About Eyal Golan

Eyal is a professional software engineer and an architect. He is a developer and leader of highly sophisticated systems in different areas, such as networking, security, commerce and more

Seven Databases in Seven Weeks – Hbase Day 2

This post is a recap of the second day of Hbase from the Seven Databases in Seven Weeks book.
Most of the commands and scripts can be found at GitHub: https://github.com/eyalgo/seven-dbs-in-seven-weeks/tree/master/hbase/day_2

Streaming Script

The first thing in day 2 was to download lots of data (big data) and stream it into Hbase. There’s a JRuby script, which I had to alter in order for it to work: https://github.com/eyalgo/seven-dbs-in-seven-weeks/blob/master/hbase/day_2/import_from_wikipedia.rb

After altering it, as the book suggested, I had to add some compression to the column family. After that, I could run the script:

curl http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 | bzcat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_from_wikipedia.rb

curl http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 | bzcat | /opt/hbase/hbase-0.94.18/bin/hbase shell import_from_wikipedia.rb

This is the output while the script runs

1 10.0G    1  128M    0     0   456k      0  6:23:37  0:04:48  6:18:49  817k19000 records inserted (Serotonin)
  1 10.0G    1  131M    0     0   461k      0  6:19:03  0:04:51  6:14:12  921k19500 records inserted (Serotonin specific reuptake inhibitors)
  1 10.0G    1  135M    0     0   469k      0  6:12:45  0:04:54  6:07:51 1109k20000 records inserted (Tennis court)
  1 10.0G    1  138M    0     0   477k      0  6:06:12  0:04:57  6:01:15 1269k20500 records inserted (Tape drive)

The next part in this chapter talks about regions and some other plumbing stuff.

Build links table

In this part the source is the large Wiki table and the output is ‘links’ table. Each link has ‘From:’ and ‘To:’. Here’s a link to the altered working script: https://github.com/eyalgo/seven-dbs-in-seven-weeks/blob/master/hbase/day_2/generate_wiki_links.rb

The rest of the chapter shows how to look at the data, count it and more.


The main part in the homework, was to create a new table: ‘foods’ that takes data from an XML, which can be downloaded from the US’s health & nutrition site. This data shows the nutrition facts per type of food.

I decided to create a very simple table. The column family does not have any special options. I created one column family:facts. Each row data from the XML file will be part of facts. I also decided that the row’s key would be the Display_Name. After all, it’s much easier to look by key and not by some ID.

create 'foods' , 'facts'

In order to see how I should create the script I looked at two sources:

  1. The script that imported data for the Wiki table
  2. One element (food) from the XML

Here’s one element:

  <Display_Name>Sour cream dip</Display_Name>
  <Portion_Display_Name>cup </Portion_Display_Name>

I created the script by examining the wiki script and one element. Opening a document is when seeing an open XML element tag: Food_Display_Row. When seeing Food_Display_Row as the close tag, the script creates the document.

include Java
import 'org.apache.hadoop.hbase.client.HTable'
import 'org.apache.hadoop.hbase.client.Put'
import 'org.apache.hadoop.hbase.HBaseConfiguration'
import 'javax.xml.stream.XMLStreamConstants'

def jbytes( *args )
  args.map { |arg| arg.to_s.to_java_bytes }

factory = javax.xml.stream.XMLInputFactory.newInstance
reader = factory.createXMLStreamReader(java.lang.System.in)

document = nil
buffer = nil
count = 0

puts( @hbase )
conf = HBaseConfiguration.new
table = HTable.new( conf, "foods" )
table.setAutoFlush( false )

while reader.has_next
  type = reader.next
  if type == XMLStreamConstants::START_ELEMENT # (3)
    case reader.local_name
    when 'Food_Display_Row' then document = {}
    when /Display_Name|Portion_Default|Portion_Amount|Portion_Display_Name|Factor/ then buffer = []
    when /Increment|Multiplier|Grains|Whole_Grains|Vegetables|Orange_Vegetables/ then buffer = []
    when /Drkgreen_Vegetables|Starchy_vegetables|Other_Vegetables|Fruits|Milk|Meats/ then buffer = []
    when /Drybeans_Peas|Soy|Oils|Solid_Fats|Added_Sugars|Alcohol|Calories|Saturated_Fats/ then buffer = []
  elsif type == XMLStreamConstants::CHARACTERS
    buffer << reader.text unless buffer.nil?
  elsif type == XMLStreamConstants::END_ELEMENT
    case reader.local_name
    when /Display_Name|Portion_Default|Portion_Amount|Portion_Display_Name|Factor/
      document[reader.local_name] = buffer.join
    when /Increment|Multiplier|Grains|Whole_Grains|Vegetables|Orange_Vegetables/
      document[reader.local_name] = buffer.join
    when /Drkgreen_Vegetables|Starchy_vegetables|Other_Vegetables|Fruits|Milk|Meats/
      document[reader.local_name] = buffer.join
    when /Drybeans_Peas|Soy|Oils|Solid_Fats|Added_Sugars|Alcohol|Calories|Saturated_Fats/
      document[reader.local_name] = buffer.join

    when 'Food_Display_Row'
      key = document['Display_Name'].to_java_bytes
      p = Put.new( key )
      p.add( *jbytes( "facts", "Display_Name", document['Display_Name'] ) )
      p.add( *jbytes( "facts", "Portion_Default", document['Portion_Default'] ) )
      p.add( *jbytes( "facts", "Portion_Amount", document['Portion_Amount'] ) )
      p.add( *jbytes( "facts", "Portion_Display_Name", document['Portion_Display_Name'] ) )
      p.add( *jbytes( "facts", "Factor", document['Factor'] ) )
      p.add( *jbytes( "facts", "Increment", document['Increment'] ) )
      p.add( *jbytes( "facts", "Multiplier", document['Multiplier'] ) )
      p.add( *jbytes( "facts", "Grains", document['Grains'] ) )
      p.add( *jbytes( "facts", "Whole_Grains", document['Whole_Grains'] ) )
      p.add( *jbytes( "facts", "Vegetables", document['Vegetables'] ) )
      p.add( *jbytes( "facts", "Orange_Vegetables", document['Orange_Vegetables'] ) )
      p.add( *jbytes( "facts", "Drkgreen_Vegetables", document['Drkgreen_Vegetables'] ) )
      p.add( *jbytes( "facts", "Starchy_vegetables", document['Starchy_vegetables'] ) )
      p.add( *jbytes( "facts", "Other_Vegetables", document['Other_Vegetables'] ) )
      p.add( *jbytes( "facts", "Fruits", document['Fruits'] ) )
      p.add( *jbytes( "facts", "Milk", document['Milk'] ) )
      p.add( *jbytes( "facts", "Meats", document['Meats'] ) )
      p.add( *jbytes( "facts", "Drybeans_Peas", document['Drybeans_Peas'] ) )
      p.add( *jbytes( "facts", "Soy", document['Soy'] ) )
      p.add( *jbytes( "facts", "Oils", document['Oils'] ) )
      p.add( *jbytes( "facts", "Solid_Fats", document['Solid_Fats'] ) )
      p.add( *jbytes( "facts", "Added_Sugars", document['Added_Sugars'] ) )
      p.add( *jbytes( "facts", "Alcohol", document['Alcohol'] ) )
      p.add( *jbytes( "facts", "Calories", document['Calories'] ) )
      p.add( *jbytes( "facts", "Saturated_Fats", document['Saturated_Fats'] ) )

      table.put( p )
      count += 1
      table.flushCommits() if count % 10 == 0
      if count % 500 == 0
        puts "#{count} records inserted (#{document['Display_Name']})"


Following are the shell commands that take the XML file and stream them to Hbase. The first command runs against the file with the single element. After I verified the correctness, I ran it against to full file.

curl file:///home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/food-display-example.xml | cat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_food_display.rb

curl file:///home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/MyFoodapediaData/Food_Display_Table.xml | cat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_food_display.rb

Let’s get some food…

get 'foods' , 'fruit smoothie made with milk'

And the result:

facts:Added_Sugars timestamp=1399932481440, value=82.54236
facts:Alcohol timestamp=1399932481440, value=.00000
facts:Calories timestamp=1399932481440, value=197.96000
facts:Display_Name timestamp=1399932481440, value=fruit smoothie made with milk
facts:Drkgreen_Vegetables timestamp=1399932481440, value=.00000
facts:Drybeans_Peas timestamp=1399932481440, value=.00000
facts:Factor timestamp=1399932481440, value=1.00000
facts:Fruits timestamp=1399932481440, value=.56358
facts:Grains timestamp=1399932481440, value=.00000
facts:Increment timestamp=1399932481440, value=.25000
facts:Meats timestamp=1399932481440, value=.00000
facts:Milk timestamp=1399932481440, value=.22624
facts:Multiplier timestamp=1399932481440, value=.25000
facts:Oils timestamp=1399932481440, value=.00808
facts:Orange_Vegetables timestamp=1399932481440, value=.00000
facts:Other_Vegetables timestamp=1399932481440, value=.00000
facts:Portion_Amount timestamp=1399932481440, value=1.00000
facts:Portion_Default timestamp=1399932481440, value=2.00000
facts:Portion_Display_Name timestamp=1399932481440, value=cup
facts:Saturated_Fats timestamp=1399932481440, value=1.91092
facts:Solid_Fats timestamp=1399932481440, value=24.14304
facts:Soy timestamp=1399932481440, value=.00000
facts:Starchy_vegetables timestamp=1399932481440, value=.00000
facts:Vegetables timestamp=1399932481440, value=.00000
facts:Whole_Grains timestamp=1399932481440, value=.00000 
Related Whitepaper:

Professional NoSQL

A hands-on guide to leveraging NoSQL databases!

NoSQL databases are an efficient and powerful tool for storing and manipulating vast quantities of data. Most NoSQL databases scale well as data grows. In addition, they are often malleable and flexible enough to accommodate semi-structured and sparse data sets. This comprehensive hands-on guide presents fundamental concepts and practical solutions for getting you ready to use NoSQL databases. Expert author Shashank Tiwari begins with a helpful introduction on the subject of NoSQL, explains its characteristics and typical uses, and looks at where it fits in the application stack. Unique insights help you choose which NoSQL solutions are best for solving your specific data storage needs.

Get it Now!  

Leave a Reply

4 × = twenty

Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

20,709 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books