About Eyal Golan

Eyal is a professional software engineer and an architect. He is a developer and leader of highly sophisticated systems in different areas, such as networking, security, commerce and more

Seven Databases in Seven Weeks – Hbase Day 2

This post is a recap of the second day of Hbase from the Seven Databases in Seven Weeks book.
Most of the commands and scripts can be found at GitHub: https://github.com/eyalgo/seven-dbs-in-seven-weeks/tree/master/hbase/day_2

Streaming Script

The first thing in day 2 was to download lots of data (big data) and stream it into Hbase. There’s a JRuby script, which I had to alter in order for it to work: https://github.com/eyalgo/seven-dbs-in-seven-weeks/blob/master/hbase/day_2/import_from_wikipedia.rb

After altering it, as the book suggested, I had to add some compression to the column family. After that, I could run the script:

curl http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 | bzcat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_from_wikipedia.rb

curl http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 | bzcat | /opt/hbase/hbase-0.94.18/bin/hbase shell import_from_wikipedia.rb

This is the output while the script runs

1 10.0G    1  128M    0     0   456k      0  6:23:37  0:04:48  6:18:49  817k19000 records inserted (Serotonin)
  1 10.0G    1  131M    0     0   461k      0  6:19:03  0:04:51  6:14:12  921k19500 records inserted (Serotonin specific reuptake inhibitors)
  1 10.0G    1  135M    0     0   469k      0  6:12:45  0:04:54  6:07:51 1109k20000 records inserted (Tennis court)
  1 10.0G    1  138M    0     0   477k      0  6:06:12  0:04:57  6:01:15 1269k20500 records inserted (Tape drive)

The next part in this chapter talks about regions and some other plumbing stuff.

Build links table

In this part the source is the large Wiki table and the output is ‘links’ table. Each link has ‘From:’ and ‘To:’. Here’s a link to the altered working script: https://github.com/eyalgo/seven-dbs-in-seven-weeks/blob/master/hbase/day_2/generate_wiki_links.rb

The rest of the chapter shows how to look at the data, count it and more.

Homework

The main part in the homework, was to create a new table: ‘foods’ that takes data from an XML, which can be downloaded from the US’s health & nutrition site. This data shows the nutrition facts per type of food.

I decided to create a very simple table. The column family does not have any special options. I created one column family:facts. Each row data from the XML file will be part of facts. I also decided that the row’s key would be the Display_Name. After all, it’s much easier to look by key and not by some ID.

create 'foods' , 'facts'

In order to see how I should create the script I looked at two sources:

  1. The script that imported data for the Wiki table
  2. One element (food) from the XML

Here’s one element:

<Food_Display_Row>
  <Food_Code>12350000</Food_Code>
  <Display_Name>Sour cream dip</Display_Name>
  <Portion_Default>1.00000</Portion_Default>
  <Portion_Amount>.25000</Portion_Amount>
  <Portion_Display_Name>cup </Portion_Display_Name>
  <Factor>.25000</Factor>
  <Increment>.25000</Increment>
  <Multiplier>1.00000</Multiplier>
  <Grains>.04799</Grains>
  <Whole_Grains>.00000</Whole_Grains>
  <Vegetables>.04070</Vegetables>
  <Orange_Vegetables>.00000</Orange_Vegetables>
  <Drkgreen_Vegetables>.00000</Drkgreen_Vegetables>
  <Starchy_vegetables>.00000</Starchy_vegetables>
  <Other_Vegetables>.04070</Other_Vegetables>
  <Fruits>.00000</Fruits>
  <Milk>.00000</Milk>
  <Meats>.00000</Meats>
  <Soy>.00000</Soy>
  <Drybeans_Peas>.00000</Drybeans_Peas>
  <Oils>.00000</Oils>
  <Solid_Fats>105.64850</Solid_Fats>
  <Added_Sugars>1.57001</Added_Sugars>
  <Alcohol>.00000</Alcohol>
  <Calories>133.65000</Calories>
  <Saturated_Fats>7.36898</Saturated_Fats>
</Food_Display_Row>

I created the script by examining the wiki script and one element. Opening a document is when seeing an open XML element tag: Food_Display_Row. When seeing Food_Display_Row as the close tag, the script creates the document.

include Java
import 'org.apache.hadoop.hbase.client.HTable'
import 'org.apache.hadoop.hbase.client.Put'
import 'org.apache.hadoop.hbase.HBaseConfiguration'
import 'javax.xml.stream.XMLStreamConstants'

def jbytes( *args )
  args.map { |arg| arg.to_s.to_java_bytes }
end

factory = javax.xml.stream.XMLInputFactory.newInstance
reader = factory.createXMLStreamReader(java.lang.System.in)

document = nil
buffer = nil
count = 0

puts( @hbase )
conf = HBaseConfiguration.new
table = HTable.new( conf, "foods" )
table.setAutoFlush( false )

while reader.has_next
  type = reader.next
  
  if type == XMLStreamConstants::START_ELEMENT # (3)
  
    case reader.local_name
    when 'Food_Display_Row' then document = {}
    when /Display_Name|Portion_Default|Portion_Amount|Portion_Display_Name|Factor/ then buffer = []
    when /Increment|Multiplier|Grains|Whole_Grains|Vegetables|Orange_Vegetables/ then buffer = []
    when /Drkgreen_Vegetables|Starchy_vegetables|Other_Vegetables|Fruits|Milk|Meats/ then buffer = []
    when /Drybeans_Peas|Soy|Oils|Solid_Fats|Added_Sugars|Alcohol|Calories|Saturated_Fats/ then buffer = []
    end
    
  elsif type == XMLStreamConstants::CHARACTERS
    buffer << reader.text unless buffer.nil?
    
  elsif type == XMLStreamConstants::END_ELEMENT
    
    case reader.local_name
    when /Display_Name|Portion_Default|Portion_Amount|Portion_Display_Name|Factor/
      document[reader.local_name] = buffer.join
    when /Increment|Multiplier|Grains|Whole_Grains|Vegetables|Orange_Vegetables/
      document[reader.local_name] = buffer.join
    when /Drkgreen_Vegetables|Starchy_vegetables|Other_Vegetables|Fruits|Milk|Meats/
      document[reader.local_name] = buffer.join
    when /Drybeans_Peas|Soy|Oils|Solid_Fats|Added_Sugars|Alcohol|Calories|Saturated_Fats/
      document[reader.local_name] = buffer.join

    when 'Food_Display_Row'
      key = document['Display_Name'].to_java_bytes
      
      p = Put.new( key )
      p.add( *jbytes( "facts", "Display_Name", document['Display_Name'] ) )
      p.add( *jbytes( "facts", "Portion_Default", document['Portion_Default'] ) )
      p.add( *jbytes( "facts", "Portion_Amount", document['Portion_Amount'] ) )
      p.add( *jbytes( "facts", "Portion_Display_Name", document['Portion_Display_Name'] ) )
      p.add( *jbytes( "facts", "Factor", document['Factor'] ) )
      p.add( *jbytes( "facts", "Increment", document['Increment'] ) )
      p.add( *jbytes( "facts", "Multiplier", document['Multiplier'] ) )
      p.add( *jbytes( "facts", "Grains", document['Grains'] ) )
      p.add( *jbytes( "facts", "Whole_Grains", document['Whole_Grains'] ) )
      p.add( *jbytes( "facts", "Vegetables", document['Vegetables'] ) )
      p.add( *jbytes( "facts", "Orange_Vegetables", document['Orange_Vegetables'] ) )
      p.add( *jbytes( "facts", "Drkgreen_Vegetables", document['Drkgreen_Vegetables'] ) )
      p.add( *jbytes( "facts", "Starchy_vegetables", document['Starchy_vegetables'] ) )
      p.add( *jbytes( "facts", "Other_Vegetables", document['Other_Vegetables'] ) )
      p.add( *jbytes( "facts", "Fruits", document['Fruits'] ) )
      p.add( *jbytes( "facts", "Milk", document['Milk'] ) )
      p.add( *jbytes( "facts", "Meats", document['Meats'] ) )
      p.add( *jbytes( "facts", "Drybeans_Peas", document['Drybeans_Peas'] ) )
      p.add( *jbytes( "facts", "Soy", document['Soy'] ) )
      p.add( *jbytes( "facts", "Oils", document['Oils'] ) )
      p.add( *jbytes( "facts", "Solid_Fats", document['Solid_Fats'] ) )
      p.add( *jbytes( "facts", "Added_Sugars", document['Added_Sugars'] ) )
      p.add( *jbytes( "facts", "Alcohol", document['Alcohol'] ) )
      p.add( *jbytes( "facts", "Calories", document['Calories'] ) )
      p.add( *jbytes( "facts", "Saturated_Fats", document['Saturated_Fats'] ) )

      table.put( p )
      
      count += 1
      table.flushCommits() if count % 10 == 0
      if count % 500 == 0
        puts "#{count} records inserted (#{document['Display_Name']})"
      end
    end
  end
end

table.flushCommits()
exit

Following are the shell commands that take the XML file and stream them to Hbase. The first command runs against the file with the single element. After I verified the correctness, I ran it against to full file.

curl file:///home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/food-display-example.xml | cat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_food_display.rb

curl file:///home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/MyFoodapediaData/Food_Display_Table.xml | cat | /opt/hbase/hbase-0.94.18/bin/hbase shell /home/eyalgo/seven-dbs-in-seven-weeks/hbase/day_2/import_food_display.rb

Let’s get some food…

get 'foods' , 'fruit smoothie made with milk'

And the result:

COLUMN CELL
facts:Added_Sugars timestamp=1399932481440, value=82.54236
facts:Alcohol timestamp=1399932481440, value=.00000
facts:Calories timestamp=1399932481440, value=197.96000
facts:Display_Name timestamp=1399932481440, value=fruit smoothie made with milk
facts:Drkgreen_Vegetables timestamp=1399932481440, value=.00000
facts:Drybeans_Peas timestamp=1399932481440, value=.00000
facts:Factor timestamp=1399932481440, value=1.00000
facts:Fruits timestamp=1399932481440, value=.56358
facts:Grains timestamp=1399932481440, value=.00000
facts:Increment timestamp=1399932481440, value=.25000
facts:Meats timestamp=1399932481440, value=.00000
facts:Milk timestamp=1399932481440, value=.22624
facts:Multiplier timestamp=1399932481440, value=.25000
facts:Oils timestamp=1399932481440, value=.00808
facts:Orange_Vegetables timestamp=1399932481440, value=.00000
facts:Other_Vegetables timestamp=1399932481440, value=.00000
facts:Portion_Amount timestamp=1399932481440, value=1.00000
facts:Portion_Default timestamp=1399932481440, value=2.00000
facts:Portion_Display_Name timestamp=1399932481440, value=cup
facts:Saturated_Fats timestamp=1399932481440, value=1.91092
facts:Solid_Fats timestamp=1399932481440, value=24.14304
facts:Soy timestamp=1399932481440, value=.00000
facts:Starchy_vegetables timestamp=1399932481440, value=.00000
facts:Vegetables timestamp=1399932481440, value=.00000
facts:Whole_Grains timestamp=1399932481440, value=.00000 

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

Leave a Reply


five − = 1



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close