Big Data analytics with Hive and iReport

Spyros SakellariouFebruary 11th, 2012Last Updated: October 26th, 2012

1 70 4 minutes read

Each J.J. Abrams’ TV series Person of Interest episode starts with the following narration from Mr. Finch one of the leading characters: “You are being watched. The government has a secret system–a machine that spies on you every hour of every day. I know because…I built it.” Of course us technical people know better. It would take a huge team of electrical and software engineers many years to build such a high performing machine and the budget would be unimaginable… or wouldn’t be? Wait a second we have Hadoop! Now everyone of us can be Mr. Finch for a modest budget thanks to Hadoop.

In JCG article “Hadoop Modes Explained – Standalone, Pseudo Distributed, Distributed” JCG partner Rahul Patodi explained how to setup Hadoop. The Hadoop project has produced a lot of tools for analyzing semi-structured data but Hive is perhaps the most intuitive among them as it allows anyone with an SQL background to submit MapReduce jobs described as SQL queries. Hive can be executed from a command line interface, as well as run in a server mode with a Thrift client acting as a JDBC/ODBC interface giving access to data analysis and reporting applications.

In this article we will set up a Hive Server, create a table, load it with data from a text file and then create a Jasper Resport using iReport. The Jasper Report executes an SQL query on the Hive Server that is then translated to a MapReduce job executed by Hadoop.

Note: I used Hadoop version 0.20.205, Hive version 0.7.1 and iReport version 4.5 running OpenSuSE 12.1 Linux with MySQL 5.5 installed.

Assuming you have already installed Hadoop download and install Hive following the Hive Getting Started wiki instructions. By default Hive is installed in CLI mode running on a Standalone Hadoop mode.

Making a multiuser Hive metastore

The default Hive install uses a derby embedded database as its metastore. The metastore is where Hive maintains descriptions of the data we want to access via SQL. In order for the metastore to be accessible from many users simultaneously it is necessary to be moved into a standalone database. Here is how to install a MySQL metastore.

Copy the MySQL JDBC driver jar file to ~/hive-0.7.1-bin/lib directory

Change the following properties in file hive-default.xml found in ~/hive-0.7.1-bin/conf directory:

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://hyperion/metastore?createDatabaseIfNotExist=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>foo</value>
  <description>Username to connect to the database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>bar</value>
  <description>Password to connect to the database</description>
</property>

Use MySQL workbench or the MySQL command line utility to create a schema using the latin1 character set. If Hive does not find a schema it will create it on its own using the default character set for MySQL. In my case that was UTF-8 and that generated jdbc errors. If you want to use the command line utility just type:
```
mysql> CREATE DATABASE  IF NOT EXISTS `metastore` DEFAULT CHARACTER SET latin1 COLLATE latin1_bin;
```
Type Hive in your command prompt to enter Hive CLI and type:
```
hive> SHOW TABLES;
OK
testlines
Time taken: 3.654 seconds
hive> 
```
That will populate your newly created metastore schema. If you see any errors check your hive-default.xml configuration and make sure your database schema is named ‘metastore’ with latin1 as the default Character Set.

Now let’s populate Hadoop Hive with some data

Let’s just create two text files named file01 and file02 each containing:

file01:
Hello World Bye World
Hello Everybody Bye Everybody

file02:
Hello Hadoop Goodbye Hadoop
Hello Everybody Goodbye Everybody

Copy these files from your local filesystem to HDFS:

$ hadoop fs -mkdir HiveExample
$ hadoop fs -copyFromLocal ~/file* /user/ssake/HiveExample

Go to Hive CLI and create a table named testlines that would contain each line’s words in an array of strings:

hive> create table testlines (line array<string>) row format delimited collection items terminated by ' ';

Load the text files into Hive:

hive> load data inpath "/user/ssake/HiveExample/file01" INTO table testlines;
hive> load data inpath "/user/ssake/HiveExample/file02" INTO table testlines;

Check that testlines now contains each line’s words:

hive> select * from testlines;
OK
["Hello","World","Bye","World"]
["Hello","Everybody","Bye","Everybody"]
["Hello","Hadoop","Goodbye","Hadoop"]
["Hello","Everybody","Goodbye","Everybody"]
Time taken: 0.21 seconds

Now that we have a Hive with data we can run it as a server in port 10000 which is typical for running a hive server:

$ HIVE_PORT=10000
$ hive --service hiveserver

With this setup it is possible to have several Thrift clients accessing our Hive server. However according to the Apache Hive blog the multithreaded Hive features are not thoroughly tested and thus it is safer to use a separate port and hive instance per Thrift client.

Create a “word count” report

iReport 4.5 supports hive datasources so let’s use it to create a report that runs with our hive server as its datasource:

1. Create a datasource connecting to the hive server

2. Use the report wizard to generate your report

3. Type the following in the HiveQL Query input box:

select word,count(word) from testlines lateral view explode(line) words as word group by word

Lets briefly explain what the above query does:

Our source table, the “testlines” table, has a single column named “line” which contains data in the form of array of strings. Each array of strings represents the words in a sentence as found in the imported files “file01” and “file02“.

In order to properly count the occurrences of each distinct word in all of the input files we have to “explode” the arrays of strings from our source table into a new one that should contain every individual word. To do so we use the “lateral view” in conjunction with the “explode()” HiveQL commands as shown above.

Issuing the HiveQL query above we create an new iconic table named “words” that has a single column named “word” containing all the words found in every string array from our “testlines” table.