In JCG article “Hadoop Modes Explained – Standalone, Pseudo Distributed, Distributed” JCG partner Rahul Patodi explained how to setup Hadoop. The Hadoop project has produced a lot of tools for analyzing semi-structured data but Hive is perhaps the most intuitive among them as it allows anyone with an SQL background to submit MapReduce jobs described as SQL queries. Hive can be executed from a command line interface, as well as run in a server mode with a Thrift client acting as a JDBC/ODBC interface giving access to data analysis and reporting applications.
In this article we will set up a Hive Server, create a table, load it with data from a text file and then create a Jasper Resport using iReport. The Jasper Report executes an SQL query on the Hive Server that is then translated to a MapReduce job executed by Hadoop.
Note: I used Hadoop version 0.20.205, Hive version 0.7.1 and iReport version 4.5 running OpenSuSE 12.1 Linux with MySQL 5.5 installed.
Assuming you have already installed Hadoop download and install Hive following the Hive Getting Started wiki instructions. By default Hive is installed in CLI mode running on a Standalone Hadoop mode.
Making a multiuser Hive metastore
The default Hive install uses a derby embedded database as its metastore. The metastore is where Hive maintains descriptions of the data we want to access via SQL. In order for the metastore to be accessible from many users simultaneously it is necessary to be moved into a standalone database. Here is how to install a MySQL metastore.
- Copy the MySQL JDBC driver jar file to ~/hive-0.7.1-bin/lib directory
- Change the following properties in file hive-default.xml found in ~/hive-0.7.1-bin/conf directory:
<description>JDBC connect string for a JDBC metastore</description>
<description>Driver class name for a JDBC metastore</description>
<description>Username to connect to the database</description>
<description>Password to connect to the database</description>
- Use MySQL workbench or the MySQL command line utility to create a schema using the latin1 character set. If Hive does not find a schema it will create it on its own using the default character set for MySQL. In my case that was UTF-8 and that generated jdbc errors. If you want to use the command line utility just type:
mysql> CREATE DATABASE IF NOT EXISTS `metastore` DEFAULT CHARACTER SET latin1 COLLATE latin1_bin;
- Type Hive in your command prompt to enter Hive CLI and type:
hive> SHOW TABLES;
Time taken: 3.654 seconds
That will populate your newly created metastore schema. If you see any errors check your hive-default.xml configuration and make sure your database schema is named ‘metastore’ with latin1 as the default Character Set.
Now let’s populate Hadoop Hive with some data
Let’s just create two text files named file01 and file02 each containing:
Hello World Bye World
Hello Everybody Bye Everybody
Hello Hadoop Goodbye Hadoop
Hello Everybody Goodbye Everybody
Copy these files from your local filesystem to HDFS:
$ hadoop fs -mkdir HiveExample
$ hadoop fs -copyFromLocal ~/file* /user/ssake/HiveExample
Go to Hive CLI and create a table named testlines that would contain each line’s words in an array of strings:
hive> create table testlines (line array<string>) row format delimited collection items terminated by ' ';
Load the text files into Hive:
hive> load data inpath "/user/ssake/HiveExample/file01" INTO table testlines;
hive> load data inpath "/user/ssake/HiveExample/file02" INTO table testlines;
Check that testlines now contains each line’s words:
hive> select * from testlines;
Time taken: 0.21 seconds
Now that we have a Hive with data we can run it as a server in port 10000 which is typical for running a hive server:
$ hive --service hiveserver
With this setup it is possible to have several Thrift clients accessing our Hive server. However according to the Apache Hive blog the multithreaded Hive features are not thoroughly tested and thus it is safer to use a separate port and hive instance per Thrift client.
Create a “word count” report
iReport 4.5 supports hive datasources so let’s use it to create a report that runs with our hive server as its datasource:
1. Create a datasource connecting to the hive server
2. Use the report wizard to generate your report
3. Type the following in the HiveQL Query input box:
select word,count(word) from testlines lateral view explode(line) words as word group by word
Lets briefly explain what the above query does:
Our source table, the “testlines” table, has a single column named “line” which contains data in the form of array of strings. Each array of strings represents the words in a sentence as found in the imported files “file01” and “file02“.
In order to properly count the occurrences of each distinct word in all of the input files we have to “explode” the arrays of strings from our source table into a new one that should contain every individual word. To do so we use the “lateral view” in conjunction with the “explode()” HiveQL commands as shown above.
Issuing the HiveQL query above we create an new iconic table named “words” that has a single column named “word” containing all the words found in every string array from our “testlines” table.
4. Click the … button to select all fileds and click next
5. When you are at the designer view click the Preview tab to execute your HiveQL report
And here is our report:
Now you are all set to build applications that access your Hadoop data using the familiar JDBC interface!
Reference: Big Data analytics with Hive and iReport from our W4G partner Spyros Sakellariou.