Software Development

How to create your first HIVE script?

Overview

Apache Hive is an integral part of Hadoop eco-system. Hive can be defined as a data warehouse like software which facilitates query and large data management on HDFS (Hadoop distributed file system). One must remember that Hive is not data warehouse software rather it provides some mechanism to manage data on distributed environment and query it by using an SQL-like language called HiveQL or HIVE Query Language. Hive scripts can be defined as a group of Hive commands bundled together to reduce the execution time. In this article I will discuss about the Hive scripts and its execution.

Introduction

HDFS or Hadoop Distributed File System provides a scalable and fault tolerant enabled data storage. HIVE provides a simple SQL like query language – HIVE QL. HIVE QL allows the traditional map reduce developers to plug-in their custom mappers and reducers to do more sophisticated analysis.

Limitation of HIVE

Latency for HIVE queries is usually very high because of the substantial overheads in job submission and scheduling. Hive does not offer real time queries and row level updates. It is best used for log analysis.

HIVE Data Units

Hive data is organized into following four categories:

  • Databases: This consists of the namespaces which separates the tables and other data units to avoid name conflicts.
  • Tables: These are homogeneous units of data having a common schema. A commonly used example could be of a page view table where each row can have the following columns :
    • USERID
    • IPADDRESS
    • LAST ACCESSED
    • PAGE URL

This example lists the record of the usage of a website or an application for individual users.

  • Partitions: Partitions determine how the data is stored. Each table can have one or multiple partitions. Partitions also help users to efficiently identify the rows which satisfy a certain selection criteria.
  • Buckets or Clusters: Data in each partition may be further subdivided into buckets or clusters or blocks. The data in the above example can be clustered based on the user id or on the ip address or on the page url column.

HIVE Data Types

Based on the need, HIVE supports primitive and complex data types as described below:

  • Primitive types:
    • INTEGERS
      • TINY INT 1 byte integer
      • SMALL INT 2 byte integer
      • INT 4 byte integer
      • BIGINT 8 byte integer
    • BOOLEAN
      • BOOLEAN TRUE or FALSE
    • FLOATING POINT numbers
      • FLOAT Single precision
      • DOUBLE Double precision
    • STRING type
      • STRING Sequence of characters
    • Complex Types: Complex types can be constructed using primitive data types and other composite types with the help of :
      • Structs
      • Maps or key value pairs
      • Arrays – Indexed lists

HIVE Scripting

Similar to any other scripting language, HIVE scripts are used to execute a set of HIVE commands collectively. HIVE scripting helps us to reduce the time and effort invested in writing and executing the individual commands manually. HIVE scripting is supported in HIVE 0.10.0 or higher versions of HIVE. To write and execute a HIVE script, we need to install Cloudera distribution for Hadoop CDH4.

Writing HIVE SCRIPTS

First, open a terminal in your Cloudera CDH4 distribution and give the below command to create a Hive Script.

gedit sample.sql

Similar to any other query language, the Hive script file should be saved with .sql extension. This will enable the execution of the commands. Now open the file in Edit mode and write your Hive commands that will be executed using this script. In this sample script, we will do the following tasks sequentially (create, describe and then load the data into the table. And then retrieve the data from table).

Create a table ‘product’ in Hive

create table product_dtl ( product_id: int, product-name: string, product_price: float, product_category: string) rows format delimited fields terminated by ‘,’ ;

Here { product_id, product-name, product_price, product_category} are names of the columns in the ‘product_dtl’ table. “Fields terminated by ‘,’ ” indicates that the columns in the input file are separated by the ‘,’ delimiter.  You can also use other delimiters as per your requirement. For example, we can consider the records in an input file separated by a new line (‘\n’) character.

Describe the Table

describe product_dtl;

Load the data into the Table

Now, let’s check the data loading part. Create an input file which contains the records that needs to be inserted into the table.

sudo gedit input.txt

Now let’s create few records in the input text file as shown in the figure below:

Figure 1: Input file.
Figure 1: Input file.

So, our input file will look like:

1, Laptop, 45000, Computers

2, Pencils, 2, Stationery

3, Rice, 64.45, Grocery

4, Furniture, 65000, Interiors

To load the data from this file we need to execute the following command:

load data local inpath ‘/home/cloudera/input.txt’ into table product_dtl; 

Retrieving the Data: To retrieve the data we use the simple select statement as under:

select * from product_dtl;

The above command will execute and fetch all the records from the table ‘product’.

The script will look like the following image:

Figure 2: The sample SQL file
Figure 2: The sample SQL file

Save this sample.sql file and run the following command:

hive –f /home/cloudera/sample.sql

While executing the script, mention the entire path of the script location. Here the sample script is present in the current directory; I haven’t provided the complete path of the script.

The following image displays that all the commands were executed successfully.

The following output shows that the table is created and the data from our sample input file is stored in the database.

1Laptop45000Computers
2Pencils2Stationery
3Rice64.45Groceries
4Furniture65000

Summary

Before concluding our discussion, we must keep a note of the following points

  • Apache HIVE is an integral part of HDFS
  • HIVE is an SQL like query language
  • HIVE script is easy to understand and implement
  • Hive supports both primitive data types and complex data types.

Kaushik Pal

Kaushik has 16 years of experience as a technical architect and software consultant in enterprise application and product development. He has interest in new technology and innovation area along with technical writing. His main focus is on web architecture, web technologies, java/j2ee, Open source, big data and semantic technologies. He has demonstrated his expertise in requirement analysis, architecture design & implementation, technical use case preparation, and software development. His experience has spanned in different domains like Insurance, banking, airlines, shipping, document management etc. Kaushik worked with a wide variety of technologies starting from Mainframe (IBM S/390), midrange (AS/400), web technologies and open source area. He has worked with clients like IBM, Lexmark, United Airlines and many more.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button