About Piyas De

Piyas is Sun Microsystems certified Enterprise Architect with 10+ years of professional IT experience in various areas such as Architecture Definition, Define Enterprise Application, Client-server/e-business solutions.

Running Map-Reduce Job in Apache Hadoop (Multinode Cluster)

We will describe here the process to run MapReduce Job in Apache Hadoop in multinode cluster. To set up Apache Hadoop in Multinode Cluster, one can read Setting up Apache Hadoop Multi – Node Cluster.

For setting up we have to configure the hadoop with the following in each machine:

  • Add the following property in conf/mapred-site.xml in all the nodes

 
 
 

<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>

<description>The host and port that the MapReduce job tracker runs
at. If “local”, then jobs are run in-process as a single map
and reduce task.
</description>
</property>

<property>
<name>mapred.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
</property>

<property>
<name>mapred.map.tasks</name>
<value>20</value>
</property>

<property>
<name>mapred.reduce.tasks</name>
<value>2</value>
</property>

N.B. The last three are additional setting, so we can omit them.

  • The Gutenberg Project

For our demo purpose of MapReduce we will be using the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.

Download the example inputs from the following sites, and all e-texts should be in plain text us-ascii encoding.

  • The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
  • The Notebooks of Leonardo Da Vinci
  • Ulysses by James Joyce
  • The Art of War by 6th cent. B.C. Sunzi
  • The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle
  • The Devil’s Dictionary by Ambrose Bierce
  • Encyclopaedia Britannica, 11th Edition, Volume 4, Part 3

Please google for those texts. Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg. Check the files with the following command:

$ ls -l /tmp/gutenberg/
  • Next we start the dfs and mapred layer in our cluster
$ start-dfs.sh

$ start-mapred.sh

Check by issuing the following command jps to check the datanodes, namenodes, and task tracker, job tracker is all up and running fine in all the nodes.

  • Next we will copy the local files(here the text files) to Hadoop HDFS
$ hadoop dfs -copyFromLocal /tmp/gutenberg /Users/hduser/gutenberg

$ hadoop dfs -ls /Users/hduser

If the files are successfully copied we will see something like below – Found 2 items

drwxr-xr-x – hduser supergroup 0 2013-05-21 14:48 /Users/hduser/gutenberg

Furthermore we check our file system whats in /Users/hduser/gutenberg:

$ hadoop dfs -ls /Users/hduser/gutenberg
Found 7 items

-rw-r--r-- 2 hduser supergroup 336705 2013-05-21 14:48 /Users/hduser/gutenberg/pg132.txt
-rw-r--r-- 2 hduser supergroup 581877 2013-05-21 14:48 /Users/hduser/gutenberg/pg1661.txt
-rw-r--r-- 2 hduser supergroup 1916261 2013-05-21 14:48 /Users/hduser/gutenberg/pg19699.txt
-rw-r--r-- 2 hduser supergroup 674570 2013-05-21 14:48 /Users/hduser/gutenberg/pg20417.txt
-rw-r--r-- 2 hduser supergroup 1540091 2013-05-21 14:48 /Users/hduser/gutenberg/pg4300.txt
-rw-r--r-- 2 hduser supergroup 447582 2013-05-21 14:48 /Users/hduser/gutenberg/pg5000.txt
-rw-r--r-- 2 hduser supergroup 384408 2013-05-21 14:48 /Users/hduser/gutenberg/pg972.txt
  • We start our MapReduce Job

Let us run the MapReduce WordCount example:

$ hadoop jar hadoop-examples-1.0.4.jar wordcount /Users/hduser/gutenberg /Users/hduser/gutenberg-output

N.B.: Assuming that you are already in the HADOOP_HOME dir. If not then,

$ hadoop jar ABSOLUTE/PATH/TO/HADOOP/DIR/hadoop-examples-1.0.4.jar wordcount /Users/hduser/gutenberg /Users/hduser/gutenberg-output

Or if you have installed the Hadoop in /usr/local/hadoop then,

hadoop jar /usr/local/hadoop/hadoop-examples-1.0.4.jar wordcount /Users/hduser/gutenberg /Users/hduser/gutenberg-output

The output is follows something like:

13/05/22 13:12:13 INFO mapred.JobClient:  map 0% reduce 0%
13/05/22 13:12:59 INFO mapred.JobClient:  map 28% reduce 0%
13/05/22 13:13:05 INFO mapred.JobClient:  map 57% reduce 0%
13/05/22 13:13:11 INFO mapred.JobClient:  map 71% reduce 0%
13/05/22 13:13:20 INFO mapred.JobClient:  map 85% reduce 0%
13/05/22 13:13:26 INFO mapred.JobClient:  map 100% reduce 0%
13/05/22 13:13:43 INFO mapred.JobClient:  map 100% reduce 50%
13/05/22 13:13:55 INFO mapred.JobClient:  map 100% reduce 100%
13/05/22 13:13:59 INFO mapred.JobClient:  map 85% reduce 100%
13/05/22 13:14:02 INFO mapred.JobClient:  map 100% reduce 100%
13/05/22 13:14:07 INFO mapred.JobClient: Job complete: job_201305211616_0011
13/05/22 13:14:07 INFO mapred.JobClient: Counters: 26
13/05/22 13:14:07 INFO mapred.JobClient:   Job Counters 
13/05/22 13:14:07 INFO mapred.JobClient:     Launched reduce tasks=3
13/05/22 13:14:07 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=118920
13/05/22 13:14:07 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/05/22 13:14:07 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/05/22 13:14:07 INFO mapred.JobClient:     Launched map tasks=10
13/05/22 13:14:07 INFO mapred.JobClient:     Data-local map tasks=10
13/05/22 13:14:07 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=54620
13/05/22 13:14:07 INFO mapred.JobClient:   File Output Format Counters 
13/05/22 13:14:07 INFO mapred.JobClient:     Bytes Written=1267287
13/05/22 13:14:07 INFO mapred.JobClient:   FileSystemCounters
13/05/22 13:14:07 INFO mapred.JobClient:     FILE_BYTES_READ=4151123
13/05/22 13:14:07 INFO mapred.JobClient:     HDFS_BYTES_READ=5882320
13/05/22 13:14:07 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=6937084
13/05/22 13:14:07 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1267287
13/05/22 13:14:07 INFO mapred.JobClient:   File Input Format Counters 
13/05/22 13:14:07 INFO mapred.JobClient:     Bytes Read=5881494
13/05/22 13:14:07 INFO mapred.JobClient:   Map-Reduce Framework
13/05/22 13:14:07 INFO mapred.JobClient:     Reduce input groups=114901
13/05/22 13:14:07 INFO mapred.JobClient:     Map output materialized bytes=2597630
13/05/22 13:14:07 INFO mapred.JobClient:     Combine output records=178795
13/05/22 13:14:07 INFO mapred.JobClient:     Map input records=115251
13/05/22 13:14:07 INFO mapred.JobClient:     Reduce shuffle bytes=1857123
13/05/22 13:14:07 INFO mapred.JobClient:     Reduce output records=114901
13/05/22 13:14:07 INFO mapred.JobClient:     Spilled Records=463427
13/05/22 13:14:07 INFO mapred.JobClient:     Map output bytes=9821180
13/05/22 13:14:07 INFO mapred.JobClient:     Total committed heap usage (bytes)=1567514624
13/05/22 13:14:07 INFO mapred.JobClient:     Combine input records=1005554
13/05/22 13:14:07 INFO mapred.JobClient:     Map output records=1005554
13/05/22 13:14:07 INFO mapred.JobClient:     SPLIT_RAW_BYTES=826
13/05/22 13:14:07 INFO mapred.JobClient:     Reduce input records=178795
  • Retrieving the Job Result

To read directly from hadoop without copying to local file system:

$ hadoop dfs -cat /Users/hduser/gutenberg-output/part-r-00000

Let us copy the the results to the local file system though.

$ mkdir /tmp/gutenberg-output

$ bin/hadoop dfs -getmerge /Users/hduser/gutenberg-output /tmp/gutenberg-output

$ head /tmp/gutenberg-output/gutenberg-output

We will get a output as:

 
"'Ample.' 1
"'Arthur!' 1
"'As 1
"'Because 1
"'But,' 1
"'Certainly,' 1
"'Come, 1
"'DEAR 1
"'Dear 2
"'Dearest 1
"'Don't 1
"'Fritz! 1
"'From 1
"'Have 1
"'Here 1
"'How 2

The command fs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.

Resources:

 

Related Whitepaper:

Hadoop Illuminated

Gentle Introduction of Hadoop and Big Data!

This Hadoop book was written with following goals and principles: Make Hadoop accessible to a wider audience -- not just the highly technical crowd. There are a few unique chapters that you won't find in other Hadoop books, for example: Hadoop use cases, Hadoop distributions rundown, BI Tools feature matrix.

Get it Now!  

Leave a Reply


+ five = 11



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

15,153 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books