Hadoop Single Node Set Up
With this post I am hoping to share the procedure to set up Apache Hadoop in single node. Hadoop is used in dealing with Big Data sets where deployment is happening on low-cost commodity hardware. It is a map-reduce framework which map segments of a job among the nodes in a cluster for execution. Though we will not see the exact power of Hadoop running it on single node, it is the first step towards the multi-node cluster. Single node set up is useful in getting familiar with the operations and debugging applications for accuracy, but the performance may be far low than what is achievable.
I am sharing the steps to be followed on a Linux system, as it is supported as both a development and a production platform for Hadoop. Win32 is supported only as a development platform and equivalent commands for the given Linux commands need to be followed. This Hadoop document includes the details for setting up Hadoop, in brief. I am here sharing a detailed guidance for the set up process with the following line up.
- Pre-requisites
- Hadoop Configurations
- Running the Single Node Cluster
- Running a Map-reduce job
Pre-requisites
Java 1.6.X
Java needs to be installed in your node, in order to run Hadoop as it is a java based framework. We can check the Java installation of node by,
pushpalanka@pushpalanka-laptop:~$ java -version java version '1.6.0_23' Java(TM) SE Runtime Environment (build 1.6.0_23-b05) Java HotSpot(TM) Server VM (build 19.0-b09, mixed mode)
If it is not giving an intended output, we need to install Java by running,
pushpalanka@pushpalanka-laptop:~$ sudo apt-get install sun-java6-jdk
or running a downloaded binary java installation package. Then set the environment variable JAVA_HOME in the relevant configuration file(.bashrc for Linux bash shell users).
Create a Hadoop User
At this step we create a user account dedicated for Hadoop installation. This not a ‘must do’ step, but recommended for security reasons and ease of managing the nodes. So we create a group called ‘hadoop’ and add a new user to the group called ‘hpuser'(the names can be of our choice) with the following commands.
pushpalanka@pushpalanka-laptop:~$ sudo addgroup hadoop pushpalanka@pushpalanka-laptop:~$ sudo adduser --ingroup hadoop hpuser pushpalanka@pushpalanka-laptop:~$ su hpuser hpuser@pushpalanka-laptop:~$
Now on we will work as hpuser, with the last command.
Enable SSH Access
Hadoop requires SSH(Secure Shell) access to the machines it uses as nodes. This is to create a secured channel to exchange data. So even in single node the localhost need SSH access for the hpuser in order to exchange data for Hadoop operations. Refer
this documentation if you need more details on SSH.
We can install SSH with following command.
hpuser@pushpalanka-laptop:~$ sudo apt-get install ssh
Now let’s try to SSH the localhost without a pass-phrase.
hpuser@pushpalanka-laptop:~$ ssh localhost Linux pushpalanka-laptop 2.6.32-33-generic #72-Ubuntu SMP Fri Jul 29 21:08:37 UTC 2011 i686 GNU/Linux Ubuntu 10.04.4 LTS Welcome to Ubuntu! ................................................
If it does not give something similar to the above, we have to enable SSH access to the localhost as following.
Generate a RSA key pair without a password. (We do not use a password only because then it will prompt us to provide password in each time Hadoop communicate with the node.)
hpuser@pushpalanka-laptop:~$ssh-keygen -t rsa -P ''
Then we need to concatenate the generated public key in the authorized keys list of the localhost. It is done as follows. Then make sure ‘ssh localhost’ is successful.
hpuser@pushpalanka-laptop:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Now we are ready to move onto Hadoop. :) Use the latest stable release from Hadoop site and I used hadoop-1.0.3.
Hadoop Configurations
Set Hadoop Home
As we set JAVA_HOME at the beginning of this post we need to set HADOOP_HOME too. Open the same file (.bashrc) and add the following two lines at the end.
export HADOOP_HOME=<absolute path to the extracted hadoop distribution> export PATH=$PATH:$HADOOP_HOME/bin
Disable IPv6
As there will not be any practical need to go for IPv6 addressing inside Hadoop cluster, we are disabling this to be less error-prone. Open the file HADOOP_HOME/conf/hadoop-env.sh and add the following line.
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
Set Paths and Configuration
In the same hadoop-env.sh file add the JAVA_HOME too.
export JAVA_HOME=<path to the java installation>
Then we need create a directory to be used as HDFS (Hadoop Distributed File System).Make directory at a place of your choice and make sure the owner of the directory is ‘hpuser’. I ll refer this as ‘temp_directory’. We can do it by,
hpuser@pushpalanka-laptop:~$ sudo chown hpuser:hadoop <absolute path to temp_directory>
Now we add the following property segments in the relevant file inside the <configuration>……</configuration> tags.
conf/core-site.xml
<property> <name>hadoop.tmp.dir</name> <value>path to temp_directory</value> <description>Location for HDFS.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. </description> </property>
conf/mapred-site.xml
<property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. </description> </property>
conf/hdfs-site.xml
<property> <name>dfs.replication</name> <value>1</value> <description>Default number of block replications. </description> </property>
The above value needs to be decided on the priority on speed, space and fault tolerance factors.
Format HDFS
This operation is needed every time we create a new Hadoop cluster. If we do this operation on a running cluster all the data will be lost.What this basically do is creating the Hadoop Distributed File System over the local file system of the cluster.
hpuser@pushpalanka-laptop:~$ <HADOOP_HOME>/bin/hadoop namenode -format 12/09/20 14:39:56 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = pushpalanka-laptop/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.0.3 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by 'hortonfo' on Tue May 8 20:31:25 UTC 2012 ************************************************************/ ... 12/09/20 14:39:57 INFOnamenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at pushpalanka-laptop/127.0.1.1 ************************************************************/
If the output is something as above HDFS is formatted successfully. Now we are almost done and ready to see the action.
Running the Single Node Cluster
hpuser@pushpalanka-laptop:~/hadoop-1.0.3/bin$ ./start-all.sh
and you will see,
hpuser@pushpalanka-laptop:~/hadoop-1.0.3/bin$ ./start-all.sh starting namenode, logging to /home/hpuser/hadoop-1.0.3/libexec/../logs/hadoop-hpuser-namenode-pushpalanka-laptop.out localhost: starting datanode, logging to /home/hpuser/hadoop-1.0.3/libexec/../logs/hadoop-hpuser-datanode-pushpalanka-laptop.out localhost: starting secondarynamenode, logging to /home/hpuser/hadoop-1.0.3/libexec/../logs/hadoop-hpuser-secondarynamenode-pushpalanka-laptop.out starting jobtracker, logging to /home/hpuser/hadoop-1.0.3/libexec/../logs/hadoop-hpuser-jobtracker-pushpalanka-laptop.out localhost: starting tasktracker, logging to /home/hpuser/hadoop-1.0.3/libexec/../logs/hadoop-hpuser-tasktracker-pushpalanka-laptop.out
which starts several nodes and trackers. We can observe these Hadoop processes using jps tool from Java.
hpuser@pushpalanka-laptop:~/hadoop-1.0.3/bin$ jps 5767 NameNode 6619 Jps 6242 JobTracker 6440 TaskTracker 6155 SecondaryNameNode 5958 DataNode
Also we can check the ports Hadoop is configured to listen on,
hpuser@pushpalanka-laptop:~/hadoop-1.0.3/bin$ sudo netstat -plten | grep java [sudo] password for hpuser: tcp 0 0 0.0.0.0:33793 0.0.0.0:* LISTEN 1001 175415 8164/java .... tcp 0 0 0.0.0.0:60506 0.0.0.0:* LISTEN 1001 174566 7767/java tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 1001 176269 7962/java
There are several web interfaces available to observe inside behavior, job completion, memory consumption etc. as follows.
- http://localhost:50070/– web UI of the NameNode
- http://localhost:50030/– web UI of the JobTracker
- http://localhost:50060/– web UI of the TaskTracker
At the moment this will not show any information as we have not put any job for execution to observe the progress.
We can stop the cluster at any time with the following command.
hpuser@pushpalanka-laptop:~/hadoop-1.0.3/bin$ ./stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode
Running a Map-reduce Job
Let’s try to get some work done using Hadoop. We can use the word count example distributed with Hadoop as explained here atHadoop wiki. In brief what it does is take some text files as inputs, count the number of repetitions of distinct words mapping each line to a mapper and output the reduced result as a text file.
first create a folder and copy some .txt files having several 10000s of words. We are going to count the distinct words’ repetitions of those.
hpuser@pushpalanka-laptop:~/tempHadoop$ ls -l total 1392 -rw-r--r-- 1 hpuser hadoop 1423810 2012-09-21 02:10 pg5000.txt
Then restart the cluster, using start-all.sh file as done before.
Copy the sample input file to HDFS.
hpuser@pushpalanka-laptop:~/hadoop-1.0.3$ bin/hadoop dfs -copyFromLocal /home/hpuser/tempHadoop/ /user/hpuser/testHadoop
Let’s check whether it is correctly copied HDFS,
hpuser@pushpalanka-laptop:~/hadoop-1.0.3$ bin/hadoop dfs -ls /user/hpuser/testHadoop Found 1 items -rw-r--r-- 1 hpuser supergroup 1423810 2012-09-21 02:17 /user/hpuser/testHadoop/pg5000.txt
Now inputs are ready. Let’s run the map reduce job. For this we use a jar distributed with Hadoop which is written to do the needful, which can later refer and learn how the things are done.
hpuser@pushpalanka-laptop:~/hadoop-1.0.3$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hpuser/testHadoop /user/hpuser/testHadoop-output 12/09/21 02:24:34 INFO input.FileInputFormat: Total input paths to process : 1 12/09/21 02:24:34 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/09/21 02:24:34 WARN snappy.LoadSnappy: Snappy native library not loaded 12/09/21 02:24:34 INFO mapred.JobClient: Running job: job_201209210216_0003 12/09/21 02:24:35 INFO mapred.JobClient: map 0% reduce 0% 12/09/21 02:24:51 INFO mapred.JobClient: map 100% reduce 0% 12/09/21 02:25:06 INFO mapred.JobClient: map 100% reduce 100% 12/09/21 02:25:11 INFO mapred.JobClient: Job complete: job_201209210216_0003 12/09/21 02:25:11 INFO mapred.JobClient: Counters: 29 12/09/21 02:25:11 INFO mapred.JobClient: Job Counters 12/09/21 02:25:11 INFO mapred.JobClient: Launched reduce tasks=1 12/09/21 02:25:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=17930 12/09/21 02:25:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/09/21 02:25:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/09/21 02:25:11 INFO mapred.JobClient: Launched map tasks=1 12/09/21 02:25:11 INFO mapred.JobClient: Data-local map tasks=1 12/09/21 02:25:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14153 12/09/21 02:25:11 INFO mapred.JobClient: File Output Format Counters 12/09/21 02:25:11 INFO mapred.JobClient: Bytes Written=337639 12/09/21 02:25:11 INFO mapred.JobClient: FileSystemCounters 12/09/21 02:25:11 INFO mapred.JobClient: FILE_BYTES_READ=466814 12/09/21 02:25:11 INFO mapred.JobClient: HDFS_BYTES_READ=1423931 12/09/21 02:25:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=976811 12/09/21 02:25:11 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=337639 12/09/21 02:25:11 INFO mapred.JobClient: File Input Format Counters 12/09/21 02:25:11 INFO mapred.JobClient: Bytes Read=1423810 12/09/21 02:25:11 INFO mapred.JobClient: Map-Reduce Framework 12/09/21 02:25:11 INFO mapred.JobClient: Map output materialized bytes=466814 12/09/21 02:25:11 INFO mapred.JobClient: Map input records=32121 12/09/21 02:25:11 INFO mapred.JobClient: Reduce shuffle bytes=466814 12/09/21 02:25:11 INFO mapred.JobClient: Spilled Records=65930 12/09/21 02:25:11 INFO mapred.JobClient: Map output bytes=2387668 12/09/21 02:25:11 INFO mapred.JobClient: CPU time spent (ms)=9850 12/09/21 02:25:11 INFO mapred.JobClient: Total committed heap usage (bytes)=167575552 12/09/21 02:25:11 INFO mapred.JobClient: Combine input records=251352 12/09/21 02:25:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=121 12/09/21 02:25:11 INFO mapred.JobClient: Reduce input records=32965 12/09/21 02:25:11 INFO mapred.JobClient: Reduce input groups=32965 12/09/21 02:25:11 INFO mapred.JobClient: Combine output records=32965 12/09/21 02:25:11 INFO mapred.JobClient: Physical memory (bytes) snapshot=237834240 12/09/21 02:25:11 INFO mapred.JobClient: Reduce output records=32965 12/09/21 02:25:11 INFO mapred.JobClient: Virtual memory (bytes) snapshot=778846208 12/09/21 02:25:11 INFO mapred.JobClient: Map output records=251352
While the job is running if we try out the previously mentioned web interfaces we can observe the progress and utilization of the resources in a summarized way. As given in the command the output file that carries the word counts is written to the ‘/user/hpuser/testHadoop-output’.
hpuser@pushpalanka-laptop:~/hadoop-1.0.3$ bin/hadoop dfs -ls /user/hpuser/testHadoop-output Found 3 items -rw-r--r-- 1 hpuser supergroup 0 2012-09-21 02:25 /user/hpuser/testHadoop-output/_SUCCESS drwxr-xr-x - hpuser supergroup 0 2012-09-21 02:24 /user/hpuser/testHadoop-output/_logs -rw-r--r-- 1 hpuser supergroup 337639 2012-09-21 02:25 /user/hpuser/testHadoop-output/part-r-00000
To see what is inside the file, let’s get it copied to the local file system.
hpuser@pushpalanka-laptop:~/hadoop-1.0.3$ bin/hadoop dfs -getmerge /user/hpuser/testHadoop-output /home/hpuser/tempHadoop/out 12/09/21 02:38:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library
Here getmerge option is used to merge if there are several files, then the HDFS location of output folder and the desired place for output in local file system are given. Now you can browse to the given output folder and open of the result file, which will something as follows, according to your input file.
"1 1 '1,' 1 '35' 1 '58,' 1 'AS'. 1 'Apple 1 'Abs 1 'Ah! 1
Now we have completed setting up a single node cluster with Apache Hadoop and running a map-reduce job. In a next post I ll share how to set up a multi node cluster.
Original Source
Hadoop Single Node Set-up – Guchex. By Pushpalanka Jayawardhana
Reference: Hadoop Single Node Set Up from our JCG partner Pushpalanka at the Pushpalanka’s Blog blog.
It seems that correct command is groupadd and not addgroup in RHEL. Same to add user
useradd -G hadoop hpuser