Hadoop Modes Explained – Standalone, Pseudo Distributed, Distributed

Rahul PatodiJanuary 26th, 2012Last Updated: October 21st, 2012

3 103 6 minutes read

After Understanding What is Hadoop Lets start Hadoop on Single Machine:

This post contains instructions for Hadoop installation on ubuntu. This is a quick step by step tutorial of Hadoop installation. Here you will get all the commands and their description required to install Hadoop in Standalone mode (single node cluster), Hadoop in Pseudo distributed mode (single node cluster) and Hadoop in distributed mode (multi node cluster).

The main goal of this tutorial is to get a ”simple” Hadoop installation up and running so that you can play around with the software and learn more about it.

This Tutorial has been tested on:

Ubuntu Linux (10.04 LTS)
Hadoop 0.20.2

Prerequisites:
Install Java:
Java 1.6.x (either Sun Java or Open Java) is recommended for Hadoop

1. Add the Canonical Partner Repository to your apt repositories:

$ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"

2. Update the source list

$ sudo apt-get update

3. Install sun-java6-jdk

$ sudo apt-get install sun-java6-jdk

4. After installation, make a quick check whether Sun’s JDK is correctly set up:

user@ubuntu:~# java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

Adding a dedicated Hadoop system user:
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc)

$ sudo adduser hadoop_admin

user@ubuntu:~$ su - hadoop_admin

Hadoop Installation:

$ cd /usr/local
$ sudo tar xzf hadoop-0.20.2.tar.gz
$ sudo chown -R hadoop_admin /usr/local/hadoop-0.20.2

Define JAVA_HOME:

Edit configuration file /usr/local/hadoop-0.20.2/conf/hadoop-env.sh and set JAVA_HOME:
export JAVA_HOME=path to be the root of your Java installation(eg: /usr/lib/jvm/java-6-sun)

$ vi conf/hadoop-env.sh

Go your hadoop installation directory(HADOOP_HOME ie /usr/local/hadoop-0.20.2/):

$ bin/hadoop

It will generate following output:

Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
namenode -format                format the DFS filesystem
secondarynamenode               run the DFS secondary namenode
namenode                        run the DFS namenode
datanode                        run a DFS datanode
dfsadmin                        run a DFS admin client
mradmin                         run a Map-Reduce admin client
fsck                            run a DFS filesystem checking utility
fs                              run a generic filesystem user client
balancer                        run a cluster balancing utility
jobtracker                      run the MapReduce job Tracker node
pipes                           run a Pipes job
tasktracker                     run a MapReduce task Tracker node
job                             manipulate MapReduce jobs
queue                           get information regarding JobQueues
version                         print the version
jar <jar>                       run a jar file
distcp <srcurl>                 <desturl> copy file or directories recursively
archive -archiveName NAME <src>*<dest> create a hadoop archive
daemonlog                       get/set the log level for each daemon
or
CLASSNAME                       run the class named CLASSNAME
Most commands print help when invoked w/o parameters:

Hadoop Setup in Standalone Mode is Completed…….!!!!!!!

Now lets run some examples:
1. Run Classic Pi example:

$ bin/hadoop jar hadoop-*-examples.jar pi 10 100

2. Run grep example:

$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' 
$ cat output/*

3. Run word count example:

$ mkdir inputwords
$ cp conf/*.xml inputwords
$ bin/hadoop jar hadoop-*-examples.jar wordcount inputwords outputwords

If you find any error visit Hadoop troubleshooting

After Running Hadoop in Standalone mode Lets start Hadoop in Pseudo distributed mode(single node cluster):

Configuring SSH:
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hadoop_admin user

user@ubuntu:~$ su - hadoop_admin   
hadoop_admin@ubuntu:~$ sudo apt-get install openssh-server openssh-client

hadoop_admin@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop_admin/.ssh/id_rsa):
Created directory '/home/hadoop_admin/.ssh'.
Your identification has been saved in /home/hadoop_admin/.ssh/id_rsa.
Your public key has been saved in /home/hadoop_admin/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hadoop_admin@ubuntu
The key's randomart image is:
[...snipp...]
hadoop_admin@ubuntu:~$

Enable SSH access to your local machine and connect using ssh

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is e7:89:26:49:ae:02:30:eb:1d:75:4f:bb:44:f9:36:29.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 30 13:27:30 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
[...snipp...]
$

Edit configuration files:

$ vi conf/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
</property>

</configuration>

If you give some other path, ensure that hadoop_admin user have read, write permission in that directory (sudo chown hadoop_admin /your/path)

$ vi conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

$ vi conf/mapred.xml
<configuration>
<property>
<name>mapred.job.tracker</name> 
<value>localhost:9001</value>
</property>
</configuration>

Formatting the name node:

$ /hadoop/bin/hadoop namenode -format

It will generate following output:

$ bin/hadoop namenode -format
10/05/10 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/10 16:59:56 INFO namenode.FSNamesystem: fsOwner=hadoop_admin,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage directory .../.../dfs/name has been successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
$

STARTING SINGLE-NODE CLUSTER:

$ /bin/start-all.sh

It will generate following output:

hadoop_admin@ubuntu:/usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-tasktracker-ubuntu.out
hadoop_admin@ubuntu:/usr/local/hadoop$

check whether the expected Hadoop processes are running by jps

$ jps
14799 NameNode
14977 SecondaryNameNode 
15183 DataNode
15596 JobTracker
15897 TaskTracker

Hadoop Setup in Pseudo Distributed Mode is Completed…….!!!!!!!

STOPPING SINGLE-NODE CLUSTER:

$ /bin/stop-all.sh

It Will generate following output:

$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
$

You can run the same set of exampels as in the standalone mode in oder to check if your installation is sucessfull

Web based Interface for NameNode
http://localhost:50070
Web based Interface for JobTracker
http://localhost:50030
Web based Interface for TaskTracker
http://localhost:50060

After Running Hadoop in Standalone mode Lets start Hadoop in distributed mode(multi node cluster)

Prerequisite: Before starting hadoop in distributed mode you must setup hadoop in pseudo distributed mode and you need at least two machines one for master and another for slave(you can create more then one virtual machine on a single machine).

COMMAND	DESCRIPTION
$ bin/stop-all.sh	Before starting hadoop in distributed mode first stop each cluster. run this cmd on all machines in cluster (master and slave)
$ vi /etc/hosts	Then type IP-add master(eg: 192.168.0.1 master) IP-add slave(eg: 192.168.0.2 slave) run this cmd on all machines in cluster (master and slave)
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub slave	setting passwordless ssh (on all the machines you must login with same user name) run this cmd on master
or $ cat .ssh/id_rsa.pub Then Its content is then copied in $ .ssh/authorized_keys file of the slave (system you wish to SSH to without being prompted for a password)	we can also set passwordless ssh manuall
$ vi conf/master then type master	The conf/masters file defines the namenodes of our multi-node cluster run this cmd on master
$ vi conf/slaves then type slave	This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will be run. run this cmd on all machines in cluster (master and slave)
$ vi conf/core-site.xml then type: <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> </property>	Edit configuration file core-site.xml run this cmd on all machines in cluster (master and slave)
$ vi conf/mapred-site.xml then type: <property> <name>mapred.job.tracker</name> <value>master:54311</value> </property>	Edit configuration file mapred-site.xml run this cmd on all machines in cluster (master and slave)
$ vi conf/hdfs-site.xml then type: <property> <name>dfs.replication</name> <value>2</value> </property>	Edit configuration file hdfs-site.xml run this cmd on all machines in cluster (master and slave)
$ vi conf/mapred-site.xml then type: <property> <name>mapred.local.dir</name> <value>${hadoop.tmp.dir}/mapred/local</value> </property> <property> <name>mapred.map.tasks</name> <value>20</value> </property> <property> <name>mapred.reduce.tasks</name> <value>2</value> </property>	Edit configuration file mapred-site.xml run this cmd on master
$ bin/start-dfs.sh	Starting the multi-node cluster. First, the HDFS daemons are started. the namenode daemon is started on master, and datanode daemons are started on all slaves run this cmd on master
$ jps	It should give output like this: 14799 NameNode 15314 Jps 16977 secondaryNameNode run this cmd on master
$ jps	It should give output like this: 15183 DataNode 15616 Jps run this cmd on all slaves
$ bin/start-mapred.sh	The MapReduce daemons are started: the jobtracker is started on master, and tasktracker daemons are started on all slaves run this cmd on master
$ jps	It should give output like this: 16017 Jps 14799 NameNode 15596 JobTracker 14977 SecondaryNameNode run this cmd on master
$ jps	It should give output like this: 15183 DataNode 15897 TaskTracker 16284 Jps run this cmd on all slaves
Congratulations Hadoop Setup is Completed
http://localhost:50070/	web based interface for name node
http://localhost:50030/	web based interface for job tracker
Now lets run some examples
$ bin/hadoop jar hadoop-*-examples.jar pi 10 100	run pi example
$ bin/hadoop dfs -mkdir input $ bin/hadoop dfs -put conf input $ bin/hadoop jar hadoop--examples.jar grep input output ‘dfs[a-z.]+’ $ bin/hadoop dfs -cat output/	run grep example
$ bin/hadoop dfs -mkdir inputwords $ bin/hadoop dfs -put conf inputwords $ bin/hadoop jar hadoop--examples.jar wordcount inputwords outputwords $ bin/hadoop dfs -cat outputwords/	run wordcount example
$ bin/stop-mapred.sh $ bin/stop-dfs.sh	To stop the demons run this cmd on master