Setting up Apache Hadoop Multi – Node Cluster

Piyas DeJune 11th, 2013Last Updated: June 11th, 2013

7 72 4 minutes read

We are sharing our experience about Apache Hadoop Installation in Linux based machines (Multi-node). Here we will also share our experience about different troubleshooting also and make update in future.

User creation and other configurations step –

We start by adding a dedicated Hadoop system user in each cluster.

$ sudo addgroup hadoop
$ sudo adduser –ingroup hadoop hduser

Next we configure the SSH (Secure Shell) on all the cluster to enable secure data communication.

user@node1:~$ su – hduser
hduser@node1:~$ ssh-keygen -t rsa -P “”

The output will be something like the following:

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
.....

Next we need to enable SSH access to local machine with this newly created key:

hduser@node1:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Repeat the above steps in all the cluster nodes and test by executing the following statement

hduser@node1:~$ ssh localhost

This step is also needed to save local machine’s host key fingerprint to the hduser user’s known_hosts file.

Next we need to edit the /etc/hosts file in which we put the IPs and Name of each system in the cluster.

In our scenario we have one master (with IP 192.168.0.100) and one slave (with IP 192.168.0.101)

$ sudo vi /etc/hosts

and we put the values into the host file as key value pair.

 
192.168.0.100 master
192.168.0.101 slave

Providing the SSH Access

The hduser user on the master node must be able to connect

to its own user account on the master via ssh master in this context not necessarily ssh localhost.
to the hduser account of the slave(s) via a password-less SSH login.

So we distribute the SSH public key of hduser@master to all its slave, (in our case we have only one slave. If you have more execute the following statement changing the machine name i.e. slave, slave1, slave2).

hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave

Try by connecting master to master and master to slave(s) and check if everything is fine.

Configuring Hadoop

Let us edit the conf/masters (only in the masters node)

and we enter master into the file.

Doing this we have told Hadoop that start Namenode and secondary NameNodes in our multi-node cluster in this machine.

The primary NameNode and the JobTracker will always be on the machine we run bin/start-dfs.sh and bin/start-mapred.sh.

Let us now edit the conf/slaves(only in the masters node) with

master
slave

This means that, we try to run datanode process on master machine also – where the namenode is also running. We can leave master to act as slave if we have more machines as datanode at our disposal.

if we have more slaves, then to add one host per line like the following:

master
slave
slave2
slave3

etc….

Lets now edit two important files (in all the nodes in our cluster):

conf/core-site.xml
conf/core-hdfs.xml

1) conf/core-site.xml

We have to change the fs.default.parameter which specifies NameNode host and port. (In our case this is the master machine)

<property>

<name>fs.default.name</name>
<value>hdfs://master:54310</value>

…..[Other XML Values]

</property>

Create a directory into which Hadoop will store its data –

$ mkdir /app/hadoop

We have to ensure the directory is writeable by any user:

$ chmod 777 /app/hadoop

Modify core-site.xml once again to add the following property:

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop</value>
</property>

2) conf/core-hdfs.xml

We have to change the dfs.replication parameter which specifies default block replication. It defines how many machines a single file should be replicated to before it becomes available. If we set this to a value higher than the number of available slave nodes (more precisely, the number of DataNodes), we will start seeing a lot of “(Zero targets found, forbidden1.size=1)” type errors in the log files.

The default value of dfs.replication is 3. However, as we have only two nodes available (in our scenario), so we set dfs.replication to 2.

<property>
<name>dfs.replication</name>
<value>2</value>
…..[Other XML Values]
</property>

Let us format the HDFS File System via NameNode.

Run the following command at master

bin/hadoop namenode -format

Let us start the multi node cluster:

Run the command: (in our case we will run on the machine named as master)

bin/start-dfs.sh

Checking of Hadoop Status –

After everything has started run the jps command on all the nodes to see everything is running well or not.

In master node the desired output will be –

$ jps

14799 NameNode
15314 Jps
14880 DataNode
14977 SecondaryNameNode

In Slave(s):

$ jps
15314 Jps
14880 DataNode

Ofcourse the Process IDs will vary from machine to machine.

Troubleshooting

It might be possible that Datanode might not get started in all our nodes. At this point if we see the

logs/hadoop-hduser-datanode-.log

on the effected nodes with the exception –

java.io.IOException: Incompatible namespaceIDs

In this case we need to do the following –

Stop the full cluster, i.e. both MapReduce and HDFS layers.
Delete the data directory on the problematic DataNode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml. In our case, the relevant directory is /app/hadoop/tmp/dfs/data
Reformat the NameNode. All HDFS data will be lost during the format perocess.
Restart the cluster.

We can manually update the namespaceID of problematic DataNodes:

Stop the problematic DataNode(s).
Edit the value of namespaceID in ${dfs.data.dir}/current/VERSION to match the corresponding value of the current NameNode in ${dfs.name.dir}/current/VERSION.
Restart the fixed DataNode(s).

In Running Map-Reduce Job in Apache Hadoop (Multinode Cluster), we will share our experience about Map Reduce Job Running as per apache hadoop example.