MapReduce for dummies

Munish GuptaMay 11th, 2012Last Updated: October 22nd, 2012

1 180 2 minutes read

Continuing the coverage on Hadoop component, we will go through the MapReduce component. MapReduce is a concept that has been programming model of LISP. But before we jump into MapReduce, lets start with an example to understand how MapReduce works.

Given a couple of sentences, write a program that counts the number of words.

Now, the traditional thinking when solving this problem is read a word, check whether the word is one of the stop words, if not , add the word in a HashMap with key as the word and set the value to number of occurrences. If the word is not found in HashMap, then add the word and set the value to 1. If the word is found, then increment the value and word the same in HashMap.

Now, in this scenario, the program is processing the sentence in a serial fashion. Now, imagine if instead of a sentence, we need to count the number of words in encylopedia. Serial processing of this amount of data is time consuming. So, question is is there another algorithm we can use to speed up the processing.

Lets take the same problem and divide the same into 2 steps. In the first step, we take each sentence each and map the number of words in that sentence.

Once, the words have been mapped, lets move to the next step. In this step, we combine (reduce) the maps from two sentences into a single map.

That’s it, we have just seen how, individual sentences can be mapped individually and then once mapped, can be reduced to a single resulting map.Advantage of the MapReduce approach is

The whole process got distributed in small tasks that will help in faster completion of the job
Both the steps can be broken down into tasks. In the first, instance, run multiple map tasks, once the mapping is done, run multiple reduce tasks to combine the results and finally aggregate the results

Now, imagine this MapReduce paradigm working on the HDFS. HDFS has data nodes that splits and store the files in blocks. Now, if map the tasks on each of the data nodes, then we can easily leverage the compute power of those data node machines.

So, each of the data nodes, can run tasks (map or reduce) which are the essence of the MapReduce. As each data nodes stores data for multiple files, multiple tasks might be running at the same time for different data blocks.

To control the MapReduce tasks, there are 2 processes that need to be understood

JobTracker – The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
TaskTracker – TaskTracker is a process that starts and tracks MapReduce Tasks in a cluster. It contacts the JobTracker for Task assignments and reporting results.

These Trackers are part of the Hadoop itself and can be tracked easily via