About Munish Gupta

Munish K Gupta is a senior architect working in a leading IT services company. His experience is in building Online Portals, SaaS Platforms, CRM Solutions and Transaction Processing Systems. He is author of the book - Akka Essentials.

MapReduce for dummies

Continuing the coverage on Hadoop component, we will go through the MapReduce component. MapReduce is a concept that has been programming model of LISP. But before we jump into MapReduce, lets start with an example to understand how MapReduce works.

Given a couple of sentences, write a program that counts the number of words.

Now, the traditional thinking when solving this problem is read a word, check whether the word is one of the stop words, if not , add the word in a HashMap with key as the word and set the value to number of occurrences. If the word is not found in HashMap, then add the word and set the value to 1. If the word is found, then increment the value and word the same in HashMap.

Now, in this scenario, the program is processing the sentence in a serial fashion. Now, imagine if instead of a sentence, we need to count the number of words in encylopedia. Serial processing of this amount of data is time consuming. So, question is is there another algorithm we can use to speed up the processing.

Lets take the same problem and divide the same into 2 steps. In the first step, we take each sentence each and map the number of words in that sentence.

Once, the words have been mapped, lets move to the next step. In this step, we combine (reduce) the maps from two sentences into a single map.

That’s it, we have just seen how, individual sentences can be mapped individually and then once mapped, can be reduced to a single resulting map.Advantage of the MapReduce approach is

  • The whole process got distributed in small tasks that will help in faster completion of the job
  • Both the steps can be broken down into tasks. In the first, instance, run multiple map tasks, once the mapping is done, run multiple reduce tasks to combine the results and finally aggregate the results

Now, imagine this MapReduce paradigm working on the HDFS. HDFS has data nodes that splits and store the files in blocks. Now, if map the tasks on each of the data nodes, then we can easily leverage the compute power of those data node machines.

So, each of the data nodes, can run tasks (map or reduce) which are the essence of the MapReduce. As each data nodes stores data for multiple files, multiple tasks might be running at the same time for different data blocks.

To control the MapReduce tasks, there are 2 processes that need to be understood

  • JobTracker – The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
  • TaskTracker – TaskTracker is a process that starts and tracks MapReduce Tasks in a cluster. It contacts the JobTracker for Task assignments and reporting results.

These Trackers are part of the Hadoop itself and can be tracked easily via

  • http://<host-name>:50030/ – web UI for MapReduce job tracker(s)
  • http://<host-name>:50060/ – web UI for task tracker(s)

Reference: MapReduce for dummies from our JCG partner Munish K Gupta at the Tech Spot blog.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

One Response to "MapReduce for dummies"

  1. Franklin says:

    Brilliant illustration of Map-Reduce algorithm.

    Thank you.

Leave a Reply


− 3 = two



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close