Home » Java » Enterprise Java » MapReduce: A Soft Introduction

About Manoj Khangaonkar

MapReduce: A Soft Introduction

MapReduce is a parallel programming technique made popular by Google. It is used for processing very large amounts of data. Such processing can be completed in a reasonable amount of time only by distributing the work to multiple machines in parallel. Each machine processes a small subset of the data.

MapReduce is a programming model that lets developers focus on the writing code that processes their data without having to worry about the details of parallel execution.

MapReduce requires modeling the data to be processed as key,value pairs. The developer codes a map function and a reduce function.

The MapReduce runtime calls the map function for each key,value pair. The map function takes as input a key,value pair and produces an output which is another key,value pair.

The MapReduce runtime sorts and groups the output from the map functions by key. It then calls the reduce function passing it a key and a list of values associated with the key. The reduce function is called for each key. The output from the reduce function is a key,value pair. The value is generally an aggregate or something calculated by processing the list of values that were passed in for the input key. The reduce function is called for each intermediate key produced by the map function. The output from the reduce function is the required result.

As an example, let us say you have a large number of log files that contain audit logs for some event such as access to an account. You need to find out how many times each account was accessed in the last 10 years.
Assume each line in the log file is a audit record. We are processing log files line by line.The map and reduce functions would look like this:

map(key , value) {
 // key = byte offset in log file 
 // value = a line in the log file
 if ( value is an account access audit log) {
  account number = parse account from value
  output key = account number, value = 1
 }
}

reduce(key, list of values) {
 // key = account number
 // list of values {1,1,1,1.....}
 for each value
  count = count + value
 output key , count 
}

The map function is called for each line in each log file. Lines that are not relevant are ignored. Account number is parsed out of relevant lines and output with a value 1. The MapReduce runtime sorts and groups the output by account number. The reduce function is called for each account. The reduce function aggregates the values for each account, which is the required result.

MapReduce jobs are generally executed on a cluster of machines. Each machine executes a task which is either a map task or reduce task. Each task is processing a subset of the data. In the above example, let us say we start with a set of large input files. The MapReduce runtime breaks the input data into partitions called splits or shards. Each split or shard is processed by a map task on a machine. The output from each map task is sorted and partitioned by key. The outputs from all the maps are merged to create partitions that are input to the reduce tasks.

There can be multiple machines each running a reduce task. Each reduce task gets a partition to process. The partition can have multiple keys. But all the data for each key is in 1 partition. In other words each key can processed by 1 reduce task only.

The number of machines , the number of map tasks , number of reduce tasks and several other things are configurable.

MapReduce is useful for problems that require some processing of large data sets. The algorithm can be broken into map and reduce functions. MapReduce runtime takes care of distributing the processing to multiple machines and aggregating the results.

Apache Hadoop is an open source Java implementation of mapreduce. Stay tuned for future blog / tutorial on mapreduce using hadoop.

Reference: What is MapReduce? from our JCG partner at The Khangaonkar Report.

Related Articles :

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design

and many more ....

 

One comment

  1. I want to introduce the LeoTask – a lightweight & reliable MapReduce framework for computational tasks on a multicore computer.

    https://github.com/mleoking/leotask

    In summary, LeoTask has the following unique combination of features:

    Automatically explore the parameter space of computational tasks in parallel and aggregate results on a multicore computer.
    Recover and continue running your program after an interruption (e.g. power outrage) without losing your calculated results.
    Enable the end users to rerun your programs with different parameter combinations and different ways to aggregate the results through editing a XML text file rather than your program.
    — This means the end users do not need to read and understand your source code should they want to rerun it with different settings.
    Ultra lightweight ~ 300KB Jar.

    And utilities:

    All dynamic & cloneable networks structures.
    Integration with Gnuplot.
    Network generation according to common network models
    DelimitedReader: a sophisticated reader that explores CSV (Comma-separated values) files like a database
    Fast random number generator based on the Mersenne Twister algorithm

Leave a Reply

Your email address will not be published. Required fields are marked *

*


Want to take your Java Skills to the next level?
Grab our programming books for FREE!
  • Save time by leveraging our field-tested solutions to common problems.
  • The books cover a wide range of topics, from JPA and JUnit, to JMeter and Android.
  • Each book comes as a standalone guide (with source code provided), so that you use it as reference.
Last Step ...

Where should we send the free eBooks?

Good Work!
To download the books, please verify your email address by following the instructions found on the email we just sent you.