Apache Hadoop Tutorial – The ULTIMATE Guide (PDF Download)

EDITORIAL NOTE: Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. Hadoop has become the de-facto tool ...

MapReduce Design Patterns Implemented in Apache Spark

This blog is a first in a series that discusses some design patterns from the book MapReduce design patterns and shows how these patterns can be implemented in Apache Spark(R). When writing MapReduce or Spark programs, it is useful to think about the data flows to perform a job. Even if Pig, Hive, Apache Drill and Spark Dataframes make it ...

Is there a future for Map/Reduce?

Google’s Jeffrey Dean and Sanjay Ghemawat filed the patent request and published the map/reduce paper  10 year ago (2004). According to WikiPedia Doug Cutting and Mike Cafarella created Hadoop, with its own implementation of Map/Reduce,  one year later at Yahoo – both these implementations were done for the same purpose – batch indexing of the web. Back than, the web began its “web 2.0″ transition, ...

Hadoop MapReduce Concepts

What do you mean by Map-Reduce programming? MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. The MapReduce programming model is inspired by functional languages and targets data-intensive computations. The input data format is application-specific, and is specified by the user. The output is a ...

Can MapReduce solve planning problems?

To solve a planning or optimization problem, some solvers tend to scale out poorly: As the problem has more variables and more constraints, they use a lot more RAM memory and CPU power. They can hit hardware memory limits at a few thousand variables and few million constraint matches. One way their users typically work around such hardware limits, is ...

Apache Spark is now a top-level project

The Apache Software Foundation (ASF) happily announced that Apache Spark has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying the project’s stability. Apache Spark is an Open Source cluster computing framework for fast and flexible large-scale data analysis. Spark has been the talk of the Big Data town for a while, and 2014 was predicted to ...

MapReduce Algorithms – Understanding Data Joins Part II

It’s been awhile since I last posted, and like last time I took a big break, I was taking some classes on Coursera. This time it was Functional Programming Principals in Scala and Principles of Reactive Programming. I found both of them to be great courses and would recommend taking either one if you have the time. In this post ...

Run your Hadoop MapReduce job on Amazon EMR

I have posted a while ago how to setup an EMR cluster by using CLI. In this post I will show how to setup the cluster by using the Java SDK for AWS. The best way to show how to do this with the Java AWS SDK is to show the complete example in my opinion, so lets start. Set ...

Writing a Hadoop MapReduce task in Java

Although Hadoop Framework itself is created with Java the MapReduce jobs can be written in many different languages. In this post I show how to create a MapReduce job in Java based on a Maven project like any other Java project.                 Prepare the example input Lets start with a fictional business case. ...

