Home » Tag Archives: Apache Hadoop (page 4)

Tag Archives: Apache Hadoop

Hadoop MapReduce Concepts

What do you mean by Map-Reduce programming? MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. The MapReduce programming model is inspired by functional languages and targets data-intensive computations. The input data format is application-specific, and is specified by the user. The output is a ...

Read More »

MapReduce Algorithms – Understanding Data Joins Part II

It’s been awhile since I last posted, and like last time I took a big break, I was taking some classes on Coursera. This time it was Functional Programming Principals in Scala and Principles of Reactive Programming. I found both of them to be great courses and would recommend taking either one if you have the time. In this post ...

Read More »

Coordination and service discovery with Apache Zookeeper

Service-oriented design has proven to be a successful solution for a huge variety of different distributed systems. When used properly, it has a lot of benefits. But as number of services grows, it becomes more difficult to understand what is deployed and where. And because we are building reliable and highly-available systems, yet another question to ask: how many instances ...

Read More »

Configuring Hadoop with Guava MapSplitters

In this post we are going to provide a new twist on passing configuration parameters to a Hadoop Mapper via the Context object. Typically, we set configuration parameters as key/value pairs on the Context object when starting a map-reduce job. Then in the Mapper we use the key(s) to retrieve the value(s) to use for our configuration needs. The twist ...

Read More »

Unit testing a Java Hadoop job

In my previous post I showed how to setup a complete Maven based project to create a Hadoop job in Java. Of course it wasn’t complete because it is missing the unit test part . In this post I show how to add MapReduce unit tests to the project I started previously. For the unit test I make use of ...

Read More »

Run your Hadoop MapReduce job on Amazon EMR

I have posted a while ago how to setup an EMR cluster by using CLI. In this post I will show how to setup the cluster by using the Java SDK for AWS. The best way to show how to do this with the Java AWS SDK is to show the complete example in my opinion, so lets start. Set ...

Read More »

Writing a Hadoop MapReduce task in Java

Although Hadoop Framework itself is created with Java the MapReduce jobs can be written in many different languages. In this post I show how to create a MapReduce job in Java based on a Maven project like any other Java project.                 Prepare the example input Lets start with a fictional business case. ...

Read More »

Big Data Open Source Security

In security there has never (IMHO) been enough open source solutions and Bruce Schneier has written about this several times in the past, and there’s no need to rewrite the arguments again. Now with “NoSQL” and “Big Data” Open Source trends in the market place Security finally has an intersection… a union if I may where new solutions to solve ...

Read More »

MapReduce Algorithms – Understanding Data Joins Part 1

In this post we continue with our series of implementing the algorithms found in the Data-Intensive Text Processing with MapReduce book, this time discussing data joins. While we are going to discuss the techniques for joining data in Hadoop and provide sample code, in most cases you probably won’t be writing code to perform joins yourself. Instead, joining data is ...

Read More »