Lets Crunch big data

As developers our focus is on simpler, effective solutions and thus one of the most valued principle is “Keep it simple and stupid”. But with Hadoop map-reduce it was a bit hard to stick to this. If we are evaluating data in multiple Map Reduce jobs we would end up with code that is not related to business but more related to infra. Most of the non-trivial business data processing involves quite a few of map-reduce tasks. This means longer tread times and harder to test solutions.

Google presented solution to these issues in their FlumeJava paper. The same paper has been adapted in implementing Apache-Crunch. In a nutshell Crunch is a java library which simplifies development on MapReduce pipelines. It provides a bunch of lazily evaluated collections which can be used to perform various operations in form of map reduce jobs.

Here is what Brock Noland said in one of posts while introducing Crunch

Using Crunch, a Java programmer with limited knowledge of Hadoop and MapReduce can utilize the Hadoop cluster. The program is written in pure Java and does not require the use of MapReduce specific constructs such as writing a Mapper, Reducer, or using Writable objects to wrap Java primitives.

Crunch supports reading data from various sources like sequence files, avro, text , hbase, jdbc with a simple read API

<T> PCollection<T> read(Source<T> source)

You can import data in various formats like json, avro, thrift etc and perform efficient joins, aggregation, sort, cartesian and filter operations. Additionally any custom operations over these collections is quite easy to cook. All you have to do is to implement the quite simple and to the point, DoFn interface. You can unit test you implementations of DoFn without any map-reduce constructs.

I am not putting any example to use it. It is quite simple and the same can be found out on Apache-Crunch site.

Alternatively you could generate a project from the available crunch-archetype. This will also generate a simple WordCount example. The archetype can be selected using :

mvn archetype:generate -Dfilter=crunch-archetype

The project has quite a few examples for its different aspects and is also available in Scala.

So now lets CRUNCH some data !!!
 

Reference: Lets Crunch big data from our JCG partner Rahul Sharma at the The road so far… blog blog.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

Leave a Reply


nine + = 12



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close