Mapreduce in Java8

Advait TrivediMarch 9th, 2015Last Updated: March 6th, 2015

0 203 2 minutes read

Wordcount is to Mapreduce what “Hello world” is for rest of the programming world. Recently I have been exploring some of the more prominent Java 8’s features like Lambda, Streams and Optionals, so I thought it’d be nice to do a simplified version of Wordcount in Java.

Java’s Stream and Lambda functions makes it really easy and concise to implement a data processing pipeline. Prior to Java 8 it will take some extra ordinary effort to write code which utilizes Java’s concurrency mechanism and benefit from multicore CPU (which is normal now a days). So lets look at Java 8 code, I have two tests, they both do the same but one utilizes concurrent APIs in Java’s library and other one does not. The purpose is to illustrate how easy it is to write code runs concurrently and test how much time each takes – concurrent vs non-concurrent.

Now lets go through some code.

Various steps in the data processing pipeline are:

On line 19 and 34 we read a text file as a stream (test uses Jane Austen’s Pride And Prejudice from project Gutenberg – http://www.gutenberg.org/cache/epub/42671/pg42671.txt). Rest of the points explain line 22 and 37 from left to right.
Read each line and split it at non-word boundaries with regex \W. This will give a Stream of Array of strings – Stream<String[]>
We need a Stream of String – Stream<String> so lets flatten it using flatMap(Arrays::stream). stream function in Arrays generates a Stream from array. So, basically flatMap will flatten Stream<Stream<String>> to Stream<String>. Which is what we set out to achieve in the beginning of this point.
Next we convert each element in the stream toLowerCase, so that we dont count, say “Pride” and “pride” as 2 different words. Points 2, 3 & 4 are all about mapping – “Map” in Mapreduce
At this point, we have a Stream of all words from the book in lower case. So we can start grouping them by using groupingBy/groupingByConcurrent collector. Collector is a form of reduction, the “Reduce” in Mapreduce. groupingBy has 2 parameters – 1. Is a Function, also called classifier, which allows us to assign each value in the stream into a group. Second parameter is for downstream reduction. It allows us to further reduce elements which we grouped into each group. Since, we dont need to classify each element we just return the element as it is in the first argument s -> s. In the second argument we call counting() collector, which counts elements in each group.

I am sure by, now you can easily see how much of processing power we can express concisely.

A comparision of “time taken” in ms by concurrent vs non-concurrent tests on my system