Is there a future for Map/Reduce?

8w9jjGoogle’s Jeffrey Dean and Sanjay Ghemawat filed the patent request and published the map/reduce paper  10 year ago (2004). According to WikiPedia Doug Cutting and Mike Cafarella created Hadoop, with its own implementation of Map/Reduce,  one year later at Yahoo – both these implementations were done for the same purpose – batch indexing of the web.

Back than, the web began its “web 2.0″ transition, pages became more dynamic , people began to create more content – so an efficient way to reprocess and build the web index was needed and map/reduce was it. Web Indexing was a great fit for map/reduce since the initial processing of each source (web page) is completely independent from any other – i.e.  a very convenient map phase and you need  to combine the results to build the reverse index. That said, even the core google algorithm –  the famous pagerank is iterative (so less appropriate for map/reduce), not to mention that  as the internet got bigger and the updates became more and more frequent map/reduce wasn’t enough. Again Google (who seem to be consistently few years ahead of the industry) began coming up with alternatives like Google Percolator  or  Google Dremel (both papers were published in 2010, Percolator was introduced at that year, and dremel has been used in Google since 2006).

So now, it is 2014, and it is time for the rest of us to catch up with Google and get over Map/Reduce and  for multiple reasons:

  • end-users’ expectations (who hear “big data” but interpret that as  “fast data”)
  • iterative problems like graph algorithms which are inefficient as you need to load and reload the data each iteration
  • continuous ingestion of data (increments coming on as small batches or streams of events) – where joining to existing data can be expensive
  • real-time problems – both queries and processing.

In my opinion, Map/Reduce is an idea whose time has come and gone – it won’t die in a day or a year, there is still a lot of working systems that use it and the alternatives are still maturing. I do think, however, that if you need to write or implement something new that would build on map/reduce – you should use other option or at the very least carefully consider them.

So how is this change going to happen ?  Luckily, Hadoop has recently adopted YARN (you can see my presentation on it here), which opens up the possibilities to go beyond map/reduce without changing everything … even though in effect,  a lot  will change. Note that some of the new options do have migration paths and also we still retain the  access to all that “big data” we have in Hadoopm as well as the extended reuse of some of the ecosystem.

The first type of effort to replace map/reduce is to actually subsume it by offering more  flexible batch. After all saying Map/reduce is not relevant, deosn’t mean that batch processing is not relevant. It does mean that there’s a need to more complex processes. There are two main candidates here  Tez and Spark where Tez offers a nice migration path as it is replacing map/reduce as the execution engine for both Pig and Hive and Spark has a compelling offer by combining Batch and Stream processing (more on this later) in a single engine.

The second type of effort or processing capability that will help kill map/reduce is MPP databases on Hadoop. Like the “flexible batch” approach mentioned above, this is replacing a functionality that map/reduce was used for – unleashing the data already processed and stored in Hadoop.  The idea here is twofold:

  • To provide fast query capabilities* – by using specialized columnar data format and database engines deployed as daemons on the cluster
  • To provide rich query capabilities – by supporting more and more of the SQL standard and enriching it with analytics capabilities (e.g. via MADlib).

Efforts in this arena include Impala from Cloudera, Hawq from Pivotal (which is essentially greenplum over HDFS), startups like Hadapt or even Actian trying to leverage their ParAccel acquisition with the recently announced Actian Vector . Hive is somewhere in the middle relying on Tez on one hand and using vectorization and columnar format (Orc) on the other.

The Third type of processing that will help dethrone Map/Reduce is Stream processing. Unlike the two previous types of effort this is covering a ground the map/reduce can’t cover, even inefficiently. Stream processing is about  handling continuous flow of new data (e.g. events) and processing them (enriching, aggregating, etc.)  them in seconds or less.  The two major contenders in the Hadoop arena seem to be Spark Streaming and Storm though, of course, there are several other commercial and open source platforms that handle this type of processing as well.

In summary – Map/Reduce is great. It has served us (as an industry) for a decade but it is now time to move on and bring the richer processing capabilities we have elsewhere to solve our big data problems as well.

Last note  - I focused on Hadoop in this post even thought there are several other platforms and tools around. I think that regardless if Hadoop is the best platform it is the one becoming the de-facto standard for big data (remember betamax vs VHS?).

One really, really last note – if you read up to here, and you are a developer living in Israel, and you happen to be looking for a job –  I am looking for another developer to join my Technology Research team @ Amdocs. If you’re interested drop me a note: arnon.rotemgaloz at amdocs dot com or via my twitter/linkedin profiles.

*esp. in regard to analytical queries – operational SQL on hadoop with efforts like  Phoenix ,IBM’s BigSQL or Splice Machine are also happening but that’s another story!

illustration idea found in  James Mickens’s talk in Monitorama 2014 –  (which is, by the way, a really funny presentation – go watch it) -ohh yeah… and pulp fiction!

Reference: Is there a future for Map/Reduce? from our JCG partner Arnon Rotem Gal Oz at the Cirrus Minor blog.
Related Whitepaper:

Software Architecture

This guide will introduce you to the world of Software Architecture!

This 162 page guide will cover topics within the field of software architecture including: software architecture as a solution balancing the concerns of different stakeholders, quality assurance, methods to describe and evaluate architectures, the influence of architecture on reuse, and the life cycle of a system and its architecture. This guide concludes with a comparison between the professions of software architect and software engineer.

Get it Now!  

Leave a Reply


× 8 = thirty two



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

20,709 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books