Google’s Jeffrey Dean and Sanjay Ghemawat filed the patent request and published the map/reduce paper 10 year ago (2004). According to WikiPedia Doug Cutting and Mike Cafarella created Hadoop, with its own implementation of Map/Reduce, one year later at Yahoo – both these implementations were done for the same purpose – batch indexing of the web.
Back than, the web began its “web 2.0″ transition, pages became more dynamic , people began to create more content – so an efficient way to reprocess and build the web index was needed and map/reduce was it. Web Indexing was a great fit for map/reduce since the initial processing of each source (web page) is completely independent from any other – i.e. a very convenient map phase and you need to combine the results to build the reverse index. That said, even the core google algorithm – the famous pagerank is iterative (so less appropriate for map/reduce), not to mention that as the internet got bigger and the updates became more and more frequent map/reduce wasn’t enough. Again Google (who seem to be consistently few years ahead of the industry) began coming up with alternatives like Google Percolator or Google Dremel (both papers were published in 2010, Percolator was introduced at that year, and dremel has been used in Google since 2006).
So now, it is 2014, and it is time for the rest of us to catch up with Google and get over Map/Reduce and for multiple reasons:
- end-users’ expectations (who hear “big data” but interpret that as “fast data”)
- iterative problems like graph algorithms which are inefficient as you need to load and reload the data each iteration
- continuous ingestion of data (increments coming on as small batches or streams of events) – where joining to existing data can be expensive
- real-time problems – both queries and processing.
In my opinion, Map/Reduce is an idea whose time has come and gone – it won’t die in a day or a year, there is still a lot of working systems that use it and the alternatives are still maturing. I do think, however, that if you need to write or implement something new that would build on map/reduce – you should use other option or at the very least carefully consider them.
So how is this change going to happen ? Luckily, Hadoop has recently adopted YARN (you can see my presentation on it here), which opens up the possibilities to go beyond map/reduce without changing everything … even though in effect, a lot will change. Note that some of the new options do have migration paths and also we still retain the access to all that “big data” we have in Hadoopm as well as the extended reuse of some of the ecosystem.
The first type of effort to replace map/reduce is to actually subsume it by offering more flexible batch. After all saying Map/reduce is not relevant, deosn’t mean that batch processing is not relevant. It does mean that there’s a need to more complex processes. There are two main candidates here Tez and Spark where Tez offers a nice migration path as it is replacing map/reduce as the execution engine for both Pig and Hive and Spark has a compelling offer by combining Batch and Stream processing (more on this later) in a single engine.
The second type of effort or processing capability that will help kill map/reduce is MPP databases on Hadoop. Like the “flexible batch” approach mentioned above, this is replacing a functionality that map/reduce was used for – unleashing the data already processed and stored in Hadoop. The idea here is twofold:
- To provide fast query capabilities* – by using specialized columnar data format and database engines deployed as daemons on the cluster
- To provide rich query capabilities – by supporting more and more of the SQL standard and enriching it with analytics capabilities (e.g. via MADlib).
Efforts in this arena include Impala from Cloudera, Hawq from Pivotal (which is essentially greenplum over HDFS), startups like Hadapt or even Actian trying to leverage their ParAccel acquisition with the recently announced Actian Vector . Hive is somewhere in the middle relying on Tez on one hand and using vectorization and columnar format (Orc) on the other.
The Third type of processing that will help dethrone Map/Reduce is Stream processing. Unlike the two previous types of effort this is covering a ground the map/reduce can’t cover, even inefficiently. Stream processing is about handling continuous flow of new data (e.g. events) and processing them (enriching, aggregating, etc.) them in seconds or less. The two major contenders in the Hadoop arena seem to be Spark Streaming and Storm though, of course, there are several other commercial and open source platforms that handle this type of processing as well.
In summary – Map/Reduce is great. It has served us (as an industry) for a decade but it is now time to move on and bring the richer processing capabilities we have elsewhere to solve our big data problems as well.
Last note – I focused on Hadoop in this post even thought there are several other platforms and tools around. I think that regardless if Hadoop is the best platform it is the one becoming the de-facto standard for big data (remember betamax vs VHS?).
One really, really last note – if you read up to here, and you are a developer living in Israel, and you happen to be looking for a job – I am looking for another developer to join my Technology Research team @ Amdocs. If you’re interested drop me a note: arnon.rotemgaloz at amdocs dot com or via my twitter/linkedin profiles.
*esp. in regard to analytical queries – operational SQL on hadoop with efforts like Phoenix ,IBM’s BigSQL or Splice Machine are also happening but that’s another story!
illustration idea found in James Mickens’s talk in Monitorama 2014 – (which is, by the way, a really funny presentation – go watch it) -ohh yeah… and pulp fiction!