The problems in Hadoop – When does it fail to deliver?

Hadoop is a great piece of software. It is not original but that certainly does not take away its glory. It builds on parallel processing, a concept that’s been around for decades. Although conceptually unoriginal, Hadoop shows the power of being free and open (as in beer!) and most of all shows about what usability is all about. It succeeded where most other parallel processing frameworks failed. So, now you know that I’m not a hater.

On the contrary, I think Hadoop is amazing. But, it does not justify some blatant failures on the part of Hadoop, may it be architectural, conceptual or even documentation wise. Hadoop’s popularity should not shield it from the need to re-enginer and re-work problems in the Hadoop implementation. The point below are based on months of exploring and hacking around Hadoop. Do dig in.

  1. Did I hear someone say “Data Locality”?
  2. Hadoop harps over and over again on data locality. In some workshops conducted by Hadoop milkers, they just went on and on about this. They say whenever possible, Hadoop will attempt to start a task on a block of data that is stored locally on that node via HDFS. This sounds like a super feature, doesn’t it? It saves so much of bandwidth without having to transfer TBs of data, right? Hellll, no. It does not. What this means is that first you have to figure out a way of getting data into HDFS, the Hadoop Distributed File System. This is non trivial, unless you live in the last decade and all your data exists as files. Assuming that you do, let’s transfer the TBs of data over to HDFS. Now, it will start doing it’s whole “data locality” thing. Ermm, OK. Am I hit by a wave of brilliance or isn’t it what’s is supposed to do anyway? Let’s get our facts straight. To use Hadoop, our problem should be able to execute in parallel. If the problem or a at least a sub-problem can’t be parallelized it won’t gain much out of Hadoop. This means the task algorithm is independent of any specific part of the data it processes. Further simplifying this would be saying, any task can process any section of the data. So, doesn’t that mean the “data locality” thing is the obvious thing to do? Why, would the Hadoop developers even write some code that would make a task process data in another node unless something goes horribly wrong. The feature would be if it was doing otherwise! If a task has finished operating on the node’s local data and then would transfer data from another node and process this data, that would be a worthy feature of the conundrum. At least that would be worthy of noise.

  3. Would you please put everything back into files
  4. Do you have nicely structured data in databases? Maybe, you became a bit fancy and used the latest and greatest NoSQL data store? Now let me write down what you are thinking. “OK, let’s get some Hadoop jobs to run on this, cause I want to find all this hidden gold mines in my data, that will get me a front page of Forbes.” I hear you. Let’s get some Hadoop jobs rolling. But wait! What the …..? Why are all the samples in text files. A plethora of examples using CSV files, tab delimited files, space delimited files, and all other kind of neat files. Why is everyone going back a few decades and using files again? Haven’t all these guys heard of DBs and all that fancy stuff. It seems that you were too early an adopter of Data Stores. Files are the heroes of the Hadoop world. If you want to use Hadoop quickly and easily, the best path for you right is to export your data neatly into files and run all those snazzy word count samples (Pun intended!). Because without files Hadoop can’t do all that cool “data locality” shit. Everything has to be in HDFS first. So, what would you do to analyze your data in the hypothetical FUHadoopDB? First of all, implement about 10+ classes necessary to split and transfer data into the Hadoop nodes and run your tasks. Hadoop needs to know how to get data from FUHadoopDB, so let’s assume this is acceptable. Now, if you don’t store it in HDFS, you won’t get the data locality shit. If this is the case, when the task runs, they themselves will have to pull data from the FUHadoopDB to process the data. But, if you want the snazzy data locality shit you need to pull data from FUHadoopDB and store it in HDFS. You will not incur the penalty of pulling data while the tasks are running, but you pay it at the preparation stage of the job, in the form of transferring the data into HDFS. Oh and did I mention the additional disk space you would need to store the same data in HDFS. I wanted to save that disk space, so I chose to make my tasks pull data while running the tasks. The choice is yours.

  5. Java is OS independent, isn’t it?
  6. Java has its flaws but for the most part it runs smoothly on most OSs. Even if there are some OS issues, it can be ironed out easily. The Hadoop folks have issued document mostly based on Linux environments. They say Windows is supported, but ignored those ignorant people by not providing adequate documentation. Windows didn’t even make it to the recommended production environments. It can be used as a development platform, but then you will have to deploy it on Linux. I’m certainly not a windows fan. But if I write a Java program, I’d bother to make it run on Windows. If not, why the hell are you using Java? Why the trouble of coming up with freaking bytecode? Oh, the sleepless nights of all those good people who came up with byte code and JVMs and what not have gone to waste.

  7. CS 201: Object Oriented Programming
  8. If you are trying to integrate Hadoop into your platform, think again. Let me take the liberty of typing your thoughts. “Let’s just extend a few interfaces and plugin my authentication mechanism. It should be easy enough. I mean these guys designed the world’s greatest software that will end world hunger.”. I hear you again. If you are planning to do this, don’t. It’s like OOP anti patterns 101 in there. So many places that would say “if (kerberos)” and execute some security specific function. One of my colleagues went through this pain, and finally decided to that it’s easier to write keberos based authentication for his software and then make it work with Hadoop. With great power comes great responsibility. Hadoop fails to fulfil this responsibility.

Even with these issues, Hadoop’s popularity seems to be catching significant attention, and its rightfully deserved. Its ability to commodotize big data analytics should be exalted. But it’s my opinion that it got way too popular way too fast. The Hadoop community needs to have another go at revamping this great piece of software.

Reference: The problems in Hadoop – When does it fail to deliver? from our JCG partner Mackie Mathew at the dev_religion blog

Related Whitepaper:

Hadoop Illuminated

Gentle Introduction of Hadoop and Big Data!

This Hadoop book was written with following goals and principles: Make Hadoop accessible to a wider audience -- not just the highly technical crowd. There are a few unique chapters that you won't find in other Hadoop books, for example: Hadoop use cases, Hadoop distributions rundown, BI Tools feature matrix.

Get it Now!  

One Response to "The problems in Hadoop – When does it fail to deliver?"

  1. Joydeep Sen Sarma says:

    Hadoop can use databases directly as input. There are numerous adapters that allow Hadoop jobs to read/write data from distributed databases like MongoDB and HBase. The same can be done against a regular SQL database except that they would typically tip over under that kind of parallel workload.

    What you cite as a weakness – is actually a sign of strength. That the software was built abstract enough to allow all these kinds of input adapters to be written.

Leave a Reply


1 + one =



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

20,709 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books