The problems in Hadoop – When does it fail to deliver?

Mackie MathewMarch 2nd, 2012Last Updated: October 21st, 2012

1 174 5 minutes read

Hadoop is a great piece of software. It is not original but that certainly does not take away its glory. It builds on parallel processing, a concept that’s been around for decades. Although conceptually unoriginal, Hadoop shows the power of being free and open (as in beer!) and most of all shows about what usability is all about. It succeeded where most other parallel processing frameworks failed. So, now you know that I’m not a hater.

On the contrary, I think Hadoop is amazing. But, it does not justify some blatant failures on the part of Hadoop, may it be architectural, conceptual or even documentation wise. Hadoop’s popularity should not shield it from the need to re-enginer and re-work problems in the Hadoop implementation. The point below are based on months of exploring and hacking around Hadoop. Do dig in.

Did I hear someone say “Data Locality”?

Hadoop harps over and over again on data locality. In some workshops conducted by Hadoop milkers, they just went on and on about this. They say whenever possible, Hadoop will attempt to start a task on a block of data that is stored locally on that node via HDFS. This sounds like a super feature, doesn’t it? It saves so much of bandwidth without having to transfer TBs of data, right? Hellll, no. It does not. What this means is that first you have to figure out a way of getting data into HDFS, the Hadoop Distributed File System. This is non trivial, unless you live in the last decade and all your data exists as files. Assuming that you do, let’s transfer the TBs of data over to HDFS. Now, it will start doing it’s whole “data locality” thing. Ermm, OK. Am I hit by a wave of brilliance or isn’t it what’s is supposed to do anyway? Let’s get our facts straight. To use Hadoop, our problem should be able to execute in parallel. If the problem or a at least a sub-problem can’t be parallelized it won’t gain much out of Hadoop. This means the task algorithm is independent of any specific part of the data it processes. Further simplifying this would be saying, any task can process any section of the data. So, doesn’t that mean the “data locality” thing is the obvious thing to do? Why, would the Hadoop developers even write some code that would make a task process data in another node unless something goes horribly wrong. The feature would be if it was doing otherwise! If a task has finished operating on the node’s local data and then would transfer data from another node and process this data, that would be a worthy feature of the conundrum. At least that would be worthy of noise.

Would you please put everything back into files

Do you have nicely structured data in databases? Maybe, you became a bit fancy and used the latest and greatest NoSQL data store? Now let me write down what you are thinking. “OK, let’s get some Hadoop jobs to run on this, cause I want to find all this hidden gold mines in my data, that will get me a front page of Forbes.” I hear you. Let’s get some Hadoop jobs rolling. But wait! What the …..? Why are all the samples in text files. A plethora of examples using CSV files, tab delimited files, space delimited files, and all other kind of neat files. Why is everyone going back a few decades and using files again? Haven’t all these guys heard of DBs and all that fancy stuff. It seems that you were too early an adopter of Data Stores. Files are the heroes of the Hadoop world. If you want to use Hadoop quickly and easily, the best path for you right is to export your data neatly into files and run all those snazzy word count samples (Pun intended!). Because without files Hadoop can’t do all that cool “data locality” shit. Everything has to be in HDFS first. So, what would you do to analyze your data in the hypothetical FUHadoopDB? First of all, implement about 10+ classes necessary to split and transfer data into the Hadoop nodes and run your tasks. Hadoop needs to know how to get data from FUHadoopDB, so let’s assume this is acceptable. Now, if you don’t store it in HDFS, you won’t get the data locality shit. If this is the case, when the task runs, they themselves will have to pull data from the FUHadoopDB to process the data. But, if you want the snazzy data locality shit you need to pull data from FUHadoopDB and store it in HDFS. You will not incur the penalty of pulling data while the tasks are running, but you pay it at the preparation stage of the job, in the form of transferring the data into HDFS. Oh and did I mention the additional disk space you would need to store the same data in HDFS. I wanted to save that disk space, so I chose to make my tasks pull data while running the tasks. The choice is yours.

Java is OS independent, isn’t it?

Java has its flaws but for the most part it runs smoothly on most OSs. Even if there are some OS issues, it can be ironed out easily. The Hadoop folks have issued document mostly based on Linux environments. They say Windows is supported, but ignored those ignorant people by not providing adequate documentation. Windows didn’t even make it to the recommended production environments. It can be used as a development platform, but then you will have to deploy it on Linux. I’m certainly not a windows fan. But if I write a Java program, I’d bother to make it run on Windows. If not, why the hell are you using Java? Why the trouble of coming up with freaking bytecode? Oh, the sleepless nights of all those good people who came up with byte code and JVMs and what not have gone to waste.

CS 201: Object Oriented Programming

If you are trying to integrate Hadoop into your platform, think again. Let me take the liberty of typing your thoughts. “Let’s just extend a few interfaces and plugin my authentication mechanism. It should be easy enough. I mean these guys designed the world’s greatest software that will end world hunger.”. I hear you again. If you are planning to do this, don’t. It’s like OOP anti patterns 101 in there. So many places that would say “if (kerberos)” and execute some security specific function. One of my colleagues went through this pain, and finally decided to that it’s easier to write keberos based authentication for his software and then make it work with Hadoop. With great power comes great responsibility. Hadoop fails to fulfil this responsibility.

Even with these issues, Hadoop’s popularity seems to be catching significant attention, and its rightfully deserved. Its ability to commodotize big data analytics should be exalted. But it’s my opinion that it got way too popular way too fast. The Hadoop community needs to have another go at revamping this great piece of software.

Reference: The problems in Hadoop – When does it fail to deliver? from our JCG partner Mackie Mathew at the dev_religion blog