About Adam Warski

Adam is one of the co-founders of SoftwareMill, a company specialising in delivering customised software solutions. He is also involved in open-source projects, as a founder, lead developer or contributor to: Hibernate Envers, a Hibernate core module, which provides entity versioning/auditing capabilities; ElasticMQ, an SQS-compatible messaging server written in Scala; Veripacks, a tool to specify and verify inter-package dependencies, and others.

Big data: when single node is better than clustered

There’s a lot of hype about “big data” and a general trend to try to apply Hadoop to almost every problem. However, sometimes it turns out that you can get much better results by writing an old-fashioned, but optimised, single-node version of your algorithm.

The specific case I’m writing about is generating recommendations (what items user may like) basing on a data set of 300 million preference values (user-item pairs, what users currently like). Whether you call it “big data” or not is disputable, however it’s big enough for the classic single-node algorithms (e.g. Taste recommenders) to stop taking a reasonable amount of time and memory to complete.

It may seem that the obvious choice then is to go clustered: Mahout contains a Hadoop implementation of a co-occurence based recommendation algorithm. During our test runs, it took about 7 hours on a 10-node small-instance EMR cluster.

Then I read about GraphChi and what you can do with a single laptop! While GraphChi isn’t exactly useable by people other than its authors, it got us inspired to try the same approach with our recommendation problem.

Hence we tried implementing a single-node version of the co-occurence based recommendation algorithm, which stores as much data as possible in-memory and uses persistent maps from MapDB as a fall-back in case memory runs out. Mostly thanks to the dominantly in-memory computations, our optimised version takes about 4 hours to complete on a large EC2 instance.

Thanks to the simple setup, and full control of the code base it’s also now in fact easier to add various enhancements, like dithering, combining with other recommendation sources or incorporating content-based features.

And this approach scales as well: either by adding more RAM, a faster disk (which is used in case there’s not enough memory), e.g. an SSD, or finally by partitioning the data set and running in parallel. However at some point adding more data points stops to be relevant from a recommendation quality point of view.

To sum up: if you have a large data set, not necessarily “big” as in trillions, but too big for classic approaches to digest, consider writing a custom, optimised, in-memory version. It may end up being simpler and faster.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Leave a Reply

1 + six =

Java Code Geeks and all content copyright © 2010-2015, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below: