About Adam Warski

Adam is one of the co-founders of SoftwareMill, a company specialising in delivering customised software solutions. He is also involved in open-source projects, as a founder, lead developer or contributor to: Hibernate Envers, a Hibernate core module, which provides entity versioning/auditing capabilities; ElasticMQ, an SQS-compatible messaging server written in Scala; Veripacks, a tool to specify and verify inter-package dependencies, and others.

Big data: when single node is better than clustered

There’s a lot of hype about “big data” and a general trend to try to apply Hadoop to almost every problem. However, sometimes it turns out that you can get much better results by writing an old-fashioned, but optimised, single-node version of your algorithm.

The specific case I’m writing about is generating recommendations (what items user may like) basing on a data set of 300 million preference values (user-item pairs, what users currently like). Whether you call it “big data” or not is disputable, however it’s big enough for the classic single-node algorithms (e.g. Taste recommenders) to stop taking a reasonable amount of time and memory to complete.

It may seem that the obvious choice then is to go clustered: Mahout contains a Hadoop implementation of a co-occurence based recommendation algorithm. During our test runs, it took about 7 hours on a 10-node small-instance EMR cluster.

Then I read about GraphChi and what you can do with a single laptop! While GraphChi isn’t exactly useable by people other than its authors, it got us inspired to try the same approach with our recommendation problem.

Hence we tried implementing a single-node version of the co-occurence based recommendation algorithm, which stores as much data as possible in-memory and uses persistent maps from MapDB as a fall-back in case memory runs out. Mostly thanks to the dominantly in-memory computations, our optimised version takes about 4 hours to complete on a large EC2 instance.

Thanks to the simple setup, and full control of the code base it’s also now in fact easier to add various enhancements, like dithering, combining with other recommendation sources or incorporating content-based features.

And this approach scales as well: either by adding more RAM, a faster disk (which is used in case there’s not enough memory), e.g. an SSD, or finally by partitioning the data set and running in parallel. However at some point adding more data points stops to be relevant from a recommendation quality point of view.

To sum up: if you have a large data set, not necessarily “big” as in trillions, but too big for classic approaches to digest, consider writing a custom, optimised, in-memory version. It may end up being simpler and faster.
 

Related Whitepaper:

Big Data Basics

An Introduction to Big Data and How It Is Changing Business

Amazingly, 90% of the data in the world today has been created only in the last two years. With the increase of mobile devices, social media networks, and the sharing of digital photos and videos, we are continuing to grow the world's data at an astounding pace. However, big data is more than just the data itself. It is a combination of factors that require a new way of collecting, analyzing, visualizing, and sharing data. These factors are forcing software companies to re-think the ways that they manage and offer their data, from new insights to completely new revenue streams.

Get it Now!  

Leave a Reply


− 2 = seven



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

20,709 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books