Software Development

Big data: when single node is better than clustered

There’s a lot of hype about “big data” and a general trend to try to apply Hadoop to almost every problem. However, sometimes it turns out that you can get much better results by writing an old-fashioned, but optimised, single-node version of your algorithm.

The specific case I’m writing about is generating recommendations (what items user may like) basing on a data set of 300 million preference values (user-item pairs, what users currently like). Whether you call it “big data” or not is disputable, however it’s big enough for the classic single-node algorithms (e.g. Taste recommenders) to stop taking a reasonable amount of time and memory to complete.

It may seem that the obvious choice then is to go clustered: Mahout contains a Hadoop implementation of a co-occurence based recommendation algorithm. During our test runs, it took about 7 hours on a 10-node small-instance EMR cluster.

Then I read about GraphChi and what you can do with a single laptop! While GraphChi isn’t exactly useable by people other than its authors, it got us inspired to try the same approach with our recommendation problem.

Hence we tried implementing a single-node version of the co-occurence based recommendation algorithm, which stores as much data as possible in-memory and uses persistent maps from MapDB as a fall-back in case memory runs out. Mostly thanks to the dominantly in-memory computations, our optimised version takes about 4 hours to complete on a large EC2 instance.

Thanks to the simple setup, and full control of the code base it’s also now in fact easier to add various enhancements, like dithering, combining with other recommendation sources or incorporating content-based features.

And this approach scales as well: either by adding more RAM, a faster disk (which is used in case there’s not enough memory), e.g. an SSD, or finally by partitioning the data set and running in parallel. However at some point adding more data points stops to be relevant from a recommendation quality point of view.

To sum up: if you have a large data set, not necessarily “big” as in trillions, but too big for classic approaches to digest, consider writing a custom, optimised, in-memory version. It may end up being simpler and faster.
 

Adam Warski

Adam is one of the co-founders of SoftwareMill, a company specialising in delivering customised software solutions. He is also involved in open-source projects, as a founder, lead developer or contributor to: Hibernate Envers, a Hibernate core module, which provides entity versioning/auditing capabilities; ElasticMQ, an SQS-compatible messaging server written in Scala; Veripacks, a tool to specify and verify inter-package dependencies, and others.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button