Home » Java » Enterprise Java » Amazon Elastic Map Reduce to compute recommendations with Apache Mahout

About Adam Warski

Adam Warski
Adam is one of the co-founders of SoftwareMill, a company specialising in delivering customised software solutions. He is also involved in open-source projects, as a founder, lead developer or contributor to: Hibernate Envers, a Hibernate core module, which provides entity versioning/auditing capabilities; ElasticMQ, an SQS-compatible messaging server written in Scala; Veripacks, a tool to specify and verify inter-package dependencies, and others.

Amazon Elastic Map Reduce to compute recommendations with Apache Mahout

Apache Mahout is a “scalable machine learning library” which, among others, contains implementations of various single-node and distributed recommendation algorithms. In my last blog post I described how to implement an on-line recommender system processing data on a single node. What if the data is too large to fit into memory (>100M preference data points)? Then we have no choice, but to take a look at Mahout’s distributed recommenders implementation!

The distributed recommender is based on Apache Hadoop; it’s a job which takes as input a list of user preferences, computes an item co-occurence matrix, and outputs top-K recommendations for each user. For an introductory blog on how this works and how to run it locally, see for example this blog post.

We can of course run this job on a custom Hadoop cluster, but it’s much faster (and less painful) to just use a pre-configured one, like EMR. However, there’s a slight problem. The latest Hadoop version that is available on EMR is 1.0.3, and it contains jars for Apache Lucene 2.9.4. However, the recommender job depends on Lucene 4.3.0, which results in the following beautiful stack trace:

2013-10-04 11:05:03,921 FATAL org.apache.hadoop.mapred.Child (main): Error running child : java.lang.NoSuchMethodError: org.apache.lucene.util.PriorityQueue.<init>(I)V
  at org.apache.mahout.math.hadoop.similarity.cooccurrence.TopElementsQueue.<init>(TopElementsQueue.java:33)
	at org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$UnsymmetrifyMapper.
map(RowSimilarityJob.java:405)
	at org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$UnsymmetrifyMapper.
map(RowSimilarityJob.java:389)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)

How to solve this? Well, we “just” need to update Lucene in the EMR Hadoop installation. We can use a bootstrap action for that. Here are the exact steps:

  1. Download lucene-4.3.0.tgz (e.g. from here) and upload it into a S3 bucket; make the file public.
  2. Upload this script to the bucket as well; call it e.g. update-lucene.sh:
    #!/bin/bash
    cd /home/hadoop
    wget https://s3.amazonaws.com/bucket_name/bucket_path/lucene-4.3.0.tgz
    tar -xzf lucene-4.3.0.tgz
    cd lib
    rm lucene-*.jar
    cd ..
    cd lucene-4.3.0
    find . | grep lucene- | grep jar$ | xargs -I {} cp {} ../lib

    This script will be run on the Hadoop nodes and will update the Lucene version. Make sure to change the script and enter the correct bucket name and bucket path, so that it points to the public Lucene archive.

  3. mahout-core-0.8-job.jar to the bucket
  4. Finally, we need to upload the input data into S3. Output data will be saved on S3 as well.
  5. Now we can start setting up the EMR job flow. Go to the EMR page on Amazon’s console, and start creating a new job flow. We’ll be using the “Amazon Distribution” Hadoop version and using a “Custom JAR” as the job type.

    2013-10-15_1230-1024x672

  6. The “JAR location” must point to the place where we’ve uploaded the Mahout jar, e.g. s3n://bucket_name/bucket_path/mahout-0.8-job.jar (make sure to change this to point to the real bucket!). As for the jar arguments, we’ll be running the RecommenderJob and using the log-likelihood similarity:
    org.apache.mahout.cf.taste.hadoop.item.RecommenderJob 
    --booleanData 
    --similarityClassname SIMILARITY_LOGLIKELIHOOD 
    --output s3n://bucket_name/output 
    --input s3n://bucket_name/input.dat

    That’s also the place to specify where the input data on S3 is, and where output should be written.

    2013-10-15_1232-1024x405

  7. Then we can choose how many machines we want to use. This depends of course on the size of the input data and how fast you want the results. The main thing to change here is the “core instance group” count. 2 is a reasonable default for testing.

    2013-10-15_1236-1024x532

  8. We can leave the advanced options as-is
  9. Now we get to one of the more important steps: setting up bootstrap actions. We’ll need to setup two:
    • a Memory Intenstive Configuration (otherwise you’ll see an OOM quickly)
    • our custom update-lucene action (the path should point to S3, e.g. s3://bucket_name/bucket_path/update-lucene.sh)

    2013-10-15_1240-1024x622

And that’s it! You can now create and run the job flow, and after a couple of minutes/hours/days you’ll have the results waiting on S3.
 

Do you want to know how to develop your skillset to become a Java Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. JPA Mini Book
2. JVM Troubleshooting Guide
3. JUnit Tutorial for Unit Testing
4. Java Annotations Tutorial
5. Java Interview Questions
6. Spring Interview Questions
7. Android UI Design
and many more ....
Email address:

Leave a Reply

Be the First to Comment!

Notify of
avatar
Want to take your Java skills to the next level?
Grab our programming books for FREE!
Here are some of the eBooks you will get:
  • Spring Interview QnA
  • Multithreading & Concurrency QnA
  • JPA Minibook
  • JVM Troubleshooting Guide
  • Advanced Java
  • Java Interview QnA
  • Java Design Patterns
The Spring Framework Cookbook
  • Learn the best practices of the Spring Framework
  • Build simple, portable, fast and flexible JVM-based systems and applications
  • Explore specific projects like Boot and Batch
The Spring Data Programming Cookbook
  • Learn how to use data access technologies and cloud-based data services
  • Set up the environment and create a basic project
  • Learn how to handle the various modules (e.g. JPA, MongoDB, Redis etc.)
The Selenium Programming Cookbook
  • Kick-start your own projects using this testing framework for web applications
  • Learn JUnit integration and Standalone Server functionality
  • Find out the most popular Interview Questions about the Selenium Framework
The Mockito Programming Cookbook
  • Kick-start your own web projects using this open source testing framework
  • Write simple test cases using the Mockito Framework
  • Learn how to integrate with JUnit, Maven and other frameworks
The JUnit Programming Cookbook
  • Learn basic usage and configuration of JUnit
  • Create multithreaded tests
  • Learn how to integrate with other testing frameworks
The JSF 2.0 Programming Cookbook
  • Build component-based user interfaces for web applications
  • Set up the environment and create a basic project
  • Learn Internationalization and Facelets Templates
The Amazon S3 Tutorial
  • Develop your own Amazon S3 based applications
  • Learn API usage and pricing
  • Get your own projects up and running in minimum time
Java Design Patterns
  • Learn how Design Patterns are implemented and utilized in Java
  • Understand the reasons why patterns are so important
  • Learn when and how to apply each one of them
Java Concurrency Essentials
  • Dive into the magic of concurrency
  • Learn about concepts like atomicity, synchronization and thread safety
  • Learn about testing concurrent applications
The IntelliJ IDEA Handbook
  • Kick-start your own programming projects using IntelliJ IDEA
  • Learn how to setup and install plugins
  • Create UIs with this Java integrated development environment
The Git Tutorial
  • Learn why Git differs from other version control systems
  • Explore Git's usage and best practises
  • Learn branching strategies
The Eclipse IDE Handbook
  • Explore the most widely used Java IDE
  • Learn how to setup and install plugins
  • Built your own projects up and running in minimum time
The Docker Containerization Cookbook
  • Explore the world’s leading software containerization platform
  • Learn how to wrap a piece of software in a complete filesystem
  • Learn how to use DNS and various commands
Developing Modern Applications With Scala
  • Develop modern Scala applications
  • Build SBT and reactive applications
  • Learn about testing and database access
The Apache Tomcat Cookbook
  • Explore Apache Tomcat open-source web server
  • Learn about installation, configuration, logging and clustering
  • Kick-start your own web projects using Apache Tomcat
The Apache Maven Cookbook
  • Explore the Apache Maven build automation tool
  • Learn about Maven's project structure and configuration
  • Learn about Maven's dependency management and plug-ins
The Apache Hadoop Cookbook
  • Explore the Apache Hadoop open-source software framework
  • Learn distributed caching and streaming
  • Kick-start your own web projects using Apache Hadoop
The Android Programming Cookbook
  • Explore the Android mobile operating system
  • Learn about services and page views
  • Learn about Google Maps and Bluetooth functionality
The Elasticsearch Tutorial
  • Explore the Elasticsearch search engine
  • Develop your own Elasticsearch based applications
  • Learn operations, Java API Integration and reporting
Amazon DynamoDB Tutorial
  • Develop your own Amazon DynamoDB based applications
  • Learn DynamoDB Concepts and Best Practices
  • Get your own projects up and running in minimum time
Java NIO Programming Cookbook
  • Learn features for intensive I/O operations
  • Follow a series of tutorials on Java NIO examples
  • Get knowledge on Java Nio Socket and Asynchronous Channels
JBoss Drools Cookbook
  • Explore Drools business rule management system
  • Follow a series of tutorials on Drools examples
  • Get knowledge on business rules for a shopping domain model
Vaadin Programming Cookbook
  • Explore Vaadin web framework for rich Internet applications
  • Learn the Architecture and Best Practices
  • Get knowledge on Data Binding and Custom Components
Groovy Programming Cookbook
  • Explore Apache Groovy object-oriented programming language
  • Create sample applications and explore interview questions
  • Get knowledge on Callback functionality and various widgets
GWT Programming Cookbook
  • Explore the open source Google Web Toolkit
  • Create sample applications and explore interview questions
  • Create and maintain complex JavaScript front-end applications in Java
Do you want to know how to develop your skillset to become a Java Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. JPA Mini Book
2. JVM Troubleshooting Guide
3. JUnit Tutorial for Unit Testing
4. Java Annotations Tutorial
5. Java Interview Questions
and many more ....
Email address:
Do you want to know how to develop your skillset to become a Java Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. JPA Mini Book
2. JVM Troubleshooting Guide
3. JUnit Tutorial for Unit Testing
4. Java Annotations Tutorial
5. Java Interview Questions
and many more ....
Email address:
Do you want to know how to develop your skillset to become a Java Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. JPA Mini Book
2. JVM Troubleshooting Guide
3. JUnit Tutorial for Unit Testing
4. Java Annotations Tutorial
5. Java Interview Questions
6. Spring Interview Questions
7. Android UI Design
and many more ....
Email address:
Want to be a DynamoDB Master ?
Subscribe to our newsletter and download the Amazon DynamoDB Tutorial right now!
In order to help you master this Amazon NoSQL database service, we have compiled a kick-ass guide with all the major DynamoDB features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Programming Interview Coming Up?
Subscribe to our newsletter and download the Ultimate Multithreading and Concurrency interview questions and answers collection right now!
In order to get you prepared for your next Programming Interview, we have compiled a huge list of relevant Questions and their respective Answers. Besides studying them online you may download the eBook in PDF format!
Email address:
Java Interview Coming Up?
Subscribe to our newsletter and download the Ultimate Spring interview questions and answers collection right now!
In order to get you prepared for your next Java Interview, we have compiled a huge list of relevant Questions and their respective Answers. Besides studying them online you may download the eBook in PDF format!
Email address:
Java Interview Coming Up?
Subscribe to our newsletter and download the Ultimate Java interview questions and answers collection right now!
In order to get you prepared for your next Java Interview, we have compiled a huge list of relevant Questions and their respective Answers. Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a Java NIO Master ?
Subscribe to our newsletter and download the Java NIO Programming Cookbook right now!
In order to help you master Java NIO Library, we have compiled a kick-ass guide with all the major Java NIO features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a Drools Master ?
Subscribe to our newsletter and download the JBoss Drools Cookbook right now!
In order to help you master Drools Business Rule Management System, we have compiled a kick-ass guide with all the major Drools features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be an iText Master ?
Subscribe to our newsletter and download the iText Tutorial right now!
In order to help you master iText Library, we have compiled a kick-ass guide with all the major iText features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be an Elasticsearch Master ?
Subscribe to our newsletter and download the Elasticsearch Tutorial right now!
In order to help you master Elasticsearch search engine, we have compiled a kick-ass guide with all the major Elasticsearch features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a Scala Master ?
Subscribe to our newsletter and download the Scala Cookbook right now!
In order to help you master Scala, we have compiled a kick-ass guide with all the basic concepts! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a JUnit Master ?
Subscribe to our newsletter and download the JUnit Programming Cookbook right now!
In order to help you master unit testing with JUnit, we have compiled a kick-ass guide with all the major JUnit features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to master Amazon Web Services ?
Subscribe to our newsletter and download the Amazon S3 Tutorial right now!
In order to help you master the leading Web Services platform, we have compiled a kick-ass guide with all its major features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to master Spring Framework ?
Subscribe to our newsletter and download the Spring Framework Cookbook right now!
In order to help you master the leading and innovative Java framework, we have compiled a kick-ass guide with all its major features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to master Eclipse IDE ?
Subscribe to our newsletter and download the Eclipse IDE Handbook right now!
In order to help you master Eclipse, we have compiled a kick-ass guide with all the basic features of the popular IDE! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to master IntelliJ IDEA ?
Subscribe to our newsletter and download the IntelliJ IDEA Handbook right now!
In order to help you master IntelliJ IDEA, we have compiled a kick-ass guide with all the basic features of the popular IDE! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to master Docker ?
Subscribe to our newsletter and download the Docker Containerization Cookbook right now!
In order to help you master Docker, we have compiled a kick-ass guide with all the basic concepts of the Docker container system! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to create a kick-ass Android App ?
Subscribe to our newsletter and download the Android Programming Cookbook right now!
With this book, you will delve into the fundamentals of Android programming. You will understand user input, views and layouts. Furthermore, you will learn how to communicate over Bluetooth and also leverage Google Maps into your application!
Email address:
Want to be a GIT Master ?
Subscribe to our newsletter and download the GIT Tutorial eBook right now!
In order to help you master GIT, we have compiled a kick-ass guide with all the basic concepts of the GIT version control system! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a Hadoop Master ?
Subscribe to our newsletter and download the Apache Hadoop Cookbook right now!
In order to help you master Apache Hadoop, we have compiled a kick-ass guide with all the basic concepts of a Hadoop cluster! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to master Spring Data ?
Subscribe to our newsletter and download the Spring Data Ultimate Guide right now!
In order to help you master Spring Data, we have compiled a kick-ass guide with all the major features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to create a kick-ass Android App ?
Subscribe to our newsletter and download the Android UI Design mini-book right now!
With this book, you will delve into the fundamentals of Android UI design. You will understand user input, views and layouts, as well as adapters and fragments. Furthermore, you will learn how to add multimedia to an app and also leverage themes and styles!
Email address:
Want to be a Java 8 Ninja ?
Subscribe to our newsletter and download the Java 8 Features Ultimate Guide right now!
In order to get you up to speed with the major Java 8 release, we have compiled a kick-ass guide with all the new features and goodies! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to master Java Annotations ?
Subscribe to our newsletter and download the Java Annotations Ultimate Guide right now!
In order to help you master the topic of Annotations, we have compiled a kick-ass guide with all the major features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a JUnit Master ?
Subscribe to our newsletter and download the JUnit Ultimate Guide right now!
In order to help you master unit testing with JUnit, we have compiled a kick-ass guide with all the major JUnit features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to master Java Abstraction ?
Subscribe to our newsletter and download the Abstraction in Java Ultimate Guide right now!
In order to help you master the topic of Abstraction, we have compiled a kick-ass guide with all the major features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to master Java Reflection ?
Subscribe to our newsletter and download the Java Reflection Ultimate Guide right now!
In order to help you master the topic of Reflection, we have compiled a kick-ass guide with all the major features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a JMeter Master ?
Subscribe to our newsletter and download the JMeter Ultimate Guide right now!
In order to help you master load testing with JMeter, we have compiled a kick-ass guide with all the major JMeter features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a Servlets Master ?
Subscribe to our newsletter and download the Java Servlet Ultimate Guide right now!
In order to help you master programming with Java Servlets, we have compiled a kick-ass guide with all the major servlet API uses and showcases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a JAXB Master ?
Subscribe to our newsletter and download the JAXB Ultimate Guide right now!
In order to help you master XML Binding with JAXB, we have compiled a kick-ass guide with all the major JAXB features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a JDBC Master ?
Subscribe to our newsletter and download the JDBC Ultimate Guide right now!
In order to help you master database programming with JDBC, we have compiled a kick-ass guide with all the major JDBC features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a JPA Master ?
Subscribe to our newsletter and download the JPA Ultimate Guide right now!
In order to help you master programming with JPA, we have compiled a kick-ass guide with all the major JPA features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a Hibernate Master ?
Subscribe to our newsletter and download the Hibernate Ultimate Guide right now!
In order to help you master JPA and database programming with Hibernate, we have compiled a kick-ass guide with all the major Hibernate features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a JSF Master ?
Subscribe to our newsletter and download the JSF 2.0 Programming Cookbook right now!
In order to get you prepared for your JSF development needs, we have compiled numerous recipes to help you kick-start your projects. Besides reading them online you may download the eBook in PDF format!
Email address:
Want to be a Java Master ?
Subscribe to our newsletter and download the Advanced Java Guide right now!
In order to help you master the Java programming language, we have compiled a kick-ass guide with all the must-know advanced Java features! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a Java Master ?
Subscribe to our newsletter and download the Java Design Patterns right now!
In order to help you master the Java programming language, we have compiled a kick-ass guide with all the must-know Design Patterns for Java! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be a Hadoop Master ?
Subscribe to our newsletter and download the Hadoop Tutorial right now!
In order to help you master Apache Hadoop, we have compiled a kick-ass guide with all the basic concepts of a Hadoop cluster! Besides studying them online you may download the eBook in PDF format!
Email address:
Want to be an Elastic Beanstalk Master ?
Subscribe to our newsletter and download the Amazon Elastic Beanstalk Tutorial right now!
In order to help you master AWS Elastic Beanstalk, we have compiled a kick-ass guide with all the major Elastic Beanstalk features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Amazon Elastic Beanstalk Tutorial
  • Develop your own Amazon Elastic Beanstalk based applications
  • Learn Java Integration and Command Line Interfacing
  • Get your own projects up and running in minimum time
Want to take your Java skills to the next level?
Grab our programming books for FREE!
Here are some of the eBooks you will get:
  • Spring Interview QnA
  • Multithreading & Concurrency QnA
  • JPA Minibook
  • JVM Troubleshooting Guide
  • Advanced Java
  • Java Interview QnA
  • Java Design Patterns
Insiders are already enjoying weekly updates and complimentary whitepapers!
Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.
Email address:
Want to be an ActiveMQ Master ?
Subscribe to our newsletter and download the Apache ActiveMQ Cookbook right now!
In order to help you master Apache ActiveMQ JMS, we have compiled a kick-ass guide with all the major ActiveMQ features and use cases! Besides studying them online you may download the eBook in PDF format!
Email address:
Apache ActiveMQ Cookbook
  • Explore Apache ActiveMQ Best Practices
  • Learn ActiveMQ Load Balancing and File Transfer
  • Develop your own Apache ActiveMQ projects