Andrey Redko

About Andrey Redko

Andriy is a well-grounded software developer with more then 12 years of practical experience using Java/EE, C#/.NET, C++, Groovy, Ruby, functional programming (Scala), databases (MySQL, PostreSQL, Oracle) and NoSQL solutions (MongoDB, Redis).

Apache Mahout: Getting started

Recently I have got an interesting problem to solve: how to classify text from different sources using automation? Some time ago I read about a project which does this as well as many other text analysis stuff – Apache Mahout. Though it’s not a very mature one (current version is 0.4), it’s very powerful and scalable. Build on top of another excellent project, Apache Hadoop, it’s capable to analyze huge data sets.

So I did a small project in order to understand how Apache Mahout works. I decided to use Apache Maven 2 in order to manage all dependencies so I will start with POM file first.

<!--?xml version="1.0" encoding="UTF-8"?-->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelversion>4.0.0</modelversion>
  <groupid>org.acme</groupid>
  <artifactid>mahout</artifactid>
  <version>0.94</version>
  <name>Mahout Examples</name>
  <description>Scalable machine learning library examples</description>
  <packaging>jar</packaging>

  <properties>
    <project.build.sourceencoding>UTF-8</project.build.sourceencoding>
    <apache.mahout.version>0.4</apache.mahout.version>
  </properties>
 
  <build>
    <plugins>
      <plugin>
        <groupid>org.apache.maven.plugins</groupid>
        <artifactid>maven-compiler-plugin</artifactid>
        <configuration>
          <encoding>UTF-8</encoding>
          <source>1.6
          <target>1.6</target>
          <optimize>true</optimize>
        </configuration>
      </plugin>
    </plugins>
  </build>

  <dependencies>
    <dependency>
      <groupid>org.apache.mahout</groupid>
      <artifactid>mahout-core</artifactid>
      <version>${apache.mahout.version}</version>
    </dependency>

    <dependency>
      <groupid>org.apache.mahout</groupid>
      <artifactid>mahout-math</artifactid>
      <version>${apache.mahout.version}</version>
    </dependency>

    <dependency>
      <groupid>org.apache.mahout</groupid>
      <artifactid>mahout-utils</artifactid>
      <version>${apache.mahout.version}</version>
    </dependency>


     <dependency>
      <groupid>org.slf4j</groupid>
      <artifactid>slf4j-api</artifactid>
      <version>1.6.0</version>
    </dependency>

    <dependency>
      <groupid>org.slf4j</groupid>
      <artifactid>slf4j-jcl</artifactid>
      <version>1.6.0</version>
    </dependency>
  </dependencies>
</project>

Then I looked into Apache Mahout examples and algorithms available for text classification problem. The most simple and accurate one is Naive Bayes classifier. Here is a code snippet:

package org.acme;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.FileReader;
import java.util.List;

import org.apache.hadoop.fs.Path;
import org.apache.mahout.classifier.ClassifierResult;
import org.apache.mahout.classifier.bayes.TrainClassifier;
import org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm;
import org.apache.mahout.classifier.bayes.common.BayesParameters;
import org.apache.mahout.classifier.bayes.datastore.InMemoryBayesDatastore;
import org.apache.mahout.classifier.bayes.exceptions.InvalidDatastoreException;
import org.apache.mahout.classifier.bayes.interfaces.Algorithm;
import org.apache.mahout.classifier.bayes.interfaces.Datastore;
import org.apache.mahout.classifier.bayes.model.ClassifierContext;
import org.apache.mahout.common.nlp.NGrams;

public class Starter {
 public static void main( final String[] args ) {
  final BayesParameters params = new BayesParameters();
  params.setGramSize( 1 );
  params.set( "verbose", "true" );
  params.set( "classifierType", "bayes" );
  params.set( "defaultCat", "OTHER" );
  params.set( "encoding", "UTF-8" );
  params.set( "alpha_i", "1.0" );
  params.set( "dataSource", "hdfs" );
  params.set( "basePath", "/tmp/output" );
  
  try {
      Path input = new Path( "/tmp/input" );
      TrainClassifier.trainNaiveBayes( input, "/tmp/output", params );
   
      Algorithm algorithm = new BayesAlgorithm();
      Datastore datastore = new InMemoryBayesDatastore( params );
      ClassifierContext classifier = new ClassifierContext( algorithm, datastore );
      classifier.initialize();
      
      final BufferedReader reader = new BufferedReader( new FileReader( args[ 0 ] ) );
      String entry = reader.readLine();
      
      while( entry != null ) {
          List< String > document = new NGrams( entry, 
                          Integer.parseInt( params.get( "gramSize" ) ) )
                          .generateNGramsWithoutLabel();

          ClassifierResult result = classifier.classifyDocument( 
                           document.toArray( new String[ document.size() ] ), 
                           params.get( "defaultCat" ) );          

          entry = reader.readLine();
      }
  } catch( final IOException ex ) {
   ex.printStackTrace();
  } catch( final InvalidDatastoreException ex ) {
   ex.printStackTrace();
  }
 }
}

There is one important note here: system must be taught before starting classification. In order to do so, it’s necessary to provide examples (more – better) of different text classification. It should be simple files where each line starts with category separated by tab from text itself. F.e.:

SUGGESTION  That's a great suggestion
QUESTION  Do you sell Microsoft Office?
...

More files you can provide, more precise classification you will get. All files must be put to the ‘/tmp/input’ folder, they will be processed by Apache Hadoop first. :)

Reference: Getting started with Apache Mahout from our JCG partner Andrey Redko at the Andriy Redko {devmind}.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

6 Responses to "Apache Mahout: Getting started"

  1. Ali says:

    Hi ,Nice tutorial
    I am able to run the code.I tested with sample file which contains
    QUESTION
    SUGGESTION

    series.and i gave a test file consisting of sentences of question and suggestion without any lable.
    In the ouput directory i get three folders of “trainer-tfIdf”,”trainer-thetaNormalizer”,”trainer-weights”

    how to see the output.

    can you please help

    • Andriy Redko says:

      Hi Ali,

      Thank you for your comment. The variable ‘result’ of ‘ClassifierResult’ contains the classification (including scores) for particular text or message. You can print it out on a console or output to another file. Please note that at the time, the post targeted version 0.4 of Apache Mahout. Current version is 0.7 and unfortunately those are not compatible at all.

      Please let me know if it’s helpful.
      Thank you.

      Best Regards,
      Andriy Redko

  2. aparnesh gaurav says:

    Thanks for sharing this .

  3. aparnesh gaurav says:

    Hi,

    Q1 .Does the above algorithm work on a distributed framework ? ( Assuming that we are keeping the input file in hdfs )
    Q2. Is the output folder referred here in hdfs ?
    Q3. I don’t see any map-reduce code here , so shall i assume it’s only hdfs applied here but no parallel processing because on map reduce codes are written here.

    Regards,
    Aparnesh

Leave a Reply


8 − = one



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close