About Bill Bejeck

Husband, father of 3, passionate about software development.

Calculating A Co-Occurrence Matrix with Hadoop

This post continues with our series of implementing the MapReduce algorithms found in the Data-Intensive Text Processing with MapReduce book. This time we will be creating a word co-occurrence matrix from a corpus of text. Previous posts in this series are:
  1. Working Through Data-Intensive Text Processing with MapReduce
  2. Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II

A co-occurrence matrix could be described as the tracking of an event, and given a certain window of time or space, what other events seem to occur. For the purposes of this post, our “events” are the individual words found in the text and we will track what other words occur within our “window”, a position relative to the target word. For example, consider the phrase “The quick brown fox jumped over the lazy dog”. With a window value of 2, the co-occurrence for the word “jumped” would be [brown,fox,over,the]. A co-occurrence matrix could be applied to other areas that require investigation into when “this” event occurs, what other events seem to happen at the same time. To build our text co-occurrence matrix, we will be implementing the Pairs and Stripes algorithms found in chapter 3 of Data-Intensive Text Processing with MapReduce. The body of text used to create our co-occurrence matrix is the collective works of William Shakespeare.

Pairs

Implementing the pairs approach is straightforward. For each line passed in when the map function is called, we will split on spaces creating a String Array. The next step would be to construct two loops. The outer loop will iterate over each word in the array and the inner loop will iterate over the “neighbors” of the current word. The number of iterations for the inner loop is dictated by the size of our “window” to capture neighbors of the current word. At the bottom of each iteration in the inner loop, we will emit a WordPair object (consisting of the current word on the left and the neighbor word on the right) as the key, and a count of one as the value. Here is the code for the Pairs implementation:

public class PairsOccurrenceMapper extends Mapper<LongWritable, Text, WordPair, IntWritable> {
    private WordPair wordPair = new WordPair();
    private IntWritable ONE = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        int neighbors = context.getConfiguration().getInt('neighbors', 2);
        String[] tokens = value.toString().split('\\s+');
        if (tokens.length > 1) {
          for (int i = 0; i < tokens.length; i++) {
              wordPair.setWord(tokens[i]);

             int start = (i - neighbors < 0) ? 0 : i - neighbors;
             int end = (i + neighbors >= tokens.length) ? tokens.length - 1 : i + neighbors;
              for (int j = start; j <= end; j++) {
                  if (j == i) continue;
                   wordPair.setNeighbor(tokens[j]);
                   context.write(wordPair, ONE);
              }
          }
      }
  }
}

The Reducer for the Pairs implementation will simply sum all of the numbers for the given WordPair key:

public class PairsReducer extends Reducer<WordPair,IntWritable,WordPair,IntWritable> {
    private IntWritable totalCount = new IntWritable();
    @Override
    protected void reduce(WordPair key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int count = 0;
        for (IntWritable value : values) {
             count += value.get();
        }
        totalCount.set(count);
        context.write(key,totalCount);
    }
}

 
Stripes

Implementing the stripes approach to co-occurrence is equally straightforward. The approach is the same, but all of the “neighbor” words are collected in a HashMap with the neighbor word as the key and an integer count as the value. When all of the values have been collected for a given word (the bottom of the outer loop), the word and the hashmap are emitted. Here is the code for our Stripes implementation:

public class StripesOccurrenceMapper extends Mapper<LongWritable,Text,Text,MapWritable> {
  private MapWritable occurrenceMap = new MapWritable();
  private Text word = new Text();

  @Override
 protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
   int neighbors = context.getConfiguration().getInt('neighbors', 2);
   String[] tokens = value.toString().split('\\s+');
   if (tokens.length > 1) {
      for (int i = 0; i < tokens.length; i++) {
          word.set(tokens[i]);
          occurrenceMap.clear();

          int start = (i - neighbors < 0) ? 0 : i - neighbors;
          int end = (i + neighbors >= tokens.length) ? tokens.length - 1 : i + neighbors;
           for (int j = start; j <= end; j++) {
                if (j == i) continue;
                Text neighbor = new Text(tokens[j]);
                if(occurrenceMap.containsKey(neighbor)){
                   IntWritable count = (IntWritable)occurrenceMap.get(neighbor);
                   count.set(count.get()+1);
                }else{
                   occurrenceMap.put(neighbor,new IntWritable(1));
                }
           }
          context.write(word,occurrenceMap);
     }
   }
  }
}

The Reducer for the Stripes approach is a little more involved due to the fact we will need to iterate over a collection of maps, then for each map, iterate over all of the values in the map:

public class StripesReducer extends Reducer<Text, MapWritable, Text, MapWritable> {
    private MapWritable incrementingMap = new MapWritable();

    @Override
    protected void reduce(Text key, Iterable<MapWritable> values, Context context) throws IOException, InterruptedException {
        incrementingMap.clear();
        for (MapWritable value : values) {
            addAll(value);
        }
        context.write(key, incrementingMap);
    }

    private void addAll(MapWritable mapWritable) {
        Set<Writable> keys = mapWritable.keySet();
        for (Writable key : keys) {
            IntWritable fromCount = (IntWritable) mapWritable.get(key);
            if (incrementingMap.containsKey(key)) {
                IntWritable count = (IntWritable) incrementingMap.get(key);
                count.set(count.get() + fromCount.get());
            } else {
                incrementingMap.put(key, fromCount);
            }
        }
    }
}

 
Conclusion

When looking at the two approaches, we can see that the Pairs algorithm will generate more key value pairs compared to the Stripes algorithm. Also, the Pairs algorithm captures each individual co-occurrence event while the Stripes algorithm captures all co-occurrences for a given event. Both the Pairs and Stripes implementations would benefit from using a Combiner. Because both produce commutative and associative results, we can simply re-use each Mapper’s Reducer as the Combiner. As stated before, creating a co-occurrence matrix has applicability to other fields beyond text processing, and represent useful MapReduce algorithms to have in one’s arsenal. Thanks for your time.

Resources

 
Reference: Calculating A Co-Occurrence Matrix with Hadoop from our JCG partner Bill Bejeck at the Random Thoughts On Coding blog.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

Leave a Reply


1 × = six



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close