Shamik Majumdar

About Shamik Majumdar

I have 14 years of experience in developing and architecting complex enterprise software. I work as a senior architect at Mindspark Interactive Networks Inc., in New York Area.

Keyword extraction and similarity calculation among textual content

Background

Web applications are becoming smarter. Gone are the days when to avail a service from a website, user had to fill up a giant form. Let’s say, you have a website which is for book lovers. Before web 2.0, sites like these used to ask user all kind of questions in a form like age, books they read, types of book they like, language preference, author preference etc. Now days, it is a common practice to ask user to write a paragraph on themselves (profile). In this note, user express some details, but the challenge is, how we extract useful information from such free form text and more over how we find user who have similar interest?

This use case has become so common that every java developer should know some tricks about information retrieval from text. In this article, I shall walk you through one simple yet effective way to do it.

Processes to extract information from text

  1. Filter Words: Read textual content word by word and remove unwanted words. As part of this filtering state, remove all commonly used English words. One can also apply censor rules and remove sexually explicit words or hate speech etc.
  2. Perform Stemming: Words like ‘search’ or ‘searched’ or ‘searching’ which all mean ‘search’. This process of reducing word to its root is called stemming.
  3. Calculate Similarity: After first two steps, we now have a set of keywords that truly represent original text (user profile in this example). We can treat these keywords as set of unique words. To calculate similarity between two user profiles, it would be better if we represent similarity in terms of a number which represent how similar two contents are in a 0 (not similar) to 1 (completely similar) scale. One way to achieve that is to calculate Jaccard Index which is used to calculate similarity or diversity of sets.
Jaccard index J(A,B) = |A∩B|/| A⋃B|

where A and B are sets and J(A,B) lies between 0 to 1.

Implementation Details

Based on the points outlined above, one can develop a library to extract keywords and calculate similarity. However, Apache Lucene is a java library that has plenty of API to perform keyword extraction. Here is a brief description of different important areas of this API.

Tokenizer

Tokenizer splits your text into chunks. There are different tokenizers and depending upon the tokenizer you use, you can get different output token streams (sequences of chunks of text).

Stemmers

Stemmers are used to get the base of a word in question. It heavily depends on the language used. Words like ‘seaerch’, ’searched’, ’searching’ etc comes from the root word ‘search’. In information retrieval field, it is very useful if we get to the root words as that reduce noise and with fewer words we can still carry the intent of the document. One of the famous stemmer algorithm is Porter Stemmer algo.

TokenFilter

Tokenfilter can be applied on the tokenizer output to normalize or filter tokens. Like LowerCaseFilter which normalizes token text to lower case or stopfilter that suppress most frequent and almost useless words. Again, it heavily depends on language. For English these stop words are “a”, “the”, “I”, “be”, “have”, etc.

Analyzer

An analyzer is the higher level class that uses tokenizers to produce tokens from input, uses stemmers to reduce the token, uses filters to suppress/normalize the tokens. This is the class that glue the other three main components. Different Analyzers use different combinations of tokenizers and filters. For example, StandardAnalyzer uses StandardTokenizer to extract tokens from string, pass that through LowerCaseFilter to convert tokens into lower case and then pass the stream of tokens through StopFilter to remove most commonly used English words. It does not perform stemming by default. One can develop a custom analyzer by mixing and matching tokenizer and tokenfilters according the need.

Code walk through

Source code of this example can be accessed from https://github.com/shamikm/similarity . Below is a highlight of the steps:

  1. Create a custom analyzer that perform the following steps:
    • Tokenize English words based on space, comma, period etc. Use StandardTokenizer for this task.
    • Convert the tokens into lowercase using LowerCaseFilter
    • Stop common English words using StopFilter
    • Stem English words using Porter Stemmer

    From StemmAnalyzer class:

    @Override
        public TokenStream tokenStream(String fieldName, Reader reader) {
          (a)..  final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
                 TokenStream tok = new StandardFilter(matchVersion, src);
          (b)..  tok = new LowerCaseFilter(matchVersion, tok);
          (c)..  tok = new StopFilter(matchVersion, tok, getStopWords());
          (d)..  return new PorterStemFilter(tok);
        }
  2. Once we have set of words, it’s easy to calculate similarity between two sets.

    From JaccardIndexBasedSimilarity class:

    public double calculateSimilarity(String oneContent, String otherContet) {
            Set<String> keyWords1 = keywordGenerator.generateKeyWords(oneContent);
            Set<String> keyWords2 = keywordGenerator.generateKeyWords(otherContet);
            Set<String> denominator = Sets.union(keyWords1,keyWords2);
            Set<String> numerator = Sets.intersection(keyWords1,keyWords2);
    
            return denominator.size()>0? (double)numerator.size()/(double)denominator.size() : 0;
        }

Here is a sample test case to demonstrate how the code works:

@Test
    public void calculateSim(){
        SimilarityCalculator calculator = new JaccardIndexBasedSimilarity();
        Assert.assertEquals(calculator.calculateSimilarity("They Licked the platter clean","Jack Sprat could eat no fat"),0.0);
        //1(lamb) out of 6(littl,lamb,mari,had,go,sure) words are same
        Assert.assertEquals(calculator.calculateSimilarity("Mary had a little lamb", "The lamb was sure to go."), 0.16, 0.02);
        Assert.assertEquals(calculator.calculateSimilarity("Mary had a little lamb","Mary had a little lamb"),1.0);
    }

You can run this process offline and find out how one user profile is similar to any other users in your database and can start recommending users based on what similar users are reading.

Conclusion

Information retrieval from text is a common use case now-a-days. Having a basic knowledge on this critical field is helpful for any developer and in this article, we looked at how Apache Lucene API can be used effectively to extract keyword and calculate similarity among text.

Resources:

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

Leave a Reply


+ seven = 11



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close