Finding long tail suggestions using Lucene’s new FreeTextSuggester

Lucene’s suggest module offers a number of fun auto-suggest implementations to give a user live search suggestions as they type each character into a search box.

For example, WFSTCompletionLookup compiles all suggestions and their weights into a compact Finite State Transducer, enabling fast prefix lookup for basic suggestions.

AnalyzingSuggester improves on this by using an Analyzer to normalize both the suggestions and the user’s query so that trivial differences in whitespace, casing, stop-words, synonyms, as determined by the analyzer, do not prevent a suggestion from matching.

Finally, AnalyzingInfixSuggester goes further by allowing infix matches so that words inside each suggestion (not just the prefix) can trigger a match. You can see this one action at the Lucene/Solr Jira search application (e.g., try “python”) that I recently created to eat our own dog food. It is also the only suggester implementation so far that supports highlighting (this has proven challenging for the other suggesters).

Yet, a common limitation to all of these suggesters is that they can only suggest from a finite set of previously built suggestions. This may not be a problem if your suggestions are past user queries and you have tons and tons of them (e.g., you are Google). Alternatively, if your universe of suggestions is inherently closed, such as the movie and show titles that Netflix’s search will suggest, or all product names on an e-commerce site, then a closed set of suggestions is appropriate.

N-Gram language models

For everyone else, where a high percentage of the incoming queries fall into the never-seen-before long tail, Lucene’s newest suggester, FreeTextSuggester, can help! It uses the approach described in this Google blog post.

Rather than precisely matching a previous suggestion, it builds up a simple statistical n-gram language model from all suggestions and looks at the last tokens (plus the prefix of whatever final token the user is typing, if present) to predict the most likely next token.

For example, perhaps the user’s query so far is: “flashforge 3d p”, and because flashforge is an uncommon brand of 3D printer, this particular suggestion prefix was never added to the suggester. Yet, “3d printer” was a frequently seen phrase in other contexts (different brands). In this case, FreeTextSuggester will see “3d” and the “p” prefix for the next token and predict printer, even though “flashforge 3d printer” was never explicitly added as a suggestion.

You specify the order (N) of the model when you create the suggester: larger values of N require more data to train properly but can make more accurate predictions. All lower order models are also built, so if you specify N=3, you will get trigrams, bigrams and unigrams, all compiled into a single weighted FST for maximum sharing of the text tokens. Of course, larger N will create much larger FSTs. In practice N=3 is the highest you should go, unless you have tons of both suggestions to train and RAM to hold the resulting FST.

To handle sparse data, where a given context (the N-1 prior words) was not seen frequently enough to make accurate predictions, the suggester uses the stupid backoff language model (yes, this is really its name, and yes, it performs well!).

I expect the best way to use this new FreeTextSuggester will be as a fallback: you would first use one of the existing exact match suggesters, but when those suggesters fail to find any suggestions for a given query, because it’s “unusual” and has crossed over into the long tail, you then fall back to FreeTextSuggester.

Google seems to use such a modal approach to suggestions as well: if you type “flashforge 3d p” you should see something like this, where each suggestion covers your entire query so far (indeed, Google has heard of the flashforge brand of 3d printer!):

suggest1

But then if you keep typing and enter “flashforge 3d printer power u”, the suggestions change: instead of suggesting an entire query, matching everything I have typed, Google instead suggests the last word or two:

suggest2

As usual, this feature is very new and likely to contain exciting bugs! See the Jira issue, LUCENE-5214, for details. If you play with this new suggester please start a discussion on the Lucene’s user list!
 

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

Leave a Reply


eight − = 4



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close