Three exciting Lucene features in one day

Yesterday was a productive day: suddenly, there are three exciting new features coming to Lucene.

Expressions module

The first feature, committed yesterday, is the new expressions module. This allows you to define a dynamic field for sorting, using an arbitrary String expression. There is builtin support for parsing JavaScript, but the parser is pluggable if you want to create your own syntax.

For example, you could define a sort field using the expression

sqrt(_score) + ln(popularity)

if you want to offer a blended sort primarily by relevance and boosting by a popularity field.

The code is very easy to use; there are some nice examples in the TestDemoExpressions.java unit test case, and this will be available in Lucene’s next stable release (4.6).

Updateable numeric doc-values fields

The second feature, also committed yesterday, is updateable numeric doc-values fields, letting you change previously indexed numeric values using the new updateNumericDocValue method on IndexWriter. It works fine with near-real-time readers, so you can update the numeric values for a few documents and then re-open a new near-real-time reader to see the changes.

The feature is currently trunk only as we work out a few remaining issues involving an particularly controversial boolean. It also currently does not work on sparse fields, i.e. you can only update a document’s value if that document had already indexed that field in the first place.

Combined, these two features enable powerful use-cases where you want to sort by a blended field that is changing over time. For example, perhaps you measure how often your users click through each document in the search results, and then use that to update the popularity field, which is then used for a blended sort. This way the rankings of the search results change over time as you learn from users which documents are popular and which are not.

Of course such a feature was always possible before, using custom external code, but with both expressions and updateable doc-values now available it becomes trivial to implement!

Free text suggestions

Finally, the third feature is a new suggester implementation, FreeTextSuggester. It is a very different suggester than the existing ones: rather than suggest from a finite universe of pre-built suggestions, it uses a simple ngram language model to predict the “long tail” of possible suggestions based on the 1 or 2 previous tokens.

Under the hood, it uses ShingleFilter to create the ngrams, and an FST to store and lookup the resulting ngram models. While multiple ngram models are stored compactly in a single FST, the FST can still get quite large; the 3-gram, 2-gram and 1-gram model built on the AOL query logs is 19.4 MB (the queries themselves are 25.4 MB). This was inspired by Google’s approach.

Likely this suggester would not be used by itself, but rather as a fallback when your primary suggester failed to find any suggestions; you can see this behavior with Google. Try searching for “the fast and the “, and you will see the suggestions are still full queries. But if the next word you type is “burning” then suddenly google (so far!) does not have a full suggestion and falls back to their free text approach.
 

Reference: Three exciting Lucene features in one day from our JCG partner Michael Mc Candless at the Changing Bits blog.
Related Whitepaper:

Functional Programming in Java: Harnessing the Power of Java 8 Lambda Expressions

Get ready to program in a whole new way!

Functional Programming in Java will help you quickly get on top of the new, essential Java 8 language features and the functional style that will change and improve your code. This short, targeted book will help you make the paradigm shift from the old imperative way to a less error-prone, more elegant, and concise coding style that’s also a breeze to parallelize. You’ll explore the syntax and semantics of lambda expressions, method and constructor references, and functional interfaces. You’ll design and write applications better using the new standards in Java 8 and the JDK.

Get it Now!  

Leave a Reply


× 7 = fourteen



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

15,153 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books