Understanding ElasticSearch Analyzers

Sadly, lots of early Internet beer recipes aren’t necessarily in an easily digestible format; that is, these recipes are unstructured intermixed lists of directions and ingredients often originally composed in an email or forum post.

So while it’s hard to easily put these recipes into traditional data stores (ostensibly for easier searching), they’re perfect for ElasticSearch in their current form.

Accordingly, imagine an ElasticSearch index full of beer recipes, since…well…I enjoy making beer (and drinking it too).old-beer

First, I’ll add some beer recipes into ElasticSearch using Node’s ElasticSearch Client(note that the code is CoffeeScript though). I’ll be adding these beer recipes into a beer_recipes index like so:

Adding a beer recipe

beer_1 = {
  name: "Todd Enders' Witbier",
  style: "wit, Belgian ale, wheat beer",
  ingredients: "4.0 lbs Belgian pils malt, 4.0 lbs raw soft red winter wheat, 0.5 lbs rolled oats, 0.75 oz coriander, freshly ground Zest from two table oranges and two lemons, 1.0 oz 3.1% AA Saaz, 3/4 corn sugar for priming, Hoegaarden strain yeast"
}

client.index('beer_recipes', 'beer', beer_1).on('data', (data) ->
  console.log(data)
).exec()

Note how the interesting part of a recipe JSON document, dubbed beer_1 is found in the ingredients field. This field is basically a big string of valuable text (you can imagine how this string was essentially the body of an email). So while the ingredients field is unstructured, it’s something clearly that people will want to search on.

I will add one more recipe for good measure:

Adding a second beer recipe

beer_2 = {
  name: "Wit",
  style: "wit, Belgian ale, wheat beer",
  ingredients: "4 lbs DeWulf-Cosyns 'Pils' malt, 3 lbs brewers' flaked wheat (inauthentic; will try raw wheat nest time), 6 oz rolled oats, 1 oz Saaz hops (3.3% AA), 0.75 oz bitter (Curacao) orange peel quarters (dried), 1 oz sweet orange peel (dried), 0.75 oz coriander (cracked), 0.75 oz anise seed, one small pinch cumin, 0.75 cup corn sugar (priming), 10 ml 88% food-grade lactic acid (at bottling), BrewTek 'Belgian Wheat' yeast"
}

client.index('beer_recipes', 'beer', beer_2).on('data', (data) ->
  console.log(data)
).exec()

It’s a hot summers day and I’m thinking I’d like to make a beer with lemon as an ingredient (to be clear: I want to use lemon zest, which is obtained from a lemon peel). So naturally, I need to find (i.e. search for) a recipe with lemons in it.

Consequently, I’ll search my index for recipes that contain the word “lemon” like so:

Searching for lemon

query = { "query" : { "term" : { "ingredients" : "lemon" } } }

client.search('beer_recipes', 'beer', query).on('data', (data) ->
  data = JSON.parse(data)
  for doc in data.hits.hits
      console.log doc._source.style
      console.log doc._source.name
      console.log doc._source.ingredients
).exec()

But nothing shows up – there are no results! Why is that?

If you look closely in the earlier code example (specifically, the beer_1 JSON document), you can see that the word “lemons” is in the text (i.e. “…two table oranges and two lemons…”). It turns out, by default, the way values are indexed by ElasticSearch, lemon doesn’t necessarily match – lemons does though.

Searching for lemons

query = { "query" : { "term" : { "ingredients" : "lemons" } } }

client.search('beer_recipes', 'beer', query).on('data', (data) ->
  data = JSON.parse(data)
  for doc in data.hits.hits
      console.log doc._source.style
      console.log doc._source.name
      console.log doc._source.ingredients
).exec()

Lo and behold, this search returns a hit! But that’s inconvenient, to say the least. Basically the words in the ingredients field are tokenized as is. Hence, a search for “lemons” works while “lemon” doesn’t. Note: there are various mechanisms for searching, and a search on “lemon*” should have returned a result.

When a document is added into an ElasticSearch index, its fields are analyzed and converted into tokens. When you execute a search against an index, you search against those tokens. How ElasticSearch tokenizes a document is configurable.

There are different ElasticSearch analyzers available – from language analyzers that allow you to support non-English language searches to the snowball analyzer, which converts a word into its root (or stem and that process of creating a stem from a word is called stemming), yielding a simpler token. For example, a snowball of “lemons” would be “lemon”. Or if the words “knocks” and “knocking” were in a snowball analyzed document, both terms would have “knock” as a stem.

You can change how documents are tokenized via the index mapping API like so:

Changing the mapping for an index using cURL

curl -XPUT 'http://localhost:9200/beer_recipes' -d '{ "mappings" : {
  "beer" : {
    "properties" : {
      "ingredients" : { "type" : "string", "analyzer" : "snowball" }
    }
   }
 }
}'

Note how the above mapping specifies that the ingredients field will be analyzed via the snowball analyzer. Also note, you have to change the mapping of an index before you begin to add documents to it! So, in this case, I’ll need to drop the index, run the mapping call above, and then re-add those two recipes.

Now I can begin searching recipes for the ingredient “lemon” or “lemons”.

Searching for lemon now works!

query = { "query" : { "term" : { "ingredients" : "lemon" } } }

client.search('beer_recipes', 'beer', query).on('data', (data) ->
  data = JSON.parse(data)
  for doc in data.hits.hits
      console.log doc._source.style
      console.log doc._source.name
      console.log doc._source.ingredients
).exec()

Keep in mind that snowballing can inadvertently make your search results less relevant. Long words can be stemmed into more common but completely different words. For example, if you snowball a document that contains the word “sextant”, the word “sex” will result as a stem. Thus, searches for “sextant” will also return documents that contain the word “sex” (and vice versa).

ElasticSearch puts a powerful search engine into your clutches; plus, with a little forethought into how a document’s contents are analyzed, you’ll make searches event more relevant.
 

Reference: Understanding ElasticSearch Analyzers from our JCG partner Andrew Glover at the The Disco Blog blog.
Related Whitepaper:

Functional Programming in Java: Harnessing the Power of Java 8 Lambda Expressions

Get ready to program in a whole new way!

Functional Programming in Java will help you quickly get on top of the new, essential Java 8 language features and the functional style that will change and improve your code. This short, targeted book will help you make the paradigm shift from the old imperative way to a less error-prone, more elegant, and concise coding style that’s also a breeze to parallelize. You’ll explore the syntax and semantics of lambda expressions, method and constructor references, and functional interfaces. You’ll design and write applications better using the new standards in Java 8 and the JDK.

Get it Now!  

Leave a Reply


nine − 7 =



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

20,709 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books