About Jaibeer Malik

Jaibeer is an experienced Java Software Architect and Agile enthusiast with a passion for new technologies, clean code and agile development.

Elasticsearch: Text analysis for content enrichment

Every text search solution is as powerful as the text analysis capabilities it offers. Lucene is such open source information retrieval library offering many text analysis possibilities. In this post, we will cover some of the main text analysis features offered by ElasticSearch available to enrich your search content.

Content Enrichment

Taking an example of a typical eCommerce site, serving the right content in search to the end customer is very important for the business. The text analysis strategy provided by any search solution plays very big role in it. As a search user, I would prefer some of typical search behavior for my query to automatically return,

  • should look for synonyms matching my query text
  • should match singluar and plural words or words sounding similar to enter query text
  • should not allow searching on protected words
  • should allow search for words mixed with numberic or special characters
  • should not allow search on html tags
  • should allow search text based on proximity of the letters and number of matching letters

Enriching the content here would be to add above search capabilities to you content while indexing and searching for the content.

Lucene Text Analysis

Lucene is information retrieval (IR) allowing full text indexing and searching capability. For quick reference, check post Text Analysis inside Lucene. In Lucene, the document contains fields of Text. Analysis is the process of converting field text further into terms. These terms are used to match a search query. There are three main implementations for the whole analysis process,

  • Analyzer: An Analyzer is responsible for building a TokenStream which can be consumed by the indexing and searching processes.
  • Tokenizer: A Tokenizer is a TokenStream and is responsible for breaking up incoming text into Tokens. In most cases,an Analyzer will use a Tokenizer as the first step in the analysis process.
  • TokenFilter: A TokenFilter is also a TokenStream and is responsible for modifying Tokens that have been created by the Tokenizer.

A common usage style of TokenStreams and TokenFilters inside an Analyzer is to use the chaining pattern that lets you build complex analyzers from simple Tokenizer/TokenFilter building blocks. Tokenizers start the analysis process by demarcating the character input into tokens (mostly these correspond to words in the original text). TokenFilters then take over the remainder of the analysis, initially wrapping a Tokenizer and successively wrapping nested TokenFilters.

ElasticSearch Text Analysis

ElasticSearch uses Lucene inbuilt capabilities of text analysis and allows you to enrich your search content. As stated above text analysis is dividing into filters, tokenizers and analyzers. ElasticSearch offers you quite some inbuilt analyzers with preconfirgured tokenizers and filters. For details list of existing analyzers, check complete list for Analysis 

Update Analysis Settings

ElasticSearch allows you to dynamically update index settings and mapping. To update index setting from java api client,

Settings settings = settingsBuilder().loadFromSource(jsonBuilder()
                    .startObject()
                        //Add analyzer settings
                        .startObject("analysis")
                            .startObject("filter")
                                .startObject("test_filter_stopwords_en")
                                    .field("type", "stop")
                                    .field("stopwords_path", "stopwords/stop_en")
                                .endObject()
                                .startObject("test_filter_snowball_en")
                                    .field("type", "snowball")
                                    .field("language", "English")
                                .endObject()
                                .startObject("test_filter_worddelimiter_en")
                                    .field("type", "word_delimiter")
                                    .field("protected_words_path", "worddelimiters/protectedwords_en")
                                    .field("type_table_path", "typetable")
                                .endObject()
                                .startObject("test_filter_synonyms_en")
                                    .field("type", "synonym")
                                    .field("synonyms_path", "synonyms/synonyms_en")
                                    .field("ignore_case", true)
                                    .field("expand", true)
                                .endObject()
                                .startObject("test_filter_ngram")
                                    .field("type", "edgeNGram")
                                    .field("min_gram", 2)
                                    .field("max_gram", 30)
                                .endObject()
                           .endObject()
                           .startObject("analyzer")
                                .startObject("test_analyzer")
                                    .field("type", "custom")
                                    .field("tokenizer", "whitespace")
                                    .field("filter", new String[]{"lowercase",
                                                                    "test_filter_worddelimiter_en",
                                                                        "test_filter_stopwords_en",
                                                                        "test_filter_synonyms_en",
                                                                        "test_filter_snowball_en"})
                                    .field("char_filter", "html_strip")
                                .endObject()
                           .endObject()
                        .endObject()
                    .endObject().string()).build();
 
CreateIndexRequestBuilder createIndexRequestBuilder = client.admin().indices().prepareCreate(indexName);
createIndexRequestBuilder.setSettings(settings);

You can also set your index and settings in your configuration file. The path mentioned in the above example are relative to the config directory of installed elasticsearch server. The above example allows you to create custom filters and analyzers for your index, ElasticSearch has existing combination of different filters and tokenizers allowing you to select right combination for your data.

Synonyms

Synonym  are the words with the same or similar meaning. Synonym Expansion is where we take variants of the word and assign them to the search engine at the indexing and/or query time. To add synonym filter to the settings for the index.

.startObject("test_filter_synonyms_en")
    .field("type", "synonym")
	.field("synonyms_path", "synonyms/synonyms_en")
	.field("ignore_case", true)
	.field("expand", true)
.endObject()

Check the Synonym Filter  for complete syntax. You can add synonym in Slor or WordNet format. Have a look at Slor Synonym Format for further examples,

# If expand==true, "ipod, i-pod, i pod" is equivalent to the explicit mapping:
ipod, i-pod, i pod => ipod, i-pod, i pod
# If expand==false, "ipod, i-pod, i pod" is equivalent to the explicit mapping:
ipod, i-pod, i pod => ipod

Check the wordlist  for the list of words and synonyms matching to your requirements.

Stemming

Word stemming  is defined as the ability to include word variations. For example any noun-word would include variations (whose importance is directly proportional to the degree of variation) With word stemming, we use quantified methods for the rules of grammar to add word stems and rank them according to their degree of separation from the root word. To add stemming filter to the settings for the index.

.startObject("test_filter_snowball_en")
    .field("type", "snowball")
    .field("language", "English")
.endObject()

Check the Snowball Filter syntax for details. Stemming programs are commonly referred to as stemming algorithms or stemmers. Lucene analysis can be algorithmic or dictionary based. Snowball,  based on Martin Porter’s Snowball algorithm provides stemming functionality and used as stemmer in above example. Check the list of snowball stemmers  for different supported languages. Synonym and stemming sometime return you strange results based on the order of text processing. Make sure to use the two in the order matching your requirements.

Stop words

Stop words are the list of words which you do not want to allow user to index or query upon. To add a stop word filter to the settings,

.startObject("test_filter_stopwords_en")
    .field("type", "stop")
    .field("stopwords_path", "stopwords/stop_en")
.endObject()

Check the complete syntax for stop words filter. Check Snowball Stop words list for English language to derive your own list. Check Solr shared list of stop words  for English language.

Word Delimiter

Word delimiter filter allows you to split a word into sub words, for further processing on the sub words. To add a word delimiter filter to the settings,

.startObject("test_filter_worddelimiter_en")
    .field("type", "word_delimiter")
	.field("protected_words_path", "worddelimiters/protectedwords_en")
	.field("type_table_path", "typetable")
.endObject()

The common split of words is based on non alphanumeric nature, case transitions and intra word delimiters etc. Check the complete syntax and different available options for Word Delimiter Filter. The list of protected words allows you to protect business relevant words from being delimited in the process.

N-grams

N-gram  is a continuous sequence of n letters for a given sequence of text. To add a edge ngram filter to the settings,

.startObject("test_filter_ngram")
	.field("type", "edgeNGram")
	.field("min_gram", 2)
	.field("max_gram", 30)
.endObject()

Based on your configuration, the input text will be broken down into multiple token of length configured above during the indexing time. It allows you to return the result based on matching ngram tokens also based on the proximity. Check the detailed syntax from the Edge NGram Filter

HTML Strip Char Filter

Most of the websites have HTML content content available that should be indexable. Allowing to index and query on standard html text is not desired for most of the sites. ElasticSearch allows you to filter the html tags, which won’t be indexed and won’t be available for query.

.startObject("analyzer")
    .startObject("test_analyzer")
		.field("type", "custom")
		.field("tokenizer", "whitespace")
		.field("filter", new String[]{"lowercase", "test_filter_worddelimiter_en", "test_filter_stopwords_en", "test_filter_synonyms_en", "test_filter_snowball_en"})
		.field("char_filter", "html_strip")
    .endObject()
.endObject()

Check the complete syntax of HTML Strip Char Filter  for details. In addition to the above mentioned common filters, there are many more available filters allowing you to enrich your search content in desired way based on end user requirements and your business data.
 

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

One Response to "Elasticsearch: Text analysis for content enrichment"

  1. Xavier says:

    The link for the post “Text Analysis inside Lucene” is broken

Leave a Reply


+ 2 = seven



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close