Elasticsearch: Text analysis for content enrichment

Jaibeer MalikMay 3rd, 2013Last Updated: May 2nd, 2013

1 53 5 minutes read

Every text search solution is as powerful as the text analysis capabilities it offers. Lucene is such open source information retrieval library offering many text analysis possibilities. In this post, we will cover some of the main text analysis features offered by ElasticSearch available to enrich your search content.

Content Enrichment

Taking an example of a typical eCommerce site, serving the right content in search to the end customer is very important for the business. The text analysis strategy provided by any search solution plays very big role in it. As a search user, I would prefer some of typical search behavior for my query to automatically return,

should look for synonyms matching my query text
should match singluar and plural words or words sounding similar to enter query text
should not allow searching on protected words
should allow search for words mixed with numberic or special characters
should not allow search on html tags
should allow search text based on proximity of the letters and number of matching letters

Enriching the content here would be to add above search capabilities to you content while indexing and searching for the content.

Lucene Text Analysis

Lucene is information retrieval (IR) allowing full text indexing and searching capability. For quick reference, check post Text Analysis inside Lucene. In Lucene, the document contains fields of Text. Analysis is the process of converting field text further into terms. These terms are used to match a search query. There are three main implementations for the whole analysis process,

Analyzer: An Analyzer is responsible for building a TokenStream which can be consumed by the indexing and searching processes.
Tokenizer: A Tokenizer is a TokenStream and is responsible for breaking up incoming text into Tokens. In most cases,an Analyzer will use a Tokenizer as the first step in the analysis process.
TokenFilter: A TokenFilter is also a TokenStream and is responsible for modifying Tokens that have been created by the Tokenizer.

A common usage style of TokenStreams and TokenFilters inside an Analyzer is to use the chaining pattern that lets you build complex analyzers from simple Tokenizer/TokenFilter building blocks. Tokenizers start the analysis process by demarcating the character input into tokens (mostly these correspond to words in the original text). TokenFilters then take over the remainder of the analysis, initially wrapping a Tokenizer and successively wrapping nested TokenFilters.

ElasticSearch Text Analysis

ElasticSearch uses Lucene inbuilt capabilities of text analysis and allows you to enrich your search content. As stated above text analysis is dividing into filters, tokenizers and analyzers. ElasticSearch offers you quite some inbuilt analyzers with preconfirgured tokenizers and filters. For details list of existing analyzers, check complete list for Analysis

Update Analysis Settings

ElasticSearch allows you to dynamically update index settings and mapping. To update index setting from java api client,

Settings settings = settingsBuilder().loadFromSource(jsonBuilder()
                    .startObject()
                        //Add analyzer settings
                        .startObject("analysis")
                            .startObject("filter")
                                .startObject("test_filter_stopwords_en")
                                    .field("type", "stop")
                                    .field("stopwords_path", "stopwords/stop_en")
                                .endObject()
                                .startObject("test_filter_snowball_en")
                                    .field("type", "snowball")
                                    .field("language", "English")
                                .endObject()
                                .startObject("test_filter_worddelimiter_en")
                                    .field("type", "word_delimiter")
                                    .field("protected_words_path", "worddelimiters/protectedwords_en")
                                    .field("type_table_path", "typetable")
                                .endObject()
                                .startObject("test_filter_synonyms_en")
                                    .field("type", "synonym")
                                    .field("synonyms_path", "synonyms/synonyms_en")
                                    .field("ignore_case", true)
                                    .field("expand", true)
                                .endObject()
                                .startObject("test_filter_ngram")
                                    .field("type", "edgeNGram")
                                    .field("min_gram", 2)
                                    .field("max_gram", 30)
                                .endObject()
                           .endObject()
                           .startObject("analyzer")
                                .startObject("test_analyzer")
                                    .field("type", "custom")
                                    .field("tokenizer", "whitespace")
                                    .field("filter", new String[]{"lowercase",
                                                                    "test_filter_worddelimiter_en",
                                                                        "test_filter_stopwords_en",
                                                                        "test_filter_synonyms_en",
                                                                        "test_filter_snowball_en"})
                                    .field("char_filter", "html_strip")
                                .endObject()
                           .endObject()
                        .endObject()
                    .endObject().string()).build();
 
CreateIndexRequestBuilder createIndexRequestBuilder = client.admin().indices().prepareCreate(indexName);
createIndexRequestBuilder.setSettings(settings);

You can also set your index and settings in your configuration file. The path mentioned in the above example are relative to the config directory of installed elasticsearch server. The above example allows you to create custom filters and analyzers for your index, ElasticSearch has existing combination of different filters and tokenizers allowing you to select right combination for your data.

Synonyms

Synonym are the words with the same or similar meaning. Synonym Expansion is where we take variants of the word and assign them to the search engine at the indexing and/or query time. To add synonym filter to the settings for the index.

.startObject("test_filter_synonyms_en")
    .field("type", "synonym")
	.field("synonyms_path", "synonyms/synonyms_en")
	.field("ignore_case", true)
	.field("expand", true)
.endObject()

Check the Synonym Filter for complete syntax. You can add synonym in Slor or WordNet format. Have a look at Slor Synonym Format for further examples,

# If expand==true, "ipod, i-pod, i pod" is equivalent to the explicit mapping:
ipod, i-pod, i pod => ipod, i-pod, i pod
# If expand==false, "ipod, i-pod, i pod" is equivalent to the explicit mapping:
ipod, i-pod, i pod => ipod

Check the wordlist for the list of words and synonyms matching to your requirements.

Stemming

Word stemming is defined as the ability to include word variations. For example any noun-word would include variations (whose importance is directly proportional to the degree of variation) With word stemming, we use quantified methods for the rules of grammar to add word stems and rank them according to their degree of separation from the root word. To add stemming filter to the settings for the index.

.startObject("test_filter_snowball_en")
    .field("type", "snowball")
    .field("language", "English")
.endObject()

Check the Snowball Filter syntax for details. Stemming programs are commonly referred to as stemming algorithms or stemmers. Lucene analysis can be algorithmic or dictionary based. Snowball, based on Martin Porter’s Snowball algorithm provides stemming functionality and used as stemmer in above example. Check the list of snowball stemmers for different supported languages. Synonym and stemming sometime return you strange results based on the order of text processing. Make sure to use the two in the order matching your requirements.

Stop words

Stop words are the list of words which you do not want to allow user to index or query upon. To add a stop word filter to the settings,

.startObject("test_filter_stopwords_en")
    .field("type", "stop")
    .field("stopwords_path", "stopwords/stop_en")
.endObject()

Check the complete syntax for stop words filter. Check Snowball Stop words list for English language to derive your own list. Check Solr shared list of stop words for English language.

Word Delimiter

Word delimiter filter allows you to split a word into sub words, for further processing on the sub words. To add a word delimiter filter to the settings,

.startObject("test_filter_worddelimiter_en")
    .field("type", "word_delimiter")
	.field("protected_words_path", "worddelimiters/protectedwords_en")
	.field("type_table_path", "typetable")
.endObject()

The common split of words is based on non alphanumeric nature, case transitions and intra word delimiters etc. Check the complete syntax and different available options for Word Delimiter Filter. The list of protected words allows you to protect business relevant words from being delimited in the process.

N-grams

N-gram is a continuous sequence of n letters for a given sequence of text. To add a edge ngram filter to the settings,

.startObject("test_filter_ngram")
	.field("type", "edgeNGram")
	.field("min_gram", 2)
	.field("max_gram", 30)
.endObject()

Based on your configuration, the input text will be broken down into multiple token of length configured above during the indexing time. It allows you to return the result based on matching ngram tokens also based on the proximity. Check the detailed syntax from the Edge NGram Filter

HTML Strip Char Filter

Most of the websites have HTML content content available that should be indexable. Allowing to index and query on standard html text is not desired for most of the sites. ElasticSearch allows you to filter the html tags, which won’t be indexed and won’t be available for query.

.startObject("analyzer")
    .startObject("test_analyzer")
		.field("type", "custom")
		.field("tokenizer", "whitespace")
		.field("filter", new String[]{"lowercase", "test_filter_worddelimiter_en", "test_filter_stopwords_en", "test_filter_synonyms_en", "test_filter_snowball_en"})
		.field("char_filter", "html_strip")
    .endObject()
.endObject()

Check the complete syntax of HTML Strip Char Filter for details. In addition to the above mentioned common filters, there are many more available filters allowing you to enrich your search content in desired way based on end user requirements and your business data.

Reference: Elasticsearch: Text analysis for content enrichment from our JCG partner Jaibeer Malik at the Jai’s Weblog blog.