Enterprise Java

Elasticsearch – Ignore special characters in query with pattern replace filter and custom analyzer

Using Elasticsearch 5, we had a field like drivers license number where values may include special characters and inconsistent upper/lower case behavior as the values were entered by the users with limited validation.  For example, these are hypothetical values:

  • CA-123-456-789
  • WI.12345.6789
  • tx123456789
  • az-123-xyz-456

In our application, the end user need to search by that field. We had a business requirement that user should be able to not have to enter any special characters such as hyphens and periods to get back the record.  So for the first example above, the user should be able to type any of these values and see that record:

  • CA-123-456-789 (an exact match)
  • CA123456789  (no special chars)
  • ca123456789  (lower-case letters and no special chars)
  • Ca.123.456-789 (mixed case letters and mixed special chars)

Our approach was to write a custom analyzer that ignores special characters and then query against that field.

Step 1:  Create pattern replace character filter and custom analyzer

We defined a pattern replace character filter to remove any non-alphanumeric characters as follows on the index:

"char_filter": {
    "specialCharactersFilter": {
        "pattern": "[^A-Za-z0-9]",
        "type": "pattern_replace",
        "replacement": ""
    }
}

Then we used that filter to create a custom analyzer that we named “alphanumericStringAnalyzer” on the index:

"analyzer": {
    "alphanumericStringAnalyzer": {
        "filter": "lowercase",
        "char_filter": [
            "specialCharactersFilter"
        ],
        "type": "custom",
        "tokenizer": "standard"
    }
}

Step 2: Define field mapping using the custom analyzer

The next step was to define a new field mapping that used the new “alphanumericStringAnalyzer” analyzer:

"driversLicenseNumber": {
    "type": "text",
    "fields": {
        "alphanumeric": {
        "type": "text",
            "analyzer": "alphanumericStringAnalyzer"
        },
        "raw": {
            "type": "keyword"
        }
    }
}

Step 3: Run query against new field

In our case, we have this match query as part of a boolean query in the “should” clause:

{
    "match" : {
        "driversLicenseNumber.alphanumeric" : {
            "query" : "Ca.123.456-789",
            "operator" : "OR",
            "boost" : 10.0
        }
    }
}
Published on Java Code Geeks with permission by Steven Wall, partner at our JCG program. See the original article here: Elasticsearch – Ignore special characters in query with pattern replace filter and custom analyzer

Opinions expressed by Java Code Geeks contributors are their own.

Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button