Home » Java » Enterprise Java » Elasticsearch – Ignore special characters in query with pattern replace filter and custom analyzer

About Steven Wall

Steven Wall

Elasticsearch – Ignore special characters in query with pattern replace filter and custom analyzer

Using Elasticsearch 5, we had a field like drivers license number where values may include special characters and inconsistent upper/lower case behavior as the values were entered by the users with limited validation.  For example, these are hypothetical values:

  • CA-123-456-789
  • WI.12345.6789
  • tx123456789
  • az-123-xyz-456

In our application, the end user need to search by that field. We had a business requirement that user should be able to not have to enter any special characters such as hyphens and periods to get back the record.  So for the first example above, the user should be able to type any of these values and see that record:

  • CA-123-456-789 (an exact match)
  • CA123456789  (no special chars)
  • ca123456789  (lower-case letters and no special chars)
  • Ca.123.456-789 (mixed case letters and mixed special chars)

Our approach was to write a custom analyzer that ignores special characters and then query against that field.

Step 1:  Create pattern replace character filter and custom analyzer

We defined a pattern replace character filter to remove any non-alphanumeric characters as follows on the index:

"char_filter": {
    "specialCharactersFilter": {
        "pattern": "[^A-Za-z0-9]",
        "type": "pattern_replace",
        "replacement": ""
    }
}

Then we used that filter to create a custom analyzer that we named “alphanumericStringAnalyzer” on the index:

"analyzer": {
    "alphanumericStringAnalyzer": {
        "filter": "lowercase",
        "char_filter": [
            "specialCharactersFilter"
        ],
        "type": "custom",
        "tokenizer": "standard"
    }
}

Step 2: Define field mapping using the custom analyzer

The next step was to define a new field mapping that used the new “alphanumericStringAnalyzer” analyzer:

"driversLicenseNumber": {
    "type": "text",
    "fields": {
        "alphanumeric": {
        "type": "text",
            "analyzer": "alphanumericStringAnalyzer"
        },
        "raw": {
            "type": "keyword"
        }
    }
}

Step 3: Run query against new field

In our case, we have this match query as part of a boolean query in the “should” clause:

{
    "match" : {
        "driversLicenseNumber.alphanumeric" : {
            "query" : "Ca.123.456-789",
            "operator" : "OR",
            "boost" : 10.0
        }
    }
}
Published on Java Code Geeks with permission by Steven Wall, partner at our JCG program. See the original article here: Elasticsearch – Ignore special characters in query with pattern replace filter and custom analyzer

Opinions expressed by Java Code Geeks contributors are their own.

(0 rating, 0 votes)
You need to be a registered member to rate this.
Start the discussion Views Tweet it!
Do you want to know how to develop your skillset to become a Java Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. JPA Mini Book
2. JVM Troubleshooting Guide
3. JUnit Tutorial for Unit Testing
4. Java Annotations Tutorial
5. Java Interview Questions
6. Spring Interview Questions
7. Android UI Design
and many more ....
I agree to the Terms and Privacy Policy

Leave a Reply

avatar

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  Subscribe  
Notify of