Home » Java » Enterprise Java » How to handle Stop Words in Hibernate Search 5.5.2 / Apache Lucene 5.4.x?

About Sumith Puri

Sumith Puri
Sumith Kumar Puri has 15+ Years of Experience in Entrepreneurship, Conceptualization, Architecture, Design and Development of Software Products and Solutions (Predominantly on Core Java and Jakarta EE). About ~85%+ of this Experience is in Software Product Development with Top Companies like Yahoo (Verizon), Symantec, Siebel (Oracle), Huawei, GXS (OpenText) and Misys (Finastra). This also includes time spent as an Author, Entrepreneur and in Independent Practice. He is a SCJP 1.4, SCJP 5.0, SCBCD 1.3, SCBCD 5.0, Quest C, Quest CPP, Quest Data Structures, Brainbench Spring 2.x*, Brainbench Hibernate 3.5*, Brainbench Java EE 6.0*. He is a Senior Member in IEEE/IEEE Computer Society and a Senior Member at ACM/ACM India. He is currently the Head of Product Engineering at Agiledge, Bangalore, India

How to handle Stop Words in Hibernate Search 5.5.2 / Apache Lucene 5.4.x?

The Stop Words like [“a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”, “for”, “if”, “in”, “into”, “is”, “it”, “no”, “not”, “of”, “on”, “or”, “such”, “that”, “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”, “was”, “will”, “with”] and the existence of them in terms or database or files that are to be indexed/searched by lucene can lead to any of the following:

  1. Stop Words being Ignored/Filtered during the Lucene Indexing Process
  2. Stop Words being Ignored/Filtered during the Lucene Querying Process
  3. No Result for Queries that Include, Start With or End With any Stop Word

The way to solve this problem or to handle them during both indexing and searching process is as follows. The method explained here is specially suitable if you are using Hibernate Search 5.5.2 which in turn is using Apache Lucene 5.3.x/5.4.x

1. Define your Custom Analyzer, Adapted from the Standard Analyzer

You need to include only the two filters – ‘LowerCaseFilterFactory’ and ‘StandardFilterFactory’ as part of the Tokenizer definition. The filter factory that we have not included here is the ‘StopFilter’. This allows Stop Words to be considered as other normal English Words and they are indexed.

@Entity 
@Indexed 
@Table(name="table_name", catalog="catalog_name") 
@AnalyzerDef(name = "FedexTextAnalyzer",
   tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), 
   filters = {
     @TokenFilterDef(factory = LowerCaseFilterFactory.class),
     @TokenFilterDef(factory = StandardFilterFactory.class) 
})

2. Mark the Field with Relevant Annotations (@Analyzer on @Field)

Along with the @Field Annotation on every Entity’s or Table’s Column Field, declare the Analyzer that we have defined above.

@Column(name="Fedex_cs_product_name", nullable=false, length=100)
@Field(index=Index.YES, analyze=Analyze.YES, store=Store.NO, analyzer=@Analyzer(definition = "FedexTextAnalyzer"))
public String getFedexCsItemName() {
   return this.FedexCsItemName;
}

3. Use WhitespaceAnalyzer to Query so that Stop Words are ‘Processed’ by Default

Although the official documentation says that if we use ‘StandardAnalyzer’ by passing in the argument for Stop Words as CharArraySet.EMPTY_SET I found that the Query was still not able to retrreve any result. On Analysis with Luke, I found that for Queries such as ‘Computer Science Books for Beginners’, the ‘for’ was being ignored. Strange! I replaced it with WhitespaceAnalyzer, I found that it works for all ‘Stop Words’ and all ‘Cases’.

stop_words_01

I have found that the above is the best/minimal way to fix this issue. Also, our QA has verified that it works for all ‘Stop Word’ cases! Hope this helps you.

(0 rating, 0 votes)
You need to be a registered member to rate this.
Start the discussion Views Tweet it!
Do you want to know how to develop your skillset to become a Java Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. JPA Mini Book
2. JVM Troubleshooting Guide
3. JUnit Tutorial for Unit Testing
4. Java Annotations Tutorial
5. Java Interview Questions
6. Spring Interview Questions
7. Android UI Design
and many more ....
I agree to the Terms and Privacy Policy

Leave a Reply

avatar

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  Subscribe  
Notify of