Enterprise Java

Advanced Lucene Query Examples

This article is part of our Academy Course titled Apache Lucene Fundamentals.

In this course, you will get an introduction to Lucene. You will see why a library like this is important and then learn how searching works in Lucene. Moreover, you will learn how to integrate Lucene Search into your own applications in order to provide robust searching capabilities. Check it out here!

1.Introduction

In a previous chapter we have learned the different components of the Lucene search engine. We also have built up a small search application using lucene indexing & searching procedures. In this chapter we will talk about the Lucene queries.

2.Lucene Queries

Lucene has a custom query syntax for querying its indexes. A query is broken up into terms and operators. Terms are of two types: 1.Single Terms and 2.Phrases. A Single Term is a single word such as “test” or “sample”. A Phrase is a group of words surrounded by double quotes such as “welcome lucene”. Multiple terms can be combined together with Boolean operators to form a more complex query. With Lucene Java, TermQuery is the most primitive Query. Then there’s BooleanQuery, PhraseQuery, and many other Query subclasses to choose from.

Fields When performing a search we can specify the field you want to search in. Any existing field name can be used as field name. The syntax is, FieldName:VALUE. There are a few special field types that have their own syntax for defining the query term. For example DateTime: ModificationDate:>’2010-09-01 12:00:00′ We will explain searching operations on these fields in a later section.

3.Lucene Query API

When a human-readable query is parsed by Lucene’s QueryParser, it is converted to a single concrete subclass of the Query class. We need some understanding of the underlying concrete Query subclasses. The relevant subclasses, their purpose, and some example expressions for each are listed in the following table:

Query Implementation Purpose Sample expressions
TermQuery Single term query, which effectively is a single word. reynolds
PhraseQuery A match of several terms in order, or in near vicinity to one another. “light up ahead”
RangeQuery Matches documents with terms between beginning and ending terms, including or excluding the end points. [A TO Z]

{A TO Z}

WildcardQuery Lightweight, regular-expression-like term-matching syntax. j*v?

f??bar

PrefixQuery Matches all terms that begin with a specified string. cheese*
FuzzyQuery Levenshtein algorithm for closeness matching. tree~
BooleanQuery Aggregates other Query instances into complex expressions allowing AND, OR, and NOT logic. reynolds AND “light up ahead”

cheese* -cheesewhiz

 
All of these Query implementations are in the org.apache.lucene.search package. The BooleanQuery is a bit of a special case because it is a Query container that aggregates other queries (including nested BooleanQuerys for sophisticated expressions).

Here is a BooleanQuery based on the query snippet. Here we can see how the QueryParser-created query is equivalent to the API-created one:

public class RulezTest extends TestCase { 
  public void testJavaNotDotNet() throws Exception { 
    BooleanQuery apiQuery = new BooleanQuery(); 
    apiQuery.add(new TermQuery(new Term("contents", "java")), true, false); 
    apiQuery.add(new TermQuery(new Term("contents", "net")), true, false); 
    apiQuery.add(new TermQuery(new Term("contents", "dot")), false, true); 
    Query qpQuery = QueryParser.parse("java AND net NOT dot", "contents", new StandardAnalyzer()); 
    // Query and subclasses behave as expected with .equals 
    assertEquals(qpQuery, apiQuery); 
  } 
}

Some interesting features of the Query classes are their toString methods. Each Query subclass generates the equivalent (though not necessarily textually exact) QueryParserexpression. There are two variants: one is the standard Object.toString overridden method, and the other accepts the default field name. The following test case demonstrates how these two methods work, and illustrates how an equivalent (yet not the exact) expression is returned.

public void testToString() throws Exception { 
  Query query = QueryParser.parse("java AND net NOT dot", "contents", new StandardAnalyzer()); 
  assertEquals("+java +net -dot", query.toString("contents")); 
  assertEquals("+contents:java +contents:net -contents:dot", query.toString()); 
}

Notice that the expression parsed was ‘java AND net NOT dot’, but the expression returned from the toString methods used the abbreviated syntax ‘+java +net -dot’. Our first test case (testJavaNotDotNet) demonstrated that the underlying query objects themselves are equivalent.

The no-arg toString method makes no assumptions about the field names for each term, and specifies them explicitly using field-selector syntax. Using these toString methods is handy for diagnosing QueryParser issues.

4.Basic search

In most cases you want to look for a single term or a phrase, which is a group of words surrounded by double quotes (“sample application”). In these cases we will look for contents that has these words in their default index data that contains all the relevant text of the content.

In more complex situations we may need some filtering based on the type or place of the content we are looking for, or we want to search in a specific field. Here we can learn how to construct more complex queries that can be used to effectively find content in a huge repository.

4.1.Terms

Suppose we want to search with a keyword ‘blog’ in the tag field. The syntax will be,

tag :  blog

Now we will search with a phrase ‘lucene blog’ in the tag field. For that the syntax will be,

tag :   "lucene blog"

Now lets search ‘lucene blog’ in the tag field, and ‘technical blog’ in the body,

tag : "lucene blog" AND body : "technical blog"

Suppose we want to search for phrase ‘lucene blog’ in the tag field, and ‘technical blog’ in the body or ‘searching blog’ phrase in tag field,

(tag : "lucene blog" AND body : "technical blog") OR tag : "searching blog"

If we want to search ‘blog’ not ‘lucene’ in the tag field, the syntax will look alike,

tag : blog -tag : lucene

4.2.WildcardQuery

Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries).

  1. To perform a single character wildcard search use the “?” symbol.
  2. To perform a multiple character wildcard search use the “*” symbol.

The single character wildcard search looks for terms that match that with the single character replaced. For example, to search for “text” or “test” we can use the search: te?t

Multiple character wildcard searches looks for 0 or more characters. For example, to search for test, tests or tester, we can use the search:

test*

We can also use the wildcard searches in the middle of a term.

te*t

Here is an example of lucene wildcard search,

Suppose, we have two files in the ‘files’ directory.

  1. test-foods.txt

    Here are some foods that Deron likes:

    hamburger
    french fries
    steak
    mushrooms
    artichokes

  2. sample-food.txt

    Here are some foods that Nicole likes:

    apples
    bananas
    salad
    mushrooms
    cheese

Now we’ll look at the LuceneWildcardQueryDemo class. This class creates an index via the createIndex() method based on the text files mentioned above, and after this, it attempts to perform 8 wildcard searches against this index. Four of the searches are performed using the WildcardQuery class, and the other four searches are performed using the QueryParser class.

The above 2 files are indexed first using a createIndex() method.

public static void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException {
  Analyzer analyzer = new StandardAnalyzer();
  boolean recreateIndexIfExists = true;
  IndexWriter indexWriter = new IndexWriter(INDEX_DIRECTORY, analyzer, recreateIndexIfExists);
  File dir = new File(FILES_TO_INDEX_DIRECTORY);
  File[] files = dir.listFiles();
  for (File file : files) {
    Document document = new Document();
    String path = file.getCanonicalPath();
    document.add(new Field(FIELD_PATH, path, Field.Store.YES, Field.Index.UN_TOKENIZED));
    Reader reader = new FileReader(file);
    document.add(new Field(FIELD_CONTENTS, reader));
    indexWriter.addDocument(document);
  }
  indexWriter.optimize();
  indexWriter.close();
}

For doing the searching operations using the query parser, we can add a method named searchIndexWithQueryParser(),

public static void searchIndexWithQueryParser(String whichField, String searchString) throws IOException,ParseException {
  System.out.println("\\nSearching for '" + searchString + "' using QueryParser");
  Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY);
  IndexSearcher indexSearcher = new IndexSearcher(directory);
  QueryParser queryParser = new QueryParser(whichField, new StandardAnalyzer());
  Query query = queryParser.parse(searchString);
  System.out.println("Type of query: " + query.getClass().getSimpleName());
  Hits hits = indexSearcher.search(query);
  displayHits(hits);
}

The WildcardQuery queries are performed in the searchIndexWithWildcardQuery() method, using the following code:

  Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY); 
  IndexSearcher indexSearcher = new IndexSearcher(directory); 
  Term term = new Term(whichField, searchString); 
  Query query = new WildcardQuery(term); 
  Hits hits = indexSearcher.search(query);

The QueryParser queries are performed in the searchIndexWithQueryParser() method via:

  Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY); 
  IndexSearcher indexSearcher = new IndexSearcher(directory); 
  QueryParser queryParser = new QueryParser(whichField, new StandardAnalyzer()); 
  Query query = queryParser.parse(searchString); 
  Hits hits = indexSearcher.search(query);

The LuceneWildcardQueryDemo class performs eight wildcard searches, as we can see from its main() method:

  searchIndexWithWildcardQuery(FIELD_CONTENTS, "t*t"); 
  searchIndexWithQueryParser(FIELD_CONTENTS, "t*t"); 
  searchIndexWithWildcardQuery(FIELD_CONTENTS, "sam*"); 
  searchIndexWithQueryParser(FIELD_CONTENTS, "sam*"); 
  searchIndexWithWildcardQuery(FIELD_CONTENTS, "te?t"); 
  searchIndexWithQueryParser(FIELD_CONTENTS, "te?t"); 
  searchIndexWithWildcardQuery(FIELD_CONTENTS, "*est"); 
  try { 
    searchIndexWithQueryParser(FIELD_CONTENTS, "*est"); 
  } catch (ParseException pe) { 
    pe.printStackTrace(); 
  }

Lastly we will print the number of hits for each searching operations using a method like,

public static void displayHits(Hits hits) throws CorruptIndexException, IOException {
  System.out.println("Number of hits: " + hits.length());
  Iterator<Hit> it = hits.iterator();
  while (it.hasNext()) {
    Hit hit = it.next();
    Document document = hit.getDocument();
    String path = document.get(FIELD_PATH);
    System.out.println("Hit: " + path);
  }
}

If we run the above code, it will show us,

Searching for 't*t' using WildcardQuery
Number of hits: 1
Hit: /home/debarshi/workspace/Test/filesToIndex/test-foods.txt
Searching for 't*t' using QueryParser
Type of query: WildcardQuery
Number of hits: 1
Hit: /home/debarshi/workspace/Test/filesToIndex/test-foods.txt
Searching for 'sam*' using WildcardQuery
Number of hits: 1
Hit: /home/debarshi/workspace/Test/filesToIndex/sample-foods.txt
Searching for 'sam*' using QueryParser
Type of query: PrefixQuery
Number of hits: 1
Hit: /home/debarshi/workspace/Test/filesToIndex/sample-foods.txt
Searching for 'te?t' using WildcardQuery
Number of hits: 1
Hit: /home/debarshi/workspace/Test/filesToIndex/test-foods.txt
Searching for 'te?t' using QueryParser
Type of query: WildcardQuery
Number of hits: 1
Hit: /home/debarshi/workspace/Test/filesToIndex/test-foods.txt
Searching for '*est' using WildcardQuery
Number of hits: 1
Hit: /home/debarshi/workspace/Test/filesToIndex/test-foods.txt
Searching for '*est' using QueryParser

  org.apache.lucene.queryParser.ParseException: Cannot parse '*est': '*' or '?' not allowed as first character in WildcardQuery 
  at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:175) 
  at LuceneWildcardQueryDemo.searchIndexWithQueryParser(LuceneWildcardQueryDemo.java:81) 
  at LuceneWildcardQueryDemo.main(LuceneWildcardQueryDemo.java:46)
  1. The first query uses a WildcardQuery object with “t*t”. Since “t*t” matches “test” in the index, this query returns 1 hit.
  2. The second query uses a QueryParser to query “t*t”. The QueryParser parse() method returns a WildcardQuery, and the query returns 1 hit, since it’s basically identical to the first query.
  3. The third query uses a WildcardQuery object with “sam*”. This wildcard query gets one hit, since “sam*” matches “sample”.
  4. The fourth query uses QueryParser with “sam*”. Notice, however, that QueryParser’s parse() method returns a PrefixQuery rather than a WildcardQuery. Since the asterisk is at the end of “sam*”. This PrefixQuery for “sam*” gets a hit, since “sam*” matches “sample”.
  5. The fifth query is a WildcardQuery that utilizes a question mark in its search word, “te?t”. The question mark can match one character. Since “te?t” matches “test”, the search returns 1 hit.
  6. The sixth query uses a QueryParser with “te?t”. The QueryParser parse() method returns a WildcardQuery, and it gets on hit, just like the fifth query.
  7. The seventh query is a WildcardQuery for “*est”. It receives one match, since “test” matches “*est”. In general, it’s not a good idea to perform queries where the first character is a wildcard.
  8. The eigth query is a QueryParser query for “*est”. Notice that the QueryParser object does not even allow us to perform a query where the first character is an asterisk. It throws a parse exception.

4.3.Boolean Operators

Boolean operators allow terms to be combined through logic operators. Lucene supports AND, “+”, OR, NOT and “-” as Boolean operators (Note: Boolean operators must be ALL CAPS).

The OR operator is the default conjunction operator. This means that if there is no Boolean operator between two terms, the OR operator is used. The OR operator links two terms and finds a matching document if either of the terms exist in a document. This is equivalent to a union using sets. The symbol || can be used in place of the word OR.

To search for documents that contain either “jakarta apache” or just “jakarta” use the query:

"jakarta apache" jakarta 

or

"jakarta apache" OR jakarta

AND

The AND operator matches documents where both terms exist anywhere in the text of a single document. This is equivalent to an intersection using sets. The symbol && can be used in place of the word AND.

To search for documents that contain “jakarta apache” and “Apache Lucene” use the query:

"jakarta apache" AND "Apache Lucene" 

+

The “+” or required operator requires that the term after the “+” symbol exist somewhere in a the field of a single document.

To search for documents that must contain “jakarta” and may contain “lucene” use the query:

+jakarta lucene 

NOT

The NOT operator excludes documents that contain the term after NOT. This is equivalent to a difference using sets. The symbol ! can be used in place of the word NOT.

To search for documents that contain “jakarta apache” but not “Apache Lucene” use the query:

"jakarta apache" NOT "Apache Lucene" 

Note: The NOT operator cannot be used with just one term. For example, the following search will return no results:

NOT "jakarta apache" 

“-”

The “-” or prohibit operator excludes documents that contain the term after the “-” symbol.

To search for documents that contain “jakarta apache” but not “Apache Lucene” use the query:

"jakarta apache" -"Apache Lucene" 

4.3.Grouping

Lucene supports using parentheses to group clauses to form sub queries. This can be very useful if you want to control the boolean logic for a query.

To search for either “jakarta” or “apache” and “website” use the query:

(jakarta OR apache) AND website 

This eliminates any confusion and makes sure you that website must exist and either term jakarta or apache may exist.

Field Grouping

Lucene supports using parentheses to group multiple clauses to a single field.

To search for a title that contains both the word “return” and the phrase “pink panther” use the query:

title:(+return +"pink panther") 

Escaping Special Characters

Lucene supports escaping special characters that are part of the query syntax. The current list special characters are:

+ – && || ! ( ) { } [ ] ^ ” ~ * ? : \

To escape these character use “\” (backslash) before the character. For example to search for (1+1):2 use the query:

\\(1\\+1\\)\\:2 

4.4.PhraseQuery

A PhraseQuery in Lucene matches documents containing a particular sequence of terms. PhraseQuery uses positional information of the term that is stored in an index.

The number of other words permitted between words in query phrase is called “slop”. It can be set by calling the setSlop method. If zero, then this is an exact phrase search. For larger values this works like a WITHIN or NEAR operator.

The slop is in fact an edit-distance, where the units correspond to moves of terms in the query phrase out of position. For example, to switch the order of two words requires two moves (the first move places the words atop one another), so to permit re-orderings of phrases, the slop must be at least two.

More exact matches are scored higher than sloppier matches, thus search results are sorted by exactness. The slop is zero by default, requiring exact matches.

PhraseQuery also supports multiple term phrases.

A Phrase Query may be combined with other terms or queries with a BooleanQuery. The maximum number of clauses is restricted to 1,024 by default.

In the previous example of lucene wildcard query, we have done the searching operations based on two text files. Now we will try to find matching phrases using PhraseQuery in lucene.

For that, instead of searchIndexWithWildcardQuery() method, we will introduce a new method searchIndexWithPhraseQuery() which takes two strings representing words in the document and a slop value. It constructs a PhraseQuery by adding two Term objects based on the “contents” field and the string1 and string2 parameters. Following that, it sets the slop value of the PhraseQuery object using the setSlop() method of PhraseQuery. The search is conducted by passing the PhraseQuery object to IndexSearcher’s search() method. Here is the code,

public static void searchIndexWithPhraseQuery(String string1, String string2, int slop) throws IOException,ParseException {
  Directory directory = 	FSDirectory.getDirectory(INDEX_DIRECTORY);
  IndexSearcher indexSearcher = new IndexSearcher(directory);
  Term term1 = new Term(FIELD_CONTENTS, string1);
  Term term2 = new Term(FIELD_CONTENTS, string2);
  PhraseQuery phraseQuery = new PhraseQuery();
  phraseQuery.add(term1);
  phraseQuery.add(term2);
  phraseQuery.setSlop(slop);
  displayQuery(phraseQuery);
  Hits hits = indexSearcher.search(phraseQuery);
  displayHits(hits);
}

And, we call this method from main(),

  searchIndexWithPhraseQuery("french", "fries", 0);
  searchIndexWithPhraseQuery("hamburger", "steak", 0);
  searchIndexWithPhraseQuery("hamburger", "steak", 1);
  searchIndexWithPhraseQuery("hamburger", "steak", 2);
  searchIndexWithPhraseQuery("hamburger", "steak", 3);
  searchIndexWithQueryParser("french fries"); // BooleanQuery
  searchIndexWithQueryParser("\\"french fries\\""); // PhaseQuery
  searchIndexWithQueryParser("\\"hamburger steak\\"~1"); // PhaseQuery
  searchIndexWithQueryParser("\\"hamburger steak\\"~2"); // PhaseQuery

The first query searches for “french” and “fries” with a slop of 0, meaning that the phrase search ends up being a search for “french fries”, where “french” and “fries” are next to each other. Since this exists in test-foods.txt, we get 1 hit.

In the second query, we search for “hamburger” and “steak” with a slop of 0. Since “hamburger” and “steak” don’t exist next to each other in either document, we get 0 hits. The third query also involves a search for “hamburger” and “steak”, but with a slop of 1. These words are not within 1 word of each other, so we get 0 hits.

The fourth query searches for “hamburger” and “steak” with a slop of 2. In the test-foods.txt file, we have the words “… hamburger french fries steak …”. Since “hamburger” and “steak” are within two words of each other, we get 1 hit. The fifth phrase query is the same search but with a slop of 3. Since “hamburger” and “steak” are withing three words of each other (they are two words from each other), we get a hit of 1.

The next four queries utilize QueryParser. Notice that in the first of the QueryParser queries, we get a BooleanQuery rather than a PhraseQuery. This is because we passed QueryParser’s parse() method “french fries” rather than “\”french fries\””. If we want QueryParser to generate a PhraseQuery, the search string needs to be surrounded by double quotes. The next query does search for “\”french fries\”” and we can see that it generates a PhraseQuery (with the default slop of 0) and gets 1 hit in response to the query.

The last two QueryParser queries demonstrate setting slop values. We can see that the slop values can be set the following the double quotes of the search string with a tilde (~) following by the slop number.

4.5.RangeQuery

A Query that matches documents within an exclusive range of terms. It allows one to match documents whose field(s) values are between the lower and upper bound specified by the RangeQuery. Range Queries can be inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically(a collection of items arranged(sorted) in the dictionary order).

Now, if we want to implement RangeQuery for Lucene searching operations, we have to add a method named like, searchIndexWithRangeQuery(), which is basically a constructor which requires a Term indicating the start of the range, a Term indicating the end of the range, and a boolean value indicating whether the search is inclusive of the start and end values (“true”) or exclusive of the start and end values (“false”). The code looks like,

public static void searchIndexWithRangeQuery(String whichField, String start, String end, boolean inclusive)
  throws IOException, ParseException {
  System.out.println("\\nSearching for range '" + start + " to " + end + "' using RangeQuery");
  Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY);
  IndexSearcher indexSearcher = new IndexSearcher(directory);
  Term startTerm = new Term(whichField, start);
  Term endTerm = new Term(whichField, end);
  Query query = new RangeQuery(startTerm, endTerm, inclusive);
  Hits hits = indexSearcher.search(query);
  displayHits(hits);
}

Now we will call the above method,

  searchIndexWithRangeQuery(FIELD_LAST_MODIFIED, "2014-04-01-00-00-00", "2014-04-01-23-59-59", INCLUSIVE);
  searchIndexWithRangeQuery(FIELD_LAST_MODIFIED, "2014-04-02-00-00-00", "2014-04-02-23-59-59", INCLUSIVE);
  searchIndexWithRangeQuery(FIELD_LAST_MODIFIED, "2014-04-01-00-00-00", "2014-04-01-21-21-02", INCLUSIVE);
  searchIndexWithRangeQuery(FIELD_LAST_MODIFIED, "2014-04-01-00-00-00", "2014-04-01-21-21-02", EXCLUSIVE);
  // equivalent range searches using QueryParser
  searchIndexWithQueryParser(FIELD_LAST_MODIFIED, "[2014-04-01-00-00-00 TO 2014-04-01-23-59-59]");
  searchIndexWithQueryParser(FIELD_LAST_MODIFIED, "[2014-04-02-00-00-00 TO 2014-04-02-23-59-59]");
  searchIndexWithQueryParser(FIELD_LAST_MODIFIED, "[2014-04-01-00-00-00 TO 2014-04-01-21-21-02]");
  searchIndexWithQueryParser(FIELD_LAST_MODIFIED, "{2014-04-01-00-00-00 TO 2014-04-01-21-21-02}");

Lastly, there is a little change in createIndex() method. We have added some date time realted operations & printed the last modified time of the indexing files.

At the top of the console output, we can see that the two files get indexed, and that the “last modified” times for these files are “2014-04-01-21-21-02” (for test-foods.txt) and “2014-04-01-21-21-38” (for sample-foods.txt).

In the first range query, we search for all files that were last modified on April 1st 2014. This returns 2 hits since both files were last modified on this date. In the second range query, we search for all files that were last modified on April 2nd 2014. This returns 0 hits, since both documents were last modified on April 1st 2014.

Next, we search in a range from 2014-04-01-00-00-00 to 2014-04-01-21-21-02, inclusively. Since test-foods.txt was last modified at 2014-04-01-21-21-02 and the range query includes this value, we get one search hit. Following this, we search in a range from 2014-04-01-00-00-00 to 2014-04-01-21-21-02, exclusively. Since test-foods.txt was last modified at 2014-04-01-21-21-02 and the range query doesn’t include this value (since it is excluded), this search returns 0 hits.

After this, our next four searches show the equivalent searches performed using a QueryParser object. Notice that QueryParser’s parse() method returns a ConstantScoreRangeQuery object rather than a RangeQuery object for each of these queries, as we can see from the console output for these queries.

4.6.Prefix Query

A Query that matches documents containing terms with a specified prefix. A PrefixQuery is built by QueryParser for input like nam*.

We will try to search certain term with its prefix using the prefix query using the two text files (test-foods.txt & sample-foods.txt).

For doing that we will add a method named searchIndexWithPrefixQuery() which will searche the index (which will be created by createIndex()) using a PrefixQuery. This method takes two parameters, one is the field name & another is the search string.

public static void searchIndexWithPrefixQuery(String whichField, String searchString) throws IOException,
  ParseException {
  System.out.println("\\nSearching for '" + searchString + "' using PrefixQuery");
  Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY);
  IndexSearcher indexSearcher = new IndexSearcher(directory);
  Term term = new Term(whichField, searchString);
  Query query = new PrefixQuery(term);
  Hits hits = indexSearcher.search(query);
  displayHits(hits);
}

Next we will call this method from main() method of program –

  searchIndexWithPrefixQuery(FIELD_CONTENTS, "test");
  searchIndexWithPrefixQuery(FIELD_CONTENTS, "tes*");

In the first query, the query string does not contain an asterisk. So if we print the query type of QueryParser’s parse() method, it will print TermQuery instead of PrefixQuery.

In the secong query the asterisk indicates to the QueryParser that this is a prefix query, so it returns a PrefixQuery object from its parse() method. This results in a search for the prefix “tes” in the indexed contents. This results in 1 hit, since “test” is in the index.

4.7.Fuzzy Query

Fuzzy query is based on Damerau-Levenshtein (optimal string alignment) algorithm. FuzzyQuery matches terms “close” to a specified base term: we specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.

The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).

FuzzyQuery is great for matching proper names: we can search for lucene~1 and it will match luccene (insert c), lucee (remove n), lukene(replace c with k) and a great many other “close” terms. With max edit distance 2 we can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.

QueryParser supports fuzzy-term queries using a trailing tilde on a term. For example, searching for wuzza~ will find documents that contain “fuzzy” and “wuzzy”. Edit distance affects scoring, such that lower edit distances score higher.

Piyas De

Piyas is Sun Microsystems certified Enterprise Architect with 10+ years of professional IT experience in various areas such as Architecture Definition, Define Enterprise Application, Client-server/e-business solutions.Currently he is engaged in providing solutions for digital asset management in media companies.He is also founder and main author of "Technical Blogs(Blog about small technical Know hows)" Hyperlink - http://www.phloxblog.in
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Raj
Raj
6 years ago

How would you search a phrase starting with a specific term?

Back to top button