Enterprise Java

Lucene Query (Search) Syntax Examples

This article is part of our Academy Course titled Apache Lucene Fundamentals.

In this course, you will get an introduction to Lucene. You will see why a library like this is important and then learn how searching works in Lucene. Moreover, you will learn how to integrate Lucene Search into your own applications in order to provide robust searching capabilities. Check it out here!

1. Introduction

In this lesson of our course we are going to investigate the basic querying mechanisms offerd by Lucene. As you might remember from the introductory lesson, Lucene does not send raw text to be searched to the index. It uses Query Objects for that. In this lesson we are going to see all the crucial components that line up, to convert human – written search phrases to representative structures like Queries.

2. The Query Class

The Query class is a public abstract class that represents a query to the index. In this section we are going to see the most important Query sub – classes that you can use to perform highly tailored queries.

2.1 TermQuery

This is the most simple and straightforward query you can perform against a Lucene index. You simply search for Documents that cointain a single word in a specific Field.

The basic TermQuery constructor is defined as follows : public TermQuery(Term t). As you remember from the fist lesson a Term consists of a two parts:

  1. The name of the Field in which this term resides.
  2. The actual value of the Term, which, in the great majority of cases, is a single word, obtained from the analysis of some plain text.

So, if you want to create a TermQuery to find all Documents that contain the word "good" in their "content" Field, here’s how you can do it

TermQuery termQuery = new TermQuery(new Term("content","good"));

We can use that to search for the word “static” in our previously created index:

String q = "static"

Directory directory = FSDirectory.open(indexDir);

IndexReader  indexReader  = DirectoryReader.open(directory);

IndexSearcher searcher = new IndexSearcher(indexReader);

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);

TermQuery termQuery = new TermQuery(new Term("content",q));

TopDocs topDocs =searcher.search(termQuery, maxHits);

ScoreDoc[] hits = topDocs.scoreDocs;

for (ScoreDoc hit : hits) {
        int docId = hit.doc;
        Document d = searcher.doc(docId);
        System.out.println(d.get("fileName") + " Score :" + hit.score);
}

System.out.println("Found " + hits.length);

The output of this would be:

C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\Product.java Score :0.29545835
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleSearcher.java Score :0.27245367
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\PropertyObject.java Score :0.24368995
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleIndexer.java Score :0.14772917
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\TestSerlvet.java Score :0.14621398
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\ShoppingCartServlet.java Score :0.13785185
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\MyServlet.java Score :0.12184498
Found 7

As you can see, seven of my source files contained the "static" keyword. That is there is to it. Naturally, if you try to add another word in the query string, the search will return 0 results. For example if you set your query string to:

String q = "private static"

The output would be :

Found 0

Now, I know that "private static" is present inside many of my source files. But as you might remember, we used the StandarAnalyzer to process the plain text retrieved from our files, in the indexing process. StandardAnalyzer splits the text into individual words, thus every Term contains one single word. You can choose not to tokenize an indexed Field. But I would suggest that you do that in Fields that contain meta – information about our Document, e.g. the title or the author, and not in the Fields that hold its content. For example if you choose not to tokenize and indexed Field with name 'author' and value 'James Wilslow', the Field 'author' will contain only one Term with value 'James Wilslow' as a whole. If you did tokenize the Field, it would contain two Terms, one with value 'James' and the other one with value 'Wilslow' .

2.2 PhraseQuery

With PhraseQuery you can search for Documents that contain a particular sequence of words, aka phrases.

You can create a PhraseQuery like this:

PhraseQuery phraseQuery = new PhraseQuery();

And then you can add Terms to it. For example if you want to search for Documents that contain the phrase “private static” in their “content” Field, you could do it like so:

PhraseQuery phraseQuery = new PhraseQuery();

phraseQuery.add(new Term("content","private"));
phraseQuery.add(new Term("content","static"));

TopDocs topDocs =searcher.search(phraseQuery, maxHits);

ScoreDoc[] hits = topDocs.scoreDocs;

for (ScoreDoc hit : hits) {
      int docId = hit.doc;
      Document d = searcher.doc(docId);
      System.out.println(d.get("fileName") + " Score :" + hit.score);
}

System.out.println("Found " + hits.length);

The output would be :

C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\Product.java Score :0.54864377
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\PropertyObject.java Score :0.45251375
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleSearcher.java Score :0.45251375
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\TestSerlvet.java Score :0.27150828
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\ShoppingCartServlet.java Score :0.25598043
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\MyServlet.java Score :0.22625688
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleIndexer.java Score :0.22398287
Found 7

A Document makes it into the results, only if the Field contains both the words "private" and "static", consecutively and in that exact order.

So if you change the above code in something like this:

phraseQuery.add(new Term("content","private"));
phraseQuery.add(new Term("content","final"));

You will get :

Found 0

That’s because although my source files contains both of the words, the are not consecutive. To alter that behavior a little bit you can add a slop to the PhraseQuery. When you add a slop of 1, you allow at most one word to intervene between the words on your phrase. When you add slop 2, you allow at most 2 words between your words on the phrase.

Interestingly : “The slop is in fact an edit-distance, where the units correspond to moves of terms in the query phrase out of position. For example, to switch the order of two words requires two moves (the first move places the words atop one another), so to permit re-orderings of phrases, the slop must be at least two.”

So if we do:

PhraseQuery phraseQuery = new PhraseQuery();

phraseQuery.add(new Term("content","private"));
phraseQuery.add(new Term("content","final"));

phraseQuery.setSlop(2);

The output of our search will give:

C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\Product.java Score :0.38794976
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\PropertyObject.java Score :0.31997555
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleSearcher.java Score :0.31997555
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\TestSerlvet.java Score :0.19198532
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\ShoppingCartServlet.java Score :0.18100551
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\MyServlet.java Score :0.15998778
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleIndexer.java Score :0.15837982

It is important to mention that documents that contain phrases closer to the exact phrase of the query, will get higher scores.

2.3 BooleanQuery

BooleanQuery is a more expressive and powerful tool, as you can combine multiple Queries together with Boolean clauses. A BoleanQuery can be populated with BooleanClauses. A BooleanClause consists of a Query, and the role that that Query should have on the boolean search.

To be more specific, a boolean clause can play the following roles in a query:

  1. MUST : This is pretty self explenatory. A Document makes it to the list of the results, if and only if it contains that clause.
  2. MUST NOT : It’s the exact opposite case. It is obligatory, for a Document to make it to the result list, not to contain that clause.
  3. SHOULD : This is for clauses that can occur in a Document, but it is not necessary for them to include it in order to make it to the results.

If you have a boolean query only with SHOULD clauses the results matches at least one of the clauses. This seems like the classic OR boolean operator but it is not so straightforward to use it properly.

Now let’s see some examples. Let’s find the source files that contain the word “string” but don’t contain the word “int”.

TermQuery termQuery = new TermQuery(new Term("content","string"));
TermQuery termQuery2 = new TermQuery(new Term("content","int"));

BooleanClause booleanClause1 = new BooleanClause(termQuery, BooleanClause.Occur.MUST);
BooleanClause booleanClause2 = new BooleanClause(termQuery2, BooleanClause.Occur.MUST_NOT);

BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(booleanClause1);
booleanQuery.add(booleanClause2);

TopDocs topDocs =searcher.search(booleanQuery, maxHits);

Here is the result:

C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleEJB.java Score :0.45057273
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\PropertyObject.java Score :0.39020744
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\ShoppingCartServlet.java Score :0.20150226
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\TestSerlvet.java Score :0.13517183
Found 4

Now let’s try to find all the Documents that contain the word “nikos” and the phrase “httpservletresponse response”. In the following snippet you can see how you can avoid creating BooleanClause instances, making your more compact.

TermQuery termQuery = new TermQuery(new Term("content","nikos"));

PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.add(new Term("content","httpservletresponse"));
phraseQuery.add(new Term("content","response"));

BooleanQuery booleanQuery = new BooleanQuery();

booleanQuery.add(phraseQuery,BooleanClause.Occur.MUST);
booleanQuery.add(termQuery,BooleanClause.Occur.MUST);

TopDocs topDocs =searcher.search(booleanQuery, maxHits);

This is the result:

C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\ShoppingCartServlet.java Score :0.3148332
Found 1

Let’s find all the Documents that contain the word “int” or the word “nikos”.As you might image you must use the SHOULD specification somehow:

TermQuery termQuery = new TermQuery(new Term("content","int"));
TermQuery termQuery2 = new TermQuery(new Term("content","nikos"));

BooleanQuery booleanQuery = new BooleanQuery();

booleanQuery.add(termQuery,BooleanClause.Occur.SHOULD);
booleanQuery.add(termQuery2,BooleanClause.Occur.SHOULD);

TopDocs topDocs =searcher.search(booleanQuery, maxHits);

This was not too hard but it is a bit tricky to create more complicated disjunctive queries. It is not always straightforward how you can use SHOULD correctly.

For example let’s try to find all the Documents that contain the word “nikos” and the phrase “httpservletresponse response” or contain the word “int”. One could write something like this:

TermQuery termQuery = new TermQuery(new Term("content","nikos"));

PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.add(new Term("content","httpservletresponse"));
phraseQuery.add(new Term("content","response"));

BooleanQuery booleanQuery = new BooleanQuery();

booleanQuery.add(phraseQuery,BooleanClause.Occur.MUST);
booleanQuery.add(termQuery,BooleanClause.Occur.MUST);
booleanQuery.add(new TermQuery(new Term("content","int")),BooleanClause.Occur.SHOULD);

TopDocs topDocs =searcher.search(booleanQuery, maxHits);

But the query would fail to provide the results you want. Remember that the results of this query, as we’ve constructed in, MUST contain the word "nikos" and MUST contain the phrase "httpservletresponse response" in the same time. But this is not what you want. You want the documents that contain the word nikos and the phrase "httpservletresponse response", but you also want documents that contain the word "int" independently, no matter if they contain the other clauses. To be fair, the above boolean query is a bit wrong. Because in straight boolean syntactic you would never write something like: A AND B OR C. You should write (A AND B) OR C. Or A AND (B OR C). See the difference?

So you should write the query you want like : ( “nikos” AND “httpservletresponse response” ) OR “int”.

You can do that combining BooleanQueries together. Using the above strict syntax it is not very hard to imagine how this would go:

TermQuery termQuery = new TermQuery(new Term("content","nikos"));

PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.add(new Term("content","httpservletresponse"));
phraseQuery.add(new Term("content","response"));

// (A AND B)
BooleanQuery conjunctiveQuery = new BooleanQuery();
conjunctiveQuery.add(termQuery,BooleanClause.Occur.MUST);
conjunctiveQuery.add(phraseQuery,BooleanClause.Occur.MUST);

BooleanQuery disjunctiveQuery = new BooleanQuery();

// (A AND B) OR C
disjunctiveQuery.add(conjunctiveQuery,BooleanClause.Occur.SHOULD);
disjunctiveQuery.add(new TermQuery(new Term("content","int")),BooleanClause.Occur.SHOULD);

TopDocs topDocs =searcher.search(disjunctiveQuery, maxHits);

This is a quick guide you can follow when constructing boolean queries using BooleanQuery class:

  • X AND Y
  • BooleanQuery bool = new BooleanQuery();
    bool.add(X,BooleanClause.Occur.MUST);
    bool.add(Y,BooleanClause.Occur.MUST);
  • X OR Y
  • BooleanQuery bool = new BooleanQuery();
    bool.add(X,BooleanClause.Occur.SHOULD);
    bool.add(Y,BooleanClause.Occur.SHOULD);
    
  • X AND (NOT Y)
  • BooleanQuery bool = new BooleanQuery();
    bool.add(X,BooleanClause.Occur.MUST);
    bool.add(Y,BooleanClause.Occur.MUST_NOT);
    
  • (X AND Y) OR Z
  • BooleanQuery conj = new BooleanQuery();
    
    conj.add(X,BooleanClause.Occur.MUST);
    conj.add(Y,BooleanClause.Occur.MUST);
    
    BooleanQuery disj = new BooleanQuery();
    disj.add(conj,BooleanClause.Occur.SHOULD)
    disj.add(Z,BooleanClause.Occur.SHOULD)
    
    
  • (X OR Y) AND Z
  • BooleanQuery conj = new BooleanQuery();
    
    conj.add(X,BooleanClause.Occur.SHOULD);
    conj.add(Y,BooleanClause.Occur.SHOULD);
    
    BooleanQuery disj = new BooleanQuery();
    disj.add(conj,BooleanClause.Occur.MUST)
    disj.add(Z,BooleanClause.Occur.MUST)
    
    
  • X OR (NOT Z)
  • BooleanQuery neg = new BooleanQuery();
    
    neg.add(Z,BooleanClause.Occur.MUST_OT);
    
    BooleanQuery disj = new BooleanQuery();
    disj.add(neg,BooleanClause.Occur.SHOULD)
    disj.add(X,BooleanClause.Occur.SHOULD)
    
    

The above can be used to create more and more complex boolean queries.

2.4 WildcardQuery

As the name suggest, you can use WildcardQuery class to perform wildcard queries using ‘*’ or ‘?’ characters. For example if you want o search for documents that contain terms starting from ‘ni’ followed by any other character sequence you can search for ‘ni*’. If you want to search for terms that start with ‘jamie’ followed by (any) one character you can search for ‘jamie?’. Simple as that. Naturally, WildcardQueries are inefficient, because the search may have o go through a lot of different terms to find matches. It is generally a good practice to avoid placing the wildcard character at the front of the word, like “*abcde”.

Let’s see an example:

Query wildcardQuery = new WildcardQuery(new Term("content","n*os"));
TopDocs topDocs =searcher.search(wildcardQuery, maxHits);

And

Query wildcardQuery = new WildcardQuery(new Term("content","niko?"));
TopDocs topDocs =searcher.search(wildcardQuery, maxHits);

2.5 RegexpQuery

Using RegexpQuery, you can perform fast regular expression queries, evaluated with a very fast automaton implementation by Lucene. Here is an example

Query regexpQuery = new RegexpQuery(new Term("content","n[a-z]+"));

TopDocs topDocs =searcher.search(regexpQuery, maxHits);

2.6 TermRangeQuery

This Query subclass is useful when performing range queries on string terms. For example, you can search for terms between “abc” and “xyz” words. The comparison of the words is executed with Byte.compareTo(Byte). You might find this particularly useful for range queries in meta-data of your documents like titles and even dates (in case of dates be careful to use DateTools).

Here is how you can find all documents created during the last week:

Calendar c = Calendar.getInstance();
c.add(Calendar.DATE, -7);
Date lastWeek = c.getTime();

Query termRangeQuery = TermRangeQuery.newStringRange("date", DateTools.dateToString(new Date(), DateTools.Resolution.DAY),DateTools.dateToString(lastWeek, 

DateTools.Resolution.DAY),true,true);

Of course you have to be careful when indexing the “date” field. You have to apply DateTools.dateToString to it as well, and specify that field not to be analyzed (so it’s not tokenized and split into words).

2.7 NumberRangeQuery

This is for performing numeric range queries. Imagine that you have a field “wordcount” that stores the number of the words of that document, and you want to retrieve documents that have between 2000 and 10000 words:

Query numericRangeQuery = NumericRangeQuery.newIntRange("wordcount",2000,10000,true,true);

The boolean arguments dictate that the upper and lower limits are included in the range.

2.8 FuzzyQuery

This is a very interesting query sub class. This query evaluates terms according to proximity measures, like the well known Damerau-Levenshtein distance. This will find words that are lexicographic close. If you want to perform intense lexicographic application, like a dictionary or a word suggestion “did you mean” feature, you can use the SpellChecker API.

Let’s see how you can perform a fuzzy query search, with an unfortunate ‘string’ misspelling:

Query fuzzyQuery = new FuzzyQuery(new Term("content","srng"));

Piyas De

Piyas is Sun Microsystems certified Enterprise Architect with 10+ years of professional IT experience in various areas such as Architecture Definition, Define Enterprise Application, Client-server/e-business solutions.Currently he is engaged in providing solutions for digital asset management in media companies.He is also founder and main author of "Technical Blogs(Blog about small technical Know hows)" Hyperlink - http://www.phloxblog.in
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button