Enterprise Java

Integrating Lucene Search into an Application

This article is part of our Academy Course titled Apache Lucene Fundamentals.

In this course, you will get an introduction to Lucene. You will see why a library like this is important and then learn how searching works in Lucene. Moreover, you will learn how to integrate Lucene Search into your own applications in order to provide robust searching capabilities. Check it out here!

1. Introduction

Java Lucene provides a quite powerful query language for performing searching operations in a large amount of data.

A query is broken up into terms and operators. There are three types of terms: Single Terms, Phrases, and Subqueries. A Single Term is a single word such as “test” or “hello”. A Phrase is a group of words surrounded by double quotes such as “hello dolly”. A Subquery is a query surrounded by parentheses such as “(hello dolly)”.

Lucene supports fields of data. When performing a search you can either specify a field, or use the default field. The field names depend on indexed data and default field is defined by current settings.

2. Parsing a query string

The job of a query parser is to convert a query string submitted by a user into query objects.

A query is used by a query parser which parses its content. Here is an example:

    "query_string" : {
        "default_field" : "content",
        "query" : "this AND that OR thus"

The query_string top level parameters include:

queryThe actual query to be parsed.
default_fieldThe default field for query terms if no prefix field is specified. Defaults to the index.query.default_field index settings, which in turn defaults to _all.
default_operatorThe default operator used if no explicit operator is specified. For example, with a default operator of OR, the query capital of Hungary is translated to capital OR of OR Hungary, and with default operator of AND, the same query is translated to capital AND of AND Hungary. The default value is OR.
analyzerThe analyzer name used to analyze the query string.
allow_leading_wildcardWhen set, * or ? are allowed as the first character. Defaults to true.
lowercase_expanded_termsWhether terms of wildcard, prefix, fuzzy, and range queries are to be automatically lower-cased or not (since they are not analyzed). Default it true.
enable_position_incrementsSet to true to enable position increments in result queries. Defaults to true.
fuzzy_max_expansionsControls the number of terms fuzzy queries will expand to. Defaults to 50
fuzzinessSet the fuzziness for fuzzy queries. Defaults to AUTO.
fuzzy_prefix_lengthSet the prefix length for fuzzy queries. Default is 0.
phrase_slopSets the default slop for phrases. If zero, then exact phrase matches are required. Default value is 0.
boostSets the boost value of the query. Defaults to 1.0.
analyze_wildcardBy default, wildcards terms in a query string are not analyzed. By setting this value to true, a best effort will be made to analyze those as well.
auto_generate_phrase_queriesDefault to false.
minimum_should_matchA value controlling how many “should” clauses in the resulting boolean query should match. It can be an absolute value (2), a percentage (30%) or a combination of both.
lenientIf set to true will cause format based failures (like providing text to a numeric field) to be ignored.
locale[1.1.0] Added in 1.1.0.Locale that should be used for string conversions. Defaults to ROOT.

Table 1

When a multi term query is being generated, one can control how it gets rewritten using the rewrite parameter.

2.1. Rules of QueryParser

Suppose you are searching the Web for pages that contain both the words java and net but not the word dot. What if search engines made you type in something like the following for this simple query?

BooleanQuery query = new BooleanQuery();
query.add(new TermQuery(new Term("contents","java")), true, false);
query.add(new TermQuery(new Term("contents", "net")), true, false);
query.add(new TermQuery(new Term("contents", "dot")), false, true);

That would be a real drag. Thankfully, Google, Nutch, and other search engines are friendlier than that, allowing you to enter something much more succinct:java AND net NOT dot.First we’ll see what is involved to use QueryParser in an application.

2.2. Using QueryParser

Using a QueryParser is quite straightforward. Three things are needed: an expression, the default field name to use for unqualified fields in the expression, and an analyzer to pieces of the expression.Field-selection qualifiers are discussed in the query syntax section. Analysis, specific to query parsing, is covered in the “Analysis paralysis” section. Now, let’s parse an expression:

String humanQuery = getHumanQuery();
Query query = QueryParser.parse(humanQuery, "contents", new StandardAnalyzer());

Once you’ve obtained a Query object, searching is done the same as if the query had been created directly through the API. Here is a full method to search an existing index with a user-entered query string, and display the results to the console:

public static void search(File indexDir, String q) throws Exception{
Directory fsDir = FSDirectory.getDirectory(indexDir, false);
 IndexSearcher is = new IndexSearcher(fsDir);
 Query query = QueryParser.parse(q, "contents", new StandardAnalyzer());
 Hits hits = is.search(query);
 System.out.println("Found " + hits.length() +
                   " document(s) that matched query '" + q + "':");
 for (int i = 0; i < hits.length(); i++) {
 Document doc = hits.doc(i);

Expressions handed to the QueryParser are parsed according to a simple grammar. When an illegal expression is encountered, QueryParser throws a ParseException.

2.3. QueryParser expression syntax

The following items in this section describe the syntax that QueryParser supports to create the various query types.

Single-term query

A query string of only a single word is converted to an underlying TermQuery.

Phrase query

To search for a group of words together in a field, surround the words with double-quotes. The query “hello world” corresponds to an exact phrase match, requiring “hello” and “world” to be successive terms for a match. Lucene also supports sloppy phrase queries, where the terms between quotes do not necessarily have to be in the exact order. The slop factor measures against how many moves it takes to rearrange the terms into the exact order. If the number of moves is less than a specified slop factor, it is a match. QueryParser parses the expression “hello world”~2 as a PhraseQuery with a slop factor of 2, allowing matches on the phrases “world hello”, “hello world”, “hello * world”, and “hello * * world”, where the asterisks represent irrelevant words in the index. Note that “world * hello” does not match with a slop factor of 2. Because the number of moves to get that back to “hello world” is 3. Hopping the word “world” to the asterisk position is one, to the “hello” position is two, and the third hop makes the exact match.

Range query

Text or date range queries use bracketed syntax, with TO between the beginning term and ending term. The type of bracket determines whether the range is inclusive (square brackets) or exclusive (curly brackets).

NOTES: Non-date range queries use the start and end terms as the user entered them without modification. In the case of {Aardvark TO Zebra}, the terms are not lowercased. Start and end terms must not contain whitespace, or parsing fails; only single words are allowed. The analyzer is not run on the start and end terms.

Date range handling

When a range query (such as [1/1/03 TO 12/31/03]) is encountered, the parser code first attempts to convert the start and end terms to dates. If the terms are valid dates, according to DateFormat.SHORT and lenient parsing, then the dates are converted to their internal textual representation (however, date field indexing is beyond the scope of this article). If either of the two terms fails to parse as a valid date, they both are used as is for a textual range.

Wildcard and prefix queries

If a term contains an asterisk or question mark, it is considered a WildcardQuery, except when the term only contains a trailing asterisk and QueryParser optimizes it to a PrefixQuery instead. While the WildcardQuery API itself supports a leading wildcard character, the QueryParser does not allow it. An example wildcard query is w*ldc?rd, whereas the query prefix* is optimized as a PrefixQuery.

Fuzzy query

Lucene’s FuzzyQuery matches terms close to a specified term. The Levenshtein distance algorithm determines how close terms in the index are to a specified target term. “Edit distance” is another term for “Levenshtein distance” and is a measure of similarity between two strings, where distance is measured as the number of character deletions, insertions, or substitutions required to transform one string to the other string. For example, the edit distance between “three” and “tree” is one, as only one character deletion is needed. The number of moves is used in a threshold calculation, which is ratio of distance to string length. QueryParser supports fuzzy-term queries using a trailing tilde on a term. For example, searching for wuzza~ will find documents that contain “fuzzy” and “wuzzy”. Edit distance affects scoring, such that lower edit distances score higher.

Boolean query

Constructing Boolean queries textually is done using the operators AND, OR, and NOT. Terms listed without an operator specified use an implicit operator, which by default is OR. A query of abc xyz will be interpreted as abc OR xyz. Placing a NOT in front of a term excludes documents containing the following term. Negating a term must be combined with at least one non-negated term to return documents. Each of the uppercase word operators has shortcut syntax shown in the following table.

a OR ba ba AND NOT b+a -b

Verbose syntaxShortcut syntax
a AND b+a +b

Table 1

QueryParser is a quick and effortless way to give users powerful query construction, but it is not for everyone. QueryParser cannot create every type of query that can be constructed using the API. For instance, a PhrasePrefixQuery cannot be constructed. You must keep in mind that all of the possibilities available when exposing freeform query parsing to an end user. Some queries have the potential for performance bottlenecks. The syntax used by the built-in QueryParser may not be suitable for your needs. Some control is possible with subclassing QueryParser, though it is still limited.

3. Create an index with index searcher

In general, an applications usually need only to call the inherited




methods. For performance improvement we can open an indexSearcher & use it for all other search operations. Here is an simple example of how to create an index in lucene & searching that index using indexSearcher.

 public void simpleLucene()


    Analyzer analyzer = new StandardAnalyzer();

    // Store the index in memory:

    Directory directory = new RAMDirectory();

    // To store an index on disk, use this instead (note that the 

    // parameter true will overwrite the index in that directory

    // if one exists):

    // Directory directory = FSDirectory.getDirectory("/tmp/myfiles", true);

    IndexWriter iwriter = new IndexWriter(directory, analyzer, true);


    Document doc = new Document();

    String text = "This is the text to be indexed.";

    doc.add(new Field("fieldname", text, Field.Store.YES,





    // Now search the index:

    IndexSearcher isearcher = new IndexSearcher(directory);

    // Parse a simple query that searches for "text":

    QueryParser parser = new QueryParser("fieldname", analyzer);

    Query query = parser.parse("text");

    Hits hits = isearcher.search(query);

    assertEquals(1, hits.length());

    // Iterate through the results:

    for (int i = 0; i < hits.length(); i++)


      Document hitDoc = hits.doc(i);

      assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));





4. Different types of query

Lucene supports a variety of query. Here are some of them.

  1. TermQuery
  2. BooleanQuery
  3. WildcardQuery
  4. PhraseQuery
  5. PrefixQuery
  6. MultiPhraseQuery
  7. FuzzyQuery
  8. RegexpQuery
  9. TermRangeQuery
  10. NumericRangeQuery
  11. ConstantScoreQuery
  12. DisjunctionMaxQuery
  13. MatchAllDocsQuery

4.1 TermQuery

Matches documents that have fields that contain a term (not analyzed). The term query maps to Lucene TermQuery. The following matches documents where the user field contains the term kimchy:

 "term" : { "user" : "kimchy" }

A boost can also be associated with the query:

 "term" : { "user" : { "value" : "kimchy", "boost" : 2.0 } }

Or :

 "term" : { "user" : { "term" : "kimchy", "boost" : 2.0 } }

With Lucene, it’s possible to search for a particular word that has been indexed using the TermQuery class. This tutorial will compare TermQuery searches with QueryParser searches, as well as show some of the nuances involved with a term query.

4.2 BooleanQuery

We can run multifield searches in Lucene using either the BooleanQuery API or using the MultiFieldQueryParser for parsing the query text. For e.g. If a index has 2 fields FirstName and LastName and if you need to search for “John” in the FirstName field and “Travis” in the LastName field one can use a BooleanQuery as such:

 BooleanQuery bq = new BooleanQuery();
Query qf = new TermQuery(new Lucene.Net.Index.Term("FirstName", "John"));
Query ql = new TermQuery(new Lucene.Net.Index.Term("LastName", "Travis"));
bq.Add(qf, BooleanClause.Occur.MUST);
bq.Add(ql, BooleanClause.Occur.MUST);
IndexSearcher srchr = new IndexSearcher(@"C:\\indexDir");

4.3 WildcardQuery

Matches documents that have fields matching a wildcard expression (not analyzed). Supported wildcards are *, which matches any character sequence (including the empty one), and ?, which matches any single character. Note this query can be slow, as it needs to iterate over many terms. In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or ?. The wildcard query maps to Lucene WildcardQuery.

    "wildcard" : { "user" : "ki*y" }

A boost can also be associated with the query:

    "wildcard" : { "user" : { "value" : "ki*y", "boost" : 2.0 } }

Or :

    "wildcard" : { "user" : { "wildcard" : "ki*y", "boost" : 2.0 } }

This multi term query allows to control how it gets rewritten using the rewrite parameter.

4.4 PhraseQuery

With Lucene, a PhaseQuery can be used to query for a sequence of terms, where the terms do not necessarily have to be next to each other or in order. The PhaseQuery object’s setSlop() method can be used to set how many words can be between the various words in the query phrase.

We can use PhraseQuery like this,

Term term1 = new Term(FIELD_CONTENTS, string1);
Term term2 = new Term(FIELD_CONTENTS, string2);
PhraseQuery phraseQuery = new PhraseQuery();

4.5 PrefixQuery

Matches documents that have fields containing terms with a specified prefix (not analyzed). The prefix query maps to Lucene PrefixQuery. The following matches documents where the user field contains a term that starts with ki:

    "prefix" : { "user" : "ki" }

A boost can also be associated with the query:

    "prefix" : { "user" :  { "value" : "ki", "boost" : 2.0 } }

Or :

    "prefix" : { "user" :  { "prefix" : "ki", "boost" : 2.0 } }

This multi term query allows to control how it gets rewritten using the rewrite parameter.

4.6 MultiPhraseQuery

The built-in MultiPhraseQuery is definitely a niche query, but it’s potentially useful. MultiPhraseQuery is just like PhraseQuery except that it allows multiple terms per position. You could achieve the same logical effect, albeit at a high performance cost, by enumerating all possible phrase combinations and using a BooleanQuery to “OR” them together.

For example, suppose we want to find all documents about speedy foxes, with quick or fast followed by fox. One approach is to do a “quick fox” OR “fast fox” query. Another option is to use MultiPhraseQuery.

4.7 FuzzyQuery

FuzzyQuery can be categorized into two, a. fuzzy like this query & b. fuzzy like this field querya. fuzzy like this query–Fuzzy like this query find documents that are “like” provided text by running it against one or more fields.

    "fuzzy_like_this" : {
        "fields" : ["name.first", "name.last"],
        "like_text" : "text like this one",
        "max_query_terms" : 12

fuzzy_like_this can be shortened to flt.
The fuzzy_like_this top level parameters include:

  • fields -> A list of the fields to run the more like this query against. Defaults to the _all field.
  • like_text -> The text to find documents like it, required.
  • ignore_tf -> Should term frequency be ignored. Defaults to false.
  • max_query_terms -> The maximum number of query terms that will be included in any generated query. Defaults to 25.
  • fuzziness -> The minimum similarity of the term variants. Defaults to 0.5. See the section called “Fuzzinessedit”.
  • prefix_length -> Length of required common prefix on variant terms. Defaults to 0.
  • boost -> Sets the boost value of the query. Defaults to 1.0.
  • analyzer -> The analyzer that will be used to analyze the text. Defaults to the analyzer associated with the field.

Fuzzifies ALL terms provided as strings and then picks the best n differentiating terms. In effect this mixes the behaviour of FuzzyQuery and MoreLikeThis but with special consideration of fuzzy scoring factors. This generally produces good results for queries where users may provide details in a number of fields and have no knowledge of boolean query syntax and also want a degree of fuzzy matching and a fast query.

For each source term the fuzzy variants are held in a BooleanQuery with no coord factor (because we are not looking for matches on multiple variants in any one doc). Additionally, a specialized TermQuery is used for variants and does not use that variant term’s IDF because this would favor rarer terms, such as misspellings. Instead, all variants use the same IDF ranking (the one for the source query term) and this is factored into the variant’s boost. If the source query term does not exist in the index the average IDF of the variants is used.b. fuzzy like this field query–

The fuzzy_like_this_field query is the same as the fuzzy_like_this query, except that it runs against a single field. It provides nicer query DSL over the generic fuzzy_like_this query, and support typed fields query (automatically wraps typed fields with type filter to match only on the specific type).

    "fuzzy_like_this_field" : {
        "name.first" : {
            "like_text" : "text like this one",
            "max_query_terms" : 12

fuzzy_like_this_field can be shortened to flt_field.The fuzzy_like_this_field top level parameters include:

  • like_text -> The text to find documents like it, required.
  • ignore_tf -> Should term frequency be ignored. Defaults to false.
  • max_query_terms -> The maximum number of query terms that will be included in any generated query. Defaults to 25.
  • fuzziness -> The fuzziness of the term variants. Defaults to 0.5. See the section called “Fuzzinessedit”.
  • prefix_length -> Length of required common prefix on variant terms. Defaults to 0.
  • boost -> Sets the boost value of the query. Defaults to 1.0.
  • analyzer -> The analyzer that will be used to analyze the text. Defaults to the analyzer associated with the field.

4.8 RegexpQuery

The regexp query allows you to use regular expression term queries. See Regular expression syntax for details of the supported regular expression language.

Note: The performance of a regexp query heavily depends on the regular expression chosen. Matching everything like .* is very slow as well as using lookaround regular expressions. If possible, you should try to use a long prefix before your regular expression starts. Wildcard matchers like .*?+ will mostly lower performance.

        "name.first": "s.*y"

Boosting is also supported


You can also use special flags

        "name.first": {
            "value": "s.*y",

Possible flags are ALL, ANYSTRING, AUTOMATON, COMPLEMENT, EMPTY, INTERSECTION, INTERVAL, or NONE. Regular expression queries are supported by the regexp and the query_string queries. The Lucene regular expression engine is not Perl-compatible but supports a smaller range of operators.

Standard operators

AnchoringMost regular expression engines allow you to match any part of a string. If you want the regexp pattern to start at the beginning of the string or finish at the end of the string, then you have to anchor it specifically, using ^ to indicate the beginning or $ to indicate the end.Lucene’s patterns are always anchored. The pattern provided must match the entire string. For string “abcde”:

ab.* # match
abcd # no match

Allowed characters

Any Unicode characters may be used in the pattern, but certain characters are reserved and must be escaped. The standard reserved characters are:

. ? + * | { } [ ] ( ) ” \

If you enable optional features (see below) then these characters may also be reserved:

# @ & < > ~

Any reserved character can be escaped with a backslash “\*” including a literal backslash character:


Additionally, any characters (except double quotes) are interpreted literally when surrounded by double quotes:


Match any character

The period “.” can be used to represent any character. For string “abcde”:

ab... # match
a.c.e # match


The plus sign “+” can be used to repeat the preceding shortest pattern once or more times. For string “aaabbb”:

a+b+ # match
aa+bb+ # match
a+.+ # match
aa+bbb+ # no match


The asterisk “*” can be used to match the preceding shortest pattern zero-or-more times. For string “aaabbb”:

a*b* # match
a*b*c* # match
.*bbb.* # match
aaa*bbb* # match


The question mark “?” makes the preceding shortest pattern optional. It matches zero or one times. For string “aaabbb”:

aaa?bbb? # match
aaaa?bbbb? # match
.....?.? # match
aa?bb? # no match


Curly brackets “{}” can be used to specify a minimum and (optionally) a maximum number of times the preceding shortest pattern can repeat. The allowed forms are:

{5} # repeat exactly 5 times
{2,5} # repeat at least twice and at most 5 times
{2,} # repeat at least twice

For string “aaabbb”:

a{3}b{3} # match
a{2,4}b{2,4} # match
a{2,}b{2,} # match
.{3}.{3} # match
a{4}b{4} # no match
a{4,6}b{4,6} # no match
a{4,}b{4,} # no match


Parentheses “()” can be used to form sub-patterns. The quantity operators listed above operate on the shortest previous pattern, which can be a group. For string “ababab”:

(ab)+ # match
ab(ab)+ # match
(..)+ # match
(...)+ # no match
(ab)* # match
abab(ab)? # match
ab(ab)? # no match
(ab){3} # match
(ab){1,2} # no match


The pipe symbol “|” acts as an OR operator. The match will succeed if the pattern on either the left-hand side OR the right-hand side matches. The alternation applies to the longest pattern, not the shortest. For string “aabb”:

aabb|bbaa # match
aacc|bb # no match aa(cc|bb) # match
a+|b+ # no match
a+b+|b+a+ # match
a+(b|c)+ # match

Character classes

Ranges of potential characters may be represented as character classes by enclosing them in square brackets “[]”. A leading ^ negates the character class. The allowed forms are:

[abc] # 'a' or 'b' or 'c'
[a-c] # 'a' or 'b' or 'c'
[-abc] # '-' or 'a' or 'b' or 'c'
[abc\\-] # '-' or 'a' or 'b' or 'c'
[^a-c] # any character except 'a' or 'b' or 'c'
[^a-c] # any character except 'a' or 'b' or 'c'
[-abc] # '-' or 'a' or 'b' or 'c'
[abc\\-] # '-' or 'a' or 'b' or 'c'

Note that the dash “-” indicates a range of characeters, unless it is the first character or if it is escaped with a backslash.For string “abcd”:

ab[cd]+ # match
[a-d]+ # match
[^a-d]+ # no match

4.9 TermRangeQuery

A Query that matches documents within an range of terms. This query matches the documents looking for terms that fall into the supplied range according toString#compareTo(String), unless a Collator is provided. It is not intended for numerical ranges.

Here is an example of how to use TermRangeQuery in lucene,

 private Query createQuery(String field, DateOperator dop) throws UnsupportedSearchException {
Date date = dop.getDate();
DateResolution res = dop.getDateResultion();
DateTools.Resolution dRes = toResolution(res);
String value = DateTools.dateToString(date, dRes);
switch(dop.getType()) {
    case ON:
        return new TermQuery(new Term(field ,value));
    case BEFORE: 
        return new TermRangeQuery(field, DateTools.dateToString(MIN_DATE, dRes), value, true, false);
    case AFTER: 
        return new TermRangeQuery(field, value, DateTools.dateToString(MAX_DATE, dRes), false, true);
        throw new UnsupportedSearchException();

4.10 NumericRangeQuery

A NumericRangeQuery, that matches numeric values within a specified range. To use this, you must first index the numeric values.We can combine NumericRangeQuery with TermQuery like this,

 String termQueryString = "title:\\"hello world\\"";
Query termQuery = parser.parse(termQueryString);
Query pageQueryRange = NumericRangeQuery.newIntRange("page_count", 10, 20, true, true);
Query query = termQuery.combine(new Query[]{termQuery, pageQueryRange});

4.11 ConstantScoreQuery

A query that wraps another query or a filter and simply returns a constant score equal to the query boost for every document that matches the filter or query. For queries it therefore simply strips of all scores and returns a constant one.

    "constant_score" : {
        "filter" : {
            "term" : { "user" : "kimchy"}
        "boost" : 1.2

The filter object can hold only filter elements, not queries. Filters can be much faster compared to queries since they don’t perform any scoring, especially when they are cached.A query can also be wrapped in a constant_score query:

    "constant_score" : {
        "query" : {
            "term" : { "user" : "kimchy"}
        "boost" : 1.2

4.12 DisjunctionMaxQuery

A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.

This is useful when searching for a word in multiple fields with different boost factors (so that the fields cannot be combined equivalently into a single search field). We want the primary score to be the one associated with the highest boost, not the sum of the field scores (as Boolean Query would give). If the query is “albino elephant” this ensures that “albino” matching one field and “elephant” matching another gets a higher score than “albino” matching both fields. To get this result, use both Boolean Query and DisjunctionMaxQuery: for each term a DisjunctionMaxQuery searches for it in each field, while the set of these DisjunctionMaxQuery’s is combined into a BooleanQuery.

The tie breaker capability allows results that include the same term in multiple fields to be judged better than results that include this term in only the best of those multiple fields, without confusing this with the better case of two different terms in the multiple fields.The default tie_breaker is 0.0.This query maps to Lucene DisjunctionMaxQuery.

    "dis_max" : {
        "tie_breaker" : 0.7,
        "boost" : 1.2,
        "queries" : [
                "term" : { "age" : 34 }
                "term" : { "age" : 35 }

4.13 MatchAllDocsQuery

A query that matches all documents. Maps to Lucene MatchAllDocsQuery.

    "match_all" : { }

Which can also have boost associated with it:

    "match_all" : { "boost" : 1.2 }

Piyas De

Piyas is Sun Microsystems certified Enterprise Architect with 10+ years of professional IT experience in various areas such as Architecture Definition, Define Enterprise Application, Client-server/e-business solutions.Currently he is engaged in providing solutions for digital asset management in media companies.He is also founder and main author of "Technical Blogs(Blog about small technical Know hows)" Hyperlink - http://www.phloxblog.in
Notify of

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Inline Feedbacks
View all comments
Back to top button