Enterprise Java

Introduction to Lucene

This article is part of our Academy Course titled Apache Lucene Fundamentals.

In this course, you will get an introduction to Lucene. You will see why a library like this is important and then learn how searching works in Lucene. Moreover, you will learn how to integrate Lucene Search into your own applications in order to provide robust searching capabilities. Check it out here!

1. Introduction

In this course we are going to dive into Apache Lucene. Lucene is a rich, open source, full-text search suit. This means that Lucene is going to help you implement a full-text search engine, tailored to your applications needs. We are going to deal with the Java flavor of Lucene, but bear in mind that there are API clients for a variety of programming languages.

1.1 What is full-text search

It is a common need for users to want to retrieve list of documents or sources that match certain criteria. For example a library user needs to be able to find all of the books written by a specific author. Or all of the books that have a specific word or phrase in their title. Or all of the books published in a specific year from a specific publisher. The above queries can be easily handled by well know relational database. If you hold a table that stores (title,author, publisher, year of publishing) tuples, the above searches can be completed efficiently. Now, what if a user wants to obtain all the documents that contain a certain word or phrase in their actual content? If you try to use a traditional database and store the raw content of all the documents in a field of a tuple, searching would take unacceptably long.

That’s because in a full text search, the search engine has to scan all of the words of the text document, or text stream in general, and try to match several criteria against it, e.g. find certain words or phrases in its content. That kind of queries in a classic relational database would be hopeless. Granted, many database systems, like MySQL and PostgreSQL support full text searching, either natively or using external libraries. But that are not efficient, nor fast enough or customizable. But the biggest problem is scalability. They just cannot handle the amount of data that full-text searched engines can.

1.2 Why do we need full-text search engines

The process of generating vast amounts of data is one of the defining characteristics of our time and a major consequence of technological advancements. It goes by the term information overload. Having said that, gathering and storing all that data is beneficial only if you are able to extract useful information out of them, plus make them reachable by your application’s end users. The most well known and used tool to achieve that is, of course, searching.

One could argue that searching files for a word or a phrase is as simple as scanning it in a serial manner from top to bottom, just like you would using a grep command. This might actually suffice for a small number of documents. But what about huge file systems with millions of files, and if that seems extraordinary to you, what about web pages, databases, emails, code repositories, to name just a few, and what about all of them combined. It becomes easy to understand that the information each individual user needs, might reside in a little document, somewhere in a vast ocean of different information resources. And the retrieval of that document has to seem as easy as breathing.

One can now see why fully customized search-based applications are gaining lots of attention and traction. Adding to that, is the fact that searching has become such an important aspect of end user’s experience that for modern web applications, varying from simple blogs to big platforms like Twitter or Facebook and even military grade applications, is incomprehensible not to have searching facilities. And that’s why big vendors don’t want to risk messing up their searching features and want to keep them as fast and yet as simple as possible. That has led to the need to upgrade searching from a simple feature to a full platform. A platform that has power, effectiveness, the necessary flexibility and customization. And Apache Lucene delivers, that’s why is used in most of the aforementioned applications.

1.3 How Lucene works

So, you must be wondering how Lucene can perform very fast full-text searches. Not surprisingly, the answer is that it uses an index. Lucene indexes fall into the category of inverted indexes. Instead of having a classic index where for every document you have the full list of words (or terms) it contains, inverted indexes do it the other way round. For every term (word) in the documents, you have a list of all of the documents that contain that term. That is hugely more convenient when performing full text searches.

The reason that inverted indexes work so good can be seen in the following diagrams. Imagine you have 3 very large document. A classic index you be of the form:

Classic index:

1
2
3
Document1 -> { going, to, dive, into, Apache, Lucene, rich, open, source, full, text, search,... }
Document2 -> { so, must, wonder, Lucene, can, achieve, very, fast, full, text, search, not,... }
Document3 -> { reason, that, inverted, index, work, good ,can, be, seen, following, diagrams,... }

For every document you have a huge list of all of the terms it contains. In order to find if a document contains a specific term you have to scan, probably sequentially these vast lists.

On the other hand an inverted index would have that form:

Inverted index:

1
2
3
4
5
6
7
reason -> { (3,0} }
Lucene -> { (1,6), (2,4) }
full   -> { (1,10),(2,9) }
going  -> { (1,0) }
index  -> { (3,3) }
search -> { (1,11), (2,10)}
...

For every term we maintain a list with all of the documents that contain that term, followed by the position of the term inside the document (of course additional information can be kept). Now, when a user searches for the term “Lucene”, we can instantly answer that the term “Lucene” is located inside Document1 in position 6 and inside Document2 in position 4. To sum up, inverted indexes use a very big number of very small lists that can be searched instantly. In contrast a classic index would use a small number of extremely big lists that is impossible to search quickly.

1.4 Basic Lucene workflow

So, Lucene has to have some work done before the actual search. And presumably that is to create the index. The basic work flow of the indexing process is depicted bellow:

lucene-work0flow

As you can see in the above diagram:

  1. You feed it with text documents/sources
  2. For every document it analyzes the text and splits it into terms (words). Meanwhile it can perform all kinds of analysis in the plain text. You can tailor that step to suit the needs of you own application.
  3. For every term of the documents it creates the previously described inverted lists.
  4. Now the index is ready to be searched. You can write queries in many different formats and as a result you will get a list of all of the documents that satisfy the criteria specified in the query.

So far Lucene seems to be a very powerful tool as it can analyze the text, create the indexes and perform queries on the index. But you have to do some work yourself, like selecting the documents to be indexed, organize and manage the whole process and the several aspects of it, as well as eventually getting the search queries from the users and presenting to them any possible results.

2. Basic components for Indexing

In this section we are going to describe the basic components and the basic Lucene classes used to create Indexes.

2.1 Directories

A Lucene Index is hosted simply in a normal file system location, or in memory when extra performance is needed and you don’t want it to be stored permanently on your drive. You can even choose to store your index in a database, via JDBC. The implementations of the aforementioned options, extend the abstract Directory class.

To keep things simple, let’s just say that it uses a directory in your file system, although there aren’t many differences when using memory or databases, but the normal directory is more intuitive in my opinion. Lucene will use that directory to store everything that is necessary for the index. You can work with such a directory using FSDirectory class, feeding it with an arbitrary path of your file system (when working with memory, you use RAMDirectory). FSDirectory class is simply an abstraction above normal Java File manipulation classes.

This is how you can create an FSDirectory:

1
Directory directory = FSDirectory.open( new File("C:/Users/nikos/Index"));

An this is how you can create a RAMDirectory:

1
Directory  ramDirectory = new RAMDirectory();

2.2 Documents

As you remember, we said that it is your responsibility to choose the documents (text files, PDFs, Word Documents etc) and any text sources you want to make searchable, and thus indexed. For every document you want to index, you have to create one Document object that represents it. At this point, it is important to understand that Documents are indexing components, and not the actual text sources. Naturally, because a Document represents an individual physical text source, it is the building unit of the Index. After creating such a Document you have to add it to the Index. Later, when dispatching a search, as a result, you will a get a list of Document objects that satisfy your query.

This is how you can create a new empty Document:

1
Document doc = new Document();

Now it’s time to fill the Document with Fields.

2.3 Fields

Document objects are populated with a collection of Fields. A Field is simply a pair of (name,value) items. So, when creating a new Document object you have to fill it with that kind of pairs. A Field can be stored in the index, in which case both name and value of the field are literally stored in the Index. Additionally a Field can be indexed, or to be more precise inverted, in which case the value of the field gets analyzed and tokenized into Terms and is available for searching. A Term represents a word from the text of a Field‘s value. A Field can be both stored and indexed/inverted, but you don’t have to store a Field to make it indexed/inverted. Storing a Field and indexing/inverting a Field are two different, independent things.

As I mentioned before, when deploying a search, in return you will get a list of Document objects (representing the physical text sources) that satisfy your query. If you want to have access to the actual value of a Field, you have to declare that Field stored. This is usually helpful when you want to store the name of the file that this Document represents, last modification date, full file path, or any additional information about the text source you want to have access to. For example, if the text source your are indexing is an email, your Document object that represents that email could have those fields:

Example Document representing an email:

Field

Stored

Indexed

Name

Value

Title

Email from example

Yes

No

Location

location of the email

Yes

No

From

example@javacodegeeks.com

Yes

No

To

foo@example.com

Yes

No

Subject

Communication

Yes

No

Body

Hi there ! Nice meeting you…

No

Yes

In the above Document, I’ve chosen to index/invert the body of the email, but not to store it. This means that the body of the email will be analyzed and tokenized into searchable terms, but it will not be literally stored in the Index. You can follow that tactic when the volume of the contents of your text source is very big and you want to save space. On the other hand, I’ve chosen to store but not to index all the other fields. When I perform a search, only the body will be searched,while all the other Fields are not taken under consideration in the search as they are not indexed.

If the Body‘s aforementioned tokenized terms satisfy the query, this Document will be included in the results. Now, when you access that retrieved Document, you can only view its stored Fields and their values. Thus, the actual body of the file will not be available to you through the Document object, despite being searchable. You can only see the Title, Location ,From, To Subject Fields. Storing the location of that email, will help me have access to its actual body content. Of course, you can store the body of the email as well, if you want to retrieve it through the Document object and thus make it both searchable and stored (same goes for the other fields).

So let’s see how you would create the above Document. To create the stored-only fields we are going to use StoredField class. And to create the non stored and indexed body of the text we are going to use a TextField class.

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
Document doc = new Document();
Field title = new StoredField("fileName", "Email from example");
doc.add(title);
Field location = new StoredField("location", "C:/users/nikos/savedEmail");
doc.add(location);
Field from = new StoredField("from", "example@javacodegeks.com");
doc.add(from);
Field to = new StoredField("to", "foo@example.com");
doc.add(to);
Field body = new TextField("body", new FileReader(new File("C:/users/nikos/savedEmail")));
doc.add(body);

As you can see, we’ve given a FileReader as the value of the "body" field. This FileReader will be used inside the analysis phase, to extract the plain text from that source. After extracting the plain text from the file, special components of Lucene will analyse it and split it into indexed terms.

2.4 Terms

A Term represents a word from the text. Terms are extracted from the analysis and tokenization of Fields’ values, thus Term is the unit of search. Terms are composed of two elements, the actual text word ( this can anything from literally words to email address, dates etc), and the name of the Field this word appeared into.

It is not absolutely necessary to tokenize and analyze a Field to extract Terms out of it. In the previous example, if you want to make the From Field indexed, you don’t really have to tokenize it. The email address example@javacodegeeks.com can serve as a Term.

2.5 Analyzers

An Analyzer is one of the most crucial components of the indexing and searching process. It is responsible for taking plain text and converting it to searchable Terms. Now, it is important to understand that Analyzers work with plain text inputs. It’s the programmer’s responsibility to provide a parser that is able to convert a text source, like an HTML page or a file form your file system, to plain text. This parser is usually a Reader. For example in case of files this could be a FileReader.

An Analyzer, internally uses a Tokenizer. The Tokenizer can take as input the aforementioned Reader and use it to extract plain text from a specific source (e.g a File). After obtaining the plain text, the Tokenizer simply splits the text into words. But an Analyzer can do much more that simple text splitting. It can perform several kinds of text and word analysis like :

  • Stemming : Replacing words with their stems. For example, in English the stem of “oranges” is “orange”. So if the end user searches for “orange”, documents that contained “oranges” and “orange” will be obtained.
  • Stop Words Filtering : Words like “the”, “and” and “a” are not of any particular interest when performing a search and one could might as well consider them “noise”. Removing them, will result in better performance, and more accurate results.
  • Text Normalization : Removes accents and other character markings.
  • Synonym Expansion : Adds in synonyms at the same token position as the current word.

These are only some of the analysis tools that are built-in to Lucenes Analyzer classes. The most commonly used built in Analyzer is the StandardAnalyzer that can remove stop words, convert words to lowercase and also do stemming. As you know, different languages have different grammar rules. Lucene community is trying to embed as many grammars for as many different languages as possible. But still, if none of the Lucene’s built in Analyzers is suitable for your application, you can create your own.

2.6 Interacting with the Index

So far, we’ve seen how to create an Index Directory, create a Document and add Fields to it. Now we have to write the Document to the Directory and thus add it to the Index. This is also the step where Analyzers and Tokenizers play their parts.

There is no special class called Index (or something like that) in Lucene, as you would expect. The way you interact with the Index is through an IndexWriter, when you want to push content to your Index (and generally manipulate it), an IndexReader when you want to read from your Index and an IndexSearcher, when you want to search the index of course.

Now let’s see how we can create the IndexWriter we want:

1
2
3
4
5
6
7
Directory directory = FSDirectory.open( new File("C:/Users/nikos/Index"));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
IndexWriter indexWriter = new IndexWriter(directory , config);

We’ve chosen to use a StandardAnalyzer instance for this. Its constructor takes Version.LUCENE_46 as an argument. That is helpful to discover compatibility dependencies across several releases of Lucene. Keep in mind that StandardAnalyzer, internally uses StandardTokenizer. Then, we create an IndexWriterConfig instance. This is a helper class that can hold all the configuration options for an IndexWriter. As you can see, we’ve specified that we want our IndexWriter to use the previously created analyzer and to be set to the appropriate version. Finally, we create the IndexWriter instance. In its constructor arguments we give the FSDirectory instance and the previously created configuration options.

Now you can go ahead and add the previously created Document to the Index, using the above IndexWriter:

1
indexWriter.addDocument(doc);

And that’s it. Now, when addDocument is called, all the previously described operations take place:

  1. The Tokenizer uses the FileReader to read the file and converts it to plain text. It then breaks it into tokens.
  2. The Analyzer meanwhile can performs all kinds of syntactic and grammar analysis on the plain text and then on the individual tokens.
  3. From the token analysis, Terms are created and used to generate the inverted index.
  4. Finally all the necessary files, holding all the info for the Document and the Index, are written to the specified path: "C:/Users/nikos/Index".

Our document is now indexed. The same process is followed for every Document you add to the Index.

Now that everything is more clear, let’s see the indexing process with the classes we used in a diagram :

indexing

As you can see:

  • We pass the Document object to the IndexWriter.
  • He uses the Analyzer to generate Terms from the plain text obtained by the Reader.
  • Then, he writes everything that is necessary to update the index to the Directory.

Indexing is the hard bit when trying to build your search engine, because you have to:

  • Choose the documents and text sources you want to index.
  • Provide Reader classes that read the text sources and convert them to plain text. There is wide variety of built in classes (or external libraries) that can read a big number of document formats. But if non of them is suitable for your documents, you will have to write your own Reader that will parse them and convert them to plain text.
  • Decide on the tokenization and analysis policy that suits your application’s needs. The great majority of applications will do just fine with the StandardAnalyzer and StandardTokenizer. But, it is possible that you want to customize the analysis step a bit further and that requires some work to be done.
  • Decide what kind of Fields to use and which of them to store and/or index.

3. Basic components for searching

In this section we are going to describe the basic components and the basic Lucene classes used to perform searched. Searching is the target of platforms like Lucene, so it has to be as nimble and as easy as possible.

3.1 QueryBuilder and Query

In Lucene, every query passed to the Index is a Query object. So before actually interacting with the Index to perform a search, you have to build such objects.

Everything starts with the query string. It can be like query strings you put on well know search engines, like Google. It can be an arbitrary phrase, or a more structured one, as we will see in the next Lesson. But it would be useless to just send this raw string to be searched in the index. You have to process it, like you did with the indexed plain text. You have to split the query string into words and create searchable Terms. Presumably, this can be done using an Analyzer.

Note: It is important to note that you should use the same Analyzer sub class, to examine the query, as the one you used to the indexing process, to examine the plain text.

Here is how you would create a simple Query, processed with StandardAnalyzer:

1
2
3
4
5
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
QueryBuilder builder = new QueryBuilder(analyzer);
Query query = builder.createBooleanQuery("content", queryStr);

We bind a QueryBuilder object with an instance of StandarAnalyzer. You can now use that QueryBuilder instance to create Query objects.

Query is an abstract class and many concrete sub classes are available, like:

  • TermQuery that searches for Documents that contain a specific term.
  • BooleanQuery that creates Boolean combinations of other queries
  • WildcardQuery to implement wild card searches, e.g for query strings like ” *abc* “.
  • PhraseQuery to search for whole phrases, not just for individual Terms.
  • PrefixQuery search for Terms with a predefined prefix.

All of these different Query flavors will determine the nature of the search that is going to be performed over your index. And each one of them can be obtained through that QueryBuilder instance. In our example we have chosen to use createBooleanQuery method. It takes two arguments. The first one is the name of the Field whose (surely indexed and probably tokenized) value is going to be searched. And the second one is the query string that is going to be analyzed with the StandardAnalyzer. createBooleanQuery can either return a TermQuery or a BooleanQuery depending on the syntax of the query string.

3.2 IndexReader

Presumably, to do a search in the Index, first you have to open it. You can use an IndexReader to open and access it. All processes that need to pull data out of the index, go through that abstract class.

It’s very easy to open an already created index with IndexReader:

1
2
3
Directory directory = FSDirectory.open(new File("C:/Users/nikos/Index"))
IndexReader indexReader = DirectoryReader.open(directory);

As you can see we use a DirectoryReader to open the Directory in which the Index is stored. DirectoryReader returns a handle for the index, and that’s what the IndexReader is.

3.3 IndexSearcher

The IndexSearcher is the class you use to search a single Index. It is bound with an IndexReader.

Here is how you can create one:

1
2
3
IndexReader  indexReader  = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(indexReader);

You use an IndexSearcher to pass Query objects to the IndexReader. Here is how :

1
2
3
Query query = builder.createBooleanQuery("content", queryStr);
TopDocs topDocs =searcher.search(query, maxHits);

We used public TopDocs search(Query query, int n) method of IndexSearcher to perform the search. This method takes two argumets. The first one is the Query object. The second is an integer that sets a limit on the number of the returned search results. If, for example, you have 10000 documents that satisfy your query you might not want all of them to be returned. You can state that you want only the first n results. Finally, that method returns a TopDocs instance.

3.4 TopDocs

The TopDocs class represents the hits that satisfy your query. TopDocs has a public ScoreDoc[] class field.

3.5 ScoreDoc

A SocreDoc represents a hit for a query. It consists of :

  • A public int doc field, that is the id of the Document that satisfied the query.
  • And a public float score field, that is the score that the Document achieved in the query.

The scoring formula is very a essential and complex part of any search platform and it is what makes Lucene work so good. This formula is used to provide a relevance measure for the retrieved Document. The higher the score is, the more relevant that Document is to your query. This helps to characterize “good” and “bad” Documents and ensures that you are provided with high quality results that are as close as possible to the Documents you really need. You can find some useful information on scoring in the documentation of the current’s version Similarity class, and on the older one , as well as this Information Retrieval article.

4. A Simple Search Application

We are going to build a simple search application that will demonstrate the basic steps for indexing and the searching. In this application we are going to use an input folder that contains a bunch of java source files. Every file in this document will be processed and added to the index. Then we are going to perform simple queries over that index to see how it works.

We are going to use:

  • Eclipse Kepler 4.3 as our IDE.
  • JDK 1.7.
  • Maven 3 to build our project.
  • Lucene 4.6.0, the latest version of Lucene.

First of all, let’s create our Maven Project with Eclipse.

4.1 Create a new Maven Project with Eclipse

Open Eclipse and go to File -> New -> Other -> Maven -> Maven Project and click Next

new-maven-project-with-eclipse

In the next window select the “Create a simple project (skip archetype selection)” option and click Next :

skip-archetype

In the next window fill in the Group Id and Artifact Id, as shown in the picture below, and click Finish:

artifact-id

A new Maven project will be created with the following structure:

initial-package-structure

4.2 Maven Dependencies

Open pom.xml and add the dependencies required to use Lucene libraries:

pom.xml:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.javacodegeeks.enterprise.lucene.index</groupId>
    <artifactId>LuceneHelloWorld</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <dependencies>
        
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>4.6.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>4.6.0</version>
        </dependency>
    </dependencies>
</project>

As you can see we are importing lucene-core-4.6.0.jar which provides all the core classes, and lucene-analyzers-common-4.6.0.jar package which provides all the classes neccesery for text analysis.

4.3. A simple indexer class

To create this class go to the Package Explorer of Eclipse. Under src/java/main create a new package named com.javacodegeeks.enterprise.lucene.index. Under the newly created package, create a new class named SimpleIndexer.

Let’s see the code of that class, which will do the indexing:

SimpleIndexer.java:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
package com.javacodegeeks.enterprise.lucene.index;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class SimpleIndexer {
    private static final String indexDirectory = "C:/Users/nikos/Desktop/LuceneFolders/LuceneHelloWorld/Index";
    private static final String dirToBeIndexed = "C:/Users/nikos/Desktop/LuceneFolders/LuceneHelloWorld/SourceFiles";
    public static void main(String[] args) throws Exception {
        File indexDir = new File(indexDirectory);
        File dataDir = new File(dirToBeIndexed);
        SimpleIndexer indexer = new SimpleIndexer();
        int numIndexed = indexer.index(indexDir, dataDir);
        System.out.println("Total files indexed " + numIndexed);
    }
    private int index(File indexDir, File dataDir) throws IOException {
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46,
                analyzer);
        IndexWriter indexWriter = new IndexWriter(FSDirectory.open(indexDir),
                config);
        File[] files = dataDir.listFiles();
        for (File f : files) {
            System.out.println("Indexing file " + f.getCanonicalPath());
            Document doc = new Document();
            doc.add(new TextField("content", new FileReader(f)));
            doc.add(new StoredField("fileName", f.getCanonicalPath()));
            indexWriter.addDocument(doc);
        }
        int numIndexed = indexWriter.maxDoc();
        indexWriter.close();
        return numIndexed;
    }
}

In the above class we specified our input folder where the text files are placed to be in C:/Users/nikos/Desktop/LuceneFolders/LuceneHelloWorld/SourceFiles, and also the folder where the index is going to be stored to be in C:/Users/nikos/Desktop/LuceneFolders/LuceneHelloWorld/Index.

On the index method, first we create a new StandardAnalyzer instance and a new IndexWriter instance. The IndexeWriter will use the StrandardAnalyzer to analyze the text, and will store the index in the FSDirectory pointing to the aforementioned index path.

The interesting bit is on the for loop. For every file in the source directory :

  1. We create a new Document instance.
  2. We add a new Filed, a TextField to be precise, that represents the content of the file. Remember that TextField is used to create a field that its value is going to be tokenized and indexed but not stored.
  3. We add another Field, this time a StoredFiled, that holds the name of the file. Remember that a StoredField is for fields that are just stored, not indexed and not tokenized. Because we store the file name as its full path, we can later use it to access, present and examine its contents.
  4. Then we simply add the Document to the index.

After the loop:

  1. We call maxDoc() method of IndexWriter that returns the number of indexed Documents.
  2. We close the IndexWriter, because we don’t need it anymore, and thus the system can reclaim its resources.

When I run this code, here is the output it produces:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
Indexing file C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\Cart.java
Indexing file C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\CartBean.java
Indexing file C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\MyServlet.java
Indexing file C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\Passivation.java
Indexing file C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\PassivationBean.java
Indexing file C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\Product.java
Indexing file C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\PropertyObject.java<
Indexing file C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SecondInterceptor.java
Indexing file C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\ShoppingCartServlet.java
Indexing file C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleEJB.java
Indexing file C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleIndexer.java
Indexing file C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleInterceptor.java
Indexing file C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleSearcher.java
Indexing file C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\TestSerlvet.java
Total files indexed 14

Here is the index folder in my system. As you can see, several special files are created (more on that next lessons):

index-folder

Now let’s search that index.

4.4. A simple searcher class

To create this class go to the Package Explorer of Eclipse. Under src/java/main create a new package named com.javacodegeeks.enterprise.lucene.search. Under the newly created package, create a new class named SimpleSearcher.

To get a more clear view of he final structure of the project, have a look at the image bellow:

package-explorer

Let’s see the code of that class, which will do the searching:

SimpleSearcher.java:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
package com.javacodegeeks.enterprise.lucene.search;
import java.io.File;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.QueryBuilder;
import org.apache.lucene.util.Version;
public class SimpleSearcher {
    private static final String indexDirectory = "C:/Users/nikos/Desktop/LuceneFolders/LuceneHelloWorld/Index";
    private static final String queryString = "private static final String";
    private static final int maxHits = 100;
    public static void main(String[] args) throws Exception {
        File indexDir = new File(indexDirectory);
        SimpleSearcher searcher = new SimpleSearcher();
        searcher.searchIndex(indexDir, queryString);
    }
    private void searchIndex(File indexDir, String queryStr)
            throws Exception {
        Directory directory = FSDirectory.open(indexDir);
        
        IndexReader  indexReader  = DirectoryReader.open(directory);
        IndexSearcher searcher = new IndexSearcher(indexReader);
        
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
        
        QueryBuilder builder = new QueryBuilder(analyzer);
        
        Query query = builder.createPhraseQuery("content", queryStr);
        
        TopDocs topDocs =searcher.search(query, maxHits);
        
        ScoreDoc[] hits = topDocs.scoreDocs;
        
        for (int i = 0; i < hits.length; i++) {
            int docId = hits[i].doc;
            Document d = searcher.doc(docId);
            System.out.println(d.get("fileName") + " Score :"+hits[i].score);
            
        }
        
        System.out.println("Found " + hits.length);
    }
}

In the searchIndex method we pass the index directory and the query string as arguments. So I’m going to search for “private static final String”. Remember that the files I’ve indexed were Java source files,

The code is pretty self explanatory:

  1. We open the index directory and obtain the IndexReader and IndexSearcher classes.
  2. We then use the QueryBuilder, fed with the StrandardAnalyzer, to build our Query object. We used createBooleanQuery to obtain the Query object. Our query string does not have a boolean format (as we will see in the next lesson), so the method will create TermQueries for the analyzed and tokenized terms of the query string.
  3. We then use the search method of the IndexSearcher to perform the actual search.
  4. We obtain the ScoreDocs that satisfied the query from the TopDocs returned from the search method. For every ID in that ScoreDocs array, we obtain the corresponding Document from the IndexSearcher.
  5. From that Document, using get method, we obtain the name of the file, stored in the value of the “fileName” Field.
  6. We finally print the name of the file and the score it achieved.

Let’s run the program and see what is the output:

01
02
03
04
05
06
07
08
09
10
11
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\Product.java Score :0.6318474
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\PropertyObject.java Score :0.58126664
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleSearcher.java Score :0.50096965
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleIndexer.java Score :0.31737804
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\ShoppingCartServlet.java Score :0.3093049
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\TestSerlvet.java Score :0.2769966
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\MyServlet.java Score :0.25359935
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleEJB.java Score :0.05496885
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\PassivationBean.java Score :0.03272106
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\CartBean.java Score :0.028630929
Found 10

It is important to understand that these files do not certainly contain the whole phrase "private static final String". Intuitively, the documents with the higher score contain most of the words of that sentence and more frequently than documents with smaller scores. Of course the scoring formula is much more complex than that, as we said earlier.

For example if you change:

1
Query query = builder.createBooleanQuery("content", queryStr);

to

1
Query query = builder.createPhraseQuery("content", queryStr);

the whole phrase will be searched. Only documents that contain the whole phrase will be returned. When you run the code with that minor change, here is the output:

1
2
3
4
5
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\Product.java Score :0.9122286
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleSearcher.java Score :0.7980338
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\SimpleIndexer.java Score :0.4561143
C:\\Users\\nikos\\Desktop\\LuceneFolders\\LuceneHelloWorld\\SourceFiles\\ShoppingCartServlet.java Score :0.36859602
Found 4

These files contain the whole query string "private static final String" in their contents.

4.5 Download the source code

You can download the Eclipse project of this example here: LuceneHelloWorld.zip

5. Final notes

It is important to mention that an IndexReader reads the “image” that the index has the moment he opens it. So, if your application is dealing with text sources that change over short periods of time, it is possible that you have to re-index those files at run time. But you want to be sure that the changes are reflected when you search the Index, while your application is still running and you have already opened an IndexReader (who now is outdated). In this case you have to obtain an updated IndexReader like this:

1
2
//indexReader is the old IndexReader
IndexReader updatedIndexReader = DirectoryReader.openIfChanged(indexReader);

This will ensure that you get a new more updated IndexReader, but only if the index has changed. Furthermore, if you want to achieve fast, near real time searches ( e.g for stream data) you can obtain your IndexReader like so:

1
2
//indexWriter is your IndexWriter
IndexReader nearRealTimeIndexReader = DirectoryReader.open(indexWriter,true);

For performance reasons, the IndexWriter doesn’t flush the changes of the index immediately to the disk. It uses buffers instead, and persists the changes asynchronously. Opening an IndexReader like in the above snippet, gives him immediate access to the write buffers of the IndexWriter, and thus he has instant access to index updates.

Finally, it’s worth mentioning that IndexWriter is thread safe, thus you can use the same instance in many threads to add Documents one particular index. The same goes for the IndexReader and IndexSearcher. Many threads can use the same instance of these classes to simultaneously read or search the same index.

Piyas De

Piyas is Sun Microsystems certified Enterprise Architect with 10+ years of professional IT experience in various areas such as Architecture Definition, Define Enterprise Application, Client-server/e-business solutions.Currently he is engaged in providing solutions for digital asset management in media companies.He is also founder and main author of "Technical Blogs(Blog about small technical Know hows)" Hyperlink - http://www.phloxblog.in
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button