Introduction to Lucene
This article is part of our Academy Course titled Apache Lucene Fundamentals.
In this course, you will get an introduction to Lucene. You will see why a library like this is important and then learn how searching works in Lucene. Moreover, you will learn how to integrate Lucene Search into your own applications in order to provide robust searching capabilities. Check it out here!
Table Of Contents
- 1. Introduction
- 1.1 What is full-text search
- 1.2 Why do we need full-text search engines
- 1.3 How Lucene works
- 1.4 Basic Lucene workflow
- 2. Basic components for Indexing
- 3. Basic components for searching
- 4. A Simple Search Application
- 4.1 Create a new Maven Project with Eclipse
- 4.2 Maven Dependencies
- 4.3. A simple indexer class
- 4.4. A simple searcher class
- 4.5 Download the source code
- 5. Final notes
In this course we are going to dive into Apache Lucene. Lucene is a rich, open source, full-text search suit. This means that Lucene is going to help you implement a full-text search engine, tailored to your applications needs. We are going to deal with the Java flavor of Lucene, but bear in mind that there are API clients for a variety of programming languages.
1.1 What is full-text search
It is a common need for users to want to retrieve list of documents or sources that match certain criteria. For example a library user needs to be able to find all of the books written by a specific author. Or all of the books that have a specific word or phrase in their title. Or all of the books published in a specific year from a specific publisher. The above queries can be easily handled by well know relational database. If you hold a table that stores (title,author, publisher, year of publishing) tuples, the above searches can be completed efficiently. Now, what if a user wants to obtain all the documents that contain a certain word or phrase in their actual content? If you try to use a traditional database and store the raw content of all the documents in a field of a tuple, searching would take unacceptably long.
That’s because in a full text search, the search engine has to scan all of the words of the text document, or text stream in general, and try to match several criteria against it, e.g. find certain words or phrases in its content. That kind of queries in a classic relational database would be hopeless. Granted, many database systems, like MySQL and PostgreSQL support full text searching, either natively or using external libraries. But that are not efficient, nor fast enough or customizable. But the biggest problem is scalability. They just cannot handle the amount of data that full-text searched engines can.
1.2 Why do we need full-text search engines
The process of generating vast amounts of data is one of the defining characteristics of our time and a major consequence of technological advancements. It goes by the term information overload. Having said that, gathering and storing all that data is beneficial only if you are able to extract useful information out of them, plus make them reachable by your application’s end users. The most well known and used tool to achieve that is, of course, searching.
One could argue that searching files for a word or a phrase is as simple as scanning it in a serial manner from top to bottom, just like you would using a grep command. This might actually suffice for a small number of documents. But what about huge file systems with millions of files, and if that seems extraordinary to you, what about web pages, databases, emails, code repositories, to name just a few, and what about all of them combined. It becomes easy to understand that the information each individual user needs, might reside in a little document, somewhere in a vast ocean of different information resources. And the retrieval of that document has to seem as easy as breathing.
One can now see why fully customized search-based applications are gaining lots of attention and traction. Adding to that, is the fact that searching has become such an important aspect of end user’s experience that for modern web applications, varying from simple blogs to big platforms like Twitter or Facebook and even military grade applications, is incomprehensible not to have searching facilities. And that’s why big vendors don’t want to risk messing up their searching features and want to keep them as fast and yet as simple as possible. That has led to the need to upgrade searching from a simple feature to a full platform. A platform that has power, effectiveness, the necessary flexibility and customization. And Apache Lucene delivers, that’s why is used in most of the aforementioned applications.
1.3 How Lucene works
So, you must be wondering how Lucene can perform very fast full-text searches. Not surprisingly, the answer is that it uses an index. Lucene indexes fall into the category of inverted indexes. Instead of having a classic index where for every document you have the full list of words (or terms) it contains, inverted indexes do it the other way round. For every term (word) in the documents, you have a list of all of the documents that contain that term. That is hugely more convenient when performing full text searches.
The reason that inverted indexes work so good can be seen in the following diagrams. Imagine you have 3 very large document. A classic index you be of the form:
For every document you have a huge list of all of the terms it contains. In order to find if a document contains a specific term you have to scan, probably sequentially these vast lists.
On the other hand an inverted index would have that form:
For every term we maintain a list with all of the documents that contain that term, followed by the position of the term inside the document (of course additional information can be kept). Now, when a user searches for the term “Lucene”, we can instantly answer that the term “Lucene” is located inside Document1 in position 6 and inside Document2 in position 4. To sum up, inverted indexes use a very big number of very small lists that can be searched instantly. In contrast a classic index would use a small number of extremely big lists that is impossible to search quickly.
1.4 Basic Lucene workflow
So, Lucene has to have some work done before the actual search. And presumably that is to create the index. The basic work flow of the indexing process is depicted bellow:
As you can see in the above diagram:
- You feed it with text documents/sources
- For every document it analyzes the text and splits it into terms (words). Meanwhile it can perform all kinds of analysis in the plain text. You can tailor that step to suit the needs of you own application.
- For every term of the documents it creates the previously described inverted lists.
- Now the index is ready to be searched. You can write queries in many different formats and as a result you will get a list of all of the documents that satisfy the criteria specified in the query.
So far Lucene seems to be a very powerful tool as it can analyze the text, create the indexes and perform queries on the index. But you have to do some work yourself, like selecting the documents to be indexed, organize and manage the whole process and the several aspects of it, as well as eventually getting the search queries from the users and presenting to them any possible results.
2. Basic components for Indexing
In this section we are going to describe the basic components and the basic Lucene classes used to create Indexes.
A Lucene Index is hosted simply in a normal file system location, or in memory when extra performance is needed and you don’t want it to be stored permanently on your drive. You can even choose to store your index in a database, via JDBC. The implementations of the aforementioned options, extend the abstract Directory class.
To keep things simple, let’s just say that it uses a directory in your file system, although there aren’t many differences when using memory or databases, but the normal directory is more intuitive in my opinion. Lucene will use that directory to store everything that is necessary for the index. You can work with such a directory using
FSDirectory class, feeding it with an arbitrary path of your file system (when working with memory, you use RAMDirectory).
FSDirectory class is simply an abstraction above normal Java File manipulation classes.
This is how you can create an
An this is how you can create a
As you remember, we said that it is your responsibility to choose the documents (text files, PDFs, Word Documents etc) and any text sources you want to make searchable, and thus indexed. For every document you want to index, you have to create one
Document object that represents it. At this point, it is important to understand that Documents are indexing components, and not the actual text sources. Naturally, because a
Document represents an individual physical text source, it is the building unit of the Index. After creating such a Document you have to add it to the Index. Later, when dispatching a search, as a result, you will a get a list of
Document objects that satisfy your query.
This is how you can create a new empty
Now it’s time to fill the Document with
Document objects are populated with a collection of
Field is simply a pair of (name,value) items. So, when creating a new
Document object you have to fill it with that kind of pairs. A
Field can be stored in the index, in which case both name and value of the field are literally stored in the Index. Additionally a
Field can be indexed, or to be more precise inverted, in which case the value of the field gets analyzed and tokenized into
Terms and is available for searching. A
Term represents a word from the text of a
Field‘s value. A
Field can be both stored and indexed/inverted, but you don’t have to store a Field to make it indexed/inverted. Storing a Field and indexing/inverting a Field are two different, independent things.
As I mentioned before, when deploying a search, in return you will get a list of
Document objects (representing the physical text sources) that satisfy your query. If you want to have access to the actual value of a
Field, you have to declare that
Field stored. This is usually helpful when you want to store the name of the file that this
Document represents, last modification date, full file path, or any additional information about the text source you want to have access to. For example, if the text source your are indexing is an email, your
Document object that represents that email could have those fields:
Example Document representing an email:
Email from example
location of the email
Hi there ! Nice meeting you…
In the above Document, I’ve chosen to index/invert the body of the email, but not to store it. This means that the body of the email will be analyzed and tokenized into searchable terms, but it will not be literally stored in the Index. You can follow that tactic when the volume of the contents of your text source is very big and you want to save space. On the other hand, I’ve chosen to store but not to index all the other fields. When I perform a search, only the body will be searched,while all the other Fields are not taken under consideration in the search as they are not indexed.
Body‘s aforementioned tokenized terms satisfy the query, this
Document will be included in the results. Now, when you access that retrieved
Document, you can only view its stored
Fields and their values. Thus, the actual body of the file will not be available to you through the
Document object, despite being searchable. You can only see the Title, Location ,From, To Subject
Fields. Storing the location of that email, will help me have access to its actual body content. Of course, you can store the body of the email as well, if you want to retrieve it through the
Document object and thus make it both searchable and stored (same goes for the other fields).
So let’s see how you would create the above
Document. To create the stored-only fields we are going to use
StoredField class. And to create the non stored and indexed body of the text we are going to use a
As you can see, we’ve given a
FileReader as the value of the
"body" field. This
FileReader will be used inside the analysis phase, to extract the plain text from that source. After extracting the plain text from the file, special components of Lucene will analyse it and split it into indexed terms.
A Term represents a word from the text. Terms are extracted from the analysis and tokenization of Fields’ values, thus
Term is the unit of search. Terms are composed of two elements, the actual text word ( this can anything from literally words to email address, dates etc), and the name of the Field this word appeared into.
It is not absolutely necessary to tokenize and analyze a Field to extract Terms out of it. In the previous example, if you want to make the From
Field indexed, you don’t really have to tokenize it. The email address email@example.com can serve as a
Analyzer is one of the most crucial components of the indexing and searching process. It is responsible for taking plain text and converting it to searchable Terms. Now, it is important to understand that Analyzers work with plain text inputs. It’s the programmer’s responsibility to provide a parser that is able to convert a text source, like an HTML page or a file form your file system, to plain text. This parser is usually a Reader. For example in case of files this could be a FileReader.
An Analyzer, internally uses a Tokenizer. The Tokenizer can take as input the aforementioned Reader and use it to extract plain text from a specific source (e.g a File). After obtaining the plain text, the Tokenizer simply splits the text into words. But an Analyzer can do much more that simple text splitting. It can perform several kinds of text and word analysis like :
- Stemming : Replacing words with their stems. For example, in English the stem of “oranges” is “orange”. So if the end user searches for “orange”, documents that contained “oranges” and “orange” will be obtained.
- Stop Words Filtering : Words like “the”, “and” and “a” are not of any particular interest when performing a search and one could might as well consider them “noise”. Removing them, will result in better performance, and more accurate results.
- Text Normalization : Removes accents and other character markings.
- Synonym Expansion : Adds in synonyms at the same token position as the current word.
These are only some of the analysis tools that are built-in to Lucenes Analyzer classes. The most commonly used built in Analyzer is the StandardAnalyzer that can remove stop words, convert words to lowercase and also do stemming. As you know, different languages have different grammar rules. Lucene community is trying to embed as many grammars for as many different languages as possible. But still, if none of the Lucene’s built in Analyzers is suitable for your application, you can create your own.
2.6 Interacting with the Index
So far, we’ve seen how to create an Index
Directory, create a
Document and add
Fields to it. Now we have to write the
Document to the
Directory and thus add it to the Index. This is also the step where
Tokenizers play their parts.
There is no special class called Index (or something like that) in Lucene, as you would expect. The way you interact with the Index is through an
IndexWriter, when you want to push content to your Index (and generally manipulate it), an IndexReader when you want to read from your Index and an
IndexSearcher, when you want to search the index of course.
Now let’s see how we can create the
IndexWriter we want:
We’ve chosen to use a
StandardAnalyzer instance for this. Its constructor takes
Version.LUCENE_46 as an argument. That is helpful to discover compatibility dependencies across several releases of Lucene. Keep in mind that
StandardAnalyzer, internally uses
StandardTokenizer. Then, we create an
IndexWriterConfig instance. This is a helper class that can hold all the configuration options for an
IndexWriter. As you can see, we’ve specified that we want our
IndexWriter to use the previously created
analyzer and to be set to the appropriate version. Finally, we create the
IndexWriter instance. In its constructor arguments we give the
FSDirectory instance and the previously created configuration options.
Now you can go ahead and add the previously created
Document to the Index, using the above
And that’s it. Now, when
addDocument is called, all the previously described operations take place:
FileReaderto read the file and converts it to plain text. It then breaks it into tokens.
Analyzermeanwhile can performs all kinds of syntactic and grammar analysis on the plain text and then on the individual tokens.
- From the token analysis,
Termsare created and used to generate the inverted index.
- Finally all the necessary files, holding all the info for the
Documentand the Index, are written to the specified path:
Our document is now indexed. The same process is followed for every
Document you add to the Index.
Now that everything is more clear, let’s see the indexing process with the classes we used in a diagram :
As you can see:
- We pass the
Documentobject to the
- He uses the
Termsfrom the plain text obtained by the
- Then, he writes everything that is necessary to update the index to the
Indexing is the hard bit when trying to build your search engine, because you have to:
- Choose the documents and text sources you want to index.
- Provide Reader classes that read the text sources and convert them to plain text. There is wide variety of built in classes (or external libraries) that can read a big number of document formats. But if non of them is suitable for your documents, you will have to write your own Reader that will parse them and convert them to plain text.
- Decide on the tokenization and analysis policy that suits your application’s needs. The great majority of applications will do just fine with the StandardAnalyzer and StandardTokenizer. But, it is possible that you want to customize the analysis step a bit further and that requires some work to be done.
- Decide what kind of Fields to use and which of them to store and/or index.
3. Basic components for searching
In this section we are going to describe the basic components and the basic Lucene classes used to perform searched. Searching is the target of platforms like Lucene, so it has to be as nimble and as easy as possible.
3.1 QueryBuilder and Query
In Lucene, every query passed to the Index is a Query object. So before actually interacting with the Index to perform a search, you have to build such objects.
Everything starts with the query string. It can be like query strings you put on well know search engines, like Google. It can be an arbitrary phrase, or a more structured one, as we will see in the next Lesson. But it would be useless to just send this raw string to be searched in the index. You have to process it, like you did with the indexed plain text. You have to split the query string into words and create searchable Terms. Presumably, this can be done using an Analyzer.
Note: It is important to note that you should use the same Analyzer sub class, to examine the query, as the one you used to the indexing process, to examine the plain text.
Here is how you would create a simple Query, processed with StandardAnalyzer:
We bind a QueryBuilder object with an instance of StandarAnalyzer. You can now use that QueryBuilder instance to create Query objects.
Query is an abstract class and many concrete sub classes are available, like:
TermQuerythat searches for Documents that contain a specific term.
BooleanQuerythat creates Boolean combinations of other queries
WildcardQueryto implement wild card searches, e.g for query strings like ” *abc* “.
PhraseQueryto search for whole phrases, not just for individual Terms.
PrefixQuerysearch for Terms with a predefined prefix.
All of these different
Query flavors will determine the nature of the search that is going to be performed over your index. And each one of them can be obtained through that
QueryBuilder instance. In our example we have chosen to use
createBooleanQuery method. It takes two arguments. The first one is the name of the
Field whose (surely indexed and probably tokenized) value is going to be searched. And the second one is the query string that is going to be analyzed with the
createBooleanQuery can either return a
TermQuery or a
BooleanQuery depending on the syntax of the query string.
Presumably, to do a search in the Index, first you have to open it. You can use an IndexReader to open and access it. All processes that need to pull data out of the index, go through that abstract class.
It’s very easy to open an already created index with
As you can see we use a DirectoryReader to open the Directory in which the Index is stored.
DirectoryReader returns a handle for the index, and that’s what the
IndexSearcher is the class you use to search a single Index. It is bound with an
Here is how you can create one:
You use an
IndexSearcher to pass
Query objects to the
IndexReader. Here is how :
public TopDocs search(Query query, int n) method of
IndexSearcher to perform the search. This method takes two argumets. The first one is the
Query object. The second is an integer that sets a limit on the number of the returned search results. If, for example, you have 10000 documents that satisfy your query you might not want all of them to be returned. You can state that you want only the first n results. Finally, that method returns a
TopDocs class represents the hits that satisfy your query.
TopDocs has a
public ScoreDoc class field.
SocreDoc represents a hit for a query. It consists of :
public int docfield, that is the id of the
Documentthat satisfied the query.
- And a
public float scorefield, that is the score that the
Documentachieved in the query.
The scoring formula is very a essential and complex part of any search platform and it is what makes Lucene work so good. This formula is used to provide a relevance measure for the retrieved Document. The higher the score is, the more relevant that Document is to your query. This helps to characterize “good” and “bad” Documents and ensures that you are provided with high quality results that are as close as possible to the Documents you really need. You can find some useful information on scoring in the documentation of the current’s version Similarity class, and on the older one , as well as this Information Retrieval article.
4. A Simple Search Application
We are going to build a simple search application that will demonstrate the basic steps for indexing and the searching. In this application we are going to use an input folder that contains a bunch of java source files. Every file in this document will be processed and added to the index. Then we are going to perform simple queries over that index to see how it works.
We are going to use:
- Eclipse Kepler 4.3 as our IDE.
- JDK 1.7.
- Maven 3 to build our project.
- Lucene 4.6.0, the latest version of Lucene.
First of all, let’s create our Maven Project with Eclipse.
4.1 Create a new Maven Project with Eclipse
Open Eclipse and go to File -> New -> Other -> Maven -> Maven Project and click Next
In the next window select the “Create a simple project (skip archetype selection)” option and click Next :
In the next window fill in the Group Id and Artifact Id, as shown in the picture below, and click Finish:
A new Maven project will be created with the following structure:
4.2 Maven Dependencies
pom.xml and add the dependencies required to use Lucene libraries:
As you can see we are importing
lucene-core-4.6.0.jar which provides all the core classes, and
lucene-analyzers-common-4.6.0.jar package which provides all the classes neccesery for text analysis.
4.3. A simple indexer class
To create this class go to the Package Explorer of Eclipse. Under
src/java/main create a new package named
com.javacodegeeks.enterprise.lucene.index. Under the newly created package, create a new class named
Let’s see the code of that class, which will do the indexing:
In the above class we specified our input folder where the text files are placed to be in
C:/Users/nikos/Desktop/LuceneFolders/LuceneHelloWorld/SourceFiles, and also the folder where the index is going to be stored to be in
index method, first we create a new
StandardAnalyzer instance and a new
IndexWriter instance. The
IndexeWriter will use the StrandardAnalyzer to analyze the text, and will store the index in the
FSDirectory pointing to the aforementioned index path.
The interesting bit is on the
for loop. For every file in the source directory :
- We create a new
- We add a new
TextFieldto be precise, that represents the content of the file. Remember that
TextFieldis used to create a field that its value is going to be tokenized and indexed but not stored.
- We add another
Field, this time a
StoredFiled, that holds the name of the file. Remember that a
StoredFieldis for fields that are just stored, not indexed and not tokenized. Because we store the file name as its full path, we can later use it to access, present and examine its contents.
- Then we simply add the
Documentto the index.
After the loop:
- We call
IndexWriterthat returns the number of indexed
- We close the
IndexWriter, because we don’t need it anymore, and thus the system can reclaim its resources.
When I run this code, here is the output it produces:
Here is the index folder in my system. As you can see, several special files are created (more on that next lessons):
Now let’s search that index.
4.4. A simple searcher class
To create this class go to the Package Explorer of Eclipse. Under
src/java/main create a new package named
com.javacodegeeks.enterprise.lucene.search. Under the newly created package, create a new class named
To get a more clear view of he final structure of the project, have a look at the image bellow:
Let’s see the code of that class, which will do the searching:
searchIndex method we pass the index directory and the query string as arguments. So I’m going to search for “private static final String”. Remember that the files I’ve indexed were Java source files,
The code is pretty self explanatory:
- We open the index directory and obtain the
- We then use the
QueryBuilder, fed with the
StrandardAnalyzer, to build our
Queryobject. We used
createBooleanQueryto obtain the
Queryobject. Our query string does not have a boolean format (as we will see in the next lesson), so the method will create
TermQueriesfor the analyzed and tokenized terms of the query string.
- We then use the search method of the
IndexSearcherto perform the actual search.
- We obtain the
ScoreDocsthat satisfied the query from the
TopDocsreturned from the
searchmethod. For every ID in that
ScoreDocsarray, we obtain the corresponding
- From that
getmethod, we obtain the name of the file, stored in the value of the “fileName”
- We finally print the name of the file and the score it achieved.
Let’s run the program and see what is the output:
It is important to understand that these files do not certainly contain the whole phrase
"private static final String". Intuitively, the documents with the higher score contain most of the words of that sentence and more frequently than documents with smaller scores. Of course the scoring formula is much more complex than that, as we said earlier.
For example if you change:
the whole phrase will be searched. Only documents that contain the whole phrase will be returned. When you run the code with that minor change, here is the output:
These files contain the whole query string
"private static final String" in their contents.
4.5 Download the source code
You can download the Eclipse project of this example here: LuceneHelloWorld.zip
5. Final notes
It is important to mention that an
IndexReader reads the “image” that the index has the moment he opens it. So, if your application is dealing with text sources that change over short periods of time, it is possible that you have to re-index those files at run time. But you want to be sure that the changes are reflected when you search the Index, while your application is still running and you have already opened an
IndexReader (who now is outdated). In this case you have to obtain an updated
IndexReader like this:
This will ensure that you get a new more updated
IndexReader, but only if the index has changed. Furthermore, if you want to achieve fast, near real time searches ( e.g for stream data) you can obtain your
IndexReader like so:
For performance reasons, the
IndexWriter doesn’t flush the changes of the index immediately to the disk. It uses buffers instead, and persists the changes asynchronously. Opening an
IndexReader like in the above snippet, gives him immediate access to the write buffers of the
IndexWriter, and thus he has instant access to index updates.
Finally, it’s worth mentioning that
IndexWriter is thread safe, thus you can use the same instance in many threads to add
Documents one particular index. The same goes for the
IndexSearcher. Many threads can use the same instance of these classes to simultaneously read or search the same index.