Searching made easy with Apache Lucene 4.3

Lucene is a Full Text Search Engine written in Java which can lend powerful search capabilities to any application. At heart of Lucene lies a file based Full Text Index. Lucene provides APIs to create this index and then add and delete contents to this index. Further it allows search and retrieval of information from this index using powerful search algorithms. The data stored can be pulled from disparate sources like a database, filesystem and as well as the websites. Before beginning let us ponder on few terms.

Inverted Index

Inverted index is a datastructure which stores a mapping of a content and the location of object that contains that content. To make it more clear here are some examples

  1. Book Index – The Index of book contains the important words and the pages that contain those words. So book index helps us in navigating to the pages that contain a particular word.
  2. Listing of wines using price ranges – The price range is content and winename is the object that has that price range
  3. Web Index – Listing of website address by keywords. For example list of all webpages containing keywords “Apache Lucene”
  4. Shopping Cart – Listing of items in shopping cart by categories. 

Faceted Search

Any object can have multiple properties, each of these properties are facet of that object. Faceted search allows us to search for collection of objects based on multiple facets. Faceted search is also known as faceted navigation or faceted browsing and it allows us to search on information that is organized according to faceted organization structure.

Consider an example of an item in shopping cart. Item can have multiple facets like category, title, price, color, weight etc. Now a facet search would allow us to search for all the items which are in garden category, has red color and is between price range of Rs.30 to Rs.40.

Lucene provides us an API

  1. To create an inverted index.
  2. Store information according to faceted classification.
  3. Retrieve information using faceted search.

All the above makes Lucene a super-fast search engine which returns super relevant search results.

Lucene Features

  1. Relevance Ranking search
  2. Phrase, proximity, wildcard search.
  3. Plug-gable analyzer.
  4. Faceted Search.
  5. Field based sorting
  6. Range queries
  7. Mutliple index searching.
  8. Fast indexing 150GB/hour.
  9. Easy Backup and restore.
  10. Small RAM requirement.
  11. Incremental addition and fast searches.

For full list visit here: http://lucene.apache.org/core/features.html

Lucene Concepts and Terminologies

  1. Indexing – Indexing involves adding a document to the Lucene index by help of a class called “IndexWriter“.
  2. Searching – Searching involves retrieval of a document from Lucene index by help of a class called “IndexSearcher
  3. Document – A Lucene Document is a single unit of search and index. For example item in a shopping cart. Lucene index can contain millions of documents.
  4. Fields – Fields are properties of any document. In other words fields are the facets of the document which is an object. For example category of an item in shopping cart. Each document can have multiple fields.
  5. Queries – Lucene has its own query language. This allows us to search for document based on mulitple fields. We can assign weight to a field and also use boolean expressions like and and or to the query. For example – Return all items in cart which belong to category garden or home and has color red and has price less than Rs.1000.
  6. Analyzers – When a field text is to be indexed then they need to be converted into its most basic form. First they are tokenized and then they are converted to lowercase, sigularized, depunctuated. These tasks are performed by Analyzers. Analyzers are complicted and we require a deep study on how to use them. Most often the built in analyzers don’t suffice for our requirement, in that case we can create a new one. For this tutorial we will be using StandardAnalyzer as they contain most of the basic features we require.

Tutorial objective

  1. Try creating a Lucene index.
  2. Insert book records in it.
  3. Performing various kinds of searches on this index.

The book item will have following Facets

  1.  Book Title(String
  2. Book Author(String)
  3. Book Catgory(String)
  4. #Pages(int)
  5. Price(float)

The code for this tutorial has been committed to SVN. It can be checked out from: https://www.assembla.com/code/weblog4j/subversion/nodes/24/SpringDemos/trunk

This is an extended project with more tutorials. The lucene classes are in com.aranin.spring.lucene package

  1. LuceneUtil – This class contains utitlity method to create index, create IndexWriter and IndexSearcher.
  2. MySearcherManager – This class uses LuceneUtil and performs searches on the index.
  3. MyWriterManager – This class uses LuceneUtil and performs writes on the index.

Step by step walk-through

1. Dependencies – The dependencies can be added via maven

<dependency>
        <artifactId>lucene-core</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

      <dependency>
        <artifactId>lucene-queries</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

      <dependency>
        <artifactId>lucene-queryparser</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

      <dependency>
        <artifactId>lucene-analyzers-common</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

      <dependency>
        <artifactId>lucene-facet</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

2. Creating the index – The index can be created by creating an IndexWriter in create mode.

public void createIndex() throws Exception {

    boolean create = true;
    File indexDirFile = new File(this.indexDir);
    if (indexDirFile.exists() && indexDirFile.isDirectory()) {
       create = false;
    }

    Directory dir = FSDirectory.open(indexDirFile);
    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
    IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43, analyzer);

    if (create) {
       // Create a new index in the directory, removing any
       // previously indexed documents:
       iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    }

    IndexWriter writer = new IndexWriter(dir, iwc);
    writer.commit();
    writer.close(true);
 }
  • indexDir is the directory where you want to create your index.
  • Directory is a flat list of files used for storing index. It can be a RAMDirectory, FSDirectory or a DB based directory.
  • FSDirectory implements Directory and saves indexes in files in file system.
  • IndexWriterConfig.Open mode creates a writer in create or create_append or appned mode. Create mode creates a new index if it does not exist or overwrites an existing one. For purpose of creation we create an existing one.
  • Calling above method creates an empty index.

3. Writing to the index – Once the index is created we can write documents to it. That can be done via following.

public void createIndexWriter() throws Exception {

     boolean create = true;
     File indexDirFile = new File(this.indexDir);

     Directory dir = FSDirectory.open(indexDirFile);
     Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
<span style="color: #222222; font-family: 'Courier 10 Pitch', Courier, monospace; line-height: 21px;">IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43, analyzer);</span>
     iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
     this.writer = new IndexWriter(dir, iwc);

    }

Above method creates a writer in create_append mode. In this mode if index is created then it will not be overwritten. You can note that this method does not close the writer. It just creates and returns it. Creating IndexWriter is an costly operation. Thus we should not create a writer everytime we have to write a document to the index. Instead we should create a pool of IndexWriter and use a thread system to get the writer from the pool write to the index and then return the writer to the pool.

public void addBookToIndex(BookVO bookVO) throws Exception {
     Document document = new Document();
     document.add(new StringField("title", bookVO.getBook_name(), Field.Store.YES));
     document.add(new StringField("author", bookVO.getBook_author(), Field.Store.YES));
     document.add(new StringField("category", bookVO.getCategory(), Field.Store.YES));
     document.add(new IntField("numpage", bookVO.getNumpages(), Field.Store.YES));
     document.add(new FloatField("price", bookVO.getPrice(), Field.Store.YES));
     IndexWriter writer =  this.luceneUtil.getIndexWriter();
     writer.addDocument(document);
     writer.commit();
 }

We dont create a writer in the code while inserting. Instead we have used a precreated writer which was stored as a instance variable.

4. Searching the index – This is again a done in two steps 1. Creating IndexSearcher 2. Creating a query and doing the search.

public void createIndexSearcher(){
    IndexReader indexReader = null;
    IndexSearcher indexSearcher = null;
    try{
         File indexDirFile = new File(this.indexDir);
         Directory dir = FSDirectory.open(indexDirFile);
         indexReader  = DirectoryReader.open(dir);
         indexSearcher = new IndexSearcher(indexReader);
    }catch(IOException ioe){
        ioe.printStackTrace();
    }

    this.indexSearcher = indexSearcher;
 }

Note – The Analyzer used in searcher should be same as the one used to create the writer as analyzer is responsible for the way in which data is stored in index. Again creating IndexSearcher is a costly operation hence it makes sense to pre create a pool of IndexSearcher and use it in similar way as IndexWriter.

public List<BookVO> getBooksByField(String value, String field, IndexSearcher indexSearcher){
     List<BookVO> bookList = new ArrayList<BookVO>();
     Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
     QueryParser parser = new QueryParser(Version.LUCENE_43, field, analyzer);

     try {
         BooleanQuery query = new BooleanQuery();
         query.add(new TermQuery(new Term(field, value)), BooleanClause.Occur.MUST);

        //Query query = parser.Query(value);
        int numResults = 100;
        ScoreDoc[] hits =   indexSearcher.search(query,numResults).scoreDocs;
        for (int i = 0; i < hits.length; i++) {
             Document doc = indexSearcher.doc(hits[i].doc);
             bookList.add(getBookVO(doc));
        }

     } catch (IOException e) {
         e.printStackTrace(); 
     }

     return bookList;
}

The IndexSearcher was pre-created and passed on to the the method. The main part of searching is query formation. Lucene supports lots of different kinds of queires.

  1. TermQuery
  2. BooleanQuery
  3. WildcardQuery
  4. PhraseQuery
  5. PrefixQuery
  6. MultiPhraseQuery
  7. FuzzyQuery
  8. RegexpQuery
  9. TermRangeQuery
  10. NumericRangeQuery
  11. ConstantScoreQuery
  12. DisjunctionMaxQuery
  13. MatchAllDocsQuery

You can choose the appropriate queries for your searches. The query language syntax can be learnt from here: http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.pdf

Resources

  1. http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.pdf
  2. http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/all/org/apache/lucene/index/IndexWriterConfig.OpenMode.html
  3. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/store/FSDirectory.html
  4. https://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html
  5. http://www.lucenetutorial.com/lucene-query-syntax.html
  6. http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/Query.html

Summary

Search remains a backbone of any content driven application. The traditional DB driven searches are not very powerful and leaves a lot to be desired. So there is a need of a fast, accurate and powerful search solution which can be easily incorporated in the application code. Lucene beautifully fills in that gap, it makes the search a breeze and is backed by a powerful array of search algorithms like relevance ranking, phrase, wildcard, proximity and ranged search. It is also space and memory efficient. No wonder so many applications have been built on top of Lucene. This article intends to provide a basic tutorial on empowering dear readers with tools for getting started with Lucene.  There is lot more to be said but then don’t you want to explore some on your own?
 

Reference: Searching made easy with Apache Lucene 4.3 from our JCG partner Niraj Singh at the Weblog4j blog.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

10 Responses to "Searching made easy with Apache Lucene 4.3"

  1. Majid Lotfi says:

    Hi,
    Thank you for this tutorial, I noticed that the source code you provided in the SVN is missing lot, it does not have the lucene package, the pom is incomplete, can you please add a readme file on how to setup or run this project ?
    thanks lot.

  2. Hi Majid,

    The SVN for this project has not been configured very well. But you can checkout the code from url below

    https://www.assembla.com/code/weblog4j/subversion/nodes/29/SpringDemos/trunk

    Please note that the revision version is 29 rather than 24.

    It is a stand alone java project written in Intellij so migrating to IDE of your preference would just require copying the java files in lucene package and pom dependencies over to your POM.

    Please let me know if you have issues.

    Regards
    Niraj

  3. Teresa says:

    Hi,

    I find it is very helpful, do you have the schema for the database? so I can create the database to test the code?

    Thanks in advance,
    Teresa

  4. GG says:

    I tried downloading and running it on my Netbeans (Ubuntu 12.10 OS) but it was unable to run. Some error in cropping up while accessing “D:/samayik” Also, please tell which is the main class…

  5. Hi GG,

    The path “D:/samayik/mydemoindex ” is a hardcoded path in main methods of classes com.aranin.spring.lucene.MyWriterManager and com.aranin.spring.lucene.MySearcherManager.

    Just to explain the project a bit. There are two main functions of lucene writing and reading. So as a first step we create the index in com.aranin.spring.lucene.MyWriterManager main method. Then we go on to write something to the index.

    Next step we read from the index using com.aranin.spring.lucene.MySearcherManager. Check out the main method of this class to see how we are reading. So to answer your question.

    1. There are two main classes MySearcherManager and MyWriterManager.
    2. Modify the index path “D:/samayik/mydemoindex” in main method in both the classes to suit ubuntu file system path. Make sure that java has read write permission to that directory.
    3. Run MyWriterManager first. This will create an index for you and write few records to it.
    4. Run MySearcherManager to read from this.

    Please let me know if you have further questions.

    Regards
    Niraj

  6. Veer says:

    By default, facets are weighted as the number of documents present in them. Can we give our own weighing parameter (Eg. Relevance, Score)??

  7. Hi Veer,

    As we know basic info we store in an index is an Document. Each Document contain fields which are really the facets. Each field has a score. When we search then all the score for each of the field is taken into consideration. The sum is used to make a decision as to which search has higher scoring. Now there are ways to set scoring and after some search on net here are few ways

    1. Boost scoring

    There are ways to boost scoring at indexing time and searching time as well. This can be done using Field.setBoost() and Query.setBoost()

    2. Custom scoring.
    We can also write our own scoring algorithms using similarity api. We can extend one of the existing Similarity classes or directly subclass the parent to create our own similarity. Once done the similairty has to be passed to your writer and searcher. Please read this . http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/Similarity.html

    Other ways are following.

    3. Custom Queries
    4. Scorer
    5. Weight interface

    Please read

    http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/package-summary.html#scoring to get going.

    I am no expert on this so I Hope this puts you in right direction.

    Regards
    Niraj

    • Veer says:

      I know we can modify the scoring system for returning seacrh results.

      But my question is while doing Faceting (in LUCENE 4.4), if we use CountFacetRequest, it would return us the number of documents present in that category (facet), but I want to change this count. I mean if it is possible to return scores (I tried using SumScoreFacetRequest). Its not working. Its still returning count.

      How can we change this parameter?

  8. Pete Lyon says:

    Hi! Thank you for this information!
    I had to convert from Lucene 3 and it works perfectly now in 4!

    Regards Pete

Leave a Reply


five − 4 =



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close