Srividhya Umashanker

About Srividhya Umashanker

A Highly motivated engineer experienced is telecom/marketing and server management domain with a wide variety of technology experience.

Lucene – Quickly add Index and Search Capability

What is Lucene?

Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Lucene can plain text, integers, index PDF, Office Documents. etc.,

How Lucene enables Faster Search?

Lucence creates something called Inverted Index. Normally we map document -> terms in the document. But, Lucene does the reverse. Creates a index term -> list of documents containing the term, which makes it faster to search.

Install Lucene

Maven Dependency

<pre class='brush:xml'><dependency>
 <groupid>org.apache.lucene</groupid>
 <artifactid>lucene-core</artifactid>
 <version>3.0.2</version>
 <type>jar</type>
 <scope>compile</scope>
</dependency>

Download Dependency

Download Lucene from http://lucene.apache.org/ and add the lucene-core.jar in the classpath

How does Lucene Work?


Let’s understand the picture first from bottom – Center. The Raw Text is used to create a Lucene ‘Document’ which is analyzed using the specified Analyzer and Document is added to the index based on the Store, TermVector and Analzed property of the Fields.

Next, the search from top to center. The users specify the query in a text format. The query Object is build based on the query text and the result of the executed query is returned as TopDocs.

Core Lucene Classes

Directory, FSDirectory, RAMDirectoryDirectory containing Index

File system based index dir

Memory based index dir

Directory

indexDirectory = FSDirectory.open(new File(‘c://lucene//nodes’));

IndexWriterHandling writing to index – addDocument, updateDocument, deleteDocuments, merge etcIndexWriter writer = new IndexWriter(indexDirectory,

new StandardAnalyzer(Version.LUCENE_30),

new MaxFieldLength(1010101));

IndexSearcherSearch using indexReader – search(query, int)IndexSearcher searcher = new IndexSearcher(indexDirectory);
DocumentDTO used to index and searchDocument document = new Document();
FieldEach document contains multiple fields. Has 2 part, name, value.new Field(‘id’, ’1′, Store.YES, Index.NOT_ANALYZED)
TermA word from test. Used in search.2 parts.Field to search and value to searchTerm term = new Term(‘id’, ’1′);
QueryBase of all types of queries – TermQuery, BooleanQuery, PrefixQuery, RangeQuery, WildcardQuery, PhraseQuery etc.Query query = new TermQuery(term);
AnalyzerBuilds tokens from text, and helps in building index terms from textnew StandardAnalyzer()

The Lucene Directory

Directory – is the data space on which lucene operates. It can be a File System or a Memory.

Below are the often used Directory

DirectoryDescriptionExample
FSDirectoryFile System based DirectoryDirectory = FSDirectory.open(File file);
// File -> Directory path
RAMDirectoryMemory based Lucene directoryDirectory = new MemoryDirectory()

Directory = new MemoryDirectory(Directory dir) // load File based Directory to memory


Create an Index Entry

Lucene ‘Document’ object is the main object used in indexing. Documents contain multiple fields. The Analyzers work on the document fields to break them down into tokens and then writes the Directory using an Index Writer.

IndexWriter

IndexWriter writer = new IndexWriter(indexDirectory, new StandardAnalyzer(Version.LUCENE_30), true, MaxFieldLength.UNLIMITED);

Analyzers

The job of analyzing text into tokens or keywords to be searched. There are few default Analyzers provided by Lucene. The choice of Analyzer defined how the indexed text is tokenized and searched.

Below are some standard analyzers.

Example – How analyzers work on sample text

Properties that define Field indexing

  • Store – Should the Field be stored to retrieve in the future
  • ANALYZED – Should the contents be split into tokens
  • TermVECTOR – Term based details to be stored or not


Store :

Should the field be stored to retreive later

STORE.YESStore the value, can be retrieved later from the index
STORE.NODon’t store. Used along with Index.ANALYZED. When token are used only for search


Analyzed:

How to analyze the text

Index.ANALYZEDBreak the text into tokens, index each token make them searchable
Index.NOT_ANALYZEDIndex the whole text as a single token, but don’t analyze (split them)
Index.ANALYZED_NO_NORMSSame as ANALYZED, but does not store norms
Index.NOT_ANALYZED_NO_NORMSSame as NOT_ANALYZED, but without norms
Index.NODon’t
make this field searchable completely


Term Vector

Need Term details for similar, highlight etc.

TermVector.YESRecord
UNIQUE TERMS + COUNTS + NO POSITIONS + NO OFFSETSin each document
TermVector.WITH_POSITIONSRecord
UNIQUE TERMS + COUNTS + POSITIONS + NO OFFSETSin each document
TermVector.WITH_OFFSETSRecord
UNIQUE TERMS + COUNTS + NO POSITIONS + OFFSETSin each document
TermVector.WITH_POSITIONS_OFFSETSRecord
UNIQUE TERMS + COUNTS + POSITIONS + OFFSETSin each document
TermVector.NODon’t record term vector information

Example of Creating Index

IndexWriter writer = new IndexWriter(indexDirectory, new StandardAnalyzer(Version.LUCENE_30), true,MaxFieldLength.UNLIMITED);

Document document = new Document();
document.add(new Field('id', '1', Store.YES, Index.NOT_ANALYZED));
document.add(new Field('name', 'user1', Store.YES, Index.NOT_ANALYZED));
document.add(new Field('age', '20', Store.YES, Index.NOT_ANALYZED));
writer.addDocument(document);

Example of Updating Index

IndexWriter writer = new IndexWriter(indexDirectory, new StandardAnalyzer(Version.LUCENE_30), true,MaxFieldLength.UNLIMITED);
 
Document document = new Document();
document.add(new Field("id", "1", Store.YES, Index.NOT_ANALYZED));
document.add(new Field("name", "user1", Store.YES, Index.NOT_ANALYZED));
document.add(new Field("age", "20", Store.YES, Index.NOT_ANALYZED));
writer.addDocument(document);

Example of Deleting Index

IndexWriter writer = new IndexWriter(indexDirectory, new StandardAnalyzer(Version.LUCENE_30), MaxFieldLength.UNLIMITED);

Term term = new Term('id', '1');
writer.deleteDocuments(term);

Searching an Index : The users specify the query in a text format. The query Object is built based on the query text, analyzed and the result of the executed query is returned as TopDocs.
Queries are the main input for the search.

TermQuery
BooleanQueryAND or not ( combine multiple queries)

Machine generated alternative text: Table 36 Boolean query operator shortcuts Verbose syntax Shortcut syntax aANDb +a+b aORb ab aNDNOTb +a-b

 

PrefixQueryStarts with
WildcardQuery? And *
- * not allowed in the beginning
PhraseQueryExact phrase
RangeQueryTerm range or numeric range
FuzzyQuerySimilar words search

Sample Queries

Example on Search:

IndexSearcher searcher = new IndexSearcher(indexDirectory);
Term term = new Term('id', '1');
Query query = new TermQuery(term);
TopDocs docs = searcher.search(query, 3);
for (int i = 1; i <= docs.totalHits; i++)
{
     System.out.println(searcher.doc(i));
}

Lucene Diagnostic Tools:

  • Luke – http://code.google.com/p/luke/
    Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display and modify their content in several ways:
  • Limo – http://limo.sourceforge.net/
    The idea is to have a small tool, running as a web application, that gives basic information about indexes used by the Lucene search engine

Complete Example:

Download here : LuceneTester.java

Resources

 

Reference: Lucene – Quickly add Index and Search Capability from our JCG partner Srividhya Umashanker at the Thoughts of a Techie blog.

Related Whitepaper:

Functional Programming in Java: Harnessing the Power of Java 8 Lambda Expressions

Get ready to program in a whole new way!

Functional Programming in Java will help you quickly get on top of the new, essential Java 8 language features and the functional style that will change and improve your code. This short, targeted book will help you make the paradigm shift from the old imperative way to a less error-prone, more elegant, and concise coding style that’s also a breeze to parallelize. You’ll explore the syntax and semantics of lambda expressions, method and constructor references, and functional interfaces. You’ll design and write applications better using the new standards in Java 8 and the JDK.

Get it Now!  

One Response to "Lucene – Quickly add Index and Search Capability"

  1. good article! but the link seems to be wrong ;-)

Leave a Reply


− two = 7



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

20,709 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books