The Lucene search engine: Powerful, flexible, and free

来源:百度文库 编辑:神马文学网 时间:2024/04/28 16:21:29
Easily add searching to your application with Lucene
Summary
Lucene is a Java-based open source toolkit for text indexing and searching. It is easy to use, flexible, and powerful -- a model of good object-oriented software architecture. Powerful abstractions and useful concrete implementations make Lucene very flexible, and allow new users to get up and running quickly and painlessly. In this article, we explore what Lucene does, how it works, and what software engineers can learn from its design. (3,400 words) By Brian Goetz

on‘t let the low version number -- 0.04 as of August 2000 -- fool you. The Lucene search engine is a robust, powerful, and flexible search toolkit, ready to tackle many common search problems. And since it‘s now available under the more flexible LGPL open source license, the price (free!) is right too.
Doug Cutting, an experienced developer of text-search and retrieval tools, created Lucene. Cutting is the primary author of the V-Twin search engine (part of Apple‘s Copland operating system effort) and is currently a senior architect at Excite. He designed Lucene to make it easy to add indexing and search capability to a broad range of applications, including:
Searchable email: An email application could let users search archived messages and add new messages to the index as they arrive.
Online documentation search: A documentation reader -- CD-based, Web-based, or embedded within the application -- could let users search online documentation or archived publications.
Searchable Webpages: A Web browser or proxy server could build a personal search engine to index every Webpage a user has visited, allowing users to easily revisit pages.
Website search: A CGI program could let users search your Website.
Content search: An application could let the user search saved documents for specific content; this could be integrated into the Open Document dialog.
Version control and content management: A document management system could index documents, or document versions, so they can be easily retrieved.
News and wire service feeds: A news server or relay could index articles as they arrive.
Of course, many search engines could perform most of those functions, but few open source search tools offer Lucene‘s ease of use, rapid implementation, and flexibility.
I first used Lucene when developing Eyebrowse, an open source Java-based tool for cataloguing and browsing mailing lists. (SeeResources for a link.) A core requirement for Eyebrowse was flexible message search and retrieval capability. It demanded an indexing and search component that would efficiently update the index base as new messages arrived, allow multiple users to search and update the index base concurrently, and scale to archives containing millions of messages.
Every other open source search engine I evaluated, including Swish-E, Glimpse, iSearch, and libibex, was poorly suited to Eyebrowse‘s requirements in some way. This would have made integration problematic and/or time-consuming. With Lucene, I added indexing and searching to Eyebrowse in little more than half a day, from initial download to fully working code! This was less than one-tenth of the development time I had budgeted, and yielded a more tightly integrated and feature-rich result than any other search tool I considered.
How search engines work
Creating and maintaining an inverted index is the central problem when building an efficient keyword search engine. To index a document, you must first scan it to produce a list of postings. Postings describe occurrences of a word in a document; they generally include the word, a document ID, and possibly the location(s) or frequency of the word within the document.
If you think of the postings as tuples of the form , a set of documents will yield a list of postings sorted by document ID. But in order to efficiently find documents that contain specific words, you should instead sort the postings by word (or by both word and document, which will make multiword searches faster). In this sense, building a search index is basically a sorting problem. The search index is a list of postings sorted by word.
An innovative implementation
Most search engines use B-trees to maintain the index; they are relatively stable with respect to insertion and have well-behaved I/O characteristics (lookups and insertions are O(log n) operations). Lucene takes a slightly different approach: rather than maintaining a single index, it builds multiple index segments and merges them periodically. For each new document indexed, Lucene creates a new index segment, but it quickly merges small segments with larger ones -- this keeps the total number of segments small so searches remain fast. To optimize the index for fast searching, Lucene can merge all the segments into one, which is useful for infrequently updated indexes. To prevent conflicts (or locking overhead) between index readers and writers, Lucene never modifies segments in place, it only creates new ones. When merging segments, Lucene writes a new segment and deletes the old ones -- after any active readers have closed it. This approach scales well, offers the developer a high degree of flexibility in trading off indexing speed for searching speed, and has desirable I/O characteristics for both merging and searching.
A Lucene index segment consists of several files:
A dictionary index containing one entry for each 100 entries in the dictionary A dictionary containing one entry for each unique word A postings file containing an entry for each posting
Since Lucene never updates segments in place, they can be stored in flat files instead of complicated B-trees. For quick retrieval, the dictionary index contains offsets into the dictionary file, and the dictionary holds offsets into the postings file. Lucene also implements a variety of tricks to compress the dictionary and posting files -- thereby reducing disk I/O -- without incurring substantial CPU overhead.
Evaluating search engines
Other widely used open source search engines include Swish-E, Glimpse, libibex, freeWAIS, and iSearch. Like any software package, each is optimized for use in particular situations; it is often difficult to deploy these tools outside of their intended domains. Consider the following features when evaluating a search engine:
Incremental versus batch indexing: Some search engines only support batch indexing; once they create an index for a set of documents, adding new documents becomes difficult without reindexing all the documents. Incremental indexing allows easy adding of documents to an existing index. For some applications, like those that handle live data feeds, incremental indexing is critical. Lucene supports both types of indexing.
Data sources: Many search engines can only index files or Webpages. This handicaps applications where indexed data comes from a database, or where multiple virtual documents exist in a single file, such as a ZIP archive. Lucene allows developers to deliver the document to the indexer through a String or an InputStream, permitting the data source to be abstracted from the data. However, with this approach, the developer must supply the appropriate readers for the data.
Indexing control: Some search engines can automatically crawl through a directory tree or a Website to find documents to index. While this is convenient if your data is already stored in this manner, crawler-based indexers often provide limited flexibility for applications that require fine-grained control over the indexed documents. Since Lucene operates primarily in incremental mode, it lets the application find and retrieve documents.
File formats: Some search engines can only index text or HTML documents; others support a filter mechanism, which offers a simple alternative to indexing word processing documents, SGML documents, and other file formats. Lucene supports such a mechanism.
Content tagging: Some search engines treat a document as a single stream of tokens; others allow the specification of multiple data fields within a document, such as "subject," "abstract," "author," and "body." This permits semantically richer queries like "author contains Hamilton AND body contains Constitution." Lucene supports content tagging by treating documents as collections of fields, and supports queries that specify which field(s) to search.
Stop-word processing: Common words, such as "a," "and," and "the," add little value to a search index. But since these words are so common, cataloging them will contribute considerably to the indexing time and index size. Most search engines will not index certain words, called stop words. Some use a list of stop words, while others select stop words statistically. Lucene handles stop words with the more general Analyzer mechanism, to be described later, and provides the StopAnalyzer class, which eliminates stop words from the input stream.
Stemming: Often, a user desires a query for one word to match other similar words. For example, a query for "jump" should probably also match the words "jumped," "jumper," or "jumps." Reducing a word to its root form is called stemming. Lucene does not yet implement stemming, but you could easily add a stemmer through a more sophisticated Analyzer class.
Query features: Search engines support a variety of query features. Some support full Boolean queries; others support only and queries. Some return a "relevance" score with each hit. Some can handle adjacency or proximity queries -- "search followed by engine" or "Knicks near Celtics" -- others can only search on single keywords. Some can search multiple indexes at once and merge the results to give a meaningful relevance score. Lucene supports a wide range of query features, including all of those listed above. However, Lucene does not support the valuable Soundex, or "sounds like," query.
Concurrency: Can multiple users search an index at the same time? Can a user search an index while another updates it? Lucene allows users to search an index transactionally, even if another user is simultaneously updating the index.
Non-English support: Many search engines implicitly assume that English is the target language; this is evident in areas such as stop-word lists, stemming algorithms, and the use of proximity to match phrase queries. As Lucene preprocesses the input stream through the Analyzer class provided by the developer, it is possible to perform language-specific filtering.
Though by no means exhaustive, the above list offers a starting point for evaluating a search engine for a particular project. Some search tools are poorly suited to certain tasks -- understanding your application‘s requirements can help you choose the right tool for the job.
Using Lucene
I will illustrate how to use Lucene to create, populate, and search an index. For clarity, import statements and exception handling have been omitted from the sample programs. In these illustrations, I have stored the search index in the filesystem (you can store indexes anywhere, e.g., in memory or in a database). The files being indexed are simple text files. With Lucene, you can also easily index other document formats and documents not stored in files.
Create an index
The simple program CreateIndex.java creates an empty index by generating an IndexWriter object and instructing it to build an empty index. In this example, the name of the directory that will store the index is specified on the command line.
public class CreateIndex {
// usage: CreateIndex index-directory
public static void main(String[] args) throws Exception {
String indexPath = args[0];
IndexWriter writer;
// An index is created by opening an IndexWriter with the
// create argument set to true.
writer = new IndexWriter(indexPath, null, true);
writer.close();
}
}
Index text documents
IndexFile.java shows how to add documents -- the files named on the command line -- to an index. For each file, IndexFiles creates a Document object, then calls IndexWriter.addDocument to add it to the index. From Lucene‘s point of view, a Document is a collection of fields that are name-value pairs. A Field can obtain its value from a String, for short fields, or an InputStream, for long fields. Using fields allows you to partition a document into separately searchable and indexable sections, and to associate metadata -- such as name, author, or modification date -- with a document. For example, when storing mail messages, you could put a message‘s subject, author, date, and body in separate fields, then build semantically richer queries like "subject contains Java AND author contains Gosling." In the code below, we store two fields in each Document: path, to identify the original file path so it can be retrieved later, and body, for the file‘s contents.
public class IndexFiles {
// usage: IndexFiles index-path file . . .
public static void main(String[] args) throws Exception {
String indexPath = args[0];
IndexWriter writer;
writer = new IndexWriter(indexPath, new SimpleAnalyzer(), false);
for (int i=1; iSystem.out.println("Indexing file " + args[i]);
InputStream is = new FileInputStream(args[i]);
// We create a Document with two Fields, one which contains
// the file path, and one the file‘s contents.
Document doc = new Document();
doc.add(Field.UnIndexed("path", args[i]));
doc.add(Field.Text("body", (Reader) new InputStreamReader(is)));
writer.addDocument(doc);
is.close();
};
writer.close();
}
}
Search
Search.java provides an example of how to search the index. While the com.lucene.Query package contains many classes for building sophisticated queries, here we use the built-in query parser, which handles the most common queries and is less complicated to use. We create a Searcher object, use the QueryParser to create a Query object, and call Searcher.search on the query. The search operation returns a Hits object -- a collection of Document objects, one for each document matched by the query -- and an associated relevance score for each document, sorted by score.
public class Search {
public static void main(String[] args) throws Exception {
String indexPath = args[0], queryString = args[1];
Searcher searcher = new IndexSearcher(indexPath);
Query query = QueryParser.parse(queryString, "body",
new SimpleAnalyzer());
Hits hits = searcher.search(query);
for (int i=0; iSystem.out.println(hits.doc(i).get("path") + "; Score: " +
hits.score(i));
};
}
}
The built-in query parser supports most queries, but if it is insufficient, you can always fall back on the rich set of query-building constructs provided. The query parser can parse queries like these:
free AND "text search"Search for documents containing "free" and the phrase "text search"
+text searchSearch for documents containing "text" and preferentially containing "search"
giants -footballSearch for "giants" but omit documents containing "football"
author:gosling java Search for documents containing "gosling" in the author field and "java" in the body
Beyond basic text documents
Lucene uses three major abstractions to support building text indexes: Document, Analyzer, and Directory. The Document object represents a single document, modeled as a collection of Field objects (name-value pairs). For each document to be indexed, the application creates a Document object and adds it to the index store. The Analyzer converts the contents of each Field into a sequence of tokens.
A Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stop-word elimination, stemming, filtering, term normalization, or language translation -- has been applied. The application filters undesired tokens, like stop words or portions of the input that do not need to be indexed, through the Analyzer class. It also modifies tokens as they are encountered in the input, to perform stemming or other term normalization. Conveniently, Lucene comes with a set of standard Analyzer objects for handling common transformations like word identification and stop-word elimination, so indexing simple text documents requires no additional work. If these aren‘t enough, the developer can provide more sophisticated analyzers.
The application provides the document data in the form of a String or InputStream, which the Analyzer converts to a stream of tokens. Because of this, Lucene can index data from any data source, not just files. If the documents are stored in files, use FileInputStream to retrieve them, as illustrated in IndexFile.java. If they are stored in an Oracle database, provide an InputStream class to retrieve them. If a document is not a text file but an HTML or XML file, for example, you can extract content by eliminating markups like HTML tags, document headers, or formatting instructions. This can be done with a FilterInputStream, which would convert a document stream into a stream containing only the document‘s content text, and connect it to the InputStream that retrieves the document. So, if we wanted to index a collection of XML documents stored in an Oracle database, the resulting code would be very similar to IndexFiles.java. But it would use an application-provided InputStream class to retrieve the document from the database (instead of FileInputStream), as well as an application-provided FilterInputStream to parse the XML and extract the desired content.
Just as Lucene allows the application to control the handling of raw document data through the Analyzer and InputStream classes, it also defines an abstract class for reading and writing the index store (Directory). Lucene also provides concrete implementations of Directory for storing indexes in RAM (RAMDirectory) or in files (FSDirectory). If, for instance, you want to store the index data in a document control system or database -- or compress or encrypt the index data -- you can simply provide your own Directory class. Most users will use the provided implementations, usually the file-based implementation. But allowing the application to handle index storage enhances the package‘s flexibility.
A case study
When developing Eyebrowse, we examined -- and discarded -- a number of widely used open source search tools. At first glance, Eyebrowse‘s search and retrieval features seemed quite straightforward, but we were surprised to find that few of the tools we examined were flexible enough for our purposes. Most search engines are designed to index files or Webpages only -- we didn‘t want to index either. Message metainformation was stored in an SQL database; message bodies and attachments were stored in mailbox files that contained many individual messages. This would have necessitated an intermediate step in which the mailbox files were exploded into thousands of small files just for indexing purposes, which seemed silly and inefficient.
Because Lucene is a search toolkit, not a monolithic search program, it was much easier to tightly integrate it into our application and control its behavior. Because of Lucene‘s flexible document model, we were able to construct and index virtual documents, which were a combination of the metadata drawn from the database and the message body drawn from the mailbox file, without having to create any intermediate files. Because it supports efficient incremental indexing, we could add new messages to the index base as they arrived. The built-in query parser supported every query feature we needed, and the search performance was perfectly acceptable. Ultimately, we added the required search features in much less time than we had budgeted, but more importantly, we were very satisfied with the quality of the resulting integration.
What can we learn?
Lucene is a fine example of good object-oriented software design and architecture. A carefully crafted division of labor between the application and the search engine lies beneath its design. This transforms indexing from a monolithic process into a collection of cooperating objects, each performing a single function and operating in a single domain. For example, when indexing a file, the FileInputStream class retrieves the document data; the appropriate Analyzer transforms it into a stream of tokens; the IndexWriter class indexes it; and the FSDirectory class stores the index on disk for later retrieval. Each of these classes performs one function, and each can be easily replaced without affecting the others.
Lucene‘s factoring leaves the application in charge of functions that it already knows about -- selecting and retrieving documents, storing the index data -- and leaves the search engine to do what it does best. However, good factoring between the component and application domains is only part of what makes a software toolkit easy to use. A useful set of default implementations for the application-domain objects is equally important. Instead of just dumping the application-domain problems in the developer‘s lap, Lucene provides a set of tools for solving the most common application-domain problems. This supports the design principle of commensurate effort -- the user does not have to learn much about the architecture to implement its basic functionality, but can access more advanced functionality with additional effort. The result: developers can often integrate Lucene‘s searching capabilities with their projects in just a few hours.
We can all learn something from Lucene‘s design. While many programs make excellent use of abstraction, not many are able to craft abstractions that a new user can easily and quickly grasp, and few provide all the pieces that allow users to get up and running so quickly. When a software tool demands that its users completely understand everything about it before they can benefit from it, it alienates would-be users. Shouldn‘t we all make our software inviting, rather than intimidating, to our users?
Conclusion
Lucene is the most flexible and convenient open source search toolkit I‘ve ever used. Cutting describes his primary goal for Lucene as "simplicity without loss of power or performance," and this shines through clearly in the result. Lucene‘s design seems so simple, you might suspect it is just the obvious way to design a search toolkit. We should all be so lucky as to craft such obvious designs for our own software.
About the author
Brian Goetz is a professional software developer with over 15 years of experience. He is a principal consultant atQuiotix Corporation, a software development and consulting firm in Los Altos, Calif.
Resources
See the Lucene Website to download the Lucene code, browse or subscribe to the Lucene mailing lists, or view the Lucene documentation:
http://www.lucene.com
For more information about Eyebrowse:
http://eyebrowse.tigris.org
_xyz