Lucene Tutorial By Steven J. Owens

来源:百度文库 编辑:神马文学网 时间:2024/04/20 07:37:50
Lucene Tutorial
By Steven J. Owens
Jarkarta Lucene (http://jakarta.apache.org/lucene/) is a high-performance, full-featured, java, open-source, text search engine API written by Doug Cutting.
Note that Lucene is specifically an API, not an application. This means that all the hard parts have been done, but the easy programming has been left to you. The payoff for you is that, unlike normal search engine applications, you spend less time wading through tons of options and build a search application that is specifically suited to what you‘re doing. You can easily develop a custom search application, perfectly suited to your needs. Lucene is startlingly easy to develop with and use.
I‘m going to assume that you‘re a basically competent programmer and that you are basically competent in java.
Use the Source, Luke
This tutorial is a brief overview; the Lucene distribution comes with four example classes:
FileDocument IndexFiles SearchFiles DeleteFiles
These classes are really a good introduction to how to use Lucene. I wrote this tutorial because I find it easier to follow code if I have a general idea of what‘s going on, but it was tricky to write because it starts to look like the source code. Lucene really does make it that easy.
Overview
I‘m going to try to use emphasis tags any time I introduce a Lucene API class name.
Here‘s a simple attempt to diagram how the Lucene classes go together:
Index  Document Field (name/value)
At the heart of Lucene is an Index. You pump data into the Index, then do searches on the Index to get results out. To build the Index, you use an IndexWriter object. To run a search on the Index you use an IndexSearcher object.
The search itself is a Query object, which you pass into IndexSearcher.search(). IndexSearcher.search() returns a Hits object, which contains a Vector of Document objects.
Document objects are stored in the Index, but they have to be put into the Index at some point, and that‘s your job. You have to select what data to enter in, and convert them into Documents. You read in each data file (or database entry, or whatever), instantiate a Document for it, break down the data into chunks and store the chunks in the Document as Field objects (a name/value pair). When you‘re done building a Document, you write it to the Index using the IndexWriter.
Queries can be quite complicated, so Lucene includes a tool to help generate Query objects, called a QueryParser. The QueryParser takes a query string, much like what you‘d put into an Internet search engine, and generates a Query object.
Note: There‘s a gotcha that often pops up, so even though it‘s a lower-level detail, I‘m going to mention it here. It‘s the Analyzer. Lucene indexes text, and part of the first step is cleaning up the text. You use an Analyzer to do this - it drops out punctuation and commonly occurring but meaningless words (the, a, an, etc). Lucene provides a couple different Analyzers, and you can make but your own, but the BIG GOTCHA people keep running into is that you must make sure you use the same sort of analyzer for indexing and for searching.
Did you notice what‘s not in the above? Lucene handles the indexing, searching and retrieving, but it doesn‘t handle: managing the process (instantiating the objects and hooking them together, both for indexing and for searching) selecting the data files parsing the data files getting the search string from the user displaying the search results to the user
Those are all your job. There are some helpful tools and some good examples available in the Lucene contrib space, but generally Lucene is focused on doing the indexing and searching, and leaves all of the rest up to you (so you can make exactly the search solution you want).
I‘m going to assume that typical uses for Lucene are either command-line driven, or web-driven. The example code I mentioned above is for a command-line driven searchable recipe database. Someday I‘m going to build an example of how to make a web-driven Lucene application and add it to this tutorial.
Don‘t Get Clever
You‘ll notice, as we get into this, a common theme. You‘ll notice the same theme if you hang out on the lucene-user list and listen to Doug Cutting answering questions. That theme is don‘t get clever, all the cleverness you‘ll ever need has been put into really, really fast indexing and searching. This isn‘t to say it‘s always best to use brute force, but in Lucene, if there‘s a simple way to do it, that way probably makes the most sense. Remember Knuth: "early optimization is the root of much evil."
At the top, you‘re either pumping data into your search application (indexing) or pulling data out of it (searching).
I‘m going to go over these classes in more or less the order you‘d encounter them by going through the the sample source files. Well, to be exact, I‘m going to go through them in the order the data would go through them, in going from an input file to the output of a search request.
If you‘re not sure you‘re ready to dive into this depth, take a look at mynot-so-nitty-gritty overview.
Indexing In Depth
You index by creating Documents full of Fields (which contain name/value pairs) and pumping them into an IndexWriter to parse the contents of the Field values into tokens and create an index.
Document Objects
Lucene doesn‘t index files, it indexes Document objects. To index and then search files, you first need to convert them to Document objects.
A Document object is a collection of Field objects (name/value pairs). So, for each file, instantiate a Document, then populate it with Fields.
This is the first potentially tricky bit, depending on what kind of files you‘re indexing, how much the data in those files is structured, and how much of that structure you want to preserve. Lucene just handles name/value pairs. Email, for example, is mostly name/value oriented:
to: fred from: barney subject: dinner? body: Let‘s get together for dinner tonight!
For more complex files, you have to "flatten" that structure out into a set of name/value fields.
By the way, I‘m saying "files" here, but the data source could really be anything - chunks of a very large file, rows returned from an SQL query, individual email messages from a mailbox file.
A minimum, as in the standard Lucene examples, would be:
A field containing...Which you‘ll use to...
the path to the original documentactually show the user the original document after the search
a modification datecompare against the original Document‘s modification date, to see if it needs to be reindexed.
the contents of the filerun the search against
Note: This is an example, not a requirement. For example, if you don‘t have a modification date, don‘t sweat it, you just have to reindex all of your files every time (and in fact, that‘s the standard recommended approach for reindexing, under the "don‘t get clever" rule of thumb).
The All Field
You also ought to really think about glomming all of the Field data together and storing it as some sort of "all" Field. This is the easiest way to set it up so your users can search all Fields at once, if they want. Yes, you could come up with a complex scheme to rewrite your users‘ query so it searches across all of the known fields, but remember, keep it simple.
Digression: Field Objects
A Field object contains a name (a String) and a value (a String or a Reader), and three booleans that control whether or not the value will be indexed for searches, tokenized prior to indexing, and stored in the index so it can be returned with the search.
Let me explain those three booleans a bit more.
Indexed for searches - sometimes you‘ll want to have fields available in your Documents that don‘t really have anything to do with searching. Two examples I can think of off the top of my head are creation dates and file names, so you can compare when the Document was created against the file modification date, and decide if the document needs to be reindexed. Since these fields won‘t ever make sense to use in an actual search, you can decrease the amount of work Lucene does by marking them as not indexed for searches. Tokenized prior to indexing - tokenizing refers to taking a piece of text and cleaning it up, and breaking it down into individual pieces (tokens) for the indexer. This is done by the Analyzer. Some fields you may not want to be tokenized, for example a serial number field. Stored in the index - even if a field is entirely indexed, it doesn‘t necessarily mean that it‘ll be easy for Lucene to reconstruct it. Although Lucene is a search index, and not a database, if your fields are reasonably small, you can ask Lucene to store them in the index. With the fields stored in the index, instead of using the Document to locate the original file or data and load it, you can actually pull the data out of the Document. This works best with fairly small fields and documents that you‘d need to parse for display anyway.
Some fields contain bulk data and are too large to be stored in the index. For them, you can create the field with a Reader, which makes it simpler for your application to just get the Reader and read in the data in order to display it.
The Field class itself is pretty simple; it pretty much consists of the instance variables of the field and accessor methods for those variables, a toString() method, and a normal constructor. The only special part is several convenience methods for manufacturing fields (static factory methods). These factory methods build Fields that are appropriate for several typical uses. I‘ve listed them in order of how often they‘d likely be used (in my unqualified opinion):
(Note: Yes, these method names are capitalized; if I had to guess, I‘d say it‘s probably because they‘re factory methods - they instantiate and return Field objects with particular parameters.)
Factory MethodTokenizedIndexedStoredUse for
Field.Text(String name, String value)YesYesYescontents you want stored
Field.Text(String name, Reader value)YesYesNo contents you don‘t want stored
Field.Keyword(String name, String value)No YesYesvalues you don‘t want broken down
Field.UnIndexed(String name, String value)No NoYesvalues you don‘t want indexed
Field.UnStored(String name, String value)YesYesNo values you don‘t want stored
IndexWriter
The IndexWriter‘s job is to take the input (a Document), feed it through the Analyzer you instantiate it with, and create an index. Using the IndexWriter itself is fairly simple. You instantiate it with a location for the index and an Analyzer, then feed Documents into IndexWriter.addDocument(). The actual index is a set of data files that the IndexWriter creates in a location defined, depending on how you instantiate the IndexWriter, by a lucene Directory object, a File, or a path string.
Directory Objects
You can also store the index in a Lucene Directory object. A Lucene Directory is an abstraction around the java filesystem classes. Using a Directory lets the Lucene classes hide what exactly is going on. This in turn lets you do clever behind-the-scenes things like keeping the file cached in memory (Lucene comes with two Directory classes, one for file-based and one for RAM-based, for really high performance).
Analyzers and Tokenizers
The analyzer‘s job is to take apart a string of text and give you back a stream of tokens. The tokens are presumably usually words from the text content of the string, and that‘s what gets stored (along with the location and other details) in the index.
Each analyzer includes one or more tokenizers and may include filters. The tokenizers take care of the actual rules for where to break the text up into words. The filters do any post-tokenizing work on the tokens.
Lucene provides an Analyzer abstract class, and three implementations of Analyzer. Glossing over the details:
SimpleAnalyzer SimpleAnalyzer seems to just use a Tokenizer that converts all of the input to lower case.
StopAnalyzerStopAnalyzer includes the lower-case filter, and also has a filter that drops out any "stop words", words like articles (a, an, the, etc) that occur so commonly in english that they might as well be noise for searching purposes. StopAnalyzer comes with a set of stop words, but you can instantiate it with your own array of stop words.
StandardAnalyzerStandardAnalyzer does both lower-case and stop-word filtering, and in addition tries to do some basic clean-up of words, for example taking out apostrophes ( ‘ ) and removing periods from acronyms (i.e. "T.L.A." becomes "TLA").
These analyzers are in English. There are several analyzers for other languages that have been developed by Lucene users. Check theLucene Sandbox. If you can‘t find an analyzer for your language, it‘s pretty straightforward to implement your own. Use a SimpleAnalyzer while you‘re learning.
Searching In Depth
To actually do the search, you need an IndexSearcher, but we‘ll get to that in a moment; before you can even think about feeding the IndexSearcher a query, you have to have a Query object. The IndexSearcher does the actual munging through the index, but it only understands Query objects.
Query and QueryParser Objects
You produce the Query object by feeding the user‘s argument string into QueryParser.parse(), along with a string for the default field to search (if the user doesn‘t specify which field to search) and an Analyzer. The Analyzer is what QueryParser uses to tokenize the argument string. Gotcha Warning: remember, again, you have to make sure that you use the same flavor Analyzer for tokenizing the argument string as you used for tokenizing the Index. StopAnalyzer is probably a safe choice for this, since that‘s the one used in the example code. QueryParser.parse() returns a Query.
QueryParser has a static version of parse(), which I guess is there for convenience. You can instantiate a QueryParser with an Analyzer and default field String and keep it around. However, note that QueryParser is not thread-safe, so each thread will need its own QueryParser.
Digression: Thread Safety
Doug Cutting hasposted on the topic of thread safety a couple of times. Indexing and searching are not only thread safe, but process safe. What this means is that: Multiple index searchers can read the lucene index files at the same time. An index writer or reader can edit the lucene index files while searches are ongoing Multiple index writers or readers can try to edit the lucne index files at the same time (it‘s important for the index writer/reader to be closed so it will release the file lock).
However, the query parser is not thread safe, so each thread using the index should have its own query parser.
The index writer is thread safe, so you can update the index while people are searching it. However, you then have to make sure that the threads with open index searchers close them and open new ones, to get the newly updated data.
IndexSearchers
To get an IndexSearcher you simply instantiate an IndexSearcher with a single argument that tells Lucene where to find an existing index. The argument is either of these two:
a string containing a path to the file, a Lucene Directory object (see the section about Directory objects under "Indexing In Depth", above)
Digression: IndexReaders
(You can safely skip this section, as it‘s just me meandering through the Lucene source code; not a whole lot of practical value here yet).
There‘s actually a third option for instantiating an IndexSearcher; you can instantiate it with any class that is a concrete subclass of the abstract class IndexReader
This makes more sense if you take a peek at the code for IndexSearcher. The other two constructors just turn your file path or Directory object into an IndexReader by calling the static method IndexReader.open(). Just for kicks, let‘s do a little more digging and see that IndexReader.open() takes either a String file path or a java File object and uses them to instantiate a Lucene Directory object, calls open(Directory).
NOTE: I have to admit, I‘m a little confused at this point, since the API docs say IndexReader is abstract (which means it can‘t be instantiated). Presumably that means IndexReader.open(), a static factory method, instantiates an appropriate concrete subclass of IndexReader and returns it. However, the API docs don‘t show any concrete subclasses of IndexReader. Since I‘m too lazy at the moment to look through the source... oh, all right, I‘m not too lazy to look through the source. Hm. It appears the API docs are out of date, the com/Lucene/index directory appears to contain a SegmentReader, which IndexReader.open() uses.
Multiple Indexes
If you‘re searching a single index, you use an IndexSearcher with a single index. If you need to search across multiple indexes, you instantiate one IndexSearcher per index, create an array, stick the IndexSearcher instances in the array, and instantiate a MultiSearcher with the array as an argument.
Doing The Search
To actually do the search, you take the argument string the user enters, pass it to a QueryParser (and remember (third time‘s the charm) that QueryParser needs to be instantiated with the same sort of Analyzer that you used when you built the index; the QueryParser‘ll use the Analyzer to tokenize the argument string) and get back a parsed Query object.
Then you feed the parsed Query to the IndexSearcher.search(). The return is a Hits object, which is a collection of matching Document objects. The Hits object also includes a score for each Document, indicating how well it matched.
Hits
IndexSearcher.search(Query) returns a "Hits" object, which is sort of like a Vector, containing a ranked list of Lucene Document objects, the same Document objects you fed into the IndexWriter. Now you need to format the hits for a display, or manufacture HREFs pointing to the original documents, or whatever you were basically planning to do with the search results.
What‘s Not Mentioned Here
There are classes in the Lucene project that didn‘t get mentioned here, or only got mentioned in passing. After all, the point of a tutorial is as much what NOT to tell you (yet) as what to tell you. Otherwise I‘d just say Use The Source, Luke.
I highly recommend sitting down with this tutorial and following through the source of the demo classes first. Then, go back and do it again, only this time when the demo class does something with a Lucene class, go look at the source of the Lucene class and see what it‘s doing. Not only is this is a good way to learn about Lucene, it‘s an excellent way to learn more about programming.
Someday To Come
Next we‘ll go through this process again, and actually build an example program to index some files and then do searches against that index.
After that, we‘ll actually build a basic web search engine, using servlets and JSP. We‘ve already seen that Lucene is a piece of cake to use, and the servlet/jsp stuff isn‘t much harder (unless you want to make it harder, which of course is possible to do). This will also introduce the whole question of multithreading Lucene. Fortunately, Lucene makes this really, really easy, because most - or all - of the key Lucene classes are thread-safe.
Copyright 2001 by Steven J. Owens, all rights reserved.