Beef up Web search applications with Lucene

来源:百度文库 编辑:神马文学网 时间:2024/04/28 00:26:55
Improve searches with a more robust app from the Apache Jakarta Project

Document options

Print this page

E-mail this page

Sample code
Free download:

Using Apache Tomcat but need to do more?
Rate this page

Help us improve this content
Level: Intermediate
Deng Peng Zhou (zhoudengpeng@yahoo.com.cn), Software Engineer, Shanghai Jiaotong University
08 Aug 2006
Lucene is a full-text information retrieval (IR) library written in the Java™ programming language. Now it‘s an open source project in the popular Apache Jakarta Project family. Discover how to implement advanced searching capabilities, and learn how to create a robust Web search application using Lucene.
In this article, you learn to implement advanced searches with Lucene, as well as how to build a sample Web search application that integrates with Lucene. The end result will be that you create your own Web search application with this open source work horse.
The architecture of a common Web search engine contains a front-end process and a back-end process, as shown inFigure 1. In the front-end process, the user enters the search words into the search engine interface, which is usually a Web page with an input box. The application then parses the search request into a form that the search engine can understand, and then the search engine executes the search operation on the index files. After ranking, the search engine interface returns the search results to the user. In the back-end process, a spider or robot fetches the Web pages from the Internet, and then the indexing subsystem parses the Web pages and stores them into the index files. If you want to use Lucene to build a Web search application, the final architecture will be similar to that shown inFigure 1.

Lucene supports several kinds of advanced searches, which I‘ll discuss in this section. I‘ll then demonstrate how to implement these searches with Lucene‘s Application Programming Interfaces (APIs).
Most search engines provide Boolean operators so users can compose queries. Typical Boolean operators are AND, OR, and NOT. Lucene provides five Boolean operators: AND, OR, NOT, plus (+), and minus (-). I‘ll describe each of these operators.
OR: If you want to search for documents that contain the words "A" or "B," use the OR operator. Keep in mind that if you don‘t put any Boolean operator between two search words, the OR operator will be added between them automatically. For example, "Java OR Lucene" and "Java Lucene" both search for the terms "Java" or "Lucene." AND: If you want to search for documents that contain more than one word, use the AND operator. For example, "Java AND Lucene" returns all documents that contain both "Java" and "Lucene." NOT: Documents that contain the search word immediately after the NOT operator won‘t be retrieved. For example, if you want to search for documents that contain "Java" but not "Lucene," you may use the query "Java NOT Lucene." You cannot use this operator with only one term. For example, the query "NOT Java" returns no results. +: The function of this operator is similar to the AND operator, but it only applies to the word immediately following it. For example, if you want to search documents that must contain "Java" and may contain "Lucene," you can use the query "+Java Lucene." -: The function of this operator is the same as the NOT operator. The query "Java -Lucene" returns all of the documents that contain "Java" but not "Lucene."
Now look at how to implement a query with Boolean operators using Lucene‘s API.Listing 1 shows the process of doing searches with Boolean operators.
//Test boolean operator public void testOperator(String indexDirectory) throws Exception{ Directory dir = FSDirectory.getDirectory(indexDirectory,false); IndexSearcher indexSearcher = new IndexSearcher(dir); String[] searchWords = {"Java AND Lucene", "Java NOT Lucene", "Java OR Lucene", "+Java +Lucene", "+Java -Lucene"}; Analyzer language = new StandardAnalyzer(); Query query; for(int i = 0; i < searchWords.length; i++){ query = QueryParser.parse(searchWords[i], "title", language); Hits results = indexSearcher.search(query); System.out.println(results.length() + "search results for query " + searchWords[i]); } }
Lucene supports field search. You can specify the fields that a query will be executed on. For example, if your document contains two fields, Title and Content, you can use the query "Title: Lucene AND Content: Java" to search for documents that contain the term "Lucene" in the Title field and "Java" in the Content field.Listing 2 shows how to use Lucene‘s API to do a field search.
//Test field search public void testFieldSearch(String indexDirectory) throws Exception{ Directory dir = FSDirectory.getDirectory(indexDirectory,false); IndexSearcher indexSearcher = new IndexSearcher(dir); String searchWords = "title:Lucene AND content:Java"; Analyzer language = new StandardAnalyzer(); Query query = QueryParser.parse(searchWords, "title", language); Hits results = indexSearcher.search(query); System.out.println(results.length() + "search results for query " + searchWords); }
Lucene supports two wildcard symbols: the question mark (?) and the asterisk (*). You can use ? to perform a single-character wildcard search, and you can use * to perform a multiple-character wildcard search. For example, if you want to search for "tiny" or "tony," you can use the query "t?ny," and if you want to search for "Teach," "Teacher," and "Teaching," you can use the query "Teach*."Listing 3 demonstrates the process of doing a wildcard search.
//Test wildcard search public void testWildcardSearch(String indexDirectory)throws Exception{ Directory dir = FSDirectory.getDirectory(indexDirectory,false); IndexSearcher indexSearcher = new IndexSearcher(dir); String[] searchWords = {"tex*", "tex?", "?ex*"}; Query query; for(int i = 0; i < searchWords.length; i++){ query = new WildcardQuery(new Term("title",searchWords[i])); Hits results = indexSearcher.search(query); System.out.println(results.length() + "search results for query " + searchWords[i]); } }
Lucene provides a fuzzy search that‘s based on an edit distance algorithm. You can use the tilde character (~) at the end of a single search word to do a fuzzy search. For example, the query "think~" searches for the terms similar in spelling to the term "think."Listing 4 features sample code that conducts a fuzzy search with Lucene‘s API.
//Test fuzzy search public void testFuzzySearch(String indexDirectory)throws Exception{ Directory dir = FSDirectory.getDirectory(indexDirectory,false); IndexSearcher indexSearcher = new IndexSearcher(dir); String[] searchWords = {"text", "funny"}; Query query; for(int i = 0; i < searchWords.length; i++){ query = new FuzzyQuery(new Term("title",searchWords[i])); Hits results = indexSearcher.search(query); System.out.println(results.length() + "search results for query " + searchWords[i]); } }
A range search matches the documents whose field values are in a range. For example, the query "age:[18 TO 35]" returns all of the documents with the value of the "age" field between 18 and 35.Listing 5 shows the process of doing a range search with Lucene‘s API.
//Test range search public void testRangeSearch(String indexDirectory)throws Exception{ Directory dir = FSDirectory.getDirectory(indexDirectory,false); IndexSearcher indexSearcher = new IndexSearcher(dir); Term begin = new Term("birthDay","20000101"); Term end = new Term("birthDay","20060606"); Query query = new RangeQuery(begin,end,true); Hits results = indexSearcher.search(query); System.out.println(results.length() + "search results is returned"); }


Back to top
Now you‘ll develop a sample Web application that uses Lucene to search HTML files stored on the file server. Before you begin, make sure you have installed the following software in your environment:
Eclipse IDE Tomcat 5.0 Lucene Library JDK 1.5
The sample uses Eclipse as the IDE to develop the Web application, and the Web application runs on Tomcat 5.0. After you prepare your environment, you can begin your development step by step.
In Eclipse, select File > New > Project, and then select Dynamic Web Project in the pop-up window, as shown inFigure 2.

After you create the dynamic Web project, you‘ll see the structure of the project, as shown inFigure 3. The name of the project is sample.dw.paper.lucene.

In this design, you can separate the system into four subsystems:
User Interface: This subsystem provides the user interface that lets the user submit a search request to the Web application server, and the search results are displayed to the user. A JSP file named search.jsp implements this subsystem. Request Manager: This subsystem manages the search request from the client and then forwards the search request to the searching subsystem. At last, the search results returned from the searching subsystem are sent to the User Interface subsystem. A servlet implements this subsystem. Searching: This subsystem searches on the Lucene index and returns the search results to the Request Manager subsystem. Lucene‘s API implements this subsystem. Indexing: This subsystem creates an index for the HTML files. Lucene‘s API and an HTML parser provided by Lucene implement this subsystem.
Figure 4 shows the detailed information of the design, where you put the User Interface subsystem in the webcontent folder. You‘ll see that a JSP file named search.jsp is in the folder. The Request Manager subsystem is located in the sample.dw.paper.lucene.servlet package, and the SearchController class is responsible for the function implementation. The Searching subsystem is in the sample.dw.paper.lucene.search package, which contains two classes: SearchManager and SearchResultBean. The first class implements the search function, and the second class describes the structure of the search result. The Indexing subsystem is in the sample.dw.paper.lucene.index package. A class named IndexManager is responsible for creating the Lucene index for the HTML files. This subsystem uses the methods getTitle and getContent provided by the HTMLDocParser class in the sample.dw.paper.lucene.util package to parse HTML files.

After analyzing the architecture design, you can move on to the detailed implementation of these subsystems.
User Interface: This subsystem is implemented by a JSP file named search.jsp, which contains two parts. The first part provides a user interface to submit the search request to the Web application server, as shown inFigure 5. Notice that this form submits the search request to a servlet named SearchController. The mapping between the servlet and the implementation class is specified in the web.xml file.

The second part of the search.jsp file displays the search results to the user, as shown inFigure 6.

Request Manager: A servlet named SearchController implements this subsystem.Listing 6 shows the content of this class.
package sample.dw.paper.lucene.servlet; import java.io.IOException; import java.util.List; import javax.servlet.RequestDispatcher; import javax.servlet.ServletException; import javax.servlet.http.HttpServlet; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletResponse; import sample.dw.paper.lucene.search.SearchManager; /** * This servlet is used to deal with the search request * and return the search results to the client */ public class SearchController extends HttpServlet{ private static final long serialVersionUID = 1L; public void doPost(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException{ String searchWord = request.getParameter("searchWord"); SearchManager searchManager = new SearchManager(searchWord); List searchResult = null; searchResult = searchManager.search(); RequestDispatcher dispatcher = request.getRequestDispatcher("search.jsp"); request.setAttribute("searchResult",searchResult); dispatcher.forward(request, response); } public void doGet(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException{ doPost(request, response); } }
InListing 6, the doPost method first gets the search word from the client and then creates an instance of the SearchManager class, which is defined in the Searching subsystem. After that, the search method of the SearchManager class is called. At last, the search results are sent to the client.
Searching subsystem: You define two classes in this subsystem: SearchManager and SearchResultBean. The first class implements the search function, and the second class is a JavaBean used to describe the structure of the search result.Listing 7 shows the content of the SearchManager class.
package sample.dw.paper.lucene.search; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import sample.dw.paper.lucene.index.IndexManager; /** * This class is used to search the * Lucene index and return search results */ public class SearchManager { private String searchWord; private IndexManager indexManager; private Analyzer analyzer; public SearchManager(String searchWord){ this.searchWord = searchWord; this.indexManager = new IndexManager(); this.analyzer = new StandardAnalyzer(); } /** * do search */ public List search(){ List searchResult = new ArrayList(); if(false == indexManager.ifIndexExist()){ try { if(false == indexManager.createIndex()){ return searchResult; } } catch (IOException e) { e.printStackTrace(); return searchResult; } } IndexSearcher indexSearcher = null; try{ indexSearcher = new IndexSearcher(indexManager.getIndexDir()); }catch(IOException ioe){ ioe.printStackTrace(); } QueryParser queryParser = new QueryParser("content",analyzer); Query query = null; try { query = queryParser.parse(searchWord); } catch (ParseException e) { e.printStackTrace(); } if(null != query >> null != indexSearcher){ try { Hits hits = indexSearcher.search(query); for(int i = 0; i < hits.length(); i ++){ SearchResultBean resultBean = new SearchResultBean(); resultBean.setHtmlPath(hits.doc(i).get("path")); resultBean.setHtmlTitle(hits.doc(i).get("title")); searchResult.add(resultBean); } } catch (IOException e) { e.printStackTrace(); } } return searchResult; } }
InListing 7, notice the three private attributes in this class. The first is searchWord, which represents the search words from the client. The second, indexManager, represents an instance of the IndexManager class that is defined in the Indexing subsystem. The third is analyzer, which represents the Analyzer that is used when parsing the search words. Now let‘s focus on the search method. This method first checks if Lucene‘s index exists already. If so, it searches on the existing index. If not, the search method first calls the method provided by IndexManager to create the index, and then it searches on the newly created index. After the search result is returned, this method fetches the needed attribute from the search results and generates an instance of the SearchResultBean class for each search result. At last, the instances of the SearchResultBean are put into a list and returned to the Request Manager subsystem.
In the SearchResultBean class, there are two private fields -- htmlPath and htmlTitle -- and the get and set methods for the two fields. This means that each search result contains only two attributes: htmlPath and htmlTitle. htmlPath represents the path of the HTML file, and htmlTitle represents the title of the HTML file.
Indexing subsystem: The IndexManager class implements this subsystem.Listing 8 shows the content of this class.
package sample.dw.paper.lucene.index; import java.io.File; import java.io.IOException; import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import sample.dw.paper.lucene.util.HTMLDocParser; /** * This class is used to create an index for HTML files * */ public class IndexManager { //the directory that stores HTML files private final String dataDir = "c:\\dataDir"; //the directory that is used to store a Lucene index private final String indexDir = "c:\\indexDir"; /** * create index */ public boolean createIndex() throws IOException{ if(true == ifIndexExist()){ return true; } File dir = new File(dataDir); if(!dir.exists()){ return false; } File[] htmls = dir.listFiles(); Directory fsDirectory = FSDirectory.getDirectory(indexDir, true); Analyzer analyzer = new StandardAnalyzer(); IndexWriter indexWriter = new IndexWriter(fsDirectory, analyzer, true); for(int i = 0; i < htmls.length; i++){ String htmlPath = htmls[i].getAbsolutePath(); if(htmlPath.endsWith(".html") || htmlPath.endsWith(".htm")){ addDocument(htmlPath, indexWriter); } } indexWriter.optimize(); indexWriter.close(); return true; } /** * Add one document to the Lucene index */ public void addDocument(String htmlPath, IndexWriter indexWriter){ HTMLDocParser htmlParser = new HTMLDocParser(htmlPath); String path = htmlParser.getPath(); String title = htmlParser.getTitle(); Reader content = htmlParser.getContent(); Document document = new Document(); document.add(new Field("path",path,Field.Store.YES,Field.Index.NO)); document.add(new Field("title",title,Field.Store.YES,Field.Index.TOKENIZED)); document.add(new Field("content",content)); try { indexWriter.addDocument(document); } catch (IOException e) { e.printStackTrace(); } } /** * judge if the index exists already */ public boolean ifIndexExist(){ File directory = new File(indexDir); if(0 < directory.listFiles().length){ return true; }else{ return false; } } public String getDataDir(){ return this.dataDir; } public String getIndexDir(){ return this.indexDir; } }
This class contains two private fields: dataDir and indexDir. dataDir represents the directory that stores the HTML files to be indexed, and indexDir represents the directory used to store the Lucene index. The IndexManager class provides three methods: createIndex, addDocument, and ifIndexExist. You use createIndex to create the Lucene index if it doesn‘t exist, and you use addDocument to add one document to the index. In this scenario, one document is an HTML file. This method calls the methods provided by the HTMLDocParser class to parse the HTML content. You use the last method, ifIndexExist, to judge whether the Lucene index exists already.
Now, look at the HTMLDocuParser class in the sample.dw.paper.lucene.util package. This class extracts the text content from the HTML file. You provide three methods in this class: getContent, getTitle, and getPath. The first method returns the HTML contents without HTML tags, the second method returns the title of the HTML file, and the last method gets the path of the HTML file.Listing 9 shows the source code of this class.
package sample.dw.paper.lucene.util; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.Reader; import java.io.UnsupportedEncodingException; import org.apache.lucene.demo.html.HTMLParser; public class HTMLDocParser { private String htmlPath; private HTMLParser htmlParser; public HTMLDocParser(String htmlPath){ this.htmlPath = htmlPath; initHtmlParser(); } private void initHtmlParser(){ InputStream inputStream = null; try { inputStream = new FileInputStream(htmlPath); } catch (FileNotFoundException e) { e.printStackTrace(); } if(null != inputStream){ try { htmlParser = new HTMLParser(new InputStreamReader(inputStream, "utf-8")); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } } } public String getTitle(){ if(null != htmlParser){ try { return htmlParser.getTitle(); } catch (IOException e) { e.printStackTrace(); } catch (InterruptedException e) { e.printStackTrace(); } } return ""; } public Reader getContent(){ if(null != htmlParser){ try { return htmlParser.getReader(); } catch (IOException e) { e.printStackTrace(); } } return null; } public String getPath(){ return this.htmlPath; } }
Now you can run the application on Tomcat 5.0.
Right-click search.jsp, and then select Run as > Run on Server, as shown inFigure 7.

In the pop-up window, select Tomcat v5.0 Server as the target Web application server, and then click Next, as shown inFigure 8.

Now specify the installation directory of Apache Tomcat v5.0 and the JRE you want to use to run the Web application. The JRE you select here must be the same version as the JRE used to compile the Java file. After the configuration, click Finish to finish the configuration, as shown inFigure 9.

After the configuration, Tomcat 5.0 runs automatically, and search.jsp will compile and display to the user, as shown inFigure 10.

Input the search word "information" into the textbox and then click Search. The page displays the search results, as shown inFigure 11.

Click the first link of the search results. The HTML replaces the content of the browser with the destination of the link that you clicked.Figure 12 shows the result.

Now you‘ve finished developing the demo project and have successfully implemented the searching and indexing functions with Lucene. You can also download the source code of this project (seeDownload).


Back to top
Lucene provides a flexible interface so you can design your own Web search application. If you want to enable search ability into your application, Lucene is a good choice. Give it serious consideration when you design your next application with search functionality.


Back to top
Description Name Size Download method
Sample Lucene Web application wa-lucene2_source_code.zip 504KBHTTP

Information about download methodsGet Adobe® Reader®


Back to top
Learn
Parsing, indexing, and searching XML with Digester and Lucene by Otis Gospodnetic (developerWorks, June 2003): Manipulate XML in Lucene and cut your development time.
Delve inside the Lucene indexing mechanism by Deng Peng Zhou (developerWorks, June 2006): Index your documents with Lucene, an IR library written in Java.
IBM Search and Index APIs (SIAPI) for WebSphere Information Integrator OmniFind Edition by Srinivas Varma Chitiveli (developerWorks, January 2005): Build your own search solutions based on OmniFind technology, IBM‘s information retrieval library.
Lucene‘s official Web site: Explore numerous study materials for Lucene, including JavaDoc and Lucene‘s latest release.
Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto: Read about changes in modern information retrieval and how to provide relevant information in this book about IR technology.
Apache Tomcat‘s official Web site: Dig into many study materials for Tomcat, including Tomcat‘s latest release.
Eclipse‘s official Web site: Check out study materials for Eclipse.
A lecture on Lucene, presented by Doug Cutting at the University of Pisa on November 24, 2004: Explore this brief introduction to Lucene.
developerWorksWeb Architecture zone: Expand your site development skills with articles and tutorials that specialize in Web technologies.
developerWorks technical events and webcasts: Stay current with jam-packed technical sessions that shorten your learning curve, and improve the quality and results of your most difficult software projects.
Get products and technologies
Lucene: Download the latest version.
Tomcat: Download the latest version of Tomcat.
Eclipse: Download the latest version of Eclipse.
Free downloads and learning resources: Improve your work with software downloads from developerWorks.
Discuss
Lucene mailing list standards: Ask questions, share knowledge, and discuss issues.
developerWorks discussion forums: Join and participate in the developerWorks community.
developerWorks blogs: Get involved in the developerWorks community.


Back to top



Deng Peng Zhou is a graduate student from Shanghai Jiaotong University. He works as an intern software engineer in IBM Shanghai Globalization Lab and is interested in Java technology and modern information retrieval. You can contact him atzhoudengpeng@yahoo.com.cn.