darren hobbs: distributed lucene

来源:百度文库 编辑:神马文学网 时间:2024/04/24 20:12:45
Interesting article by Mark Harwoodhere regarding distributed lucene indexes. Using distributed indexes is how google achieves its scalability I believe, but they are a fairly special case. If scalability in the sense of concurrent users is the issue, I tend to favour multiple identical boxes with a load balancer and an RPC frontend. This can be as simple as a servlet, or you can use SOAP or XML-RPC etc. (Possibly RMI, although I‘ve never tried that across a load balancer). Doing things this way is probably a lot simpler to manage than splitting your indexes across boxes and means that even if your queries are asymmetric (ie. 85% of the queries are for the same thing), the load can be fairly balanced. Reliability is achieved for free as well - if a box dies just stop sending requests there. Given Lucene‘s performance (it has been used to index collections of more than 10 million documents) its pretty unlikely that your dataset will get so large that sheer size starts to affect your query times. Unless of course, you are google :)