Quora - What are good resources to learn about search engine architecture?

来源:百度文库 编辑:神马文学网 时间:2024/04/29 09:57:56

What are good resources to learn about searchengine architecture?

Ithink Manning and Prabhakar's book on Information Retrieval covers agood bit of the theory behind search engines, but what are the bestresources out there to learn about their distributed systems, networkrouting and scalability aspects? Pointers to books, conference andjournal papers perhaps that talk about real world designs and systems?Cannot addcomment at this time.  

3Answers

  Krishna Gade, TwitterSearch <- Bing <- Live Search <- M... 6 votes by Anon User, Anon User, Amund Tveit, (more)Anon User, Anon User, Amund Tveit, Viksit Gaur, Michael Maloneand BabakHamadaniIf you're interested in thearchitecture of search engines the way they are done in practice ratherthan in academia, following are some of the papers that're very good.Esp., the last one helps you give a good model to approach the problemof how to design the architecture of a search engine. 

-Evolution of Google's search architecture by Jeff Dean. http://research.google.com/peopl...
-Lessons from building large scale systems by Jeff Dean. http://www.cs.cornell.edu/projec...
-Operational Requirements for Scalable Search Systems. http://www.ir.iit.edu/~abdur/pub...

AlsoI found this IR lab produces good search architecture papers. 
http://cis.poly.edu/westlab/publ...Thu May 27 2010 12:47:27GMT+0800 (China Standard Time)Cannot add comment at this time.  Reynold Xin 4 votes by Zak Stone, Viksit Gaur, Yang Zhang, (more)Zak Stone, Viksit Gaur, Yang Zhang and Vaibhav MallyaThere is a newbook which is incomplete: http://www.ir.uwaterloo.ca/book/

Itfocuses more on the use of IR in search engines.

Also searchengine is a large area - in general you can divide it into systems andthe algorithms side. Algorithm part is obvious; systems refers tobuilding large scale distributed systems that enable the algorithm toperform effectively and efficiently.

Some conferences to followon this topic are: SOSP, WWW, SIGMOD, VLDB.

As for informalreadings, I personally subscribe to the following blogs, and many ofthem talk about challenges in building real systems (not necessarilysearch engines, but all kinds of distributed systems):

WernerVogels (Amazon CTO)
http://www.allthingsdistributed....

JamesHamilton (Amazon DE & VP of Engineering)
http://perspectives.mvdirona.com/

http://highscalability.com/ (lots ofcoverage about different systems: youtube, google, flickr, etc)

http://www.royans.net/arch/ (similar tohighscalability but updated not as frequently)

http://googleresearch.blogspot.com/1 Comment • Sun May 09 2010 23:45:32GMT+0800 (China Standard Time)

From the samplechapters, it looks like the new book could be fantastic.

ZakStone • Mon May 10 2010 03:12:01 GMT+0800 (ChinaStandard Time)

Cannot add comment at this time.  Xuehua Shen 2 votes by Amund Tveit and Viksit GaurFor the rankingpart of search engine, SIGIR is the most relevant conference, followedby CIKM.   A relatively new book about it: http://ir.iit.edu/~ophir/pub.html. D.Grossman and O. Frieder, Information Retrieval: Algorithms andHeuristics, Kluwer Academic Publisher.
 
For system implementation part, there is one more new book http://www.search-engines-book.com/,which I have not read it. So I am not sure whether it touchesdistributed system, network routing, etc.  
 
You may study Open source project Lucene (indexing and rankinglibrary), or Solr (Enterprise search solution based on Lucene), which isused by many companyies such as Netflix. Katta, http://katta.sourceforge.net/,  is"Lucene & more  in the cloud". You may read source code ordocumentation to know implementation details.  Very easy to set up asearch engine using Solr  and play with it.
 
There is a Lemur and related Indri project in academia   http://www.lemurproject.org/lemu.... Using this toolkit, it is easy to implement and test sophiscated searchalgorithms such as language models-based ones.