Why Writing Your Own Search Engine Is Hard

来源:百度文库 编辑:神马文学网 时间:2024/05/01 16:35:03

Why Writing Your Own Search Engine is Hard

ANNA PATTERSON, STANFORD UNIVERSITY

Big or small, proprietary or open source, Web or intranet, it’s a toughjob.

There must be 4,000 programmers typing away in their basements trying to buildthe next “world’s most scalable” search engine. It has beendone only a few times. It has never been done by a big group; always one tofour people did the core work, and the big team came on to build the elaborationsand the production infrastructure. Why is it so hard? We are going to delvea bit into the various issues to consider when writing a search engine. Thisarticle is aimed at those individuals or small groups that are considering thisendeavor for their Web site or intranet. It is fun, but a word of caution: notonly is it difficult, but you need two commodities in short supply—timeand patience.

SUPER-SHORT SEARCH ENGINE OVERVIEW

OK, let’s do it. Let’s write a search engine.

A crawler gets the Web pages off of that pesky Web and onto your beautiful disks.You’ll need lots of disks.

Then you need to index these pages—say which page has which words. Thiswill tell you that Janet Jackson was found on the www.superbowl.com page. Usually,indexing happens locally on the disks where your crawler dumped these Web pages.Hey, why move them?

In most architectures, now you need to merge these indices so that you haveone place to go to in order to find all the pages mentioning Janet Jackson’sSuper Bowl performance. When you merge all these small indices, the final indexwill be so big that it won’t fit on one machine. This means that you’llhave to merge these small indices in such a way as to split the final big indexacross many machines.

Now you are ready to serve queries? Wrong. Now you build the runtime systemthat gets users’ queries, retrieves the results out of the index fromthe right machine(s), and re-ranks them according to the query. All this, whilepeople are drumming their fingers on their desks waiting—hopefully, lotsof people and, hopefully, not enough time for much drumming.

Resources

People talk a lot about the thousands of machines needed to build a search engine.This sounds very scary. All search engines, however, started with a lot morethought and design than they did machines. So let’s see what is fact andwhat is fallacy.

Bandwidth. Legend has it that venture capitalists used tobuy hard disks for young entrepreneurs to prove that their ideas would work.Now disks are cheap—but the new bottleneck is bandwidth. Usually thattakes capital. You need this bandwidth to get the pages from the Web in thefirst place. The “CPU-ness” or memory of the machines that you usedoesn’t really matter. All that matters is how much bandwidth you have(can afford) and can use because crawling is not a CPU endeavor—crawlingis a bandwidth monster.

There are lots of ways around this issue, but the most useful is to realizethat you won’t get the indexer and the servers working right (if at all)for six months, anyway, so crawl slowly and index what you have as you go along.Bugs will show up in the later phases, so the lack of pages won’t be thething holding you up; instead, it’ll be those nasty bugs slowing you down.So crawl continuously at whatever rate you can afford (down to 1-megabyte DSL),and the rest will take care of itself. By the time you have a search enginethat works on the pages you have and can keep up with your super-slow crawl,perhaps you’ll be in a position to afford big bandwidth by raising capital.

Big bandwidth is usually found at a collocation facility (or colo). I want towarn against this if you are a super-small company. Get the bandwidth to theoffice! If you have a small team, the last thing you can afford is people onthe highway all day long running to the colo. This is another big reason thatI recommend small bandwidth for the development phase. You can’t affordthe loss of a person for half a day to go exchange a disk. Another reason toavoid a colo is that it’s hugely expensive. Just throw the stack of machinesunder your desk and consider it a space heater.

CPU Issues. People argue all day about which types of CPUsto use for which phase of a search engine. Most people argue that the idealis to get stupid CPUs for crawling and fast CPUs for indexing and serving. Whyis this?

You don’t need a lot of thinking to do crawling; you need bandwidth, soany old CPU will do. For indexing, you are doing a lot of I/O and a lot of thinking/analyzingthe page, so the bigger the better. At serve time, you’re going to needto re-rank the URLs in response to a query, so again, the bigger the better.

Since you’re writing the search engine yourself, however, it has to beone size fits all. Most indexing algorithms worth their salt will probably pegany CPU. So the same advice goes: it doesn’t matter, get what you canafford; the bugs you write will slow you down more than the cheap CPUs. If youhave to look around your local Fry’s or CompUSA for CPUs, however, moreon-board cache will be key for the indexing algorithms because more of the pagewill be kept onboard.

If your algorithm doesn’t peg a Pentium 4, then rethink the game planof building a better search engine, because yours will not be the one that wins.

Disk Issues. SCSI is faster, but IDE is bigger (and cheaper).If you are writing a search engine yourself, use IDE. This will save money inmany ways. You get bigger disks, so one machine can hold 1 terabyte for IDEdisks easily, but this just isn’t the case for SCSI. Secondly, SCSI disksare a lot more expensive—also not a good idea for four guys in the garage.

At runtime, you’ll be disk-bound. You have two tasks: get the index entriesoff disk and re-rank these for relevancy. For getting the index entries offdisk, you might think the faster the disk the better. But users will not seethe performance increase you get from SCSI in the disk transfer rate, becauseit takes a lot of practice with the search engine end game (the runtime architecture)for this difference to be an issue. Instead, use parallelism and multiple cheapdisks to achieve this speed-up. This will still save you money in buying fewermachines and give you practice with the key tool of search engine architectures—parallelism.

Ah, but SCSIs are hot-swappable, you say. Get over it. Remember, no colo. Youcannot afford it and you don’t want it. So if you’re worried aboutdisk failures since you picked your disks out of a Dumpster, then my adviceis don’t screw the covers onto your machines and don’t use fourscrews per disk. This makes IDEs pretty easy to repair, but certainly not hot-swappable.

Storing Files. Old-fashioned file systems used to have alimit on file size—some of them had a 2-gigabyte limit. These file systemsalso used to have an issue with storing lots and lots of files in one directory.For these reasons, the prevailing wisdom has been to crawl a bunch of URLs andstuff them into one big file (up to the limit) and then start on the next file.Even though current operating systems don’t have the same number-of-filerestrictions they used to, putting lots of pages in one file is still a goodidea. Stuff them in—up to the limit of good performance of your operatingsystem.

Why? When indexing, or laying down the crawl, a big continuous file saves awhole lot of disk seeks—the fewer files the better. Disk seeks will killyou even if your disk transfer rate is high. You cannot afford the time to seekto a file to process a Web page. Web pages right now average around 10 kilobytesper page (I’m such an oldtimer, I remember when they were 2 KB, and othersremember when they were 1 KB). You don’t want to seek to a disk to read10 KB when we are talking about millions, if not billions, of Web pages. Essentially,this will almost double your processing time, as well as fry your disks fromthe Dumpster.

While you might think that it is conceptually cleaner to store one Web pageper one file, this will become a management pain—and it will also slowdown your processing.

Networking. With real estate they say “location, location,location.” Well, a good search engine rule that I’ve learned thehard way is: Don’t use NFS. Don’t use NFS. Don’t use NFS (networkfile system). NFS might seem like a great idea for an index that won’tfit on one machine (and yours probably won’t). It seems like the perfectsolution. If you put the index on multiple machines, then NFS will make it seemlike your index is on one machine. Sound good? That way you don’t haveto do or learn any networking yourself. Wrong! You’ll have to do realdistributed systems work for the serving architecture, anyway, so get it overwith and do the work now.

Current NFS implementations can’t stand the punishment inflicted by theruntime system, or the indexing phase without using “spendy” specializedhardware.

In the indexing phase, you will get corrupted indices as you try to do lotsof networked writes. Ask the contributors to NFS in Linux and they will tellyou the same: not ready for serious punishment.

Next, using NFS in the runtime system, you will get machines that don’thave fault tolerance. If one of the NFS’d machines is sick, then the restjust seize. Not good.

Software to write/get

Crawler. If you don’t use an open source crawler, my advice is a super-simplemultistep crawler. This is very important advice that will cut months off yourdevelopment time, so if you ignore everything else, don’t ignore this.

If you want to build a crawler yourself, then first get a list of URLs thatyou want to seed your crawler with (these need to be good starting points forexploring the Web—dmoz, Yahoo...). Then write any simple program thatwill get them. For instance, (dolist (y list of URLs) GET y) is essentiallyall you need.

When you get these pages, analyze the outgoing links in the pages to createa new list for your simple crawler and go get those. What about duplicates,you ask? Sort | uniq on Linux will do this for you; otherwise, I think you canhandle it. This takes care of duplicate URLs, but what about duplicate content?My advice: find those at serve time.

The really hard problem with crawlers is to perform dynamic duplicate elimination—eliminatingboth duplicate URLs and duplicate content. With the system that I described,we’ve avoided getting a Ph.D. dissertation and instead have some pieceof code you can hand off to your youngest sibling.

Indexing. Next you need to churn through the pages and buildan index. This is tricky. Just don’t do anything wrong, as the sayinggoes. One false step and those billions of pages are going to take too longto process and your 1-MB DSL crawling line is going to seem fast.

There is a major field of study about the different things to index on. Don’tget a Ph.D.; just index on words. Words are what people search for; they don’tsearch for N-Grams or letters or PTrees or locations in streams, so any othermethod other than the simplest will make you seem clever. But, hey, writingyour own search engine is hard enough. Save what cleverness you own for ranking.

Two other pieces of key advice: First, just index the data you need to serveyour kind of search results and do your kind of ranking. Don’t write downeverything and the kitchen sink—save that for when you go ultra-commercial.The first item of business is getting something presentable up. Correction—startby getting something up. Find out what went wrong and fix it.

Second, do not get attached to the “index format.” The hallowed“index format” is not the end of the search engine, it is just thebeginning. It is a tool to see results, so change it and change it often. Playwith it, and you and your team will be on a winner to be able to improve searchresults quickly.

Why would you need to add things to the index? Perhaps you’ve just decidedthat it would be good idea to keep whether the indexed word is in the title.So now you need a space to annotate this fact. You might have other ideas thatmean adding more data to the index.

Let’s say that you’ve worked in the long dark until the proud daywhen you type in a search for bug, and pages that mention Britney Spears butnot bug appear. All kinds of things like that happen. Do a dance—you’realmost there. Just keep fixing.

A last word of advice: when in the development phase, keep a disk-based indexarchitecture. You are not getting lots of traffic, you want flexibility regardingwhich items to place in the index, and mostly you want a happy team. A happyteam does not fight over bits. A happy team does not see whose new feature isin and whose is out because there isn’t enough memory. Buy disks, playwith features, and have fun.

Dynamic versus Static Ranking. Don’t do page rank initially.Actually don’t do it at all. For this observation I risk being inundatedwith hate mail, but nonetheless don’t do page rank. If you four guys inyour garage can’t get something decent-looking up without page rank, you’renot going to get anything decent up with page rank. Use the source, Luke—theHTML source, that is.

Page rank is lengthy analysis of a global nature and will cause you to buy moremachines and get bogged down on this one complicated step—this one factorin ranking. Start by exploiting everything else you can think of: Is the wordin the title? Is it in bold? etc. Spend your time thinking about anything youcan exploit and try it out.

This again will give you the freedom and make you develop an architecture goodfor adding things and trying them out. This will become invaluable later.

Serving. Runtime systems are hard. Algorithms are hard. Thehardest part about a search engine is that you have to do both. They have towork together, and both parts are absolutely critical.

At serve time, you have to get the results out of the index, sort them as pertheir relevancy to the query, and stick them in a pretty Web page and returnthem.

If it sounds easy, then you haven’t written a search engine. Remember,first, that some queries have more than one word. This means that you have tointersect the index entries for the two words. My advice is to have them presortedin some canonical URL number order so that you can view the two (n) index entriesas two stacks and pop until the tops are equal, in which case, you win the prize—theURL is in both index entries. These sorts of computations have to be run atquery time and they need to be run quickly, so think hard about how you aregoing to do intersections.

Next problem, query time ranking. Now that you have the list of URLs, you haveto rank them according to your relevancy algorithm. This has to be fast. Peopleare waiting.

The fastest thing to do at runtime is pre-rank and then sort according to thepre-rank part of your indexing structure. This often results in generic (readnot the best of breed) ranking algorithms. You need to take into account theactual query when you are ranking. Thus, you need some data in your index tohelp take the query into account and re-rank your a priori ranking quickly atruntime.

For the basic runtime architecture, you will find no end to people willing toargue about the “appropriate” way to do it. In practice, there aretwo basic disk-based methods and other memory-based methods. Since we’redoing this on the cheap, we’ll cover just the basic disk-based methods.

The first major method is this: after indexing the files locally—whereyour crawler deposited them—leave the little indices there. Yes, do nothingmore. This means at runtime you ask all machines that have answers for the appropriatequery to get back to you ASAP. You drum your fingers as long as you are willing,then gather these little lists into a big list and sort this list for relevancy.

The other method is to gather all results for a particular word together ina big list beforehand. Then when a query arrives, go to the appropriate machine,get the list, and then sort for relevancy. Without showing my bias too much,look on the bright side: for rare queries or obscure words, these are equivalent.

NO ROOM FOR ERROR

When you look at all these steps and all the complications, this process isrife with things that go can wrong. The hardest part about writing a searchengine is that you’re going to process billions of URLS and serve millions,if not billions, of queries. This does not leave a lot of room for error. Onesuper-linear algorithm applied over the wrong-sized list of items and you aresunk. One lock inside another lock and you are sunk. There will be no code pathsnot explored. All of those comments in your code, which print out errors like“This will never happen,” will happen.

When you think that you are done, there is still the load balancing, the caching,the DNS servers, the ad service, the image servers, the update architecture,and (to take off on a familiar tune) a cartridge in a tape drive. Oh, and ifyou would like to hear from someone who’s already done it, read Mike Cafarellaand Doug Cutting’s article, “Nutch: Open Source Web Search,”on page 54 of this issue.

Sadly, the biggest thing that goes wrong while writing your own search engineis running out of time. Real life often interferes and forces you to end yourquest. In that case, cheer up; once the search bug gets you, you’ll beback. The problem isn’t getting any easier, and it needs all the experienceanyone can muster.

ANNA PATTERSON (anna@cs.stanford.edu) has written two searchengines. Most recently she wrote the biggest index in the world by indexing30 billion Web pages at the Internet Archive at Recall.Archive.org. In 1998she coauthored a search engine at Xift, where she was a founder. She receivedher Ph.D. in computer science from the University of Illinois at Urbana-Champaignand was a research scientist at Stanford University, where she worked on phenomenaldata mining. She is also the mother of three preschoolers, who let her hacksometimes.

© 2004 ACM 1542-7730/04/0400 $5.00