a comparison of the size of the yahoo and google indices

来源:百度文库 编辑:神马文学网 时间:2024/04/28 03:45:23
A Comparison of the Size of the Yahoo! and Google Indices
Matthew Cheney, Mike Perry, and Dr. Orville Vernon Burton
University of Illinois at Urbana-Champaign and the National Center for Supercomputing Applications
Introduction
On August 8th, 2005 Tim Mayer of Yahoo! posted on the Yahoo! Search Blog that the “[Yahoo!] index now provides access to over 20 billion items” which include “19.2 billion web documents, 1.6 billion images, and over 50 million audio and video files”. [1] Two days later, on his blog, University of California at Berkeley Visiting Professor John Battelle reported that Google refuted this claim saying “[their] scientists are not seeing the increase claimed in the Yahoo! index”. [2] In order to test Yahoo!‘s claims, Matthew Cheney and Mike Perry, two researchers working for the National Center for Supercomputing Application (NCSA) under the supervision of Associate Director of Humanities and Social Sciences Dr. Orville Vernon Burton, conducted a brief study of the indices of the two search engines.
Methodology
Although there is no direct way to verify the size of each search engine‘s respective index, the standard method to measure relative size was developed by Krishna Bharat and Andrei Broder in 1998. [3] Their method utilized a corpus of “web words” from the Yahoo! Web Hierarchy that was used to generate random search queries. The results of those queries were in turn sampled and the presence of these sampled webpages in major search engines was then checked.
For our study, instead of focusing on documents that match the common "web words", we chose to focus on the more obscure documents of the web – the “long tail” of the search index. By counting the presence of these obscure documents in either search engine, we hope to be able to measure its comprehensiveness and its relative size.
The study operates under two working assumptions. The first is that both the Yahoo! and the Google search engine return all the results that match the particular keywords. The second is if Yahoo!‘s index contains more than twice as many documents as Google‘s index (19.2 billion documents to 8.1 billion documents), a series of random searches to both search engines should return more than twice as many results from Yahoo! than Google.
Unfortunately, both the Yahoo! and Google search engines truncate results returned to the user after 1,000 results. Thus, for the purposes of this study, we were forced to restrict our searches to those queries that returned less than 1,000 results on both Yahoo! and Google. Any search result found to have more than 1,000 returned results on either search engine was disregarded from our sample.[4]
In order to create a large number of queries that returned less than 1,000 results, we took the commonly available English Ispell Wordlist (a total of 135,069 words) [5] and wrote a PERL script to randomly select two words at a time from that list. The script then used those keywords to search both Yahoo! and Google and logged the number of results returned. For the purposes of this study we used a sample of 10,012 different searches of Yahoo! and Google using our randomly selected keywords.
In the interest of transparency, we have included a copy of the PERL script and the dictionary file we used to run the queries on the project website.
Results
Over a period of 18 hours using computing resources at both at the National Center for Supercomputing Applications (NCSA) and the University of Illinois at Urbana-Champaign chapter of the Association for Computing Machinery (ACM), we conduct a random sample of 10,012 searches of Yahoo! and Google.
Based on this random sample, we found that on average Yahoo! only returns 37.4% of the results that Google does and, in many cases, returns significantly less. As our search results indicate, there are a number of cases in which Google returns dozens of results while Yahoo! only returns one or two results, or none at all. Due to the search display cap of 1,000 results and our deliberate efforts for obscure documents, these averages are of course small. The results are as displayed in Table One.
Table One (n=10,012)
Average Search Results
(Excluding Duplicate Results)
Average Search Results
(Including Duplicate Results)
Yahoo!
14
22
Google
38
64
In aggregate, Yahoo! returned a total of 146,330 results to our 10,012 searches while Google returned almost three times as many total results at 390,595. This pattern holds true when you include “omitted” or “duplicate” search results (both search engines give you an option to search for omitted or duplicate results) with Google returning 651,398 total results and Yahoo! only returning 223,522 total results. This information is available in Table Two.
Table Two (n=10,012)
Total Search Results
(Excluding Duplicate Results)
Total Search Results
(Including Duplicate Results)
Yahoo!
146,330
223,522
Google
390,595
651,398
Interestingly, the actual total number of results returns varies dramatically from the estimated total number of results that both Google and Yahoo! provide users in the search results. In the case of Google, the number of actual results returned is about half of the estimation that Google gives. However, in the case of Yahoo! the actual number of search results returned is only one-fifth the estimated total. This information is available in Table Three.
Table Three (n=10,012)
Estimated Search Results (Excluding Duplicate Results)
Total Search Results (Excluding Duplicate Results)
Percent of Actual Results Based on Estimate
Estimated Search Results (Including Duplicate Results)
Total Search Results (Including Duplicate Results)
Percent of Actual Results Based on Estimate
Yahoo!
690,360
146,330
21.1%
821,043
223,522
27.2%
Google
713,729
390,595
54.7%
708,029
651,398
92.0%
Conclusions
Based on the data created from our sample searches, this study concludes that a user can expect, on average, to receive 166.9% more results using the Google search engine than the Yahoo! search engine. In fact, in the 10,012 test cases we ran, only in 3% of the cases (307) did Yahoo! return more results. In 96.6% of the cases (9676) Google returned more results. In less than 1% of the cases (29) both search engines returned the same number of results.
It is the opinion of this study that Yahoo!‘s claim to have a web index of over twice as many documents as Google‘s index is suspicious. Unless a large number of the documents Yahoo! has indexed are not yet available to its search engine or if the Yahoo! search engine is not returning all the documents that match our specific search queries, we find it puzzling that Yahoo!‘s search engine consistently returned fewer results than Google.
Footnotes
[1] Mayer, Tim. "Our Blog is Growing Up . And So Has Our Index". Yahoo! Search Blog. August 8th, 2005. [http://www.ysearchblog.com/archives/000172.html]
[2] Battelle, John. "In This Battle, Size Does Matter: Google Responds to Yahoo Index Claims". John Battelle‘s Searchblog. August 10th, 2005. [http://battellemedia.com/archives/001790.php]
[3] Bharat, Krishna and Andrei Broder. "A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines". In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia (WWW7), pages 379-388, April 1998.
[4] In a small number of cases, one search engine (almost always Google) will return results over
1,000 while the other search engine will not. Although we discard this data, we recognize that the data is meaningful and we hope to refine our code to take this into account. However, since the frequency this occurs is small (and almost always favoring Google) we do not feel it changes our findings.
[5] Kuenning, Geoff. "Ispell Word List". [http://fmg-www.cs.ucla.edu/geoff/ispell.html]. We are aware that the study focuses on websites in English and would be interested in other researchers who have done studies using other languages.
[6] We are aware that a number of our random search queries seem to return several dictionaries and wordlists (including the original Ispell dictionary itself) and that Google is favored in these results. We are working on a method to filter these results out, but it should be noted that in the many cases where non-dictionary and wordlist results are returned, the same results are seen.
Additional Information
The log of the search results is availablehere.
The PERL code used to do the study is availablehere.
The wordlist used for the study is availablehere.
A full zip file of all the materials used to do the study is availablehere.
A PDF version of the study is availablehere.
Contact Information
To contact the authors please emailmcheney@uiuc.edu
This page was last updated on August 16 at 3.00 PM. It included a description of the Bharat and Broder method, more discussion of our particular methodology, discussion of the "wordlist" problem, and some stylistic corrections.