Indexable Web Size

来源:百度文库 编辑:神马文学网 时间:2024/04/29 11:12:27
The Indexable Web is more than 11.5 billion pages
Antonio Gulli
Università di Pisa, Informatica
gulli@di.unipi.it  Alessio Signorini
University of Iowa, Computer Science
alessio-signorini@uiowa.edu
Download the article
 PDF Version   PS Version
Abstract
What is the current size of the Web? At the time of this writing, Google claims to index more than 8 billion pages, MSN Beta claims about 5 billion pages, Yahoo! at least 4 billion and Ask/Teoma more than 2 billion. Two sources for tracking the growth of the Web are [6,7], although they are not kept up to date. Estimating the size of the whole Web is quite difficult, due to its dynamic nature (According to Andrei Broder, the size of the whole Web depends strongly on whether his laptop is on the web, since it can be configured to produce links to an infinite number of URLs!). Nevertheless, it is possible to assess the size of the publically indexable Web. The indexable Web [4] is defined as "the part of the Web which is considered for indexing by the major engines". In 1997, Bharat and Broder [2] estimated the size of Web indexed by Hotbot, Altavista, Excite and Infoseek (the largest search engines at that time) at 200 million pages. They also pointed out that the estimated intersection of the indexes was less than 1.4\%, or about 2.2 million pages. Furthermore, in 1998, Lawrence and Giles [3] gave a lower bound 800 million pages. These estimates have now become obsolete.
In this short paper, we revise and update the estimated size of the indexable Web to at least 11.5 billion pages as of the end of January 2005. We also estimate the relative size and overlap of the largest Web search engines. Precisely Google is the largest engine, followed by Yahoo!, by Ask/Teoma, and by MSN Beta. We adopted the methodology proposed in 1997 by Bharat and Broder [2], but extended the number of queries used for testing from 35,000 in English, to more than 438,141 in 75 different languages. We remark that an estimate of the size of the web is useful in many situations, such as when compressing, ranking, spidering, indexing and mining the Web.Data files
The data used in the experiment are available for download in UTF-8 plain text format, compressed with bzip2. They are formatted as followsSearchTime, Engine, Query, Rank, URL, CheckTime, GMTY
The field SearchTime is the integer returned by the system function time() at the time of the search. The field Engine indicate the queried search engine, and Query is the word used in the search. Rank indicate the position of the URL among the first 100 returned by the search engine. The last two fields are related to the checking procedure. The field CheckTime represent the integer returned by the system function time() at the time of the check, and the field GMTY indicate if the URL was recognized (1) or not recognized (0) by each search engine (G=Google, M=Msn Beta, T=Ask/Teoma, Y=Yahoo!).
~ Round 1 ~Engines Coverage %
Google Msn Teoma Yahoo!
Coverage 76.30 62.03 57.58 69.28
Engines Intersections %
Google Msn Teoma Yahoo!
Google - 55.80 35.56 55.63
Msn 78.40 - 49.56 67.38
Teoma 58.83 42.99 - 54.13
Yahoo! 67.96 49.33 45.21 -
Download the data fileround1.urls.bz2 [3.0Mb]
~ Round 2 ~Engines Coverage %
Google Msn Teoma Yahoo!
Coverage 76.09 61.90 57.69 69.39
Engines Intersections %
Google Msn Teoma Yahoo!
Google - 55.27 35.89 56.60
Msn 78.48 - 49.57 67.28
Teoma 58.17 42.95 - 53.70
Yahoo! 67.71 49.38 45.32 -
Download the data fileround2.urls.bz2 [3.0Mb]
~ Round 3 ~Engines Coverage %
Google Msn Teoma Yahoo!
Coverage 76.27 61.87 57.70 69.37
Engines Intersections %
Google Msn Teoma Yahoo!
Google - 55.23 35.96 56.04
Msn 78.42 - 49.87 67.30
Teoma 58.20 42.68 - 54.13
Yahoo! 68.45 49.56 44.98 -
Download the data fileround3.urls.bz2 [3.0Mb]
~ Round 4 ~Engines Coverage %
Google Msn Teoma Yahoo!
Coverage 76.05 61.73 57.57 69.30
Engines Intersections %
Google Msn Teoma Yahoo!
Google - 55.30 35.75 56.23
Msn 78.52 - 49.42 67.09
Teoma 57.81 42.18 - 53.88
Yahoo! 67.85 49.45 45.11 -
Download the data fileround4.urls.bz2 [3.0Mb]
~ Round 5 ~Engines Coverage %
Google Msn Teoma Yahoo!
Coverage 76.11 61.96 57.56 69.26
Engines Intersections %
Google Msn Teoma Yahoo!
Google - 55.52 35.46 56.06
Msn 78.42 - 49.77 67.15
Teoma 58.19 42.74 - 53.84
Yahoo! 67.84 49.58 45.02 -
Download the data fileround5.urls.bz2 [3.0Mb]URLs normalization
Since we used the web interface of each search engine to check the presence of the URLs, lot of care have been taken while normalizing them. In addition, we eliminated from our computations the URLs not recognized by the originating search engine after the normalization, to avoid any possible bias that could be introduced applying this procedure.
On each retrieved URL, we applied the following steps Hex-encoded characters (%XX) have been converted in a standard ISO-8859-1 characters Html entities have been converted in their corresponding standard characters Every URL, not terminating with a dot (.) followed by 2-5 characters, has been considered as a directory, and a slash (/) have been added to its tail. Everything (parameters) after the question mark (?) have been removed (question mark included) URLs containing invalid characters (spaces, quotes, equals...) have been eliminated
Engine sizes estimation
In the square distance method, we tried to minimize the square distance between the estimate sizes in each pair of engine. Let A and B be two search engines, and let x and y be the relative sizes coefficents such that x*A=B and y*B=A, using as lower bound the declared sizes of the engine‘s indexes, we tried to minimize (for each engine) the square difference between the declared size and the relative size obtained by the pairwise overlaps.
In the linear program approach, we built a linear program with 12 contraints of the form A-y*B<=Cn for each n from 1 to 12. The objective was the minimization of the sum of the Cn variables. Each engine variable (in this example A and B) represent its index size, and can assume any value greater or equal to the declared engine‘s size.
Both the approaches give similar engine‘s sizes.Indexed Web estimation
Analyzing the coverage of each engine over the 5 rounds, we obtained the following engines coveragesGoogle=76.16%, Msn Beta=61.90%, Ask/Teoma=57.62%, Yahoo!=69.32%
on the test data. Since we generated the same amount of URLs from each engine, and since we eliminated the URLs not recognized by the originator engine after the normalization process, we can consider these values as representative of each engine‘s coverage of the Indexed Web. Thus, using the estimates engine‘s index sizes, we can estimate the dimension of the Indexed Web, for each one of them. Averaging these values we obtained the declared 9.36 billion pages.
Furthermore, computing how many URLs are recognized by every search engine (among the URLs in the test data), we can estimate the number of URLs of the Indexed Web, shared by the four search engines. The estimate intersection of the engine‘s indexes turned out to be the 28.85% of the Indexed Web, or about 2.7 billion pages. Bibliography
[1] A.Gulli and A.Signorini, Building an open source meta search engine [WWW2005]
[2] K.Bharat and A.Broder, A technique for measuring the relative size and overlap of public web search engines [WWW1998]
[3] S.Lawrence and C.L. Giles, Accessibility of information on the web [Nature 400:107-109, 1999]
[4] E.Selberg, Towards Comprehensive Web Search [PhD thesis, University of Washington, 1999]
[5] Lawrence Web site (http://www.neci.nj.nec.com/homepages/lawrence/)
[6] SearchEngineShowDown (http://searchengineshowdown.com/stats/)
[7] SearchEngineWatch (http://searchenginewatch.com/reports/article.php/2156481)