java.net: Introduction to Nutch, Part 1: Crawling

来源:百度文库 编辑:神马文学网 时间:2024/04/27 16:42:33
User: Password:
Register |Login help

My pagesProjectsCommunitiesjava.net

Get Involved
java-net Project
Request a Project
Project Help Wanted Ads
Publicize your Project
Submit Content
Get Informed
About java.net
Articles
Weblogs
News
Events
Also in Java Today
java.net Online Books
java.net Archives
Get Connected
java.net Forums
Wiki andJavapedia
People andPartners
Java User Groups
RSS Feeds
Search
Web and Projects:
Online Books:
Advanced Search

   E-mail   Print   Discuss   Blog

Introduction to Nutch, Part 1: Crawling
byTom White
01/10/2006


Contents
Nutch Vs. Lucene
Architecture
The Crawler
The Crawl Tool
Configuration and Customization
Running a Crawl
Examining the Results of the Crawl
WebDB
Segments
Index
Conclusion
Resources

Nutch is an open source Java implementation of a search engine. It provides all of the tools you need to run your own search engine. But why would anyone want to run their own search engine? After all, there‘s always Google. There are at least three reasons.
Transparency. Nutch is open source, so anyone can see how the ranking algorithms work. With commercial search engines, the precise details of the algorithms are secret so you can never know why a particular search result is ranked as it is. Furthermore, some search engines allow rankings to be based on payments, rather than on the relevance of the site‘s contents. Nutch is a good fit for academic and government organizations, where the perception of fairness of rankings may be more important. Understanding. We don‘t have the source code to Google, so Nutch is probably the best we have. It‘s interesting to see how a large search engine works. Nutch has been built using ideas from academia and industry: for instance, core parts of Nutch are currently being re-implemented to use theMap Reduce distributed processing model, which emerged from Google Labs last year. And Nutch is attractive for researchers who want to try out new search algorithms, since it is so easy to extend. Extensibility. Don‘t like the way other search engines display their results? Write your own search engine--using Nutch! Nutch is very flexible: it can be customized and incorporated into your application. For developers, Nutch is a great platform for adding search to heterogeneous collections of information, and being able to customize the search interface, or extend the out-of-the-box functionality through the plugin mechanism. For example, you can integrate it into your site to add a search capability.
Nutch installations typically operate at one of three scales: local filesystem, intranet, or whole web. All three have different characteristics. For instance, crawling a local filesystem is reliable compared to the other two, since network errors don‘t occur and caching copies of the page content is unnecessary (and actually a waste of disk space). Whole-web crawling lies at the other extreme. Crawling billions of pages creates a whole host of engineering problems to be solved: which pages do we start with? How do we partition the work between a set of crawlers? How often do we re-crawl? How do we cope with broken links, unresponsive sites, and unintelligible or duplicate content? There is another set of challenges to solve to deliver scalable search--how do we cope with hundreds of concurrent queries on such a large dataset? Building a whole-web search engine is a major investment. In " Building Nutch: Open Source Search," authors Mike Cafarella and Doug Cutting (the prime movers behind Nutch) conclude that:
... a complete system might cost anywhere between $800 per month for two-search-per-second performance over 100 million pages, to $30,000 per month for 50-page-per-second performance over 1 billion pages.
This series of two articles shows you how to use Nutch at the more modest intranet scale (note that you may see this term being used to cover sites that are actually on the public internet--the point is the size of the crawl being undertaken, which ranges from a single site to tens, or possibly hundreds, of sites). This first article concentrates on crawling: the architecture of the Nutch crawler, how to run a crawl, and understanding what it generates. The second looks at searching, and shows you how to run the Nutch search application, ways to customize it, and considerations for running a real-world system.
Nutch Vs. Lucene
Nutch is built on top of Lucene, which is an API for text indexing and searching. A common question is: "Should I use Lucene or Nutch?" The simple answer is that you should use Lucene if you don‘t need a web crawler. A common scenario is that you have a web front end to a database that you want to make searchable. The best way to do this is to index the data directly from the database using the Lucene API, and then write code to do searches against the index, again using Lucene. Erik Hatcher and Otis Gospodneti?‘sLucene in Action gives all of the details. Nutch is a better fit for sites where you don‘t have direct access to the underlying data, or it comes from disparate sources.
Architecture
Nutch divides naturally into two pieces: the crawler and the searcher. The crawler fetches pages and turns them into an inverted index, which the searcher uses to answer users‘ search queries. The interface between the two pieces is the index, so apart from an agreement about the fields in the index, the two are highly decoupled. (Actually, it is a little more complicated than this, since the page content is not stored in the index, so the searcher needs access to the segments described below in order to produce page summaries and to provide access to cached pages.)
The main practical spin-off from this design is that the crawler and searcher systems can be scaled independently on separate hardware platforms. For instance, a highly trafficked search page that provides searching for a relatively modest set of sites may only need a correspondingly modest investment in the crawler infrastructure, while requiring more substantial resources for supporting the searcher.
We will look at the Nutch crawler here, and leave discussion of the searcher to part two.
The Crawler
The crawler system is driven by the Nutch crawl tool, and a family of related tools to build and maintain several types of data structures, including the web database, a set of segments, and the index. We describe all of these in more detail next.
The web database, or WebDB, is a specialized persistent data structure for mirroring the structure and properties of the web graph being crawled. It persists as long as the web graph that is being crawled (and re-crawled) exists, which may be months or years. The WebDB is used only by the crawler and does not play any role during searching. The WebDB stores two types of entities: pages and links. A page represents a page on the Web, and is indexed by its URL and the MD5 hash of its contents. Other pertinent information is stored, too, including the number of links in the page (also called outlinks); fetch information (such as when the page is due to be refetched); and the page‘s score, which is a measure of how important the page is (for example, one measure of importance awards high scores to pages that are linked to from many other pages). A link represents a link from one web page (the source) to another (the target). In the WebDB web graph, the nodes are pages and the edges are links.
A segment is a collection of pages fetched and indexed by the crawler in a single run. The fetchlist for a segment is a list of URLs for the crawler to fetch, and is generated from the WebDB. The fetcher output is the data retrieved from the pages in the fetchlist. The fetcher output for the segment is indexed and the index is stored in the segment. Any given segment has a limited lifespan, since it is obsolete as soon as all of its pages have been re-crawled. The default re-fetch interval is 30 days, so it is usually a good idea to delete segments older than this, particularly as they take up so much disk space. Segments are named by the date and time they were created, so it‘s easy to tell how old they are.
The index is the inverted index of all of the pages the system has retrieved, and is created by merging all of the individual segment indexes. Nutch uses Lucene for its indexing, so all of the Lucene tools and APIs are available to interact with the generated index. Since this has the potential to cause confusion, it is worth mentioning that the Lucene index format has a concept of segments, too, and these are different from Nutch segments. A Lucene segment is a portion of a Lucene index, whereas a Nutch segment is a fetched and indexed portion of the WebDB.
The crawl tool
Now that we have some terminology, it is worth trying to understand the crawl tool, since it does a lot behind the scenes. Crawling is a cyclical process: the crawler generates a set of fetchlists from the WebDB, a set of fetchers downloads the content from the Web, the crawler updates the WebDB with new links that were found, and then the crawler generates a new set of fetchlists (for links that haven‘t been fetched for a given period, including the new links found in the previous cycle) and the cycle repeats. This cycle is often referred to as the generate/fetch/update cycle, and runs periodically as long as you want to keep your search index up to date.
URLs with the same host are always assigned to the same fetchlist. This is done for reasons of politeness, so that a web site is not overloaded with requests from multiple fetchers in rapid succession. Nutch observes theRobots Exclusion Protocol, which allows site owners to control which parts of their site may be crawled.
The crawl tool is actually a front end to other, lower-level tools, so it is possible to get the same results by running the lower-level tools in a particular sequence. Here is a breakdown of what crawl does, with the lower-level tool names in parentheses:
Create a new WebDB (admin db -create). Inject root URLs into the WebDB (inject). Generate a fetchlist from the WebDB in a new segment (generate). Fetch content from URLs in the fetchlist (fetch). Update the WebDB with links from fetched pages (updatedb). Repeat steps 3-5 until the required depth is reached. Update segments with scores and links from the WebDB (updatesegs). Index the fetched pages (index). Eliminate duplicate content (and duplicate URLs) from the indexes (dedup). Merge the indexes into a single index for searching (merge).
After creating a new WebDB (step 1), the generate/fetch/update cycle (steps 3-6) is bootstrapped by populating the WebDB with some seed URLs (step 2). When this cycle has finished, the crawler goes on to create an index from all of the segments (steps 7-10). Each segment is indexed independently (step 8), before duplicate pages (that is, pages at different URLs with the same content) are removed (step 9). Finally, the individual indexes are combined into a single index (step 10).
The dedup tool can remove duplicate URLs from the segment indexes. This is not to remove multiple fetches of the same URL because the URL has been duplicated in the WebDB--this cannot happen, since the WebDB does not allow duplicate URL entries. Instead, duplicates can arise if a URL is re-fetched and the old segment for the previous fetch still exists (because it hasn‘t been deleted). This situation can‘t arise during a single run of the crawl tool, but it can during re-crawls, so this is why dedup also removes duplicate URLs.
While the crawl tool is a great way to get started with crawling websites, you will need to use the lower-level tools to perform re-crawls and other maintenance on the data structures built during the initial crawl. We shall see how to do this in the real-world example later, in part two of this series. Also, crawl is really aimed at intranet-scale crawling. To do a whole web crawl, you should start with the lower-level tools. (See the "Resources" section for more information.)
Configuration and Customization
All of Nutch‘s configuration files are found in the conf subdirectory of the Nutch distribution. The main configuration file isconf/nutch-default.xml. As the name suggests, it contains the default settings, and should not be modified. To change a setting you create conf/nutch-site.xml, and add your site-specific overrides.
Nutch defines various extension points, which allow developers to customize Nutch‘s behavior by writing plugins, found in the plugins subdirectory. Nutch‘s parsing and indexing functionality is implemented almost entirely by plugins--it is not in the core code. For instance, the code for parsing HTML is provided by theHTML document parsing plugin, parse-html. You can control which plugins are available to Nutch with the plugin.includes and plugin.excludes properties in the main configuration file.
With this background, let‘s run a crawl on a toy site to get a feel for what the Nutch crawler does.
Running a Crawl
First,download the latest Nutch distribution and unpack it on your system (I used version 0.7.1). To use the Nutch tools, you will need to make sure the NUTCH_JAVA_HOME or JAVA_HOME environment variable is set to tell Nutch where Java is installed.
I created a contrived example with just four pages to understand the steps involved in the crawl process. Figure 1 illustrates the links between pages. C and C-dup (C-duplicate) have identical content.

Figure 1. The site structure for the site we are going to crawl
Before we run the crawler, create a file called urls that contains the root URLs from which to populate the initial fetchlist. In this case, we‘ll start from page A.
echo ‘http://keaton/tinysite/A.html‘ > urls
The crawl tool uses a filter to decide which URLs go into the WebDB (in steps 2 and 5 in the breakdown of crawl above). This can be used to restrict the crawl to URLs that match any given pattern, specified by regular expressions. Here, we just restrict the domain to the server on my intranet (keaton), by changing the line in the configuration file conf/crawl-urlfilter.txt from
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
to
+^http://keaton/
Now we are ready to crawl, which we do with a single command:
bin/nutch crawl urls -dir crawl-tinysite -depth 3 >& crawl.log
The crawl uses the root URLs in urls to start the crawl, and puts the results of the crawl in the directory crawl-tinysite. The crawler logs its activity to crawl.log. The -depth flag tells the crawler how many generate/fetch/update cycles to carry out to get full page coverage. Three is enough to reach all of the pages in this example, but for real sites it is best to start with five (the default), and increase it if you find some pages aren‘t being reached.
We shall now look in some detail at the data structures crawl has produced.
Examining the Results of the Crawl
If we peek into the crawl-tinysite directory, we find three subdirectories: db, segments, and index (see Figure 2). These contain the WebDB, the segments, and the Lucene index, respectively.

Figure 2. The directories and files created after running the crawl tool
Nutch comes with several tools for examining the data structures it builds, so let‘s use them to see what the crawl has created.
WebDB
The first thing to look at is the number of pages and links in the database. This is useful as a sanity check to give us some confidence that the crawler did indeed crawl the site, and how much of it. The readdb tool parses the WebDB and displays portions of it in human-readable form. We use the -stats option here:
bin/nutch readdb crawl-tinysite/db -stats
which displays:
Number of pages: 4 Number of links: 4
As expected, there are four pages in the WebDB (A, B, C, and C-duplicate) and four links between them. The links to Wikipedia are not in the WebDB, since they did match the pattern in the URL filter file. Both C and C-duplicate are in the WebDB since the WebDB doesn‘t de-duplicate pages by content, only by URL (which is why A isn‘t in twice). Next, we can dump all of the pages, by using a different option for readdb:
bin/nutch readdb crawl-tinysite/db -dumppageurl
which gives:
Page 1: Version: 4 URL: http://keaton/tinysite/A.html ID: fb8b9f0792e449cda72a9670b4ce833a Next fetch: Thu Nov 24 11:13:35 GMT 2005 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 1 Score: 1.0 NextScore: 1.0 Page 2: Version: 4 URL: http://keaton/tinysite/B.html ID: 404db2bd139307b0e1b696d3a1a772b4 Next fetch: Thu Nov 24 11:13:37 GMT 2005 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 3 Score: 1.0 NextScore: 1.0 Page 3: Version: 4 URL: http://keaton/tinysite/C-duplicate.html ID: be7e0a5c7ad9d98dd3a518838afd5276 Next fetch: Thu Nov 24 11:13:39 GMT 2005 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 0 Score: 1.0 NextScore: 1.0 Page 4: Version: 4 URL: http://keaton/tinysite/C.html ID: be7e0a5c7ad9d98dd3a518838afd5276 Next fetch: Thu Nov 24 11:13:40 GMT 2005 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 0 Score: 1.0 NextScore: 1.0
Each page appears in a separate block, with one field per line. The ID field is the MD5 hash of the page contents: note that C and C-duplicate have the same ID. There is also information about when the pages should be next fetched (which defaults to 30 days), and page scores. It is easy to dump the structure of the web graph, too:
bin/nutch readdb crawl-tinysite/db -dumplinks
which produces:
from http://keaton/tinysite/B.html to http://keaton/tinysite/A.html to http://keaton/tinysite/C-duplicate.html to http://keaton/tinysite/C.html from http://keaton/tinysite/A.html to http://keaton/tinysite/B.html
For sites larger than a few pages, it is less useful to dump the WebDB in full using these verbose formats. The readdb tool also supports extraction of an individual page or link by URL or MD5 hash. For example, to examine the links to page B, issue the command:
bin/nutch readdb crawl-tinysite/db -linkurl http://keaton/tinysite/B.html
to get:
Found 1 links. Link 0: Version: 5 ID: fb8b9f0792e449cda72a9670b4ce833a DomainID: 3625484895915226548 URL: http://keaton/tinysite/B.html AnchorText: B targetHasOutlink: true
Notice that the ID is the MD5 hash of the source page A.
There are other ways to inspect the WebDB. The admin tool can produce a dump of the whole database in plain-text tabular form, with one entry per line, using the -textdump option. This format is handy for processing with scripts. The most flexible way of reading the WebDB is through the Java interface. See the Nutch source code andAPI documentation for more details. A good starting point isorg.apache.nutch.db.WebDBReader, which is the Java class that implements the functionality of the readdb tool (readdb is actually just a synonym for org.apache.nutch.db.WebDBReader).
Segments
The crawl created three segments in timestamped subdirectories in the segments directory, one for each generate/fetch/update cycle. The segread tool gives a useful summary of all of the segments:
bin/nutch segread -list -dir crawl-tinysite/segments/
giving the following tabular output (slightly reformatted to fit this page):
PARSED? STARTED FINISHED COUNT DIR NAME true 20051025-12:13:35 20051025-12:13:35 1 crawl-tinysite/segments/20051025121334 true 20051025-12:13:37 20051025-12:13:37 1 crawl-tinysite/segments/20051025121337 true 20051025-12:13:39 20051025-12:13:39 2 crawl-tinysite/segments/20051025121339 TOTAL: 4 entries in 3 segments.
The PARSED? column is always true when using the crawl tool. This column is useful when running fetchers with parsing turned off, to be run later as a separate process. The STARTED and FINISHED columns indicate the times when fetching started and finished. This information is invaluable for bigger crawls, when tracking down why crawling is taking a long time. The COUNT column shows the number of fetched pages in the segment. The last segment, for example, has two entries, corresponding to pages C and C-duplicate.
Sometimes it is necessary to find out in more detail what is in a particular segment. This is done using the -dump option for segread. Here we dump the first segment (again, slightly reformatted to fit this page):
s=`ls -d crawl-tinysite/segments/* | head -1` bin/nutch segread -dump $s Recno:: 0 FetcherOutput:: FetchListEntry: version: 2 fetch: true page: Version: 4 URL: http://keaton/tinysite/A.html ID: 6cf980375ed1312a0ef1d77fd1760a3e Next fetch: Tue Nov 01 11:13:34 GMT 2005 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 0 Score: 1.0 NextScore: 1.0 anchors: 1 anchor: A Fetch Result: MD5Hash: fb8b9f0792e449cda72a9670b4ce833a ProtocolStatus: success(1), lastModified=0 FetchDate: Tue Oct 25 12:13:35 BST 2005 Content:: url: http://keaton/tinysite/A.html base: http://keaton/tinysite/A.html contentType: text/html metadata: {Date=Tue, 25 Oct 2005 11:13:34 GMT, Server=Apache-Coyote/1.1, Connection=close, Content-Type=text/html, ETag=W/"1106-1130238131000", Last-Modified=Tue, 25 Oct 2005 11:02:11 GMT, Content-Length=1106} Content: ‘A‘ is for Alligator

Alligators live in freshwater environments such as ponds, marshes, rivers and swamps. Although alligators have heavy bodies and slow metabolisms, they are capable of short bursts of speed that can exceed 30 miles per hour. Alligators‘ main prey are smaller animals that they can kill and eat with a single bite. Alligators may kill larger prey by grabbing it and dragging it in the water to drown. Food items that can‘t be eaten in one bite are either allowed to rot or are rendered by biting and then spinning or convulsing wildly until bite size pieces are torn off. (From the Wikipedia entry for Alligator.)

B

ParseData:: Status: success(1,0) Title: ‘A‘ is for Alligator Outlinks: 2 outlink: toUrl: http://en.wikipedia.org/wiki/Alligator anchor: the Wikipedia entry for Alligator outlink: toUrl: http://keaton/tinysite/B.html anchor: B Metadata: {Date=Tue, 25 Oct 2005 11:13:34 GMT, CharEncodingForConversion=windows-1252, Server=Apache-Coyote/1.1, Last-Modified=Tue, 25 Oct 2005 11:02:11 GMT, ETag=W/"1106-1130238131000", Content-Type=text/html, Connection=close, Content-Length=1106} ParseText:: ‘A‘ is for Alligator Alligators live in freshwater environments such as ponds, marshes, rivers and swamps. Although alligators have heavy bodies and slow metabolisms, they are capable of short bursts of speed that can exceed 30 miles per hour. Alligators‘ main prey are smaller animals that they can kill and eat with a single bite. Alligators may kill larger prey by grabbing it and dragging it in the water to drown. Food items that can‘t be eaten in one bite are either allowed to rot or are rendered by biting and then spinning or convulsing wildly until bite size pieces are torn off. (From the Wikipedia entry for Alligator .) B
There‘s a lot of data for each entry--remember this is just a single entry, for page A--but it breaks down into the following categories: fetch data, raw content, and parsed content. The fetch data, indicated by the FetcherOutput section, is data gathered by the fetcher to be propagated back to the WebDB during the update part of the generate/fetch/update cycle.
The raw content, indicated by the Content section, contains the page contents as retrieved by the fetcher, including HTTP headers and other metadata. (By default, the protocol-httpclient plugin is used to do this work.) This content is returned when you ask Nutch search for a cached copy of the page. You can see the HTML page for page A in this example.
Finally, the raw content is parsed using an appropriate parser plugin--determined by looking at the content type and then the file extension. In this case, parse-html was used, since the content type is text/html. The parsed content (indicated by the ParseData and ParseText sections) is used by the indexer to create the segment index.
Index
The tool of choice for examining Lucene indexes isLuke. Luke allows you to look at individual documents in an index, as well as perform ad hoc queries. Figure 3 shows the merged index for our example, found in the index directory.

Figure 3. Browsing the merged index in Luke
Recall that the merged index is created by combining all of the segment indexes after duplicate pages have been removed. In fact, if you use Luke to browse the index for the last segment (found in the index subdirectory of the segment) you will see that page C-duplicate has been removed from the index. Hence, the merged index only has three documents, corresponding to pages A, B, and C.
Figure 3 shows the fields for page A. Most are self-explanatory, but the boost field deserves a mention. It is calculated on the basis of the number of pages linking to this page--the more pages that link to the page, the higher the boost. The boost is not proportional to the number of inbound links; instead, it is damped logarithmically. The formula used is ln(e + n), where n is the number of inbound links. In our example, only page B links to page A, so there is only one inbound link, and the boost works out as ln(e + 1) = 1.3132616 ...
You might be wondering how the boost field is related to the page score that is stored in the WebDB and the segment fetcher output. The boost field is actually calculated by multiplying the page score by the formula in the previous paragraph. For our crawl--indeed, for all crawls performed using the crawl tool--the page scores are always 1.0, so the boosts depend simply on the number of inbound links.
When are page scores not 1.0? Nutch comes with a tool for performing link analysis, LinkAnalysisTool, which uses an algorithm like Google‘sPageRank to assign a score to each page based on how many pages link to it (and is weighted according to the score of these linking pages). Notice that this is a recursive definition, and it is for this reason that link analysis is expensive to compute. Luckily, intranet search usually works fine without link analysis, which is why it is not a part of the crawl tool, but it is a key part of whole web search--indeed, PageRank was crucial to Google‘s success.
Conclusion
In this article, we looked at the Nutch crawler in some detail. The second article will show how to get the Nutch search application running against the results of a crawl.
Resources
TheNutch project page is the place to start for more information on Nutch. The mailing lists fornutch-user andnutch-dev are worth searching if you have a question. At the time of this writing, the Map Reduce version of Nutch is in the maintrunk and is not in a released version. This means that you need to build it yourself if you want to use it (or wait for version 0.8 to be released). For more on whole-web crawling, see theNutch tutorial. For more information on Nutch plugins (which are based on theEclipse 2.0 plugin architecture), a good starting point isPluginCentral on the Nutch Wiki. Creative Commons provides a Nutch-poweredsearch option for finding Creative-Commons-licensed content (see alsothis blog entry). " Building Nutch: Open Source Search" (ACM Queue, vol. 2, no. 2, April 2004), by Mike Cafarella and Doug Cutting, is a good high-level introduction to Nutch. "Nutch: A Flexible and Scalable Open Source Web Search Engine" (PDF) (CommerceNet Labs Technical Report 04-04, November 2004), by Rohit Khare, Doug Cutting, Kragen Sitaker, and Adam Rifkin, covers the filesystem scale of Nutch particularly well.
Acknowledgments
Thanks to Doug Cutting and Piotr Kosiorowski for their feedback and helpful suggestions.
Dedication
This article is for my younger daughter Charlotte who learned to crawl while I was writing it.
Tom White is lead Java developer at Kizoom, a leading U.K. software company in the delivery of personalized travel information.


View alljava.net Articles.
So, that‘s how Nutch crawls pages. Any questions or comments?


Showing messages 1 through 11 of 11.
webdb in nutch
2006-07-02 00:06:21 fandufunkyman [Reply |View]
hi,
i would like to know about the web db type what nutch is using.
nutch is using custom data base, what is that?
thanks
funky
DB-driven portal
2006-02-26 23:37:10 abah [Reply |View]
Having a portal wich almost all contents (menu,links, article...) is retrieving from DB, how can i update WebDB without using "./nutch crawl ...", i have one url to crawl and a don‘t need other url any more, just the portal.
How to integrate in a java app?
2006-01-23 02:17:12 ennio [Reply |View]
Now that sounds great! But how can I integrate nutch in a java application?
The only documentation I found for developers is the nutch APIs, but I don‘t have a clue on where to start...
Btw, your article was very instructive, thank you!
Crawl
2006-01-11 04:58:21 athome [Reply |View]
"Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org domain, the line should read:
+^http://([a-z0-9]*\.)*apache.org/
This will include any url in the domain apache.org."
If I follow the instructions above I get the following error:
bin/nutch crawl urls -dir crawl.test -depth 10000
060111 072541 parsing file:/usr/local/nutch/nutch-0.7.1/conf/nutch-default.xml
060111 072541 parsing file:/usr/local/nutch/nutch-0.7.1/conf/crawl-tool.xml
060111 072541 parsing file:/usr/local/nutch/nutch-0.7.1/conf/nutch-site.xml
060111 072541 No FS indicated, using default:local
060111 072541 crawl started in: crawl.test
060111 072541 rootUrlFile = urls
060111 072541 threads = 10
060111 072541 depth = 10000
060111 072541 No FS indicated, using default:local
Exception in thread "main" java.lang.RuntimeException: java.net.UnknownHostException: hunter: hunter
at org.apache.nutch.io.SequenceFile$Writer.(SequenceFile.java:67)
at org.apache.nutch.io.MapFile$Writer.(MapFile.java:94)
at org.apache.nutch.db.WebDBWriter.(WebDBWriter.java:1507)
at org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438)
at org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172)
at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133)
Caused by: java.net.UnknownHostException: hunter: hunter
at java.net.InetAddress.getLocalHost(InetAddress.java:1191)
at org.apache.nutch.io.SequenceFile$Writer.(SequenceFile.java:64)
... 5 more
So, how do we use the crawl-tool for non-intranet crawling? Crawl
2006-06-20 20:59:14 edwardwang [Reply |View]
Set properties:
http.proxy.host -> your http proxy host
http.proxy.port -> your http proxy port
fetcher.threads.per.host.by.ip -> false
will work for you.
The 3rd property is only available since nutch-0.8, though.
Crawl
2006-01-11 13:36:52 tomwhite [Reply |View]
The crawl tool works fine with non-intranet crawling. Looks like you have a problem with DNS - can you ping "hunter" from the machine you are running nutch on?
Also, a depth of 10000 is far too high. Most pages should be discovered using a depth of 5 or so. This is the default so I would suggest you start with that.
Tom Crawl
2006-01-12 05:54:09 athome [Reply |View]
added 127.0.0.1 "hostname" to the hosts file.
After running the same command I get :
060112 085356 Added 0 pages
060112 085356 FetchListTool started
060112 085356 Overall processing: Sorted 0 entries in 0.0 seconds.
060112 085356 Overall processing: Sorted NaN entries/second
060112 085356 FetchListTool completed
060112 085357 logging at INFO
060112 085358 Updating /usr/local/nutch/nutch-0.7.1/test/db
060112 085358 Updating for /usr/local/nutch/nutch-0.7.1/test/segments/20060112085356
060112 085358 Finishing update
060112 085358 Update finished
060112 085358 FetchListTool started
060112 085358 Overall processing: Sorted 0 entries in 0.0 seconds.
060112 085358 Overall processing: Sorted NaN entries/second
060112 085358 FetchListTool completed
060112 085358 logging at INFO
060112 085359 Updating /usr/local/nutch/nutch-0.7.1/test/db
060112 085359 Updating for /usr/local/nutch/nutch-0.7.1/test/segments/20060112085358
060112 085359 Finishing update
060112 085400 Update finished
060112 085400 FetchListTool started
060112 085400 Overall processing: Sorted 0 entries in 0.0 seconds.
060112 085400 Overall processing: Sorted NaN entries/second
060112 085400 FetchListTool completed
060112 085400 logging at INFO
060112 085401 Updating /usr/local/nutch/nutch-0.7.1/test/db
060112 085401 Updating for /usr/local/nutch/nutch-0.7.1/test/segments/20060112085400
060112 085401 Finishing update
060112 085401 Update finished
060112 085401 FetchListTool started
060112 085401 Overall processing: Sorted 0 entries in 0.0 seconds.
060112 085401 Overall processing: Sorted NaN entries/second
060112 085401 FetchListTool completed
060112 085401 logging at INFO
060112 085403 Updating /usr/local/nutch/nutch-0.7.1/test/db
060112 085403 Updating for /usr/local/nutch/nutch-0.7.1/test/segments/20060112085401
060112 085403 Finishing update
060112 085403 Update finished
060112 085403 FetchListTool started
060112 085403 Overall processing: Sorted 0 entries in 0.0 seconds.
060112 085403 Overall processing: Sorted NaN entries/second
060112 085403 FetchListTool completed
060112 085403 logging at INFO
060112 085404 Updating /usr/local/nutch/nutch-0.7.1/test/db
060112 085404 Updating for /usr/local/nutch/nutch-0.7.1/test/segments/20060112085403
060112 085404 Finishing update
060112 085404 Update finished
060112 085404 Updating /usr/local/nutch/nutch-0.7.1/test/segments from /usr/local/nutch/nutch-0.7.1/test/db
060112 085405 reading /usr/local/nutch/nutch-0.7.1/test/segments/20060112085356
060112 085405 reading /usr/local/nutch/nutch-0.7.1/test/segments/20060112085358
060112 085405 reading /usr/local/nutch/nutch-0.7.1/test/segments/20060112085400
060112 085405 reading /usr/local/nutch/nutch-0.7.1/test/segments/20060112085401
060112 085405 reading /usr/local/nutch/nutch-0.7.1/test/segments/20060112085403
060112 085405 Sorting pages by url...
060112 085405 Getting updated scores and anchors from db...
060112 085405 Sorting updates by segment...
060112 085405 Updating segments...
060112 085405 Done updating /usr/local/nutch/nutch-0.7.1/test/segments from /usr/local/nutch/nutch-0.7.1/test/db
060112 085405 indexing segment: /usr/local/nutch/nutch-0.7.1/test/segments/20060112085356
060112 085405 * Opening segment 20060112085356
060112 085405 * Indexing segment 20060112085356
060112 085405 * Optimizing index...
060112 085405 * Moving index to NFS if needed...
060112 085405 DONE indexing segment 20060112085356: total 0 records in 0.054 s (NaN rec/s).
060112 085405 done indexing
060112 085405 indexing segment: /usr/local/nutch/nutch-0.7.1/test/segments/20060112085358
060112 085405 * Opening segment 20060112085358
060112 085405 * Indexing segment 20060112085358
060112 085405 * Optimizing index...
060112 085405 * Moving index to NFS if needed...
060112 085405 DONE indexing segment 20060112085358: total 0 records in 0.034 s (NaN rec/s).
060112 085405 done indexing
060112 085405 indexing segment: /usr/local/nutch/nutch-0.7.1/test/segments/20060112085400
060112 085405 * Opening segment 20060112085400
060112 085405 * Indexing segment 20060112085400
060112 085405 * Optimizing index...
060112 085405 * Moving index to NFS if needed...
060112 085405 DONE indexing segment 20060112085400: total 0 records in 0.032 s (NaN rec/s).
060112 085405 done indexing
060112 085405 indexing segment: /usr/local/nutch/nutch-0.7.1/test/segments/20060112085401
060112 085405 * Opening segment 20060112085401
060112 085405 * Indexing segment 20060112085401
060112 085405 * Optimizing index...
060112 085405 * Moving index to NFS if needed...
060112 085405 DONE indexing segment 20060112085401: total 0 records in 0.129 s (NaN rec/s).
060112 085405 done indexing
060112 085405 indexing segment: /usr/local/nutch/nutch-0.7.1/test/segments/20060112085403
060112 085405 * Opening segment 20060112085403
060112 085405 * Indexing segment 20060112085403
060112 085405 * Optimizing index...
060112 085405 * Moving index to NFS if needed...
060112 085405 DONE indexing segment 20060112085403: total 0 records in 0.03 s (NaN rec/s).
060112 085405 done indexing
060112 085405 Reading url hashes...
060112 085405 Sorting url hashes...
060112 085405 Deleting url duplicates...
060112 085405 Deleted 0 url duplicates.
060112 085405 Reading content hashes...
060112 085405 Sorting content hashes...
060112 085405 Deleting content duplicates...
060112 085405 Deleted 0 content duplicates.
060112 085405 Duplicate deletion complete locally. Now returning to NFS...
060112 085405 DeleteDuplicates complete
060112 085405 Merging segment indexes...
060112 085405 crawl finished: test
Zero pages ?
"Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org domain, the line should read:
+^http://([a-z0-9]*\.)*apache.org/
This will include any url in the domain apache.org."
So, why zero pages? Crawl
2006-01-13 01:22:40 tomwhite [Reply |View]
I would suggest directing your questions to the nutch user mailing list (http://lucene.apache.org/nutch/mailing_lists.html#Users) as there are plenty more folks there who should be able to help out.
It would be helpful if you could post your urls and conf/crawl-urlfilter.txt too.
Hope this helps,
Tom
That explains something
2006-01-10 07:47:58 c_armstrong [Reply |View]
ah hah - the map reduction algorithm is related to functional programming
That explains why when you type ‘functional programming‘ into the google search engine the top sponsored link is ‘Work at Google‘!
as per 3
2006-01-10 07:13:16 dnamiot [Reply |View]
Number 3 in your list is not a reason. All engines (e.g. Google, Yahoo) provide own API’s so you can manipulate by returned data according to your own requirements.
See this for example: http://www.servletsuite.com/servlets/yahoosearchtag.htm API Limitations
2006-01-11 04:07:47 mparaz [Reply |View]
It‘s a good reason. The free public APIs have limits - Google is only 1,000 queries/day - to prevent practical use. Those without limits, like theAlexa Web Search Platform, are for pay. Even this one doesn‘t do ranking very well.
 java.net RSS

Feedback  | FAQ  | Press | Terms of Use
Privacy  | Trademarks | Site Map
Your use of this web site or any of its content or software indicates your agreement to be bound by theseTerms of Participation.
Copyright © 1995-2006 Sun Microsystems, Inc.

Powered by Sun Microsystems, Inc.,
O‘Reilly and CollabNet