Archive Crawler Wiki: NutchSearchingArcs

来源:百度文库 编辑:神马文学网 时间:2024/05/01 09:13:36
NutchSearchingArcsHomePage |RecentChanges |Preferences
This page holds notes on ARC/Nutch? integration; using Nutch for searching ARC content. Notes are not kept in any particular order.
A concern is that Nutch weights incoming link text. We‘ll be feeding Nutch content without giving it a chance to do its analysis step -- no chance for it to work out incoming link text (Of note, crawling an internal Wayback, when Nutch does analysis, because its not reading content with an active JavaScript? engine running, its finding original links, not the rewritten versions that Wayback would have a browser read). Can we still get good search results? Do we need to change weighing alogarithm so that any website typed in comes out as first hit, etc. What is the Nutch md5‘ing? It doesn‘t seem to be content only though this is my reading of the code (I can‘t get md5sum and the Nutch scoring to agree). How to do a date-range search with Nutch? Nutch webapp has hardcoding that says only show two pages from a particular site in a page of hits (See src/web/jsp/search.jsp). Can only have one fetcher at a time fetch from a site (Means it fetches too slowly). If fetcher.server.delay is zero, fetches always fail with max.http.delays. If URLs are different but document hash is the same, both show as distinct hits. Where this is problematic is when we have a page that hasn‘t changed but its been fetched twice or more (It shows as two distinct dates in the Wayback). Would be nice pages only went into the index if they were different so that if an URL shows twice in the results, its because the content differs. Just uses proffered mimetype; doesn‘t look at the document to confirm (Need java version of unix ‘file‘). How to stop midfetch and pickup from where we left off (e.g. the JVM crashes midfetch as it seems to do with the IBM JVM). Need to up the http.content.limit from 64k. Set {http,ftp,file}.content.limit to -1 How to get ARC content into Nutch? Couple of approaches ranging from the most nutch-intrusive to least are: A nutch protocol-http plugin that goes direct against a local filesystem ARC collection. Disadvantage is we‘d have to move nutch to the ARCs rather than have nutch ask the cluster for an ARC entry (We might then merge the nutch indexes all into one large index). If we were distributing queries over the cluster to nutches running on each machine, this configuration would make better sense. Doug Cutting has been helping us with this. He‘s written a nutch insertion tool for ARCs. See[nutch].
Having nutch crawl the Wayback (Wayback bars crawlers in its robots.txt. Nutch always respects robots. You can‘t turn it off without changing code) or crawl the (experimental)[ARC Server] or crawl the wb server, the server the wayback talks to.
Wayback maps links as follows: The content for an URLhttp://www.duboce.net/index.html on a particular date ‘20030422041338‘ can be fetched with a linkhttp://web.archive.org/web/20030422041338/www.duboce.net/index.html. The Wayback embeds javascript to rewrite page links to point back to the Wayback. Wayback doesn‘t do any protocols other than ‘http‘ (No ‘https‘ nor ‘ftp‘, etc.). Here is the blog for USF CS students looking at Nutch/ARC integration:[Chronica].[September 09, 2004] has a couple of comments on why we‘d consider Nutch for searching ARCs.[CrawlProjects] has a demo web search running. It was made by having a modified Nutch crawl an internal Wayback instance. Here are notes on how the demo was set up:[README.txt]. Nutch fetcher is basic: Opens a socket and conjures an HTTP Request in a StringBuffer?. Little config. Always respects robots.txt. Nutch population is done by putting a list of urls into db using ‘inject‘. The db then makes a fetch list for the nutch fetcher. The fetcher goes and downloads content. Run another script to insert the downloaded content into the nutch db. Another script will do an analysis that finds links in just downloaded pages to produce a new fetchlist. Refetch. Repeat. Another script is run to index each of the downloads. Goes against each download segment. Not against the db. Can merge all segments as one index. Nutch/heritrix integration? Heritrix would run in two modes. There would be a mode that let the Heritrix character predominate. Would be run on set up of a nutch db to do initial population. Heritrix would be configured to do something like an intranet crawl and off it would go. Per resource, rather than write the heritrix native ARC format, it would instead write to the nutch segment format. When done, all would be imported into the nutch db. Mode two would be heritrix letting nutch dictate what to crawl; The Nutch character would predominate. Code would be written to make heritrix read the nutch fetch list. No link extraction. Slimmed down heritrix. Nutch is all plugins. You add document strainers as a new plugin. On the frontend you add new query parser as a plugin. Would be grand if we had a query parser that could take date ranges and an url currently unsupported in nutch. This plugin could be complimented by pages that display date range in the returned hit set.