FAQ

来源:百度文库 编辑:神马文学网 时间:2024/04/29 01:43:24
Nutch Wiki
Search:
Login
FrontPage
RecentChanges
FindPage
HelpContents
FAQ
Immutable Page
Show Changes
Get Info
More Actions: Show Raw Text Show Print View Delete Cache -------- Attachments Check Spelling Show Like Pages Show Local Site Map -------- Rename Page Delete Page

FAQ
This is the official Nutch FAQ.
Nutch FAQGeneralAre there any mailing lists available?Is there a mail archive?How can I stop Nutch from crawling my site?Will Nutch be a distributed, P2P-based search engine?Will Nutch use a distributed crawler, like Grub?Won‘t open source just make it easier for sites to manipulate rankings?What Java version is required to run Nutch?Exception: java.net.SocketException: Invalid argument or cannot assign requested address on Fedora Core 3 or 4I have two XML files, nutch-default.xml and nutch-site.xml, why?My system does not find the segments folder. Why? OR How do I tell the ‘‘Nutch Servlet‘‘ where the index file are located?
InjectingWhat happens if I inject urls several times?
FetchingIs it possible to fetch only pages from some specific domains?How can I recover an aborted fetch process?Who changes the next fetch date?I have a big fetchlist in my segments folder. How can I fetch only some sites at a time?How can I force fetcher to use custom nutch-config?bin/nutch generate generates empty fetchlist, what can I do?While fetching I get UnknownHostException for known hostsHow can I fetch pages that require Authentication?
UpdatingIndexingIs it possible to change the list of common words without crawling everything again?How do I index my local file system?Nutch crawling parent directories for file protocol -> misconfigured URLFiltersHow do I index remote file shares?While indexing documents, I get the following error:
Segment HandlingDo I have to delete old segments after some time?
MapReduceWhat is MapReduce?How to start working with MapReduce?
NDFSWhat is it?How to send commands to NDFS?
SearchingCommon words are saturating my search results.What ranking algorithm is used in searches ? Does Nutch use the [http://en.wikipedia.org/wiki/HITS_algorithm HITS algorithm] ?How is scoring done in Nutch? (Or, explain the "explain" page?)How can I influence Nutch scoring?What is the RSS symbol in search results all about?How can I find out/display the size and mime type of the hits that a search returns?
CrawlingJava.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml
Discussion
 
Nutch FAQ
General
Are there any mailing lists available?
There‘s a user, developer, commits and agents lists, all available at http://lucene.apache.org/nutch/mailing_lists.html#Agents .
Is there a mail archive?
Yes: http://www.mail-archive.com/nutch-user%40lucene.apache.org/maillist.html or http://www.nabble.com/Nutch-f362.html .
How can I stop Nutch from crawling my site?
Please visit öur "webmaster info page"
Will Nutch be a distributed, P2P-based search engine?
We don‘t think it is presently possible to build a peer-to-peer search engine that is competitive with existing search engines. It would just be too slow. Returning results in less than a second is important: it lets people rapidly reformulate their queries so that they can more often find what they‘re looking for. In short, a fast search engine is a better search engine. I don‘t think many people would want to use a search engine that takes ten or more seconds to return results.
That said, if someone wishes to start a sub-project of Nutch exploring distributed searching, we‘d love to host it. We don‘t think these techniques are likely to solve the hard problems Nutch needs to solve, but we‘d be happy to be proven wrong.
Will Nutch use a distributed crawler, like Grub?
Distributed crawling can save download bandwidth, but, in the long run, the savings is not significant. A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages, so making the crawler use less bandwidth does not reduce overall bandwidth requirements. The dominant expense of operating a search engine is not crawling, but searching.
Won‘t open source just make it easier for sites to manipulate rankings?
Search engines work hard to construct ranking algorithms that are immune to manipulation. Search engine optimizers still manage to reverse-engineer the ranking algorithms used by search engines, and improve the ranking of their pages. For example, many sites use link farms to manipulate search engines‘ link-based ranking algorithms, and search engines retaliate by improving their link-based algorithms to neutralize the effect of link farms.
With an open-source search engine, this will still happen, just out in the open. This is analagous to encryption and virus protection software. In the long term, making such algorithms open source makes them stronger, as more people can examine the source code to find flaws and suggest improvements. Thus we believe that an open source search engine has the potential to better resist manipulation of its rankings.
What Java version is required to run Nutch?
Nutch 0.7 will run with Java 1.4 and up.
Exception: java.net.SocketException: Invalid argument or cannot assign requested address on Fedora Core 3 or 4
It seems you have installed IPV6 on your machine.
To solve this problem, add the following java param to the java instantiation in bin/nutch:
JAVA_IPV4=-Djava.net.preferIPv4Stack=true
# run it exec "$JAVA" $JAVA_HEAP_MAX $NUTCH_OPTS $JAVA_IPV4 -classpath "$CLASSPATH" $CLASS "$@"
I have two XML files, nutch-default.xml and nutch-site.xml, why?
nutch-default.xml is the out of the box configuration for nutch. Most configuration can (and should unless you know what your doing) stay as it is. nutch-site.xml is where you make the changes that override the default settings. The same goes to the servlet container application.
My system does not find the segments folder. Why? OR How do I tell the ‘‘Nutch Servlet‘‘ where the index file are located?
There are at least two choices to do that:
First you need to copy the .WAR file to the servlet container webapps folder.
% cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war
1) After building your first index, start Tomcat from the index folder.
Assuming your index is located at /index :
% cd /index/ % $CATATALINA_HOME/bin/startup.sh
Now you can search.
2) After building your first index, start and stop Tomcat which will make Tomcat extrat the Nutch webapp. Than you need to edit the nutch-site.xml and put in it the location of the index folder.
% $CATATALINA_HOME/bin/startup.sh % $CATATALINA_HOME/bin/shutdown.sh % vi $CATATALINA_HOME/bin/webapps/ROOT/WEB-INF/classes/nutch-site.xml searcher.dir /your_index_folder_path % $CATATALINA_HOME/bin/startup.sh
Injecting
What happens if I inject urls several times?
Urls, which are already in the database, won‘t be injected.
Fetching
Is it possible to fetch only pages from some specific domains?
Please have a look on PrefixURLFilter. Adding some regular expressions to the urlfilter.regex.file might work, but adding a list with thousands of regular expressions would slow down your system excessively.
How can I recover an aborted fetch process?
Well, you can not. However, you have two choices to proceed:
1) Recover the pages already fetched and than restart the fetcher.
You‘ll need to create a file fetcher.done in the segment directory an than: updatedb, generate and fetch . Assuming your index is at /index
% touch /index/segments/2005somesegment/fetcher.done % bin/nutch updatedb /index/db/ /index/segments/2005somesegment/ % bin/nutch generate /index/db/ /index/segments/2005somesegment/ % bin/nutch fetch /index/segments/2005somesegment
All the pages that were not crawled will be re-generated for fetch. If you fetched lots of pages, and don‘t want to have to re-fetch them again, this is the best way.
2) Discard the aborted output.
Delete all folders from the segment folder except the fetchlist folder and restart the fetcher.
Who changes the next fetch date?
After injecting a new url the next fetch date is set to the current time.
Generating a fetchlist enhances the date by 7 days.
Updating the db sets the date to the current time + db.default.fetch.interval - 7 days.
I have a big fetchlist in my segments folder. How can I fetch only some sites at a time?
You have to decide how many pages you want to crawl before generating segments and use the options of bin/nutch generate.
Use -topN to limit the amount of pages all together.
Use -numFetchers to generate multiple small segments.
Now you could either generate new segments. Maybe you whould use -adddays to allow bin/nutch generate to put all the urls in the new fetchlist again. Add more then 7 days if you did not make a updatedb.
Or send the process a unix STOP signal. You should be able to index the part of the segment for crawling which is allready fetched. Then later send a CONT signal to the process. Do not turn off your computer between!
How can I force fetcher to use custom nutch-config?
Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig
Copy these files from $NUTCH_HOME/conf to the new directory: common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, regex-normalize.xml, regex-urlfilter.txt
Modify the nutch-default.xml to suite your needs
Set NUTCH_CONF_DIR environment variable to point into the directory you created
run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR environment variable. You should check the command outputs for lines where the configs are loaded, that they are really loaded from your custom dir.
Happy using.
bin/nutch generate generates empty fetchlist, what can I do?
The reason for that is that when a page is fetched, it is timestamped in the webdb. So basiclly if its time is not up it will not be included in a fetchlist. So for example if you generated a fetchlist and you deleted the segment dir created. calling generate again will generate an empty fetchlist. So, two choices:
1) Change your system date to be 30 days from today (if you haven‘t changed the default settings) and re-run bin/nutch generate...
2) Call bin/nutch generate with the -adddays 30 (if you haven‘t changed the default settings) to make generate think the time has come...
After generate you can call bin/nutch fetch.
While fetching I get UnknownHostException for known hosts
Make sure your DNS server is working and/or it can handle the load of requests.
How can I fetch pages that require Authentication?
Unknown.
Updating
Indexing
Is it possible to change the list of common words without crawling everything again?
Yes. The list of common words is used only when indexing and searching, and not during other steps. So, if you change the list of common words, there is no need to re-fetch the content, you just need to re-create segment indexes to reflect the changes.
How do I index my local file system?
The tricky thing about Nutch is that out of the box it has most plugins disabled and is tuned for a crawl of a "remote" web server - you have to change config files to get it to crawl your local disk.
1) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won‘t index anything, or it‘ll jump off your disk onto web sites.
Change this line:
-^(file|ftp|mailto|https):
to this:
-^(http|ftp|mailto|https):
2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it‘s probably ok:
# accept anything else +.*
3) By default the "file plugin" is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this:

plugin.includes protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)

Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha is that if you use Mozilla it will not load file: URLs from a web paged fetched with http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned here and this behavior may be disabled by a preference (see security.checkloaduri). IE5 does not have this problem.
Nutch crawling parent directories for file protocol -> misconfigured URLFilters
http://issues.apache.org/jira/browse/NUTCH-407 E.g. for urlfilter-regex you should put the following in regex-urlfilter.txt :
+^file:///c:/top/directory/-.How do I index remote file shares?
At the current time, Nutch does not have built in support for accessing files over SMB (Windows) shares. This means the only available method is to mount the shares yourself, then index the contents as though they were local directories (see above).
Note that the share mounting method suffers from the following drawbacks:
1) The links generated by Nutch will not work except for queries from localhost (end users typically won‘t have the exact same shares mounted in the exact same way).
2) You are limited to the number of mounted shares your operating system supports. In *nix environments, this is effectively unlimited, but in Windows you may mount 26 (one share or drive per letter in the English alphabet)
3) Documents with links to shares are unlikely to work since they won‘t link to the share on your machine, but rather to the SMB version.
While indexing documents, I get the following error:
050529 011245 fetch okay, but can‘t parse myfile, reason: Content truncated at 65536 bytes. Parser can‘t handle incomplete msword file.
What is happening?
By default, the size of the documents downloaded by Nutch is limited (to 65536 bytes). To allow Nutch to download larger files (via HTTP), modify nutch-site.xml and add an entry like this:

http.content.limit 150000

If you do not want to limit the size of downloaded documents, set http.content.limit to a negative value:

http.content.limit -1

Segment Handling
Do I have to delete old segments after some time?
If you‘re fetching regularly, segments older than the db.default.fetch.interval can be deleted, as their pages should have been refetched. This is 30 days by default.
MapReduce
What is MapReduce?
MapReduce
How to start working with MapReduce?
edit conf/nutch-site.xml

fs.default.name localhost:9000 The name of the default file system. Either the literal string "local" or a host:port for NDFS.


mapred.job.tracker localhost:9001 The host and port that theMapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.

edit conf/mapred-default.xml
mapred.map.tasks 4 define mapred.map.tasks to be multiple of number of slave hosts



mapred.reduce.tasks 2 define mapred.reduce tasks to be number of slave hosts

create a file with slave host names
% echo localhost >> ~/.slaves % echo somemachin >> ~/.slaves
start all ndfs & mapred daemons
% bin/start-all.sh
create a directory with seed list file
% mkdir seeds % echo http://www.cnn.com/ > seeds/urls
copt the seed directory to ndfs
% bin/nutch ndfs -put seeds seeds
crawl a bit
% bin/nutch crawl seeds -depth 3
monitor things from adminstrative interface open browser and enter your masterHost : 7845
NDFS
What is it?
NutchDistributedFileSystem
How to send commands to NDFS?
list files in the root of NDFS
[root@xxxxxx mapred]# bin/nutch ndfs -ls / 050927 160948 parsing file:/mapred/conf/nutch-default.xml 050927 160948 parsing file:/mapred/conf/nutch-site.xml 050927 160948 No FS indicated, using default:localhost:8009 050927 160948 Client connection to 127.0.0.1:8009: starting Found 3 items /user/root/crawl-20050927142856 /user/root/crawl-20050927144626 /user/root/seeds
remove a directory from NDFS
[root@xxxxxx mapred]# bin/nutch ndfs -rm /user/root/crawl-20050927144626 050927 161025 parsing file:/mapred/conf/nutch-default.xml 050927 161025 parsing file:/mapred/conf/nutch-site.xml 050927 161025 No FS indicated, using default:localhost:8009 050927 161025 Client connection to 127.0.0.1:8009: starting Deleted /user/root/crawl-20050927144626
Searching
Common words are saturating my search results.
You can tweak your conf/common-terms.utf8 file after creating an index through the following command:
bin/nutch org.apache.nutch.indexer.HighFreqTerms -count 10 -nofreqs index
What ranking algorithm is used in searches ? Does Nutch use the [http://en.wikipedia.org/wiki/HITS_algorithm HITS algorithm] ?
N/A yet
How is scoring done in Nutch? (Or, explain the "explain" page?)
Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does it. The formula Lucene uses scoring can be found at the head of the Lucene Similarity class in the Lucene Similarity Javadoc. Roughly, the score for a particular document in a set of query results, "score(q,d)", is the sum of the score for each term of a query ("t in q"). A terms score in a document is itself the sum of the term run against each field that comprises a document ("title" is one field, "url" another. A "document" is a set of "fields"). Per field, the score is the product of the following factors: Its "td" (term freqency in the document), a score factor "idf" (usually a factor made up of frequency of term relative to amount of docs in index), an index-time boost, a normalization of count of terms found relative to size of document ("lengthNorm"), a similar normalization is done for the term in the query itself ("queryNorm"), and finally, a factor with a weight for how many instances of the total amount of terms a particular document contains. Study the lucene javadoc to get more detail on each of the equation components and how they effect overall score.
Interpreting the Nutch "explain.jsp", you need to have the above cited Lucene scoring equation in mind. First, notice how we move right as we move from "score total", to "score per query term", to "score per query document field" (A document field is not shown if a term was not found in a particular field). Next, studying a particular field scoring, it comprises a query component and then a field component. The query component includes query time -- as opposed to index time -- boost, an "idf" that is same for the query and field components, and then a "queryNorm". Similar for the field component ("fieldNorm" is an aggregation of certain of the Lucene equation components).
How can I influence Nutch scoring?
The easiest way to influence scoring is to change query time boosts (Will require edit of nutch-site.xml and redeploy of the WAR file). Query-time boost by default looks like this:
query.url.boost, 4.0fquery.anchor.boost, 2.0fquery.title.boost, 1.5fquery.host.boost, 2.0fquery.phrase.boost, 1.0f
From the list above, you can see that terms found in a document URL get the highest boost with anchor text next, etc.
Anchor text makes a large contribution to document score (You can see the anchor text for a page by browsing to "explain" then editing the URL to put in place "anchors.jsp" in place of "explain.jsp").
What is the RSS symbol in search results all about?
Clicking on the RSS symbol sends the current query back to Nutch to a servlet named OpenSearchServlet. OpenSearchServlet reruns the query and returns the results formatted instead as RSS (XML). The RSS format is based on OpenSearch RSS 1.0 from a9.com: " OpenSearch RSS 1.0 is an extension to the RSS 2.0 standard, conforming to the guidelines for RSS extensibility as outlined by the RSS 2.0 specification" (See also opensearch). Nutch in turn makes extension to OpenSearch. The Nutch extensions are identified by the ‘nutch‘ namespace prefix and add to OpenSearch navigation information, the original query, and all fields that are available at search result time including the Nutch page boost, the name of the segment the page resides in, etc.
Results as RSS (XML) rather than HTML are easier for programmatic clients to parse: such clients will query against OpenSearchServlet rather than search.jsp. Results as XML can also be transformed using XSL stylesheets, the likely direction of UI development going forward.
How can I find out/display the size and mime type of the hits that a search returns?
In order to be able to find this information you have to modify the standard plugin.includes property of the nutch configuration file and add the index-more filter.
plugin.includes...|index-more|...|query-more|......
After that, don‘t forget to crawl again and you should be able to retrieve the mime-type and content-length through the classHitDetails (via the fields "primaryType", "subType" and "contentLength") as you normally do for the title and URL of the hits.
(Note byDanielLopez) Thanks to Do?acan Güney for the tip.
Crawling
Java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml
The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seed -dir /user/nutchuser...
Discussion
Grub has some interesting ideas about building a search engine using distributed computing. And how is that relevant to nutch?
CategoryHomepage
last edited 2006-12-11 08:15:51 by DanielLopez
Immutable Page
Show Changes
Get Info
More Actions: Show Raw Text Show Print View Delete Cache -------- Attachments Check Spelling Show Like Pages Show Local Site Map -------- Rename Page Delete Page

MoinMoin PoweredPython PoweredValid HTML 4.01