Nutch 笔记（二）：Craw more urls and Recrawl

来源：百度文库编辑：神马文学网时间：2024/04/28 01:28:37

一：Recrawl
nutch wiki上有现成的script，我们只需要拿来用用即可

http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03
把它放在nutch-0.8.1/bin/recrawl.sh
代码
martin@martinx:~/workspace/doc/nutch-0.8.1$ sudo bin/recrawl.sh ../tomcat5/webap ps/ROOT xici/
10 1 5

wiki中对参数的说明已经很详细了，没有必要再多说了。这里有个参数../tomcat5/webap你可以看到脚本中只是
代码
touch $tomcat_dir/WEB-INF/web.xml
让tomcat重新加载webapp,如果你没有使用tomcat，只是crawl，你修改一下脚本，就把这个参数给去掉吧。

二：Crawl more urls and merge
我们上面只是抓取了一个xici的页面，但是我们的目标不仅仅是一个，而是一系列的，所有我们必须增加新的url进行抓取。
新增news.163.com
代码
mkdir url2
echo http://news.163.com>url2/163

重新执行我们上面提到的crawl
代码
martin@martinx:~/workspace/doc/nutch-0.8.1$ sudo bin/nutch crawl url2 -dir 163 -depth 10 -topN 50
note:
这个时间会很长，如果你愿意可以用别的资讯很少的网站代替

合并我们采用nutch wiki上的脚本http://wiki.apache.org/nutch/MergeCrawl保存到bin/mergecrawl.sh。
代码
martin@martinx:~/workspace/doc/nutch-0.8.1$ bin/mergecrawl.sh newpath 163/ xici/
传递的两个参数分别是两次crawl的目录

修改tomcat目录下的classes/nutch-site.xml文件，将searcher.dir修改为新的索引目录
代码
perl -pi -e ‘s|xici|newpath|‘ ../tomcat5/webapps/ROOT/WEB-INF/classes/nutch-site.xml

重新加载webapp
代码
touch ../tomcat5/webapps/ROOT/WEB-INF/web.xml

以下是截图
这个是163的

这个是xici的

Nutch 笔记（二）：Craw more urls and Recrawl Nutch 笔记（一）：Quick Start The Social Net Catches More and More Blogs, Blogs and More Blogs Quantitative Easing II and more Nutch 初体验之二 HTML笔记（二）管理学笔记（二）管理学笔记（二）学习笔记（二） mercantilism: Definition and Much More from A... sonar: Definition and Much More from Answers.... Christmas: Definition and Much More From Answers SQLET - 开放源码的中文搜索引擎 - Nutch安装笔记 NUTCH介绍--抓取（1） WinSock学习笔记（二） 3G: Definition and Much More From Answers.com wikipedia | Christmas: Definition and Much More From Answers.com netizen: Definition and Much More From Answers.com ONJava.com -- Maven 2.0: Compile, Test, Run, Deploy, and More amino acid: Definition and Much More from Answers.com acetylcholine: Definition and Much More from Answers.com titanium dioxide: Definition and Much More from Answers.com hypnotism: Definition and Much More from Answers.com