Crawl The Nutch --

来源：百度文库编辑：神马文学网时间：2024/04/29 18:44:59

抓取模式
Intranet Crawling：
适合抓取网页预期总数在一百万，网站数量有限的情况，一步式命令bin/nutch crawl，比较舒服，对于很多垂直
搜索领域已经足够。
Whole-web Crawling：
抓取WWW海量数据，一般可分为
inject 注入url,
generate 生成抓取列表，
fetch 抓取网页，
updatedb 更新crawldb,
invertlinks 建立连接数据库，
index 建立索引
dedup 去重
merge 合并索引
其实这两种模式基本上一样的，可以互换，差别只是配置文件的不同（个人见解！）。如果是crawl命令，可能涉及的配置文件crawl-urlfilter.txt, regex-urlfilter.txt，prefix-urlfilter.txt，suffix-urlfilter.txt，automaton-urlfilter.txt, 注意配置文件要放在类搜索路径上，如果你用bin/nutch脚本来启动程序，则这些配置文件都应该在conf目录中找到。还有一点要注意的是，这些filter文件是否生效，要看你的插件配置情况，哎天哪，什么都需要配置！
更新
执行下面的循环：
generate
fetch
updatedb
invertlinks
index
dedup
merge
我自己简单修改了org.apache.nutch.crawl.Crawl, 生成了一个新类可以方便的一步式更新
package org.apache.nutch.crawl;
public class CrawlUpdate {
public static final Logger LOG = LogFormatter
.getLogger("org.apache.nutch.crawl.CrawlUpdate");
private static String getDate() {
return new SimpleDateFormat("yyyyMMddHHmmss").format(new Date(System
.currentTimeMillis()));
}
public static void main(String[] args) throws IOException {
if (args.length < 1) {
System.out
.println("Usage: CrawlUpdate [-dir d] [-threads n] [-topN N]");
return;
}
Configuration conf = NutchConfiguration.create();
conf.addDefaultResource("crawl-tool.xml");
JobConf job = new NutchJob(conf);
Path dir = new Path("crawl-" + getDate());
int threads = job.getInt("fetcher.threads.fetch", 10);
int topN = Integer.MAX_VALUE;
for (int i = 0; i < args.length; i++) {
if ("-dir".equals(args[i])) {
dir = new Path(args[i + 1]);
i++;
} else if ("-threads".equals(args[i])) {
threads = Integer.parseInt(args[i + 1]);
i++;
} else if ("-topN".equals(args[i])) {
topN = Integer.parseInt(args[i + 1]);
i++;
}
}
FileSystem fs = FileSystem.get(job);
if (!fs.exists(dir)) {
throw new RuntimeException(dir + " dosn‘t exist.");
}
LOG.info("crawl started in: " + dir);
LOG.info("threads = " + threads);
if (topN != Integer.MAX_VALUE)
LOG.info("topN = " + topN);
Path crawlDb = new Path(dir + "/crawldb");
Path linkDb = new Path(dir + "/linkdb");
Path segments = new Path(dir + "/segments");
Path indexes = new Path(dir + "/indexes" + getDate());
Path index = new Path(dir + "/index");
Path tmpDir = job.getLocalPath("crawl" + Path.SEPARATOR + getDate());
Path segment = new Generator(job).generate(crawlDb, segments, -1, topN,
System.currentTimeMillis());
new Fetcher(job).fetch(segment, threads, Fetcher.isParsing(job)); // fetch
if (!Fetcher.isParsing(job)) {
new ParseSegment(job).parse(segment); // parse it, if needed
}
new CrawlDb(job).update(crawlDb, segment); // update crawldb
new LinkDb(job).invert(linkDb, new Path[] { segment }); // invert links
// index, dedup & merge
new Indexer(job)
.index(indexes, crawlDb, linkDb, new Path[] { segment });
Path[] indexesDirs = fs.listPaths(dir, new PathFilter() {
public boolean accept(Path p) {
return p.getName().startsWith("indexes");
}
});
new DeleteDuplicates(job).dedup(indexesDirs);
List indexesParts = new ArrayList();
for (int i = 0; i < indexesDirs.length; i++) {
indexesParts.addAll(Arrays.asList((fs.listPaths(indexesDirs[i]))));
}
new IndexMerger(fs, (Path[]) (indexesParts
.toArray(new Path[indexesParts.size()])), index, tmpDir, job)
.merge();
LOG.info("crawl update finished: " + dir);
}
}
这样我可以用如下模式来周期更新我的搜索数据：
Crawl urlsdir -dir crawl -topN 1000 -- 第一次下载
CrawlUpdate -dir crawl -topN 1000 --　更新
CrawlUpdate -dir crawl -topN 1000 --　继续更新
...
还没搞明白，是lucene的限制还是基于什么考虑，在更新时（准确说是更新索引时）要先停止tomcat，感觉有那么一点不舒服。

Crawl The Nutch -- Crawl The Nutch -- Agylen Nutch Page Ranking 试用Nutch (1) Nutch 初体验 Hadoop、Lucene、Nutch Nutch 的配置文件 nutch内部工作流程 - NUTCH介绍--抓取（1） Nutch version 0.8 安装向导 Nutch搜索引擎之分布式文件系统 Nutch version 0.8.x tutorial Nutch 初体验之二 [Nutch-dev] MD5 in fetchlist / fetcher 未知都是已知的: Nutch 初体验 CSDN技术中心试用Nutch (1) 关于 Nutch 的一个问题: 中文乱码 Nutch 笔记（一）：Quick Start Nutch中creativecommons插件的分析 Nutch初体验全网的爬行如何在eclipse中配置nutch 通过HTTP状态代码看搜索引擎怎么Crawl你的站 osu goes nutch — osu open source lab SQLET - 开放源码的中文搜索引擎 - Nutch安装笔记