给Lucene添加中文分词3(猎兔的另一篇文章)
来源:百度文库 编辑:神马文学网 时间:2024/04/29 16:48:23
给Lucene添加中文分词3(猎兔的另一篇文章)
原文转贴如下:http://www.lietu.com/en/
The tokenizer compose of two part. The code in a jar file and a dictinary information(chinese language model) ,which is compressed in a zip file, you can uncompress it to a path.
Make a CnAnalyzer class to test it:
import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.TokenStream;
import seg.result.CnTokenizer;
import seg.result.PlaceFilter;
/**
* The Analyzer to demo CnTokenizer.
*
*/
public class CnAnalyzer extends Analyzer {
//~ Constructors -----------------------------------------------------------
public CnAnalyzer() {
}
//~ Methods ----------------------------------------------------------------
/**
* get token stream from input
*
* @param fieldName lucene field name
* @param reader input reader
*
* @return TokenStream
*/
public final TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new CnTokenizer(reader);
result = new LowerCaseFilter(result);
//还加入了地名过滤
result = new PlaceFilter(result);
return result;
}
}
Use a test class to test CnAnalyzer:
public static void testCnAnalyzer() throws Exception {
long startTime;
long endTime;
StringReader input;
CnTokenizer.makeTag= false;
String sentence
="其中包括兴安至全州、桂林至兴安、全州至黄沙河、阳朔至平乐、桂林至阳朔、桂林市国道过境线灵川至三塘段、平乐至钟山、桂林至三江高速公路。";
input = new java.io.StringReader(sentence);
startTime = System.currentTimeMillis();
TokenStream tokenizer = new seg.result.CnTokenizer(input);
endTime = System.currentTimeMillis();
System.out.println("seg time cost:" + ( endTime - startTime));
for (Token t = tokenizer.next(); t != null; t = tokenizer.next())
{
System.out.println(t.termText() + " " + t.startOffset() + " "
+ t.endOffset() + " "+t.type());
}
}
To run it , please add a java property value "dic.dir" to java command line. For example:
"-Ddic.dir=D:/lg/work/SSeg/Data"
It also have the function of auto learn new words from corpus ,such as:
java -Xmx512m "-Ddic.dir=/home/lg/SSeg/Data" -cp seg.jar:je.jar:libsvm.jar:lucene-1.4.3.jar seg.train.FindNewWords -p
/home/lg/segtest/db -v -c ../0/
原文转贴如下:http://www.lietu.com/en/
The tokenizer compose of two part. The code in a jar file and a dictinary information(chinese language model) ,which is compressed in a zip file, you can uncompress it to a path.
Make a CnAnalyzer class to test it:
import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.TokenStream;
import seg.result.CnTokenizer;
import seg.result.PlaceFilter;
/**
* The Analyzer to demo CnTokenizer.
*
*/
public class CnAnalyzer extends Analyzer {
//~ Constructors -----------------------------------------------------------
public CnAnalyzer() {
}
//~ Methods ----------------------------------------------------------------
/**
* get token stream from input
*
* @param fieldName lucene field name
* @param reader input reader
*
* @return TokenStream
*/
public final TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new CnTokenizer(reader);
result = new LowerCaseFilter(result);
//还加入了地名过滤
result = new PlaceFilter(result);
return result;
}
}
Use a test class to test CnAnalyzer:
public static void testCnAnalyzer() throws Exception {
long startTime;
long endTime;
StringReader input;
CnTokenizer.makeTag= false;
String sentence
="其中包括兴安至全州、桂林至兴安、全州至黄沙河、阳朔至平乐、桂林至阳朔、桂林市国道过境线灵川至三塘段、平乐至钟山、桂林至三江高速公路。";
input = new java.io.StringReader(sentence);
startTime = System.currentTimeMillis();
TokenStream tokenizer = new seg.result.CnTokenizer(input);
endTime = System.currentTimeMillis();
System.out.println("seg time cost:" + ( endTime - startTime));
for (Token t = tokenizer.next(); t != null; t = tokenizer.next())
{
System.out.println(t.termText() + " " + t.startOffset() + " "
+ t.endOffset() + " "+t.type());
}
}
To run it , please add a java property value "dic.dir" to java command line. For example:
"-Ddic.dir=D:/lg/work/SSeg/Data"
It also have the function of auto learn new words from corpus ,such as:
java -Xmx512m "-Ddic.dir=/home/lg/SSeg/Data" -cp seg.jar:je.jar:libsvm.jar:lucene-1.4.3.jar seg.train.FindNewWords -p
/home/lg/segtest/db -v -c ../0/
给Lucene添加中文分词3(猎兔的另一篇文章)
给Lucene增加中文分词2(猎兔分词,转贴)
技术文摘: 给Lucene加入性能更好的中文分词
Lucene中文分词的highlight显示
Lucene 中文分词的 highlight 显示
Lucene 中文分词
Lucene 中文分词
给Lucene加入性能更好的中文分词1(windshowzbf原创)
Lucene中文分词(IKAnalyzer1.4)
当前几个主要的Lucene中文分词器的比较
我的LUCENE终于实现中文自动分词了
分词 Lucene
唐福林 博客雨 : 当前几个主要的Lucene中文分词器的比较
测试lucene的所有分词接口
如何在lucene中使用中文自动分词技术
一个简单的中文分词
中文分词的实现思路
中文分词的实现思路
lucene.net 2.0 中文分词后语法高亮问题 - 智慧掩盖真相 - 博客园
编写简单的中文分词程序
编写简单的中文分词程序
中文的分词应该怎么搞啊?
编写简单的中文分词程序
面向搜索的中文分词设计