给Lucene添加中文分词3(猎兔的另一篇文章)

来源:百度文库 编辑:神马文学网 时间:2024/04/29 16:48:23
给Lucene添加中文分词3(猎兔的另一篇文章)
原文转贴如下:http://www.lietu.com/en/
The tokenizer compose of two part. The code in a jar file and a dictinary information(chinese language model) ,which is compressed in a zip file, you can uncompress it to a path.
Make a CnAnalyzer class to test it:
import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.TokenStream;
import seg.result.CnTokenizer;
import seg.result.PlaceFilter;
/**
* The Analyzer to demo CnTokenizer.
*
*/
public class CnAnalyzer extends Analyzer {
//~ Constructors -----------------------------------------------------------
public CnAnalyzer() {
}
//~ Methods ----------------------------------------------------------------
/**
* get token stream from input
*
* @param fieldName lucene field name
* @param reader input reader
*
* @return TokenStream
*/
public final TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new CnTokenizer(reader);
result = new LowerCaseFilter(result);
//还加入了地名过滤
result = new PlaceFilter(result);
return result;
}
}
Use a test class to test CnAnalyzer:
public static void testCnAnalyzer() throws Exception {
long startTime;
long endTime;
StringReader input;
CnTokenizer.makeTag= false;
String sentence
="其中包括兴安至全州、桂林至兴安、全州至黄沙河、阳朔至平乐、桂林至阳朔、桂林市国道过境线灵川至三塘段、平乐至钟山、桂林至三江高速公路。";
input = new java.io.StringReader(sentence);
startTime = System.currentTimeMillis();
TokenStream tokenizer = new seg.result.CnTokenizer(input);
endTime = System.currentTimeMillis();
System.out.println("seg time cost:" + ( endTime - startTime));
for (Token t = tokenizer.next(); t != null; t = tokenizer.next())
{
System.out.println(t.termText() + " " + t.startOffset() + " "
+ t.endOffset() + " "+t.type());
}
}
To run it , please add a java property value "dic.dir" to java command line. For example:
"-Ddic.dir=D:/lg/work/SSeg/Data"
It also have the function of auto learn new words from corpus ,such as:
java -Xmx512m "-Ddic.dir=/home/lg/SSeg/Data" -cp seg.jar:je.jar:libsvm.jar:lucene-1.4.3.jar seg.train.FindNewWords -p
/home/lg/segtest/db -v -c ../0/