Word Segmentation in Chinese

来源:百度文库 编辑:神马文学网 时间:2024/04/27 16:47:19














Article Detail
Sunday, June 18, 2006
E-mail Article LinkSearch Articles
Segmentation of Chinese Text
Various approaches to the problems of
separating the components of a sentence
TOM EMERSON
According to a recent report by the Gartner Dataquest group, the number of Internet subscribers in Mainland China is expected to grow almost 37 percent a year through 2004 to 51 million (“Gartner Dataquest,” ZDNetAsia, 2000). Given the huge growth potential, many non-Asian companies are attempting to migrate their on-line offerings into China, Japan and Korea, as well as the smaller (but growing) markets in Thailand, Vietnam and Indonesia. Users in most East Asian countries prefer and are coming to expect that Internet content be localized to their languages.
While all of these countries have their own barriers to entry, China as a whole (Mainland China, Taiwan, Hong Kong, Singapore and Macau) presents several difficult issues that have to be addressed before a company moves into the region. These include a myriad of conflicting character sets and character encodings, a number of languages and dialects and a disparate set of regions with different geopolitical alliances and user bases. One thing almost every application does is process text, and this task is especially difficult for Chinese.
Many text processing applications need to know where the words are in a line of text. For many languages, this is a relatively “easy” task: words are separated by white space and punctuation. There are some complicated cases, such as how to treat punctuation used after (or in) an abbreviation (“Dear Mr. Emerson, we at A.C.M.E. brand ...”) and single quotes (“He said, ‘I don’t know’”), but generally these problems are tractable.
Chinese, in comparison, is written without any separation between words. White space serves little or no purpose. You are as likely to find spaces between every character as you are to find no spaces at all. Lines can be broken anywhere, even in the middle of a number.
What kinds of applications require word recognition?
Search engines (including Internet search engines such as Google and Lycos and general-purpose full-text search engines such as the Verity™ K2 Toolkit) often create indices based on the important words in a document.
In word processors, users expect to be able to navigate through a sentence by word, instead of being limited to moving by character or sentence.
Chinese spell-checkers need to know where the word boundaries are in order to correctly determine whether a particular word is misspelled in its context.
Speech recognition and speech generation applications need to know the word boundaries so they can correctly pronounce or transcribe an utterance.
For natural language processing (NLP) applications, such as automatic translation systems, the first phase is to find the words, as these are often the fundamental unit processed in an NLP application.
The problem of finding the words in a sentence, the “word segmentation problem,” is an area of active research in industry and academia. This article describes the issues faced when attacking this problem, as well as the approaches that have been used.
Word Segmentation
The problem sounds deceptively simple: given a sentence with no spaces, break it into words. The fundamental problem is this: where is a word in a sentence?
The first definition that might come to a Western mind is “a group of letters having meaning separated by spaces in the sentence.” Of course, this definition doesn’t work for Chinese, since white space is inconsequential. Is the word a single Chinese character? Not necessarily. Is it the smallest set of characters that can have meaning by themselves? Maybe. Is it the longest set of characters that can have meaning by themselves? Perhaps.
Even within Western linguistics there are several notions of what a word actually is. Does Chinese even have a notion of word in the Western sense? At least one respected Chinese linguist, Chao Yuen Ren, has argued at length that there are languages, like Chinese, which do not have a notion of “word-hood” like that in Western linguistics. However, this view is not now widely accepted.
Entire books have been written about the question of “word-hood” in Chinese: what is a word, what are the processes involved in word formation, and how have these evolved over time? I cannot answer these questions here. However, having a sound definition is vital when designing, implementing and testing a word-segmentation system.
Consider the following sentence:
我不是中国人
"I am not Chinese."
This sentence contains six characters, each of which is a dictionary word: 我 (wo, I), 不 (bu, not), 是 (shi, be), 中 (zhong, middle), 国 (guo, country) and 人 (ren, person). Yet no Chinese speaker would actually view the sentence as six separate words. Depending on your dictionary, 不是 (not be) may be listed as a word. In addition, 中国人 can be broken into four different words by using a dictionary:
中 (middle), 国 (country) and 人 (person)
中 (middle) and 国人 (compatriot)
中国 (China) and 人 (person)
中国人 (Chinese)
The correct segmentation is the last one. The second and third interpretations illustrate a common type of ambiguity faced when trying to segment sentences: ABC can be segmented as either A+BC or AB+C.
Regardless of the method used to segment a sentence, there are some constructions that can cause problems: transliterated foreign words and names, abbreviations, and personal, organization and company names. These often comprise the “interesting” parts of a sentence (from an information retrieval perspective); hence, correctly recognizing and segmenting these is vital.
Foreign names are written in Chinese by transliterating them using Chinese characters for their sound value only. The meaning of each character is irrelevant and cannot be relied on. Each Chinese-speaking region will often transliterate the same name differently: Kennedy is transcribed as 肯尼迪 (Kennidi) in Mainland China but 甘迺迪 (Gannaidi) in Taiwan. Fortunately, each region uses a relatively small and consistent set of characters when transliterating.
Abbreviations are another source of complication. In Chinese, abbreviations are formed by taking a character from each word in the phrase being abbreviated. Sometimes it is the first character in each, sometimes the second. For example, Beijing University, 北京大学, is abbreviated 北大. As another example, you could write 中美 instead of 中国美国 for Chinese-American or Sino-American. The complication arises over the fact that virtually any phrase can be abbreviated by taking on a character from each component, and these characters usually have no independent relation to each other. Further, these cannot be enumerated in a dictionary because it is impossible to enumerate every possible abbreviation.
Chinese proper names are extremely difficult to recognize since they can be created from almost any combination of characters. While some characters are considered bad luck or inappropriate for a name, most characters are fair game. A Chinese name is formed by a single-character surname (though two- and three-character surnames are sometimes seen, particularly in Singapore) with a one- or two-character given name. There are approximately 100 common surnames, but the number of given names is huge. It is often difficult to determine gender based solely on the name. Some names related to beauty, flowers and such are generally used for girls. However, others are asexual. I’ve known both males and females named Yuen, for example. The complication lies in the fact that the same characters are names that are used in “regular” words and may even be words in some situations (compare with English names such as “Dawn White”).
Another interesting, though not necessarily complicating, issue in written Chinese is the various ways you can write numbers. For example, the number two hundred can be written several different ways, including 200, 2百, 二百, 贰百, 200, 二00 and 2○○, all of which are valid. This complicates number parsing slightly and is something you need to be aware of when working with Chinese text.
Approaches to Segmentation
So, how do you go about breaking a sentence into its constituents? At a high level, three approaches are available: statistical techniques, dictionary-based techniques and a hybrid approach using a combination of these. Each has its supporters and detractors, and each has its advantages and disadvantages.
Statistical Approaches
The basic idea of all the statistical approaches is the recognition that certain characters have a stronger affinity for some characters than others. For example, in English the pair th is far more likely to occur than the pair td. If you have a large enough (tens of megabytes) text sample, you can build a statistical “language model” that can be used to break a sequence of characters into segments. The break is placed at the point you have a minimum affinity value between characters.
There are two parameters to consider in a statistical model: how many consecutive characters are examined, and what statistical method is used to build the model?
The more characters that are included in the model, the more complex it becomes. Using an example once again from English, assume that we have twenty-six letters (ignoring case distinction). Starting with a single character, there are twenty-six possible characters appearing after it, and twenty-six possible characters appearing after that. So, for a three-character sequence, there are 676 possible following characters. Some are more probable than others: t+h is far more probable than t+d, and th+e is considerably more probable than th+v. For Chinese these factors are much greater: there are approximately 6,000 characters in common use in Mainland China, so looking at a three-character sequence means that there are over 36 million possibilities at any point in the input.
The statistical method used varies, and the most common is a first-order hidden Markov model. A Markov model is a description of a sequence of random variables over time. The value of a particular variable at time t+1 is dependent only on the state of the variable at time t. In the context of segmentation, the Markov model describes the probability that the character at position p+1 is part of a word depends on the character at position p. The order of the model describes the number of previous states considered at each position. A Markov model is “visible” if all of the information needed to compute the probabilities is available to the model. A model is “hidden” if some of the information is outside the scope of the model. For example, in a language model grammatical information exists outside the model but can affect the adjacency of characters. Higher order hidden Markov models (third and higher) are computationally intractable and are rarely used.
The disadvantage of statistical approaches is called the “data-sparseness problem”: a language model is only as comprehensive as the data that is used to train it. The type of language differs greatly between newspaper articles, government documents and romance novels, and a model trained on one may not work well on another. Another problem with these is that there are certain characters which appear with high frequency but only have grammatical meaning, such as 的 (de) and 了 (le). This can dilute the probabilities.
Dictionary-Based Approaches
The dictionary-based techniques can be divided into two varieties: strict dictionary approaches and a combination of dictionary and linguistic knowledge. The idea is straightforward: use a dictionary to find the words in the sentence. Several methods have used this theme.
Starting on the left and moving to the right, find the longest word that exists in the dictionary, until you get to the end of the sentence. This technique, called forward maximum match, is based on the hypothesis that a longer word is more specific and probably correct in the given context.
Alternatively, start at the right and work your way to the left, finding the longest match. This is called backward maximum match. The maximum match strategy, while conceptually simple, does quite well in practice, as long as the dictionary is comprehensive.
Both methods are often used together, since the combination can detect certain ambiguities. For example, a string ABC may be segmented as either A+BC, or AB+C. By segmenting in both directions, you can determine whether such an ambiguity exists: the right-to-left will find the former, and the left-to-right the latter.
There are limitations with the pure dictionary methods. The size and quality of the dictionary (or dictionaries) used are of paramount importance. By the nature of language, no dictionary can ever be complete: new words are constantly being coined, so a dictionary is constantly out-of-date. Different text domains have different vocabularies, which require domain-appropriate dictionaries to even attempt a correct segmentation. In general, once the language model is created, a probabilistic segmenter will run faster than a dictionary-based segmenter.
A pure maximal match strategy also has its limitations. The method is inherently greedy, which can cause missegmentations. Say you have the string ABC, where A, AB, C and BC are all in the dictionary. If the correct segmentation is A+BC, the forward maximal match will incorrectly find AB+C.
A variation of maximal match is to find all of the matches at each point in the sentence and have some method for picking the right one. For example, in the example in the previous paragraph, both segmentations would be generated and the correct one (hopefully) selected. As a more concrete example, recall the example given earlier, 我不是中国人. This has four components:
我 + 不 + 是 + 中国人
It is possible to treat 不是 as a single word, yielding
我 + 不是 + 中国人
And as we saw earlier, 中国人 has a number of possible segmentations:
中 + 国 + 人
中 + 国人
中国 + 人
So you need a way to choose the correct one. This is when making use of other knowledge — morphology and grammar — comes into play. The word 人 (person), for example, is a very productive word that can be combined with country, city and other place names to create a word referring to a person of or from that place. So, one can code a rule that states, “When a country name is followed by 人, join them to create a new word.” By creating a number of generalized rules like this based on Chinese morphology, you can correctly segment many cases that are not handled by a dictionary alone.
Grammatical knowledge can be used to aid in disambiguation. If two segments have conflicting parts of speech, then you can reject those segments. For example, an adjective and an adverb cannot appear together. There are also characters that cannot stand in isolation, but must be part of a larger word. Using this knowledge allows a dictionary-based system to handle words that are not explicitly encoded in the dictionary.
There are some linguistic features that you can take advantage of to simplify the segmentation task. Any type of punctuation can be used as a stopping point; an entire sentence does not need to be processed. Similarly, when you see an Arabic numeral, you know that a new word will follow it, which allows you again to limit the number of characters that need to be considered. There are also features that can be used as a pivot when examining a sentence. For example, grammatical particles like 的 (de) and 了 (le) and measure words like 个 (ge) can be used as a possible transition point since a segment probably ended just before them. Similar methods can be used with longer segments that are easily recognized, such as numbers and date expressions.
Word frequency information can also be taken into account when trying to select the correct segmentation. In contemporary Chinese the average word length is around 2.5 characters, so a two-character segment AB, which is quite frequent, is probably correct while A+B, where A and B are very rare, is incorrect.
The hybrid systems have their disadvantages. They require a dictionary with part-of-speech and word frequency information. They also require you to develop, test and maintain the set of grammatical and morphological rules that are used in the disambiguation process. All of the computation involved in disambiguating the segmentation means that these systems can be quite slow. However, the accuracy can be quite good.
Basis Technology’s CMA
Basis Technology develops Asian linguistic technologies targeted at search engine providers, including Google, Lycos, and Verity. There are some design and implementation decisions that went into the Basis Technology Chinese Morphological Analyzer (CMA), a high-performance segmentation engine designed for information retrieval applications.
CMA is a dictionary-based segmentation system that makes use of part-of-speech information, grammatical and syntactic knowledge, and word frequency information to aid in the disambiguation of ambiguous sentences. Unicode is used throughout the analyzer, allowing it to be agnostic with regard to character encoding.
CMA uses a dictionary containing more than 1.2 million entries, each including part-of-speech and frequency information. Entries range from single characters to phrases containing twelve or more characters. The dictionary also contains thousands of proper nouns, including place names, organizations and companies. The dictionaries are updated several times a year to stay current. They include vocabulary from all Chinese locales. Users are also able to supply their own dictionaries to provide domain-specific vocabularies to the segmenter.
CMA moves left to right generating all possible segmentations for a string of Chinese characters, using various heuristics to “prune” improbable segmentations and favor others. Such pruning is absolutely necessary. The number of possible segmentations can grow logarithmically with the length of the string. Some of these heuristics include favoring longer segments such as maximum match; disfavoring rare, that is, low-frequency, single-character words; disfavoring pairs of segments with incompatible parts of speech, such as an adverb and a noun do not go together; and joining compatible segments (new words can be created by combining compatible segments based on grammatical rules).
The analyzer also has knowledge of how some types of words are formed, such as numeric expressions and “reduplicative phrases” (in Chinese some words can be repeated in various patterns for certain purposes) which it uses to recognize larger units in the text.
CMA is under constant development in an aim to improve its handling of various constructs, including proper names, abbreviations and transliterated foreign words.
For Further Reference
Chao Yuen Ren. A Grammar of Spoken Chinese. University of California Press, 1968.
Packard, Jerome L., ed. New Approaches to Chinese Word Formation: Morphology, Phonology and the Lexicon in Modern and Ancient Chinese. Berlin: Mouton de Gruyter, 1997.
ZDNetAsia. “Gartner Dataquest Predicts Soaring Internet Growth in Asia Pacific.” 18 December 2000. [On-line] Available at
Tom Emerson is a senior software engineer at Basis Technology Corp. in Cambridge, MA. He can be reached attree@basistech.com
This article reprinted from #38 Volume 12 Issue 2 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc., 319 North First Ave., Sandpoint, Idaho, USA, 208-263-8178, Fax: 208-263-6310.
Basis Technology Corp. 150 CambridgePark Drive, Cambridge, MA 02140-2322 USA, 617-386-2000, 800-697-2062, Fax: 617-386-2020, e-mail:info@basistech.com, Web:http://www.basistech.com
See more information on this company
E-mail Article LinkSearch Articles
info@multilingual.com ©1998-2004, Copyright MultiLingual Computing, Inc. No duplication or reproduction without expressed written permission.
_xyz