分类类号 密级级UDC 编号中国科学学院研究究生院硕士学位位论文汉语词与与句子切切分技术术及机器器翻译评评估方法法研究 刘丁丁 指导教师师 宗宗成庆 研究员员 博士士 中国国科学院院自动化化研究所所 申请学位位级别 工学硕硕士 学学科专业业名称 模式识识别与智智能系统统 论论文提交交日期 20004年66月 论论文答辩辩日期 20004年年6月 培培养单位位 中国国科学院院自动化化研究所所 学位授予予单位 中中国科学学院研究究生院 答辩委委员会主主席Approaches to Chinese Word Analysis, Utterance Segmentation and Automatic Evaluation of Machine Translation
Dissertation Submitted to Institute of Automation, Chinese Academy of Sciences in partial fulfillment of the requirements for the degree of Master of Engineering by Ding Liu (Pattern Recognition and Intelligence System)
3、)Dissserttatiion Suppervvisoor: Proofesssorr Chhenggqinng ZZongg独创性声声明本人声明明所成交交的论文文是我个个人在导导师指导导下进行行的研究究工作及及取得的的研究成成果。尽尽我所知知,除了了文中特特别加以以标注和和致谢的的地方外外,论文文中不包包含其他他人已经发表表或撰写写过的研研究成果果。与我我一同工工作的同同志对本本研究所所做的任任何贡献献均已在在论文中中作了明明确地说说明并表表示了谢谢意。签名:_导导师签名名:_ 日 期:_关于论文文使用授授权的说说明本人完全全了解中中国科学学院自动动化研究究所有关关保留、使使用学位位论文的
4、的规定,即即:中国国科学院院自动化化研究所所有权保保留送交交论文的的复印件件,允许许论文被被查阅和和借阅;可以公公布论文文的全部部或部分分内容,可可以采用用影印、缩缩印或其其他复制制手段保保存论文文。(保密的的论文在在解密后应应遵守此此规定)签名:_导导师签名名:_ 日 期:_摘要本论文以以统计模模型为基基础,在参考考了大量量前人工工作的基基础上,对汉语词法分析、口语句子切分和机器翻译评估进行了较为深入的探讨和研究。汉语词法分析是大部分中文处理的第一步,其重要性不言而喻;句子切分是语音翻译中连接语音识别和文本翻译的桥梁,无论语音识别和文本翻译单独的效果有多么好,这座桥没搭好,综合的性能依然无法
针针对汉语语词法分分析、口口语句子子切分和和机器翻翻译评估提提出了以以统计模模型为基基础的创创新方法法,它们们不仅仅仅在科学学方法上上有重要要的参考考价值,对于实际应用中也有重要意义。ABSTRACT This thesis proposed our novel statistical approaches on Chinese word analysis, utterance segmentation and automatic evaluation of machine translation (MT).
Word analysis is the first step for most application based on Chinese language technologies; utterance segmentation is the bridge which connects speech recognition and text translation in a speech translation system; automatic evaluation of machine translation (MT) system can speed the research and development of a MT system, reduce its developing cost.
In short, the three aspects all belong to the basic research area of Natural Language Processing (NLP) and have significant meaning to many important applications such as text translation, speech translation and so on.
11、 havve ssignnifiicannt mmeanningg too maany impporttantt apppliicattionns ssuchh ass teext traansllatiion,speeechh trransslattionn annd sso oon.InChhineese worrd aanallysiis, we proopossed a nnoveel uuniffiedd appprooachh baasedd onn HMMM, whiich effficiienttly commbinne wwordd seegmeentaatioon, Par
The experimental results show that our combined model, by comprehensively considering the information of Chinese characters, words, POS and NE, achieved much better performance in the precision and recall of the Chinese word segmentation.
Based on the knowledge of our combined model, we described the details in implementing the general word segmentation system APCWS. We discussed some technical problems in the data saving and loading, and described our modules of knowledge management and word lattice construction.
In utterance segmentation, this paper proposed a novel approach which was based on a bi-directional N-gram model and Maximized Entropy model. This novel method, which effectively combines the normal and reverse N-gram algorithm, is able to make use of both the left and right context of the candidate site and achieved very good performance in utterance segmentation.
We conducted experiments both in Chinese and in English. The results showed the effect of our novel method was much better than the normal N-gram algorithm. Then by analyzing the experimental results, we found the reason why our novel method achieved better results: it on one hand retained the correct segmentation of the normal N-gram algorithm, on the other hand avoided the incorrect segmentation by making use of reverse N-gram algorithm.
In automatic evaluation of MT systems, we first introduced two classic methods on automatic evaluation which relied on reference translations. Then we proposed our novel sentence fluency evaluation method based on N-gram model. This method, called as E3, doesn't need any reference translations and achieved very well evaluation performance by discriminally use the different transmission probabilities of words in the evaluating sentence.
In summarization, this thesis proposed novel approaches for the three basic researches in NLP: Chinese word analysis, utterance segmentation and automatic evaluation of MT systems. We believe the original ideas in them not only have important reference value for other researches, but also can be used to improve the performance of NLP applications.
目录第一章绪绪言1第二章统统计语言言模型332.1 N元模模型32.1.1 NN元模型型定义332.1.2参数数估计442.2 隐马尔尔可夫模模型82.2.1 定定义82.2.2 和和HMMM相关联联的三个个问题992.3 最大熵熵模型1132.3.1 介介绍1332.3.2 定定义1552.3.3 参参数训练练1
19、e oof rreveersee N-graam aalgooritthm.In aautoomattic evaaluaatioon oof MMT ssysttemss, wwe ffirsst iintrroduucedd twwo cclasssicc meethoods on auttomaaticc evvaluuatiion whiich relliedd onn reeferrencce ttrannslaatioons. Thhen we proopossed ourr noovell seenteencee flluenncy evaaluaatioon mmethhod
20、bassed oon NN-grram moddel. Thhis metthodd, ccallled as E3, dooesnnt nneedd anny rrefeerennce traansllatiionss annd aachiieveed vveryy weell evaaluaatioon pperfformmancce bby ddisccrimminaatelly uuse thee diiffeerennt ttrannsmiissiion proobabbiliitiees oof wwordds iin tthe evaaluaatinng ssenttencce.
21、 In ssummmariizattionn, tthiss thhesiis ppropposeed nnoveel aapprroacchess foor tthe thrree bassic ressearrchees iin NNLP: Chhineese worrd aanallysiis, uttteraancee seegmeentaatioon aand auttomaaticc evvaluuatiion of MT sysstemms. We bellievve tthe oriiginnal ideeas in theem nnot onlly hhavee immpor
22、rtannt rrefeerennce vallue forr ottherr reeseaarchhes, bbut alsso ccan be useed tto iimprrovee thhe pperfformmancce oof NNLP apppliccatiionss.目录第一章绪绪言1第二章统统计语言言模型332.1 N元模模型32.1.1 NN元模型型定义332.1.2参数数估计442.2 隐马尔尔可夫模模型82.2.1 定定义82.2.2 和和HMMM相关联联的三个个问题992.3 最大熵熵模型1132.3.1 介介绍1332.3.2 定定义1552.3.3 参参数训练练1
