跨语言信息检索技术28085.pptx
《跨语言信息检索技术28085.pptx》由会员分享,可在线阅读,更多相关《跨语言信息检索技术28085.pptx(78页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、 Cross Language Information RetrievalRoad MaplCrossLingualIRlMotivationlDefinitionlGeneralIssuesWithCLIRlBasicApproachestoCLIRlCLIRevaluationlCLIRapplications2023/3/153Information RetrievallSinglelanguage:boththeusersqueryanddocumentstobesearchedareinsamelanguage.lCrosslanguage:documentswritteninala
2、nguagedifferentfromthelanguageoftheusersquerydocumentsquery2023/3/1542000-2010年世界各大洲网络语言使用增长率(数据更新时间:2010年6月30日)The Internet Big PictureWorld RegionsPopulationInternet UsersPenetration(%population)Users%of TableGrowth 2000-2015Africa1,158,355,663313,257,07427.0%9.6%6,839%Asia4,032,466,8821,563,208,1
3、4338.8%47.8%1,268%Europe821,555,904604,122,38073.5%18.5%475%MiddleEast236,137,235115,823,88249.0%3.5%3,426%NorthAmerica357,172,209313,862,86387.9%9.6%191%LatinAmerica617,776,105333,115,90853.9%10.2%1,743%Oceania/Australia37,157,12027,100,33472.9%0.8%256%WorldTotal7260,621,1183,270,490,58445%100%806%
4、WorldInternetUsersand2015PopulationStats2023/3/1552023/3/156Usage of content languages for websites2023/3/15720022015English72%English54.5%German7%Russian5.9%Japanese6%German5.7%Spanish3%Japanese5.0%French3%Spanish4.7%Italian2%French4.1%Dutch2%Portuguese2.6%Chinese2%Chinese2.2%Korean1%Italian2.1%Rus
5、sian1%Polish1.9%Portuguese1%Turkish1.6%Cross Language IRlMotivationlInformationunavailabilityinsomelanguageslLanguagebarrierlDefinition:lCross-language information retrieval(CLIR)isasubfieldofinformationretrievaldealingwithretrievinginformationwritteninalanguagedifferentfromthelanguageoftheusersquer
6、y(wikipedia)lExample:lAusermayaskqueryinChinesebutretrieverelevantdocumentswritteninEnglish.Why do we need CLIR systems?lNeedstechnologiesthatenableaccesstoinforegardlessofgeographic/languagebarriers.lTofind,retrieveandunderstandrelevantinformationinwhateverlanguage/form.lCLIRhasbecomeoneofthekeyfac
7、torsaffectingknowledgesharingallovertheworld.General Issues With CLIRlMultilingualtextaccess(charactersets,etc.)lDifferencesbetweenlanguages-stemming,compoundwords,breaksbetweenwords,etc.lTermambiguitybetweenlanguageslWhattotranslate(queryvs.document)andhowMatching strategieslNotranslationl(1)Cognat
8、ematchinglTranslationl(2)Querytranslationl(3)Documenttranslationl(4)Interlingualtechniques2023/3/1511Cognate matching(同源匹配)同源匹配)lInthecaseofthemostnaivecognatematching,untranslatabletermssuchaspropernounsortechnicalterminologyareleftunchangedthroughthestageoftranslation.lTheunchangedtermcanbeexpecte
9、dtomatchsuccessfullywithacorrespondingterminanotherlanguageifthetwolanguageshaveacloselinguisticrelationship.(forexample,generationinEnglishandFrench)lWhentwolanguagesareverydifferent,byexploringamethodformeasuringsimilaritybetweentransliterationanditsoriginalword,wemaymakecognatematchingfeasible(音译
10、).2023/3/15122023/3/1513Query translation搜索引擎搜索引擎翻译系统翻译系统法语查询法语文档结果结果中文查询选择浏览法语文档集合法语文档集合过程:将中文查询翻译成法语检索法语文档集合将检索结果翻译成中文2023/3/1514query translationlQuerytranslationisthemostwidelyusedmatchingstrategyforCLIRduetoitstractability.ltheretrievalsystemdoesnothavetochangeitsinvertedfilesofindextermsinanyw
11、ayagainstqueriesinanylanguage.lItislesscomputationallycostlytoprocessthetranslationofaquerythanthatofalargesetofdocumentslChallenge:termambiguitylqueriesareoftenshortandshortqueriesprovidelittlecontextfordisambiguationlTermdisambiguationwillbediscussedlater.2023/3/1515查询翻译优缺点查询翻译优缺点l优点l简单l容易操作l灵活l节约
12、时间、空间,效率高l缺点l缺乏上下文l对于短查询式,翻译歧义性大2023/3/1516Document translation中文查询法语文档集合法语文档集合搜索引擎搜索引擎翻译系统翻译系统中文文档集合中文文档集合结果结果选择浏览过程:将整个法语文档翻译成中文文档直接用中文文档检索2023/3/1517Document translationlDocumenttranslationhasoppositeadvantagesanddisadvantagesfromquerytranslation.lInCLIRexperiments,thisapproachisnotusuallyutilize
13、d,andquerytranslationisdominant.lHowever,someresearchershaveusedittotranslatelargesetsofdocumentssincemorevariedcontextwithineachdocumentisavailablefortranslation,whichcanimprovetranslationquality.lOardandHackett(1998)reportedthatautomaticmachinetranslationofasetofdocumentsusingacommercialMTsystemou
14、tperformsquerytranslationinanexperimentofCLIRfromGermantoEnglish2023/3/1518文档翻译优缺点文档翻译优缺点l优点l只翻译一次l文档提供的上下文比较丰富l文档可以线下事先翻译好l缺点l翻译速度慢l占用大量空间、时间,效率低l依赖机器翻译系统的质量2023/3/1519查询翻译查询翻译vs.文档翻译文档翻译l取决于特定语言资源l通常查询翻译使用更广l两种方法都提出了“交互性”挑战Interlingual approachlanintermediatespaceofsubjectrepresentationintowhichbo
15、ththequeryandthedocumentsareconvertedisusedtocomparethem.lOnetypeofinterlingualapproachistousethesynsetsprovidedinWordNet,whichisawellknownmachine-readablethesaurus.lForexample,Diekema,Oroumchian,Sheridan,andLiddy(1999)employedtheWordNetsynsetnumbersaslanguage-independentrepresentationsforCLIR.lSinc
16、easynsetnumber(label)representingaconceptiscorrespondedtoasetofconcretewordsineachoflanguagessupported(e.g.,EnglishandFrench),itispossiblethataqueryterminthesourcelanguagesislinkedtowordsinthetargetlanguageviathesynsetnumber.2023/3/1520Translation techniques2023/3/1521Dictionary-based methodslUsinga
17、bilingualMachineReadableDictionary(MRD).lmostretrievalsystemsarestillbasedonso-calledbag-of-wordsarchitectures,inwhichbothquerystatementsanddocumenttextsaredecomposedintoasetofwords(orphrases)throughaprocessofindexing.lThuswecantranslateaqueryeasilybyreplacingeachquerytermwithitstranslationequivalen
18、tsappearinginabilingualdictionaryorabilingualtermlist.2023/3/15222023/3/1523bilingual dictionary2023/3/1524Term translationoilpetroleumprobesurveytakesamples选哪个翻译?没有翻译!restraincymbidiumgoeringii分词错误oilpetroleumprobesurveytakesamples2023/3/1525Some issues in term translationlCompoundwords,forexampleG
19、ermanldecompositionlNoboundarybetweenwords,e.g.ChineselsegmentationlSpecializedvocabularynotcontainedinthedictionary,e.g.namedentity2023/3/1526ExampleslCompounddecomposition(复合词分解)lchinesewordsegmentationl新西兰花l新西兰花NewZealandflowersl新西兰花freshbroccolis2023/3/1527Corpora-based methodlParallel(双语平行语料库)o
20、rcomparablecorpora(双语可比语料库)areusefulresourcesenablingustoextractbeneficialinformationforCLIR.lForexample,inordertotranslateEnglishqueriesintoSpanish,DavisandDunning(1995)extractedmoderatelyfrequentSpanishtermsfromSpanishdocumentsalignedwithEnglishdocumentswhichhadbeensearchedusinganEnglishquery(sour
21、cequery).2023/3/1528Parallel corporalAparallelcorpus(pl.corpora)isadocumentcollectioncomposedoftwoormoredisjointsubsets,eachwritteninadifferentlanguage,suchthatdocumentsineachsubsetaretranslationsofdocumentsineachothersubset.lVeryhighaccuracy2023/3/1529象形文字古埃及文字希腊文2023/3/1530罗塞塔石碑罗塞塔石碑l罗塞塔石碑(Rosetta
22、Stone,也译作罗塞达碑),高1.14米,宽0.73米,是一块制作于公元前196年的大理石石碑,原本是一块刻有埃及国王托勒密五世(PtolemyV)诏书的石碑。石碑上用希腊文字、古埃及文字和当时的通俗体文字刻了同样的内容。由于这块石碑刻有三种不同语言版本,使得近代的考古学家得以有机会对照各语言版本的内容后,解读出已经失传千余年的埃及象形文之意义与结构,而成为今日研究古埃及历史的重要里程碑。2023/3/1531More parallel corporalnews:lDE-News(German-English)lHong-KongNews,XinhuaNews(Chinese-English
23、)lGovernmentdocuemtns:lCanadian-Hansards(French-English)lEuroparl(Danish,Dutch,English,Finnish,French,German,Greek,Italian,Portugese,Spanish,Swedish)lUNTreaties(Russian,English,Arabic,)lBible(many,manylanguages)2023/3/1532ExamplesEnglishGermanDivergingopinionsaboutplannedtaxreformUnterschiedlicheMei
24、nungenzurgeplantenSteuerreformThediscussionaroundtheenvisagedmajortaxreformcontinues.DieDiskussionumdievorgesehenegrosseSteuerreformdauertan.TheFDPeconomicsexpert,GrafLambsdorff,todaycameoutinfavorofadvancingtheenactmentofsignificantpartsoftheoverhaul,currentlyplannedfor1999.DerFDP-Wirtschaftsexpert
25、eGrafLambsdorffsprachsichheutedafueraus,wesentlicheTeilederfuer1999geplantenReformvorzuziehen.2023/3/1533Comparable corporalAcomparablecorpusisapairofcorporaintwodifferentlanguages,whichcomefromthesamedomain.lTalkingthesametopiclParallelsentencesmayalsobeminedfromcomparablecorporasuchasnewsstorieswr
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 语言 信息 检索 技术 28085
限制150内