BASIC TECHNIQUES IN STATISTICAL NLP.ppt
《BASIC TECHNIQUES IN STATISTICAL NLP.ppt》由会员分享,可在线阅读,更多相关《BASIC TECHNIQUES IN STATISTICAL NLP.ppt(49页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、BASIC TECHNIQUES IN STATISTICAL NLPWord predictionn-gramssmoothingSeptember 20031Statistical Methods in NLElTwo characteristics of NL make it desirable to endow programs with the ability to LEARN from examples of past use:VARIETY(no programmer can really take into account all possibilities)AMBIGUITY
2、(need to have ways of choosing between alternatives)lIn a number of NLE applications,statistical methods are very commonlThe simplest application:WORD PREDICTIONSeptember 20032We are good at word predictionStocks plunged this morning,despite a cut in interestStocks plunged this morning,despite a cut
3、 in interestrates by the Federal Reserve,as WallStocks plunged this morning,despite a cut in interestrates by the Federal Reserve,as WallStreet began.September 20033Real Spelling ErrorsThey are leaving in about fifteen minuets to go to her houseThe study was conducted mainly be John Black.The design
4、 an construction of the system will take more than one year.Hopefully,all with continue smoothly in my absence.Can they lave him my messages?I need to notified the bank of this problem.He is trying to fine out.September 20034Handwriting recognitionlFrom Woody Allens Take the Money and Run(1969)Allen
5、(a bank robber),walks up to the teller and hands her a note that reads.I have a gun.Give me all your cash.The teller,however,is puzzled,because he reads I have a gub.No,its gun,Allen says.Looks like gub to me,the teller says,then asks another teller to help him read the note,then another,and finally
6、 everyone is arguing over what the note means.September 20035Applications of word predictionlSpelling checkerslMobile phone textinglSpeech recognitionlHandwriting recognitionlDisabled usersSeptember 20036Statistics and word predictionlThe basic idea underlying the statistical approach to word predic
7、tion is to use the probabilities of SEQUENCES OF WORDS to choose the most likely next word/correction of spelling errorlI.e.,to compute lFor all words w,and predict as next word the one for which this(conditional)probability is highest.P(w|W1.WN-1)September 20037Using corpora to estimate probabiliti
8、eslBut where do we get these probabilities?Idea:estimate them by RELATIVE FREQUENCY.lThe simplest method:Maximum Likelihood Estimate(MLE).Count the number of words in a corpus,then count how many times a given sequence is encountered.lMaximum because doesnt waste any probability on events not in the
9、 corpusSeptember 20038Maximum Likelihood Estimation for conditional probabilitieslIn order to estimate P(w|W1 WN),we can use instead:lCfr.:P(A|B)=P(A&B)/P(B)September 20039Aside:counting words in corporalKeep in mind that its not always so obvious what a word is(cfr.yesterday)lIn text:He stepped out
10、 into the hall,was delighted to encounter a brother.(From the Brown corpus.)lIn speech:I do uh main-mainly business data processinglLEMMAS:cats vs catlTYPES vs.TOKENSSeptember 200310The problem:sparse datalIn principle,we would like the n of our models to be fairly large,to model long distance depen
11、dencies such as:Sue SWALLOWED the large green lHowever,in practice,most events of encountering sequences of words of length greater than 3 hardly ever occur in our corpora!(See below)l(Part of the)Solution:we APPROXIMATE the probability of a word given all previous wordsSeptember 200311The Markov As
12、sumptionlThe probability of being in a certain state only depends on the previous state:P(Xn=Sk|X1 Xn-1)=P(Xn=Sk|Xn-1)This is equivalent to the assumption that the next state only depends on the previous m inputs,for m finite (N-gram models/Markov models can be seen as probabilistic finite state aut
13、omata)September 200312The Markov assumption for language:n-grams modelslMaking the Markov assumption for word prediction means assuming that the probability of a word only depends on the previous n words(N-GRAM model)September 200313Bigrams and trigramslTypical values of n are 2 or 3(BIGRAM or TRIGR
14、AM models):P(Wn|W1.W n-1)P(Wn|W n-2,W n-1)P(W1,Wn)P(Wi|W i-2,W i-1)lWhat bigram model means in practice:Instead of P(rabbit|Just the other day I saw a)We use P(rabbit|a)lUnigram:P(dog)Bigram:P(dog|big)Trigram:P(dog|the,big)September 200314The chain rulelSo how can we compute the probability of seque
15、nces of words longer than 2 or 3?We use the CHAIN RULE:lE.g.,P(the big dog)=P(the)P(big|the)P(dog|the big)lThen we use the Markov assumption to reduce this to manageable proportions:September 200315Example:the Berkeley Restaurant Project(BERP)corpuslBERP is a speech-based restaurant consultantlThe c
16、orpus contains user queries;examples includeIm looking for Cantonese foodId like to eat dinner someplace nearbyTell me about Chez PanisseIm looking for a good place to eat breakfastSeptember 200316Computing the probability of a sentencelGiven a corpus like BERP,we can compute the probability of a se
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- BASIC TECHNIQUES IN STATISTICAL NLP
限制150内