Statistical NLP.ppt
![资源得分’ title=](/images/score_1.gif)
![资源得分’ title=](/images/score_1.gif)
![资源得分’ title=](/images/score_1.gif)
![资源得分’ title=](/images/score_1.gif)
![资源得分’ title=](/images/score_05.gif)
《Statistical NLP.ppt》由会员分享,可在线阅读,更多相关《Statistical NLP.ppt(87页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、Part II.Statistical NLPAdvanced Artificial IntelligenceN-GrammsWolfram Burgard,Luc De Raedt,Bernhard Nebel,Lars Schmidt-ThiemeMost slides taken from Helmut Schmid,Rada Mihalcea,Bonnie Dorr,Leila Kosseim and othersContentsShort recap motivation for SNLPProbabilistic language models N-grammsPredicting
2、 the next word in a sentenceLanguage guessingLargely chapter 6 of Statistical NLP,Manning and Schuetze.And chapter 6 of Speech and Language Processing,Jurafsky and Martin Human Language is highly ambiguous at all levelsacoustic levelrecognize speech vs.wreck a nice beachmorphological levelsaw:to see
3、(past),saw(noun),to saw(present,inf)syntactic levelI saw the man on the hill with a telescopesemantic levelOne book has to be read by every studentMotivationStatistical DisambiguationDefine a probability model for the dataCompute the probability of each alternativeChoose the most likely alternativeN
4、LP and StatisticsSpeech recognisers use a noisy channel model“The source generates a sentence s with probability P(s).The channel transforms the text into an acoustic signalwith probability P(a|s).The task of the speech recogniser is to find for a givenspeech signal a the most likely sentence s:s=ar
5、gmaxs P(s|a)=argmaxs P(a|s)P(s)/P(a)=argmaxs P(a|s)P(s)Language Modelssource schannelacoustic signal aSpeech recognisers employ two statistical models:a language model P(s)an acoustics model P(a|s)Language ModelsLanguage ModelDefinition:Language model is a model that enables one to compute the proba
6、bility,or likelihood,of a sentence s,P(s).Lets look at different ways of computing P(s)in the context of Word PredictionExample of bad language modelA bad language modelA bad language modelA Good Language ModelDetermine reliable sentence probability estimatesP(“And nothing but the truth”)0.001P(“And
7、 nuts sing on the roof”)0Shannon game Word PredictionPredicting the next word in the sequenceStatistical natural language.The cat is thrown out of the The large green Sue swallowed the large green ClaimA useful part of the knowledge needed to allow Word Prediction can be captured using simple statis
8、tical techniques.Compute:-probability of a sequence-likelihood of words co-occurringWhy would we want to do this?Rank the likelihood of sequences containing various alternative alternative hypothesesAssess the likelihood of a hypothesisApplicationsSpelling correctionMobile phone textingSpeech recogn
9、itionHandwriting recognitionDisabled usersSpelling errorsThey are leaving in about fifteen minuets to go to her house.The study was conducted mainly be John Black.Hopefully,all with continue smoothly in my absence.Can they lave him my messages?I need to notified the bank of.He is trying to fine out.
10、Handwriting recognitionAssume a note is given to a bank teller,which the teller reads as I have a gub.(cf.Woody Allen)NLP to the rescue.gub is not a wordgun,gum,Gus,and gull are words,but gun has a higher probability in the context of a bankFor Spell CheckersCollect list of commonly substituted word
11、spiece/peace,whether/weather,their/there.Example:“On Tuesday,the whether“On Tuesday,the weather”How to assign probabilities to word sequences?The probability of a word sequence w1,n is decomposedinto a product of conditional probabilities.P(w1,n)=P(w1)P(w2|w1)P(w3|w1,w2).P(wn|w1,n-1)=i=1.n P(wi|w1,i
12、-1)Language ModelsIn order to simplify the model,we assume thateach word only depends on the 2 preceding words P(wi|w1,i-1)=P(wi|wi-2,wi-1)2nd order Markov model,trigramthat the probabilities are time invariant(stationary)P(Wi=c|Wi-2=a,Wi-1=b)=P(Wk=c|Wk-2=a,Wk-1=b)Final formula:P(w1,n)=i=1.n P(wi|wi
13、-2,wi-1)Language ModelsSimple N-GramsAn N-gram model uses the previous N-1 words to predict the next one:P(wn|wn-N+1 wn-N+2 wn-1)unigrams:P(dog)bigrams:P(dog|big)trigrams:P(dog|the big)quadrigrams:P(dog|chasing the big)A Bigram Grammar FragmentEat on.16Eat Thai.03Eat some.06Eat breakfast.03Eat lunch
14、.06Eat in.02Eat dinner.05Eat Chinese.02Eat at.04Eat Mexican.02Eat a.04Eat tomorrow.01Eat Indian.04Eat dessert.007Eat today.03Eat British.001Additional Grammar I.25Want some.04 Id.06Want Thai.01 Tell.04To eat.26 Im.02To have.14I want.32To spend.09I would.29To be.02I dont.08British food.60I have.04Bri
15、tish restaurant.15Want to.65British cuisine.01Want a.05British lunch.01Computing Sentence ProbabilityP(I want to eat British food)=P(I|)P(want|I)P(to|want)P(eat|to)P(British|eat)P(food|British)=.25x.32x.65x.26x.001x.60=.000080vs.P(I want to eat Chinese food)=.00015Probabilities seem to capture“synta
16、ctic facts,“world knowledge-eat is often followed by a NP-British food is not too popularN-gram models can be trained by counting and normalizationSome adjustmentsproduct of probabilities numerical underflow for long sentencesso instead of multiplying the probs,we add the log of the probsP(I want to
17、 eat British food)Computed usinglog(P(I|)+log(P(want|I)+log(P(to|want)+log(P(eat|to)+log(P(British|eat)+log(P(food|British)=log(.25)+log(.32)+log(.65)+log(.26)+log(.001)+log(.6)=-11.722Why use only bi-or tri-grams?Markov approximation is still costlywith a 20 000 word vocabulary:bigram needs to stor
18、e 400 million parameterstrigram needs to store 8 trillion parametersusing a language model trigram is impracticalto reduce the number of parameters,we can:do stemming(use stems instead of word types)group words into semantic classesseen once-same as unseen.Building n-gram ModelsData preparation:Deci
19、de training corpusClean and tokenizeHow do we deal with sentence boundaries?I eat.I sleep.(I eat)(eat I)(I sleep)I eat I sleep (I)(I eat)(eat)(I)(I sleep)(sleep)Use statistical estimators:to derive a good probability estimates based on training data.Statistical EstimatorsMaximum Likelihood Estimatio
20、n(MLE)SmoothingAdd-one-LaplaceAdd-delta-Lidstones&Jeffreys-Perks Laws(ELE)(Validation:Held Out EstimationCross Validation)Witten-Bell smoothingGood-Turing smoothingCombining EstimatorsSimple Linear InterpolationGeneral Linear InterpolationKatzs BackoffStatistical Estimators-Maximum Likelihood Estima
21、tion(MLE)SmoothingAdd-one-LaplaceAdd-delta-Lidstones&Jeffreys-Perks Laws(ELE)(Validation:Held Out EstimationCross Validation)Witten-Bell smoothingGood-Turing smoothingCombining EstimatorsSimple Linear InterpolationGeneral Linear InterpolationKatzs BackoffMaximum Likelihood Estimation Choose the para
22、meter values which gives the highest probability on the training corpusLet C(w1,.,wn)be the frequency of n-gram w1,.,wnExample 1:P(event)in a training corpus,we have 10 instances of“come across”8 times,followed by“as”1 time,followed by“more”1 time,followed by“a”with MLE,we have:P(as|come across)=0.8
23、 P(more|come across)=0.1 P(a|come across)=0.1 P(X|come across)=0 where X “as”,“more”,“a”Problem with MLE:data sparseness What if a sequence never appears in training corpus?P(X)=0“come across the men”-prob=0 “come across some men”-prob=0“come across 3 men”-prob=0MLE assigns a probability of zero to
24、unseen events probability of an n-gram involving unseen words will be zero!Maybe with a larger corpus?Some words or word combinations are unlikely to appear!Recall:Zipfs lawf 1/rin(Balh et al 83)training with 1.5 million words 23%of the trigrams from another part of the same corpus were previously u
25、nseen.So MLE alone is not good enough estimatorProblem with MLE:data sparseness(cont)Discounting or SmoothingMLE is usually unsuitable for NLP because of the sparseness of the data We need to allow for possibility of seeing events not seen in trainingMust use a Discounting or Smoothing techniqueDecr
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- Statistical NLP
![提示](https://www.taowenge.com/images/bang_tan.gif)
限制150内