书签分享收藏举报版权申诉 / 87

立即下载

当前位置：首页 > 生活休闲 > 生活常识 > Statistical NLP.ppt

Statistical NLP.ppt

上传人：s****8

文档编号：82713807

上传时间：2023-03-26

格式：PPT

页数：87

大小：2.28MB

( 4.5 )

《Statistical NLP.ppt》由会员分享，可在线阅读，更多相关《Statistical NLP.ppt（87页珍藏版）》请在淘文阁 - 分享文档赚钱的网站上搜索。

1、Part II.Statistical NLPAdvanced Artificial IntelligenceN-GrammsWolfram Burgard,Luc De Raedt,Bernhard Nebel,Lars Schmidt-ThiemeMost slides taken from Helmut Schmid,Rada Mihalcea,Bonnie Dorr,Leila Kosseim and othersContentsShort recap motivation for SNLPProbabilistic language models N-grammsPredicting

2、 the next word in a sentenceLanguage guessingLargely chapter 6 of Statistical NLP,Manning and Schuetze.And chapter 6 of Speech and Language Processing,Jurafsky and Martin Human Language is highly ambiguous at all levelsacoustic levelrecognize speech vs.wreck a nice beachmorphological levelsaw:to see

3、(past),saw(noun),to saw(present,inf)syntactic levelI saw the man on the hill with a telescopesemantic levelOne book has to be read by every studentMotivationStatistical DisambiguationDefine a probability model for the dataCompute the probability of each alternativeChoose the most likely alternativeN

4、LP and StatisticsSpeech recognisers use a noisy channel model“The source generates a sentence s with probability P(s).The channel transforms the text into an acoustic signalwith probability P(a|s).The task of the speech recogniser is to find for a givenspeech signal a the most likely sentence s:s=ar

5、gmaxs P(s|a)=argmaxs P(a|s)P(s)/P(a)=argmaxs P(a|s)P(s)Language Modelssource schannelacoustic signal aSpeech recognisers employ two statistical models:a language model P(s)an acoustics model P(a|s)Language ModelsLanguage ModelDefinition:Language model is a model that enables one to compute the proba

6、bility,or likelihood,of a sentence s,P(s).Lets look at different ways of computing P(s)in the context of Word PredictionExample of bad language modelA bad language modelA bad language modelA Good Language ModelDetermine reliable sentence probability estimatesP(“And nothing but the truth”)0.001P(“And

7、 nuts sing on the roof”)0Shannon game Word PredictionPredicting the next word in the sequenceStatistical natural language.The cat is thrown out of the The large green Sue swallowed the large green ClaimA useful part of the knowledge needed to allow Word Prediction can be captured using simple statis

8、tical techniques.Compute:-probability of a sequence-likelihood of words co-occurringWhy would we want to do this?Rank the likelihood of sequences containing various alternative alternative hypothesesAssess the likelihood of a hypothesisApplicationsSpelling correctionMobile phone textingSpeech recogn

9、itionHandwriting recognitionDisabled usersSpelling errorsThey are leaving in about fifteen minuets to go to her house.The study was conducted mainly be John Black.Hopefully,all with continue smoothly in my absence.Can they lave him my messages?I need to notified the bank of.He is trying to fine out.

10、Handwriting recognitionAssume a note is given to a bank teller,which the teller reads as I have a gub.(cf.Woody Allen)NLP to the rescue.gub is not a wordgun,gum,Gus,and gull are words,but gun has a higher probability in the context of a bankFor Spell CheckersCollect list of commonly substituted word

11、spiece/peace,whether/weather,their/there.Example:“On Tuesday,the whether“On Tuesday,the weather”How to assign probabilities to word sequences?The probability of a word sequence w1,n is decomposedinto a product of conditional probabilities.P(w1,n)=P(w1)P(w2|w1)P(w3|w1,w2).P(wn|w1,n-1)=i=1.n P(wi|w1,i

12、-1)Language ModelsIn order to simplify the model,we assume thateach word only depends on the 2 preceding words P(wi|w1,i-1)=P(wi|wi-2,wi-1)2nd order Markov model,trigramthat the probabilities are time invariant(stationary)P(Wi=c|Wi-2=a,Wi-1=b)=P(Wk=c|Wk-2=a,Wk-1=b)Final formula:P(w1,n)=i=1.n P(wi|wi

13、-2,wi-1)Language ModelsSimple N-GramsAn N-gram model uses the previous N-1 words to predict the next one:P(wn|wn-N+1 wn-N+2 wn-1)unigrams:P(dog)bigrams:P(dog|big)trigrams:P(dog|the big)quadrigrams:P(dog|chasing the big)A Bigram Grammar FragmentEat on.16Eat Thai.03Eat some.06Eat breakfast.03Eat lunch

14、.06Eat in.02Eat dinner.05Eat Chinese.02Eat at.04Eat Mexican.02Eat a.04Eat tomorrow.01Eat Indian.04Eat dessert.007Eat today.03Eat British.001Additional Grammar I.25Want some.04 Id.06Want Thai.01 Tell.04To eat.26 Im.02To have.14I want.32To spend.09I would.29To be.02I dont.08British food.60I have.04Bri

15、tish restaurant.15Want to.65British cuisine.01Want a.05British lunch.01Computing Sentence ProbabilityP(I want to eat British food)=P(I|)P(want|I)P(to|want)P(eat|to)P(British|eat)P(food|British)=.25x.32x.65x.26x.001x.60=.000080vs.P(I want to eat Chinese food)=.00015Probabilities seem to capture“synta

16、ctic facts,“world knowledge-eat is often followed by a NP-British food is not too popularN-gram models can be trained by counting and normalizationSome adjustmentsproduct of probabilities numerical underflow for long sentencesso instead of multiplying the probs,we add the log of the probsP(I want to

17、 eat British food)Computed usinglog(P(I|)+log(P(want|I)+log(P(to|want)+log(P(eat|to)+log(P(British|eat)+log(P(food|British)=log(.25)+log(.32)+log(.65)+log(.26)+log(.001)+log(.6)=-11.722Why use only bi-or tri-grams?Markov approximation is still costlywith a 20 000 word vocabulary:bigram needs to stor

18、e 400 million parameterstrigram needs to store 8 trillion parametersusing a language model trigram is impracticalto reduce the number of parameters,we can:do stemming(use stems instead of word types)group words into semantic classesseen once-same as unseen.Building n-gram ModelsData preparation:Deci

19、de training corpusClean and tokenizeHow do we deal with sentence boundaries?I eat.I sleep.(I eat)(eat I)(I sleep)I eat I sleep (I)(I eat)(eat)(I)(I sleep)(sleep)Use statistical estimators:to derive a good probability estimates based on training data.Statistical EstimatorsMaximum Likelihood Estimatio

20、n(MLE)SmoothingAdd-one-LaplaceAdd-delta-Lidstones&Jeffreys-Perks Laws(ELE)(Validation:Held Out EstimationCross Validation)Witten-Bell smoothingGood-Turing smoothingCombining EstimatorsSimple Linear InterpolationGeneral Linear InterpolationKatzs BackoffStatistical Estimators-Maximum Likelihood Estima

21、tion(MLE)SmoothingAdd-one-LaplaceAdd-delta-Lidstones&Jeffreys-Perks Laws(ELE)(Validation:Held Out EstimationCross Validation)Witten-Bell smoothingGood-Turing smoothingCombining EstimatorsSimple Linear InterpolationGeneral Linear InterpolationKatzs BackoffMaximum Likelihood Estimation Choose the para

22、meter values which gives the highest probability on the training corpusLet C(w1,.,wn)be the frequency of n-gram w1,.,wnExample 1:P(event)in a training corpus,we have 10 instances of“come across”8 times,followed by“as”1 time,followed by“more”1 time,followed by“a”with MLE,we have:P(as|come across)=0.8

23、 P(more|come across)=0.1 P(a|come across)=0.1 P(X|come across)=0 where X “as”,“more”,“a”Problem with MLE:data sparseness What if a sequence never appears in training corpus?P(X)=0“come across the men”-prob=0 “come across some men”-prob=0“come across 3 men”-prob=0MLE assigns a probability of zero to

24、unseen events probability of an n-gram involving unseen words will be zero!Maybe with a larger corpus?Some words or word combinations are unlikely to appear!Recall:Zipfs lawf 1/rin(Balh et al 83)training with 1.5 million words 23%of the trigrams from another part of the same corpus were previously u

25、nseen.So MLE alone is not good enough estimatorProblem with MLE:data sparseness(cont)Discounting or SmoothingMLE is usually unsuitable for NLP because of the sparseness of the data We need to allow for possibility of seeing events not seen in trainingMust use a Discounting or Smoothing techniqueDecr

26、ease the probability of previously seen events to leave a little bit of probability for previously unseen eventsStatistical EstimatorsMaximum Likelihood Estimation(MLE)-Smoothing-Add-one-LaplaceAdd-delta-Lidstones&Jeffreys-Perks Laws(ELE)(Validation:Held Out EstimationCross Validation)Witten-Bell sm

27、oothingGood-Turing smoothingCombining EstimatorsSimple Linear InterpolationGeneral Linear InterpolationKatzs BackoffMany smoothing techniquesAdd-oneAdd-delta Witten-Bell smoothingGood-Turing smoothingChurch-Gale smoothingAbsolute-discountingKneser-Ney smoothing.Add-one Smoothing(Laplaces law)Pretend

28、 we have seen every n-gram at least once Intuitively:new_count(n-gram)=old_count(n-gram)+1The idea is to give a little bit of the probability space to unseen eventsAdd-one:Exampleunsmoothed bigram counts:unsmoothed normalized bigram probabilities:1st word2nd wordAdd-one:Example(cont)add-one smoothed

29、 bigram counts:add-one normalized bigram probabilities:Add-one,more formallyN:nb of n-grams in training corpus-B:nb of bins(of possible n-grams)B=V2 for bigrams B=V3 for trigrams etc.where V is size of vocabularyProblem with add-one smoothingbigrams starting with Chinese are boosted by a factor of 8

30、!(1829/213)unsmoothed bigram counts:add-one smoothed bigram counts:1st word1st wordProblem with add-one smoothing(cont)Data from the AP from(Church and Gale,1991)Corpus of 22,000,000 bigramsVocabulary of 273,266 words(i.e.74,674,306,760 possible bigrams-or bins)74,671,100,000 bigrams were unseenAnd

31、each unseen bigram was given a frequency of 0.000137fMLEfempiricalfadd-one00.0000270.00013710.4480.00027421.250.00041132.240.00054843.230.00068554.210.000822too hightoo lowFreq.from training dataFreq.from held-out dataAdd-one smoothed freq.Total probability mass given to unseen bigrams=(74,671,100,0

32、00 x 0.000137)/22,000,000 0.465!Problem with add-one smoothingevery previously unseen n-gram is given a low probabilitybut there are so many of them that too much probability mass is given to unseen eventsadding 1 to frequent bigram,does not change muchbut adding 1 to low bigrams(including unseen on

33、es)boosts them too much!In NLP applications that are very sparse,Laplaces Law actually gives far too much of the probability space to unseen events.Statistical EstimatorsMaximum Likelihood Estimation(MLE)SmoothingAdd-one-Laplace-Add-delta-Lidstones&Jeffreys-Perks Laws(ELE)Validation:Held Out Estimat

34、ionCross Validation Witten-Bell smoothingGood-Turing smoothingCombining EstimatorsSimple Linear InterpolationGeneral Linear InterpolationKatzs BackoffAdd-delta smoothing(Lidstones law)instead of adding 1,add some other(smaller)positive value most widely used value for =0.5if=0.5,Lidstones Law is cal

35、led:the Expected Likelihood Estimation(ELE)or the Jeffreys-Perks Lawbetter than add-one,but stillThe expected frequency of a trigram in a randomsample of size N is thereforef*(w,w,w)=f(w,w,w)+(1-)N/B relative discountingAdding /Lidstones law Statistical EstimatorsMaximum Likelihood Estimation(MLE)Sm

36、oothingAdd-one-LaplaceAdd-delta-Lidstones&Jeffreys-Perks Laws(ELE)-(Validation:Held Out EstimationCross Validation)Witten-Bell smoothingGood-Turing smoothingCombining EstimatorsSimple Linear InterpolationGeneral Linear InterpolationKatzs BackoffValidation/Held-out EstimationHow do we know how much o

37、f the probability space to“hold out”for unseen events?ie.We need a good way to guess in advance Held-out data:We can divide the training data into two parts:the training set:used to build initial estimates by countingthe held out data:used to refine the initial estimates(i.e.see how often the bigram

38、s that appeared r times in the training text occur in the held-out text)Held Out EstimationFor each n-gram w1.wn we compute:Ctr(w1.wn)the frequency of w1.wn in the training dataCho(w1.wn)the frequency of w1.wn in the held out data Let:r =the frequency of an n-gram in the training dataNr=the number o

39、f different n-grams with frequency r in the training dataTr=the sum of the counts of all n-grams in the held-out data that appeared r times in the training data T=total number of n-gram in the held out dataSo:Some explanation probability in held-out data for all n-grams appearing r times in the trai

40、ning datasince we have Nr different n-grams in the training data that occurred r times,lets share this probability mass equality among them ex:assume if r=5 and 10 different n-grams(types)occur 5 times in training-N5=10if all the n-grams(types)that occurred 5 times in training,occurred in total(n-gr

41、am tokens)20 times in the held-out data-T5=20assume the held-out data contains 2000 n-grams(tokens)Cross-ValidationHeld Out estimation is useful if there is a lot of data available If not,we can use each part of the data both as training data and as held out data.Main methods:Deleted Estimation(two-

42、way cross validation)Divide data into part 0 and part 1In one model use 0 as the training data and 1 as the held out dataIn another model use 1 as training and 0 as held out data.Do a weighted average of the two modelsLeave-One-OutDivide data into N parts(N=nb of tokens)Leave 1 token out each timeTr

43、ain N language modelsEmpirical results for bigram data (Church and Gale)ffemp fGTfadd1fheld-out00.0000270.0000270.0001370.00003710.4480.4460.0002740.39621.251.260.0004111.2432.242.240.0005482.2343.233.240.0006853.2254.214.220.0008224.2265.235.190.0009595.2076.216.210.001096.2187.217.240.001237.1898.

44、268.250.001378.18ComparisonDividing the corpusTraining:Training data(80%of total data)To build initial estimates(frequency counts)Held out data(10%of total data)To refine initial estimates(smoothed estimates)Testing:Development test data(5%of total data)To test while developingFinal test data(5%of t

45、otal data)To test at the endBut how do we divide?Randomly select data(ex.sentences,n-grams)Advantage:Test data is very similar to training dataCut large chunks of consecutive data Advantage:Results are lower,but more realisticDeveloping and Testing Models1.Write an algorithm2.Train itnWith training

46、set&held-out data3.Test itnWith development set4.Note things it does wrong&revise it 5.Repeat 1-5 until satisfied6.Only then,evaluate and publish resultsnWith final test setnBetter to give final results by testing on n smaller samples of the test data and averagingStatistical EstimatorsMaximum Likel

47、ihood Estimation(MLE)SmoothingAdd-one-LaplaceAdd-delta-Lidstones&Jeffreys-Perks Laws(ELE)(Validation:Held Out EstimationCross Validation)-Witten-Bell smoothingGood-Turing smoothingCombining EstimatorsSimple Linear InterpolationGeneral Linear InterpolationKatzs BackoffWitten-Bell smoothingintuition:A

48、n unseen n-gram is one that just did not occur yetWhen it does happen,it will be its first occurrenceSo give to unseen n-grams the probability of seeing a new n-gramWitten-Bell:the equationsTotal probability mass assigned to zero-frequency N-grams:(NB:T is OBSERVED types,not V)So each zero N-gram ge

49、ts the probability:Witten-Bell:why discountingNow of course we have to take away something(discount)from the probability of the events seen more than once:Witten-Bell for bigramsWe relativize the types to the previous word:this probability mass,must be distributed in equal parts over all unseen bigr

50、amsZ(w1):number of unseen n-grams starting with w1 for each unseen eventSmall exampleall unseen bigrams starting with a will share a probability mass of each unseen bigrams starting with a will have an equal part of this all unseen bigrams starting with b will share a probability mass of each unseen

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

16 金币

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: Statistical NLP

淘文阁 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：Statistical NLP.ppt
链接地址：https://www.taowenge.com/p-82713807.html