(8.2.1)--8.2BasicsofNLP.pdf
-
资源ID:52847373
资源大小:1.25MB
全文页数:12页
- 资源格式: PDF
下载积分:10金币
快捷下载
会员登录下载
微信登录下载
三方登录下载:
微信扫一扫登录
友情提示
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
|
(8.2.1)--8.2BasicsofNLP.pdf
Bag of WordsA model that allows us to count all words in apiece of textCreating an occurrence matrix for the sentenceor documentBag of WordsSentences:1.Jim and Pam traveled by bus.2.The train was late.3.The flight was full.Traveling by flight isexpensive.ExampleBasic structure for a bag of wordsWords with frequenciesCombination of words Bag of wordsTF-IDFTF:Term Frequency.If a particular word appears multiple times in adocument,then it might have higher importance than the otherwords that appear fewer timesIDF:Inverse Document Frequency.If a particular word appearsmany times in a document,but it is also present many times insome other documents,then maybe that word is frequent,so wecannot assign much importance to itTF-IDFSentences:1.This is the first document.2.This document is the second document.ExampleResulting Multiplication of TF-IDFTF-IDF using a logTokenizationTokenization is the process of segmenting running text into sentencesand words.In essence,its the task of cutting a text into pieces calledtokens,and at the same time throwing away certain characters,such aspunctuation.Sentences:1.This is the first document.2.This document is the second document.ThisisthefirstdocumentThisisthedocumentseconddocumentStop Words RemovalSome very common words that appear to provide little orno value to the NLP objective are filtered and excludedfromthetexttobeprocessed,henceremovingwidespread and frequent terms that are not informativeabout the corresponding text.Stop words can be safely ignored by carrying out alookup in a pre-defined list of keywords,freeing updatabase space and improving processing time.Stop Words ListStemmingStemming is used to normalize words.In English and many other languages,a single word can takemultiple forms depending upon context used.studystudiesstudyingstudiedLemmatizationLemmatization has the objective of reducing a word to its baseform and grouping together different forms of the same word.bankPart of Speech TaggingPart of speech tagging is crucial for syntactic and semantic analysis.Question formationVerbContainerNounChunkingChunking means to extract meaningful phrases from unstructuredtext.Categories of phrasesPhrase structure rules