《现代信息检索导论作业讲评.pptx》由会员分享,可在线阅读,更多相关《现代信息检索导论作业讲评.pptx(20页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、会计学1现代信息检索导论作业讲评现代信息检索导论作业讲评n n不要抄袭n n有两位同学一次作业都没交,请认识他们的同学转告一下。作业说明第1页/共20页第一次作业第一次作业n nEstimate the time and space complexity of the SPIM indexing Estimate the time and space complexity of the SPIM indexing algorithmalgorithmn n把空间复杂度与实际内存使用情况混淆把空间复杂度与实际内存使用情况混淆第2页/共20页第一次作业第一次作业n nDesign a MapRed
2、uce algorithm for counting the Design a MapReduce algorithm for counting the occurrence no.of occurrence no.of a a“phrasephrase”of n of n orderedordered words words WW1 1WWn nn n直接使用直接使用(tid,did,pos)(tid,did,pos)三元组三元组第3页/共20页第二次作业 编程题1.1.Write two C functions to encode and decode variable-byte Writ
3、e two C functions to encode and decode variable-byte integers.integers.2.2.Write two C functions to encode and decode Write two C functions to encode and decode integers.integers.3.3.(*)Write two C functions to encode and decode(*)Write two C functions to encode and decode integers integers Exercise
4、 5.9.Exercise 5.9.基本上都做的很好基本上都做的很好编码理解有误编码理解有误精简代码精简代码第4页/共20页第三次作业 第一题n nCompute the vector space similarity between the query Compute the vector space similarity between the query“digital cameras”and the document“digital cameras and“digital cameras”and the document“digital cameras and video camera
5、s”by filling out the empty columns in Table 6.1 video cameras”by filling out the empty columns in Table 6.1(p132).Assume N=10,000,000,logarithmic term weighting(p132).Assume N=10,000,000,logarithmic term weighting(wf columns)for query and document,idf weighting for the(wf columns)for query and docum
6、ent,idf weighting for the query only and cosine normalization for the document only.query only and cosine normalization for the document only.Treat and as a stop word.Enter term counts in the tf Treat and as a stop word.Enter term counts in the tf columns.Give the final similarity score.columns.Give
7、 the final similarity score.第5页/共20页第三次作业 第一题n nquery“digital cameras”query“digital cameras”n ndocument“digital cameras and video cameras”document“digital cameras and video cameras”n nN=10,000,000N=10,000,000查询查询文档文档dtfwfdfidfqtfwfdq*dlog(N/DF)wf*idf1+log(tf)normalize(wf)digital1110,000log(1000)=331
8、1+log(1)=10.521.56video00100,000log(100)=2011+log(1)=10.520cameras1150,000log(200)=2.32.321+log(2)=1.30.681.564第6页/共20页第三次作业 第二题n nWe suggested that the postings for static quality ordering be in We suggested that the postings for static quality ordering be in decreasing order of decreasing order of
9、 g g(d d).Why do we use the decreasing rather).Why do we use the decreasing rather than the increasing order?How to do linear merge of postings than the increasing order?How to do linear merge of postings with with g g(d d)?Write a C function to present your idea.)?Write a C function to present your
10、 idea.第7页/共20页n n忽略了didn n未保持g(d)的顺序第三次作业 第二题第8页/共20页第四次作业 第二题在10000篇文档构成的文档集中,某个查询的相关文档总数为8,下面给出了某系统针对前20个有序结果的相关(R)和不相关(N)情况:RRNNN NNNRN RNNNR NNNNRA.前20篇文档的正确率:P=6/20=30%B.前20篇文档的F1值:F1=2PR/(R+P)其中R=6/8,故F1=0.4286第9页/共20页第四次作业 第二题n nRRNNN NNNRN RNNNR NNNNRn nC.在25%召回率水平上的插值正确率:100%n nD.在33%召回率水平上
11、的插值正确率:36.4%0.1250.250.3750.50.6250000000000.750000000000100806040200第10页/共20页第四次作业 第二题n nRRNNN NNNRN RNNNR NNNNRn nE.假定该系统所有返回结果的数目就是20,则MAP=(1+2/2+3/9+4/11+5/15+6/20+0+0)/8 =0.4163第11页/共20页第四次作业 第二题RRNNN NNNRN RNNNR NNNNRRRNNN NNNRN RNNNR NNNNRF.F.该系统可能的最大该系统可能的最大MAP:MAP:当第当第2121和和2222篇文档都是相关文档时,篇
12、文档都是相关文档时,MAPMAP达到最大值。达到最大值。MAP=(1+2/2+3/9+4/11+5/15+6/20+7/21+8/22)/8 MAP=(1+2/2+3/9+4/11+5/15+6/20+7/21+8/22)/8 =0.5034 =0.5034G.G.该系统可能的最小该系统可能的最小MAP:MAP:当第当第99999999和和1000010000篇文档是相关文档时,篇文档是相关文档时,MAPMAP达到最小值。达到最小值。MAP=(1+2/2+3/9+4/11+5/15+6/20+7/9999+8/10000)/8 MAP=(1+2/2+3/9+4/11+5/15+6/20+7/9
13、999+8/10000)/8 =0.4165 =0.4165第12页/共20页第四次作业 第二题RRNNN NNNRN RNNNR NNNNRRRNNN NNNRN RNNNR NNNNRH.H.在一系列实验中,只有最靠前的在一系列实验中,只有最靠前的2020篇文档通篇文档通过人工来判定,(过人工来判定,(E E)的结果用于近似从)的结果用于近似从(F F)到()到(GG)的)的MAPMAP取值范围。对于上例来取值范围。对于上例来说,通过(说,通过(E E)而不是()而不是(F F)和()和(GG)来计算)来计算MAPMAP所造成的误差有多大(采用绝对值来计算)所造成的误差有多大(采用绝对值来
14、计算)?|MAP|MAP F F-MAP-MAP G G|=0.0869 =0.0869第13页/共20页第四次作业 第三题Write a C program to highlight the keywords of an input query in the text of an input document,where both the Write a C program to highlight the keywords of an input query in the text of an input document,where both the query and document
15、 text are input as a character string:query and document text are input as a character string:const char*q=“word1 word2 word3”;const char*q=“word1 word2 word3”;const char*doc_text=“”;const char*doc_text=“”;(Requirements:first segment the text to sentences,then select them.)(Requirements:first segmen
16、t the text to sentences,then select them.)要求用要求用C C语言语言首先分句首先分句HighlightHighlight整个查询出现的地方,而不是查询中某个整个查询出现的地方,而不是查询中某个单词单词程序应该生成一个程序应该生成一个HTMLHTML文件文件第14页/共20页第五次作业 第二题n nGive three reasons why relevance feedback has been little used Give three reasons why relevance feedback has been little used in w
17、eb search.in web search.n n用户不愿意进行显示反馈(延长搜索交互时间)用户不愿意进行显示反馈(延长搜索交互时间)n n相关反馈会造成长查询,降低系统效率相关反馈会造成长查询,降低系统效率n n相关反馈主要用于提高召回率,而相关反馈主要用于提高召回率,而WEBWEB检索中准确率能检索中准确率能提升用户体验提升用户体验n n很难使普通用户理解并使用很难使普通用户理解并使用第15页/共20页第五次作业 第三题n nWhy is positive feedback likely to be more useful than negative Why is positive f
18、eedback likely to be more useful than negative feedback to an IR system?feedback to an IR system?n n正反馈返回的相关文档中相似度更高,聚类性质强,容正反馈返回的相关文档中相似度更高,聚类性质强,容易带来更多的相关文档易带来更多的相关文档n nWhy might only using one nonrelevant document be more Why might only using one nonrelevant document be more effective than using
19、several?effective than using several?n n在实际检索中绝大部分文档都是不相关文档,相关文档在实际检索中绝大部分文档都是不相关文档,相关文档的聚类不够强,容易相互抵消的聚类不够强,容易相互抵消第16页/共20页第五次作业 第四题n nOmar has implemented a relevance feedback web search system,where he is going to Omar has implemented a relevance feedback web search system,where he is going to do
20、relevance feedback based only on words in the title text returned for a page(for do relevance feedback based only on words in the title text returned for a page(for efficiency).The user is going to rank 3 results.The first user,Jinxing,queries for:efficiency).The user is going to rank 3 results.The
21、first user,Jinxing,queries for:n nbanana slugbanana slugand the top three titles returned are:and the top three titles returned are:n nbanana slug Ariolimax columbianusbanana slug Ariolimax columbianusn nSanta Cruz mountains banana slugSanta Cruz mountains banana slugn nSanta Cruz Campus MascotSanta
22、 Cruz Campus MascotJinxing judges the first two documents relevant,and the third nonrelevant.Assume Jinxing judges the first two documents relevant,and the third nonrelevant.Assume that Omars search engine uses term frequency but no length normalization nor IDF.that Omars search engine uses term fre
23、quency but no length normalization nor IDF.Assume that he is using the Rocchio relevance feedback mechanism,with Assume that he is using the Rocchio relevance feedback mechanism,with =1.=1.Showthe final revised query that would be run.(Please list the vector elements in Showthe final revised query t
24、hat would be run.(Please list the vector elements in alphabetical order.)alphabetical order.)第17页/共20页第五次作业 第四题n nQuery:Query:n nbanana slugbanana slugn nDocuments:Documents:n n(R)banana slug Ariolimax columbianus(R)banana slug Ariolimax columbianusn n(R)Santa Cruz mountains banana slug(R)Santa Cruz
25、 mountains banana slugn n(N)Santa Cruz Campus Mascot(N)Santa Cruz Campus MascotAriolimax banana Campus columbianusCruz Mascotmountains Santa slugQ010000001D1110100001D2010010111D3001011010第18页/共20页第五次作业 第四题把文档写成向量把文档写成向量 Q =(0,1,0,0,0,0,0,0,1)Q =(0,1,0,0,0,0,0,0,1)D1=(1,1,0,1,0,0,0,0,1)D1=(1,1,0,1,0,0,0,0,1)D2=(0,1,0,0,1,0,1,1,1)D2=(0,1,0,0,1,0,1,1,1)D3=(0,0,1,0,1,1,0,1,0)D3=(0,0,1,0,1,1,0,1,0)由公式,其中由公式,其中 =1=1得得QQmm=(0.5,2,-1,0.5,-0.5,-1,0.5,-0.5,2)=(0.5,2,-1,0.5,-0.5,-1,0.5,-0.5,2)负的负的weightweight变为变为0 0QQmm=(0.5,2,0,0.5,0,0,0.5,0,2)=(0.5,2,0,0.5,0,0,0.5,0,2)第19页/共20页
限制150内