南京大学计算机科学与技术系主讲人：黄宜华杨晓亮 2011年春季学期 MapReduce海量数据并行处理.ppt

资源ID：56695327 资源大小：5.33MB 全文页数：52页
资源格式： PPT 下载积分：20金币

快捷下载

会员登录下载

微信登录下载

三方登录下载：

微信扫一扫登录

下载资源需要20金币

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

南京大学计算机科学与技术系主讲人：黄宜华杨晓亮 2011年春季学期 MapReduce海量数据并行处理.ppt

南京大学计算机科学与技术系主讲人：黄宜华，杨晓亮 2011年春季学期 MapReduce海量数据并行处理 Still waters run deep.流静水深流静水深,人静心深人静心深 Where there is life,there is hope。有生命必有希望。有生命必有希望LOGOA Full Text Search Engine For BBS Lily 南京大学计算机科学与技术系主讲人：黄宜华、顾荣 2011年春季学期鸣谢：本课程得到鸣谢：本课程得到鸣谢：本课程得到鸣谢：本课程得到GoogleGoogleGoogleGoogle公司公司公司公司(北京）北京）北京）北京）中国大学合作部精品课程计划资助中国大学合作部精品课程计划资助中国大学合作部精品课程计划资助中国大学合作部精品课程计划资助ContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion1.BackgroundWhat is a full text search engine?1.11.2Why do we need it?What is a full text search engine?Inafulltextsearch,thesearchengineexaminesallofthewordsineverystoreddocumentasittriestomatchsearchwordssuppliedbytheuser.-FromWikiWhy do we need a FTSE for BBS Lily?Totalamount:around3millionpostsOverathousandeveryday.Eachpostssize:1K4KData InBBS Lily BaseCapacityIncreasingSpeed Post GranularityContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion2.Brief Intro to the Principle of Full Text Search EngineWhat happens after you press enter?Abstract IR ArchitectureDocumentsQueryHitsRepresentationFunctionRepresentationFunctionQueryRepresentationDocumentRepresentationComparisonFunctionIndexofflineonlinedocument acquisition(e.g.,web crawling)About Representation FunctionDocumentsInvertedIndexBag of Wordscase folding,tokenization,stopword removal,stemmingsyntax,semantics,word knowledge,etc.A Simple Inverted Index Demo111211111112111111123111411121121bluecateggfishgreenhamhatone11111121bluecateggfishgreenhamhatone11red11two1red1twoone fish,two fishDoc 1red fish,blue hatDoc 2cat in the hatDoc 3green eggs and hamDoc 43414432122112Map/Reduces Role1.musthavesub-second responsetime2.fortheweb,onlyneedrelativelyfew resultsIndexingIndexingProblemProblemRetrievalRetrievalProblemProblemCharacter DescriptionCharacter DescriptionSuitable?Suitable?1.scalability2.relativelyfast3.batchoperation4.updatesmaynotbeimportant5.crawlingisachallengeinitselfContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion3.Implement of FTSE for Lily BBS3.4OutlineofWorkFlow3.13.23.33.5CrawlWebPages&MineInfoIndexingProcessesSetupWebRetrievalInterfaceOptimizationResponseQuery String3.1 Outline of work FlowWeb Page 0Web Page 1Web Page nCrawl&Info MiningFormated Files/Content/Vice InfoInverted Index&Ranking JSP PageSplitTerm0,Term1Term nSearch&MergeTarget DIDResult ListTitleContextAuthorURLHottoken 1token 0token nIndexForIndicesCrawlerWeb RetrivalMap/Reduce3.2 Crawl Web Pages&Mine Info3.2.1Target FrameworkofLilyBBSStrategyofCrawler StrategyofMiner3.2.53.2.43.2.23.2.3CommonissuesTarget of Crawler&MinerCrawl every postFrom BBS lily Continuously.FaulttoleranceMine wanted infoFrom each post that Crawler has got from web;store the them in a designed pattern.A CrawlerB MinerFramework of BBS Lily(1)Post 0Post 1Post nTitleinhereBBSLilyTitleinheresection 12Titleinheresection0Titleinheresection2Titleinheresection1TitleinhereBoard 0Board 1TitleinhereBoard nFramework of BBS Lily(2)Strategy of CrawlerDFSPost 1Post nPost 0TitleinhereBBSLilyTitleinhereSection 12Titleinheresection0Titleinheresection2Titleinheresection1TitleinhereBoard 0Board 1TitleinhereBoard n-Traversalcataloglinkstogetthecontent;-AutomaticlinktoNextPageanddotheroutinejob.tipsStrategy of MinerRegex 南京大学小百合站-文章阅读讨论区:D_ComputerNet.User.init(WHEEL:0,FACE:1,BACK:0)发信人:MSer(微软校园大使),信区:D_Computer.本篇人气:205标题:转载微软2011春夏季实习生招聘将于下周一启动！发信站:南京大学小百合站(FriMar1800:28:592011)【以下文字转载自MSer的blog】【原文由MSer所发表】抢先知道：微软2011春夏季实习生招聘将于下周一启动！亲爱的同学们，微软2011春夏实习生招聘将于下周一在全国范围内全面启动！届时，JoinMS网站也将以全新的内容在同一时间与同学们见面！2011微软实习生招聘的职位数量接近200余个，工作地点分布在北京和上海，涵盖了基础研究，软件开发，销售、市场和服务，技术支持等领域。具体的职位信息和技能需求请同学们登录微软的校园招聘网站进行查看！加入微软，加入IT精英的行列！微软期待与你携手创造更加辉煌的未来！-来源:南京大学小百合站http:/FROM:180.109.95.252上一篇本讨论区下一篇主题列表同主题阅读Net.Html.show(copyright)-CopyRight(C)1997-2011,NJULilyBBS-Use HtmlParserTo get Tags ContentExtract Info by regexStore in a designed pattern EachpostwillbestoredinalineasthepatternblewEachpostwillbestoredinalineasthepatternblewClicktoaddTextURL/007hot/007auhtor/007title/007content SeeDemoCommon issuesFault Tolerance Network Problems ConnectionTime Out3.3 Indexing Process3.3.1TargetFilterSourceFileBuildInvertedIndex3.3.23.3.3PartitionInvertedIndexFile3.3.53.3.4Second-LevelIndex(IndexforIndices)Target of Indexing ProcessRunaseriesofMap/ReduceoperationstogenerateInvertedIndiceswithrankandpositioninfo.Indexing ProcessTxt_FilterPartitionIndex TableInverted IndexIndexForIndicesFilter Source File(1)AlthoughSourceFilestorespostsinawell-designedpattern,WestillneedtofilteritbeforewedotheInvertedIndicesjob.1.Examine and eliminate noises and duplications -“http:/ null 007 null 007 null 007 null”-About duplications2.It is natural to pre-process the data before we really handle it.ReasonsFilter Source File(2)-Pseudo Code(1)publicclassFilterMapperextendsMapperpublicvoidmap(LongWritablekeyin,Textvalin,Contextcontext)/If the input line is has a legal structure emit it with its URL as key and itself as valueif(IsLegal(valin.toString()Textkeyout=newText(GetURL(val);context.write(keyout,valin);publicbooleanIsLegal(Stringval)/check whether the input lines structure is legal;/If legal return True,else return false;PublicStringGetURL(val)/returntheURLpartoftheinputline;/splittheinputlineby007andreturnthefirstpart.publicstaticclassFilterReducerextendsReducerpublicvoidreduce(Textkeyin,Iterablevalsin,Contextcontext)/A sign denotes whether the post with certain URL has been emittedbooleanflag=false;for(Textval:valsin)if(flag)break;elsecontext.write(NullWritable.get(),val);flag=true;Filter Source File(2)-Pseudo Code(2)Build Inverted Index TheprocessofbuildingInverted Indexissmart,itwillbesmarterifwecancalculateandrecordpostinginfoatthesametime.Thepostinginfoincludesrank、positions etc.DetailsBuild Inverted IndexPosting Info1.TF-IDF(Term Frequency-Inverse Document Frequency):2.Positions info do not need any calculation,the can be record as a Integer Pair like(StartIndex,EndIndex).Posting Info|D|:totalnumberofdocumentsinthecorpus:numberofdocumentswherethetermtiappears(thatis)Build Inverted Index-structure1.Foreachpostinfilteredsourcefile,theoffsetinthefilecanbeconsideredasitsDID;2.EachlineofInvertedIndexfilestoresatermwithitsinfo,thedetailsareasblew:term infoinfo=SingleDIDInfo;SingleDIDInfo;SingleDIDInfo.SingleDIDInfo=DID:rank:positionspositions=position%position%position%position.position=IsTitle|start|endEg.黑莓48522292:162.6:1|2|4%0|804|806;42910773:106.26:0|456|458%0|560|562/weneedtocustomizeanewWritableformattorepresenta/termsInfopublicclassSingleRecordWritableimplementsWritableComparableprivateLongWritableDID=newLongWritable();/DIDprivateFloatWritablerank=newFloatWritable();/RankprivateTextpositions=newText();/Positions.Build Inverted Index-Pseudo Code(1)publicclassMapperextendsMapperpublicHashMapTokenMap;/TMP Infopublicvoidmap(LongWritableoffset,Textval,Contextcontext)Stringstr=val.toString().split(007,5);/Split post to get title and contentHandleToken(str3,true);HandleToken(str4,false);/store lines terms in TMP Info this.EmitMapValue(key,context);/After scaned the whole text,emit key and value/Handleinputline:segementintowords,handlethewordsonebyone.publicHandleToken(Iine,IsTitle)while(line.hasnext()if(TokenMap.contians(line.GetTerm()TokenMap.Update(line.GetTerm();/if TMP dont contains the word put in TMP elseTokenMap.Put(line.GetTerm();/Update the terms info in TMP.publicvoidEmitMapValue(LongWritablekey,Contextcontext)for(Map.Entryentry:TokenMap.entrySet()context.write(word,singleRecord);/emit the key/value pair one by oneBuild Inverted Index-Pseudo Code(2)publicclassLineIndexerReducerextendsReducer/reduces input key is a term,input value is a List of Infos that/contains the same term,each info represent a Doc.publicvoidreduce(Textkey,Iterablevalues,Contextcontext)StringBuilderInfo;for(SingleRecordWritableval:values)Info.append(ResetFinalRank(val,values.len)+;);out.set(info.toString();context.write(key,out);/The Num of doucment that a term shows is taken into account.publicStringResetFinalRank(SingleRecordWritableVal,longTotalNum)Val.Rank=Val.Rank/TotalNum;returnVal.toString();Build Inverted Index-Pseudo Code(3)Partition Inverted Index FileAfter last step,we got the Inverted Index File.However,the file is so big.SourcefilesizeInvertedindexfilesize48M72.5M182M240M703M828MPartition Inverted Index File-Pseudo Code/this algorithm is simple,we just need to rewrite a/partitioner classpublicstaticclasssplitfilePartitionerextendsPartitionerpublicstaticintcount=0;OverridepublicintgetPartition(Textkey,Textvalue,intnumPartitions)intpartition=(this.count%numPartitions);this.count+;returnpartition;Second-Level Index(Index for Indices)In last step,we partitioned the Inverted Index file into a certain num parts,for example 16.Each file contains some term-info pairs.So,when a term is given?How can we know which part-file is it in?which line is it in?We need an Index for Indices.Ps.This really works.The second-level index files size is less than 10%of the source file.Second-Level Index(Index for Indices)SourcefilesizeInvertedindexfilesizeSecond-LevelIndexfilesize48.1M72.5M2.375M182M240M5.17M703M828M10.5M3.4 Set up Web Retrieval Interface3.4.1Target SortPagesTechnics3.4.23.4.3Target of Web Retrieval InterfaceMakeanInterfacewhichaccpetusersqueryandresponsesearchresults.1.Restrict the query string;2.Sort search result dynamically;3.Response results page by page.Web RetrievalInterfaceSort PagesTerm0Term2Term1Here is a demo.Doc1 10Doc3 90Doc7 20Text in hereQueryStringWordSegementDoc2 20Doc7 80 Doc5 15Doc3 05Doc2 40Doc6 40MergeRankAgainDoc7100Doc395Doc640Doc240Doc515Doc110Merge the rank and Rank againTechnicsJavascipt+HTMLTomcat+ServletWorkinclient.restrictinput,submitform,splitpageSome technics used in Web InterfaceWorkinServer.listernforrequest,handlerequest,feedbackresponse.3.5 Optimizationa)Foreachtermonlytop1500DIDarereservedatmost.b)UseTreeMaptosort.Reduce Sort TimeReduce I/O operationsCache StrategyOptimization measures in different areas.a)ResponsePageiscreateddynamically.b)Eachtimereturn10records.a)PutsomehotInvertedIndexfileinthememory.b)Cachereplacement-LRUContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion4.Maybe Google&Baidu has done.ParallellyWordSegementUsers queryRank3.Abetterrankstrategy:TodescirbetherelationshipbetweenatokenandDIDprecisely4.Recordeachusersquerystring;a)feedbacktoWordSegementb)Provideremindfunction.(Byinputchangeevent)1.SearchStuffparallelly2.AnOutStandingWordSegementAlgorithm.ContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion5.ConclusionSummary5.15.25.3HighlightsAboutMap/ReduceSummaryCrawlerIndexingprocessWeb retrievalA Hard Coding Crawler and Miner,aimed to get data for BBS lilyThe indexing process runs as a sequence of MapReduce operations.Set up a Web Interface for userto retrieve info.CrawlerWeb InterfaceIndexingprocessRelatedWorkhasthreepartsasblew:HighlightsThisstuffisCOOLIt can provide a friendly User Experience,when we wanna search something in our BBS Lily.UseMap/Reducetoprocessdataoffline.Ithasprovidedseveralbenefitssuchas:1.The indexing code is simpler,smaller,and easier tounderstand.2.we can keep conceptually unrelated computations separate.3.The indexing process has become much easier toOperate and maintain.View of ApplicationView of Technics1Map/Reduce is not just a Programming Model,actually its also a Life ModelClick to add TextAbout Map/ReduceMany thanks toTeacher Huang;Yang Xiaoliang;Xiao Tao;Liu Yulong;Zhang Lu;Demonstrate the systemLOGO

注意事项

本文（南京大学计算机科学与技术系主讲人：黄宜华杨晓亮 2011年春季学期 MapReduce海量数据并行处理.ppt）为本站会员（豆****）主动上传，淘文阁 - 分享文档赚钱的网站仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知淘文阁 - 分享文档赚钱的网站（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。