南京大学计算机科学与技术系主讲人:黄宜华杨晓亮 2011年春季学期 MapReduce海量数据并行处理.ppt
南京大学计算机科学与技术系主讲人:黄宜华,杨晓亮 2011年春季学期 MapReduce海量数据并行处理 Still waters run deep.流静水深流静水深,人静心深人静心深 Where there is life,there is hope。有生命必有希望。有生命必有希望LOGOA Full Text Search Engine For BBS Lily 南京大学计算机科学与技术系主讲人:黄宜华、顾荣 2011年春季学期鸣谢:本课程得到鸣谢:本课程得到鸣谢:本课程得到鸣谢:本课程得到GoogleGoogleGoogleGoogle公司公司公司公司(北京)北京)北京)北京)中国大学合作部精品课程计划资助中国大学合作部精品课程计划资助中国大学合作部精品课程计划资助中国大学合作部精品课程计划资助ContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion1.BackgroundWhat is a full text search engine?1.11.2Why do we need it?What is a full text search engine?Inafulltextsearch,thesearchengineexaminesallofthewordsineverystoreddocumentasittriestomatchsearchwordssuppliedbytheuser.-FromWikiWhy do we need a FTSE for BBS Lily?Totalamount:around3millionpostsOverathousandeveryday.Eachpostssize:1K4KData InBBS Lily BaseCapacityIncreasingSpeed Post GranularityContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion2.Brief Intro to the Principle of Full Text Search EngineWhat happens after you press enter?Abstract IR ArchitectureDocumentsQueryHitsRepresentationFunctionRepresentationFunctionQueryRepresentationDocumentRepresentationComparisonFunctionIndexofflineonlinedocument acquisition(e.g.,web crawling)About Representation FunctionDocumentsInvertedIndexBag of Wordscase folding,tokenization,stopword removal,stemmingsyntax,semantics,word knowledge,etc.A Simple Inverted Index Demo111211111112111111123111411121121bluecateggfishgreenhamhatone11111121bluecateggfishgreenhamhatone11red11two1red1twoone fish,two fishDoc 1red fish,blue hatDoc 2cat in the hatDoc 3green eggs and hamDoc 43414432122112Map/Reduces Role1.musthavesub-second responsetime2.fortheweb,onlyneedrelativelyfew resultsIndexingIndexingProblemProblemRetrievalRetrievalProblemProblemCharacter DescriptionCharacter DescriptionSuitable?Suitable?1.scalability2.relativelyfast3.batchoperation4.updatesmaynotbeimportant5.crawlingisachallengeinitselfContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion3.Implement of FTSE for Lily BBS3.4OutlineofWorkFlow3.13.23.33.5CrawlWebPages&MineInfoIndexingProcessesSetupWebRetrievalInterfaceOptimizationResponseQuery String3.1 Outline of work FlowWeb Page 0Web Page 1Web Page nCrawl&Info MiningFormated Files/Content/Vice InfoInverted Index&Ranking JSP PageSplitTerm0,Term1Term nSearch&MergeTarget DIDResult ListTitleContextAuthorURLHottoken 1token 0token nIndexForIndicesCrawlerWeb RetrivalMap/Reduce3.2 Crawl Web Pages&Mine Info3.2.1Target FrameworkofLilyBBSStrategyofCrawler StrategyofMiner3.2.53.2.43.2.23.2.3CommonissuesTarget of Crawler&MinerCrawl every postFrom BBS lily Continuously.FaulttoleranceMine wanted infoFrom each post that Crawler has got from web;store the them in a designed pattern.A CrawlerB MinerFramework of BBS Lily(1)Post 0Post 1Post nTitleinhereBBSLilyTitleinheresection 12Titleinheresection0Titleinheresection2Titleinheresection1TitleinhereBoard 0Board 1TitleinhereBoard nFramework of BBS Lily(2)Strategy of CrawlerDFSPost 1Post nPost 0TitleinhereBBSLilyTitleinhereSection 12Titleinheresection0Titleinheresection2Titleinheresection1TitleinhereBoard 0Board 1TitleinhereBoard n-Traversalcataloglinkstogetthecontent;-AutomaticlinktoNextPageanddotheroutinejob.tipsStrategy of MinerRegex 南京大学小百合站-文章阅读讨论区:D_ComputerNet.User.init(WHEEL:0,FACE:1,BACK:0)发信人:MSer(微软校园大使),信区:D_Computer.本篇人气:205标题:转载微软2011春夏季实习生招聘将于下周一启动!发信站:南京大学小百合站(FriMar1800:28:592011)【以下文字转载自MSer的blog】【原文由MSer所发表】抢先知道:微软2011春夏季实习生招聘将于下周一启动!亲爱的同学们,微软2011春夏实习生招聘将于下周一在全国范围内全面启动!届时,JoinMS网站也将以全新的内容在同一时间与同学们见面!2011微软实习生招聘的职位数量接近200余个,工作地点分布在北京和上海,涵盖了基础研究,软件开发,销售、市场和服务,技术支持等领域。具体的职位信息和技能需求请同学们登录微软的校园招聘网站进行查看!加入微软,加入IT精英的行列!微软期待与你携手创造更加辉煌的未来!-来源:南京大学小百合站http:/FROM:180.109.95.252上一篇本讨论区下一篇主题列表同主题阅读Net.Html.show(copyright)-CopyRight(C)1997-2011,NJULilyBBS-Use HtmlParserTo get Tags ContentExtract Info by regexStore in a designed pattern EachpostwillbestoredinalineasthepatternblewEachpostwillbestoredinalineasthepatternblewClicktoaddTextURL/007hot/007auhtor/007title/007content SeeDemoCommon issuesFault Tolerance Network Problems ConnectionTime Out3.3 Indexing Process3.3.1TargetFilterSourceFileBuildInvertedIndex3.3.23.3.3PartitionInvertedIndexFile3.3.53.3.4Second-LevelIndex(IndexforIndices)Target of Indexing ProcessRunaseriesofMap/ReduceoperationstogenerateInvertedIndiceswithrankandpositioninfo.Indexing ProcessTxt_FilterPartitionIndex TableInverted IndexIndexForIndicesFilter Source File(1)AlthoughSourceFilestorespostsinawell-designedpattern,WestillneedtofilteritbeforewedotheInvertedIndicesjob.1.Examine and eliminate noises and duplications -“http:/ null 007 null 007 null 007 null”-About duplications2.It is natural to pre-process the data before we really handle it.ReasonsFilter Source File(2)-Pseudo Code(1)publicclassFilterMapperextendsMapperpublicvoidmap(LongWritablekeyin,Textvalin,Contextcontext)/If the input line is has a legal structure emit it with its URL as key and itself as valueif(IsLegal(valin.toString()Textkeyout=newText(GetURL(val);context.write(keyout,valin);publicbooleanIsLegal(Stringval)/check whether the input lines structure is legal;/If legal return True,else return false;PublicStringGetURL(val)/returntheURLpartoftheinputline;/splittheinputlineby007andreturnthefirstpart.publicstaticclassFilterReducerextendsReducerpublicvoidreduce(Textkeyin,Iterablevalsin,Contextcontext)/A sign denotes whether the post with certain URL has been emittedbooleanflag=false;for(Textval:valsin)if(flag)break;elsecontext.write(NullWritable.get(),val);flag=true;Filter Source File(2)-Pseudo Code(2)Build Inverted Index TheprocessofbuildingInverted Indexissmart,itwillbesmarterifwecancalculateandrecordpostinginfoatthesametime.Thepostinginfoincludesrank、positions etc.DetailsBuild Inverted IndexPosting Info1.TF-IDF(Term Frequency-Inverse Document Frequency):2.Positions info do not need any calculation,the can be record as a Integer Pair like(StartIndex,EndIndex).Posting Info|D|:totalnumberofdocumentsinthecorpus:numberofdocumentswherethetermtiappears(thatis)Build Inverted Index-structure1.Foreachpostinfilteredsourcefile,theoffsetinthefilecanbeconsideredasitsDID;2.EachlineofInvertedIndexfilestoresatermwithitsinfo,thedetailsareasblew:term infoinfo=SingleDIDInfo;SingleDIDInfo;SingleDIDInfo.SingleDIDInfo=DID:rank:positionspositions=position%position%position%position.position=IsTitle|start|endEg.黑莓48522292:162.6:1|2|4%0|804|806;42910773:106.26:0|456|458%0|560|562/weneedtocustomizeanewWritableformattorepresenta/termsInfopublicclassSingleRecordWritableimplementsWritableComparableprivateLongWritableDID=newLongWritable();/DIDprivateFloatWritablerank=newFloatWritable();/RankprivateTextpositions=newText();/Positions.Build Inverted Index-Pseudo Code(1)publicclassMapperextendsMapperpublicHashMapTokenMap;/TMP Infopublicvoidmap(LongWritableoffset,Textval,Contextcontext)Stringstr=val.toString().split(007,5);/Split post to get title and contentHandleToken(str3,true);HandleToken(str4,false);/store lines terms in TMP Info this.EmitMapValue(key,context);/After scaned the whole text,emit key and value/Handleinputline:segementintowords,handlethewordsonebyone.publicHandleToken(Iine,IsTitle)while(line.hasnext()if(TokenMap.contians(line.GetTerm()TokenMap.Update(line.GetTerm();/if TMP dont contains the word put in TMP elseTokenMap.Put(line.GetTerm();/Update the terms info in TMP.publicvoidEmitMapValue(LongWritablekey,Contextcontext)for(Map.Entryentry:TokenMap.entrySet()context.write(word,singleRecord);/emit the key/value pair one by oneBuild Inverted Index-Pseudo Code(2)publicclassLineIndexerReducerextendsReducer/reduces input key is a term,input value is a List of Infos that/contains the same term,each info represent a Doc.publicvoidreduce(Textkey,Iterablevalues,Contextcontext)StringBuilderInfo;for(SingleRecordWritableval:values)Info.append(ResetFinalRank(val,values.len)+;);out.set(info.toString();context.write(key,out);/The Num of doucment that a term shows is taken into account.publicStringResetFinalRank(SingleRecordWritableVal,longTotalNum)Val.Rank=Val.Rank/TotalNum;returnVal.toString();Build Inverted Index-Pseudo Code(3)Partition Inverted Index FileAfter last step,we got the Inverted Index File.However,the file is so big.SourcefilesizeInvertedindexfilesize48M72.5M182M240M703M828MPartition Inverted Index File-Pseudo Code/this algorithm is simple,we just need to rewrite a/partitioner classpublicstaticclasssplitfilePartitionerextendsPartitionerpublicstaticintcount=0;OverridepublicintgetPartition(Textkey,Textvalue,intnumPartitions)intpartition=(this.count%numPartitions);this.count+;returnpartition;Second-Level Index(Index for Indices)In last step,we partitioned the Inverted Index file into a certain num parts,for example 16.Each file contains some term-info pairs.So,when a term is given?How can we know which part-file is it in?which line is it in?We need an Index for Indices.Ps.This really works.The second-level index files size is less than 10%of the source file.Second-Level Index(Index for Indices)SourcefilesizeInvertedindexfilesizeSecond-LevelIndexfilesize48.1M72.5M2.375M182M240M5.17M703M828M10.5M3.4 Set up Web Retrieval Interface3.4.1Target SortPagesTechnics3.4.23.4.3Target of Web Retrieval InterfaceMakeanInterfacewhichaccpetusersqueryandresponsesearchresults.1.Restrict the query string;2.Sort search result dynamically;3.Response results page by page.Web RetrievalInterfaceSort PagesTerm0Term2Term1Here is a demo.Doc1 10Doc3 90Doc7 20Text in hereQueryStringWordSegementDoc2 20Doc7 80 Doc5 15Doc3 05Doc2 40Doc6 40MergeRankAgainDoc7100Doc395Doc640Doc240Doc515Doc110Merge the rank and Rank againTechnicsJavascipt+HTMLTomcat+ServletWorkinclient.restrictinput,submitform,splitpageSome technics used in Web InterfaceWorkinServer.listernforrequest,handlerequest,feedbackresponse.3.5 Optimizationa)Foreachtermonlytop1500DIDarereservedatmost.b)UseTreeMaptosort.Reduce Sort TimeReduce I/O operationsCache StrategyOptimization measures in different areas.a)ResponsePageiscreateddynamically.b)Eachtimereturn10records.a)PutsomehotInvertedIndexfileinthememory.b)Cachereplacement-LRUContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion4.Maybe Google&Baidu has done.ParallellyWordSegementUsers queryRank3.Abetterrankstrategy:TodescirbetherelationshipbetweenatokenandDIDprecisely4.Recordeachusersquerystring;a)feedbacktoWordSegementb)Provideremindfunction.(Byinputchangeevent)1.SearchStuffparallelly2.AnOutStandingWordSegementAlgorithm.ContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion5.ConclusionSummary5.15.25.3HighlightsAboutMap/ReduceSummaryCrawlerIndexingprocessWeb retrievalA Hard Coding Crawler and Miner,aimed to get data for BBS lilyThe indexing process runs as a sequence of MapReduce operations.Set up a Web Interface for userto retrieve info.CrawlerWeb InterfaceIndexingprocessRelatedWorkhasthreepartsasblew:HighlightsThisstuffisCOOLIt can provide a friendly User Experience,when we wanna search something in our BBS Lily.UseMap/Reducetoprocessdataoffline.Ithasprovidedseveralbenefitssuchas:1.The indexing code is simpler,smaller,and easier tounderstand.2.we can keep conceptually unrelated computations separate.3.The indexing process has become much easier toOperate and maintain.View of ApplicationView of Technics1Map/Reduce is not just a Programming Model,actually its also a Life ModelClick to add TextAbout Map/ReduceMany thanks toTeacher Huang;Yang Xiaoliang;Xiao Tao;Liu Yulong;Zhang Lu;Demonstrate the systemLOGO