欢迎来到淘文阁 - 分享文档赚钱的网站! | 帮助中心 好文档才是您的得力助手!
淘文阁 - 分享文档赚钱的网站
全部分类
  • 研究报告>
  • 管理文献>
  • 标准材料>
  • 技术资料>
  • 教育专区>
  • 应用文书>
  • 生活休闲>
  • 考试试题>
  • pptx模板>
  • 工商注册>
  • 期刊短文>
  • 图片设计>
  • ImageVerifierCode 换一换

    南京大学计算机科学与技术系主讲人:黄宜华杨晓亮 2011年春季学期 MapReduce海量数据并行处理.ppt

    • 资源ID:56695327       资源大小:5.33MB        全文页数:52页
    • 资源格式: PPT        下载积分:20金币
    快捷下载 游客一键下载
    会员登录下载
    微信登录下载
    三方登录下载: 微信开放平台登录   QQ登录  
    二维码
    微信扫一扫登录
    下载资源需要20金币
    邮箱/手机:
    温馨提示:
    快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
    如填写123,账号就是123,密码也是123。
    支付方式: 支付宝    微信支付   
    验证码:   换一换

     
    账号:
    密码:
    验证码:   换一换
      忘记密码?
        
    友情提示
    2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
    3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
    4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
    5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。

    南京大学计算机科学与技术系主讲人:黄宜华杨晓亮 2011年春季学期 MapReduce海量数据并行处理.ppt

    南京大学计算机科学与技术系主讲人:黄宜华,杨晓亮 2011年春季学期 MapReduce海量数据并行处理 Still waters run deep.流静水深流静水深,人静心深人静心深 Where there is life,there is hope。有生命必有希望。有生命必有希望LOGOA Full Text Search Engine For BBS Lily 南京大学计算机科学与技术系主讲人:黄宜华、顾荣 2011年春季学期鸣谢:本课程得到鸣谢:本课程得到鸣谢:本课程得到鸣谢:本课程得到GoogleGoogleGoogleGoogle公司公司公司公司(北京)北京)北京)北京)中国大学合作部精品课程计划资助中国大学合作部精品课程计划资助中国大学合作部精品课程计划资助中国大学合作部精品课程计划资助ContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion1.BackgroundWhat is a full text search engine?1.11.2Why do we need it?What is a full text search engine?Inafulltextsearch,thesearchengineexaminesallofthewordsineverystoreddocumentasittriestomatchsearchwordssuppliedbytheuser.-FromWikiWhy do we need a FTSE for BBS Lily?Totalamount:around3millionpostsOverathousandeveryday.Eachpostssize:1K4KData InBBS Lily BaseCapacityIncreasingSpeed Post GranularityContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion2.Brief Intro to the Principle of Full Text Search EngineWhat happens after you press enter?Abstract IR ArchitectureDocumentsQueryHitsRepresentationFunctionRepresentationFunctionQueryRepresentationDocumentRepresentationComparisonFunctionIndexofflineonlinedocument acquisition(e.g.,web crawling)About Representation FunctionDocumentsInvertedIndexBag of Wordscase folding,tokenization,stopword removal,stemmingsyntax,semantics,word knowledge,etc.A Simple Inverted Index Demo111211111112111111123111411121121bluecateggfishgreenhamhatone11111121bluecateggfishgreenhamhatone11red11two1red1twoone fish,two fishDoc 1red fish,blue hatDoc 2cat in the hatDoc 3green eggs and hamDoc 43414432122112Map/Reduces Role1.musthavesub-second responsetime2.fortheweb,onlyneedrelativelyfew resultsIndexingIndexingProblemProblemRetrievalRetrievalProblemProblemCharacter DescriptionCharacter DescriptionSuitable?Suitable?1.scalability2.relativelyfast3.batchoperation4.updatesmaynotbeimportant5.crawlingisachallengeinitselfContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion3.Implement of FTSE for Lily BBS3.4OutlineofWorkFlow3.13.23.33.5CrawlWebPages&MineInfoIndexingProcessesSetupWebRetrievalInterfaceOptimizationResponseQuery String3.1 Outline of work FlowWeb Page 0Web Page 1Web Page nCrawl&Info MiningFormated Files/Content/Vice InfoInverted Index&Ranking JSP PageSplitTerm0,Term1Term nSearch&MergeTarget DIDResult ListTitleContextAuthorURLHottoken 1token 0token nIndexForIndicesCrawlerWeb RetrivalMap/Reduce3.2 Crawl Web Pages&Mine Info3.2.1Target FrameworkofLilyBBSStrategyofCrawler StrategyofMiner3.2.53.2.43.2.23.2.3CommonissuesTarget of Crawler&MinerCrawl every postFrom BBS lily Continuously.FaulttoleranceMine wanted infoFrom each post that Crawler has got from web;store the them in a designed pattern.A CrawlerB MinerFramework of BBS Lily(1)Post 0Post 1Post nTitleinhereBBSLilyTitleinheresection 12Titleinheresection0Titleinheresection2Titleinheresection1TitleinhereBoard 0Board 1TitleinhereBoard nFramework of BBS Lily(2)Strategy of CrawlerDFSPost 1Post nPost 0TitleinhereBBSLilyTitleinhereSection 12Titleinheresection0Titleinheresection2Titleinheresection1TitleinhereBoard 0Board 1TitleinhereBoard n-Traversalcataloglinkstogetthecontent;-AutomaticlinktoNextPageanddotheroutinejob.tipsStrategy of MinerRegex 南京大学小百合站-文章阅读讨论区:D_ComputerNet.User.init(WHEEL:0,FACE:1,BACK:0)发信人:MSer(微软校园大使),信区:D_Computer.本篇人气:205标题:转载微软2011春夏季实习生招聘将于下周一启动!发信站:南京大学小百合站(FriMar1800:28:592011)【以下文字转载自MSer的blog】【原文由MSer所发表】抢先知道:微软2011春夏季实习生招聘将于下周一启动!亲爱的同学们,微软2011春夏实习生招聘将于下周一在全国范围内全面启动!届时,JoinMS网站也将以全新的内容在同一时间与同学们见面!2011微软实习生招聘的职位数量接近200余个,工作地点分布在北京和上海,涵盖了基础研究,软件开发,销售、市场和服务,技术支持等领域。具体的职位信息和技能需求请同学们登录微软的校园招聘网站进行查看!加入微软,加入IT精英的行列!微软期待与你携手创造更加辉煌的未来!-来源:南京大学小百合站http:/FROM:180.109.95.252上一篇本讨论区下一篇主题列表同主题阅读Net.Html.show(copyright)-CopyRight(C)1997-2011,NJULilyBBS-Use HtmlParserTo get Tags ContentExtract Info by regexStore in a designed pattern EachpostwillbestoredinalineasthepatternblewEachpostwillbestoredinalineasthepatternblewClicktoaddTextURL/007hot/007auhtor/007title/007content SeeDemoCommon issuesFault Tolerance Network Problems ConnectionTime Out3.3 Indexing Process3.3.1TargetFilterSourceFileBuildInvertedIndex3.3.23.3.3PartitionInvertedIndexFile3.3.53.3.4Second-LevelIndex(IndexforIndices)Target of Indexing ProcessRunaseriesofMap/ReduceoperationstogenerateInvertedIndiceswithrankandpositioninfo.Indexing ProcessTxt_FilterPartitionIndex TableInverted IndexIndexForIndicesFilter Source File(1)AlthoughSourceFilestorespostsinawell-designedpattern,WestillneedtofilteritbeforewedotheInvertedIndicesjob.1.Examine and eliminate noises and duplications -“http:/ null 007 null 007 null 007 null”-About duplications2.It is natural to pre-process the data before we really handle it.ReasonsFilter Source File(2)-Pseudo Code(1)publicclassFilterMapperextendsMapperpublicvoidmap(LongWritablekeyin,Textvalin,Contextcontext)/If the input line is has a legal structure emit it with its URL as key and itself as valueif(IsLegal(valin.toString()Textkeyout=newText(GetURL(val);context.write(keyout,valin);publicbooleanIsLegal(Stringval)/check whether the input lines structure is legal;/If legal return True,else return false;PublicStringGetURL(val)/returntheURLpartoftheinputline;/splittheinputlineby007andreturnthefirstpart.publicstaticclassFilterReducerextendsReducerpublicvoidreduce(Textkeyin,Iterablevalsin,Contextcontext)/A sign denotes whether the post with certain URL has been emittedbooleanflag=false;for(Textval:valsin)if(flag)break;elsecontext.write(NullWritable.get(),val);flag=true;Filter Source File(2)-Pseudo Code(2)Build Inverted Index TheprocessofbuildingInverted Indexissmart,itwillbesmarterifwecancalculateandrecordpostinginfoatthesametime.Thepostinginfoincludesrank、positions etc.DetailsBuild Inverted IndexPosting Info1.TF-IDF(Term Frequency-Inverse Document Frequency):2.Positions info do not need any calculation,the can be record as a Integer Pair like(StartIndex,EndIndex).Posting Info|D|:totalnumberofdocumentsinthecorpus:numberofdocumentswherethetermtiappears(thatis)Build Inverted Index-structure1.Foreachpostinfilteredsourcefile,theoffsetinthefilecanbeconsideredasitsDID;2.EachlineofInvertedIndexfilestoresatermwithitsinfo,thedetailsareasblew:term infoinfo=SingleDIDInfo;SingleDIDInfo;SingleDIDInfo.SingleDIDInfo=DID:rank:positionspositions=position%position%position%position.position=IsTitle|start|endEg.黑莓48522292:162.6:1|2|4%0|804|806;42910773:106.26:0|456|458%0|560|562/weneedtocustomizeanewWritableformattorepresenta/termsInfopublicclassSingleRecordWritableimplementsWritableComparableprivateLongWritableDID=newLongWritable();/DIDprivateFloatWritablerank=newFloatWritable();/RankprivateTextpositions=newText();/Positions.Build Inverted Index-Pseudo Code(1)publicclassMapperextendsMapperpublicHashMapTokenMap;/TMP Infopublicvoidmap(LongWritableoffset,Textval,Contextcontext)Stringstr=val.toString().split(007,5);/Split post to get title and contentHandleToken(str3,true);HandleToken(str4,false);/store lines terms in TMP Info this.EmitMapValue(key,context);/After scaned the whole text,emit key and value/Handleinputline:segementintowords,handlethewordsonebyone.publicHandleToken(Iine,IsTitle)while(line.hasnext()if(TokenMap.contians(line.GetTerm()TokenMap.Update(line.GetTerm();/if TMP dont contains the word put in TMP elseTokenMap.Put(line.GetTerm();/Update the terms info in TMP.publicvoidEmitMapValue(LongWritablekey,Contextcontext)for(Map.Entryentry:TokenMap.entrySet()context.write(word,singleRecord);/emit the key/value pair one by oneBuild Inverted Index-Pseudo Code(2)publicclassLineIndexerReducerextendsReducer/reduces input key is a term,input value is a List of Infos that/contains the same term,each info represent a Doc.publicvoidreduce(Textkey,Iterablevalues,Contextcontext)StringBuilderInfo;for(SingleRecordWritableval:values)Info.append(ResetFinalRank(val,values.len)+;);out.set(info.toString();context.write(key,out);/The Num of doucment that a term shows is taken into account.publicStringResetFinalRank(SingleRecordWritableVal,longTotalNum)Val.Rank=Val.Rank/TotalNum;returnVal.toString();Build Inverted Index-Pseudo Code(3)Partition Inverted Index FileAfter last step,we got the Inverted Index File.However,the file is so big.SourcefilesizeInvertedindexfilesize48M72.5M182M240M703M828MPartition Inverted Index File-Pseudo Code/this algorithm is simple,we just need to rewrite a/partitioner classpublicstaticclasssplitfilePartitionerextendsPartitionerpublicstaticintcount=0;OverridepublicintgetPartition(Textkey,Textvalue,intnumPartitions)intpartition=(this.count%numPartitions);this.count+;returnpartition;Second-Level Index(Index for Indices)In last step,we partitioned the Inverted Index file into a certain num parts,for example 16.Each file contains some term-info pairs.So,when a term is given?How can we know which part-file is it in?which line is it in?We need an Index for Indices.Ps.This really works.The second-level index files size is less than 10%of the source file.Second-Level Index(Index for Indices)SourcefilesizeInvertedindexfilesizeSecond-LevelIndexfilesize48.1M72.5M2.375M182M240M5.17M703M828M10.5M3.4 Set up Web Retrieval Interface3.4.1Target SortPagesTechnics3.4.23.4.3Target of Web Retrieval InterfaceMakeanInterfacewhichaccpetusersqueryandresponsesearchresults.1.Restrict the query string;2.Sort search result dynamically;3.Response results page by page.Web RetrievalInterfaceSort PagesTerm0Term2Term1Here is a demo.Doc1 10Doc3 90Doc7 20Text in hereQueryStringWordSegementDoc2 20Doc7 80 Doc5 15Doc3 05Doc2 40Doc6 40MergeRankAgainDoc7100Doc395Doc640Doc240Doc515Doc110Merge the rank and Rank againTechnicsJavascipt+HTMLTomcat+ServletWorkinclient.restrictinput,submitform,splitpageSome technics used in Web InterfaceWorkinServer.listernforrequest,handlerequest,feedbackresponse.3.5 Optimizationa)Foreachtermonlytop1500DIDarereservedatmost.b)UseTreeMaptosort.Reduce Sort TimeReduce I/O operationsCache StrategyOptimization measures in different areas.a)ResponsePageiscreateddynamically.b)Eachtimereturn10records.a)PutsomehotInvertedIndexfileinthememory.b)Cachereplacement-LRUContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion4.Maybe Google&Baidu has done.ParallellyWordSegementUsers queryRank3.Abetterrankstrategy:TodescirbetherelationshipbetweenatokenandDIDprecisely4.Recordeachusersquerystring;a)feedbacktoWordSegementb)Provideremindfunction.(Byinputchangeevent)1.SearchStuffparallelly2.AnOutStandingWordSegementAlgorithm.ContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion5.ConclusionSummary5.15.25.3HighlightsAboutMap/ReduceSummaryCrawlerIndexingprocessWeb retrievalA Hard Coding Crawler and Miner,aimed to get data for BBS lilyThe indexing process runs as a sequence of MapReduce operations.Set up a Web Interface for userto retrieve info.CrawlerWeb InterfaceIndexingprocessRelatedWorkhasthreepartsasblew:HighlightsThisstuffisCOOLIt can provide a friendly User Experience,when we wanna search something in our BBS Lily.UseMap/Reducetoprocessdataoffline.Ithasprovidedseveralbenefitssuchas:1.The indexing code is simpler,smaller,and easier tounderstand.2.we can keep conceptually unrelated computations separate.3.The indexing process has become much easier toOperate and maintain.View of ApplicationView of Technics1Map/Reduce is not just a Programming Model,actually its also a Life ModelClick to add TextAbout Map/ReduceMany thanks toTeacher Huang;Yang Xiaoliang;Xiao Tao;Liu Yulong;Zhang Lu;Demonstrate the systemLOGO

    注意事项

    本文(南京大学计算机科学与技术系主讲人:黄宜华杨晓亮 2011年春季学期 MapReduce海量数据并行处理.ppt)为本站会员(豆****)主动上传,淘文阁 - 分享文档赚钱的网站仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知淘文阁 - 分享文档赚钱的网站(点击联系客服),我们立即给予删除!

    温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。




    关于淘文阁 - 版权申诉 - 用户使用规则 - 积分规则 - 联系我们

    本站为文档C TO C交易模式,本站只提供存储空间、用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。本站仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知淘文阁网,我们立即给予删除!客服QQ:136780468 微信:18945177775 电话:18904686070

    工信部备案号:黑ICP备15003705号 © 2020-2023 www.taowenge.com 淘文阁 

    收起
    展开