南京大学计算机科学与技术系主讲人:黄宜华杨晓亮 2011年春季学期 MapReduce海量数据并行处理.ppt
《南京大学计算机科学与技术系主讲人:黄宜华杨晓亮 2011年春季学期 MapReduce海量数据并行处理.ppt》由会员分享,可在线阅读,更多相关《南京大学计算机科学与技术系主讲人:黄宜华杨晓亮 2011年春季学期 MapReduce海量数据并行处理.ppt(52页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、南京大学计算机科学与技术系主讲人:黄宜华,杨晓亮 2011年春季学期 MapReduce海量数据并行处理 Still waters run deep.流静水深流静水深,人静心深人静心深 Where there is life,there is hope。有生命必有希望。有生命必有希望LOGOA Full Text Search Engine For BBS Lily 南京大学计算机科学与技术系主讲人:黄宜华、顾荣 2011年春季学期鸣谢:本课程得到鸣谢:本课程得到鸣谢:本课程得到鸣谢:本课程得到GoogleGoogleGoogleGoogle公司公司公司公司(北京)北京)北京)北京)中国大学合
2、作部精品课程计划资助中国大学合作部精品课程计划资助中国大学合作部精品课程计划资助中国大学合作部精品课程计划资助ContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion1.BackgroundWhat is a full text search engine?1.11.2Why do we need it?What is a full text search engine?Inafulltextsea
3、rch,thesearchengineexaminesallofthewordsineverystoreddocumentasittriestomatchsearchwordssuppliedbytheuser.-FromWikiWhy do we need a FTSE for BBS Lily?Totalamount:around3millionpostsOverathousandeveryday.Eachpostssize:1K4KData InBBS Lily BaseCapacityIncreasingSpeed Post GranularityContentsBackgroundB
4、rief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeGoogle&Baiduhasdonethese.Conclusion2.Brief Intro to the Principle of Full Text Search EngineWhat happens after you press enter?Abstract IR ArchitectureDocumentsQueryHitsRepresentationFunctionRepresentationFunctionQueryRepresent
5、ationDocumentRepresentationComparisonFunctionIndexofflineonlinedocument acquisition(e.g.,web crawling)About Representation FunctionDocumentsInvertedIndexBag of Wordscase folding,tokenization,stopword removal,stemmingsyntax,semantics,word knowledge,etc.A Simple Inverted Index Demo11121111111211111112
6、3111411121121bluecateggfishgreenhamhatone11111121bluecateggfishgreenhamhatone11red11two1red1twoone fish,two fishDoc 1red fish,blue hatDoc 2cat in the hatDoc 3green eggs and hamDoc 43414432122112Map/Reduces Role1.musthavesub-second responsetime2.fortheweb,onlyneedrelativelyfew resultsIndexingIndexing
7、ProblemProblemRetrievalRetrievalProblemProblemCharacter DescriptionCharacter DescriptionSuitable?Suitable?1.scalability2.relativelyfast3.batchoperation4.updatesmaynotbeimportant5.crawlingisachallengeinitselfContentsBackgroundBrief IntrotoprincipleofFullTextSearchEngineImplementofFTSEforBBSLilyMaybeG
8、oogle&Baiduhasdonethese.Conclusion3.Implement of FTSE for Lily BBS3.4OutlineofWorkFlow3.13.23.33.5CrawlWebPages&MineInfoIndexingProcessesSetupWebRetrievalInterfaceOptimizationResponseQuery String3.1 Outline of work FlowWeb Page 0Web Page 1Web Page nCrawl&Info MiningFormated Files/Content/Vice InfoIn
9、verted Index&Ranking JSP PageSplitTerm0,Term1Term nSearch&MergeTarget DIDResult ListTitleContextAuthorURLHottoken 1token 0token nIndexForIndicesCrawlerWeb RetrivalMap/Reduce3.2 Crawl Web Pages&Mine Info3.2.1Target FrameworkofLilyBBSStrategyofCrawler StrategyofMiner3.2.53.2.43.2.23.2.3CommonissuesTar
10、get of Crawler&MinerCrawl every postFrom BBS lily Continuously.FaulttoleranceMine wanted infoFrom each post that Crawler has got from web;store the them in a designed pattern.A CrawlerB MinerFramework of BBS Lily(1)Post 0Post 1Post nTitleinhereBBSLilyTitleinheresection 12Titleinheresection0Titleinhe
11、resection2Titleinheresection1TitleinhereBoard 0Board 1TitleinhereBoard nFramework of BBS Lily(2)Strategy of CrawlerDFSPost 1Post nPost 0TitleinhereBBSLilyTitleinhereSection 12Titleinheresection0Titleinheresection2Titleinheresection1TitleinhereBoard 0Board 1TitleinhereBoard n-Traversalcataloglinkstog
12、etthecontent;-AutomaticlinktoNextPageanddotheroutinejob.tipsStrategy of MinerRegex 南京大学小百合站-文章阅读讨论区:D_ComputerNet.User.init(WHEEL:0,FACE:1,BACK:0)发信人:MSer(微软校园大使),信区:D_Computer.本篇人气:205标题:转载微软2011春夏季实习生招聘将于下周一启动!发信站:南京大学小百合站(FriMar1800:28:592011)【以下文字转载自MSer的blog】【原文由MSer所发表】抢先知道:微软2011春夏季实习生招聘将于下周一
13、启动!亲爱的同学们,微软2011春夏实习生招聘将于下周一在全国范围内全面启动!届时,JoinMS网站也将以全新的内容在同一时间与同学们见面!2011微软实习生招聘的职位数量接近200余个,工作地点分布在北京和上海,涵盖了基础研究,软件开发,销售、市场和服务,技术支持等领域。具体的职位信息和技能需求请同学们登录微软的校园招聘网站进行查看!加入微软,加入IT精英的行列!微软期待与你携手创造更加辉煌的未来!-来源:南京大学小百合站http:/FROM:180.109.95.252上一篇本讨论区下一篇主题列表同主题阅读Net.Html.show(copyright)-CopyRight(C)1997-
14、2011,NJULilyBBS-Use HtmlParserTo get Tags ContentExtract Info by regexStore in a designed pattern EachpostwillbestoredinalineasthepatternblewEachpostwillbestoredinalineasthepatternblewClicktoaddTextURL/007hot/007auhtor/007title/007content SeeDemoCommon issuesFault Tolerance Network Problems Connecti
15、onTime Out3.3 Indexing Process3.3.1TargetFilterSourceFileBuildInvertedIndex3.3.23.3.3PartitionInvertedIndexFile3.3.53.3.4Second-LevelIndex(IndexforIndices)Target of Indexing ProcessRunaseriesofMap/ReduceoperationstogenerateInvertedIndiceswithrankandpositioninfo.Indexing ProcessTxt_FilterPartitionInd
16、ex TableInverted IndexIndexForIndicesFilter Source File(1)AlthoughSourceFilestorespostsinawell-designedpattern,WestillneedtofilteritbeforewedotheInvertedIndicesjob.1.Examine and eliminate noises and duplications -“http:/ null 007 null 007 null 007 null”-About duplications2.It is natural to pre-proce
17、ss the data before we really handle it.ReasonsFilter Source File(2)-Pseudo Code(1)publicclassFilterMapperextendsMapperpublicvoidmap(LongWritablekeyin,Textvalin,Contextcontext)/If the input line is has a legal structure emit it with its URL as key and itself as valueif(IsLegal(valin.toString()Textkey
18、out=newText(GetURL(val);context.write(keyout,valin);publicbooleanIsLegal(Stringval)/check whether the input lines structure is legal;/If legal return True,else return false;PublicStringGetURL(val)/returntheURLpartoftheinputline;/splittheinputlineby007andreturnthefirstpart.publicstaticclassFilterRedu
19、cerextendsReducerpublicvoidreduce(Textkeyin,Iterablevalsin,Contextcontext)/A sign denotes whether the post with certain URL has been emittedbooleanflag=false;for(Textval:valsin)if(flag)break;elsecontext.write(NullWritable.get(),val);flag=true;Filter Source File(2)-Pseudo Code(2)Build Inverted Index
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 南京大学计算机科学与技术系主讲人:黄宜华,杨晓亮 2011年春季学期 MapReduce海量数据并行处理 南京大学
链接地址:https://www.taowenge.com/p-56695327.html
限制150内