Deduplication数据重复删除.ppt

上传人：s****8

文档编号：67326706

上传时间：2022-12-24

格式：PPT

页数：16

大小：2.27MB

( 4.5 )

《Deduplication数据重复删除.ppt》由会员分享，可在线阅读，更多相关《Deduplication数据重复删除.ppt（16页珍藏版）》请在淘文阁 - 分享文档赚钱的网站上搜索。

1、DeduplicationCSCI 572:Information Retrieval and Search EnginesSummer 2010May-20-10CS572-Summer2010CAM-2OutlineWhat is Deduplication?ImportanceChallengesApproachesMay-20-10CS572-Summer2010CAM-3What are web duplicates?The same page,referenced by different URLshttp:/http:/What are the differences?URL h

2、ost(virtual hosts),sometimes protocol,sometimes page name,etc.May-20-10CS572-Summer2010CAM-4What are web duplicates?Near identical page,referenced by the same URLsGoogle search for“search engines”Google search for“search engines”What are the differences?Page is within some delta%similar to the other

3、(where delta is a large number),but may differ in e.g.,adds,counters,timestamps,etc.May-20-10CS572-Summer2010CAM-5Why is it important to consider duplicates?In search engines,URLs tell the crawlers where to go and how to navigate the information spaceIdeally,given the webs scale and complexity,well

4、give priority to crawl content that we havent already stored or seen beforeSaves resources(on the crawler end,as well as the remote host)Increases crawler politenessReduces the analysis that well have to do laterMay-20-10CS572-Summer2010CAM-6Why is it important to consider duplicates?Identification

5、of website mirrors(or copies of content)used to spread the load andbandwidth consumptionS,CPAN,Apache,etc.If you identify a mirror,you canomit crawling many web pagesand save crawler resourcesMay-20-10CS572-Summer2010CAM-7“More Like This”Finding similarcontent to whatyou were lookingforAs we discuss

6、edduring the lecture on the search engine architecture,much of the time in search engines is spent filtering through the results.Presenting similar documents can cut down on that filtering timeMay-20-10CS572-Summer2010CAM-8XMLXML documents,structurally appear very similarWhats the difference between

7、 RSS and RDF and OWL and XSL and XSLT and any number of XML documents out there?With the ability to identify similarity and reduce duplication of XML,we could identify XML documents with similar structureRSS feeds that contain the same linksDifferentiate RSS(crawl more often)from other less frequent

8、ly updated XMLMay-20-10CS572-Summer2010CAM-9Detect PlagiarismDetermine web sites and reportsthat plagiarize one anotherImportant for copyright lawsand enforcementDetermine similarity betweensource code Licensing issuesOpen Source,other.May-20-10CS572-Summer2010CAM-10Detection of SPAMIdentifying mali

9、cious SPAM contentAdult sitesPharmaceutical drug and prescriptiondrug SPAMMalware and phishing scamsNeed to ignore this content from a crawling perspectiveOr to“flag”it and not include it in(general)search resultsMay-20-10CS572-Summer2010CAM-11ChallengesScalabilityMost approaches to detecting duplic

10、ates rely on training and analytical approaches that may be computationally expensiveChallenge is to perform the evaluation at low costWhat to do with the duplicates?The answer isnt always throw them out they may be useful for studyThe content may require indexing for later comparison in legal issue

11、s,or for“snapshot”ing the web at the time i.e.,the Internet ArchiveMay-20-10CS572-Summer2010CAM-12ChallengesStructure versus SemanticsDocuments that are structurally dissimilar may content the exact same contentThink the use of tags to emphasize versus tags in HTMLNeed to take this into accountOnlin

12、e versus offlineDepends on crawling strategy,but offline typically can provide more precision at the cost of inability to dynamically reactMay-20-10CS572-Summer2010CAM-13Approaches for DeduplicationSIMHASH and Hamming DistanceTreat web documents as a set of features,constituting an n dimension vecto

13、r transform this vector into an f-bit fingerprint of a small size,e.g.,64Compare fingerprints and look for difference in at most k bitsManku et al.,WWW 2007Syntactic similarityShinglingTreat web documents as continuous subsequence of wordsCompute w-shinglingBorder et al.,WWW 1997May-20-10CS572-Summe

14、r2010CAM-14Approaches for DeduplicationLink structure similarityIdentify similar in the linkages between web collectionsChoo et al.May-20-10CS572-Summer2010CAM-15Approaches for DeduplicationExploiting the structure and links between physical network hostsLook at:LanguageGeographical connectionContin

15、uations and proxiesZipifan functionBharat et al.,ICDM 2001May-20-10CS572-Summer2010CAM-16WrapupNeed Deduplication for conserving resources and ensuring quality and accuracy of resultant search indicesCan assist in other areas like plagiarism,SPAM detection,fraud detection,etc.Deduplication at web scale is difficult,need efficient means to perform this computation online or offlineTechniques look at page structure/content,page link structure content,or physical web node structure

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

16 金币

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: Deduplication 数据重复删除

淘文阁 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：Deduplication数据重复删除.ppt
链接地址：https://www.taowenge.com/p-67326706.html