《Deduplication数据重复删除.ppt》由会员分享,可在线阅读,更多相关《Deduplication数据重复删除.ppt(16页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、DeduplicationCSCI 572:Information Retrieval and Search EnginesSummer 2010May-20-10CS572-Summer2010CAM-2OutlineWhat is Deduplication?ImportanceChallengesApproachesMay-20-10CS572-Summer2010CAM-3What are web duplicates?The same page,referenced by different URLshttp:/http:/What are the differences?URL h
2、ost(virtual hosts),sometimes protocol,sometimes page name,etc.May-20-10CS572-Summer2010CAM-4What are web duplicates?Near identical page,referenced by the same URLsGoogle search for“search engines”Google search for“search engines”What are the differences?Page is within some delta%similar to the other
3、(where delta is a large number),but may differ in e.g.,adds,counters,timestamps,etc.May-20-10CS572-Summer2010CAM-5Why is it important to consider duplicates?In search engines,URLs tell the crawlers where to go and how to navigate the information spaceIdeally,given the webs scale and complexity,well
4、give priority to crawl content that we havent already stored or seen beforeSaves resources(on the crawler end,as well as the remote host)Increases crawler politenessReduces the analysis that well have to do laterMay-20-10CS572-Summer2010CAM-6Why is it important to consider duplicates?Identification
5、of website mirrors(or copies of content)used to spread the load andbandwidth consumptionS,CPAN,Apache,etc.If you identify a mirror,you canomit crawling many web pagesand save crawler resourcesMay-20-10CS572-Summer2010CAM-7“More Like This”Finding similarcontent to whatyou were lookingforAs we discuss
6、edduring the lecture on the search engine architecture,much of the time in search engines is spent filtering through the results.Presenting similar documents can cut down on that filtering timeMay-20-10CS572-Summer2010CAM-8XMLXML documents,structurally appear very similarWhats the difference between
7、 RSS and RDF and OWL and XSL and XSLT and any number of XML documents out there?With the ability to identify similarity and reduce duplication of XML,we could identify XML documents with similar structureRSS feeds that contain the same linksDifferentiate RSS(crawl more often)from other less frequent
8、ly updated XMLMay-20-10CS572-Summer2010CAM-9Detect PlagiarismDetermine web sites and reportsthat plagiarize one anotherImportant for copyright lawsand enforcementDetermine similarity betweensource code Licensing issuesOpen Source,other.May-20-10CS572-Summer2010CAM-10Detection of SPAMIdentifying mali
9、cious SPAM contentAdult sitesPharmaceutical drug and prescriptiondrug SPAMMalware and phishing scamsNeed to ignore this content from a crawling perspectiveOr to“flag”it and not include it in(general)search resultsMay-20-10CS572-Summer2010CAM-11ChallengesScalabilityMost approaches to detecting duplic
10、ates rely on training and analytical approaches that may be computationally expensiveChallenge is to perform the evaluation at low costWhat to do with the duplicates?The answer isnt always throw them out they may be useful for studyThe content may require indexing for later comparison in legal issue
11、s,or for“snapshot”ing the web at the time i.e.,the Internet ArchiveMay-20-10CS572-Summer2010CAM-12ChallengesStructure versus SemanticsDocuments that are structurally dissimilar may content the exact same contentThink the use of tags to emphasize versus tags in HTMLNeed to take this into accountOnlin
12、e versus offlineDepends on crawling strategy,but offline typically can provide more precision at the cost of inability to dynamically reactMay-20-10CS572-Summer2010CAM-13Approaches for DeduplicationSIMHASH and Hamming DistanceTreat web documents as a set of features,constituting an n dimension vecto
13、r transform this vector into an f-bit fingerprint of a small size,e.g.,64Compare fingerprints and look for difference in at most k bitsManku et al.,WWW 2007Syntactic similarityShinglingTreat web documents as continuous subsequence of wordsCompute w-shinglingBorder et al.,WWW 1997May-20-10CS572-Summe
14、r2010CAM-14Approaches for DeduplicationLink structure similarityIdentify similar in the linkages between web collectionsChoo et al.May-20-10CS572-Summer2010CAM-15Approaches for DeduplicationExploiting the structure and links between physical network hostsLook at:LanguageGeographical connectionContin
15、uations and proxiesZipifan functionBharat et al.,ICDM 2001May-20-10CS572-Summer2010CAM-16WrapupNeed Deduplication for conserving resources and ensuring quality and accuracy of resultant search indicesCan assist in other areas like plagiarism,SPAM detection,fraud detection,etc.Deduplication at web scale is difficult,need efficient means to perform this computation online or offlineTechniques look at page structure/content,page link structure content,or physical web node structure
限制150内