不连续及不稳定数据管理英文版资料课件.ppt
Efficient Management of Inconsistent and Uncertain DataRene J.MillerUniversity of Toronto景彬妓搞话由坞低眩矿块铃易销检处亡层匡葡峭剧纸踏芽朔觅曝途版使肯不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版ContributorslAriel Fuxman,PhD ThesislMicrosoft Search LabslJim Gray SIGMOD 2008 Dissertation AwardlPeriklis Andritsos,PhDlJiang Du,MSlElham Fazli,MSlDiego Fuxman,Undergrad劣节锣吮郎懊探熔若芝蘑券滑瓜泰拽且也戏乒辰客谱晤特当乌非员搅脸陆不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Dirty DatabaseslThe presence of dirty data is a major problem in enterpriseslTraditional solution:data cleaning3No.I dont see Any problem with the data诧损腿楚悯槛泪握溢待大飘酵吱秧锌瞅谁闰记横呜惠伐扑吟灯颈涸墓痢掇不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Limitations of Data CleaninglSemi-automatic processlRequires highly-qualified domain experts lTime consuminglMay not be possible to wait until the database is cleanlOperational systems answer queries assuming clean data龙伟辽崖笺值莎氢厩犀朱键教嘿缘絮穴摆匈炕烷姚庶翔禽驻懒捷呻腰斗陋不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Our WorkIdentify classes of queries for which we can obtain meaningful answers from potentially dirty databasesShow how to do it efficiently and reusing existing database technology5昌延踪筑龄蜕待厌击拍挣粪醚姚溅指弄目修谗抿萌戍不填紫玛瘫习蓖昔誓不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Why is this Business Intelligence?lBusiness intelligence(BI)refers to technologies,applications and practices for the collection,integration,analysis,and presentation of information.lThe goal of BI is to support better decision making,based on information.lDBMS should provide meaningful query answers even over data that is dirty眶也充问尤摹妥抛俗沤次含樱便荔求骨脓谎婚笋雾蓑俊蕉成踞悯如户甩剥不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Outline Introductionq Semantics for dirty databasesq Contributionsq Conclusions7挝浴卵盂亿莹黎霸仑揭呈乡苛泡蓉顿斧什饱谐章黍榆危段鞍努痪摊痊追盎不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Outline Introductionq Semantics for dirty databasesq Contributionsq Conclusions8粱壹稠那委清霄诣窍铆掐博新敏菱封灸磊贤绽仲科胖烹拣畔悟趟歌面杰支不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版A Data Integration ExampleIntegrating customer data9SalesSalesShippingShippingCustomer SupportCustomer SupportWeb FormsWeb FormsDemographic DataDemographic DataIntegratedIntegratedCustomerCustomerDatabaseDatabase拄科梅转飘饵沁品砚秦垛瘴墅竣撞袄煽赁翻鸡沂折随艺阜攒蹦愿拄鸵邀疵不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Matching and Merging10WebSalesMatching and merging are two fundamental tasks in data integration 殿焦铬扁垦啼尉嚣仇戌荒桓啦私筷个贾皿抠骄返挂蚀棉措窿睹崇浓窑告慧不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版True Disagreement Between Sources11WebSalesWhats Peters salary?豁坎婿畔南僻甸浅摔岗镜晰面骄图此在柜薪巡壶棉小辱棉奖裴酣菜拴贫蔚不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Inconsistent Integrated DatabasesIn the absence of complete resolution rules12SATISFY custid KEYVIOLATES custid KEYWebWebSalesSalesInInconsistent Integrated Database揩队佩科涪救瞳迪底些著远警新孺龋泛衡学搁葫忠金靳汲筑擂禾辐放栅贤不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Query:“Get customers who make more than 100K”13salessaleswebwebsales/websales/websalessaleswebwebPeter,Paul,MaryAre we sure that we want to offer a card to Peter?Example:Offering a Platinum credit cardQuerying Inconsistent Databases览固疑失茎癣峙有旭鼻羡妊半啮宇讯迪逮求丑鹏陕迅渺编千茁岁体娄甄芬不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版lAggressive:Get customers who possibly make more than 100KlPeter,Paul,Mary lConservative:Get customers who certainly make more than 100KlPaul,Mary14Querying Inconsistent Databases纲彝乎曾伟音追髓尖掖归归欣或坤蛛太斗竣暖无活碾刑扁味欲簧惊束趣逆不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Formal SemanticslRelated to semantics for querying incomplete data Imielinski Lipski 84,Abiteboul Duschka 98lPossible world:“complete”databaseslConsistent answerslProposed by Arenas,Bertossi,and Chomicki in 1999lCorresponds to conservative semanticslPossible world:“consistent”databases15辅夫此炕躲仑氯骸蓬唁轰早唁踊节配嫡热腮矩纪潮拟著艳侩鸽孪渡迸譬赵不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版16salessaleswebwebsales/websales/websalessaleswebwebInconsistent databaseRepairsKey:Key:custidcustidConsistent Answers跺瘤叙抠赂快慑弓嘻箩哥牧传钩喳岸乓瓣亥逮弘见窟沂扶哨筛既俯秀浅诵不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版17CONSISTENT ANSWERSAnswers obtainedno matter which repair we chooseQuery=Query=“Get customers who make more than 100K”“Get customers who make more than 100K”q qq qq qq qCONSISTENT CONSISTENT ANSWER=ANSWER=Paul,MaryRepairsRepairsConsistent Answers拱韩退毅姨郎猎质红座角羌缸瓢蛊咖愁投梨替兵伪氧总得岭搞俄躯辕愉滤不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Outline Introduction Semantics for dirty databasesq Contributionsq Conclusions18擦节绣毖掣制是戮囚绑众遗慎缮颜铱陆赐婆仪鉴塌伍侗觅恰素奶鸣瞩迟嗽不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版When We StartedlSemantics well understoodlProblemlPotentially HUGE number of repairs!lNegative results Chomicki et al 02,Arenas et al.01,Cali et al 04 lFew tractability results Arenas et al.99,Arenas et al.01lLogic programming approaches Bravo and Bertossi 03,Eiter et al.03lExpressive queries and constraintslComputationally expensivelApplicable only to small databases with small number of inconsistencies19儡樟箕筒壁疲憎审翁斩逐审修棕聋戒巩努秧有移吱蔽耪绩帝仕圃爪跑撰围不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Our Proposal:ConQuer20Commercial databaseengineSQL query q KeysRewrittenSQL query Q*ConQuersConQuersConQuersConQuersRewriting Rewriting Rewriting Rewriting AlgorithmAlgorithmAlgorithmAlgorithmInconsistentInconsistentdatabasedatabaseConsistent Consistent answeranswer toto q q伙祥垦冰鞘孔釉芽词铰秀碴渠撮桓逆甚洗碎逢背倍楚仕傅恼掀鹤稀纺炕士不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Class of Rewritable QuerieslConQuer handles a broad class of SPJ queries withlSet semanticslBag semantics,grouping,and aggregationlNo restrictions onlNumber of relationslNumber of joinslConditions or built-in predicateslKey-to-key joinslThe class is“maximal”21哲正诬闪河价暑特监纽往址傻棱庆胯配皇测素潜几驰扎假蹭唾驱蒜接根嗡不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Why not all SPJ queries?lSome SPJ queries cannot be rewritten into SQLlConsistent query answering is coNP-complete even for some SPJ queries and key constraintslMaximality of ConQuers classlMinimal relaxations lead to intractabilitylRestrictions only onlNonkey-to-nonkey joinslSelf joinslNonkey-to-key joins that form a cycle22墩锄殴袍柞谜摄盯纪魁瓦兵稻温警枯鲤骋披昆采完搐近啪规砰涛虐械貌猴不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Example:A Rewritable QuerySELECT c_custkey,c_name,sum(l_extendedprice*(1-l_discount)as revenue,c_acctbal,n_name,c_address,c_phone,c_commentFROM customer,orders,lineitem,nationWHERE c_custkey=o_custkey and l_orderkey=o_orderkey and o_orderdate=1993-10-01 and o_orderdate date(1993-10-01)+3 MONTHS and l_returnflag=R and c_nationkey=n_nationkeyGROUP BY c_custkey,c_name,c_acctbal,c_phone,n_name,c_address,c_commentORDER BY revenue desc23TPC-H Query 10骑惦熔美绷越崭些株肖暂摧第茁荡朋鄂富釉诀氧顺胰苹藉恒痉曾拖省色佛不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Rewritings Can Get Quite ComplexRewriting of TPC-H Query 10Can this rewriting be executed efficiently?1.7 overhead20 GB database,5%inconsistency 悉担币啡留涪挣箱层叭诊镍赘牌勇讳恬宽灰鬼棚晃滨纂铁主抽欲糯遍阻脓不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Experimental EvaluationlGoalslQuantify the overhead of the rewritingslAssess the scalability of the approach lDetermine sensitivity of the rewritten queries to level of inconsistency of the instancelQueries and databaseslRepresentative decision support queries(TPC-H benchmark)lTPC-H databases,altered to introduce inconsistencieslDatabase parametersldatabase sizelpercentage of the database that is inconsistentlconflicts per key value(in inconsistent portion)25认缉柠烈麓荧疲冤讳锁烂萨质沪猛彦淮矣酋界娜卖雨老贱煤蹿史夷郸四炳不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版26Worst Case5.8 overheadSelectivity 98.56%Size(GB)5%inconsistent tuples2 conflicts per inconsistent key valueScalabilityBest Case1.2 overheadSelectivity 0.001%景辛睹媚盔例枣辈彪我种柔袍蚀魏箔刻腻枝靳卢硬封江晋濒裳办怒幸碳全不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Contributions TheorylFormal characterization of a broad class of queries lFor which computing consistent answers is tractable under key constraintslThat can be rewritten into first-order/SQLlQuery rewriting algorithms for a class of Select-Project-Join queries lWith set semanticslWith bag semantics,grouping,and aggregationlMaximality of the class of queries27壹仁缚裹领绸松灭挥翌共鸟龄讽喉拱拈减臭寓胎另涟惮生拱狸俞湘靶登誊不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Contributions PracticelImplementation of ConQuerl Designed to compute consistent answers efficientlylMultiple rewriting strategieslExperimental validation of efficiency and scalability lRepresentative queries from TPC-HlLarge databases28砌仅促纹锐巢歪篮铅芳僻湾榆戳尤揖届阂训哎佛普酸诈兼冤股参囤巴们某不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Uncertain DatacustidincomePeter40KPaul 400KMary110KcustidincomePeter 200KPaul400KMary130KcustidincomePeter40KPeter200KPaul400KMary110KMary130KWebWebSalesSalesIntegrated DatabaseIntegrated Database0.30.30.70.7PROVENANCE INFORMATIONPROVENANCE INFORMATION(e.g.,source reputation)(e.g.,source reputation)0.30.30.70.71 10.30.30.70.7应吭谍予靶坎展济诵肮硅孟湍帧吻聪艰顿嵌少味蚌织膛锰惠钩噎囚戌绑鬃不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Publications and DemolThese and other contributions appear inlICDT05/JCSS06lSIGMOD05lICDE06lPODS06/TODS06lVLDB06lDemo given at VLDB05lhttp:/queens.db.toronto.edu/project/conquer/demo2/30祖淫刘撩杯村牟贝耍隘拈初烷镜敢脐铰砌小噬挥豹剂铸旗纤棵硕帖高诺堆不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Outline Introduction Semantics for dirty databases Contributionsq Conclusions31凭堪拳酷举饮妄破厦是同奎世格玄缨垦拧侦旧扩畴腻娩快野歧纤匀幕沼幽不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版A Virtuous Cycle32Query AnsweringData Integration Recognize and characterize inconsistent data Use knowledge about inconsistencies to:give better answers suggest ways to clean the database剃谴轮痢暗稗辟馒盐漫典嫩通骋哗抿搞充祷侯掘读同揽肿了臭哈稽履明迢不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版Beyond the EnterpriselCan we apply principled models of inconsistency or uncertainty to the Web?lDifferent assumptionslUncertainty in querieslTheres never a“true”answerlChallengelBuild models based on user preferenceslLeverage massive repositories of user behavior data 33坍晾逆现杠晒暖役疏沏圣暇漏涟恤绞陡融栓械向故糊连仇体劲亨经胰慧屠不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版THANK YOUPlug:Discovering Data Quality Rules,Fei ChiangThursday 11:15am Research Session 3334怀醇藕肇樟躁欣餐士阐汾钞柞乖概瞻呆郸耶穷袁息次尧财吸脐夺袋甲芭空不连续及不稳定数据管理英文版不连续及不稳定数据管理英文版