机械学习ー.pptx
機械学習機械学習n岡田孝TUT 2000/06/071n知識発見知的解析n紙缶n大容量nn対象多様化Web上TUT 2000/06/072知識発見知識発見基幹基幹DBDB抽出抽出変換変換統合統合外部外部DBDB評価評価可視化可視化知識知識TUT 2000/06/073技法技法n統計学n認識nnn決定木nRoughsetn相関nGraphBasedInductionn帰納論理n変数選択n可視化TUT 2000/06/074What is supervised learning?nInput instances contains Class attributesExplanation attributes.nGenerate rules to describe class descriptions inductively.IF conditions THEN classnLearning from examples,Incorporation of background knowledgecf.regression,discriminant analysis,neural network,nearest neighborTUT 2000/06/075Typical applicationsnKnowledge acquisition to be used in plant operating expert systemnAction prediction of opponent teams in sports matchnDiagnosis from medical testsnDiscovery of active motifs in chemical compounds from structure activity relationship datasetsTUT 2000/06/076Classification of ProblemsTUT 2000/06/077Streams in learning researchI.ClassificationTUT 2000/06/078決定木方法決定木方法TUT 2000/06/079決定木決定木高,赤,青:茶茶青青赤赤黒黒低,黒,青:高,黒,青:高,黒,茶:低,青:高,青:高,茶:低,茶:髪色目色TUT 2000/06/0710平均情報量変数選択平均情報量変数選択n平均情報量平均情報量n分類前分類前TUT 2000/06/0711分類平均情報量分類平均情報量利得利得n身長分類0.003bitn髪色分類0.454bitn眼色分類0.347bitTUT 2000/06/0712数値属性間結合糖尿病診断木数値属性間結合糖尿病診断木TUT 2000/06/0713Progress in Decision TreenVariable with continuous valuesnEntropy gain ratio,Gini indexnSamplingnPruningnBagging,BoostingnUser interfaceuInteractive expansion of a treeuVisualizationuRulesTUT 2000/06/0714Gini-index=Pi(1-Pi)=1-Pi2Gini index vs.EntropyTUT 2000/06/0715決定木方法決定木方法n秋葉秋葉,金田金田:例学習技術応例学習技術応用向用向,情報処理学会誌情報処理学会誌,Vol.39,No.2,pp.145-151;No.3,pp.245-251(1998).nBreiman,L.,Friedman,J.H.,Olshen,R.A.&Stone,C.J.:Classification and Regression Trees,The Wadsworth&Brooks/Cole(1984).CARTnQuinlan,J.R.:C4.5:Programs for Machine Learning,Morgan Kaufmann(1993).古川訳古川訳:AIAI解析解析,(1995).TUT 2000/06/0716Streams in learning researchIII.Rough set nCharacteristicsuNon exploratoryuMethodology for decision tableuAnalysis of variable dependencies uNP hard to attributes&valuesnReferencesuPawlak,Z.:Rough Sets:Theoretical Aspects of Reasoning about Data,Kluwer Academic Publishers(1991).uW.Ziarko:Review of Basics of Rough Sets in the Context of Data Mining,Proc.Fourth International Workshop on Rough Sets,Fuzzy Sets,and Machine Discovery,pp.447-457,Tokyo(1996).uDatalogic/R:Reduct Systems Inc.TUT 2000/06/0717Rough setPositive regionBoundary regionNegative regionTUT 2000/06/0718計算過程計算過程:離散化分類離散化分類Reduct1=Size,Height,EnergyReduct2=Size,Height,Current Core=Size,HeightTUT 2000/06/0719説明変数説明変数P目的変数目的変数QP=Size,Height,Energy,CurrentQ=TemperatureReduct1(P,Q)=Height,EnergyReduct2(P,Q)=Height,CurrentCore(P,Q)=HeightTUT 2000/06/0720計算過程計算過程 :Decision matrixRule導出導出B B1 11 1=(S,1)(S,1)(E,2)(E,2)(C,1)(C,1)(H,0)(H,0)(E,2)(E,2)(C,1)(C,1)(E,2)(E,2)(C,1)(C,1)=(E,2)=(E,2)(C,1)(C,1)B B1 12 2=(H,2)(H,2)(C,1)(C,1)(S,0)(S,0)(H,2)(H,2)(C,1)(C,1)(S,0)(S,0)(H,2)(H,2)(C,1)(C,1)=(H,2)=(H,2)(C,1)(C,1)B B1 13 3=(S,1)(S,1)(H,2)(H,2)(H,2)(H,2)(H,2)(H,2)=(H,2)=(H,2)B B1 14 4=(S,1)(S,1)(H,2)(H,2)(E,2)(E,2)(C,1)(C,1)(H,2)(H,2)(E,2)(E,2)(C,1)(C,1)(H,2)(H,2)(E,2)(E,2)(C,1)(C,1)=(H,2)=(H,2)(E,2)(E,2)(C,1)(C,1)B B1 15 5=(E,2)(E,2)(C,1)(C,1)(S,0)(S,0)(H,0)(H,0)(E,2)(E,2)(C,1)(C,1)(S,0)(S,0)(E,2)(E,2)(C,1)(C,1)=(E,2)=(E,2)(C,1)(C,1)(Energy=2)(Energy=2)(Temperature=1)(Temperature=1)(Current=1)(Current=1)(Temperature=1)(Temperature=1)(Height=2)(Height=2)(Temperature=1)(Temperature=1)TUT 2000/06/0721Variable Precision Rough Set Model Positive regionBoundary regionNegative regionTUT 2000/06/0722Variable Dependency AnalysisNecessary and Sufficient Variable SetsReduct 2CoreReduct 3Reduct 1Reduct 5Reduct 4TUT 2000/06/0723Cars exampleReducts(1)cyl,fuelsys,comp,power,weight(2)size,fuelsys,comp,power,weight(3)size,fuelsys,displace,weight(4)size,cyl,fuelsys,power,weight(5)cyl,turbo,fuelsys,displace,comp,trans,weight(6)size,cyl,fuelsys,comp,weight(7)size,cyl,turbo,fuelsys,trans,weightCore:fuelsys,weightZiarko:The discovery,analysis,and representation of data dependencies in databases,Knowledge Discovery in Databases,pp.195-209,Piatetsky-Shapiro&Frawley ed.AAAI Press(1991).TUT 2000/06/0724Reduct&Core Effects to Sum of SquaresSize cyl turbo fuelsys displace comp power trans weightNet-power121086420VariablesTUT 2000/06/0725Rough Set Method as a Tool of Data AnalysisnVery good rules for understandingDespiteuToo many reducts uNumber of reducts changes with confidence value in VPRSMuDisregard of frequenciesTUT 2000/06/0726Rough setnPawlak,Z.:Rough Sets:Theoretical Aspects of Reasoning about Data,Kluwer Academic Publishers(1991).nW.Ziarko:Review of Basics of Rough Sets in the Context of Data Mining,Proc.Fourth International Workshop on Rough Sets,Fuzzy Sets,and Machine Discovery,pp.447-457,Tokyo(1996).nDatalogic/R:Reduct Systems Inc.n方法論特徴u離散表現対方法論離散表現対方法論u共起的分布知識獲得可能共起的分布知識獲得可能u計算量計算量数数N,属性数属性値数属性数属性値数exp(N)TUT 2000/06/0727Streams in learning researchII.Characteristic Rules nEvaluation by UsefulnessnPatterns with Accuracy&SupportnStatistical estimation of generality and accuracy鈴木鈴木(1999):(1999):特徴的発見特徴的発見一般性正確性信頼性同時評価手法、一般性正確性信頼性同時評価手法、人工知能学会誌人工知能学会誌、14,139-147.,139-147.nExceptions as interestingness鈴木、志村鈴木、志村(1997):(1997):情報理論的手法用情報理論的手法用例外的知識発見、例外的知識発見、人工知能学会誌人工知能学会誌、12,305-312.,305-312.nRating usefulness by human estimation Rule generation by Genetic AlgorithmTerano,T.and Ishino,Y.(1996):Interactive knowledge Terano,T.and Ishino,Y.(1996):Interactive knowledge discovery from marketing questionaire using simulated discovery from marketing questionaire using simulated breeding and inductive learning methods,breeding and inductive learning methods,Proc.KDD-Proc.KDD-9696,279-282.279-282.nMarket basket analysisTUT 2000/06/0728相関抽出相関抽出Association rules mining相関TUT 2000/06/0729Apriori algorithm候補集合候補集合TUT 2000/06/0730時系列解析時系列解析(1)TUT 2000/06/0731時系列解析時系列解析(2)TUT 2000/06/0732分類階層構造導入分類階層構造導入飲料清涼飲料弱酒類強酒類TUT 2000/06/0733数値属性取扱数値属性取扱n離散化nMax-support越range統合 複数rangenFrequent itemset計算Rule導出nRule Interest 刈込nPartial completeness概念 Interval設定、健全性確保Srikant,R.&Agrawal,R.:Mining Quantitative Association Rules in Large Relational Tables,Proc.ACM SIGMOD,pp.1-12(1996).PeopleFrequent itemset(part)TUT 2000/06/0734仮想導入要因分析仮想導入要因分析沼尾、清水沼尾、清水:流通業流通業,人工知能学会誌人工知能学会誌,Vol.12,No.4,pp.528-535(1997).TUT 2000/06/0735n時系列記号化認識時系列記号化認識n教師付帰納学習教師付帰納学習異常発生最大遅時間圧力圧力温度温度化化IF 圧力上昇圧力上昇AND 温度下降温度下降 THEN 異常発生:確率異常発生:確率80%佐藤:佐藤:向技術応用向技術応用,情報処理学会関西支部平成年度第回研究会情報処理学会関西支部平成年度第回研究会時系列事例時系列事例TUT 2000/06/0736構造拡張:構造拡張:履歴分析履歴分析猪口他猪口他:人工知能学会基礎論研究会人工知能学会基礎論研究会 SIG-FAI-9801-10,pp.55-60(1998).TUT 2000/06/0737相関探索相関探索TUT 2000/06/0738抽出抽出Setp3:選択選択Graph Based Induction逐次拡張逐次拡張吉田、元田吉田、元田:逐次拡張基帰納推論逐次拡張基帰納推論 人工知能学会誌人工知能学会誌 Vol.12,pp.58-67(1997).入力入力Step1:入力書換入力書換Step2:入力中数上入力中数上TUT 2000/06/0739GBI操作履歴解析応用操作履歴解析応用emacslprdvi2psxdvilatexpaper.pspaper.dvipaper.texcommandfileTUT 2000/06/0740Graph Based Induction特徴特徴n高速構造化解析可n概念獲得,分類規則学習,推論高速化何適用可能nSequence(DNA,protein)応用Negative条件表現工夫必要nOrdered Graph限定n規則概念連結限定n複製障害複雑取扱困難TUT 2000/06/0741帰納論理帰納論理最簡単実行例最簡単実行例n前提知識parent(1,2).parent(1,3).n正例他負例grandparent(1,4).grandparent(1,5).n結果grandparent(X,Y):-parent(X,Z),parent(Z,Y).TUT 2000/06/0742Version space中仮説探索中仮説探索Grandparent(X,Y):-?n被覆集合n新付加変数定数化n記述長最少原理仮説選択nFOIL:Quinlan(1990)entropy最良探索nProgol:Muggleton(1995)逆伴意(Inverse entailment)探索空間縮小採用仮説採用仮説棄却仮説棄却仮説正例正例負例負例TUT 2000/06/0743Progol変異原性物質識別変異原性物質識別n230種化合物:種化合物:Ames test positive 138/negative 92,Debnath et al:J.Med.Chem.34:786-797(1991).n種:重回帰分析実施種:重回帰分析実施nProgol:188(12hr)/42(6hr)分割分割解析解析natm(compound,atom,element,type,charge).bond(compound,atom1,atom2,bondtype).n9種種Rule 分類精度同様分類精度同様n指示変数自動的発見指示変数自動的発見Phenanthrene骨格、例外的骨格、例外的acetylenen使用法困難性、長計算時間使用法困難性、長計算時間King et.al.:Relating chemical activity to structure:an examination of ILP success,New Generation Computing,Vol.13,pp.411-433(1995).InputRuleTUT 2000/06/0744帰納論理帰納論理Inductive logic programmingn人工知能学会誌小特集人工知能学会誌小特集:帰納論理帰納論理,Vol.12,No.5,pp.654-688(1997).nLavrac&Dzeroski:Inductive Logic Programming:Techniques and Applications,Hertfordshire,Ellis Horwood(1994).Dzeroski,S.:Inductive Logic Programming and Knowledge Discovery in Databases,In Fayyad et.al.Advances in Knowledge Discovery and Data Mining,pp.117-152,AAAI Press(1996).nQuinlan,J.R.:Learning Logical Definitions from Relations,Machine Learning,Vol.5,pp.239-266(1990).nMuggleton,S.:Inductive Logic Programming,New Generation Computing,Vol.8,pp.295-318(1991);Inverse Entailment and Progol,ibid.Vol.13,pp.245-286(1995).nKing,R.D.et.al.:Relating Chemical Activity to Structure:an examination of ILP successes,ibid.Vol.13,pp.411-433(1995).n :/gruffle lab.ox.ac.uk/oucl/groups/machlearn/TUT 2000/06/0745参考資料参考資料TUT 2000/06/0746赤目四十八滝赤目四十八滝三重県三重県TUT 2000/06/0747Rule Induction as Data Analysis ToolnRules accurate?Yes.nSoftware available?Yes.nComputing fast?Yes.nEasy understanding?Yes.nPopular?No.TUT 2000/06/0748Possible ReasonsnConservative usersnUnix environmentnNo familiar examplesnToo many methodsnToo many rulesnSelf-evident rulesnImpressions:uad hoc methodsuexploratoryTUT 2000/06/0749Response of Users fromExpected ResultsnRegression by a few variablesuTSS=ESS+RSS100%99%1%uHypothesis confirmed=SatisfactorynRule inductionu A few simple rulesFAverage accuracy:99%FSum of coverage:99%uSelf-evident rules=Unsatisfactoryuwith Datascapeuwithout DatascapeTUT 2000/06/0750What is Datascape?nQuantificationuProblem quantificationuSolution quantificationnMultiple data dependenciesuExplanation from plural viewpointsuCorrelation among explanation variablesuConcise&levelwise deepening descriptionsnViews of solutionuInspection of individual datumuSurroundings of solutionTUT 2000/06/0751Answers to Datascapeby Cascade Model nQuantification by SS SS:sum of squaresnData dependenciesuDetection of local interactions uUnified mechanism forDiscrimination rulesCharacteristic rules nLevelwise creation of rule setsTUT 2000/06/0752Problem in decision tree 1Heuristic search is used to get the best tree.A 4/4a1b1B 3/1C 1/11/00/10/2 2/0B 1/3C 1/10/11/0a2b1b2b2c1c1c2c2C 4/4c1a1B 2/22/00/2 0/2A 2/22/0c2b1b2a2TUT 2000/06/0753Problem in decision tree 2Distinction between Main and Pre-conditions C 4/4c1a1B 2/22/00/20/2A 2/22/0c2b1b2a2IF c1 and b1 THEN positiveAccuracy:100%Cover :25%IF b1 added on c1 THEN positiveAccuracy:50%100%Cover :25%TUT 2000/06/0754Problem in decision tree 3Discrimination Power of a RuleHow can we order rules?Rule set pruned Accuracy:85%Coverage:100%IF a1 and b1 THEN positive Accuracy:80%Cover:25%IF a1 and b2 THEN negative Accuracy:100%Cover:25%IF a2 and b1 THEN positive Accuracy:100%Cover:15%IF a2 and b2 THEN negative Accuracy:71%Cover:35%A 11/9a1b1B 4/6C 4/11/03/10/5 3/0B 7/3C 2/51/31/2a2b1b2b2c1c1c2c2TUT 2000/06/0755SAR Discovery on the Mutagenicity of Aromatic Nitro Compounds Studied by the Cascade ModelnTakashi OkadanKwansei Gakuin Universityn.TUT 2000/06/0756ContentsnPrevious work on this data setuRegressionuInductive Logic ProgramminguAim of this worknCascade modelnItem generationuRelative indexing of graph verticesuTypes of item expressionsnResultsuInterpretation of rulesuCharacteristic structural patternsTUT 2000/06/0757Regression analysisn230 aromatic and hetero aromatic nitro compounds Debnath et al:J.Med.Chem.34:786-797(1991).nDivision to 2 data setsu188 compounds subject to regressionu42 compounds with diverse structuresnConclusionuImportance of hydrophobicity(log P)uElectron-attracting element conjugating with nitro group enhance mutagenisityuCompounds with 3 or more fused rings(I1=1)are much more mutagenicuLess active aceanthrylenes(Ia=1)TUT 2000/06/0758Analysis by Progol(Inductive Logic Programming)nKing et al:Proc.Natl.Acad.Sci.USA Vol.93,pp.438-442(1996).natm(compound,atom,element,type,charge).bond(compound,atom1,atom2,bondtype).nProgol processingu5 rules from 188 compounds(12 hrs.)u1 rule from 42 compounds(6 hrs.)uaccuracy comparable to regressionnAutomatic discovery of indicator variablesu3 or more fused ringsuaceanthrylenesTUT 2000/06/0759Aim of This WorknUniform treatment of 230 compoundsnUsing 2D structural formulaenFeasibility study of the cascade model to large scale structural data mininguLarge toxicology databaseuHuge SAR database from high throughput screeningTUT 2000/06/0760Scheme of AnalysisStructural formulae(SDF file)Relative indexingof graph verticesCompoundActivity Ptrn-1 Ptrn-2 Ptrn-99911.5YYN23.2NYNCascade ModelRulesInterpretationTUT 2000/06/0761Cascade model 1Itemset Latticenitemset as nodenitems inclusion as linknclass distribution as node property 4/4a2 b20/2a2 b12/0a1 b21/2a1 b11/0b21/4b13/0a22/2a12/2BAb1a1a2a1,b1:pa2,b1:pa2,b1:pb2a1,b2:na1,b2:na1,b2:pa2,b2:na2,b2:nTUT 2000/06/0762Cascade model 2Questions Potential definition Power definition Cascade construction Selection of waterfallsnNodes as Lakes with PotentialnLinks as Waterfalls with PowernHigh Power Waterfalls as RulesPotential:Class purityMixedPureTUT 2000/06/0763Sum of Squares Decompositionfor Categorical DataUsing SS Definition by GiniTSSWSS(1)WSS(2)BSS(1)BSS(2)TSS=WSS(1)+BSS(1)+WSS(2)+BSS(2)TUT 2000/06/0764WSS as potential&BSS as power given by TSS Decomposition#positives/#negativesSample varianceSum of squaresBSS(1):18800/2000.16TSS:160760/400.0475WSS(1):3840/1600.16WSS(2):32BSS(2):7218+38+72+32=160TUT 2000/06/0765IF B:y added on A:yIF B:y added on A:yTHEN D:y;E:nTHEN D:y;E:nCases:100 Cases:100 60 0D:y 60%D:y 60%93%,BSS=6.67 93%,BSS=6.67E:n 60%E:n 60%90%,BSS=5.40 90%,BSS=5.40A:yA:yA:y,B:yA:y,B:yA Link,Distribution of Veiled Items&a RuleNoneedtogenerateA:y,B:y,D:y,E:nTUT 2000/06/0766Item generation schemenLinear substructure patternnTwo terminal atom partuCoordination No.and/or hydrogennConnecting part along the shortest pathuElement name and/or bond typeuCoordination No.and/or hydrogenuNumber of bonds=-TUT 2000/06/0767Item Types and ResultsTerminal atom partConnecting partelement&bond typebond type#bonds&bond types at terminalscoord.no.&hydrogen20447710.7(7)17108513.0(8)8227214.1(10)coord.no.1678469.8(7)11985914.2(9)5314711.4(4)element1587518.5(7)10625714.2(10)4125111.6(5)#featuresgenerated#featuresanalyzedSSexplained(#rules)inthefirstrulesetusingminsup=0.1,thres=0.1,thr-BSS=0.01TUT 2000/06/0768Computation by DISCASnCategorizationto4activityclassesinactive,low:y 0.0,medium:0.0 y 14.0%;BSS:3.25;Cases:157-430.100.410.410.08=0.000.140.580.28inactlowmedhighinactlowmedhighTUT 2000/06/0770Interpretation of a Rule(2)an Optional RHS TermIFC3HrCrC-CrCrCrC-N3:yIIaddedonC3rCrCrCrC3:nTHENC3rCrCrC-N-O1=yIII68.2%-100.0%;BSS:4.36;Cases:157-43TUT 2000/06/0771Interpretation of a Rule(3)Integration of InformationnIncorporationofotheroptionalRHStermswith100%confidencenConsultation to the supportinginstancesindatabaseuPossiblesubstructurepatternsfordiscriminationuSelectionbychemistsintuition43 compounds114 compoundsTUT 2000/06/0772Interpretation of a Rule(4)Absence of a SubstructureIFC3rCrCrC-N3:naddedonC3HrCrCrC-N-O1:yTHENActivity=low30.7%-56.4%;BSS:3.03;Cases:137-390.13 0.31 0.38 0.18=0.26 0.56 0.18 0.00CooccurrenceofAbsenceofC3rCrCrC-N3PresenceofC3HrCrCrC-N3Allsubstituentsatthe4-thpositionfromNarehydrogen.TUT 2000/06/0773Characteristic Substructure Patterns(1)Lower ActivityL1:0.130.310.380.18=0.260.