(3.4.1)--Bigdataanalysis3-4datatranform.pdf
Data transformData integration1Data transform231 Data integrationData integration:Integrate data from multipledata sources into a consistent storage Pattern matching;Data redundancy processing;Data value conflict solving;41 Data integration-Pattern matchingIntegrate metadata from different data sources.Entity recognition problem:Match real-world entities from different data sources,such as:A.cust-id=B.customer_no.51 Data integration-Data redundancy The same attribute will have different field names in different databases.One attribute can be derived from another attribute.For example,the average monthly income attribute in a customer data table can be calculated based on the monthly income attribute.Some redundancy can be detected by correlation analysis61 Data integration-Data value conflictFor a real-world entity,its attribute values from different data sources may be different.Such as Differences in representation,different scales,or differences in coding,etc.For example:the weight attribute uses the metric system,like kg,g in one system,but uses the imperial system like pound in another system.Same price attributes in different locations using different currency units,$,pound,RMBData transformData integration1Data transform282 Data transform-1)smoothRemove noise,discretize continuous data,and increase granularity Binning Clustering Regression92 Data transformation-2)AggregationAggregate the data:avg(),count(),sum(),min(),max().For example:daily sales(data)can be aggregated to get the monthly or annual total.102 Data transformation-3)Data generalizationFor example:street attributes can be generalized to higher-level concepts,such as:city,country.Similarly,numeric attributes,such as age attributes,can be mapped to higher-level concepts,such as young,middle-aged,and old.Replace low-level data objects with more abstract(higher-level)concepts112 Data transformation-4)Data NormalizationThe data is scaled proportionally to make it fall into a specific area,so as to eliminate the deviation of the mining results caused by the different sizes of the numerical attributes.Such as mapping the salary income attribute value to the range of-1.0,1.0.method:(1)Min-Max normalization(2)Zero-mean normalization(z-score normalization)(3)Standardization of decimal calibration12Data transformation-5)Attribute constructionUse the existing attribute set to construct new attributes and add them to the existing attribute set to help dig deeper pattern knowledge and improve the accuracy of mining results.For example:According to the width and height attributes,a new attribute can be constructed:area.