聚类算法实践.docx
实验报告学号20191106078姓名龚永好上机地点信-506专业电子信息工程班级电信1902班时间2022年5月24日上机内容实验四:聚类算法实践一、实验目的及要求目的:进一步掌握数据探索、数据预处理、属性构造的过程;熟悉聚类算法原理;学会使用聚类算 法对数据进行处理,掌握一种客户价值模型的应用。要求:1 .进一步熟悉数据挖掘的过程。2 .学会Python进行数据预处理方法。3 .学会使用K-Means函数进行聚类分析。4 .学会根据聚类结果画出向量图。二、实验设备(环境)及要求1 .硬件要求:CPU在2.0 GHz以上,内存在4G以上,建议8G。2 .软件要求:Widows7系统及以上系统,Anaconda编译环境。三、实验内容(一)数据挖掘步骤理解客户价值分析的基本步骤1、抽取某时间段内总的样本数据。2、对抽取的数据进行数据探索分析与预处理,包括数据缺失值与异常值的探索分析、数据清 洗、特征构建、标准化等操作。3、基于RFM模型,使用K-Means算法进行客户分群。4、针对模型结果得到不同价值的客户,采用不同的营销手段,提供定制化的服务。(二)数据探索客户信息分析4-1代码:数据探索数据挖掘理论与实践指导教师:向前会员飞行次数分布箱线图80 00 8飞行次数20015010050客户总飞行公里数箱线图600000500000 -400000300000 -200000 100000 0-总飞行公里数4-4代码:探索客户的积分信息分布情况1 # coding: utf-82 1n h34-44 ii ii ii56#积分信息类别7#提取会员积分兑换次数8ec = data 'EXCHANGE_COUNT'9#绘副会员兑换积分次数有方图10 fig = pit. figure (figsize = (8 ,5) # 设置画布大小11 pit.hist(ec, bins=5f color='#0504aa')12plt.xlabel( 兑换次皴)13口1匚丫1加©1(,会员人数,)14plt. title(,会员兑换积分次数分布直方图,)15 pit .show()pit .close16 18#提敢会员总累计积分ps = data Points_Sum 20#绘匐9员总累计积分箱线图21 fig = pit.figure(figsize = (5 ,8)pit.boxplot(ps,22 patch_artist=True,labels = .总累计积分,#设置x轴标题23 boxprops = ' facecolor1 : 'lightblue' ) # 设置填充虢色pit.title (.客户总累计积分箱线图1)27#显示y坐标轴的底线一 pit.grid(axis='y )29 pit .show()pit .close4-4运行结果In 6: runfile('D:/shujuwajue-shiyan4/4-4.py1)会员兑换积分次数分布直方图60000 -50000 -40000 -3000020000 10000 .0兑换次数1000000 .800000600000 -400000 -200000 -客户总累计积分箱线图息累计积分4-5代码:相关系数矩阵与热力图1# coding: utf-8234-5 4 ii it n 56#提取属性并合并为新数据集7data_corr = data'FFP_TIER1,1FLIGHT_COUNT1,'LAST_TO_END1,8 -,SEG2kM_SUM1,.EXCHANGE_COUNT','Points_Sum'9 agel = data1 AGE1.fillna(0)L0 data_corr1 AGE* = agel.astype(1int641)LI data_corr1ffp_year1 = ffp_yearL2 "一13#计算相关性矩阵L4 dt_corr = data_corr.corr(method = 'pearson1)L5 print ('相关性矩阵为:n' ,dt_corr)16一17#绘制热力图18 import seaborn as sns19 pit. subplots(figsize=(10, 10) # 设置画面大小?0 sns.heatmap(dt_corr, annot=True, vmax=l, square=True, cmap=1 Blues')21 pit. show()22 pit .close4-5运行结果In 7: runfile('D:/shujuwajue-shiyan4/4-5.py')D:/shujuwajue-shiyan4/4-5.py: 10: SettmgWithCopyWarnmg:A value is trying to be set on a copy of a slice from a DataFrame.Try using .locrow_indexer,col_indexer = value insteadSee the caveats in the documentation: : /pandas . pydata. org/pandas -docs/stable/mdexing. html#indexmg-view-versus - copy data_corr'AGE' = agel.astype('int64')D:/shujuwajue-shiyan4/4-5.py: 11: SettmgWithCopyWarning:A value is trying to be set on a copy of a slice from a DataFrame.Try using .locrow_indexer,col_indexer = value insteadSee the caveats in the documentation: :/pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copydata_corrffp_year'=ffp_year相关性应车为:FFP_TIERFLIGHT_COUNT AGE ffp_yearFFP_TIER1.0000000.5824470.076245 -0.116510FLIGHT_COUNT0.5824471.0000000.075309 -0.188181LAST_TO_END-0.2G6313-0.404999-0.027654 0.117913SEG KM SUM0.5223500.8504110.087285 -0.171508EXCHANGE COUNT 0.3423550.5025010.032760 -0.216610Points_Sum0.5592490.7470920.074887 -0.163431AGE0.0762450.0753091.000000 -0.242579ffp_year-0.116510-0.188181-0.242579 1.0000008 rows x 8columns8 rows x 8 columnsFFP_TIER 10 58-0 210 52a 340.560.076-0.12FLIGHT_C(XJNT -0.581-0 40 85Q50.750.075-0.19LAST_TO_END -0.21-0 41-0 37-0.17-0.29-0 028a 12SEG_KM_SUM -Q 520 85-0 371Q 510.850.087-0.17EXCHANGE_COUNT -0.34Q5-0.170.5110.5S0.033-0 22R>ints_Sun -0.560.75-0.290.85a 5810.075-0.16ff£ -0.0760.075-0 028a 0870.W30.0751-0 24ffp_year -0.12-0.19Q 12-0.17-0.22-0.16-0 241缶一隹1NDO3I1H9一dgo'llsvll=no,LLI9NY 工 3G(三)数据预处理数据清洗,构造属性4-6代码:清洗空值与异常值1 # coding: utf-82 1 , n34-64 ii ii n5#处理缺失值与异常值67 import numpy as np8 import pandas as pd9 datafile = 'D:/shujuwajue-shiyan4/air_data .csv1 # 航空原始数据路径LI cleanedfile = ' D : /shu juwajue-shiyan4/data_cleaned. csv' # 数据清洗后保存的文件路径12一13#读取数据airline_data = pd.read_csv(datafile,encoding = 1utf-8')15 print (,原始数据的形状为:',airline_ddta.shape)1617#去除票伙为空的记录18 airline_notnull = airline_data.locairline_data,SUM_YR_1,.notnullO &19-airlineZdata'SUM二YR2' .notnullO,:20 print (1册ij除缺失记录后数据的形状为:,,airline_notnull .shape)2122#只保存票价非零的,或者平均折扣率不为G且总飞行公里数大型)的记录.23 indexl = airline_notnull'SUM_YR_1' != 0!= 0)!= 0)24index2 = airline_notnull'SUM_YR_2' != 025 index3 = (airline_notnull,SEG_KM_SUM,> 0) & (airline_notnull1avg_discount126 index4 = airline_notnull ' AGE'T >"100 # 去除年龄大于 100的记录一27 airline = airline_notnull(indexl | index2) & index3 & index428 print ('数据清洗后数据的形状为::airline.shape)2930 airline. to_csv(cleanedfile) # 保存清洗后的数据4-6运行结果In 8: runfile('D:/shujuwajue-shiyan4/4-6.py1) 原始数据的形状为:(62988, 44) 删除缺失记录后数据的形状为:(62299, 44) 数据清洗后数据的形状为:(62043, 44)In 9:4-7代码:属性选择1# coding: utf-8234-74 ii ii ii56 #属性选择、构造与数据标准化78 import pandas as pd9 import numpy as np11#读取数据清洗后的数据cleanedfile = 1D:/shujuwajue-shiyan4/data_cleaned.csv, # 数据清洗后保存的文件路径12 airline = pd.read_csv(cleanedfile, encoding = 'utf-81)#选取需求属性一15airline_selection = airline'FFPDATE','L0AD_TIME','LAST_TO_END','FLIGHT_COUNT','SEG_KM_SUM',1avgdiscount116 print (1 筛选的属性前5行为:n',airline_selection. head()4-7运行结果In 9: runfile(1D:/shujuwajue-shiyan4/4-7.py1) 筛选的属性前5行为:01234FFP_DATE 2006/11/2 2007/2/19 2007/2/1 2008/8/22 2009/4/10L0AD_TIME 2014/3/31 2014/3/31 2014/3/31 2014/3/31 2014/3/31SEG_KM_SUM 580717 293678 283712 281336 309928avg_discount 0.961639 1.252314 1.254676 1.090870 0.9706585 rows x 6 columnsIn 10:4-8代码:属性构造与数据标准化234-84 ii ii56#构造属性工L = pd.to_datetime(airline_selection LOAD_TIME ) - 8 pd.to_datetime(airline_selection'FFP_DATE')9 L = L.astype('str').str.split().str010 L = L.astype('int')/301112#合并属性airline_features = pd.concat(Lrairline_selection.iloc:,2:raxis = 1) 14airline_features.columns = 'L'r 'R'r1F','M','C15 print(构建的LRFMC属性前5行为:n ,airline_features.head()16 _17#数据标准化18 from sklearn.preprocessing import StandardScaler19 data = StandardScaler(). fit_transform(air,Line_features)20np.savez( D:/shujuwajue-shiyan4/|airline_scale.npz rdata)21 print('标准化后LRFMC五个属性为:n'rdata:5,:)4-8运行结果In 10: runfile('D:/shuj uwaj ue-shiyan4/4-8.py') 构建的LRFMC属性前5行为:LRFMC090.20000012105807170.961639186.56666771402936781.252314287.166667111352837121.254676368.23333397232813361.090870460.53333351523099280.970658标准化后LRFMC五个属性为:1.43579256 -0.9449390214.03402401 26.761156991.307232191.328462340.658533040.3860794-0.91188564-0.88985006-0.41608504-0.922903439.073215958.718872520.781579629.9236401913,1268643612.6534814412.5406219313.898735971.295541882.868177772.880951861.994715461.34433641In 11:型大小值LSeries(62043JSeries object of pandas.core.series moduleageSeries(62568,)Series object of pandas.core.series moduleagelSeries(62988,)Series object of pandas.co re.series moduleairlineDataFrame(62043, 45)Column names: Unnamed: 0, MEMBER_NO, FFP_DATE, FIRST_FLIGHT_DATE, GEND .airline_dataDataFrame(62988, 44)Column names: MEMBER_NO, FFP_DATE, FIRST_FLIGHT_DATE, GENDER, FFP_TIER .airline_featuresDataFrame(62043, 5)Column names: L, R, F, M, Cairline_notnullDataFrame(62299, 44)Column names: MEMBER_NO, FFP_DATE, FIRST_FLIGHT_DATE, GENDER, FFP_TIER .airline_selectionDataFrame(62043, 6)Column names: FFP_DATE, LOAD_TIME, LAST_TO_END, FLIGHT_COUNT, SEG_KM_S .cleanedfilestr1D:/shujuwajue-shiyan4/data_cleaned.csvdatafloat64(62043, 5)1.43579256 -0.94493902 14.03402401 26.76115699 1.29554188 1.30 .data_corrDataFrame(62988, 8)Column names: FFP_TIER, FLIGHT_C0Ur4T, LAST_TO_END, SEG_KM_SUM, EXCHANG .datafilestr1D:/shujuwajue-shiyan4/air_data.csvdt_corrDataFrame(8, 8)Column names: FFP_TIER, FLIGHT_COUNT, LAST_TO_END, SEG_KM_SUM, EXCHANG .ecSeries(62988,)Series object of pandas.core.series moduleexploreDataFrame(44, 3)Column names:空值数,最大值,最小值fcSeries(62988JSeries object of pandas.core.series modulefemaleint64114851ffpSeries(62988JSeries object of pandas.core.series moduleffp_yearSeries(62988,)Series object of pandas.core.series modulei ndpylGariaw nhiprt nf nandaw .ri aw mndul a(四)数据模型构建采用K-means模型聚类4-9代码:K-Meas聚类标准化后的数据以及运行结果In 1:iaport pandas as pdiaport nuapy as npfrom sklearn. cluster import KMeans ff 导入X覆e2s算法In 2:以谈取标准化后的数据airline_scale = np.load(,D:/shujuwajue-shiyan4/airline_scale. npz')V arr_0'k = 5 #确定聚类中心数In 3: *构建根吧,随机林子设为123knieans_jnodel KMeans (n_c lusters = k, n_jobs=4, rando*_state=123)In 4 : fit_kmeans = k*eans_model. fit (airline_scale) ff 模邕谢栋C:ProgramDataAnaconda3envspython36libsite-packagessklearncluster_kmeans.py:793: FutureWarning: , n_jobs, was deprecated in version 0. 23 and will be removed in 1.0 (renaming of 0.25).“reaoved in 1.0 (renaming of 0.25).”, FutureWarning)In 5: #查看聚类结果knieans_cc = knieans_jnodel. c lus t er _c ent er s_ # 聚类中心print ('各类聚类中心为:n , kmeans_cc)kmeans_labels = kmeans_»iodel. labels_ ff 祥本的类别标签 print ('各样本别标签为:n', kjneans_labels)rl = pd. Series (kjneans_model. labels.). value_counts () # 线计不同类别样本的数目 print (5最终每个类别的数目为:n , rl)各类聚类中心为:-6.70030628 -0.41502288 -0.16081841 -0.16053724 -0.257285960.0444681 ->.00249102 -O.23O46649 -0.23492871 2.175287420,48370858 -0.79939042 2.48317171 2.42445742 0.309239621,1608298 -0. 37751261 -0.08668008 -0.09460809 -0.15678402 -0. 31319365 1.68685465 -0.57392007 -0.5367502 -0.17484815 各样本的类别标签为:2 2 2 . 0 4 4最终每个类别的数目为:0246302533714226dtype: int64In 5: *查看聚类结果kjneans_cc = kmeans_jnodel. cluster_centers_ # 聚类中心 print ('各类聚类中心为:n', k*eans_cc)kmeans_labels = knieans_*odel. labels_ / 群本的类别标签 print (J各样本的类别标签为:n', k*eans_labels)rl = pd. Series (kmeans_niodel. labels_). value_counts 0 / 统计不同类别将本的数目 print C最终每个类别的数目为:婷,rl)各类聚类中心为:-0.70030628 -0.41502288 -0.16081841 -0.16053724 -0.257285960.0444681 -0.00249102 -0.23046649 -0.23492871 2.175287420.48370858 -0.79939042 2.48317171 2.42445742 0.309239621.1608298 -0.37751261 -0.08668008 -0.09460809 -0.15678402 -0.31319365 1.68685465 -0.57392007 -0.5367502 -0.17484815 各样本的类别标签为:2 2 2 . 044 最终每个类别的数目为: 0246303157334121172533714226dtype: int64In 6: #输出聚类分群的结集cluster_center = pd.DataFrame(kmeans_*odel. cluster_centers_, columns = 'ZL',' ZR',' ZF ,' ZM',' ZC')到将聚类中心放在数梃福中cluster_center. index = pd. DataFrame (k»eans_niodel. labels_ ). drop_duplicates(). iloc:,0 ff将祥本类别作为数据福索引 print (cluster/ent er)ZL ZR ZF ZM ZC 02 -0.700306 -0.415023 -0.160818 -0.160537 -0.2572861 0.044468 -0.002491 -0.230466 -0.234929 2.1752870.483709 -0.799390 2.483172 2.424457 0.3092400 1.160830 -0.377513 -0.086680 -0.094608 -0.156784-0.313194 1.686855 -C.573920 -0.536750 -0.1748484-10代码:绘制客户分群雷达图1 # coding: utf-82 34-1 A II II II56#对数据进行基本的探索7#返回缺失值个数以及最大最小值89 import pandas as pd1011 datafile= 'D:/shujuwajue-shiyan4/air_data.csv'12 resultfile = 'D:/shujuwajue-shiyan4/explore.csv' # 数据探索结果表1314#读取原始数据,指定UTF-8编码(需要用文本编辑器将数据装换为UTF-8编码)15data = pd.read_csv(datafile, encoding = 'utf-8')1617#包括对数据的基本描述,percentiles参数是指定计算多少的分位数表(如 1/4分位数、中位数等)18 explore = data. describe (percentiles = , include = 'all') ,T # T是转置,19 explore! ,null' = len(data)-explore!'count'#describe。函数自动计算非空值数,需要手动计算空值数2021 explore = explore'null', 'max', 'min 122explore.columns =空值数,最大值l,最小值23 '"24这里只选取局部探索结果。25 describe()函数自动计算的字段有count (非空值数)、unique (唯一值数)、top (频数最高者)、26 freq (最高频数)、mean (平均值)、std (方差)、min (最小值)、50% (中位数)、max (最大值)27 "'2829 explore.to_csv(resultfile)4-1运行结果“代妈4-10过Btplotlib iniineinport matplotlib. pyplot as pitcluster_center = pd. DataFrame (kmeans_model. cluster_centers_, columns =lLf,ZR','ZF','ZM',' ZC')cluster_center. index = pd. DataFrame (kmeans_model. labels_ ). drop_duplicates (). iloc :, 0print(cluster_center)labels = ' ZL',' ZR',' ZF',' ZM', ZC'legen = 客户群 + str (i + 1) for i in cluster_center. indexIstype =>, (0, (3, 5, 1, 5, 1, 5)kinds = list (cluster_center. iloc:, 0) cluster_center = pd. concat(cluster_center, cluster_center"ZL?, axis=l) centers = np. array(cluster_center. iloc:, 0:)n = len(labels)angle = np. linspace (0, 2 np. pi, n, endpoint=False)angle = np. concatenate(angle, angle0)fig = pit.figure(figsize = (8,6)ax = fig. add_subplot (111, polar=True)pit. reParams ' font, sans-serif' = ' SimHei"pit. reParams ? axes. unicode_minus:, = Falsefor i in range (len(kinds):ax.plot(angle, centersi, linesty1e=1stypei, linewidth=2, label=kinds i) ax.set_thetagrids(angle 180 / np. pi)1: , labels) plt.tiZleC客户特征分析雷达图) pit.legend(legen) pit. show() pit.close4-10运行结果各类聚类中心为:-0.70030628 -0.41502288 -0.16081841 -0.16053724 -0.257285960.04446810.483708581.1608298-0.313193650.04446810.483708581.1608298-0.31319365-0.00249102-0.79939042-0.377512611.68685465-0.230466492.48317171-0.08668008-0.57392007-0.234928712.42445742-0.09460809-0.53675022.175287420.30923962-0.15678402-0.17484815各样本的类别标签为:2 2 2 . 044最终每个类别的数目为:各样本的类别标签为:2 2 2 . 044最终每个类别的数目为:0342124630157331211753374226dtype: int64 ZLdtype: int64 ZLZRZFZMZC021304021304-0.7003060.0444680.4837091.160830-0.313194-0.415023-0.002491-0. 799390-0.3775131. 686855-0.160818 -0.160537-0.230466 -0. 2349292. 483172 2. 424457-0.086680-0.573920-0.094608-0.536750-0.2572862.1752870.309240-0.156784-0.174848客户特征分析雷达图 A客户特征分析雷达图 A四、实验结果分析以及出现问题(-)出现的问题四、实验结果分析以及出现问题(-)出现的问题4-9代码无法运行出结果来In 9: runfile('D:/shujuwajue-shiyan4/4-9.py1)解决方法:在Jupyter Notebook里运行,可以成功#读取株准化后的数据airline_scale = np. load(,D:/shujuwajue-shiyan4/airline_scale. npz*)' arr_0,k = 5 3确定鬃类中心教“构建梭型,随机种子设为123kmeans_jnodel = KMeans (n_clusters = k,n_jobs=4, randojn_state=123)f it_kmeans = kjrieans_nodel. f it (airline_scale) * 梭嬖诩练“道港聚类结果,kjneans_cc = kmeans_nodel. cluster_centers_ * 鬃类中心print(各类聚类中心为:n ,kmeans_cc)kmeans_labels = kmeans_nodel. labels. * 样本的类别标签print ('各样本的类别标容为:n'