聚类算法实践.docx
《聚类算法实践.docx》由会员分享,可在线阅读,更多相关《聚类算法实践.docx(24页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、实验报告学号20191106078姓名龚永好上机地点信-506专业电子信息工程班级电信1902班时间2022年5月24日上机内容实验四:聚类算法实践一、实验目的及要求目的:进一步掌握数据探索、数据预处理、属性构造的过程;熟悉聚类算法原理;学会使用聚类算 法对数据进行处理,掌握一种客户价值模型的应用。要求:1 .进一步熟悉数据挖掘的过程。2 .学会Python进行数据预处理方法。3 .学会使用K-Means函数进行聚类分析。4 .学会根据聚类结果画出向量图。二、实验设备(环境)及要求1 .硬件要求:CPU在2.0 GHz以上,内存在4G以上,建议8G。2 .软件要求:Widows7系统及以上系统
2、,Anaconda编译环境。三、实验内容(一)数据挖掘步骤理解客户价值分析的基本步骤1、抽取某时间段内总的样本数据。2、对抽取的数据进行数据探索分析与预处理,包括数据缺失值与异常值的探索分析、数据清 洗、特征构建、标准化等操作。3、基于RFM模型,使用K-Means算法进行客户分群。4、针对模型结果得到不同价值的客户,采用不同的营销手段,提供定制化的服务。(二)数据探索客户信息分析4-1代码:数据探索数据挖掘理论与实践指导教师:向前会员飞行次数分布箱线图80 00 8飞行次数20015010050客户总飞行公里数箱线图600000500000 -400000300000 -200000 100
3、000 0-总飞行公里数4-4代码:探索客户的积分信息分布情况1 # coding: utf-82 1n h34-44 ii ii ii56#积分信息类别7#提取会员积分兑换次数8ec = data EXCHANGE_COUNT9#绘副会员兑换积分次数有方图10 fig = pit. figure (figsize = (8 ,5) # 设置画布大小11 pit.hist(ec, bins=5f color=#0504aa)12plt.xlabel( 兑换次皴)13口1匚丫1加1(,会员人数,)14plt. title(,会员兑换积分次数分布直方图,)15 pit .show()pit .cl
4、ose16 18#提敢会员总累计积分ps = data Points_Sum 20#绘匐9员总累计积分箱线图21 fig = pit.figure(figsize = (5 ,8)pit.boxplot(ps,22 patch_artist=True,labels = .总累计积分,#设置x轴标题23 boxprops = facecolor1 : lightblue ) # 设置填充虢色pit.title (.客户总累计积分箱线图1)27#显示y坐标轴的底线一 pit.grid(axis=y )29 pit .show()pit .close4-4运行结果In 6: runfile(D:/s
5、hujuwajue-shiyan4/4-4.py1)会员兑换积分次数分布直方图60000 -50000 -40000 -3000020000 10000 .0兑换次数1000000 .800000600000 -400000 -200000 -客户总累计积分箱线图息累计积分4-5代码:相关系数矩阵与热力图1# coding: utf-8234-5 4 ii it n 56#提取属性并合并为新数据集7data_corr = dataFFP_TIER1,1FLIGHT_COUNT1,LAST_TO_END1,8 -,SEG2kM_SUM1,.EXCHANGE_COUNT,Points_Sum9 a
6、gel = data1 AGE1.fillna(0)L0 data_corr1 AGE* = agel.astype(1int641)LI data_corr1ffp_year1 = ffp_yearL2 一13#计算相关性矩阵L4 dt_corr = data_corr.corr(method = pearson1)L5 print (相关性矩阵为:n ,dt_corr)16一17#绘制热力图18 import seaborn as sns19 pit. subplots(figsize=(10, 10) # 设置画面大小?0 sns.heatmap(dt_corr, annot=True,
7、 vmax=l, square=True, cmap=1 Blues)21 pit. show()22 pit .close4-5运行结果In 7: runfile(D:/shujuwajue-shiyan4/4-5.py)D:/shujuwajue-shiyan4/4-5.py: 10: SettmgWithCopyWarnmg:A value is trying to be set on a copy of a slice from a DataFrame.Try using .locrow_indexer,col_indexer = value insteadSee the caveat
8、s in the documentation: : /pandas . pydata. org/pandas -docs/stable/mdexing. html#indexmg-view-versus - copy data_corrAGE = agel.astype(int64)D:/shujuwajue-shiyan4/4-5.py: 11: SettmgWithCopyWarning:A value is trying to be set on a copy of a slice from a DataFrame.Try using .locrow_indexer,col_indexe
9、r = value insteadSee the caveats in the documentation: :/pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copydata_corrffp_year=ffp_year相关性应车为:FFP_TIERFLIGHT_COUNT AGE ffp_yearFFP_TIER1.0000000.5824470.076245 -0.116510FLIGHT_COUNT0.5824471.0000000.075309 -0.188181LAST_TO_END-0
10、.2G6313-0.404999-0.027654 0.117913SEG KM SUM0.5223500.8504110.087285 -0.171508EXCHANGE COUNT 0.3423550.5025010.032760 -0.216610Points_Sum0.5592490.7470920.074887 -0.163431AGE0.0762450.0753091.000000 -0.242579ffp_year-0.116510-0.188181-0.242579 1.0000008 rows x 8columns8 rows x 8 columnsFFP_TIER 10 5
11、8-0 210 52a 340.560.076-0.12FLIGHT_C(XJNT -0.581-0 40 85Q50.750.075-0.19LAST_TO_END -0.21-0 41-0 37-0.17-0.29-0 028a 12SEG_KM_SUM -Q 520 85-0 371Q 510.850.087-0.17EXCHANGE_COUNT -0.34Q5-0.170.5110.5S0.033-0 22Rints_Sun -0.560.75-0.290.85a 5810.075-0.16ff -0.0760.075-0 028a 0870.W30.0751-0 24ffp_year
12、 -0.12-0.19Q 12-0.17-0.22-0.16-0 241缶一隹1NDO3I1H9一dgollsvll=no,LLI9NY 工 3G(三)数据预处理数据清洗,构造属性4-6代码:清洗空值与异常值1 # coding: utf-82 1 , n34-64 ii ii n5#处理缺失值与异常值67 import numpy as np8 import pandas as pd9 datafile = D:/shujuwajue-shiyan4/air_data .csv1 # 航空原始数据路径LI cleanedfile = D : /shu juwajue-shiyan4/data
13、_cleaned. csv # 数据清洗后保存的文件路径12一13#读取数据airline_data = pd.read_csv(datafile,encoding = 1utf-8)15 print (,原始数据的形状为:,airline_ddta.shape)1617#去除票伙为空的记录18 airline_notnull = airline_data.locairline_data,SUM_YR_1,.notnullO &19-airlineZdataSUM二YR2 .notnullO,:20 print (1册ij除缺失记录后数据的形状为:,,airline_notnull .shap
14、e)2122#只保存票价非零的,或者平均折扣率不为G且总飞行公里数大型)的记录.23 indexl = airline_notnullSUM_YR_1 != 0!= 0)!= 0)24index2 = airline_notnullSUM_YR_2 != 025 index3 = (airline_notnull,SEG_KM_SUM, 0) & (airline_notnull1avg_discount126 index4 = airline_notnull AGET 100 # 去除年龄大于 100的记录一27 airline = airline_notnull(indexl | inde
15、x2) & index3 & index428 print (数据清洗后数据的形状为::airline.shape)2930 airline. to_csv(cleanedfile) # 保存清洗后的数据4-6运行结果In 8: runfile(D:/shujuwajue-shiyan4/4-6.py1) 原始数据的形状为:(62988, 44) 删除缺失记录后数据的形状为:(62299, 44) 数据清洗后数据的形状为:(62043, 44)In 9:4-7代码:属性选择1# coding: utf-8234-74 ii ii ii56 #属性选择、构造与数据标准化78 import pan
16、das as pd9 import numpy as np11#读取数据清洗后的数据cleanedfile = 1D:/shujuwajue-shiyan4/data_cleaned.csv, # 数据清洗后保存的文件路径12 airline = pd.read_csv(cleanedfile, encoding = utf-81)#选取需求属性一15airline_selection = airlineFFPDATE,L0AD_TIME,LAST_TO_END,FLIGHT_COUNT,SEG_KM_SUM,1avgdiscount116 print (1 筛选的属性前5行为:n,airli
17、ne_selection. head()4-7运行结果In 9: runfile(1D:/shujuwajue-shiyan4/4-7.py1) 筛选的属性前5行为:01234FFP_DATE 2006/11/2 2007/2/19 2007/2/1 2008/8/22 2009/4/10L0AD_TIME 2014/3/31 2014/3/31 2014/3/31 2014/3/31 2014/3/31SEG_KM_SUM 580717 293678 283712 281336 309928avg_discount 0.961639 1.252314 1.254676 1.090870 0.
18、9706585 rows x 6 columnsIn 10:4-8代码:属性构造与数据标准化234-84 ii ii56#构造属性工L = pd.to_datetime(airline_selection LOAD_TIME ) - 8 pd.to_datetime(airline_selectionFFP_DATE)9 L = L.astype(str).str.split().str010 L = L.astype(int)/301112#合并属性airline_features = pd.concat(Lrairline_selection.iloc:,2:raxis = 1) 14ai
19、rline_features.columns = Lr Rr1F,M,C15 print(构建的LRFMC属性前5行为:n ,airline_features.head()16 _17#数据标准化18 from sklearn.preprocessing import StandardScaler19 data = StandardScaler(). fit_transform(air,Line_features)20np.savez( D:/shujuwajue-shiyan4/|airline_scale.npz rdata)21 print(标准化后LRFMC五个属性为:nrdata:5
20、,:)4-8运行结果In 10: runfile(D:/shuj uwaj ue-shiyan4/4-8.py) 构建的LRFMC属性前5行为:LRFMC090.20000012105807170.961639186.56666771402936781.252314287.166667111352837121.254676368.23333397232813361.090870460.53333351523099280.970658标准化后LRFMC五个属性为:1.43579256 -0.9449390214.03402401 26.761156991.307232191.328462340.
21、658533040.3860794-0.91188564-0.88985006-0.41608504-0.922903439.073215958.718872520.781579629.9236401913,1268643612.6534814412.5406219313.898735971.295541882.868177772.880951861.994715461.34433641In 11:型大小值LSeries(62043JSeries object of pandas.core.series moduleageSeries(62568,)Series object of panda
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 算法 实践
限制150内