《10分钟搞定pandas数据分析+案例.docx》由会员分享,可在线阅读,更多相关《10分钟搞定pandas数据分析+案例.docx(40页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、1.什么是pandas?pandas : Python数据分析模块pandas是为了解决数据分析任务而创立的,纳入了大量的库和标准数据模型,提 供了高效地操作大型数据集所需的工具。在此所写文章做成笔记记录的形式,小 编推荐一个企鹅群,群里分子非常踊跃交流经验遇坑问题,也有初学者交流讨论, 群内整理了也整理了大量的PDF书籍和学习资料,程序员也很热心的帮助解决 问题,还有讨论工作上的解决方案,非常好的学习交流地方!群内大概有好几千 人了,喜欢python的朋友可以加入python群:526929231欢迎大家交流讨论各 种奇技淫巧,一起快速成长。pandas中的数据结构: Series: 一维数
2、组,类似于python中的基本数据结构list,区别是series只允 许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率。就像 数据库中的列数据。 DataFrame:二维的表格型数据结构。很多功能与R中的data.frame类似。可 以将DataFrame理解为Series的容器。 Panel:三维的数组,可以理解为DataFrame的容器。2.10分钟搞定pandas引入需要的包:import pandas as pdimport numpy as npimport matplotlibpyplot as pitnumpy是一个python实现的科学计算包 matplotlib
3、 是一个 python 的 2D 绘图库更多章节请查看Cookbook7.快速访问某个标量(同6): df.iatl,l-0.48405080229207309 Boolean 索弓|1 .通过某列选择数据:C 0.012447 0.404728D 1.257684 0.360880 dfdf.A 0 AB2013-01-01 0.859619 -0.545903 2013-01-02 0.119622 -0.4840512 .通过where选择数据: dfdf 0A2013-01-01 0.8596192013-01-02 0.1196222013-01-03NaN2013-01-04NaN
4、2013-01-05NaN2013-01-06NaNNaN NaN NaN 0.876693 NaN 0.7867850.012447 0.404728 0.635237NaNNaNNaND1.2576840.3608800.2166911.4680601.6947400.1779733 .通过isin()过滤数据: df2 = df.copy() df2 * E * = *one * , onetwothree/four/three df22013-01-01 2013-01-02 2013-01-03 2013-01-042013-01-05 2013-01-062013-01-01 20
5、13-01-02 2013-01-03 2013-01-042013-01-05 2013-01-06A 0.859619 0.119622 -0.719234 -0.921692 -0.300317 -1.903683B -0.545903 -0.484051 -0.396174 0.876693 -0.011320 0.786785C0.0124470.4047280.6352370.670553-1.3764420.194179D E 1.257684 one 0.360880 one 0.216691 two 1.468060 three 1.694740 four 0.177973
6、threeD E 0.216691 two 1.694740 four df2df2fEf .isin(w two*/four)ABC2013-01-03 -0.719234 -0.396174 0.635237 2013-01-05 -0.300317 -0.011320 -1.376442设置1 .新增一列数据: si = pd.Series(lJ2J3J4,5J6J periods=6) siindex=pd.date_range( 20130102,2013-01-0212013-01-0322013-01-0432013-01-0542013-01-0652013-01-076J D
7、dtype:int642 .通过标签更新值: df.atdates0? A = 03 .通过位置更新值: df.iat0,l = 04 .通过数组更新一列值: df.loc:D1 = np.array(5 * len(df)上面几步操作的结果: dfABCDF5 NaN2013-01-01 0. 000000 0. 000000 0. 012447 2013-01-02 0.119622 -0.484051 0.404728 5 2013-01-03 -0.719234 -0.396174 0.635237 2013-01-04 -0.921692 0.876693 -0.670553 5 2
8、013-01-05 -0.300317 -0.011320 -1.376442 52013-01-06 -1.903683 0.786785 -0.194179 555 .通过where更新值: df2 = df.copy() df2df2 0 = -df2 df22013-01-012013-01-022013-01-032013-01-042013-01-052013-01-062013-01-012013-01-022013-01-032013-01-042013-01-052013-01-06A 0. 000000 0.119622 -0.719234 -0.921692 -0.300
9、317 -1.9036830.000000 -0.484051 -0.396174 0.876693C D -0.012447 -0.404728 -0.635237 -0.670553-0.011320 -1.376442-0.786785 -0.194179-5-5-5-5-5-5NaN-1-2-3-4-56.缺失数据处理pandas用np.nan代表缺失数据,详情请查看Missing Data section l.reindex()可以修改/增加/删除索引,会返回一个数据的副本: dfl = df.reindex(index=dates0:4? columns=list(df.colum
10、ns) + E) dfl.locdates0):dateslE * = 1 dfl2013-01-012013-01-022013-01-032013-01-040.000000 0.119622 -0.719234 -0.9216920.000000-0.484051-0.3961740.876693C D 0.012447 0.404728 0.635237 -0.670553F E5 NaN 151152 NaN53 NaN2.丢掉含有缺失项的行: dfl.dropna(how= any)-八一ab2013-01-02 0.119622 -0.484051 0.404728 5113 .
11、对缺失项赋值: dfl.fillna(value=5)2013-01-012013-01-022013-01-032013-01-04A0.0000000.119622-0.719234-0.921692B0.000000-0.484051-0.3961740.876693C D 0.012447 0.404728 0.635237 -0.670553F5555E512311554 ,对缺失项布尔赋值: pd.isnull(dfl)ABCDFE2013-01-01FalseFalseFalseFalseTrueFalse2013-01-02FalseFalseFalseFalseFalseFa
12、lse2013-01-03FalseFalseFalseFalseFalseTrue2013-01-04FalseFalseFalseFalseFalseTrue7.相关操作详情请查看 Basic section on Binary Ops-统计(操作通常情况下不包含缺失项)1 .按列求平均值: df.mean()A -0.620884B 0.128655C -0.198127D 5.000000F 3. 000000 dtype: float642 .按行求平均值:1.2531121. 2080601.3039661.4568891.4623841.737785 df.mean(l) 201
13、3-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06Freq: D) dtype: float643 .操作不同的维度需要先对齐,pandas会沿着指定维度执行:shift(2) s = pdSeries(Ij3,5)npnanj6j8j index=dates) s2013-01-01 NaN2013-01-02 NaN2013-01-0312013-01-0432013-01-0552013-01-06 NaNFreq: DJ dtype: float64 df.sub(s axis=index1)AB?C DF201
14、3-01-01NaNNaNNaN NaNNaN2013-01-02NaNNaNNaN NaNNaN2013-01-03-1.719234-1.396174-0.36476342013-01-04-3.921692-2.123307-3.67055322013-01-05-5.300317-5.011320-6.37644202013-01-06NaNNaNNaN NaNNaN头条号坳喈 这里对齐维度指的对齐时间indexshift(2)指沿着时间轴将数据顺移两位 sub指减法,与NaN进行操作,结果也是NaN-应用1 .对数据应用function: df.apply(np.cumsum)201
15、3-01-012013-01-022013-01-032013-01-042013-01-052013-01-062013-01-012013-01-022013-01-032013-01-042013-01-052013-01-060 . 000000 0.1196220.599612-1.521304-1.821621-3.7253040.0000000.484051-0.880225-0.003532-0.0148530.0124470.4171751.0524120.381859-0.9945835 NaN0.771932 -1.18876310152025301361015 df.a
16、pply(lambda x: x.max() - x.min()A 2.023304B 1.360744C 2.011679D0. 000000F4. 000000 dtype: float64PS: - cumsum 累加详情请查看直方图和离散化直方图: s = pd.Series(np.random.randint(0? 7, size=10) s011 352 163 134 ;405 3dtype: int64|s .value_counts0:33361514101dtype: int641条号/ P/切Wj物勇 pandas默认配置了一些字符串处理方法,可以方便的操作元素,如下所示
17、:(详情 请查看 Vectorized String Methods)-字符串方法: s = pd.SeriesCfA* JFaba; Baca np.nan, PABA:,dog caf) s.str.lower()0a1 bc2 aababaca3 NaN:4 cabaJdog5 catdtype: object乩/ 8.合并-连接pandas提供了大量的方法,能轻松的对Series, DataFrame和Panel执行合并操 作。详情请查看Merging section使用concat。连接pandas对象: df = pd.DataFname(np.random.randn(10? 4
18、) df01230 -0.199614 1.9144851 -0.061961 -1.3528830.396383 -0.2953060.266751 -0.8741322 0.346504 -2.328099 -1.492250 0.0953923 0.187115 0.562740 -1.677737 -0. 2248072 0.346504 -2.328099 -1.492250 0.0953923 0.187115 0.562740 -1.677737 -0. 2248074 -1.422599 -1.028044 0.7894874 -1.422599 -1.028044 0.789
19、487567890.439478-0.205641-2.1687250.2076340.764893-0.592229-0.649465-2.4871890.5125720.736081-0.7063950.0602580.5953730.806940 1.008404 0.578698 1.965318 0.8165160.612208 -1.022504 -2.032126 pieces = df:3? df3:7? df7: pieces = df:3? df3:7? df7: pd.concat(pieces)010 -0.199614 1.914485 1 -0.061961 -1.
20、3528830.396383 -0.2953060.266751 -0.8741322 0.346504 -2.328099 -1.492250 0.0953923 0.187115 0.562740 -1.677737 -0.2248074 -1.422599 -1.028044 0.7894874 -1.422599 -1.028044 0.789487567890.439478 -0.592229-0.205641-2.1687250.2076340.764893-0.649465-2.4871890.5125720.736081-0.7063950.0602580.5953730.80
21、69401. 008404 0.578698 1.965318 0.8165160.612208 -1.022504 -2.032126头条号/沙山3物房 Join类似SQL的合并操作,详情请查看Database style joining栗子: left = pd.DataFrame(key: foofooIval1: 1 2) right = pd.DataFrame(,key * : foo 4,吁oo, 1rval : 4)5) leftkey Ival0 foo11 foo;2 rightkey rval0 foo41 foo pd.merge(leftJ righton= key)
22、keyIvalrval0foo141foo152foo243foo25栗子: left = pd.DataFname(key1: foo1bar1,Ival : 1? 2) right = pd.DataFrame(key 1: , foo , ? barrval: 4)5) leftkey Ival0 foo11 bar2 rightkey rval0 foo41 bar5 pd.merge(leftJ righton= key ) key Ival rval0 foo141 bar25追加,详情请查看Appending: df = pd.DataFname(np.random.randn(
23、8? 4) columns= A/B/Cj .D) dfABCD0 -1.710447 2.541720 -0.654403 0.132077 1 0.667796 -1.124769 -0.430752 -0.244731 2 1.555865 -0.483805 0.066114 -0.4095181.1717981.1717980.036219 -0.5150654 -0.8340515 -0.3548866 0.5608887 -0.770196-2.178128 0.161204 1.208905 0.307691-0.3456271.4655321.3019831.2122000.
24、8606250.8193921.8798410.7990840.909137 s = df.iloc3 df.append(s? ignore_index=True) ABCD0 -1.710447 2.541720 -0.654403 0.132077 1 0.667796 -1.124769 -0.430752 -0.244731 2 1.555865 -0.483805 0.066114 -0.4095183 1.171798 0. 036219 -0. 5150653 1.171798 0. 036219 -0. 5150654 -0.8340515678-0.354886 0.560
25、888-0.7701961.171798-2.178128 0.161204 1.208905 0.307691 0.036219-0.3456271.4655321.3019831.2122000.5150650.8606250.8193921.8798410.7990840.9091370.860625工条号/ P川I物谙9.分组group by: - Splitting将数据分组- Applying对每个分组应用不同的function- Combining使用某种数据结果展示结果 详情请查看Grouping section举个栗子:3.创立对象详情请查看数据结构介绍1 .通过传入一个列表
26、来创立Series , pandas会创立默认的整形指标: s = pd.Series(135?np.nan68) s 0 1 1 3 2 5 3 NaN 4 6 5; 8; dtype: float642 .通过传递数字数组、时间索引、列标签来创立DataFrame dates = pd.date_range(20130101 yperiods=6) datesDatetimelndex(*2013-01-01w, *2013-01-02*, 2013-01-03; 2013- 01-042013-。1-051 2013-01-06, dtype=datetime64ns3 freq=D)
27、df =pd.DataFrame(np. random. randn(6?4)?index=dates?columns=list( ABCD) df2013-01-012013-01-022013-01-032013-01-042013-01-052013-01-062013-01-012013-01-022013-01-032013-01-042013-01-052013-01-06A 0.859619 0.119622 0.719234 -0.921692 -0.300317 -1.903683B-0.545903-0.484051-0.3961740.876693-0.0113200.7
28、86785C0.0124470.4047280.635237-0.670553-1.376442-0.194179D 1.257684 0.360880 0.216691 1.468060 1.694740 0.177973ps: np.random.randn(6,4)即创立6行4列的随机数字数组3 .通过传递能被转换成类似结构的字典来创立DataFrame: df = pd.DataFrame( A : foo, * bar * f oo * bar Jfoo)bar,foo foor*B : one) one ? * two * y,three,/two), two, one,three
29、&C, : np.random.randn(8),D : np.rando(n.randn(8) dfABCD0fooone0.655020-0.6715921barone0.8464281.8846032footwo-2.2804660.7250703barthree1.166448-0.2081714footwo-0.257124-0.8503195bartwo-0.6546091.2580916fooone-1.624213-0.3839787foothree-0.5239440.114338分组后sum求和: df .groupbyA ) .sum(JCDAbar 1.358267 2
30、.934523foo -5.340766 -1.066481一目对多列分组后sum: df .groupby( A/B).sum() DA B bar one three two foo one three twoG0.846428 1.166448 -0.654609 -2.279233 -0.523944 -2.5375891.884603-0.2081711.258091-1.0555700.114338-0.12524910.重塑详情请查看 Hierarchical Indexing 和 Reshapingstack: tuples = list(zip(*,bar11 barbaz,
31、 baz,Foo, foo, qux,,qux1 one % two) one J , two写; 飞 one 3 * two ,one % * two tuplesbar: * one) ( 9 bar9, * two), (baz。*one*)* (az; two,)J (foo ione)J Cfoo*, two) |qux -one5)j Cqux,?two) index = pd.Multilndex.from_tuples(tuples? names=first, second |j) indexMultilndex(levels= u * bar * u * baz, ufoo,
32、 uqux,,one: utwo *,labels,。,1, 1, 2, 2, 3, 3, 0, 0, 1,仇加 0, 1Lnames=ufirst:, usecond); df = pd.DataFname(np.random.randn(8, 2) index=index columns= A:6) dfbarone-0.922059 -0.918091two-0.825565 -0.880527bazone0.241927 1.130320two-0.261823 2.463877fooone-0.220328 -0.519477two-1.028038 -0.543191quxone0
33、.315674 0.558686two0.422296 0.241212 df2 = df2df:4ABfirstsecondbarone-0.922059 -0.918091two-0.825565 -0.880527bazone0.241927 1.130320two-0.261823 2.463877barone-0.922059 -0.918091two-0.825565 -0.880527bazone0.241927 1.130320two-0.261823 2.463877fooone-0.220328 -0.519477two-1.028038 -0.543191quxone0.
34、315674 0.558686two0.422296 0.241212 df2 = df2df:4ABfirstsecondbarone-0.922059 -0.918091two-0.825565 -0.880527bazone0.241927 1.130320two-0.261823 2.463877ABfirst second工条号/叱小。门物谙PS:将包含多个list的元组转换为复杂索引使用 stack。方法为 DataFrame 增加 column: stacked = df2.stack() stackeddtype: float64firstsecondbaroneA-0.922
35、059B-0.918091twoA-0.825565B-0.880527bazoneA0.24192781.130320twoA-0.261823B2.463877使用unstack。方法还原stack的DataFrame,默认还原最后一级,也可以自由指 定: stacked.unstack()ABfirst secondbarone0.922059-0.918091two-0.825565-0.880527bazone0.2419271.130320two-0.2618232.463877 stacked.unstack(l) second one twofirstbarbaz-0.8255
36、65-0.880527-0.2618232.463877A -0.922059B -0.918091A 0.241927B 1.130320 stacked.unstack(0)firstbar bazsecondonetwoonetwoA B A B-0.922059-0.918091-0.825565-0.8805270.2419271.130320-0.2618232.463877透视表详情请查看Pivot Tables栗子: df = pd.DataFrame( A : one ? one two three * 3)B. : A:,至*,Cw : fo。,阡。,阡。,barj bar
37、 bar * 2,,D : np.random.randn(12),1 E* : np.random.randn( 12)日”/PS:可以理解为自由组合表的行与列,类似于交叉报表我们能非常简单的构造透视表: pd.pivot_table(dfvalues=D y index=A, B*? columns=1C1) Cbar foo;A one A -1.250611 -1.047274 B 1.532134 -0.455948 C 0.125989 -0.500260 three A 0.623716 NaN BNaN0.095117C -0.348707NaNtwoANaN0.390363B
38、 -0.743466NaNCNaN0.79227911.时间序列pandas可以简单高效的进行重新采样通过频率转换(例如:将秒级数据转换成五分 钟为单位的数据)。这常见与金融应用中,但是不限于此。详情请查看Time Series section栗子: rng = pd.date_range(J1/1/2012, periods=100 freq= S ) ts = pd.Series(np.random.randint(05。, len(rng)? index=rng) ts.resample().sum()2012-01-0124390Freq: 5TJ dtype: int64曲演PS:将
39、随机产生的秒级数据整合成5min的数据时区表现: rng = pd.date_range( 3/6/2012 00:00 ? periods=5? freq=D ) ts = pd.Series(np.random.randn(len(rng)y rng) ts2012-03-062012-03-072012-03-082012-03-092012-03-102012-03-062012-03-072012-03-082012-03-092012-03-100.972202-0.839969-0.979993-0.052460-0.487963Freq: D dtype: float64 ts
40、_utc = ts.tz_localize(*UTC *)ts_utc 2012-03-062012-03-072012-03-082012-03-092012-03-1000:00:00+00:0000:00:00+00:0000:00:00+00:0000:00:00+00:0000:00:00+00:000.9722020.8399690.979993-0.052460-0.487963Freq: D dtype: float64时区变换: ts_utc.tz_convert( US/Eastern)2012-03-052012-03-062012-03-072012-03-082012
41、-03-092012-03-052012-03-062012-03-072012-03-082012-03-090.972202-0.839969-0.979993-0.052460-0.48796319:00:00-05:00 19:00:00-05:00 19:00:00-05:00 19:00:00-05:0019:00:00-05:00Freq: D) dtype: float64在不同的时间跨度表现间变换: rng = pd.date_range(11/1/2012 * periods=5? freq= M) ts = pd.Series(np.random.randn(len(rng)? index=rng) ts2012-01-312012-02-292012-03-312012-04-302012-05-312012-01-312012-02-292012-03-312012-04-302012-05-31-0.681068-0.2635711.268001 0.331786 0.663572Freq: Mj dtype: float64 ps = ts.to_period()Freq: Mj dtype: float64 ps = ts.to_period() ps2012-01 2012-02 2012-03 20
限制150内