《葡萄酒网站数据分析.docx》由会员分享,可在线阅读,更多相关《葡萄酒网站数据分析.docx(14页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、本文将分析从Greek wine e-shop商店(一个希腊葡萄酒网站)中获得的数据,来 看看哪种葡萄酒最受欢迎。scr叩er本身相当简单,可以在GitHub页面( s:/github /Florents-Tselai/greek-wines-analysis)找到。作者将着重于通 过使用标准的Python包对得到的数据(1125个独特的标签)做一些快速的探索 性分析。scraper本身暴露了一个相当简单的API。首先,请求葡萄酒页面的数据,并将数 据返回给nicedict,如下所示:In 2:from houseofwine_gr import getIn 3:get( :houseofwi
2、ne.gr/hov/megas-oinos-skoura.html)Out3: ageable : True, :alcohol.% : 13. 5, avg_rating_% : 82, color : E p U 8 p s,description* : J S 1 I TO n I KaV T I K O Cabernet Sauvignon, O 1 a 0 I X O T aI a h s K X K 1 d KPJC; I. s a a 6 6JII pepvadp a K nK入 Bcr工8P6 KpelOiTra 入 ai3OT|q. B a s i a i a s x a p
3、 HdV i 冗 ou jravipsos i TOD Ay l W p YiT l K O U H S T T| V aUCTTTlPn 3uVQ|1T 1 o uk a i p a v i XdT T ayKaXia ion 5 pu i v o u papsXiOuirouv 20 |i ri v T| n a X a(w an iou. n o i o i ikc s i iKeia y i a u s VaX a (p a v o ndidrink_now : False,keep_2_3_years,: False,n_votes,: 21,name : Me? dq ON Oq
4、S K Oup a 2014,price : 20. 9,tagf :, *H 冗 I O g , E T| P 8,, Cabernet Sauvignon, Ay I (0 p Y 1 I KO, url : : / houseofwine. gr/how/megas-oinos-skoura. html,year* : 2014然后,定义一些matplotliboIn 4J:看起来Chardonnay是最流行的品种,而Vidal和Sangiovese是最昂贵的品种。评分最高的是Malvasia,但所有品种都非常接近。把注意力转移到blends上,我们做了一些Numpy和Scikit-Le
5、arn来产生blends 的矩阵。In (26:def create_coocurrence_df(docs):母 Source: s:/stackoverflow.eom/a/37822989count_model - CountVectorizer(lowercase-fal$e,# default unigram oodelX - count_iodel. fit_transfor(docs)Xc - (X.T - X) # this is co-occurrence matrix in sparse csr foraatXc.setdiag(0) # soraetines you wa
6、nt to fill same word cooccurence to 0ret - pd.OataFraite(Xc.todense()#, index-count_roodel.get_feature-names(), columnscou-count_model.gei ret.index - ret.columns - list(map(laflibda f: f.replace(*_*, * ,), count_model.get_feature_names() return retIn 27:from sklearn.feature_extraction.text import C
7、ountVectorizerdocs - df.locdf.is_blend, varieties)ap(lambda x: (s.replace(* , ,一) for s in x).tB3p(laabda x:coocurrence - create_coocurrence_df(docs)上面的代码简单地从这里得到:In 28:df.loefdf.is_blend, .varieties.head(10)Out28:11131622232935cxvopaupo,卜匕riot) AoupriKO, MaAavouCG Axotxko, Merlot, Cabernet Sauvigno
8、n) Viognier, Syrah) Sauvignon BlanCj Semilion Merlot, Cabernet Sauvignon) Syrah, Cabernet Sauvignon Tannat, zwopaupo Sangiovese, Merlot AouprtKO, AgptName: varieties, dtype: object对此:In |29):Out29:CabernetFrancCabernetSauvignonChardonnayMerlotPinotNoirSyrahAyiaviTiKoAovqtikoCabernet Franc0420520210C
9、abernetSauvignon420085124210Chardonnay000046008Merlot52850012970Pinot Noir014610100Syrah224029101101210701100AQTIKO00800000这些是blends中出现频率最图的品种。In 30:ax = coocurrence. sum(). sort_values(ascending=False). head(10). sort_values(ascending=True). plot(kind=,barh,)ax.set(title:Most Frequent Varieties App
10、earing in Blends (Top 10);Most Frequent Varieties Appearing in Blends (Top 10)Syrah |这里是一个热图,显示哪些品种通常混合在一起。In 31:ax = sns.heatmap(coocurrencesquare=TrueJ annot=TrueJ fmt=dcmap=Blues) ax.set(title=Which varieties are usually blended together ?n.title();pit.gcf().set_size_inches(8,6)Which Varieties Ar
11、e Usually Blended Together ?Cabernet FrancCabernet SauvignonChardonnayMerlotPinot NoirSyrah4622402912240291012107011AyicopyiTiKo 12101100080000AODpTIKO 0OJnHdoov。工 EAd31AVPJASJON aouao-Je wAEUUOPJP6 uoubnESasUJDqpu 0UE aoEaqEoIn 32:fig, axes = pit. subplots(nrows=3, figsize=(12, 18)for c, ax in zip(
12、,Red,, , White*, Ros3axes):,docs = df. locdf. is_blend & (df. color=c), varieties. map (lambda x : (s. replace) for s in x) ). map (lambda x : ,. join(x)cooc = create_coocurrence_df(docs)cmaps = Red: Reds, White: Blues, Ros6:sns. cubehelix_palette(8)sns. heatmap(cooc, square=True, annot=True, fmt=d,
13、 cmap=cmaps.get(c), ax=ax)ax. set(title=,Usually blended varieties for 0n . format(c). title();pit. tight layout()Usually Blended Varieties For RedCabernet FrancCabernet SauvignonMerlot5282,82AyiwpyiiiKO 11207 7 o121260 7Sivouaupo 01 oxea号工v16U巴历156 uou5nps WEgqQouuejd aaujsqeo75604530150oUsually Bl
14、ended Varieties For WhiteChardonnay 0Pinot MeunierPinot Noir27273527Sauvignon Blanc AOnpi 1AabpTiKO 8MaXayouia 8Pobitriq 3aeuuopjeu。o o o o o JaGnow 30.sd00000000110110101401000h o右 u 16 05 1 o WLood100 1 pimoabvdw0105 o 工二doDV140 1 auov101 1 oupsUOUBnps1 8 6 4 2 0odCAJds-oxbon 0xllAd3 1v udJASJON 3
15、0EdJ(Dcn9N ao.sd aosn (D6noHouopualo IPPUCDJ。 APUUOPJPq。Uou6nps wusqpoUsually Blended Varieties For RoseCabernet Sauvignon Chardonnay GrenacheGrenache Rouge MerlotPinot MeunierPinot Noir Syrah AyicopyiTiKO Moa/otpiAcpoXmatplotlib inlineimport matplot1ib.pyplot as pitimport raatplotlib as mpl import
16、seaborn as snspit. style. use(fivethirtyeight)mpl.reParams* figure.figsize - (8,6)mpl.rcParams* font.size - 14 mpl.rcParams*font.family* - Serif, mpl.rcParams* figure.facecolor * ,white* pit.rcParams, axes.facecolorwhite plt.rcParamsfaxes.grid* -Falsepit.rcParams* figure.facecolor* white *加载由houseof
17、wine_gr.dump模块生成的数据转储,开发者也可以在GitHub页面 找至U.json, .csv和.xlsx的数据集。In 5:import pandas as pddf = pd.read_json(*./data/houseofv/ine.gr-wines.jsonencoding=utf-8)以下是所拥有数据的视图:In 6:df.head()Out6:ageablealcohol.%avg_rating,%colordescriptiondrink_nowkeep_2_3_yearsn.votesnamepricetagsurlyear0False8.5Ka-ota a-o -
18、a Riesling tov 24 Rhelngau 加PsQi-iav.FalseTrueRiesling Spatlese201333.7Riesling, HuoAvko;.2013.01False13.5Tv-iko Neo ZiAavoEuiKoAevkoc Sauvignon Blanc a-o wvaTrueFalseSpy Valley - Sauvignon Blanc 201618.3H-ioc, Sauvignon Blanc. Erjooc. spy-valley- sauvi.2016.02True13.0H Kooon-ioaEqvGqoc 时就2 vQVaOOOV
19、 KALTaUTOXQO.FalseFalseCavaXtuoxoou25.0为对二 ivouavQ。,Merlot2008.03False12.590Eva a-o -aY - AevKa Koauux GTT1V EAA TrueFalse27KTT)ua FsgopaauXeiov -Asukoc 201612.1三“际, Ach.,otiko, M4vrx5iCicx :./.2016.04False13.5EioaT| araEovGooc、T|、fvr)roiKiAux 7Tc.FalseTrueLandskroon - Pinotage201313.1三际,, Pinotage
20、: .2013.0用np.nan替换空的字符串,使它们更容易处理PandasoIn 7:from numpy import nandf = df.replace(*, nan, regex=True)重命名一些包含特殊字符的列名,以便将它们用作本机DataFrame存储器。In 8:df = df.rename(columns=alcohol_%: alcoholavg_rating_%: avg_rating1, inplaceFalse)我们还将适当的类型分配给列:In 9:dfalcohol = df.alcohol.astype(float)dfn_votes = df.n_votes
21、.astype(intJ errors=ignore)df,price = df.price.astype(float)dfyear = df.year.astype(int errors=ignore *)让我们将color列值从希腊语翻译成英语。In 10:df color = df. color. replaced Aeukoc; : White, ,Epu0p6q: Red, Poe*: Rose) )以下是数据集的颜色直方图。In 11:ax = dfcolor.value_counts().plot(bar) ax.set(ylabel=Kines, title=kine Colo
22、r Frequency*);以下是每种葡萄酒的简单指标分布情况:In 12:fig, (axl, ax2), (ax3, ax4)-plt.subplots(ncols-2, nrows-2, figsize-(12,8)df.year.dropna().astype(int).value_counts().plot(bar*, ax-axl) axl.set(title-, Production Year Frequency, xlabel-Year);sns.distplot(dfdf.alcohol 100.alcohol.dropna(), ax*ax2)ax2.set(xlabel-
23、,Alcohol %; title-tAlcohol % Distribution*);sns.distplot(df df,price 100.price.dropna(), ax“x3)ax3.set(xlabel * Price ( 180),)sns.distplot(df.avg_rating. zqpdc;6 Bin。,=zivdaupo7 Cabernet Sauvignon, Hnxoq, 13,三Merlot,.8 入uk649 Hmoq, Aqdavx, Sqp6Name: tags, dtype: object似乎每个标签列表可以给出有关葡萄酒的各种属性(品种,甜味等)的
24、信息。接 下来,作者将这些属性分开,将tags列元素从list转换为set列表元素,因为这 样会使操作更简单。也就是说,不是在一个if x in -else-try-except-IndexError 我们将使用set操作。In 14:dftags = df.tags.map(set)现在,做一些简单操作来提取关于甜度,温和性等信息,以下信息同样从希腊语 翻译到了英语。In 15:sweetness_values - 入叩丫入小。/, xqpdq,悔115。df*sweetness* - df.tags.map(sweetness_values.intersection).map(labda
25、x: x.pop() if x else None)translations - (TAukoc;* : * Sweet *,加1丫入 uxoq: Semi-Sweet,三 r)p6 : Dry, Hpiqpo (ax3, ax4) - pit.subplots(nrows-2J ncols-2, sharex-True, squeeze-False)for attr, ax in zip(,sweetness, color, sparkling, ,is_mild, (axl, ax2, ax3, ax4): dfattr.value_counts().sort_values(ascendi
26、ngTrue).plottkind-arh*, ax-ax) attr-str - Mildness, if attr is is-mild else attr.title() ax.set(xlabel *Number of Wines,Frequency ,.forat(dttr_str);fig.set_size_inches(12,8)fig.tight_layout()SweetSemiSweetSemi-Dry |SparklingSweetness FrequencySparkling FrequencyRedWhiteRoseTrueFalseColor FrequencyMi
27、ldness FrequencySemi-Sparkling0200400600800100002004006008001000Number of WinesNumber of Wines在这一点上,开发者可以(几乎)平安地假设所有剩下的标签显示每种葡萄酒的 品种信息,所以定义一个新的列来存储它们。In 19:non_varietal_tags = A0pd)6r)q, Hpiia4)pw6n(; f 1 Hr)p6- 2In (24):df.locdf.is-varietal, ,variety_type* - Varietal, df.locdf.is_blend, *variety_ty
28、pe* - Blend,让我们来看看Varietal / Blend的分布是怎样的。In 25:ax - df.is_varietal.replace(True: *Varietal , False: Blend*).value_counts().plot(*barh*)ax.set(title*How many wines are varietals and how many blends ?*);How many wines are varietals and how many blends ?BlendVarietal0100200300400500600这是一些指示性的情节。In 27
29、:fig, (axl, ax2, ax3) = pit. subplots (nr ows=3, figsize=(12, 14)fig. tight_layoutvarieties_hist = dfdf. is_varietal. varieties, map(lambda x: next(iter(x). value_counts()yarieties_hist. head(lO). sort_values(ascending=True). plot(5 barh, ax=axl)axl.set(title=,Most Frequent Varietal Vines (Top 10);v
30、arietals_mean_price = df f single_variety,, price . groupbyC single_variety,). mean(). dropna () price varietals_mean_price. sort_values(ascending=False). head(10). sort_values(ascending=True). plot C barh*, ax=ax2) ax2.set(title=,Most Expensive Varietals,, xlabel=,*, ylabel=,*);varietals_mean_ratin
31、g = df f single_variety,, avg_rating,. groupbysingle_variety,). mean。, dropna() avg_rating, varietals_mean_rating. sort_values(ascending=False). head(10). sort_values(ascending=True). plotbarh5, ax=ax3);ax3.set(title=,Top Rated Varietals5, ylabel=,?;fig. tight_layout()Most Frequent Varietal Wines (Top 10) Chardonnay AviutpyiuxoSyrahAOUpTIKOBvopaupoCabernet SauvignonSduvtgnon BUncMoa/cxpiAcpoMerlotTop Rated VarietalsNtah asia IAnpvio020406080
限制150内