葡萄酒网站数据分析.docx
本文将分析从Greek wine e-shop商店(一个希腊葡萄酒网站)中获得的数据,来 看看哪种葡萄酒最受欢迎。scr叩er本身相当简单,可以在GitHub页面( s:/github /Florents-Tselai/greek-wines-analysis)找到。作者将着重于通 过使用标准的Python包对得到的数据(1125个独特的标签)做一些快速的探索 性分析。scraper本身暴露了一个相当简单的API。首先,请求葡萄酒页面的数据,并将数 据返回给nicedict,如下所示:In 2:from houseofwine_gr import getIn 3:get(' :houseofwine.gr/hov/megas-oinos-skoura.html')Out3:' ageable' : True, :alcohol.%' : 13. 5, 'avg_rating_%> : 82, 'color' : ' E p U 8 p s','description* : J S 1 I TO n I KaV T I K O Cabernet Sauvignon, O 1 a 0 I X O T aI a h s K X £ K 1 d KPJC; I. s a a 6 6JII pepvadp a K nK入 Bcr£工£8P6 KpelOiTra 入 ai3OT|q. B a s i a i a s x a p HdV i 冗 ou jravipsos i TOD Ay l W p YiT l K O U H S T T| V aUCTTTlPn 3uVQ|1T 1 o uk a i p a v i XdT T ayKaXia ion 5 pu i v o u papsXiOuirouv 20 |i ri v T| n a X a(w an iou. n o i o i ikc s i iKeia y i a u s VaX a (p a v o ndidrink_now' : False,keep_2_3_years,: False,n_votes,: 21,name' : ' Me? dq ON Oq S K Oup a 2014',price' : 20. 9,tagf :, *H 冗 I O g' , ' E T| P 8,,' Cabernet Sauvignon', 'Ay I (0 p Y 1 I KO', url' : ' : / houseofwine. gr/how/megas-oinos-skoura. html',year* : 2014然后,定义一些matplotliboIn 4J:看起来Chardonnay是最流行的品种,而Vidal和Sangiovese是最昂贵的品种。评分最高的是Malvasia,但所有品种都非常接近。把注意力转移到blends上,我们做了一些Numpy和Scikit-Learn来产生blends 的矩阵。In (26:def create_coocurrence_df(docs):母 Source: s:/stackoverflow.eom/a/37822989count_model - CountVectorizer(lowercase-fal$e,# default unigram oodelX - count_i»odel. fit_transfor»(docs)Xc - (X.T - X) # this is co-occurrence matrix in sparse csr foraatXc.setdiag(0) # soraetines you want to fill same word cooccurence to 0ret - pd.OataFraite(Xc.todense()#, index-count_roodel.get_feature-names(), columnscou-count_model.gei ret.index - ret.columns - list(map(laflibda f: f.replace(*_*, * ,), count_model.get_feature_names() return retIn 27:from sklearn.feature_extraction.text import CountVectorizerdocs - df.locdf.is_blend, varieties')«ap(lambda x: (s.replace(* ', ,一') for s in x).tB3p(laabda x:coocurrence - create_coocurrence_df(docs)上面的代码简单地从这里得到:In 28:df.loefdf.is_blend, .varieties'.head(10)Out28:11131622232935cxvopaupo,卜匕riot) AoupriKO, MaAavouCG Axotxko, Merlot, Cabernet Sauvignon) Viognier, Syrah) Sauvignon BlanCj Semilion Merlot, Cabernet Sauvignon) Syrah, Cabernet Sauvignon Tannat, zwopaupo Sangiovese, Merlot AouprtKO, AgptName: varieties, dtype: object对此:In |29):Out29:CabernetFrancCabernetSauvignonChardonnayMerlotPinotNoirSyrahAyiaviTiKoAovqtikoCabernet Franc0420520210CabernetSauvignon420085124210Chardonnay000046008Merlot52850012970Pinot Noir014610100Syrah224029101101210701100A<n>QTIKO00800000这些是blends中出现频率最图的品种。In 30:ax = coocurrence. sum(). sort_values(ascending=False). head(10). sort_values(ascending=True). plot(kind=,barh,)ax.set(title:'Most Frequent Varieties Appearing in Blends (Top 10);Most Frequent Varieties Appearing in Blends (Top 10)Syrah |这里是一个热图,显示哪些品种通常混合在一起。In 31:ax = sns.heatmap(coocurrencesquare=TrueJ annot=TrueJ fmt="d"cmap='Blues') ax.set(title='Which varieties are usually blended together ?n'.title();pit.gcf().set_size_inches(8,6)Which Varieties Are Usually Blended Together ?Cabernet FrancCabernet SauvignonChardonnayMerlotPinot NoirSyrah4622402912240291012107011AyicopyiTiKo 12101100080000AODpTIKO 0OJnHdoov。工 EAd31AVPJASJON aou£ao-Je wAEUUOPJP6 uoub>nESasUJDqpu 0UE£ aoEaqEoIn 32:fig, axes = pit. subplots(nrows=3, figsize=(12, 18)for c, ax in zip(,Red,, , White*, 'Ros3axes):,docs = df. locdf. is_blend & (df. color=c), ' varieties'. map (lambda x : (s. replace) for s in x) ). map (lambda x : ' ,. join(x)cooc = create_coocurrence_df(docs)cmaps = 'Red': 'Reds', 'White': 'Blues', ' Ros6':sns. cubehelix_palette(8)sns. heatmap(cooc, square=True, annot=True, fmt="d", cmap=cmaps.get(c), ax=ax)ax. set(title=,Usually blended varieties for 0n . format(c). title();pit. tight layout()Usually Blended Varieties For RedCabernet FrancCabernet SauvignonMerlot5282,82AyiwpyiiiKO 11207 7 o121260 7Sivouaupo 01 oxea号工v16U巴历156 uou£5nps WEgqQouuejd aaujsqeo75604530150oUsually Blended Varieties For WhiteChardonnay 0Pinot MeunierPinot Noir27273527Sauvignon Blanc °AOnpi 1AabpTiKO 8MaXayouia 8Pobitriq 3aeuuopjeu。o o o o o JaGnow 30.sd00000000110110101401000h o右 u £16 05 1 o WLood100 1 pimoabvdw0105 o 工二doDV140 1 auov101 1 oupsUOUB>nps1 8 6 4 2 0odCAJds-oxbon 0xllAd3 1v udJASJON 30EdJ(Dcn9N ao.sd aosn (D6noHouopualo IPPUCDJ。 APUUOPJPq。Uou6>nps wusqpoUsually Blended Varieties For RoseCabernet Sauvignon Chardonnay GrenacheGrenache Rouge MerlotPinot MeunierPinot Noir Syrah AyicopyiTiKO Moa/otpiAcpoXmatplotlib inlineimport matplot1ib.pyplot as pitimport raatplotlib as mpl import seaborn as snspit. style. use('fivethirtyeight")mpl.reParams* figure.figsize' - (8,6)mpl.rcParams* font.size' - 14 mpl.rcParams*font.family* - 'Serif, mpl.rcParams* figure.facecolor' * ,white* pit.rcParams, axes.facecolor'»'white' plt.rcParamsf'axes.grid* -Falsepit.rcParams* figure.facecolor"* white *加载由houseofwine_gr.dump模块生成的数据转储,开发者也可以在GitHub页面 找至U.json, .csv和.xlsx的数据集。In 5:import pandas as pddf = pd.read_json(*./data/houseofv/ine.gr-wines.json'encoding='utf-8')以下是所拥有数据的视图:In 6:df.head()Out6:ageablealcohol.%avg_rating,%colordescriptiondrink_nowkeep_2_3_yearsn.votesnamepricetagsurlyear0False8.5Ka-ota a-o -a Riesling tov 24 Rhelngau 加PsQi-iav.FalseTrueRiesling Spatlese201333.7Riesling, HuoAvko;.2013.01False13.5Tv-iko Neo" ZiAavoEuiKoAevkoc Sauvignon Blanc a-o wvaTrueFalseSpy Valley - Sauvignon Blanc 201618.3H-ioc, Sauvignon Blanc. Erjooc. spy-valley- sauvi.2016.02True13.0H Kooon-ioaEqvGqoc 时就2 vQVaOOOV KALTaUTOXQO.FalseFalseCavaXt'uoxoou25.0为对二 ivouavQ。,Merlot2008.03False12.590Eva a-o -aY - AevKa Koauux GTT1V EAA TrueFalse27KTT)ua FsgopaauXeiov -Asukoc 201612.1三“际, Ach.,otiko, M4v<rx5i'Cicx :.'/'.2016.04False13.5EioaT| araEovGooc、T|、fvr)roiKiAux 7Tc.FalseTrueLandskroon - Pinotage201313.1三际,, Pinotage : .2013.0用np.nan替换空的字符串,使它们更容易处理PandasoIn 7:from numpy import nandf = df.replace('*, nan, regex=True)重命名一些包含特殊字符的列名,以便将它们用作本机DataFrame存储器。In 8:df = df.rename(columns='alcohol_%': 'alcohol''avg_rating_%': 'avg_rating1, inplaceFalse)我们还将适当的类型分配给列:In 9:df'alcohol' = df.alcohol.astype(float)df'n_votes' = df.n_votes.astype(intJ errors='ignore')df,price' = df.price.astype(float)df'year' = df.year.astype(int errors='ignore *)让我们将color列值从希腊语翻译成英语。In 10:df 'color' = df. color. replaced 'Aeukoc; ': 'White', ,Epu0p6q': 'Red', 'Po<e*: "Rose') )以下是数据集的颜色直方图。In 11:ax = df'color'.value_counts().plot('bar') ax.set(ylabel="Kines', title='kine Color Frequency*);以下是每种葡萄酒的简单指标分布情况:In 12:fig, (axl, ax2), (ax3, ax4)-plt.subplots(ncols-2, nrows-2, figsize-(12,8)df.year.dropna().astype(int).value_counts().plot('bar*, ax-axl) axl.set(title-, Production Year Frequency, xlabel-'Year');sns.distplot(dfdf.alcohol < 100.alcohol.dropna(), ax*ax2)ax2.set(xlabel-,Alcohol %; title-tAlcohol % Distribution*);sns.distplot(df df,price < 100.price.dropna(), ax“x3)ax3.set(xlabel« * Price (< 180),)sns.distplot(df.avg_rating.<jropna(), ax-dx4)ax4.set(xlabel-'Average Rating,);Production Year FrequencyAlcohol % Distribution0.40.3o.o如下图,Average Rating列几乎为正态分布,口值高达85以上。Reddit上的 Kroutoner解释了为什么会发生这种情况(并纠正了作者以前的错误):典型的葡萄酒评级是50-100,而不是0-100。所以看起来似乎只有一半分布,实 际上是一个几乎完全的分布。此外,90分以上的葡萄酒一般被认为效果更好, 销售也更好。这个事实改变了对数据的解释,也就是说大多数葡萄酒被评为好, 只有一小局部被评为非常好。为了进一步推进,来看一下tags歹U。In 13:df.tags.head(10)Out(13:0Riesling, Hnioq 即ty入uicoq1 仲moa Sauvignon 61anc,三npdq2 三ngq, -wopaupo, Merlot3 Hnxo,三np64, AouprxKO, MaAavou?x64 Hnxoc;,三 Qgq, Pi not age5 Hnxoc;, Sangiovese> zqpdc;6 Bin。,=zivdaupo7 Cabernet Sauvignon, Hnxoq, 13,三Merlot,.8 入uk649 Hmoq, Aqdavx, Sqp6Name: tags, dtype: object似乎每个标签列表可以给出有关葡萄酒的各种属性(品种,甜味等)的信息。接 下来,作者将这些属性分开,将tags列元素从list转换为set列表元素,因为这 样会使操作更简单。也就是说,不是在一个if x in -else-try-except-IndexError 我们将使用set操作。In 14:df'tags' = df.tags.map(set)现在,做一些简单操作来提取关于甜度,温和性等信息,以下信息同样从希腊语 翻译到了英语。In 15:sweetness_values - '入叩丫入小。/, 'xqpdq",'悔115。<'df*sweetness* - df.tags.map(sweetness_values.intersection).map(la«bda x: x.pop() if x else None)translations - (TAukoc;* : * Sweet *,'加1丫入 uxoq': 'Semi-Sweet','三 r)p6 ': ' Dry', 'Hpiqpo<*: 'Semi-Dry*) dfsweetness* - df*sh*eetness*.replace(translations)In 16J:df ,sparkling* - df .tags.gp('ApnU, ,intersection).map(lambda x: x.pop() if x else Hone).replace( , Apwdqc;*: 'Sparkling, *rSemi-Sparkling*) df,sparkling' - df.sparkling,fillna('Itot Sparkling*)In 17:df* is-mild* - df.tags.reap(lambda x: 'Hnxo* in x)以下是4个属性中每一个属性的直方图:In (18):fig, (axl, ax2)> (ax3, ax4) - pit.subplots(nrows-2J ncols-2, sharex-True, squeeze-False)for attr, ax in zip(,sweetness', "color', "sparkling', ,is_mild, (axl, ax2, ax3, ax4): dfattr.value_counts().sort_values(ascending»True).plottkind-arh*, ax-ax) attr-str - 'Mildness, if attr is 'is-mild' else attr.title() ax.set(xlabel« *Number of Wines',Frequency ,.for«at(dttr_str);fig.set_size_inches(12,8)fig.tight_layout()SweetSemiSweetSemi-Dry |SparklingSweetness FrequencySparkling FrequencyRedWhiteRoseTrueFalseColor FrequencyMildness FrequencySemi-Sparkling0200400600800100002004006008001000Number of WinesNumber of Wines在这一点上,开发者可以(几乎)平安地假设所有剩下的标签显示每种葡萄酒的 品种信息,所以定义一个新的列来存储它们。In 19:non_varietal_tags = ' A0pd)6r)q', ' Hpiia4)pw6n(;' f 1 Hr)p6<', ' Hpinpoq', .入,Hpriy入uKoq', 'Hnxoq') df'varieties' = df.tags.map(lambda t: t.difference(non_varietal_tags)由于解析错误,列中出现了一些整数,我们将其过滤掉。In (20):def is_not_int(x): try:int(x) return Falseexcept ValueError:return Truedff *varieties* - df.varieties.map(lambda x: set(fliter(is-not_int, x)我们也可以添加一个布尔变量varietal酒中的混合物只有一种的称为varietal, 至少有两种混合物的称作blends oIn 21:df'is_varietal' = df.varieties.map(set._len_) = 1对于varietal葡萄酒,我们设定了一个single_variety -对于其他非varietal的葡 萄酒来说,这个数值将是NaN。In 22:df.locdf.is_varietal, *single_variety* - df.locdf.is_varletal, *varieties*.map(la«bda v: next(lter(In 123:df ,islend* - df.varieties.»ap(set._len_) >- 2In (24):df.locdf.is-varietal, ,variety_type* - 'Varietal, df.locdf.is_blend, *variety_type* - 'Blend,让我们来看看Varietal / Blend的分布是怎样的。In 25:ax - df.is_varietal.replace(True: *Varietal , False: 'Blend*).value_counts().plot(*barh*)ax.set(title«*How many wines are varietals and how many blends ?*);How many wines are varietals and how many blends ?BlendVarietal0100200300400500600这是一些指示性的情节。In 27:fig, (axl, ax2, ax3) = pit. subplots (nr ows=3, figsize=(12, 14)fig. tight_layoutvarieties_hist = dfdf. is_varietal. varieties, map(lambda x: next(iter(x). value_counts()yarieties_hist. head(lO). sort_values(ascending=True). plot(5 barh', ax=axl)axl.set(title=,Most Frequent Varietal Vines (Top 10)');varietals_mean_price = df f single_variety,, 'price' . groupbyC single_variety,). mean(). dropna () 'price' varietals_mean_price. sort_values(ascending=False). head(10). sort_values(ascending=True). plot C barh*, ax=ax2) ax2.set(title=,Most Expensive Varietals,, xlabel=,*, ylabel=,*);varietals_mean_rating = df f single_variety,, ' avg_rating,. groupbysingle_variety,). mean。, dropna() ' avg_rating, varietals_mean_rating. sort_values(ascending=False). head(10). sort_values(ascending=True). plotbarh5, ax=ax3);ax3.set(title=,Top Rated Varietals5, ylabel=,?;fig. tight_layout()Most Frequent Varietal Wines (Top 10) Chardonnay AviutpyiuxoSyrahAOUpTIKOBvopaupoCabernet SauvignonSduvtgnon BUncMoa/cxpiAcpoMerlotTop Rated VarietalsNtah asia IAnpvio020406080