回归分析大作业(共24页).doc
精选优质文档-倾情为你奉上回归大作业国内旅游消费影响的回归分析一、问题引入我国第三产业发展迅速,在2010年其已占国内生产总值的43.14%,而旅游业在第三产业中占有重要地位,且与餐饮、住宿、休闲、运输等产业联系密切,所以此次分析以探究国内旅游消费的影响为目的,并建立回归模型。二、模型设计运用多元线性模型拟合,若拟合效果不显著,则进行log或平方根变换或使用多项式拟合等其他模型。1、相关性分析,首先确定与因变量有相关性的变量。2、建立全模型多元线性回归,若回归方程F检验未通过,则查找原因、更换模型;若有部分回归系数检验未通过,则进行选元(步骤2),剔除部分变量再继续;若所有检验都良好,则模型初步确立,跳过步骤2。3、运用逐步回归方法筛选变量,并进行t检验,若效果显著,则可初步确立多元线性回归模型;若仍有部分变量未通过检验,则再单独进行变量筛选,综合运用AIC准则等确定剔除变量,直至所有变量都通过t检验。4、回归诊断。进行残差分析,检验残差是否满足正态分布,是否有相关性,也即自变量间是否有自相关性,检验是否存在异常值和强影响值,是否存在异方差性,是否存在多重共线性。若以上问题存在,则需修改模型,或重新筛选变量,或增减样本。5、模型最终确立。三、数据yearincomenumberexpenselevelroadrail199448108.5524195.3320.0111.785.90199559810.5629218.7345.1115.706.24199670142.5640256.2377.6118.586.49199778060.9644328.1394.6122.646.60199883024.3695345.0417.8127.856.64199988479.2719394.0452.3135.176.74200098000.5744426.6491.0140.276.872001.2784449.5521.2169.807.012002.7878441.8557.6176.527.192003.0870395.7596.9180.987.302004.81102427.5645.3187.077.442005.51212436.1695.2334.527.542006.91394446.9761.9345.707.712007.01610482.6843.4358.377.802008.71712511.0916.8373.027.972009.51902535.41001.6386.088.552010.02103598.21062.6400.829.12yearairrailtranroadtranshiptranairtrantravel1994104.562616540391023.51995112.902392451171375.71996116.65947972289555551638.41997142.50933082257356302112.71998150.58950852054557552391.21999152.221915160942831.92000150.291938667223175.52001155.361864575243522.42002163.771869385943878.42003174.95972601714287593442.32004204.9419040121234710.72005199.8520227138275285.92006211.3522047159686229.72007234.3022835185767770.62008246.1820334192518749.32009234.51223142305210183.72010276.51223922676912579.8数据来源:中国统计年鉴2011数据说明:Year:年份。Income:国民总收入,单位亿元。Number:旅游人数。Expense:人均旅游花费,单位元。Level:居民消费水平指数,以1978年为基年。Road:公路里程,单位万公里。Rail:铁路里程,单位万公里。Air:民航里程,单位万公里。Roadtran:公路客运量,单位万人。Railtran:铁路客运量,单位万人。Shiptran:水路客运量,单位万人。Airtran:民航客运量,单位万人。Travel:国内旅游消费总额,单位亿元。四、回归分析1、相关性首先分析相关性,画出散布阵。 可较为直观地看出,travel与各变量间有较强的相关性,除了road,和shiptran两项,做相关性检验,可见,travel与road是线性相关的,相关系数为0.93,p-value = 4.563e-08,而travel与shiptran不相关,p-value = 0.9983,所以可先排除shiptran,再做回归。2、全回归模型直接建立多元回归模型,得结果:Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.972e+03 3.193e+03 -1.870 0. income 2.151e-02 4.779e-03 4.501 0. * number 1.039e+00 1.446e+00 0.719 0. expense 6.805e+00 1.124e+00 6.052 0. *level -5.815e+00 1.261e+00 -4.610 0. * road -1.468e+00 1.019e+00 -1.441 0. rail 6.274e+02 4.462e+02 1.406 0. air -4.155e+00 2.790e+00 -1.490 0. railtran 2.524e-02 8.492e-03 2.972 0. * roadtran -4.093e-04 4.554e-04 -0.899 0. airtran 1.058e-01 1.272e-01 0.832 0. -Signif. codes: 0 * 0.001 * 0.01 * 0.05 . 0.1 1 Residual standard error: 84.55 on 6 degrees of freedomMultiple R-squared: 0.9998, Adjusted R-squared: 0.9994 F-statistic: 2462 on 10 and 6 DF, p-value: 5.061e-10其中,R2=0.9998, F检验的p-value: 2.632e-08,可见回归模型的检验是成立的,但回归系数并不是全能通过检验,所以应该进行选元。3、选元先进行逐步回归,逐步回归排除了roadtran,number两个变量,以AIC准则为主要判断依据,调整后的AIC值为153.73,达到最小值。再检验一下回归模型:Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.393e+03 2.102e+03 -2.090 0. . income 1.898e-02 2.320e-03 8.179 3.72e-05 *expense 7.038e+00 9.369e-01 7.512 6.85e-05 *level -5.427e+00 1.057e+00 -5.133 0. *road -1.460e+00 9.339e-01 -1.564 0. rail 3.697e+02 2.865e+02 1.290 0. air -3.589e+00 2.496e+00 -1.438 0. railtran 2.166e-02 6.843e-03 3.165 0. * airtran 2.032e-01 5.464e-02 3.719 0. * -Signif. codes: 0 * 0.001 * 0.01 * 0.05 . 0.1 1 Residual standard error: 78.95 on 8 degrees of freedomMultiple R-squared: 0.9997, Adjusted R-squared: 0.9994 F-statistic: 3529 on 8 and 8 DF, p-value: 2.252e-13 可见回归模型改善,自由度调整负相关系数达到了0.9994,有所提高,这与AIC准则的判断相符,而回归系数的检验也有所好转,但仍然有road,rail,air通不过检验。若去掉一个变量回归,可见: Df Sum of Sq RSS AIC<none> 49866 153.73income 1 189.75expense 1 187.19level 1 176.50road 1 15241 65107 156.26rail 1 10380 60246 154.94air 1 12886 62752 155.63railtran 1 62438 165.53airtran 1 86215 168.79去掉rail,AIC增加最小,同时RSS增加最小,而回归方程系数检验:Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.773e+03 5.648e+02 -3.140 0. * income 1.935e-02 2.386e-03 8.112 1.98e-05 *expense 7.977e+00 6.116e-01 13.043 3.77e-07 *level -5.126e+00 1.069e+00 -4.797 0. *road -2.214e+00 7.550e-01 -2.933 0. * air -5.129e+00 2.272e+00 -2.257 0. . railtran 1.495e-02 4.613e-03 3.241 0. * airtran 2.603e-01 3.323e-02 7.832 2.62e-05 * 只有air一项在a=0.05的情况下是不能通过检验的,若排除air,则:Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.450e+03 5.683e+02 -4.310 0.00154 * income 1.834e-02 2.782e-03 6.593 6.13e-05 *expense 7.465e+00 6.742e-01 11.072 6.21e-07 *level -5.389e+00 1.261e+00 -4.273 0.00163 * road -2.381e+00 8.921e-01 -2.669 0.02355 * railtran 1.933e-02 4.970e-03 3.889 0.00301 * airtran 2.451e-01 3.864e-02 6.343 8.42e-05 *所有回归系数通过检验,回归模型初步确立。4、回归诊断计算得出残差,进行W正态性检验,得到p-value = 0.9066,不能拒绝正态性假设。而回归值与标准化残差的残差图为:从图中也可看出,残差分布均匀且无规律,所以线性回归的基本假设满足,且没有自相关性。而再看:综合看上面四幅图,11和15号观测值可能为强影响值,但产生原因还需要探究,可能是统计过程上的,亦可能是分析方法上的,去掉后回归效果减弱,所以暂不剔除。再检验多重共线性,kappa=1346.411>1000,所以存在多重共线性,接近零的特征值及其相应特征向量为:0.,,61, 0.2, 0.3, -0.4, 0.5, -0.6, -0. 0.,51, -0.2, 0.3, -0.4, 0.5, -0.6, 0.可见,1,3,6之间即income与level,airtran之间可能存在严重的多重共线性关系,更可能的是在income与level之间,这在经济意义上也可以理解,国民收入越高,消费水平越高,而坐飞机的人才越多,前两者关系更直接。所以引起原因可能是有多余的自变量,分别去掉income,level,airtran做回归,并计算kappa值。从结果知,不管去掉哪一个,kappa值均减少一半左右,而只有去掉level时,回归方程几乎无影响,Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.824e+03 7.511e+02 -5.091 0. *income 1.217e-02 3.811e-03 3.194 0. * expense 5.483e+00 7.843e-01 6.991 2.3e-05 *road -4.247e+00 1.247e+00 -3.407 0. * railtran 2.708e-02 7.416e-03 3.651 0. * airtran 1.929e-01 5.876e-02 3.284 0. * -Signif. codes: 0 * 0.001 * 0.01 * 0.05 . 0.1 1 Residual standard error: 155.7 on 11 degrees of freedomMultiple R-squared: 0.9985, Adjusted R-squared: 0.9978 F-statistic: 1450 on 5 and 11 DF, p-value: 4.078e-15 所以可以剔除level。再做一下异方差性的检验,用等级相关系数法,计算残差的绝对值与自变量间的等级相关系数,分别为0.,0.,0.,0,0.发现并无相关的,所以模型拟合良好。5、模型确立Travel=-3.824e+03+1.217e-02*income+5.483*expense-4.247*road+2.708e-02*railtran+1.929e-01*airtran五、模型评注从模型来看,国内旅游消费量可由国民收入、人均旅游花费、铁路客运量、民航客运量、公路里程来建模模拟预测,这与实际意义相符。前两者可归纳为人民生活水平,后三者是国家交通建设方面,而恰恰包括了公路、铁路、航空三个方面。所以回归方程的建立与其实际意义大致相符,影响因素也基本确定。但是受开始自变量选择的影响,有可能存在重要变量为选入。六、程序代码及输出(编程语言:R)> x=read.csv("数据.csv",head=T)> a=x,2:13> plot(a) > cor.test(road,travel) /*相关性检验*/ Pearson's product-moment correlationdata: road and travel t = 10.0692, df = 15, p-value = 4.563e-08alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0. 0. sample estimates: cor 0.> cor.test(shiptran,travel) Pearson's product-moment correlationdata: shiptran and travel t = 0.0021, df = 15, p-value = 0.9983alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0. 0. sample estimates: cor 0.>model=lm(travelincome+number+expense+level+road+rail+air+railtran+roadtran+airtran)> summary(model) /*建立回归模型*/Call:lm(formula = travel income + number + expense + level + road + rail + air + railtran + roadtran + airtran)Residuals: Min 1Q Median 3Q Max -72.549 -44.860 3.562 44.806 90.603 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.972e+03 3.193e+03 -1.870 0. income 2.151e-02 4.779e-03 4.501 0. * number 1.039e+00 1.446e+00 0.719 0. expense 6.805e+00 1.124e+00 6.052 0. *level -5.815e+00 1.261e+00 -4.610 0. * road -1.468e+00 1.019e+00 -1.441 0. rail 6.274e+02 4.462e+02 1.406 0. air -4.155e+00 2.790e+00 -1.490 0. railtran 2.524e-02 8.492e-03 2.972 0. * roadtran -4.093e-04 4.554e-04 -0.899 0. airtran 1.058e-01 1.272e-01 0.832 0. -Signif. codes: 0 * 0.001 * 0.01 * 0.05 . 0.1 1 Residual standard error: 84.55 on 6 degrees of freedomMultiple R-squared: 0.9998, Adjusted R-squared: 0.9994 F-statistic: 2462 on 10 and 6 DF, p-value: 5.061e-10> model1=step(model) /*逐步回归*/Start: AIC=155.17travel income + number + expense + level + road + rail + air + railtran + roadtran + airtran Df Sum of Sq RSS AIC- number 1 3693 46589 154.57- airtran 1 4948 47844 155.02<none> 42897 155.17- roadtran 1 5775 48671 155.31- rail 1 14137 57033 158.01- road 1 14850 57746 158.22- air 1 15862 58758 158.52- railtran 1 63136 168.55- income 1 178.26- level 1 178.90- expense 1 186.50Step: AIC=154.57travel income + expense + level + road + rail + air + railtran + roadtran + airtran Df Sum of Sq RSS AIC- roadtran 1 3276 49866 153.73<none> 46589 154.57- rail 1 11735 58325 156.39- air 1 15657 62246 157.50- road 1 17009 63598 157.86- airtran 1 58169 166.34- railtran 1 64855 167.40- income 1 176.91- level 1 178.18- expense 1 189.12Step: AIC=153.73travel income + expense + level + road + rail + air + railtran + airtran Df Sum of Sq RSS AIC<none> 49866 153.73- rail 1 10380 60246 154.94- air 1 12886 62752 155.63- road 1 15241 65107 156.26- railtran 1 62438 165.53- airtran 1 86215 168.79- level 1 176.50- expense 1 187.19- income 1 189.75> summary(model1)Call:lm(formula = travel income + expense + level + road + rail + air + railtran + airtran)Residuals: Min 1Q Median 3Q Max -66.673 -57.766 2.796 46.749 91.039 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.393e+03 2.102e+03 -2.090 0. . income 1.898e-02 2.320e-03 8.179 3.72e-05 *expense 7.038e+00 9.369e-01 7.512 6.85e-05 *level -5.427e+00 1.057e+00 -5.133 0. *road -1.460e+00 9.339e-01 -1.564 0. rail 3.697e+02 2.865e+02 1.290 0. air -3.589e+00 2.496e+00 -1.438 0. railtran 2.166e-02 6.843e-03 3.165 0. * airtran 2.032e-01 5.464e-02 3.719 0. * -Signif. codes: 0 * 0.001 * 0.01 * 0.05 . 0.1 1 Residual standard error: 78.95 on 8 degrees of freedomMultiple R-squared: 0.9997, Adjusted R-squared: 0.9994 F-statistic: 3529 on 8 and 8 DF, p-value: 2.252e-13 > model2=drop1(model1) /*减少一个变量做回归*/> model2Single term deletionsModel:travel income + expense + level + road + rail + air + railtran + airtran Df Sum of Sq RSS AIC<none> 49866 153.73income 1 189.75expense 1 187.19level 1 176.50road 1 15241 65107 156.26rail 1 10380 60246 154.94air 1 12886 62752 155.63railtran 1 62438 165.53airtran 1 86215 168.79> model3=update(model1,.-rail) /*剔除rail*/> summary(model3)Call:lm(formula = travel income + expense + level + road + air + railtran + airtran)Residuals: Min 1Q Median 3Q Max -77.120 -62.739 -7.682 57.073 96.157 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.773e+03 5.648e+02 -3.140 0. * income 1.935e-02 2.386e-03 8.112 1.98e-05 *expense 7.977e+00 6.116e-01 13.043 3.77e-07 *level -5.126e+00 1.069e+00 -4.797 0. *road -2.214e+00 7.550e-01 -2.933 0. * air -5.129e+00 2.272e+00 -2.257 0. . railtran 1.495e-02 4.613e-03 3.241 0. * airtran 2.603e-01 3.323e-02 7.832 2.62e-05 *-Signif. codes: 0 * 0.001 * 0.01 * 0.05 . 0.1 1 Residual standard error: 81.82 on 9 degrees of freedomMultiple R-squared: 0.9997, Adjusted R-squared: 0.9994 F-statistic: 3756 on 7 and 9 DF, p-value: 7.348e-15 > model4=update(model3,.-air)> summary(model4)Call:lm(formula = travel income + expense + level + road + railtran + airtran)Residuals: Min 1Q Median 3Q Max -165.78 -44.43 12.86 49.24 123.92 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.450e+03 5.683e+02 -4.310 0.00154 * income 1.834e-02 2.782e-03 6.593 6.13e-05 *expense 7.465e+00 6.742e-01 11.072 6.21e-07 *level -5.389e+00 1.261e+00 -4.273 0.00163 * road -2.381e+00 8.921e-01 -2.669 0.02355 * railtran 1.933e-02 4.970e-03 3.889 0.00301 * airtran 2.451e-01 3.864e-02 6.343 8.42e-05 *-Signif. codes: 0 * 0.001 * 0.01 * 0.05 . 0.1 1 Residual standard error: 97.14 on 10 degrees of freedomMultiple R-squared: 0.9995, Adjusted R-squared: 0.9991 F-statistic: 3108 on 6 and 10 DF, p-value: 9.282e-16 > resid=resid(model4)> resid 1 2 3 4 5 6 32. -8. 12. -83. 50. 47. 7 8 9 10 11 12 -54. -28. 123. 80. -165. 33. 13 14 15 16 17 -28. -44. -112. 96. 49. > shapiro.test(resid) /*W正态性检验*/ Shapiro-Wilk normality testdata: resid W = 0.9756, p-value = 0.9066> y=predict(model4)> rstandard=rstandard(model4)> plot(