线性回归模型在SAS_EM中的应用实例(30页).doc
《线性回归模型在SAS_EM中的应用实例(30页).doc》由会员分享,可在线阅读,更多相关《线性回归模型在SAS_EM中的应用实例(30页).doc(30页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、-Chapter 3Chapter 4Chapter 5 线性回归模型在SAS_EM中的应用实例-第 6-29 页Chapter 6 Predictive Modeling Using RegressionIntroduction to Regression3-3Regression in Enterprise Miner3-86.1 Introduction to RegressionThe Regression node in Enterprise Miner does either linear or logistic regression depending upon the meas
2、urement level of the target variable.Linear regression is done if the target variable is an interval variable. In linear regression the model predicts the mean of the target variable at the given values of the input variables.Logistic regression is done if the target variable is a discrete variable.
3、 In logistic regression the model predicts the probability of a particular level(s) of the target variable at the given values of the input variables. Because the predictions are probabilities, which are bounded by 0 and 1 and are not linear in this space, the probabilities must be transformed in or
4、der to be adequately modeled. The most common transformation for a binary target is the logit transformation. Probit and complementary log-log transformations are also available in the regression node.Recall that one assumption of logistic regression is that the logit transformation of the probabili
5、ties of the target variable results in a linear relationship with the input variables.Regression uses only full cases in the model. This means that any case, or observation, that has a missing value will be excluded from consideration when building the model. As discussed earlier, when there are man
6、y potential input variables to be considered, this could result in an unacceptably high loss of data. Therefore, when possible, missing values should be imputed prior to running a regression model.Other reasons for imputing missing values include the following: Decision trees handle missing values d
7、irectly, whereas regression and neural network models ignore all observations with missing values on any of the input variables. It is more appropriate to compare models built on the same set of observations. Therefore, before doing a regression or building a neural network model, you should perform
8、 data replacement, particularly if you plan to compare the results to results obtained from a decision tree model. If the missing values are in some way related to each other or to the target variable, the models created without those observations may be biased. If missing values are not imputed dur
9、ing the modeling process, observations with missing values cannot be scored with the score code built from the models.There are three variable selection methods available in the Regression node of Enterprise Miner.Forwardfirst selects the best one-variable model. Then it selects the best two variabl
10、es among those that contain the first selected variable. This process continues until it reaches the point where no additional variables have a p-value less than the specified entry p-value.Backwardstarts with the full model. Next, the variable that is least significant, given the other variables, i
11、s removed from the model. This process continues until all of the remaining variables have a p-value less than the specified stay pvalue.Stepwiseis a modification of the forward selection method. The difference is that variables already in the model do not necessarily stay there. After each variable
12、 is entered into the model, this method looks at all the variables already included in the model and deletes any variable that is not significant at the specified level. The process ends when none of the variables outside the model has a p-value less than the specified entry value and every variable
13、 in the model is significant at the specified stay value.!The specified p-values are also known as significance levels.6.2 Regression in Enterprise MinerFINFOUTImputation, Transformation, and RegressionThe data for this example is from a nonprofit organization that relies on fundraising campaigns to
14、 support their efforts. After analyzing the data, a subset of 19 predictor variables was selected to model the response to a mailing. Two response variables were stored in the data set. One response variable related to whether or not someone responded to the mailing (TARGET_B), and the other respons
15、e variable measured how much the person actually donated in U.S. dollars (TARGET_D).NameModel RoleMeasurement LevelDescriptionAGEInputIntervalDonors ageAVGGIFTInputIntervalDonors average giftCARDGIFTInputIntervalDonors gifts to card promotionsCARDPROMInputIntervalNumber of card promotionsFEDGOVInput
16、Interval% of household in federal governmentFIRSTTInputIntervalElapsed time since first donationGENDERInputBinaryF=female, M=MaleHOMEOWNRInputBinaryH=homeowner, U=unknownIDCODEIDNominalID code, unique for each donorINCOMEInputOrdinalIncome level (integer values 0-9)LASTTInputIntervalElapsed time sin
17、ce last donationLOCALGOVInputInterval% of household in local governmentMALEMILIInputInterval% of household males active in the militaryMALEVETInputInterval% of household male veteransNUMPROMInputIntervalTotal number of promotionsPCOWNERSInputBinaryY=donor owns computer (missing otherwise)PETSInputBi
18、naryY=donor owns pets (missing otherwise)STATEGOVInputInterval% of household in state governmentTARGET_BTargetBinary1=donor to campaign, 0=did not contributeTARGET_DTargetIntervalDollar amount of contribution to campaignTIMELAGInputIntervalTime between first and second donation!The variable TARGET_D
19、 is not considered in this chapter, so its model role will be set to Rejected.!A card promotion is one where the charitable organization sends potential donors an assortment of greeting cards and requests a donation for them.The MYRAW data set in the CRSSAMP library contains 6,974 observations for b
20、uilding and comparing competing models. This data set will be split equally into training and validation data sets for analysis.Building the Initial Flow and Identifying the Input Data1. Open a new diagram by selecting File New Diagram.2. On the Diagrams subtab, name the new diagram by right-clickin
21、g on Untitled and selecting Rename.3. Name the new diagram Non-Profit.4. Add an Input Data Source node to the diagram workspace by dragging the node from the toolbar or from the Tools tab.5. Add a Data Partition node to the diagram and connect it to the Input Data Source node.6. To specify the input
22、 data, double-click on the Input Data Source node.7. Click on Select in order to choose the data set.8. Click on the and select CRSSAMP from the list of defined libraries.9. Select the MYRAW data set from the list of data sets in the CRSSAMP library and then select OK.Observe that this data set has
23、6,974 observations (rows) and 21 variables (columns). Evaluate (and update, if necessary) the assignments that were made using the metadata sample.1. Click on the Variables tab to see all of the variables and their respective assignments. 2. Click on the Name column heading to sort the variables by
24、their name. A portion of the table showing the first 10 variables is shown below.The first several variables (AGE through FIRSTT) have the measurement level interval because they are numeric in the data set and have more than 10 distinct levels in the metadata sample. The model role for all interval
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 线性 回归 模型 SAS_EM 中的 应用 实例 30
限制150内