书签分享收藏举报版权申诉 / 30

立即下载

当前位置：首页 > 教育专区 > 单元课程 > 线性回归模型在SAS_EM中的应用实例(30页).doc

线性回归模型在SAS_EM中的应用实例(30页).doc

上传人：1595****071

文档编号：37373457

上传时间：2022-08-31

格式：DOC

页数：30

大小：1.11MB

( 4.5 )

《线性回归模型在SAS_EM中的应用实例(30页).doc》由会员分享，可在线阅读，更多相关《线性回归模型在SAS_EM中的应用实例(30页).doc（30页珍藏版）》请在淘文阁 - 分享文档赚钱的网站上搜索。

1、-Chapter 3Chapter 4Chapter 5 线性回归模型在SAS_EM中的应用实例-第 6-29 页Chapter 6 Predictive Modeling Using RegressionIntroduction to Regression3-3Regression in Enterprise Miner3-86.1 Introduction to RegressionThe Regression node in Enterprise Miner does either linear or logistic regression depending upon the meas

2、urement level of the target variable.Linear regression is done if the target variable is an interval variable. In linear regression the model predicts the mean of the target variable at the given values of the input variables.Logistic regression is done if the target variable is a discrete variable.

3、 In logistic regression the model predicts the probability of a particular level(s) of the target variable at the given values of the input variables. Because the predictions are probabilities, which are bounded by 0 and 1 and are not linear in this space, the probabilities must be transformed in or

4、der to be adequately modeled. The most common transformation for a binary target is the logit transformation. Probit and complementary log-log transformations are also available in the regression node.Recall that one assumption of logistic regression is that the logit transformation of the probabili

5、ties of the target variable results in a linear relationship with the input variables.Regression uses only full cases in the model. This means that any case, or observation, that has a missing value will be excluded from consideration when building the model. As discussed earlier, when there are man

6、y potential input variables to be considered, this could result in an unacceptably high loss of data. Therefore, when possible, missing values should be imputed prior to running a regression model.Other reasons for imputing missing values include the following: Decision trees handle missing values d

7、irectly, whereas regression and neural network models ignore all observations with missing values on any of the input variables. It is more appropriate to compare models built on the same set of observations. Therefore, before doing a regression or building a neural network model, you should perform

8、 data replacement, particularly if you plan to compare the results to results obtained from a decision tree model. If the missing values are in some way related to each other or to the target variable, the models created without those observations may be biased. If missing values are not imputed dur

9、ing the modeling process, observations with missing values cannot be scored with the score code built from the models.There are three variable selection methods available in the Regression node of Enterprise Miner.Forwardfirst selects the best one-variable model. Then it selects the best two variabl

10、es among those that contain the first selected variable. This process continues until it reaches the point where no additional variables have a p-value less than the specified entry p-value.Backwardstarts with the full model. Next, the variable that is least significant, given the other variables, i

11、s removed from the model. This process continues until all of the remaining variables have a p-value less than the specified stay pvalue.Stepwiseis a modification of the forward selection method. The difference is that variables already in the model do not necessarily stay there. After each variable

12、 is entered into the model, this method looks at all the variables already included in the model and deletes any variable that is not significant at the specified level. The process ends when none of the variables outside the model has a p-value less than the specified entry value and every variable

13、 in the model is significant at the specified stay value.!The specified p-values are also known as significance levels.6.2 Regression in Enterprise MinerFINFOUTImputation, Transformation, and RegressionThe data for this example is from a nonprofit organization that relies on fundraising campaigns to

14、 support their efforts. After analyzing the data, a subset of 19 predictor variables was selected to model the response to a mailing. Two response variables were stored in the data set. One response variable related to whether or not someone responded to the mailing (TARGET_B), and the other respons

15、e variable measured how much the person actually donated in U.S. dollars (TARGET_D).NameModel RoleMeasurement LevelDescriptionAGEInputIntervalDonors ageAVGGIFTInputIntervalDonors average giftCARDGIFTInputIntervalDonors gifts to card promotionsCARDPROMInputIntervalNumber of card promotionsFEDGOVInput

16、Interval% of household in federal governmentFIRSTTInputIntervalElapsed time since first donationGENDERInputBinaryF=female, M=MaleHOMEOWNRInputBinaryH=homeowner, U=unknownIDCODEIDNominalID code, unique for each donorINCOMEInputOrdinalIncome level (integer values 0-9)LASTTInputIntervalElapsed time sin

17、ce last donationLOCALGOVInputInterval% of household in local governmentMALEMILIInputInterval% of household males active in the militaryMALEVETInputInterval% of household male veteransNUMPROMInputIntervalTotal number of promotionsPCOWNERSInputBinaryY=donor owns computer (missing otherwise)PETSInputBi

18、naryY=donor owns pets (missing otherwise)STATEGOVInputInterval% of household in state governmentTARGET_BTargetBinary1=donor to campaign, 0=did not contributeTARGET_DTargetIntervalDollar amount of contribution to campaignTIMELAGInputIntervalTime between first and second donation!The variable TARGET_D

19、 is not considered in this chapter, so its model role will be set to Rejected.!A card promotion is one where the charitable organization sends potential donors an assortment of greeting cards and requests a donation for them.The MYRAW data set in the CRSSAMP library contains 6,974 observations for b

20、uilding and comparing competing models. This data set will be split equally into training and validation data sets for analysis.Building the Initial Flow and Identifying the Input Data1. Open a new diagram by selecting File New Diagram.2. On the Diagrams subtab, name the new diagram by right-clickin

21、g on Untitled and selecting Rename.3. Name the new diagram Non-Profit.4. Add an Input Data Source node to the diagram workspace by dragging the node from the toolbar or from the Tools tab.5. Add a Data Partition node to the diagram and connect it to the Input Data Source node.6. To specify the input

22、 data, double-click on the Input Data Source node.7. Click on Select in order to choose the data set.8. Click on the and select CRSSAMP from the list of defined libraries.9. Select the MYRAW data set from the list of data sets in the CRSSAMP library and then select OK.Observe that this data set has

23、6,974 observations (rows) and 21 variables (columns). Evaluate (and update, if necessary) the assignments that were made using the metadata sample.1. Click on the Variables tab to see all of the variables and their respective assignments. 2. Click on the Name column heading to sort the variables by

24、their name. A portion of the table showing the first 10 variables is shown below.The first several variables (AGE through FIRSTT) have the measurement level interval because they are numeric in the data set and have more than 10 distinct levels in the metadata sample. The model role for all interval

25、 variables is set to input by default. The variables GENDER and HOMEOWNR have the measurement level binary because they have only two different nonmissing levels in the metadata sample. The model role for all binary variables is set to input by default.The variable IDCODE is listed as a nominal vari

26、able because it is a character variable with more than two nonmissing levels in the metadata sample. Furthermore, because it is nominal and the number of distinct values is at least 2000 or greater than 90% of the sample size, the IDCODE variable has the model role id. If the ID value had been store

27、d as a number, it would have been assigned an interval measurement level and an input model role.The variable INCOME is listed as an ordinal variable because it is a numeric variable with more than two but no more than ten distinct levels in the metadata sample. All ordinal variables are set to have

28、 the input model role.Scroll down to see the rest of the variables. The variables PCOWNERS and PETS both are identified as unary for their measurement level. This is because there is only one nonmissing level in the metadata sample. It does not matter in this case whether the variable was character

29、or numeric, the measurement level is set to unary and the model role is set to rejected. These variables do have useful information, however, and it is the way in which they are coded that makes them seem useless. Both variables contain the value Y for a person if the person has that condition (pet

30、owner for PETS, computer owner for PCOWNERS) and a missing value otherwise. Decision trees handle missing values directly, so no data modification needs to be done for fitting a decision tree; however, neural networks and regression models ignore any observation with a missing value, so you will nee

31、d to recode these variables to get at the desired information. For example, you can recode the missing values as a U, for unknown. You do this later using the Replacement node.Identifying Target VariablesNote that the variables TARGET_B and TARGET_D are the response variables for this analysis. TARG

32、ET_B is binary even though it is a numeric variable since there are only two non-missing levels in the metadata sample. TARGET_D has the interval measurement level. Both variables are set to have the input model role (just like any other binary or interval variable). This analysis will focus on TARG

33、ET_B, so you need to change the model role for TARGET_B to target and the model role TARGET_D to rejected because you should not use a response variable as a predictor. 1. Right-click in the Model Role column of the row for TARGET_B.2. Select Set Model Role target from the pop-up menu.3. Right-click

34、 in the Model Role column of the row for TARGET_D.4. Select Set Model Role rejected from the pop-up menu.Inspecting DistributionsYou can inspect the distribution of values in the metadata sample for each of the variables. To view the distribution of TARGET_B:1. Right-click in the name column of the

35、row for TARGET_B.2. Select View distribution of TARGET_B.Investigate the distribution of the unary variables, PETS and PCOWNERS. What percentage of the observations have pets? What percentage of the observations own personal computers? Recall that these distributions depend on the metadata sample. T

36、he numbers may be slightly different if you refresh your metadata sample; however, these distributions are only being used for a quick overview of the data.Evaluate the distribution of other variables as desired. For example, consider the distribution of INCOME. Some analysts would assign the interv

37、al measurement level to this variable. If this were done and the distribution was highly skewed, a transformation of this variable may lead to better results.Modifying Variable InformationEarlier you changed the model role for TARGET_B to target. Now modify the model role and measurement level for P

38、COWNERS and PETS.1. Click and drag to select the rows for PCOWNERS and PETS.2. Right-click in the Model Role column for one of these variables and select Set Model Role input from the pop-up menu.3. Right-click in the measurement column for one of these variables and select Set Measurement binary fr

39、om the pop-up menu.Understanding the Target Profiler for a Binary TargetWhen building predictive models, the best model often varies according to the criteria used for evaluation. One criterion might suggest that the best model is the one that most accurately predicts the response. Another criterion

40、 might suggest that the best model is the one that generates the highest expected profit. These criteria can lead to quite different results.In this analysis, you are analyzing a binary variable. The accuracy criteria would choose the model that best predicts whether someone actually responded; howe

41、ver, there are different profits and losses associated with different types of errors. Specifically, it costs less than a dollar to send someone a mailing, but you receive a median of $13.00 from those that respond. Therefore, to send a mailing to someone that would not respond costs less than a dol

42、lar, but failing to mail to someone that would have responded costs over $12.00 in lost revenue.!In the example shown here, the median is used as the measure of central tendency. In computing expected profit, it is theoretically more appropriate to use the mean.In addition to considering the ramific

43、ations of different types of errors, it is important to consider whether or not the sample is representative of the population. In your sample, almost 50% of the observations represent responders. In the population, however, the response rate was much closer to 5% than 50%. In order to obtain approp

44、riate predicted values, you must adjust these predicted probabilities based on the prior probabilities. In this situation, accuracy would yield a very poor model because you would be correct approximately 95% of the time in concluding that nobody will respond. Unfortunately, this does not satisfacto

45、rily solve your problem of trying to identify the best subset of a population for your mailing.!In the case of rare target events, it is not uncommon to oversample. This is because you tend to get better models when they are built on a data set that is more balanced with respect to the levels of the

46、 target variable.Using the Target ProfilerWhen building predictive models, the choice of the best model depends on the criteria you use to compare competing models. Enterprise Miner allows you to specify information about the target that can be used to compare competing models. To generate a target

47、profile for a variable, you must have already set the model role for the variable to target. This analysis focuses on the variable TARGET_B. To set up the target profile for this TARGET_B, proceed as follows:1. Right-click over the row for TARGET_B and select Edit target profile.2. When the message

48、stating that no target profile was found appears, select Yes to create the profile.The target profiler opens with the Profiles tab active. You can use the default profile or you can create your own. 3. Select Edit Create New Profile to create a new profile.4. Type My Profile as the description for this new profile (currently named Profile1).5. To set the newly created profile for use, position your cursor in the row corresponding to your new

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 金币

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 线性回归模型 SAS_EM 中的应用实例 30

淘文阁 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：线性回归模型在SAS_EM中的应用实例(30页).doc
链接地址：https://www.taowenge.com/p-37373457.html