Computing Corner
Logistic Regression Model Using the SAS System
Logistic Regression is commonly used to predict the probability that a unit under analysis will acquire the event of interest as a linear function of changes in values of one or more continuous-level variables, dichotomous (binary variables), or a combination of both continuous and binary independent variables. The dependent variable is dichotomous and is coded either zero (event did not occur) or one (event did occur). The logistic function is used to estimate, as a function of unit changes in the independent variable(s) the probability that the event of interest will occur. Logistic regression techniques are implemented in the LOGISTIC procedure in the STAT Module of the SAS System for Information Delivery.
Your dependent variable should be dichotomous, coded zero for 'non-event' and coded one for 'event'. The coding dictates how PROC LOGISTIC will operate. The zero/one coding scheme is the commonly used scheme for indicating non-event/event for the dependent variable. By default, however, PROC LOGISTIC will attempt to model (that is, predict the probability of) the lower of the two values, which is usually not the desired result.
Logistic regression, when properly used, develops a model which attempts to predict the probability of an event of interest occurring in the population from which the data under analysis are assumed to have been randomly sampled. Changes in the values of the independent variables are expressed in the context of changes (if any) in the odds ratio, or how unit increases in the independent variable(s) contained in the model increase/decrease the chance that the outcome will occur in those subjects where the change in value occurs versus those subjects where the change does not occur.
I will use the Intensive Care Unit (ICU) Survival Study (Hosmer and Lemeshow, 1989, used by permission of authors) as an example. The SAS routine is as follows:
PROC LOGISTIC DATA=ICU DESCENDING;
MODEL STA = AGE;
RUN;
The outcome (dependent variable) is survival after admission to hospital intensive care unit (STA). The independent variable is the age of patient in years (AGE). A UNITS option can also be used when single unit changes in the values of the independent variable may not be substantively relevant to the analysis at hand. The impact of changes or more than one unit in the independent variable can be obtained by using the UNITS option, which is available in Version 6.10 and above. An example of this option is as follows: UNITS AGE = 5 10 20;. This option provides odds ratios for 5, 10 and 20 increments in patient age.(1)
Several examples of PROC LOGISTIC can be found in a data set on the mainframe of the Health and Welfare Agency Data Center (HWDC). However, you must use //STEP1 EXEC HWSAS607 in your Job Control Language (JCL). In other words, this procedure will work in version 6.07 at HWDC (It doesn't work with SAS608). The simplest program was on data taken from Cox and Snell (1989, pp 10-11), consisting of the number of ingots not ready for rolling (R) out of N tested, for a number of combinations of heating time and soaking time. The SAS program is below. The SAS output is shown in Table 1.
DATA INGOTS;
INPUT HEAT SOAK R N;
CARDS;
.
.;
RUN;
PROC LOGISTIC DATA = INGOTS;
MODEL R/N = HEAT SOAK;
RUN;
The PROC LOGISTIC MODEL has many features which can be used as options such as PLCL, PLRL, WALDCL, and WALDRL (available in release 6.10 and later). WALDRL is equivalent to the RISKLIMITS option, which is available in earlier releases. With these options, you can compute confidence limits from regression parameters and odds ratios. Estimated odds ratios are computed by exponentiating the parameter estimates for a logistic regression model when the following conditions are met:
the explanatory variable does not interact with any other variable,
the explanatory variable is represented by a single term in the model,
and a one-unit change in the explanatory variable is relevant.
Similarly, confidence limits for odds ratios are computed by exponentiating the confidence limits for the logistic regression parameters.
There are two available methods of computing confidence limits for logistic regression parameters: the likelihood ratio method and the Wald method. The likelihood ratio model is an iterative process based on the profile likelihood function. The Wald method is a simpler method based on the asymptotic normality of the parameter estimator. These two methods should produce approximately the same results for large samples, but may produce different results for small samples. When the parameter estimate is very large, however, these two methods may produce different results even for large sample sizes.
The next example uses options in the MODEL statement of the LOGISTIC procedure to compute confidence limits for odds ratios and parameter estimates. It also shows how to use an option to adjust the confidence coefficient for the confidence limits.
First, you must create your SAS data set. The complete DIABETES data set is as follows:
DATA DIABETES;
INPUT PATIENT RELWT GLUFAST GLUTEST INSTEST
SSPG GROUP;
LABEL
relwt = 'Relative weight'
glufast = 'Fasting Plasma Glucose'
glutest = 'Test Plasma Glucose'
instest = 'Plasma Insulin during Test'
sspg = 'Steady State Plasma Glucose'
group = 'Clinical Group';
CARDS;
1 0.81 80 356 124 55 1
2 0.95 97 289 117 76 1
3 0.94 105 319 143 105 1
...
;
Next, you convert the ordinal response (GROUP) to a binary response (GRP). Combine the chemical diabetics and overt diabetics into one group-the event group. The normals are the nonevent group. You set up this SAS data set as follows:
DATA DIABET2;
SET DIABETES;
GRP = (GROUP=1);
RUN;
Finally, you run your logistic regression model to compute both types of confidence limits for the regression parameter estimates and the odds ratios. We will use the options in the MODEL statement as follows:
PROC LOGISTIC DATA=DIABET2;
MODEL GRP=GLUEST / PLCL PLRL WALDCL WALDRL;
RUN;
A partial output generated by this PROC LOGISTIC program is listed in Table 2.
The explanation of this output is as follows:
- The PLCL option produces the first table labeled Parameter Estimates and 95% Confidence Intervals. The confidence limits are labeled Profile Likelihood Confidence Limits. The construction of these confidence intervals is derived from the asymptotic chi-square distribution of the likelihood ratio test.
- The WALDCL option produces the second table labeled Parameter Estimates and 95% Confidence Intervals. The confidence limits are labeled Wald Confidence Limits. Wald confidence limits are computed by assuming a normal distribution for each parameter estimator. This computation method is less time consuming than the one based on the profile likelihood function because it does not involve an iterative process. However, it is considered to be less accurate, especially for small sample sizes. When you examine the confidence intervals for the parameter estimates, you can see that the Wald confidence intervals are symmetric about the point estimate, but the profile likelihood confidence intervals are asymmetrical. This is because the upper and lower profile likelihood confidence limits are computed separately using an iterative process, and the distribution of a parameter estimate is not symmetric for small sample sizes.
- The PLRL option produces the third table labeled Conditional Odds Ratios and 95% Confidence Intervals. The confidence limits are labeled Profile Likelihood Confidence Limits. Profile likelihood confidence limits for odds ratios are a transformation of the confidence limits that you can produce with the PLCL option for the corresponding regression parameters.
- The WALDRL option produces the fourth table labeled Conditional Odds Ratios and 95% Confidence Intervals. The confidence limits are labeled Wald Confidence Limits. WALDRL is an alias of the RISKLIMITS option, which is available in Release 6.07 and later releases. It requests confidence intervals for the odds ratios of all explanatory variables. Computation of these confidence intervals is based on the asymptotic normality of the parameter estimators. (2)
Footnotes:
- Information from the lecture titled "Getting Started With PROC LOGISTIC: A Beginning Tutorial" by Andrew Karp (SAS Consultant with Sierra Information Services, San Francisco, CA) at the State of California Health and Welfare Agency Data Center Statistical Users Group on July 12, 1995.
- SAS Program and information from example 2 (pp. 27-29) in the new SAS Institute book titled "Logistic Regression Examples Using the SAS System" Version 6, First Edition (1995).
TABLE 1: The LOGISTIC Procedure (SAS Output)
Data Set: WORK.INGOTS Response Profile
Response Variable (Events): R Ordered Binary
Response Variable (Trials): N Value Outcome Count
Number of Observations: 19 1 EVENT 12
Link Function: Logit 2 NO EVENT 375
Criteria for Assessing Model Fit
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 108.988 101.346 .
SC 112.947 113.221 .
-2 LOG L 106.988 95.346 11.643 with 2 DF (p=0.0030)
Score . . 15.109 with 2 DF (p=0.0005)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized
Variable DF Estimate Error Chi-Square Chi-Square Estimate
INTERCPT 1 -5.5592 1.1197 24.6504 0.0001 .
HEAT 1 0.0820 0.0237 11.9453 0.0005 0.449368
SOAK 1 0.0568 0.3312 0.0294 0.8639 0.029509
Association of Predicted Probabilities and Observed Responses
Concordant = 64.4% Somers' D = 0.460
Discordant = 18.4% Gamma = 0.555
Tied = 17.2% Tau-a = 0.028
TABLE 2: Diabetes Data (SAS Output)
(1) Parameter Estimates and 95% Confidence Intervals
Profile Likelihood
Confidence Limits
Parameter
Variable Estimate Lower Upper
INTERCPT -90.4017 -213.2 -38.6425
GLUTEST 0.2153 0.0918 0.5073
(2) Parameter Estimates and 95% Confidence Intervals
Wald
Confidence Limits
Parameter
Variable Estimate Lower Upper
INTERCPT -90.4017 -173.8 -7.0490
GLUTEST 0.2153 0.0171 0.4136
(3) Conditional Odds Ratios and 95% Confidence Intervals
Profile Likelihood
Confidence Limits
Odds
Variable Unit Ratio Lower Upper
GLUTEST 1.0000 1.240 1.096 1.661
(4) Conditional Odds Ratios and 95% Confidence Intervals
Wald
Confidence Limits
Odds
Variable Unit Ratio Lower Upper
GLUTEST 1.0000 1.240 1.017 1.512
Contributed by Ronald Ridley, California Deparment of Health Services
Go Back to the Computing Corner Main List
Go Back to SSA Homepage