Computing Corner
Logistic Regression Model Using the SAS System


Logistic Regression is commonly used to predict the probability that a unit under analysis will acquire the event of interest as a linear function of changes in values of one or more continuous-level variables, dichotomous (binary variables), or a combination of both continuous and binary independent variables. The dependent variable is dichotomous and is coded either zero (event did not occur) or one (event did occur). The logistic function is used to estimate, as a function of unit changes in the independent variable(s) the probability that the event of interest will occur. Logistic regression techniques are implemented in the LOGISTIC procedure in the STAT Module of the SAS System for Information Delivery. Your dependent variable should be dichotomous, coded zero for 'non-event' and coded one for 'event'. The coding dictates how PROC LOGISTIC will operate. The zero/one coding scheme is the commonly used scheme for indicating non-event/event for the dependent variable. By default, however, PROC LOGISTIC will attempt to model (that is, predict the probability of) the lower of the two values, which is usually not the desired result.

Logistic regression, when properly used, develops a model which attempts to predict the probability of an event of interest occurring in the population from which the data under analysis are assumed to have been randomly sampled. Changes in the values of the independent variables are expressed in the context of changes (if any) in the odds ratio, or how unit increases in the independent variable(s) contained in the model increase/decrease the chance that the outcome will occur in those subjects where the change in value occurs versus those subjects where the change does not occur.

I will use the Intensive Care Unit (ICU) Survival Study (Hosmer and Lemeshow, 1989, used by permission of authors) as an example. The SAS routine is as follows:

PROC LOGISTIC DATA=ICU DESCENDING; 
   MODEL STA = AGE; 
RUN; 
The outcome (dependent variable) is survival after admission to hospital intensive care unit (STA). The independent variable is the age of patient in years (AGE). A UNITS option can also be used when single unit changes in the values of the independent variable may not be substantively relevant to the analysis at hand. The impact of changes or more than one unit in the independent variable can be obtained by using the UNITS option, which is available in Version 6.10 and above. An example of this option is as follows: UNITS AGE = 5 10 20;. This option provides odds ratios for 5, 10 and 20 increments in patient age.(1)

Several examples of PROC LOGISTIC can be found in a data set on the mainframe of the Health and Welfare Agency Data Center (HWDC). However, you must use //STEP1 EXEC HWSAS607 in your Job Control Language (JCL). In other words, this procedure will work in version 6.07 at HWDC (It doesn't work with SAS608). The simplest program was on data taken from Cox and Snell (1989, pp 10-11), consisting of the number of ingots not ready for rolling (R) out of N tested, for a number of combinations of heating time and soaking time. The SAS program is below. The SAS output is shown in Table 1.

	DATA INGOTS; 
	   INPUT HEAT SOAK R N; 
	   CARDS; 
	   . 
	   .; 
	RUN; 
	PROC LOGISTIC DATA = INGOTS; 
	   MODEL R/N = HEAT SOAK; 
	RUN; 
The PROC LOGISTIC MODEL has many features which can be used as options such as PLCL, PLRL, WALDCL, and WALDRL (available in release 6.10 and later). WALDRL is equivalent to the RISKLIMITS option, which is available in earlier releases. With these options, you can compute confidence limits from regression parameters and odds ratios. Estimated odds ratios are computed by exponentiating the parameter estimates for a logistic regression model when the following conditions are met: Similarly, confidence limits for odds ratios are computed by exponentiating the confidence limits for the logistic regression parameters.

There are two available methods of computing confidence limits for logistic regression parameters: the likelihood ratio method and the Wald method. The likelihood ratio model is an iterative process based on the profile likelihood function. The Wald method is a simpler method based on the asymptotic normality of the parameter estimator. These two methods should produce approximately the same results for large samples, but may produce different results for small samples. When the parameter estimate is very large, however, these two methods may produce different results even for large sample sizes.

The next example uses options in the MODEL statement of the LOGISTIC procedure to compute confidence limits for odds ratios and parameter estimates. It also shows how to use an option to adjust the confidence coefficient for the confidence limits.

First, you must create your SAS data set. The complete DIABETES data set is as follows:

DATA DIABETES; 
  INPUT PATIENT RELWT GLUFAST GLUTEST INSTEST 
        SSPG GROUP;
  LABEL 
   relwt   = 'Relative weight'             
   glufast = 'Fasting Plasma Glucose'        
   glutest = 'Test Plasma Glucose'           
   instest = 'Plasma Insulin during Test'       
   sspg    = 'Steady State Plasma Glucose'    
   group   = 'Clinical Group'; 
  CARDS; 
     1 0.81  80  356 124   55  1                          
     2 0.95  97  289 117   76  1         
     3 0.94 105  319 143  105  1           
      ... 
      ; 
Next, you convert the ordinal response (GROUP) to a binary response (GRP). Combine the chemical diabetics and overt diabetics into one group-the event group. The normals are the nonevent group. You set up this SAS data set as follows:

DATA DIABET2; 
   SET DIABETES; 
   GRP = (GROUP=1); 
 RUN; 
Finally, you run your logistic regression model to compute both types of confidence limits for the regression parameter estimates and the odds ratios. We will use the options in the MODEL statement as follows:

PROC LOGISTIC DATA=DIABET2; 
  MODEL GRP=GLUEST / PLCL PLRL WALDCL WALDRL;     
 RUN;
A partial output generated by this PROC LOGISTIC program is listed in Table 2.

The explanation of this output is as follows:
  1. The PLCL option produces the first table labeled Parameter Estimates and 95% Confidence Intervals. The confidence limits are labeled Profile Likelihood Confidence Limits. The construction of these confidence intervals is derived from the asymptotic chi-square distribution of the likelihood ratio test.
  2. The WALDCL option produces the second table labeled Parameter Estimates and 95% Confidence Intervals. The confidence limits are labeled Wald Confidence Limits. Wald confidence limits are computed by assuming a normal distribution for each parameter estimator. This computation method is less time consuming than the one based on the profile likelihood function because it does not involve an iterative process. However, it is considered to be less accurate, especially for small sample sizes. When you examine the confidence intervals for the parameter estimates, you can see that the Wald confidence intervals are symmetric about the point estimate, but the profile likelihood confidence intervals are asymmetrical. This is because the upper and lower profile likelihood confidence limits are computed separately using an iterative process, and the distribution of a parameter estimate is not symmetric for small sample sizes.
  3. The PLRL option produces the third table labeled Conditional Odds Ratios and 95% Confidence Intervals. The confidence limits are labeled Profile Likelihood Confidence Limits. Profile likelihood confidence limits for odds ratios are a transformation of the confidence limits that you can produce with the PLCL option for the corresponding regression parameters.
  4. The WALDRL option produces the fourth table labeled Conditional Odds Ratios and 95% Confidence Intervals. The confidence limits are labeled Wald Confidence Limits. WALDRL is an alias of the RISKLIMITS option, which is available in Release 6.07 and later releases. It requests confidence intervals for the odds ratios of all explanatory variables. Computation of these confidence intervals is based on the asymptotic normality of the parameter estimators. (2)


Footnotes:
  1. Information from the lecture titled "Getting Started With PROC LOGISTIC: A Beginning Tutorial" by Andrew Karp (SAS Consultant with Sierra Information Services, San Francisco, CA) at the State of California Health and Welfare Agency Data Center Statistical Users Group on July 12, 1995.
  2. SAS Program and information from example 2 (pp. 27-29) in the new SAS Institute book titled "Logistic Regression Examples Using the SAS System" Version 6, First Edition (1995).



TABLE 1: The LOGISTIC Procedure (SAS Output)          
Data Set: WORK.INGOTS                      Response Profile 
Response Variable (Events): R  Ordered    Binary         
Response Variable (Trials): N   Value    Outcome      Count
Number of Observations: 19      1        EVENT           12
Link Function: Logit            2        NO EVENT       375
                                                       
                   Criteria for Assessing Model Fit          
                                                          
                             Intercept                      
               Intercept        and                     
    Criterion    Only       Covariates  Chi-Square for Covariates
                                                       
    AIC        108.988       101.346         . 
    SC         112.947       113.221         . 
    -2 LOG L   106.988        95.346    11.643 with 2 DF (p=0.0030)
   Score         .             .        15.109 with 2 DF (p=0.0005)
                                                             
               Analysis of Maximum Likelihood Estimates          
 
            Parameter  Standard       Wald       Pr >    Standardized
Variable DF  Estimate    Error    Chi-Square  Chi-Square   Estimate  
                                                                  
INTERCPT 1   -5.5592    1.1197     24.6504      0.0001         . 
HEAT     1    0.0820    0.0237     11.9453      0.0005     0.449368
SOAK     1    0.0568    0.3312      0.0294      0.8639     0.029509
                                                     
       Association of Predicted Probabilities and Observed Responses
                                                            
         Concordant = 64.4%          Somers' D = 0.460         
         Discordant = 18.4%          Gamma     = 0.555         
         Tied       = 17.2%          Tau-a     = 0.028            
TABLE 2: Diabetes Data (SAS Output)        
                                         
(1) Parameter Estimates and 95% Confidence Intervals 
                 
                                Profile Likelihood    
                                 Confidence Limits   
                Parameter                   
    Variable     Estimate        Lower      Upper      
    INTERCPT    -90.4017        -213.2   -38.6425   
    GLUTEST       0.2153        0.0918     0.5073  
 
(2) Parameter Estimates and 95% Confidence Intervals  

                                        Wald       
                                 Confidence Limits   
                Parameter                             
    Variable     Estimate        Lower      Upper     
    INTERCPT    -90.4017        -173.8    -7.0490     
    GLUTEST       0.2153        0.0171     0.4136      
 
(3) Conditional Odds Ratios and 95% Confidence Intervals 
                      
                                Profile Likelihood    
                                 Confidence Limits   
                         Odds                      
    Variable     Unit   Ratio    Lower      Upper      
    GLUTEST    1.0000   1.240    1.096      1.661  
      
(4) Conditional Odds Ratios and 95% Confidence Intervals 
                                 
                                        Wald          
                                 Confidence Limits     
                         Odds                           
    Variable     Unit   Ratio    Lower      Upper     
     GLUTEST    1.0000  1.240    1.017      1.512      

Contributed by Ronald Ridley, California Deparment of Health Services

Go Back to the Computing Corner Main List
Go Back to SSA Homepage