All the statistical methods we have developed so far have been for quantitative dependent variables, measured on more-or-less continuous scales. The assumptions of linear regression—in particular, that the mean value of the population at any combination of the independent variables be a linear function of the independent variables and that the variation about the plane of means be normally distributed—required that the dependent variable be measured on a continuous scale. In contrast, because we did not need to make any assumptions about the nature of the independent variables, we could incorporate qualitative or categorical information (such as whether or not a Martian was exposed to secondhand tobacco smoke) into the independent variables of the regression model. There are, however, many times when we would like to evaluate the effects of multiple independent variables on a qualitative dependent variable, such as the presence or absence of a disease. Because the methods that we have developed so far depend strongly on the continuous nature of the dependent variable, we will have to develop a new approach to deal with the problem of regression with a qualitative dependent variable.
To meet this need, we will develop two related statistical techniques, logistic regression in this chapter and the Cox proportional hazards model in Chapter 13. Logistic regression is used when we are seeking to predict a dichotomous outcome* from one or more independent variables, all of which are known at a given time.* The Cox proportional hazards model is used when we are following individuals for varying lengths of time to see when events occur and how the pattern of events over time is influenced by one or more additional independent variables.
To develop these two techniques, we need to address three related issues.
We need a dependent variable that represents the two possible qualitative outcomes in the observations. We will use a dummy variable, which takes on values of 0 and 1, as the dependent variable.
We need a way to estimate the coefficients in the regression model because the ordinary least-squares criterion we have used so far is not relevant when we have a qualitative dependent variable. We will use maximum likelihood estimation.
We need statistical hypothesis tests for the goodness of fit of the regression model and whether or not the individual coefficients in the model are significantly different from zero, as well as confidence intervals for the individual coefficients.
In addition, we will need to develop specific mathematical models that describe the underlying population from which the observations were drawn.