Up to this point we have formulated models for multiple linear regression and used these models to describe how a dependent variable depends on one or more independent variables. By defining new variables as nonlinear functions of the original variables, such as logarithms or powers, and including interaction terms in the regression equation, we have been able to account for some nonlinear relationships between the variables. By defining appropriate dummy variables, we have been able to account for shifts in the relationship between the dependent and independent variables in the presence or absence of some condition. In each case, we estimated the coefficients in the regression equation, which, in turn, could be interpreted as the sensitivity of the dependent variable to changes in the independent variables. We also could test a variety of statistical hypotheses to obtain information on whether or not different treatments affected the dependent variable. All these powerful techniques rest on the assumptions that we made at the outset concerning the population from which the observations were drawn.
We now turn our attention to procedures to ensure that the data reasonably match these underlying assumptions. When the data do not match the assumptions of the analysis, it indicates that either there are erroneous data or that the regression equation does not accurately describe the underlying processes. The size and nature of the violations of the assumptions can provide a framework in which to identify and correct erroneous data or to revise the regression equation.
Before continuing, let us recapitulate the assumptions that underlie what we have accomplished so far.
The mean of the population of the dependent variable at a given combination of values of the independent variables varies linearly as the independent variables vary.
For any given combination of values of the independent variables, the possible values of the dependent variable are distributed normally.
The standard deviation of the dependent variable about its mean at any given values of the independent variables is the same for all combinations of values of the independent variables.
The deviations of all members of the population from the plane of means are statistically independent, that is, the deviation associated with one member of the population has no effect on the deviations associated with other members.
The first assumption boils down to the statement that the regression model (i.e., the regression equation) is correctly specified or, in other words, that the regression equation reasonably describes the relationship between the mean value of the dependent variable and the independent variables. We have already encountered this problem—for example, our analysis of the relationship between sperm reactive oxygen species production and cell phone electromagnetic signals or the relationship between heat loss and temperature in gray seals in Chapter 2 and the study of protein synthesis in newborns and adults in Chapter 3. In both cases, we fit a straight line through the data, ...