# Chapter Five: Multicollinearity and What to Do About It

Multiple regression allows us to study how several independent variables act together to determine the value of a dependent variable. The coefficients in the regression equation quantify the nature of these dependencies. Moreover, we can compute the standard errors associated with each of these regression coefficients to quantify the precision with which we estimate how the different independent variables affect the dependent variable. These standard errors also permit us to conduct hypothesis tests about whether the different proposed independent variables affect the dependent variable at all. The conclusions we draw from regression analyses will be unambiguous when the independent variables in the regression equation are *statistically independent* of each other, that is, when the value of one of the independent variables does not depend on the values of any of the other independent variables. Unfortunately, as we have already seen in Chapter 3, the independent variables often contain at least some redundant information and so tend to vary together, a situation called *multicollinearity.* Severe multicollinearity indicates that a substantial part of the information in one or more of the independent variables is redundant, which makes it difficult to separate the effects of the different independent variables on the dependent variable.

The resulting ambiguity is reflected quantitatively as reduced precision of the estimates of the parameters in the regression equation. As a result, the standard errors associated with the regression coefficients will be inflated, and the values of the coefficients themselves may be unreasonable and can even have a sign different from that one would expect based on other information about the system under study. Indeed, the presence of large standard errors or unreasonable parameter estimates is a qualitative suggestion of multicollinearity.

Although multicollinearity causes problems in interpreting the regression coefficients, it does not affect the usefulness of a regression equation for purely empirical description of data or prediction of new observations if we make no interpretations based on the values of the individual coefficients. It is only necessary that the data used for future prediction have the same multicollinearities and range as the data originally used to estimate the regression coefficients. In contrast, if the goal is meaningful estimation of parameters in the regression equation or identification of a model structure, one must deal with multicollinearity.

As we did with outliers in Chapter 4, we will develop *diagnostic* techniques to identify multicollinearity and assess its importance as well as techniques to reduce or eliminate the effects of multicollinearity on the results of the regression analysis.

Multicollinearity among the independent variables arises for two reasons.

*Structural multicollinearity*is a mathematical artifact due to creating new independent variables, as we often did in Chapters 3 and 4, from other independent variables, such as by introducing powers of an independent variable or by introducing interaction terms as the product ...