Our discussion of regression analysis to this point has been based on the premise that we have correctly identified all the relevant independent variables. Given these independent variables, we have concentrated on investigating whether it was necessary to transform these variables or to consider interaction terms, evaluate data points for undue influence (in Chapter 4), or resolve ambiguities arising out of the fact that some of the variables contained redundant information (in Chapter 5). It turns out that, in addition to such analyses of data using a predefined model, multiple regression analysis can be used as a tool to screen potential independent variables to select that subset of them that make up the “best” regression model. As a general principle, we wish to identify the simplest model with the smallest number of independent variables that will describe the data adequately.

Procedures known as *all possible subsets regression* and *stepwise regression* permit you to use the data to guide you in the formulation of the regression model. *All possible subsets regression* involves forming all possible combinations of the potential independent variables, computing the regressions associated with them, and then selecting the one with the best characteristics. *Stepwise regression* is a procedure for sequentially entering independent variables one at a time in a regression equation in the order that most improves the regression equation’s predictive ability or removing them when doing so does not significantly degrade its predictive ability. These methods are particularly useful for screening data sets in which there are many independent variables in order to identify a smaller subset of variables that determine the value of a dependent variable.

Although these methods can be very helpful, there are two important caveats: First, the list of candidate independent variables to be screened must include all the variables that actually predict the dependent variable; and, second, there is no single criterion that will always be the best measure of the “best” regression equation. In short, these methods, although very powerful, require considerable thought and care to produce meaningful results.

The problem of specifying the correct regression model has several elements, only some of which can be addressed directly via statistical calculations. The first, and most important, element follows not from statistical calculations but from knowledge of the substantive topic under study: *You need to carefully consider what you know—from both theory and experience—about the system under study to select the potentially important variables for study.* Once you have selected these variables, statistical methods can help you decide whether the model is adequately specified, whether you have left out important variables, or whether you have included extraneous or redundant variables. Completing this analysis requires a combination of studying the *residual diagnostics* for an adequate functional form of the regression equation (discussed in Chapter 4), studying the *multicollinearity diagnostics* for evidence of model overspecification ...