Missing data are a fact of research life. A measuring device may fail in an experiment. Survey respondents may not answer questions about their health or income or may mark “don’t know” or “don’t remember.” In longitudinal studies, research participants may die or be lost to follow-up. Giving careful consideration at the study-design stage for how to minimize missing data is a key facet of a well-designed research study, but even the best, most carefully designed studies often have to contend with missing data.
So far, all the equations and examples have assumed there are no missing data. The classical methods we (and most other textbooks) use are based on ordinary least squares (OLS) regression that are the best methods when there are either no missing data or there are so few missing data that cases with missing data can be safely dropped and the analysis conducted based only on cases with complete data, so-called listwise deletion. For instance, if there are less than 5 percent missing data, cases with missing data can usually be removed and the analysis can be run using the usual estimation methods that assume all cases have complete data.*
The problem is that simply dropping cases with incomplete information throws away the valuable information, often collected at substantial expense, that is included in the dropped cases. In addition, if there are systematic factors associated with the pattern of missing data the resulting regression coefficients could be biased estimates of the underlying population parameters. As noted above, when there are only a few incomplete cases this is not a practical problem, but there are often more than a few incomplete cases. Fortunately, there are methods that allow us to use all the available information in the data—even when the information is incomplete—to fit our regression models to the data.
What should you do if you have more than a small percentage of missing data? We first discuss classical ad hoc methods for handling missing data and why they do not work very well when there is more than a trivial amount of missing data. We then present two alternative approaches to produce parameter estimates that use all the available information, maximum likelihood estimation (MLE), and multiple imputation (MI), which almost always work better than ad hoc methods, including listwise deletion, because they use all the available, albeit incomplete, information. First, however, we initially need to consider how the data became missing in the first place because that will guide us in what analytic methods we will use to handle the missing data. Even more fundamentally, it is worth considering an under-discussed topic in research: how to prevent missing data.
This book is about how to analyze data ...