What is new?
- -
We showed that using the missing indicator method (MIM) gives unpredictable direction of bias.
- -
We showed the direction and magnitude of bias resulting from using the MIM.
- -
We gave a clear explanation of why the MIM fails.
- -
We combined empirical data with simulations to convey our message.
- -
Implication is that the MIM should never be used to handle missing confounder data.
In observational etiologic research, missing data on one or more confounders can affect the possibility to adequately adjust for confounding variables. There are many methods to handle missing data [1], [2]. Commonly, researchers exclude subjects with missing data, the so-called complete case analysis (CC), because multivariable modeling in standard software packages usually excludes persons with a missing value on any of the variables in the model. Obviously, this affects the number of subjects and, thereby, the statistical power, but more importantly, it may lead to seriously biased estimates [1], [3], [4], [5]. This bias occurs because missing values are typically related to other observed subject characteristics, including the outcome. Even in a follow-up study where the outcome is not yet known at baseline, missing values can be (indirectly) related to the outcome if predisposing factors are associated with the missing covariate and the outcome. For example, a question on medication use might be skipped by someone with low education level, whereas education level is also associated with mortality.
Another approach to handle missing data is the “MIM”, which was specifically proposed for missing confounder data in etiologic research [6], [7]. This method does not exclude subjects from the analysis but adds an extra variable to the statistical model to indicate that the value of a certain variable is missing. Although it has been argued that the MIM gives biased results [1], [3], [5], [8], it is still used often. A survey of 100 articles showed that 32 of 81 articles mentioned a method for handling covariates of which four (13%) used the MIM [9]. Reasons to use the MIM might be that it is an intuitively appealing method because it seems to adjust for missing values and is easy to use. In addition, it is thought that the MIM only gives some residual confounding [10], and researchers might think that this is acceptable. Furthermore, the MIM is advocated for use of missing baseline measurements in randomized trials [11].
A more sophisticated approach to handle missing data is to impute (ie, fill in) missing values. Imputing the overall or subgroup mean commonly yields biased effect estimates as well [2]. Imputing a missing value by a value predicted by a regression model using all other observed variables in the data set including the outcome seems a better approach [1], [2], [4], [8], [12], [13]. This imputation can be done once (single imputation) or multiple times (multiple imputation [MI]). If the missing values are completely random or if they depend on observed variables, single imputation gives unbiased effect estimates but overestimates the precision, that is, underestimates the standard error, of the estimate because it assumes that all data are present [1], [8], [13]. In MI, the missing value is imputed multiple times (usually 5–10), and the uncertainty of the imputed values is taken into account. MI has been described in numerous articles, and it has been shown to give valid estimates and valid standard errors if missing values are completely random or if they depend on observed variables [1], [4], [8], [12], [13], [14], [15], [16], [17], [18].
Although MI is increasingly being used (eg, [16], [19], [20]), CC and the MIM are still common in the epidemiologic literature [9]. Many researchers, and editors alike, appear not to be aware of the degree of bias that can result from both methods. The objective of this study was to show the direction and degree of bias when using the MIM or CC to handle missing confounder data in an etiological context. In contrast to earlier studies, we used an empirical data set and simulated different patterns of missing data in a confounder. We studied the bias in the odds ratio of the association between exposure and outcome and varied the percentage of missing values from 2.5% to 30%. We compared using the MIM and CC with the use of MI.