Original Article
Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example

https://doi.org/10.1016/j.jclinepi.2009.08.028Get rights and content

Abstract

Objective

Missing indicator method (MIM) and complete case analysis (CC) are frequently used to handle missing confounder data. Using empirical data, we demonstrated the degree and direction of bias in the effect estimate when using these methods compared with multiple imputation (MI).

Study Design and Setting

From a cohort study, we selected an exposure (marital status), outcome (depression), and confounders (age, sex, and income). Missing values in “income” were created according to different patterns of missingness: missing values were created completely at random and depending on exposure and outcome values. Percentages of missing values ranged from 2.5% to 30%.

Results

When missing values were completely random, MIM gave an overestimation of the odds ratio, whereas CC and MI gave unbiased results. MIM and CC gave under- or overestimations when missing values depended on observed values. Magnitude and direction of bias depended on how the missing values were related to exposure and outcome. Bias increased with increasing percentage of missing values.

Conclusion

MIM should not be used in handling missing confounder data because it gives unpredictable bias of the odds ratio even with small percentages of missing values. CC can be used when missing values are completely random, but it gives loss of statistical power.

Introduction

What is new?

  • -

    We showed that using the missing indicator method (MIM) gives unpredictable direction of bias.

  • -

    We showed the direction and magnitude of bias resulting from using the MIM.

  • -

    We gave a clear explanation of why the MIM fails.

  • -

    We combined empirical data with simulations to convey our message.

  • -

    Implication is that the MIM should never be used to handle missing confounder data.

In observational etiologic research, missing data on one or more confounders can affect the possibility to adequately adjust for confounding variables. There are many methods to handle missing data [1], [2]. Commonly, researchers exclude subjects with missing data, the so-called complete case analysis (CC), because multivariable modeling in standard software packages usually excludes persons with a missing value on any of the variables in the model. Obviously, this affects the number of subjects and, thereby, the statistical power, but more importantly, it may lead to seriously biased estimates [1], [3], [4], [5]. This bias occurs because missing values are typically related to other observed subject characteristics, including the outcome. Even in a follow-up study where the outcome is not yet known at baseline, missing values can be (indirectly) related to the outcome if predisposing factors are associated with the missing covariate and the outcome. For example, a question on medication use might be skipped by someone with low education level, whereas education level is also associated with mortality.

Another approach to handle missing data is the “MIM”, which was specifically proposed for missing confounder data in etiologic research [6], [7]. This method does not exclude subjects from the analysis but adds an extra variable to the statistical model to indicate that the value of a certain variable is missing. Although it has been argued that the MIM gives biased results [1], [3], [5], [8], it is still used often. A survey of 100 articles showed that 32 of 81 articles mentioned a method for handling covariates of which four (13%) used the MIM [9]. Reasons to use the MIM might be that it is an intuitively appealing method because it seems to adjust for missing values and is easy to use. In addition, it is thought that the MIM only gives some residual confounding [10], and researchers might think that this is acceptable. Furthermore, the MIM is advocated for use of missing baseline measurements in randomized trials [11].

A more sophisticated approach to handle missing data is to impute (ie, fill in) missing values. Imputing the overall or subgroup mean commonly yields biased effect estimates as well [2]. Imputing a missing value by a value predicted by a regression model using all other observed variables in the data set including the outcome seems a better approach [1], [2], [4], [8], [12], [13]. This imputation can be done once (single imputation) or multiple times (multiple imputation [MI]). If the missing values are completely random or if they depend on observed variables, single imputation gives unbiased effect estimates but overestimates the precision, that is, underestimates the standard error, of the estimate because it assumes that all data are present [1], [8], [13]. In MI, the missing value is imputed multiple times (usually 5–10), and the uncertainty of the imputed values is taken into account. MI has been described in numerous articles, and it has been shown to give valid estimates and valid standard errors if missing values are completely random or if they depend on observed variables [1], [4], [8], [12], [13], [14], [15], [16], [17], [18].

Although MI is increasingly being used (eg, [16], [19], [20]), CC and the MIM are still common in the epidemiologic literature [9]. Many researchers, and editors alike, appear not to be aware of the degree of bias that can result from both methods. The objective of this study was to show the direction and degree of bias when using the MIM or CC to handle missing confounder data in an etiological context. In contrast to earlier studies, we used an empirical data set and simulated different patterns of missing data in a confounder. We studied the bias in the odds ratio of the association between exposure and outcome and varied the percentage of missing values from 2.5% to 30%. We compared using the MIM and CC with the use of MI.

Section snippets

Data set

We used data from the PREDICT study, which is described in detail elsewhere [21]. In short, the PREDICT study is a European prospective cohort study aimed to develop a multifactor risk algorithm for onset of major depression over 12 months. A total of 1,338 subjects were included in the Dutch part of the study (PREDICT-NL). The outcome of interest was the occurrence of major depressive disorder according to DSM-IV (Diagnostic and Statistical Manual of Mental Disorders, version IV) criteria

Scenario 1—missing completely at random

Using the MIM gave an overestimation of the odds ratio of marital status (Fig. 1A). This bias increased with an increasing percentage of missing values. Both CC and MI gave an unbiased odds ratio of exposure. The coverage of the 95% CI was around 0.95 for all methods of handling missing data (Fig. 1B).

Scenario 2—MAR, odds ratios of 3-2-3-1

Using MIM resulted in a small overestimation of the odds ratio of exposure with 30% of missing values (Fig. 2A). CC gave an overestimation of the odds ratio with more bias in higher percentages of

Discussion

This study demonstrates the degree of bias in the effect estimate of exposure when using the MIM and CC for missing confounder data in comparison with MI. MIM and CC gave a biased odds ratio in almost all situations of missing confounder data. The direction and degree of that bias depended on how the missing values were related to exposure and outcome. The bias was already present with a small percentage of missing values and increased when the percentage of missing values increased. Coverage

Acknowledgments

This research was funded by an unrestricted grant from Novo Nordisk and the Scientific Institute of the Dutch Pharmacists (WINAp) and by the Netherlands Organization for Scientific Research, a VIDI grant (NWO: project no. 917-66-311). The PREDICT study was funded by The European Commission, reference: PREDICT-QL4-CT2002-00683.

References (29)

  • A.B. Anderson et al.

    Missing data: a review of the literature

  • Regression analysis

  • A. Burton et al.

    Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines

    Br J Cancer

    (2004)
  • J.P. Vandenbroucke et al.

    Strengthening the reporting of observational studies in epidemiology (STROBE): explanation and elaboration

    Epidemiology

    (2007)
  • Cited by (137)

    View all citing articles on Scopus
    View full text