Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example

doi:10.1016/j.jclinepi.2009.08.028

Journal of Clinical Epidemiology

Volume 63, Issue 7, July 2010, Pages 728-736

https://doi.org/10.1016/j.jclinepi.2009.08.028 Get rights and content

Abstract

Objective

Missing indicator method (MIM) and complete case analysis (CC) are frequently used to handle missing confounder data. Using empirical data, we demonstrated the degree and direction of bias in the effect estimate when using these methods compared with multiple imputation (MI).

Study Design and Setting

From a cohort study, we selected an exposure (marital status), outcome (depression), and confounders (age, sex, and income). Missing values in “income” were created according to different patterns of missingness: missing values were created completely at random and depending on exposure and outcome values. Percentages of missing values ranged from 2.5% to 30%.

Results

When missing values were completely random, MIM gave an overestimation of the odds ratio, whereas CC and MI gave unbiased results. MIM and CC gave under- or overestimations when missing values depended on observed values. Magnitude and direction of bias depended on how the missing values were related to exposure and outcome. Bias increased with increasing percentage of missing values.

Conclusion

MIM should not be used in handling missing confounder data because it gives unpredictable bias of the odds ratio even with small percentages of missing values. CC can be used when missing values are completely random, but it gives loss of statistical power.

Introduction

What is new?

-
We showed that using the missing indicator method (MIM) gives unpredictable direction of bias.
-
We showed the direction and magnitude of bias resulting from using the MIM.
-
We gave a clear explanation of why the MIM fails.
-
We combined empirical data with simulations to convey our message.
-
Implication is that the MIM should never be used to handle missing confounder data.

In observational etiologic research, missing data on one or more confounders can affect the possibility to adequately adjust for confounding variables. There are many methods to handle missing data [1], [2]. Commonly, researchers exclude subjects with missing data, the so-called complete case analysis (CC), because multivariable modeling in standard software packages usually excludes persons with a missing value on any of the variables in the model. Obviously, this affects the number of subjects and, thereby, the statistical power, but more importantly, it may lead to seriously biased estimates [1], [3], [4], [5]. This bias occurs because missing values are typically related to other observed subject characteristics, including the outcome. Even in a follow-up study where the outcome is not yet known at baseline, missing values can be (indirectly) related to the outcome if predisposing factors are associated with the missing covariate and the outcome. For example, a question on medication use might be skipped by someone with low education level, whereas education level is also associated with mortality.

Another approach to handle missing data is the “MIM”, which was specifically proposed for missing confounder data in etiologic research [6], [7]. This method does not exclude subjects from the analysis but adds an extra variable to the statistical model to indicate that the value of a certain variable is missing. Although it has been argued that the MIM gives biased results [1], [3], [5], [8], it is still used often. A survey of 100 articles showed that 32 of 81 articles mentioned a method for handling covariates of which four (13%) used the MIM [9]. Reasons to use the MIM might be that it is an intuitively appealing method because it seems to adjust for missing values and is easy to use. In addition, it is thought that the MIM only gives some residual confounding [10], and researchers might think that this is acceptable. Furthermore, the MIM is advocated for use of missing baseline measurements in randomized trials [11].

A more sophisticated approach to handle missing data is to impute (ie, fill in) missing values. Imputing the overall or subgroup mean commonly yields biased effect estimates as well [2]. Imputing a missing value by a value predicted by a regression model using all other observed variables in the data set including the outcome seems a better approach [1], [2], [4], [8], [12], [13]. This imputation can be done once (single imputation) or multiple times (multiple imputation [MI]). If the missing values are completely random or if they depend on observed variables, single imputation gives unbiased effect estimates but overestimates the precision, that is, underestimates the standard error, of the estimate because it assumes that all data are present [1], [8], [13]. In MI, the missing value is imputed multiple times (usually 5–10), and the uncertainty of the imputed values is taken into account. MI has been described in numerous articles, and it has been shown to give valid estimates and valid standard errors if missing values are completely random or if they depend on observed variables [1], [4], [8], [12], [13], [14], [15], [16], [17], [18].

Although MI is increasingly being used (eg, [16], [19], [20]), CC and the MIM are still common in the epidemiologic literature [9]. Many researchers, and editors alike, appear not to be aware of the degree of bias that can result from both methods. The objective of this study was to show the direction and degree of bias when using the MIM or CC to handle missing confounder data in an etiological context. In contrast to earlier studies, we used an empirical data set and simulated different patterns of missing data in a confounder. We studied the bias in the odds ratio of the association between exposure and outcome and varied the percentage of missing values from 2.5% to 30%. We compared using the MIM and CC with the use of MI.

Section snippets

Data set

We used data from the PREDICT study, which is described in detail elsewhere [21]. In short, the PREDICT study is a European prospective cohort study aimed to develop a multifactor risk algorithm for onset of major depression over 12 months. A total of 1,338 subjects were included in the Dutch part of the study (PREDICT-NL). The outcome of interest was the occurrence of major depressive disorder according to DSM-IV (Diagnostic and Statistical Manual of Mental Disorders, version IV) criteria

Scenario 1—missing completely at random

Using the MIM gave an overestimation of the odds ratio of marital status (Fig. 1A). This bias increased with an increasing percentage of missing values. Both CC and MI gave an unbiased odds ratio of exposure. The coverage of the 95% CI was around 0.95 for all methods of handling missing data (Fig. 1B).

Scenario 2—MAR, odds ratios of 3-2-3-1

Using MIM resulted in a small overestimation of the odds ratio of exposure with 30% of missing values (Fig. 2A). CC gave an overestimation of the odds ratio with more bias in higher percentages of

Discussion

This study demonstrates the degree of bias in the effect estimate of exposure when using the MIM and CC for missing confounder data in comparison with MI. MIM and CC gave a biased odds ratio in almost all situations of missing confounder data. The direction and degree of that bias depended on how the missing values were related to exposure and outcome. The bias was already present with a small percentage of missing values and increased when the percentage of missing values increased. Coverage

Acknowledgments

This research was funded by an unrestricted grant from Novo Nordisk and the Scientific Institute of the Dutch Pharmacists (WINAp) and by the Netherlands Organization for Scientific Research, a VIDI grant (NWO: project no. 917-66-311). The PREDICT study was funded by The European Commission, reference: PREDICT-QL4-CT2002-00683.

References (29)

A.R. Donders et al.
Review: a gentle introduction to imputation of missing values
J Clin Epidemiol
(2006)
K.G. Moons et al.
Using the outcome for imputation of missing predictor values was preferred
J Clin Epidemiol
(2006)
G.J. van der Heijden et al.
Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example
J Clin Epidemiol
(2006)
S.L. Crawford et al.
A comparison of anlaytic methods for non-random missingness of outcome data
J Clin Epidemiol
(1995)
M.H. Gorelick
Bias arising from missing data in predictive models
J Clin Epidemiol
(2006)
S. Greenland et al.
A critical look at methods for handling missing covariates in epidemiologic regression analyses
Am J Epidemiol
(1995)
R.J. Little
Regression with missing X's: a review
J Am Stat Assoc
(1992)
M.P. Jones
Indicator and stratification methods for missing explanatory variables in multiple linear regression
J Am Stat Assoc
(1996)
J.L. Schafer et al.
Missing data: our view of the state of the art
Psychol Methods
(2002)
W. Vach et al.
Biased estimation of the odds ratio in case-control studies due to the use of ad hoc methods of correcting for missing values for confounding variables
Am J Epidemiol
(1991)

A.B. Anderson et al.

Missing data: a review of the literature

Regression analysis

A. Burton et al.

Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines

Br J Cancer

(2004)

J.P. Vandenbroucke et al.

Strengthening the reporting of observational studies in epidemiology (STROBE): explanation and elaboration

Epidemiology

(2007)

Cited by (137)

Endobronchial silicone spigot in prolonged air leaks: Nationwide study on outcomes and risk factors for treatment failure
2024, Respiratory Investigation
The endobronchial silicone spigot, also known as the endobronchial Watanabe spigot, is used in bronchoscopic interventions to manage prolonged pulmonary air leakage. However, the outcomes of this procedure have not been thoroughly investigated.
Using a Japanese national inpatient database from April 2014 to March 2022, we assessed the clinical characteristics and outcomes of all eligible patients who received the endobronchial spigot. We also investigated risk factors associated with treatment failure. Treatment failure was defined as in-hospital death or the need for surgery after bronchial occlusion.
We analyzed data of 1095 patients who underwent bronchial occlusion using the endobronchial spigot. Among them, 252 patients (23.0%) died during hospitalization, and 403 patients (36.8%) experienced treatment failure. Factors associated with treatment failure included age between 85 and 94 years (odds ratio [OR] 1.83; 95% confidence intervals [CI], 1.04–3.21); male sex (OR 2.43; 95% CI, 1.44–4.11); low Barthel index score; comorbidities of interstitial pneumonia (OR 1.71; 95% CI, 1.18–2.48); antibiotics treatment (OR 1.45; 95% CI, 1.02–2.07); steroids treatment (OR 1.59; 95% CI, 1.07–2.36); and surgery prior to bronchial occlusion (OR 2.08; 95% CI, 1.29–3.35). In contrast, pleurodesis after bronchial occlusion (OR 0.49; 95% CI, 0.32–0.75), and admission to high-volume hospitals were inversely associated with treatment failure (OR 0.58; 95% CI, 0.37–0.90).
The endobronchial Watanabe spigot could be a nonsurgical treatment option for patients with prolonged pulmonary air leaks. Our findings will help identify patients who may benefit from such bronchial interventions.
Clinical and health inequality risk factors for non-COVID-related sepsis during the global COVID-19 pandemic: a national case-control and cohort study
2023, eClinicalMedicine
Sepsis, characterised by significant morbidity and mortality, is intricately linked to socioeconomic disparities and pre-admission clinical histories. This study aspires to elucidate the association between non-COVID-19 related sepsis and health inequality risk factors amidst the pandemic in England, with a secondary focus on their association with 30-day sepsis mortality.
With the approval of NHS England, we harnessed the OpenSAFELY platform to execute a cohort study and a 1:6 matched case-control study. A sepsis diagnosis was identified from the incident hospital admissions record using ICD-10 codes. This encompassed 248,767 cases with non-COVID-19 sepsis from a cohort of 22.0 million individuals spanning January 1, 2019, to June 31, 2022. Socioeconomic deprivation was gauged using the Index of Multiple Deprivation score, reflecting indicators like income, employment, and education. Hospitalisation-related sepsis diagnoses were categorised as community-acquired or hospital-acquired. Cases were matched to controls who had no recorded diagnosis of sepsis, based on age (stepwise), sex, and calendar month. The eligibility criteria for controls were established primarily on the absence of a recorded sepsis diagnosis. Associations between potential predictors and odds of developing non-COVID-19 sepsis underwent assessment through conditional logistic regression models, with multivariable regression determining odds ratios (ORs) for 30-day mortality.
The study included 224,361 (10.2%) cases with non-COVID-19 sepsis and 1,346,166 matched controls. The most socioeconomic deprived quintile was associated with higher odds of developing non-COVID-19 sepsis than the least deprived quintile (crude OR 1.80 [95% CI 1.77–1.83]). Other risk factors (after adjusting comorbidities) such as learning disability (adjusted OR 3.53 [3.35–3.73]), chronic liver disease (adjusted OR 3.08 [2.97–3.19]), chronic kidney disease (stage 4: adjusted OR 2.62 [2.55–2.70], stage 5: adjusted OR 6.23 [5.81–6.69]), cancer, neurological disease, immunosuppressive conditions were also associated with developing non-COVID-19 sepsis. The incidence rate of non-COVID-19 sepsis decreased during the COVID-19 pandemic and rebounded to pre-pandemic levels (April 2021) after national lockdowns had been lifted. The 30-day mortality risk in cases with non-COVID-19 sepsis was higher for the most deprived quintile across all periods.
Socioeconomic deprivation, comorbidity and learning disabilities were associated with an increased odds of developing non-COVID-19 related sepsis and 30-day mortality in England. This study highlights the need to improve the prevention of sepsis, including more precise targeting of antimicrobials to higher-risk patients.
The UK Health Security Agency, Health Data Research UK, and National Institute for Health Research.
How to effectively communicate health information on social media depending on the audience's personality traits: An experimental study in the context of organ donation in Germany
2023, Social Science and Medicine
The shortage of organs donated for transplantation is a global concern. Even though increasing awareness can boost organ donation registration rates (thus leading to a higher number of available organ transplants), public organ donation campaigns lack effectiveness and are rarely tailored to audiences. To further enhance the effectiveness of digital health (i.e., organ donation) communication, we assessed the perception of social media organ donation campaign strategies (i.e., transformational, informational, neutral) as a function of personality traits (i.e., Big Five).
Data was collected through an online experiment with 1000 participants (i.e., German citizens between 18 and 70 years) who were recruited via Facebook and Prolific between Jun–Aug 2022. Perceived message effectiveness of the organ donation posts was measured on a 5-point Likert scale applying the AIDA model and consequently analyzed using multiple regression analyses.
Messaging strategy applied in the social media campaigns served as a predictor for message effectiveness depending on the personality traits of the audience when controlling for demographics and donor status. Extraversion was positively associated with higher message effectiveness of the transformational post while neuroticism showed a significant positive association with informational content. Agreeableness was positively correlated with transformational as well as informational post effectiveness. Furthermore, higher perceived post effectiveness increased the likelihood to sign-up for further organ donation information.
Our results show that Instagram in Germany is an underleveraged but potentially effective platform to spread organ donation knowledge. Based on our results, we urge public health authorities to revisit and start tailoring their (digital) health (i.e., organ donation) campaigns to audiences (i.e., personality traits) to increase their effectiveness.
A review on missing values for main challenges and methods
2023, Information Systems
Several recent reviews summarize common missing value analysis methods. However, none of them provide a systematic and in-depth summary of the analytical challenges and solutions for dealing with missing values. For the purpose of guiding the handling of missing values, this review aims to consolidate current developments in novel missing-value research methodologies. In particular, we comprehensively investigated cutting-edge missing value solutions and methodically studied the main challenges associated with missing values analysis (missing mechanisms, missing patterns, and missing rates). Furthermore, we reviewed 63 publications that compare different strategies for deleting and imputing missing values. Then we investigated data characteristics, highlighted three main problems when analyzing missing values, and analyzed the performance of missing value solutions in these studied papers. Moreover, we conducted comprehensive experiments on 9 public datasets using typical missing value processing methods and provided a simple guided decision tree for handling missing values. Finally, we described current Research hotspots and open challenges, which give potential research topics.
Making the Improbable Possible: Generalizing Models Designed for a Syndrome-Based, Heterogeneous Patient Landscape
2023, Critical Care Clinics
Associations Between Insurance, Race and Ethnicity, and COVID-19 Hospitalization Beyond Underlying Health Conditions: A Retrospective Cohort Study
2023, AJPM Focus
People of lower socioeconomic position and people of color experience higher risks of severe COVID-19, but understanding of these associations beyond the effect of underlying health conditions is limited. Moreover, few studies have focused on young adults, who have had the highest incidence of COVID-19 during much of the pandemic.
We conducted a retrospective cohort study using electronic health record data from the University of Washington Medicine healthcare system. Our study population included individuals aged 18–39 years who tested positive for SARS-CoV-2 from February 2020 to March 2021. Using regression modeling, we estimated adjusted risk ratios and adjusted risk differences of COVID-19 hospitalization by socioeconomic position (measured by health insurance status) and race and ethnicity. We adjusted for any underlying health condition to examine these associations beyond the effect of underlying health conditions.
Among 3,101 individuals, the uninsured/publicly insured had a 1.9-fold higher risk of hospitalization (adjusted risk ratio=1.9, 95% CI=1.0, 3.6) and 9 additional hospitalizations per 1,000 SARS-CoV-2–positive persons (adjusted risk difference=9, 95% CI= −1, 20) than the privately insured. Hispanic or Latine, non-Hispanic Asian, non-Hispanic Black, and non-Hispanic Native Hawaiian or Pacific Islander patients had a 1.5-, 2.7-, 1.4-, and 2.1-fold higher risk of hospitalization (adjusted risk ratio=1.5, 95% CI=0.7, 3.1; 2.7, 95% CI=1.1, 6.5; 1.4, 95% CI=0.6, 3.3; and 2.1, 95% CI=0.5, 9.1), respectively, than non-Hispanic White patients.
Although they should be interpreted cautiously given low precision, our findings suggest that the increased risk of COVID-19 hospitalization among young adults of lower socioeconomic position and young adults of color may be driven by forces other than underlying health conditions, including social determinants of health.

View all citing articles on Scopus

View full text

Original ArticleUnpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example

Abstract

Objective

Study Design and Setting

Results

Conclusion

Introduction

Section snippets

Data set

Scenario 1—missing completely at random

Scenario 2—MAR, odds ratios of 3-2-3-1

Discussion

Acknowledgments

J Clin Epidemiol

J Clin Epidemiol

J Clin Epidemiol

J Clin Epidemiol

J Clin Epidemiol

A critical look at methods for handling missing covariates in epidemiologic regression analyses

Am J Epidemiol

Regression with missing X's: a review

J Am Stat Assoc

Indicator and stratification methods for missing explanatory variables in multiple linear regression

J Am Stat Assoc

Missing data: our view of the state of the art

Psychol Methods

Biased estimation of the odds ratio in case-control studies due to the use of ad hoc methods of correcting for missing values for confounding variables

Am J Epidemiol

Missing data: a review of the literature

Regression analysis

Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines

Br J Cancer

Strengthening the reporting of observational studies in epidemiology (STROBE): explanation and elaboration

Epidemiology

Original Article
Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example