The common practice of reporting the percentage of patients with scores above cut-off thresholds in screening questionnaires for depression as disorder prevalence substantially overestimates prevalence and misinforms users of epidemiological evidence.
Exaggeration of the prevalence of depression is disproportionately high in low-prevalence populations and blurs distinctions between high- and low-prevalence populations.
Researchers should use diagnostic interview methods that have been validated for estimating prevalence.
A two-stage estimation method that combines screening questionnaires and diagnostic interviews can reduce resource requirements and generate valid prevalence estimates for depression.
Mental health disorders, including major depressive disorder, are classified in research using validated diagnostic interviews.1,2 However, administering diagnostic interviews to large population samples to estimate prevalence is expensive because of the time and trained personnel that are required. This is likely why researchers increasingly use self-report screening questionnaires, which require fewer resources, to estimate prevalence. We searched PubMed from Jan. 1, 2017, to Mar. 14, 2017, for primary studies with titles that indicated that prevalence of depression or depressive disorders had been assessed. Prevalence was based on screening questionnaires in 17 of 19 studies (89%; Appendix 1, available at www.cmaj.ca/lookup/suppl/doi:10.1503/cmaj.170691/-/DC1). Many recent meta-analyses have also based estimates of prevalence of depression on screening questionnaires.3–7 However, using screening questionnaires to estimate prevalence can overestimate prevalence and blur distinctions between low- and high-prevalence populations. We describe the problem and possible strategies for estimation of prevalence that are less resource intensive than conducting diagnostic interviews with all patients.
How are patients classified with screening questionnaires for depression?
Typically, screening questionnaires for depression are completed independently by respondents. The questionnaires assess symptoms similar to those evaluated in diagnostic interviews, but they do not assess functional impairment or investigate non-psychiatric conditions that can produce similar symptoms. Patients are classified as likely or unlikely to have depression based on scores above or below a cut-off threshold. Researchers set cut-offs by comparing scores on a screening questionnaire to classifications based on validated diagnostic interviews and attempting to maximize correct classifications. Different approaches may be used,8,9 but many researchers simply maximize combined sensitivity (probability that a person with depression is classified correctly) and specificity (probability that a person without depression is classified correctly).10 Because screening is intended to identify previously unrecognized cases, cut-off thresholds for screening are set to cast a wide net and identify substantially more patients who may have depression than those who will meet diagnostic criteria based on a diagnostic interview.
How should percentage above cut-offs on screening questionnaires be interpreted?
Positive predictive value (PPV) is the percentage of patients with scores above a test cut-off who have the target condition. For screening questionnaires for depression, PPV is the percentage of patients with a positive screen who meet diagnostic criteria. Positive predictive value depends on test sensitivity, specificity and true prevalence, but because screening tests are designed to cast a wide net, PPV is often very low. In many medical settings, fewer than 3 of 10 patients with a positive screen have major depression.11
The percentage of patients above a cut-off threshold typically exceeds true prevalence substantially. This has been shown by several recent highly cited meta-analyses that combined results from primary studies that used validated diagnostic interviews and primary studies that reported percentages of patients above cut-off thresholds on screening questionnaires for depression. In a meta-analysis involving patients who underwent bariatric surgery, 19% had depression in 34 studies based on evaluation by screening questionnaires, but the rate was 7% to 8% in six studies that used a validated diagnostic interview.3 Another meta-analysis of 43 studies involving new fathers during the prenatal and postpartum periods reported an overall prevalence of depression of 10%; however, three included studies that used validated diagnostic interviews reported a prevalence less than 5%.5,12 Yet another meta-analysis, on depression among medical students,7 reported that 27% of participants from 183 studies had depression. However, the only included study that used a validated diagnostic interview reported 9% prevalence, which is comparable to the 9% prevalence among 18- to 25-year-olds and the 7% prevalence among 26- to 49-year-olds in the general population of the United States.13
Some researchers have attempted to address this problem by labelling the percentage of patients above cut-offs for screening questionnaires as the prevalence of “clinically significant” symptoms or “symptoms” of depression rather than depression.14,15 However, these designations are not based on evidence that these cut-offs reflect a meaningful divide between impairment and nonimpairment. Furthermore, the percentage of patients above cut-off thresholds varies depending on the particular screening questionnaire and cut-off threshold used. For example, a systematic review of depression after myocardial infarction found that 31% of patients had a score at or above the standard cut-off of 10 on the Beck Depression Inventory, whereas only 16% had a score at or above the standard cut-off of 8 on the Hospital Anxiety and Depression Scale.16
Another concern is that screening tools for depression overestimate prevalence more in low true-prevalence populations than in high true-prevalence populations. Based on assumed values of sensitivity and specificity, the percentage of patients who would score above a cut-off threshold for a screening questionnaire can be calculated for different values of true prevalence. In Table 1, we used estimates of sensitivity and specificity for the standard cut-off of 10 or greater on the Patient Health Questionnaire-9 (PHQ-9) from a recent meta-analysis involving about 20 000 patients (12% had depression).17 Sensitivity and specificity may vary by patient population symptom severity and, thus, by prevalence. 18,19 Therefore, Table 1 shows a basic scenario and scenarios where sensitivity and specificity are adjusted in calculations across prevalence. Estimated prevalence is substantially exaggerated when true prevalence is lowest. In all scenarios, the percentage of patients above the cut-off threshold is at least twice the true prevalence when true prevalence is 10% or less, but this ratio decreases as true prevalence increases. This is because the misclassification of noncases as cases of depression (false positives) is disproportionately high in low-prevalence populations and only minimally offset by false-negative screens, which occur when true cases are missed by the screening test. Consequently, even populations with very low prevalence appear to have high prevalence based on the percentage above a screening test cut-off; this is the case even if terms such as “clinically significant symptoms” are used to describe patients above the cut-off threshold. Calculations in Table 1 do not account for precision of sensitivity and specificity estimates or potential heterogeneity across samples, but these factors could potentially exacerbate this problem.
Comparison of true prevalence and percentage of patients above a cut-off threshold for screening tests
What are the alternatives for estimating prevalence of depression?
Three methods for generating prevalence estimates from screening questionnaires or from a combination of screening questionnaires and diagnostic interviews have been proposed, including back calculation based on sensitivity and specificity,20 prevalence matching8 and two-stage estimation.21
Back calculation
Back calculation involves adjusting the percentage above a cut-off threshold by existing estimates of sensitivity and specificity.20 The percentage of patients above a cut-off is equal to the percentage with true positive results for screening plus the percentage with false-positive screens. Based on this, a simple formula can be derived to estimate disorder prevalence (derivation of the formula is presented in Appendix 2, available at www.cmaj.ca/lookup/suppl/doi:10.1503/cmaj.170691/-/DC1):

However, estimation based on this method assumes that the exact sensitivity and specificity are known for the population being studied, which rarely occurs in practice. A meta-analysis of screening and case finding for major depressive order using the PHQ-9, which included about 20 000 patients, had 95% confidence intervals (CIs) for the standard cut-off threshold of 10 or greater that ranged from 70% to 84% for sensitivity and 84% to 90% for specificity.17 This lack of certainty about true sensitivity and specificity can lead to substantial swings in back-calculated prevalence.
Figure 1 shows the estimated prevalence generated across a range of percentages among patients with a score above the cut-off for a screening questionnaire. The red line shows the estimated prevalence based on point estimates of PHQ-9 sensitivity (78%) and specificity (87%).17 The green line shows estimated prevalence based on the lower bound of the 95% CI for sensitivity and the upper bound for specificity, and the blue line for the opposite. As shown by the black lines, if 20% of patients have a score above the cut-off threshold, plausible estimates of true disorder prevalence would range from 6% (blue line) to 17% (green line). However, this example likely underestimates the actual degree of imprecision that would be encountered in practice: we incorporated, for simplicity, ranges of estimates for sensitivity and specificity but ignored imprecision in the estimated percentage of patients with scores above the cut-off threshold for the screening questionnaire. We used CI estimates of sensitivity and specificity from a very large meta-analysis of the PHQ-9, but intervals for other screening questionnaires with less data would be even wider. Furthermore, we did not consider heterogeneity of estimates from different settings and the ramifications of this for implementation. An additional consideration is that estimated prevalence may actually be negative in some scenarios where assumptions about sensitivity and specificity are inaccurate.
Estimated disorder prevalence based on the percentage of patients with scores above a cut-off threshold for a screening test, using estimates of sensitivity and specificity from a meta-analysis of the Patient Health Questionnaire-9 for detecting major depressive disorder.17 Black lines highlight estimated prevalence for situations where 20% of patients have a score above the test cut-off threshold.
Prevalence matching
Prevalence matching8 involves conducting very large research studies to set a cut-off for estimation of the prevalence of depression rather than screening for previously unidentified cases. This could be done by administering a screening tool and a validated diagnostic interview to all patients included in a study and setting a cut-off score that results in the percentage above the cut-off matching as closely as possible the number of patients with depression based on a validated diagnostic interview rather than to balance sensitivity and specificity. However, barriers to using this approach and generating accurate estimates of prevalence include the large number of patients who would need to be administered a diagnostic interview in the calibrating study and the high likelihood that results would not generalize well to other samples, given the substantial heterogeneity of results in existing studies of screening questionnaires.17 Thus, estimates based on a cut-off score established in one study may be inaccurate when the cut-off is applied in other settings.
Two-stage prevalence estimation
In the two-stage approach,21,22 first, all patients are administered a screening questionnaire. Then, all patients with positive screens, but only a randomly selected portion of patients with negative screens, are evaluated with a validated diagnostic interview. Prevalence is estimated by adding the number of patients with positive screens who meet diagnostic criteria and the number of patients with negative screens who also meet diagnostic criteria, weighting the latter to reflect their actual proportion of the total sample. This still requires diagnostic interviews but can reduce the number of interviews that need to be conducted substantially. Methods for implementing a two-stage approach have been described previously.22
Table 2 shows the precision of estimates that would likely be obtained using a two-stage approach. Precision, based on the width of estimated 95% CIs, is higher when true prevalence is lower, when the total number of patients is higher, and when a greater percentage of patients with negative results for screening are interviewed. In many scenarios, differences in precision are minimal, and this shows that investigators may be able to achieve sufficient precision to meet their needs by interviewing only a small proportion of patients with negative results for screening, which would have positive resource implications. The methods used to generate Table 2 can be found in Appendix 3, available at www.cmaj.ca/lookup/suppl/doi:10.1503/cmaj.170691/-/DC1.
Precision of two-stage prevalence estimation for true prevalence, sample size and percentage of patients with negative results for a screening test who were administered diagnostic interviews*
What are the implications of these observations in depression research?
Screening tests for mental health and other types of screening questionnaires are not designed to make diagnostic classifications, and they are not calibrated to estimate prevalence. Using them in this way distorts prevalence estimates, often substantially, and does so disproportionately in low-prevalence populations. Estimating disorder prevalence with screening questionnaires misinforms evidence users, including health care decision-makers. It may also contribute to overdiagnosis, because practitioners may use the same methods to diagnose cases in clinical practice, and they may assume that they should be finding similar rates of disorders. Overdiagnosis can lead to inappropriate labelling and nocebo effects, as well as the unnecessary consumption of health care resources and potentially harmful treatment for patients who will not benefit.23,24
There are important implications for how research should be conducted and reported. First, prevalence estimates should be based on appropriate methods. Researchers should not report rates above cut-off thresholds in screening questionnaires as estimates of prevalence or clinical impairment. Second, systematic reviews and meta-analyses of the prevalence of depression should be based on results from validated diagnostic interviews. Third, comparisons between samples and descriptions of mental health symptoms based on depression screening tools should ideally use continuous scores rather than cut-off categories for screening questionnaires.25 In some cases, categorical divisions may be helpful to illustrate data distributions and make comparisons, but there is no reason why the categories used should be dichotomous or based on cut-off thresholds of screening questionnaires. If categories are used, a clear rationale should be provided, including a justification for the category thresholds chosen. Finally, the knowledge needed to accurately implement back calculation and prevalence matching is not yet available. When efficient methods for estimating the prevalence of depression are needed, two-stage estimation of prevalence presents a viable option that can reduce resource use substantially and generate unbiased, reasonably precise prevalence estimates.
Acknowledgements
The authors thank Scott Patten, Kira Riehm, Ian Shrier and Roy Ziegelstein for their helpful feedback on earlier versions of this manuscript.
Footnotes
Competing Interests: None declared.
This article has been peer reviewed.
Contributors: Brett Thombs was responsible for the study concept and design. All of the authors participated in the conduct of analyses, contributed to interpretation of data, drafted sections of the manuscript, reviewed the manuscript critically for intellectual content, gave final approval of the version to be published and agreed to be accountable for all aspects of the work.
Funding: Brett Thombs and Andrea Benedetti were supported by researcher salary awards from the Fonds de recherche du Québec – Santé. Linda Kwakkenbos was supported by a Banting Postdoctoral Fellowship from the Canadian Institutes of Health Research. Alexander Levis was supported by a Masters Award from the Canadian Institutes of Health Research. There was no specific funding for this study, and no funding body had any input into any aspect of the study.