 © 2007 Canadian Medical Association or its licensors
Abstract
Background: Statistical tests for funnelplot asymmetry are common in metaanalyses. Inappropriate application can generate misleading inferences about publication bias. We aimed to measure, in a survey of metaanalyses, how frequently the application of these tests would be not meaningful or inappropriate.
Methods: We evaluated all metaanalyses of binary outcomes with é 3 studies in the Cochrane Database of Systematic Reviews (2003, issue 2). A separate, restricted analysis was confined to the largest metaanalysis in each of the review articles. In each metaanalysis, we assessed whether criteria to apply asymmetry tests were met: no significant heterogeneity, I2 < 50%, é 10 studies (with statistically significant results in at least 1) and ratio of the maximal to minimal variance across studies > 4. We performed a correlation and 2 regression asymmetry tests and evaluated their concordance. Finally, we sampled 60 metaanalyses from print journals in 2005 that cited use of the standard regression test.
Results: A total of 366 of 6873 (5%) and 98 of 846 metaanalyses (12%) in the wider and restricted Cochrane data set, respectively, would have qualified for use of asymmetry tests. Asymmetry test results were significant in 7%–18% of the metaanalyses. Concordance between the 3 tests was modest (estimated k 0.33–0.66). Of the 60 journal metaanalyses, 7 (12%) would qualify for asymmetry tests; all 11 claims for identification of publication bias were made in the face of large and significant heterogeneity.
Interpretation: Statistical conditions for employing asymmetry tests for publication bias are absent from most metaanalyses; yet, in medical journals these tests are performed often and interpreted erroneously.
Publication bias, the selective publication of studies based on whether results are “positive” or not, is a major threat to the validity of clinical research.1^{–}4 This bias can distort the totality of the available evidence on a research question, which leads to misleading inferences in reviews and metaanalyses. Without upfront study registration, however, this bias is difficult to identify after the fact.5 Many tests have therefore been proposed to help identify publication bias.6
The most common approaches try to investigate the presence of asymmetry in (inverted) funnel plots.7^{–}10 A funnel plot shows the relation between study effect size and its precision. The premise is that small studies are more likely to remain unpublished if their results are nonsignificant or unfavourable, whereas larger studies get published regardless. This leads to funnelplot asymmetry. Although visual inspection of funnel plots is unreliable,11^{,}12 statistical tests can be used to quantify the asymmetry.7^{–}10 These tests have become popular: one relevant article8 has been cited more than 1000 times.
The limitations of these tests have been documented for some time. Begg and Mazumdar7 mentioned in 1994 that the falsepositive rates of their popular rankcorrelation test were too low. In 2000, Sterne and colleagues13 showed in a simulation study that the regression method described by Egger and associates8 was more powerful than the rank correlation test, although the power of either method was low for metaanalyses of 10 or fewer trials. Falsepositive results were found to be a major concern in the presence of heterogeneity.13^{,}14 To reduce the problem, a modified regression test was developed,10 and several other tests proposed.6^{,}15 Because they differ in their assumptions and statistical properties, discordant results can be expected with different tests.
There are situations when the use of these tests is clearly inappropriate, and others where their use is futile or meaningless. Application of these tests with few studies is not wrong, but has low statistical power. Application in the presence of heterogeneity is more clearly inappropriate, and may lead to falsepositive claims for publication bias.14^{,}16^{,}17 When all available studies are equally large (i.e., have similar precision), the tests are not meaningful. Finally, it makes no sense to evaluate whether studies with significant results are preferentially published when none with significant results have been published.
Despite these limitations, these tests figure prominently in the medical literature. It would be useful to estimate how often these tests are appropriately or meaningfully applied. We therefore appraised almost 7000 metaanalyses in the Cochrane Database of Systematic Reviews to discover the extent to which tests of funnelplot asymmetry would be inappropriate or nonconcordant. We also examined the appropriateness of the application of asymmetry testing in metaanalyses recently published in print journals.
Methods
We used issue 2, 2003, of the Cochrane Database of Systematic Reviews (n = 1669 reviews). We imported into Stata software all metaanalyses that had binary outcomes and numerical 2 × 2 table information available (n = 12 709).18 We did not consider studies where no patients in either arm of the study had an event, or all patients in both arms had an event; this eliminated 906 metaanalyses. Zero counts in one arm only were handled in the calculations via the addition of 0.5 to all data cells, which allowed an odds ratio to be calculated without distorting the data appreciably. Metaanalysis data sets were further scrutinized for similarity. When numbers of studies, patients and events were all the same and summary results were identical (to 7 digits of accuracy), the metaanalyses were considered to contain duplicate data sets and only one of them was retained: similarity checks eliminated 761 duplicate metaanalyses. We also excluded metaanalyses where only 2 studies were available (n = 4169), which makes correlation and regression diagnostics impossible to calculate. Thus, our analysis of the wider Cochrane data set included data from 6873 metaanalyses.
The data sets of these metaanalyses are not necessarily independent. Within the same systematic review, different outcomes, contrasts and analyses may be correlated. To minimize correlation, we created a separate, more restricted data set for which we selected one metaanalysis, the one with the largest number of studies, per systematic review. When the largest number of studies was equal in 2 or more of the metaanalyses, we chose the one with the largest number of subjects; if that number was also equal, we chose the one with the largest number of events. The problem of inappropriateness of the asymmetry tests due to limited number of studies was thereby minimized in this analysis of the restricted Cochrane data set of data from 846 metaanalyses.
For each eligible metaanalysis, we evaluated 4 aspects that bear on whether applying an asymmetry test may be meaningful or appropriate. Statistical significance was tested with the χ^{2}based Q statistic and considered significant for p < 0.10 (2tailed);19 the extent of betweenstudy heterogeneity was measured with the I^{2} statistic and considered large for values of 50% or more.20 The number of included studies was noted; 10 or more was considered sufficient. To see if the difference in precision of the largest and the smallest study was sufficiently large (ratio of extreme values of variances > 4), we noted the ratio of the maximal versus minimal variance (the square of the standard error of estimates) across the included studies. Finally, we recorded whether at least one study had found formally statistically significant results (p < 0.05).
Some debate about the extent to which criteria need be fulfilled for asymmetry tests to be meaningful or appropriate is unavoidable. The thresholds listed above are not very demanding, based on the properties of the tests. Results of analyses with alternative, even more lenient criteria are illustrated in Venn diagrams of the 4 overlapping criteria.
The odds ratio was used as the metric of choice for all the metaanalyses. We documented the degree of overlap of the criteria described above and the number of metaanalyses that would qualify, based not only upon each criterion but also on combinations thereof.
We evaluated each metaanalysis by means of 3 asymmetry tests: the 2 most popular tests in the literature (the Begg–Mazumdar τ rankcorrelation coefficient,7 and the standard regression test of the standardized effect size [i.e., the natural logarithm of the odds ratio divided by its standard error] against its precision [the inverse of the standard error]8) and a new variant, a modified version of the regression test, which has a lower falsepositive rate.10 For all tests, statistical significance was claimed for p < 0.10 (2tailed).7^{,}8^{,}10 We estimated inferences on the basis of these 3 tests in the entire data sets and in the subsets of metaanalyses fulfilling the appropriateness criteria already described. Pairwise concordance between the 3 tests was assessed with the κ statistic.21
The Cochrane Handbook for Systematic Reviews of Interventions16 has taken a critical stance to the use of these tests. RevMan, the Cochrane Library metaanalysis software, does not include any options for running them, and their use in the Cochrane Library is limited.22 We therefore used a sample of metaanalyses in printed journals to examine whether these tests are used inappropriately in practice. We examined papers published in 2005 that cited the most common reference for the standard regression test,8 the asymmetry test most commonly used in the current literature. We screened citations in sequential order (as indexed in the Science Citation Index) until we identified 60 metaanalyses in which asymmetry testing had been employed. The 60 metaanalyses examined were within 24 published articles. Although we focused on the standard regression test,8 we also recorded results from the other 2 tests whenever such data were reported. We examined whether these 60 metaanalyses fulfilled the criteria that we set, what they found, and how they interpreted the application of the test.
Results
In terms of fulfillment of criteria, the most common feasibility problem we encountered in both of our Cochrane dataset analyses was too low a number of studies, with threequarters or more of the metaanalyses examining fewer than 10 studies (Table 1). Lack of significant studies was also a major issue: of the wider and restricted data sets, about half and a third of the metaanalyses, respectively, included no studies with statistically significant results; a fifth/ a quarter had significant or large betweenstudy heterogeneity; and nearly a quarter/ a fifth had a ratio of extreme values of variances of 4 or greater. Only 366 (5%) of the metaanalyses in the wider Cochrane data set and 98 (12%) of those in the restricted Cochrane data set fulfilled all 4 of the original criteria (Fig. 1, left).
Results of the 3 tests showed statistically significant asymmetry in few metaanalyses (Table 2); overall, in the 2 data sets, rates of significant signals (i.e., statistically significant results) varied between 7% and 18%. They tended to be smallest for the correlation test and highest for the unmodified standard regression test, but did not much differ between the 2 data sets. When the data sets were split according to whether metaanalyses met the criteria for applying asymmetry tests or not, significant signals were more prevalent in the metaanalyses that fulfilled the criteria than in those that did not. Nevertheless, even in the former group, the rates of signals varied from 14% to 24%.
The 3 asymmetry tests had modest concordance across the entire data sets (Table 2, Fig. 2); results were largely similar across the wider and restricted Cochrane data sets. Overall, 3% and 4% of the metaanalyses, respectively, gave a significant signal with all 3 tests. In 19% and 22% of the metaanalyses, a result from at least 1 of the 3 tests was significant. Estimated κ values fell generally below 0.5 (range 0.33–0.45) for the concordance of the correlation test with either of the regression diagnostics, and were somewhat higher (0.64–0.66) for concordance between the unmodified and modified regression diagnostics. When analyses were limited to metaanalyses that fulfilled the criteria for asymmetry tests, concordance slightly improved between the correlation and the regression diagnostics (estimated κ 0.39–0.60) and worsened slightly between the unmodified and modified regression diagnostics (estimated κ 0.57–0.59).
Of the 60 metaanalyses that stated their use of the regression test within the 24 print articles, use of the test was meaningful or appropriate in 7 of the metaanalyses (12%, 95% confidence interval 5%–23%). Of the 24 articles, 6 had at least one metaanalysis where use of the test was appropriate. Twentysix metaanalyses had significant heterogeneity (all with I^{2} > 50%), and another 4 had I^{2} > 50% without statistically significant heterogeneity. Twentysix metaanalyses were of fewer than 10 studies. Eighteen metaanalyses included no significant studies; 3 had ratios of extreme variances ≤ 4. Four of the 24 articles also reported rank correlation test results (with similar inferences). Another cited the regression test when what had actually been performed were rank correlation tests. One other article apparently used a regression test based on sample size, a different test than the one that was cited.
All 24 articles claimed that the tests were done to estimate publication bias, with a single exception: an article that clarified that the authors tested for “smallstudy bias, of which publication bias is one potential cause.” Eleven metaanalyses (18%) claimed that there was evidence for publication bias, whereas the other 49 stated that they found no such evidence. All metaanalyses that claimed to have detected publication bias were found to have betweenstudy heterogeneity that was large and statistically significant .
Interpretation
In most metaanalyses, the application of funnelplot asymmetry tests to detect publication bias is inappropriate or not meaningful. We found a major problem to be lack of a sufficient number of studies; lack of studies with significant results and the presence of heterogeneity were also common issues. In a smaller proportion of metaanalyses, differences in the magnitude of the smallest versus the largest studies were negligible.
When each of 3 asymmetry (“publication bias”) tests were applied, we found a minority of the examined metaanalyses to have a positive signal. About a fifth of the metaanalyses gave a signal with any of the 3 tests; 3%–4% gave consistent signals for asymmetry with all diagnostics. In the absence of a criterion standard about the presence of publication bias, it is impossible to decide whether these figures were low because the tests we examined were underpowered or because publication bias is uncommon. Moreover, concordance among the 3 tests was modest. Automatic and undocumented use of these tests may lead to unreliable inferences.
A survey of 60 recently published metaanalyses from 24 published reports that had cited use of the standard regression test8 revealed that most had used the test inappropriately. With one exception, all these articles misleadingly equated the results of these tests with the presence or absence of publication bias, ignoring numerous other causes that may underlie differences between small and larger studies.8 Moreover, all signals for publication bias occurred in metaanalyses with large, significant betweenstudy heterogeneity. It is also disquieting that 82% of the metaanalyses were assumed to have no publication bias simply because of a “negative” asymmetry test result.
When these diagnostics give significant signals, this does not necessarily mean that publication bias is present. This applies even when the metaanalyses fulfill all of the 4 eligibility criteria that we considered. In the absence of a prospective registry of studies, publication bias cannot be proven or excluded, because a criterion standard is lacking.
The 4 criteria we used are merely technical and conceptual prerequisites. Even if statistical prerequisites are met, the conceptual assumptions may sometimes not hold. Very large sample size,11 increased attention to the research question and heightened interest in contradicting previous publications with extreme opposite results may contribute as much or more than statistical significance to dictating publication in selected cases or in entire scientific fields.23
We used the Cochrane Database of Systematic Reviews because it is by far the largest compilation of metaanalyses. The composition of this database may differ from that of the totality of metaanalyses published.22^{,}24^{,}25 Despite some uneven emphasis on specific diseases in the evolving Cochrane Database of Systematic Reviews,26 this database is likely to be less selective compared with the metaanalyses that appear in the medical journal literature. Metaanalyses published in printed medical journals are larger but also more likely to have large heterogeneity, because they also include a greater share of nonrandomized studies. In the journal literature, the percentage of metaanalyses where asymmetry tests are applied inappropriately is therefore also very high.
There can be some subjectivity about thresholds for a definition of when a statistical test is meaningful or appropriate. Our criteria tended toward the lenient; use of even more lenient criteria would increase the proportion of appropriateness, but not to very high percentages (Fig. 1).
Publication bias is compounded by additional biases that pertain to selective outcome reporting27^{,}28 and “significancechasing”29 in the data published. It would be misleading to claim that all these problems can be addressed with asymmetry tests. Occasionally, in a metaanalysis of many studies, the retrieval of unpublished data may “correct” a funnelplot asymmetry.30 However, we should caution that, when unpublished data exist, only a portion might possibly be retrievable; so, it is unknown what would happen if data from all studies could be retrieved. Whenever both unpublished and published information is available, the results of these 2 types of evidence should be compared. Nevertheless, as has been stressed repeatedly, prospective registration of clinical studies and of their analyses and outcomes5^{,}31 may be the only means to properly address publication bias.
In conclusion, metaanalysts should refrain from inappropriate or unmeaningful application of funnelplot asymmetry tests. Readers should not be misled that publication bias has been documented or excluded according to inappropriate use or interpretation of funnel plots.
Footnotes

This article has been peer reviewed.
Contributors: John Ioannidis originated the study concept and wrote the protocol and manuscript, with input and critical revisions by Thomas Trikalinos. John Ioannidis evaluated the metaanalyses published in printed journals; Thomas Trikalinos performed all the statistical analyses. Both authors interpreted the data from their analyses, and approved the final version of the article for publication.
Competing interests: None declared.