Abstract
BACKGROUND: Peer review is used to determine what research is funded and published, yet little is known about its effectiveness, and it is suspected that there may be biases. We investigated the variability of peer review and factors influencing ratings of grant applications.
METHODS: We evaluated all grant applications submitted to the Canadian Institutes of Health Research between 2012 and 2014. The contribution of application, principal applicant and reviewer characteristics to overall application score was assessed after adjusting for the applicant’s scientific productivity.
RESULTS: Among 11 624 applications, 66.2% of principal applicants were male and 64.1% were in a basic science domain. We found a significant nonlinear association between scientific productivity and final application score that differed by applicant gender and scientific domain, with higher scores associated with past funding success and h-index and lower scores associated with female applicants and those in the applied sciences. Significantly lower application scores were also associated with applicants who were older, evaluated by female reviewers only (v. male reviewers only, −0.05 points, 95% confidence interval [CI] −0.08 to −0.02) or reviewers in scientific domains different from the applicant’s (−0.07 points, 95% CI −0.11 to −0.03). Significantly higher application scores were also associated with reviewer agreement in application score (0.23 points, 95% CI 0.20 to 0.26), the existence of reviewer conflicts (0.09 points, 95% CI 0.07 to 0.11), larger budget requests (0.01 points per $100 000, 95% CI 0.007 to 0.02), and resubmissions (0.15 points, 95% CI 0.14 to 0.17). In addition, reviewers with high expertise were more likely than those with less expertise to provide higher scores to applicants with higher past success rates (0.18 points, 95% CI 0.08 to 0.28).
INTERPRETATION: There is evidence of bias in peer review of operating grants that is of sufficient magnitude to change application scores from fundable to nonfundable. This should be addressed by training and policy changes in research funding.
Peer review is the backbone of modern science. Scientists with expertise in the field collectively make recommendations about what research is funded. Despite the almost ubiquitous use of peer review and its role in the scientific enterprise, there is limited evidence about its effectiveness.1–3 Critics have expressed concerns about the reliability and fairness of the process, and its innate conservatism in funding interdisciplinary and innovative science.4–6 An increasing number of empirical studies of peer review have investigated some of these criticisms. Reliability, when measured, is poor;7–11 of greater concern is evidence suggesting the presence of systematic bias. Female scientists are less likely to be funded and published than male scientists.12–22 Reviewers who declare conflicts of interest with an application positively bias other reviewers’ rating.23,24 Yet, few studies have taken differences in the quality of the applicant or nature of the research into account.14,25 The first study to disentangle these effects, by adjusting for the applicant’s publication impact score to quantify potential bias in the peer review of postdoctoral fellowships, reported substantial gender bias, with female scientists with the highest productivity being scored equivalent to males with the lowest productivity.23 There have been limited attempts to separate scientific quality from potential biases in the investigator-initiated operating grant competitions that fund the bulk of science.13,14,22,25–32 When scientific productivity is taken into account, potential gender biases are not evident in all studies, even within the same funding agency.13,14,22,28 Differences in the characteristics of peer reviewers may explain the lack of consistency in findings, but the interaction between reviewer and applicant characteristics has not yet been investigated.
In particular, there is interest in determining whether reviewer gender, expertise, success rate, experience, scientific domain, conflict of interest and reviewer disagreement would influence and potentially bias the overall rating of an application. In this study, we used data from the national health research funding agency in Canada to estimate the reliability of peer review and to investigate potential bias in rating after differences in the scientific productivity of applicants had been taken into account.
Methods
The Canadian Institutes of Health Research (CIHR) is Canada’s national health research funding agency. CIHR invests about $800 million annually in health research.33 About 70%, or $540 million annually, is used for investigator-driven research, and $268 million for research to address strategic priorities for the country.33 Until 2015, when reforms were made in funding programs, investigator-initiated operating grant applications were submitted to biannual competitions and were evaluated by 1 of 53 standing committees, selected as most appropriate for review by the applicant. The chair and scientific officer of each standing committee, composed of 10–15 committee members, assigned a first and second reviewer to each application, based on the committee member’s self-assessment of their expertise to review each application and conflict of interest, if relevant. The first and second reviewer independently assigned preliminary scores that reflected their assessment of the quality of the application, from 1 (poor) to 4.9 (excellent); provided a written review; and presented the application and their comments to the committee for discussion. A primary and secondary reviewer consensus score was agreed to after committee discussion, then all committee members scored the application between 0.5 above or below the consensus score. The final score for an application was computed as the mean of scores assigned by all committee members, and was used to rank applications. The top-ranked applications in each committee were funded, with funded and not funded applications often differing by less than 0.1 of a point in score. If both reviewers independently assigned a score of < 3.5 to an application, the application was considered nonfundable, it was not discussed or rated by the committee, and the mean score of the 2 reviewers was used as the final score. Members who were in conflict with an application were excused during the discussion and rating.
Study population
We extracted all applications to the investigator-initiated open operating grant competition between 2012 and 2014 from the CIHR database. In this period, CIHR recorded the reviewers’ self-declared expertise in reviewing each application and conflicts of interest in the central research database, which allowed their contribution to application scores to be assessed.
Variables
Application characteristics
We classified the scientific domain of the application as basic sciences (biomedical), or applied sciences (clinical, health services and policy and population health), based on the applicant’s self-designation. We classified the history of the application as a new grant or a resubmission of a previously unsuccessful grant, and measured the total amount of funding requested as the sum of the amount requested per year over the grant duration.
Reviewer characteristics
We considered an application to have conflicts of interest if 1 or more members of the review panel declared a conflict of interest with the application. The self-assessed expertise of the first and second reviewer was classified as: 1) both reviewers had high expertise, 2) a mix of high and medium expertise, 3) a mix of low expertise with a high- or medium-expertise reviewer, and 4) both with low expertise. The genders of the reviewers were classified as: 1) both male, 2) male and female, 3) both female. The research experience of the first and second reviewer was based on the number of years they had applied to CIHR since its inception in 2000, and the counts for the 2 reviewers were summed to provide a continuous years-of-experience measure. Similarly, the scientific domain of the reviewer’s own CIHR applications was measured as: 1) in the same scientific domain as the applicant only, 2) in mixed domains, including the domain of the applicant, 3) in domains different from that of the applicant. The proportion of applications submitted by the 2 reviewers that were successfully funded by CIHR represented the reviewers’ past success at CIHR.
Applicant characteristics
The principal applicant’s self-reported age, gender and primary academic institution were retrieved from the application form. When there was more than 1 principal applicant (17.5% of applications), the characteristics of the older, more senior applicant were measured.
Scientific productivity
We measured the applicant’s scientific productivity by 2 indicators of academic performance and predictors of funding success: 1) previous success rate in CIHR funding, and 2) bibliometric indicators of impact. To measure CIHR funding success, we retrieved all applications submitted to CIHR since 2000 and calculated the proportion funded to provide a quantitative measure of success rate. For the bibliometric measures, we calculated the Wennerås total impact measure to enable comparisons with this study.23 This indicator sums the impact factors of all published articles. We also calculated the h-index for each applicant.34 This measure estimates the impact of a scientist’s cumulative research contributions based on citations and allows an unbiased comparison of scientific achievement between individuals competing for the same resources.
To produce the bibliometric measures of scientific productivity, we used the applicant’s first, middle and last name listed on the grant proposal to retrieve all publications, up to and including 2011, from the Web of Science, where the applicant was listed as an author. For each publication, we retrieved the detailed text file (authors’ names, corresponding author’s name and institution, publication title) and the citation reports. We found the impact factor for each journal by linking the ISSN of the journal to the Journal Citation Record file or, when there was no recorded ISSN, we used the full and abbreviated journal name to make the link.
Application scores
For each application, we retrieved the first and second reviewer scores to estimate the reliability of rating, as well as the final score to assess potential sources of systematic bias. As disagreement in ratings between reviewers may bias final application scores, we classified applications as having differences in score between the first and second reviewer of greater than 1 scale point to assess the impact of disagreement on the final application score.
Statistical analysis
We used descriptive statistics to summarize the application, applicant and reviewer characteristics. Inter-rater reliability was estimated by the intraclass correlation coefficient (ICC), where values of 0.00 to 0.40, 0.41 to 0.59, 0.60 to 0.74, and 0.75 to 1.00 are considered to represent poor, fair, good and excellent agreement, respectively.35 We estimated the ICC for the first and second reviewer, overall and by scientific domain. Differences in within-rater variance in different scientific domains were tested using a 2-tailed F test.
To assess potential sources of systematic bias in rating, we estimated the association between the applicant’s scientific productivity and his or her final application score using multiple linear regression within a generalized estimating equation framework to account for clustering of multiple applications from the same applicant. We used an exchangeable correlation structure to account for clustering, and added quadratic terms to assess linearity. Application was the unit of analysis, and final application score was the outcome. As the h-index and the total impact measure were highly correlated, we included only the h-index and funding success rate in the final model. We added application, reviewer and applicant characteristics to the model. In theory, after adjusting for scientific productivity, there should be no additional variance in final application score that is explained by the gender, age, reviewer expertise or agreement of the applicants. To determine whether the relationship between scientific productivity and application score was modified by applicant gender, gender mix of the reviewers, scientific domain or reviewer expertise, we included 2-way interaction terms in the model and tested using the Wald χ2 test. As basic and applied sciences have been shown to differ in the weight given to past scientific productivity in evaluating the quality of the application, 32,36,37 we also tested the 3-way interaction between scientific productivity, science domain and applicant characteristics. To facilitate interpretation of significant interactions, we illustrated these associations graphically, and calculated the impact of these biases on final application score for common scenarios. All analyses were completed using SAS version 9.4 TS Level 1M4.
Ethics approval
This study was approved by CIHR senior executive management and the CIHR legal counsel.
Results
Overall, 11 624 applications were submitted to the open operating grant competitions between 2012 and 2014, of which 66.2% of principal applicants were male and 69.1% were aged 40 years or older (Table 1). The scientific domains of the applications were basic science (64.1%) and applied science (35.9%), of which 16.6% were clinical, 8.1% were health services and policy and 11.3% were population health. Most applications were new submissions, and more than half had 1 to 3 investigators. The mean amount of funding requested was $747 981. About 20% of applications were classified as nonfundable because both the first and second reviewer independently provided scores of less than 3.5.
In the majority of applications, both the first and second reviewer had high (16.3%) or medium-high (68.1%) expertise to review, and half had submitted their own grant applications in the same science domain in which they were reviewing (Table 1). The majority of applications were reviewed by both male and female reviewers, or male reviewers only. Most reviewers had between 10 and 20 years of combined experience, and a success rate in their own applications of between 25% and 50%. In 66.9% of applications, at least 1 member of the review panel had a conflict of interest with the application. Female applicants were more likely to apply with multiple co-investigators, ask for less funding, have their application triaged, be reviewed by female reviewers only, and have reviewers from other scientific domains.
Overall, the reliability of application rating by the first and second reviewer was fair (ICC 0.41, 95% CI 0.39 to 0.43), but only for basic science applications (ICC 0.41, 95% CI 0.39 to 0.44), whereas it was poor (ICC 0.33, 95% CI 0.30 to 0.36) for applied science applications despite greater variance between applications (Table 2). The within-rater variance component for health services and policy reviewers was almost double that of basic science reviewers (0.28 v. 0.15, p < 0.05).
The h-index and the total impact measure were highly correlated (r = 0.8), and showed similar trends regarding the characteristics of the application’s principal investigator. Clinical investigators had the highest scientific productivity, and health services and population health researchers had the lowest. Scientific productivity was systematically lower for women and for younger applicants (Table 3). History of funding success was not strongly correlated with bibliometric measures of scientific productivity (cumulative impact: r = 0.11, h-index: r = 0.18), and was highest in older, male and basic science applicants. The mean final application score was highest for basic science applications.
There was a significant nonlinear association between the h-index, past success rate and final application score (Table 4). The greatest impact of scientific productivity on the application score was at the lower levels of the distribution. The gender and scientific domain of the applicant modified the association between past success rate and application score (significant 2-and 3-way interactions) (Figure 1). Increasing past success rate in funding had a greater positive impact on application scores in basic science compared with applied science. Overall, female applicants who had past success rates equivalent to male applicants received lower application scores, the difference being greater in applied science applications and as the past success rates increased (Figure 1). Based on the fitted model (see the note in Figure 1), a female applicant in applied sciences with a success rate of 50% would get a score of 3.75 (95% CI 3.32 to 4.18), while a male applicant would get a score of 3.82 (95% CI 3.36 to 4.28). A male applicant in applied sciences needs a funding success of 23% to get a score of 3.75 (95% CI 3.39 to 4.11). A female applicant in basic sciences with a funding success of 50% would achieve a final application score of 4.02 (95% CI 3.57 to 4.47), compared with 4.06 (95% CI 3.78 to 4.34) for males.
With regard to peer review and application characteristics, significantly lower application scores were associated with both reviewers being female (adjusted difference in score v. male reviewers only, −0.05, 95% CI −0.08 to −0.02), or the applications of both reviewers being outside of the scientific domain of the applicant (−0.07, 95% CI −0.11 to −0.03). Moreover, we observed a significant interaction between reviewer expertise and applicant past funding success, such that when both reviewers had high expertise, they were more likely to provide higher application scores to applicants with higher past success rates than were reviewers with less expertise (adjusted difference 0.18, 95% CI 0.08 to 0.29). In comparison, final application scores were higher when there was reviewer agreement (adjusted difference 0.23, 95% CI 0.20 to 0.26), a conflict with at least 1 member of the panel (0.09, 95% CI 0.07 to 0.11), for resubmissions (0.15, 95% CI 0.14 to 0.17), and for applications that requested more funding (0.01 per additional $100 000, 95% CI 0.01 to 0.02). There was no significant interaction between applicant gender and reviewer gender, or reviewer expertise.
The impact of peer review characteristics on application score is sufficient to have an impact on the likelihood of funding success. Based on the model, the estimated application score for 2 male applicants in basic science with equivalent mean scientific productivity, age and application characteristics is 3.9 for the applicant with the most favourable peer review characteristics — agreement between reviewers, conflicts on the panel, high-expertise reviewers, male reviewers only, and reviewers from the same scientific domain — compared with a score of 3.4 for the applicant without these conditions, a score that would place the application in the nonfundable range.
Interpretation
This study confirmed many of the suspected biases in the peer review of operating grant applications and identified important characteristics of peer reviewers that must be considered in application assignment. By measuring and controlling for scientific excellence of the applicant, we were able to examine how applicant, application and reviewer characteristics may unduly influence the assessment of operating grant applications. We found lower scores for applied science applications, gender inequities in application scores that favoured male applicants who had past funding success rates equivalent to female applicants, particularly in the applied sciences. Conflicts on the panel, male reviewers only, reviewers with all high expertise, and those whose own research was exclusively in the same scientific domain as the applicant’s conferred positive benefits in application rating.
The issue of gender inequity in peer review has been a topic of considerable debate since the original Swedish studies.23,24 Subsequent investigations evaluated differences in success rates or application scores,10,13–15,17,22,28,30,31,38,39 many without adjustment for scientific productivity,22,28,30,40 an important deficiency because women have lower productivity measures. The results are mixed;10,13–15,17,23,24,30,31,38,41 a meta-analysis suggests a modest bias of a 7% higher odds of funding success in favour of men.17 Our results provide some possible explanation of differences across studies. We showed that the association is not linear, and is modified by scientific domain, with greater inequities for women in the applied sciences at the upper end of funding success rates. This may be why studies can show negligible to large effects depending on the scientific domain and performance of the cohort being investigated. Previous studies report that female scientists are perceived as being less competent23,42 and having weaker leadership skills.13,40,42 Moreover, the language used in application evaluation criteria may favour male stereotypes (e.g., “independent,” “challenging”).13,15,40 In keeping with these biases, there may be greater concerns about the ability of successful female scientists to lead multiple funded projects, resulting in lower application scores, and lower funding success.
Although we did not find an interaction between the gender of the applicant and the gender mix of the reviewers, female reviewers were more stringent in their rating. Two previous studies reported similar results.18,30 To provide equitable assessment, these systematic differences in ratings by male and female reviewers need to be addressed — for example, by reviewer training, monitoring and intervention, and possibly statistical adjustment, as is done in high-stakes professional licensing examinations.7,43,44
Our study confirmed that conflict of interest has an important positive impact on application scoring, even though panel members who have conflicts are not present for the discussion and scoring. One possible reason is that reviewers vote favourably for applicants from the same institution, even if they have never met them and would therefore not be in conflict — a phenomenon that was noted in both the French and Swedish studies.23,24,45 Alternatively, as the same reviewers may be on the same panel for years, they may want to support the colleagues of other panel members with more positive ratings, in the spirit of collegiality. Several suggestions have been made on how to address this problem, including blinding the applicant’s identity, selecting international reviewers (especially for smaller research communities), and allowing the applicant to respond to the reviewers’ comments, as is done in manuscript review45 and by some granting agencies.18 To date, there is no evidence on whether these strategies mitigate conflict bias in peer review.
Our analyses provide novel evidence about the effect of reviewer expertise and the scientific domain of their own applications on application rating. Of particular interest was the observation that high-expertise reviewers were more likely to pay attention to the applicant’s past funding success rate, rating the applications from more successful scientists higher. There has been very limited exploration of reviewer expertise and the role it plays in grant review. A recent study of reviewer expertise at the National Institutes of Health suggests that reviewers with higher levels of expertise are more informed and positively biased in their rating of projects in their own area.29 These preliminary results suggest that the reviewers’ own grant and publication track record, as well as their self-reported expertise, should be considered in reviewer assignments.
When combined, reviewer characteristics can have a substantial effect on an application’s score and its likelihood of funding. In the worst-case scenario, an applicant who has female reviewers only, no conflicts on the committee, disagreement in the quality of the application by the reviewers, and reviewers with less expertise in the domain may receive a score 0.5 points lower on a 1 to 4.9 scale. A difference of this size could move an application with a fundable score of 3.9 to a nonfundable score of 3.4. Future research should be directed toward better methods of matching reviewers to applications, and monitoring and correcting for potential reviewer biases.
Similar to many other studies,7–10 we found that the reliability of scientific review was fair to poor. Moreover, we found that disagreement between reviewers systematically lowered the score of an application. Increasing the number of reviewers has been recommended as an effective means of improving reliability.46 Also, as noted in another study, reviewers give different weights to evaluation criteria such as originality, usefulness, methodology and feasibility.47 Structuring and rating each component is recommended to address this problem, by providing explicit, transparent weighting of assessment.47 In addition, training has been shown to be effective in getting reviewers to use rating scales in the same way.7
Limitations
There are important limitations to consider in the interpretation of the results. Although we used standard measures to assess the scientific excellence of the applicant, we had no external gold standard measure of the quality of the proposal. The improvement in scoring seen with resubmissions, and with higher funding requests, which have been reported previously, 15,25,48 may represent true superiority in the quality of the proposal; however, it is unlikely that biases related to reviewer characteristics or scientific domain are related to differences in proposal quality. We were conservative in our linkage of applicants to publications, requiring perfect agreement on first and last name. Our approach likely underestimated the bibliometric measures of productivity and impact, possibly differentially penalizing female scientists if they changed their name after marriage. Finally, there may be other factors that influence application score that we could not measure, such as the quality of the institution or department.
Conclusion
We identified potential systematic biases in peer review that penalize female applicants and are associated with peer reviewer characteristics; these may be addressed through policy change, training and monitoring.
Acknowledgements:
The authors wish to acknowledge David Peckham and his team from the Canadian Institutes of Health Research for assembling the data for this study and answering many questions; and Dr. Jeff Latimer for reviewing and providing feedback on the manuscript.
Footnotes
Competing interests: None declared.
This article has been peer reviewed.
Contributors: All authors contributed substantially to study design, analysis, and interpretation of data. All authors also participated in drafting and revising the manuscript, providing final approval of the version to be published, and agree to be accountable for all aspects of the work.
Funding: Funding was provided by the Canadian Institutes of Health Research (CIHR). The study sponsor approved the use of the data, and the original manuscript as per CIHR policy. CIHR had no role in the design of the study, the analysis of interpretation of the data, or the the review and approval of the final published manuscript.
- Accepted November 23, 2017.