Kappa statistic

Thomas McGinn; Gordon Guyatt

doi:10.1503/cmaj.1050048

Christopher Carpenter suggests another approach to teaching the kappa statistic: giving real data to the students and having them do the requisite calculations. Our approach takes them through the principles of the calculations, step by step. Calculating the kappa score from real data would be an excellent subsequent step for the most enthusiastic students. We encourage readers to consult the teachers' version of our article on the kappa score,1 which has interactive components and data that may be helpful in understanding the concept of kappa.

Regarding the interpretation of kappa scores, Table 1 in both the teachers'1 and learners'2 versions of our article was based on a text by Sackett and colleagues3 and not, as Carpenter correctly points out, the article by Maclure and associates.4 Carpenter is also correct in noting that several different versions of this table are available in the literature. All have 3 basic categories: poor agreement, fair to good agreement, and very good to excellent agreement. In our view, further differentiating within these groups adds little to the practical clinical discussion.

Although we have never seen it in real life, David Juurlink and Allan Detsky correctly state that kappa can theoretically be less than 0 when agreement is poorer than chance. This is most likely to occur when both observers call almost every observation positive or almost every observation negative. In these circumstances, chance agreement would be close to zero and at times could be negative; determining chance-independent agreement (the phi statistic) may represent a better approach.5

Michael Allan asks about chance-corrected agreement when outcomes are categorical or continuous. One useful approach to this problem is the “weighted kappa,” which gives maximal credit for full agreement, partial credit for partial agreement and no credit when disagreement is extreme. For example, in the case of ventilation- perfusion scans for the evaluation of pulmonary embolus, if both people reading a scan interpret the test result as normal (or both say there is intermediate or high probability of embolus), they get full credit for their agreement (weight of 1.0). If one reads the result as normal and the other as high probability, they get no credit (weight of 0). If one assessor classifies the scan as low probability and the other calls the same scan high probability or normal, they get partial credit (a weight of 0.75, for instance). Readers can find a more in-depth explanation and the details of how to calculate weighted kappa from the Web site of MedCalc Software (Mariakerke, Belgium; www.MedCalc.org).

Footnotes

Competing interests: None declared.

References

1.↵
McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R, Guyatt G, for Evidence-Based Medicine Teaching Tips Working Group. Tips for teachers of evidence-based medicine: 3. Understanding and calculating kappa. CMAJ 2004;171(11): Online-1 to Online-9. Available: www.cmaj.ca/cgi/data/171/11/1369/DC1/1 (accessed 2005 May 16).
2.↵
McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R, Guyatt G, for Evidence-Based Medicine Teaching Tips Working Group. Tips for learners of evidence-based medicine: 3. Measures of observer variability (kappa statistic). CMAJ 2004;171(11):1369-73.
OpenUrl FREE Full Text
3.↵
Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical epidemiology: a basic science for clinical medicine. 2nd ed. Boston: Little, Brown and Co; 1991. p. 30.
4.↵
Maclure M, Willett WC. Misinterpretation and misuse of the kappa statistic. Am J Epidemiol 1987;126:161-9.
OpenUrl FREE Full Text
5.↵
Meade MO, Cook RJ, Guyatt GH, Groll R, Kachura JR, Bédard M, et al. Interobserver variation in interpreting chest radiographs for the diagnosis of acute respiratory distress syndrome. Am J Respir Crit Care Med 2000;161:85-90.
OpenUrl CrossRef PubMed