|
From the Department of Medicine, Division of General Internal Medicine (McGinn), and the Department of Geriatrics (Leipzig), Mount Sinai Medical Center, New York, NY; the Columbia University College of Physicians and Surgeons, New York, NY (Wyer); the Departments of Epidemiology and Biostatistics and of Pediatrics, University of California, San Francisco, San Francisco, Calif. (Newman); Durham Veterans Affairs Medical Center and Duke University Medical Center, Durham, NC (Keitz); and the Departments of Medicine and of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ont. (Guyatt)Members of the Evidence-Based Medicine Teaching Tips Working Group: Peter C. Wyer (project director), College of Physicians and Surgeons, Columbia University, New York, NY; Deborah Cook, Gordon Guyatt (general editor), Ted Haines, Roman Jaeschke, McMaster University, Hamilton, Ont.; Rose Hatala (internal review coordinator), University of British Columbia, Vancouver, BC; Robert Hayward (editor, online version), Bruce Fisher, University of Alberta, Edmonton, Alta.; Sheri Keitz (field test coordinator), Durham Veterans Affairs Medical Center and Duke University Medical Center, Durham, NC; Alexandra Barratt, University of Sydney, Sydney, Australia; Pamela Charney, Albert Einstein College of Medicine, Bronx, NY; Antonio L. Dans, University of the Philippines College of Medicine, Manila, The Philippines; Barnet Eskin, Morristown Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory University School of Medicine, Atlanta, Ga.; Hui Lee, formerly Group Health Centre, Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas McGinn, Mount Sinai Medical Center, New York, NY; Victor M. Montori, Mayo Clinic College of Medicine, Rochester, Minn.; Virginia Moyer, University of Texas, Houston, Tex.; Thomas B. Newman, University of California, San Francisco, San Francisco, Calif.; Jim Nishikawa, University of Ottawa, Ottawa, Ont.; Kameshwar Prasad, Arabian Gulf University, Manama, Bahrain; W. Scott Richardson, Wright State University, Dayton, Ohio; Mark C. Wilson, University of Iowa, Iowa City, Iowa
Imagine that you're a busy family physician and that you've found a rare free moment to scan the recent literature. Reviewing your preferred digest of abstracts, you notice a study comparing emergency physicians' interpretation of chest radiographs with radiologists' interpretations.1 The article catches your eye because you have frequently found that your own reading of a radiograph differs from both the official radiologist reading and an unofficial reading by a different radiologist, and you've wondered about the extent of this disagreement and its implications.
|
|
Looking at the abstract, you find that the authors have reported the extent of agreement using the
statistic. You recall that
stands for "kappa" and that you have encountered this measure of agreement before, but your grasp of its meaning remains tentative. You therefore choose to take a quick glance at the authors' conclusions as reported in the abstract and to defer downloading and reviewing the full text of the article.
Practitioners, such as the family physician just described, may benefit from understanding measures of observer variability. For many studies in the medical literature, clinician readers will be interested in the extent of agreement among multiple observers. For example, do the investigators in a clinical study agree on the presence or absence of physical, radiographic or laboratory findings? Do investigators involved in a systematic overview agree on the validity of an article, or on whether the article should be included in the analysis? In perusing these types of studies, where investigators are interested in quantifying agreement, clinicians will often come across the kappa statistic.
In this article we present tips aimed at helping clinical learners to use the concepts of kappa when applying diagnostic tests in practice. The tips presented here have been adapted from approaches developed by educators experienced in teaching evidence-based medicine skills to clinicians.2 A related article, intended for people who teach these concepts to clinicians, is available online at www.cmaj.ca/cgi/content/full/171/11/1369/DC1.
Clinician learners' objectives
Defining the importance of kappa
Calculating kappa
Calculating chance agreement
Tip 1: Defining the importance of kappa
A common stumbling block for clinicians is the basic concept of agreement beyond chance and, in turn, the importance of correcting for chance agreement. People making a decision on the basis of presence or absence of an element of the physical examination, such as Murphy's sign, will sometimes agree simply by chance. The kappa statistic corrects for this chance agreement and tells us how much of the possible agreement over and above chance the reviewers have achieved.
A simple example should help to clarify the importance of correcting for chance agreement. Two radiologists independently read the same 100 mammograms. Reader 1 is having a bad day and reads all the films as negative without looking at them in great detail. Reader 2 reads the films more carefully and identifies 4 of the 100 mammograms as positive (suspicious for malignancy). How would you characterize the level of agreement between these 2 radiologists?
The percent agreement between them is 96%, even though one of the readers has, on cursory review, decided to call all of the results negative. Hence, measuring the simple percent agreement overestimates the degree of clinically important agreement in a fashion that is misleading. The role of kappa is to indicate how much the 2 observers agree beyond the level of agreement that could be expected by chance. Table 1 presents a rating system that is commonly used as a guideline for evaluating kappa scores. Purely to illustrate the range of kappa scores that readers can expect to encounter, Table 2 gives some examples of commonly reported assessments and the kappa scores that resulted when investigators studied their reproducibility.
|
|
The bottom line
If clinicians neglect the possibility of chance agreement, they will come to misleading conclusions about the reproducibility of clinical tests. The kappa statistic allows us to measure agreement above and beyond that expected by chance alone. Examples of kappa scores for frequently ordered tests sometimes show surprisingly poor levels of agreement beyond chance.
Tip 2: Calculating kappa
What is the maximum potential for agreement between 2 observers doing a clinical assessment, such as presence or absence of Murphy's sign in patients with abdominal pain? In Fig. 1, the upper horizontal bar represents 100% agreement between 2 observers. For the hypothetical situation represented in the figure, the estimated chance agreement between the 2 observers is 50%. This would occur if, for example, each of the 2 observers randomly called half of the assessments positive. Given this information, what is the possible agreement beyond chance?
|
The vertical line in Fig. 1 intersects the horizontal bars at the 50% point that we identified as the expected agreement by chance. All agreement to the right of this line corresponds to agreement beyond chance. Hence the maximum agreement beyond chance is 50% (100% 50%).
The other number you need to calculate the kappa score is the degree of agreement beyond chance. The observed agreement, as shown by the lower horizontal bar in Fig. 1, is 75%, so the degree of agreement beyond chance is 25% (75% 50%).
Kappa is calculated as the observed agreement beyond chance (25%) divided by the maximum agreement beyond chance (50%); here, kappa is 0.50.
The bottom line
Kappa allows us to measure agreement above and beyond that expected by chance alone. We calculate kappa by estimating the chance agreement and then comparing the observed agreement beyond chance with the maximum possible agreement beyond chance.
Tip 3: Calculating chance agreement
A conceptual understanding of kappa may still leave the actual calculations a mystery. The following example is intended for those who desire a more complete understanding of the kappa statistic.
Let us assume that 2 hopeless clinicians are assessing the presence of Murphy's sign in a group of patients. They have no idea what they are doing, and their evaluations are no better than blind guesses. Let us say they are each guessing the presence and absence of Murphy's sign in a 50:50 ratio: half the time they guess that Murphy's sign is present, and the other half that it is absent. If you were completing a 2
2 table, with these 2 clinicians evaluating the same 100 patients, how would the cells, on average, get filled in?
Fig. 2 represents the completed 2
2 table. Guessing at random, the 2 hopeless clinicians have agreed on the assessments of 50% of the patients. How did we arrive at the numbers shown in the table? According to the laws of chance, each clinician guesses that half of the 50 patients assessed as positive by the other clinician (i.e., 25 patients) have Murphy's sign.
|
How would this exercise work if the same 2 hopeless clinicians were to randomly guess that 60% of the patients had a positive result for Murphy's sign? Fig. 3 provides the answer in this situation. The clinicians would agree for 52 of the 100 patients (or 52% of the time) and would disagree for 48 of the patients. In a similar way, using 2
2 tables for higher and higher positive proportions (i.e., how often the observer makes the diagnosis), you can figure out how often the observers will, on average, agree by chance alone (as delineated in Table 3).
|
|
At this point, we have demonstrated 2 things. First, even if the reviewers have no idea what they are doing, there will be substantial agreement by chance alone. Second, the magnitude of the agreement by chance increases as the proportion of positive (or negative) assessments increases.
But how can we calculate kappa when the clinicians whose assessments are being compared are no longer "hopeless," in other words, when their assessments reflect a level of expertise that one might actually encounter in practice? It's not very hard.
Let's take a simple example, returning to the premise that each of the 2 clinicians assesses Murphy's sign as being present in 50% of the patients. Here, we assume that the 2 clinicians now have some knowledge of Murphy's sign and their assessments are no longer random. Each decides that 50% of the patients have Murphy's sign and 50% do not, but they still don't agree on every patient. Rather, for 40 patients they agree that Murphy's sign is present, and for 40 patients they agree that Murphy's sign is absent. Thus, they agree on the diagnosis for 80% of the patients, and they disagree for 20% of the patients (see Fig. 4A). How do we calculate the kappa score in this situation?
|
Recall that if each clinician found that 50% of the patients had Murphy's sign but their decision about the presence of the sign in each patient was random, the clinicians would be in agreement 50% of the time, each cell of the 2
2 table would have 25 patients (as shown in Fig. 2), chance agreement would be 50%, and maximum agreement beyond chance would also be 50%.
The no-longer-hopeless clinicians' agreement on 80% of the patients is therefore 30% above chance. Kappa is a comparison of the observed agreement above chance with the maximum agreement above chance: 30%/50% = 60% of the possible agreement above chance, which gives these clinicians a kappa of 0.6, as shown in Fig. 4B.
Hence, to calculate kappa when only 2 alternatives are possible (e.g., presence or absence of a finding), you need just 2 numbers: the percentage of patients that the 2 assessors agreed on and the expected agreement by chance. Both can be determined by constructing a 2
2 table exactly as illustrated above.
The bottom line
Chance agreement is not always 50%; rather, it varies from one clinical situation to another. When the prevalence of a disease or outcome is low, 2 observers will guess that most patients are normal and the symptom of the disease is absent. This situation will lead to a high percentage of agreement simply by chance. When the prevalence is high, there will also be high apparent agreement, with most patients judged to exhibit the symptom. Kappa measures the agreement after correcting for this variable degree of chance agreement.
Conclusions
Armed with this understanding of kappa as a measure of agreement between different observers, you are able to return to the study of agreement in chest radiography interpretations between emergency physicians and radiologists1 in a more informed fashion. You learn from the abstract that the kappa score for overall agreement between the 2 classes of practitioners was 0.40, with a 95% confidence interval ranging from 0.35 to 0.46. This means that the agreement between emergency physicians and radiologists represented 40% of the potentially achievable agreement beyond chance. You understand that this kappa score would be conventionally considered to represent fair to moderate agreement but is inferior to many of the kappa values listed in Table 2. You are now much more confident about going to the full text of the article to review the methods and assess the clinical applicability of the results to your own patients.
The ability to understand measures of variability in data presented in clinical trials and systematic reviews is an important skill for clinicians. We have presented a series of tips developed and used by experienced teachers of evidence-based medicine for the purpose of facilitating such understanding.
Footnotes
This article has been peer reviewed.
Contributors: Thomas McGinn developed the original idea for tips 1 and 2 and, as principal author, oversaw and contributed to the writing of the manuscript. Thomas Newman and Roseanne Leipzig reviewed the manuscript at all phases of development and contributed to the writing as coauthors. Sheri Keitz used all of the tips as part of a live teaching exercise and submitted comments, suggestions and the possible variations that are described in the article. Peter Wyer reviewed and revised the final draft of the manuscript to achieve uniform adherence with format specifications. Gordon Guyatt developed the original idea for tip 3, reviewed the manuscript at all phases of development, contributed to the writing as a coauthor, and, as general editor, reviewed and revised the final draft of the manuscript to achieve accuracy and consistency of content.
Competing interests: None declared.
Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave., Pelham NY 10803, USA; fax 914 738-9368; pwyer{at}att.net
References
Related Articles
This article has been cited by other articles:
![]() |
H. A. Siddiki, J. L. Fidler, J. G. Fletcher, S. S. Burton, J. E. Huprich, D. M. Hough, C. D. Johnson, D. H. Bruining, E. V. Loftus Jr., W. J. Sandborn, et al. Prospective Comparison of State-of-the-Art MR Enterography and CT Enterography in Small-Bowel Crohn's Disease Am. J. Roentgenol., July 1, 2009; 193(1): 113 - 121. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Dentali, A. Squizzato, L. Brivio, L. Appio, L. Campiotti, M. Crowther, A. M. Grandi, and W. Ageno JAK2V617F mutation for the early diagnosis of Ph- myeloproliferative neoplasms in patients with venous thromboembolism: a meta-analysis Blood, May 28, 2009; 113(22): 5617 - 5623. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Guru, J. V. Tu, E. Etchells, G. M. Anderson, C. D. Naylor, R. J. Novick, C. M. Feindel, F. D. Rubens, K. Teoh, A. Mathur, et al. Relationship Between Preventability of Death After Coronary Artery Bypass Graft Surgery and All-Cause Risk-Adjusted Mortality Rates Circulation, June 10, 2008; 117(23): 2969 - 2976. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Lordkipanidze, C. Pharand, and J. G. Diodati Comparison of different methods of measurement of aspirin resistance: using the appropriate statistic: reply Eur. Heart J., January 1, 2008; 29(1): 138 - 139. [Full Text] [PDF] |
||||
![]() |
W. Ageno, C. Becattini, T. Brighton, R. Selby, and P. W. Kamphuisen Cardiovascular Risk Factors and Venous Thromboembolism: A Meta-Analysis Circulation, January 1, 2008; 117(1): 93 - 102. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Asif, C. Leon, L. C. Orozco-Vargas, G. Krishnamurthy, K. L. Choi, C. Mercado, D. Merrill, I. Thomas, L. Salman, S. Artikov, et al. Accuracy of Physical Examination in the Detection of Arteriovenous Fistula Stenosis Clin. J. Am. Soc. Nephrol., November 1, 2007; 2(6): 1191 - 1194. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Zarychanski MD, A. F. Turgeon MD MSc, L. McIntyre MD MHSc, and D. A. Fergusson MHA PhD Erythropoietin-receptor agonists in critically ill patients: a meta-analysis of randomized controlled trials Can. Med. Assoc. J., September 25, 2007; 177(7): 725 - 734. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.J. Nemeth, J.W. Henson, M.E. Mullins, R.G. Gonzalez, and P.W. Schaefer Improved Detection of Skull Metastasis with Diffusion-Weighted MR Imaging AJNR Am. J. Neuroradiol., June 1, 2007; 28(6): 1088 - 1092. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Taubert, R. Roesen, and E. Schomig Effect of Cocoa and Tea Intake on Blood Pressure: A Meta-analysis Arch Intern Med, April 9, 2007; 167(7): 626 - 634. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. L. Back, R. M. Arnold, W. F. Baile, K. A. Fryer-Edwards, S. C. Alexander, G. E. Barley, T. A. Gooley, and J. A. Tulsky Efficacy of Communication Skills Training for Giving Bad News and Discussing Transitions to Palliative Care Arch Intern Med, March 12, 2007; 167(5): 453 - 460. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Simmons, S. Lillis, J. Swan, and J. Haar Discordance in Perceptions of Barriers to Diabetes Care Between Patients and Primary Care and Secondary Care Diabetes Care, March 1, 2007; 30(3): 490 - 495. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Dentali, J. D. Douketis, M. Gianni, W. Lim, and M. A. Crowther Meta-analysis: Anticoagulant Prophylaxis to Prevent Symptomatic Venous Thromboembolism in Hospitalized Medical Patients Ann Intern Med, February 20, 2007; 146(4): 278 - 288. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. G. Cole, J. McCusker, C. Dufouil, A. Ciampi, and E. Belzile Short-Term Stability of Diagnoses of Major and Minor Depression in Older Medical Inpatients Psychosomatics, February 1, 2007; 48(1): 38 - 45. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Dentali, J. D. Douketis, W. Lim, and M. Crowther Combined Aspirin-Oral Anticoagulant Therapy Compared With Oral Anticoagulant Therapy Alone Among Patients at Risk for Cardiovascular Disease: A Meta-analysis of Randomized Trials Arch Intern Med, January 22, 2007; 167(2): 117 - 124. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Dentali, M. Crowther, and W. Ageno Thrombophilic abnormalities, oral contraceptives, and risk of cerebral vein thrombosis: a meta-analysis Blood, April 1, 2006; 107(7): 2766 - 2773. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Drolet, E. Maunsell, J. Brisson, C. Brisson, B. Masse, and L. Deschenes Not Working 3 Years After Breast Cancer: Predictors in a Population-Based Study J. Clin. Oncol., November 20, 2005; 23(33): 8305 - 8312. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Cohrssen, M. Anderson, A. Merrill, and D. McKee Reliability of the Whiff Test in Clinical Practice J Am Board Fam Med, November 1, 2005; 18(6): 561 - 562. [Full Text] [PDF] |
||||
![]() |
W Scott Richardson and D. Dowding Teaching evidence-based practice on foot Evid. Based Nurs., October 1, 2005; 8(4): 100 - 103. [Full Text] [PDF] |
||||
![]() |
M. Drolet, E. Maunsell, M. Mondor, C. Brisson, J. Brisson, B. Masse, and L. Deschenes Work absence after breast cancer diagnosis: a population-based study Can. Med. Assoc. J., September 27, 2005; 173(7): 765 - 771. [Abstract] [Full Text] [PDF] |
||||
![]() |
W S. Richardson Teaching evidence-based practice on foot Evid. Based Med., August 1, 2005; 10(4): 98 - 101. [Full Text] [PDF] |
||||
![]() |
D. N. Juurlink and A. S. Detsky Kappa statistic Can. Med. Assoc. J., July 5, 2005; 173(1): 16 - 16. [Full Text] [PDF] |
||||
![]() |
T. McGinn and G. Guyatt Kappa statistic Can. Med. Assoc. J., July 5, 2005; 173(1): 17 - 17. [Full Text] [PDF] |
||||
![]() |
C. R. Carpenter Kappa statistic Can. Med. Assoc. J., July 5, 2005; 173(1): 15 - 16. [Full Text] [PDF] |
||||
![]() |
G. M. Allan Kappa statistic Can. Med. Assoc. J., July 5, 2005; 173(1): 16 - 17. [Full Text] [PDF] |
||||
![]() |
Editor's note Can. Med. Assoc. J., January 4, 2005; 172(1): 19 - 19. [Full Text] [PDF] |
||||
Read all eLetters
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||