Cohen’s kappa

Jacob Cohen invented kappa as a way of describing how good agreement was between two raters using some categorical rating (Cohen, 1961). A kappa of zero indicates purely random, chance agreement between the raters and perfect agreement across a sample of things rated, e.g. therapist styles, therapy “ruptures”, whatever is rated, gets a kappa of 1. Where agreement is really disagreement, kappa can take values below zero.

Details #

The reason for agreement indices like kappa is that simple agreement rates can look very good even with purely chance agreement if the ratings have one category that is very common, and seen to be very common by both raters even though they have zero agreement on when they see it. See my blog post: Why kappa? or How simple agreement rates are deceptive which has pointers to other information, including to my own Rblog post which goes into more of the technicalities using R.

Most illustrations, including mine (above), look at binary, yes/no, ratings as that’s easiest to illustrate but kappa can handle any number of categories and there are extensions to weight the seriousness of disagreement where there are multiple categories, for example, in AAI attachment coding a higher weight might be attached to one rater rating the interview “autonomous” and the other “dismissing” than might be given to the disagreement in which the first rater classifyies an interview as “Unresolved/Disorganized” and the second rater classifies it as “dismissing” (as there is an overarching distinction in AAI categories between “autonomous” and the other, non-autonomous categories.

There are also extensions of kappa for the situations in which there are more than two raters and there are also alternative agreement indices with claims to some advantages over kappa. However, kappa, weighted or not, for two raters or more, for a binary or a multiple option classification, remains the agreement index you are most likely to meet in our field.

Try also #

Inter-rater reliability
Weighted kappa
Interview measures
Rating scales
Nominal/category scaling
Ordinal scaling
Stevens’ levels of measurement

Chapters #

Chapter 3

Online applications #

None currently.

Reference #

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

History #

Created 26/1/22.

Powered by BetterDocs