the degree of agreement among independent observers who rate, code, or assess the same phenomenon.
Cohen’s Kappa
takes into account the possibility of the agreement occurring by chance
- P(a) = observed agreement, proportion of times judges agreed
- P(e) = expected agreement, proportion of times judges expected to agree by chance
Example
For 2 annotators and a binary classification problem,
- P(a) = P(A1=yes, A2=yes) + P(A1=no, A2=no)
- P(e) = P(A1=yes) * P(A2=yes) + P(A1=no) * P(A2=no)
Interpretation
unacceptable < 0.6 < substantial/tentative conclusions < 0.8 < definite conclusions
Scott’s Pi
improve on simple observed agreement by factoring in the extent of agreement that might be expected by chance
Fleiss’ kappa
Scott’s pi extended to more than two annotators