View Full Version : inter-rater reliability - Fleiss Kappa with uneven rater response distribution


mathglot
04-06-2009, 01:45 PM
I'm using Fleiss' kappa for the first time (via R's irr (http://cran.r-project.org/web/packages/irr/irr.pdf) package) and wanted to make sure that it's valid for our experiment. Does Kappa require that the same raters provide the judgments across queries? I assume not.

We're using Amazon's Mechanical Turk (http://www.mturk.com/) to solicit up to 5 judgments (MaxAssignments=5) on 1000 queries with 3 multiple-choice categories (yes/no/I-don't-know) for a maximum of 5000 judgments.

Typically about 50 unique workers from the pool of about 500 pre-qualified workers actually log in and contribute to an experiment. But the distribution of contributions is unequal, with workers providing anywhere from 2 or 3 judgments all the way up to 300 or more.

Can we assume that Kappa (irr: fleissm.kappa) is valid for this experiment? Or to take a worst-case: suppose we had 5000 unique workers each contribute exactly one judgment--is kappa still valid? I assume so, but I don't know.

Finally, how should we handle our "I-dont-know" category: throw those judgments out and calculate based on number of categories=2, or include the judgments and base it on category count=3? I assume the former, since "I-don't know" isn't really a classification.