# Help with inter-rater reliability

#### Blank

##### New Member
Hi all, I've got a set of data that I need some advice on. I'm working on a pilot project where a group of four people evaluate essays. The way that I'm working it is that each essay is read by two of the four people on the assessment team. Each of these individuals will rate five different aspects of the essay on a scale of five, giving each essay a possible score of 5-25 points. My question is about calculating inter-rater reliability. What calculation would be best for this assessment? I'm running into trouble since I'm looking for a reliability within a group of four when only two randomly assigned people are reading each essay. Do I need to calculate a different reliability between each possible pair of readers, or is there a measure I can use for the entire group?

Any advice you can provide is greatly appreciated.

How many essays?

#### Blank

##### New Member
Not many since this is just a pilot to get the university started. There will be 50 essays with two readers each. Believe me, I know there are problems with such a low number, but the administration wants something soon to show that we're doing assessment. Once that goes through, I'll actually have money to put together a more appropriate study.

But anyway, I'm capable of running the numbers myself, I'm just not sure what formula to use for this particular data set.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Sorry for all of the questions.

So there are 4 essay reviewers and 50 essays. Each essay was reviewed by two reviewers, I am assuming based on no random assignment (just convenience).

If only two people reviewed each unique essay, and these combination were always changing, it almost seems like you may need to calculate a kappa statistic for each combination of two reviewers. Did a reviewer examine all 5 portions of the essay, or did the components get split up among reviewers as well?

I will knowledge that I have not run into this exact scenario before. Can you not get all four to review some of the same essays, or is that too labor intensive? Lets see what others propose and a broader literature review may be beneficial. You MAY be able to also run a generalized Kappa for all of the reviewers for all essay sections combined, but that would not really help describe where differences are. Depends on your aims.

#### Blank

##### New Member
No worries. Thanks for the help.

Yes, four reviewers, 50 essays. The essays weren't split up by components, they were just different facets of the essay. So one reader would pick up an essay (yes, distributed by convenience rather than any specific method), evaluate five aspects of the essay (thesis, support, structure, mechanics, conclusion) and then move on to the next. We had someone collecting the data as it came in, and any essay that displayed more than a one point difference on any axis was then read and scored by a third reader. So most of the essays had two readers but a few had three, making it problematic to calculate reliability between pairs. Also, if I went that route, wouldn't I have to cut it down to each of the five facets, as well? I should note that the third reader was just meant to improve upon the overall reliability of the essay rating. I'm not using tertium quid to throw out the different score.

So yeah, I'm still not sure where to go with this.

#### Blank

##### New Member
Got a follow up here. I've found a source that says that I can use this formula for averaged ratings:

(Between persons mean square) - (Within persons mean square)) / (Between persons mean square)

Unfortunately, it doesn't give any more explanations than that. Do these terms mean anything to anyone?

#### spunky

##### Can't make spagetti
yup... it looks a lot like one of the many formulas for the many intraclass correlation coefficients out there..

#### Blank

##### New Member
Yep, that's where I found it. So, any idea what "within persons mean" and "between persons mean" is?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Can you post where you got the formula? URL or link.

#### Blank

##### New Member
Er.. not really. It's from a book. The article is "Reliability Issues in Holistic Assessment" by Roger Cherry and Paul Meyer from the book Validating Holistic Assessment Scoring for Writing Assessment: Theoretical and Empirical Foundations edited by M Williamson and B Huot. The Google Books information is here:

But there's no available ebook to peruse. At this point, I'm considering just going with this simple calculator http://www.med-ed-online.org/rating/reliability.html and calling it done. The admins who'll be looking at this stuff won't really know the difference, and that'll give me more time to see what other people have done in preparation for continuing the study next year.

#### Berley

##### Member
Not commenting on the validity of the formula, but just to explain it. (And with the caveat that I haven't finished my coffee yet this morning...) I believe it means:

(The mean score for all 4 evaluators reading all 50 essays)squared - (the mean score of all a single evaluator's scores for all the essays they read)squared / the first one

So take the mean score for all 100 evaluations (pretend it's 15), square it (to 225), then subtract the mean score for all 25 of evaluator #1's evaluations (pretend it's 9) squared (to make it 81) and divide it by the overall mean score squared (225 again) for a grand total of 0.64. That tells you Evaluator #1's score.

Now if you did the same thing for Evaluator #2 and came up with the same number (then presumably) you could assume that Evaluator #1 and Evaluator #2 were similar evaluators even if they didn't read the same essays.

But if Evaluator #2's mean score was 20 (squared to 400), then you come up with -0.7778. So you could conclude that Evaluator #2 doesn't see essays the same way that Evaluator #1 does. And you can say that even if you don't have essays in common by which you can directly compare scores.

I think that's right. Yes? Someone who is awake?

#### spunky

##### Can't make spagetti
(The mean score for all 4 evaluators reading all 50 essays)squared - (the mean score of all a single evaluator's scores for all the essays they read)squared / the first one
uhmm... this does not seem quite right right. look at how to calculate the mean squares from any standard ANOVA formulas most traditional ICCs (Intra Class Correlations) are obtained from the mean squares you'd calculate when doing an ANOVA

#### Berley

##### Member
uhmm... this does not seem quite right right. look at how to calculate the mean squares from any standard ANOVA formulas most traditional ICCs (Intra Class Correlations) are obtained from the mean squares you'd calculate when doing an ANOVA
So you think maybe it means to calculate the sample variance for each instead? I could buy that, and it certainly seems like it would be more accurate. I just didn't take that away from what was presented.

#### spunky

##### Can't make spagetti
So you think maybe it means to calculate the sample variance for each instead? I could buy that, and it certainly seems like it would be more accurate. I just didn't take that away from what was presented.
something like that... i mean not exactly but you got the idea that some sort of variance is at play here. i guess i'm just so used to working with these things that the minute i saw a ratio of mean squares in the context of inter-rater reliability i immediatley thought "oh... the OP must be looking at a formula for the intra class correlation coefficient". the thing is there area many intraclass correlation coefficients (i think there are 3 or 4 out there) so it really depends on which one of all the OP is looking at