Any advice you can provide is greatly appreciated.

- Thread starter Blank
- Start date
- Tags reliability

Any advice you can provide is greatly appreciated.

But anyway, I'm capable of running the numbers myself, I'm just not sure what formula to use for this particular data set.

So there are 4 essay reviewers and 50 essays. Each essay was reviewed by two reviewers, I am assuming based on no random assignment (just convenience).

If only two people reviewed each unique essay, and these combination were always changing, it almost seems like you may need to calculate a kappa statistic for each combination of two reviewers. Did a reviewer examine all 5 portions of the essay, or did the components get split up among reviewers as well?

I will knowledge that I have not run into this exact scenario before. Can you not get all four to review some of the same essays, or is that too labor intensive? Lets see what others propose and a broader literature review may be beneficial. You MAY be able to also run a generalized Kappa for all of the reviewers for all essay sections combined, but that would not really help describe where differences are. Depends on your aims.

Yes, four reviewers, 50 essays. The essays weren't split up by components, they were just different facets of the essay. So one reader would pick up an essay (yes, distributed by convenience rather than any specific method), evaluate five aspects of the essay (thesis, support, structure, mechanics, conclusion) and then move on to the next. We had someone collecting the data as it came in, and any essay that displayed more than a one point difference on any axis was then read and scored by a third reader. So most of the essays had two readers but a few had three, making it problematic to calculate reliability between pairs. Also, if I went that route, wouldn't I have to cut it down to each of the five facets, as well? I should note that the third reader was just meant to improve upon the overall reliability of the essay rating. I'm not using tertium quid to throw out the different score.

So yeah, I'm still not sure where to go with this.

http://books.google.com/books/about/Validating_holistic_scoring_for_writing.html?id=ozgmAQAAIAAJ

But there's no available ebook to peruse. At this point, I'm considering just going with this simple calculator http://www.med-ed-online.org/rating/reliability.html and calling it done. The admins who'll be looking at this stuff won't really know the difference, and that'll give me more time to see what other people have done in preparation for continuing the study next year.

(The mean score for all 4 evaluators reading all 50 essays)squared - (the mean score of all a single evaluator's scores for all the essays they read)squared / the first one

So take the mean score for all 100 evaluations (pretend it's 15), square it (to 225), then subtract the mean score for all 25 of evaluator #1's evaluations (pretend it's 9) squared (to make it 81) and divide it by the overall mean score squared (225 again) for a grand total of 0.64. That tells you Evaluator #1's score.

Now if you did the same thing for Evaluator #2 and came up with the same number (then presumably) you could assume that Evaluator #1 and Evaluator #2 were similar evaluators even if they didn't read the same essays.

But if Evaluator #2's mean score was 20 (squared to 400), then you come up with -0.7778. So you could conclude that Evaluator #2 doesn't see essays the same way that Evaluator #1 does. And you can say that even if you don't have essays in common by which you can directly compare scores.

I think that's right. Yes? Someone who is awake?

(The mean score for all 4 evaluators reading all 50 essays)squared - (the mean score of all a single evaluator's scores for all the essays they read)squared / the first one

uhmm... this does not seem quite right right. look at how to calculate the mean squares from any standard ANOVA formulas most traditional ICCs (Intra Class Correlations) are obtained from the mean squares you'd calculate when doing an ANOVA

I'm not sure why it wasn't showing up when I searched for the terms specifically, but adding in the ANOVA led me in the right direction.

Thanks for all of the help.

So you think maybe it means to calculate the sample variance for each instead? I could buy that, and it certainly seems like it would be more accurate. I just didn't take that away from what was presented.