How many essays?
Hi all, I've got a set of data that I need some advice on. I'm working on a pilot project where a group of four people evaluate essays. The way that I'm working it is that each essay is read by two of the four people on the assessment team. Each of these individuals will rate five different aspects of the essay on a scale of five, giving each essay a possible score of 5-25 points. My question is about calculating inter-rater reliability. What calculation would be best for this assessment? I'm running into trouble since I'm looking for a reliability within a group of four when only two randomly assigned people are reading each essay. Do I need to calculate a different reliability between each possible pair of readers, or is there a measure I can use for the entire group?
Any advice you can provide is greatly appreciated.
How many essays?
Blank (09-07-2012)
Not many since this is just a pilot to get the university started. There will be 50 essays with two readers each. Believe me, I know there are problems with such a low number, but the administration wants something soon to show that we're doing assessment. Once that goes through, I'll actually have money to put together a more appropriate study.
But anyway, I'm capable of running the numbers myself, I'm just not sure what formula to use for this particular data set.
Sorry for all of the questions.
So there are 4 essay reviewers and 50 essays. Each essay was reviewed by two reviewers, I am assuming based on no random assignment (just convenience).
If only two people reviewed each unique essay, and these combination were always changing, it almost seems like you may need to calculate a kappa statistic for each combination of two reviewers. Did a reviewer examine all 5 portions of the essay, or did the components get split up among reviewers as well?
I will knowledge that I have not run into this exact scenario before. Can you not get all four to review some of the same essays, or is that too labor intensive? Lets see what others propose and a broader literature review may be beneficial. You MAY be able to also run a generalized Kappa for all of the reviewers for all essay sections combined, but that would not really help describe where differences are. Depends on your aims.
Blank (09-07-2012)
No worries. Thanks for the help.
Yes, four reviewers, 50 essays. The essays weren't split up by components, they were just different facets of the essay. So one reader would pick up an essay (yes, distributed by convenience rather than any specific method), evaluate five aspects of the essay (thesis, support, structure, mechanics, conclusion) and then move on to the next. We had someone collecting the data as it came in, and any essay that displayed more than a one point difference on any axis was then read and scored by a third reader. So most of the essays had two readers but a few had three, making it problematic to calculate reliability between pairs. Also, if I went that route, wouldn't I have to cut it down to each of the five facets, as well? I should note that the third reader was just meant to improve upon the overall reliability of the essay rating. I'm not using tertium quid to throw out the different score.
So yeah, I'm still not sure where to go with this.
Got a follow up here. I've found a source that says that I can use this formula for averaged ratings:
(Between persons mean square) - (Within persons mean square)) / (Between persons mean square)
Unfortunately, it doesn't give any more explanations than that. Do these terms mean anything to anyone?
yup... it looks a lot like one of the many formulas for the many intraclass correlation coefficients out there..
for all your psychometric needs! https://psychometroscar.wordpress.com/about/
Blank (09-07-2012)
Yep, that's where I found it. So, any idea what "within persons mean" and "between persons mean" is?
Can you post where you got the formula? URL or link.
Blank (09-07-2012)
Er.. not really. It's from a book. The article is "Reliability Issues in Holistic Assessment" by Roger Cherry and Paul Meyer from the book Validating Holistic Assessment Scoring for Writing Assessment: Theoretical and Empirical Foundations edited by M Williamson and B Huot. The Google Books information is here:
http://books.google.com/books/about/...d=ozgmAQAAIAAJ
But there's no available ebook to peruse. At this point, I'm considering just going with this simple calculator http://www.med-ed-online.org/rating/reliability.html and calling it done. The admins who'll be looking at this stuff won't really know the difference, and that'll give me more time to see what other people have done in preparation for continuing the study next year.
Not commenting on the validity of the formula, but just to explain it. (And with the caveat that I haven't finished my coffee yet this morning...) I believe it means:
(The mean score for all 4 evaluators reading all 50 essays)squared - (the mean score of all a single evaluator's scores for all the essays they read)squared / the first one
So take the mean score for all 100 evaluations (pretend it's 15), square it (to 225), then subtract the mean score for all 25 of evaluator #1's evaluations (pretend it's 9) squared (to make it 81) and divide it by the overall mean score squared (225 again) for a grand total of 0.64. That tells you Evaluator #1's score.
Now if you did the same thing for Evaluator #2 and came up with the same number (then presumably) you could assume that Evaluator #1 and Evaluator #2 were similar evaluators even if they didn't read the same essays.
But if Evaluator #2's mean score was 20 (squared to 400), then you come up with -0.7778. So you could conclude that Evaluator #2 doesn't see essays the same way that Evaluator #1 does. And you can say that even if you don't have essays in common by which you can directly compare scores.
I think that's right. Yes? Someone who is awake?
Blank (09-07-2012)
for all your psychometric needs! https://psychometroscar.wordpress.com/about/
Blank (09-07-2012)
Blank (09-07-2012)
Ah, I get it now. I found this site that explains what those terms actually mean: http://people.richland.edu/james/lec.../ch13-1wy.html
I'm not sure why it wasn't showing up when I searched for the terms specifically, but adding in the ANOVA led me in the right direction.
Thanks for all of the help.
something like that... i mean not exactly but you got the idea that some sort of variance is at play here. i guess i'm just so used to working with these things that the minute i saw a ratio of mean squares in the context of inter-rater reliability i immediatley thought "oh... the OP must be looking at a formula for the intra class correlation coefficient". the thing is there area many intraclass correlation coefficients (i think there are 3 or 4 out there) so it really depends on which one of all the OP is looking at
for all your psychometric needs! https://psychometroscar.wordpress.com/about/
Tweet |