Hello,

I'm tasked with analyzing how reliable raters are in rating a single piece of work along multiple dimensions. I'm trying to answer questions like:

1) Which rubric dimensions (some or all) have unacceptable variance across the raters? How do I define the cutoff for unacceptable variance? It's important to note here that while 35 raters rated a single piece of work in this norming exercise, when these raters actually rate a large number of work, each piece of work will probably be rated by only three to five raters.

2) Which raters were outliers (in terms of overall deviation from the mean and in terms of positive bias or negative bias) and how do I determine the cutoff that defines outliers?

In an ideal world, each rater's scores would be exactly the same for each dimension of the rubric.

I've done a lot of research into inter-rater reliability (that's why I know what it's called!) but am confused about which tests to perform here (and how).

Do I use Krippendorff’s Alpha? If so, how?

Do I use raters' average scores? I can, of course, get mean ratings, but then which ANOVA do I use?

Do I use Pearson r correlation coefficient between all the different raters and then take the average inter-rater correlations?

Do I use none of the above?

I have access only to Excel.

The data looks like this with A, B, and C being the rubric dimensions and R1..R35 being the raters. I've also uploaded the same data as an excel document.

A B C

R1 4 3 3

R2 2 2 1

R3 2 2 1

R4 2 2 1

R5 2 2 1

R6 2 2 1

R7 2 2 3

R8 2 2 2

R9 3 2 2

R10 2 2 2

R11 2 2 1

R12 2 2 2

R13 2 2 2

R14 4 3 3

R15 3 3 4

R16 2 2 3

R17 2 2 1

R18 3 3 1

R19 3 3 3

R20 2 3 1

R21 1 1 1

R22 1 1 1

R23 2 2 2

R24 2 2 1

R25 2 2 1

R26 2 2 1

R27 2 2 1

R28 3 2 3

R29 2 2 2

R30 2 2 1

R31 2 1 2

R32 3 3 3

R33 2 2 2

R34 2 1 3

R35 2 2 1

Thank you so much for your help! I'm very new to statistical research and will probably be asking questions a lot until I know enough to start helping others!

---

Update: I went ahead with a simple consensus analysis to test what percentage of raters agreed on the mean for each of the rubric dimensions. It's not perfect, but I think meets my needs. I do woder why the guideline is 70% for tests of this kind.

Also, can I use the same 70% threshold to test how reliable each individual rater was? For example, if a rater matched the mean in his/her scoring of >70% of the rubric dimensions, is he/she more reliable?

I'm tasked with analyzing how reliable raters are in rating a single piece of work along multiple dimensions. I'm trying to answer questions like:

1) Which rubric dimensions (some or all) have unacceptable variance across the raters? How do I define the cutoff for unacceptable variance? It's important to note here that while 35 raters rated a single piece of work in this norming exercise, when these raters actually rate a large number of work, each piece of work will probably be rated by only three to five raters.

2) Which raters were outliers (in terms of overall deviation from the mean and in terms of positive bias or negative bias) and how do I determine the cutoff that defines outliers?

In an ideal world, each rater's scores would be exactly the same for each dimension of the rubric.

I've done a lot of research into inter-rater reliability (that's why I know what it's called!) but am confused about which tests to perform here (and how).

Do I use Krippendorff’s Alpha? If so, how?

Do I use raters' average scores? I can, of course, get mean ratings, but then which ANOVA do I use?

Do I use Pearson r correlation coefficient between all the different raters and then take the average inter-rater correlations?

Do I use none of the above?

I have access only to Excel.

The data looks like this with A, B, and C being the rubric dimensions and R1..R35 being the raters. I've also uploaded the same data as an excel document.

A B C

R1 4 3 3

R2 2 2 1

R3 2 2 1

R4 2 2 1

R5 2 2 1

R6 2 2 1

R7 2 2 3

R8 2 2 2

R9 3 2 2

R10 2 2 2

R11 2 2 1

R12 2 2 2

R13 2 2 2

R14 4 3 3

R15 3 3 4

R16 2 2 3

R17 2 2 1

R18 3 3 1

R19 3 3 3

R20 2 3 1

R21 1 1 1

R22 1 1 1

R23 2 2 2

R24 2 2 1

R25 2 2 1

R26 2 2 1

R27 2 2 1

R28 3 2 3

R29 2 2 2

R30 2 2 1

R31 2 1 2

R32 3 3 3

R33 2 2 2

R34 2 1 3

R35 2 2 1

Thank you so much for your help! I'm very new to statistical research and will probably be asking questions a lot until I know enough to start helping others!

---

Update: I went ahead with a simple consensus analysis to test what percentage of raters agreed on the mean for each of the rubric dimensions. It's not perfect, but I think meets my needs. I do woder why the guideline is 70% for tests of this kind.

Also, can I use the same 70% threshold to test how reliable each individual rater was? For example, if a rater matched the mean in his/her scoring of >70% of the rubric dimensions, is he/she more reliable?

Last edited: