I have a sample of 5,000 health assessments (with ratings from 1-5 on 7 dimensions that roll up to a composite score) completed on 1,400 patients over three years by over 100 clinicians at about 15 locations. These assessments are essentially ratings (by trained clinicians) of various dimensions of behavioral health risk.

I am interested in a proxy measure of inter-rater reliability because I have limited data. I would like to estimate inter-rater reliability at three levels: overall, location (each location contains a specific group of clinicians- who are not included in any other location), and clinician level.

Can I do this if I don't have all raters completing the entire assessment (with 7 dimensions) on all of the same patients?

The clinicians and patients in this data set are the complete data; they are not a sample. However, clinicians only completed assessments on a sub-set of (their) patients; and they completed more than one rating (over time) on some of the same patients.

I can do an Intra-Class Correlation Coefficient using the ratings from the dimensions (composed of questions with scales of 1-5) of the whole sample and get a Cronbach's Alpha (I chose one-way random and got a Cronbach's Alpha of .861 overall - which seems very high). And then I would like to do that for each location and provider and compare. However, I don't know if what I'm doing is meaningful in terms of inter-rater reliability.

Or maybe I should use Fleiss' Kappa Coefficient (or weighted Kappa) on the individual dimensions (each of which is a rating scale from 1-5) because that data is ordinal? And because I have more than two raters?

I'm thinking that all of this will be garbage until I have the right data - which would be a sample of patients (or scenarios) with multiple raters.

Hi,
I would be mist confortable with the Fleiss Kappa. This could give you a lot of interesting information.

