Non-unique raters: ICC or Fleiss' kappa

My country surveys people in a retirement home about the quality of service. If the person has dementia or an illness a proxy is asked to fill out the survey. My research wants to know if proxies are really capable of correctly answering for the person living in the retirement home. So we asked people to fill out a survey, as well as the person that would be their proxy. We want to compare their answers.

I've been told that I should use "intraclass correlation coefficient (ICC)". However, I have two issues
1) the data is ordinal
2) most examples I find online compare two or more teachers who all grade several students. However, in my case their aren't two or more people who give a score for everybody. We have an old person and his proxy, and other old people with another proxy, and so on. Can you use ICC to compare the group in the retirement home with the proxy-group, although each elderly-proxy is different? Does it matter if ICC uses the mean over all the cases? I think it is possible because you compare two groups "elderly" (let say rater 1) and "proxy" (let say rater 2). You will not be able to say retery16 and proxy 16 are a good match, but that's not important.

My question is hence; is it allowed to compare the two groups with ICC?
If you cannot use ICC, what else would you advise? I'm thinking Fleiss' kappa is a possibility.
Last edited:


Active Member
Are the many retirement homes? ie what is being rated, the (assumed common) service of a given retirement home, or the experience per old (senior citizen?) person?

Are you actually interested in qualifying the raters/rating system, or is it just the agreement between the two raters you care about? In the latter case, I am getting the statistical feeling that its actually going to be more like a paired t-test here, except ordinal, whatever test i cant remember.
Thank you for your answer!
The reason I don't want to use a paired t-test is because it is calculated by looking at the means. However, the mean has litlle value. It should be a test looking at each elderly person with his proxy. Taking all the retired people together (and calculating the mean) and all the proxies (mean as well) doesn't answer the question whether or not a proxy is capable of answering instead of the elderly person.

We had a couple (5) retirement homes. We measured the experience for each retired person (and gave the same question to the proxy). An example is "Do you have sufficient privacy" "Do they have sufficient activities" "Is the home clean" ...

Feel free to ask additional questions if I'm not making myself clear!