I have a question regarding which interrater reliability test I should use.

The situation is as follows: 12 judges rated 20 profiles with 14 questions (each profile was rated with the same 14 questions). I want to know to what degree the raters agreed in their judgments.

I was thinking about using ICC (should I use single or average measures?)/Fleisch Kappa/Krippendorf's alpha. But I'm not sure which one to take. Next to that, depending on the reliability test, how should I organize the dataset?

Thanks in advance for your input!