Which design is more powerful: testing interrater reliability with 2 judges or 4 judges? (citations appreciated)