Do you guys think ANOVA with collapsing works for this?

We want to design a warning label.

Basically we intend to show subjects 25 packages with different warning labels on them (the labels will vary from barely different – maybe a font size difference – to quite different – maybe shaped differently). The subjects will see the images for 15 seconds or so (there is a lot of information on the images). We will then ask them to recall what they saw, there will be distractor questions and the image presentation will be randomized. We will have 4500 subjects.

To score how well a label performed, we intend to generate a difference score between the hits and false alarms. The item set is balanced so that we’re asking the same number of each. For every ‘hit’ (label element) identified as being present, the participant gets a +1. For every ‘false alarm’ (distractor) they identify as being present they get a -1. Perfect score is +4 with a range of +4 to -4. So this gives us a score for the label as a whole.

The problem is, at the end of the first part, we will have 25-scores and we will need to determine if the difference between them is significant. Originally we were going to use an ANOVA and then rank the top 5…

An ANOVA with 25 label scores might not work – especially in this case where the differences are likely to be minor for the top performing labels (assuming some design attributes attract attention better – then the top 5 labels should be similar). I don’t think that the significance we may see from the differences will actually mean that the labels were any better.

Do you have any ideas how we could easily solve this? We can still change the study design a bit. Another test that may work?
Im wondering if you think this would work?

We can't rely on "post hoc" or "Ad hoc" collapsing of labels. We have to write up an analysis plan that doesn't include ad hoc analyses.

The top 5 labels to move onto another phase of the study.