Multiple Comparison of Percentages from Binary Within-Subjects Reponse Trials

I have searched and read a lot about this but I'm still unsure which test is suitable for the following situation.

Data come from a few hundred participants who saw, in each of several trials, a pair of stimuli and made a binary judgment (indicating which stimulus in the pair has a certain quality; there is an objectively correct answer). I am not interested in participant-level variation. I want to make inferences about the stimulus pairs. I want to compare these several pairs against each other (e.g., Pair 1 vs. Pair 2; Pair 1 vs. Pair 3, etc.) to conclude if certain pairs afford more accurate judgment. The stimuli in the pairs overlap, by the way (e.g., Pair 1 contrasts Stim 1 vs. Stim 2; Pair 2 contrasts Stim 1 vs. Stim 3, etc.).

In the end, I only want to compare one figure--the % of correct responses--in one pair against the same figure in another pair; and do this for multiple pairs.

The definitions of McNemar and Cochran's Q seem closest to this. Can someone verify that either of these would be a sensible choice here? I guess I cannot simply use the percentages, I would need to create 2*2 tables (e.g., how many participants gave correct response in Pair 1 AND incorrect in Pair 2, etc..

I can work with R or Excel (also SPSS but not preferred, if that matters. Thank you in advance!
Last edited:
I've searched quite a bit and I think the right answer is Cochran's Q as an omnibus followed by McNemar's tests if the former is significant. Both can be done easily with R or other software. McNemar's has various versions whose descriptions I didn't find very simple and clear. Most recently, there is a mid-P version of it that has been proposed as superior. I tried many different McNemar's and they more or less yield the same binary decisions (to reject or not reject H0).

Both Cochran and McNemar can be done with the raw data (the binary responses). This is what had confused me. I had calculated percentages and was trying to compare those. But most software or R packages take the raw data and automatically generate the 2*2 tables and those percentages I talked about in my post above. Only some packages required the user to generate the contingency tables, which is not too much work.

There are also some alternative follow-ups, like Dunn's test, but again, I did not find a good description of how it compares to McNemar.

I was unable to delete my post so I'm answering it, in case someone is interested.