DF and fishers vs. pearson's

Hallo everyone

I'm having a mental block today, and having trouble getting my df value. So, here's a summary of my data to help:

32 participants face-to-face
18 participants online
40 questions; binomial correct/incorrect answers

I'm comparing the accuracy rates of the face-to-face vs. online group using pearson's chi square. If I put total counts of accuracy, rather than list each individual count, do I use a different df value?

And, owing to my small sample sizes, would I be better off using Fisher's exact, even if the p value ends up similar in the end?

Please ask for further clarification...it might even help me to figure it out on my own!
Thanks in advance



Cookie Scientist
I started off writing a long lecture on why the analysis you have in mind--which I readily acknowledge is certainly the traditional and straightforward way to handle this type of data--is demonstrably flawed and why the general strategy needs to finally die a quiet death... but the post was really starting to go far outside the scope of this thread, so I'll try to be a little more succinct and to the point. (If the post still seems a little long, well, you should have seen the first draft. :p)

By calculating proportion correct for each participant and then testing for group differences in these proportions, you ignore question-to-question variability in difficulty (e.g., there are a large number of possible ways for a participant to get 0.60 correct) and thereby implicitly treat questions as a fixed factor nested under participants. Not only are questions not nested under participants here--every participant responds to the same set of 40 questions so clearly questions are crossed with participants--but depending on the context you're working in, you probably wouldn't be too happy about assuming questions are fixed either. Can the questions in your study reasonably be considered a sample from a larger (theoretical) population of potential questions? And might the data look a little different if you had used some other sample of questions? If yes then you need to explicitly take question-to-question variability into account in your analysis if you are to avoid systematic bias in your results.

The appropriate analysis here is a logit or probit mixed model where correct responses are predicted from the online vs. face-to-face factor, with crossed random effects for participants and test questions. This is very similar to an item response theory type of analysis (the two may in fact be equivalent here but unfortunately I haven't studied enough item response theory to know if that is actually true). Further reading and some instruction on how to go about conducting such an analysis can be found in the papers cited below.

I have to stress that this is not an esoteric piece of statistical pedantry--the amount of positive bias you may be introducing in your analysis by ignoring random effects can be quite substantial depending on various details of your study. In various simulations I have conducted (albeit looking at continuous responses, not categorical), it has not been at all uncommon to see empirical type 1 error rates that exceed the nominal .05 error rate by more than an order of magnitude (!!). So doing this right matters.

Baayen, Davidson, & Bates. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390-412.

Jaeger. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59, 434-446.
Hi Jake,

Far from being a pedant, I think you're probably spot on and your input most appreciated.

As it happens, in my most recent experiment I used a logistic mixed effects regression, for the very reasons you stated i.e. treating the questions as independent, and not randomly sampled from a larger pool of possible questions. Now that I know how to do this analysis (using r), I'd love to have done it for my old experiment (the one I'm describing), but I no longer have the individual responses for each question, just the overall counts.

I should say that I originally performed a crosstabs on the data (the raw data), so I did look at the individual effects of each question, but I certainly didn't fit my model depending on individual question strength/bias.

Sigh...a lesson to everyone: treat raw data like diamonds. or malt whisky.
Last edited:


No cake for spunky
The appropriate analysis here is a logit or probit mixed model where correct responses are predicted from the online vs. face-to-face factor, with crossed random effects for participants and test questions.
Which will work fine for the 1 percent of the population who actually understand logit. :) Although probably no more than 1/10th of one percent will understand logit with a crossed effect. No one else will understand it. So it depends on your audience. Or whether you value your audience understanding what you found.

Fischer is used when the assumptions of chi square (too many cells with less than five cases) are violated. Most software will tell you.