chi-square and multiple testing


I want to compare survey responses for a particular survey questions between one year and the next - to see if there is a statistically significant difference in the % agreeing between one year and the next

Year - 2012 or 2013 ( explanatory variable)
Response - Agree or Not Agree (Response variable)

I am happy to do chi-square. However if I break down the people who were surveyed by the department in which they work (could be one of 30) and run the chi-square test thirty times and report the results for every department, do I need to worry about the multiple testing problem?

I could also add that there are 20 questions in the survey so I would be doing 20*30 tests. Obviously at the 0.05 significance level there is a 5% chance that every association I report as being significant actually is not - but I can live with that.

Anyway, I have advice to say that I must adjust for multiple testing and wanted an expert second opinion because I cannot quite understand why I should!!

I'd be extremely grateful for any help.


There are 2600 staff in total and their distribution over the 33 departments is as below. No response rate is less than 50%, I am only interested in finding out if there is a statistically significant difference between the proportion of respondents who agree with a question vs those who do not between 2012 and 2013. I am not interesting in differences between departments. My q is can I not just do 33 chi-square tests each at the 0.05 significance level?

Size of Department Number of Departments
10-19 2
20-29 2
30-39 1
40-49 6
50-59 4
60-69 2
70-79 5
>100 11


TS Contributor
If you are not interested in the particular departments, then why
don't you just pool the data and perform the analysis on the total
of 2600 respondents...?

With kind regards


Thanks very much. I'm not interested in differences between departments but I am interested in seeing for each department how the responses compare from one year to the next. My question is it legitimate to just perform the chi-square test on it's own for each department and report significant results at the 0.05 level or do I need to do something complex to address the 'multiple testing problem'

Thanks very much,


Mean Joe

TS Contributor
Yeah you can perform the test on its own for each department, without adjusting the p-values. I would say.

The reason why: since you are treating departments on their own, then you are not assuming that all the departments have the same proportion. You are not testing that some department differs from the common proportion.

Look at it this way: if you only collected the same data for one department, and ignored the other 32 departments, would you do a p value adjustment?

I believe your tests are not connected.

Now here's one reason why I asked for your n in each department. With large n, it is easier to get p<.05. So you need to consider this, when deciding to use p-critical = .05.

Instead of "adjusting p-value to make the .05 magic level have meaning", why don't people just realize they should use a different p-critical? p-values are based on known distribution of test statistics. Bonferroni correction that just divides p by a number, I'm not a fan of. Furthermore, with any style of p-value correction: the smallest p will remain the smallest p.
To try to answer the explicit question, I think that it would be legitimate to use a chi-squared test for each department. But, I also agree that the multiple inference problem needs to be dealt with.

Maybe a "false discovery rate" (FDR) could be used.

An other, and better alternative I believe, would be to do a Mantel-Haenzel test with the departments as a covariate. That would be one test per question since statlearner1000 is:
"not interested in differences between departments".
But still the 20 questions remains. Maybe FDR is useful.

- - -

But there is another possibility....
#8 see if there is a statistically significant difference in the % agreeing between one year and the next

Year - 2012 or 2013 ( explanatory variable)
Response - Agree or Not Agree (Response variable)
Then it seems natural, to me, to model the proportions (p) who "Agree" versus those who "Not Agree"

(Where the response variable is Y = 1 (Agree) or 0 (Not agree). Y is binomial with n=1 and p, thus Y is Bin(1,p))

Year would be a fixed independent variable, that will "explain" the change in proportion p.)

The department could be a random effect. Although maybe these 30 department are the whole population, that is all the departments that they have in the company. Maybe it is fruitful to model it as is it is a random sample from a large population.

There are also 20 questions. Maybe these can also be modelled as a random effect. Then it would be a crossed mixed effect model with departments as one random effect and questions as the other random effect. Here I think of "questions" as randomly drawn from a "population" from which some questions have been sampled. It could be that the questions are not statistically independent. (If the respondents is dissatisfied with one question then she can also be dissatisfied with the next and similar question.) Such a modelling could be controversial, but I suggest it so that it can be discussed.)

Possibly one could get better precision for each question by modelling it as a random effect, in contrast to model them one-by-one.

Essentially the main interest would be for the difference in the proportions parameter.

Also there would be random department effects, random questions effects and interactions effects between the explanatory effects.

Let's see if someone else has any comments on this modelling.


TS Contributor
I want to compare survey responses for a particular survey questions between one year and the next - to see if there is a statistically significant difference in the % agreeing between one year and the next
Why do you want to do this, what will be done with the results,
what will be the consequences? I'm asking this since, as you
are already aware of, Bonferroni adjustements or the like could
increase type II error probability, and unadjusted multiple testing
could increase type I error probability, so one has to take into
consideration the consequences of each error type in the present
study. I addition, it would be useful to know how large
the effects are which you expect and/or how large the
effects are which you need to reliably detect (desired power).

If e.g. you just want to give feedback to each department
separately, without comparisons between departments, then
I'd suppose that no adjustment for department needs to be made,
but within department (20 tests) a more conservative level of
significance (1% instead of 5%) could be useful, if you accept
moderate power.

Again, the strategy depends on the consequences
of the respective errors (for example, depeding on such
considerations, as a preliminary step one could aggregate the
20 items and compare mean or median number of agreements
between years, for each department).

By the way, since subjects in each department are nearly identical between
years (right?), tests for independent samples are not quite correct. Is it
possible to match the 2013 and 2013 surveys? I am not sure, though, in
which way results would be biased if dependent measures are treated as

With kind regards