McNemar or Chi sqared test for case-control study

Dear all,

I'm facing a dilemma which test would be appropriate to compare frequency of a feature in two age, sex and risk-score-for-frequent-outcome matched but otherwise unrelated samples of patients with the same disease. Samples differ in a feature that is also considered to be risky for frequent outcome but it is not included into risk score and tested feature which frequency would be compared among samples is presence of genetic mutation (that can also be considered risky for outcome). Therefore - two matched but otherwise independent populations and 2x2 contingency table. Logically, it seems to me that Chi squared test would be appropriate.

However, I came across different opinions. One of them is that two groups although originating from different populations (not same population) are not independent of each other because they are matched according to common features - then paired tests should be used like McNemar test or paired-T test for other comparisons, etc.

I find this very interesting, but I'm not sure what approach would be reasonable. Would same be valid if two samples were only age and sex matched? Dear readers, could you please provide your opinion.
Last edited:
Sometimes it is helpful to consider extreme situations. Imagine that you have a random sample of 40 suitable patients. You make a 2x2 table for the two features of interest. Then I see no reason why Chi squared shouldn't be used. However, if those 40 patients consisted of 20 pairs of (absolutely) identical twins, then in effect you have only 20 patients to work with and to use all 40 results (really just 20 results written twice) would be clearly wrong. If the 40 patients were 20 pairs of non identical it would also be wrong to use all 40 results but probably not as wrong as with identical twins. 20 pairs of siblings have the same problem but not as badly. Finally 20 pairs of matched patients will also be wrong (or at least susceptible to valid criticism from referees). It depends on how alike the responses are from any matched pair.
Unfortunately McNemar's test will not fix this for the same reason (and probably isn't appropriate anyway). My only thought is a permutation test which maintains the paired nature of the data.
Thank You very much, that is very interesting viewpoint and very well elaborated. Therefore, You consider that neither Chi squared nor McNemar test would be optimal choice. Can You suggest what would kind of test should be used to compare numerical variables between two matched samples?

If I understood correctly, we are wrong on some level in both situations (paired vs. unpaired test like Mann Whitney U and Wilcoxon signed rank test when speaking on basic level), but we have to make a compromise and paired approach takes some assumptions into account that unpaired approach doesn't. What is bothering me is the fact that paired tests would need much less individual patients to "prove" significant association - it is easier to obtain significant P value.

Speaking in extremes - it seems to me that we can encounter one unethical situation here. Let's say we want to analyse a population of 1000 patients and 100 of them have some feature of interest that would define two groups (with and without it, let's say advanced disease status). Then we would like to compare something between groups (age, weight, some score... or presence of another feature like genetic mutation in previous question).
If we analyze numerical variable we can use T test/Mann Whitney U test - this way we observe that P value is near significant level. But we can choose another approach and find all patients with this type of feature first (100 patients) and then find matched controls from our available population. This way cases and controls are similar in sample size, however they are also "matched by some criteria" now. We can then argue that paired test would be "more optimal"? and use paired T test/Wilcoxon signed rank test instead - now we get firmly significant P value. I hope you see my concern here.
We are talking about two different situations, I think.

You are right that matched pairs tests are better than independent samples in situations where you want to compare two groups Treatment A/Treatment B or with/without advanced disease status. If you want to compare two groups, then you are better off if the two groups are matched pairs. This removes much of the between subject variation. Agricultural research establishments keep whole herds of identical twin animals for just that purpose. The ultimate in paring for them would be to have all their animals identical by cloning. This is your t test/MW test situation (actually something like Wilcoxon which uses paired data, rather than MW). So, you can look at one feature between two matched groups. Is feature A more prevalent in the with advanced disease status than in the without group. In short, use matched pairs if you can.

However, as I read your first post, you are looking for an association between two natural features of the patients - when a patient has feature A, are they more or less likely to have feature B, or doesn't it make any difference? In this case there are no Treatment A/Treatment B groups. To find an association you need a good collection of independent cases and your patients are not independent because they are matched pairs. . To make the ultimate pairing here would be for all patients to be genetically identical which wouldn't tell us anything at all.

One possibility is to simply do the Chi square test on just one group. Or perhaps some sort of one off permutation/randomization test could be worked out which takes into account the paired nature of the data.
Cheers, kat
Let's say we are evaluating retrospective data set of all patients with some disease treated in one department (one population). And let's say that we find patients with some other disease as controls (healthy in an organ affected with first disease), however we try to find similar age and sex type of patients (that would be some kind of vague matching) so we don't have statistical significant differences in age and gender between two groups. Then we analyze frequency of a genetic mutation. I feel it would be more appropriate and more fair to investigate such problem using Chi squared test.

Now, we separate first group into two subgroups and check for the same thing (frequency of something). Now we encounter possible problem from the first post: analyse data as they are with non-paired tests or match patients in two subgroups by some risky feature if sample is big enough and choose paired test for analysis. Both criteria used to separate patients into two groups are intrinsic features of one population as you accurately observed. Would Chi squared also be better choice in this situation? I have a feeling it would, it would definitely be more conservative choice (in regard to P value).
Looking at the last post, and reading back to the original post, I think I may have misinterpreted the question and was unduly dismissive about McNemar's test. Here is a summary as I see it.
Let's say we want 40 patients, each classified by a selection factor A or B, and a testing factor X or Y.
Situation 1. Pick 40 patients at random. Make up a two way table AB vs XY. This table has a grand total of 40 numbers. Do a Chi square test. All assumptions are true. The test is valid. The p value is reliable.
Situation 2. Pick 20 patients at random. Choose 20 more patients, matching them as closely as possible to the first set, but ensuring that the new patient is of the opposite of its pair with regard to factor A/B. Make up a two way table AB vs XY. This table has a grand total of 40. Do a Chi square test. This test is not valid because the subjects are not independent. The p value is not reliable. If the connection between the factors and the matching variables is practically zero, then the p value is the same as situation 1, and is correct. If the connection between matching and factors is very strong, then the p value will be too high and though it is conservative is liable to critical review. Unfortunately we don't know where on the weak/strong matching scale our data lies.
Assuming that it is desirable to have matched pairs for other reasons, here are some solutions.
Solution 1. Just use the first 20 random subjects. Do a Chi square test. The p value will be valid, but because we only have 20 patients, the power will not be as great.
Solution 2. Do a permutation Chi square test which includes the matching. This will automatically allow for any amount of matching strength. You use all the data. The p value will be valid. This probably has the best power. Snag - it needs some specialist knowledge, and is harder to explain to referees.
Solution 3. McNemar's test. This uses a 2x2 table, but a different table from the ones above. Instead of a AB vs XY it does XY vs XY- the A set of XY vs the B set of XY. Each pair is one of AXBX, AXBY, AYBX, or AYBY and so goes into the appropriate part of the table. Each entry in the table is one pair so the table has a grand total of 20, half the number of the Chi square test. The test is valid and the p value is reliable. However, you start with only half the amount of data, and the test only uses the data in two of the cells. The rest of the data is ignored. Consequently the power of the test can be quite low. It is also less powerful than the chi square test when the strength of the matching is practically zero. It may be better to do a straight Chi square test on half the data as in solution 1. McNemar's test is most useful when you have repeated evaluation of the same patient and so are forced to the ultimate in matched pairs - each patient is its own matched pair.