Sparsity in an m*n contingency table

#1
I am attempting to find a citation for or document the use of a correction factor for an \(m \times n\) contingency table with sparse data (for most of the tables, 20% to 80% of the cells have a value of 0).

I have been working with a statistician who insists that adding 0.5 to each value is an appropriate and common method of dealing with sparsity. The situation is as follows:

The data is preliminary analysis to consider whether proposed projects are worthwhile. The table consists of the occurrences of haplotypes in different regions: 130 haplotypes and 42 geographically defined regions. The aim to to demonstrate whether or not there is variation in haplotype distribution across the regions. Due to the high genetic variability, many haplotypes may be observed only once in the data.

The statistician proposed a permutation test using the \(\chi^2\) score as a useful metric for a permutation test. Prior to computing the \(\chi^2\) for the data and the permutations, 0.5 is added to the data.

I requested references that adding 0.5 is appropriate and he insists that it is common enough that I should be able to find a reference. As of yet, the closest I have come is the Yates continuity corrects, which the statistician insists is not what he mean.

That's a long way of saying: does anyone know of a literature citation that introduces or supports adding 0.5 to an \(m \times n\) contingency table?
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
Yeah, it is common practice. Finding a citation for just adding it to a table may be difficult, better chance finding a citation for adding it when conducting a specific procedure.

I believe if you used exact methods (e.g., Fisher's exact test) you typically don't have to add it. However, if you don't correct these data, then you may be unable to calculate relative risks and odds ratios.

Sorry, no citations come to my mind.
 

gianmarco

TS Contributor
#3
The table consists of the occurrences of haplotypes in different regions: 130 haplotypes and 42 geographically defined regions. The aim to to demonstrate whether or not there is variation in haplotype distribution across the regions.
Just wondering how you will perform the analysis. What approach will you use?

Gm