I am conducting an analysis of how a gene frequency (technically allele frequency) correlates with certain ecological variables (mainly temperature) using a regression analysis in STATA. I have genotype frequencies from several hundred populations and the corresponding ecological variables. The problem is that the sample size used to get the mean gene frequency is radically different by population. In some populations only 5 individuals were sampled, whereas in others many thousand individuals were sampled (median study sample size of 105 individuals). I realize it would be inappropriate to simply weight the regression by the population sample size and that instead I need to derive something like confidence intervals and then weight by them. Also, I expect that the expected genotype frequency will determine the needed sample size to achieve confidence of proper sampling of the real population.

The genotype I am concerned with around the world averages 14.6% of the population and ranges from 0% to 49% with a SD of 0.0845.

Don’t let the fact that I am using genes make you think this is more complicated than it is…we could replace genes with hair-color and the question would be the same (black hair versus not black hair in different human populations).

I assume there must be a literature on this, but am not quite sure where to start. If along with advice you can provide a scholarly citation or two that would be great so I can justify my actions in the manuscript I am preparing.

(I realize there are other problems with assuming that these populations are independent..and am working to address this with some other means.)

Much thanks,

Dan