I have a conceptual issue with some data I am analysing. I am looking at the point prevalence of a particular disease as reported in medical journal articles, and am then regressing this on time (median year of study) and per capita GDP for the year, controlling for things like geographical area and study setting.

The problem I have is in weighting the studies. I could just use study size, but I would like to use some sort of variance estimate. Since I have the total number of people in each study and the total number of people diagnosed with the disease, I could use p(1-p)/n.

BUT the disease in question is interesting: it is quite difficult to diagnose, but when you do diagnose it you are usually correct. The error in these studies is therefore likely to be in the number of undiagnosed cases. Or, in other words, a group which finds 10 cases in 100 is more likely to be nearer the true rate than a group which finds 1 in 100. Perhaps they are both equally correct, but what is very unlikely is the reverse, i.e. that the group finding 1 in 100 are more correct than those finding 10 in 100.

However, the numerator p(1-p) (which becomes the denominator when weighting) increases your variance as you go from p = 0.01 up to p = 0.5. In my data, all prevalence values are less that 0.25, and hence the higher estimates are progressively more punished in terms of variance than the

lower ones. As discussed, this is contrary to the intuitive prediction that studies with higher estimates are likely to be more correct...

There is generally insufficient information in the papers to derive a robust quality assessment to use for weighting.

Does anyone have any ideas? Is there an alternative way of estimating variance in data like this? Or should I just stick with study size?

Many thanks!