I have a "two-column" data set, with a multi-class categorical variable A, and two-class variable B. It is assumed that each observation is independent.
For each category of variable A, I want to make a Bayesian estimate of a binomial parameter for class 1 of variable B, consistent with the number of B=0 and B=1 observations for each value of A.
If my data set had been created by true random sampling, the answer to this would be very easy. I'd apply a beta prior to a binomial likelihood, and for each class A get a beta posterior that depends straightforwardly on the number of B=0 and B=1 counts within class A.
However, in this data, the class ratios are very skewed, i.e. there are many more members of class 0 than class 1. For this reason, the data set I've been given contains all of the members of class 1 + a 1% sample of class 0 (and it still contains millions of "rows").
It's easy enough to turn the crank and perform the Bayesian - binomial - beta analysis on the sampled data. My question is how to rigorously project the result from the sampled data back to the original data.
Intuitively, I'm fairly sure that dividing the estimated means of the binomial parameters by the sampling ratio for class 0 of variable B (100 for a 1% sample) leads to the right mean, in cases where there are large counts of B=0 and B=1 within Class A.
However, I'm not sure that treating the variance is as straightforward, or that this would be valid at all, for cases where the number of counts in class A is small.
Are there any suggestions how to set this problem up from the start, in a way that includes the additional sampling step for the B=0 data, to see how the unequal sampling propagates through the problem?
Tweet |