I would like to give the scientist who asked me the question a simple test, so my thinking was to give them a prediction interval for the mean $\bar y$ of the sample of size m. We can let the null hypothesis be that they are from the same population, and alternative hypothesis that they are not. Under the null hypothesis, $(\bar x − \bar y)/(σ \sqrt{1/n+1/m})∼N(0,1)$, where $σ^2$ is the variance of our initial population. However, as usual, we don't know $σ^2$. The standard thing to do here would be to use the t-distribution, take a weighted mean of the sample variances of the $x$'s and $y$'s as an estimate for $σ^2$, and take n+m−2 as the degrees of freedom.

I don't want to do that, though, since in fact in this situation there will be many possible populations that the sample could be from and I want the test to be simple (i.e. not require him to calculate the variance of the $y$'s, take the weighted means of the variances, calculate the degrees of freedom, put everything in the t distribution, etc.), so what I want to do is to give him a prediction interval for $\bar y¯$ for each population which is calculated without use of the sample variance of $y_1,…,y_m$. My thinking is to simply use $s^2$, the sample variance of the $x$'s, as an estimate for the variance $σ^2$, and then approximately $bar x−\bar y)/(s \sqrt{1/n+1/m})∼t_r$. My question then is, what is the degrees of freedom $r$? My feeling is that n+m−2 is too optimistic, because $s^2$ was calculated with a sample of size only n, and that therefore n−1 is the correct value. This seems basically to agree with what is done in Chapter 2 of "Predictive Inference: An Introduction" by Geisser, except that there only m=1 is done in this fashion, and the rest is somewhat different.

Has anyone seen this question before? A reference would be great.

Thanks, Greg ]]>