Nevermind, I figured it out. It's n-1.
Thanks,
Greg
I've been asked the following question by a scientist who is not a statistician: Suppose we take a sample $x_1, ,x_n$ from a population that we can assume is normal (I don't think this assumption is critical). We then calculate the sample mean $\bar xŻ$ and variance $s^2$. We then take a different sample $y_1, ,y_m$ from an unknown (normal) population, and what we are trying to test is whether or not the two samples could be from the same population.
I would like to give the scientist who asked me the question a simple test, so my thinking was to give them a prediction interval for the mean $\bar y$ of the sample of size m. We can let the null hypothesis be that they are from the same population, and alternative hypothesis that they are not. Under the null hypothesis, $(\bar x − \bar y)/(σ \sqrt{1/n+1/m})∼N(0,1)$, where $σ^2$ is the variance of our initial population. However, as usual, we don't know $σ^2$. The standard thing to do here would be to use the t-distribution, take a weighted mean of the sample variances of the $x$'s and $y$'s as an estimate for $σ^2$, and take n+m−2 as the degrees of freedom.
I don't want to do that, though, since in fact in this situation there will be many possible populations that the sample could be from and I want the test to be simple (i.e. not require him to calculate the variance of the $y$'s, take the weighted means of the variances, calculate the degrees of freedom, put everything in the t distribution, etc.), so what I want to do is to give him a prediction interval for $\bar yŻ$ for each population which is calculated without use of the sample variance of $y_1, ,y_m$. My thinking is to simply use $s^2$, the sample variance of the $x$'s, as an estimate for the variance $σ^2$, and then approximately $bar x−\bar y)/(s \sqrt{1/n+1/m})∼t_r$. My question then is, what is the degrees of freedom $r$? My feeling is that n+m−2 is too optimistic, because $s^2$ was calculated with a sample of size only n, and that therefore n−1 is the correct value. This seems basically to agree with what is done in Chapter 2 of "Predictive Inference: An Introduction" by Geisser, except that there only m=1 is done in this fashion, and the rest is somewhat different.
Has anyone seen this question before? A reference would be great.
Thanks, Greg
Nevermind, I figured it out. It's n-1.
Thanks,
Greg
Tweet |