# Thread: Should splitting data improve the standard error?

1. ## Should splitting data improve the standard error?

I am simulating random samples of 20 from normal data mean 20, SD 5. The SE of the mean should be SD/sqrt(20) = 1.12. If I estimate the SE from this formula it will vary from sample to sample, but over several thousand iterations it settles down to an average value of 1.12. All good so far.

I can also split the sample of 20 into two samples of 10 and get two “submeans”. The SD of these submeans is 5/sqrt(10). Now, if I find the SD of these two submeans and estimate the SE using the formula SD of the means/sqrt(2), I should get another estimate of the se of the overall mean because (5/sqrt(10))/sqrt(2) = 5/sqrt(20) as before. Each sample will produce its own submeans and estimate of the se of the overall mean. However, when I do this several thousand times, the estimated se averages out at 0.82. Apparently splitting the data and using the formula SD of the means/sqrt(2)has made the SE smaller on average.

This doesn’t seem right. Any thoughts?

2. ## Re: Should splitting data improve the standard error?

So are you now taking samples of 10 or are you taking randomly taking 10 values from your sample of 20?

3. ## Re: Should splitting data improve the standard error?

Thanks for looking at this. In the simulation, either two samples of 10, and then combining them into one. or putting the first 10 out of 20 into one sample and the other half in the second - it makes no difference to the results. The situation is this - I'm planning several estimates of average tree size in a (hopefully) uniform forest, say 15 of them of 30 trees each. The traditional method is to average the 15 estimates and use the SD of the means and the number of samples to get SE = SD/sqrt(15). However, with the data I've got it is also possible to pool the data and find a mean and SE from the pooled data using SD of all the data/sqrt(450). So what I put in my first post was just a mini version of the real situation. The problem is that the SE from the traditional method is lower on average than from the pooled method, but the sampling distribution of the SE from the traditional method is much larger than from the pooled method so the pooled estimate is more precise. I think the pooled method is correct (or better, anyway) even if it is higher, but I have to convince other people. Cheers

4. ## Re: Should splitting data improve the standard error?

Originally Posted by katxt
Thanks for looking at this. In the simulation, either two samples of 10, and then combining them into one. or putting the first 10 out of 20 into one sample and the other half in the second - it makes no difference to the results. The situation is this - I'm planning several estimates of average tree size in a (hopefully) uniform forest, say 15 of them of 30 trees each. The traditional method is to average the 15 estimates and use the SD of the means and the number of samples to get SE = SD/sqrt(15). However, with the data I've got it is also possible to pool the data and find a mean and SE from the pooled data using SD of all the data/sqrt(450). So what I put in my first post was just a mini version of the real situation. The problem is that the SE from the traditional method is lower on average than from the pooled method, but the sampling distribution of the SE from the traditional method is much larger than from the pooled method so the pooled estimate is more precise. I think the pooled method is correct (or better, anyway) even if it is higher, but I have to convince other people. Cheers
I'm convinced. pooled method is batter

5. ## The Following User Says Thank You to janessa642 For This Useful Post:

katxt (10-13-2016)

6. ## Re: Should splitting data improve the standard error?

hi,
maybe your simulation has an error? I also simulated these scenarios and when I split the original group in two the variance increased by a 2 - in general if I split an original group in k subgroups the variance of the mean increases by k, the standard error increases by sqrt(k). So, it is definitely a bad idea to split the group.

regards

7. ## The Following User Says Thank You to rogojel For This Useful Post:

katxt (10-13-2016)

8. ## Re: Should splitting data improve the standard error?

BTW,
it should be simple to demonstrate this mathematically, could anyone give it a try?

It seems that V(N/k)=k*V(N) where V(N) is the variance of the mean estimate using samples of size N and V(N/k) is the variance of the mean estimate using samples of size N/k first and calculating the average of the k estimates second?

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts