I constructed the attached mock data samples with four variables, A, B, C, D, each of which is either present (1) or not (0). The true value for each is 2, 4, 6, and 8, respectively, and each has a standard deviation of 0.5, 1, 3, and 10, respectively. I generated 1000 observations using the RAND() function (since copied and pasted as values) and the results of each observation is in column L of the Original Data worksheet. I then ran Excel’s regression analysis (constant set to zero) and came up with the regression results in the Regression Results worksheet.

This shows the standard error for each coefficient as all being relatively close (approx. 0.42), but looking at the data clearly the bulk of the deviation comes from the inclusion of variable D. In fact, variable D appears in slightly less than half of all trials (48%), but nearly all of the SSE comes from the trials where D is present (95%).

It would seem to me there would be some way to capture this.

So, figuring the SSE would be a function of the standard deviation for each coefficient, I then ran a regression on the SSE thinking the result would be the variance of each coefficient, but then I end up with negative numbers. I’ve tried other data sets and I almost always get similar results.

It seems to me if the standard deviation of each coefficient can be estimated, then a better measure of standard error for each coefficient can be calculated, but obviously this is not the case since it isn’t done this way.

Am I missing something or going about this the wrong way?

LKM