Improving an estimate of a mean value


I have a dataset that has 10 independent descriptors and one dependent statistic, b. Only a few of the descriptors are quantitative and most are qualitative.

For an arbitary set of descriptors there is no guarentee that a record in the data set has exactly the same set, but it might match, say, 6 descriptors, or there may be only a few records that match all descriptors but hundreds that match a set of, say, 8 descriptors.

I want to identify the 'best estimate' for b given an arbitary set of descriptors. I define B as the average of all b values in a given sub set of the dataset.

My initial analysis was to filter the data for qualitative descriptor values and then run a multiple regression, however the correlation coefficients were so poor that I gave up with this approach. [I recognise that this could be the key problem, however I am still required to find a better estimate of B.]

My next idea is to assume that the initial estimate for B is the average of all b values for the whole dataset, B0 with corresponding stdev S0. Now match one descriptor (giving B1) and pose the null hyp that B0=B1 and S0=S1.

Now if the H0 is true then the descriptor is not important in the analysis and I have not improved my estimate by filtering the data.

If however H0 is not true then I have, presumably, improved my estimate since I have filtered the data by a significant descriptor.

The problem is that the order that I filter may influence the result. Denoting descriptor 1 by D1 etc). If I filter by D2 and then by D1 I may get a different result to filtering D1 and then D2. Also, although the descriptor may be significant I do not know if I have improved the estimate.

Intuitively if S1<S0 I have narrowed the data and therefore presumably improved the estimate of B. However I may have simply filtered out important data that showed the spread. I could end up reducing the dataset to one record and have an undefined Stdev that gives me a 'perfect' estimate of B but is actually not an improvement. This is cleary unacceptable.

So how can I get a 'better' estimate of B?

My hunch is that I will have to improve the multiple regression analysis....

Thanks in advance,


from what I understand you are doing a very close thing to multiple regression: condition on a certain level of a predictor variable, comparing "within group" and overal variabilities. I can imagine that certain properties of regression would be inherited in this analysis, for example:
Inclusion of significant predictor into the regression model may alter significance of other predictor.

In order to avoid obtaining just one observation after predictor-specific filtering (overfitting) and hence 0 deviation you may want to separate the data into GROUPS according to the levels of the predictor and compare means of the response across the groups. As you clearly see, this is just one-way ANOVA (read regression).

I am not really sure how you want to obtain the best estimate for all sets of predictors by your "one predictor level at a time approach", you will clearly run into multiple testing problems and will need to do Bonferroni adjustment of some sort.

anyway, just an opinion....

I knew I didn't know alot of stats but this makes it clear I know very little.

I understand the principle of regression (single and multivariate) and that 1-way ANOVA is similar to single variable regression.

The issue is that by doing the regression or ANOVA analysis by filtering the data I find that D1 (as expected) is the most important descriptor. The problem is that within the subset I can not get a statistical correlation with an r2 above 0.5. for any of the other quantitative descriptors. My conclusion is that the best estimate will be the mean (perhaps the median?) of the b values for the subset.

The question is how many filters should I use? How can I avoid overfitting?

I may simply have to set some 'standards' such as:
if the sample size falls below 30 then stop filtering
If the mean and stdev are significantly different for the flitered dataset and the Stdev is smaller then accept the data (and the risk that this is a wrong assumption).
If B0>B1 then use B0
If B1>B0
[The whole calculation is to estimate a risk function and we are expected to be risk averse!]

This may not be a statistically valid method, but it is practical in the absence of statistics showing a relationship.

If after gathering more data we are able to formulate a relationship with a high level of confidence then we will use that.

Just re-read my reply and the answer has struck me. If after filtering for D1 the regression improves then use the data. So, if after filtering for D2 the regression is not improved then we should not use the filter. This effectively means that the only filter required is D1.

Thanks for your help,



TS Contributor
You may want to try a multiple regression approach known as stepwise multiple regression.

Basically, you start with the descriptor that has the strongest correlation with the dependent variable, then you add the next strongest correlating descriptor to the model - if it adds "significant" information or value to the model, then the stats software will "advise" you to keep it in the model.

After adding a few descriptors to the model, you may find that they no longer add any "new" or "independent" information in explaining the variation in the dependent variable, and so they are left out.

Here's a link that explains the process a bit more:

and this link shows how to do this in SPSS: