When removing influential outliers creates more, what then?

#1
Hello good folks,

I am conducting multiple regression for a meta analysis, with the dependent y variable being a range of values ($/ha/year) extracted from studies, and the x variables being a range of geographical and methodological variables. Both the y variable and Area (ha) had a poor distribution with a huge range. Collinear variables were identified and removed.

After running a successful model, I had some huge influential outliers (these were established with Leverage charts, and Bonferonni number of <1). I tried removing these to see the effect it had, and more have simply taken their place. I have attempting log and square transformations of the troublesome variables but to no avail. I wouldn't say the data looks non-linear from the plots. I've attached the plots to help. Sorry if this is poorly worded etc. I can provide further info if needed. Could anyone let me know the next step?
Thank you in advance.
 

noetsi

Fortran must die
#2
It is common for removing outliers to generate new outliers. This is called masking. That is the outliers you removed masked the impact of the ones that showed up after the extreme outliers were removed. There is disagrement among data analysts whether one should or should not remove outliers (not caused by simple mistakes on the collectors part which all agree should be removed). I think the great majority these days says you should not remove outliers that reflect valid responses although personally that makes little sense to me since this allows a small number of cases to distort the impact of the bulk of the data.

There are many solutions for problems created by outliers (other than simply removing them). For example robust regression, splines, and other forms of regression that essentially generate local slope estimates rather than just one as with "standard regression" (for lack of a better word). Are you concerned with non-normality, distortion of the parameters by outliers [leverage] or what? Different problems lead to different solutions.

You could also show regression with the outllers and without and try to explain why the outliers occur and have the impact they do.
 
#3
Hello,

Thank you for responding! The info on masking was useful. I think the main concern was the QQplot which showed outliers above the line on the right and below the line on the left, and whether therefore the normality is being majorly violated. There are a few points that are very high on a Cook's distance test too, but once I remove them, more take their place.
I tried a GLM but when I ran an overdispersion test that was hugely over acceptable limits, so I'm just not sure where to go from here. I've tried various transformations of the worrisome variables to little avail. I realise this alone may not be enough information however so please let me know what more I could provide.

Thanks,
Sam
 

noetsi

Fortran must die
#4
In honesty the QQ plot does not look that bad to me. It is unusual not to have any outliers, by definition a certain percent will always be far from the norm, and the QQ plot is assessed primarily I believe based on if the data as a whole is close to the line. In any case the real question is not if you have outliers, but leverage. Are they moving the regression line (assessed by Cook's d, DFBETA, or whatever you use).

Are you concerned that the regression line is moving, that the data is non-normal or what specifically? If the data is not moving the regression line signficantly I would tend to keep the outliers and analyze why they are occuring - that is what is special about the outliers. Robust Regression is also a possibility. If non-normality is the issue, than logs, squaring and the like can be used (there are a whole range of transformations some more powerful than others). Or there is non-parametrics.
 

Mean Joe

TS Contributor
#5
What are your results when you include outliers, what are your results when you remove them?

How many outliers, as a % of your sample?

Maybe your sample is not simply one "population", so you may want to stratify or something.
 

noetsi

Fortran must die
#6
A good point that extreme outliers often suggest a different population that the normal slopes don't explain. That is really what I was trying to get at in suggesting you try to explain why the outliers occur.
 
#7
Hi there, and thanks for all your help. I was mostly worried about the extreme outliers moving the regression line, so went away and ran Cook's test and dfbeta. Cook's test fine, nothing over 0.6 at the very worst, but dfbeta had 2 observations over 1. I then removed these observations and re-ran, and this time there were loads of observations over 1!! So it was better before. I realise I'm perhaps panicking, as this is part of a meta-analysis where some of the figures are a tad questionable, is it OK simply to explain that there are possible influential outliers? Or should I add a dummy variable to account for the questionable figures...or do I just need to explore further modelling techniques? The residuals don't seem to show me the data is non-linear in any way.

Thanks again.
 
#8
I see what you mean about populations...I have a few further ideas, adding in interactions based on method of achieving value etc. Thanks for your ideas.