Leverage/outliers in large data sets

noetsi

Fortran must die
#1
Typically discussions of these and their impact on regression deal with individual points. Commonly I have 10-30 thousand data points and its unlikely that one point will have a large impact. But I have many outliers and jointly a set of such points might influence the results.

So how do you tell if a set, not one, outlier is influencing the results (ideally in terms of leverage).
 

lken

New Member
#2
I this case I would say looking at your data graphically might be a better option than relying on packages to calculate outliers for you.

Or, you can try to use stats to create critical cut off values for outliers (from stack overflow): Lund, R. E. 1975, "Tables for An Approximate Test for Outliers in Linear Models", Technometrics, vol. 17, no. 4, pp. 473-476. and Prescott, P. 1975, "An Approximate Test for Outliers in Linear Models", Technometrics, vol. 17, no. 1, pp. 129-132
 

hlsmith

Omega Contributor
#3
Per my own thoughts which may mirror Iken's links, you may think about temporarily removing the upper or lower ?tile observations and see if there is an affect. ?tile = whatever percentile you decided to define. Also, given the graphical approach you may be able to Color code these ?tiles observation in your graph and if they are way out on the fringe you can better understand them.
 

Miner

TS Contributor
#4
Are you able to attribute these outliers to an assignable cause that would allow you to legitimately remove them? Have you considered robust regression methods?