Perhaps some bootstapping would be an option. I think it depends on a person's field and the purpose of the investigation.
We have this discussion from time to time, but it has become more pertinant to me now... Before I begin I should note that I am not really interested in testing theory in my use of regression. I want to know the relative impact of regressors on the DV.
I know a central issue is how to detect outliers [I use some form of cook's d and studentized residuals]. My concern is different. Once you find them, how do you address them? I know the general answer is (barring measurement error] you don't. But the result can be widely wrong analysis, individual points can for our data lead to massive different results. Which is less than ideal.
One suggestion is to use a different estimator [least absolute deviations which I don't know] or a different model might be useful. However, throwing out a theoretically useful variable to adress an outlier seems doubtful. Others suggest transforming the outlier or non-parametrics.
I would appreciate comments on how members deal with outliers.
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
Perhaps some bootstapping would be an option. I think it depends on a person's field and the purpose of the investigation.
Stop cowardice, ban guns!
noetsi (12-30-2014)
How do you use bootstrapping in this regard?
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
If the model is appropriate you can get corrected estimates. For example, if you calculate the c-statistic in logistic regression, then you can run 10,000 versions of the model and subtract the mean difference value from the additional runs from the original c-statistic. This addresses sampling variation, not directly designed for outliers but you can see how they may get addressed.
Stop cowardice, ban guns!
noetsi (12-30-2014)
It always comes back to why you have an outlier. Was it a transcription error or measurement error? Throw it out.
On the other hand, was it an indicator that you have a discrete IV that has not been considered? If so, keep it in, but discover and include that lurking IV. Otherwise your model is still questionable.
This is interesting. Knowing where sas runs these types of regression can be difficult. To many procs
http://www.bauer.uh.edu/rsusmel/phd/ec1-25.pdf
We have outliers, that are not measurement errors, because a small number of customers have extreme expenses. It disrupts the true cost structure of our data. It may be that this is really only a problem with descriptive statistics and not the regression itself. Need to find out.
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
Not knowing what you analyze, I'll speculate a little based on my industry's experience with customers.
I can see several possibilities here. The first is that you do not have a linear relationship, but some type of curve. The second is that customer type/size/etc. is a discrete IV that causes a change in the coefficients.
You are seeing an example of the Pareto principle (see http://en.wikipedia.org/wiki/Pareto_principle), where a small number of customers account for a large percentage of the expenses/sales/returns/etc. This is well known in quality assurance.
Wrong in what sense though? I guess the point I'd come back to is that the analyses we use generally don't assume that there are no outliers. Sometimes outliers can cause problems with other assumptions, like error normality or homoscedasticity. But the presence of outliers will not in of itself lead to incorrect results. (E.g., OLS will still provide estimates that are unbiased, consistent and efficient, even if outliers are present, provided the distributional assumptions for OLS are met).
I hear this, but while the objection makes sense at a "gut" level, the estimators we use most often do not carry any requirement that the data points have equal influence on the results.individual points can for our data lead to massive different results. Which is less than ideal.
I think the outlier issue also comes up a bit more when interpreting statistical results in a binary fashion. E.g., if your main focus comes down to whether or not p < 0.05. In that case it can more often seem like an outlier makes a massive difference (e.g., if excluding an outlier changes the p value from 0.04 to 0.06). But if you interpret estimates and statistics quantitatively instead of focusing on binary decision rules, it's (in my experience) quite rarely the case that an outlier makes a big difference.
Tweet |