+ Reply to Thread
Results 1 to 8 of 8

Thread: Dealing with outliers

  1. #1
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Dealing with outliers




    We have this discussion from time to time, but it has become more pertinant to me now... Before I begin I should note that I am not really interested in testing theory in my use of regression. I want to know the relative impact of regressors on the DV.

    I know a central issue is how to detect outliers [I use some form of cook's d and studentized residuals]. My concern is different. Once you find them, how do you address them? I know the general answer is (barring measurement error] you don't. But the result can be widely wrong analysis, individual points can for our data lead to massive different results. Which is less than ideal.

    One suggestion is to use a different estimator [least absolute deviations which I don't know] or a different model might be useful. However, throwing out a theoretically useful variable to adress an outlier seems doubtful. Others suggest transforming the outlier or non-parametrics.

    I would appreciate comments on how members deal with outliers.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  2. #2
    Omega Contributor
    Points: 38,334, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,998
    Thanks
    398
    Thanked 1,186 Times in 1,147 Posts

    Re: Dealing with outliers

    Perhaps some bootstapping would be an option. I think it depends on a person's field and the purpose of the investigation.
    Stop cowardice, ban guns!

  3. The Following User Says Thank You to hlsmith For This Useful Post:

    noetsi (12-30-2014)

  4. #3
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Dealing with outliers

    How do you use bootstrapping in this regard?
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  5. #4
    Omega Contributor
    Points: 38,334, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,998
    Thanks
    398
    Thanked 1,186 Times in 1,147 Posts

    Re: Dealing with outliers

    If the model is appropriate you can get corrected estimates. For example, if you calculate the c-statistic in logistic regression, then you can run 10,000 versions of the model and subtract the mean difference value from the additional runs from the original c-statistic. This addresses sampling variation, not directly designed for outliers but you can see how they may get addressed.
    Stop cowardice, ban guns!

  6. The Following User Says Thank You to hlsmith For This Useful Post:

    noetsi (12-30-2014)

  7. #5
    TS Contributor
    Points: 14,811, Level: 78
    Level completed: 91%, Points required for next Level: 39
    Miner's Avatar
    Location
    Greater Milwaukee area
    Posts
    1,171
    Thanks
    34
    Thanked 405 Times in 363 Posts

    Re: Dealing with outliers

    It always comes back to why you have an outlier. Was it a transcription error or measurement error? Throw it out.


    On the other hand, was it an indicator that you have a discrete IV that has not been considered? If so, keep it in, but discover and include that lurking IV. Otherwise your model is still questionable.

  8. #6
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Dealing with outliers

    This is interesting. Knowing where sas runs these types of regression can be difficult. To many procs

    http://www.bauer.uh.edu/rsusmel/phd/ec1-25.pdf

    We have outliers, that are not measurement errors, because a small number of customers have extreme expenses. It disrupts the true cost structure of our data. It may be that this is really only a problem with descriptive statistics and not the regression itself. Need to find out.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  9. #7
    TS Contributor
    Points: 14,811, Level: 78
    Level completed: 91%, Points required for next Level: 39
    Miner's Avatar
    Location
    Greater Milwaukee area
    Posts
    1,171
    Thanks
    34
    Thanked 405 Times in 363 Posts

    Re: Dealing with outliers

    Not knowing what you analyze, I'll speculate a little based on my industry's experience with customers.


    I can see several possibilities here. The first is that you do not have a linear relationship, but some type of curve. The second is that customer type/size/etc. is a discrete IV that causes a change in the coefficients.


    You are seeing an example of the Pareto principle (see http://en.wikipedia.org/wiki/Pareto_principle), where a small number of customers account for a large percentage of the expenses/sales/returns/etc. This is well known in quality assurance.

  10. #8
    TS Contributor
    Points: 18,889, Level: 87
    Level completed: 8%, Points required for next Level: 461
    CowboyBear's Avatar
    Location
    New Zealand
    Posts
    2,062
    Thanks
    121
    Thanked 427 Times in 328 Posts

    Re: Dealing with outliers


    Quote Originally Posted by noetsi View Post
    But the result can be widely wrong analysis
    Wrong in what sense though? I guess the point I'd come back to is that the analyses we use generally don't assume that there are no outliers. Sometimes outliers can cause problems with other assumptions, like error normality or homoscedasticity. But the presence of outliers will not in of itself lead to incorrect results. (E.g., OLS will still provide estimates that are unbiased, consistent and efficient, even if outliers are present, provided the distributional assumptions for OLS are met).

    individual points can for our data lead to massive different results. Which is less than ideal.
    I hear this, but while the objection makes sense at a "gut" level, the estimators we use most often do not carry any requirement that the data points have equal influence on the results.

    I think the outlier issue also comes up a bit more when interpreting statistical results in a binary fashion. E.g., if your main focus comes down to whether or not p < 0.05. In that case it can more often seem like an outlier makes a massive difference (e.g., if excluding an outlier changes the p value from 0.04 to 0.06). But if you interpret estimates and statistics quantitatively instead of focusing on binary decision rules, it's (in my experience) quite rarely the case that an outlier makes a big difference.

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats