cooks distance cutoff

#1
Hello,
I urgently need to find out what is happening in regards to below:
I am doing a huge number of simple linear regressions. For each regression I want to use outlier test (outlierTest(fit)) and influence index test and influence plots to identify outliers and influential data points.
I read that for cook's distance people use 1 or 4/n as cutoff. And the outlierTest by default uses 0.05 as cutoff for pvalue.
Doing this, I am getting some data showing that there are no outliers (test result = false with p>0.05) but the cooks distance (using 4/n as cutoff) indicating presence of influential data point.
Can a data point be 'influential' even though it is not an outlier?
Or am I making some mistake in interpreting outlierTest and cook's distance?
Thank you very much!
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
I see no one got back to you. I don't know the answer, but it would come down to their calculations being different or not. Off-hand, I think it is possible for them to differ, but that is based on a hunch and no data at this time. May depend on your sample size and their scatter. How do your data look when plotted, does it seem to be an outlier and potential influencer?
 
#3
In addition to what has already been suggested, one way to check if a single point is influential or not is to compare the regression with and without that observation: if the results substantially change then that is some evidence that the point is influential.

EDIT:

Although, I guess that's what cook's distance is measuring so it's kind of a silly suggestion...
 
Last edited:
#4
Thank you hlsmith!
As I said I am doing more than 300 of simple linear regressions and so will take time to investigate each case. I will have a look at the plots. My guess is since the sample size is low (from 6 to 9) each data point will any way be influential and may not need to be an outlier to have a significant effect, especially in cases where the relationships are weak (rsqr is low). Any thoughts?

Disvengeance, Thanks a lot to you too! Yes, that is exactly what Cook's distance measures but again, what 'cutoff' should be selected is important - I have selected 4/n but am not sure if that is the correct approach. Please advise how a cutoff should be selected for Cook's distance if you have any idea...
Thank you,
Cheers!
 
#5
From what I recall, there isn't really one perfect cutoff, but 4/n is commonly used. Perhaps sample size is an issue when deciding between 1 or 4/n, but I don't have my text handy but I will check later. It may also be useful to calculate other measures of influence and use them all to get an idea of how influential an observation is.
 

noetsi

Fortran must die
#6
There are disagreements on what the cutoff should be and a number of rules of thumb that, in my experience, can conflict. Since there is no agreement on what cutoff to use you have to just chose one that makes sense to you.

Outliers in the classical sense measure distance from the regression line (and only that). Leverage, which is what cook's d gets at or influence, deals with impact on the regression line. It is possible that a point has so much influence on the regression line that it pulls the whole line to itself. In that case it won't be an outlier, because it will be close to the line, but it will have signficant leverage. So logically you can have a point that is not an outlier, but has high leverage.
 

Dragan

Super Moderator
#7
I think a more appropriate "cut-off" would be to use the F (p+1, N-p-1) distribution where p is the number of predictors. You would take your computed value of Cook's D and use it to determine if the cumulative probability exceeds 0.50 (the value of the median that yields a cumulative probability of 0.5) of the F distribution - if so, then consider it an outlier.

Example: Suppose you have a simple regression model with N=12 subjects so p=1. Thus, use the F(2, 10) distribution. Further suppose your value of Cooks D is 1.00. The resulting cumulative probability, for D=1.00, associated with the F(2,10) distribution is 0.598122, which is greater than 0.50 so consider this value an outlier.
 
Last edited:
#8
Thanks a lot Disvengeance! Will be great if you can share if you have any reference that recommends 4/n as that does look logical to me.

From what I recall, there isn't really one perfect cutoff, but 4/n is commonly used. Perhaps sample size is an issue when deciding between 1 or 4/n, but I don't have my text handy but I will check later. It may also be useful to calculate other measures of influence and use them all to get an idea of how influential an observation is.
 
#9
Thank you very much Noetsi, that really explains well why there could be points that are influential but are not identified as outliers. I guess particularly in case where the sample size is small, an individual point may behave like that as it can pull the line towards itself, and will not be an outlier as it itself is contributing to the regression equation.