# Thread: Removing an outlier DECREASES correlation??

1. ## Removing an outlier DECREASES correlation??

Here's my data. Note the "outlier"

Code:
``````X	LN(X)		Y
4	1.386294361	0.107543212
6	1.791759469	3.289563593
7	1.945910149	2.058949066
8	2.079441542	1.590767664
9	2.197224577	0.944430212
12	2.484906650	0.974058018
13	2.564949357	2.396011242
14	2.639057330	0.091316141
15	2.708050201	1.284675088
16	2.772588722	1.776876231
17	2.833213344	1.890457578
19	2.944438979	569.0429437***
20	2.995732274	0.632025158``````
The correlations are
X vs Y: 0.39
ln(X) vs Y: 0.32

But if I remove the outlier, the correlations DROP
X vs Y: 0.15
ln(X) vs Y: 0.08

I am used to seeing the correlation increase when removing outliers. In fact, that's often (usually??) a reason for removing outliers, no?

Now, I am not a fan of removing outliers, and I am not sold on removing this one. And I really don't want to get into that discussion here (whether or not to remove outliers). But I would like to know when it is common to see correlations decrease after removing outliers like this one. Anyone have a good mathematical explanation?

Skip

2. Hi,
I am not going to give you the needed answer. I am just noting that the correlation is not that good (0,39).

I am wondering if your data meet the assumptions for parametric correlation (I guess you are using pearson r), if you tried nonparametric correlation, and beside the correlation coefficient if are you considering to evaluate the probability associated to r.

Regards
gm

3. Yes, I am using Pearson to determine the critical value. This is just one data series of over 1000. I am conducting an aging study for different types of equipment installed onboard ships. The X's are ages in years, and the Y's are the amounts of man-days of maintenance expended maintaining that piece of equipment. I rarely get good correlations, but I do use Pearson to make sure I am using significant data. I don't know if I even checked this particular series for significance yet (that might be the next step). I was just puzzled by the fact that that the correlation went down after removing an outlier, that's all.

4. Without looking at the formula you used, I think you could have outliers which actually increase correlation, but it must be a rare coincidence.

You could probably generate your own examples now that you know about it. In a way, you'd be 'fooling' the formula.

5. With regards to the formula that I used, I assume you mean the formula used to calculate the correlation value. For the most part, I just use the (linear) excel CORREL() function, but I have manually calculated residual sum of the squares and sum of squares, dived them and subtract from 1 and I end up with the same number.

That is
1- (SStot/SSerr)

For my model, I am also trying to see if a logarithmic model would fit better (that's why you see the LN(X) column. When I calculate the correlation for that, I cannot use the excel CORREL() function because the model is not linear and then I use an entirely manual process. I have tested my process with some sample data and I do end up with correct results.

So, I go back to my original question. There must be a way to describe when it is likely to see a situation such that removing outliers can actually decrease correlation. I suspect it has something to do with the magnitude of the outlier (see that in this case, my outliers is many orders of magnitude beyond the other data points. Or perhaps it is a function of how many data points there are. And then my next question would be, when such an outlier is found, does that give any addition credence (or confirmation) that the point in question either specifically should or should not be removed? Or should it just be treated like any other data point, perhaps labeled with an explanation if deemed necessary, and deleted if only if there is other evidence to indicate it is warranted (such as a procedural error in the collection of that data)?

6. Anyone want to take a shot at my questions above? Specifically I want to know if "when such an outlier is found, does that give any addition credence (or confirmation) that the point in question either specifically should or should not be removed? Or should it just be treated like any other data point, perhaps labeled with an explanation if deemed necessary, and deleted if only if there is other evidence to indicate it is warranted (such as a procedural error in the collection of that data)?"

7. Post the formulas you used and I will try to come up with a reason why you have that result.

Another thing to try is to adjust the value of the outlier to see if, at some point, it does lower the r value.

8. The easiest way to do it is to plot those data points in Excel and watch what happens when you remove the point. I also tested it using least squares to make sure I wasn't being hoodwinked by Bill Gates.

The model: y=m*x+b (I will use the non-transformed X-values)

r-squared= 1-SSerr/SStot

SStot = sumof((y_sub_i - y_bar)^2)
SSerr = sumof((y_sub_i - y_hat)^2)

y_bar = average(Y-data) = 45.08
SStot = (.108 - 45.08)^2 + (3.29 - 45.08)^2 + (2.06 - 45.08)^2 .... = 297421.6

Calculate y_hat for each X value. y_hat is the predicted value using the model in question. But I need m and b to complete the model.

The easy way is to again let Bill Gates figure it out using the slope() and intercept() functions. Of course, I can manually do it using OLS methods:

n = 13
Sx = 160
Sy = 586
Sxx = 2286
Syy = 323843.8
Sxy = 11004.2

m= (n*Sxy - Sx*Sy)/(n*Sxx - Sx^2)
m= (13*11004 - 160*586)/(13*2286 - 160^2)
m= 11.97 (agrees with excel slope function)

b= Sy/n - m*Sx/n
b= 586/13 - 11.97*160/13
b= -102.2 (agrees with excel intercept function)

So, my model for this data is y=11.97*x-102.2

calculate y_hat for each X
4: y=11.97*4-102.2 = -54.3
6: y=11.97*6-102.2 = -30.4
etc etc

Here are the y_hats

4 -54.33906291
6 -30.40411041
7 -18.43663417
8 -6.469157918
9 5.498318329
12 41.40074707
13 53.36822332
14 65.33569956
15 77.30317581
16 89.27065206
17 101.2381283
19 125.1730808
20 137.140557

So, SSerr is (0.108-(-54.34))^2+(3.29-(-30.4))^2+...
SSerr = 252053.7

R2 = 1 - 252053/297421 = 0.1525 (agrees with excell rsq function)

Now, if I do all that again after removing the outlier, I find that the r squared goes DOWN to 0.023, which to me is counter-intuitive. Normally we remove outliers to improve r-squared.

I could do some trials and run a bunch a sample values through and find out what happens to rr-squared in the presence of outliers, and the implications, and results of removing those outliers, but I was hoping someone could point me to some studies that say "If there is one (or more) outlier(s) that meet such and such criteria, and there are this many data points, then it is possible/likely/impossible/whatever to see R-squared go down (or up as the case may be)"

So, the crux of my problem is not that I think I am calculating r-squared incorrectly. Bill Gates and I agree on the results. But I am confused and to when this phenomenon can occur.

Thanks for your help, everyone!

9. I printed out Excel's formula and I'll see if I can find the conditions under which removing an outlier increases correlation. For some reason I couldn't paste it into this post.

Simplifying the formula, what you're seeing doesn't seem possible, so I'd say the problem crops up before r is computed. I'd do each step with and without the outlier to see at what point the inputs that eventually go into the r formula change in the direction necessary to give this result.

Or, using both r values, work backwards to see what each of the four terms in the Excel formula must be in order to get this answer. There's really only two terms: x minus xbar and y minus ybar.

10. A non-mathematical observation: 569 man-days is almost 2 years. Are you sure that this isn't a data entry issue? The ship in question is 19 years old, and almost 2 of those years were spent maintaining equipment that normally would be just a few days? Seems like a data problem to me.

Also, data that is poorly correlated can have a noisy correlation coefficient as you move the mean around. To visualize this, imagine a data set X,Y that is very badly correlated and the data values are between 0 and 1. If you tug the Y data set up or down slightly by shifting the mean, the correlation coefficient can go up and down.

Imagine a scenario where you have a multi-modal distribution. Lets say that ships with new equipment are running great, but the equipment manufactured 5 years ago was unreliable, and the equipment manufactured 10 years ago was very reliable, and the equipment manufactured 15 years ago was unreliable. Ship age and equipment reliability aren't linearly correlated.

A single outlier of the magnitude that you see here could easily move the Y mean enough that the correlation coefficient could go up or down. I'd recommend a scatter plot first to see what the data looks like.

Your comment that you typically expect removing outliers to increase the correlation coefficient is a true observation, as long as the underlying data is reasonably well correlated. It sounds like you have an interesting data set, and an interesting business problem. But I would guess that there is something more fundamental (and probably really interesting) underlying the maintenance requirements on the equipment than the age of the ship.

11. Thank you guys so much for your help.

Outlier, I'm not really sure where you are going with looking at how excel calculates r-squared, since it looks like the way I am doing it comes up with the same numbers. I would *ahem* assume that Excel is calculating it the same way. I have had to look at how Excel calculates different things before because I was getting screwy numbers, so this wouldn't be the first time, and I usually get a kick out of it because sometimes they do some really silly things, but oh well. That's excel for you. I guess I could drop it into R (I don't have S+), but I am quite certain that I would get the same results. I think linear regression is pretty straightforward in all "stats" packages (and, yes, I use that term loosely when putting Excel in the group. But if you uncover something interesting, I sure would love to hear it! I have attached my worksheet for this example and you can see how I calculated everything and you can see how it agrees with Excel as well.

RealityFocus, you bring up some interesting ideas. We have a pretty good confidence that this outlier is indeed a good data point. By the way, if you were wondering about the mandays that were less than 1.0, that happens because I have to divide the total man-days expended working on that piece of equipment by the number of ships that have that piece of equipment. And the extreme precision is also an artifact of some pro-rating that we have to do. It usually doesn't throw the individual data points off too much, and it happens across the board, so it's generally not a problem for our analysis. We have double checked the math there, and the work orders on this particular line item, and the numbers seem to be valid.

Unfortunately, what can happen is some pieces of equipment can go a long time with very little maintenance and then it requires an extensive overhaul. This leads to extreme variance in the data, and it is very difficult to identify them when you have data that goes back 20+ years, for literally thousands of different pieces of equipment.

Yes, the data set is very interesting, and it keeps the job challenging (this is for professional work, not academic). We believe we have some process control problems, and would like to start getting this excessive variance under control. This is the start of that process.

Since I have so many line items, I can't do this level of manual analysis for each line. What I was hoping was someone could point me to a confidence test or something that says "you have a good chance of having an outlier here that could be adversely affecting your data". Then I would need to develop a process to numerically identify the outlier, and tabulate the results of the regression with and without the outlier(s).

I have run some trials in the spreadsheet attached, and see that any value above ~100 keeps the r-squared around 0.15. Then as you drop the value of the outlier until it falls into the range of the other values, r-squared decreases until it reaches a minimum of about 0.02, which is the r-squared of the data without the outlier. So, it does look like when you have highly uncorrelated data, having an outlier can (often???) increase r-squared.

So my fear is that had I not graphed this particular set and noticed this phenomenon, I would have looked at the Pearson critical value and would have said "it passed" (we have pretty low standards when it comes to confidence intervals around here due to the high variance that we deal with all the time). So, when you have thousands of line items to analyze, how can I be sure that one isn't getting in there because it has a powerful outlier that is munging up the analysis?

Thanks again to everyone that is helping me on this one. It sure is challenging (and fun!).

Skip

**EDIT** Attachment removed because of errors. The correct spreadsheet is attached in the next post.

12. Whoops, found an error on the spreadsheet I attached last time. I failed to upload the latest version. Anyway, there are some different numbers on the "Without Outlier" tables, and the other spreadsheet should not be used.

I also used the Excel solver to find out what value for that data point would give the worst R2, and that is 2.582.

Of course it is not possible to maximize R2 because R2 continues to increase as the value of the outlier increase. However, it looks like it does approach some limit (around 0.153).

I look forward to your comments!

Skip

13. You shouldn't think about the effect removing outliers has on a model - only whether they need to be removed or not based on the assumptions of the model you are using. Perhaps I don't understand what you're trying to achieve, but if you make a decision about whether to remove something based on whether it increases the correlation or not, seems to be fiddling the data and it won't end up an accurate representation of the population the data came from.

http://en.wikipedia.org/wiki/Mahalanobis_distance

http://en.wikipedia.org/wiki/Cook's_distance

Also search Studentised Deleted Residuals

There are different classes of outliers, exerting various kinds of influence over the model. Leverage, influence, distance.

If there is a strong reason to think the outlying data points are good data, I would keep them. If they are not, then of course lose them.

I'm interested to know why you're using Excel to do Data analysis for you - do you have access to SPSS - it's a big help.

14. You know, that's a good point, and I should have known better. I've been preaching that to my peers here for a long time. I was surprised to see this happen (correlation decreasing after removing a suspected outlier), and thought there must be some mathematical, statistical anomaly going on that I had better be aware of and be ready to fix it. For the most part, our data has been scrutinized to the point that it has been accepted, approved and verified by everyone. You are right, removing a point now just because I don't like what it is doing to the correlation is not the right way to do business.

But now I still wonder, if tests like Pearson are based on the assumption (fact) that correlation behaves with consistent results, and things like this can't happen, then do they lose some credibility. For example, in my opinion, I don't think this particular data set should pass the "Pearson*" test. The underlying data has a correlation of almost 0.0. But as soon as you add in one wild outlier, then you all of a sudden have a remarkable correlation. It's almost like we need a descriptive stat like median and mean, how median is so strong against outliers. Sure, we can increase our confidence interval, and then hopefully things like this won't make it through, but that gaming the system, in my opinion. It does come down to it is always important to plot the scatterplot, but sometimes it isn't practical (like if you have thousands of data sets, like I do). In that case, it would be great if the math would stand on its own.

However, I intend to read up more on this. Thank you for the links and other ideas.

15. Oh, and the reason why I use Excel is because everyone in my office is using Excel. I know I could import and export my data to and from real stats package, but I have to do many on-call, rapid fire analyses each day for my co-workers, and having to import, crunch, export every time would just get annoying. I wish it wasn't that way, and I think I will look for ways to use these tools for the large projects, but sometimes it just isn't practical.

+ Reply to Thread
Page 1 of 2 1 2 Last

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts