Help with Outliers

#1
Hello fellow scholars and researchers,

I am a first year doctoral graduate student, and I could use a bit of advice.

It's been a few years since my first graduate degree, and things are a bit rusty. :confused:

I have been working with a dataset for a week now, and I recently read that one of the assumptions of correlations and other tests (I am also running an independent samples t-test), was having data that was normal. So, I ran both scatter plots and boxplots. On the box-plots, I have 2-3 outliers. I also read that this can "mess up" your results, or that they might not be "true," because of the outliers. Question: Should I take out the outliers? Leave them in? Just report that I found them?
Also, on one of my box plots, instead of a circle to represent an outlier, one of the outliers was represented with a *. Thoughts?

Another question, if you have the time. I figured out the scatterplot on SPSS. Now it looks like a bad game of connect the dots, but apparently it means something. How the heck do I interpret it? It doesn't seem to follow any sort of linear direction. But then again, I am not expert.

Thanks so much for any imput,

Doc Student :wave:
 

gianmarco

TS Contributor
#2
Hi!
1) I think that the different ways in which outliers are flagged depend on the type of outliers: mild vs extreme outliers.
See:
http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
http://bcs.siuc.edu/facultypages/young/ResMethodsStuff/howtoboxplot.html
http://condor.depaul.edu/sjost/lsp121/documents/boxplots.htm

2) how should you deal with outlier? This is a BIG question and a controversial topic.
For an overview, see:
http://en.wikipedia.org/wiki/Outlier

3) I do not understand what your goal is. What kind of analisys will you perform on your data? What are you interested in?
Non-normality can be a problem for parametric tests (i.e., for those tests requiring the meeting of normality assumption); moreover, the presence of outliers can heavily affect parametric correlation (i.e. Pearson r).
So, after further speculation, if you decide to leave outliers where they are (i.e., you decide not to drop them), you could use non-parametric tests in order to downplay the presence of extreme values. As far as descriptive statistics are concerned, you could consider to use robust measures, as the median (to get an idea of the central tendency of your data), or the IQR (or Median Absolute Deviation) (to get a measure of the amount of variation in your data).


Hope this helps
Regards
Gm