Skew in data

noetsi

Fortran must die
#1
I have a huge data set, 10,000 points is common. I have significant skew in my data and some large outliers. Some authors argue addressing skew with transformations while other caution that this distorts the findings, it can be difficult to understand what your results actually mean substantively after a transformation. Similarly some argue you address outliers through various approaches and some argue that they will have limited impact on large data sets.

One suggestion is transforming the variables back after the regression, but I have never seen a concrete example of how you do that.
 

Miner

TS Contributor
#2
I prefer to work with the data in the original, untransformed state for the reasons cited unless a transform is necessary to stabilize variances in the residuals. I would also agree that with a very large sample size, a few outliers would have little impact on the results.
 

hlsmith

Omega Contributor
#3
Would logistic regression be an example of transformed data (link log base e) for you. Just kidding. Yeah a few outliers is probably fine, especially if you have 10,000 observations.
 

noetsi

Fortran must die
#4
As I reviewed my data I realized I had quite a few missing points - which raised the issue of what "a few" really is :p
 

noetsi

Fortran must die
#5
I am guessing this is too much skew :p Now I need to find something to deal with it that does not involve transformations...if there is such a thing. Or how you transform the model back. This is actually pretty typical of our cost data.
 

noetsi

Fortran must die
#7
Edit: 10 percent of my data is 0 or negative. So taking logs or square roots eliminated that data in practice [sas makes it missing]. The problem with adding a constant to get rid of the negative values is that while there are a few extremely negative points. Adding 39,001 to each data point [to get rid of the most extreme point] would seriously distort the interpretation of the results I assume.

Wicklin makes this point:
A criticism of the previous method is that some practicing statisticians don't like to add an arbitrary constant to the data. They argue that a better way to handle negative values is to use missing values for the logarithm of a nonpositive number.
I am not sure which is best.


The skew reversed itself when I logged Y [note about 10 percent of the data was excluded when I did this]. It was then negatively skewed. I also tried taking the square root which did not eliminate the positive skew.

Its interesting that when I logged the data, which dropped out about ten percent of the data, those who lost income as result of the process. two variables that had been significant were no longer so, and one that had not been significant became so. R squared seemed very similar. I am somewhat concerned about dropping out all those who lost in the process as missing data. They would substantively seem to be a different group than those that gained from the process. Maybe I should just add a constant.
 
Last edited:

noetsi

Fortran must die
#8
It would be nice if statisticians could agree for once. I found some who argue that when you have negative numbers you add a constant to make all numbers positive before logging and then comments like this

Adding a constant in order to get only positive values makes it mathematically possible to apply a log transform. But it's almost always a really bad idea. The results you get from that are very sensitive to the choice of the constant being added, and the impact of that on subsequent analyses can be enormous. So unless there is a scientifically justified choice of the constant (or one justified by the data collection procedures) the results may well be meaningless.

The presence of negative values in a variable is usually a good sign that taking logs is conceptually inappropriate (unless the negative values are themselves data errors).
So if logging won't work with negative numbers (and square roots won't either obviously, how do you deal with skew if you have negative numbers....

While I am asking, do you do this with slopes as well [so you would square slope results if you have taken the square root as a transformation)?

If you use transformed data to calculate statistical values like means, you should back-transform the final results and report them in their original units. To back-transform, you just do the opposite of the mathematical function you used in the first place. For example, if you did a square root transformation, you would back-transform by squaring your end result.
 
Last edited:

hlsmith

Omega Contributor
#9
I don't recall running into this problem myself, but below would be my approach per your chatbox posts.


I would take a subsection of your data, perhaps the straight forward positive values and run regression then transform them and rerun reg using transformed data, then back transform them. Make sure they are the same (results) and you have your coding right. Next, I would do the same thing but add the the constant or secondary transformation part. Back transform and make sure you have everything right. Lastly, I would broaden the dataset to include those values that you are fretting about. This iterative process will help ensure - when you actually do the transformation, that you have everything right.
 

noetsi

Fortran must die
#10
I am not sure what "everything right" means :p

If you add constants that will change the slope apparently. And I am not sure, per the disagreement I cited, that the results you get are accurate of the underlying phenomenon, or even how you would know. Which may be moot. None of the transformations I have used so far, logging, cubed and square roots etc, have actually got rid of the skew.

Looking at the QQ plots and pictures of skew I am not even sure that is a skewed distribution. It does not fall sharply at the end as descriptions of skew commonly do. But I guess its positive skew.
 
Last edited:

Miner

TS Contributor
#11
Looking at the QQ plots and pictures of skew I am not even sure that is a skewed distribution. It does not fall sharply at the end as descriptions of skew commonly do.
Is there any chance that you have a mixture from different processes, types of customer, etc.?

As an example, cycle times often follow a lognormal distribution. However, introduce a deadline and you have an instant mixture. There will be one type of behavior before the deadline and an entirely different behavior after the deadline. If you separate the two, you can often model them individually, but cannot model them as a mixture.
 

noetsi

Fortran must die
#12
In theory there is one process. In practice there probably is not, but the variation in processes is not known.

I decided to treat it as skew. I removed about a hundred extreme points (which have forced me to add a massive number to make the data non negative) added a constant to make the smallest DV 1 and logged the data. The right skew goes away, but left skew shows up :p I decided, since this is a population, to assume future populations would be similar (an unknowable point) and not worry about normality which does not bias the slope estimates. I tried several transformations none remove abnormality from the data. I am going to do a non-parametric test, if I can find one with 14 predictors and just use the untransformed results.

But two strange things happened. First the logging, and I guess removal of a few extreme data points, got rid of the autocorrelation I find without the transformation (the Durbin Watson is not sig anymore). I don't know why this type of transformation would get rid of serial autocorrelation - nothing I have read suggest it does.

Also in the residual plot a strange spearlike set of data appears - some type of pattern although I have never seen this before. It is the line at the bottom of the residuals, hard to see in this picture. It is obviously created by the transformation
 

noetsi

Fortran must die
#13
This type of comment always confuses me. Does this mean that even with the type of very non-normal distribution I have [and ignoring the issue of populations versus samples] the normal distribution does not impact your ability to determine if a relationship is statistically significant?

It is widely but incorrectly believed that the t-test and linear regression are valid only for Normally distributed outcomes. The t-test and linear regression
compare the mean of an outcome variable for different subjects. While these are valid even in very small samples if the outcome variable is Normally distributed, their major
usefulness comes from the fact that in large samples they are valid for any distribution. We demonstrate this validity by simulation in extremely non-Normal data
The difference between these two in practice confuses me.

We should note that our discussion is entirely restricted to inference about associations between variables. When linear regression is used to
predict outcomes for individuals, knowing the distribution of the outcome variable is critical to computing valid prediction intervals."
If you are using regression to show what impact a variable has on the dependent variable are you showing association or prediction [I assume the former, I am not stating that I can predict what value a customer will have but instead that this predictor has this impact on the customer results].
 
Last edited:

hlsmith

Omega Contributor
#14
Yeah the first quote just says if you have say 20 people normality helps but if you have say 150 people the test is robust.

Second quote says you can model for association and prediction. However an association model has predictors at time. So that word can be used in both scenarios. It all comes down to what you want to do with you data and model building.
 

noetsi

Fortran must die
#15
If you want to know the impact on the DV or the IV that is association right not prediction?

I found this interesting. They created data that was

It is clear that the standard deviation increases strongly as the mean increases. The data are as far from being Normal and homoscedastic as can be found in any real
examples.
and found

Note that for sample sizes of about 500 or more, the coverage for all regression coefficients is quite close to 95%. Thus, even with these very extreme data, least-squares regression
performed well with 500 or more observations.
You might be interested in this article hlsmith as it deals with public health data...

http://www.rctdesign.org/techreports/ARPHnonnormality.pdf
 

noetsi

Fortran must die
#16
The more I read today the less serious violations of assumptions appear if you have large data sets. So my question is, why do so many text stress strongly the danger of violating the assumptions and the need for strict test of this?

I assume it is because its common not to have data sets with thousands of records in academic research :p

I spent a lot of time in the last few years learning how to capture and correct violations of methods specifically because of the concern raised....
 

Dason

Ambassador to the humans
#17
It's better to have people concerned about the assumptions even if they might not be necessary in every single case than it is to have people blindly ignore the assumptions all the time.
 

noetsi

Fortran must die
#18
It's better to have people concerned about the assumptions even if they might not be necessary in every single case than it is to have people blindly ignore the assumptions all the time.
I am sure that is true. Of course there are outliers like me so paranoid about making "mistakes" by violating assumptions that he is reluctant to send anything in :p

One thing I had not thought of is that many of the alternatives, such as robust methods, have assumptions themselves that when violated can create significant issues. And transformations, which in my case did not work well although I used most of the ones recommended, have serious issues. First, its hard to interpret the results of the transformed variables and 2nd) in changing the distribution of the data they can distort the results [especially if you are not careful which one to use, but to some extent always}. This is particularly true with data that has lots of 0's and negative numbers which my data always has.