1. ## Skew in data

I have a huge data set, 10,000 points is common. I have significant skew in my data and some large outliers. Some authors argue addressing skew with transformations while other caution that this distorts the findings, it can be difficult to understand what your results actually mean substantively after a transformation. Similarly some argue you address outliers through various approaches and some argue that they will have limited impact on large data sets.

One suggestion is transforming the variables back after the regression, but I have never seen a concrete example of how you do that.

2. ## Re: Skew in data

I prefer to work with the data in the original, untransformed state for the reasons cited unless a transform is necessary to stabilize variances in the residuals. I would also agree that with a very large sample size, a few outliers would have little impact on the results.

3. ## The Following User Says Thank You to Miner For This Useful Post:

noetsi (03-14-2016)

4. ## Re: Skew in data

Would logistic regression be an example of transformed data (link log base e) for you. Just kidding. Yeah a few outliers is probably fine, especially if you have 10,000 observations.

5. ## The Following User Says Thank You to hlsmith For This Useful Post:

noetsi (03-14-2016)

6. ## Re: Skew in data

As I reviewed my data I realized I had quite a few missing points - which raised the issue of what "a few" really is

7. ## Re: Skew in data

I am guessing this is too much skew Now I need to find something to deal with it that does not involve transformations...if there is such a thing. Or how you transform the model back. This is actually pretty typical of our cost data.

8. ## Re: Skew in data

How does this look on a lognormal plot?

9. ## Re: Skew in data

Edit: 10 percent of my data is 0 or negative. So taking logs or square roots eliminated that data in practice [sas makes it missing]. The problem with adding a constant to get rid of the negative values is that while there are a few extremely negative points. Adding 39,001 to each data point [to get rid of the most extreme point] would seriously distort the interpretation of the results I assume.

Wicklin makes this point:
A criticism of the previous method is that some practicing statisticians don't like to add an arbitrary constant to the data. They argue that a better way to handle negative values is to use missing values for the logarithm of a nonpositive number.
I am not sure which is best.

The skew reversed itself when I logged Y [note about 10 percent of the data was excluded when I did this]. It was then negatively skewed. I also tried taking the square root which did not eliminate the positive skew.

Its interesting that when I logged the data, which dropped out about ten percent of the data, those who lost income as result of the process. two variables that had been significant were no longer so, and one that had not been significant became so. R squared seemed very similar. I am somewhat concerned about dropping out all those who lost in the process as missing data. They would substantively seem to be a different group than those that gained from the process. Maybe I should just add a constant.

10. ## Re: Skew in data

It would be nice if statisticians could agree for once. I found some who argue that when you have negative numbers you add a constant to make all numbers positive before logging and then comments like this

Adding a constant in order to get only positive values makes it mathematically possible to apply a log transform. But it's almost always a really bad idea. The results you get from that are very sensitive to the choice of the constant being added, and the impact of that on subsequent analyses can be enormous. So unless there is a scientifically justified choice of the constant (or one justified by the data collection procedures) the results may well be meaningless.

The presence of negative values in a variable is usually a good sign that taking logs is conceptually inappropriate (unless the negative values are themselves data errors).
So if logging won't work with negative numbers (and square roots won't either obviously, how do you deal with skew if you have negative numbers....

While I am asking, do you do this with slopes as well [so you would square slope results if you have taken the square root as a transformation)?

If you use transformed data to calculate statistical values like means, you should back-transform the final results and report them in their original units. To back-transform, you just do the opposite of the mathematical function you used in the first place. For example, if you did a square root transformation, you would back-transform by squaring your end result.

11. ## Re: Skew in data

I don't recall running into this problem myself, but below would be my approach per your chatbox posts.

I would take a subsection of your data, perhaps the straight forward positive values and run regression then transform them and rerun reg using transformed data, then back transform them. Make sure they are the same (results) and you have your coding right. Next, I would do the same thing but add the the constant or secondary transformation part. Back transform and make sure you have everything right. Lastly, I would broaden the dataset to include those values that you are fretting about. This iterative process will help ensure - when you actually do the transformation, that you have everything right.

12. ## The Following User Says Thank You to hlsmith For This Useful Post:

noetsi (03-15-2016)

13. ## Re: Skew in data

I am not sure what "everything right" means

If you add constants that will change the slope apparently. And I am not sure, per the disagreement I cited, that the results you get are accurate of the underlying phenomenon, or even how you would know. Which may be moot. None of the transformations I have used so far, logging, cubed and square roots etc, have actually got rid of the skew.

Looking at the QQ plots and pictures of skew I am not even sure that is a skewed distribution. It does not fall sharply at the end as descriptions of skew commonly do. But I guess its positive skew.

14. ## Re: Skew in data

Originally Posted by noetsi
Looking at the QQ plots and pictures of skew I am not even sure that is a skewed distribution. It does not fall sharply at the end as descriptions of skew commonly do.
Is there any chance that you have a mixture from different processes, types of customer, etc.?

As an example, cycle times often follow a lognormal distribution. However, introduce a deadline and you have an instant mixture. There will be one type of behavior before the deadline and an entirely different behavior after the deadline. If you separate the two, you can often model them individually, but cannot model them as a mixture.

15. ## Re: Skew in data

In theory there is one process. In practice there probably is not, but the variation in processes is not known.

I decided to treat it as skew. I removed about a hundred extreme points (which have forced me to add a massive number to make the data non negative) added a constant to make the smallest DV 1 and logged the data. The right skew goes away, but left skew shows up I decided, since this is a population, to assume future populations would be similar (an unknowable point) and not worry about normality which does not bias the slope estimates. I tried several transformations none remove abnormality from the data. I am going to do a non-parametric test, if I can find one with 14 predictors and just use the untransformed results.

But two strange things happened. First the logging, and I guess removal of a few extreme data points, got rid of the autocorrelation I find without the transformation (the Durbin Watson is not sig anymore). I don't know why this type of transformation would get rid of serial autocorrelation - nothing I have read suggest it does.

Also in the residual plot a strange spearlike set of data appears - some type of pattern although I have never seen this before. It is the line at the bottom of the residuals, hard to see in this picture. It is obviously created by the transformation

16. ## Re: Skew in data

This type of comment always confuses me. Does this mean that even with the type of very non-normal distribution I have [and ignoring the issue of populations versus samples] the normal distribution does not impact your ability to determine if a relationship is statistically significant?

It is widely but incorrectly believed that the t-test and linear regression are valid only for Normally distributed outcomes. The t-test and linear regression
compare the mean of an outcome variable for different subjects. While these are valid even in very small samples if the outcome variable is Normally distributed, their major
usefulness comes from the fact that in large samples they are valid for any distribution. We demonstrate this validity by simulation in extremely non-Normal data
The difference between these two in practice confuses me.

We should note that our discussion is entirely restricted to inference about associations between variables. When linear regression is used to
predict outcomes for individuals, knowing the distribution of the outcome variable is critical to computing valid prediction intervals."
If you are using regression to show what impact a variable has on the dependent variable are you showing association or prediction [I assume the former, I am not stating that I can predict what value a customer will have but instead that this predictor has this impact on the customer results].

17. ## Re: Skew in data

Yeah the first quote just says if you have say 20 people normality helps but if you have say 150 people the test is robust.

Second quote says you can model for association and prediction. However an association model has predictors at time. So that word can be used in both scenarios. It all comes down to what you want to do with you data and model building.

18. ## The Following User Says Thank You to hlsmith For This Useful Post:

noetsi (03-18-2016)

19. ## Re: Skew in data

If you want to know the impact on the DV or the IV that is association right not prediction?

I found this interesting. They created data that was

It is clear that the standard deviation increases strongly as the mean increases. The data are as far from being Normal and homoscedastic as can be found in any real
examples.
and found

Note that for sample sizes of about 500 or more, the coverage for all regression coefficients is quite close to 95%. Thus, even with these very extreme data, least-squares regression
performed well with 500 or more observations.
You might be interested in this article hlsmith as it deals with public health data...

http://www.rctdesign.org/techreports...nnormality.pdf

Page 1 of 2 1 2 Last

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts