One suggestion is transforming the variables back after the regression, but I have never seen a concrete example of how you do that.

- Thread starter noetsi
- Start date

One suggestion is transforming the variables back after the regression, but I have never seen a concrete example of how you do that.

Edit: 10 percent of my data is 0 or negative. So taking logs or square roots eliminated that data in practice [sas makes it missing]. The problem with adding a constant to get rid of the negative values is that while there are a few extremely negative points. Adding 39,001 to each data point [to get rid of the most extreme point] would seriously distort the interpretation of the results I assume.

Wicklin makes this point:

I am not sure which is best.

The skew reversed itself when I logged Y [note about 10 percent of the data was excluded when I did this]. It was then negatively skewed. I also tried taking the square root which did not eliminate the positive skew.

Its interesting that when I logged the data, which dropped out about ten percent of the data, those who lost income as result of the process. two variables that had been significant were no longer so, and one that had not been significant became so. R squared seemed very similar. I am somewhat concerned about dropping out all those who lost in the process as missing data. They would substantively seem to be a different group than those that gained from the process. Maybe I should just add a constant.

Wicklin makes this point:

A criticism of the previous method is that some practicing statisticians don't like to add an arbitrary constant to the data. They argue that a better way to handle negative values is to use missing values for the logarithm of a nonpositive number.

The skew reversed itself when I logged Y [note about 10 percent of the data was excluded when I did this]. It was then negatively skewed. I also tried taking the square root which did not eliminate the positive skew.

Its interesting that when I logged the data, which dropped out about ten percent of the data, those who lost income as result of the process. two variables that had been significant were no longer so, and one that had not been significant became so. R squared seemed very similar. I am somewhat concerned about dropping out all those who lost in the process as missing data. They would substantively seem to be a different group than those that gained from the process. Maybe I should just add a constant.

Last edited:

It would be nice if statisticians could agree for once. I found some who argue that when you have negative numbers you add a constant to make all numbers positive before logging and then comments like this

So if logging won't work with negative numbers (and square roots won't either obviously, how do you deal with skew if you have negative numbers....

While I am asking, do you do this with slopes as well [so you would square slope results if you have taken the square root as a transformation)?

Adding a constant in order to get only positive values makes it mathematically possible to apply a log transform. But it's almost always a really bad idea. The results you get from that are very sensitive to the choice of the constant being added, and the impact of that on subsequent analyses can be enormous. So unless there is a scientifically justified choice of the constant (or one justified by the data collection procedures) the results may well be meaningless.

The presence of negative values in a variable is usually a good sign that taking logs is conceptually inappropriate (unless the negative values are themselves data errors).

The presence of negative values in a variable is usually a good sign that taking logs is conceptually inappropriate (unless the negative values are themselves data errors).

While I am asking, do you do this with slopes as well [so you would square slope results if you have taken the square root as a transformation)?

If you use transformed data to calculate statistical values like means, you should back-transform the final results and report them in their original units. To back-transform, you just do the opposite of the mathematical function you used in the first place. For example, if you did a square root transformation, you would back-transform by squaring your end result.

Last edited:

I would take a subsection of your data, perhaps the straight forward positive values and run regression then transform them and rerun reg using transformed data, then back transform them. Make sure they are the same (results) and you have your coding right. Next, I would do the same thing but add the the constant or secondary transformation part. Back transform and make sure you have everything right. Lastly, I would broaden the dataset to include those values that you are fretting about. This iterative process will help ensure - when you actually do the transformation, that you have everything right.

I am not sure what "everything right" means

If you add constants that will change the slope apparently. And I am not sure, per the disagreement I cited, that the results you get are accurate of the underlying phenomenon, or even how you would know. Which may be moot. None of the transformations I have used so far, logging, cubed and square roots etc, have actually got rid of the skew.

Looking at the QQ plots and pictures of skew I am not even sure that is a skewed distribution. It does not fall sharply at the end as descriptions of skew commonly do. But I guess its positive skew.

If you add constants that will change the slope apparently. And I am not sure, per the disagreement I cited, that the results you get are accurate of the underlying phenomenon, or even how you would know. Which may be moot. None of the transformations I have used so far, logging, cubed and square roots etc, have actually got rid of the skew.

Looking at the QQ plots and pictures of skew I am not even sure that is a skewed distribution. It does not fall sharply at the end as descriptions of skew commonly do. But I guess its positive skew.

Last edited:

Looking at the QQ plots and pictures of skew I am not even sure that is a skewed distribution. It does not fall sharply at the end as descriptions of skew commonly do.

As an example, cycle times often follow a lognormal distribution. However, introduce a deadline and you have an instant mixture. There will be one type of behavior before the deadline and an entirely different behavior after the deadline. If you separate the two, you can often model them individually, but cannot model them as a mixture.

I decided to treat it as skew. I removed about a hundred extreme points (which have forced me to add a massive number to make the data non negative) added a constant to make the smallest DV 1 and logged the data. The right skew goes away, but left skew shows up I decided, since this is a population, to assume future populations would be similar (an unknowable point) and not worry about normality which does not bias the slope estimates. I tried several transformations none remove abnormality from the data. I am going to do a non-parametric test, if I can find one with 14 predictors and just use the untransformed results.

But two strange things happened. First the logging, and I guess removal of a few extreme data points, got rid of the autocorrelation I find without the transformation (the Durbin Watson is not sig anymore). I don't know why this type of transformation would get rid of serial autocorrelation - nothing I have read suggest it does.

Also in the residual plot a strange spearlike set of data appears - some type of pattern although I have never seen this before. It is the line at the bottom of the residuals, hard to see in this picture. It is obviously created by the transformation

This type of comment always confuses me. Does this mean that even with the type of very non-normal distribution I have [and ignoring the issue of populations versus samples] the normal distribution does not impact your ability to determine if a relationship is statistically significant?

The difference between these two in practice confuses me.

If you are using regression to show what impact a variable has on the dependent variable are you showing association or prediction [I assume the former, I am not stating that I can predict what value a customer will have but instead that this predictor has this impact on the customer results].

It is widely but incorrectly believed that the t-test and linear regression are valid only for Normally distributed outcomes. The t-test and linear regression

compare the mean of an outcome variable for different subjects. While these are valid even in very small samples if the outcome variable is Normally distributed, their major

usefulness comes from the fact that in large samples they are valid for any distribution. We demonstrate this validity by simulation in extremely non-Normal data

compare the mean of an outcome variable for different subjects. While these are valid even in very small samples if the outcome variable is Normally distributed, their major

usefulness comes from the fact that in large samples they are valid for any distribution. We demonstrate this validity by simulation in extremely non-Normal data

We should note that our discussion is entirely restricted to inference about associations between variables. When linear regression is used to

predict outcomes for individuals, knowing the distribution of the outcome variable is critical to computing valid prediction intervals."

predict outcomes for individuals, knowing the distribution of the outcome variable is critical to computing valid prediction intervals."

Last edited:

Second quote says you can model for association and prediction. However an association model has predictors at time. So that word can be used in both scenarios. It all comes down to what you want to do with you data and model building.

I found this interesting. They created data that was

It is clear that the standard deviation increases strongly as the mean increases. The data are as far from being Normal and homoscedastic as can be found in any real

examples.

examples.

Note that for sample sizes of about 500 or more, the coverage for all regression coefficients is quite close to 95%. Thus, even with these very extreme data, least-squares regression

performed well with 500 or more observations.

performed well with 500 or more observations.

http://www.rctdesign.org/techreports/ARPHnonnormality.pdf

I assume it is because its common not to have data sets with thousands of records in academic research

I spent a lot of time in the last few years learning how to capture and correct violations of methods specifically because of the concern raised....

It's better to have people concerned about the assumptions even if they might not be necessary in every single case than it is to have people blindly ignore the assumptions all the time.

One thing I had not thought of is that many of the alternatives, such as robust methods, have assumptions themselves that when violated can create significant issues. And transformations, which in my case did not work well although I used most of the ones recommended, have serious issues. First, its hard to interpret the results of the transformed variables and 2nd) in changing the distribution of the data they can distort the results [especially if you are not careful which one to use, but to some extent always}. This is particularly true with data that has lots of 0's and negative numbers which my data always has.