Should you transform skewed data when the distribution is expected?

#1
Hi all,

I have been discussing this topic all week with people in my department and I can't seem to get a straight answer. I was taught long ago that you should not necessarily transform a variable if the population distribution is expected to be skewed (for use in regression analysis). For example, symptom measures (e.g., posttraumatic stress symptoms, depression symptoms) expected to have a positively skewed distribution in the general population and this is typically what we see in sample data. So, if the population distribution is suppose to be positively skewed and your sample data has the expected positive skew, should you transform that variable? Any clarification on this issue would be greatly appreciated!
 

Jake

Cookie Scientist
#2
It's true that you don't always need to transform your variables just because they have funny distributions, but this has nothing whatsoever to do with whether or not you expected the data to have a funny distribution beforehand. Why would it???
 
#3
I think the response to the question would be another question. Why would you alter the distribution of scores to represent a distribution that they are not representative of in the general population? Isn't that part of the idea of the normal distribution? That in the general population, scores for most things will follow a normal distribution. Well, if you have a variable or measurement that does not follow the normal distribution in the general population, why would you alter it in a smaller sample when it is actually correctly representing the distribution in the general population?
 

noetsi

Fortran must die
#5
My answer would be that methods like regression assume non-skewed data and that the results will not be accurate if your data is skewed. One example of how this works is that outlier test (such as Tukey's boxplot) that assumes normal data will generate incorrect number of outliers if the data is highly skewed.

As far as I know the fact that the data really reflects a skewed population (as data often does) has absolutely nothing to do with this. It has to do with the assumptions the method is making in doing its analysis.
 
#6
I received the following response to my original question from an old stats professor of mine"

"Regarding transformations, it depends upon what analyses you plan to carry out on the variables. For example, when doing ANOVA with large (greater than 40 per group) sample sizes, there's no need to transform a skewed DV, as the Central Limit Theorem ensures normality of the mean values. Similarly, for regression, the assumption is that the *residuals* are normally distributed, not necessarily the variables. This may result in an analysis where a number of variables are skewed, but there's no need for transformation because the residuals have a normal distribution. Really, transformation is so dependent upon the particulars of your situation, that it's difficult to formulate generalizations."

While I appreciate his response, it doesn't necessarily get at my original question.
 

noetsi

Fortran must die
#7
It is also a response other statisticians disagree with. One Stanford prof wrote a book suggesting outliers could totally invalidate ANOVA despite the CLM even with very high sample sizes (although this is somewhat different than skewness per se). And testing for skewness in the regression data, not in the residuals, is commonly recommended in works on statistics and statistics classes. I posted a while back to well known statistical experts who argued that normality was critical to regression, not simply normality in the residuals.
 

DMCH

New Member
#9
I am glad this thread is open because I'm two months out from submitting and freaking out. I have a heavily skewed set of survey replies [which i also expected given the population] and want to know if I can compare my results adequately with data from the general survey where my items were taken from. I expect my survey items to come back heavily in favor of certain policy areas than others, but this then seems to violate normality. If I'm not performing regressions with my data, is there any significant issue with using the non-transformed data to compare inter-group means using ANOVA??? Any help would be immensely appreciated. Pronto.
 

noetsi

Fortran must die
#10
ANOVA and regression are essentially the same method (or more accurately I suppose ANOVA is a specialized form of regression) - although some argue ANOVA is more robust to the assumptions of normality.

If you use ANOVA than statistical test such as p values assume a normal distribution as does the CL. So if you have abnormal residuals (not just individual variables that are skewed the key is if the error is non-normal) this will be an issue.

Generalization to a population is an entirely different issue I believe. The problem with non-normality has to do with the assumptions and the results of the statistics. Not if you can generalize to the population (that is have external validity). Statistics deal with random error, they assume you do not have a systematic error such as a sample that is not represenative of the population.

Why not just run a non-parametric and ANOVA and see if the results are very different?
 

CB

Super Moderator
#11
If you use ANOVA than statistical test such as p values assume a normal distribution as does the CL. So if you have abnormal residuals (not just individual variables that are skewed the key is if the error is non-normal) this will be an issue.
Yes :)

Generalization to a population is an entirely different issue I believe. The problem with non-normality has to do with the assumptions and the results of the statistics. Not if you can generalize to the population (that is have external validity). Statistics deal with random error, they assume you do not have a systematic error such as a sample that is not represenative of the population.
Actually, the point of making an assumption such as normal errors is because when some given assumptions hold, the estimates from an OLS regression will have particular properties as estimates of the population parameters. I.e. when errors are independent, homoscedastic, and have zero mean for any value of X, then OLS regression estimates are unbiased and consistent estimates of the population parameters, they are also efficient in the sense that they are the best of all linear unbiased estimates for the parameters, and the asymptotic distribution of the coefficients will be normal.

If we add the assumption that the errors are normally distributed, then the OLS estimates are the best of all unbiased estimates for the parameters, and furthermore the distribution of the regression coefficients (over repeated samplings) will be normal even for small samples.

If we just want to describe relationships within a sample, with no generalization to a population, I'm not sure there's really any need for distributional assumptions at all.
 
#12
Yes :)

If we just want to describe relationships within a sample, with no generalization to a population, I'm not sure there's really any need for distributional assumptions at all.
The situation is that my sample was 1000 persons within and 1000 persons outside detention facilities. As a hard to reach population, sampling was based on multiple criteria which I could not play with too much. I received 134 from one subsample, 95 from the other, and a further 50 from an alternate subsample. Only the first two were randomly selected. I began first by analyzing whether there were any differences in the means between groups using independent t-tests, but then realized i was creating a big mess of it. so, i am now trying to assess how to proceed so that i can make valid assumptions about the means and differences between not only my subsamples, but also in comparison to the general population who i assume are not under supervision at the time of the surveys being distributed to them. Sample size for the main surveys where the items came from are all large and normally distributed. I am not trying to make any causal inferences, only compare the groups in terms of their mean answers on items.

Ideas???
 

hlsmith

Less is more. Stay pure. Stay poor.
#13
Per the general theme of this thread, I think it should be remembered that if you transform data to meet a procedure's assumptions, yeah the data is now different and perhaps more normalized - but when you interprete the results from the model you incoporate the transformation into the description. For example you can say for every unit increase in log transformed data results in a blank change in the dependent variable. You are just creating the best fit model for these data and not alluding the reviewer, since they will realize what you did and that these data can originally be skewed.
 

noetsi

Fortran must die
#14
Which is a major point, often forgotten about transformation. That the transfored data is commonly not easy to interpret. I have seen suggestions not to transform data even with problems given that.

I.e. when errors are independent, homoscedastic, and have zero mean for any value of X, then OLS regression estimates are unbiased and consistent estimates of the population parameters, they are also efficient in the sense that they are the best of all linear unbiased estimates for the parameters, and the asymptotic distribution of the coefficients will be normal.

If we add the assumption that the errors are normally distributed, then the OLS estimates are the best of all unbiased estimates for the parameters, and furthermore the distribution of the regression coefficients (over repeated samplings) will be normal even for small samples.
I understand that is the assumption of OLS. But I have, as someone who came up through survey research not statistics, always had a problem with this. I would think all those could be true with a convenience sample - and that absolutely can not be generalized to a population as a whole. I don't see, regardless of whether these assumptions are met or not, that estimates can be unbiased if the data you are working from does not represent the population as a whole.

This type of issue came up on my master comps in the context of IRT (which uses logistic regression, but which seperately assumes that regardless of who is measured you can generalize to the population even with very biased samples - or at least many potray it this way). One of my professors strongly disagreed with this assumption - arguing that if the sample is biased the results would be regardless of the assumptions of the method being met.

I think in general statistical methods tends to ignore external validity issues. Regardless of whether the assumptions of the method are met or not, I don't believe you can generate unbiased estimates of a population without a representative sample. Statistics is not magic....

Heresy, heresy :p
 

BGM

TS Contributor
#15
I.e. when errors are independent, homoscedastic, and have zero mean for any value of X, then OLS regression estimates are unbiased and consistent estimates of the population parameters, they are also efficient in the sense that they are the best of all linear unbiased estimates for the parameters, and the asymptotic distribution of the coefficients will be normal.

If we add the assumption that the errors are normally distributed, then the OLS estimates are the best of all unbiased estimates for the parameters, and furthermore the distribution of the regression coefficients (over repeated samplings) will be normal even for small samples.
I believe that this paragraph just breifly describe the standard theoretical result from OLS regression. I think you cannot mix the idea up with the sampling method - the "bias" in sampling is different with the bias in the above quote.
 

noetsi

Fortran must die
#16
A good point, my argument was that in practice treatements of regression tend to suggest that if the assumptions of the method are met they will produced unbiased indicators of the population - entirely ignoring the need for a represenative sample. Issues of external validity routinely get ignore in statistical treatment (at least based on my observations over the years). That is they fail entirely to address external bias leaving the reader to believe that the results will be entirely unbiased if the assumptions are met regardless of the sample.

It would be better if statistical text distinguished between methodological bias tied to violations of assumptions and bias tied to faulty survey methods. But they don't because, I believe, design issues are largely ignored in most statistical text.
 

CB

Super Moderator
#17
I understand that is the assumption of OLS. But I have, as someone who came up through survey research not statistics, always had a problem with this. I would think all those could be true with a convenience sample - and that absolutely can not be generalized to a population as a whole. I don't see, regardless of whether these assumptions are met or not, that estimates can be unbiased if the data you are working from does not represent the population as a whole.
I see where you're coming from, but I think the real question here is: Are the distributional assumptions of regression really likely to be met with a convenience sample?