# Transformation didn't normalize the data set... now what?

#### FriskyBeetle

##### New Member
I study and insect that has a tendency to produce non-normal data sets. I have transformed my current data set using sqrt, sqrt + 0.5, log, log10 and arcsin (even thought this prob isn't appropriate) transformations, but when I check the assumptions on the transformed data set it's still non-normal. Is there another transformation method I could use or am I just missing something?

data EmergenceAllt; set EmergenceAll;
TotalperDayt = sqrt(TotalperDay + 0.5);
run;

proc univariate data=EmergenceAllt normal;
var TotalperDayt;
probplot TotalperDayt / normal(mu=est sigma=est color=black w=3);
inset mean median min max skewness kurtosis var probn;
run;

#### FriskyBeetle

##### New Member
And here are my residual codes

proc glm data=EmergenceAllt;
class year;
model TotalperDayt=year;
output out=resid r=resid;
means year/hovtest=bf; run;

proc gplot data=resid;
plot resid*year; run;

proc univariate data=resid normal;
var resid;
probplot resid / normal(mu=est sigma=est color=black w=3);
inset mean median min max skewness kurtosis var probn;
run;

#### noetsi

##### No cake for spunky
You could look at Tukey's ladder of power which suggests a series of transformations.

Another alternative is a non-parametric method.

You suggest, I think, that the nature of insect behavior leads to non-normal data. How so?

#### Karabiner

##### TS Contributor
I study and insect that has a tendency to produce non-normal data sets.
Why do you bother whether your data are non-normally distributed?

With kind regards

K.

#### FriskyBeetle

##### New Member

How would I code for Tukey's ladder of power?

I meant to say that my particular insect has a tendency to produce non-normal data. Probably because of it's aggregated distribution.

#### FriskyBeetle

##### New Member
Why do you bother whether your data are non-normally distributed?

K.
Need to meet the assumptions so I can run ANOVA.

#### Karabiner

##### TS Contributor
Need to meet the assumptions so I can run ANOVA.
ANOVA doesn't assume normally distributed data.
It may assume that the residuals of the model
are normally distributed. Did you refer to the residuals
or to the unconditional data distributions?

And if a sample size is large enough, then non-normality
of residuals doesn't matter much, either.

With kind regards

K.

#### FriskyBeetle

##### New Member
My residuals are non-normal.

Roughly how large would me sample size have to be to reduce the importance of normality? I have ~6500 observations.

#### noetsi

##### No cake for spunky
There is no agreed on answer on how large a sample would have to be to make the results robust. But 6500 cases would meet most definitions and is often believed that ANOVA is highly robust anyway to violations of normality due to the central limit theorem (although some argue outliers in the tails disrupts ANOVA regardless of case size - again this is not an area there is complete consenus on).

You don't code Tukey's ladder of power. It suggests transformations to deal with normality. You then code whatever that transformation is into SAS ( I don't work with ANOVA in SAS, but usually a proc uses TRANSFORM= in the model options and the key word for a given transformation).

If you are familiar with Box-Cox (used to transform data in part to generate normality) SAS will do that. http://support.sas.com/documentatio...efault/viewer.htm#statug_transreg_sect015.htm

I am not certain I would worry about non-normal residuals in ANOVA unless it is truly extreme.

#### Karabiner

##### TS Contributor
Roughly how large would me sample size have to be to reduce the importance of normality?
50 or so.
I have ~6500 observations
Well...if you performed formal tests of significance in order to decide
whether your residuals are (non-)normal, then with this sample size
the power of such tests is absolutely extremely much much too high.
But maybe you used graphical methods (histograms, Q-Q plots...)?

Anyway, results shouldn't be affected by the shape of the residual
distribution with such a large sample size, I suppose.

This depends on how you gathered your data. If you e.g. had 3 insects
which delivered 2160 data points each, then things might be different.

With kind regards

K.

#### FriskyBeetle

##### New Member
Well...if you performed formal tests of significance in order to decide
whether your residuals are (non-)normal, then with this sample size
the power of such tests is absolutely extremely much much too high.
But maybe you used graphical methods (histograms, Q-Q plots...)?

Anyway, results shouldn't be affected by the shape of the residual
distribution with such a large sample size, I suppose.

This depends on how you gathered your data. If you e.g. had 3 insects
which delivered 2160 data points each, then things might be different.
It sounds like my data set is large enough that it didn't need to be transformed. There were several hundred insects collected over all of those observations.

I did initially run Shariro-Wilk and Kolmogorov-Smirnov tests with a probability plots to check normality.

You don't code Tukey's ladder of power. It suggests transformations to deal with normality. You then code whatever that transformation is into SAS ( I don't work with ANOVA in SAS, but usually a proc uses TRANSFORM= in the model options and the key word for a given transformation).
Is that something that is available in SAS Enterprise? I have only used 9.2 and 9.3 up until this point and have had no exposure to the rest of the SAS suite.

Thank you so much for your help, this puts my mind at ease!