Alternatives for Non-normality and Inequality of Variance?

#1
I've been working with a rather difficult data set for over a week with no real progress. I am trying to compare the effect of temperature (15, 20, 25, 28, 30 degrees) on development time. The problem is the data is very non-normal and the variance unequal despite many transformations. What I have observed is as temperature decreases the variance increases quite substantially. For example, at 30 degrees the organism basically develops at 7 or 8 days but at 15 degrees the range of development can be 22-28 days ect. I've looked at running Kruskall-Wallis, Welsh ANOVA, but am still too concerned with the assumptions. Any advice? Here is my SAS Code if anyone wants to see what the issues are. Thanks!

data dev1;
input id duration temp;
datalines;
1 8 30
2 8 30
3 8 30
4 8 30
5 7 30
6 8 30
7 8 30
8 8 30
9 8 30
10 8 30
11 8 30
12 8 30
13 8 30
14 8 30
15 8 30
16 8 30
17 8 30
18 8 30
19 8 30
20 8 30
21 8 30
22 7 30
23 7 30
24 7 30
25 7 30
26 7 30
27 7 30
28 7 30
29 7 30
30 7 30
31 7 30
32 6 30
33 6 30
34 7 30
35 6 30
36 6 30
37 7 30
38 7 30
39 7 30
40 7 30
41 7 30
42 7 30
43 7 30
44 7 30
45 7 30
46 7 30
47 7 30
48 7 30
49 8 30
50 7 30
51 7 30
52 7 30
53 7 30
54 7 30
55 7 30
56 7 30
57 7 30
58 7 30
59 7 30
60 7 30
61 7 30
62 7 30
63 7 30
64 7 30
65 7 30
66 7 30
67 7 30
68 7 30
69 7 30
70 7 30
71 7 30
72 7 30
73 7 30
74 7 30
75 7 30
76 7 30
77 7 30
78 7 30
79 8 30
80 8 30
81 8 30
82 7 30
83 7 30
84 7 30
85 7 30
86 7 30
87 6 30
88 6 30
89 6 30
90 6 30
91 6 30
92 6 30
93 7 30
94 6 30
95 7 30
96 6 30
97 6 30
98 7 28
99 6 28
100 6 28
101 6 28
102 6 28
103 6 28
104 6 28
105 6 28
106 6 28
107 6 28
108 6 28
109 6 28
110 6 28
111 6 28
112 6 28
113 6 28
114 6 28
115 6 28
116 6 28
117 6 28
118 6 28
119 6 28
120 7 28
121 6 28
122 6 28
123 7 28
124 6 28
125 6 28
126 6 28
127 6 28
128 6 28
129 6 28
130 7 28
131 7 28
132 6 28
133 6 28
134 6 28
135 6 28
136 6 28
137 7 28
138 7 28
139 7 28
140 7 28
141 7 28
142 7 28
143 6 28
144 7 28
145 7 28
146 7 28
147 6 28
148 6 28
149 6 28
150 6 28
151 6 28
152 6 28
153 6 28
154 7 28
155 7 28
156 7 28
157 7 28
158 7 28
159 7 28
160 6 28
161 6 28
162 6 28
163 7 28
164 6 28
165 6 28
166 6 28
167 7 28
168 7 28
169 7 28
170 7 28
171 6 28
172 7 28
173 7 28
174 7 28
175 7 28
176 7 28
177 7 28
178 7 28
179 7 28
180 7 28
181 7 28
182 8 28
183 7 28
184 7 28
185 7 28
186 7 28
187 7 28
188 8 28
189 7 28
190 6 28
191 6 28
192 7 28
193 7 28
194 6 28
195 6 28
196 6 28
197 6 28
198 6 28
199 6 28
200 6 28
201 6 28
202 6 28
203 6 28
204 6 28
205 6 28
206 6 28
207 6 28
208 6 28
209 6 28
210 7 28
211 7 28
212 6 28
213 6 28
214 6 28
215 6 28
216 6 28
217 6 28
218 6 28
219 6 28
220 6 28
221 6 28
222 6 28
223 6 28
224 6 28
225 6 28
226 6 28
227 7 28
228 7 28
229 7 28
230 7 28
231 7 28
232 6 28
233 6 28
234 6 28
235 6 28
236 6 28
237 6 28
238 6 28
239 6 28
240 6 28
241 6 28
242 6 28
243 6 28
244 6 28
245 6 28
246 6 28
247 6 28
248 6 28
249 6 28
250 6 28
251 6 28
252 6 28
253 6 28
254 6 28
255 6 28
256 6 28
257 6 28
258 6 28
259 6 28
260 6 28
261 6 28
262 6 28
263 6 28
264 6 28
265 6 28
266 6 28
267 6 28
268 6 28
269 6 28
270 6 28
271 6 28
272 6 28
273 6 28
274 6 28
275 6 28
276 6 28
277 6 28
278 6 28
279 6 28
280 6 28
281 6 28
282 6 28
283 6 28
284 6 28
285 6 28
286 6 28
287 6 28
288 6 28
289 6 28
290 6 28
291 6 28
292 8 25
293 7 25
294 7 25
295 7 25
296 7 25
297 7 25
298 7 25
299 7 25
300 7 25
301 7 25
302 7 25
303 7 25
304 7 25
305 7 25
306 7 25
307 7 25
308 8 25
309 8 25
310 8 25
311 8 25
312 8 25
313 8 25
314 8 25
315 8 25
316 8 25
317 8 25
318 8 25
319 8 25
320 8 25
321 8 25
322 8 25
323 9 25
324 7 25
325 8 25
326 8 25
327 8 25
328 7 25
329 8 25
330 8 25
331 8 25
332 8 25
333 9 25
334 8 25
335 8 25
336 8 25
337 7 25
338 9 25
339 8 25
340 8 25
341 8 25
342 8 25
343 8 25
344 7 25
345 8 25
346 8 25
347 8 25
348 8 25
349 8 25
350 8 25
351 8 25
352 8 25
353 8 25
354 8 25
355 7 25
356 7 25
357 7 25
358 7 25
359 7 25
360 8 25
361 7 25
362 7 25
363 8 25
364 8 25
365 8 25
366 8 25
367 9 25
368 8 25
369 8 25
370 8 25
371 8 25
372 9 25
373 7 25
374 8 25
375 8 25
376 8 25
377 8 25
378 8 25
379 8 25
380 8 25
381 8 25
382 8 25
383 8 25
384 8 25
385 8 25
386 8 25
387 8 25
388 8 25
389 8 25
390 8 25
391 7 25
392 7 25
393 7 25
394 7 25
395 7 25
396 7 25
397 7 25
398 7 25
399 7 25
400 7 25
401 7 25
402 7 25
403 7 25
404 7 25
405 7 25
406 7 25
407 7 25
408 7 25
409 7 25
410 7 25
411 7 25
412 7 25
413 8 25
414 8 25
415 8 25
416 8 25
417 8 25
418 8 25
419 8 25
420 7 25
421 7 25
422 7 25
423 7 25
424 7 25
425 7 25
426 7 25
427 7 25
428 7 25
429 7 25
430 7 25
431 7 25
432 7 25
433 7 25
434 7 25
435 7 25
436 7 25
437 8 25
438 8 25
439 8 25
440 7 25
441 7 25
442 7 25
443 7 25
444 7 25
445 7 25
446 7 25
447 7 25
448 7 25
449 7 25
450 7 25
451 7 25
452 7 25
453 7 25
454 7 25
455 7 25
456 7 25
457 7 25
458 7 25
459 7 25
460 7 25
461 7 25
462 7 25
463 7 25
464 7 25
465 7 25
466 7 25
467 7 25
468 7 25
469 7 25
470 7 25
471 8 25
472 8 25
473 8 25
474 8 25
475 8 25
476 8 25
477 8 25
478 8 25
479 8 25
480 7 25
481 7 25
482 7 25
483 7 25
484 7 25
485 7 25
486 7 25
487 7 25
488 7 25
489 7 25
490 7 25
491 7 25
492 7 25
493 7 25
494 7 25
495 7 25
496 7 25
497 7 25
498 7 25
499 7 25
500 7 25
501 7 25
502 7 25
503 7 25
504 7 25
505 7 25
506 7 25
507 7 25
508 7 25
509 8 25
510 8 25
511 8 25
512 8 25
513 8 25
514 8 25
515 8 25
516 8 25
517 8 25
518 8 25
519 8 25
520 11 20
521 11 20
522 12 20
523 12 20
524 11 20
525 11 20
526 12 20
527 11 20
528 11 20
529 11 20
530 11 20
531 11 20
532 12 20
533 11 20
534 11 20
535 12 20
536 10 20
537 10 20
538 11 20
539 10 20
540 10 20
541 11 20
542 11 20
543 11 20
544 11 20
545 12 20
546 12 20
547 12 20
548 12 20
549 13 20
550 12 20
551 12 20
552 13 20
553 13 20
554 12 20
555 13 20
556 12 20
557 12 20
558 12 20
559 12 20
560 13 20
561 13 20
562 12 20
563 13 20
564 12 20
565 14 20
566 12 20
567 13 20
568 12 20
569 12 20
570 11 20
571 11 20
572 11 20
573 11 20
574 11 20
575 11 20
576 11 20
577 11 20
578 11 20
579 11 20
580 11 20
581 11 20
582 12 20
583 11 20
584 12 20
585 11 20
586 11 20
587 11 20
588 12 20
589 13 20
590 12 20
591 12 20
592 12 20
593 12 20
594 12 20
595 12 20
596 13 20
597 12 20
598 13 20
599 13 20
600 13 20
601 12 20
602 12 20
603 13 20
604 12 20
605 11 20
606 11 20
607 11 20
608 10 20
609 11 20
610 11 20
611 11 20
612 12 20
613 12 20
614 13 20
615 12 20
616 13 20
617 12 20
618 12 20
619 13 20
620 12 20
621 25 15
622 24 15
623 24 15
624 23 15
625 23 15
626 23 15
627 23 15
628 24 15
629 24 15
630 23 15
631 25 15
632 25 15
633 25 15
634 25 15
635 23 15
636 25 15
637 25 15
638 23 15
639 25 15
640 25 15
641 25 15
642 26 15
643 25 15
644 26 15
645 24 15
646 26 15
647 25 15
648 25 15
649 26 15
650 26 15
651 26 15
652 24 15
653 24 15
654 23 15
655 22 15
656 23 15
657 22 15
658 22 15
659 24 15
660 24 15
661 24 15
662 23 15
663 23 15
664 25 15
665 22 15
666 24 15
667 24 15
668 25 15
669 24 15
670 24 15
671 29 15
672 25 15
673 25 15
674 24 15
675 26 15
676 26 15
677 25 15
678 26 15
679 25 15
680 24 15
681 26 15
682 25 15
683 25 15
684 26 15
685 25 15
686 26 15
687 26 15
688 26 15
689 26 15
690 26 15
691 26 15
692 26 15
693 26 15
694 26 15
695 26 15
696 27 15
697 27 15
698 27 15
699 27 15
700 27 15
701 27 15
702 27 15
703 25 15
704 25 15
705 26 15
706 26 15
707 26 15
708 26 15
709 26 15
710 26 15
711 26 15
712 26 15
713 26 15
714 26 15
715 26 15
716 26 15
717 26 15
718 27 15
719 27 15
720 27 15
721 25 15
722 25 15
723 25 15
724 25 15
725 25 15
726 25 15
727 25 15
728 25 15
729 25 15
730 25 15
731 26 15
732 26 15
733 26 15
734 26 15
735 26 15
736 26 15
737 26 15
738 27 15
739 27 15
740 26 15
741 26 15
742 26 15
743 26 15
744 26 15
745 26 15
746 26 15
747 26 15
748 26 15
749 26 15
750 26 15
751 26 15
752 26 15
753 26 15
754 26 15
755 27 15
756 27 15
757 27 15
758 27 15
759 26 15
760 26 15
761 26 15
762 26 15
763 26 15
764 26 15
765 26 15
766 26 15
767 26 15
768 26 15
769 26 15
770 26 15
771 26 15
772 26 15
773 27 15
774 27 15
775 27 15
776 27 15
777 27 15
;

proc univariate data=dev1 NORMALTEST;
class temp;
var duration;
run;
quit;

proc glm data=dev1;
class temp;
model duration=temp;
means temp / hovtest welch;
run;
quit;
 

rogojel

TS Contributor
#2
hi,
it is nice that you showed us the data! The first, obvious, question is about the goal of the analysis: do you just want to prove the relationship between temp and duration or do you want to build a predictive model?

Second: there seem to be two issues with your data : if this is in time order then you have a strong grouping (high temps only in a short period) and you can not exclude any confounding factors, like something else besides temps being also different at the time of the measurement. Also you have a large gap in the temperatures between about 15 and 22. Obviously for predictions this will be problematic.

If you only want a generic proof that higher temps are linked to lower durations, you could for instance group the temperatures in 3 classes - High, Med, Low and run an ANOVA or some non-parametric variant (like Kruskal-Wallis). You have enough data so that the lower power of the non-parametric test will not matter, the effect is also quite clear.

If you want predictions you should take care of that gap first IMO.

regards
 
#3
hi,
it is nice that you showed us the data! The first, obvious, question is about the goal of the analysis: do you just want to prove the relationship between temp and duration or do you want to build a predictive model?

Second: there seem to be two issues with your data : if this is in time order then you have a strong grouping (high temps only in a short period) and you can not exclude any confounding factors, like something else besides temps being also different at the time of the measurement. Also you have a large gap in the temperatures between about 15 and 22. Obviously for predictions this will be problematic.

If you only want a generic proof that higher temps are linked to lower durations, you could for instance group the temperatures in 3 classes - High, Med, Low and run an ANOVA or some non-parametric variant (like Kruskal-Wallis). You have enough data so that the lower power of the non-parametric test will not matter, the effect is also quite clear.

If you want predictions you should take care of that gap first IMO.

regards
I was fitting a nonlinear model (Lactin/Beriere) to describe the relationship between temperature and developmental rate, hence the clustering of high temperatures to capture the peak of the curve. With this analysis I posted simply want to show a development time difference at each temperature through some sort of multiple comparison test (Dunn's/Games Howell) but I cant run any ANOVA/KW/Welsh test due to data assumptions
 

noetsi

Fortran must die
#7
Non-normality is not a major issue when you have at least 30 data points because of the central limit theorem (some say 40 others higher). You can transform the data (box cox transformations are sometimes useful) to make it normal if you like or run a non-parametric test. If you mean heteroscedastcity you can do transformations, you can do WLS (if you know the source of the problem) or you can use a robust SE (I think White is recommended).

Neither of these effect the point estimate only the statistical test. I do not think you can split the data into three levels of the dependent variable and run ANOVA which requires a linear DV. You could use ordinal or multinomial logistic regression for that.
 

Miner

TS Contributor
#8
You can also transform your response variable (i.e., duration). Using Minitab, I used a Box-Cox transform on the response, then analyzed the results using a 1 way ANOVA followed by a Tukey post-hoc test. The transform corrected the heteroskedacity issue in the residuals. You can also repeat the analysis using regression on the transformed response.
 

noetsi

Fortran must die
#9
How did you decide what the correct transformation in box cox was miner? This is the element of box cox that always confuses me.
 

Miner

TS Contributor
#10
Minitab allows you to set lambda at 0 (natural log), 0.5 (square root), any value between -5 and 5, or allow Minitab to find an optimal value.

I started with Tukey's "Ladder of Powers" and Tukey and Mosteller's "Bulge Rules", focusing on transforming the response in order to correct for heteroskedacity, but could not find a standard power transform that worked. Then I tried the Box-Cox and allowed Minitab to search for an optimal lambda, which worked.
 
#14
Here is the Box-Cox optimal transform for the Duration response.
Interesting. I had tried to do a Box-Cox transformation in SAS prior to posting my question here and it suggested Lambda = -1 (i.e. the reciprocal transformation), which is essentially using developmental rate (1/d) rather than time. I tried that transformation and it did not help heteroskedacity. Since that is not a standard power transformation like you suggested I will use Minitab and see if I get the same results you reached. Thank you very much, this will definitely help me in the future.

Edit* What p-value did you get for the variance test you used? After transforming the data and using Levene's Test I got an F=2.75 and p=0.02 This is much better than any transformation I ever did but still not non-significant. I tried other values proposed in that range and X^0.33 seemed to be best with p-value of 0.0327.
 
Last edited:

ondansetron

TS Contributor
#15
Non-normality is not a major issue when you have at least 30 data points because of the central limit theorem (some say 40 others higher).
Just wanted to add that in cases of more than two groups in an ANOVA, for example, the CLT can't apply at any sample size, so you will always need the two assumptions of normally distributed DV among the groups and a common variance for the groups. Good points, though!
 

ondansetron

TS Contributor
#16
Interesting. I had tried to do a Box-Cox transformation in SAS prior to posting my question here and it suggested Lambda = -1 (i.e. the reciprocal transformation), which is essentially using developmental rate (1/d) rather than time. I tried that transformation and it did not help heteroskedacity. Since that is not a standard power transformation like you suggested I will use Minitab and see if I get the same results you reached. Thank you very much, this will definitely help me in the future.

Edit* What p-value did you get for the variance test you used? After transforming the data and using Levene's Test I got an F=2.75 and p=0.02 This is much better than any transformation I ever did but still not non-significant. I tried other values proposed in that range and X^0.33 seemed to be best with p-value of 0.0327.
I would be careful not to interpret the p-value as an indication of the effect size, how unequal the variances are in this case.
 

Miner

TS Contributor
#17
Interesting. I had tried to do a Box-Cox transformation in SAS prior to posting my question here and it suggested Lambda = -1 (i.e. the reciprocal transformation), which is essentially using developmental rate (1/d) rather than time. I tried that transformation and it did not help heteroskedacity. Since that is not a standard power transformation like you suggested I will use Minitab and see if I get the same results you reached. Thank you very much, this will definitely help me in the future.

Edit* What p-value did you get for the variance test you used? After transforming the data and using Levene's Test I got an F=2.75 and p=0.02 This is much better than any transformation I ever did but still not non-significant. I tried other values proposed in that range and X^0.33 seemed to be best with p-value of 0.0327.
Don't focus too much on the equal variances. ANOVA is similar to regression in that equal variances in the residuals is what is really important. As you can see from the residual plots that I attached, the residuals vs. fitted values is fine. Don't worry about the normality plot. The low p-value there is due to the chunky nature of the data collected. With the number of samples, the normality test is extremely sensitive. Visually, the plot is acceptable.