I'm trying to produce an automated framework for choosing hypothesis tests and carrying them out. I really am new to statistics so sorry if I'm asking really obvious questions, but I've tried for so long to find an answer and aren't getting anywhere. This also isn't homework help so sorry if it's in the wrong sub forum, but it is a request for help and I'm not exactly an expert yet.

So the problem is that, as we know, in hypothesis testing some tests assume a normal distribution so you should only use them if your data is normally distributed. Generally non parametric tests don't require a normal distribution whereas parametric do. Suppose you have more than 2 groups of continuous data which are not paired. If it's between ANOVA and Kruskal Wallis you should use Kruskal Wallis unless you know that your data is normally distributed.

My problem is, how can you ever know that your data is normally distributed? I've spent ages looking up tests for normality and there's plenty about - Shapiro-Wilk, Shapiro-Francia, Lilliefors etc. But the problem is that they all have a normal distribution as their null hypothesis. This means, as I understand it, that if you get a P value from them lower than you want (e.g. 0.05) then you have provided a fair degree of certainty that your data is not normal but if the P value is higher then you haven't proved anything, your data still may or may not be normal. I wanted to know whether you could test against the null hypothesis that your data isn't normal, and so I googled for this but didn't get anywhere. A Stack Overflow thread told me that such a test doesn't even make mathematical sense.

http://stackoverflow.com/questions/...ribution-not-test-for-non-normal-distribution.

It looks to me as though you can never prove with any quantifiable certainty that your data is normally distributed, only that it isn't.

What I don't understand is that this would surely mean that KW would have to always be used in favour of ANOVA. In fact, given that the same thing seems to apply for heteroscedasticity tests it looks like its a general rule that you have no choice but to always use the test that makes the least assumptions, e.g. always use something like the Brunner Dette Munk test which doesn't even assume homoscedasticity. But people must somehow be finding justification for using tests like ANOVA, especially as I keep on seeing advice to use tests like ANOVA if I do know my data is normally distributed as parametric can give you more power. Are people simply choosing ANOVA through looking at their data by eye to see if it looks normal? Or maybe using Bayesian, rather than frequentist methods - but if you do that you still have to somehow choose a threshold below which you'll use the test which doesn't assume normality and I don't know how to choose that threshold? Am I missing something here?

Thanks a lot,

Geoff