# help with Normality test?

#### nicola

##### New Member
Hi all!

this is probably a simple question; however, my statistics skills got a bit rusty and I cannot find an appropriate solution on the internet...

This is the problem: Let X_1, ..., X_n be a series of values drawn from a normal distribution. All I know about them is their mean u=sum X_i/n and their standard deviation s (n is unknown and can be assumed to be large). I want to compute the likelihood (i.e. a p-value) that X_1, ..., X_n come from the normal distribution N(U,S^2), with U and S known.

I need something like Student's or Welch's t-tests; however, those tests (1) require n to be known and (2) test the hypothesis that two populations have equal means (instead I want to test for both equal mean and equal standard deviation). Problem (1) could probably be solved with the assumption that n is large, so the t distribution tends to N(0,1)...

Can someone help me with this? thank you very much! #### Dason

Without the sample size I think you're out of luck for what you want to do. At least if I'm understanding you correctly.

#### nicola

##### New Member
I think there should be a solution (it's just that I cannot find it): consider the simpler version of this problem where I just want to take into account the mean u of my data (and just ignore its standard deviation). Then, this can be solved with a Student's t-test. Since I can assume large n, I simply can approximate the t-distribution of the test with N(0,1)...

The problem is: is there someting like Student's t-test that takes into account both sample mean and standard deviation?

#### gene2420

##### New Member
Look into normality tests....i.e. Shapiro-Wilk test

#### Dason

Since I can assume large n, I simply can approximate the t-distribution of the test with N(0,1)...
Sure you get rid of the issue of needing to care about the degrees of freedom but... how are you getting your t-statistic?

$$T = \frac{\bar{X} - \mu_o}{s/\sqrt{n}}$$

You need n to get the t-statistic.

#### nicola

##### New Member
yes, you are totally right woa ok, I'll see if in some way I can derive n from the data I have.. thank you!

#### Dason

What data do you have?

#### nicola

##### New Member
Hi! in the end, I managed to obtain the sample size n! this makes everything much easier.

To summarize, this is an example instance of the problem I wish to solve: I know that mean(X_1,...,X_n)=21, stdev(X_1,...,X_n)=3, and n=250. How to compute the likelihood that X_1, ..., X_n have been generated from the distribution N(21.5,4)?

I could perform a Student's or Welch's t-test, but those tests only give me the likelihood that the means are equal, right? Is there a way to compute the likelihood that both mean and standard deviation are the same?

thanks!

#### hlsmith

##### Less is more. Stay pure. Stay poor.
what is the purpose of this endeavor? Does it have to be compared to (21.5, 4) or can you just test whether your data is normally distributed?

In the prior posts, you may have been able to insert a range of n-values and say that your parameters would be normally distributed given n-value = ? - ?.

Currently you can also plot your data, if you actually have them, and overlay a normal distribution with mean 21.5 and SD 4, and visually examine the distributions.

#### nicola

##### New Member
It must be compared to N(21.5,4). I already know the data is normally distributed, so this is not of concern. I could use the test $$T = \frac{\bar{X} - \mu_o}{s/\sqrt{n}}$$ , but this would only include $$\bar{X}$$ in the computation, and not the standard deviation of the sample (instead I want to use also the standard deviation to make the estimate more accurate)

Unfortunately, I need an automatic method to perform this task (I cannot use graphical methods) because I am implementing this as a C++ routine to be called hundreds of times per second... this problem comes from the analysis of DNA sequencing data.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
If you gave N(21.4,4) the same sample size as the comparison sample, and you confirmed normality assumptions for the ttest, then you can put the other pieces together and do a ttest. Also, if you just gave N(21.4,4) the same sample size you could run a K-S test to compare distributions.

#### nicola

##### New Member
I solved the problem. I post the solution here in the hope it will be useful to others.

Again, the problem formulation is:

compute the likelihod of observing sample mean $$\bar\mu$$ and sample standard deviation $$\bar\sigma$$ in $$n$$ samples drawn from the distribution $$N(\mu,\sigma^2)$$

The quantity we are interested in is $$\log P(\bar \mu, \bar\sigma | \mu, \sigma)$$ (I use log-likelihood since using log simplifies notation). Since sample mean $$\bar \mu$$ and sample variance $$\bar\sigma^2$$ of a normally distributed population are two independent random variables, we have that

$$\log P(\bar \mu, \bar\sigma | \mu, \sigma) = \log P(\bar \mu | \mu, \sigma) + \log P(\bar\sigma | \mu, \sigma)$$

The random variable $$M=\frac{\bar\mu-\mu}{\sigma/\sqrt{n}}$$ is t-distributed with $$n-1$$ degrees of freedom. For large $$n$$, Student's t-distribution tends to $$N(0,1)$$; we assume big n so we can approximate the distribution of $$M$$ with $$N(0,1)$$. Then (applying the definition of the standard normal distribution's density function),

$$\log P(\bar \mu | \mu, \sigma) \approx \log\left(\frac{1}{\sqrt{2\pi}}exp(-M^2/2)\right) = - \frac{1}{2}\log{2\pi} - \frac{(\bar\mu-\mu)^2n}{2\sigma^2}$$

The random variable $$S=\frac{(n-1)\bar\sigma^2}{\sigma^2}$$ is chi-distributed with $$n-1$$ degrees of freedom. Again, we assume $$n$$ to be large. Then, the distribution of the random variable $$Q=\frac{S-n}{\sqrt{2n}}$$ tends to $$N(0,1)$$ and we have:

$$\log P(\bar\sigma | \mu, \sigma) \approx \log\left( \frac{1}{\sqrt{2\pi}}exp(-Q^2/2) \right) = - \frac{1}{2}\log{2\pi} - \frac{\left((n-1)\bar\sigma^2-n\sigma^2\right)^2}{4n\sigma^4}$$

note that $$n\approx n-1$$ (n is large), so the above quantity simplifies to
$$- \frac{1}{2}\log{2\pi} - \frac{n^2\left(\bar\sigma^2-\sigma^2\right)^2}{4n\sigma^4} = - \frac{1}{2}\log{2\pi} - \frac{n\left(\bar\sigma^2-\sigma^2\right)^2}{4\sigma^4}$$

Putting it all together, we finally obtain

$$\log P(\bar \mu, \bar\sigma | \mu, \sigma) = - \log{2\pi} - \frac{n}{2\sigma^2}\left( (\bar\mu-\mu)^2 + \frac{(\bar\sigma^2-\sigma^2)^2}{2\sigma^2} \right)$$