t, Z, and s; known vs estimated sigma

#1
I see "if sigma is unknown, use t" here.
1. When do we know sigma? I can't remember a time.
2. With n > ?, 30 or 100 0r 1000, it seems that t dist ~ z for any question. n threshold?

Thanks;
joe b.
 
Last edited:
#2
I think Z-tests are more common in non-linear regression or generalized linear models, related to Wald tests. I just go with the default behavior of the software. I think it is generally understood that this is anti-conservative though and requires large samples.

As far as a threshold I think computers make the question irrelevant since the software will pretty much make this decision for you. If you was workin off tables then I guess youd just pull from z-table at some point. The table in the back of my linear models book only goes to df=120 and has a gap from 60 to 120.
 

obh

Active Member
#5
Hi Joe,

When you don't know the standard deviation you should use the t-test which uses the sample standard deviation.
The t distribution has heavier tails to compensate for the inaccurate sample standard deviation.

A long time ago when people didn't have computers we used tables instead of a computer.
The normal distribution has the advantage of a more detailed table as you don't need to have table for every DF like in the t distribution.

With DF tend to infinity the t-distribution tent to the normal distribution
With DF=30 the t-distribution is not too far from the z distribution. (but still not so accurate)
Maybe the more detailed z table of the compensate for the not accurate values.

But today you should use the t distribution.

The average of a sample size of 30 tends to distribute normally per CLT, but this is a different story ...
 
#6
i tried to go deep on this question by looking at the r-documentation on pt/qt(). Apparently it also uses a normal approximation in the tails. They give a reference to an ASM article. It comes down to some computer science stuff about how quickly various things can be calculated, but it was pretty complicated so i didn't really get it, but suffice it to say it is pretty accurate. Looking into the c-code was not enlightening for me, alot 'SEXP' and such. I was hoping to find some 'if n> xx then use normal' statement, but no such luck.
 

obh

Active Member
#7
i tried to go deep on this question by looking at the r-documentation on pt/qt(). Apparently it also uses a normal approximation in the tails. They give a reference to an ASM article. It comes down to some computer science stuff about how quickly various things can be calculated, but it was pretty complicated so i didn't really get it, but suffice it to say it is pretty accurate. Looking into the c-code was not enlightening for me, alot 'SEXP' and such. I was hoping to find some 'if n> xx then use normal' statement, but no such luck.
Hi Fed,

Even if R put the line on df=1000, you may decide to put the line on 5000.
It is probably related to the precision level and performance, If I remember correctly R distributions give the precision of 7 digits.
So if df=121 will give a precision level of 7 digits, and the calculation is faster than the T distribution calculation, it is more efficient to use the normal distribution.

As a user, if you use the sample standard deviation you should just use the T distribution.