# Thread: Type 1 error in regression with entire populations

1. ## Type 1 error in regression with entire populations

I have read that you are not supposed to keep choosing different levels of the categorical variable to be the reference level, because this can cause familywise error, the true type 1 error will be greater than your nominal alpha level. I mean by this if you have 5 levels, you change the reference level five times seeing what the results will be (although it can be useful substantively to do that).

I am not sure the above is true. But even if it is, does this apply when you have an entire population, as I usually do. Can you even have type 1 error when you are analyzing a population. By that I mean there are 25000 people in the population of interest and I have all of them, there is no sample involved.

2. ## Re: Type 1 error in regression with entire populations

Why do you have a reference group, did you run something on this population?

Type I Error is probably exclusive to statistical testing. Statistics are for making generalization from collected samples. You don't have a sample and aren't conducting statistical tests. Your numbers are the truth, so if two groups are different, they are different - no sampling distribution based threats. This is what I think.

3. ## Re: Type 1 error in regression with entire populations

I just saw this but have not actually looked at its content:

<li class="first-item">Xiaoqin Wang,

Yin Jin,

<li class="last-item">and Li Yin

Measuring and estimating treatment effect on dichotomous outcome of a population

Stat Methods Med Res October 2016 25: 1779-1790, first published on September 3, 2013 doi:10.1177/0962280213502146

4. ## Re: Type 1 error in regression with entire populations

Originally Posted by noetsi
I have read that you are not supposed to keep choosing different levels of the categorical variable to be the reference level, because this can cause familywise error, the true type 1 error will be greater than your nominal alpha level. I mean by this if you have 5 levels, you change the reference level five times seeing what the results will be (although it can be useful substantively to do that).

I am not sure the above is true.
Hi,
I am pretty sure this is not true. If you are only changing the reference level I think you are actually repeating the same test only presented differently - kindof like writing up the same test results in different languages - so it will still be the same test not an independent one.

5. ## Re: Type 1 error in regression with entire populations

Whenever I do multiple testing like noetsi mentioned, I always correct my level of significance. Unless it is the exact same test (a vs b, b vs a).

6. ## Re: Type 1 error in regression with entire populations

Originally Posted by hlsmith
Whenever I do multiple testing like noetsi mentioned, I always correct my level of significance. Unless it is the exact same test (a vs b, b vs a).
But this is the same thing, right ? a vs. b,c,d,e or b vs. a,c,d,e ..etc.

7. ## Re: Type 1 error in regression with entire populations

A vs b, a vs c, and b vs c, is three hypothesis tests in my practice.

8. ## Re: Type 1 error in regression with entire populations

Originally Posted by noetsi
I have read that you are not supposed to keep choosing different levels of the categorical variable to be the reference level, because this can cause familywise error, the true type 1 error will be greater than your nominal alpha level. I mean by this if you have 5 levels, you change the reference level five times seeing what the results will be (although it can be useful substantively to do that).

I am not sure the above is true. But even if it is, does this apply when you have an entire population, as I usually do. Can you even have type 1 error when you are analyzing a population. By that I mean there are 25000 people in the population of interest and I have all of them, there is no sample involved.
I believe that you do have type 1 error even though you have a population.

Thinking it through, you have two types of statistics, descriptive and inferential. In descriptive statistics (i.e., mean, standard deviation) you no longer have sampling error, so your measures of mean and standard deviation are absolute. No confidence intervals around mean or standard deviation. However, when you use inferential statistics, you still have variation with which you must deal. True, the sampling variation is gone, but all of your other sources of variation still exist. Where there is variation there is uncertainty. I believe that all of the typical rules, assumptions, etc. still apply.

9. ## Re: Type 1 error in regression with entire populations

Hmmm. Need references that go either way. Does traditional 1.96 disappear though, it would seem natural not to have it.

10. ## Re: Type 1 error in regression with entire populations

Originally Posted by hlsmith
A vs b, a vs c, and b vs c, is three hypothesis tests in my practice.
Yepp, you are right. But if I understand the question correctly, it is about doing the same analysis only changing the reference level. So, I agree, you have one test for each pair of levels but you do not have more tests just because you changed the reference level from A to B.

11. ## Re: Type 1 error in regression with entire populations

I found the following:
It appears that there are a lot of arguments either way. Andrew Gelman's makes the most sense to me.

12. ## Re: Type 1 error in regression with entire populations

I think a key issue raised by miner is how you understand your population. If you think of your analysis as only pertaining to that population than type 1 error does not make sense to me. You know the true results in the population and no error is possible. However, if you think of your population as a sample of all possible populations (that might occur in the future for example) then error does pertain - in terms of applying your analysis to those other macro-populations. I have back and forth on this issue, in this case I decided to ignore that future populations might be different (or ones in other states etc, this analysis is very focused).

I am not sure what source of variation exist other than sampling variation that could cause error in honesty. I don't doubt they might exist -I just can't imagine what they are or how they would introduce error in the regression. All I really care about here are the slopes and odds ratios - not interested in other statistics.

The literature I have seen comes down on HL Smith in terms of multiple tests (that is why posthoc tests are penalized). But that literature does not I think deal with populations. I am still not sure one way or the other if familywise error applies - because I think type 1 error itself is impossible in a population when all you care about is that population. Its a problem with a sample because you are interested in the larger population and you don't know if what you find in the sample matches the real population.

Obviously I remain uncertain (thanks miner and hlsmith for the articles). I don't think the assumptions of regression, except for non-linearity, really apply when you are analyzing a large population (I have 25,0000 cases). Ignoring that you have the population when you have that many cases I don't think heteroscedasticity or multicolinearity influence the results because the results are asymptotically correct with that many cases even with the errors if they exist. Nor does normality matter because of the CLT.

13. ## Re: Type 1 error in regression with entire populations

I am also unsure of this additional variation.

You could say mesasurement error, but that isn't usually addressed in your model. Sensitivity analysis can try to assume its direction and magnitude. You have dispersion of the variate, but that is the nature of a random variable (stochastic).

For example how would one perform a two sample ttest with a population??

I get the attempt of saying what about a future sample, but the moment after you get a measurement things are different and what if you aren't predicting just getting a cross-sectional measurement.

14. ## Re: Type 1 error in regression with entire populations

It all depends on what the analysis will be used for, I guess. The moment somebody says that we have proven an effect it is implicitely understood that we talk about future samples - as an effect presuposes that it will not disappear after we stop the measurement. Also, i do not think anyone uses the very careful wording that would be necessary to avoid this happening - something like "we did not use the usual statistical methods because we have a full census and we have no intention to discuss the existence or non-existence of any effect whastsoever that might show up in our analysis " - so, probably it would be best to go with Gelman .

regards

15. ## Re: Type 1 error in regression with entire populations

i accessed the paper I reference in post #3, it is not relevant to your question. It covers using maximum likelihoods to estimate risk differences, relative risks, and odds ratios from one model.

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts