Minimum population size for statistical tests?

#1
Hi all - I am hoping somebody could help with a broader question.

I am trying to see if individuals with certain characteristics (independent variables) are statistically more or less likely to have a certain property (dependent variables) than individuals without the characteristic. I am looking at the whole population rather than a sample, as I have data on every individual.

For example, are individuals with characteristic x more likely to have property 1 than those individuals without characteristic x. All my independent and dependent variables are "either-or" (the individuals either do, or do not, have the characteristic/property).

My main issue is there is a much small number of individuals with the characteristic than without in every case. For example, only 300 individuals have characteristic x, and 30,000 do not. Does this need to be taken into account when performing the analysis? Is there a minimum "ratio" for comparing two populations?

To give actual data, characteristics are things like disability (disabled or not) and age (young or mature) and properties are things like graduation (did or did not) and drop-out (did or did not).

Thanks very much in advance,

J
 

obh

Active Member
#2
Generally, you use statistical tests when you can't use the full population data, and you take only a sample of the data, trying to understand if the sample's results represent the population.

But I think that some times it is okay to use statistics also on the full population.
If the full population size is small, like one group of 100, you may ask if the full population results would be the same if you check the same group again. so you may look at the full population of a small group as a sample for several repeats of the same small group.

What statistical tests do you want to run?
 
#3
Thanks very much for the reply - I now understand this issue a lot better, and am in a bit of a better position to know what the next steps should be. The fact we are analysing trends from an entire population removes the "statistical significance" element, in favour of a contextual significance - i.e. what is the trend we're most concerned by, what has changed the most over time etc.

If we were looking at continuous data, I would use a Welch's T-test to ascertain the most significant differences, but for this initial analysis I'll focus on the contextual significance. Thanks very much for the help!
 

Miner

TS Contributor
#4
Are you certain that you have the entire population? Do you ever intend to generalize your findings for another group?

One could collect data for 100% of the people in a specific classroom and say they have the entire population. This is true for that classroom. However, if they ever want to make generalizations for more classrooms, then it is no longer a population, but a sample. Then the question becomes is that sample representative?
 

noetsi

Fortran must die
#5
Its dangerous to assume you have a population. For example we have the entire population of our customers. But we may want to generalize to a larger population of the state we do not serve. You have to think carefully about who you want to generalize to in making this type of decision.

If your DV has two levels than you want to do something like logistic regression. So the 300 is the key factor according to many - the least common. I would look at Agresti's suggestions on necessary sample size. Or authors like him remembering these are rules of thumb.

This is worth considering.

https://journals.sagepub.com/doi/full/10.1177/0962280218784726
 
#6
@noetsi and @Miner - thank you both for your replies and apologies for my late response! I see what you're saying regarding having an entire population, but there is no extrapolation or predictive element to this analysis; we are most concerned with the individuals in our "population," and reporting on their attributes - for example, we are interested in whether disabled students in our population dropped out, not whether disabled students in general are more likely to drop out. I hope this makes sense; I may be missing something here, and needing to rethink my entire strategy or philosophy behind what I'm doing.
 

noetsi

Fortran must die
#7
If you think that you have the population then you can ignore statistical tests. Whatever the effect size you find, that is the true effect size. So p values, sample size etc are entirely unimportant. I would not even report statistical tests in this case.
 
#8
All populations are of infinite size.
If I understand the study, you've got a set of binomial distributions.

"My main issue is there is a much small number of individuals with the characteristic than without in every case. For example, only 300 individuals have characteristic x, and 30,000 do not."
P characteristic = 300/30,000 = .01 = 1%. Now measure/count property / characteristic. If p/c = .3, the property is connected to the characteristic. Etc.