Census data rules, variability and significance

ebf

New Member
#1
Hi

I am having trouble understanding census populations, variability in census populations, and how and why one is allowed to attribute significance to tests conducted on data derived from census populations vs population samples.

1) Is there any point at which missing data points invalidate a data set as a "census" and render the data set a "sample". For example 3 morphological measurements were collected on a finite population of animals. The data is historical and animals with incomplete data sets (missing one or more measurements) were excluded from the analysis. Is there a percentage of missing data sets that would render this population a sample.

2) Is variance completely irrelevant to census data? In other words, does the spread in measurement data for a given morphological feature (for instance height) have any effect on the significance of the difference between means where animals are grouped into categories, for example, grouped by the year the animal was measured.

I understand that sampling error cannot be used to evaluate statistical significance due to the fact that there is theoretically no sampling error in a census due to sampling every member of the population, but does the fact that data is derived from a census also eliminate degree of variance as a way to compare the significance of a difference between two means? In other words no matter how much variability there is within the data sets being compared, is it the case that the variability can never invalidate the difference between the means?

In practical terms, when working with census data, is it valid to create graphs that show means with their standard deviation and make a statement about whether or not the measurement means from measurements taken in certain year ranges (ie 1981-1988, 1989-1992) are significantly different.

Having a hard to time wrapping my brain around how variability in the data due to genetic or environmental causes can be ignored even when you have data on the total population.

3) At what degree of distance from raw census data are you allowed to attribute statistical significance to the results. In other words, once you start performing tests on raw data (generate statistical data based on that raw data), is it then valid to examine statistical significance?

4) Is a census population statistically the same as a finite population and do the same statistical rules apply to both?

Thanks in advance...and sorry for the basic nature of these questions.
 

terzi

TS Contributor
#2
Hi ebf,

As you already know, a census implies that you get information about every single unit in your population. Since you will have full information on your subjects, there is no reason to perform estimations: what you get is what it is. I'll try to answer your questions based on that principle, I hope it is useful

1) Missing data can become problematic, particularly since it will alter your calculations and the number you get won't be the real value of your parameters. Missing data will not turn your census into a sample, it will directly distort the results. Procedures for handling missing data, such as imputations should be appropriate since, depending on the mechanism of missing data, your information could be biased.

2) Variance is relevant as a measure of dispersion. You can still calculate variance and standard deviation and interpret them accordingly. Estimation is irrelevant, since the standard deviation you get is the real standard deviation, so you wouldn't be using an estimator, but you should calculate the parameter directly. You don't have to look for significant differences then, if you get 10 one year and 11 on another, those are different, period. As you mentioned, there is no sampling error in a census, but there are many non-sampling errors that may occur (problems with surveyors, poorly designed questionnaires, etc.) and that can alter your results.

3) I don't really get this question. As I told you, statistical significance is a concept used in samples, I don't think it would be most useful in a census.

4) A census can only occur in a finite population. As you know, a finite population is well defined and every member can be appropriately identified. An infinite population means you don't know the size or the actual members of your population, so you can't adequately produce a census.
 
#3
Conditional on me understanding the questions correctly, I'll take a shot at some of these:

1) A set of observations either contains an entire population or it doesn't. If the latter, then it is a sample. If your data contain only those observations of a population for which there are no missing cells, then the data are a systematic sample of the total population where inclusion in the sample is based on listwise deletion. However, you could also call those data "the population of individuals for whom there are no missing cells".

2) That your data come from a census does not mean that individual variables are not random. Variance always matters for hypothesis testing.

3) I don't think I understand this question.

4) You may need to clarify this question, too. A census of a population should give you an observation for each member of the population. So if the population is finite, so will be the number of individuals in the census. The distinction between a census population and a finite population is like the distinction between apples and organes: A census population is a population on whom a census is performed, while a finite population is a population that is not infinite. So asking whether they "statistically the same" doesn't make sense to me in this context.
 

ebf

New Member
#4
Thanks for the answers so far. I realize these are very basic questions but I appreciate you taking the time to try to explain.

I guess what I am trying to understand is what can and cannot be said about the significance of the difference between means of census measurement data grouped by year. Specifically, if you create a bar chart of the means and apply standard deviation bars to the columns, and the sd bars overlap between columns, what are you allowed to say about the means? Can you say anything about the significance of the difference between the means?