+ Reply to Thread
Results 1 to 7 of 7

Thread: Census data: Should I worry about sample size and pvalues?

  1. #1
    Points: 8,540, Level: 62
    Level completed: 30%, Points required for next Level: 210

    Posts
    266
    Thanks
    39
    Thanked 1 Time in 1 Post

    Census data: Should I worry about sample size and pvalues?




    Hi all,
    I have a questions regarding statistical power and sample size. (1) Do I have to worry about sample size in a multiple logistic regression if I am using all the individuals in a population (census) and not a sample? Let’s say that I want to see how many tourists in a resort report a complaint. This would be my dependent variable (complaint Yes/No). I have 7 independent variables including age, sex, and ethnicity, past complain (yes/no), etc. Again, I am including all guests in a period of time (not a sample). The problem a very small proportion of people report a complain (DV). 271 reported a complaint and 32,469 did not (so less than 1% report a complaint). I wonder that some cells (categories) of my independent and dependent variables will not contain any people since we don’t have too many people who said Yes for my DV. For example, the Asian category may have only 2 people and both did not report a complaint. (2) Would this affect the regression and the pvalues? We expected people with previous complaints to be more likely to complain but in my regressions analysis this is not statistically significant, however OR is 1.6. (3) Can this be affected by the low N? (4) Should I report and consider p-values or not since I don’t have sampling errors? I would appreciate any help/ideas!

    Thank you!

  2. #2
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Census data: Should I worry about sample size and pvalues?

    Census really does not have the population. Commonly it does samples as with the ACS, but even when in theory it does populations it likely does not capture everyone because some don't get the census and some don't return it. That said if you have the population you know the true effect size. I don't see how type 1 or type 2 errors can apply when you know beyond any doubt what the true population result is. For the same reason I don't think p values matter when you know the true effect size and there is no sampling error.

    Having too few people in a cell might cause your regression not to run. But I think a more basic issue is whether your results are reasonable. If only a tiny portion of the population has complaints, does that mean they truly don't have any or they just did not go through the process of complaining?
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  3. #3
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Census data: Should I worry about sample size and pvalues?

    hi,
    this will depend on the scope of your conclusions. If you only describe what happened until this moment then you are probably right and you need not to worry. If you or the readers of your report interpret your results as something talking about the future as well (like having to take some actions based on the knowledge tühat 35% of tourists complained in the past) then you conceptually have a sample of the population of tourists past and future and then all the sample size issues should be considered. This is tricky because your reades might look at the future even if you explicitly state that you do not draw any conclusions .

    regards

  4. #4
    Points: 8,540, Level: 62
    Level completed: 30%, Points required for next Level: 210

    Posts
    266
    Thanks
    39
    Thanked 1 Time in 1 Post

    Re: Census data: Should I worry about sample size and pvalues?

    I am sorry for the late reply. I red both of your comments and I still have a couple of questions.

    1. I think it makes sense to consider the p-values. The main task of this project is to predict which guests will report a complaint in their stay. So I can treat my population as a sample of a future guest universe. This can be very tricky since in reality I don’t this is not a sample of a population. Thus I am not predicting a parameter of a static population. My universe is constantly moving. For example, the next group of guest can be totally different than the population of which I run the logistic regression. Is there a standard terminology for this type of population/sample??? Does it make sense to consider and use the p values?

    2. Since only less than 1% of the guests reported a compliant in their stay I have cells with very few observations. After my logistic regression (using Stata), two race categories were empty and Stata recognized this and alert me with the following message "Asian and Indian was dropped because it predicts failure perfectly". That is, none of the Asian and Indian guests reported a complaint (DV). How would this affect my regression? Why Stata is dropping these observations? What about if this is true? For example, if you are Asian you have 0 probability of reporting a compliant? Or in order to generate a logistic regression coefficient, at least one person needs to be in on of the categories?


    3. On the other hand, we were expecting that past complaints (IV) will predict our DV (complaint in the current stay). The regression shows that those with a previous complaint have 1.6 greater odds of complaining but it is not significant. Can be this be affected by the few people who had a complaint in the past, as well as by the low proportion of people who complaint in their current stay (DV)? Namely, 13 out of the 271 people who currently complained also had a complaint in the past. On the other hand, 133 out of the 32,00 people who did not have a complaint, had a complaint in the past. Are these numbers too small and therefore influence my p value? If I have a larger N, would the result became significant?

  5. #5
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Census data: Should I worry about sample size and pvalues?

    Quote Originally Posted by rogojel View Post
    hi,
    this will depend on the scope of your conclusions. If you only describe what happened until this moment then you are probably right and you need not to worry. If you or the readers of your report interpret your results as something talking about the future as well (like having to take some actions based on the knowledge tühat 35% of tourists complained in the past) then you conceptually have a sample of the population of tourists past and future and then all the sample size issues should be considered. This is tricky because your reades might look at the future even if you explicitly state that you do not draw any conclusions .

    regards
    This is true of course only if other populations effect size varies.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  6. #6
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Census data: Should I worry about sample size and pvalues?

    1. I think it makes sense to consider the p-values. The main task of this project is to predict which guests will report a complaint in their stay. So I can treat my population as a sample of a future guest universe. This can be very tricky since in reality I don’t this is not a sample of a population. Thus I am not predicting a parameter of a static population. My universe is constantly moving. For example, the next group of guest can be totally different than the population of which I run the logistic regression. Is there a standard terminology for this type of population/sample??? Does it make sense to consider and use the p values?
    If you think of your population as a sample of some other population you should use p value. If you don't you should not. That pretty much is the standard terminology, a population is the entire unchanging population and a sample is a portion of a larger unknown population. One practical problem here is that many analysis require a "random sample" and you are not sampling randomly (well you don't know if you are or not - since you are not sampling at all in the classical sense).

    2. Since only less than 1% of the guests reported a compliant in their stay I have cells with very few observations. After my logistic regression (using Stata), two race categories were empty and Stata recognized this and alert me with the following message "Asian and Indian was dropped because it predicts failure perfectly". That is, none of the Asian and Indian guests reported a complaint (DV). How would this affect my regression? Why Stata is dropping these observations? What about if this is true? For example, if you are Asian you have 0 probability of reporting a compliant? Or in order to generate a logistic regression coefficient, at least one person needs to be in on of the categories?
    I don't believe it is possible to run a regression with no variance. Which is why STATA dropped it.

    3. On the other hand, we were expecting that past complaints (IV) will predict our DV (complaint in the current stay). The regression shows that those with a previous complaint have 1.6 greater odds of complaining but it is not significant. Can be this be affected by the few people who had a complaint in the past, as well as by the low proportion of people who complaint in their current stay (DV)? Namely, 13 out of the 271 people who currently complained also had a complaint in the past. On the other hand, 133 out of the 32,00 people who did not have a complaint, had a complaint in the past. Are these numbers too small and therefore influence my p value? If I have a larger N, would the result became significant?
    I know a very low variation in a predictor (say 95 percent in one group) can negatively effect regression. How it influences odds ratios I have never read.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  7. #7
    Points: 8,540, Level: 62
    Level completed: 30%, Points required for next Level: 210

    Posts
    266
    Thanks
    39
    Thanked 1 Time in 1 Post

    Re: Census data: Should I worry about sample size and pvalues?


    Thank you Noetsi! Rogojel any ideas?

+ Reply to Thread

           




Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats