+ Reply to Thread
Results 1 to 5 of 5

Thread: Linear Probability Models and Complex Survey Data

  1. #1
    Phineas Packard
    Points: 16,013, Level: 81
    Level completed: 33%, Points required for next Level: 337
    Lazar's Avatar
    Location
    Sydney
    Posts
    1,159
    Thanks
    198
    Thanked 336 Times in 299 Posts

    Linear Probability Models and Complex Survey Data




    So there has been a movement back toward linear probability models in cases in which multi-group comparisons are of focal interest. The rational for this is that cross-group comparisons of parameter estimates get shaky in probit and logit models requiring a bunch of finessing to get right.

    So I now get frequent requests from reviewers to run LPM in addition to probit/logit. This is fine and I can deal with heteroskedasticity in standard error with the sandwich package in R. The issue comes when there is a complex sample design. For example with PISA, PIAAC, TIMSS, etc. I get 80 replication weights that are designed to account for complex sample design. These are essentially pre-specified bootstrapping selectors for the sample. Given that the SE come from variance in these repeated replications do I need to deal with hetroskedasicity in LPM? If so does anyone know how? I am using the survey package in R.
    "I have done things to data. Dirty things. Things I am not proud of."

  2. #2
    Dark Knight
    Points: 6,762, Level: 54
    Level completed: 6%, Points required for next Level: 188
    vinux's Avatar
    Posts
    2,011
    Thanks
    52
    Thanked 241 Times in 205 Posts

    Re: Linear Probability Models and Complex Survey Data

    Not answering your question. LPM is not a bad model if you are interested in central values (mean, median, etc.). The non tail part of sigmoid curve can be approximated to a linear curve.
    In the long run, we're all dead.

  3. #3
    TS Contributor
    Points: 22,359, Level: 93
    Level completed: 1%, Points required for next Level: 991
    spunky's Avatar
    Location
    vancouver, canada
    Posts
    2,135
    Thanks
    166
    Thanked 537 Times in 431 Posts

    Re: Linear Probability Models and Complex Survey Data

    OK, I am far from an expert on this thing and willing to admit that I may be wrong but for the sake of saying something, here are my two cents?

    Quote Originally Posted by Lazar View Post
    do I need to deal with hetroskedasicity in LPM?
    I would be inclined to say "yes". I mean, your dependent variable is still constrained to be a number either 0 or 1, right? I guess it is reasonable to claim that it is bernoulli-distributed and the mean and variance of a bernoulli distribution are not independent. So you're still stuck with the problem that, simply by the way the distribution behave, higher/lower values of the mean are associated with higher/lower values on the variance of the residuals. So I would still be inclined to use some type of robust correction to the standard errors.

    Quote Originally Posted by Lazar View Post
    If so does anyone know how?
    I'm not sure if this is relevant at all but have you heard of the wild bootstrap? If you think this is relevant, one of my profs was obsessed with it last semester and he gave us R code to implement the wild bootstrap both by itself AND in the presence of missing data with multiple imputation (he was also infatuated with multiple imputation). Would that be helpful to you? It does account for heteroskedasticity!
    for all your psychometric needs! https://psychometroscar.wordpress.com/about/

  4. #4
    Phineas Packard
    Points: 16,013, Level: 81
    Level completed: 33%, Points required for next Level: 337
    Lazar's Avatar
    Location
    Sydney
    Posts
    1,159
    Thanks
    198
    Thanked 336 Times in 299 Posts

    Re: Linear Probability Models and Complex Survey Data

    Quote Originally Posted by spunky View Post
    I would be inclined to say "yes". I mean, your dependent variable is still constrained to be a number either 0 or 1, right? I guess it is reasonable to claim that it is bernoulli-distributed and the mean and variance of a bernoulli distribution are not independent. So you're still stuck with the problem that, simply by the way the distribution behave, higher/lower values of the mean are associated with higher/lower values on the variance of the residuals. So I would still be inclined to use some type of robust correction to the standard errors.
    I am inclined to agree but not certain how this would be done in this context given that the standard errors are taken from the variance in the point estimates of replicants.

    Quote Originally Posted by spunky View Post
    I'm not sure if this is relevant at all but have you heard of the wild bootstrap? If you think this is relevant, one of my profs was obsessed with it last semester and he gave us R code to implement the wild bootstrap both by itself AND in the presence of missing data with multiple imputation (he was also infatuated with multiple imputation). Would that be helpful to you? It does account for heteroskedasticity!
    Not really helpful in this case as the form of the 'bootstraps' is predefined by the survey organisers to account for the complex sample but still cool
    "I have done things to data. Dirty things. Things I am not proud of."

  5. #5
    Human
    Points: 12,672, Level: 73
    Level completed: 56%, Points required for next Level: 178
    Awards:
    Master Tagger
    GretaGarbo's Avatar
    Posts
    1,361
    Thanks
    455
    Thanked 462 Times in 402 Posts

    Re: Linear Probability Models and Complex Survey Data


    Quote Originally Posted by Lazar View Post
    If so does anyone know how?
    I don't know! But this is my thoughts.

    Quote Originally Posted by Lazar View Post
    So there has been a movement back toward linear probability models in cases in which multi-group comparisons are of focal interest. The rational for this is that cross-group comparisons of parameter estimates get shaky in probit and logit models requiring a bunch of finessing to get right.
    The usual logit model with link function g( ) is often written like:

    g(p) = log(p/(1-p)) = beta'x

    As I understand it, the linear probability model is just a model with identity link:

    p = beta'x

    Maximum likelihood estimates in generalized linear model is iteratively reweighted least squares:

    (X'W(t)X)*beta(t+1) = (X'W(t)y) with conventional nomenclature.

    where the weights are updated in each round. But this just takes care of the increasing variance as p gets closer to 0.5, not the sampling design.

    Let's write the model as:

    Y = beta'x + eps

    Where Y is still binomial distributed with a mean of beta'x and an awkward disturbance term eps.

    Now suppose that there is a complex sampling design so that there is a random selection variable S (selected or not selected) and that the probability of S is not constant as in simple random sampling.

    Now my point is that if the disturbance term in the regression model eps and S are statistically independent then, my guess is, that you can ignore the complex sampling design and estimate it as if it was a simple random sampling.

    If they are independent the likelihood would just multiply the densities:

    L = f(s)*f(y;beta) and in the log-likelihood the sampling would just be an additive constant. logL= log(f(s)+ log(f(y;beta)

    But if the sampling design in not independent, so that the sampling probability is a function of beta, then maybe that can be modelled in the likelihood:

    L =f(s, y; beta)

    If it is estimated with ML (and for me ML is maximum likelihood!) the variances and covariances can be found in the information matrix.

    But I believe that this has been solved a long ago. But I guess that the results have not been used very much.

    Lazar wants cross-group comparisons of parameter estimates. Then I guess that he need not only the standard error but also the variances and covariances of the parameter estimates.

    I don't know much about bootstrapping in this case. But I would guess that the maximum likelihood estimates would be more precise.

+ Reply to Thread

           




Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats