+ Reply to Thread
Results 1 to 15 of 15

Thread: Simulating a logistic regressio - scary results

  1. #1
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Simulating a logistic regressio - scary results




    Hi,
    I tried to work out the necessary sample size for a logistic regression by simulations and got some scary results. If anyone could check the code below, it would be a great help.

    I simulate a logistic regression with two normalized variables, one having a fixed odds ratio of 1.4 the other a variable odds ratio from 1 to 2.5.

    What makes this scary is that even at relatively high sample sizes (190) the probability of not identifying the real situation (both xs significant) is very high. At lower sample sizes the situation is more or less hopeless.

    Code: 
    # logistic regression sample size simulation
    
    # for several variables (2) 
    
    #Define beta0 (start with 0)
    B0=0
    
    # Define the odds ratio
    #First trial was from 1 to 5 where 1 is giving us a rate of the false alarms
    #It dies not make sense to go higher then 2 because at that point 
    #the power will be almost always high
    OR=seq(from=1, to = 2.5, by=0.1)
    
    #Define the sample size
    SSize=seq(from=50, to=200, by=20)
    
    #Define table to store the power
    T=expand.grid(OR,SSize)
    T$PP1=rep(0, length(T$Var1))
    T$PP2=rep(0, length(T$Var1))
    T$PP3=rep(0, length(T$Var1))
    T$PP4=rep(0, length(T$Var1))
    names(T)=c("OddsRatio", "SampleSize", "Only.X1", "X1.and.X2", "Only.X2", "Neither")
    
    #Define the X variable (from -3 to 3) assuming standardized X
    #Normally distributed with mean 0 and stddev 1
    
    #Define the number of trials per setting (OR and sample size)
    NTrials=200
    
    #FOR each OR DO
    for(or in OR){
      
      # FOR each Sample Size DO
      for(ss in SSize){
        
        #   Define holding vector for pvalues
        pvals1=numeric(length=NTrials)
        pvals2=numeric(length=NTrials)
        
        #   For NTrials times DO
        for(count in 1:NTrials){
          
          #     Generate SSize X values
          XVals1=rnorm(ss)
          XVals2=rnorm(ss)
          
          #     Calculate the probability of an event for the generated X
          B1=log(or)
          Fact=B0+B1*XVals1+log(1.4)*XVals2
          YProb=exp(-Fact)/(1+exp(-Fact))
          
          #     Generate the Ys for the X values
          Y=rbinom(length(YProb),1,YProb)
          
          #     Generate a dataframe (not really necessary here)
          testdata=data.frame(Y=as.factor(Y), X1=XVals1, X2=XVals2)
          
          #     Run a logistic regression Y~X
          mod=glm(Y~X1+X2, data=testdata, family="binomial")
          
          #     Save the p-value
          pvals1[count]=coef(summary(mod))[11]
          pvals2[count]=coef(summary(mod))[12]
          
          #   ENDFOR Ntrials
        }
        
        #   Calculate the percent p-values less then 0.05
        px1=sum(pvals1<=0.05 & pvals2>0.05)/NTrials
        px1x2=sum(pvals1<=0.05 & pvals2<=0.05)/NTrials
        px2=sum(pvals1>0.05 & pvals2<=0.05)/NTrials
        pnone=sum(pvals1>0.05 & pvals2>0.05)/NTrials
        
        #   Save them in the power table
        T[T$OddsRatio==or &T$SampleSize==ss,]$Only.X1=px1
        T[T$OddsRatio==or &T$SampleSize==ss,]$X1.and.X2=px1x2
        T[T$OddsRatio==or &T$SampleSize==ss,]$Only.X2=px2
        T[T$OddsRatio==or &T$SampleSize==ss,]$Neither=pnone
        
        # ENDFOR Sample Size
      }
      
      #ENDFOR OR
    }

  2. The Following User Says Thank You to rogojel For This Useful Post:

    GretaGarbo (05-29-2017)

  3. #2
    TS Contributor
    Points: 5,246, Level: 46
    Level completed: 48%, Points required for next Level: 104
    maartenbuis's Avatar
    Location
    Konstanz
    Posts
    372
    Thanks
    3
    Thanked 146 Times in 123 Posts

    Re: Simulating a logistic regressio - scary results

    Yes, the power is fairly low in those cases. To make you more scared: you have chosen a best case scenario by setting the constant at 0 and using standard normal distributions for you xs. That way you have overall about 50% 1s and %50% 0s for your dependent variable, thus maximizing the variance in your dependent variable. If you change the constant (or change the mean of one or more of your independent variables) the power will get worse. Here is my simulation in Stata:

    Code: 
    clear all
    set more off
    set seed 123456789
    program define sim
    	args n b0
    	drop _all
    	set obs `n'
    	gen x1 = rnormal()
    	gen x2 = rnormal()
    	gen y = runiform() < invlogit( ln(`b0') + ln(1.4)*x1 + ln(1.4)*x2 )
    	version 5: logit y x1 x2
    end
    
    matrix res = J(189,3,.)
    matrix colnames res = N constant power
    local i = 1
    foreach n of numlist 100(20)500 {
    	foreach b0 of numlist .2 .4 .6 .8 1 1.25 `=5/3' 2.5 5{
    		simulate b=_b[x2] se=_se[x2], reps(20000) nodots : sim `n' `b0'
    		count if abs(b/se) > invnormal(.975)
    		matrix res[`i++',1] = `n', `b0', r(N)/_N
    	}
    }
    
    svmat res, names(col)
    seperate power, by(constant)
    twoway line power? N,                                     ///
         lcolor(gs14 gs11 gs8 gs4 gs0 gs4 gs8 gs11 gs14)      ///
         lpattern(longdash longdash longdash longdash solid   ///
                  shortdash shortdash shortdash shortdash)    ///
         legend(order(1 "0.2" 2 "0.4" 3 "0.6" 4 "0.8" 5 "1"   ///
                      6 "1.25" 7 "1.667" 8 "2.5" 9 "5")       ///
                subtitle(baseline odds) )                     ///
         yline(.8) ytitle(power)
    Attached Images  
    Last edited by maartenbuis; 05-27-2017 at 05:36 AM. Reason: updated the simulation

  4. The Following 2 Users Say Thank You to maartenbuis For This Useful Post:

    hlsmith (05-26-2017), rogojel (05-26-2017)

  5. #3
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Simulating a logistic regressio - scary results

    hi,
    the idea is to use standardised X-es, which is recommanded anyway. It definitely should not bring about 50% 1-s in my Y, it should depend on the probability of Y, no?

    What I find scary is the high percentage of false diagnostics, independently of the sample size. The odds ratios are practically quite high, due to the use of standardised Xs, still, the chances of getting an accurate picture are low.
    regards

  6. #4
    TS Contributor
    Points: 5,246, Level: 46
    Level completed: 48%, Points required for next Level: 104
    maartenbuis's Avatar
    Location
    Konstanz
    Posts
    372
    Thanks
    3
    Thanked 146 Times in 123 Posts

    Re: Simulating a logistic regressio - scary results

    It is the combination of symmetrically distributed and standardized xs with a constant of 0 that leads to the 50% 1s in y.

  7. #5
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Simulating a logistic regressio - scary results

    Quote Originally Posted by maartenbuis View Post
    It is the combination of symmetrically distributed and standardized xs with a constant of 0 that leads to the 50% 1s in y.
    Hi,
    I just ran a few simulations and you are right. I would just rephrase it a bit - it is the uniform sampling of the whole relevant range of Xs that is generating the roughly 50% events. It is not relevant whether the Xs are standardized or not as long as the relevant range (defined as between X1 and X2, where P(Y|x=X1) ~ 0 and P(Y|X=X2)~1) is uniformly covered.

    This is BTW making the often quoted sample size rule of 10 events per independent variable look really bad.

    regards

  8. #6
    TS Contributor
    Points: 5,246, Level: 46
    Level completed: 48%, Points required for next Level: 104
    maartenbuis's Avatar
    Location
    Konstanz
    Posts
    372
    Thanks
    3
    Thanked 146 Times in 123 Posts

    Re: Simulating a logistic regressio - scary results

    The constant is the log(odds) of a 1 on y when all xs are 0. You standardized your xs, so they are 0 at their mean. You set the constant to 0, so for a person with a mean value on both xs you expect an odds of getting a 1 on y to be exp(0)=1, i.e. 1 success (1) per failure(0), or a 50% chance of a 1. The overall (marginal) probability of getting a 1 on y is the average of that probability over all observations. Since the probability is a non-linear function of the xs, you normally cannot say that the probability at mean values of x is the average probability. But the function relating the probability to the xs is symmetrical, and you created your explanatory variables from a symmetrical distribution, so now that happens to work out.

  9. #7
    TS Contributor
    Points: 5,246, Level: 46
    Level completed: 48%, Points required for next Level: 104
    maartenbuis's Avatar
    Location
    Konstanz
    Posts
    372
    Thanks
    3
    Thanked 146 Times in 123 Posts

    Re: Simulating a logistic regressio - scary results

    The sample size rule of 10 per variable refers to the sample size needed before the asymptotics on which the standard error computations are based starts kicking in. The fact that a standard error means what it is supposed to mean, does not guarantee satisfactory power.

  10. #8
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Simulating a logistic regressio - scary results

    hi,
    so, by adjusting the B0 I could shift the range where I am sampling with respect to the logistic curve. I guess, I can have the same effect, by keeping B0 zero but moving the range of the Xs to something else instead of [-3,3].

    If, for instance I will pick Xs between -2 and -1, I will see just a few events, between 2 and 3 just a few non-events. I am curious how this affects the power.


    regards

  11. #9
    TS Contributor
    Points: 5,246, Level: 46
    Level completed: 48%, Points required for next Level: 104
    maartenbuis's Avatar
    Location
    Konstanz
    Posts
    372
    Thanks
    3
    Thanked 146 Times in 123 Posts

    Re: Simulating a logistic regressio - scary results

    That is exactly the excercise I have done in #2. I fixed the effect at an odds ratio of 1.4 and changed the baseline odds to 0.4, 0.6, 0.8, 1, 1.25, 1 2/3, and 2.5 (the numbers larger than 1 are the reciprocal of the numbers lower than 1). The result is that if you choose more extreme constants, the power will go down. This is to be expected, as by choosing more extreme constants you will reduce the variance in the dependent variable.

  12. #10
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Simulating a logistic regressio - scary results

    This means, that in real life situations where we do not know where our Xs are with respect to the logistic curve, the same sample size could have a lot less power then we think, right? Especially, that in many cases we are likely at the end of the curve where the event probability is low, like in health applications or reliability.

  13. #11
    TS Contributor
    Points: 5,246, Level: 46
    Level completed: 48%, Points required for next Level: 104
    maartenbuis's Avatar
    Location
    Konstanz
    Posts
    372
    Thanks
    3
    Thanked 146 Times in 123 Posts

    Re: Simulating a logistic regressio - scary results

    I do a lot of analyses with logistic regression, and I never rely on only the number of observations to get a feel for how much I trust my analysis. One of the very first things I do is look at the proportion of 1s on my dependent variable to get a feel for how much information I really have. It is this raw proportion that will feed into the constant, and thus link to the simulation in #2.

  14. The Following User Says Thank You to maartenbuis For This Useful Post:

    hlsmith (05-27-2017)

  15. #12
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Simulating a logistic regressio - scary results

    I think the proportion of events will depend on which logistic curve we are on (the OR in fact) and where are our Xs on that curve. Once this is fixed, the power will depend on the sample size - meaning that for the same proportion of events my power can be abysmal (0.03 or high(ish) like 0.4). For the simulations I picked the edge where the events are fairly rare and avoided the middle region. Proportions I checked were between 30% and 15% and the power really low.

    As an example for standardised Xs varying between 2 and 3:

    OddsRatio SampleSize Power YRatio
    1.4 50 0.040 0.2963000
    1.4 100 0.080 0.2998500
    1.4 150 0.060 0.3038333
    1.4 200 0.115 0.2982500
    1.4 250 0.095 0.3027800
    1.4 300 0.145 0.3030500
    1.4 350 0.165 0.3013571
    1.4 400 0.135 0.3032625
    1.4 450 0.165 0.3020667
    1.4 500 0.220 0.3016300

  16. #13
    Devorador de queso
    Points: 95,705, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,931
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: Simulating a logistic regressio - scary results

    You don't need to ask. Send the user a pm explaining the situation and go ahead and clean up the thread. Or else I'll sue you.
    I don't have emotions and sometimes that makes me very sad.

  17. #14
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Simulating a logistic regressio - scary results

    I sure won't

  18. #15
    Human
    Points: 12,676, Level: 73
    Level completed: 57%, Points required for next Level: 174
    Awards:
    Master Tagger
    GretaGarbo's Avatar
    Posts
    1,362
    Thanks
    455
    Thanked 462 Times in 402 Posts

    Re: Simulating a logistic regressio - scary results


    I don't understand Rogojels code in post # 1.

    It is the expression "YProb=exp(-Fact)/(1+exp(-Fact))" with the minus signs in the exponents:

    Code: 
    x1 <-  c(1)
    x1
    B0 <- 0
    B1 <- 1 
    Fact=B0+B1*x1  
    Fact
    YProb=exp(-Fact)/(1+exp(-Fact))
    YProb
    # [1] 0.2689414
    I would prefer it like:

    Code: 
    YProb2=exp(Fact)/(1+exp(Fact)) 
    YProb2
    # [1] 0.7310586
    
    #or this:
    YProb3=  1/(1+exp(-1*(Fact))) 
    YProb3
    # [1] 0.7310586
    Also, I don't understand that the estimated p-values are so low. They are around 0.025 when the null is true. I had expected them to be more like 0.05. What have I missed?

    And if the new user "Behroozbalaei" does NOT make a thread of his own and explain better, I will sue him. But I will sue Dason anyway!

    Edit: Ha ha! I failed to put in a "not" in the comment to Behroozbalaei, so maybe the joke went all wrong.
    Behroozbalaei! Please make your own thread and then we will try to answer it.

+ Reply to Thread

           




Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats