Simulating a logistic regressio - scary results


TS Contributor
I tried to work out the necessary sample size for a logistic regression by simulations and got some scary results. If anyone could check the code below, it would be a great help.

I simulate a logistic regression with two normalized variables, one having a fixed odds ratio of 1.4 the other a variable odds ratio from 1 to 2.5.

What makes this scary is that even at relatively high sample sizes (190) the probability of not identifying the real situation (both xs significant) is very high. At lower sample sizes the situation is more or less hopeless.

# logistic regression sample size simulation

# for several variables (2) 

#Define beta0 (start with 0)

# Define the odds ratio
#First trial was from 1 to 5 where 1 is giving us a rate of the false alarms
#It dies not make sense to go higher then 2 because at that point 
#the power will be almost always high
OR=seq(from=1, to = 2.5, by=0.1)

#Define the sample size
SSize=seq(from=50, to=200, by=20)

#Define table to store the power
T$PP1=rep(0, length(T$Var1))
T$PP2=rep(0, length(T$Var1))
T$PP3=rep(0, length(T$Var1))
T$PP4=rep(0, length(T$Var1))
names(T)=c("OddsRatio", "SampleSize", "Only.X1", "X1.and.X2", "Only.X2", "Neither")

#Define the X variable (from -3 to 3) assuming standardized X
#Normally distributed with mean 0 and stddev 1

#Define the number of trials per setting (OR and sample size)

#FOR each OR DO
for(or in OR){
  # FOR each Sample Size DO
  for(ss in SSize){
    #   Define holding vector for pvalues
    #   For NTrials times DO
    for(count in 1:NTrials){
      #     Generate SSize X values
      #     Calculate the probability of an event for the generated X
      #     Generate the Ys for the X values
      #     Generate a dataframe (not really necessary here)
      testdata=data.frame(Y=as.factor(Y), X1=XVals1, X2=XVals2)
      #     Run a logistic regression Y~X
      mod=glm(Y~X1+X2, data=testdata, family="binomial")
      #     Save the p-value
      #   ENDFOR Ntrials
    #   Calculate the percent p-values less then 0.05
    px1=sum(pvals1<=0.05 & pvals2>0.05)/NTrials
    px1x2=sum(pvals1<=0.05 & pvals2<=0.05)/NTrials
    px2=sum(pvals1>0.05 & pvals2<=0.05)/NTrials
    pnone=sum(pvals1>0.05 & pvals2>0.05)/NTrials
    #   Save them in the power table
    T[T$OddsRatio==or &T$SampleSize==ss,]$Only.X1=px1
    T[T$OddsRatio==or &T$SampleSize==ss,]$X1.and.X2=px1x2
    T[T$OddsRatio==or &T$SampleSize==ss,]$Only.X2=px2
    T[T$OddsRatio==or &T$SampleSize==ss,]$Neither=pnone
    # ENDFOR Sample Size


TS Contributor
Yes, the power is fairly low in those cases. To make you more scared: you have chosen a best case scenario by setting the constant at 0 and using standard normal distributions for you xs. That way you have overall about 50% 1s and %50% 0s for your dependent variable, thus maximizing the variance in your dependent variable. If you change the constant (or change the mean of one or more of your independent variables) the power will get worse. Here is my simulation in Stata:

clear all
set more off
set seed 123456789
program define sim
	args n b0
	drop _all
	set obs `n'
	gen x1 = rnormal()
	gen x2 = rnormal()
	gen y = runiform() < invlogit( ln(`b0') + ln(1.4)*x1 + ln(1.4)*x2 )
	version 5: logit y x1 x2

matrix res = J(189,3,.)
matrix colnames res = N constant power
local i = 1
foreach n of numlist 100(20)500 {
	foreach b0 of numlist .2 .4 .6 .8 1 1.25 `=5/3' 2.5 5{
		simulate b=_b[x2] se=_se[x2], reps(20000) nodots : sim `n' `b0'
		count if abs(b/se) > invnormal(.975)
		matrix res[`i++',1] = `n', `b0', r(N)/_N

svmat res, names(col)
seperate power, by(constant)
twoway line power? N,                                     ///
     lcolor(gs14 gs11 gs8 gs4 gs0 gs4 gs8 gs11 gs14)      ///
     lpattern(longdash longdash longdash longdash solid   ///
              shortdash shortdash shortdash shortdash)    ///
     legend(order(1 "0.2" 2 "0.4" 3 "0.6" 4 "0.8" 5 "1"   ///
                  6 "1.25" 7 "1.667" 8 "2.5" 9 "5")       ///
            subtitle(baseline odds) )                     ///
     yline(.8) ytitle(power)
Last edited:


TS Contributor
the idea is to use standardised X-es, which is recommanded anyway. It definitely should not bring about 50% 1-s in my Y, it should depend on the probability of Y, no?

What I find scary is the high percentage of false diagnostics, independently of the sample size. The odds ratios are practically quite high, due to the use of standardised Xs, still, the chances of getting an accurate picture are low.


TS Contributor
It is the combination of symmetrically distributed and standardized xs with a constant of 0 that leads to the 50% 1s in y.
I just ran a few simulations :) and you are right. I would just rephrase it a bit - it is the uniform sampling of the whole relevant range of Xs that is generating the roughly 50% events. It is not relevant whether the Xs are standardized or not as long as the relevant range (defined as between X1 and X2, where P(Y|x=X1) ~ 0 and P(Y|X=X2)~1) is uniformly covered.

This is BTW making the often quoted sample size rule of 10 events per independent variable look really bad.



TS Contributor
The constant is the log(odds) of a 1 on y when all xs are 0. You standardized your xs, so they are 0 at their mean. You set the constant to 0, so for a person with a mean value on both xs you expect an odds of getting a 1 on y to be exp(0)=1, i.e. 1 success (1) per failure(0), or a 50% chance of a 1. The overall (marginal) probability of getting a 1 on y is the average of that probability over all observations. Since the probability is a non-linear function of the xs, you normally cannot say that the probability at mean values of x is the average probability. But the function relating the probability to the xs is symmetrical, and you created your explanatory variables from a symmetrical distribution, so now that happens to work out.


TS Contributor
The sample size rule of 10 per variable refers to the sample size needed before the asymptotics on which the standard error computations are based starts kicking in. The fact that a standard error means what it is supposed to mean, does not guarantee satisfactory power.


TS Contributor
so, by adjusting the B0 I could shift the range where I am sampling with respect to the logistic curve. I guess, I can have the same effect, by keeping B0 zero but moving the range of the Xs to something else instead of [-3,3].

If, for instance I will pick Xs between -2 and -1, I will see just a few events, between 2 and 3 just a few non-events. I am curious how this affects the power.



TS Contributor
That is exactly the excercise I have done in #2. I fixed the effect at an odds ratio of 1.4 and changed the baseline odds to 0.4, 0.6, 0.8, 1, 1.25, 1 2/3, and 2.5 (the numbers larger than 1 are the reciprocal of the numbers lower than 1). The result is that if you choose more extreme constants, the power will go down. This is to be expected, as by choosing more extreme constants you will reduce the variance in the dependent variable.


TS Contributor
This means, that in real life situations where we do not know where our Xs are with respect to the logistic curve, the same sample size could have a lot less power then we think, right? Especially, that in many cases we are likely at the end of the curve where the event probability is low, like in health applications or reliability.


TS Contributor
I do a lot of analyses with logistic regression, and I never rely on only the number of observations to get a feel for how much I trust my analysis. One of the very first things I do is look at the proportion of 1s on my dependent variable to get a feel for how much information I really have. It is this raw proportion that will feed into the constant, and thus link to the simulation in #2.


TS Contributor
I think the proportion of events will depend on which logistic curve we are on (the OR in fact) and where are our Xs on that curve. Once this is fixed, the power will depend on the sample size - meaning that for the same proportion of events my power can be abysmal (0.03 or high(ish) like 0.4). For the simulations I picked the edge where the events are fairly rare and avoided the middle region. Proportions I checked were between 30% and 15% and the power really low.

As an example for standardised Xs varying between 2 and 3:

OddsRatio SampleSize Power YRatio
1.4 50 0.040 0.2963000
1.4 100 0.080 0.2998500
1.4 150 0.060 0.3038333
1.4 200 0.115 0.2982500
1.4 250 0.095 0.3027800
1.4 300 0.145 0.3030500
1.4 350 0.165 0.3013571
1.4 400 0.135 0.3032625
1.4 450 0.165 0.3020667
1.4 500 0.220 0.3016300


Ambassador to the humans
You don't need to ask. Send the user a pm explaining the situation and go ahead and clean up the thread. Or else I'll sue you.
I don't understand Rogojels code in post # 1.

It is the expression "YProb=exp(-Fact)/(1+exp(-Fact))" with the minus signs in the exponents:

x1 <-  c(1)
B0 <- 0
B1 <- 1 
# [1] 0.2689414
I would prefer it like:

# [1] 0.7310586

#or this:
YProb3=  1/(1+exp(-1*(Fact))) 
# [1] 0.7310586
Also, I don't understand that the estimated p-values are so low. They are around 0.025 when the null is true. I had expected them to be more like 0.05. What have I missed?

And if the new user "Behroozbalaei" does NOT make a thread of his own and explain better, I will sue him. But I will sue Dason anyway!

Edit: Ha ha! I failed to put in a "not" in the comment to Behroozbalaei, so maybe the joke went all wrong. :)
Behroozbalaei! Please make your own thread and then we will try to answer it.