# Simulation for logistic regression in R

#### masarimk

##### Member
Dear All,

I would like to simulate data for logistic regression and I need to have below variables
x1=numeric (mean=25,std=5)
x2=numeric (mean=50,std=10)
x3=factor variables with 5 levels
x4=factor variable with 3 levels
x5=factor variable with 2 levels

How can I do that?
Thank you

#### hlsmith

##### Less is more. Stay pure. Stay poor.
I think the big question now is how these variable relate to the Y variable (e.g., y = Bo + B1X1,...,+ random error)?

#### masarimk

##### Member
Hlsmith can you tell me with random error?Actually below is what I have done so far.Below seems okey but I do not get x1 significant.
Thank you.

PHP:
set.seed(666)
x1 = rnorm(100)           # some continuous variables
x2 = rnorm(100)
x3=sample(x=c(1, 2, 3), size=100, prob=rep(1/3, 3),replace = TRUE)
z = 0.01 + 0.5*x1+1.2*x2+0.75*x3      # linear combination with a bias
pr = 1/(1+exp(-z))         # pass through an inv-logit function

y = rbinom(100,1,pr)      # bernoulli response variable
data.frame(pr,y)
df = data.frame(y=y,x1=x1,x2=x2,x3=x3)
glm( y~x1+x2+x3,data=df,family="binomial")
summary(glm( y~x1+x2+as.factor(x3),data=df,family="binomial")  )

Last edited:

#### JesperHP

##### TS Contributor
increase sample size N=100 to get x1 significant on a lower alpha level than 10%

#### Dason

Hlsmith can you tell me with random error?Actually below is what I have done so far.Below seems okey but I do not get x1 significant.
Thank you.
You don't want " + random error" in your model. You already are simulating y according to the model specific in logistic regression.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Correct! I was thinking of the random error you can add when coming up with the actual variable terms. Dason is right that there is not a individual/stand alone error term.

#### masarimk

##### Member
Hello guys,
I got another way to do this simulation. Above code I could not get good estimators. But chek below simulation I have 1 numeric and 1 categorical variable. If you want to add,please let me know.
Thanks.

PHP:
x1_a=rnorm(100000,mean=290,sd=15)
x1_b=rnorm(100000,mean=300,sd=15)
x1=c(x1_a,x1_b)  ###numeric variable
x2_a=sample(1:4, size=100000, prob=c(.3,.5,.1,.1),replace = TRUE)
x2_b=sample(1:4, size=100000, prob=c(.1,.1,.3,.5),replace = TRUE)
x2=c(x2_a,x2_b)###categorical variable with 4 levels
y1=sample(0:1, size=100000,  prob=c(.8,.2),replace = TRUE)
table(y2)
y2=sample(0:1, size=100000,  prob=c(.6,.3),replace = TRUE)
y=c(y1,y2)
table(y)###create y variable
dat=data.frame(x1=x1,x2=x2,y=as.factor(y))
mylogit=glm(y~x1+as.factor(x2),data=dat,family=binomial())
summary(mylogit)

#### JesperHP

##### TS Contributor
You may get what you want, but what you are doing is - in a manner of speaking - simply wrong.

The reason is:

Code:
y1=sample(0:1, size=100000,  prob=c(.8,.2),replace = TRUE)
table(y2)
y2=sample(0:1, size=100000,  prob=c(.6,.3),replace = TRUE)
y=c(y1,y2)
where the dependent variable is not simulated according to a logistic model where the dependency between x and y is obvious and where the parameters to be estimated are known. If you do not know the true parameters how do you know your estimator is not simply inconsistent?

And if you cant tell this from the simulation, what can you tell from the simulation? What is the purpose of the simulation? (I get that it is fun to make random draws and throw dices and stuff like that but a higher purpose than simply celebrating randomness is ussually wanted)