# logistic regression on bivariate outcome with nominal predictors in R

#### mhof08

##### New Member
Hi there.

I have a dataset consisting of 1812 observations on packages. Variables are error(y=1/n=0) and country of origin (five counties). Error is the dependent variable and countries is the independent variable. I've creating dummy-variables for each of the five countries.

My research question is: whats the likelihood of an error from each country compaired to the entire population.

Code:
> logreg_3
# A tibble: 1,812 x 7
error                       country                cn    nl    ee    hk    my
<dbl> <chr>                <dbl> <dbl> <dbl> <dbl> <dbl>
1                         1 cn                       1     0     0     0     0
2                         1 nl                       0     1     0     0     0
3                         1 nl                       0     1     0     0     0
4                         1 my                       0     0     0     0     1
5                         1 my                       0     0     0     0     1
6                         1 nl                       0     1     0     0     0
7                         1 hk                       0     0     0     1     0
8                         1 hk                       0     0     0     1     0
9                         1 hk                       0     0     0     1     0
10                         1 hk                       0     0     0     1     0
# ... with 1,802 more rows
I've run data through a logistic model in R:

Code:
Call:
glm(formula =error ~ cn + nl + ee + hk + my,
family = binomial, data = logreg_3)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.0933  -1.0893  -0.6987   1.2640   1.7492

Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept)  -1.25276    0.40089  -3.125  0.00178 **
cn           -0.03292    0.40989  -0.080  0.93598
nl            1.04184    0.40993   2.542  0.01104 *
ee          -14.31331  280.09167  -0.051  0.95924
hk            1.05157    0.41364   2.542  0.01102 *
my                 NA         NA      NA       NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2308.1  on 1811  degrees of freedom
Residual deviance: 2177.1  on 1807  degrees of freedom
AIC: 2187.1

Number of Fisher Scoring iterations: 14
And calculated odds ratio:

Code:
round(exp(coef(logit)),3)

(Intercept)          cn          nl          ee          hk          my
0.286       0.968       2.834       0.000       2.862          NA

I have some difficulties in interpreting the results and I have some specific issue I'd like to address.

My questions are:

1) how do I overcome the dummy variable trap in R, thus avoiding the NA for the last predictor? Using +0 to remove the intercept does not seem to works as the results are changed in a matter that makes no sense. I wish to calculate OR for all countries to determine/forecast the risk of error for each country.
2) Is this even the right model for answering my research question?
3) Say if, it is the correct model: Is it correct to interpret the positive estimates as a token for increased risk of error and the negative estimates as decreased risk of error? I do understand that the relationship is non-linear, hence the size of the estimate makes little sense on its own.
4) How should I interpret the odds-ratio in this case with multiple predictors and a single outcome?
5) Any ideas for further modelling/analysis?

Thanks in advance

#### hlsmith

##### Not a robit
Odds ratios represent a relative comparison, so I think you are going to have difficulties. Many times if you just dummy code all of the variables, the procedure will catch this and yell at you that one variable is a linear combination of the other variables.

Is it possible for you to just run five models?