# logistic regression on bivariate outcome with nominal predictors in R

#### mhof08

##### New Member
Hi there.

I have a dataset consisting of 1812 observations on packages. Variables are error(y=1/n=0) and country of origin (five counties). Error is the dependent variable and countries is the independent variable. I've creating dummy-variables for each of the five countries.

My research question is: whats the likelihood of an error from each country compaired to the entire population.

Code:
> logreg_3
# A tibble: 1,812 x 7
error                       country                cn    nl    ee    hk    my
<dbl> <chr>                <dbl> <dbl> <dbl> <dbl> <dbl>
1                         1 cn                       1     0     0     0     0
2                         1 nl                       0     1     0     0     0
3                         1 nl                       0     1     0     0     0
4                         1 my                       0     0     0     0     1
5                         1 my                       0     0     0     0     1
6                         1 nl                       0     1     0     0     0
7                         1 hk                       0     0     0     1     0
8                         1 hk                       0     0     0     1     0
9                         1 hk                       0     0     0     1     0
10                         1 hk                       0     0     0     1     0
# ... with 1,802 more rows
I've run data through a logistic model in R:

Code:
Call:
glm(formula =error ~ cn + nl + ee + hk + my,
family = binomial, data = logreg_3)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.0933  -1.0893  -0.6987   1.2640   1.7492

Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept)  -1.25276    0.40089  -3.125  0.00178 **
cn           -0.03292    0.40989  -0.080  0.93598
nl            1.04184    0.40993   2.542  0.01104 *
ee          -14.31331  280.09167  -0.051  0.95924
hk            1.05157    0.41364   2.542  0.01102 *
my                 NA         NA      NA       NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2308.1  on 1811  degrees of freedom
Residual deviance: 2177.1  on 1807  degrees of freedom
AIC: 2187.1

Number of Fisher Scoring iterations: 14
And calculated odds ratio:

Code:
round(exp(coef(logit)),3)

(Intercept)          cn          nl          ee          hk          my
0.286       0.968       2.834       0.000       2.862          NA

I have some difficulties in interpreting the results and I have some specific issue I'd like to address.

My questions are:

1) how do I overcome the dummy variable trap in R, thus avoiding the NA for the last predictor? Using +0 to remove the intercept does not seem to works as the results are changed in a matter that makes no sense. I wish to calculate OR for all countries to determine/forecast the risk of error for each country.
2) Is this even the right model for answering my research question?
3) Say if, it is the correct model: Is it correct to interpret the positive estimates as a token for increased risk of error and the negative estimates as decreased risk of error? I do understand that the relationship is non-linear, hence the size of the estimate makes little sense on its own.
4) How should I interpret the odds-ratio in this case with multiple predictors and a single outcome?
5) Any ideas for further modelling/analysis?