Comparison of one outcome to all the other outcomes

Remi

New Member
#1
Here is some data to play with


Code:
    set.seed(1e4)    
    color = factor(c(rep("blue",20), rep("red",20), rep("yellow",150)), levels=c("red","blue","yellow"))
    x1 = rnorm(20+20+150,12,3)
    x2 = c(rnorm(20,5), rnorm(20,5), rnorm(150,7))
    x3 = c(rnorm(20,7), rnorm(20,5), rnorm(150,5))
    x4 = c(rnorm(20,5), rnorm(20,7), rnorm(150,5))
    RedOrNot = ifelse(color=="red","Red","NotRed")
    
    df = data.frame(
      color = color,
      RedOrNot = RedOrNot,
      x1 = x1, 
      x2 = x2,
      x3 = x3,
      x4 = x4
     )

We are interested in the color `red` in particular. We would like to understand which explanatory variables (which among `x1`,`x2`,`x3` and `x4`) affects the probability `red` "differentially that it affects the probability of outcome `blue` or `yellow`". The question might be a little unclear but I hope the following will make my point very obvious.

Below graphs show 95% bootstrap confidence intervals.

Code:
    p1 = ggplot(df,aes(x=color,y=x1)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
    p2 = ggplot(df,aes(x=color,y=x2)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
    p3 = ggplot(df,aes(x=color,y=x3)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
    p4 = ggplot(df,aes(x=color,y=x4)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
    multiplot(p1,p2,p3,p4,layout=matrix(1:4,ncol=2))


"red" seem to differ from the two other groups only in respect to `x4`. I would be tempted to group `yellow` and `blue` into a single group called `NotRed` as shown below.

Code:
    p1 = ggplot(df,aes(x=RedOrNot,y=x1)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
    p2 = ggplot(df,aes(x=RedOrNot,y=x2)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
    p3 = ggplot(df,aes(x=RedOrNot,y=x3)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
    p4 = ggplot(df,aes(x=RedOrNot,y=x4)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
    multiplot(p1,p2,p3,p4,layout=matrix(1:4,ncol=2))



Because we sampled way more `yellow` than `blue`, it seems that `Red` differ from the other colors for `x2` as well as `x4` but this is misleading. I would rather not consider `x2` as having an effect of interest when I see the first series of plots as I am looking for variables that make `red` different than both other groups.


Let's make a test with `color` as response variable.

Code:
    k = length(levels(df$color))
    I = diag(k-1)
    J = matrix(rep(1, (k-1)^2), c(k-1, k-1))
    priors = list(R = list(fix=1, V=(1/k) * (I + J), n = k - 1))    
    m = MCMCglmm(color ~  -1  + trait:(x1) + trait:(x2) + trait:(x3) + trait:(x4) , rcov = ~ us(trait):units, data = df, family = "categorical", prior = priors, verbose = FALSE, burnin = 10000, nitt = 60000, thin = 50)
    summary(m)

                         post.mean l-95% CI u-95% CI eff.samp  pMCMC    
    traitcolor.blue:x1    -0.30261 -0.60335  0.02378    72.04  0.048 *  
    traitcolor.yellow:x1  -0.02728 -0.24525  0.19561   129.73  0.832    
    traitcolor.blue:x2    -0.60401 -1.20394  0.04258    32.72  0.118    
    traitcolor.yellow:x2   2.05721  1.35060  2.94990    26.03 <0.001 ***
    traitcolor.blue:x3     2.27062  1.36630  3.27224    30.57 <0.001 ***
    traitcolor.yellow:x3  -0.30359 -0.97326  0.27461    86.95  0.306    
    traitcolor.blue:x4    -1.28126 -2.11725 -0.55014    41.71 <0.001 ***
    traitcolor.yellow:x4  -1.41233 -2.05106 -0.80523    68.91 <0.001 ***

Would it be wise to consider only the highest p.value for each explanatory variable to decide (relative to a threshold) whether a variable affects the probability of the outcome `red` differentially than both `blue` and `yellow`?

How can I go about testing what variables affect the probability of the outcome `red` "differentially than both `blue` and `yellow`"?

Similarly, when performing an LDA, I would have wished `x4` to pop up as important driver of the first axis but this is not made obvious due to `x2` again.

Code:
    lda(data=df,RedOrNot~x1+x2+x3+x4)$scaling
    
                LD1
    x1  0.004390466
    x2 -0.571945258
    x3 -0.241371706
    x4  0.784905799

    lda(data=df,color~x1+x2+x3+x4)$scaling

              LD1         LD2
    x1 -0.0361047  0.03607743
    x2 -0.8110918 -0.09303029
    x3  0.4472190 -0.69882190
    x4  0.3820449  0.73839111
 
Last edited: