Here is some data to play with
Code:set.seed(1e4) color = factor(c(rep("blue",20), rep("red",20), rep("yellow",150)), levels=c("red","blue","yellow")) x1 = rnorm(20+20+150,12,3) x2 = c(rnorm(20,5), rnorm(20,5), rnorm(150,7)) x3 = c(rnorm(20,7), rnorm(20,5), rnorm(150,5)) x4 = c(rnorm(20,5), rnorm(20,7), rnorm(150,5)) RedOrNot = ifelse(color=="red","Red","NotRed") df = data.frame( color = color, RedOrNot = RedOrNot, x1 = x1, x2 = x2, x3 = x3, x4 = x4 )
We are interested in the color `red` in particular. We would like to understand which explanatory variables (which among `x1`,`x2`,`x3` and `x4`) affects the probability `red` "differentially that it affects the probability of outcome `blue` or `yellow`". The question might be a little unclear but I hope the following will make my point very obvious.
Below graphs show 95% bootstrap confidence intervals.
Code:p1 = ggplot(df,aes(x=color,y=x1)) + stat_summary(fun.data=mean_cl_boot) + coord_flip() p2 = ggplot(df,aes(x=color,y=x2)) + stat_summary(fun.data=mean_cl_boot) + coord_flip() p3 = ggplot(df,aes(x=color,y=x3)) + stat_summary(fun.data=mean_cl_boot) + coord_flip() p4 = ggplot(df,aes(x=color,y=x4)) + stat_summary(fun.data=mean_cl_boot) + coord_flip() multiplot(p1,p2,p3,p4,layout=matrix(1:4,ncol=2))
"red" seem to differ from the two other groups only in respect to `x4`. I would be tempted to group `yellow` and `blue` into a single group called `NotRed` as shown below.
Code:p1 = ggplot(df,aes(x=RedOrNot,y=x1)) + stat_summary(fun.data=mean_cl_boot) + coord_flip() p2 = ggplot(df,aes(x=RedOrNot,y=x2)) + stat_summary(fun.data=mean_cl_boot) + coord_flip() p3 = ggplot(df,aes(x=RedOrNot,y=x3)) + stat_summary(fun.data=mean_cl_boot) + coord_flip() p4 = ggplot(df,aes(x=RedOrNot,y=x4)) + stat_summary(fun.data=mean_cl_boot) + coord_flip() multiplot(p1,p2,p3,p4,layout=matrix(1:4,ncol=2))
Because we sampled way more `yellow` than `blue`, it seems that `Red` differ from the other colors for `x2` as well as `x4` but this is misleading. I would rather not consider `x2` as having an effect of interest when I see the first series of plots as I am looking for variables that make `red` different than both other groups.
Let's make a test with `color` as response variable.
Code:k = length(levels(df$color)) I = diag(k-1) J = matrix(rep(1, (k-1)^2), c(k-1, k-1)) priors = list(R = list(fix=1, V=(1/k) * (I + J), n = k - 1)) m = MCMCglmm(color ~ -1 + trait:(x1) + trait:(x2) + trait:(x3) + trait:(x4) , rcov = ~ us(trait):units, data = df, family = "categorical", prior = priors, verbose = FALSE, burnin = 10000, nitt = 60000, thin = 50) summary(m) post.mean l-95% CI u-95% CI eff.samp pMCMC traitcolor.blue:x1 -0.30261 -0.60335 0.02378 72.04 0.048 * traitcolor.yellow:x1 -0.02728 -0.24525 0.19561 129.73 0.832 traitcolor.blue:x2 -0.60401 -1.20394 0.04258 32.72 0.118 traitcolor.yellow:x2 2.05721 1.35060 2.94990 26.03 <0.001 *** traitcolor.blue:x3 2.27062 1.36630 3.27224 30.57 <0.001 *** traitcolor.yellow:x3 -0.30359 -0.97326 0.27461 86.95 0.306 traitcolor.blue:x4 -1.28126 -2.11725 -0.55014 41.71 <0.001 *** traitcolor.yellow:x4 -1.41233 -2.05106 -0.80523 68.91 <0.001 ***
Would it be wise to consider only the highest p.value for each explanatory variable to decide (relative to a threshold) whether a variable affects the probability of the outcome `red` differentially than both `blue` and `yellow`?
How can I go about testing what variables affect the probability of the outcome `red` "differentially than both `blue` and `yellow`"?
Similarly, when performing an LDA, I would have wished `x4` to pop up as important driver of the first axis but this is not made obvious due to `x2` again.
Code:lda(data=df,RedOrNot~x1+x2+x3+x4)$scaling LD1 x1 0.004390466 x2 -0.571945258 x3 -0.241371706 x4 0.784905799 lda(data=df,color~x1+x2+x3+x4)$scaling LD1 LD2 x1 -0.0361047 0.03607743 x2 -0.8110918 -0.09303029 x3 0.4472190 -0.69882190 x4 0.3820449 0.73839111
Last edited by Remi; 10-23-2016 at 07:03 PM.
Tweet |