# anova error (Error in contrasts<-)

#### sak

##### New Member
I am trying to find joint F-statistics and since this is a question regarding usage of R, I would not spend time on logic of it but this is what I am trying to achieve:

Code:
Set11 = read.table("D1.csv",sep=",",header=TRUE)  ##change attached xlsx to csv

combset <- rbind(Set11, Set21)
combset$grp <- rep(c("Set11", "Set21"), times=c(nrow(Set11), nrow(Set21) ) ) anova(lm(A ~ grp, data=combset)) ##Runs successfuly anova(lm(A+B+C+D+E+F+G ~ grp, data=combset)) ##Produces the error below The error is: Code: Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels In addition: Warning message: In model.matrix.default(mt, mf, contrasts) : variable 'grp' converted to a factor I looked up on Google for above error and seems a lot of people have had the same error in past but none of the suggested solutions works for me especially due to sparse nature of my data. Can anyone please help? Last edited: #### Dason ##### Ambassador to the humans Just out of curiosity... if you already had the files converted to csv and you used the csvs in your code... why did you attach the excel versions instead of the csvs? #### trinker ##### ggplot2orBust Are you attempting a multivariate model? #### sak ##### New Member Just out of curiosity... if you already had the files converted to csv and you used the csvs in your code... why did you attach the excel versions instead of the csvs? Cause it would not let me attach the CSV version. #### sak ##### New Member Are you attempting a multivariate model? Yes. I have two sets of data: Set11 and Set21. For each set we have the same variables A, B, C, D, E, F, G. I want to do the F-test to understand whether the following relationships are simultaneously true: Code: Set11_A = Set21_A, Set11_B = Set21_B, Set11_C = Set21_C, Set11_D = Set21_D, Set11_E = Set21_E, Set11_F = Set21_F, Set11_G = Set21_G #### Dason ##### Ambassador to the humans Cause it would not let me attach the CSV version. Huh. That's interesting. Sorry about that. #### Dason ##### Ambassador to the humans It looks like you actually only have one observation in each row. So we can model everything as independent. If you were willing to assume that the variances were equal in all of the treatment*grp combinations then this could be done fairly easily by reshaping the data and using a full vs. reduced test. Here is some code for that Code: setwd("Downloads/") Set11 = read.table("D1.csv",sep=",",header=TRUE) ##change attached xlsx to csv Set21 = read.table("D2.csv",sep=",",header=TRUE) ##change attached xlsx to csv library(reshape2) melt.1 <- na.exclude(melt(Set11)) melt.2 <- na.exclude(melt(Set21)) dat <- rbind(melt.1, melt.2) dat$grp <- factor(rep(c("g1", "g2"), c(nrow(melt.1), nrow(melt.2))))

# Create the full and reduced models
o.full <- lm(value ~ variable*grp, data = dat)
o.red <- lm(value ~ variable, data = dat)
anova(o.red, o.full)

# Same as
o.full2 <- lm(value ~ variable + variable:grp, data = dat)
# using test of variable:grp
anova(o.full2)
We might want to examine the data a little bit though...

Code:
library(plyr)
out <- ddply(dat, .(variable, grp), summarize,
mean = mean(value),
sd = sd(value))

#How many observations in each variable*grp combo?
dcast(variable ~ grp, fun.aggregate = length, data = dat)
# Look at means to see if there are differences 'by eye'
dcast(variable ~ grp, data = out, value.var = "mean")
# Check if the equal variance assumptino is reasonable
dcast(variable ~ grp, data = out, value.var = "sd")
with corresponding output
Code:
> dcast(variable ~ grp, fun.aggregate = length, data = dat)
variable   g1  g2
1        A 1041 979
2        B  702 734
3        C  278 268
4        D  241 231
5        E   71  90
6        F   25  26
7        G    2   3
> dcast(variable ~ grp, data = out, value.var = "mean")
variable          g1         g2
1        A  -0.8918652  -2.039113
2        B  -7.3818872  -5.766886
3        C  -8.4213611 -10.894831
4        D -20.5045199 -24.218392
5        E -35.9456003 -41.776600
6        F -18.3655425 -34.704142
7        G -49.6065706  -6.387440
> dcast(variable ~ grp, data = out, value.var = "sd")
variable       g1       g2
1        A 23.10399 21.73232
2        B 48.47890 33.11398
3        C 55.20237 54.88554
4        D 56.58071 62.00402
5        E 98.76417 84.41985
6        F 71.81309 63.67966
7        G 18.51917 20.40112
so it probably isn't reasonable to assume completely equal variances. We might want to consider an analysis that either gives a unique variance term for each variable*grp combo. Alternatively it probably would be fine to fit a model that assumes equal variances within each variable (treatment... whatever the A,B,...,G represents...).

#### sak

##### New Member
# Create the full and reduced models
o.full <- lm(value ~ variable*grp, data = dat)
o.red <- lm(value ~ variable, data = dat)
anova(o.red, o.full)

# Same as
o.full2 <- lm(value ~ variable + variable:grp, data = dat)
# using test of variable:grp
anova(o.full2)
Can you please explain what exactly you are doing above? How does the lm function behave when using variable*grp? Why are you suggesting to compare full vs reduced models here. Since all I want to do is find the F statistics to verify whether
Code:
Set11_A = Set21_A, Set11_B = Set21_B,etc
are true simultaneously.

#### Dason

The way I set it up the full model gives a separate mean to each variable*grp combination. The reduced model only gives a mean for each variable - so it assumes that the mean is the same for variable A in both groups 1 and 2. So the full versus reduced test here is simultaneously testing

$$H_o: \mu_{A1} = \mu_{A2}; \mu_{B1} = \mu{B2}; ...; \mu_{G1} = \mu{G2}$$
against the alternative hypothesis of "at least one of those equalities doesn't hold.

#### Dason

The way I set it up the full model gives a separate mean to each variable*grp combination. The reduced model only gives a mean for each variable - so it assumes that the mean is the same for variable A in both groups 1 and 2. So the full versus reduced test here is simultaneously testing

$$H_o: \mu_{A1} = \mu_{A2}; \mu_{B1} = \mu_{B2}; ...; \mu_{G1} = \mu_{G2}$$
against the alternative hypothesis of "at least one of those equalities doesn't hold (where $$\mu_{A1}$$ is the mean of variable A in group 1)

#### sak

##### New Member
Thanks for your help with this.

My output from running the code:
Code:
anova(o.red, o.full)
is:

Code:
Analysis of Variance Table

Model 1: value ~ variable
Model 2: value ~ variable * grp
Res.Df     RSS Df Sum of Sq     F Pr(>F)
1   4680 8320332
2   4674 8311518  6    8813.4 0.826 0.5495
Can you please explain in English what exactly it means?

Also on what basis does it decide the null hypothesis defined by you above. I ask this since if I instead use:
Code:
anova(o.full,o.red)
I get the same output as above with regards to p-value and F-statistics.

Edit:
But this does not conform to R-documentation as described below.

The R-documentation says:
When given a sequence of objects, anova tests the models against one another in the order specified.

Last edited: