Greetings all, I have what is likely a simple R data format question. My experiment is as follows, I have a single quantitative metric (Y) measured in a number of groups as follows:
Group 1: no treatment
Group 2: Treatment A
Group 3: Treatment B
Group 4: Treatments A and B
My data is in a dataframe in R with a column specifying the quantitative metric Y and columns specifying A and/or B treatments (i.e., 0 or 1 for not receiving a treatment and for receiving a treatment, respectively). What I'd like to know is if this setup correct for representing my data in R. The command I'm running is:
Am I correct in thinking that the data from groups 1 and 4 are still being incorporated into the analysis even though they aren't explicitly mentioned in the linear model (the anova output gives a p-value for A:B, so I would assume so). Thanks much
You will probably want to make use of R data types. In particular, R handles categorical variables through factors. For instance, consider the following Quick Bread data set (from "Applied Linear Statistical Models," chapter 15).
volume = c(910, 810, 1270, 1220, 1510, 1290, 1180, 1030)
temp = factor(rep(1:4, each = 2), labels = c("Low", "Med", "High", "HOT"))
df = data.frame(volume, temp)
fit = lm(volume ~ temp, data = df) ## Of course, I could have just used the vectors without the data frame
# (Intercept) tempMed tempHigh tempHOT
# 1 1 0 0 0
# 2 1 0 0 0
# 3 1 1 0 0
# 4 1 1 0 0
# 5 1 0 1 0
# 6 1 0 1 0
# 7 1 0 0 1
# 8 1 0 0 1
Notice how the "lm" function automatically knows how to treat qualitative data. Also notice how the first factor (i.e., "Low") isn't included in the list of variables. This is because when R codes your categories it uses the first factor as the reference. This can be changed with the "relevel" function.
relevel(temp, ref = "HOT")
#  Low Low Med Med High High HOT HOT
# Levels: HOT Low Med High
Notice that you'll have to reassign the variable to make use of the change in reference. Now, an anova model, technically, should not include an intercept. In that case, all four categories will be included in the linear model.
Another way to get the same result is by using the "aov" function instead of "lm". Note that "aov" does retain an "lm" class object, also. The benefits of using "aov" are questionable, and I'm sure smarter people than me can tell when it is best to use it. I believe you don't try to exclude the intercept in an "aov" model. But as you can see, you get the same results. That is not always the case when you do anova(fit) or summary(aov(...)). Both of those modes will print ANOVA tables.
Now, the case when the intercept is excluded from the linear model, so all that volume is being regressed on are the coded temperatures, is called the "cell means model." The coefficients on the predictors just are the means.
lm(volume ~ temp - 1)
# lm(formula = volume ~ temp - 1)
# tempLow tempMed tempHigh tempHOT
# 860 1245 1400 1105
model.tables(aov(volume ~ temp), "means")
# Tables of means
# Grand mean
# Low Med High HOT
# 860 1245 1400 1105
So, to get to your case. The simple way would be to have a two column dataframe, containing your response variable and your categorical predictor, just like in the example I chose to use above. You will need to either control the assignment of it being a factor in the dataframe, or when you make the linear fit--i.e., you can do it like I did above, or just go with lm(volume ~ factor(temp)). If temp were not a factor, it would have to be coded to represent each of the levels, either as numbers (say, 1:4) or as characters (like my labels). R can figure it out. I defined mine manually from scratch. In any case, once you have your dataframe set up, you can make your model like above.
Now, what you want to do is have a model that has an interaction term with a categorical variable with itself. This doesn't seem right, and I believe R would exclude it from the model (like it does if you try to do lm(y ~ x:x) or lm(y ~ x*x)). Both of those forms would specify interactions (note, if you want a squared term you either have to do lm(y ~ poly(x, 2)) or lm(y ~ I(x*x))).
If, however, you do not have just two categories (A and B), but four as you listed, then you will have to code them that way. For instance,
As shown above, all the coding work for the four predictor variables that would be created are done for us. As this is the cell means model, the coefficients listed are the means for each group, which can be checked against the model.tables on an "aov" object, or by simply doing tapply(y, x, mean), which is just a shortcut way of saying sapply(split(y, x), mean) (or lapply, if you want the results as a list for some reason).
I would need to know more about your data to help your further, but I hope the above discussion is beneficial to you.
Actually, I should revise what I mentioned above. I was thinking of your analysis as a single-factor study. In that case you would use four groupings in your column: None, A, B, A and B. The cell means model would be appropriate. Instead, you will want to have two variables (factors) with two levels in each: A and Not A, B and Not B, respectively. In that case, you should use the model lm(Y ~ A + B + A:B). Instead of the colon you can, of course, use the multiplication '*'. Also, you can short-hand it as you did with lm(Y ~ A*b) or in case of multifactor studies, lm(Y ~ (A + B + ...)^2) for all interactions and sums. The point about using factors mentioned above is still important. You don't need to code the 0s and 1s. Just create factors with two levels. R takes care of the rest. For instance: