You will probably want to make use of R data types. In particular, R handles categorical variables through factors. For instance, consider the following Quick Bread data set (from "Applied Linear Statistical Models," chapter 15).

Notice how the "lm" function automatically knows how to treat qualitative data. Also notice how the first factor (i.e., "Low") isn't included in the list of variables. This is because when R codes your categories it uses the first factor as the reference. This can be changed with the "relevel" function.Code:`volume = c(910, 810, 1270, 1220, 1510, 1290, 1180, 1030) temp = factor(rep(1:4, each = 2), labels = c("Low", "Med", "High", "HOT")) df = data.frame(volume, temp) fit = lm(volume ~ temp, data = df) ## Of course, I could have just used the vectors without the data frame model.matrix(fit) # (Intercept) tempMed tempHigh tempHOT # 1 1 0 0 0 # 2 1 0 0 0 # 3 1 1 0 0 # 4 1 1 0 0 # 5 1 0 1 0 # 6 1 0 1 0 # 7 1 0 0 1 # 8 1 0 0 1`

Notice that you'll have to reassign the variable to make use of the change in reference. Now, an anova model, technically, should not include an intercept. In that case, all four categories will be included in the linear model.Code:`relevel(temp, ref = "HOT") # [1] Low Low Med Med High High HOT HOT # Levels: HOT Low Med High`

Another way to get the same result is by using the "aov" function instead of "lm". Note that "aov" does retain an "lm" class object, also. The benefits of using "aov" are questionable, and I'm sure smarter people than me can tell when it is best to use it. I believe you don't try to exclude the intercept in an "aov" model. But as you can see, you get the same results. That is not always the case when you do anova(fit) or summary(aov(...)). Both of those modes will print ANOVA tables.Code:`model.matrix(lm(volume ~ temp - 1)) # tempLow tempMed tempHigh tempHOT # 1 1 0 0 0 # 2 1 0 0 0 # 3 0 1 0 0 # 4 0 1 0 0 # 5 0 0 1 0 # 6 0 0 1 0 # 7 0 0 0 1 # 8 0 0 0 1`

Now, the case when the intercept is excluded from the linear model, so all that volume is being regressed on are the coded temperatures, is called the "cell means model." The coefficients on the predictors just are the means.Code:`model.matrix(aov(volume ~ temp)) # (Intercept) tempMed tempHigh tempHOT # 1 1 0 0 0 # 2 1 0 0 0 # 3 1 1 0 0 # 4 1 1 0 0 # 5 1 0 1 0 # 6 1 0 1 0 # 7 1 0 0 1 # 8 1 0 0 1`

So, to get to your case. The simple way would be to have a two column dataframe, containing your response variable and your categorical predictor, just like in the example I chose to use above. You will need to either control the assignment of it being a factor in the dataframe, or when you make the linear fit--i.e., you can do it like I did above, or just go with lm(volume ~ factor(temp)). If temp were not a factor, it would have to be coded to represent each of the levels, either as numbers (say, 1:4) or as characters (like my labels). R can figure it out. I defined mine manually from scratch. In any case, once you have your dataframe set up, you can make your model like above.Code:`lm(volume ~ temp - 1) # Call: # lm(formula = volume ~ temp - 1) # # Coefficients: # tempLow tempMed tempHigh tempHOT # 860 1245 1400 1105 model.tables(aov(volume ~ temp), "means") # Tables of means # Grand mean # # 1152 # # temp # temp # Low Med High HOT # 860 1245 1400 1105`

Now, what you want to do is have a model that has an interaction term with a categorical variable with itself. This doesn't seem right, and I believe R would exclude it from the model (like it does if you try to do lm(y ~ x:x) or lm(y ~ x*x)). Both of those forms would specify interactions (note, if you want a squared term you either have to do lm(y ~ poly(x, 2)) or lm(y ~ I(x*x))).

If, however, you do not have just two categories (A and B), but four as you listed, then you will have to code them that way. For instance,

As shown above, all the coding work for the four predictor variables that would be created are done for us. As this is the cell means model, the coefficients listed are the means for each group, which can be checked against the model.tables on an "aov" object, or by simply doing tapply(y, x, mean), which is just a shortcut way of saying sapply(split(y, x), mean) (or lapply, if you want the results as a list for some reason).Code:`(y <- sample(1:1000, 12)) # [1] 161 741 375 265 412 303 823 252 716 910 526 654 (x <- gl(4, 3)) # [1] 1 1 1 2 2 2 3 3 3 4 4 4 # Levels: 1 2 3 4 lm(y ~ x -1) # Call: # lm(formula = y ~ x - 1) # # Coefficients: # x1 x2 x3 x4 # 426 327 597 697 model.matrix(lm(y ~ x -1)) # x1 x2 x3 x4 # 1 1 0 0 0 # 2 1 0 0 0 # 3 1 0 0 0 # 4 0 1 0 0 # 5 0 1 0 0 # 6 0 1 0 0 # 7 0 0 1 0 # 8 0 0 1 0 # 9 0 0 1 0 # 10 0 0 0 1 # 11 0 0 0 1 # 12 0 0 0 1`

I would need to know more about your data to help your further, but I hope the above discussion is beneficial to you.