question regarding grouping several variables

In the context of regression, in what situations am I allowed to group separate variables into a single categorical predictor? I'll use R code as an example since it's what I'm most familiar with.
For example I could run a model like this:
dat <- data.frame(ind=c(1,2,3,4,5,6), y=c(40,63,23,66,74,45), day1=c(4,6,3,6,1,3), day2=c(6,4,7,9,8,9))
  ind  y day1 day2
1   1 40    4    6
2   2 63    6    4
3   3 23    3    7
4   4 66    6    9
5   5 74    1    8
6   6 45    3    9

mod <- lm(y ~ day1 + day2 + day1:day2, data=dat)
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -133.873    230.464  -0.581    0.620
day1          33.121     41.781   0.793    0.511
day2          23.062     29.095   0.793    0.511
day1:day2     -4.046      5.289  -0.765    0.524
but I could also reshape dat in this way
dat2 <- melt(dat, id.vars = c("ind","y"))
   ind  y variable value
1    1 40     day1     4
2    2 63     day1     6
3    3 23     day1     3
4    4 66     day1     6
5    5 74     day1     1
6    6 45     day1     3
7    1 40     day2     6
8    2 63     day2     4
9    3 23     day2     7
10   4 66     day2     9
11   5 74     day2     8
12   6 45     day2     9

mod2 <- lm(y ~ variable + value + variable:value, data=dat2)

Estimate Std. Error t value Pr(>|t|)
(Intercept)         47.7965    20.7469   2.304   0.0502 .
variableday2        -1.7345    41.7837  -0.042   0.9679
value                1.0531     4.9129   0.214   0.8356
variableday2:value  -0.2478     6.9479  -0.036   0.9724
The two approaches give very different results, and I'm not sure if one is considered more valid than the other.

Last edited:


Well-Known Member
Can you give a simple non-R example? or paste small runnable R code inside block ended by [/CODE] and started with [ CODE]
and include the data


Less is more. Stay pure. Stay poor.
Two questions:

1.) What is the purpose of the model. What model are you trying to run. Mod based on dat has only 3-way interactions.

2.) Look at what coefficients are generated from the model and that will help you distinguish what model is being ran. If they are generating different output than they are likely different models.


Well-Known Member
Hi Stat20,

The second model is incorrect ...even if you ignore the interaction X1X2

I will take one row for example:

In the first model:

X1 X2 X1X2 Y
4 6 24 40

In the second model:

X1 X2 X1X2 Y
4 0 0 40
0 6 0 40

Why do you expect to get the same answer ...
Thank you obh for your answer.
I do not expect the models to give the same answer, I was just wondering if the second model is valid. Why am I not allowed to have this structure?:
X1 X2 X1X2 Y
4 0 0 40
0 6 0 40

If y represents performance on math test and day1 and day2 are hours spent studying, maybe I would like to test if math scores can be predicted by a combination of time spent studying on day1 and day2. I realize that this is a repeated measures so to be accurate I should use a mixed model approach instead, but my question applies to the mixed model case as well.


Well-Known Member
Hi S,

The following row is incorrect:
X1 X2 X1X2 Y
4 0 0 40

This is incorrect because when X1=4 and Y=40, X2 isn't equal 0 but it is equal 6.

You may decide for example to use only X1, or only X2
Y=a0+a1X1 okay.
Y=a0+a1X2 okay.
Y=a0+a1X1+a2X2 okay.

But you can't combine data of Y=a0+a1X1 and Y=a0+a1X2 in the same model.

Thank you obh for your answer.
I realize that this is a repeated measures.
This is not a repeated measure since you have only one DV (Y)
You have two predictors for one DV (the predictors may be dependent)
Thanks obh, that was very helpul. Do you mind if I ask you another question?
If I go with

do I need to correct for multiple testing?

This is not a repeated measure since you have only one DV (Y)
You have two predictors for one DV (the predictors may be dependent)
Oh I see. I thought repeated measures meant that I have more than two IVs per individual. In my case each individual is measured on two occasions: day1 and day2.
I'm sure I'm wrong, I'm just trying to understand :)


Well-Known Member
Hi S,

You should choose the one model, the best regression model: Y=a0+a1X1 or Y=a0+a1X2 or Y=a0+a1X1+a2X2 or Y=a0+a1X1+a2X2+a3X1X2.
So no need to correct (unless you need to choose predictor which is a different story)

Repeat measure says that you measure for example the same subject more than once.
For example blood pressure before taking the medicine and after. (paired t-test)

When you have more than one predictor, this is multiple regression.
Again there may be a dependency between Day1 and day2
Hi obh,

A follow-up question, what if I use a mixed model (eg, lme4::lmer) for mod2 to account for the fact the data come from the same ind? Would it still be incorrect?