1. Dummy Variables

Is there a limit to the number of independent variables that can be dummy coded for multiple regression as well as linear regression. In other words, can I only have 2 dummy-coded predictors in a model with no other continuous variables for MR and LR? Thanks.

2. Re: Dummy Variables

Yes, there are limits...in fact you're discussing two separate cases.

(1) The first is that the the number of IV's cannot exceed K-1, where K is the number of IVs.

(2) The number of IV's cannot exceed the sample size (N - 1).

3. Re: Dummy Variables

Originally Posted by Dragan
(1) The first is that the the number of IV's cannot exceed K-1, where K is the number of IVs.
What now?
.

4. Re: Dummy Variables

I think he/she meant that the number of dummy variables can not exceed k-1 number of levels of the categorical variable it was created from. There is no limit on the number of dummy variables tied to other specific IV in the model (not exceeding your degrees of freedom is extremely rare in practice in regression).

5. Re: Dummy Variables

Ah. Well then we need to add the caveat that you only need k-1 when you have an intercept term in your model. If you don't then you can create k dummy variables for a categorical variable with k levels.

6. The Following User Says Thank You to Dason For This Useful Post:

noetsi (08-20-2012)

7. Re: Dummy Variables

Originally Posted by Dason
Ah. Well then we need to add the caveat that you only need k-1 when you have an intercept term in your model. If you don't then you can create k dummy variables for a categorical variable with k levels.
That is fasinating. Every commentary I ever read stressed that you could only have k-1 - that other wise you would have a perfect linear combination which would make the regression not run (something I have actually encountered in SAS). Indeed I read that several times the last few days in studying for comps.

8. Re: Dummy Variables

That's because it's assuming you have an intercept in your model.

9. Re: Dummy Variables

Can you simplify this again, one more time? Just the k-1 part. Perhaps with a hypothetical example.

10. Re: Dummy Variables

Here is an example showing what the dummy coding would look like for 3 groups. In one model I include an intercept (and thus can only use 2 dummy variables) and in the other model I don't include an intercept (and thus have 3 dummy variables). I use R

Code:
> set.seed(551)
> # Construct 3 groups and random y values
> n.group <- 3
> n.samp <- 3
> dat <- data.frame(group = gl(n.group, n.samp), y = rnorm(n.group*n.samp))
> dat
group          y
1     1  0.8410229
2     1  0.2376842
3     1  0.6098874
4     2 -1.4915477
5     2 -0.4559226
6     2  0.7860064
7     3 -1.0857772
8     3 -1.2947018
9     3 -0.2002826
>
> # By default lm includes an intercept
> # So for group it will create 2 dummy variables for us
> # model.matrix constructs the matrix used in the regression
> model.matrix(~ group, data = dat)
(Intercept) group2 group3
1           1      0      0
2           1      0      0
3           1      0      0
4           1      1      0
5           1      1      0
6           1      1      0
7           1      0      1
8           1      0      1
9           1      0      1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$group [1] "contr.treatment" > > # This fits the linear model itself > # using the model matrix above. So it has > # an intercept and 2 dummy variables > # The dummy variables here represent the difference between > # group2 and group1, and the difference between group3 and group1 > o <- lm(y ~ group, data = dat) > summary(o) Call: lm(formula = y ~ group, data = dat) Residuals: Min 1Q Median 3Q Max -1.10439 -0.32518 -0.06877 0.27816 1.17316 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.5629 0.4385 1.284 0.2466 group2 -0.9500 0.6201 -1.532 0.1764 group3 -1.4231 0.6201 -2.295 0.0615 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7595 on 6 degrees of freedom Multiple R-squared: 0.4766, Adjusted R-squared: 0.3021 F-statistic: 2.732 on 2 and 6 DF, p-value: 0.1434 > anova(o) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) group 2 3.1516 1.57581 2.7317 0.1434 Residuals 6 3.4612 0.57687 > > > # We can tell it to not include an intercept > # Now we get three dummy variables > # The dummy variables here just represent the mean > # value for each group (cell means model) > model.matrix(~ group - 1, data = dat) group1 group2 group3 1 1 0 0 2 1 0 0 3 1 0 0 4 0 1 0 5 0 1 0 6 0 1 0 7 0 0 1 8 0 0 1 9 0 0 1 attr(,"assign") [1] 1 1 1 attr(,"contrasts") attr(,"contrasts")$group
[1] "contr.treatment"

>
>
> o.cellmean <- lm(y ~ group -1, data = dat)
> summary(o.cellmean)

Call:
lm(formula = y ~ group - 1, data = dat)

Residuals:
Min       1Q   Median       3Q      Max
-1.10439 -0.32518 -0.06877  0.27816  1.17316

Coefficients:
Estimate Std. Error t value Pr(>|t|)
group1   0.5629     0.4385   1.284   0.2466
group2  -0.3872     0.4385  -0.883   0.4113
group3  -0.8603     0.4385  -1.962   0.0975 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7595 on 6 degrees of freedom
Multiple R-squared: 0.5112,	Adjusted R-squared: 0.2668
F-statistic: 2.092 on 3 and 6 DF,  p-value: 0.2027

> anova(o.cellmean)
Analysis of Variance Table

Response: y
Df Sum Sq Mean Sq F value Pr(>F)
group      3 3.6202 1.20674  2.0919 0.2027
Residuals  6 3.4612 0.57687
>
> # Notice that the resulting p-values from the two
> # calls to "anova" are different even though I
> # claimed we can get the same results.
> # That is because the two tests being done
> # are testing different things.  So...
> # if we want the "Omnibus" F-test (which is performed
> # in the first anova) we need
> # to construct it ourselves by creating the null model
> o.null <- lm(y ~ 1, data = dat)
>
> # The F-tests here should now match
> anova(o.null, o.cellmean)
Analysis of Variance Table

Model 1: y ~ 1
Model 2: y ~ group - 1
Res.Df    RSS Df Sum of Sq      F Pr(>F)
1      8 6.6128
2      6 3.4612  2    3.1516 2.7317 0.1434
> anova(o)
Analysis of Variance Table

Response: y
Df Sum Sq Mean Sq F value Pr(>F)
group      2 3.1516 1.57581  2.7317 0.1434
Residuals  6 3.4612 0.57687
and if you just want the code to reproduce this yourself
Code:

set.seed(551)
# Construct 3 groups and random y values
n.group <- 3
n.samp <- 3
dat <- data.frame(group = gl(n.group, n.samp), y = rnorm(n.group*n.samp))
dat

# By default lm includes an intercept
# So for group it will create 2 dummy variables for us
# model.matrix constructs the matrix used in the regression
model.matrix(~ group, data = dat)

# This fits the linear model itself
# using the model matrix above.  So it has
# an intercept and 2 dummy variables
# The dummy variables here represent the difference between
# group2 and group1, and the difference between group3 and group1
o <- lm(y ~ group, data = dat)
summary(o)
anova(o)

# We can tell it to not include an intercept
# Now we get three dummy variables
# The dummy variables here just represent the mean
# value for each group (cell means model)
model.matrix(~ group - 1, data = dat)

o.cellmean <- lm(y ~ group -1, data = dat)
summary(o.cellmean)
anova(o.cellmean)

# Notice that the resulting p-values from the two
# calls to "anova" are different even though I
# claimed we can get the same results.
# That is because the two tests being done
# are testing different things.  So...
# if we want the "Omnibus" F-test (which is performed
# in the first anova) we need
# to construct it ourselves by creating the null model
o.null <- lm(y ~ 1, data = dat)

# The F-tests here should now match
anova(o.null, o.cellmean)
anova(o)

11. Re: Dummy Variables

A simple example. Gender is your categorical variable. There are k=2 levels (male and female). You create one (k-1) dummy variables from this. For instance male where if the subject is male they are coded 1 and female 0.

then the regression becomes.

Y = b0 (the intercept) +b1Xmale (the dummy variable)

12. Re: Dummy Variables

And if you don't include an intercept you could let
= 0 if male, 1 if female
= 1 if male, 0 if female.

Then the regression could be

13. Re: Dummy Variables

Yes, I get this part (thanks).

I was unsure about the question on if there was a limit on the number of categories for a variable. I understand no more categories than number of observations and related power issues, but the k-1 portion got a little cloudy.

14. Re: Dummy Variables

Maybe you guys are talking about something else but isn't having K dummy variables for a variable with K levels redundant? You can get away with having only K-1 levels because if you set the K-1 dummy vars to zero then that can be used for the Kth level.

15. Re: Dummy Variables

Originally Posted by lancearmstrong1313
Maybe you guys are talking about something else but isn't having K dummy variables for a variable with K levels redundant? You can get away with having only K-1 levels because if you set the K-1 dummy vars to zero then that can be used for the Kth level.
But that only works if you have an intercept term in your model. Which is what I've been saying.

If you have an intercept then you only need k-1 dummy variables to represent a categorical variable that has k levels.
If you don't have an intercept then you need all k dummy variables to represent a categorical variable with k levels.

16. Re: Dummy Variables

Originally Posted by hlsmith
Yes, I get this part (thanks).

I was unsure about the question on if there was a limit on the number of categories for a variable. I understand no more categories than number of observations and related power issues, but the k-1 portion got a little cloudy.
There is no statistical limit to how many categories a variable can have. There is a limit to how many dummies you can create from this variable if you use an intercept (which most times you will). It is one less than the number of levels of the categorical variable (k-1).

If you have a lot of levels to the categorical variable and they are ordered, at a certain point it makes sense to treat them as an interval variable. But that is not required.