Yes, there are limits...in fact you're discussing two separate cases.
(1) The first is that the the number of IV's cannot exceed K-1, where K is the number of IVs.
(2) The number of IV's cannot exceed the sample size (N - 1).
Is there a limit to the number of independent variables that can be dummy coded for multiple regression as well as linear regression. In other words, can I only have 2 dummy-coded predictors in a model with no other continuous variables for MR and LR? Thanks.
Yes, there are limits...in fact you're discussing two separate cases.
(1) The first is that the the number of IV's cannot exceed K-1, where K is the number of IVs.
(2) The number of IV's cannot exceed the sample size (N - 1).
I think he/she meant that the number of dummy variables can not exceed k-1 number of levels of the categorical variable it was created from. There is no limit on the number of dummy variables tied to other specific IV in the model (not exceeding your degrees of freedom is extremely rare in practice in regression).
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
Ah. Well then we need to add the caveat that you only need k-1 when you have an intercept term in your model. If you don't then you can create k dummy variables for a categorical variable with k levels.
I don't have emotions and sometimes that makes me very sad.
noetsi (08-20-2012)
That is fasinating. Every commentary I ever read stressed that you could only have k-1 - that other wise you would have a perfect linear combination which would make the regression not run (something I have actually encountered in SAS). Indeed I read that several times the last few days in studying for comps.
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
That's because it's assuming you have an intercept in your model.
I don't have emotions and sometimes that makes me very sad.
Can you simplify this again, one more time? Just the k-1 part. Perhaps with a hypothetical example.
Here is an example showing what the dummy coding would look like for 3 groups. In one model I include an intercept (and thus can only use 2 dummy variables) and in the other model I don't include an intercept (and thus have 3 dummy variables). I use R
and if you just want the code to reproduce this yourselfCode:> set.seed(551) > # Construct 3 groups and random y values > n.group <- 3 > n.samp <- 3 > dat <- data.frame(group = gl(n.group, n.samp), y = rnorm(n.group*n.samp)) > dat group y 1 1 0.8410229 2 1 0.2376842 3 1 0.6098874 4 2 -1.4915477 5 2 -0.4559226 6 2 0.7860064 7 3 -1.0857772 8 3 -1.2947018 9 3 -0.2002826 > > # By default lm includes an intercept > # So for group it will create 2 dummy variables for us > # model.matrix constructs the matrix used in the regression > model.matrix(~ group, data = dat) (Intercept) group2 group3 1 1 0 0 2 1 0 0 3 1 0 0 4 1 1 0 5 1 1 0 6 1 1 0 7 1 0 1 8 1 0 1 9 1 0 1 attr(,"assign") [1] 0 1 1 attr(,"contrasts") attr(,"contrasts")$group [1] "contr.treatment" > > # This fits the linear model itself > # using the model matrix above. So it has > # an intercept and 2 dummy variables > # The dummy variables here represent the difference between > # group2 and group1, and the difference between group3 and group1 > o <- lm(y ~ group, data = dat) > summary(o) Call: lm(formula = y ~ group, data = dat) Residuals: Min 1Q Median 3Q Max -1.10439 -0.32518 -0.06877 0.27816 1.17316 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.5629 0.4385 1.284 0.2466 group2 -0.9500 0.6201 -1.532 0.1764 group3 -1.4231 0.6201 -2.295 0.0615 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7595 on 6 degrees of freedom Multiple R-squared: 0.4766, Adjusted R-squared: 0.3021 F-statistic: 2.732 on 2 and 6 DF, p-value: 0.1434 > anova(o) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) group 2 3.1516 1.57581 2.7317 0.1434 Residuals 6 3.4612 0.57687 > > > # We can tell it to not include an intercept > # Now we get three dummy variables > # The dummy variables here just represent the mean > # value for each group (cell means model) > model.matrix(~ group - 1, data = dat) group1 group2 group3 1 1 0 0 2 1 0 0 3 1 0 0 4 0 1 0 5 0 1 0 6 0 1 0 7 0 0 1 8 0 0 1 9 0 0 1 attr(,"assign") [1] 1 1 1 attr(,"contrasts") attr(,"contrasts")$group [1] "contr.treatment" > > > o.cellmean <- lm(y ~ group -1, data = dat) > summary(o.cellmean) Call: lm(formula = y ~ group - 1, data = dat) Residuals: Min 1Q Median 3Q Max -1.10439 -0.32518 -0.06877 0.27816 1.17316 Coefficients: Estimate Std. Error t value Pr(>|t|) group1 0.5629 0.4385 1.284 0.2466 group2 -0.3872 0.4385 -0.883 0.4113 group3 -0.8603 0.4385 -1.962 0.0975 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7595 on 6 degrees of freedom Multiple R-squared: 0.5112, Adjusted R-squared: 0.2668 F-statistic: 2.092 on 3 and 6 DF, p-value: 0.2027 > anova(o.cellmean) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) group 3 3.6202 1.20674 2.0919 0.2027 Residuals 6 3.4612 0.57687 > > # Notice that the resulting p-values from the two > # calls to "anova" are different even though I > # claimed we can get the same results. > # That is because the two tests being done > # are testing different things. So... > # if we want the "Omnibus" F-test (which is performed > # in the first anova) we need > # to construct it ourselves by creating the null model > o.null <- lm(y ~ 1, data = dat) > > # The F-tests here should now match > anova(o.null, o.cellmean) Analysis of Variance Table Model 1: y ~ 1 Model 2: y ~ group - 1 Res.Df RSS Df Sum of Sq F Pr(>F) 1 8 6.6128 2 6 3.4612 2 3.1516 2.7317 0.1434 > anova(o) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) group 2 3.1516 1.57581 2.7317 0.1434 Residuals 6 3.4612 0.57687
Code:set.seed(551) # Construct 3 groups and random y values n.group <- 3 n.samp <- 3 dat <- data.frame(group = gl(n.group, n.samp), y = rnorm(n.group*n.samp)) dat # By default lm includes an intercept # So for group it will create 2 dummy variables for us # model.matrix constructs the matrix used in the regression model.matrix(~ group, data = dat) # This fits the linear model itself # using the model matrix above. So it has # an intercept and 2 dummy variables # The dummy variables here represent the difference between # group2 and group1, and the difference between group3 and group1 o <- lm(y ~ group, data = dat) summary(o) anova(o) # We can tell it to not include an intercept # Now we get three dummy variables # The dummy variables here just represent the mean # value for each group (cell means model) model.matrix(~ group - 1, data = dat) o.cellmean <- lm(y ~ group -1, data = dat) summary(o.cellmean) anova(o.cellmean) # Notice that the resulting p-values from the two # calls to "anova" are different even though I # claimed we can get the same results. # That is because the two tests being done # are testing different things. So... # if we want the "Omnibus" F-test (which is performed # in the first anova) we need # to construct it ourselves by creating the null model o.null <- lm(y ~ 1, data = dat) # The F-tests here should now match anova(o.null, o.cellmean) anova(o)
I don't have emotions and sometimes that makes me very sad.
A simple example. Gender is your categorical variable. There are k=2 levels (male and female). You create one (k-1) dummy variables from this. For instance male where if the subject is male they are coded 1 and female 0.
then the regression becomes.
Y = b0 (the intercept) +b1Xmale (the dummy variable)
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
Yes, I get this part (thanks).
I was unsure about the question on if there was a limit on the number of categories for a variable. I understand no more categories than number of observations and related power issues, but the k-1 portion got a little cloudy.
Maybe you guys are talking about something else but isn't having K dummy variables for a variable with K levels redundant? You can get away with having only K-1 levels because if you set the K-1 dummy vars to zero then that can be used for the Kth level.
But that only works if you have an intercept term in your model. Which is what I've been saying.
If you have an intercept then you only need k-1 dummy variables to represent a categorical variable that has k levels.
If you don't have an intercept then you need all k dummy variables to represent a categorical variable with k levels.
I don't have emotions and sometimes that makes me very sad.
There is no statistical limit to how many categories a variable can have. There is a limit to how many dummies you can create from this variable if you use an intercept (which most times you will). It is one less than the number of levels of the categorical variable (k-1).
If you have a lot of levels to the categorical variable and they are ordered, at a certain point it makes sense to treat them as an interval variable. But that is not required.
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
Tweet |