+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 15 of 19

Thread: Dummy Variables

  1. #1
    Points: 10, Level: 1
    Level completed: 19%, Points required for next Level: 40

    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Dummy Variables




    Is there a limit to the number of independent variables that can be dummy coded for multiple regression as well as linear regression. In other words, can I only have 2 dummy-coded predictors in a model with no other continuous variables for MR and LR? Thanks.

  2. #2
    Super Moderator
    Points: 13,151, Level: 74
    Level completed: 76%, Points required for next Level: 99
    Dragan's Avatar
    Location
    Illinois, US
    Posts
    2,014
    Thanks
    0
    Thanked 223 Times in 192 Posts

    Re: Dummy Variables

    Yes, there are limits...in fact you're discussing two separate cases.

    (1) The first is that the the number of IV's cannot exceed K-1, where K is the number of IVs.

    (2) The number of IV's cannot exceed the sample size (N - 1).

  3. #3
    Devorador de queso
    Points: 95,705, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,931
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: Dummy Variables

    Quote Originally Posted by Dragan View Post
    (1) The first is that the the number of IV's cannot exceed K-1, where K is the number of IVs.
    What now?
    .
    I don't have emotions and sometimes that makes me very sad.

  4. #4
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Dummy Variables

    I think he/she meant that the number of dummy variables can not exceed k-1 number of levels of the categorical variable it was created from. There is no limit on the number of dummy variables tied to other specific IV in the model (not exceeding your degrees of freedom is extremely rare in practice in regression).
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  5. #5
    Devorador de queso
    Points: 95,705, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,931
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: Dummy Variables

    Ah. Well then we need to add the caveat that you only need k-1 when you have an intercept term in your model. If you don't then you can create k dummy variables for a categorical variable with k levels.
    I don't have emotions and sometimes that makes me very sad.

  6. The Following User Says Thank You to Dason For This Useful Post:

    noetsi (08-20-2012)

  7. #6
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Dummy Variables

    Quote Originally Posted by Dason View Post
    Ah. Well then we need to add the caveat that you only need k-1 when you have an intercept term in your model. If you don't then you can create k dummy variables for a categorical variable with k levels.
    That is fasinating. Every commentary I ever read stressed that you could only have k-1 - that other wise you would have a perfect linear combination which would make the regression not run (something I have actually encountered in SAS). Indeed I read that several times the last few days in studying for comps.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  8. #7
    Devorador de queso
    Points: 95,705, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,931
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: Dummy Variables

    That's because it's assuming you have an intercept in your model.
    I don't have emotions and sometimes that makes me very sad.

  9. #8
    Omega Contributor
    Points: 38,303, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,993
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Dummy Variables

    Can you simplify this again, one more time? Just the k-1 part. Perhaps with a hypothetical example.

  10. #9
    Devorador de queso
    Points: 95,705, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,931
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: Dummy Variables

    Here is an example showing what the dummy coding would look like for 3 groups. In one model I include an intercept (and thus can only use 2 dummy variables) and in the other model I don't include an intercept (and thus have 3 dummy variables). I use R

    Code: 
    > set.seed(551)
    > # Construct 3 groups and random y values
    > n.group <- 3
    > n.samp <- 3
    > dat <- data.frame(group = gl(n.group, n.samp), y = rnorm(n.group*n.samp))
    > dat
      group          y
    1     1  0.8410229
    2     1  0.2376842
    3     1  0.6098874
    4     2 -1.4915477
    5     2 -0.4559226
    6     2  0.7860064
    7     3 -1.0857772
    8     3 -1.2947018
    9     3 -0.2002826
    > 
    > # By default lm includes an intercept
    > # So for group it will create 2 dummy variables for us
    > # model.matrix constructs the matrix used in the regression
    > model.matrix(~ group, data = dat)
      (Intercept) group2 group3
    1           1      0      0
    2           1      0      0
    3           1      0      0
    4           1      1      0
    5           1      1      0
    6           1      1      0
    7           1      0      1
    8           1      0      1
    9           1      0      1
    attr(,"assign")
    [1] 0 1 1
    attr(,"contrasts")
    attr(,"contrasts")$group
    [1] "contr.treatment"
    
    > 
    > # This fits the linear model itself
    > # using the model matrix above.  So it has
    > # an intercept and 2 dummy variables
    > # The dummy variables here represent the difference between
    > # group2 and group1, and the difference between group3 and group1
    > o <- lm(y ~ group, data = dat)
    > summary(o)
    
    Call:
    lm(formula = y ~ group, data = dat)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -1.10439 -0.32518 -0.06877  0.27816  1.17316 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
    (Intercept)   0.5629     0.4385   1.284   0.2466  
    group2       -0.9500     0.6201  -1.532   0.1764  
    group3       -1.4231     0.6201  -2.295   0.0615 .
    ---
    Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 
    
    Residual standard error: 0.7595 on 6 degrees of freedom
    Multiple R-squared: 0.4766,	Adjusted R-squared: 0.3021 
    F-statistic: 2.732 on 2 and 6 DF,  p-value: 0.1434 
    
    > anova(o)
    Analysis of Variance Table
    
    Response: y
              Df Sum Sq Mean Sq F value Pr(>F)
    group      2 3.1516 1.57581  2.7317 0.1434
    Residuals  6 3.4612 0.57687               
    > 
    > 
    > # We can tell it to not include an intercept
    > # Now we get three dummy variables
    > # The dummy variables here just represent the mean
    > # value for each group (cell means model)
    > model.matrix(~ group - 1, data = dat)
      group1 group2 group3
    1      1      0      0
    2      1      0      0
    3      1      0      0
    4      0      1      0
    5      0      1      0
    6      0      1      0
    7      0      0      1
    8      0      0      1
    9      0      0      1
    attr(,"assign")
    [1] 1 1 1
    attr(,"contrasts")
    attr(,"contrasts")$group
    [1] "contr.treatment"
    
    > 
    > 
    > o.cellmean <- lm(y ~ group -1, data = dat)
    > summary(o.cellmean)
    
    Call:
    lm(formula = y ~ group - 1, data = dat)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -1.10439 -0.32518 -0.06877  0.27816  1.17316 
    
    Coefficients:
           Estimate Std. Error t value Pr(>|t|)  
    group1   0.5629     0.4385   1.284   0.2466  
    group2  -0.3872     0.4385  -0.883   0.4113  
    group3  -0.8603     0.4385  -1.962   0.0975 .
    ---
    Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 
    
    Residual standard error: 0.7595 on 6 degrees of freedom
    Multiple R-squared: 0.5112,	Adjusted R-squared: 0.2668 
    F-statistic: 2.092 on 3 and 6 DF,  p-value: 0.2027 
    
    > anova(o.cellmean)
    Analysis of Variance Table
    
    Response: y
              Df Sum Sq Mean Sq F value Pr(>F)
    group      3 3.6202 1.20674  2.0919 0.2027
    Residuals  6 3.4612 0.57687               
    > 
    > # Notice that the resulting p-values from the two
    > # calls to "anova" are different even though I
    > # claimed we can get the same results.
    > # That is because the two tests being done
    > # are testing different things.  So...
    > # if we want the "Omnibus" F-test (which is performed
    > # in the first anova) we need
    > # to construct it ourselves by creating the null model
    > o.null <- lm(y ~ 1, data = dat)
    > 
    > # The F-tests here should now match
    > anova(o.null, o.cellmean)
    Analysis of Variance Table
    
    Model 1: y ~ 1
    Model 2: y ~ group - 1
      Res.Df    RSS Df Sum of Sq      F Pr(>F)
    1      8 6.6128                           
    2      6 3.4612  2    3.1516 2.7317 0.1434
    > anova(o)
    Analysis of Variance Table
    
    Response: y
              Df Sum Sq Mean Sq F value Pr(>F)
    group      2 3.1516 1.57581  2.7317 0.1434
    Residuals  6 3.4612 0.57687
    and if you just want the code to reproduce this yourself
    Code: 
    
    set.seed(551)
    # Construct 3 groups and random y values
    n.group <- 3
    n.samp <- 3
    dat <- data.frame(group = gl(n.group, n.samp), y = rnorm(n.group*n.samp))
    dat
    
    # By default lm includes an intercept
    # So for group it will create 2 dummy variables for us
    # model.matrix constructs the matrix used in the regression
    model.matrix(~ group, data = dat)
    
    # This fits the linear model itself
    # using the model matrix above.  So it has
    # an intercept and 2 dummy variables
    # The dummy variables here represent the difference between
    # group2 and group1, and the difference between group3 and group1
    o <- lm(y ~ group, data = dat)
    summary(o)
    anova(o)
    
    
    # We can tell it to not include an intercept
    # Now we get three dummy variables
    # The dummy variables here just represent the mean
    # value for each group (cell means model)
    model.matrix(~ group - 1, data = dat)
    
    
    o.cellmean <- lm(y ~ group -1, data = dat)
    summary(o.cellmean)
    anova(o.cellmean)
    
    # Notice that the resulting p-values from the two
    # calls to "anova" are different even though I
    # claimed we can get the same results.
    # That is because the two tests being done
    # are testing different things.  So...
    # if we want the "Omnibus" F-test (which is performed
    # in the first anova) we need
    # to construct it ourselves by creating the null model
    o.null <- lm(y ~ 1, data = dat)
    
    # The F-tests here should now match
    anova(o.null, o.cellmean)
    anova(o)
    I don't have emotions and sometimes that makes me very sad.

  11. #10
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Dummy Variables

    A simple example. Gender is your categorical variable. There are k=2 levels (male and female). You create one (k-1) dummy variables from this. For instance male where if the subject is male they are coded 1 and female 0.

    then the regression becomes.

    Y = b0 (the intercept) +b1Xmale (the dummy variable)
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  12. #11
    Devorador de queso
    Points: 95,705, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,931
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: Dummy Variables

    And if you don't include an intercept you could let
    X_{1i} = 0 if male, 1 if female
    X_{2i} = 1 if male, 0 if female.

    Then the regression could be

    Y_i = \beta_1X_{1i} + \beta_2X_{2i} + \epsilon_i
    I don't have emotions and sometimes that makes me very sad.

  13. #12
    Omega Contributor
    Points: 38,303, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,993
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Dummy Variables

    Yes, I get this part (thanks).

    I was unsure about the question on if there was a limit on the number of categories for a variable. I understand no more categories than number of observations and related power issues, but the k-1 portion got a little cloudy.

  14. #13
    Points: 10,858, Level: 69
    Level completed: 2%, Points required for next Level: 392

    Posts
    360
    Thanks
    24
    Thanked 19 Times in 18 Posts

    Re: Dummy Variables

    Maybe you guys are talking about something else but isn't having K dummy variables for a variable with K levels redundant? You can get away with having only K-1 levels because if you set the K-1 dummy vars to zero then that can be used for the Kth level.

  15. #14
    Devorador de queso
    Points: 95,705, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,931
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: Dummy Variables

    Quote Originally Posted by lancearmstrong1313 View Post
    Maybe you guys are talking about something else but isn't having K dummy variables for a variable with K levels redundant? You can get away with having only K-1 levels because if you set the K-1 dummy vars to zero then that can be used for the Kth level.
    But that only works if you have an intercept term in your model. Which is what I've been saying.

    If you have an intercept then you only need k-1 dummy variables to represent a categorical variable that has k levels.
    If you don't have an intercept then you need all k dummy variables to represent a categorical variable with k levels.
    I don't have emotions and sometimes that makes me very sad.

  16. #15
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Dummy Variables


    Quote Originally Posted by hlsmith View Post
    Yes, I get this part (thanks).

    I was unsure about the question on if there was a limit on the number of categories for a variable. I understand no more categories than number of observations and related power issues, but the k-1 portion got a little cloudy.
    There is no statistical limit to how many categories a variable can have. There is a limit to how many dummies you can create from this variable if you use an intercept (which most times you will). It is one less than the number of levels of the categorical variable (k-1).

    If you have a lot of levels to the categorical variable and they are ordered, at a certain point it makes sense to treat them as an interval variable. But that is not required.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

+ Reply to Thread
Page 1 of 2 1 2 LastLast

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats