+ Reply to Thread
Results 1 to 4 of 4

Thread: Two categorical variables, one continuous dependent variable

  1. #1
    Points: 944, Level: 16
    Level completed: 44%, Points required for next Level: 56

    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Two categorical variables, one continuous dependent variable




    Hi,

    I want to check whether linear regression is the statistical tool I need and ways of checking whether my data is appropriate for this statistical test.

    I have a dataset with two categorical variables: person and procedure; and one continuous variable: time taken for that person to do the procedure.

    There are many measurements and some people have done the same procedure multiple times. Some procedures have only been done by one person but most have been done by most people at least once.

    What am I trying to see:
    - individuals who take significantly longer or significantly less time than everyone else
    - anything else that might be of interest.

    I've created some R code with dummy data and tried to run a linear regression model.

    Any help would be very much appreciated.

    Code: 
    dput(df)
    
    ## r output follows
    structure(list(Person = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", 
    "B", "C"), class = "factor"), Procedure = structure(c(1L, 2L, 
    3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L
    ), .Label = c("proc1", "proc2", "proc3", "proc4", "proc5", "proc6", 
    "proc7", "proc8"), class = "factor"), Time = c(88L, 54L, 12L, 
    76L, 72L, 91L, 60L, 18L, 81L, 68L, 101L, 80L, 9L, 90L, 75L, 80L, 
    12L, 9L)), .Names = c("Person", "Procedure", "Time"), class = "data.frame", row.names = c(NA, 
    -18L))
    Code: 
    library(dplyr)
    reg <- df %>%
      lm(formula = Time ~ Person + Procedure)
    
    summary(reg)
    
    ## R output follows
    Call:
    lm(formula = Time ~ Person + Procedure, data = .)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -10.000  -2.133   0.000   1.667   9.333 
    
    Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
    (Intercept)      88.733      4.421  20.071 3.96e-08 ***
    PersonB           3.200      4.093   0.782 0.456841    
    PersonC          10.600      4.093   2.590 0.032126 *  
    Procedureproc2  -28.667      5.284  -5.425 0.000627 ***
    Procedureproc3  -80.333      5.284 -15.203 3.47e-07 ***
    Procedureproc4  -11.000      5.284  -2.082 0.070928 .  
    Procedureproc5  -21.667      5.284  -4.100 0.003436 ** 
    Procedureproc6  -19.333      7.838  -2.467 0.038909 *  
    Procedureproc7  -87.333      7.838 -11.143 3.76e-06 ***
    Procedureproc8  -90.333      7.838 -11.526 2.91e-06 ***
    ---
    Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1
    
    Residual standard error: 6.472 on 8 degrees of freedom
    Multiple R-squared:  0.9812,	Adjusted R-squared:  0.9601 
    F-statistic:  46.5 on 9 and 8 DF,  p-value: 5.9e-06

  2. #2
    Points: 4,664, Level: 43
    Level completed: 57%, Points required for next Level: 86
    kiton's Avatar
    Location
    Corn field
    Posts
    234
    Thanks
    47
    Thanked 51 Times in 46 Posts

    Re: Two categorical variables, one continuous dependent variable

    Could you please elaborate on the nature of your dependent variable. Its descriptive statistics would also help.

    Note, the type of analysis to be used is determined by the DV, not the IVs.

  3. #3
    Points: 944, Level: 16
    Level completed: 44%, Points required for next Level: 56

    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Two categorical variables, one continuous dependent variable

    Quote Originally Posted by kiton View Post
    Could you please elaborate on the nature of your dependent variable. Its descriptive statistics would also help.

    Note, the type of analysis to be used is determined by the DV, not the IVs.
    The dependent variable is the time taken for workers to complete a procedure - there are lots of procedure types, lots of workers, and many, many measurements. I would expect the time taken for a worker to complete a procedure to be normally distributed for that procedure.

    Does that answer all your questions?

  4. #4
    Points: 4,664, Level: 43
    Level completed: 57%, Points required for next Level: 86
    kiton's Avatar
    Location
    Corn field
    Posts
    234
    Thanks
    47
    Thanked 51 Times in 46 Posts

    Re: Two categorical variables, one continuous dependent variable


    Considering your DV is time -- a count type (i.e., minutes or hours -- integers that cannot be below zero), I'd recommend you look into exponential models, such as Poisson or Negative Binomial. OLS estimates in such case might be inconsistent. Note, robust standard errors should be used with such models (as we rarely believe that data always forms a Poisson or NegBin distribution).

    Also, your topic mentions categorical variables as DVs as well. In that case, if the number of categories is two (i.e., binary), look into logistic regression; if the number of categories is above 2, than look into multinomial logistic regression.
    Last edited by kiton; 06-17-2016 at 12:31 PM.

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats