# Thread: Two categorical variables, one continuous dependent variable

1. ## Two categorical variables, one continuous dependent variable

Hi,

I want to check whether linear regression is the statistical tool I need and ways of checking whether my data is appropriate for this statistical test.

I have a dataset with two categorical variables: person and procedure; and one continuous variable: time taken for that person to do the procedure.

There are many measurements and some people have done the same procedure multiple times. Some procedures have only been done by one person but most have been done by most people at least once.

What am I trying to see:
- individuals who take significantly longer or significantly less time than everyone else
- anything else that might be of interest.

I've created some R code with dummy data and tried to run a linear regression model.

Any help would be very much appreciated.

Code:
``````dput(df)

## r output follows
structure(list(Person = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), Procedure = structure(c(1L, 2L,
3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L
), .Label = c("proc1", "proc2", "proc3", "proc4", "proc5", "proc6",
"proc7", "proc8"), class = "factor"), Time = c(88L, 54L, 12L,
76L, 72L, 91L, 60L, 18L, 81L, 68L, 101L, 80L, 9L, 90L, 75L, 80L,
12L, 9L)), .Names = c("Person", "Procedure", "Time"), class = "data.frame", row.names = c(NA,
-18L))``````
Code:
``````library(dplyr)
reg <- df %>%
lm(formula = Time ~ Person + Procedure)

summary(reg)

## R output follows
Call:
lm(formula = Time ~ Person + Procedure, data = .)

Residuals:
Min      1Q  Median      3Q     Max
-10.000  -2.133   0.000   1.667   9.333

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)      88.733      4.421  20.071 3.96e-08 ***
PersonB           3.200      4.093   0.782 0.456841
PersonC          10.600      4.093   2.590 0.032126 *
Procedureproc2  -28.667      5.284  -5.425 0.000627 ***
Procedureproc3  -80.333      5.284 -15.203 3.47e-07 ***
Procedureproc4  -11.000      5.284  -2.082 0.070928 .
Procedureproc5  -21.667      5.284  -4.100 0.003436 **
Procedureproc6  -19.333      7.838  -2.467 0.038909 *
Procedureproc7  -87.333      7.838 -11.143 3.76e-06 ***
Procedureproc8  -90.333      7.838 -11.526 2.91e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.472 on 8 degrees of freedom
Multiple R-squared:  0.9812,	Adjusted R-squared:  0.9601
F-statistic:  46.5 on 9 and 8 DF,  p-value: 5.9e-06``````

2. ## Re: Two categorical variables, one continuous dependent variable

Could you please elaborate on the nature of your dependent variable. Its descriptive statistics would also help.

Note, the type of analysis to be used is determined by the DV, not the IVs.

3. ## Re: Two categorical variables, one continuous dependent variable

Originally Posted by kiton
Could you please elaborate on the nature of your dependent variable. Its descriptive statistics would also help.

Note, the type of analysis to be used is determined by the DV, not the IVs.
The dependent variable is the time taken for workers to complete a procedure - there are lots of procedure types, lots of workers, and many, many measurements. I would expect the time taken for a worker to complete a procedure to be normally distributed for that procedure.

4. ## Re: Two categorical variables, one continuous dependent variable

Considering your DV is time -- a count type (i.e., minutes or hours -- integers that cannot be below zero), I'd recommend you look into exponential models, such as Poisson or Negative Binomial. OLS estimates in such case might be inconsistent. Note, robust standard errors should be used with such models (as we rarely believe that data always forms a Poisson or NegBin distribution).

Also, your topic mentions categorical variables as DVs as well. In that case, if the number of categories is two (i.e., binary), look into logistic regression; if the number of categories is above 2, than look into multinomial logistic regression.

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts