# Two categorical variables, one continuous dependent variable

#### SpoodleBeast

##### New Member
Hi,

I want to check whether linear regression is the statistical tool I need and ways of checking whether my data is appropriate for this statistical test.

I have a dataset with two categorical variables: person and procedure; and one continuous variable: time taken for that person to do the procedure.

There are many measurements and some people have done the same procedure multiple times. Some procedures have only been done by one person but most have been done by most people at least once.

What am I trying to see:
- individuals who take significantly longer or significantly less time than everyone else
- anything else that might be of interest.

I've created some R code with dummy data and tried to run a linear regression model.

Any help would be very much appreciated.

Code:
dput(df)

## r output follows
structure(list(Person = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), Procedure = structure(c(1L, 2L,
3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L
), .Label = c("proc1", "proc2", "proc3", "proc4", "proc5", "proc6",
"proc7", "proc8"), class = "factor"), Time = c(88L, 54L, 12L,
76L, 72L, 91L, 60L, 18L, 81L, 68L, 101L, 80L, 9L, 90L, 75L, 80L,
12L, 9L)), .Names = c("Person", "Procedure", "Time"), class = "data.frame", row.names = c(NA,
-18L))
Code:
library(dplyr)
reg <- df %>%
lm(formula = Time ~ Person + Procedure)

summary(reg)

## R output follows
Call:
lm(formula = Time ~ Person + Procedure, data = .)

Residuals:
Min      1Q  Median      3Q     Max
-10.000  -2.133   0.000   1.667   9.333

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)      88.733      4.421  20.071 3.96e-08 ***
PersonB           3.200      4.093   0.782 0.456841
PersonC          10.600      4.093   2.590 0.032126 *
Procedureproc2  -28.667      5.284  -5.425 0.000627 ***
Procedureproc3  -80.333      5.284 -15.203 3.47e-07 ***
Procedureproc4  -11.000      5.284  -2.082 0.070928 .
Procedureproc5  -21.667      5.284  -4.100 0.003436 **
Procedureproc6  -19.333      7.838  -2.467 0.038909 *
Procedureproc7  -87.333      7.838 -11.143 3.76e-06 ***
Procedureproc8  -90.333      7.838 -11.526 2.91e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.472 on 8 degrees of freedom
Multiple R-squared:  0.9812,	Adjusted R-squared:  0.9601
F-statistic:  46.5 on 9 and 8 DF,  p-value: 5.9e-06

#### kiton

##### New Member
Could you please elaborate on the nature of your dependent variable. Its descriptive statistics would also help.

Note, the type of analysis to be used is determined by the DV, not the IVs.

#### SpoodleBeast

##### New Member
Could you please elaborate on the nature of your dependent variable. Its descriptive statistics would also help.

Note, the type of analysis to be used is determined by the DV, not the IVs.
The dependent variable is the time taken for workers to complete a procedure - there are lots of procedure types, lots of workers, and many, many measurements. I would expect the time taken for a worker to complete a procedure to be normally distributed for that procedure.