Could you please elaborate on the nature of your dependent variable. Its descriptive statistics would also help.
Note, the type of analysis to be used is determined by the DV, not the IVs.
Hi,
I want to check whether linear regression is the statistical tool I need and ways of checking whether my data is appropriate for this statistical test.
I have a dataset with two categorical variables: person and procedure; and one continuous variable: time taken for that person to do the procedure.
There are many measurements and some people have done the same procedure multiple times. Some procedures have only been done by one person but most have been done by most people at least once.
What am I trying to see:
- individuals who take significantly longer or significantly less time than everyone else
- anything else that might be of interest.
I've created some R code with dummy data and tried to run a linear regression model.
Any help would be very much appreciated.
Code:dput(df) ## r output follows structure(list(Person = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Procedure = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L ), .Label = c("proc1", "proc2", "proc3", "proc4", "proc5", "proc6", "proc7", "proc8"), class = "factor"), Time = c(88L, 54L, 12L, 76L, 72L, 91L, 60L, 18L, 81L, 68L, 101L, 80L, 9L, 90L, 75L, 80L, 12L, 9L)), .Names = c("Person", "Procedure", "Time"), class = "data.frame", row.names = c(NA, -18L))
Code:library(dplyr) reg <- df %>% lm(formula = Time ~ Person + Procedure) summary(reg) ## R output follows Call: lm(formula = Time ~ Person + Procedure, data = .) Residuals: Min 1Q Median 3Q Max -10.000 -2.133 0.000 1.667 9.333 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 88.733 4.421 20.071 3.96e-08 *** PersonB 3.200 4.093 0.782 0.456841 PersonC 10.600 4.093 2.590 0.032126 * Procedureproc2 -28.667 5.284 -5.425 0.000627 *** Procedureproc3 -80.333 5.284 -15.203 3.47e-07 *** Procedureproc4 -11.000 5.284 -2.082 0.070928 . Procedureproc5 -21.667 5.284 -4.100 0.003436 ** Procedureproc6 -19.333 7.838 -2.467 0.038909 * Procedureproc7 -87.333 7.838 -11.143 3.76e-06 *** Procedureproc8 -90.333 7.838 -11.526 2.91e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.472 on 8 degrees of freedom Multiple R-squared: 0.9812, Adjusted R-squared: 0.9601 F-statistic: 46.5 on 9 and 8 DF, p-value: 5.9e-06
Could you please elaborate on the nature of your dependent variable. Its descriptive statistics would also help.
Note, the type of analysis to be used is determined by the DV, not the IVs.
The dependent variable is the time taken for workers to complete a procedure - there are lots of procedure types, lots of workers, and many, many measurements. I would expect the time taken for a worker to complete a procedure to be normally distributed for that procedure.
Does that answer all your questions?
Considering your DV is time -- a count type (i.e., minutes or hours -- integers that cannot be below zero), I'd recommend you look into exponential models, such as Poisson or Negative Binomial. OLS estimates in such case might be inconsistent. Note, robust standard errors should be used with such models (as we rarely believe that data always forms a Poisson or NegBin distribution).
Also, your topic mentions categorical variables as DVs as well. In that case, if the number of categories is two (i.e., binary), look into logistic regression; if the number of categories is above 2, than look into multinomial logistic regression.
Last edited by kiton; 06-17-2016 at 12:31 PM.
Tweet |