PDA

View Full Version : testing rates through time

shanehall.m
10-11-2011, 09:31 AM
Hello all, this seems like it should be a simple questions but our team cannot find an answer to the question. If I have data set up like the following (the numbers are the percent that react positively to the drug):

Drug 2007 2008 2009 2010
A 59 62 61 62
B 50 49 58 61
C 67 70 69 83

and my goal is to see whether there is a significant difference in the observed trends in rates over time. How can I test this...unfortunately in my line of work, this is all we get, we cannot get data with the sample sizes or any other information. Any help would be greatly appreciated as I have ran into this problem many times.

Thank you so much!

trinker
10-11-2011, 11:01 AM
without sample size or sd I think you may have to rely on simple mean differences.

d21e7x11
10-11-2011, 01:45 PM
shanehall.m, you may want to consider first applying the arcsine transformation to percentages. Then you can out the repeated measurements analysis on the transformed outcome.

shanehall.m
10-11-2011, 02:12 PM
So there is no sort of time series analysis? We can construct a line graph with both the different drugs having different lines, but we don't know of a way to compare them. So your saying do an arcsine transformation to the percentages and then run a repeated measures analysis. That will negate the trend effect wont it? Are you suggesting teh arcsine transformation as a more precise way of treating the percentages as true continuous numbers?

shanehall.m
10-11-2011, 02:17 PM
Slide 9 of this powerpoint is almost identical to my question. In this powerpoint, I'd be trying to compare the age groups. If I had no other numbers other than the rates/proportions given throughout the years, how would I be able to find a difference in the trends of the age groups. Thank you all for your help.
http://www.hsph.harvard.edu/means-matter/files/SuicideTrends.ppt

10-11-2011, 02:22 PM
One method is to just throw the numbers into a linear regression model:

Percentage ~ DrugA*time + DrugB*time + DrugC*time

This will tell you whether there is a significant increasing linear trend in the percentages over time. I problem with this though is that you have so little observations. A way around this could be to assume that all the drugs have the same growth over time.

PS. If you are planning to present this professionally, I strongly recommend bringing someone onto your team who knows what they're doing.

Dason
10-11-2011, 02:45 PM
I did a quite little thing to see if we assume that there is a linear trend if we have any evidence that the slope is different for the different drugs. We end up concluding we don't have enough evidence. Once again though if we actually had sample sizes we could do quite a bit more with the data.

dat <- data.frame(drug = rep(c("A","B","C"), each = 4),
year = rep(2007:2010, 3),
vals = c(59,62,61,62,50,49,58,61,67,70,69,83))
# I'd rather work with smaller numbers for the predictor
dat\$time = dat\$year - 2007
# Apply the arcsin-squareroot transformation
dat\$transvals <- asin(sqrt(dat\$vals/100))

# Plotting to see what the data looks like
library(ggplot2)
# Plot of actual data
qplot(time, vals, colour = drug, data = dat, geom = "line")
# Plot of transformed data
qplot(time, transvals, colour = drug, data = dat, geom = "line")

# Fit a line for each drug (actual data)
o.full <- lm(vals ~ drug + time + drug:time, data = dat)
# Fit a line for each drug (transformed data)
o.trans <- lm(transvals ~ drug + time + drug:time, data = dat)

# Check the interaction term to see if there is a "significant"
# difference
anova(o.full) # Interaction isn't significant
anova(o.trans) # Interaction isn't signficant

# Not entirely sure the transformation is completely appropriate
# since the point is to stabilize the variance but it partially
# depends on sample size which we don't know. So if the sample
# sizes are approximately equal it doesn't matter. But then again
# all the observations are in a relatively small range anyways so
# it doesn't really matter... and that's probably why we don't
# see any big changes in the analysis.

And the code along with the output

> dat <- data.frame(drug = rep(c("A","B","C"), each = 4),
+ year = rep(2007:2010, 3),
+ vals = c(59,62,61,62,50,49,58,61,67,70,69,83))
> # I'd rather work with smaller numbers for the predictor
> dat\$time = dat\$year - 2007
> # Apply the arcsin-squareroot transformation
> dat\$transvals <- asin(sqrt(dat\$vals/100))
>
> # Plotting to see what the data looks like
> library(ggplot2)
> # Plot of actual data
> qplot(time, vals, colour = drug, data = dat, geom = "line")
> # Plot of transformed data
> qplot(time, transvals, colour = drug, data = dat, geom = "line")
>
> # Fit a line for each drug (actual data)
> o.full <- lm(vals ~ drug + time + drug:time, data = dat)
> # Fit a line for each drug (transformed data)
> o.trans <- lm(transvals ~ drug + time + drug:time, data = dat)
>
> # Check the interaction term to see if there is a "significant"
> # difference
> anova(o.full) # Interaction isn't significant
Analysis of Variance Table

Response: vals
Df Sum Sq Mean Sq F value Pr(>F)
drug 2 645.17 322.58 28.5052 0.0008634 ***
time 1 156.82 156.82 13.8571 0.0098231 **
drug:time 2 45.03 22.52 1.9897 0.2173416
Residuals 6 67.90 11.32
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> anova(o.trans) # Interaction isn't signficant
Analysis of Variance Table

Response: transvals
Df Sum Sq Mean Sq F value Pr(>F)
drug 2 0.073098 0.036549 24.8271 0.001253 **
time 1 0.018545 0.018545 12.5975 0.012082 *
drug:time 2 0.005863 0.002932 1.9914 0.217123
Residuals 6 0.008833 0.001472
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> # Not entirely sure the transformation is completely appropriate
> # since the point is to stabilize the variance but it partially
> # depends on sample size which we don't know. So if the sample
> # sizes are approximately equal it doesn't matter. But then again
> # all the observations are in a relatively small range anyways so
> # it doesn't really matter... and that's probably why we don't
> # see any big changes in the analysis.

d21e7x11
10-11-2011, 04:05 PM
We can construct a line graph with both the different drugs having different lines, but we don't know of a way to compare them.
Actually, repeated measurements analysis would apply if outcomes at different time points were obtained from the same "entity". If you can assume independence then you could fit regression lines within each group and obtain estimates of the slope and their standard errors. Then I think it will be possible to test if there is a difference between the slopes since you'll have an estimate and a standard error.

Are you suggesting teh arcsine transformation as a more precise way of treating the percentages as true continuous numbers?Yes, that's right.

d21e7x11
10-11-2011, 04:25 PM
In SAS:

data test;
input Drug \$ t2007 t2008 t2009 t2010;
datalines;
A 59 62 61 62
B 50 49 58 61
C 67 70 69 83
;
run;

data test7; set test(keep=drug t2007 rename=(t2007=outc)); time=2007;
data test8; set test(keep=drug t2008 rename=(t2008=outc)); time=2008;
data test9; set test(keep=drug t2009 rename=(t2009=outc)); time=2009;
data test10; set test(keep=drug t2010 rename=(t2010=outc)); time=2010;

data test2; set test7 test8 test9 test10;
troutc=arsin(sqrt(outc*0.01));
run;

proc glm data=test2; /*untransformed outcome, time categorical*/
class drug time;
model outc=time drug;
run;

proc glm data=test2; /*transformed outcome, time categorical*/
class drug time;
model troutc=time drug;
run;

Output (ANOVA table), untransformed outcome, time categorical:

Source DF Type III SS Mean Square F Value Pr > F

time 3 172.2500000 57.4166667 3.53 0.0881
Drug 2 645.1666667 322.5833333 19.85 0.0023

Output (ANOVA table), transformed outcome, time categorical:

Source DF Type III SS Mean Square F Value Pr > F

time 3 0.02074889 0.00691630 3.32 0.0983
Drug 2 0.07309807 0.03654903 17.55 0.0031

Dason, I don't think we can test the time*drug interaction with these data. We have a single observation in each time*drug cell, so it's just there is no error term to test for an interaction.

Dason
10-11-2011, 04:26 PM
If we look for a linear trend and treat time as continuous then we can look for an interaction. It's not ideal but it's probably the best we could do with this dataset.

d21e7x11
10-11-2011, 04:32 PM
Yes, that's true. I'm not proficient enough in R so I couldn't tell from your code if time was continuous or categorical.

Here is what I got with time continuous, time*drug interation included:

SAS code with time continuous:

proc glm data=test2; /*untransformed outcome, time continuous*/
class drug;
model outc=time drug time*drug;
run;

proc glm data=test2; /*transformed outcome, time continuou*/
class drug;
model troutc=time drug time*drug;
run;

Output - untransformed outcome:

Source DF Type III SS Mean Square F Value Pr > F

time 1 156.8166667 156.8166667 13.86 0.0098
Drug 2 44.9826521 22.4913261 1.99 0.2176
time*Drug 2 45.0333333 22.5166666 1.99 0.2173

Output - transformed outcome:

Source DF Type III SS Mean Square F Value Pr > F

time 1 0.01854527 0.01854527 12.60 0.0121
Drug 2 0.00585338 0.00292669 1.99 0.2176
time*Drug 2 0.00586314 0.00293157 1.99 0.2171

10-12-2011, 02:04 AM
In case you guys are interested, I became more curious and set up a model assuming the same chronological growth in all three drugs (editing Dason's coding):

1> o.full <- lm(vals ~ drug + time, data = dat)

1> summary(o.full)

Call:
lm(formula = vals ~ drug + time, data = dat)

Residuals:
Min 1Q Median 3Q Max
-4.867 -2.175 -0.025 2.067 5.900

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.1500 2.3763 23.629 1.09e-08 ***
drugB -6.5000 2.6568 -2.447 0.04015 *
drugC 11.2500 2.6568 4.234 0.00286 **
time 3.2333 0.9701 3.333 0.01034 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.757 on 8 degrees of freedom
Multiple R-squared: 0.8766, Adjusted R-squared: 0.8303
F-statistic: 18.94 on 3 and 8 DF, p-value: 0.0005423

Looks like there may be enough evidence to say that there IS an overall growth over time. I still feel like we have too few observations though.

shanehall.m
10-12-2011, 10:56 AM
d21e7x11 , in your SAS code, should that be model troutc instead of model outc or does the step before that automatically denote that anytime data=test2 is used, the transformed data is used... Everyone thanks for your help! So what I am getting out of this is to apply the arcsine transformation to the proportions, and then fit a line for each drug as well as an anova model to make sure there is no interaction. But, I guess the main point being, if we get this type of data, there needs to be more than 4 years of data.

Dason
10-12-2011, 11:00 AM
Do you have any idea if the sample sizes for each data point are similar? Or even what the relative size of the sample sizes is?

shanehall.m
10-12-2011, 12:21 PM
drug A would have a sample size > 200,000
Drub b would have a sample size > 7,000
drug c would have a sample size > 600

That is as accurate as I can get.

d21e7x11
10-12-2011, 12:26 PM
shanehall.m, I added to the code/comments/output in my posts - sorry for the confusion.
Note that the interaction can only be tested in the models where time is continuous.

shanehall.m
10-12-2011, 12:56 PM
So from what I understand, first I must apply the arcsine transformation to the proportions. Then fit each drug to a regression line and run an anova model testing to see if the drugs are different. I do not know R code too well... Is year a repeated measure variable?

shanehall.m
10-13-2011, 09:03 AM
I see now... thank you... In our case we probably want the transformed data and time as continous. So to see if there was a difference in the trends of the drugs, we would be looking at the interaction, correct? What that output is saying is that there is no difference in the trends or in the drugs, but there is a difference in the rates over the years. We probably need more than 4 years of data to get reliable estimates for the terms, especially interaction and time. Is my thinking correct?

shanehall.m
10-13-2011, 09:04 AM
drug A would have a sample size > 200,000
Drub b would have a sample size > 7,000
drug c would have a sample size > 600

That is as accurate as I can get.
Does this change anything?