testing rates through time

#1
Hello all, this seems like it should be a simple questions but our team cannot find an answer to the question. If I have data set up like the following (the numbers are the percent that react positively to the drug):
Code:
Drug 2007 2008 2009 2010
   A   59   62   61   62
   B   50   49   58   61
   C   67   70   69   83
and my goal is to see whether there is a significant difference in the observed trends in rates over time. How can I test this...unfortunately in my line of work, this is all we get, we cannot get data with the sample sizes or any other information. Any help would be greatly appreciated as I have ran into this problem many times.

Thank you so much!
 
Last edited by a moderator:
#3
shanehall.m, you may want to consider first applying the arcsine transformation to percentages. Then you can out the repeated measurements analysis on the transformed outcome.
 
#4
So there is no sort of time series analysis? We can construct a line graph with both the different drugs having different lines, but we don't know of a way to compare them. So your saying do an arcsine transformation to the percentages and then run a repeated measures analysis. That will negate the trend effect wont it? Are you suggesting teh arcsine transformation as a more precise way of treating the percentages as true continuous numbers?
 

Link

Ninja say what!?!
#6
One method is to just throw the numbers into a linear regression model:

Percentage ~ DrugA*time + DrugB*time + DrugC*time

This will tell you whether there is a significant increasing linear trend in the percentages over time. I problem with this though is that you have so little observations. A way around this could be to assume that all the drugs have the same growth over time.

PS. If you are planning to present this professionally, I strongly recommend bringing someone onto your team who knows what they're doing.
 

Dason

Ambassador to the humans
#7
I did a quite little thing to see if we assume that there is a linear trend if we have any evidence that the slope is different for the different drugs. We end up concluding we don't have enough evidence. Once again though if we actually had sample sizes we could do quite a bit more with the data.
Code:
dat <- data.frame(drug = rep(c("A","B","C"), each = 4),
                  year = rep(2007:2010, 3),
                  vals = c(59,62,61,62,50,49,58,61,67,70,69,83))
# I'd rather work with smaller numbers for the predictor
dat$time = dat$year - 2007
# Apply the arcsin-squareroot transformation
dat$transvals <- asin(sqrt(dat$vals/100))

# Plotting to see what the data looks like
library(ggplot2)
# Plot of actual data
qplot(time, vals, colour = drug, data = dat, geom = "line")
# Plot of transformed data
qplot(time, transvals, colour = drug, data = dat, geom = "line")

# Fit a line for each drug (actual data)
o.full <- lm(vals ~ drug + time + drug:time, data = dat)
# Fit a line for each drug (transformed data)
o.trans <- lm(transvals ~ drug + time + drug:time, data = dat)

# Check the interaction term to see if there is a "significant"
# difference
anova(o.full) # Interaction isn't significant
anova(o.trans) # Interaction isn't signficant

# Not entirely sure the transformation is completely appropriate
# since the point is to stabilize the variance but it partially
# depends on sample size which we don't know.  So if the sample
# sizes are approximately equal it doesn't matter.  But then again
# all the observations are in a relatively small range anyways so
# it doesn't really matter... and that's probably why we don't
# see any big changes in the analysis.
And the code along with the output
Code:
> dat <- data.frame(drug = rep(c("A","B","C"), each = 4),
+                   year = rep(2007:2010, 3),
+                   vals = c(59,62,61,62,50,49,58,61,67,70,69,83))
> # I'd rather work with smaller numbers for the predictor
> dat$time = dat$year - 2007
> # Apply the arcsin-squareroot transformation
> dat$transvals <- asin(sqrt(dat$vals/100))
> 
> # Plotting to see what the data looks like
> library(ggplot2)
> # Plot of actual data
> qplot(time, vals, colour = drug, data = dat, geom = "line")
> # Plot of transformed data
> qplot(time, transvals, colour = drug, data = dat, geom = "line")
> 
> # Fit a line for each drug (actual data)
> o.full <- lm(vals ~ drug + time + drug:time, data = dat)
> # Fit a line for each drug (transformed data)
> o.trans <- lm(transvals ~ drug + time + drug:time, data = dat)
> 
> # Check the interaction term to see if there is a "significant"
> # difference
> anova(o.full) # Interaction isn't significant
Analysis of Variance Table

Response: vals
          Df Sum Sq Mean Sq F value    Pr(>F)    
drug       2 645.17  322.58 28.5052 0.0008634 ***
time       1 156.82  156.82 13.8571 0.0098231 ** 
drug:time  2  45.03   22.52  1.9897 0.2173416    
Residuals  6  67.90   11.32                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
> anova(o.trans) # Interaction isn't signficant
Analysis of Variance Table

Response: transvals
          Df   Sum Sq  Mean Sq F value   Pr(>F)   
drug       2 0.073098 0.036549 24.8271 0.001253 **
time       1 0.018545 0.018545 12.5975 0.012082 * 
drug:time  2 0.005863 0.002932  1.9914 0.217123   
Residuals  6 0.008833 0.001472                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
> 
> # Not entirely sure the transformation is completely appropriate
> # since the point is to stabilize the variance but it partially
> # depends on sample size which we don't know.  So if the sample
> # sizes are approximately equal it doesn't matter.  But then again
> # all the observations are in a relatively small range anyways so
> # it doesn't really matter... and that's probably why we don't
> # see any big changes in the analysis.
 
#8
We can construct a line graph with both the different drugs having different lines, but we don't know of a way to compare them.
Actually, repeated measurements analysis would apply if outcomes at different time points were obtained from the same "entity". If you can assume independence then you could fit regression lines within each group and obtain estimates of the slope and their standard errors. Then I think it will be possible to test if there is a difference between the slopes since you'll have an estimate and a standard error.

Are you suggesting teh arcsine transformation as a more precise way of treating the percentages as true continuous numbers?
Yes, that's right.
 
#9
In SAS:

Code:
data test;
input Drug $ t2007 t2008 t2009 t2010;
datalines;
   A   59   62   61   62
   B   50   49   58   61
   C   67   70   69   83
;
run;

data test7; set test(keep=drug t2007 rename=(t2007=outc)); time=2007; 
data test8; set test(keep=drug t2008 rename=(t2008=outc)); time=2008;
data test9; set test(keep=drug t2009 rename=(t2009=outc)); time=2009;
data test10; set test(keep=drug t2010 rename=(t2010=outc)); time=2010;

data test2; set test7 test8 test9 test10;
 troutc=arsin(sqrt(outc*0.01));
run;

proc glm data=test2; /*untransformed outcome, time categorical*/
   class drug time;
   model outc=time drug;
run;

proc glm data=test2; /*transformed outcome, time categorical*/
   class drug time;
   model troutc=time drug;
run;
Output (ANOVA table), untransformed outcome, time categorical:
Code:
Source                      DF     Type III SS     Mean Square    F Value    Pr > F

time                         3     172.2500000      57.4166667       3.53    0.0881
Drug                         2     645.1666667     322.5833333      19.85    0.0023
Output (ANOVA table), transformed outcome, time categorical:
Code:
Source                      DF     Type III SS     Mean Square    F Value    Pr > F

time                         3      0.02074889      0.00691630       3.32    0.0983
Drug                         2      0.07309807      0.03654903      17.55    0.0031

Dason, I don't think we can test the time*drug interaction with these data. We have a single observation in each time*drug cell, so it's just there is no error term to test for an interaction.
 
Last edited:

Dason

Ambassador to the humans
#10
If we look for a linear trend and treat time as continuous then we can look for an interaction. It's not ideal but it's probably the best we could do with this dataset.
 
#11
Yes, that's true. I'm not proficient enough in R so I couldn't tell from your code if time was continuous or categorical.

Here is what I got with time continuous, time*drug interation included:

SAS code with time continuous:
Code:
proc glm data=test2; /*untransformed outcome, time continuous*/
   class drug;
   model outc=time drug time*drug;
run;

proc glm data=test2; /*transformed outcome, time continuou*/
   class drug;
   model troutc=time drug time*drug;
run;
Output - untransformed outcome:
Code:
Source                      DF     Type III SS     Mean Square    F Value    Pr > F

time                         1     156.8166667     156.8166667      13.86    0.0098
Drug                         2      44.9826521      22.4913261       1.99    0.2176
time*Drug                    2      45.0333333      22.5166666       1.99    0.2173
Output - transformed outcome:
Code:
Source                      DF     Type III SS     Mean Square    F Value    Pr > F

time                         1      0.01854527      0.01854527      12.60    0.0121
Drug                         2      0.00585338      0.00292669       1.99    0.2176
time*Drug                    2      0.00586314      0.00293157       1.99    0.2171
 
Last edited:

Link

Ninja say what!?!
#12
In case you guys are interested, I became more curious and set up a model assuming the same chronological growth in all three drugs (editing Dason's coding):

Code:
1> o.full <- lm(vals ~ drug + time, data = dat)

1> summary(o.full)

Call:
lm(formula = vals ~ drug + time, data = dat)

Residuals:
   Min     1Q Median     3Q    Max 
-4.867 -2.175 -0.025  2.067  5.900 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  56.1500     2.3763  23.629 1.09e-08 ***
drugB        -6.5000     2.6568  -2.447  0.04015 *  
drugC        11.2500     2.6568   4.234  0.00286 ** 
time          3.2333     0.9701   3.333  0.01034 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 3.757 on 8 degrees of freedom
Multiple R-squared: 0.8766,	Adjusted R-squared: 0.8303 
F-statistic: 18.94 on 3 and 8 DF,  p-value: 0.0005423


Looks like there may be enough evidence to say that there IS an overall growth over time. I still feel like we have too few observations though.
 
#13
d21e7x11 , in your SAS code, should that be model troutc instead of model outc or does the step before that automatically denote that anytime data=test2 is used, the transformed data is used... Everyone thanks for your help! So what I am getting out of this is to apply the arcsine transformation to the proportions, and then fit a line for each drug as well as an anova model to make sure there is no interaction. But, I guess the main point being, if we get this type of data, there needs to be more than 4 years of data.
 

Dason

Ambassador to the humans
#14
Do you have any idea if the sample sizes for each data point are similar? Or even what the relative size of the sample sizes is?
 
#16
shanehall.m, I added to the code/comments/output in my posts - sorry for the confusion.
Note that the interaction can only be tested in the models where time is continuous.
 
Last edited:
#17
So from what I understand, first I must apply the arcsine transformation to the proportions. Then fit each drug to a regression line and run an anova model testing to see if the drugs are different. I do not know R code too well... Is year a repeated measure variable?
 
#18
I see now... thank you... In our case we probably want the transformed data and time as continous. So to see if there was a difference in the trends of the drugs, we would be looking at the interaction, correct? What that output is saying is that there is no difference in the trends or in the drugs, but there is a difference in the rates over the years. We probably need more than 4 years of data to get reliable estimates for the terms, especially interaction and time. Is my thinking correct?
 
#19
drug A would have a sample size > 200,000
Drub b would have a sample size > 7,000
drug c would have a sample size > 600

That is as accurate as I can get.
Does this change anything?