without sample size or sd I think you may have to rely on simple mean differences.
Hello all, this seems like it should be a simple questions but our team cannot find an answer to the question. If I have data set up like the following (the numbers are the percent that react positively to the drug):
and my goal is to see whether there is a significant difference in the observed trends in rates over time. How can I test this...unfortunately in my line of work, this is all we get, we cannot get data with the sample sizes or any other information. Any help would be greatly appreciated as I have ran into this problem many times.Code:Drug 2007 2008 2009 2010 A 59 62 61 62 B 50 49 58 61 C 67 70 69 83
Thank you so much!
Last edited by Dason; 10-11-2011 at 11:22 AM.
without sample size or sd I think you may have to rely on simple mean differences.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
shanehall.m (10-11-2011)
shanehall.m, you may want to consider first applying the arcsine transformation to percentages. Then you can out the repeated measurements analysis on the transformed outcome.
shanehall.m (10-11-2011)
So there is no sort of time series analysis? We can construct a line graph with both the different drugs having different lines, but we don't know of a way to compare them. So your saying do an arcsine transformation to the percentages and then run a repeated measures analysis. That will negate the trend effect wont it? Are you suggesting teh arcsine transformation as a more precise way of treating the percentages as true continuous numbers?
Slide 9 of this powerpoint is almost identical to my question. In this powerpoint, I'd be trying to compare the age groups. If I had no other numbers other than the rates/proportions given throughout the years, how would I be able to find a difference in the trends of the age groups. Thank you all for your help.
http://www.hsph.harvard.edu/means-ma...cideTrends.ppt
One method is to just throw the numbers into a linear regression model:
Percentage ~ DrugA*time + DrugB*time + DrugC*time
This will tell you whether there is a significant increasing linear trend in the percentages over time. I problem with this though is that you have so little observations. A way around this could be to assume that all the drugs have the same growth over time.
PS. If you are planning to present this professionally, I strongly recommend bringing someone onto your team who knows what they're doing.
I did a quite little thing to see if we assume that there is a linear trend if we have any evidence that the slope is different for the different drugs. We end up concluding we don't have enough evidence. Once again though if we actually had sample sizes we could do quite a bit more with the data.
And the code along with the outputCode:dat <- data.frame(drug = rep(c("A","B","C"), each = 4), year = rep(2007:2010, 3), vals = c(59,62,61,62,50,49,58,61,67,70,69,83)) # I'd rather work with smaller numbers for the predictor dat$time = dat$year - 2007 # Apply the arcsin-squareroot transformation dat$transvals <- asin(sqrt(dat$vals/100)) # Plotting to see what the data looks like library(ggplot2) # Plot of actual data qplot(time, vals, colour = drug, data = dat, geom = "line") # Plot of transformed data qplot(time, transvals, colour = drug, data = dat, geom = "line") # Fit a line for each drug (actual data) o.full <- lm(vals ~ drug + time + drug:time, data = dat) # Fit a line for each drug (transformed data) o.trans <- lm(transvals ~ drug + time + drug:time, data = dat) # Check the interaction term to see if there is a "significant" # difference anova(o.full) # Interaction isn't significant anova(o.trans) # Interaction isn't signficant # Not entirely sure the transformation is completely appropriate # since the point is to stabilize the variance but it partially # depends on sample size which we don't know. So if the sample # sizes are approximately equal it doesn't matter. But then again # all the observations are in a relatively small range anyways so # it doesn't really matter... and that's probably why we don't # see any big changes in the analysis.
Code:> dat <- data.frame(drug = rep(c("A","B","C"), each = 4), + year = rep(2007:2010, 3), + vals = c(59,62,61,62,50,49,58,61,67,70,69,83)) > # I'd rather work with smaller numbers for the predictor > dat$time = dat$year - 2007 > # Apply the arcsin-squareroot transformation > dat$transvals <- asin(sqrt(dat$vals/100)) > > # Plotting to see what the data looks like > library(ggplot2) > # Plot of actual data > qplot(time, vals, colour = drug, data = dat, geom = "line") > # Plot of transformed data > qplot(time, transvals, colour = drug, data = dat, geom = "line") > > # Fit a line for each drug (actual data) > o.full <- lm(vals ~ drug + time + drug:time, data = dat) > # Fit a line for each drug (transformed data) > o.trans <- lm(transvals ~ drug + time + drug:time, data = dat) > > # Check the interaction term to see if there is a "significant" > # difference > anova(o.full) # Interaction isn't significant Analysis of Variance Table Response: vals Df Sum Sq Mean Sq F value Pr(>F) drug 2 645.17 322.58 28.5052 0.0008634 *** time 1 156.82 156.82 13.8571 0.0098231 ** drug:time 2 45.03 22.52 1.9897 0.2173416 Residuals 6 67.90 11.32 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > anova(o.trans) # Interaction isn't signficant Analysis of Variance Table Response: transvals Df Sum Sq Mean Sq F value Pr(>F) drug 2 0.073098 0.036549 24.8271 0.001253 ** time 1 0.018545 0.018545 12.5975 0.012082 * drug:time 2 0.005863 0.002932 1.9914 0.217123 Residuals 6 0.008833 0.001472 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > # Not entirely sure the transformation is completely appropriate > # since the point is to stabilize the variance but it partially > # depends on sample size which we don't know. So if the sample > # sizes are approximately equal it doesn't matter. But then again > # all the observations are in a relatively small range anyways so > # it doesn't really matter... and that's probably why we don't > # see any big changes in the analysis.
Actually, repeated measurements analysis would apply if outcomes at different time points were obtained from the same "entity". If you can assume independence then you could fit regression lines within each group and obtain estimates of the slope and their standard errors. Then I think it will be possible to test if there is a difference between the slopes since you'll have an estimate and a standard error.
Yes, that's right.
In SAS:
Output (ANOVA table), untransformed outcome, time categorical:Code:data test; input Drug $ t2007 t2008 t2009 t2010; datalines; A 59 62 61 62 B 50 49 58 61 C 67 70 69 83 ; run; data test7; set test(keep=drug t2007 rename=(t2007=outc)); time=2007; data test8; set test(keep=drug t2008 rename=(t2008=outc)); time=2008; data test9; set test(keep=drug t2009 rename=(t2009=outc)); time=2009; data test10; set test(keep=drug t2010 rename=(t2010=outc)); time=2010; data test2; set test7 test8 test9 test10; troutc=arsin(sqrt(outc*0.01)); run; proc glm data=test2; /*untransformed outcome, time categorical*/ class drug time; model outc=time drug; run; proc glm data=test2; /*transformed outcome, time categorical*/ class drug time; model troutc=time drug; run;
Output (ANOVA table), transformed outcome, time categorical:Code:Source DF Type III SS Mean Square F Value Pr > F time 3 172.2500000 57.4166667 3.53 0.0881 Drug 2 645.1666667 322.5833333 19.85 0.0023
Code:Source DF Type III SS Mean Square F Value Pr > F time 3 0.02074889 0.00691630 3.32 0.0983 Drug 2 0.07309807 0.03654903 17.55 0.0031
Dason, I don't think we can test the time*drug interaction with these data. We have a single observation in each time*drug cell, so it's just there is no error term to test for an interaction.
Last edited by d21e7x11; 10-12-2011 at 12:17 PM.
shanehall.m (10-13-2011)
If we look for a linear trend and treat time as continuous then we can look for an interaction. It's not ideal but it's probably the best we could do with this dataset.
shanehall.m (10-12-2011)
Yes, that's true. I'm not proficient enough in R so I couldn't tell from your code if time was continuous or categorical.
Here is what I got with time continuous, time*drug interation included:
SAS code with time continuous:
Output - untransformed outcome:Code:proc glm data=test2; /*untransformed outcome, time continuous*/ class drug; model outc=time drug time*drug; run; proc glm data=test2; /*transformed outcome, time continuou*/ class drug; model troutc=time drug time*drug; run;
Output - transformed outcome:Code:Source DF Type III SS Mean Square F Value Pr > F time 1 156.8166667 156.8166667 13.86 0.0098 Drug 2 44.9826521 22.4913261 1.99 0.2176 time*Drug 2 45.0333333 22.5166666 1.99 0.2173
Code:Source DF Type III SS Mean Square F Value Pr > F time 1 0.01854527 0.01854527 12.60 0.0121 Drug 2 0.00585338 0.00292669 1.99 0.2176 time*Drug 2 0.00586314 0.00293157 1.99 0.2171
Last edited by d21e7x11; 10-12-2011 at 12:25 PM.
shanehall.m (10-12-2011)
In case you guys are interested, I became more curious and set up a model assuming the same chronological growth in all three drugs (editing Dason's coding):
Code:1> o.full <- lm(vals ~ drug + time, data = dat) 1> summary(o.full) Call: lm(formula = vals ~ drug + time, data = dat) Residuals: Min 1Q Median 3Q Max -4.867 -2.175 -0.025 2.067 5.900 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 56.1500 2.3763 23.629 1.09e-08 *** drugB -6.5000 2.6568 -2.447 0.04015 * drugC 11.2500 2.6568 4.234 0.00286 ** time 3.2333 0.9701 3.333 0.01034 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.757 on 8 degrees of freedom Multiple R-squared: 0.8766, Adjusted R-squared: 0.8303 F-statistic: 18.94 on 3 and 8 DF, p-value: 0.0005423
Looks like there may be enough evidence to say that there IS an overall growth over time. I still feel like we have too few observations though.
shanehall.m (10-12-2011)
d21e7x11 , in your SAS code, should that be model troutc instead of model outc or does the step before that automatically denote that anytime data=test2 is used, the transformed data is used... Everyone thanks for your help! So what I am getting out of this is to apply the arcsine transformation to the proportions, and then fit a line for each drug as well as an anova model to make sure there is no interaction. But, I guess the main point being, if we get this type of data, there needs to be more than 4 years of data.
Do you have any idea if the sample sizes for each data point are similar? Or even what the relative size of the sample sizes is?
shanehall.m (10-12-2011)
drug A would have a sample size > 200,000
Drub b would have a sample size > 7,000
drug c would have a sample size > 600
That is as accurate as I can get.
Tweet |