# Survival analysis at few dates: glms? logrank? cox ?

#### Rets

##### New Member
Hello,

In many documentation I have read that Kaplan-Meier curves followed by logrank test and/or cox models are the most recommended statistical methods to analyse Survival and test for different factors that may impact the Survival.

However, I have also heard that these are suitable if I have many "times" points in the data, but they may not be the best choice when I want to compare at one, two or three times only (let say at t=0 , t=7 days and t= 14days). In many examples with the methods above, they have at least 10 "times" or dates where the alive/dead information is available. Should I consider something else to enquire about the effect of a group or factor on the Survival ?

I was considering glms with a binomial distribution at only one target date (I remove day=0) and then compare the models (reduced vs non-reduced) by an anova, but I am not sure if this is appropriate for Survival and what assumption for such glm binomial family I should verify beforehand. The good point of glms would be the possibility to test for an interaction between two factors, which is not possible with the logrank test nor with cox models.

Many thanks if you have any good comments or advices to provide !

#### hlsmith

##### Not a robit
Please provide more detail. What do you mean be 10 times? Do you have the actual day of the event? e

You can have interaction terms in Cox reg!

#### Rets

##### New Member
Thank you, I have actually never run cox, I thought it was similar to logrank tests with surfit in which we cannot add interaction; I will give a try.

For clarification of my post: let's consider an experiment lasting 9 days. If I record the survival of all animals every day, I will have 10 times of measurements (=10 events). If I record only at t=0 and at t=day 9, I will only have 2 times (2 events), at t=0, t=2d and t=9d, I will have 3 times (3 events), etc. So for very few events (2 or 3 times of measurments), are Kaplan-meiere curves, logrank test and cox models the most suitable analysis for survival ? Should I consider glm or something else ?

#### hlsmith

##### Not a robit
Still not quite following. Are you saying you have a small sample, rare events, or short follow-up -> or a combination of these? Human clinical trials may look at survival at 1 week, 30 days, 6 months, 1 year, and then 5 years. Is this what you are referencing?

#### Rets

##### New Member
Sorry, I was not clear in my post and misused the word "event". What I mean is a short follow-up. For 9 days, instead of having 10 follow-ups, so 10 times the information alive/death for every animal of the same sample size (same number of animal so same events per time of measurement= same number of animals per follow-up), which would be good, I only have 2 or three times of follow-up of the same cohort (same number of animal so same sample size, same number of event per day of measurement). I just have the information dead/alive at t=0, t=2d and t=9d, for all animals.

In this context, are Kaplan meier curves, logrank and cox appropriated ? Do you think I should consider glm or something else?

Bonus question: If on the contrary, I have a good follow-up, let's say every day during one year, but a small cohort size, so very few events per day (eg, 5 animals), are the mentioned methods still the most appropriated ?

#### obh

##### Active Member
Hi Rets,

The power of the log-rank test is the same as the power of the chi-squared test with 2 groups. (df=1)
The sample size is the number of events, say "not survive" events. (not the number of follow-ups, not the group size)
Of course, the number of events depends on the research duration, the group size, the event frequency ...

"one, two or three times only (let say at t=0 , t=7 days and t= 14days). In many examples with the methods above, they have at least 10 "times"
Two or three times is missing the point of the survival analysis when you want to have a continuous chart. 10 times is close to continuous.
I'm not sure if there is a problem using the log-rank test for only two or three times, but using it with only one time is like a regular goodness of feet ... (expect remain items are half of the total of the two groups ) so my common sense says there is no problem with 2-3 times.

Last edited:

#### Rets

##### New Member
Thank you very much for your explanations.

So "Event" or "sample size" is the number of death x the number of follow-ups? If no one is dead, the sample size is 0?

And if I want to assess the effect of a factor on the mortality at one time only, do you think anova on glm built under binomial distribution could be appropriate? I have seen many studies assessing the effect of a factor on survival by one or two-way anovas.

#### obh

##### Active Member
So "Event" or "sample size" is the number of death x the number of follow-ups? If no one is dead, the sample size is 0?
In your example, an event is a death. (it can be any other event in other examples like birth, rain, next bus, car break etc)
The sample size is just the number of events. not multiplied by the number of follow-ups.

This makes sense, think for example that you check every day 30 time but no event happened, what can you conclude? only that time to event is usually more than 30 days ...

Ps in chi-squared test the expected frequencies must be at least 5, so I assume it is also relevant for log-rank test (?)

And if I want to assess the effect of a factor on the mortality at one time only
What do you mean?

Last edited:

#### obh

##### Active Member
Hi @hlsmith

1. I didn't see anywhere written about a minimum of 5 events per group, but since the test uses the chi-squared goodness of fit test
I assume it is relevant as well. Do you have any idea?

I also saw a sample size formula is based on the RH: 4(Zα>+Zβ)^2/(d*(logRH)^2).
Is my assumption to use sample size for chi-square correct? (I can't see why no)
Maybe will have the same result if calculating the chi effect size from the RH?

Thanks

#### Rets

##### New Member
Hello,

Thank you for your explanations !

1: Ok, so if the number of individuals being followed is 100, and 50 died during the study which is lasting 365 days, then the sample size is 50. However, I suppose the logrank test will be much more powerful if we have a follow-up every day so 365 follow-ups, rather than once a month, (n=12 follow-ups) and even worst with only 3 follow-ups (say at day=0, day=180 and day=365). However, in every of these examples, the sample size= the number of events=50. Is there a way to determine the power taking into account the follow up, sample size and number of individuals being followed?

2: Just a precision in your post, when you say it could be any event, like next bus, should it be a binomial and irreversible variable, like death as it happens once only and there is no going back, except for zombies or J.

3: Rets said:
And if I want to assess the effect of a factor on the mortality at one time only
--> What do you mean?

I mean if, in the example written in 1/, I want to assess the effect of people smoking vs non smokers on their survival at the day 365 only, resulting from the period studied from day 0 to 365, let say on the 100 individuals, 50 survived split in 40 of non-smokers, 10 of smokers and on the 50 deaths, 20 were non-smokers, 30 were smokers. Let say I have only t=0 (all are alive), and tfinal= at 365d, 50 are dead. To test the factor "smoke cigarettes" on the survival, may I only take data at day=365d, and compare something like (with the R software):

m0<- glm(survival ~ 1, family= binomial, data= d)
m1<- glm(survival ~ smoke, family= binomial, data= d)
anova(m1, m0)

That would also allows me to include other factors (eg, "sport" and interactions).

Many thanks again for your lights.

#### obh

##### Active Member
Hi Rets,

1: Ok, so if the number of individuals being followed is 100, and 50 died during the study which is lasting 365 days, then the sample size is 50. However, I suppose the logrank test will be much more powerful if we have a follow-up every day so 365 follow-ups, rather than once a month, (n=12 follow-ups) and even worst with only 3 follow-ups (say at day=0, day=180 and day=365). However, in every of these examples, the sample size= the number of events=50. Is there a way to determine the power taking into account the follow up, sample size and number of individuals being followed?
If your goal is the "kaplan meier plot" you want it to be close to a continuous chart. you want to know the probability of an event for any point on time. With more follow-ups, you may get a higher resolution plot. So I guess the resolution of the follow-ups depends on the time between events for all subjects, so if you get in average 1 event per 3 months I assume that you don't need to follow up every day.

Regard the power, I understand that it doesn't "feel" right, you may think that more follow-ups result in more "data" and more "data "is better power. But if you calculate the power based on the chi-squared test you won't get higher power test.

Maybe the standard deviation of the power will be lower with more follow-ups?? anyway, you may run a simulation with R and see yourself. (if you didn't try running simulations you should, it is a very powerful way to learn and relatively easy).

2: Just a precision in your post, when you say it could be any event, like next bus, should it be a binomial and irreversible variable, like death as it happens once only and there is no going back, except for zombies or J.
being followed?
The bus of 14:00 and the bus of 15:00 are different entities.
The time between headaches (also it is the same person it is a different headache)

#### Rets

##### New Member
Hi Obh,

Ok thank you so much for these explanations!
I like the comparisons, this is very clear to me now, thanks!

Regarding the power, it is indeed counter-intuitive that a higher number of follow-ups will not result in a higher statistical power! So we actually could reduce all the follow-ups and look at the last follow-up only... that would alleviate a lot the data without decreasing the power ?

Regarding the simulations to have power estimations for logrank tests or cox models, I have no idea how to proceed....

Any opinion regarding the anova and glms at one date = last day of follow-up only ?

#### obh

##### Active Member
Hi Rets,

Start by installing RStudio.

Example of the t-test power calculation.
creating 2 random samples: 100 observations of N(10,70) , 250 observations of N(13,70). calculate the p-value.
repeat above row 10000 times.
calculate the ratio of times that t-test rejected the H0 (pvalue<alpha), this is the power.

n1 <- 100; n2 <- 250 # sample size
sigma1 <- 70; sigma2 <- 70 # true SD
delta <- 3 # change
mu1 <- 10 # mean under the null hypothesis
mu2 <- 10+delta # mean under the null hypothesis
alpha <- 0.03 #significant level

reps <- 10000 # number of simulations (bigger value will have higher accuracy)

## p-value approach:

pvalues <- numeric(reps)

set.seed(1)

for (i in 1:reps) {
x1 <- rnorm(n1, mu1, sigma1)
x2 <- rnorm(n2, mu2, sigma2)
pvalues [ i ] <- t.test(x2,x1,alternative="greater",var.equal=FALSE)$p.value } mean(pvalues < alpha) ===================== simple example of the alternative log rank test. (one run not simulation) library(survival) time = c(3,5,6,7,9,11,12,15,17,17,18,21,8,12,17,21,24,26,29,32,38,39) status = c(1,1,1,1,1,0,1,1,1,0,1,1,1,1,1,1,0,1,1,1,0,1) group = c("A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B") sdata=data.frame(time,status,group) formula1 <- Surv(time, status == 1) ~ group surv_by_group = survfit(Surv(time, status == 1) ~ group, data = sdata) test_result <- survdiff(formula = formula1, data = sdata, rho = 0) test_result ================================ "Any opinion regarding the anova and glms at one date = last day of follow-up only ?" Not sure exactly what do you mean. probably not good if you have censored data Last edited: #### Rets ##### New Member Dear OBH, Thank you for your instructive reply. I have been trying to run your script. However, I got confused. In the first example with t-tests, the command "mean(pvalues < alpha)" gives me only: 0. So what is the power in that case ? I am not used to loops sorry. For the second example, I am not sure to see where is the power calculation, or maybe it is to show how to calculate the effect of one group on survival using the log rank test and the survival package? Regarding "the glm at one date", I have written below a reproducible example, please let me know if such a method to analyse a real dataset is statistically acceptable/correct : surv0<- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) # at t=0h, every individual is alive (in this specific notation: alive="1", dead="0") length(surv0) # verification of the length surv96<- c(1,1,0,1,1,0,1,0,1,0,1,0,1,0,1,0,0,0,1,1,1,0,1,1,1,0,0,0,0,0) # at t=96h, some are dead length(surv96) # verification of the length surv<-append(surv0, surv96) # binding the lists time0<- rep(0, 30) time96<- rep(96, 30) time<- append(time0, time96) #we only have two time, t=0h and t=96h condition <- rep(c("a","b"), 60) # assign two different conditions (same cohort in the same order in this example, 30 at t=0 and 30 at t=96h) all<- data.frame(surv, time, condition) tfinal<- subset(all, time=="96") # we look only at the final time table(tfinal) # just to have an overview of the data m1<- glm( surv~ 1 , data=tfinal, family=binomial) m2<- glm( surv~ condition , data=tfinal, family=binomial) anova(m1, m2) summary(m2) # the condition b has a significant negative effect on the individual survival (p<0.01) Many thanks in advance. #### obh ##### Active Member Hi Rets, Sorry, very strange from some reason, it omits the [ i ] from the pvalues [ i ] <- t.test when pasting in the forum ... I fixed to pvalues [ i ] and it turn back to pvalues, so I added spaces pvalues [ i ] , surely there is a better way op course you don't need to add it in the array definition (pvalues <- numeric(reps)) Just remove the extra spaces when pasting in R !! I also updated the original message one to avoid other confusing Last edited: #### obh ##### Active Member Hi Rets, Probably it is easy writing the simulation for the first time, but it clearly rewarding. I wrote some partial code to start with, but you need to complete the missing parts You should generate the random data as following: rbinom(n,1,p) (instead of x1 <- rnorm(n1, mu1, sigma1) in the t example) for example n=10 , p=0.7 will generate the following > rbinom(10,1,0.7) [1] 0 1 0 1 0 1 1 1 0 0 and second time the following: > rbinom(10,1,0.7) [1] 0 1 1 1 1 1 0 1 1 1 the loop will generate "reps" such vectors .............. pvalues <- numeric(reps) set.seed(1) for (i in 1:reps) { status1 <- rbinom(n1,1,p1) status2 <- rbinom(n2,1,p2) ............. ............. sdata=data.frame(time,status,group) formula1 <- Surv(time, status == 1) ~ group surv_by_group = survfit(Surv(time, status == 1) ~ group, data = sdata) pvalues [ i ] <- survdiff(formula = formula1, data = sdata, rho = 0))$p.value

}
mean(pvalues < alpha)