Let’s investigate immortal time bias with a coded example.
Published
February 23, 2024
In the last post I introduced the concept of immortal time bias and how it can distort associations in your survival analysis, if you naively misclassify unexposed/untreated observation time as exposed/treated. This week I am going to illustrate the concept with some data and R code. It would have been good to analyse the Oscar Winner’s data but as I could not locate that anywhere online, we are instead going to look at one of the first studies in which immortal time bias was subsequently recognised to be a problem.
The work came out of Stanford University in the early 1970s and assessed the survival benefit of potential heart transplant recipients. In the analysis, the event of interest was death and the primary treatment was heart transplantation - so survival amongst transplant recipients was compared to that amongst accepted patients into the program that did not end up receiving a transplant. Treatment was initially considered time-fixed and the patients divided into two groups - ‘ever transplanted’ vs ‘never transplanted’. Survival time amongst recipients was found to be longer than those who didn’t receive transplantation.
The immortal time bias here involves the waiting time of those patients who survived to make it to the transplant. Because this portion of the observation time was classified as exposed to transplantation instead of unexposed, it offered a guaranteed survival time to the transplanted group. The result of this misclassification was to produce an artificial increase in the mortality rate of the reference group, thus suggesting a benefit of heart transplant surgery. In a later reanalysis of the data, the apparent survival benefit of the transplanted group disappeared when the immortal time was properly accounted for by a time-dependent analysis.
1 Load data
As this study is considered a canonical example of immortal time bias, the data comes built into R’s survival package. We can load the data and inspect the relevant jasa dataframe as below.
fustat - the ‘event’ variable; 0 = alive, 1 = dead at the end of follow-up.
futime - the primary ‘time’ variable; time (days) from acceptance into the transplant program until death or censoring.
wait.time - the secondary ‘time’ variable; time (days) from acceptance into the transplant program until receiving a heart if transplanted (NA for those who never underwent transplant surgery).
transplant - the ‘treatment/exposure’ variable; 0 = did not receive heart, 1 = received heart.
2 Visualise individual survival trajectories
Using a bit of ggplot2 magic, we can now plot the individual observation times for the 103 patients in the study. Note that I have stratified observation time by transplant status (orange for the period a patient remains untransplanted and blue for the period following a transplant).
Code
# Create 'id' variablejasa$id <-seq(1:dim(jasa)[1])# Replace wait.time with futime if didn't undergo transplantjasa$wait.time[is.na(jasa$wait.time)] <- jasa$futime[is.na(jasa$wait.time)]# Plotjasa |>ggplot(aes(x = id, y = futime)) +geom_linerange(aes(ymin =0, ymax = wait.time), color ="#E7B800", linewidth =1) +geom_linerange(aes(ymin = wait.time, ymax = futime), color ="#2E9FDF", linewidth =1) +geom_point(aes(shape =factor(fustat)), stroke =1, cex =1, color ="black") +scale_shape_manual(values =c(1, 3), labels =c("Censored", "Died"), name ="Outcome") +annotate("text", x =95, y =1400, label ="Observation time = yellow - untransplanted", size =5, color ="#E7B800") +annotate("text", x =92, y =1380, label ="Observation time = blue - post-transplant", size =5, color ="#2E9FDF") +ggtitle("Survival Trajectories for Heart Transplant Patients") +ylab("Time (days)") +xlab("Patient Number") +coord_flip() +theme_bw(base_size =20) +theme(axis.text.y =element_text(size =15))
3 Naive analysis assuming treatment status is time-fixed
3.1 Visualise survival curves
Plotting the Kaplan-Meier survival curves are easy by first saving the survfit object:
fit <- survfit(Surv(futime, fustat) ~ transplant, data = jasa)
and then passing this ggsurvplot which does a nicer job of plotting survival data then using R’s base functions. Note that we ignore wait.time and only specify futime in our fit function. This is because we are assuming if a patient was transplanted, the entire duration of their observation period was considered as such.
This gives a HR = 0.27 (95% CI 0.17, 0.43; p < 0.001) indicating that there is about a 73% reduction in the risk of death with transplantation. Pretty effective, right?
4 Correct analysis assuming treatment status is time-varying
Up until now we have just used the data as it’s been presented to us. Each patient has a single observation with all information about them contained in that row of data. However, to perform the correct time-dependent analysis we first need to construct a time-varying version of the treatment (i.e. transplant) variable. This data format is known as ‘counting process’ and in the general case involves creating potentially multiple rows of data for each patient with each row corresponding to a different exposure/treatment period of that patients observation time. In this specific example, we will create an additional row of data for transplanted patients splitting time at the point of transplant, so that the first row contains the time from acceptance into the transplant program to the point of transplant, and the second row contains the time from transplant to either death or censoring. We specify this in ‘start, stop’ format rather than the duration of the interval itself. We will use the tmergefunction to do this, although a little bit of manual programming can also achieve the same result.
Code
# Create subset of data selecting relevant variablesjasa_subset <- jasa |>select(id, wait.time, futime, fustat, transplant)# Can't have an end time of 0 (one obs) - change this to 0.5jasa_subset$futime[jasa_subset$futime ==0] <-0.5# Create dataframe in counting process formatjasa_cp <-tmerge(data1 = jasa_subset |>select(id, futime, fustat), data2 = jasa_subset |>select(id, futime, fustat, wait.time, transplant), id = id, death =event(futime, fustat),transplant =tdc(wait.time)) |>select(-c(futime, fustat))
Remember that the original data looked like:
Code
head(jasa_subset, 7)
id
wait.time
futime
fustat
transplant
1
49
49
1
0
2
5
5
1
0
3
0
15
1
1
4
35
38
1
1
5
17
17
1
0
6
2
2
1
0
7
50
674
1
1
And the newly created dataframe in counting process format:
Code
head(jasa_cp, 9)
id
tstart
tstop
death
transplant
1
0
49
1
0
2
0
5
1
0
3
0
15
1
1
4
0
35
0
0
4
35
38
1
1
5
0
17
1
0
6
0
2
1
0
7
0
50
0
0
7
50
674
1
1
Note the new ‘start, stop’ time variables. We have also renamed fustat to death for a more intuitive name. In this small data subset, Subject’s 3, 4 and 7 underwent a transplant, but only the latter two had both unexposed and exposed time periods during observation (Subject 3 was transplanted at the beginning of their observation), hence each subject now has two rows of data.
4.1 Visualise survival curves
Plotting the Kaplan-Meier survival curves for the data in this correct format reveals a vastly different result to that which we viewed earlier. There is now almost no separation in the curves.
Commensurately, the output of the Cox model now gives a HR = 1.13 (95% CI 0.63, 2.04; p = 0.7) indicating that there is about a 13% increase in the risk of death with transplantation - but this could be as much as a 104% increase or even a 37% decrease. That is, we can’t be confident the observed effect didn’t occur just by chance. Clearly, this tells a different story to the naive analysis we previously conducted.
The lesson here is to always think about whether your exposure or treatment changes over the course of an individual’s observation time, and if it does, to account for that in your survival model by constructing a time-varying covariate.
Source Code
---title: "Immortal time bias - \"The fallacy that never dies\" (Part 2)"date: 2024-02-23categories: [code, analysis, modelling, survival, visualisation]#image: description: "Let's investigate immortal time bias with a coded example."---In the last post I introduced the concept of immortal time bias and how it can distort associations in your survival analysis, if you naively misclassify unexposed/untreated observation time as exposed/treated. This week I am going to illustrate the concept with some data and `R` code. It would have been good to analyse the Oscar Winner's data but as I could not locate that anywhere online, we are instead going to look at [one of the first studies](https://www.jstor.org/stable/2286902 "https://www.jstor.org/stable/2286902"){style="color: #1f7de6"} in which immortal time bias was subsequently recognised to be a problem.The work came out of Stanford University in the early 1970s and assessed the survival benefit of potential heart transplant recipients. In the analysis, the event of interest was death and the primary treatment was heart transplantation - so survival amongst transplant recipients was compared to that amongst accepted patients into the program that did not end up receiving a transplant. Treatment was initially considered **time-fixed** and the patients divided into two groups - '*ever transplanted*' vs '*never transplanted*'. Survival time amongst recipients was found to be longer than those who didn't receive transplantation.The immortal time bias here involves the waiting time of those patients who survived to make it to the transplant. Because this portion of the observation time was classified as exposed to transplantation instead of unexposed, it offered a guaranteed survival time to the transplanted group. The result of this misclassification was to produce an artificial increase in the mortality rate of the reference group, thus suggesting a benefit of heart transplant surgery. In a later reanalysis of the data, the apparent survival benefit of the transplanted group disappeared when the immortal time was properly accounted for by a time-dependent analysis.## Load dataAs this study is considered a canonical example of immortal time bias, the data comes built into `R`'s `survival` package. We can load the data and inspect the relevant `jasa` dataframe as below.```{r}#| label: load data#| message: false#| warning: falselibrary(survival)library(survminer)library(gtsummary)library(dplyr)data(heart, package ="survival")head(jasa)```The variables that we're going to use are:- `fustat` - the 'event' variable; 0 = alive, 1 = dead at the end of follow-up.- `futime` - the primary 'time' variable; time (days) from acceptance into the transplant program until death or censoring.- `wait.time` - the secondary 'time' variable; time (days) from acceptance into the transplant program until receiving a heart if transplanted (`NA` for those who never underwent transplant surgery).- `transplant` - the 'treatment/exposure' variable; 0 = did not receive heart, 1 = received heart.## Visualise individual survival trajectoriesUsing a bit of `ggplot2` magic, we can now plot the individual observation times for the 103 patients in the study. Note that I have stratified observation time by transplant status (<FONT COLOR = "#E7B800">orange</FONT> for the period a patient remains untransplanted and <FONT COLOR = "#2E9FDF">blue</FONT> for the period following a transplant).```{r}#| label: survival trajectories# Create 'id' variablejasa$id <-seq(1:dim(jasa)[1])# Replace wait.time with futime if didn't undergo transplantjasa$wait.time[is.na(jasa$wait.time)] <- jasa$futime[is.na(jasa$wait.time)]# Plotjasa |>ggplot(aes(x = id, y = futime)) +geom_linerange(aes(ymin =0, ymax = wait.time), color ="#E7B800", linewidth =1) +geom_linerange(aes(ymin = wait.time, ymax = futime), color ="#2E9FDF", linewidth =1) +geom_point(aes(shape =factor(fustat)), stroke =1, cex =1, color ="black") +scale_shape_manual(values =c(1, 3), labels =c("Censored", "Died"), name ="Outcome") +annotate("text", x =95, y =1400, label ="Observation time = yellow - untransplanted", size =5, color ="#E7B800") +annotate("text", x =92, y =1380, label ="Observation time = blue - post-transplant", size =5, color ="#2E9FDF") +ggtitle("Survival Trajectories for Heart Transplant Patients") +ylab("Time (days)") +xlab("Patient Number") +coord_flip() +theme_bw(base_size =20) +theme(axis.text.y =element_text(size =15))```## Naive analysis assuming treatment status is time-fixed### Visualise survival curvesPlotting the **Kaplan-Meier** survival curves are easy by first saving the `survfit` object:`fit <- survfit(Surv(futime, fustat) ~ transplant, data = jasa)`and then passing this `ggsurvplot` which does a nicer job of plotting survival data then using `R`'s `base` functions. Note that we ignore `wait.time` and only specify `futime` in our fit function. This is because we are assuming if a patient was transplanted, the entire duration of their observation period was considered as such.```{r}#| label: naive KM curvefit_naive <-survfit(Surv(futime, fustat) ~ transplant, data = jasa)ggsurvplot(fit_naive,risk.table =TRUE,risk.table.col ="strata",linetype ="strata",surv.median.line ="hv",ggtheme =theme_bw(base_size =20),palette =c("#E7B800", "#2E9FDF"))```### Cox modelFitting a Cox model is also simple with:`mod_naive <- coxph(Surv(futime, fustat) ~ transplant, data = jasa)````{r}#| label: naive Coxmod_naive <-coxph(Surv(futime, fustat) ~ transplant, data = jasa)tbl_regression(mod_naive, exp = T)```This gives a `HR = 0.27 (95% CI 0.17, 0.43; p < 0.001)` indicating that there is about a 73% reduction in the risk of death with transplantation. Pretty effective, right?## Correct analysis assuming treatment status is time-varyingUp until now we have just used the data as it's been presented to us. Each patient has a single observation with all information about them contained in that row of data. However, to perform the correct time-dependent analysis we first need to construct a time-varying version of the treatment (i.e. transplant) variable. This data format is known as **'counting process'** and in the general case involves creating potentially multiple rows of data for each patient with each row corresponding to a different exposure/treatment period of that patients observation time. In this specific example, we will create an additional row of data for transplanted patients splitting time at the point of transplant, so that the first row contains the time from acceptance into the transplant program to the point of transplant, and the second row contains the time from transplant to either death or censoring. We specify this in **'start, stop'** format rather than the duration of the interval itself. We will use the `tmerge`[function](https://www.rdocumentation.org/packages/survival/versions/3.5-7/topics/tmerge){style="color: #1f7de6"} to do this, although a little bit of manual programming can also achieve the same result.```{r}#| label: tmerge# Create subset of data selecting relevant variablesjasa_subset <- jasa |>select(id, wait.time, futime, fustat, transplant)# Can't have an end time of 0 (one obs) - change this to 0.5jasa_subset$futime[jasa_subset$futime ==0] <-0.5# Create dataframe in counting process formatjasa_cp <-tmerge(data1 = jasa_subset |>select(id, futime, fustat), data2 = jasa_subset |>select(id, futime, fustat, wait.time, transplant), id = id, death =event(futime, fustat),transplant =tdc(wait.time)) |>select(-c(futime, fustat))```Remember that the original data looked like:```{r}#| label: view originalhead(jasa_subset, 7)```And the newly created dataframe in counting process format:```{r}#| label: view cphead(jasa_cp, 9)```Note the new 'start, stop' time variables. We have also renamed `fustat` to `death` for a more intuitive name. In this small data subset, Subject's 3, 4 and 7 underwent a transplant, but only the latter two had both unexposed and exposed time periods during observation (Subject 3 was transplanted at the beginning of their observation), hence each subject now has two rows of data.### Visualise survival curvesPlotting the **Kaplan-Meier** survival curves for the data in this correct format reveals a vastly different result to that which we viewed earlier. There is now almost no separation in the curves.```{r}#| label: correct KM curvefit_correct <-survfit(Surv(tstart, tstop, death) ~ transplant, data = jasa_cp)ggsurvplot(fit_correct,risk.table =TRUE,risk.table.col ="strata",linetype ="strata",surv.median.line ="hv",ggtheme =theme_bw(base_size =20),palette =c("#E7B800", "#2E9FDF"))```### Cox modelCommensurately, the output of the Cox model now gives a `HR = 1.13 (95% CI 0.63, 2.04; p = 0.7)` indicating that there is about a 13% increase in the risk of death with transplantation - but this could be as much as a 104% increase or even a 37% decrease. That is, we can't be confident the observed effect didn't occur just by chance. Clearly, this tells a different story to the naive analysis we previously conducted.```{r}#| label: correct Coxmod_correct <-coxph(Surv(tstart, tstop, death) ~ transplant, data = jasa_cp)tbl_regression(mod_correct, exp = T)```The lesson here is to always think about whether your exposure or treatment changes over the course of an individual's observation time, and if it does, to account for that in your survival model by constructing a time-varying covariate.