Target Trial Emulation

We have explained how to implement TTE through manual coding for an active-comparator study design. In this tutorial, we focus on a placebo-controlled design, where individuals in the control group may meet the eligibility criteria multiple times (Hernán and Robins 2016). This makes it unclear how to define their time zero.

A Placebo-control Design

We have simulated a data as an example to compare the effectiveness of ARB versus no anti-hypertensive medications on reducing the risk of cardiovascular disease (CVD) among subjects with hypertension with no history of chronic disease and no use of ARB medications during the previous 2 years. Table 1 shows the protocol of the target trial that I wish to run and my emulating plan with observational data side by side.

Recall the three key steps before implementing TTE:

Step 1: Clearly define the research questions of interest. In our case, they are:
- Primary: What is the effect of initiating ARBs versus no medication on CVD risk?
- Secondary: What is the effect of initiating and continuously using ARBs versus no medication on CVD risk?
- *Vaguely defined questions are troublesome as they impact the study design, data preparation, and analysis methods used to answer these questions, leading to biased and misleading findings.
Step 2: Clarify study design
- Our example uses an observational, new-user, placebo control, retrospective cohort design
- Individuals in the control group meet these eligibility criteria in Table 1 continuously would be eligible for the target trial at multiple times during their lifetime, that is, they have multiple times that can qualify as time zero. When should their follow-up start in the observational study?
- Two unbiased options:
  - All eligible times:
    - This option is more efficient by using all the data from each individual
    - A classic use case: Data with a small sample size or low event rate
    - However, it requires emulating a sequence of trials, each with a different start of follow-up
    - More complicated analysis procedures to account that using the same individual’s data multiple times in the study
  - A random eligible time:
    - Simple but less efficient
    - This option is appropriate when there is sufficient statistical power in a study
Step 3: Build a data set that fits the purpose of the study
- A data set can be used to answer these questions with minimal bias
- It is a false believe that one data set can be used to answer any questions
- To build a fit-for-purpose dataset, Danaei et al. (2013) demonstrated how to emulate a sequence of trials with observational data using the TrialEmulation R package

Example data

We import the data we simulated.

obsdata1 <- readRDS("obsdata1trt.rds")

get_label <- function(x) {
  lbl <- attr(x, "label", exact = TRUE)
  if (is.null(lbl)) "" else as.character(lbl)
}

dict <- data.frame(
  Variable = names(obsdata1),
  Meaning  = vapply(obsdata1, get_label, character(1)),
  check.names = FALSE
)

knitr::kable(dict, caption = "Data Dictionary", row.names = FALSE)

Data Dictionary
Variable	Meaning
id	Patient ID
time	Time index for longitudinal records (months)
X1	Non-ACEI or ARB antihypertensive medication use over time
X2	Standardized systolic blood pressure over time
X3	Biological sex (M=1, F=0)
X4	Standardized diastolic blood pressure at baseline
age	Age over time (years)
A	Treatment indicator over time (ARB = 1, control = 0 )
Y	Event indicator of cardiovascular disease
C	Indicator of early dropout / censoring

Emulating a sequence of trials

Suppose treatment and covariates information are updated monthly in our observational data, so we consider each month as a separate enrollment period. For instance, the first enrollment period is Jan. 2017, then Feb. 2017, and so on.

In contrast to emulating a single trial, we need to construct a pooled dataset by stacking the separate data from each trial. Here are a few things to keep in mind during this process:

Each of the multiple trials has a different baseline (time zero), so the baseline information (covariates and treatment) need to be set to the values at current time zero if they can vary over time. For instance, the baseline SBP for Trial 1 is the SBP value in Jan. 2017 but the baseline SBP for Trial 2 is the SBP value in Feb. 2017
The same individual could contribute data to one emulated trial as a non-initiator but as an ARB initiator in another emulated trial later since his/her treatment status changes over time
As we move the enrollment period forward in calendar time, the follow-up time is decreasing by one month each time since the start of follow-up is one month later, so we need to adjust it across all the emulated trials accordingly
For the reason above, it is a good practice to prespecify an overall index window, say 01/01/2017 - 01/01/2020, and a study period say 01/01/2017 - 01/01/2024, so that even individuals in the last emulated trial are followed up for a sufficient amount of time, i.e., 4 years
Create an emulated trial indicator for downstream analyses where we can allow treatment effects to remain constant or vary across trials

Before start using the TrialEmulation R package, it is crucial to understand the process being carried out in the data_preparation function. We explain the steps using a toy example where we only consider three enrollment times: at the beginning of the overall study, month 1, and month 2.

Patients without hypertension and with CVD are already excluded from our observational data in another pre-processing step
We apply the rest of eligibility criteria (the first two) in Table 1 to our data and create a “eligible” indicator

obsdata2 <- obsdata1 %>%
  group_by(id) %>%
  mutate(eligible = as.integer(age >= 50 & cumsum(Y) == 0 & ( slide_dbl(
    A, 
    sum, 
    .before = 24,   # Look back 24 rows
    .after = -1,    # Exclude the current row
    .complete = T
  )==0) )) %>% 
  # Age over 50, no history of CVD, and no ACEI or ARB treatment in last 24 months.
  mutate(eligible = ifelse(is.na(eligible), 0, eligible)) %>% #if unknown, then not eligible
  ungroup()

Create data for the 1st emulated trial

#find the first date when some individual become eligible. This is the start date that we can start enroll subjects
start.date=obsdata2%>%filter(eligible==1)%>%select(time)%>%unique()%>%summarise(start.date=min(time))
#for convenience purpose, convert time to months from this date
obsdata2 <- obsdata2 %>%
  mutate(month = interval( start.date$start.date, time) %/% months(1))  # month zero is the original study baseline for the 1st trial

# Select eligible individuals
iligible1 <- obsdata2 %>%
  filter(eligible == 1 & month == 0) %>%    # time zero is the original study baseline for the 1st trial
  select(id, A, X1, X2, X3, X4, age) %>%
  rename(assigned_treatment=A, X1_0=X1, X2_0=X2, X3_0=X3, X4_0=X4, age_0=age) # baseline covariates in the 1st trial are the same as the baseline covariates of the study 

trial.1 <- obsdata2 %>%
  filter(id %in% iligible1$id & month >= 0) %>%   # month zero is the original study baseline for the 1st trial
  mutate(trial = 0,   # create an emulated trial indicator
         follow_up = month) %>%  # no adjustment of follow-up time is needed since the 1st trial share the same baseline as the entire study
  left_join(iligible1)
#> Joining with `by = join_by(id)`

Create data for the 2nd emulated trial

# Select eligible individuals 
iligible2 <- obsdata2 %>%
  filter(eligible == 1 & month == 1) %>%   # time zero is month 1 for the 2nd trial
  select(id, A, X1, X2, X3, X4, age) %>%
  rename(assigned_treatment=A, X1_0=X1, X2_0=X2, X3_0=X3, X4_0=X4, age_0=age) # baseline covariates in the 2nd trial are the covariates at month 1 

trial.2 <- obsdata2 %>%
  filter(id %in% iligible2$id & month >= 1) %>% # time zero is month 1 for the 2nd trial
  mutate(trial = 1,       # create an emulated trial indicator
         follow_up = month - 1) %>%     # adjust the follow-up time by decreasing by 1
  left_join(iligible2)
#> Joining with `by = join_by(id)`

Create data for the 3rd emulated trial

# Select eligible individuals 
iligible3 <- obsdata2 %>%
  filter(eligible == 1 & month == 2) %>%    # time zero is month 2 for the 3rd trial
  select(id, A, X1, X2, X3, X4, age) %>%
  rename(assigned_treatment=A, X1_0=X1, X2_0=X2, X3_0=X3, X4_0=X4, age_0=age) # baseline covariates in the 3rd trial are the covariates at month 2 

trial.3 <- obsdata2 %>%
  filter(id %in% iligible3$id & month >= 2) %>% 
  mutate(trial = 2,       # create an emulated trial indicator
         follow_up = month - 2) %>%     # adjust the follow-up time by decreasing by 2
  left_join(iligible3)
#> Joining with `by = join_by(id)`

Now, we stack all the data from three emulated trials

obsdata2.all.trials <- data.frame(rbind(trial.1, trial.2, trial.3)) %>%
  rename(trial_period = trial,
         followup_time = follow_up,
         treatment = A, 
         outcome = Y)
head(obsdata2.all.trials, n=10)
#>    id       time X1          X2 X3        X4      age treatment outcome C
#> 1  46 2000-01-01  0  0.86415249  1 -1.377567 68.93422         0       0 0
#> 2  46 2000-02-01  0  1.01366731  1 -1.377567 69.01756         0       0 0
#> 3  46 2000-03-01  1  2.68932942  1 -1.377567 69.10089         0       0 0
#> 4  46 2000-04-01  0  1.26360835  1 -1.377567 69.18422         0       0 0
#> 5  46 2000-05-01  0  1.55297235  1 -1.377567 69.26756         0       0 0
#> 6  46 2000-06-01  1 -2.70904796  1 -1.377567 69.35089         0       0 0
#> 7  46 2000-07-01  1 -2.04404444  1 -1.377567 69.43422         0       0 0
#> 8  46 2000-08-01  1 -2.53416818  1 -1.377567 69.51756         0       0 0
#> 9  46 2000-09-01  1 -2.48962305  1 -1.377567 69.60089         0       0 0
#> 10 46 2000-10-01  1 -0.09444011  1 -1.377567 69.68422         0       0 0
#>    eligible month trial_period followup_time assigned_treatment X1_0      X2_0
#> 1         1     0            0             0                  0    0 0.8641525
#> 2         1     1            0             1                  0    0 0.8641525
#> 3         1     2            0             2                  0    0 0.8641525
#> 4         1     3            0             3                  0    0 0.8641525
#> 5         1     4            0             4                  0    0 0.8641525
#> 6         1     5            0             5                  0    0 0.8641525
#> 7         1     6            0             6                  0    0 0.8641525
#> 8         1     7            0             7                  0    0 0.8641525
#> 9         1     8            0             8                  0    0 0.8641525
#> 10        1     9            0             9                  0    0 0.8641525
#>    X3_0      X4_0    age_0
#> 1     1 -1.377567 68.93422
#> 2     1 -1.377567 68.93422
#> 3     1 -1.377567 68.93422
#> 4     1 -1.377567 68.93422
#> 5     1 -1.377567 68.93422
#> 6     1 -1.377567 68.93422
#> 7     1 -1.377567 68.93422
#> 8     1 -1.377567 68.93422
#> 9     1 -1.377567 68.93422
#> 10    1 -1.377567 68.93422

Use TrialEmulation R package

We now use the data_preparation function to prepare the data for emulating a sequence of trials and focus on the primary intention-to-treat estimand.

prep_ITT_data <- data_preparation(
  data = obsdata2,
  id = "id", 
  period = "month", 
  treatment = "A",
  outcome = "Y", 
  eligible = "eligible",  # indicator of eligibility for the target trial at that visit/period
  estimand_type = "ITT",
  outcome_cov = ~ X1 + X2 + X3 + X4 + age,
  model_var = "assigned_treatment",
  use_censor_weights = F, 
  first_period = 0,
  last_period = 2,
  quiet = TRUE,
  control = list(maxit = 100))

dt <- data.frame(prep_ITT_data$data)
dt <- dt %>% 
  rename(X1_0=X1, X2_0=X2, X3_0=X3, X4_0=X4, age_0=age) %>%
  arrange(trial_period, id, followup_time)

Let us compare the data sets prepared on our own and using the data_preparation function

table(dt$trial_period==obsdata2.all.trials$trial_period)
#> 
#> TRUE 
#> 1026
table(dt$id==obsdata2.all.trials$id)
#> 
#> TRUE 
#> 1026
table(dt$followup_time==obsdata2.all.trials$followup_time)
#> 
#> TRUE 
#> 1026
table(dt$treatment==obsdata2.all.trials$treatment)
#> 
#> TRUE 
#> 1026
table(dt$outcome==obsdata2.all.trials$outcome)
#> 
#> TRUE 
#> 1006
table(dt$age_0==obsdata2.all.trials$age_0)
#> 
#> TRUE 
#> 1026
table(dt$X1_0==obsdata2.all.trials$X1_0)
#> 
#> TRUE 
#> 1026
table(dt$X2_0==obsdata2.all.trials$X2_0)
#> 
#> TRUE 
#> 1026
table(dt$X3_0==obsdata2.all.trials$X3_0)
#> 
#> TRUE 
#> 1026
table(dt$X4_0==obsdata2.all.trials$X4_0)
#> 
#> TRUE 
#> 1026

It shows that all the variable are the same between the two data sets though finer checking can be made. Both datasets are ready to be used for downstream analyses now.

Target Trial Emulation

Peirong Hao, Kevin Ying, Adam Bress, Tom Greene, Yizhe Xu*

2026-01-29

A Placebo-control Design

Example data

Emulating a sequence of trials

Use TrialEmulation R package

Funding

References