Target Trial Emulation

Peirong Hao, Kevin Ying, Adam Bress, Tom Greene, Yizhe Xu*

2026-01-29

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(warn = -1)

We have explained how to implement TTE through manual coding for an active-comparator study design. In this tutorial, we focus on a placebo-controlled design, where individuals in the control group may meet the eligibility criteria multiple times (Hernán and Robins 2016). This makes it unclear how to define their time zero.

A Placebo-control Design

We have simulated a data as an example to compare the effectiveness of ARB versus no anti-hypertensive medications on reducing the risk of cardiovascular disease (CVD) among subjects with hypertension with no history of chronic disease and no use of ARB medications during the previous 2 years. Table 1 shows the protocol of the target trial that I wish to run and my emulating plan with observational data side by side.

Recall the three key steps before implementing TTE:

Example data

We import the data we simulated.

obsdata1 <- readRDS("obsdata1trt.rds")

get_label <- function(x) {
  lbl <- attr(x, "label", exact = TRUE)
  if (is.null(lbl)) "" else as.character(lbl)
}

dict <- data.frame(
  Variable = names(obsdata1),
  Meaning  = vapply(obsdata1, get_label, character(1)),
  check.names = FALSE
)

knitr::kable(dict, caption = "Data Dictionary", row.names = FALSE)
Data Dictionary
Variable Meaning
id Patient ID
time Time index for longitudinal records (months)
X1 Non-ACEI or ARB antihypertensive medication use over time
X2 Standardized systolic blood pressure over time
X3 Biological sex (M=1, F=0)
X4 Standardized diastolic blood pressure at baseline
age Age over time (years)
A Treatment indicator over time (ARB = 1, control = 0 )
Y Event indicator of cardiovascular disease
C Indicator of early dropout / censoring

Emulating a sequence of trials

Suppose treatment and covariates information are updated monthly in our observational data, so we consider each month as a separate enrollment period. For instance, the first enrollment period is Jan. 2017, then Feb. 2017, and so on.

In contrast to emulating a single trial, we need to construct a pooled dataset by stacking the separate data from each trial. Here are a few things to keep in mind during this process:

Before start using the TrialEmulation R package, it is crucial to understand the process being carried out in the data_preparation function. We explain the steps using a toy example where we only consider three enrollment times: at the beginning of the overall study, month 1, and month 2.

obsdata2 <- obsdata1 %>%
  group_by(id) %>%
  mutate(eligible = as.integer(age >= 50 & cumsum(Y) == 0 & ( slide_dbl(
    A, 
    sum, 
    .before = 24,   # Look back 24 rows
    .after = -1,    # Exclude the current row
    .complete = T
  )==0) )) %>% 
  # Age over 50, no history of CVD, and no ACEI or ARB treatment in last 24 months.
  mutate(eligible = ifelse(is.na(eligible), 0, eligible)) %>% #if unknown, then not eligible
  ungroup()
#find the first date when some individual become eligible. This is the start date that we can start enroll subjects
start.date=obsdata2%>%filter(eligible==1)%>%select(time)%>%unique()%>%summarise(start.date=min(time))
#for convenience purpose, convert time to months from this date
obsdata2 <- obsdata2 %>%
  mutate(month = interval( start.date$start.date, time) %/% months(1))  # month zero is the original study baseline for the 1st trial

# Select eligible individuals
iligible1 <- obsdata2 %>%
  filter(eligible == 1 & month == 0) %>%    # time zero is the original study baseline for the 1st trial
  select(id, A, X1, X2, X3, X4, age) %>%
  rename(assigned_treatment=A, X1_0=X1, X2_0=X2, X3_0=X3, X4_0=X4, age_0=age) # baseline covariates in the 1st trial are the same as the baseline covariates of the study 

trial.1 <- obsdata2 %>%
  filter(id %in% iligible1$id & month >= 0) %>%   # month zero is the original study baseline for the 1st trial
  mutate(trial = 0,   # create an emulated trial indicator
         follow_up = month) %>%  # no adjustment of follow-up time is needed since the 1st trial share the same baseline as the entire study
  left_join(iligible1)
#> Joining with `by = join_by(id)`
# Select eligible individuals 
iligible2 <- obsdata2 %>%
  filter(eligible == 1 & month == 1) %>%   # time zero is month 1 for the 2nd trial
  select(id, A, X1, X2, X3, X4, age) %>%
  rename(assigned_treatment=A, X1_0=X1, X2_0=X2, X3_0=X3, X4_0=X4, age_0=age) # baseline covariates in the 2nd trial are the covariates at month 1 

trial.2 <- obsdata2 %>%
  filter(id %in% iligible2$id & month >= 1) %>% # time zero is month 1 for the 2nd trial
  mutate(trial = 1,       # create an emulated trial indicator
         follow_up = month - 1) %>%     # adjust the follow-up time by decreasing by 1
  left_join(iligible2)
#> Joining with `by = join_by(id)`
# Select eligible individuals 
iligible3 <- obsdata2 %>%
  filter(eligible == 1 & month == 2) %>%    # time zero is month 2 for the 3rd trial
  select(id, A, X1, X2, X3, X4, age) %>%
  rename(assigned_treatment=A, X1_0=X1, X2_0=X2, X3_0=X3, X4_0=X4, age_0=age) # baseline covariates in the 3rd trial are the covariates at month 2 

trial.3 <- obsdata2 %>%
  filter(id %in% iligible3$id & month >= 2) %>% 
  mutate(trial = 2,       # create an emulated trial indicator
         follow_up = month - 2) %>%     # adjust the follow-up time by decreasing by 2
  left_join(iligible3)
#> Joining with `by = join_by(id)`
obsdata2.all.trials <- data.frame(rbind(trial.1, trial.2, trial.3)) %>%
  rename(trial_period = trial,
         followup_time = follow_up,
         treatment = A, 
         outcome = Y)
head(obsdata2.all.trials, n=10)
#>    id       time X1          X2 X3        X4      age treatment outcome C
#> 1  46 2000-01-01  0  0.86415249  1 -1.377567 68.93422         0       0 0
#> 2  46 2000-02-01  0  1.01366731  1 -1.377567 69.01756         0       0 0
#> 3  46 2000-03-01  1  2.68932942  1 -1.377567 69.10089         0       0 0
#> 4  46 2000-04-01  0  1.26360835  1 -1.377567 69.18422         0       0 0
#> 5  46 2000-05-01  0  1.55297235  1 -1.377567 69.26756         0       0 0
#> 6  46 2000-06-01  1 -2.70904796  1 -1.377567 69.35089         0       0 0
#> 7  46 2000-07-01  1 -2.04404444  1 -1.377567 69.43422         0       0 0
#> 8  46 2000-08-01  1 -2.53416818  1 -1.377567 69.51756         0       0 0
#> 9  46 2000-09-01  1 -2.48962305  1 -1.377567 69.60089         0       0 0
#> 10 46 2000-10-01  1 -0.09444011  1 -1.377567 69.68422         0       0 0
#>    eligible month trial_period followup_time assigned_treatment X1_0      X2_0
#> 1         1     0            0             0                  0    0 0.8641525
#> 2         1     1            0             1                  0    0 0.8641525
#> 3         1     2            0             2                  0    0 0.8641525
#> 4         1     3            0             3                  0    0 0.8641525
#> 5         1     4            0             4                  0    0 0.8641525
#> 6         1     5            0             5                  0    0 0.8641525
#> 7         1     6            0             6                  0    0 0.8641525
#> 8         1     7            0             7                  0    0 0.8641525
#> 9         1     8            0             8                  0    0 0.8641525
#> 10        1     9            0             9                  0    0 0.8641525
#>    X3_0      X4_0    age_0
#> 1     1 -1.377567 68.93422
#> 2     1 -1.377567 68.93422
#> 3     1 -1.377567 68.93422
#> 4     1 -1.377567 68.93422
#> 5     1 -1.377567 68.93422
#> 6     1 -1.377567 68.93422
#> 7     1 -1.377567 68.93422
#> 8     1 -1.377567 68.93422
#> 9     1 -1.377567 68.93422
#> 10    1 -1.377567 68.93422

Use TrialEmulation R package

We now use the data_preparation function to prepare the data for emulating a sequence of trials and focus on the primary intention-to-treat estimand.

prep_ITT_data <- data_preparation(
  data = obsdata2,
  id = "id", 
  period = "month", 
  treatment = "A",
  outcome = "Y", 
  eligible = "eligible",  # indicator of eligibility for the target trial at that visit/period
  estimand_type = "ITT",
  outcome_cov = ~ X1 + X2 + X3 + X4 + age,
  model_var = "assigned_treatment",
  use_censor_weights = F, 
  first_period = 0,
  last_period = 2,
  quiet = TRUE,
  control = list(maxit = 100))

dt <- data.frame(prep_ITT_data$data)
dt <- dt %>% 
  rename(X1_0=X1, X2_0=X2, X3_0=X3, X4_0=X4, age_0=age) %>%
  arrange(trial_period, id, followup_time)

Let us compare the data sets prepared on our own and using the data_preparation function

table(dt$trial_period==obsdata2.all.trials$trial_period)
#> 
#> TRUE 
#> 1026
table(dt$id==obsdata2.all.trials$id)
#> 
#> TRUE 
#> 1026
table(dt$followup_time==obsdata2.all.trials$followup_time)
#> 
#> TRUE 
#> 1026
table(dt$treatment==obsdata2.all.trials$treatment)
#> 
#> TRUE 
#> 1026
table(dt$outcome==obsdata2.all.trials$outcome)
#> 
#> TRUE 
#> 1006
table(dt$age_0==obsdata2.all.trials$age_0)
#> 
#> TRUE 
#> 1026
table(dt$X1_0==obsdata2.all.trials$X1_0)
#> 
#> TRUE 
#> 1026
table(dt$X2_0==obsdata2.all.trials$X2_0)
#> 
#> TRUE 
#> 1026
table(dt$X3_0==obsdata2.all.trials$X3_0)
#> 
#> TRUE 
#> 1026
table(dt$X4_0==obsdata2.all.trials$X4_0)
#> 
#> TRUE 
#> 1026

It shows that all the variable are the same between the two data sets though finer checking can be made. Both datasets are ready to be used for downstream analyses now.

Funding

The research reported in this publication was supported in part by the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UM1 TR 004409. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

Danaei, Goodarz, Luis A García Rodríguez, Oscar Fernández Cantero, Roger Logan, and Miguel A Hernán. 2013. “Observational Data for Comparative Effectiveness Research: An Emulation of Randomised Trials of Statins and Primary Prevention of Coronary Heart Disease.” Statistical Methods in Medical Research 22 (February): 70–96. https://doi.org/10.1177/0962280211403603.
Hernán, Miguel A, and James M Robins. 2016. “Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available.” American Journal of Epidemiology 183: 758–64.