Target Trial Emulation

Peirong Hao, Kevin Ying, Adam Bress, Tom Greene, Yizhe Xu*

2026-01-29

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(warn = -1)

Introduction

Randomized controlled trials (RCTs) are the gold standard for evaluating medical interventions, but they are often impractical, slow, or costly. Modern causal inference methods, coupled with real-world data (RWD), provide a faster, well-powered approach to estimating treatment effects in diverse patient populations. Yet several barriers hinder their effective use in translational research. A key challenge is aligning “time zero,” the moment the cohort is defined, eligibility is assessed, and treatment strategies are assigned, which leads to intractable bias.

Target trial emulation (TTE) addresses these barriers by emulating RCT protocols with observational data, e.g., electronic health records (EHRs) (Hernán and Robins 2016; Hernán, Wang, and Leaf 2022). The TTE framework has proven especially useful in improving communication between statisticians and clinicians and better integrate clinical insight with sound study design and incisive data analysis to enhance the quality of research. High-quality observational studies built on TTE can complement findings from RCTs and provide actionable evidence (Wang et al. 2023).

This vignette shows how to build a fit-for-purpose dataset that minimizes bias. We briefly review the five key components of the TTE framework and explain why clear definitions for each are essential.

Eligibility Criteria

Treatment Strategies

Assignment Procedures

Outcome Identification and Validation

Time Zero

Implementation of TTE framework

An Active-comparator Design

Suppose we are interested in comparing the effectiveness of ARB versus ACEI anti-hypertensive medications on reducing the risk of cardiovascular disease (CVD).

Before we begin, we import a simulated observational dataset representing patients with hypertension who are followed over time to compare the risk of incident cardiovascular disease (CVD) under ACEI versus ARB treatment. The dataset includes baseline covariates, time-varying covariates, treatment, and outcome indicators, allowing us to demonstrate the implementation of TTE and causal inference methods in a realistic longitudinal setting. For details on data-generating mechanism and parameter settings, please refer to the data simulation function.

# Load data
obsdata <- readRDS("obsdata.rds")

# Function to extract variable label
get_label <- function(x) {
  lbl <- attr(x, "label", exact = TRUE)
  if (is.null(lbl)) "" else as.character(lbl)
}

# Create data dictionary
dict <- data.frame(
  Variable = names(obsdata),
  Meaning  = vapply(obsdata, get_label, character(1)),
  check.names = FALSE
)

# Present data dictionary
knitr::kable(dict, caption = "Data Dictionary", row.names = FALSE)
Data Dictionary
Variable Meaning
id Patient ID
time Time index for longitudinal records
X1 Non-ACEI or ARB antihypertensive medication use over time
X2 Standardized systolic blood pressure over time
X3 Biological sex (F/M)
X4 Standardized diastolic blood pressure at baseline
age Age over time (years)
A Treatment indicator of ARB use over time (ACEI=0)
Y Event indicator of cardiovascular disease
C Indicator of early dropout
age_s Standardized age over time (years)

Write your own code to implement TTE

Table 1 shows the protocol of the target trial that I wish to run and my emulating plan with observational data side by side

Component 1: Eligibility criteria

obsdata1 <- obsdata %>%
  group_by(id) %>%
  mutate(eligible = as.integer(age >= 50 & cumsum(Y) == 0 & (lag(A)+lag(A,2)) == 0)) %>% 
  # lag(A) and lag(A, 2) shows the treatment at one and two years ago, respectively.
  ungroup()

?? Simplify this to active comparator set up. Let us see when Subject 2 becomes eligible: - Subject 2 is not eligible from time 0 to 6 due to age below 50, and not eligible from time 7 to 9 due to the use of ARBs in previous 2 years. Subject 2 becomes eligible at time 10.

obsdata1 %>% filter(id==2)
#> # A tibble: 20 × 12
#>       id  time    X1      X2    X3    X4   age     A     Y     C age_s eligible
#>    <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <int>
#>  1     2     0     1  0.177      1 0.654  43.7     1     0     0 0.722        0
#>  2     2     1     1 -0.0711     1 0.654  44.7     1     0     0 0.805        0
#>  3     2     2     0 -0.363      1 0.654  45.7     1     0     0 0.888        0
#>  4     2     3     0 -1.27       1 0.654  46.7     1     0     0 0.972        0
#>  5     2     4     0  0.387      1 0.654  47.7     1     0     0 1.06         0
#>  6     2     5     0 -1.96       1 0.654  48.7     1     0     0 1.14         0
#>  7     2     6     1  1.20       1 0.654  49.7     1     0     0 1.22         0
#>  8     2     7     0  0.960      1 0.654  50.7     1     0     0 1.31         0
#>  9     2     8     0 -0.624      1 0.654  51.7     0     0     0 1.39         0
#> 10     2     9     1  1.36       1 0.654  52.7     0     0     0 1.47         0
#> 11     2    10     0 -0.893      1 0.654  53.7     0     0     0 1.56         1
#> 12     2    11     1 -0.0871     1 0.654  54.7     0     0     0 1.64         1
#> 13     2    12     0 -0.752      1 0.654  55.7     0     0     0 1.72         1
#> 14     2    13     1 -0.142      1 0.654  56.7     1     0     0 1.81         1
#> 15     2    14     1 -0.0428     1 0.654  57.7     1     0     0 1.89         0
#> 16     2    15     0  0.965      1 0.654  58.7     1     0     0 1.97         0
#> 17     2    16     0  0.887      1 0.654  59.7     1     0     0 2.06         0
#> 18     2    17     1  0.0558     1 0.654  60.7     1     0     0 2.14         0
#> 19     2    18     1  0.327      1 0.654  61.7     1     0     0 2.22         0
#> 20     2    19     0 -0.0961     1 0.654  62.7     0     0     0 2.31         0

Next, we identify individuals who ever meet the eligibility criteria and determine the first time each eligible subject becomes eligible. This first eligibility time defines time zero (\(T_0\)), marking the start of follow-up for each participant in the target trial emulation.

eligible_info <- obsdata1 %>%
  group_by(id) %>%
  summarise(
    ever_eligible = any(eligible == 1, na.rm = TRUE),
    first_eligible_time = if (any(eligible == 1, na.rm = TRUE)) min(time[eligible == 1]) else NA_real_) %>%
  ungroup()

Check the number of subjects are ever eligible.

table(eligible_info$ever_eligible)
#> 
#> FALSE  TRUE 
#>   744   256

Let’s verify a few examples. Subject 2 first becomes eligible at time 10, whereas Subjects 1 and 3 never meet the eligibility criteria during follow-up.

eligible_info %>% filter(id %in% 1:3)
#> # A tibble: 3 × 3
#>      id ever_eligible first_eligible_time
#>   <dbl> <lgl>                       <dbl>
#> 1     1 FALSE                          NA
#> 2     2 TRUE                           10
#> 3     3 FALSE                          NA

Filter out ineligible subjects

obsdata2 <- obsdata1 %>%
  left_join(eligible_info, by = "id") # merge the eligibility information to our observational data
attr(obsdata2$eligible, "label") <- "Eligibility indicator at each time in records"
attr(obsdata2$ever_eligible, "label") <- "Whether a subject ever becomes eligible"
attr(obsdata2$first_eligible_time, "label") <- "When is the first time a subject becomes eligible"

obsdata3 <- obsdata2 %>%
  filter(time >= first_eligible_time) %>% # include only eligible subjects
  select(-ever_eligible)

Component 2: Treatment strategies

Component 3: Treatment assignment

Components 4 & 5: Follow-up and Outcome

obsdata4 <- obsdata3 %>%
  mutate(follow_up = time - first_eligible_time) %>% # 
  select(-first_eligible_time, -eligible) 

Finally, we save the data for downstream statistical analysis

# Extract baseline data 
baseline <- obsdata4 %>%
  filter(follow_up==0) %>%
  select(id, A, X1, X2, X3, X4, age) %>%
  rename(assigned_treatment = A, X1_0 = X1, X2_0 = X2, X3_0 = X3, X4_0 = X4, age_0 = age) 

# Merge baseline data to longitudinal data
obsdata5 <- obsdata4 %>%
  left_join(baseline, by = "id") %>%
  arrange(id, follow_up)

obsdata5 %>% slice_head(n=10)
#> # A tibble: 10 × 18
#>       id  time    X1      X2    X3    X4   age     A     Y     C age_s follow_up
#>    <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
#>  1     2    10     0 -0.893      1 0.654  53.7     0     0     0  1.56         0
#>  2     2    11     1 -0.0871     1 0.654  54.7     0     0     0  1.64         1
#>  3     2    12     0 -0.752      1 0.654  55.7     0     0     0  1.72         2
#>  4     2    13     1 -0.142      1 0.654  56.7     1     0     0  1.81         3
#>  5     2    14     1 -0.0428     1 0.654  57.7     1     0     0  1.89         4
#>  6     2    15     0  0.965      1 0.654  58.7     1     0     0  1.97         5
#>  7     2    16     0  0.887      1 0.654  59.7     1     0     0  2.06         6
#>  8     2    17     1  0.0558     1 0.654  60.7     1     0     0  2.14         7
#>  9     2    18     1  0.327      1 0.654  61.7     1     0     0  2.22         8
#> 10     2    19     0 -0.0961     1 0.654  62.7     0     0     0  2.31         9
#> # ℹ 6 more variables: assigned_treatment <dbl>, X1_0 <dbl>, X2_0 <dbl>,
#> #   X3_0 <dbl>, X4_0 <dbl>, age_0 <dbl>

# Save the data 
saveRDS(obsdata5, "obsdata5.rds")

Funding

The research reported in this publication was supported in part by the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UM1 TR 004409. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

Danaei, Goodarz, Luis A García Rodríguez, Oscar Fernández Cantero, Roger Logan, and Miguel A Hernán. 2013. “Observational Data for Comparative Effectiveness Research: An Emulation of Randomised Trials of Statins and Primary Prevention of Coronary Heart Disease.” Statistical Methods in Medical Research 22 (February): 70–96. https://doi.org/10.1177/0962280211403603.
Hernán, Miguel A, and James M Robins. 2016. “Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available.” American Journal of Epidemiology 183: 758–64.
Hernán, Miguel A, Brian C Sauer, Sonia Hernández-Dĺaz, Robert Platt, and Ian Shrier. 2016. “Specifying a Target Trial Prevents Immortal Time Bias and Other Self-Inflicted Injuries in Observational Analyses.” Journal of Clinical Epidemiology 79: 70–75.
Hernán, Miguel A, Wei Wang, and David E Leaf. 2022. “Target Trial Emulation: A Framework for Causal Inference from Observational Data.” JAMA 328 (24): 2446–47.
Miguel; Robins, James Hernan. 2020. Causal Inference: What If. Boca Ration: Chapman & Hall/CRC.
Wang, Shirley V, Sebastian Schneeweiss, Jessica M Franklin, Rishi J Desai, William Feldman, Elizabeth M Garry, Robert J Glynn, et al. 2023. “Emulation of Randomized Clinical Trials with Nonrandomized Database Analyses: Results of 32 Clinical Trials.” Jama 329: 1376–85.