Introduction

Randomized controlled trials (RCTs) are the gold standard for evaluating medical interventions, but they are often impractical, slow, or costly. Modern causal inference methods, coupled with real-world data (RWD), provide a faster, well-powered approach to estimating treatment effects in diverse patient populations. Yet several barriers hinder their effective use in translational research. A key challenge is aligning “time zero,” the moment the cohort is defined, eligibility is assessed, and treatment strategies are assigned, which leads to intractable bias.

Target trial emulation (TTE) addresses these barriers by emulating RCT protocols with observational data, e.g., electronic health records (EHRs) (Hernán and Robins 2016; Hernán, Wang, and Leaf 2022). The TTE framework has proven especially useful in improving communication between statisticians and clinicians and better integrate clinical insight with sound study design and incisive data analysis to enhance the quality of research. High-quality observational studies built on TTE can complement findings from RCTs and provide actionable evidence (Wang et al. 2023).

This vignette shows how to build a fit-for-purpose dataset that minimizes bias. We briefly review the five key components of the TTE framework and explain why clear definitions for each are essential.

Eligibility Criteria

Both inclusion and exclusion criteria need to be defined using only information prior to or at baseline
Eligibility criteria that use post-baseline information induce selection bias
Not all trials of interest can be emulated using observational data. TTE is only feasible when the observational data contains all the information needed to apply the pre-defined eligibility criteria

Treatment Strategies

Observational data is best used to emulate a pragmatic trial in the sense that
- Open-label design: patients are aware of the treatment they are prescribed
- Treatment are compared under the usual conditions used
- Cannot compare new treatment due to lack of data
New user design is highly recommended to avoid prevalent user bias
- Treatment of interest: Hormone therapy versus no treatment (Hernán and Robins 2016)
- Given that hormone therapy may cause a short-term increase in the risk of coronary heart disease (CHD), current users are likely to be CHD free
- Including current users in the study cohort may introduce bias since patients who initiated hormone therapy then switched due to side effects are not included
- Prevalent user bias is a type of selection bias
Benefits of using active control
- Treatment of interest: ACEI versus ARB
- Both medications are first-line antihypertensive medications that are prescribed interchangeably by physicians
- Thus, patients in these two treatment groups are much more similar, which leads to smaller confounding by indication bias before any adjustment. We refer to this situation as pseudo-randomization
- In contrast, in the example of hormone therapy versus no treatment, patients who use hormone therapy can be very different than those do not use the therapy in terms of socioeconomic status and body mass index

Assignment Procedures

It is infeasible to emulation trials with blind assignment as individuals are aware of the treatment they received in observational data
To mimic the randomized treatment assignment in RCTs, we restore comparability between treatment groups within strata defined by baseline covariates, and we refer to this as conditional exchangeability
To achieve conditional exchangeability, we need to adjust for all confounding covariates using causal inference methods (see (Miguel; Robins 2020) for more etails)

Outcome Identification and Validation

Outcomes can be identified using International Classification of Disease (ICD) codes (Version 9 or 10), medication use, or natural language processing of clinical notes
It is challenging to conduct blind outcome ascertainment in observational data since healthcare providers are often aware of the treatment a patient received
- Death outcome is an exception as it can be independently ascertained by death registry

Time Zero

This is the most important component in TTE
Time zero is the start of follow-up, also referred to as baseline
In RCTs, the time of treatment assignment (A), the time of meeting eligibility criteria (E), and the start of follow-up (\(T_0\)) are typically well aligned. This alignment helps to reduce selection bias and immortal time bias significantly
Check out the four classic emulation failure modes below (Hernán et al. 2016)

Implementation of TTE framework

An Active-comparator Design

Suppose we are interested in comparing the effectiveness of ARB versus ACEI anti-hypertensive medications on reducing the risk of cardiovascular disease (CVD).

Step 1: Clearly define the research questions of interest. In our case, they are:
- Primary: What is the effect of initiating ARBs versus ACEIs on CVD risk?
- Secondary: What is the effect of initiating and continuously using ARBs versus ACEIs on CVD risk?
- *Vaguely defined questions are troublesome as they impact the study design, data preparation, and analysis methods used to answer these questions, leading to biased and misleading findings.
Step 2: Clarify study design
- Our example uses an observational, new-user, active comparator, retrospective cohort design. Thanks to the active-comparator design, the time zero for both treatment groups is straightforward, that is, the time of initiating ARBs or ACEIs. Therefore, we only need to emulate a single trial
- In contrast, in a placebo-control design, subjects in the control group may meet the eligibility criteria multiple times, so emulating a sequence of trials may be needed especially with a small sample size or low event rate. We will explain how to implement TTE for a placebo-control design using the TrialEmulation R package (Danaei et al. 2013)
Step 3: Build a data set that fits the purpose of the study
- A data set can be used to answer these questions with minimal bias
- It is a false believe that one data set can be used to answer any questions.
- To build a fit-for-purpose dataset, we demonstrate how to implement TTE framework with observational data through manual coding.

Before we begin, we import a simulated observational dataset representing patients with hypertension who are followed over time to compare the risk of incident cardiovascular disease (CVD) under ACEI versus ARB treatment. The dataset includes baseline covariates, time-varying covariates, treatment, and outcome indicators, allowing us to demonstrate the implementation of TTE and causal inference methods in a realistic longitudinal setting. For details on data-generating mechanism and parameter settings, please refer to the data simulation function.

# Load data
obsdata <- readRDS("obsdata.rds")

# Function to extract variable label
get_label <- function(x) {
  lbl <- attr(x, "label", exact = TRUE)
  if (is.null(lbl)) "" else as.character(lbl)
}

# Create data dictionary
dict <- data.frame(
  Variable = names(obsdata),
  Meaning  = vapply(obsdata, get_label, character(1)),
  check.names = FALSE
)

# Present data dictionary
knitr::kable(dict, caption = "Data Dictionary", row.names = FALSE)

Data Dictionary
Variable	Meaning
id	Patient ID
time	Time index for longitudinal records
X1	Non-ACEI or ARB antihypertensive medication use over time
X2	Standardized systolic blood pressure over time
X3	Biological sex (F/M)
X4	Standardized diastolic blood pressure at baseline
age	Age over time (years)
A	Treatment indicator of ARB use over time (ACEI=0)
Y	Event indicator of cardiovascular disease
C	Indicator of early dropout
age_s	Standardized age over time (years)

Write your own code to implement TTE

Table 1 shows the protocol of the target trial that I wish to run and my emulating plan with observational data side by side

Component 1: Eligibility criteria

Patients without hypertension and with CVD are already excluded from our observational data in another pre-processing step
We apply the rest of eligibility criteria (the first two) in Table 1 to our data and create a “eligible” indicator

obsdata1 <- obsdata %>%
  group_by(id) %>%
  mutate(eligible = as.integer(age >= 50 & cumsum(Y) == 0 & (lag(A)+lag(A,2)) == 0)) %>% 
  # lag(A) and lag(A, 2) shows the treatment at one and two years ago, respectively.
  ungroup()

?? Simplify this to active comparator set up. Let us see when Subject 2 becomes eligible: - Subject 2 is not eligible from time 0 to 6 due to age below 50, and not eligible from time 7 to 9 due to the use of ARBs in previous 2 years. Subject 2 becomes eligible at time 10.

obsdata1 %>% filter(id==2)
#> # A tibble: 20 × 12
#>       id  time    X1      X2    X3    X4   age     A     Y     C age_s eligible
#>    <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <int>
#>  1     2     0     1  0.177      1 0.654  43.7     1     0     0 0.722        0
#>  2     2     1     1 -0.0711     1 0.654  44.7     1     0     0 0.805        0
#>  3     2     2     0 -0.363      1 0.654  45.7     1     0     0 0.888        0
#>  4     2     3     0 -1.27       1 0.654  46.7     1     0     0 0.972        0
#>  5     2     4     0  0.387      1 0.654  47.7     1     0     0 1.06         0
#>  6     2     5     0 -1.96       1 0.654  48.7     1     0     0 1.14         0
#>  7     2     6     1  1.20       1 0.654  49.7     1     0     0 1.22         0
#>  8     2     7     0  0.960      1 0.654  50.7     1     0     0 1.31         0
#>  9     2     8     0 -0.624      1 0.654  51.7     0     0     0 1.39         0
#> 10     2     9     1  1.36       1 0.654  52.7     0     0     0 1.47         0
#> 11     2    10     0 -0.893      1 0.654  53.7     0     0     0 1.56         1
#> 12     2    11     1 -0.0871     1 0.654  54.7     0     0     0 1.64         1
#> 13     2    12     0 -0.752      1 0.654  55.7     0     0     0 1.72         1
#> 14     2    13     1 -0.142      1 0.654  56.7     1     0     0 1.81         1
#> 15     2    14     1 -0.0428     1 0.654  57.7     1     0     0 1.89         0
#> 16     2    15     0  0.965      1 0.654  58.7     1     0     0 1.97         0
#> 17     2    16     0  0.887      1 0.654  59.7     1     0     0 2.06         0
#> 18     2    17     1  0.0558     1 0.654  60.7     1     0     0 2.14         0
#> 19     2    18     1  0.327      1 0.654  61.7     1     0     0 2.22         0
#> 20     2    19     0 -0.0961     1 0.654  62.7     0     0     0 2.31         0

Next, we identify individuals who ever meet the eligibility criteria and determine the first time each eligible subject becomes eligible. This first eligibility time defines time zero (\(T_0\)), marking the start of follow-up for each participant in the target trial emulation.

eligible_info <- obsdata1 %>%
  group_by(id) %>%
  summarise(
    ever_eligible = any(eligible == 1, na.rm = TRUE),
    first_eligible_time = if (any(eligible == 1, na.rm = TRUE)) min(time[eligible == 1]) else NA_real_) %>%
  ungroup()

Check the number of subjects are ever eligible.

table(eligible_info$ever_eligible)
#> 
#> FALSE  TRUE 
#>   744   256

Let’s verify a few examples. Subject 2 first becomes eligible at time 10, whereas Subjects 1 and 3 never meet the eligibility criteria during follow-up.

eligible_info %>% filter(id %in% 1:3)
#> # A tibble: 3 × 3
#>      id ever_eligible first_eligible_time
#>   <dbl> <lgl>                       <dbl>
#> 1     1 FALSE                          NA
#> 2     2 TRUE                           10
#> 3     3 FALSE                          NA

Filter out ineligible subjects

obsdata2 <- obsdata1 %>%
  left_join(eligible_info, by = "id") # merge the eligibility information to our observational data
attr(obsdata2$eligible, "label") <- "Eligibility indicator at each time in records"
attr(obsdata2$ever_eligible, "label") <- "Whether a subject ever becomes eligible"
attr(obsdata2$first_eligible_time, "label") <- "When is the first time a subject becomes eligible"

obsdata3 <- obsdata2 %>%
  filter(time >= first_eligible_time) %>% # include only eligible subjects
  select(-ever_eligible)

Component 2: Treatment strategies

We group patients into ARB or ACEI initiators based on their outpatient pharmacy dispense
ARB initiators are coded as 1 and ACEI initiators are coded as 0 in our data (code is not shown)
Individuals initiated both ARB and ACEI medications are excluded

Component 3: Treatment assignment

Although ARBs and ACEIs are both first-line treatment and used interchangeably at clinic, they are not completly randomized to patients in our observational data, which may lead to imbalance in patient characteristics between treatment groups
It is perhaps more plausible to consider ARB or ACEI medications are randomized among patients with the same age, sex, blood pressure, and other antihypertensive medication use
Formally, we assume treatment assignment is independent of the CVD outcome conditional on all the baseline covariates mentioned above, i.e., conditional changeability assumption
We mostly worry about the baseline covariates that impact treatment assignment and outcome and they are called confounders
The conditional changeability assumption becomes more plausible as more confounders are measured and considered though we can never verify the assumption completely holds true
We will demonstrate how to restore the conditional changeability using causal inference methods in the tutorial of Intetion-to-treat Analysis.

Components 4 & 5: Follow-up and Outcome

The start of follow-up or time zero (\(T_0\)) is when individuals first become eligible and initiate ARB or ACEI
In our long-format data, the follow-up at each month is the number of months being followed up to the current month
In an one-row-per-patient data (?? I think we need a version of this), the end of follow-up is the time of CVD event, death, loss-to-follow-up, or the end of the study, whichever occurs first
?? It is a good practice to prespecify a index window, say 01/01/2017 - 01/01/2020, and a study period say 01/01/2017 - 01/01/2024, so that individuals are followed up for a sufficient amount of time, i.e., 4 years

obsdata4 <- obsdata3 %>%
  mutate(follow_up = time - first_eligible_time) %>% # 
  select(-first_eligible_time, -eligible)

Finally, we save the data for downstream statistical analysis

# Extract baseline data 
baseline <- obsdata4 %>%
  filter(follow_up==0) %>%
  select(id, A, X1, X2, X3, X4, age) %>%
  rename(assigned_treatment = A, X1_0 = X1, X2_0 = X2, X3_0 = X3, X4_0 = X4, age_0 = age) 

# Merge baseline data to longitudinal data
obsdata5 <- obsdata4 %>%
  left_join(baseline, by = "id") %>%
  arrange(id, follow_up)

obsdata5 %>% slice_head(n=10)
#> # A tibble: 10 × 18
#>       id  time    X1      X2    X3    X4   age     A     Y     C age_s follow_up
#>    <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
#>  1     2    10     0 -0.893      1 0.654  53.7     0     0     0  1.56         0
#>  2     2    11     1 -0.0871     1 0.654  54.7     0     0     0  1.64         1
#>  3     2    12     0 -0.752      1 0.654  55.7     0     0     0  1.72         2
#>  4     2    13     1 -0.142      1 0.654  56.7     1     0     0  1.81         3
#>  5     2    14     1 -0.0428     1 0.654  57.7     1     0     0  1.89         4
#>  6     2    15     0  0.965      1 0.654  58.7     1     0     0  1.97         5
#>  7     2    16     0  0.887      1 0.654  59.7     1     0     0  2.06         6
#>  8     2    17     1  0.0558     1 0.654  60.7     1     0     0  2.14         7
#>  9     2    18     1  0.327      1 0.654  61.7     1     0     0  2.22         8
#> 10     2    19     0 -0.0961     1 0.654  62.7     0     0     0  2.31         9
#> # ℹ 6 more variables: assigned_treatment <dbl>, X1_0 <dbl>, X2_0 <dbl>,
#> #   X3_0 <dbl>, X4_0 <dbl>, age_0 <dbl>

# Save the data 
saveRDS(obsdata5, "obsdata5.rds")

Funding

The research reported in this publication was supported in part by the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UM1 TR 004409. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

Danaei, Goodarz, Luis A García Rodríguez, Oscar Fernández Cantero, Roger Logan, and Miguel A Hernán. 2013. “Observational Data for Comparative Effectiveness Research: An Emulation of Randomised Trials of Statins and Primary Prevention of Coronary Heart Disease.” Statistical Methods in Medical Research 22 (February): 70–96. https://doi.org/10.1177/0962280211403603.

Hernán, Miguel A, and James M Robins. 2016. “Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available.” American Journal of Epidemiology 183: 758–64.

Hernán, Miguel A, Brian C Sauer, Sonia Hernández-Dĺaz, Robert Platt, and Ian Shrier. 2016. “Specifying a Target Trial Prevents Immortal Time Bias and Other Self-Inflicted Injuries in Observational Analyses.” Journal of Clinical Epidemiology 79: 70–75.

Hernán, Miguel A, Wei Wang, and David E Leaf. 2022. “Target Trial Emulation: A Framework for Causal Inference from Observational Data.” JAMA 328 (24): 2446–47.

Miguel; Robins, James Hernan. 2020. Causal Inference: What If. Boca Ration: Chapman & Hall/CRC.

Wang, Shirley V, Sebastian Schneeweiss, Jessica M Franklin, Rishi J Desai, William Feldman, Elizabeth M Garry, Robert J Glynn, et al. 2023. “Emulation of Randomized Clinical Trials with Nonrandomized Database Analyses: Results of 32 Clinical Trials.” Jama 329: 1376–85.

Target Trial Emulation

Peirong Hao, Kevin Ying, Adam Bress, Tom Greene, Yizhe Xu*

2026-01-29