Randomized controlled trials (RCTs) are the gold standard for
evaluating medical interventions, but they are often impractical, slow,
or costly. Modern causal inference methods, coupled with real-world data
(RWD), provide a faster, well-powered approach to estimating treatment
effects in diverse patient populations. Yet several barriers hinder
their effective use in translational research. A key challenge is
aligning “time zero,” the moment the cohort is defined, eligibility is
assessed, and treatment strategies are assigned, which leads to
intractable bias.
Target trial emulation (TTE) addresses these barriers by emulating
RCT protocols with observational data, e.g., electronic health records
(EHRs) (Hernán and
Robins 2016; Hernán, Wang, and Leaf
2022). The TTE framework has proven especially useful in
improving communication between statisticians and clinicians and better
integrate clinical insight with sound study design and incisive data
analysis to enhance the quality of research. High-quality observational
studies built on TTE can complement findings from RCTs and provide
actionable evidence (Wang
et al. 2023).
This vignette shows how to build a fit-for-purpose dataset that minimizes bias. We briefly review the five key components of the TTE framework and explain why clear definitions for each are essential.
Suppose we are interested in comparing the effectiveness of ARB
versus ACEI anti-hypertensive medications on reducing the risk of
cardiovascular disease (CVD).
Before we begin, we import a simulated observational dataset representing patients with hypertension who are followed over time to compare the risk of incident cardiovascular disease (CVD) under ACEI versus ARB treatment. The dataset includes baseline covariates, time-varying covariates, treatment, and outcome indicators, allowing us to demonstrate the implementation of TTE and causal inference methods in a realistic longitudinal setting. For details on data-generating mechanism and parameter settings, please refer to the data simulation function.
# Load data
obsdata <- readRDS("obsdata.rds")
# Function to extract variable label
get_label <- function(x) {
lbl <- attr(x, "label", exact = TRUE)
if (is.null(lbl)) "" else as.character(lbl)
}
# Create data dictionary
dict <- data.frame(
Variable = names(obsdata),
Meaning = vapply(obsdata, get_label, character(1)),
check.names = FALSE
)
# Present data dictionary
knitr::kable(dict, caption = "Data Dictionary", row.names = FALSE)| Variable | Meaning |
|---|---|
| id | Patient ID |
| time | Time index for longitudinal records |
| X1 | Non-ACEI or ARB antihypertensive medication use over time |
| X2 | Standardized systolic blood pressure over time |
| X3 | Biological sex (F/M) |
| X4 | Standardized diastolic blood pressure at baseline |
| age | Age over time (years) |
| A | Treatment indicator of ARB use over time (ACEI=0) |
| Y | Event indicator of cardiovascular disease |
| C | Indicator of early dropout |
| age_s | Standardized age over time (years) |
Table 1 shows the protocol of the target trial that I wish to run and my emulating plan with observational data side by side
Component 1: Eligibility criteria
obsdata1 <- obsdata %>%
group_by(id) %>%
mutate(eligible = as.integer(age >= 50 & cumsum(Y) == 0 & (lag(A)+lag(A,2)) == 0)) %>%
# lag(A) and lag(A, 2) shows the treatment at one and two years ago, respectively.
ungroup()?? Simplify this to active comparator set up. Let us see when Subject 2 becomes eligible: - Subject 2 is not eligible from time 0 to 6 due to age below 50, and not eligible from time 7 to 9 due to the use of ARBs in previous 2 years. Subject 2 becomes eligible at time 10.
obsdata1 %>% filter(id==2)
#> # A tibble: 20 × 12
#> id time X1 X2 X3 X4 age A Y C age_s eligible
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 2 0 1 0.177 1 0.654 43.7 1 0 0 0.722 0
#> 2 2 1 1 -0.0711 1 0.654 44.7 1 0 0 0.805 0
#> 3 2 2 0 -0.363 1 0.654 45.7 1 0 0 0.888 0
#> 4 2 3 0 -1.27 1 0.654 46.7 1 0 0 0.972 0
#> 5 2 4 0 0.387 1 0.654 47.7 1 0 0 1.06 0
#> 6 2 5 0 -1.96 1 0.654 48.7 1 0 0 1.14 0
#> 7 2 6 1 1.20 1 0.654 49.7 1 0 0 1.22 0
#> 8 2 7 0 0.960 1 0.654 50.7 1 0 0 1.31 0
#> 9 2 8 0 -0.624 1 0.654 51.7 0 0 0 1.39 0
#> 10 2 9 1 1.36 1 0.654 52.7 0 0 0 1.47 0
#> 11 2 10 0 -0.893 1 0.654 53.7 0 0 0 1.56 1
#> 12 2 11 1 -0.0871 1 0.654 54.7 0 0 0 1.64 1
#> 13 2 12 0 -0.752 1 0.654 55.7 0 0 0 1.72 1
#> 14 2 13 1 -0.142 1 0.654 56.7 1 0 0 1.81 1
#> 15 2 14 1 -0.0428 1 0.654 57.7 1 0 0 1.89 0
#> 16 2 15 0 0.965 1 0.654 58.7 1 0 0 1.97 0
#> 17 2 16 0 0.887 1 0.654 59.7 1 0 0 2.06 0
#> 18 2 17 1 0.0558 1 0.654 60.7 1 0 0 2.14 0
#> 19 2 18 1 0.327 1 0.654 61.7 1 0 0 2.22 0
#> 20 2 19 0 -0.0961 1 0.654 62.7 0 0 0 2.31 0Next, we identify individuals who ever meet the eligibility criteria and determine the first time each eligible subject becomes eligible. This first eligibility time defines time zero (\(T_0\)), marking the start of follow-up for each participant in the target trial emulation.
eligible_info <- obsdata1 %>%
group_by(id) %>%
summarise(
ever_eligible = any(eligible == 1, na.rm = TRUE),
first_eligible_time = if (any(eligible == 1, na.rm = TRUE)) min(time[eligible == 1]) else NA_real_) %>%
ungroup()Check the number of subjects are ever eligible.
Let’s verify a few examples. Subject 2 first becomes eligible at time 10, whereas Subjects 1 and 3 never meet the eligibility criteria during follow-up.
eligible_info %>% filter(id %in% 1:3)
#> # A tibble: 3 × 3
#> id ever_eligible first_eligible_time
#> <dbl> <lgl> <dbl>
#> 1 1 FALSE NA
#> 2 2 TRUE 10
#> 3 3 FALSE NAFilter out ineligible subjects
obsdata2 <- obsdata1 %>%
left_join(eligible_info, by = "id") # merge the eligibility information to our observational data
attr(obsdata2$eligible, "label") <- "Eligibility indicator at each time in records"
attr(obsdata2$ever_eligible, "label") <- "Whether a subject ever becomes eligible"
attr(obsdata2$first_eligible_time, "label") <- "When is the first time a subject becomes eligible"
obsdata3 <- obsdata2 %>%
filter(time >= first_eligible_time) %>% # include only eligible subjects
select(-ever_eligible)Component 2: Treatment strategies
Component 3: Treatment assignment
Components 4 & 5: Follow-up and Outcome
obsdata4 <- obsdata3 %>%
mutate(follow_up = time - first_eligible_time) %>% #
select(-first_eligible_time, -eligible) Finally, we save the data for downstream statistical analysis
# Extract baseline data
baseline <- obsdata4 %>%
filter(follow_up==0) %>%
select(id, A, X1, X2, X3, X4, age) %>%
rename(assigned_treatment = A, X1_0 = X1, X2_0 = X2, X3_0 = X3, X4_0 = X4, age_0 = age)
# Merge baseline data to longitudinal data
obsdata5 <- obsdata4 %>%
left_join(baseline, by = "id") %>%
arrange(id, follow_up)
obsdata5 %>% slice_head(n=10)
#> # A tibble: 10 × 18
#> id time X1 X2 X3 X4 age A Y C age_s follow_up
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 10 0 -0.893 1 0.654 53.7 0 0 0 1.56 0
#> 2 2 11 1 -0.0871 1 0.654 54.7 0 0 0 1.64 1
#> 3 2 12 0 -0.752 1 0.654 55.7 0 0 0 1.72 2
#> 4 2 13 1 -0.142 1 0.654 56.7 1 0 0 1.81 3
#> 5 2 14 1 -0.0428 1 0.654 57.7 1 0 0 1.89 4
#> 6 2 15 0 0.965 1 0.654 58.7 1 0 0 1.97 5
#> 7 2 16 0 0.887 1 0.654 59.7 1 0 0 2.06 6
#> 8 2 17 1 0.0558 1 0.654 60.7 1 0 0 2.14 7
#> 9 2 18 1 0.327 1 0.654 61.7 1 0 0 2.22 8
#> 10 2 19 0 -0.0961 1 0.654 62.7 0 0 0 2.31 9
#> # ℹ 6 more variables: assigned_treatment <dbl>, X1_0 <dbl>, X2_0 <dbl>,
#> # X3_0 <dbl>, X4_0 <dbl>, age_0 <dbl>
# Save the data
saveRDS(obsdata5, "obsdata5.rds")The research reported in this publication was supported in part by the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UM1 TR 004409. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.