Comparing Individualized Treatment Effect Inference Methods Through a Simulation Study

Poster

#Care Modeling #Simulation #Individual Treatment Effect #Causal Inference #Literature review

2025, 10 November

Event: ISPOR Europe 2025, Glasgow, Scotland, UK

Authors: Diane Vincent, Antoine Movschin, Tristan Fauvel

OBJECTIVES

Randomized controlled trials (RCTs) are the gold standard for estimating the average treatment effect (ATE), based on the equipoise principle and proper trial size and design, but they are typically not geared towards individual effect estimation.

Conversely, real-world data (RWD) richness and abundance offers an opportunity to estimate the individualized treatment effect (ITE), albeit limited by the biases induced by violated causal assumptions.

While a variety of machine learning (ML)-based methods exist to estimate the conditional average treatment effect (CATE) with observational data, there is no guideline for practitioners to choose the best suited method, and systematic comparisons using unbiased benchmark datasets remain limited.

This simulation study aims at guiding the selection of the best method by comparing a range of approaches across diverse performance metrics and constraints.

METHODS

A set of representative ML-based CATE estimation methods, including meta-learners, tree-based, deep learning and Bayesian methods are evaluated.

The simulation study, via a data-generating process (DGP), enables control and knowledge of the treatment effect, and emulates a variety of realistic scenarios by implementing different constraints on sample size, CATE heterogeneity, covariate overlap, confounding (both observed and unobserved), etc.

The metrics most frequently mentioned in the literature are used to assess the methods, including standard ML metrics, Precision in Estimating Heterogeneous Effect (PEHE), and its approximations.

RESULTS

CATE estimation methods are accurate, with the best reducing PEHE by a factor of 27.5 compared to the ATE baseline in average, but confidence intervals remain wide, representing 25% of CATE values on average, and no single method outperforms others across all scenarios and metrics.

Observable metrics poorly reflect true performance: coverage is uncorrelated, and PEHE shows only moderate alignment.

CONCLUSIONS

We provide a comprehensive mapping of the evaluated methods under the defined constraints, providing guidance for method selection in real-world contexts, tailored to specific use-cases.

Let’s bring science to impact together

Whether you’re interested in our work, looking to co-publish, or exploring to explore how
our insights can support your objectives, our team is here to connect.