Speaker

Joshua Gilbertn

Mr. Joshua Gilbert
PhD Candidate
Harvard Graduate School of Education

Joshua Gilbert is a PhD candidate in Education Policy and Program Evaluation at the Harvard Graduate School of Education, where he works with James Kim and Luke Miratrix. His research interests include the intersection of causal inference and psychometric methods. He (already) has over twenty peer-reviewed publications in journals such as Developmental Psychology, Journal of Educational Psychology, Journal of Educational and Behavioral Statistics, Behavior Research Methods, Psychological Methods, with a current h-index of 13. He has also curated a collection of over 100 item-level datasets from randomized control trials available via the Item Response Warehouse, and hosted workshops on how to conceptualize and analyze item-level effects. In 2025, he was awarded a prestigious 2025 Spencer / NAEd dissertation fellowship.

Title

Estimating Heterogeneous Treatment Effects with Item-Level data: Insights from Item Response Theory

Abstract

Analyses of heterogeneous treatment effects (HTE) are common in applied causal inference research. However, when outcomes are latent variables assessed via psychometric instruments such as educational tests, standard methods ignore the potential HTE that may exist among the individual items of the outcome measure. Failing to account for “item-level” HTE (IL-HTE) can lead to both underestimated standard errors and identification challenges in the estimation of treatment-by-covariate interaction effects. We demonstrate how Item Response Theory (IRT) models that estimate a treatment effect for each assessment item can both address these challenges and provide new insights into HTE generally. This study articulates the theoretical rationale for the IL-HTE model and demonstrates its practical value using 75 datasets from 48 randomized controlled trials containing 5.8 million item responses in economics, education, and health research. Our results show that the IL-HTE model reveals item-level variation masked by single-number scores, provides more meaningful standard errors in many settings, allows for estimates of the generalizability of causal effects to untested items, resolves identification problems in the estimation of interaction effects, and provides estimates of standardized treatment effect sizes corrected for attenuation due to measurement error.