Emulating trials using EHR and Cogstack

Lead Supervisor
Professor Sabine Landau
Professor of Biostatistics
Dept. of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), King’s College London

Dr James Teo (King’s College Hospital), Professor Richard Dobson (King’s College London), Dr Dan Bean (King’s College London)

Project Details

The increasing adoption and availability of Electronic Health Records (EHRs) over the last decade has provided a rich data source for observational research and the promise of “virtual trials” in real world populations. However, EHR data are subject to selection biases and it is necessary to “emulate clinical trials” with robust cohort identification and correction for time-varying confounds. 

The EHR of Kings College Hospital NHS Foundation Trust contains 4 billion documents, and in the last decade, densely rich in e-notes and clinical text narrative. This system is supplemented by search functionality through Cogstack and the NLP method.  The aim of this project is to develop trial emulation techniques within an existing EHR system (KCH Cogstack system) and to perform a prototype trial emulation in one or two domains (eg. Cardiology and Neurology). These specialties are active domains of clinical trials research with a spectrum of trial designs from small-scale studies with complex interventions to large-scale studies with simpler interventions, making them suitable for evaluating the scope of trial emulation from EHR data. 

Trial emulation involves a number of methodological challenges (Hernan and Robins, 2016; Danaei et al., 2013): To mimic good clinical trial design (i) patients for study should be selected according to defined inclusion/ exclusion criteria, (ii) treatment strategies under study should be clearly described, (iii) outcomes used to evaluate health benefits should be defined and (iv) the analysis planned to estimate the causal contrast of interest should be pre-specified. To avoid confounding biases (v) variables driving treatment decisions and outcomes would need to be measured and appropriately accounted for in the modelling. This project will consider whether information retrievable from EHRs/search engines is sufficiently rich to address these issues.

In the first year, the student will learn how to retrieve information from both structured and unstructured datasets, familiarise themselves with the existing literature on trial emulation using routinely collected data and generate a dictionary of variables that can be retrieved from the EHR (within categories: (i) patient clinical and demographic characteristics, (ii) treatment characteristics, (iii) post treatment health outcomes and (v) confounding variables measured before or after treatment initiation).  In the second year the student will devise a protocol for emulating a trial in the area of cardiology and/or neurology and develop the accompanying statistical analysis plan. This pre-specified plan will detail the methods to be used to estimate causal contrasts between treatment strategies under investigation and their underlying assumptions. Finally, in their third year the student will apply these methodologies to emulate a trial in the area of cardiology and/or neurology, place the findings in the context of the wider substantive literature and critically appraise the EHR as a source for trial emulation. 


  • Hernan and Robins, American Journal of Epidemiology: 2016 (for the principles of this approach) ;
  • Danaei et al., Statistical Methods in Medical Research: 2013 (emulation of a trial of statins for the prevention of coronary heart disease) 


EHR, see project description.


Electronic health records, virtual trials, trial emulation, cardiology, neurology, causal inference