Combining statistical and knowledge-based methods for clinical modelling of electronic health record text

Lead Supervisor
Dr Angus Roberts
Senior Lecturer in Biostatistics
Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), King’s College London

Dr Sumithra Velupillai
Lecturer in Applied Health Informatics, King’s College London

Industrial Partner

Project Details

As much as 70% of the information in electronic health records (EHRs) is found in unstructured, natural language, such as attached documents and the free text notes associated with health care events. In recognition of this, natural language processing (NLP) of the EHR has been widely adopted to assist with reuse of the EHR in support of research and clinical care. In recent years, state of the art clinical NLP, as in other domains, has been dominated by neural networks and other statistical models. In contrast to the unstructured nature of EHR text, medical and biomedical knowledge is increasingly available in structured and codified forms, underpinned by curated databases, common data models, machine readable clinical guidelines, and logically defined terminologies.

This project will explore interactions between structured medical knowledge and clinical NLP, by incorporating representations of logical inference across medical knowledge bases, in to statistical and neural models of language. These models will thus combine the data-driven, statistical view of clinical language, with the logical view of medical knowledge. The models will be used to map EHR text to medical knowledge, at two levels of granularity:

  1. Linking mentions of medical entities in text to compositional (post-coordinated) terms in the standard medical terminology, SNOMED-CT. Concepts in SNOMED-CT are compositional. For example, “emergency appendectomy” can either be considered as an atomic term, or as a term composed of “appendectomy” with “priority=emergency”. Unlike current approaches to entity linking, which only consider atomic terms, this project will make use of the compositional nature of SNOMED-CT.
  2. Mapping text to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). The CDM is a common format for representing the disparate models of different EHR systems, in order that they might be shared for analysis. For example, the CDM contains tables for patient demographics, and for the observations and procedures carried out on that patient. Values in these tables may be recorded using standard terminologies, such as SNOMED-CT.

Models developed in the PhD project will be evaluated in a real health research use case, that of physical multimorbidities in patients with mental illness. The interaction between physical and mental health is a complex interplay of physical, mental and social factors, often resulting in worse outcomes. This is consequently an active research area.

DRIVE-Health is associated with one of the UK’s leading centres in clinical NLP, the NIHR Maudsley Biomedical Research Centre (BRC). The BRC routinely runs more than 80 different NLP applications over 35 million documents, in order to support health research in mental health, including several projects in physical multimorbidities. The BRC has a large and active NLP, data science, and health informatics research groups.

The project is sponsored by AIMES, a cloud service provider and data analytics company with contracts throughout the health sector. The student will benefit from opportunities to explain their work to a commercial audience, which will bring insights in how their work may be generalised, and will benefit from potential opportunities to validate methods with AIMES health clients.


Data from the Clinical Records Interactive Search (CRIS) database, which contains anonymized EHRs from the South London and Maudsley (SLaM) NHS Foundation Trust and has ethical approval for research use (Oxford REC C, reference 08/H0606/71+5) under an extensive governance model will be central to the initial development of NLP algorithms. Opportunities for further development and evaluation on other similar data in other NHS trusts, e.g. Mersey Care will be actively explored.


Mental health, natural language processing, knowledge representation, ontologies, NLP, EHR, electronic health records, physical multimorbidities