Advancing explainable human in the loop NLP analytics for clinical applications

Lead Supervisor
Dr Iain Marshall
Clinical Academic Fellow
School of Population Health and Environmental Sciences, King’s College London

Dr Angus Roberts (Senior Lecturer in Biostatistics, King’s College London)
Dr Petr Slovak (Lecturer in Computer Science, King’s College London)
Serge Umansky, PhD (Metadvice)

Industrial Partner

Project Details

Essential background:

Clinicians typically have ten minutes with each patient, but generate more than one question every two consultations.1 Instead of using the principles of evidence-based medicine (EBM), clinicians more often search Google or ask colleagues.2,3 Most questions go unanswered, not because an answer doesn’t exist, but because one wasn’t found, or that no search was attempted. Yet the volume of evidence available is now overwhelming4 and actionable insights are typically locked inside clinical guidelines (PDFs or websites) which run to 10s or 100s of pages.5

A burgeoning field of health informatics (including research by both proposed supervisors, and the focus of industry collaborators Metadvice) concerns the use of computers to speed up the process of finding answers to health questions.6–12 Core to this task is the use of natural language processing (NLP) and machine learning (ML) methods, in order to automatically extract and classify the plain text within the health literature and electronic health records to produce a machine-understandable format. If sufficiently accurate, this could allow summarised, research evidence which is relevant to the care of individual patients to be provided to clinicians at the point-of-care in a timely way, potentially improving the quality of patient care.

Much research to date has focused on using these methods in general scientific literature (e.g. the contents of PubMed Central, including a number of ‘shared-tasks’ where global academic teams compete to produce the most accurate system for automatically retrieving structured answers from the source.13–18 The past 5 years have seen NLP methods develop at pace, and particularly the use of transformer neural-network models (including BERT and derivatives targeting the scientific literature) which have lead to performance improvements in many8,19 (though not all10) NLP tasks. This project proposes to investigate the role and effectiveness of both classical NLP approaches, and recent methodological developments for this task, leading to an in-practice pilot evaluation with health professionals in primary care.


  1. Del Fiol, G., Workman, T. E. & Gorman, P. N. Clinical questions raised by clinicians at the point of care: a systematic review. JAMA Intern. Med. 174, 710–718 (2014).
  2. Hider, P. N., Griffin, G., Walker, M. & Coughlan, E. The information-seeking behavior of clinical staff in a large health care organization. J Med Libr Assoc 97, 47–50 (2009).
  3. Papermaster, A. & Champion, J. D. The common practice of ‘curbside consultation’: A systematic review. J Am Assoc Nurse Pr. 29, 618–628 (2017).
  4. Bastian, H., Glasziou, P. & Chalmers, I. Seventy-Five Trials and Eleven Systematic Reviews a Day: How Will We Ever Keep Up? PLOS Med. 7, e1000326 (2010).
  5. Pronovost, P. J. Enhancing Physicians’ Use of Clinical Guidelines. JAMA 310, 2501–2502 (2013).
  6. Marshall, I. et al. Trialstreamer: a living, automatically updated database of clinical trial reports. JAMIA [in press], (2020).
  7. Marshall, I. J., Kuiper, J., Banner, E. & Wallace, B. C. Automating Biomedical Evidence Synthesis: RobotReviewer. in vol. 2017 7–12 (2017).
  8. Nye, B. et al. A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature. ACL 2018, 197–207 (2018).
  9. Kraljevic, Z. et al. MedCAT – medical concept annotation tool. (2019).
  10. Mascio, A. et al. Comparative analysis of text classification approaches in electronic health records. in Proceedings of the 19th SIGBioMed workshop on biomedical language processing, BioNLP 2020, online, july 9, 2020 (eds. Demner-Fushman, D., Cohen, K. B., Ananiadou, S. & Tsujii, J.) 86–94 (Association for Computational Linguistics, 2020).
  11. Marshall, I. J. & Wallace, B. C. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Syst. Rev. 8, 163 (2019).
  12. Marshall, I. J., Kuiper, J. & Wallace, B. C. RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. JAMIA 23, 193–201 (2016).
  13. Kilicoglu, H. et al. Semantic annotation of consumer health questions. BMC Bioinformatics 19, 34 (2018).
  14. Deardorff, A., Masterton, K., Roberts, K., Kilicoglu, H. & Demner‐Fushman, D. A protocol-driven approach to automatically finding authoritative answers to consumer health questions in online resources. J. Assoc. Inf. Sci. Technol. 68, 1724–1736 (2017).
  15. Sarrouti, M. & Ouatik El Alaoui, S. SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions. Artif. Intell. Med. 102, 101767 (2020).
  16. Abacha, A. B. & Demner-Fushman, D. A Question-Entailment Approach to Question Answering. BMC Bioinformatics 20, 511 (2019).
  17. Balikas, G., Krithara, A., Partalas, I. & Paliouras, G. BioASQ: A Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering. in Multimodal Retrieval in the Medical Domain (eds. Müller, H., Jimenez del Toro, O. A., Hanbury, A., Langs, G. & Foncubierta Rodriguez, A.) vol. 9059 26–39 (Springer International Publishing, 2015).
  18. Demner-Fushman, D., Mrabet, Y. & Ben Abacha, A. Consumer Health Information and Question Answering: Helping consumers find answers to their health-related information needs. J. Am. Med. Inform. Assoc. 27, 194– 201 (2020).
  19. Beltagy, I., Lo, K. & Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. in EMNLP- IJCNLP 3615–3620 (2019). doi:10.18653/v1/D19-1371.
  20. Ely, J. W. et al. A taxonomy of generic clinical questions: classification study. BMJ 321, 429–432 (2000).
  21. Soboczenski, F. et al. Machine learning to help researchers evaluate biases in clinical trials: a. BMC Med Inf. Decis Mak 19, 96 (2019).
  22. Nielsen, J. Estimating the number of subjects needed for a thinking aloud test. Int. J. Hum.-Comput. Stud. 41, 385–397 (1994).
  23. Lewis, J. R. Psychometric Evaluation of the PSSUQ Using Data from Five Years of Usability Studies. Int. J. Human–Computer Interact. 14, 463–488 (2002).

Proposed plan of work:

Please include key aspects of study design, key research methods (including statistical methods and appropriate power calculations for the primary hypotheses. Ensure that it is clear how the design and methods will address the study aims.

1. Development of a gold standard dataset for training and evaluation

The construction of annotated corpora has been a critical step in recent advances in natural language processing. These corpora serve to train and evaluate systems, and (via shared task competitions) have aided the research community in advancing the field. In Aim 1 of the project, an annotated (i.e. manually labelled) corpus of guideline documents will be developed, and the relability of the corpus evaluated. As the first part of this aim, a schema for structured representation of clinical guidelines will be constructed. The precise nature of the schema will be determined from literature review, and in collaboration with informaticians at Metadvice, using structured vocabularies [SNOMED CT] to describe the participants and interventions; and the Ely taxonomy20 to classify the informational need. Labelled guidelines using the schema will be developed via an online task, which can be done by independent annotators who have domain expertise, who will include a combination of paid medical students, and physicians and medial writers hired via the UpWork platform. Accuracy of the corpus will be evaluated by redundant duplicate labelling by recruited annotators (with Kappa measures of agreement), and sample review by health informaticians in Metadvice and health professionals working on the project. We estimate that the labelling of 80 clinical guideline documents (selected from Uptodate from NICE clinical guidelines) would provide in the region of >5,000 individual guideline recommendations. This sample would provide sufficient data for training and evaluating models via 5-fold cross-validation, with an estimated margin-of-error of 3% for precision, recall, and F1 statistics on the validation portion. The supervisors have experience in developing annotated datasets in a similar way using both crowdworkers8 and medical students paid via KCL TalentBank, and our industry partners Metadvice have agreed to provide the resources needed to complete this part of the project. This dataset will be supplemented with a set of annotated clinical guidelines on 10 topics from Metadvice.

2. Evaluation of computational methods for translating clinical guidelines into structured representations

Aim 2 concerns the development and evaluation of natural language processing (NLP) and machine learning (ML) methods for producing structured representations of guideline recommendations. Identified elements of text will be referenced to widely used structured vocabularies (such as SNOMED Clinical Terms collection), particularly text describing the clinical condition, interventions, and outcomes (the ‘PICO’ representation). In order to further characterise the information need, the use of question taxonomies (including the Ely taxonomy1,20) will be used. As part of this aim, ‘classicial’ statistical approaches to language modelling (namely logistic regression and Support Vector Machines, using bag-of-words text representations for classification; conditional random fields and Hidden Markov Models for sequence labelling) will be evaluated. Aim 2 will also investigate the accuracy of neural text models (including Convolutional Neural Networks for classification, Long short-term memory networks for sequence classification, and transformer model variants including SciBERT) for this task. The accuracy of structured guideline representations will be evaluated as averaged precision recall and F1 scores for each target data element. All models developed will be evaluated on a withheld portion of the dataset from Aim 2, with the calculation of information retrieval metrics (precision of top-k answers; mean reciprocal rank [MRR]). The accuracy of the information extraction steps will be evaluated using precision, recall, and percentage of exact matches

3. Evaluation of human-in-the-loop strategies

Key issues in real-world clinical use include that even highly performing systems by statistical measures might not be sufficient for a high-risk context, where extremely high reliability is needed. To this end, various ‘human- in-the-loop’ methods have been proposed, where automated systems provide suggestions to experts who will validate the output, thus combining some of the speed of automated systems with the reassurance of human validation.21 In this aim, using the output from Aim 1–2, an analysis of errors from the automated system will be conducted. Using this data, target strategies for human interatction with automatic extractions will be devised and evaluated through simulation. Strategies which rely on the end-user (e.g. clinicians being presented with >1 short summaries, and being asked to choose the most relevant), and those dependent on the data collector (i.e. where a human annotator responsible for collating the guidance repository corrects a targeted sample) will be evaluated.

4. In practice pilot evaluation

Aim 4 proposes to take the best performing methods based from Aims 2–3, and evaluate their utility in clinical use. Project partners Metadvice currently work with a network of GP practices in Lambeth, south London, where they provide decision support software via the electronic health record (EHR). This project proposes to utilise this system to pilot the use of a functional system for delivering clinical guideline recommendations to clinicians in practice. GPs who are users of the Metadvice system will be recruited to evaluate the system. In this initial stage, a ‘think-aloud’ user test will conducted with these users in year 2 of the project, to inform the design of the guideline presentation to users. During this pilot, we propose to recruit a convenience sample of 6 GPs, based on recommendations for studies of of this design.22 These users will be presented (automatically) with a series of guideline recommendations pertinent to their use of the EHR. As part of this step, users will be asked to feed back via a simple questionnaire, including Likert scales covering the relevance and accuracy of the suggested answers, which will allow the calculation of information retrieval metrics. Pilot users will be asked to additionally validated 19-item usability score including an information usefulness domain23, and to provide qualitative feedback about their experience using the system which will be analysed thematically. This data will allow an assessment of the real-world utility of the system, and to learn about suitable accuracy thresholds for use, and key risks to be addressed in the clinical context.