Machine Learning for Disease Subtype Discovery

Lead Supervisor
Dr Mansoor Saqi
Head of Translational Bioinformatics, NIHR Biomedical Research Centre
Honorary Senior Lecturer, Faculty of Life Sciences & Medicine, Kings College London

Professor Richard Dobson
King’s College London

Project Details

Identification of disease subtypes is important for the development of strategies for precision medicine. A complex heterogeneous disease while often described as a single disorder may be better characterised as a collection of subtypes, each involving different mechanisms of action. The identification of disease subtype can impact on choice of treatment and healthcare pathway, since the different associated mechanisms may dictate the appropriate therapeutic intervention and may be associated with different prognoses.

Many large translational medicine studies are now collecting multiple types of molecular data such as transcriptomics, methylation, microbiome, metabolomics and proteomics data. More recently data from imaging, wearable sensors and behavioural phenotypes are also becoming available. Clustering patients based on these multiple layers of data can lead to the discovery of subtypes and can suggest a new taxonomy of disease.

The datasets being collected can be envisaged as a collection of layers where each layer corresponds to a component datatype (eg. transcriptomics, metabolomics, methylation, etc). Clustering can be carried out on the layers as well as on the integrated dataset. Several methods have been developed for multiple data integration, including network fusion approaches (Wang et al., 2014), non-negative matrix factorisation methods (Chalise & Fridley, 2017), and Bayesian correlated clustering (Kirk et al., 2012).

Early work in multiple dataset integration has been carried typically for ‚’omics data types. However there are considerable challenges in the application of these approaches to data sets that include omics as well as non-omics data types, such as data from images, data from wearable sensors and data on behavioural phenotypes such as obtained from questionnaires, and used in neurodevelopmental studies (see for example (Stefanik et al., 2018)). These challenges relate to the optimal way to represent and integrate the data layers, and to the biological and clinical interpretation of the resultant clusters. This project will use machine learning approaches to comprehensively explore these effects on disease subtype discovery in large public domain multi modal datasets associated with translational medicine studies.


machine learning, disease subtype, patient stratification