Advancing Unsupervised and Weakly Supervised Learning with Emphasis on Data-Driven Healthcare

Publication details

Supervised by: Jenssen, Robert
Publisher: UiT Norges arktiske universitet

In healthcare, vast amounts of data are stored digitally in the electronic health records (EHRs). EHRs represent a largely untapped source of clinically relevant information, which combined with advances in machine learning, have the potential to transform healthcare into a more data-driven direction. However, due to the complexity and poor quality of the EHRs, data-driven healthcare is facing many challenges. In this thesis, we address the challenge posed by lack of ground-truth labels and provide methodological solutions to challenges related with missing data, temporality, and high dimensionality. Towards that end, we present four lines of work where we develop novel unsupervised and weakly supervised learning methodology. The first work presents a kernel for multivariate time series with missing values, which frequently occur in the EHRs. Key components in the method are clustering and ensemble learning. Experiments on benchmark datasets demonstrate that the proposed kernel is robust to hyper-parameter choices and performs well in presence of missing data. Next, we present a dimensionality reduction method, which is designed to account for many of the challenges data-driven healthcare is facing. One of them is high dimensionality, but in addition, the method is capable of exploiting noisy and partially labeled multi-label data. We provide a case study of patients suffering from chronic diseases. In the third work, we present a kernel capable of exploiting informative missingness in multivariate time series, as well as a novel semi-supervised kernel. The effectiveness of the proposed methods is demonstrated via experiments on benchmark data and a case study of patients suffering from infectious postoperative complications. In the last work, we perform phenotyping of patients with postoperative delirium using a weakly supervised learning framework, wherein clinical knowledge is used to generate a noisy labeled training set, which in turn is used to train classifiers. Experiments on a dataset collected from a Norwegian university hospital demonstrate the efficiency of the framework.