ICM project 1 en – Euro HPC

PH.D. NORBERT KAPIŃSKI
INTERDISCIPLINARY CENTRE FOR MATHEMATICAL AND COMPUTER MODELLING,
WARSAW UNIVERSITY

PH.D. NORBERT KARPIŃSKI
INTERDISCIPLINARY CENTRE FOR MATHEMATICAL AND COMPUTER MODELLING,
WARSAW UNIVERSITY

Mortality prediction and pathology detection in thoracic imaging screening using deep machine learning techniques

PROJECT OBJECTIVE

Problem

The project concerns the use of artificial intelligence techniques, including deep machine learning methods, to analyse medical X-ray and low-dose CT images from screening for early detection of pathological changes and prediction of risk of death.

Project objective

The aim of the project is to demonstrate that with the help of Trustworthy AI techniques applied to medical image analysis in screening, early detection of pathological changes and assessment of risk of death is possible. The main tasks of the project include the development and validation of artificial intelligence models performing the above tasks and the development of methods for the explainability of model results, in particular with regard to the influence of low-level and high-level image information.

TASKS FOR THE SUPERCOMPUTER

01.

The basic computing technology for deep machine learning methods, particularly in the field of convolutional neural networks (CNNs), is so-called tensor computing. One of the most powerful architectures implementing such computations is the graphics processing units (GPUs). Due to the abundance and size of the data, the size of models, and the need to iterate over multiple computing epochs, supercomputer-class solutions are best suited for this type of project, providing the computational power of GPUs, large host and GPU memory resources, and fast access to stored data and metadata resources.

02.

Issues investigated using AI techniques require large volumes of datasets and metadata (e.g. labels) for training and validation – often in the order of tens or even hundreds of thousands of images and associated records. In the case of medical imaging data from 3D imaging (e.g. CT scans), a single data sample is usually an entire imaging study (so-called series) of the order of 100 MB (e.g. 200 2D 512x512x2B images). A training set of tens of thousands of examinations is a data resource of several TB, and the process of training the model requires multiple uses of the full set. In addition, metadata, and in particular labels, are often assigned at the level of individual sections or individual pixels of an imaging study, requiring adequate database resources to efficiently store and search metadata resources that are several orders of magnitude larger than the number of studies themselves.

BENEFITS OF COOPERATION WITH ICM UW

The thoracic screening analysis task undertaken in the project was based on a data resource of 90,000 CT image series. Due to the size of the data and the complexity of the trained model, the issue required solving the following problems: efficient metadata management at the single image level (18 million) for experiment planning, efficient access to the datasets, efficient computation and a large GPU memory to accommodate the model and a sufficient number of data samples (so-called batch).

Using the Centre’s resources, all components of the model training process described above were achieved. Using high-speed data resources (SSD/NVME) and in-memory databases (IMDB), a dedicated solution was prepared to integrate the database with the medical image storage system (PACS) for metadata management and image data addressing.

Rysy cluster based on NVIDIA V100 32GB GPU cards was used for the calculations, providing high computational performance and a large model memory resource. The calculations were implemented in Python in the TensorFlow environment.

Due to the high data access intensity (high ratio of data reading to computing time) and the large size of the training and validation set, it was necessary to provide data access from the GPU computing system in a hierarchical model.

Data was stored holistically in the Lustre file system (Tethys) and asynchronously allocated on the high-speed local resources (SSD/NVME) of the computing cluster.

Thanks to the use of HPC infrastructure:

the process of training models with higher complexity at higher batch-size values was enabled;
the response time of the metadata base was reduced by several orders of magnitude;
data access times have been reduced by several orders of magnitude.

The realisation of the calculations assumed in the project’s multiple experiments model only became time realistic due to the above infrastructure improvements.

90 000

IMAGING SERIES

18 milion

OF SINGLE IMAGES

EFFECTS

It is possible to use the results of the project in practice and future implementation to clinical applications subject to extension of validation, demonstration of the credibility of the results and the embedding in measurable clinical benefits in terms of thoracic screening, with particular emphasis on the examination of lung cancer screening.

PROJECT OBJECTIVE

Problem

Project objective

EFFECTS

PARTNERS

CONTACT FORM

PROJECT OBJECTIVE

Problem

Project objective

EFFECTS

PARTNERS

CONTACT FORM

Information about cookies