12. Large Scale Transfer Learning applied to X-Ray based COVID-19 diagnostics

 

Jenia Jitsev, Mehdi Cherti

Blog post – Jenia Jitsev, Mehdi Cherti

 

Cross-Sectional Team Deep Learning residing at Juelich Supercomputing Center (JSC), Helmholtz Research Center Juelich (FZJ), Germany, deals with learning algorithms that are able to transfer or generalize well across different datasets, conditions and tasks in largely unsupervised, self-organized manner. Long term goal is to build generic cross-domain models that can be trained to handle and perform well on a very broad range of conditions and tasks, and do so continually, without ever stopping to learn if a new task, condition or dataset comes up that was not experienced before, which is also known as continual learning. Another focus is on large-scale pre-training, which uses supercomputers to enable training on very large datasets in a reasonable amount of time.

What is the data science project’s research question? In the project we attempt to devise large scale pre-training such that models pre-trained on large conventional datasets like ImageNet-1k, ImageNet-21k or large medical imaging datasets like CheXpert, MMC-CXR, PadChest or combinations of those can be used for fast and robust transfer to different smaller, specific medical imaging datasets. We envisage different network architectures and networks sizes during pre-training, as well as different schemes for pre-training – supervised or unsupervised, e.g. based on contrastive losses. As a use case for transfer, we take COVID-19 related X-Ray imaging datasets like COVIDx.

What data will be worked on? We work with publicly available large natural images datasets like ImageNet-1k and ImageNet-21k, and with publicly available large X-Ray medical images datasets like MIMIC-CXR, CheXpert, and PadChest.

What tasks will this project involve?  Project will involve setting up deep learning training and transfer routines which involves different network architectures, testing those with respect to their transfer performance on a target dataset shown after large-scale pre-training. This requires coding of respective training routines, running and evaluating distributed training of deep learning networks on a supercomputer (JUWELS Booster, ca. 3700 A100 GPUs)

What makes this project interesting to work on? The project has at least two aspects that are of general value and interest for frontiers of deep learning research. One is the aspect of transfer learning, that is a general problem how models trained on a particular set of data can successfully cope with new datasets that may be much smaller than those used for previous learning, or pre-training, also benefiting from the knowledge acquired in large-scale pre-training phase, for instance by being able to adapt to new dataset very quickly in few-shot or even zero-shot manner. This is also related to general question of how to achieve strong generalization when learning from data, either in supervised or unsupervised fashion. The other aspect is that of training models in distributed fashion on a supercomputer, which helps to speed up experiments that otherwise would be too long to execute on a usual machine. Those aspects are of generic interest. Specific to this project in addition, there is a part involving handling different medical X-Ray image datasets, and even more specifically, datasets containing X-Ray lung images of COVID-19 patients like COVIDx.

What is the expected outcome? Contribution to research paper, Contribution to software development

What infrastructure, programs and tools will be used? Can they be used remotely?  The project can be pursued remotely via access to supercomputing facilities at Juelich Supercomputing Center (JSC). Supercomputing facilities at JSC, where the lab is hosted, offer supercomputing machines like the newly installed JUWELS Booster, featuring ca. 3700 new generation NVIDIA A100 GPU accelerators, which allows conducting training and transfer experiments at very large scale. In addition, there are already publicly available large-scale natural image (ImageNet-1k, ImageNet-21k) and medical imaging datasets (MIMIC-CXR, NIH ChestX-ray14, PadChest, RSNA Pneumonia, CheXpert) that offer the right data basis amounting together to more than 850.000 relevant medical images to conduct such large-scale medical imaging model pre-training. Computing time budget is available to use the the supercomputer, which can be accessed remotely with ease. The libraries required for model and software development are TensorFlow, PyTorch and Horovod (for distributed training in data parallel regime).

What skills are necessary for this project? Data analytics / statistics, Data mining / Machine learning, Deep learning, High-performance computing

Is the data open source? Yes

Interested candidates should be at Phd level . Jenia Jitsev and Mehdi Chert are looking for 2 visiting scientists, working on the project together with the team.