29. Statistical compositional data science for microbiome data

 

Christian L. Müller

 We develop and apply computational statistics and data science techniques for the analysis of microbial ecosystems. We particularly focus on high-dimensional statistics methods, including structured sparsity and graphical models, Bayesian techniques, and compositional data analysis methods and their deep learning extensions. Using multi-omics high-throughput data, such as amplicon and single cell data, we seek to understand microbial ecosystems and their intricate relationship with the host and the environment, including the human microbiome and marine ecosystems. .

What is the data science project’s research question?  The overarching goal of this data science project is the development and implementation of generative models and differential abundance (DA) tests for compositional (microbiome) data. The project comprises two components: A) a statistical modeling component and B) a software component. In the modeling part, we will investigate how to integrate available covariate data, such as host and environmental factors, into generative models and DA tests. Special emphasis will be given to novel stratification and automatic compositional reference selection techniques. The second part is concerned with the efficient implementation of state-of-the-art compositional data analysis techniques (CODA) in Python. More specifically, we will focus on implementing DA techniques in standalone Python packages and interface them with QIIME 2.

What data will be worked on?  To test the developed and implemented methods, we will consider large-scale collections of publicly available amplicon sequencing data, including human gut and environmental (soil, marine) microbiome data. These data comprise relative abundance data in form of overdispersed, highly sparse count data. Our group has also compiled several specific data collections, including gut microbiome data of Irritable Bowel Syndrome (IBS) patients and multi-view (satellite, sequencing, in-situ physics-chemical measurements) marine data.

What tasks will this project involve?  In the statistical modeling part, the key tasks are the theoretical development of a novel statistical methodology for generative modeling/differential abundance (DA) testing and its documentation in a project report. In addition, a prototype R implementation and comprehensive tests on real-world microbiome data are of paramount importance. The software implementation part involves numerically efficient Python code development, unit testing, and reproducible workflow design that includes the application of real-world microbiome data. For at least one DA method, a QIIME 2 interface needs to be provided and tested.

What makes this project interesting to work on?    While compositional data sets are ubiquitous in many scientific disciplines, scalable data science techniques for these types of data have remained scarce. The advent of high-throughput compositional amplicon and single-cell sequencing data has triggered a huge demand and interest in these techniques. This project, with its theoretical and software component, seeks (i) to contribute to this expanding research field in data science and (ii) to apply the methods to large-scale data from the rapidly growing field of microbiome research.

What is the expected outcome?   Contribution to research paper, Contribution to software development

What infrastructure, programs and tools will be used? Can they be used remotely?   We will use R and Python for code development, collaborative tools such as GitHub, slack, and overleaf for communication and documentation, as well as distributed and high-performance clusters for computation.

What skills are necessary for this project?   Data analytics / statistics, Scientific computation, Computational models, Data mining / Machine learning, Visualization

Is the data open source?  Yes

Interested candidates should be at Phd level. Christian L. Müller is looking for 2 visiting scientists, working on the project together with the team.