6. Optimization of data distribution


Volker Guelzow

Research on big data management and analytics

What is the data science project’s research question? dCache is a large scale storage system that stores a large amount of scientific data distributed over multiple numbers of independent nodes. As data distribution is not directly correlated with their usage some nodes may become a hotspot, if popular data is located on a small number of servers. To optimize data access, the dcache access information can be analyzed and to find the best data distribution layout. Moreover, as the popularity of data sets decreases over time, less popular files can be grouped on dedicated nodes or even pushed to the so-called cold storage. 

What data will be worked on?  Scientific data from high enenry physics and research with photons

What tasks will this project involve? Study of access pattern and automatic design of a data layout   

What makes this project interesting to work on?  The management and efficient retrieve of the huge amount of data is a key factor for scientific experiments 

What is the expected outcome?  Contribution to research paper, Contribution to software development  

What infrastructure, programs and tools will be used? Can they be used remotely?  The large computing infrastructure at DESY will be used. The task is part of the programme Matter&Technology, Topic DMA

What skills are necessary for this project?  Data analytics / statistics, Scientific computation, Data mining / Machine learning, Deep learning, Data warehousing

Is the data open source? Yes  

Interested candidates should be at Bachelor level (3+).  Volker Guelzow is looking for 1 visiting scientist, working on the project with Tigran Mkrtchyan (tigran.mkrtchyan@desy.de) as supervisor, with the team.