Research on big data management and analytics
What is the data science project’s research question? dCache is a large scale storage system that stores a large amount of scientific data distributed over multiple numbers of independent nodes. As data distribution is not directly correlated with their usage some nodes may become a hotspot, if popular data is located on a small number of servers. To optimize data access, the dcache access information can be analyzed and to find the best data distribution layout. Moreover, as the popularity of data sets decreases over time, less popular files can be grouped on dedicated nodes or even pushed to the so-called cold storage.
What data will be worked on? Scientific data from high enenry physics and research with photons
What tasks will this project involve? Study of access pattern and automatic design of a data layout
What makes this project interesting to work on? The management and efficient retrieve of the huge amount of data is a key factor for scientific experiments
What is the expected outcome? Contribution to research paper, Contribution to software development
What infrastructure, programs and tools will be used? Can they be used remotely? The large computing infrastructure at DESY will be used. The task is part of the programme Matter&Technology, Topic DMA
What skills are necessary for this project? Data analytics / statistics, Scientific computation, Data mining / Machine learning, Deep learning, Data warehousing
Is the data open source? Yes
Interested candidates should be at Bachelor level (3+). Volker Guelzow is looking for 1 visiting scientist, working on the project with Tigran Mkrtchyan (email@example.com) as supervisor, with the team.