Research and Development on Big Data Management
What is the data science project’s research question? Some communities, especially those, who process images, need to populate metadata
catalogs before the data or data set can be processed. Traditionally, this is done by
repeatedly querying a storage system to learn if new data is available. This approach
doesn’t scale with the number of clients, as each query uses resources to obtain the current status, resulting in some upper limit on the number of concurrent activities a service may sustain.
dCache is a large scale storage system that stores a large amount of scientific data distributed over multiple numbers of independent nodes. In addition to simply storing the data, dCache can generate storage events when data is changed and asynchronously delivers them to the clients. In combination with a scientific community specific
metadata extraction application, storage events can be used to automatically populate metadata catalogs, when new data is written.
What data will be worked on? Data from research with photons
What tasks will this project involve? Study of the structure of data, analysis of exist. metadata systems, design and software development for autoatic matadata generator
What makes this project interesting to work on? Metadata are of extreme importance to find the right data and to make them available to others
What is the expected outcome? Contribution to software development
What infrastructure, programs and tools will be used? Can they be used remotely? The large DESY computing infrastructure will be used
What skills are necessary for this project? Data analytics / statistics, Scientific computation, Data mining / Machine learning, Deep learning, Databases, Data warehousing
Is the data open source? yes
Interested candidates should be at Master level. Volker Guelzow is looking for 1 visiting scientist, working on the project with Tigran Mkrtchyan (firstname.lastname@example.org) as supervisor, with the team.