9. Repository Mining of Models


Thomas Heinze

The Secure Software Engineering group at the DLR Institute of Data Science investigates and develops concepts and tools to support the development of safe software systems in research institutions such as the DLR. A particular focus is on the validation of development concepts and best practices through intelligent and data-driven analysis of the process data, metadata and artifacts that are generated in modern software development.

What is the data science project’s research question? What are characteristics of model-based engineering and how are model artifacts used in open-source software development projects as found on GitHub.com?

What data will be worked on? According to work progress, part of the project will be to generate a dataset by mining public software repositories on GitHub.com at scale.

What tasks will this project involve? 

Based on the group’s prior work in repository mining of model artifacts on GitHub.com, generate a more comprehensive and representative dataset.

Perform initial analysis on the dataset, considering, e.g., aspects like model clones, model formats, repository maturity, etc.

Analyze and interpret the dataset with respect to the use of model-based software engineering and model artifacts in open source projects.

What makes this project interesting to work on? Repository mining is a current and fast-developing field in software engineering providing empirical foundations for vital questions about how software is being developed. The student will gain insight into the field and obtain hands-on experience in conducting empirical software engineering research.

What is the expected outcome? Contribution to research paper, Contribution to software development, Data set

What infrastructure, programs and tools will be used? Can they be used remotely?  While GitHub.com can be accessed publicly, mining repositories on GitHub.com at scale requires a certain infrastructure and technical setup. For implementing the mining, we will provide access to our server infrastructure. With regular meetings, discussing goals, progress, and other aspects of the project, remote work will be possible.

What skills are necessary for this project? Data analytics / statistics, Data mining / Machine learning, Databases

Is the data open source? yes 

Interested candidates should be at Master level. Thomas Heinze is looking for 1 visiting scientist, working on the project together with the team.