R-CCS Cafe - Special Edition (May 9, 2025) | RIKEN Center for Computational Science RIKEN Website

TOP Events & Outreach R-CCS Cafe R-CCS Cafe - Special Edition (May 9, 2025)

Details
Date	Fri, May 9, 2025
Time	4:00 pm - 5:00 pm
City	Online
Place	Online seminar on Zoom If you are not affiliated with R-CCS and would like to attend R-CCS Cafe, please email us at r-ccs-cafe[at]ml.riken.jp.
Organizer	RIKEN Center for Computational Science (R-CCS)
Language	Presentation Language: English Presentation Material: English
Speakers	Dr. Andreas Knüpfer（16:00-16:30） The head for the Scientific Computing Core (SCC) department at the Center for Advanced Systems Understanding (CASUS) in Görlitz/Saxony/Germany which is part of the Helmholtz-Zentrum Dresden-Rossendorf (HZDR) Ivan R. Ivanov（16:30-17:00） Junior Research Associate Supercomputing Performance Research Team, RIKEN R-CCS

Talk Titles and Abstracts

1st Speaker: Dr. Andreas Knüpfer

Title:
Data Management and Machine-Actionable Reproducibility for HPC with git and Datalad
Abstract:
The talk will present the two connected topics of Research Data Management (RDM) in HPC environments and of reproducibility of computational results and how it can be adopted for HPC. Research Data Management according to the F.A.I.R. principles is an established concept and required for most/all Computational Science projects out of formal reasons (because institutions and funding agencies demand it) and out of practical reasons (because it benefits researchers). Versioning of data collections is for some reasons not part of the F.A.I.R. principles even though it is most useful for software development and many types of text-based data collections. Git is the de-facto standard for distributed version control there. The talk present how git can be applied for general data including large, binary files. Furthermore, it presents extra benefits of such research data repositories for typical HPC projects. The second part of the talk introduces Datalad, a tool on top of git repositories, which offers machine-actionable reproducibility for computational results integrated in the git log of a data repository. This is incompatible with HPC processing unfortunately, because of batch processing and because git repositories don’t allow concurrent accesses. We developed a solution for this conflict and can now offer machine-actionable reproducibility for HPC where many Slurm jobs contribute to large data repositories.
Speaker Info:
Andreas Knüpfer is the head for the Scientific Computing Core (SCC) department at the Center for Advanced Systems Understanding (CASUS) in Görlitz/Saxony/Germany which is part of the Helmholtz-Zentrum Dresden-Rossendorf (HZDR). His background is Applied Mathematics from his first degree and High Performance Computing (HPC) / Computer Science from his PhD. The mission of the SCC department is research on Computational Science methods from classical HPC simulations and parallel programming to ML/DNN models, surrogate models as well as Research Data Management (RDM), collaborative software development methodologies and tools. Furthermore, it collaborates with and supports all other groups at CASUS with their Computational Science challenges to allow better science from the combination of application-area expertise with the best computational approaches.

2nd Speaker: Ivan R. Ivanov

Title:
Input-Gen: Guided Generation of Stateful Inputs for Testing, Tuning, and Training
Abstract:
The size and complexity of software applications is increasing at an accelerating pace. Source code repositories (along with their dependencies) require vast amounts of labor to keep them tuned, tested, maintained, and up to date. As the discipline now begins to also incorporate automatically generated programs, automation in testing and tuning is required to keep up with the pace - let alone reduce the present level of complexity. While machine learning has been used to understand and generate code in various contexts, machine learning models themselves are trained almost exclusively on static code without inputs, traces, or other execution time information. This lack of training data limits the ability of these models to understand real-world problems in software. In this work we show that inputs, like code, can be generated automatically at scale. Our generated inputs are stateful, and appear to faithfully reproduce the arbitrary data structures and system calls required to rerun a program function. By building our tool within the compiler, it both can be applied to arbitrary programming languages and architectures and can leverage static analysis and transformations for improved performance. Our approach is able to produce valid inputs, including initial memory states, for 90% of the ComPile dataset modules we explored, for a total of 21.4 million executable functions.

Important Notes

Please turn off your video and microphone when you join the meeting.
The broadcasting may be interrupted or terminated depending on the network condition or any other unexpected event.
The program schedule and contents may be modified without prior notice.
Depending on the utilized device and network environment, it may not be able to watch the session.
All rights concerning the broadcasted material will belong to the organizer and the presenters, and it is prohibited to copy, modify, or redistribute the total or a part of the broadcasted material without the previous permission of RIKEN.

（May 8, 2025）