Program
Program (subject to change)
DAY-1 : 23th Jan. (JST)
Time | Program |
---|---|
8:30 - |
Registration |
9:00 - 9:20 |
Opening Message from MEXT |
9:20 - 10:10 |
Keynote 1 Satoshi Matsuoka (RIKEN R-CCS) |
10:10 - 10:30 |
Break |
Science by Computing: Classical, AI/ML | |
10:30 - 11:10 |
Plenary Talk Robert Harrison (Stony Brook University) |
11:10 - 11:35 |
Invited Talk Masateru Ohta (RIKEN R-CCS) |
11:35 - 12:00 |
Invited Talk Ayori Mitsutake (Meiji University) |
12:00 - 13:20 |
Lunch |
Quantum Computing | |
13:20 - 14:00 |
Plenary Talk Hiroshi Horii (IBM) |
14:00 - 14:40 |
Plenary Talk Masahiro Horibe (AIST) |
14:40 - 15:05 |
Invited Talk Miwako Tsuji (RIKEN R-CCS) |
15:05 - 15:30 |
Break |
Science of Computing: Classical, AI/ML | |
15:30 - 16:10 |
Plenary Talk John Shalf (LBNL) |
16:10 - 16:35 |
Invited Talk Kenji Tanaka (NTT) |
16:35 - 17:00 |
Invited Talk Ryousei Takano (AIST) |
17:00 - 18:20 |
Poster Session |
18:30 - 20:00 |
Reception (Optional) |
DAY-2 : 24th Jan. (JST)
Time | Program |
---|---|
9:00 - 9:30 |
Fugaku Tour at R-CCS |
9:30 - |
Registration |
10:10 - 10:30 |
Group Photo at Room 301, Kobe International Conference Center |
Quantum Science | |
10:30 - 11:00 |
Invited Talk Roland Farrell (Caltech) |
11:00 - 11:30 |
Invited Talk Masanao Ozawa (Chubu Univeristy) |
11:30 - 12:00 |
Invited Talk Kentaro Yamamoto (Quantinuum) |
12:00 - 13:30 |
Lunch |
13:30 - 14:10 |
Keynote 2 Christophe Calvin (CEA) |
14:10 - 14:30 |
Break |
International Activity (EU/HANAMI, DOE) | |
14:30 - 15:10 |
Plenary Talk Franck Cappello (ANL) |
15:10 - 15:35 |
Invited Talk Mohamed Wahib (RIKEN R-CCS) |
15:35 - 16:00 |
Invited Talk Brian C. Van Essen (LLNL) |
16:00 - 16:10 |
Closing |
Program
- Keynote 1 (DAY-1 : Jan 23 9:20 - 10:10)
-
- Satoshi Matsuoka (RIKEN R-CCS)
-
Towards AI "Zettascale” FugakuNEXT (@40MW)--- How Simulation, AI for Science and Quantum-HPC will Innovate the Society
-
Japan's flagship supercomputer Fugaku has achieved notable milestones, including being named a 2023 Gordon Bell Prize finalist for climate modeling and supporting real-time torrential rain forecasts through its "Fugaku siblings.” Meanwhile, the new "Virtual Fugaku” environment on AWS Graviton extends Fugaku's software stack into the cloud, broadening access and exemplifying our overall strategy to leverage HPC for societal impact. These achievements pave the way for our next advances in Al-driven science (“AI for Science”), quantum-HPC integration, and the upcoming FugakuNEXT system targeting "Zettascale" performance.
Al for Science merges large-scale simulations with robust AI methods such as generative models and LLMs (e.g., FugakuLLM). Beyond LLM training, we envision using Al-driven surrogates, robotic systems, and foundation models to accelerate innovation across various domains—from drug discovery to structural engineering. Our TRIP-AGIS program (2024-2031) will further expand AI's impact, bolstered by seamless integration with Fugaku's massive computational capabilities.
On the quantum-HPC front, the JHPC Quantum Project develops new software stacks and libraries to integrate quantum computers with supercomputers for workloads such as quantum chemistry. By 2025, RIKEN plans to install two distinct quantum platforms (IBM and Quantinuum) to form a hybrid HPC infrastructure with Fugaku, thus enabling exploration of NlSQ-era applications.
Finally, our FugakuNEXT feasibility study examines heterogeneous node architectures to address both high-precision simulations and low-precision AI under a 40MW power envelope. By integrating matrix accelerators, advanced networking, and high-bandwidth memory, we aim to deliver “Zettascale” capabilities in simulation, AI, and quantum-HPC hybrid computing—pushing the frontiers of computational science for future discoveries.
- Science by Computing: Classical, AI/ML Plenary Talk (DAY-1 : Jan 23 10:30 - 11:10)
-
- Robert Harrison (Stony Brook University)
-
Co-design of an Ecosystem for Programming and Executing eXtreme-scale Applications
-
Complexity is overwhelming our collective abilty to advance the frontiers of science by exploit continued progress/revolutions in computing technology. Multidisciplinary science applications and their associated multiple numerical representations are increasingly seeking accuracy and performance through asymptotically fast but irregular algorithms that are poorly supported by current programming paradigms on modern computer architectures.
The EPEXA collaboration is creating a production-quality, general-purpose, community-supported open-source software ecosystem. It attacks the twin challenges of programmer productivity and portable performance for advanced scientific applications on massively-parallel, hybrid, many-core systems. Specifically, through science-driven co-design, we aim to transition into production a successful research prototype of a powerful new data-flow programming model and associated parallel runtime, and to accelerate the growth of the community of computer scientists and domain scientists employing these tools for their research. Crucially, the new ecosystem is focused upon facilitating high-performance implementations of irregular and dynamic computations, while also enabling efficient execution of regular computations such as dense (multi)linear algebra. In this talk I will examine some of the key design concepts, technical details, and provide some initial performance results.
I will also introduce Empire AI, a new $0.5B initiative in New York State that closes the gap between AI research in academia and industry. A consortium of New York State's leading academic institutions in partnership with the State will provision world class AI compute resources for academic research in the public good.
The EPEXA project is supported by the National Science Foundation under grants OAC-1931387 at Stony Brook University, OAC1931347 at Virginia Tech, and ACI-1450300 at the University of Tennessee, Knoxville.
- Science by Computing: Classical, AI/ML Invited Talk (DAY-1 : Jan 23 11:10 - 11:35)
-
- Masateru Ohta (RIKEN R-CCS)
-
AI-Powered Transformation of Simulations in Drug Discovery
-
We are living in an era where high-performance computing (HPC) and artificial intelligence (AI) is transforming numerous fields. In the realm of drug discovery, my area of expertise, I have been leveraging these technologies to drive my research. In this lecture, I will present two illustrative examples that highlight the synergy of these cutting-edge technologies.
The first topic focuses on the water molecules surrounding proteins. Understanding the behavior of these water molecules is crucial for drug design, as a drug must displace the water at the binding site on the target protein.
Using 3D-RISM methods, we can estimate the probability of water molecule distribution around a protein, provided its 3D structure is known. Leveraging HPC, we conducted comprehensive 3D-RISM calculations on 3,706 proteins, uncovering detailed properties of water surrounding these proteins [1].
Building on these results, we developed a deep learning model trained to replicate the water probability distributions derived from the 3D-RISM [2]. This AI enables rapid estimation of water distributions around proteins within seconds, bypassing the need for 3D-RISM calculations, which typically require several hours.
The second topic addresses docking simulations of an exceptionally large number of compounds, based on the 3D structures of proteins. Traditionally, pharmaceutical companies have relied on compound libraries and employed high-throughput screening (HTS) to assay these compounds. However, with the number of synthesizable compounds now reaching billions, HTS is no longer feasible. To overcome this challenge, a new process is needed; Create virtual libraries in silico, perform docking simulations to identify potential binders, and then assay a much smaller subset of compounds.
In collaboration with 10 pharmaceutical and agrochemical companies, we utilized the Fugaku to conduct docking simulations on a virtual library of 100 million (M) synthesizable compounds against 12 target proteins. Docking 100M compounds to a single protein required 184,000 node hours (NH), while the cumulative calculations for all 12 proteins totaled 2.21M NH. These simulations provided docking scores for 100M compounds, reflecting their predicted binding affinities for the 12 proteins.
Given the substantial computational resources required for docking 100M compounds, we developed an AI-based system to efficiently prioritize compounds with high docking scores while significantly reducing the number of docking operations. By docking only 1% of the 100M compound, the system successfully identified 74.6% of the top 100 scoring compounds and 68.6% of the top 1,000 scoring compounds, on average, across the 12 proteins.
These two examples demonstrate how AI technology can significantly accelerate processes and enhance efficiency by leveraging the results of simulations performed with HPC.
[1] T. Yoshidome, M. Ikeguchi, M. Ohta, “Comprehensive 3D-RISM analysis of the hydration of small molecule binding sites in ligand-free protein structures”, Journal of Computational Chemistry, 2020, 41, pp 2406-2419
[2] K. Kawama, Y. Fukushima, M. Ikeguchi, M. Ohta, and T. Yoshidome, “gr Predictor: A Deep Learning Model for Predicting the Hydration Structures around Proteins”, J. Chem. Inf. Model., 2022, 62, 4460-4473
- Science by Computing: Classical, AI/ML Invited Talk (DAY-1 : Jan 23 11:35 - 12:00)
-
- Ayori Mitsutake (Meiji University)
-
Investigating stability and dynamics of proteins using molecular dynamics simulations and efficient analysis methods
-
To investigate the stability and dynamics of proteins using computer simulations, we have developed molecular simulation algorithms. Specifically, we have developed generalized-ensemble algorithms (GEAs) to enhance the conformational sampling of protein systems [1,2]. Additionally, we have developed analysis methods for proteins, such as relaxation mode analysis and three-dimensional reference site model theory. Relaxation mode analysis is an dynamical analysis method to extract slow relaxation modes. We have focused on investigating the dynamics of G protein-coupled receptors (GPCRs), which are membrane proteins, through molecular simulations and relaxation mode analysis. In this work, we primarily present the results related to the simulations of GPCRs.
- Quantum Computing Plenary Talk (DAY-1 : Jan 23 13:20 - 14:00)
-
- Hiroshi Horii (IBM)
-
TBA
-
Quantum-centric supercomputing (QCSC) is a revolutionary approach to computer science that combines quantum computing with high-performance computing to create a computing system that will be capable of solving highly complex real-world problems. In this talk, I would like to give an update on recent research in IBM-Quantum for QCSC software and applications.
- Quantum Computing Plenary Talk (DAY-1 : Jan 23 14:00 - 14:40)
-
- Masahiro Horibe (AIST)
-
TBA
-
TBA
- Quantum Computing Invited Talk (DAY-1 : Jan 23 14:40 - 15:05)
-
- Miwako Tsuji (RIKEN R-CCS)
-
2nd Year in Quantum-HPC Hybrid Platform
-
The evolution of quantum computers in recent years has been remarkable. Exploiting quantum computers as new accelerators in supercomputers will expand the computing capability in various areas. We started to build a platform that integrates quantum computers and supercomputers in 2023. This talk focuses on the development of the quantum-HPC hybrid platform: design, implemetation, and demonstration.
- Science of Computing: Classical, AI/ML Plenary Talk (DAY-1 : Jan 23 15:30 - 16:10)
-
- John Shalf (LBNL)
-
Investigating Open Chiplets for HPC Beyond Exascale
-
After three decades of phenomenal performance growth (1000x every 11 years), HPC performance improvement rates have dropped below 5x every 10 years. Architecture specialization is a promising approach to extract more performance, but there remain challenges creating a viable economic model for delivering them in mainstream products. Chiplets have become a compelling approach to scaling and heterogeneous integration e.g. integrating workload-specific processors and massive bandwidth memory systems into computing systems; integrating die from multiple function-optimized process nodes into one product; integrating silicon from multiple businesses into one product. Chiplet-based products have been produced in high volume by multiple companies using proprietary chiplet ecosystems. Recently, the community has proposed several new standards (e.g., UCIe) to facilitate integration and interoperability of any compliant chiplet. Hyperscalers (e.g., Google, Amazon) are actively designing high volume products with chiplets through these open interfaces. Other communities are exploring the end-to-end workflow and tooling to assemble chiplet-based products. High performance computing can benefit from this trend. However, the performance, power, and thermal requirements unique to HPC, present many challenges to realizing a vision for affordable, modular HPC using this new approach.
- Science of Computing: Classical, AI/ML Invited Talk (DAY-1 : Jan 23 16:10 - 16:35)
-
- Kenji Tanaka (NTT)
-
Enabling High-Efficiency Multi-Site Data Centers
-
In recent years, the demand for data centers (DCs) has grown exponentially. At the same time, the importance of interconnecting multiple DCs and operating them in a geographically distributed manner to prepare for disaster risks is rising more than ever. However, geographic separation introduces various forms of overhead, traditionally making it difficult to achieve close coordination among DCs. In this presentation, we will focus on two key challenges: the CPU load overhead caused by network packet processing and the latency overhead associated with seamless accelerator utilization. We will introduce a system architecture designed to overcome these issues, as well as the computing infrastructure that NTT envisions through the Innovative Optical and Wireless Network (IOWN) project.
- Science of Computing: Classical, AI/ML Invited Talk (DAY-1 : Jan 23 16:35 - 17:00)
-
- Ryousei Takano (AIST)
-
ABCI 3.0: Evolution of the leading AI infrastructure in Japan
-
ABCI 3.0 is the latest version of the ABCI, a large-scale open AI infrastructure that AIST has been operating since August 2018 and will be fully operational in January 2025. ABCI 3.0 consists of computing servers equipped with 6128 of the NVIDIA H200 GPUs and an all-flash storage system. Its peak performance is 6.22 exaflops in half precision and 3.0 exaflops in single precision, which is 7 to 13 times faster than the previous system, ABCI 2.0. It also more than doubles both storage capacity and theoretical read/write performance. ABCI 3.0 is expected to accelerate research and development, evaluation, and workforce development of cutting-edge AI technologies, with a particular focus on generative AI.
- Quantum Science Invited Talk (DAY-2 : Jan 24 10:30 - 11:00)
-
- Roland Farrell (Caltech)
-
Simulations of scattering in quantum field theories using a quantum computer
-
The simulation of scattering between particles in a quantum field theory is needed to deepen our understanding of dynamics in non-integrable systems, and particle production in high-energy collisions. While these simulations exceed the capabilities of even the most powerful classical supercomputers, they are believed to be achievable with quantum computers. In this talk I will introduce a new technique for efficiently preparing the initial state of a scattering simulation (a wavepacket) on a quantum computer. This method reduces the problem of wavepacket preparation to a minimization of the energy subject to a set of constraints. This procedure is then demonstrated in multiple one-dimensional lattice quantum field theories. Progress toward performing quantum simulations of scattering and obtaining a quantum advantage will be discussed.
- Quantum Science Invited Talk (DAY-2 : Jan 24 11:00 - 11:30)
-
- Masanao Ozawa (Chubu Univeristy)
-
Testing uncertainty principle using currently available quantum computer platforms
-
Heisenberg's uncertainty principle states that non-commuting observables can be measured simultaneously only with a characteristic relation for their errors known as Heisenberg's uncertainty relation. However, Heisenberg's relation was shown not to be universally valid, and a new universally valid relation was derived in 2003. The violation of Heisenberg's original relation and the validity of the new relation were experimentally demonstrated in 2012 and later in neutron spin measurements as well as photon polarization measurements. In this talk, we propose experimental studies of the uncertainty principle using currently available quantum computer platforms.
- Quantum Science Invited Talk (DAY-2 : Jan 24 11:30 - 12:00)
-
- Kentaro Yamamoto (Quantinuum)
-
Tierkreis: A Dataflow Framework for Hybrid Quantum–Classical Computing
-
We present Tierkreis, a higher-order dataflow graph program representation and runtime designed for compositional, quantum-classical hybrid algorithms. The design of the system is motivated by the remote nature of quantum computers, the need for hybrid algorithms to involve distributed computing, and the long-running nature of these algorithms. The graph-based representation allows automatic parallelism and asynchronicity. We also give a perspective on the ongoing deployment of Tierkreis to the Fugaku supercomputer and its application to quantum chemistry.
- Keynote 2 (DAY-2 : Jan 24 13:30 - 14:10)
-
- Christophe Calvin (CEA)
-
International collaboration in High Performance Computing: Success and Challenges
-
The High Performance Computing (HPC) landscape has drastically changed in the last decade. Strategic areas such as Quantum Computing and Artificial Intelligence have reshaped traditional HPC, from the scientific field towards the industry ecosystem. Science, especially when identified as a strategic area, is also submitted to geopolitical forces: strategic regions and partners may change, affecting in a positive or negative trend several research areas.
In this talk, we will explore the current context of HPC at the international level, and how the “merging” of HPC, AI and QC is changing the paradigms. We will present several key collaborations, foreseen as a success in these strategic areas (HPC, QC and AI). Then we will discuss about current and upcoming challenges we identified regarding international collaboration, and present several perspectives to overcome these challenges.
- International Activity (EU/HANAMI, DOE) Plenary Talk (DAY-2 : Jan 24 14:30 - 15:10)
-
- Franck Cappello (ANL)
-
AuroraGPT/Eval: Establishing a methodology to evaluate LLMs/FMs as Research Assistants
-
The capabilities of large language models such as ChatGPT, Claude, Gemini, and Llama have progressed dramatically in the past 2-3 years, raising the question of using them in a research context as research assistants. Moreover, recent results and publications suggest that future generations of LLMs may exceed scientist skills. However, while many benchmarks exist to assess the general language skills of these models, there is no established methodology to evaluate these models as scientific assistants. This talk introduces and presents the current state of the effort at Argonne in the context of the AuroraGPT project to establish a methodology to rigorously evaluate the capabilities, trustworthiness, and safety of LLMs as research assistants. As we will show, this is a complex open problem demanding expertise in domain sciences (including computer science), AI, and psychometrics.
- International Activity (EU/HANAMI, DOE) Invited Talk (DAY-2 : Jan 24 15:10 - 15:35)
-
- Mohamed Wahib (RIKEN R-CCS)
-
Ongoing Collaborative Research with Europe (EU HANAMI Project)
-
In this talk, we highlight ongoing research activities between RIKEN-CCS researchers and their counterparts from Europe, in particular the EU HANAMI project. We identify some of the common challenges in preparing for the next era of HPC, which will increasingly rely on advanced AI and data analytics.
- International Activity (EU/HANAMI, DOE) Invited Talk (DAY-2 : Jan 24 15:35 - 16:00)
-
- Brian C. Van Essen (LLNL)
-
Scale-Free Fractal Benchmark: Working to develop a scalable AI Benchmark for HPC-AI Systems
-
In this talk, we will present our current work on creating a scalable AI benchmark for AI4Science HPC systems. The goals of this benchmark are to test accelerator-to-accelerator communication algorithms as well compute kernels and I/O requirements of neural network models that are systemic to scientific AI workloads as well as create a model that can be used to test single node systems all of the way up to full scale HPC-AI systems. We will present the design philosophy of the benchmark as well some open challenges that we are working to address.
List of Accepted Posters
- 1Reinforcement learning-based statistical search strategy for an axion model from flavor
Satsuki Nishimura(Kyushu University)*; Coh Miyao(Kyushu University); Hajime Otsuka(Kyushu University) - The Standard Model of particle physics describes the behavior of elementary particles with high accuracy, but there are many problems. For example, the Standard Model does not explain the background of mass hierarchy of matter particles. In addition, the difference of flavor mixing between quarks and leptons is also mysterious. Then, we propose a reinforcement learning-based search strategy to explore new physics beyond the Standard Model. The reinforcement learning, which is one of machine learning methods, is a powerful approach to find model parameters with phenomenological constraints. As a concrete example, we focus on a minimal axion model with a global U(1) flavor symmetry. Agents of the learning succeed in finding U(1) charge assignments of quarks and leptons solving the flavor and cosmological puzzles in the Standard Model, and find more than 150 realistic solutions for the quark sector taking renormalization effects into account. For the solutions found by the reinforcement learning-based analysis, we discuss the sensitivity of future experiments for the detection of an axion which is a Nambu-Goldstone boson of the spontaneously broken U(1). We also examine how fast the reinforcement learning-based searching method finds the best discrete parameters in comparison with conventional optimization methods. In conclusion, the efficient parameter search based on the reinforcement learning-based strategy enables us to perform a statistical analysis of the vast parameter space associated with the axion model from flavor. The reference is arXiv:2409.10023 [hep-ph].
- 22D-replica exchange simulation of membrane permeation process of cyclic hexapeptides
Tsutomu Yamane(RIKEN Center for Computational Science)*; Masateru Ota(RIKEN Center for Computational Science); Mitsunori Ikeguchi(RIKEN Center for Computational Science, Graduate School of Medical Life Science, Yokohama City University) -
Cyclic peptides, composed of approximately ten amino acid residues, form a cyclic structure by forming bonds between their main or side chains. These molecules are of sufficient size to inhibit protein-protein interactions, and notably, some cyclic peptides are capable of permeating membranes. These properties make cyclic peptides excellent candidates for drug development aimed at inhibiting intracellular protein-protein interactions. However, the challenge of predicting the membrane permeability of cyclic peptides remains a major obstacle to their use in drug development. This is because the membrane permeation process of cyclic peptides is a complex process involving conformational changes in response to different hydrophilic (in water) and hydrophobic (in membrane) surrounding environments. To gain a comprehensive understanding of the membrane permeation process, it is essential to conduct a detailed analysis using molecular dynamics (MD) simulations of the membrane-water system. The conventional MD simulation approaches are often inadequate for following cyclic peptides' slow membrane permeation process due to their limited time scales.
In the present study, we utilized MD simulations enhanced by the gREST/REUS method [1], a sophisticated type of two-dimensional replica exchange method designed to improve the sampling of conformational changes of cyclic peptides and their translocation from water to the membrane. We conducted 400 ns of MD simulations using the gREST/REUS method with a total of 120 replicas, which includes six replicas of gREST for conformational sampling and 20 replicas of REUS for monitoring the translocation from water to membrane. These simulations were carried out using the GENESIS molecular dynamics simulation program [2] on the supercomputer Fugaku.
Our research specifically targeted 62 cyclic hexapeptides with known membrane permeability, as determined by Caco2 and PAMPA assays [3], and performed MD simulation with gREST/REUS method. In this study, we specifically focused on 11 cyclic hexapeptides with comparable hydrophobicity indices (MlogP) but markedly different membrane permeation rates. We investigated their behavior during the membrane permeation process to uncover the underlying factors influencing their permeability.
References: [1] S. Re, H. Oshima, et al., Proc. Natl. Acad. Sci. USA, 116, 18404-18409 (2019). [2] C. Kobayashi, et al., J. Compute. Chem. 38, 2193-2206 (2017). [3] C.K. Wang, et al., Eur. J. Med. Chem., 97, 202e213 (2015).
Supported by: The HPCI System Application Proposal (Proposal No. ra000017) with the computational resources of the supercomputer "Fugaku" provided by RIKEN.
- 3Optimizing the GENESIS Molecular Dynamics Software on the AMD CPU/GPU LUMI Supercomputer for Simulation of Large Biological Systems
Diego Ugarte La Torre(RIKEN R-CCS)*; Jaewoon Jung(RIKEN R-CCS,RIKEN CPR); Yuji Sugita(RIKEN R-CCS,RIKEN BDR,RIKEN CPR) - This work explores optimizing the GENESIS molecular dynamics software for LUMI, a supercomputer powered by AMD EPYC processors and AMD Instinct GPUs. By addressing critical performance challenges, we enhance GENESIS, leveraging the computational resources of the LUMI-G hardware partition. Key improvements include mitigating CPU-GPU communication bottlenecks, selecting optimal Particle Mesh Ewald (PME) kernels for the electrostatic calculations, and tailoring computational kernels for AMD GPU architecture. We find that LUMI-G's high-speed interconnects enable the reciprocal-space PME calculations to outperform those on FUGAKU and that for large-scale simulations, GENESIS achieves a performance of 200.8 nanoseconds per day using 1,024 nodes on LUMI-G. Finally, we provide practical recommendations for using GENESIS on LUMI to obtain optimal performance in the simulation of large biomolecular systems.
- 4Monitoring Performance From An Application Perspective
Sriram Swaminarayan(Los Alamos National Laboratory )*; Carola Ellinger(Los Alamos National Laboratory ) - While microbenchmarks are useful for monitoring machine health, they are akin to unit tests, testing each component of the system independently. On the other hand, system health measured by running large complex applications shows us a different view of system performance – one that more immediately affects the user’s experience on the machine. In this poster we present work we have done to continuously monitor the health of large complex multi-physics applications at Los Alamos National Laboratory on some of the world’s largest supercomputers and the different technologies we use to make this happen. The applications themselves consist of more than a million lines of code each and have been modified so that we use common technologies for monitoring performance nightly on a prespecified set of problems. All applications measure performance in a consistent method using Caliper. RabbitMQ is used as a message broker to transfer data to a Splunk datastore. The performance results stored in Splunk can be viewed via dashboards. Using this system we have been able to classify nightly variations in performance as due to machine variability or changes to source code.
- 5ORCHA: A Performance Portability System For Non C++ Codes
Anshu Dubey(Argonne National Laboratory,University of Chicago)*; Younjun Lee(Argonne National Laboratory); Klaus Weide(University of Chicago); Wesley Kwiecinski(University of Illinois Chicago); Johann Rudi(Virginia Tech); Jared O'Neal(Argonne National Laboratory) ; Mohamed Wahib(Riken-CCS) -
Applications can effectively utilize heterogeneous platforms if they have data structures and algorithms suitable for target devices, can conceptualize a map of computation to target devices, and can execute the map by moving data and computation to devices efficiently. We have designed a performance portability orchestration system ORCHA to address these challenges through abstractions and code generation. The tools are designed such that each tool focuses on a small subset of abstractions and code generation that have similar
requirements, but are substantially different from those addressed by the other tools. Through this approach of divide and conquer the tools have been kept relatively simple and customizable, but their combination provides a powerful performance portability solution. In this poster we show the results of using ORCHA for real applications with different hardware configurations and also the performance for using GPU with and without ORCHA
- 6Adaptive Super-Resolution Approaches for Enhanced Video Compression and Reconstruction
Alexis Amoyo(Florida State University); Amarjit Singh(RIKEN R-CCS); Kento Sato(RIKEN R-CCS); Weikuan Yu(Florida State University) -
This work introduces TEZip, a compression and decompression framework that takes advantage of temporal patterns in data. By applying a deep neural network model called PredNet, we improve how well video-like data can be compressed. Building on this approach, we explore more advanced video super-resolution (VSR) models to handle high-resolution images, where restoring quality from lower resolutions becomes more challenging.
In particular, we apply various downsampling and compression methods to the Vimeo90K dataset to see how image quality changes. We start by downsampling images by a factor of four and then apply different techniques—such as bicubic interpolation (BI) and blur-based downsampling (BD)—to further reduce size. These methods is to examine the balance between compression ratio and the resulting image quality, which measured with metrics like PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity), and MSE (Mean Square Error).
We also evaluated two state-of-the-art VSR models, BasicVSR++ and VRT. BasicVSR++ enhances performance by using improved temporal propagation and alignment, effectively dealing with frame-to-frame motion. Meanwhile, VRT uses a transformer-based architecture to capture long-range dependencies, offering a different path to high-quality reconstruction from low-resolution inputs.
By comparing pre-trained models specialized for different compression techniques—like BI and BD—we identify the most effective methods for various scenarios. This study guides the selection of strategies to achieve better balance between file size reduction and image quality in high-resolution video scenarios.
Acknowledgement
This work has been supported by the COE research grant in computational science from Hyogo Prefecture and Kobe City through Foundation for Computational Science.
This work (”AI for Science” supercomputing platform project) was supported by the RIKEN TRIP initiative
(System software for AI for Science).
- 7Harnessing Computational Material Science for the Design of Photo-active Nanomaterials in Sustainable Technologies
Amrit Sarmah(RIKEN-CCS)* - Photo-active nanomaterials are pivotal in advancing light-driven applications across energy, environmental, and electronic sectors. Their successful exploitation hinges on a profound understanding of the interplay between their electronic structure, topology, and synthetic methodologies. Leveraging computational material science, we integrate experimental data with advanced theoretical models to elucidate the electronic behavior of these materials. This integrated approach not only facilitates the rational design of photo-active nanomaterials but also clarifies the discrepancies between experimental observations and theoretical predictions. By employing cutting-edge computational techniques, we accelerate the discovery of novel materials with optimized properties, paving the way for innovative solutions to global energy and environmental challenges.
- 8Deterministic and Ensemble forecasts of the Kuroshio south of Japan
Shun Ohishi(RIKEN Center for Computational Science (R-CCS))*; Takemasa Miyoshi(RIKEN Center for Computational Science (R-CCS)); Misako Kachi(JAXA) -
Kuroshio flows eastward along the southern coast of Japan and has various flow paths such as straight and large meander paths with a wide range of timescales from interannual to decadal. These variations significantly influence fisheries, marine transportation, and marine environment (e.g., Nakata et al. 2000; Barreto et al. 2021). Japanese research institutions investigated Kuroshio prediction using regional ocean data assimilation systems with the Kalman filter (Hirose et al. 2013) and the three- and four-dimensional variational methods (Miyazawa et al. 2017; Kuroda et al. 2017; Hirose et al. 2019). However, these assimilation approaches are not designed for ensemble forecasts, and the predictions have been limited to deterministic forecasts.
Therefore, we developed a new local ensemble transform Kalman filter (LETKF)-based regional ocean data assimilation system (Ohishi et al. 2022a, b) and released ensemble ocean analysis datasets called the LETKF-based Ocean Research Analysis (LORA) for the western North Pacific and Maritime Continent regions (Ohishi et al. 2023, 2024a, b). The LORA datasets demonstrate sufficient accuracy for geoscience research, particularly in the mid-latitude regions (Ohishi et al. 2023). Here, we perform deterministic and ensemble forecasts initialized by the LORA and compare them in terms of the predictability of the Kuroshio path.
We conducted 6-month deterministic and ensemble forecasts initialized on the first day of each month from January 2016–December 2018 totaling 36 cases using the initial conditions of the ensemble mean analyses and 128-ensemble analyses from the LORA, respectively. The results show that the predictability limit of the ensemble mean forecast is estimated to be 108 days, demonstrating for the first time that the ensemble mean forecasts achieve a significantly longer predictability limit than the 74-day limit of the deterministic forecast for the Kuroshio path.
- 9Open Composer: A Web Application for Simplifying Job Submission on HPC Clusters
Masahiro Nakao(RIKEN Center for Computational Science)*; Keiji Yamamoto(RIKEN Center for Computational Science) - Using HPC clusters requires extensive prerequisite knowledge, such as Linux commands and job schedulers, which poses a high learning cost for beginners. To address this challenge, we have developed Open Composer, a web application designed to simplify the submission of batch jobs, the primary use case for HPC clusters. Open Composer runs on Open OnDemand, the de facto standard web portal for HPC clusters. Open Composer features automated job script generation using web forms, job submission functionalities, and more. This poster describes the design, development, and usage of Open Composer.
- 10New enhanced sampling method integrating ABMD with gREST for multi-scale free energy calculations
Shingo Ito(RIKEN Center for Computational Science)*; Hiraku Oshima(Graduate School of Life Science, University of Hyogo); Yuji Sugita(RIKEN Center for Computational Science, Computational Biophysics Research Team, RIKEN Cluster for Pioneering Research, Laboratory for Biomolecular Function Simulation, RIKEN Center for Biosystems Dynamics Research ) - Multidimensional enhanced sampling methods have been widely used for free energy calculations by exploring the complicated free energy surface having many local minima efficiently. However, these methods, such as generalized replica-exchange with solute tempering/replica-exchange umbrella sampling (gREST/REUS), have large computational costs, and only supercomputers allow us to perform the MD simulations with these methods. Here, we developed a new multi-scale enhanced conformational sampling method integrating adaptively biased molecular dynamics (ABMD) with gREST, which has low computational costs and quick convergence of free energy calculations compared to conventional multidimensional enhanced sampling methods. We implemented a new method, called gREST-ABMD, into the GENESIS MD simulation package and applied for free energy calculations in three different multi-scale models: proton transfer reaction of malonaldehyde in quantum mechanical (QM)/molecular mechanics (MM) model, the conformational change of adenyl kinase (AdK) in the coarse-grained model, and folding of chignolin in the all-atom model. In every case, the performance of gREST-ABMD is better than that of conventional ABMD and multiple-walker ABMD (MW-ABMD). We expect that gREST-ABMD will accelerate the free energy calculations with reasonable computational resources.
- 11Technical knowledge on effective use of Supercomputer Fugaku: Case studies from program tuning supports of RIST
Yukihiro Ota(Research Organization for Information Science and Technology)*; Gilles Gouaillardet(Research Organization for Information Science and Technology); Daisuke Matsuoka(Research Organization for Information Science and Technology); Eiji Tomiyama(Research Organization for Information Science and Technology); Yoshinori Kusama(Research Organization for Information Science and Technology) - In Supercomputer Fugaku, use of the system-intrinsic features, such as the Arm instruction sets and the Tofu interconnect D, can lead to efficient execution of programs. Elementary techniques for optimization in cached-based microprocessors are also useful. The application of these aspects to a program requires an understanding of the connection between the properties of hardware and software, and it is not straightforward except for experts on HPC. To overcome the difficulty, practical guidance of finding clues to good performance an avoiding common pitfalls is preferable for the beginners who intend to optimize programs. Since 2021, we have provided user supports in Fugaku, as the program tunning supports of RIST. According to our experiences, we summarize the technical knowledge helpful for using Fugaku. In this poster, we illustrate the optimization techniques for Fugaku, mainly focusing on three examples. Our first example is an application of the automatic vectorization. Vectorization is important for effective use of the processor core in Fugaku, but a compiler-based approach is not always successful owing to the complexity of source codes. We convert the original program into a compiler-friendly one and successfully apply vectorization to longer loops. Second, we show a tip of MPI communication with user-defined types. Use of derived types is useful for treating inhomogeneous data in memory, but the performance might depend on the data structure and MPI implementations. We show an example that explicit pack and unpack of derived-type data has a positive impact on performance. Third, we focus on GROMACS in Fugaku. GROMACS abstracts vector registers in C++ classes and vector operations are implemented via SIMD intrinsics specific to each given architecture. To achieve the best possible performances on A64fx and SVE capable Arm processors, SVE support was contributed to GROMACS by one of the authors (Gilles Gouaillardet). This is explained here, along with other techniques, such as increasing instruction per cycles and mixing objects generated by Fujitsu, Arm, and GNU compilers.
- 12Hamiltonian simulation for solving Linear PDE via
Schrödingerisation
Sangwon Kim(R-CCS)*; Junya Onishi(R-CCS); Ayato Takii(Kobe University); Younghwa Cho(Hokkaido University); Makoto Tsubokura(R-CCS, Kobe University) -
The Hamiltonian simulation, which has been gaining attention as a method in quantum computing, is used to solve the time-dependent Schrödinger equation in quantum computing. To apply general linear PDEs to Hamiltonian simulation, it is necessary to define a Hamiltonian system. However, not all linear PDEs can be directly transformed into a Hamiltonian system, as they often lack the imaginary unit i required for the Schrödinger equation, or the coefficient matrix may not be symmetric.
Thus, we explore the Schrödingerisation method (Jin et al., 2022) for solving linear partial differential equations (PDEs) using quantum systems. This method introduces an additional one-dimensional variable, referred to as the warped phase transformation, transforming linear PDEs into a Hamiltonian system.
In this study, the Qiskit library is employed to conduct Hamiltonian simulations using quantum simulators. We apply the method to both one-dimensional and two-dimensional advection, diffusion, and advection-diffusion equations. By comparing the simulation results with those obtained from classical numerical methods, we validate the accuracy and efficiency of the Schrödingerisation approach.
- 13Development of GENESIS CGDYN for large scale coarse-grained MD simulations
Jaewoon Jung(RIKEN R-CCS, RIKEN CPR)*; Cheng Tan(RIKEN R-CCS); Yuji Sugita(RIKEN R-CCS, RIKEN CPR,RIKEN BDR) - The coarse-grained (CG) model at the residue level is one of the most valuable tools for analyzing large-scale biological phenomena and understanding their mechanisms using molecular dynamics simulations. However, this model cannot be directly used to perform MD simulations on massively parallel computers due to difficulty in parallelization. In this study, we developed a non-uniform domain decomposition method with dynamic load balancing and succeeded in greatly improving the computational efficiency on supercomputers such as Fugaku. We named this computational engine “CGDYN” and implemented it in the multi-scale molecular dynamics software GENESIS. We performed CG simulations using CGDYN with Fugaku and observed the fusion process of condensed droplets consisting of intrinsically disordered proteins (IDPs). In particular, we observed “Ostwald ripening” characterized by the dissolution of small droplets and re-fusion into larger ones. Future development of coarse-grained models using “CGDYN” is expected to serve as a powerful framework to reproduce experimentally observable biological phenomena in a computer using simulations and to elucidate the details of their mechanisms.
- 14Installation of application software in Fugaku and HPCI
Kanako Yoshizawa(Research Organization for Information Science and Technology )*; Gilles Gouaillardet(Research Organization for Information Science and Technology ); Akira Azami(Research Organization for Information Science and Technology ); Daisuke Matsuoka(Research Organization for Information Science and Technology ); Asako Terasawa(Research Organization for Information Science and Technology ); Masato Matsui(Research Organization for Information Science and Technology );Satoru Shingu(Research Organization for Information Science and Technology ); Takaaki Noguchi(Research Organization for Information Science and Technology ); Naoki Sueyasu(Research Organization for Information Science and Technology ); Yoshinori Kusama(Research Organization for Information Science and Technology ); Tatsuya Nakano(Research Organization for Information Science and Technology ); -
The Research Organization for Information Science and Technology (RIST) provides various services such as user advanced consulting and tuning supports for the High Performance Computing Infrastructure (HPCI) users and prospective users.
RIST installs application software in the HPCI systems including Supercomputer Fugaku and provides useful information in order to make them ready-to-use. The application software includes commonly used Open Source Software (OSS) and Japan-developed, commonly used or expected to be used software recommended by the application development projects and a community of the computational science.
Information for using the application software, including examples of batch-job scripts on each HPCI computing resource is provided via the HPCI portal site [1]. The performance information on widely-used OSS in Fugaku is also published on it [2]. In this presentation, we will discuss the updated performance information for the application software such as GROMACS [3], LAMMPS [4] (using a machine learning potential), and FDS [5].
Fugaku Open OnDemand [6] has been introduced by RIKEN R-CCS. It is a convenient tool that allows users to run application software and visualize the results using only a web browser. We are collaborating with RIKEN R-CCS to expand the number of application software that can be executed using the tool. We have applied in-house developed software as well as pre-installed application software to the tool. An example of the in-house grown software is a reconstruction software of CT data obtained using the large-scale synchrotron radiation facility SPring-8. We will present execution examples using the tool.
1. https://www.hpci-office.jp/en/for_users/appli_software.
2. https://www.hpci-office.jp/en/for_users/appli_info.
3. https://www.gromacs.org.
4. https://www.lammps.org.
5. https://pages.nist.gov/fds-smv.
6. https://www.hpci-office.jp/en/for_users/appli_software#openondemand.
- 15Single-reference coupled cluster theory for systems with strong correlation extended to excited states
Stanislav Kedzuch(RIKEN Center for Computational Science)*; Shota Tsuru(RIKEN Center for Computational Science); Takahito Nakajima(RIKEN Center for Computational Science) -
Conventional single-reference coupled cluster (CC) theories often suffer from instability of the reference Slater determinant in the presence of strong correlations. Such a situation occurs, for example, in bond formation, breaking, and at the dissociation limit. The instability in the given cases is removed by breaking the spin symmetry of the reference and the lost spin symmetry is recovered by projection. In this manner, the Scuseria Group proposed the CCSD0 model [1], which behaved correctly at the dissociation limit by removing the spin-triplet pair excitations from CCSD. The group also proposed FSigCCSD, where the removed excitations were added back to CCSD0 [2].
We apply the F12 methodology [3] to CCSD0 and FSigCCSD to consider more dynamic correlations. We also extend the CCSD0 model to excited states in the equation-of-motion (EOM) scheme. In the poster, we will show the potential energy surfaces of the azomethane (C2N2H6) molecule, which undergoes torsional motion from the trans to cis forms and the elimination of a methyl group by photoexcitation.
EOM-CCSD0 does not still qualitatively describe each conical intersection between excited states. We are implementing the similarity-constrained (SC) scheme [4] to rectify the conical intersection. Our aim is to track potential energy surfaces of any kind of molecules without specifying active space.
References
[1] I. W. Bulik et al., J. Chem. Theory Comput., 11, 3171-3179 (2015).
[2] J. A. Gomez et al., J. Chem. Phys. 145, 134103 (2016).
[3] S. Ten-no, and J. Noga, WIREs Comput. Mol. Sci. 2, 114-125 (2012).
[4] E. F. Kjønstad, and H. Koch, J. Chem. Theory Comput. 15, 5386-5397 (2019).
- 16Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit
Yuki Uchino(RIKEN Center for Computational Science)*; Toshiyuki Imamura(RIKEN Center for Computational Science); Katsuhisa Ozaki(Shibaura Institute of Technology) -
This study was aimed at simultaneously achieving sufficient accuracy and high performance for general matrix multiplications. Recent architectures, such as NVIDIA GPUs, feature high-performance units designed for low-precision matrix multiplications in machine learning models, and next-generation architectures are expected to follow the same design principle. The key to achieving superior performance is to fully leverage such architectures.
The Ozaki scheme, a highly accurate matrix multiplication algorithm using error-free transformations, enables higher-precision matrix multiplication to be performed through multiple lower-precision matrix multiplications and higher-precision matrix additions. Ootomo et al. implemented the Ozaki scheme on high-performance matrix multiplication units with the aim of achieving both sufficient accuracy and high performance.
This paper proposes alternative approaches to improving performance by reducing the numbers of lower-precision matrix multiplications and higher-precision matrix additions. Numerical experiments demonstrate the accuracy of the results and conduct performance benchmarks of the proposed approaches. These approaches are expected to yield more efficient results in next-generation architectures.
This study was supported by JSPS Grant-in-Aid for Research Activity Start-up No. 24K23874.
- 17Electrostatic Repulsions Regulate Multicomponent Biomolecular Condensation: Insights from Large-Scale Simulations on Fugaku
Cheng Tan(RIKEN Center for Computational Science)*; Yuji Sugita(RIKEN Center for Computational Science) - The highly charged protein Hero11 regulates condensates formed by its client protein TDP43 through liquid-liquid phase separation. Previous coarse-grained simulations revealed residue-level interactions and proposed potential regulatory mechanisms. However, the lack of atomic details limited our understanding of the roles of protein side chains, water molecules, and ions. In this study, we employed atomistic molecular dynamics simulations to investigate these interactions at high resolution. Using results from coarse-grained simulations as a starting point, we reconstructed atomistic systems comprising approximately 2.5 million atoms, including around 120 proteins. Simulations spanning over 3 μs across six independent systems were performed on the supercomputer Fugaku. Our results confirm that Hero11 induces electrostatic repulsive interactions, reducing the overall density in the dense phase. This allows ions to permeate more freely into the condensate, enhancing internal dynamics and potentially preventing maturation. These findings provide valuable insights into the regulatory role of electrostatic repulsions in biomolecular condensates.
- 18Massively parallel state-vector simulation of quantum computer
Naoki Yoshioka(RIKEN Center for Computational Science)*; Nobuyasu Ito(RIKEN Center for Computational Science); Kazuhiro Seki(RIKEN Center for Quantum Computing); Tomonori Shirakawa(RIKEN Center for Computational Science, RIKEN Center for Quantum Computing, RIKEN Cluster for Pioneering Reseasrch,RIKEN Interdisciplinary Theoretical and Mathemetical Sciences); Seiji Yunoki(RIKEN Center for Computational Science, RIKEN Center for Quantum Computing, RIKEN Cluster for Pioneering Reseasrch, RIKEN Center for Emergent Matter Science); Doru Thom Popovici(Laurence Berkeley National Laboratory); Anastasiia Butkoi(Laurence Berkeley National Laboratory) -
A state-vector simulator of quantum computer is developed for massively parallel supercomputers. Our simulator, RIKEN-braket, removes the limitations about the number of MPI processes and the size of data array of state-vector in the commonly used parallelization method. Therefore, it is possible to increase the number of qubits by one in the state-vector simulations. It is demonstrated that our simulations work well up to 46 qubits on the supercomputer Fugaku by using 55296 computing nodes at most. Recent progress of our study, such as implementation of gate fusion and application to massively parallel simulation of Variational Quantum Eigensolver (VQE) and dynamics of spin systems, is also presented.
This presentation is based on results obtained from a project, JPNP20017, commissioned by the New Energy and Industrial Technology Development Organization (NEDO).
- 19Study on the Multi-Conformational States of the SARS-CoV-2 Spike Protein Using the CryoMDM Method
Atsushi Tokuhisa(RIKEN, R-CCS)* -
Abstract:
Understanding biomolecular structures is essential for comprehending life phenomena. In recent years, cryo-electron microscopy (cryoEM) single-particle structure analysis has enabled remarkable advancements. However, single-particle cryoEM images have poor signal‒to‒noise ratios, and existing methods rely on multi-image analysis, which involves averaging multiple images to produce high-resolution structures, causing the unique features of individual structures to be lost. To overcome this challenge, we developed a molecular dynamics (MD) simulation-based structural matching method, cryoMDM, which involves identifying plausible structures in two-dimensional cryoEM images from many three-dimensional (3D) structural model candidates generated through MD simulations. The method innovatively enables the estimation of 3D atomic models from single-particle images rather than merely generating 3D electron density maps. By linking each cryoEM particle image to a 3D structure in the simulation space, we can directly connect cryoEM data to continuous structural changes in biomolecules. In order to elucidate the biomolecular structures reflected in the experimental data at high resolution, we have established a large-scale structure-search workflow in the HPCI environment of the Fugaku with AI-driven technologies. We applied this method to the spike protein derived from SARS-CoV-2 and successfully captured four characteristic intermediate states, Open, Close, Free, and Condensed stets, of the spike protein. As this example shows, the cryoMDM method can be a powerful tool to elucidate various structural states of biomolecules as high-resolution 3D structural models by the conformational search workflow established in the HPCI environment.
Acknowledgement:
This work was supported by the FOCUS Establishing Supercomputing Center of Excellence project, MEXT, as “Program for Promoting Research on the Supercomputer Fugaku” (Simulation- and AI-driven next-generation medicine and drug discovery based on "Fugaku", JPMXP1020230120). We used computational resources from the supercomputer Fugaku provided by the RIKEN Center for Computational Science through HPCI System Research Projects (Project IDs: hp220078, hp230102, hp230216, hp240109, hp240162, ra000018) and the supercomputer system at the Information Initiative Center, Hokkaido University, Sapporo, Japan, through HPCI System Research Projects (Project IDs: hp220078, hp230102).
- 20Nuclear Quantum Effects in Sub- and Supercritical Water Investigated at ab initio level with SL-PIHMC-MIX
Bo Thomsen(CCSE - JAEA)*; Motoyuki Shiga(CCSE - JAEA) - Nuclear Quantum effects (NQEs) play a large role in the accurate modelling of water structure, even at sub- and supercritical conditions, where the NQEs modulate the hydrogen bonds of the liquid phase. In previous works we have studied the NQEs using ab initio path integral molecular dynamics (AI-PIMD). Unlike conventional ab initio molecular dynamics (AI-MD), where only a single instance of the system must be evaluated per timestep, AI-PIMD requires evaluating several copies simultaneously, dramatically increasing computational costs. This makes the straightforward application of AI-PIMD expensive due to the repeated solution of density functional theory (DFT) equations for multiple configurations at each timestep. In this presentation, we introduce the self-learning path integral hybrid Monte Carlo with mixed ab initio and machine learning potentials (SL-PIHMC-MIX) method. This approach maintains DFT accuracy while reducing the number of DFT evaluations by an order of magnitude. This is achieved using a machine learned potential (MLP) that is trained on the fly during the propagation of the SL-PIHMC-MIX trajectory. This method has allowed the benchmarking of the RPBE-D3, SCAN, rev-vdW-DF2, and rev-PBE0-D3 functionals for modelling sub- and supercritical water. This benchmarking is crucial for understanding neutron scattering phenomena in water-cooled nuclear power plants, where accurate modeling of water’s structure and properties is essential. Additionally, we compare the performance of pure MLPs trained on each of these functionals to confirm the consistency and reliability of the MLP-generated results relative to the underlying DFT methods. Furthermore, we will discuss recent improvements to our locally developed program, which enable efficient propagation of PIMD trajectories when employing MLPs or other computationally intensive potentials.
- 21WHEEL: A web-based analysis workflow tool and its use on Open OnDemand on supercomputer Fugaku
Tomohiro Kawanabe(RIKEN R-CCS)*; Naoyuki Sogo(Longtail software LLC); Masahiro Nakao(RIKEN R-CCS); Kenji Ono(Research Institute for Information Technology, Kyushu University) -
The RIKEN Center for Computational Science (R-CCS) has set a goal of expanding the user base of the supercomputer ”Fugaku” to potential users in industry and is promoting research, development, and implementation of tools that make the HPC environment easy to use even for novices. The analysis workflow tool WHEEL (Workflow in Hierarchical distributed parallEL) is an open-source software being researched and developed primarily by R-CCS in order to enable even beginners to HPC in the industrial sector to build and execute practical analysis workflows.
WHEEL is a client-server web application with a GUI that runs on a web browser. There is no need to install a special plug-in to the web browser. The WHEEL server is built on Node.js, a server-side JavaScript framework, and is provided as a Docker container to be used in various environments.
Workflows can be built by users dragging and dropping workflow components on the GUI.
Workflow components include repetitive processing components (For, While, ForEach) and a ParameterStudy component that simultaneously submits multiple jobs in user-defined parameter space, making it easy to build optimal design workflows required by industrial users. This is a major advantage of WHEEL, as it is difficult to achieve with Directed Acyclic Graph-based workflow tools.
In this poster, we will first provide an overview of WHEEL, and then explain an example of a multi-objective optimal design exploration workflow using the supercomputer Fugaku as an application example in industry. We will also introduce how to use WHEEL in the Open OnDemand environments of Fugaku and Kyushu University's supercomputer “Genkai” and explain the HPCI shared storage utilization function currently under development.
Acknowledgement:
This work used computational resources of supercomputer
Fugaku provided by the RIKEN Center for Computational
Science (ProjectID: rccs-hud).
- 22Development of OpenACC for MN-Core: Part of the Post-Fugaku FS by Kobe University
Ryuta Tsunashima(Kobe University)*; Katsuhiro Endo(National Institute of Advanced Industrial Science and Technology); Naohito Nakasato(The University of Aizu); Hiroto Imachi(Preferred Networks, Inc.); Junichiro Makino (Kobe University, Preferred Networks, Inc.) -
The MN-Core series of processors are AI and HPC accelerators developed by Kobe University and Preferred Networks, Inc. From autumn 2022 to March 2025, MEXT organized the Post-Fugaku Feasibility Study (Post-Fugaku FS). The Kobe University team of Post-Fugaku FS has evaluated the possibility of the software environment of a future MN-Core for not only AI but also general-purpose computing such as HPC. For general-purpose computing, we have designed programming environments, MNCL like OpenCL and OpenACC for MN-Core.
Traditional programming for accelerators requires a lot of programming effort because programming languages are different from that for CPU, and languages such as CUDA and OpenCL requires different manners from those for CPU programming. Therefore, programs written in these languages are not portable. Compiler directive-based APIs for C, C++ and Fortran are more favorable, and OpenACC is the most promising.
In this poster we describe the progress of the development of OpenACC for MN-Core. We show the interface design, the compiler implementation and the future programming environment of MN-Core.
GPUs are based on the shared memory model. On the other hand, MN-Core is based on the distributed memory model. Therefore, the programming model and the language processor should adopt the distributed memory model too. OpenACC for MN-Core extends the standard OpenACC to support the interface for the distributed memory. This extended interface is based on XcalableMP and High Performance Fortran.
The OpenACC for MN-Core compiler is the source-to-source compiler from OpenACC to MNCL. One of the advantages of source-to-source is easy to support calling MNCL functions from OpenACC; users can detail performance tuning with MNCL from OpenACC.
Acknowledgment:
This work was supported by MEXT as "Feasibility studies for the next-generation computing infrastructure".
- 23An Adaptive Kernel Scheduler for Irregular Applications on GPUs Using CUDA Graphs
Kento Kitamura(NTT Device Innovation Center)*; Kenji Tanaka(NTT Device Innovation Center); Kazunori Seno(NTT Device Innovation Center) - We propose a novel approach for executing irregular applications using Graphics Processing Units (GPUs). Irregular applications mean applications that parallelisms of tasks are changing dynamically. When executing such applications with traditional GPU programming models, CPU-GPU communications occur at every change in parallelism to redefine kernels. It decreases the application's throughput according to the number of redefining kernels. Some prior works use Dynamic Parallelism (DP) which allows dynamic kernel launch on GPUs, but they suffer from large kernel launch overhead. CUDA Graphs are also introduced for dynamic kernel launch on GPUs, but they need to define the kernel launch flow statically and therefore can waste a lot of compute resources when executing irregular applications. To execute irregular applications efficiently using GPUs, our method allows GPUs to dynamically allocate computational resources to kernels without large kernel launch overhead. We decompose a kernel into multiple fragmented kernels and dynamically launch an optimal number of them according to the size of the input data. The input data is partitioned into smaller segments and each fragmented kernel processes each of the partitioned segments. In addition, we implemented a kernel scheduler on each fragmented kernel. When a fragmented kernel is finished and unprocessed input data remains, the scheduler relaunches the fragmented kernel. This method is implemented using CUDA graphs conditional nodes for determining the number of fragmented kernels to be launched based on the input size, and schedulers of fragmented kernels. We compared the proposed method with a traditional kernel execution method with CPU and a method with DP. We executed a Breadth-First Search (BFS); one of the representative irregular applications with several sizes of input graphs using those three methods and evaluated Traversed Edges Per Second (TEPS) per core. Compared to the traditional method and the DP method, the TEPS of our method was up to 1.27 and 1.26 times greater respectively. We are going to research about GPU-GPU communication method to implement larger scale applications.
- 24Using Data Assimilation to Improve Data-Driven Models
Michael Goodliff(RIKEN R-CCS)*; Takemasa Miyoshi(RIKEN R-CCS) -
Data-driven models (DDMs) are mathematical, statistical, or computational models built upon data, where patterns, relationships, or predictions are derived directly from the available information rather than through explicit instructions or rules defined by humans. These models are constructed by analysing large volumes of data to identify patterns, correlations, trends, and other statistical relationships. In areas such as numerical weather predictions (NWP), these DDMs are becoming increasingly popular with an aim to replace numerical models based on reanalysis data. Data assimilation (DA) is a process which combines observations from various sources with numerical models to improve the accuracy of predictions or simulations of a system's behaviour.
This presentation focuses on the application of DA methodologies in enhancing the precision and efficiency of DDM generation within computation models characterised by inherent observation error. The aim is to demonstrate the pivotal role that DA techniques can play in refining and optimising the process of DDM generation, thereby augmenting the accuracy and reliability of predictive models despite the presence of observational uncertainties.
- 25Computational evolution of social norms in indirect reciprocity
Yohsuke Murase(RIKEN, R-CCS)*; Christian Hilbe - Cooperation is a crucial aspect of human society, and indirect reciprocity is one of the mechanisms that promote cooperation. Models of indirect reciprocity study how social norms promote cooperation. In these models, cooperative individuals build up a positive reputation, which in turn helps them in their future interactions. The exact reputational benefits of cooperation depend on the norm in place, which may change over time. Previous research focused on the stability of social norms. Much less is known about how social norms initially evolve when competing with many others. A comprehensive evolutionary analysis, however, has been difficult. Even among the comparably simple space of so-called third-order norms, there are thousands of possibilities, each one inducing its own reputation dynamics. To address this challenge, we conduct large-scale computer simulations using Fugaku. We study the reputation dynamics of each social norm and all evolutionary transitions between them. In contrast to established work with only a handful of norms, we find that cooperation is hard to maintain in well-mixed populations. However, within group-structured populations, cooperation can emerge. The most successful norm in our simulations is particularly simple. It regards cooperation as universally positive, and defection as usually negative—unless defection takes the form of justified punishment. This research sheds light on the complex interplay of social norms, their induced reputation dynamics, and population structure.
- 26Connecting Physical Qubits to Quantum Error Correction Systems using Regular Ethernet
Jan-Erik R. Wichmann(RIKEN Center for Computational Science)*; Kentaro Sano(RIKEN Center for Computational Science) -
Fault tolerant quantum computers require quantum error correction (QEC) due to the intrinsic instability of physical qubits. Use of suitable QEC algorithms allows for the creation of a few error-resistant logical qubits from many noisy physical qubits. For QEC algorithms to properly work, it is necessary to repeatedly measure a subset of the physical qubits and evaluate the measurement results. This creates classical data transfer challenges especially for fast superconducting qubits, where many independent frontend units (FEU) control only a few physical qubits each, but the QEC backend requires the results of all FEU to properly estimate and correct for errors. Keeping the latency of the connections between FEU and backend low is necessary to ensure a high logical clock rate for the quantum computer.
Here we present a scalable connection between FEU and QEC backend using regular Ethernet. Assuming contemporary FEU controlling 4 qubits each, we show that the backend can handle the information of a single code distance 11 (19) qubit using a regular 100 (400) Gbps Ethernet connection. This can be improved upon by using intermediate aggregation units (AU) between FEU and QEC backend. Each AU aggregates the information of a few FEU, effectively reducing the required bandwidth at the backend. Using two layers of AU, a single backend can receive the error syndrome information of 95 logical qubits of code distance 41 over a single 400 Gbps connection, while only moderately increasing communication latency. This allows for long-distance lattice surgery operations, highlighting the suitability of Ethernet-based frontend-backend communications for large scale fault tolerant quantum computers.
Acknowledgement: This work was supported by the JST Moonshot R&D Grant Number JPMJMS226A.
- 27Development of Scalable Systolic Arrays for Quantum Error Correction Core
Prasoon Ambalathankandy(RIKEN Center for Computational Sciences )*; Werner Florian Samayoa(RIKEN Center for Computational Sciences ); Jan Wichmann(RIKEN Center for Computational Sciences ); Kentaro Sano(RIKEN Center for Computational Sciences ) - This work focuses on scalable Systolic Arrays (SA) for Quantum Error Correction (QEC) in superconducting qubit-based quantum devices. As part of a QEC core, the SA processes syndrome graphs in real-time, leveraging a reduced syndrome subgraph decoder to minimize latency. Designed at the RTL level, the SA supports varying surface code distances (D) and measurement rounds (R), ensuring scalability and adaptability to different quantum architectures. When synthesized on Intel Agilex 7, the SA achieves high efficiency: for 𝑫=𝟑, it utilizes <1% resources at 433 MHz, and for 𝑫=𝟏𝟕, it uses 68% ALMs at 207 MHz. On Stratix 10, the resource usage for 𝑫=𝟑 includes <1% ALMs and <1% registers, achieving an Fmax of 354.23 MHz. For 𝑫=𝟏𝟑, it utilizes 42% ALMs and 3.3% registers, with an Fmax of 133.28 MHz. These results demonstrate the SA's scalability and resource efficiency, enabling robust, real-time QEC for fault-tolerant quantum computing systems.
- 28Global precipitation nowcasting using a ConvLSTM with adversarial training
Shigenori Otsuka(RIKEN)*; Takemasa Miyoshi(RIKEN) -
Recent advancement of deep learning in earth sciences enables us to perform accurate predictions of atmospheric states for an extended period. Among various meteorological parameters, precipitation will have a large socioeconomic impact. However, precipitation is one of the most difficult quantities mainly due to its nonuniform spatiotemporal distributions and complex dynamics. From the viewpoint of data driven approaches, we have sufficiently large amount of historical record of global precipitation distributions in the past 20 or more years of the TRMM and GPM era with the satellite-borne radar data. However, precipitation distributions based on microwave satellite observations also suffer from quality issues.
To take advantage of such abundant but noisy precipitation data, we adopted convolutional long short term memory (ConvLSTM), a type of recurrent neural network, in a hierarchical manner to perform 12-hour-lead global precipitation nowcasting. Adversarial training is one of the key techniques to provide realistic precipitation distributions with extreme values. In this study, the loss consists of a pixelwise Huber loss, spatial and temporal discriminator losses, and constraints on some non-local statistical values, such as average and standard deviations over the entire domain.
The model input variables are hourly GSMaP (Global Satellite Mapping of Precipitation) near-real-time product v8 and its quality indicator, as well as the latitude at each pixel. The output is trained on hourly GSMaP standard product v8 on a 0.1 x 0.1 degrees mesh. This setup is designed for the real time operation in the future. Preliminary experiments indicated that this model has an ability to produce reasonable precipitation predictions up to 12 hours ahead.
- 29Can Tensor Cores Benefit Memory-Bound Kernels?
Lingqi Zhang(RIKEN Center for Computational Science)*; Lingqi Zhang(RIKEN Center for Computational Science); Jiajun Huang(University of California, Riverside)*; Mohamed Wahib(RIKEN Center for Computational Science) -
Tensor cores are specialized processing units within GPUs that have demonstrated significant efficiency gains in compute-bound applications such as Deep Learning Training by accelerating dense matrix operations. Given their success, researchers have attempted to extend tensor core capabilities beyond dense matrix computations to other computational patterns, including memory-bound kernels. Recent studies have reported that tensor cores can outperform traditional CUDA cores even on memory-bound kernels, where the primary performance bottleneck is memory bandwidth rather than computation.
In this research, we challenge these findings through both theoretical and empirical analysis. Our theoretical analysis reveals that tensor cores can achieve a maximum speedup of only 1.33× over CUDA cores for memory-bound kernels in double precision (for V100, A100, and H100 GPUs). We validate this theoretical limit through empirical analysis of three representative memory-bound kernels—STREAM Scale, SpMV, and stencil. We demonstrate that optimizing memory-bound kernels using tensor cores does not yield sound performance improvements over CUDA cores.
- 30Evaluating the Impact of Batch Independent Normalization and Shuffling in DDP
Kohei Yoshida(The University of Electro-Communications, RIKEN R-CCS)*; Kento Sato( RIKEN R-CCS); Shinobu Miwa(The University of Electro-Communications, RIKEN R-CCS); Rio Yokota(Institute of Science Tokyo) - In Distributed Data Parallel (DDP), shuffling plays an important role in increasing local batch diversity to help convergence. To achieve this, various shuffling techniques have been studied to promote both faster convergence and shorter training time, but they rely on batch size-dependent normalization techniques, such as batch normalization (BN). On the other hand, there are several batch size-independent normalization techniques, such as group normalization (GN). By using GN, aggressive shuffling to increase batch diversity may not be necessary, as the effect of batch diversity on accuracy is anticipated to be less significant. Additionally, when training with 3D data, ultra-high resolution, or long context, it is sometimes necessary to set the batch size to extremely small values, such as one or two, due to GPU memory constraints. In these situations, training using GN can be conducted without incurring additional communication costs, unlike the use of SyncBatchNorm. While previous studies have compared the accuracy of local shuffling and global shuffling in DDP, no studies have evaluated the impact of shuffling on accuracy in the case of normalization methods that are independent of the local batch size in DDP. In this study, we investigate the impact of GN and shuffling by comparing the accuracy of multiple patterns of local and global shuffling. The experimental results show that the following two scenarios lead to accuracy equivalent to global shuffling: (1) Shuffling the entire dataset before training, without performing any shuffling between epochs in each process, and (2) Not shuffling the entire dataset before training, but instead performing local shuffling between epochs in each process. These results suggest that with local shuffling and GN, overly complex shuffling is not necessary to achieve accuracy similar to that obtained with global shuffling.
- 31Assimilation of dual MP-PAWR observations within a regional-scale numerical weather prediction system with 1000 member ensemble Kalman filter and 30-second update
James Taylor(RIKEN)* -
The multi-parameter phased array weather radar (MP-PAWR) is an advanced X-band radar system designed to provide high-density observations of Doppler wind velocity and reflectivity to observe heavy rainfall systems. Several studies have assimilated MP-PAWR observations within regional-scale numerical weather prediction (NWP) systems, with a positive impact found to both analyses and rain forecasts. However, previous studies have only assimilated observations from a single MP-PAWR and the use of multiple MP-PAWR observations has not yet been explored. With the recent development of MP-PAWR in Osaka and Kobe, there now exists a common observation region, providing two sets of observations over much of the Kansai region and the possibility to perform dual-MP-PAWR data assimilation with the potential for improvement in short-range rain forecasts.
In this study we perform data assimilation experiments to assimilate observations from the Osaka and Kobe MP-PAWR within the SCALE-LETKF NWP modelling system. A 1000-member ensemble is refreshed every 30-seconds with both observations for a 500-m resolution mesh that covers an area of approximately 192-km by 192-km area of the Kansai region. Results showed improvements in the distribution and intensity of rainfall in both the analyses and forecasts up to 30-minute lead times compared to the assimilation of a single MP-PAWR dataset. Plans to demonstrate dual MP-PAWR assimilation within the real-time SCALE-LETKF system at the Osaka World Expo 2025 will also be presented.
- 32Architecture Exploration of Heterogeneous CGRA Architecture for HPC and AI
Boma Anantasatya Adhi(R-CCS)*; Chenlin Shi(R-CCS); Kentaro Sano(R-CCS) - Coarse-grained reconfigurable Array (CGRA) is a highly efficient compute accelerator architecture. It offers a compelling balance between performance and flexibility compared to traditional CPUs and ASICs, making it attractive for embedded systems and, more recently, AI and HPC applications. However, typical AI and HPC applications often require complex mathematical functions beyond the capabilities of traditional CGRA architectures, which are commonly limited to basic arithmetic operations like multiplication and addition. Therefore, specialized function units need to be provisioned for these mathematical functions, which presents a significant challenge; over or under-provisioning may lead to resource wastage and incorrect placements of those functions will also cause issues for the CGRA placement and routing compiler. Furthermore, the physical placement of the large macros associated with these specialized functions must be carefully considered during the CGRA's ASIC design, as improper placement can lead to implementation issues. In this poster, we present an architectural exploration of potential heterogeneous computing resources in CGRA. First, we proposed an improved Processing Element (PE) architecture to support those functions. Then, we study the provisioning and placement of those functions in various heterogeneous layouts and their impact on the CGRA's mappability for multiple benchmarks and Power, Performance, and Area (PPA) analysis of the CGRA design on a state-of-the-art 3-nm process node.
- 33Reduced non-Gaussianity and improved analysis by assimilating every-30-second radar observation: a case of idealized deep convection
Arata Amemiya(RIKEN Center for Computational Science)*; Takemasa Miyoshi -
The non-Gaussianity of error probability distributions is a major challenge in the application of data assimilation methods based on Kalman filter theory to rapidly growing convective systems. The assimilation of phased array weather radar data with a very short interval of 30 seconds is an interesting approach to overcome this problem. However, as previous studies investigated only real-world cases, it was difficult to verify analysis accuracy of dynamical variables. It was also difficult to distinguish the effect of non-Gaussianity from other factors which may also degrade the analysis and forecast accuracies, such as the errors in model physics, imperfect and nonlinear observation operators, limited observation coverage, and multi-scale background error.
In this study, we perform a series of idealized experiments for a convective cell triggered by a warm bubble and investigate the impact of assimilating radar observation with high frequency, focusing on the non-Gaussianity and the analysis accuracy. We used 100-member local ensemble transform Kalman filter (LETKF) and synthetic radar reflectivity observation generated every 30 seconds. We compared the analysis fields after 50 minutes of data assimilation cycle of three different cases: 3D-LETKF with a 5-minute interval, 4D-LETKF with a 5-minute interval, and 3D-LETKF with a 30-second interval. We found that assimilating radar reflectivity every 30 seconds leads to a significant reduction of the non-Gaussianity of the background ensemble and the improvement of the analysis field, particularly for vertical velocity around the convective core.
- 34Towards Practical In-Situ Visualization Environment on A64FX HPC Systems using Kombyne and KVS
Jorji Nonaka(RIKEN R-CCS)*; Tomohiro Kawanabe(RIKEN R-CCS); Toshihiko Kai(RIKEN R-CCS); Shunji Uno(JAXA); Naoyuki Fujita(JAXA); Daichi Obinata(Fujitsu); Hiroyuki Ito(Ryoyu Systems); Naohisa Sakamoto(Kobe University); Atsushi Toyoda(Intelligent Light); et al.(RIKEN R-CCS, JAXA, Kobe University), et al.(RIKEN R-CCS, JAXA, Kobe University) -
Abstract
R-CCS and JAXA have operated HPC systems with similar architecture from the previous SPARC64-based HPC systems (K computer and SORA-MA, a Fujitsu PRIMEHPC FX100) to the current Arm A64FX-based HPC systems (Fugaku and TOKI-SORA, a Fujitsu PRIMEHPC FX1000), and have worked independently on the visualization environment for their respective HPC users. However, the continuous increase in the size of the simulation outputs has made the large data visualization environment a common problem for both HPC sites, and we started working collaboratively on the “in-situ visualization”, which is an umbrella terminology that includes a wide range of visualization-related techniques and methods for processing the simulation results while they are still resident in memory. JAXA has adopted Kombyne, a commercial in-situ/in-transit visualization and analysis tool, and R-CCS has worked with Kobe University on smart in-situ visualization by using KVS (Kyoto Visualization System), an open-source software. Although Kombyne is a commercial software, it works as "Enterprise Open-Source Tool" and offers the flexibility to rework and customize it. In this poster, we will present some ongoing activities focusing on building a practical in-situ visualization environment by using Kombyne with the Fugaku as the testbed. Although Kombyne provides rich data conversion and rendering features for realizing in-situ/in-transit visualization, KVS has the potential to increase its effectiveness by adding new features, thus we are planning to integrate some additional functionalities provided by the KVS focusing on assisting HPC users with their visualization and analysis tasks.
Acknowledgments
This work has been conducted under an MoU between R-CCS (PI: Toshihiko Kai) and JAXA (PI: Naoyuki Fujita), and a Collaborative Research Agreement between R-CCS (PI: Jorji Nonaka) and Kobe University (PI: Naohisa Sakamoto). This work was partially supported by JSPS KAKENHI (Grant Numbers: 20H04194, 22H03603), and MEXT as “Program for Promoting Researches on the Supercomputer Fugaku” (Grant Number: JPMXP102023032100). This work used computational resources of the supercomputer Fugaku provided by the RIKEN Center for Computational Science (ProjectID: rccs-hud).
- 35Multiple-component floating point with internal scaling for stable arithmetic in large scale linear solver
Atsushi Suzuki(RIKEN Center for Computational Science)*; Toshiyuki Imamura(RIKEN Center for Computational Science) -
Utilizing several precision arithmetic to obtain solution in numerical simulation becomes more common thanks to developed numerical schemes and to libraries in dealing with mixed precision data. Using two-component FP64 data as Double-double has been established by DD/QD library around 15 years ago. There are several linear systems that require high precision data and operation than double precision, e.g., more than 24 digits accuracy due to high condition number of the coefficient matrix. When physical modeling contains an exponential function in the coefficient matrix, localized variation of the coefficient occurs, and it makes the condition number of the matrix to be high.
Nowadays, we have more computational resources with FP32 than FP64, so it is urgent to find a way to utilize FP32 data and arithmetic to achieve higher precision than FP64, e.g., FP128. FP32 consists of 8 bit exponent and 24bit effective mantissa and it is a natural idea to concatenate several FP32 data to express higher precision than FP64.
Five FP32 can keep 120 bit mantissa which will cover 113 bit mantissa of FP128. However, here we need to pay attention on treatment of exponent data after concatenating several FP32 data. Concatenated mantissa of FP32 consumes data in the exponent section, which results in reduction of the exponent range of the compound data. Ahead of the last component of five-compound FP32, 96bit mantissa consumes 7 bit exponent, which only allows 1 bit exponent range to store floating point. For multiplication of five-compound FP32 numbers, 6th component needs to be evaluated with risk of underflow.
This issue is not observed in FP64-based compound floating point data thanks to wide 11 bit exponent, but it is critical in FP32-based one. A remedy is combination of introducing separate data to keep information on the exponent and of scaling during arithmetic between components.
- 36Improving Global Precipitation Forecast by Assimilating Frequent Satellite Microwave Observations
Rakesh Teja Konduru(Data Assimilation Research Team, RIKEN Center for Computational Science, Kobe)*; Jianyu Liang(Data Assimilation Research Team, RIKEN Center for Computational Science, Kobe); Shigenori Otsuka(Data Assimilation Research Team, RIKEN Center for Computational Science, Kobe); Takemasa Miyoshi(Data Assimilation Research Team, RIKEN Center for Computational Science, Kobe) - We explored the impact of assimilating frequent satellite microwave observations with a global atmospheric ensemble data assimilation system known as the NICAM-LETKF. We performed four Observing Systems Simulation Experiments (OSSEs) with conventional observation data and clear-sky AMSU-A satellite observations at different frequencies: hourly (1H), bi-hourly (2H), three hourly (3H) and six hourly (6H). The results showed that 1H and 2H resulted in higher Root Mean Square Error (RMSE) for air temperature compared to 3H and 6H because of dynamic imbalances caused by more frequent data assimilation. These imbalances were investigated by computing representative measures such as second time derivative of vertical velocity and found more problematic in 1H and 2H than in 3H and 6H. To understand the problem better, we investigated sensitivities to horizontal localization parameters and observation error settings. The adjustment of horizontal localization in 1H-HLOC showed a reduction in air temperature RMSE by 5-10% but did not significantly affect the dynamic imbalance. Conversely, inflating observation error standard deviations manually by 60% in the 1H-Rinfl experiment reduced imbalances by 5-10% and enhanced the global and tropical representation of air temperature with the RMSE reduced by 10-15%. However, manual tuning is computationally expensive. The Adaptive Observation Error Inflation (AOEI) method adjusts observation error standard deviations online by considering innovations and is found effective in avoiding manual tuning. AOEI not only reduced the imbalance and RMSE in the 1H-AOEI experiment but also demonstrated superior performance compared to 3H and 6H and comparable results to 1H-Rinfl. This approach was consistent in the 2H-AOEI experiment. In addition, the results indicated that 1H markedly improved air temperature representation and precipitation forecasts, particularly for tropical precipitation systems influenced by tropical waves, such as tropical storms and Kelvin waves. Further analysis showed an improved tropical precipitation forecast in 1H. In summary, the AOEI method can successfully rectify the imbalances due to frequent satellite data assimilation in the NICAM-LETKF.
- 37Simulation of the Potts Model on IBM Quantum Computers: Exploring Phase Structures and Error Mitigation Techniques
Jishnu Goswami(RIKEN)*; Katsumasa Nakayama(RIKEN) -
We employ the Potts model as a simplified toy model for Quantum Chromodynamics to investigate its phase structure using quantum computing technologies, specifically IBM devices featuring 127 qubits. Leveraging IBM’s Quantum Computing platform and the Variational Quantum Eigensolver (VQE) algorithm, we construct and simulate the Potts model by mapping its Hamiltonian for efficient multi-qubit operator representation. Furthermore, we explore various error mitigation strategies and, by comparing our results with exact diagonalization methods, address the inherent noise and decoherence challenges of current quantum hardware.
Acknowledgement: This project is supported by the FY2024 Incentive Research Project (project code: 202459499010) under the project name 寄22_使途_2件/CN・I/奨励24_JishnuGoswami
- 38KL Divergence-based Dynamic Threshold Adjustment of Timestep Intervals for Enhancing Smart In-Situ Visualization
Kazuya Adachi(Kobe University)*; Taisei Matsushima(Kobe University); Naohisa Sakamoto(Kobe University); Jorji Nonaka(RIKEN R-CCS); Chongke Bi(Tianjin University) -
Abstract
In-situ visualization has become a key approach to address the ever-increasing I/O costs of modern large-scale simulations capable of scaling to extreme sizes. Over the past decade, in-situ visualization-related libraries and tools like KVS, Catalyst, LibSim, Ascent, SENSEI, and Kombyne have become widely accessible to HPC users. Most of these tools provide image- or video-based visualization capabilities, enabling interactive viewpoint adjustments during offline visual analysis. However, it usually requires the generation of dozens or even hundreds of rendering images at each visualization timestep, thus contributing to increased turnaround times during the analysis of simulation results. To overcome this limitation, smart visualization techniques aiming to reduce the total number of images while maintaining visualization quality have been proposed. For instance, using information entropy to identify user-preferred viewpoints, and reducing redundant image generation by focusing on key areas. In addition, camera path interpolation can further be applied to create a smooth transition between image sequences. We can also cite dynamic camera adjustments, such as focus and zoom, to emphasize potentially important features in the underlying simulation. These aforementioned techniques have been designed with static timestep intervals in mind, and in this poster, we will present an adaptive threshold determination method to dynamically adjust the viewpoint estimation timestep intervals, aiming to further reduce the total amount of rendering images. The proposed method uses Kullback-Leibler divergence to decide whether viewpoint estimation is necessary or not, based on a pre-defined threshold. We evaluated this method using the general-purpose KVS framework and a FORTRAN-based CFD simulation code on an x86/GPU Server. In our initial experiments, by using specific parameter settings, the total number of renderings could be reduced by up to 50%. As some future works, we plan to further evaluate different parameter settings and also to carry out some production runs on the supercomputer Fugaku.
Acknowledgements
This work was partially supported by MEXT as "Program for Promoting Researches on the Supercomputer Fugaku" (Grant Number JPMXP1020210123), and Fujitsu & Kobe University Joint Laboratory for Advanced Computing and Social Implementation (Fujitsu Small Research Lab).
- 39Design of a Power-efficient Data Compression System for Edge-cloud Computing Platforms
YIYU TAN(Iwate University) - Edge-cloud computing platforms are becoming important in the post-5G era, supporting applications like real-time data analysis, data communication in IoT, and machine learning at edge devices. In Edge-cloud computing platforms, a big challenge is the large amount of data generated at the edge devices, which may overwhelm network bandwidth and increase energy consumption in the cloud. In this research, the GZIP, a highly optimized lossless compression algorithm, was implemented using the field programmable gate array (FPGA) to reduce data transfer in the edge-cloud computing platforms. The whole system consisted of three kernels, LZ reduction, static Huffman, and CRC. The LZ reduction kernel implemented an LZ77 algorithm for data de-duplication. The static Huffman kernel realized a static Huffman coding scheme for bit reduction. The CRC kernel conducted a CRC checksum based on the input file. After the data compression system was implemented using the FPGA board DE10-Agilex, the system ran at 535 MHz, and consumed 57% and 56% of the valid logic resources and RAM blocks inside the FPGA, respectively. When the dataset enwik was compressed, the data size was reduced 48.4% and 52.5% in the dataset enwik8 and enwik9, respectively. Compared with the pigz performed on a desktop machine with an Intel’s Xeon Gold 6212U processor running at 2.4 GHz, the FPGA-based data compression system achieved about 5.6x in compression throughput with slightly smaller compressed ratio, and 5.2x in power efficiency.
- 40Bandwidth/Compute Tradeoff Control Mechanism through Hardware-Based Data Compression
Tomohiro Ueno(RIKEN R-CCS)*; Kaito Kitazume(University of Tsukuba); Masato Kiyama(Kumamoto University); Kazutomo Yoshii (Argonne National Laboratory);Kentaro Sano(RIKEN R-CCS) -
Data compression is a technique that exploits the redundancy of data written in a particular format to convert them into smaller data volumes while preserving as much of its information as possible. The use of data compression can reduce data movement costs such as communication time and power consumption in large systems and networks. By employing a dedicated design circuit to achieve high-throughput processing, we have been developing a hardware-based system that can compress transmitted data in real-time while improving memory and network bandwidth and reducing power consumption.
In this research, we present our research on tradeoff control between data compression ratio and used computational resources. If the data compression ratio is high, the effective transmission bandwidth can be increased, so if this tradeoff can be controlled, it will be possible to allocate computational resources between actual processing and available bandwidth. We introduce quantization parameters for data compression hardware, which is possible to control the tradeoff between the actual data compression ratio and the circuit area required for implementation. This parameter is actually independent of the compression algorithm and only affects the efficiency of the encoding of the data after compression.
The evaluation result shows that the quantization parameter can adjust the resource allocation between the compression ratio (i.e., the available bandwidth) and the required circuit resources. The quantization parameters make it significantly easier to explore the design space to optimize the mechanism for the target environment, especially the output specifications when implementing the adaptive data compression hardware we are aiming for.
This work was supported by JSPS KAKENHI JP22K17870 and JST ASPIRE JPMJAP2341.
- 41A Glimpse into Generational Improvements Across Stratix 10, Agilex 7, and Agilex 7M FPGAs
Werner Oswaldo Florian Samayoa(Processor Research Team, RIKEN Center for Computational Science)*; Prasoon AMBALATHANKANDY(Processor Research Team, RIKEN Center for Computational Science); Jan ERIK REINHARD WICHMANN(Processor Research Team, RIKEN Center for Computational Science); Kentaro SANO(Processor Research Team, RIKEN Center for Computational Science) -
Estimating generational improvements in FPGA architectures is challenging because of the lack of detailed vendor disclosures and the diverse range of applications that defy standardized benchmarks. In this study, we evaluate the improvements attained by testing our scalable QEC core on three different Intel FPGA architectures. By systematically adjusting the scaling parameters of the core, we analyze the evolution of resource utilization and performance across these devices. Using a consistent behavioral description, we avoided architecture-specific constraints and provided a fixed framework for comparison. The results indicate that by upgrading the FPGA, a substantial performance improvement (up to 1.5x) can be attained.
Acknowledgements
This work was supported by the JST Moonshot R&D Grant Number JPMJMS226A.
- 42A Comparative Survey of GPUs and ASICs for AI Acceleration
Akira Jinguji(RIKEN R-CCS)*; Kentaro Sano(RIKEN R-CCS) -
The exponential growth in deep learning workloads has fueled the development of specialized hardware accelerators. Among these, GPUs and ASICs stand out as the primary pillars driving AI applications. This poster provides a comparative survey of contemporary GPU and ASIC solutions, emphasizing their architectural distinctions, performance benchmarks, and key innovations that tackle challenges such as latency, throughput, and power consumption.
GPUs, with their highly parallel structures, continue to dominate various AI domains thanks to well-established ecosystems and flexible programmability. However, as model sizes and computational demands surge, recent ASIC designs have pushed the boundaries of efficiency by customizing resources for specific tasks, reducing overhead, and improving overall scalability.
We explore notable ASIC-based solutions such as wafer-scale engines, which consolidate massive compute resources to reduce communication latencies. We also cover chiplet-based architectures, where modular sub-chips are integrated to optimize yields, lower production costs, and enhance scalability. Additionally, in-package memory integration emerges as a crucial factor in minimizing data transfer times and curbing power draw, thereby increasing total throughput.
Through direct comparisons of the latest GPUs and ASICs, we highlight trade-offs in programmability, performance-per-watt, and deployment constraints. We discuss how the growing diversity of AI workloads, ranging from image recognition to large language models, shapes the choice between GPUs and ASICs, ultimately influencing cost-effectiveness and real-world practicality.
By focusing on these primary hardware accelerators, this survey offers insights into the current state of GPU- and ASIC-based AI solutions. Our findings aim to guide practitioners, researchers, and stakeholders in selecting and developing the most suitable hardware accelerators for next-generation AI infrastructures.
- 43Symmetry Breaking: Efficient Structure Search for Quantum Tensor Network Systems
CHAO LI(RIKEN-AIP, Juntendo University) -
Tensor network (TN) has emerged as a powerful tool for both quantum and classic machine learning (ML), and its model selection, known as tensor network structure search (TN-SS) recently gained much attention. In this work, we explore a new realm for TN-SS: we consider optimizing multiple structures jointly for a TN system, which is involved in many applications yet commonly recognized as a difficult problem due to the combinatorial explosion issue. To this end, we propose a simple and efficient method inspired from physics. The key idea is to treat the objective function as "energy" in physics, and then introduce the "barrier" acting as a regularization to reshape the landscape. In the method, we initialize the search with a positive barrier, which enforces a symmetric solution across TN structures, i.e., the entries of the solution are identical to each other. We then decrease the magnitude of the barrier until flipping the sign to negative, which finally happens the phase transition called symmetry breaking (SB): the TN structures diverse to each other spontaneously. Extensive numerical results demonstrate that the SB mechanism allows the proposed method to use limited additional cost to achieve much more compact TN structures for a TN system than the existing TN-SS methods. Leveraging the computational power of RAIDEN and FUGAKU, the proposed method is also applied to challenges in large language models and quantum computing. The work will be presented in a collaborative paper, currently in preparation, with RIKEN colleagues from AIP and R-CCS. This work was partially supported by the RIKEN TRIP initiative (RIKEN Quantum).
- 44Improving Image Classification Accuracy with Hybrid Classical-Quantum Machine Learning
Maxence Vandromme(RIKEN R-CCS)* -
Quantum Machine Learning is still a developing research field with several open problems.
One way to demonstrate its interest is by showing a gain in classification accuracy when using quantum circuits in Machine Learning (ML) algorithms, compared to fully classical models. The underlying assumption is that the very non-linear functions represented by quantum circuits allow to find more optimal solutions during the exploration of the parameter space. We focus here on the task of image classification, which has many real-life applications. In this context, quantum circuits are used to replace some of the layers in a neural network, and have their parameters iteratively optimized to fit the training data. Previous studies have shown that this type of hybrid classical-quantum ML algorithms give higher classification accuracy for several use cases.
However, quantum computing is very wide-ranging, including different technologies, providers, software, etc. When implementing a QML algorithm, one can have the choice between many different backends (simulators or physical devices) on which to run the quantum circuits. These backends have various properties and constraints, which may have significant effects on the feasibility or performance of the algorithm. These aspects are of prime importance when developing actual applications, but tend to not be the main focus of research in QML.
We implement a transfer hybrid classical-quantum ML algorithm, and compare classification performance and runtime when running the quantum circuit on different backends. We detail the issues with training these algorithms on the current cloud-based physical devices. We present a framework for executing different parts of the learning process on different backends, and show some first observations about this approach. We hope that such a systematic comparison will shed some light on the benefits and drawbacks of various types of quantum backends in the context of QML.
- 45Performance evaluation of open-source Molecular Dynamics software on virtual Fugaku.
Chigusa Kobayashi(RIKEN Center for Computational Science)*; Kazuto Ando(RIKEN Center for Computational Science); Tsuyoshi Yamaura(RIKEN Center for Computational Science); Hikaru Inoue(RIKEN Center for Computational Science); Hitoshi Murai(RIKEN Center for Computational Science) -
In recent years, systems based on the ARM architecture have been adopted in High-Performance Computing (HPC). "Fugaku," developed by RIKEN and Fujitsu, is equipped with the A64FX processor. Additionally, Amazon Web Services (AWS) Graviton series and NVIDIA's Grace utilize the ARM architecture.
To establish a universal HPC software platform compatible with various computing environments, the RIKEN Center for Computational Science (R-CCS) is advancing the "Virtual Fugaku" project, which aims to make the software infrastructure developed for “Fugaku” available on other systems. As the first step of the “Virtual Fugaku”, an environment deployable on AWS Graviton 3E instances was released in August 2024. Users can download a singularity file from the R-CCS website and deploy the environment on AWS EC2 instances. Additionally, a small-scale test and development environment is experimentally provided to Fugaku users as "Satellite Fugaku".
Molecular Dynamics (MD) is a method based on Newton’s equation of motion to calculate the motion of particles. It is widely used in the field of biophysics to understand the dynamics of atoms and molecules in various biological systems. We have evaluated the performances of open-source MD software on "Satellite Fugaku." In “Satellite Fugaku”, GENESIS, Gromacs, and LAMMPS are pre-installed via spack. We measured their performance with the benchmark sets provided with each software. The GCC 14.1.0 compiler and the Elastic Fabric Adapter-enabled OpenMPI 4.1.6 on AWS EC2 instances, were used for the computations.
- 46A GAN-based Visualization Surrogate Model for Time-varying Numerical Simulations.
Tomoya Miyake(Kobe University)*; Naohisa Sakamoto(Kobe University) -
Abstract
Time-varying, or transient, numerical simulations have been widely used to study and analyze dynamic and complex systems, such as automotive aerodynamics. For this purpose, HPC has been widely adopted due to its computationally intensive nature, and the continuous advancements in HPC technology have made large-scale and accurate simulations possible. However, since such simulations are time-consuming, a parameter study can become too expensive to perform in a reasonable timeframe. To minimize this problem, machine learning-based surrogate models have received increasing attention for providing efficient approximations of complex systems, by significantly reducing computational costs while maintaining acceptable accuracy. However, most of those surrogate models have focused mainly on replacing numerical simulations, while the visualization and analysis portion was not taken into consideration. Surrogate models capable of directly outputting visualization results have the potential to enhance the efficiency of the end-to-end simulation and analysis pipeline, and there are some prior works, such as the InSituNet, which uses Generative Adversarial Networks (GANs) to generate visualization images by specifying simulation and visualization parameters. However, InSituNet does not support time evolution, limiting its applicability in dynamic simulations where faster training and efficient image generation over time are essential. To overcome this limitation, in this poster, we present a GAN-based visualization surrogate model with added pixel shuffling functionality to accelerate the training process for enabling interactive visual analytics of time-varying simulations. We evaluated the proposed method by applying it to an automobile aerodynamic simulation and used PyTorch and a workstation with AMD EPYC 7313 CPU and NVIDIA RTX6000 GPU for the training. We compared the predicted images with the ground truth data and obtained reasonable results even when changing the initial parameter values. In some future works, we plan to apply it to other numerical simulations and conduct qualitative and quantitative evaluations with corresponding domain experts.
Acknowledges
This work was partially supported by MEXT as "Program for Promoting Researches on the Supercomputer Fugaku" (Grant Number JPMXP1020210123), and Fujitsu & Kobe University Joint Laboratory for Advanced Computing and Social Implementation (Fujitsu Small Research Lab).
- 47Critical Path Detecting with Ray Tracing Core Acceleration
Zhengyang Bai(RIKEN R-CCS)*; Peng Chen(National Institute of Advanced Industrial Science and Technology); Chen Zhuang(RIKEN R-CCS,Institute of Science Tokyo); Jing Xu(Kyoto University); Emil Vatai(RIKEN R-CCS); Mohamed Wahib(RIKEN R-CCS) - The critical path is widely used in project management, which determine the longest sequence of dependent tasks and activities required to complete a project. It helps in identifying the most crucial tasks that directly impact the project's duration. However, critical path detecting is difficult to accelerate due to the random data access of sparse matrix and algorithm complexity. In this presentation, we introduce a radical different approach to do the acceleration using ray tracing cores in the latest GPUs.
- 48A general method of the initial tensor for TRG with the Steiner tree problem
Katsumasa Nakayama(RIKEN)*; Manuel Schneider(NYCU) - In this poster, we present a general method for constructing the tensor network representation of the tensor renormalization group based on the Boltzmann factor representation. Our approach can be interpreted as a solution to the minimum spanning tree problem, providing valuable insights into the properties of the tensor.
- 49Energy Conversions in the Venus Atmosphere: Insights from Bred Vector Energy Equations
Jianyu Liang(RIKEN R-CCS)*; Norihiko Sugimoto(Keio University); Takemasa Miyoshi(RIKEN R-CCS) -
Conducting numerical simulations of the planetary atmospheres is important to deepen our knowledge of the atmospheres and provide their forecasts. Analyzing the simulation results can help us understand the instabilities and related energy conversion in the planetary atmospheres. Two instabilities are very important. The baroclinic instability is related to the meridional temperature gradients and the barotropic instability is related to the horizontal wind shear. Traditional methods like the Lorenz energy cycle can analyze energy conversions. However, it relies on zonal averages as the basic state, which works well for the Earth's atmosphere but less so for Venus's atmosphere, where longitudinal variations are significant.
In this study, we applied the Bred Vector (BV) energy equations to study energy conversions in the Venus atmosphere. This method uses a control run as the basic state, capturing longitudinal dependencies and quantifying energy contributions from baroclinic and barotropic instabilities. It has been applied previously to the atmospheres on Earth and Mars, but not Venus. The results show that baroclinic conversions are stronger at higher latitudes and exceed barotropic conversions at mid- to high-latitudes in the cloud layer. Thermal tide, related to solar heating, increases the baroclinic and barotropic energy conversions in the morning hemisphere at the mid-latitudes in the cloud layer. This study provides new insights into the energy conversion of Venus’s atmosphere with potential application to other planetary atmospheres.
- 50Parallel Bayesian Optimization of Atomic-Level Interaction Descriptors for Protein Modeling
Shuntaro Chiba(RIKEN R-CCS)*; Tsutomu Yamane (RIKEN R-CCS); Mitsunori Ikeguchi (RIKEN R-CCS,Yokohama City University); Masateru Ohta(RIKEN R-CCS) -
Understanding protein–protein interactions can be facilitated by modeling protein structures at atomic resolution. To address this challenge, we developed a predictive model by a machine learning algorithm that assesses the likelihood of modeled amino acid sidechain structures occurring naturally, using a set of interaction descriptors. These descriptors represent van der Waals (vdW) interactions, hydrogen bonds, and so-called weak interactions, such as CH-O and CH-π interactions. Our prior research has demonstrated that incorporating these weak interactions enhances discrimination performance.
Some interaction definitions, such as distances between heavy atoms, were determined empirically, indicating scope for optimization to further improve predictive accuracy. To explore this, we developed a robust workflow that includes descriptor generation, predictive model construction, and performance evaluation, utilizing a structural dataset for training. In addition, we implemented a parallel Bayesian optimization framework to refine the interaction definitions. By deploying this optimization system on the Fugaku supercomputer, we significantly accelerated computationally intensive tasks, including descriptor generation and hyperparameter tuning for the predictive model, via massive parallel execution. This approach facilitated the identification of interaction definitions with high optimization potential, resulting in improved discrimination performance.
These results can potentially accelerate discoveries in drug design, molecular biology, and other areas where understanding structural protein interactions is crucial.
- 51GPU Implementation of Lattice QCD code with OpenACC
Issaku Kanamori(RIKEN)*; Tatsumi Aoyama(University of Tokyo); Kazuyuki Kanaya(University of Tsukuba); Hideo Matsufuru(KEK); Yusuke Namekawa(Hiroshima University); Hidekatsu Nemura(Osaka University); Keigo Nitadori(RIKEN) -
Lattice QCD (LQCD) is a physics application that treats the interaction of quarks and gluons. It is one of the typical HPC applications on massively parallel systems. In this presentation, we report the status of Bridge++, a general lattice QCD code set, focusing on its GPU implementation with OpenACC. The most time-consuming part of LQCD simulation is solving a discretized partial differential equation called the Dirac equation, for which iterative solvers such as the Conjugate Gradient method are applied.
As a nature of the partial differential equation, the kernel of the discretized equation is a sparse matrix, of which structure depends on the discretization scheme. We implement most of the major matrices and report the performance on the NVIDIA GH200 system and H100 system.
- 52Performance Evaluation of Quantum Computing Simulation Methods for Intra-Node Multi-GPU Systems: Qubit Reordering and PGAS Approaches
Naoto Aoki(RIKEN Center for Computational Science)*; Naoki Yoshioka(RIKEN Center for Computational Science); Nobuyasu Ito(RIKEN Center for Computational Science) -
In recent years, the development of quantum computers has progressed dramatically, and proposals for various applications have also become more active.
There are two types of quantum computer: NISQ (Noisy Intermediate-Scale Quantum) computers, which are affected by noise, and
FTQC (Fault-Tolerant Quantum Computer) computers, which are equipped with fault tolerance.
NISQ computers are being developed and applications are being proposed, but the range of applications is limited.
On the other hand, FTQC is expected to enable large-scale, accurate quantum computation by utilizing error correction technology, but currently it is still in the research stage.
In this context, high-precision quantum computer simulators are essential for the research and development of future FTQC algorithms.
In addition, the performance of GPUs (Graphics Processing Units) has improved significantly due to the explosive demand for AI (Artificial Intelligence) technology in recent years. It is hoped that the computational power of such GPUs can be utilized to achieve fast and efficient quantum simulations. In this research, we used such high-performance GPUs to implement a prototype and identify performance issues.
In this presentation, we evaluate the implementation method of a quantum computer simulator in an environment with multiple GPUs in a node.
As a comparison, we prepared (1) an implementation using Qubit Reordering proposed by K. De Raedt et al. (2006) and (2) an implementation using partitioned global address space (PGAS) proposed by A. Li et al. (2021)
After comparing (1) and (2), it became clear that the method in (1) required approximately 1.2 times longer computation time than the method in (2).
This result is thought to be due to the fact that the synchronization cost for Qubit Reordering exceeds the communication cost in environments where high-speed communication within nodes is possible.
This presentation is based on results obtained from a project, JPNP20017, commissioned by the New Energy and Industrial Technology Development Organization (NEDO).
- 53Memory-Efficient Parallel-in-Spacetime 4-D AMR for Chaotic PDEs
Masado Ishii(RIKEN)* -
Strong scaling (solving a fixed problem size faster by using more cores) is essential to speed up the solution of partial differential equations (PDEs) in such areas as numerical weather prediction and aerodynamic design. Weak scaling (solving larger problem sizes with more cores in the same time) is also required to meet the surging demand for higher fidelity simulations. Multigrid methods have long been known for both their strong and weak scaling. Two essential properties are 1) parallelism of local computations on spatial subdomains and 2) fast convergence of the solver with a constant number of iterations as either the problem size or number of parallel subdomains grows. Furthermore, parallel octree-based data structures are often used for adaptive mesh refinement (AMR) to automatically generate hierarchical subdomain partitions, specifically suitable for geometric multigrid methods.
However, for time evolving problems over long time intervals, spatial parallelism inevitably reaches a limit, rendering the sequential nature of time-marching a scalability bottleneck. Thus several groups over the last six decades have studied techniques to parallelize in time (PinT), a landmark example being Parareal (2001). Interest in PinT has intensified in anticipation of the arrival and future of exascale computing, with varying rates of success depending on the type of problem. PinT speedup for a chaotic PDE was achieved for the first time in 2023 by Vargas et al., using Multigrid Reduction in Time (MGRIT), augmented with Lyapunov stability theory (here called MGRIT-θ/Δ).
This poster proposes a plan to combine the capabilities of MGRIT-θ/Δ with the memory efficiency that is possible with spacetime AMR and reduced precision. Firstly, MGRIT-θ/Δ assumes that the time domain is thinly sliced into synchronized planes. Spacetime AMR saves memory by permitting nonuniform adaptive (space and) time step size---that is, dramatically fewer time steps where possible. Secondly, crucial to convergence, MGRIT-θ/Δ stores potentially many leading Lyapunov vectors to span the unstable manifold. Fortunately, the computation of these vectors is a well-conditioned problem. However, the minimum number of Lyapunov vectors needed to maintain fast convergence depends on the problem, ranging from nine in the simplest case of the Kuramoto-Sivashinsky system to perhaps dozens or more for other PDEs. To mitigate overhead, we propose to store the Lyapunov vectors in reduced precision. These improvements will help realize the ultimate goal of PinT-accelerated, highly dynamic AMR multiphysics simulations.
- 54From Hits to Drugs: AI Applications in Multi-Objective Structure Development
Kazuyoshi Ikeda(R-CCS, Keio University)*; Tomoki Yonezawa( Keio University); Masateru Ohta((R-CCS); Shoichi Ishida(Yokohama City University); Yugo Shimizu(R-CCS); Kei Terayama(Yokohama City University); Teruki Honma(RIKEN) - Various structure generation methods have been developed to enhance drug design efficiency. This study explores AI-based generation of marketed drug and clinical candidate structures to validate its applicability in drug discovery. We aimed to replicate the structure development process from hit compounds to approved drugs or candidates, focusing on multi-objective optimization for primary activity, solubility, membrane permeability, and pharmacokinetic properties. ChemTS, a structure generation AI, employs reinforcement learning to optimize target parameters. By integrating predictive models of physicochemical and ADME properties as reward functions, we searched for structures with improved predicted values. Combining docking or 3D shape similarity with these models, we generated structures using ChemTS to achieve enhanced solubility, membrane permeability, and metabolic stability. To evaluate the predictive model, we identified successful structure development examples from a Journal of Medicinal Chemistry review, extracting hit-drug and hit-candidate pairs and related activity data from ChEMBL. For DORAVIRINE, an HIV reverse transcriptase inhibitor, we reproduced its structure using rewards for docking, membrane permeability, solubility, and metabolic stability. Similarly, for TEPOTINIB, a c-Met inhibitor, we generated structures using docking, membrane permeability, and metabolic stability as rewards, producing highly active compounds similar to TEPOTINIB. These findings highlight the strengths and limitations of ChemTS in structure generation for drug discovery, providing insights into its application for optimizing complex pharmacological properties.
- 55PIMID: A Full-System Simulator for Processing-in-Memory with Intricacy and Diversity
Yuan He(RIKEN Center for Computational Science)*; Masaaki Kondo(RIKEN Center for Computational Science); Galen M. Shipman(Los Alamos National Laboratory); Jered Benjamin Dominguez-Trujillo(Los Alamos National Laboratory) -
Processing-in-Memory (PIM) has emerged as a promising solution to the longstanding memory wall challenge, driven by the increasing disparity between processor and memory performance. As computational demands continue to grow, conventional memory systems struggle to keep pace, prompting the need for innovative approaches that reduce data movement and improve energy efficiency. PIM offers a way forward by integrating computational capabilities directly into memory systems, but current tools for exploring this paradigm remain limited in their scope and flexibility.
To address these shortcomings, we introduce PIMID, a full-system simulation framework tailored for the comprehensive evaluation of PIM architectures. PIMID stands out with its ability to co-simulate host and memory devices in real time with the help of multiple processes of ZSim simulations, alongside its support for diverse memory technologies, including DRAM, SRAM and STT-MRAM, using tools like Ramulator, CACTI and NVSim to ensure accurate modeling. It also allows for detailed and fine-grained configurations of the processing elements (PEs) at various levels, such as subarrays, banks, chips, and stacks, which enable precise architectural explorations.
Another key feature of PIMID is its advanced support for in-memory networks. By incorporating detailed network models like GARNET, it captures complex data communication patterns and addresses often-overlooked challenges, such as disparities in addressing between host systems and PEs and the need for realistic interconnect designs.
PIMID provides researchers and engineers with a flexible, high-fidelity platform for designing and evaluating innovative PIM architectures. By offering extensive configurability and support for emerging technologies, it serves as a robust foundation for exploring the next generation of memory-centric computing systems.
- 56Assessment of physics-informed deep operator network as a surrogate model for fluid flow at different conditions
Junya Onishi(Center for Computational Science, RIKEN)*; Harutaka Kitagawa(Graduate School of System Informatics, Kobe University); Makoto Tsubokura(Center for Computational Science, RIKEN, Graduate School of System Informatics, Kobe University) -
In this study, we develop a surrogate model for fluid flow at different flow conditions by employing Physics-Informed Deep Operator Network (PI-DeepONet).
The operator learning capabilities of Deep Operator Network is leveraged to train the neural network to learn the effects of different flow conditions, such as Reynolds number and inlet velocity distribution.
Then, the Navier-Stokes equations are embedded into the loss function following the concept of Physics-Informed Neural Networks (PINNs). This enables the neural network to learn the solution of the Navier-Stokes equations without the need for large volume of high-quality training data, or even without any data at all.
Therefore, compared to traditional data-driven approaches, this data-independent approach can be expected to enhance the reliability and the generalizability of the neural network, resulting in a powerful surrogate model for predicting fluid flow across a wide range of flow conditions.
We assess the performance of such PI-DeepONet from the viewpoint of the generalizability, accuracy, and training cost.
- 57Data assimilation in volcanic deformation
Shungo Tonoyama(Data Assimilation Research Team, RIKEN Center for Computational Science)*; Atsushi Suzuki(Large-scale Parallel Numerical Computing Technology Research Team, RIKEN Center for Computational Science); Kengo Nakajima(Large-scale Parallel Numerical Computing Technology Research Team, RIKEN Center for Computational Science, The University of Tokyo); Takemasa Miyoshi(Data Assimilation Research Team, RIKEN Center for Computational Science) -
Data assimilation combines numerical modelling and observational data and is now being applied across various fields in Earth sciences. Here, we propose a new data assimilation method for volcanic eruptions. We developed an elastic deformation model based on the finite element method and carried out identical twin experiments. Data assimilation experiments were performed by applying the Kalman filter and the adjoint method. The surface displacements from the true model were perturbed with independent Gaussian random noise to represent the actual observation data estimated by time-series Interferometric Synthetic Aperture Radar (InSAR) analysis. By assimilating the ground deformation data, the state of the magma chamber was estimated.
In the Kalman filter experiment, a magma chamber initially placed at a location different from the true location was sequentially adjusted by data assimilation. The results show that the magma chamber depth analysis converged to the true state by assimilating data for eight steps (equivalent to two months) likely because the depth direction has a significant impact on ground deformation observation.
In the adjoint method, the surface stress field of the magma chamber is estimated by minimizing the difference between the observed and model-predicted surface displacements. Here, the objective is solving inverse problem to accurately determine the surface stress field of the magma chamber, assuming its location and size are known. The results showed convergence after approximately 50 iterations and successful reproduction of the target stress field.
The proposed approaches have distinct advantages in predicting the transition of volcanic eruptions. Our future research will explore the application to real-world observation data for volcanic disaster prevention.
- 58Data driven hydrological simulation at national scale
Tristan Erwan Marie Hascoet(RIKEN RCCS)*; Victor Pellet(LMD X); Takemasa Miyoshi(RIKEN R-CCS) -
Climate change is increasing stress on water management operations worldwide. In Japan, public dam infrastructure faces the dual challenge of mitigating flood risk while aiming to strengthen its hydroelectric power generation. Simultaneously, recent advances in deep learning (DL) have demonstrated notable improvements in hydrological prediction and rainfall nowcasting, raising the question of whether these advances can yield actionable policy insights for Japanese dam operations.
We develop a national-scale dataset tailored for data-driven hydrological modeling. Key to this dataset is a “gauge- and dam-aware” segmentation of hydrologically corrected elevation maps into catchments. We then overlay Japan Meteorological Agency (JMA) land surface observations and Ministry of Land, Infrastructure, Transport and Tourism (MLIT) hydrological measurements onto these catchments. Building on this foundation, our proposed data-driven hydrological model pipeline includes runoff, routing, and inundation modules in an end-to-end differentiable, GPU-accelerated framework. We refactor traditional Muskingum routing in time and space to propose an efficient GPU implementation leveraging block-sparse causal convolution. Additionally, we aggregate routing kernels using scattered complex-product transformations in frequency space.
On our newly constructed dataset, the proposed model achieves a median Nash–Sutcliffe Efficiency (NSE) of 0.82, indicating robust predictive performance. Error analysis reveals four main error sources—snowmelt modeling, heavy rainfall events, base flow estimation, and miscellaneous factors. We highlight the impact of snowfall underestimation in northern Japan, biases in heavy-rain predictions, and more generally, the challenges for data-driven approach to effectively deal with the biases of noisy observations.
Our results confirm the potential of DL approaches in modeling hydrological processes at national scale towards the goal of providing efficient forecast to inform dam operations. Ongoing work focuses on refining the ability of the model to deal with biases in the available set of observations.
- 59Porting of Intel Spin-qubit Quantum Simulator to Fugaku Supercomputer
Soratouch Pornmaneerattanatri(RIKEN R-CCS)*; Miwako Tsuji(RIKEN R-CCS); Mitsuhisa Sato(RIKEN R-CCS) -
The spin-qubit quantum machine, one of the prominent quantum machine architectures, is currently under development by Intel. The electron spin-qubit devices operate on the silicon-based gate-defined quantum dots(QD) that utilize existing complementary metal-oxide-semiconductor(CMOS) manufacturing technology to achieve small-scale and effectively increased scalability of the quantum processing unit. Additionally, qubits of spin-qubit quantum machines exhibit long coherence time and high-fidelity qubit read-out, making them compelling quantum machines. Manufacturers of quantum machines typically provide a quantum software development kit(SDK), along with a quantum simulator that emulates quantum phenomena specific to their quantum computing designs. Similarly, Intel develops the Intel Quantum SDK (IQSDK) as the compilation, runtime, and simulations for its spin-qubit QC. The IQSDK is an LLVM-based compiler extension. Alongside the SDK, the simulator, developed with the LLVM compiler, functions as a state-vector simulator among the choices that facilitate the testing and validation of Intel quantum programs on a qubit-agnostic back end, offering both noiseless and noisy modes.
We already provide SDKs for superconductive and trapped ION quantum machines on the Fugaku supercomputer. To support researchers' interest in a wide variety of quantum hardware, we ported the IQSDK, including the spin-qubit quantum simulator, to the Fugaku supercomputer. However, the processors of Fugaku supercomputer, Fujitsu A64FX based-on ARM architecture, is different from Intel processors, X86_64 architecture and IQS developed and embedded in the Intel software environments, especially Math Kernel Library(MKL), is incompatible with the software environment designed for the Fugaku supercomputer, developed by Fujitsu. To build and run the simulator on the Fugaku supercomputer, modifications were made to the simulator source code by substituting the Intel software environment with the Fugaku software environment. The result of porting the simulator to the Fugaku supercomputer was successful, with all functions, including qubit and classical bit behavior and quantum gate operations, performing as intended.
- 60Investigating Secondary Structure Effects on TDP-43 Condensation Using Coarse-Grained Molecular Dynamics Simulations
Zhang Yangyang(Computational Biophysics Research Team, R-CCS)*; Cheng Tan(Computational Biophysics Research Team, R-CCS); Yuji Sugita(Computational Biophysics Research Team, R-CCS) -
Molecular dynamics (MD) simulations have become an essential tool in computational biophysics, offering insights into complex biological phenomena such as protein condensation during liquid-liquid phase separation (LLPS). Coarse-grained (CG) models, such as Martini and SPICA, play a critical role in bridging the gap between atomic-level details and large-scale simulations, enabling the exploration of macromolecular interactions with reduced computational cost.
Recent experiments and residue-level CG simulations suggested that secondary structures, particularly helices, may enhance TDP-43 low-complexity domain (LCD) assembly during LLPS. However, Martini coarse-grained (CG) model simulations showed decreased interchain contacts with extended helix constraints, while hybrid Martini-level CG models with all-atom sidechains failed to show enhanced assembly due to reinforced secondary structure constraints.These discrepancies underscore the need for further investigation on the role of secondary structures in LLPS.
In this study, we employed the SPICA v2.0 CG model, which features well-defined long-range interactions for various secondary structures, to examine the effects of helices on TDP-43 LCD condensation. We analyzed two extreme models: one representing TDP-43 LCD as a purely intrinsically disordered protein (IDP), and the other incorporating elastic network (EN) constrains on the helical regions. We conducted MD simulations on double chain systems to evaluate helix-helix interactions and on 100-chain systems at various temperatures to construct a phase diagram.
Our results revealed that secondary structure constraints alone do not drive enhanced phase separation. Contact maps and phase diagrams showed no significant differences between the IDP and EN-constrained models. This suggests that helix-enhancing mutations likely promote assembly through a coupled folding-binding mechanism, rather than directly through secondary structure stabilization.
This study underscores the importance of employing advanced CG models like SPICA in deciphering complex biomolecular interactions. Our findings contribute to the understanding of secondary structure effects in LLPS and highlight the potential of computational approaches in advancing biophysical research.