The continuous improvement in processing speed in high-performance computer systems such as the supercomputers K and Fugaku have been enabled by transistor scaling. This trend is known as Moore’s law. But Moore’s law is predicted to end in the near future. Hence, it is vital to research and develop novel and more efficient high performance computer systems if we are to continue realizing high performance computing in the future.
Based on the experience of hardware development and the existing software environment of the supercomputers K and Fugaku, we are researching and developing a next-generation high-performance computer architecture together with strategies to improve the power efficiency of exascale supercomputer systems. Currently, we are mainly focusing on non-von Neumann architectures such as systolic arrays and neuromorphic computers based on the latest advances in device technologies, architectures that can integrate next generation non-volatile memories and/or various types of accelerators into a general-purpose processor, the advancement of scientific simulations by accelerating machine learning computations, and hybrid computing architectures that combine the benefits of quantum computing and classical computing. And we are performing detailed co-design (coordinated design of hardware and software) evaluations of the computer architectures noted above as well as the co-design evaluations of algorithms that take advantage of them on the supercomputers K and Fugaku.
Development of power management strategies for next-generation high performance computing systems
Power consumption is a prerequisite design constraint for developing exascale or next-generation computer systems. In order to maximize effective performance within a given power constraint, we need a new system-design concept in which the system’s peak power is allowed to exceed maximum power provisioning using adaptively controlling power knobs incorporated in hardware components so that effective power consumption is maintained below the power constraint. In such systems, it is indispensable to allocate the power budget adaptively among various hardware component such as processors, memories, and interconnects, or among co-scheduled jobs, instead of fully utilizing all available hardware resources.
To help resolve this challenge, we have devised a software framework for code optimization and system power management. For example, we have developed a variation-aware power budgeting scheme to maximize effective application performance, and in tests it produced a 5.4X speed increase compared to a variation-unaware power allocation scheme. Based on the Slurm workload manager, we have also developed a power-aware resource manager and a job scheduler to control power allocation among co-scheduled jobs. We have tested this framework on a large-scale HPC cluster system with about 1000 compute-nodes and showed that it can successfully manage the system’s power consumption below a given power constraint.
Power-aware resource manager based on the Slurm workload manager