Improving the Performance and Productive Programming Environment for Exascale and Beyond
We are researching and developing parallel programming models and a new programming language to exploit the full potential of large-scale parallel systems, as well as working to increase productivity of parallel programming. The programming language, called XcalableMP (XMP), is based on the PGAS (Partitioned Global Address Space) model, which was originally designed by the Japanese HPC language research community. We are working on a reference XMP compiler, the Omni XMP compiler, and have deployed it on several systems including the K computer for users, and conducted a performance study and optimization of the PGAS language. We also have developed an extension for an accelerator cluster beyond the K computer. Currently, we are working on the development of XcalableMP for “Fugaku” supercomputer with researches on some functional extensions to make use of many-core in “Fugaku” processor efficiently.
Towards the high-performance, and highly productive programming model for exascale computing, we are working on a new version of XcalableMP: XcalableMP 2.0. As it will be an important issue how to exploit the performance of a large-scale many-core system such as “Fugaku”, we are proposing the programming model to integrate task-parallelism and RDMA operation by PGAS model. It can improve the performance by removing time-consuming global synchronization and enabling the overlap of computation and communication. It is expected to improve the performance of several program using many-core processor of “Fugaku” supercomputer. And, our team is also carrying out the research on the programming model for FPGA (field-programmable gate array) as a future accelerator device.
Development of high-performance and highly-productive XcalableACC parallel language for parallel systems with accelerators
Although parallel systems with accelerators such as GPUs have come to be widely used, it is pointed out that programming for such system is often not easy and time-consuming. We have developed a new programming language called XcalableACC, which integrates OpenACC for offloading operations to GPUs with XcalableMP developed by our team. XcalableMP is used to describe the data distribution and the work assignment for each node, and OpenACC is used to describe offloading to GPUs. The XcalableACC extension enables not only a global distributed data to be offloaded to GPU, but also communication between GPUs. As shown in the program below, the solver of laplace equation can be parallelized by just adding XcalableMP directives (red) and OpenACC directives (blue) to the sequential code.
Our case-study using QCD application indicates the high productivity because the sequential version of the program can be parallelized by adding just 160 lines of XcalableACC code to 824 lines of the original code written in C while 832 lines of addition and modification are required when using the MPI and CUDA.
The XcalableACC code archived the comparable performance, from 95% to 99%, to the version of MPI and CUDA.
Professor（Cooperative Graduate School Program）and Professor Emeritus, University of Tsukuba (-present)
Team Leader, Architecture Development Team of Flagship 2020 Project, AICS, RIKEN (-present)
Team Leader, Programming Environment Research Team, AICS, RIKEN (-present)
Director , Center for Computational Sciences, University of Tsukuba
Professor , Graduate School of Systems and Information Engineering, University of Tsukuba (-present)
Chief, Parallel and Distributed System Performance Laboratory in Real World Computing Partnership, Japan.
Senior Researcher, ElectroTechnical Laboratory (-1996)
M.S. and Ph.D. in Information Science, The University of Tokyo (-1990)
- Annual Report