DAY-1 : Feb 17, 2020

Time Program
Opening (welcome remarks)
Message from MEXT
Keynote 1
  • Satoshi Matsuoka, Director, R-CCS, RIKEN
Invited Talk 1
  • Evelyne Foerster, CEA
Session 1
Achievements with K computer & prospects for Fugaku (R-CCS team) 1
Invited Talk 2
  • Norman Christ, Columbia University
Session 2
Achievements with K computer & prospects for Fugaku (R-CCS team) 2
Session 3
Achievements with K computer & prospects for Fugaku 3
Invited Talk 3
  • Takaki Hatsui, SPring-8, RIKEN
Poster Session / Fugaku Tour @RIKEN R-CCS building 1st floor "List of Accepted Posters"
Reception @RIKEN R-CCS building 1st floor

DAY-2 : Feb 18, 2020

Time Program
Keynote 2
  • Jeffrey Vetter, Oak Ridge National Laboratory
Session 4 : ARM eco system
Lunch & Group photo
Session 5 : Future co-design
Panel Session : Future co-design


Keynote 1 (DAY-1 : Feb 17 9:20 - 10:10)

Session Chair: Kentaro Sano, R-CCS, RIKEN

Invited Talk 1 (DAY-1 : Feb 17 10:30 - 11:00)

Session Chair: Satoru Ohishi, R-CCS, RIKEN

  • Evelyne Foerster, CEA
    "Modelling strategies for Nuclear Probabilistic Safety Assessment in case of natural external events"
Session 1 (DAY-1 : Feb 17 11:00 - 12:15)

Session Chair: Yasumichi Aioki, R-CCS, RIKEN

Achievements with K computer & prospects for Fugaku (R-CCS team) 1
Talk 15min x 5

  • Satoru Oishi, R-CCS
    "Data Processing for Digital Ensemble of Cities to Simulate Catastrophic Disaster"
  • Takemasa Miyoshi, R-CCS
    "Big Data Assimilation in Weather Prediction: From K to Fugaku"
  • Hirofumi Tomita, R-CCS
    "Mathematical climate studies from K to Fugaku"
  • Yohsuke Murase, R-CCS
    "Finding successful strategies for social dilemma using the K computer"
  • Makoto Tsubokura, R-CCS
    "Application of HPC-CFD for industrial applications on the K computer and toward Fugaku"
Invited Talk 2 (DAY-1 : Feb 17 13:30 - 14:00)

Session Chair: Yasumichi Aioki, R-CCS, RIKEN

Session 2 (DAY-1 : Feb 17 14:00 - 15:00)

Session Chair: Hirofumi Tomita, R-CCS, RIKEN

Achievements with K computer & prospects for Fugaku (R-CCS team) 2
Talk 15min x 4

  • Junichiro Makino, R-CCS (15min)
    "Current Status of FDPS and Future plan"
  • Yasumichi Aoki, R-CCS (15min)
    "Field Theory Simulation towards Fugaku"
  • Shigetoshi Sota, R-CCS (15min)
    "Quantum dynamics simulation towards Fugaku"
  • Takahito Nakajima, R-CCS (15min)
    "Molecular simulation on the K computer and towards Fugaku"
Session 3 (DAY-1 : Feb 17 15:20 - 16:15)

Session Chair: Takahito Nakajima, R-CCS, RIKEN

Achievements with K computer & prospects for Fugaku 3

  • Jaewoon Jung, R-CCS (15min)
    "Acceleration of large-scale MD simulations for biological functions"
  • Osamu Miyashita, R-CCS (15min)
    "Hybrid Approach for Biomolecular Structure Modeling"
  • Fumiyoshi Shoji, R-CCS (25min)
    "The concept of user services on Fugaku"
Invited Talk 3 (DAY-1 : Feb 17 16:15 - 16:45)

Session Chair: Kentaro Sano, R-CCS, RIKEN

Poster Session / Fugaku Tour @RIKEN R-CCS building 1st floor (DAY-1 : Feb 17 17:00 - 18:30)

Session Chair: Kentaro Sano, R-CCS, RIKEN

  • Poster Session: Poster Session and Reception are held in R-CCS building 1st floor ( List of Accepted Posters )
  • Fugaku Tour: Please do NOT take a photo
Keynote 2 (DAY-2 : Feb 18 9:00 - 9:50)

Session Chair: Mitsuhisa Sato

Session 4 : ARM eco system (DAY-2 : Feb 18 10:10 - 12:10)

Session Chair: Toshiyuki Imamura

Session 5 : Future co-design (DAY-2 : Feb 18 13:30 - 15:30)

Session Chair: Kentaro Sano

Panel Session : Future co-design (DAY-2 : Feb 18 15:50 - 17:20)
Moderator : Masaaki Kondo, R-CCS, RIKEN
Panelists : Ahmed Hemani (KTH), Toshiyuki Shimizu (Fujitsu), Andy Hock (Cerebras Systems), Jean-Marc Denis (European Processor Initiative), Kentaro Sano (RIKEN)

List of Accepted Posters

1Takumi Honda, Shu-Chih Yang and Takemasa Miyoshi.
"Assimilating all-sky Himawari-8 radiances in the heavy rainfall event on 23 August 2018 in Taiwan"
In August 2018, a tropical depression stayed near Taiwan and induced heavy precipitation over the southwest region of Taiwan. Detailed evolution of this depression was well captured by the Himawari-8 geostationary satellite of the Japan Meteorological Agency. Previous studies have investigated impact of all-sky assimilation of Himawari-8 radiance observations on predicting rapid intensification of a typhoon over the ocean and typhoon-associated precipitation. However, it is an open question if Himawari-8 observations can improve the prediction of tropical depression and associated precipitation in a warm and moist environment. This study aims at exploring the impact of assimilating all-sky Himawari-8 observations on the analyses and forecasts of the heavy rainfall event on 23 August 2018, in Taiwan. This collaborative research is based on the MOU between R-CCS and the National Central University, Taiwan.
2Takehiro Azuma, Konstantinos N. Anagnostopoulos, Toshihiro Aoki, Mitsuaki Hirasawa, Yuta Ito, Jun Nishimura, Stratos Kovalkov Papadoudis and Asato Tsuchiya.
"Complex Langevin studies of the continuum limit of the Lorentzian type IIB matrix model"
The type IIB matrix model, also known as the IKKT model, is a promising candidate for the nonperturbative formulation of the string theory. Recently, its Lorentzian version, in which the indices are contracted by the Lorentzian metric, has been numerically studied. This has a sign problem stemming from "e^{iS}" in the partition function (where S is the action). Its complex Langevin analysis has shown the emergence of the (3+1)D expanding spacetime, different from a Pauli-matrix structure. In this work, we attempt to reveal the continuum limit of the Lorentzian type IIB model, by studying its scaling behavior.
3Hiroshi Murakami.
"Filter Diagonalization with Iterative Refinement"

For a real symmetric-definite generalized eigenproblem A v = \lambda B v, we solve approximate eigenpairs whose eigenvalues are within a specified real interval by using a filter.

The filter consists of some resolvents, here the resolvent with a complex shift \rho is defined as R(\rho) = (A-\rho B)^{-1}B. For a vector x, the action of the resolvent y := R(\rho)x is to solve the system of linear equations C(\rho) y = B x for y. Here, C(\rho) = A-\rho B is the shifted matrix.

When \rho is real, the shifted matrix is real symmetric. When \rho is real and less than the minimum eigenvalue, it is real symmetric positive-definite (This case is to solve eigenpairs with lower-end eigenvalues).

When \rho is imaginary, the shifted matrix is non-singular complex symmetric (This case is to solve eigenpairs with interior eigenvalues). In present study, we assume the system of linear equations corresponds to a resolvent is solved by some direct methods. When both A and B are narrow banded, so is C(\rho), and we use the banded modified Cholesky method to solve the symmetric system.

We use a filter consists of only a single resolvent to reduce both amounts of arithmetics for matrix factorizations and especially storage to hold matrix factors. The formula of the filter is, F = g_s T_n( 2\gamma R(\rho)-I ) when \rho is real, and F = g_s T_n( 2\gamma Im R(\rho)-I ) when \rho is imaginary. Here, \gamma is a real constant, g_s is an upper-bound of the filter's transmissivity in stop-bands, T_n(z) stands for n-th degree Chebyshev polynomial, I denotes the identity operator, and Im is the operator to take the imaginary-part.

Unfortunately, filter's transfer functions are not good in shapes when only a single resolvent is used. Such filters do not reduce so well for eigenvectors to be removed, and transmissivities have different orders of magnitudes for eigenvectors to be solved. When the transfer function is poor in shape, approximate eigenpairs are not accurate which are extracted from the set of vectors obtained by a B-orthonormalization and a filtering of a set of random vectors.

However, even when the shape is poor, our experiments show that by a few applications of the combination of B-orthonormalization and filtering to the set of vectors started from a set of random vectors, the approximate invariant subspace spanned by the set is improved to give approximate eigenpairs.

4Koji Terasaki and Takemasa Miyoshi.
"Towards big data assimilation in Fugaku by accounting for the horizontal observation error correlation of satellite observations"

Recent development in sensing technology increases the number of observations both in space and time. It is essential to effectively utilize the information from observations for numerical weather prediction (NWP). The observation errors are usually correlated among data measured with a single instrument, such as satellite radiances. The error-correlated observations are usually thinned to avoid degradation due to neglecting the horizontal observation error correlation in data assimilation. As a result, an enormous amount of observation data is discarded. It is important to fully utilize those observations by explicitly accounting for the horizontal observation error correlations in data assimilation. This study explores to explicitly account for the horizontal observation error correlation of Advanced Microwave Sounding Unit-A (AMSU-A) radiances, one of the highest impact data in NWP.

In this study, we estimated the horizontal observation error correlation of AMSU-A radiances using innovation statistics (Desroziers, 2005). We performed data assimilation experiments using global atmospheric data assimilation system known as the NICAM-LETKF (Terasaki et al. 2015, Terasaki and Miyoshi, 2017). The results show that the analyses and forecasts are improved by accounting for the horizontal error correlations.

The computational cost was also examined. The computation of inverting the observation error covariance matrix will increase when non-zero off-diagonal terms are included. In this study, we assumed uncorrelated errors between different instruments and observation variables. Therefore, the observation error covariance matrix becomes block diagonal. We successfully reduced the computational cost by inverting the small block diagonal matrices. In this study, we still thinned the AMSU-A radiances because the horizontal resolution of the NICAM was low. In Fugaku-era, it will be challenging to assimilate AMSU-A radiances without any thinning by increasing the model resolution.

5Masahiro Nakao, Maaki Sakai, Yoshiko Hanada, Hitoshi Murai and Mitsuhisa Sato.
"Construction Method for Low-Latency Interconnection Network using Simulated Annealing"

Due to the increase in the scale of parallel computer systems such as supercomputers and data centers, the communication latency of the interconnection network connecting the computing nodes strongly affects the performance of parallel applications. Some researches have been conducted to apply the small-world property of random topology to various networks of industrial products. When the random topology is adopted as the network topology of the parallel computer system, the performance of the parallel application is improved because the diameter and the average number of hops (ASPL: Average shortest Path Length) of the interconnection network are smaller than those of the regular topology such as the conventional k-ary n-cube. However, in most cases, the diameter and ASPL of the random topology are larger than their theoretical lower bounds, so it is crucial to design a graph with a smaller diameter and ASPL.

Against this background, the Order/Degree problem has been proposed as one of the problems in graph theory. The Order/Degree problem is to find a graph with the minimum diameter and ASPL from a set of graphs that satisfy a given number of vertices (Order) and degree (Degree). The network topology can be represented as a graph by considering the computing nodes of the system as “vertices” and the cables of the network as “edges.” The graphs discovered from the Order/Degree problem can be used for various industrial products.

We propose a novel method to generate a topology that has both randomness and regularity. Since this method is based on Simulated Annealing (SA), which is a general-purpose optimization method, an efficient solution search that does not easily fall into local optima can be performed. Furthermore, by making the topology symmetrical, it is possible to enhance the solution search capability of SA and to reduce the amount of calculation time significantly for finding the diameter and ASPL. As a result of evaluation using test cases, it was shown that our method could generate a graph whose diameter and ASPL are significantly smaller than those of a random graph.

6Yuhsiang Tsai, Terry Cojean and Hartwig Anzt.
"Gingko's SpMV on NVIDIA and AMD GPU architectures"

Efficiently processing matrices on manycore architectures is an essential but critical operation in many scientific applications such as iterative linear solver, fluid flow simulation, or Google's PageRank algorithm. We have designed COO, CSR, ELL, and HYBRID SpMV kernels that give attractive performance across a wide range of matrices across different GPU architectures from AMD and NVIDIA. In this evaluation, we do not only focus on runtime performance but also consider metrics like performance-per-$. Ginkgo is a next-generation sparse linear algebra library able to run on multi- and manycore architectures. Ginkgo is an open source library licensed under BSD 3-clause and ships with the latest version of the xSDK package (v.0.5.0).

In this poster, we show the improvement of four SpMV strategies (COO, CSR, ELL, and HYBRID) and present their performance on AMD and NVIDIA GPUs. Furthermore, we evaluate each format for 2,800 test matres on high-end GPUs from AMD and NVIDIA. We integrated all introduced SpMV kernels into the Ginkgo open-source library (, a modern C++ library designed for the iterative solution of sparse linear systems. We demonstrate that these kernels often outperform their corresponding kernels in the AMD hipSPARSE and the NVIDIA cuSPARSE vendor libraries.

The following list shows our primary contributions:

  1. We develop new SpMV kernels for COO, CSR, ELL, and HYBRID that are optimized for AMD and NVIDIA GPUs and outperform existing implementations. In particular, we enable performance portability by algorithmic improvements and tuning parameters.
  2. We evaluate the performance of the new kernels against SpMV kernels available in AMD's hipSPARSE library and NVIDIA's cuSPARSE library on whole test matrices from the Suite Sparse Matrix Collection. We use performance plot, and relative performance to compare the corresponding kernels of Ginkgo's and vendors'. We use performance profiles to analyze how well the distinct kernels generalize. Moreover, we compare the SpMV performance of high-end GPUs from AMD and NVIDIA.
  3. Up to our knowledge, Ginkgo is the first open-source sparse linear algebra library based on C++ that features multiple SpMV kernels suitable for irregular matrices with backends for both AMD's and NVIDIA's GPUs.
  4. We make all kernels publicly available as part of the Ginkgo library ( and archive the performance results in a public repository to ensure full result reproducibility.
7Yoshifumi Nakamura.
"Nature of the finite temperature phase transition for three flavor QCD"

We present our recent studies on the nature of the finite temperature phase transition for three flavor QCD by using lattice QCD.

The nature of the finite temperature phase transition of 2+1 flavor QCD (Nf=2+1) at zero chemical potential depends on quark masses. The order of transition and universality class are summarized in the plane of light quark mass, ml, and strange quark mass, ms, which is called the Columbia plot. A first order phase transition is expected in the small quark mass region. Many lattice QCD studies have shown that the phase transition is also of first order in the heavy quark mass region while it is crossover in the medium quark mass region. The boundary between the first order and crossover regions is a second order phase transition of Z2 universality class.

The nature in the lower-left corner of the Columbia plot has not been fully understood yet. The first lattice QCD calculation was done by using standard Wilson fermions at temporal lattice size Nt=4 roughly 20 years ago. It reported the critical mass at the critical endpoint (CEP) for three flavor QCD (Nf=3) is heavy: the critical quark mass, mq, is heavier than 140 MeV or, equivalently, the critical pseudoscalar mass is heavier than 1 GeV. Karsch et al. reported preliminary values for the critical pseudoscalar mass, mPS, is about 290 MeV with unimproved gauge and staggered fermion actions and mPS is about 190 MeV with improved gauge and p4 staggered fermion actions. These results were obtained by using the R-algorithm. Afterward, the results were updated as mPS=290(20) MeV with unimproved gauge and staggered fermion actions and mPS=67(17) MeV with improved gauge and p4 staggered fermion actions. Then, de Forcrand and Philipsen obtained amq=0.0260(5) by using the RHMC algorithm, which is about 25% smaller than the value amq which is about 0.033 and quoted by works using the R-algorithm. They also performed Nf = 2+1 simulations and obtained the critical line and tri-critical point, where lattice spacing, a, was approximately 0.3 fm. After a preliminary study with standard Wilson gauge and staggered fermions which reported the bare critical mass amq is about 0.035 at Nt=4. It was also reported that the ratio of mPS and the CEP temperature decreased from 1.680(4) to 0.954(12) as increasing Nt from 4 to 6 with unimproved staggered fermions.

These results are showing very large cut off effect for the critical mass and it is important to increase Nt and use improved actions. Further studies with improved staggered fermions have not found the first order phase transition and quoted only a bound of the critical mass, mPS <50 MeV. Therefore, the positions of the critical end line (CEL) and CEP for Nf=3 are still particularly important problems to be solved at this moment.

Recently we also have investigated the nature of the finite phase transition in the small quark mass region by using nonperturbatively O(a)-improved Wilson-clover fermions. We have determined CEP at Nt=4, 6, 8, and 10 as well as an upper bound of CEP in the continuum limit for Nf=3. For Nf=2+1, we have studied at Nt=6 and determined CEL around the SU(3) flavor symmetric point. Then, we confirmed that the slope of CEL at the SU(3) flavor symmetric point is -2.

In this poster, we present our extended study for both CEP at the SU(3) flavor symmetric point and CEL away from the SU(3) flavor symmetric point. We employ the renormalization-group improved Iwasaki gauge action and nonperturbatively O(a)-improved Wilson-clover fermion action. The critical endpoints are determined by using the intersection point of kurtosis, employing the multi-parameter, multi-ensemble reweighting method. We present results for the critical end line at Nt=6 and the continuum extrapolation for the critical endpoint of the SU(3)-flavor symmetric point.

8Yoshifumi Nakamura.
"QCD Wide SIMD Library (QWS) for Fugaku"

We present a co-design activity of the Flagship 2020 project for QCD Wide SIMD Library (QWS) for Fugaku.

“The Flagship 2020 Project (Supercomputer Fugaku) is carrying out research and development for future supercomputing. Its initial task is the development of the supercomputer to succeed the K computer, with the goal of starting public service around 2021. It is taking on a number of challenges from four aspects, system software, architecture, applications, and co-design with the aim to create a system that can properly respond to the future needs of science and technology. (abstracted from project web page)”. The CPU of Fugaku is the Fujitsu A64FX processor supporting the Arm SVE instruction set, having 48 computing cores with a 512-bit wide SIMD. The main memory is HBM2 that peak bandwidth is 1024 GB/s. Interconnect is TofuD having 28 Gbps x 2 lanes x 10 ports.

Lattice quantum chromodynamics (LQCD) is the discretized theory of QCD which is the theory of the strong interaction, one of the four fundamental forces in nature. To get good performance for LQCD, all of arithmetic, memory and network must be balanced in high level.

QCD Wide SIMD Library (QWS) is developed in FS2020 project, a LQCD simulation kernel library for wide SIMD width, and especially optimized to obtain high performance on supercomputer Fugaku. It is written in C and C++, mostly in C language. It is supporting clover Wison Dirac operator and also domain wall Dirac operator, Even-odd preconditioned Dirac matrix, Schwarz Alternating Procedure (SAP) domain decomposition for full Dirac matrix, double, float, half precisions, Conjugate gradient (CG), shifted CG, BiCGstab. QWS will be free software under a BSD-like License, will appear at

In this poster, we introduce lattice QCD briefly and present current tuning status for QWS, especially single precision BiCGStab for clover Dirac operator with SAP preconditioning. The target problem size is 192x192x192x192 lattice size. For this lattice size, all evaluation region can be put on L2 cache when full system is used. Memory bandwidth is avoided in this setup. Data layout of QCD field variables are Array of Structures of Arrays (AoSoA) to fit a SIMD register width. Kernel parts are tuned by using the Arm C Language Extensions (ACLE). For OMP tuning, parallel region is expanded since making omp parallel region is costly. So “omp parallel” is put on higher level caller routines. All data arrays are prefetched by software prefetching, explicitly, every 256 B for all arrays by __buildin_prefetch. Instruction for clover multiplication is scheduled by hands. Four dimensional QCD processes of MPI are perfectly mapped in TofuD by using a process mapping search tool for LQCD. The tool calculates stream for all possible rank maps and find the best process mapping. Tofu Network Interface (TNI) load balancing is also considered to minimize link stream.

Note, the software used for the evaluation, such as the compiler, is still under development and its performance, which is obtained by “performance estimation tool” and even actual execution on a prototype machine, may be different when the supercomputer Fugaku starts its operation.

9Masamitsu Nakayama, Shinichi Goto, Kengo Ayabe, Hiroto Yabushita and Shinya Goto.
"The Impact of Thrombin Binding to Platelet Glycoprotein Ibα and von Willebrand Factor using Molecular Dynamics Simulations"
【Background】Glycoprotein Ibα (GPIbα) on platelet membrane binding to von Willebrand factor (VWF) exclusively mediates the initial platelet adhesion to damaged vessel walls under blood flow condition. Thrombin, which is an important component of coagulation cascade that catalyzes fibrinogen to fibrin, is known to bind to GPIbα. However, the impact of thrombin binding to GPIbα on GPIbα-VWF bond is unknown.【Aims】To evaluate the impact of presence and absence of thrombin binding to GPIbα on GPIbα interaction with VWF using molecular dynamic simulation calculation.【Methods】The initial structure of GPIbα containing N-terminal leucine rich repeat (residues HSE1-PRO265) bound to A1 domain of VWF (residues ASP506-PRO703) and thrombin (residues THR1-GLU247) was made by combining two structures. One structure was the crystal structure of GPIbα bound to thrombin and the other was the stable structure of GPIbα bound to VWF, which was previously calculated by MD simulation in previous study. Added the whole structure of GPIbα-thrombin to GPIbα-VWF by aligning the position of GPIbα. Then deleted the structure of GPIbα from original structure of GPIbα-VWF binding. Molecular dynamic simulation for all atoms and water molecules was done on a computer equipped with Xeon Phi. NAnoMolecular Dynamics (NAMD) software with Chemistry at Harvard Macromolecular Mechanics (CHARMM)-36 force field was used for calculation. Dynamic binding structure of GPIbα binding with VWF was calculated both in the presence and absence of thrombin binding with GPIbα.【Results】The root mean square deviation (RMSD) of all protein atoms of GPIbα bound to VWF in the presence or absence of thrombin, starting from the initial structure was stabilized after 144 ns and 48 ns, respectively. The distance between the center of mass of GPIbαN-terminal and VWF A1 domain in the presence and absence of thrombin was 29.9 Å and 39.5 Å, respectively. The relative position of center of mass of GPIbα to VWF moved approximately 59.1 Å in the presence of thrombin compared to thrombin-absent counterpart.【Conclusions】Stable structure of GPIbα binding with VWF is markedly influenced by thrombin binding to GPIbα.
10Yasumitsu Maejima, Shigenori Otsuka and Takemasa Miyoshi.
"Assimilating every 30-second phased array weather radar data in a torrential rainfall event on July 6, 2018 around Kobe city"
To investigate the impact of every 30-second phased array weather radar (PAWR) observation on a simulation of a severe rainfall event occurred on July 6, 2018 around Kobe city, we perform 30-second-update 100-m-mesh data assimilation (DA) experiments using the Local Ensemble Transform Kalman Filter (LETKF)[3] with the Scalable Computing for Advanced Library and Environment regional numerical weather prediction model (SCALE), so-called SCALE-LETKF (Miyoshi et al. 2016)[4], [5]. Two experiments were performed: the test experiment with every 30-second PAWR observation (TEST), and the other without observation (NO-DA). The TEST analysis shows intense rainfalls with detailed structure of active convection, better matching with the PAWR observation compared to NO-DA analysis. We also perform the forecast experiments initialized by the ensemble mean analyses of TEST and NO-DA. The TEST forecasts are skillful for 30 minutes compared with NO-DA, although the skill drops rapidly. The results suggest that the rapid-update PAWR DA have a potential to improve the numerical simulation for this torrential rainfall event.
11Yoshinori Kusama, Takaaki Noguchi, Noriyuki Shiobara and Motoi Okuda.
"Overview of User Support Activities by RIST in HPCI in K computer Era in Japan"

Research Organization for Information Science and Technology (RIST) has been responsible for user selection and resource allocation, user support and dissemination of achievements in High Performance Computing Infrastructure (HPCI) of Japan [1] as Registered Institution for Facilities Use Promotion of the “Specific High-speed Computer Facilities” (K computer) since the start of its operation in 2012 (its shared use operation has ended on August 16, 2019) and the Representative for HPCI Operation since FY2017. Among these responsibilities of RIST, this paper overviews user support activities by RIST in HPCI in the K computer era.

RIST provides the following user support activities: provision of variety of information; response to inquiries and requests from HPCI users; technical supports; organization of seminars and workshops; provision of application software and other necessary support services. Its key activities are introduced in this abstract.

Expert Support: Experts of a variety of scientific areas and/or of computer science and applications provide high-level technical supports as “Expert Support”. Main supports are porting and optimization of application software to the target platform. RIST has provided the expert supports of 206 in total from FY2012 to end of Nov. 2019, of which 173 are for projects using K computer, corresponding to 18% of total projects using K computer. Expert support reports describing detailed support contents are shared on the HPCI portal site [1], and the number of access to them is ~5,800 (as of end of Nov. 2019). Provision of application software is ongoing as part of the expert support [2].

Seminars and workshops: RIST has organized a variety of seminars and workshops aiming to improve participants’ skill, and to contribute to human resource development and extension of HPCI users. Since holding first seminar for K computer users and HPC seminar on programing technique in 2012, RIST has added seminars on application software and workshops of specific areas such as materials, CAE (Computer-Aided-Engineering) etc., responding to requests from HPCI users. RIST has held seminars and workshops of 154 times in total (as of end of Nov. 2019), and 3,382 participants in total joined them. Participants from companies account for 55%.

Promotion of Industrial use: For the purpose of promotion of industrial use of HPCI, early creation of outcome and expansion of HPCI users in industry, RIST has provided user supports responding to users’ experiences and needs in industries; consultation as a concierge and other user supports mentioned above, providing peculiar information on the industrial use. RIST has accepted 363 consultation requests, and of them 169 consultations have led to application of the HPCI research project (145 projects awarded). The 230 companies have used HPCI systems, and K computer has been used by 202 companies (as of end of Nov. 2019).

12Takaaki Miyajima, Tomohiro Ueno, Jens Huthmann, Atsushi Koshiba and Kentaro Sano.
"High-Performance Off-loading Engine with multiple FPGAs through Custom Computing"

Hardware customization for target applications is gathering attention in the HPC area.
Recent large scale Field Programmable Gate Arrays (FPGA) can be applied to such purpose.
We believe that one of the promising approaches is a combination of an existing HPC system and multiple FPGAs to cover wider applications by off-loading some of them.

Our goal is to realize a high-performance off-loading engine with multiple FPGAs through custom computing.
We have the following two main problems to achieve this goal. How should we combine FPGA cluster with an existing HPC
system? How should we achieve high performance custom computing by off-loading tasks to FPGA clusters?
This research poster overviews our research activity including motivation, challenge, target system and system stack.

13Yi-Chao Wang and James Lin.
"An Empirical Study of HPC Workloads on Huawei Arm-based Processor"

The ARM-based server processors have been gaining momentum in high performance computing (HPC). In 2019, Huawei announced that they will invest about 436 million U.S. dollars in the next 5 years to develop an ecosystem for its ARM-based server chips and complementary products. Kunpeng is the product line name of ARM-based server processors developed by Huawei. Launched in 2016, Kunpeng 916 is the latest generation of Kunpeng processor publicly available in the market when the paper was written (the first half of 2019). Kunpeng 916 aims at cloud computing and storage servers. While not designed specifically for HPC, Kunpeng 916 processor has 32 ARMv8 cores and is tempting for HPC workloads. Each core is ARMv8 IP cores, running at 2.4GHz and supporting advanced SIMD extension NEON. Each processor can support up to 128GB DDR4-2400 memory. The thermal design power (TDP) is only 85W. TDP of Intel Xeon server processors usually is 120-140W. In order to throughly understand the potential of Kunpeng 916, we conducted a systematic evaluation in three steps by using: 1) three well-known benchmarks (HPL, STREAM, and LMbench); 2) three typical scientific kernels (SpMV, N-body, and GEMM); 3) three widely used mini-apps (TeaLeaf, Neutral, and SNAP) and a real-world application GTC-P.

Based on our systematic evaluation, we believed that Kunpeng 916 is compelling for running memory bound HPC workloads, due to its high memory bandwidth. We developed a microbenchmark in assembly language to measure instruction latency and throughput on Kunpeng 916. We find the two most used instructions FMA and MUL have longer latency and lower throughput on Kunpeng 916 than the Intel Haswell and Broadwell, and it causes the significant performance gap in running HPL and other compute bound workloads. We found binding four threads into one process can improve data locality thus increasing performance on Kunpeng 916, due to a processor micro-architectural design (four cores in the same core group sharing L2 cache).

Based on the evaluation results, we highlight the key findings of this paper as below:

  1. Memory bound workloads. Kunpeng 916 can achieve higher memory bandwidth than Intel Haswell and Broadwell. Therefore, memory bound workloads, such as SpMV, TeaLeaf, and the charge kernel of GTC-P, can achieve compelling, or even better, performance than the two Intel processors.
  2. Compute bound workloads. Kunpeng 916 has at least 35% lower HPL results than Intel Haswell and Broadwell. Therefore, compute bound workloads, such as Nbody and GEMM, have significant performance loss on Kunpeng 916 compared with the two Intel processors.
  3. Instruction latency and throughput. The two typical instructions (FMA and MUL) used in compute bound kernels, have longer instruction latency and lower instruction throughput on Kunpeng 916 than the two Intel processors, causing the significant performance gap for compute bound workloads.
  4. Data locality. In Kunpeng 916, the four cores in the same core group share L2 cache, therefore, based on our experiences on optimizing GTC-P, binding four threads into one process can improve data locality thus increase performance.
14Yen-Chen Chen and Kengo Nakajima.
"Parallel-in-Space/Time Method for Explicit Time-Marching Schemes"

Partial differential equation (PDE) solvers benefit highly from the massive computation power of supercomputers. However, as the computation power of supercomputers grows by years, we are now entering the exaflops computation era. Spatial parallelization for PDE solvers is gradually reaching its limits. To exploit further parallelization efficiency, researchers propose Parallel-in-Space/Time (PinST) methods. As the name suggests, Parallel-in-Space/Time methods parallel PDE in both spatial and time dimensions.

Two of the most potent PinST methods nowadays are the Parareal method and the Multigrid Reduction in Time (MGRIT) method. The Parareal method involves one high precision solver and one low precision solver. Two solvers are solved parallelly through the timeline, and the errors are propagated to the back by the low precision solver. The MGRIT method is based on the multigrid hierarchy and constructs coarse grids on the timeline. Falgout, R. D. et al. (2014) shows that the MGRIT method performs efficient parallel computation for diffusion equations using implicit time-marching schemes.

Although PinST methods have been able to solve some PDE problems successfully, they still have difficulty solving hyperbolic problems such as the advection equation, which often requires an explicit time-marching scheme. An essential property of explicit time-marching schemes is that they have to abide by the Courant-Friedrichs-Lewy (CFL) condition to converge. This property causes trouble with coarse solvers on PinST methods. Also, explicit schemes stepping depends only on values from the current time; thus, they are much more scalable in the spatial dimension than that of the current PinST methods.

In this research, we propose a multilevel parareal method, which is similar to the traditional parareal method, while but has more levels like the MGRIT method. We also coarsen the time grid and space at the same time to tackle the CFL restrain. Our target hyperbolic PDE problem is the one-dimensional advection equation. We implemented the multilevel parareal method using C++ with MPI parallelization, and we proved that the multilevel parareal method could serve as a decent approximation to explicit time-marching schemes which scales better than that of pure spatial parallelization.

15Shigenori Otsuka, Marimo Ohhigashi, Viet Phi Huynh, Pierre Tandeo and Takemasa Miyoshi.
"A deep-learning approach to three-dimensional precipitation nowcasting"

The Phased-Array Weather Radar (PAWR), developed by the National Institute of Information and Communications Technology, Osaka University, and Toshiba Corporation, has been in operations since 2012 in Japan. The PAWR scans the whole sky in the 60-km range every 30 seconds at 110 elevation angles. Four PAWRs of the same type have been installed at Osaka, Kobe, Okinawa, and Tsukuba, and two similar ones of other types have been installed in Japan. Taking advantage of the PAWRs’ frequent and dense three-dimensional volume scans, we have been operating 30-second-update three-dimensional precipitation nowcasting at RIKEN since 2017 (Otsuka et al., 2016). Our current system adopts an optical-flow-based algorithm in the three-dimensional space; its computational cost is orders of magnitude higher than the traditional two-dimensional optical flow. In addition, convective clouds evolve rapidly within a 10-minute forecast, sometimes violating the assumption of Lagrangian persistence.

Recent advances in the machine-learning algorithms may help solve these problems. In this study, a three-dimensional extension of the Convolutional Long Short-Term Memory (Conv-LSTM; Shi et al., 2015), a kind of deep-learning algorithm, is applied to PAWR nowcasting. In addition to the Conv-LSTM with past observations, we also develop a Conv-LSTM that accepts forecast data from numerical weather prediction (NWP) or optical-flow-based nowcast. NWP uses HPC resources with full physics equations of the atmosphere, so that Conv-LSTM with NWP would be a new direction toward fusing Big Data and HPC, in which training with the big data from high-resolution NWP and PAWR observation would be a challenge.

The three-dimensional Conv-LSTM successfully made predictions of convective storms; in some cases, Conv-LSTM had additional skill in capturing intensification and weakening of precipitation that were not predicted by the optical-flow. On average, the Conv-LSTM-based system outperformed the optical-flow-based system statistically. Furthermore, Conv-LSTM with forecast data outperformed that without forecast data.

16James Taylor, Guo-Yuan Lien, Shinsuke Satoh and Takemasa Miyoshi.
"The Assimilation of Dual Phased Array Weather Radar Observations to Short-range Convective Forecasts"

The assimilation of Doppler velocity and reflectivity observations from phased array weather radar (PAWR) has been widely studied for the use of short-range numerical weather prediction (NWP) and have been found to have positive impact on analyses and forecasts of convective scale weather systems (Maejima et al 2017). However, these studies only assimilated observations from a single PAWR and the use of multiple PAWR observations for NWP has not yet been explored. With the recent development of PAWR located at sites in Osaka University and Kobe a common observation region has been established providing dual radar observations over a large area of the Kobe region, where severe convective storms can develop quickly and bring intense rain, causing hazardous conditions.

In this study we investigate the impact of utilizing dual PAWR observations for the prediction of a localized convective weather system. We employ the use of the SCALE-LETKF system (Lien et al, 2017), which couples the Local Ensemble Transform Kalman Filter (LETKF) with the Scalable Computing for Advanced Library and Environment (SCALE)-Regional Model (RM), to perform data assimilation experiments with 30-second-update of PAWR observations within a high-1km resolution mesh in order to capture the rapid development of convective activity. The dual reflectivity observations are initially used to identify and remove remaining false echoes from observations, including range sidelobes which are a common radar artifact, thereby improving upon existing quality control measures. Next, we introduce a process of combining reflectivity observations from both radars to reduce the loss of information from rain attenuation, which is a major problem for X-band radars. The results show that implementing these new data assimilation methods lead to improvements in both rainfall intensity and distribution in short range forecasts.

17Koya Kobayashi.
"Blood flow prediction using machine learning"

Cardiovascular disease is the second leading cause of death in Japan. The disease is mainly caused by atherosclerosis. The rate of progression of atherosclerosis is highly dependent of lifestyle. It is possible to prevent atherosclerosis by improving lifestyle, which also means that it is important to detect the disease early. So far, various studies using mathematical models have been conducted to improve atherosclerosis diagnosis technology. Nevertheless, the diagnosis takes time and requires cuff inflation, which gives patients stress. Therefore, more patient friendly method is required.

The diagnosis can be simplified by measuring fingertip blood flow. However, the vessel segments from wrist to fingertip is not modeled due to the complicated structure. In contrast, machine learning based on input and output data can be used to predict blood flow. The purpose of the present study is to develop a method to predict fingertip blood flow using machine learning.

In the present study, the LSTM (Long Short-Term Memory) model, which is capable of predicting a current value while linking past values, is utilized. In this study, Avolio’s model (1980) is adopted in order to collect a training data. However, the model describes only the heart to the wrist blood vessels and the hand model is undeveloped due to the complicated structure.

We develop a model which capable of predicting blood flow of wrist from that of upper arm by gathering the data simulated with Avolio's model. As shown in Figure1, we trained the LSTM model with the blood flow of upper arm as the input and that of the wrist as the output. First step, a peak value of blood flow of wrist is predicted by time-series data of that of upper arm from peak to peak. Next step, input and output data is shifted one step at a time (Figure 1). This is repeated. After the training process is completed the LSTM model is used to estimate the wrist blood flow from the data measured clinically(Figure 2).

18Mariia V. Ivonina, Yuuichi Orimoto and Yuriko Aoki.
"Nonlinear optical properties of push-pull π-conjugated molecules via elongation-FF method"

Nonlinear optical (NLO) materials, defined as materials demonstrating large (hyper)polarizabilities, are extensively used in optical and electro-optical applications such as optical communication, computing, modulation and switching. One of the promising classes of NLO compounds is organic molecules with π-conjugated chain substituted in terminal positions by donor and acceptor groups, well-known as push-pull molecules. One can obtain the target properties by tuning molecular structure through the donor and acceptor variation or/and π-system extension via elongating the π-linker. To increase the theoretical study efficiency on the design of such molecules at ab initio level we use elongation finite-field (elongation-FF) method [1], developed for calculation of huge aperiodic systems. In elongation procedure monomers attack the molecule one by one gradually increasing chain length. Wherein during self-consistent field calculation routine, the eigenvalue problem is solved only for the active region which has strong interaction with the monomer, disregarding the frozen region which is a part of the chain insignificantly interacting with the attacking monomer. We consider the elongation method as a promising approach to long molecular chains investigation since the advantage to build the molecular framework from step to step while varying substituents and number of monomers. In this study, NLO properties tuning of selected Donor-π-Acceptor molecules with π-conjugate linker has been performed using elongation-FF method and compared with references results obtained by Hartree-Fock method.


  1. F.L. Gu, Y. Aoki, A. Imamura, D.M. Bishop and B. Kirtman, Mol. Phys. 101, 1487 (2003).
19Takehiro Yonehara and Takahito Nakajima.
"Computational study on quantum dynamics of excited electrons in molecular aggregates : Toward an efficient control of a conversion from light energy to chemical functionality"
To design an energy flow and realize aimed states in material systems, to understand dynamics propensity rule among elements is critically important. A recent social demand of light-energy conversion with a low cost, for example, a development of solar cell and artificial photosynthesis system with high efficiency, provides us important challenges not only for the fundamental science but also human wellness via contribution from industry. There, excited electrons in material system play important role in chemical reaction and efficient energy distribution. For elucidating this, we developed theoretical method and apply to it. In this presentation, we would like to report the results in our project in the recent three years.
20Yoshiaki Yamaoka, Naohisa Sakamoto and Jorji Nonaka.
"Adaptive spatial and temporal sampling for In-situ visualization"
Recently, the scale of numerical simulation and its generated simulation data have become large, so In-situ visualization that bypass the disk I/O for processing the simulation results has become an attractive solution, and it is worth noting that some In-situ visualization approaches have also been tackled on the K computer environment, which includes the KVS, VisIt-libsim, and Pi2D. In a batch-based In-situ visualization, a multi-viewpoint visualization would be highly useful to increase the probability that all important information will be visualized. However, this increase in the amount of generated images is expected to impact not only the processing time, but also the time required for the analysis, and consequently for the turnaround time of the scientific knowledge acquisition. In this poster, we present an adaptive spatial and temporal sampling for minimizing this problem during the in-situ visualization. We have already proposed an adaptive temporal sampling technique for optimizing the visualization processing by thinning out the regions with small changes during the simulation time steps. Although this method has proven useful for reducing the number of generated images in the temporal direction, there is no gain in the in-situ multi-viewpoint visualization since it does not take into consideration the spatial direction. In addition, the task of threshold value setting for determining whether the image shows some important changes in the initial prototype has shown difficult because the user had to manually set this value without prior information. In order to solve these problems, we propose a thinning technique that can thin out the visualization not only in the temporal direction but also in the spatial direction, that is, the number of visualization results can be reduced by eliminating unnecessary images of certain viewpoints. We also improved the threshold value setting method by introducing new indexes that makes the task more intuitive for the users since they only need to set them to have the threshold value automatically estimated. Some experiments have been performed to evaluate the usefulness of the proposed method by measuring the processing time. We also asked for domain scientists to verify the validity of the visualization results, and we obtained positive feedbacks from these experiments.
21Tsuyoshi Yamaura, Seiya Nishizawa and Hirofumi Tomita.
"Theoretical time evolution of numerical errors when using floating point numbers in shallow-water models"
We carried out a theoretical investigation of the impact of the numerical errors caused by using floating point numbers (FPNs) in simulations, such as rounding errors. Under the presupposition that model variables can be written as the linear sum of the true value and the numerical error, equations governing the time evolution of numerical errors due to FPNs (FPN errors) are obtained by considering the total errors of the results of simulations of shallow-water models and estimating the errors incurred by using FPNs with varying precision. We can use the time evolution equations to estimate the behavior of the FPN errors, then confirm these estimations by carrying out numerical simulations. In a geostrophic wind balance state, the FPN error oscillates and gradually increases in proportion to the square root of the number of time steps, like a random walk. We found that the error introduced by using FPNs can be considered as stochastic forcing. In a state of barotropic instability, the FPN error initially evolves as stochastic forcing, as in the case of the geostrophic wind balance state. However, it then begins to increase exponentially, like a barotropic instability wave. These numerical results are obtained by using a staggered-grid arrangement and stable time-integration method to retain near-neutral numerical stability in the simulations. The FPN error tends to behave as theoretically predicted if the numerical stability is close to neutral.
22Toshiyuki Imamura, Daichi Mukunoki, Yiyu Tan, Atsushi Koshiba, Jens Huthmann, Kentaro Sano, Fabienne Jézéquel, Stef Graillat, Roman Iakymchuk, Norihisa Fujita and Taisuke Boku.
"Minimal-Precision Computing for High-Performance, Energy-Efficient, and Reliable Computations"
We have recently started a research collaboration to explore the possibility of a new computing system with precision-tuning, in collaboration with RIKEN CCS, Sorbonne University, and University of Tsukuba. Our proposal, "minimal-precision computing," aims to achieve reliability (accuracy and reproducibility) as well as high-performance (speed and energy) by obtaining the computing results with the accuracy requested by users with the minimal-precision use. Our proposal involving both hardware and software stacks combines (1) a precision-tuning method through numerical validation by Discrete Stochastic Arithmetic (DSA), (2) arbitrary-precision arithmetic libraries, (3) fast and accurate numerical libraries, and (4) Field-Programmable Gate Array (FPGA) with high-level synthesis, and some important components that we develop. This poster introduces an overview of our ideas and our up-to-date contributions.
23Yutaka Ishikawa, Atsushi Hori, Balazs Gerofi, Masamichi Takagi, Takahiro Ogura, Toyohisa Kameyama and Yin Jie.
"MPI, OS and Runtime Enhancement for the Fugaku Supercomputer"

RIKEN Center for Computational Science (R-CCS) is leading the development of Japan's next generation flagship supercomputer, Fugaku, the successor of the K Computer. The system software research team and development team are developing new advanced software stack, consisting of the low-level OS kernel, novel in-node parallel execution model, and MPI utilizing Fugaku’s unique features.

The successor of K computer (Fugaku) has the advanced interconnect called Tofu Interconnect D (TofuD) for massive scalability. It has ten external ports and six DMA engines, which transfer data between the memory devices and the local ports of the internal router. It also has offload engines, which process orchestrated communication with many MPI processes without CPU processing. In this poster, we present the preliminary implementations of the two libraries to make application scale with tens of thousands of nodes. The first one is MPICH for Fugaku, called RIKEN MPI. It is customized to Fugaku’s TofuD and it complements the vendor-provided MPI (Fujitsu MPI) by providing emerging optimizations and emerging MPI-standard features by frequent updates. RIKEN MPI provides the offloaded, persistent collective protocols and the memory-saving point-to-point protocol for this purpose. The offloaded protocols are customized to the unique network / NUMA topology of Post-K. The "neighbor" type of the protocol, which performs a collection of communications with subsets of nodes, adds another optimization where the DMA engines are scheduled in a way that none of the engines are under- nor over-utilized. The memory-saving protocol uses the receive buffer in an efficient manner by sharing one receive buffer among multiple remote MPI ranks. We evaluate the offloaded protocol on a PRIMEHPC FX100 cluster, which has the predecessor of TofuD, and confirm the benefit over the non-offloaded counterpart. As for memory-saving protocol, we show that the protocol consumes a fixed amount of memory and thus it is scalable with over one million MPI processes.

Another effort is to design and develop a system software stack that suits the needs of future extreme-scale computing. In this poster, we also introduce IHK/McKernel and PiP. IHK/McKernel is a lightweight multi-kernel operating system that runs Linux and a light-weight kernel side-by-side on compute nodes with the primary motivation of providing scalable and consistent performance for large-scale HPC simulations. At the same time, it retains a fully Linux compatible execution environment. We provide an overview of the software architecture and show performance results on the full-scale Oakforest-PACS machine, a many-core based supercomputer consisting of 8,192 Intel Xeon Phi Knights Landing nodes. Process in Process (PiP) is a new concurrent execution model that takes the best of multi-process and multi-thread execution models. In this model, variables are privatized so that each task has its own variable set and tasks share the same virtual address space. The execution entity in this model is called a “task” since it does not follow the definitions of either processes or threads. PiP tasks can interact and communicate with others easily and race conditions can only happen on the explicitly shared variables and/or data. Since tasks run in the same virtual address space, any data owned by the tasks can be accessed if the addresses of the variables and/or data are known. While this idea is not new, existing implementations require either a dedicated OS kernel or specialize language processing system consisting of compiler, linker, runtime system, and debugger. Process-in-Process (PiP) is the first implementation as a user-level library, which needs neither dedicated OS kernels nor language systems. Thus, PiP is portable and easy to deploy.

24Sachiho Adachi, Seiya Nishizawa and Hirofumi Tomita.
"A new framework built into a numerical library for climate to use spatially-detailed urban parameters"

We newly added a framework to use spatially-detailed urban parameters into SCALE, which is a numerical library for weather and climate and developed in RIKEN/R-CCS. In this presentation, we will introduce the new framework taken into SCALE and show its effect on an atmospheric simulation from the viewpoint of performance with comparing the results of original and improved versions of SCALE.

The atmospheric layer within about 3 km above ground level is called the atmospheric boundary layer, where the atmospheric condition is strongly influenced by ground surface conditions. The ground surface receives solar and longwave radiations and then emits heat and water vapor as thermal energy into the atmosphere. The energy fluxes as heat and water vapor are called the sensible and latent heat fluxes, respectively. The ground surface also acts as a resistance for the atmosphere; it provides kinetic energy. Therefore, the atmospheric condition in the atmospheric boundary layer is horizontally inhomogeneous according to the differences in the surface conditions. One of them is known as an urban heat island phenomenon, which is a phenomenon that the surface temperature in an urban area is higher than that in surrounding areas covered with grasslands and forests. The higher temperature in the urban area is attributed to the difference of Bowen ratio (the ratio of sensible heat flux to latent heat flux) between urban and rural areas.

In addition, there are inhomogeneous thermal properties in urban areas, because the thermal properties depend on materials and shapes of buildings. The original version of SCALE used uniform value for urban parameters in the entire calculation domain; that means the model did not consider the spatial inhomogeneity in the urban area. However, the inhomogeneous feature is important in a simulation with high spatial resolution. In this study, we show the improvement of SCALE and the results of initial analysis.

25Yiyu Tan, Toshiyuki Imamura and Daichi Mukunoki.
"An FPGA-based Matrix Multiplier with Task Parallelism"
Matrix multiplication requires computer systems have huge computing capability and data throughputs as problem size is increased. In this research, an OpenCL-based matrix multiplier with task parallelism is designed and implemented by using the FPGA board DE5a-NET to improve computation throughput and energy efficiency. The matrix multiplier is based on the systolic array architecture with 10 × 16 processing elements (PEs), and all modules except the data loading modules are autorun to hide computation overhead. When data are single-precision floating-point, the proposed matrix multiplier averagely achieves about 785 GFLOPs in computation throughput and 66.75 GFLOPs/W in energy efficiency. Compared with the Intel’s OpenCL example with data parallelism on FPGA, the SGEMM routines in the Intel MKL and OpenBLAS libraries executed on a desktop with 32 GB DDR4 RAMs and an Intel i7-6800K processor running at 3.4 GHz, the proposed matrix multiplier averagely outperforms by 3.2 times, 1.3 times, and 1.6 times in computation throughput, and by 2.9 times, 10.5 times, and 11.8 times in energy efficiency, respectively, even though the fabrication technology is 20 nm in the FPGA while it is 14 nm in the CPU. Although the proposed FPGA-based matrix multiplier only delivers 6.5% of the computation throughput of the SGEMM routine in the cuBLAS performed on the Nvidia TITAN V GPU, it outperforms by 1.2 times in energy efficiency even though the fabrication technology of the GPU is 12 nm.
26Akiyoshi Kuroda, Kiyoshi Kumahata, Syuichi Chiba, Katsutoshi Takashina and Kazuo Minami.
"Performance Tuning of Deep Learning Framework Chainer on the K computer"

Recently GPUs has become a popular platform for executing deep learning (DL) workloads. We revisit the idea of doing DL on CPUs, especially massively parallel CPU clusters (supercomputers). In anticipation of deployment of the Supercomputer Fugaku with much more DL capable CPUs, we investigate which optimizations can be already done using the K computer, current leadership computing facility and predecessor to the Supercomputer Fugaku. We use Chainer as a deep learning framework of choice. Chainer expresses the hierarchical structure of deep learning using Python, and all calculations can be realized using numpy without special libraries. Many of the cost was the calculation of the square root and the arithmetic when the filter was updated and activation functions.

These operations are not optimized when calculated using numpy and are particularly slow on the K computer. By replacing the kernel with software pipelining and SIMD optimization by Fortran library, the kernel elapsed time was improved to 1/11.08 and total elapsed time was improved to 1/4.54. Moreover, by optimizing floating point underflow exception when building Python, total elapsed time was improved to 1/3.39. Generally gemm convolution cost is high in the DL calculations, By replacing the SSL2 gemm library called by Python with the thread-parallel version, section elapsed time was improved to 1/5.03, the total elapsed time was improved to 1/1.15, and the performance efficiency of gemm convolution was improved about 70.05%[1]. Python control part has no thread scalability. By dividing the Python procedure by data process parallelization using ChainerMN, the total elapsed time was improved to 1/2.24. As a result of these optimizations, the overall speed ratio was 36.4 times, and the efficiency reached 35.9%.

There are some limitations on the use of Chainer on the K computer. It is necessary to prepare the learning data beforehand and to stage-in the data to an appropriate storage system. Moreover, since Python is in the shared storage, it takes time to load the library. However, I believe that we will be able to use the supercomputer Fugaku for deep learning sufficiently as well as GPU.


  1. Akiyoshi Kuroda, Kiyoshi Kumahata, Syuichi Chiba, Katsutoshi Takashina, and
    Kazuo Minami. 2019. Performance Tuning of Deep Learning Framework Chainer
    on the K computer. ISC2019 Research Poster, PR (2019), 28.
27Arata Amemiya, Shlok Mohta and Takemasa Miyoshi.
"Application of machine learning methods to model bias correction: idealized experiments with the Lorenz-96 model"

Numerical prediction models in meteorology and oceanography are composed of a set of discretized partial differential equations of the basic law of physics. This 'knowledge-based' approach is advantageous over a purely 'data-driven' forecast approach due to generally limited and noisy observation data. However, knowledge-based models also have inevitable systematic biases and random errors due to discretization and various approximations of physical processes. Hence, a hybrid approach in which the systematic model bias is compensated by statistical estimation may provide a better forecast than a knowledge-based or data-driven model alone.

Model bias correction has been studied as an important subject in data assimilation. Model bias can be effectively alleviated by statistical model bias correction methods combined with a variational or sequential (Kalman filter-based) data assimilation method (Dee, 2005). Conventional methods have assumed a bias correction term of simple functional form such as a constant or a linear dependence on model state variables (Dee and Da Silva, 1998, Danforth et al., 2007). Recently, the data-driven forecast and estimation of governing functions of the system in arbitrary form, known as ‘model detection’ or ‘system identification’, has been rapidly developed with the use of machine learning (Brunton et al. 2016, Vlachas et al. 2018). The application of such machine learning methods to data assimilation is a potential solution to model bias correction with unknown complexity.

In this study, an application of Long-Short term memory (LSTM) on model bias correction problems is explored in the context of data assimilation using the Local Ensemble Transformed Kalman Filter (LETKF). The proposed method is applicable for model bias which is dependent on current and past model states in an arbitrarily nonlinear manner. Localization of bias correction treatment is also implemented. The new method is examined by idealized numerical experiments using a multi-scale Lorenz-96 model (Lorenz 1996; Wilks 2005). The performance and efficiency compared to existing methods assuming linear dependence will be discussed.

28Bhaskar Dasgupta, Osamu Miyashita and Florence Tama.
"Hybrid Structural Modeling of Proteins Based on Atomic Force Microscopy (AFM) to Recover Conformational Transition Included in AFM Images"

Hybrid structural modeling combines experimental and computational techniques to obtain a reliable three-dimensional (3D) model of molecules. One of the fields of application of such modeling is atomic force microscopy (AFM), which is used to study structure-function relationship of biomolecules. AFM provides 2D images of biomolecules enabling us to follow the conformational change. To understand its function in more detail a 3D structural model is valuable.

AFM image is low-resolution, therefore we represented associated structure by coarse-grained models (a mixture of 3D Gaussians kernels). We use our original Monte-Carlo (MC) sampling technique to update an initial candidate model to better fit to a given AFM image. The MC sampling method is continued until similarity to the AFM image is converged to a maximum value.

We applied our computational technique to a theoretical study, where synthetic AFM images are generated from two proteins, Elongation Factor 2 (EF2) and CRISPR associated protein 9 (Cas9). One of the conformations of these proteins is used as the target conformation from which a reference synthetic AFM image is generated, and the other conformation is represented as a set of 3D kernels producing a low-resolution representation. By MC sampling the 3D kernels are optimized to fit to the reference image, producing a set of candidate models. We observe that those candidate models are highly similar to the target conformation. We also tested the effect of orientation of a molecule in the AFM image. We observed that for multiple orientations similar accuracy can be obtained. For Cas9 protein we also performed a stepwise modelling, as this protein exists in four major conformations.

In the next step, we are analyzing experimentally obtained AFM image from a heat-shock disaggretase protein, ClpB from bacteria. The molecule is hexameric in solution and shows variety of conformational forms by transitioning between closed and different types of open forms. Here, our aim is to reconstruct the major conformations observed in the AFM image set. For this we constructed an initial model from known 3D structure of one of the conformations of ClpB including 18 kernels. The MC method here also uses several structural restraints for a reliable 3D modeling. We discuss our approach to model conformational changes in this large oligomeric protein complex based on AFM data.

29Kohei Takatama, Yusuke Uchiyama and Takemasa Miyoshi.
"Simulations of lake currents toward the prediction of blue-green algae – cases of Lake Biwa and Lake Kinneret -"

As a part of Japan Science and Technology Agency (JST) Strategic International Collaboration Research Program (SICORP), we have been working on the prediction of blue-green algae in Lake Biwa, Japan, and in Lake Kinneret, Israel. We will represent the algae motion by a Lagrangian particle tracking model for buoyant Microcystis. We used a hydrostatic model known as ROMS and have obtained reasonable lake current fields which can drive the tracking model.

First, we conducted a year-long experiment for Lake Biwa with 250-m-mesh lake grid and 5-km-mesh atmospheric forcing from an operational weather forecast data of the Japan Meteorologcial Agency Meso-scale Model (JMA MSM). The model well reproduced the seasonal change of temperature and a major gyre. Next, we conducted 1-month-long experiments for Lake Biwa and Lake Kinneret with 50-m-mesh lake grid and 1-km-mesh atmospheric forcing from the RIKEN’s regional atmospheric model known as SCALE. Around the period of an observation campaign in Lake Kinneret in February 19-23, 2018, the model well reproduced in-situ observed temperature at the center of Lake Kinneret. For Lake Biwa, the model partly reproduced a sudden decrease of water level in the south lake caused by typhoon Jebi in September 2018.

30Kazunori Mikami, Hirofumi Tomita, Soichiro Suzuki and Kazuo Minami.
"Target applications performance estimation through the codesign of Fugaku computer"

The target applications representing the nine social and scientific priority issues were chosen for codesigning Fugaku computer, a.k.a. post-K. Each of the nine target applications represents the typical work load in the originating priority issue. They cover different numerical algorithms, different space and time discretization schemes, different grid structures, the different data elements. Consequently, they show quite different computing characteristics and exhibit the bottle necks in the system's various components. Addressing and relaxing the performance bottle necks of the target applications is expected to help improving the computational performance of the applications, not only in the priority issues but also in the wide range of high performance computing demand on Fugaku. The list of target applications and their objectives, the numerical algorithm, the computing characteristics and the estimated performance will be shown in the poster.

The target applications undergo many performance analyses by using various tools. The tools can be categorized, for example, as below:

  1. software simulator executed on non-Fugaku platform.
  2. performance estimation tool based on precise PA data from FX100.
  3. hardware emulator
  4. Fugaku prototype test vehicle

There are multiple simulators and estimation tools for supporting different user/usage. The output example from these tools will be shown in the poster.

For each design and update of the application, detail analysis is conducted in order to identify the latest computing bottle neck, and the corresponding feasibility study of system design is requested to the system development and the manufacturing vendor, expecting the improved system design parameters. The updated system design in turn provides the target applications the opportunity of further optimization from different perspectives. Thus, after all, the codesign of the target applications and Fugaku computer is effectively the repeated procedure of mutual optimization in terms of performance, power and economy. The design of Fugaku allows the user to choose the processor operating frequency and the mode of economy. Although the default Fugaku operation should provide the efficient environment for most of the applications, the users will be able to explicitly set up the best combination for their applications, as the target application developers do. The estimated performance under the restricted power consumption will be shown in the poster.

The codesign in the early stage can contribute to the design of systems architecture such as the processor and memory specification and configuration, and the codesign in the later stage can contribute to the design of system software such as compilers and libraries. Ideally, the codesign of Fugaku and the applications can be completed at the same time. Some codesign achievements reflected to the system will be shown in the poster.

31Maha Mdini, Shigenori Otsuka and Takemasa Miyoshi.
"Precipitation Nowcasting Based on Convolutional Neural Networks"
  1. Objective
    Short-term prediction of precipitation based on latest precipitation observations is called “nowcasting,” and widely used for various purposes such as disaster prevention. The aim of this project is to predict precipitation amounts within 10--30 minutes by taking advantage of dense and frequent observations of Phased-Array Weather Radar (PAWR). Among various techniques, we compared a classical convolutional neural network (CNN) model with a modified version of InceptionV3 [1].
  2. Data
    We used a dataset of 3D PAWR images of Kobe area with a frequency of one image every 30 seconds. The image size is 321×321×57 pixels. The grid spacing is 250 m. We used three months of data from 01/05/2018 to 31/07/2018. The prediction is based on the last hour images. For computation reasons, we aggregated data frequency from 30 seconds to 10 minutes.
  3. Challenges
    1. High dimensionality
      The output is a predicted image having the same size as the input images. The high dimensionality of the output leads to a large number of local minima in the objective function during the optimization process. The model is more likely to converge to a local minimum rather than the global one. To address this issue, we use stacked CNNs, each contributing to the prediction of a portion of the output image.
    2. Sparsity
      The images are sparse (2% on average contain rain). The sparsity of data leads to two issues: space complexity and time complexity. Space complexity refers to the incapacity to fit data into the memory during processing due to its large size. Time complexity denotes the processing time growth induced by the data size. To deal with these issues, we introduce a compressed data structure that contains only significant information (rain pixels).
    3. Border Effect
      We have limited information about pixels in the image border. So, the prediction accuracy is lower than the pixels in the image center. As the image size is large, images are split into patches for the learning phase. To prevent the accuracy loss due to border effect at the border of each patch, we use a sliding patch and predict only the central part at each step.
    4. Resolution
      The prediction accuracy in the case of high resolution (HR) images is higher compared to low resolution (LR) images. However, for HR images, the accuracy drops quickly after about 10-minute prediction unlike LR where the accuracy is relatively stable. Consequently, we use HR images for short-range prediction up to 10 minutes and LR for longer-range prediction.
  4. Results and future work
    The results given by CNN are promising in terms of precision and recall. The next step of our work is to extend our study to Convolutional Long Short-Term Memory (LSTM) and to Reservoir computing. Our goal is to compare the performance of different algorithms to select the most suitable solution for 3D precipitation nowcasting.
  1. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, "Rethinking the Inception Architecture for Computer Vision," Computer Vision and Pattern Recognition, 2016.
32Kazuki Koiso, Naohisa Sakamoto, Jorji Nonaka, Keiji Yamamoto and Fumiyoshi Shoji.
"A Visual Causality Exploration System for HPC Hardware Failure Analysis"
It is needless to say that stable and uninterrupted operation is highly demanded for the operation of any HPC system. However, hardware failure can be considered inherent in a long-term operation of an HPC system especially due to the large number of hardware components involved. Among different types of failures, those requiring the substitution of the damaged component may impact the normal operation of the system, and the understanding of the cause of such hardware failure is highly desired. In this poster, we present a visual causal analytics system, which has initially been developed for the K computer hardware failure analysis. It uses the maintenance record of hardware component substitutions, and a log dataset from the K computer system and facility as the input data, and provides an interactive visualization workspace for the visual causal exploration. The differential with other related systems is the proposed causal network representation that emphasizes the causal relationships between the involved entities, which are estimated by using the transfer entropy technique. The interactive graphical user interface provides a workspace for selecting the period of interest for the analysis, and a heat map representation to assist the identification of the spatiotemporal distribution of the hardware failures. For the experimental evaluation of the developed system, we utilized a maintenance record of the K computer for a period of three years (April 2014 to April 2017) focusing on the critical failures of CPU and memory (DIMM), that is, which required maintenance intervention and component substitution. We also utilized the K computer system and facility log data from which we extracted the CPU temperature and the cooling water and air temperature. To verify the intuitive operability of this system, a practical user evaluation was conducted with the help of Operations and Computer Technologies Division at the RIKEN R-CCS. We will utilize the obtained feedbacks for ameliorating the system and also expect to make necessary adjustments for the Fugaku system environment.
33Chigusa Kobayashi, Yasuhiro Matsunaga, Jaewoon Jung and Yuji Sugita.
"Analysis of ligand dissociation in Ca2+-ATPase by using molecular dynamics"

Molecular dynamics (MD) is one of the effective tools to investigate the functions of biomolecules in a biological system. The method calculates the motion of particles by numerically solving the equation of motion using interaction forces between atoms. While it can reveal atomistic details in biological functions, in many cases, it requires large computation time. In particular, it is difficult to simulate reaction processes in relative large proteins, such as membrane proteins.

To overcome the difficulty, we have developed a high-performance MD simulation package, GENESIS and have performed various simulations efficiently on K computer and the other supercomputers. We have also introduced a rare event sampling method, string method, into GENESIS. We applied the method to a calcium pump, Sarco(endo)plasmic reticulum Ca2+-ATPase (SERCA). SERCA is a representative membrane transport protein. The protein transfers Ca2+ ions across a membrane against a large concentration gradient by using ATP-hydrolysis. In this study, we performed the string method to investigate a reaction pathway on dissociations of nucleotide and Ca2+ ions. We discuss mechanisms of ligand dissociation in SERCA along the reaction pathway.

34Jorji Nonaka, Toshiyuki Tsukamoto, Motohiko Matsuda, Keiji Yamamoto, Akiyoshi Kuroda, Atsuya Uno and Naohisa Sakamoto.
"A Brief Analysis of the K Computer by using the HPC Facility's Water Cooling Subsystem"
The K computer ceased its regular operation in August 2019, and was shutted down after 7 years of successful operation after delivering high level of availability for the users. Taking into consideration the early access period before the start of the regular operation, we can verify that the K computer was actually used for almost 8 years. During this running period, the HPC facility's water cooling subsystem was set to deliver cool water in the range of 14oC to 16oC for removing the heat generated by the CPU and ICC (Inter-Connect Controller), and 10oC chilled water was applied to control this water temperature via heat exchangers. A preliminary analysis shows that the use of such low water temperature contributed to minimize the energy consumption by probably reducing the chip static power consumption. In addition, although there is no scientific evidence, there is a theoretical hypothesis relating this low operational temperature with the reduced hardware failure rate of the K computer. Obviously, we have to take into consideration that the CPU was produced by using a decade-old manufacturing process, with higher leakage current compared to modern processors. In addition, it seems to have no dynamic frequency scaling functionality for reducing the power consumption during the idle time. Therefore, it is probably useless to make any comparison in the microprocessor architecture level. However, from the facility operation point of view, it is valuable to understand the impacts of using different temperature settings not only on the cooling subsystems but also on the main HPC system. We utilized the pre-shutdown period, after the end of the regular operation, to make some evaluations on the facility subsystems targeting the next HPC system (Fugaku supercomputer), which is expected to generate more heat energy from the CPU. Even during the idle time of the K computer, it was possible to control the water temperature by changing the water flow at the heat exchanger in order to emulate high CPU temperature operation. In this poster, we will present a brief analysis of the probable impacts brought about by the use of higher running temperature, especially on the energy consumption and hardware failure rate.
35Erina Mills, Nicholas Mills and Steven J. Stuart.
"Automated Parallel Optimization of Simulation Parameters using Modified Simplex Algorithm"

The use of simulations for addressing social and scientific issues of high priority is well known. For example, a priority use of the Fugaku supercomputer will be to predict the effects of disasters such as earthquakes and tsunami through the use of complex simulations. These simulations use parameters to define models that are used to evaluate the simulated properties. When developing these models, the goal is to choose the parameters that best replicate a set of desired properties. In such cases, mathematical optimization methods can be used to optimize the simulation parameters by defining a function that uses simulation parameters as an input and outputs a function value describing how well a set of targeted properties are reproduced.

We propose a modified Nelder-Mead (NM) simplex in a master-worker parallel algorithm. The parallelization of the NM simplex algorithm will allow multiple simulations to run in parallel for indefinite amounts of time until the noise is adequately converged and the simulation parameters produce the targeted properties within an appropriate error bound. In preliminary studies, a parallel NM simplex was implemented for optimizing a four-point water model to replicate the experimentally observed internal energy, pressure, self-diffusion coefficient, and pair-correlation function of liquid water. These simulated properties were optimized by modifying the σ and ε parameters found in the Lennard-Jones potential and the partial charges of hydrogen (and the M-site) in the water molecule. These optimized results are compared with experimentally observed values published in the literature. The preliminary study showed that the modified NM simplex was able to optimize the parameters to some extent despite the presence of sampling noise. However, stochastic errors in the simulated results do negatively affect the quality of the optimization.

36Motohiko Matsuda, Hiroshi Shibata, Jorji Nonaka, Toshiyuki Tsukamoto, Keiji Yamamoto and Hajime Naemura.
"R-CCS Facility Simulation Modeling for Assisting Operation Planning and Decision Making"
One of the missions of Operations and Computer Technologies Division at RIKEN R-CCS is a daily operation of the facility which includes the management of electricity and air/water cooling subsystems of a supercomputer. But, a spotlight only shines at the achievements obtained on the system, although we spend a lot of effort for delivering a continuous and stable operation of these subsystems. We need to prepare for quick responses to unpredictable hardware faults, occasional external incidents such as power outage, and natural phenomena such as severe rainfalls and lightning strikes, which all impact the stable operation of a supercomputer as a consequence. It is worth mentioning that the facility has achieved a long-term stable operation of K computer, and we expect to continue delivering such a high-quality operation also for Fugaku. In order to better prepare for Fugaku, we are working on Modelica-based facility simulation models [1] trying to understand the behaviors of the cooling subsystems initially designed for K computer and modified for Fugaku (Fig. 1). In this poster, we present some trial usages of the models, which includes a study on the effects of higher cooling water temperatures caused by imaginary failure situations. The simulation models are still for K computer, but we believe that this kind of simulation approach will benefit for operation planning to improve energy efficiency and decision making at disastrous events.
37William Dawson, Luigi Genovese and Takahito Nakajima.
"Complexity Reduction in Density Functional Theory Calculations of Large Systems"
In recent years, significant advances in computational power alongside methodological developments have enabled the application of Density Functional Theory (DFT) to systems with tens of thousands of atoms. However, with an increase in system size comes an increase in complexity, making it challenging to interpret the results of such calculations. To aid in the interpretation of large scale calculations, here we present a systematic complexity reduction framework, which is able to use the results of large scale DFT calculations to generate coarse grain models of systems. This framework requires no a priori information, allowing a single workflow to be applied to any type of system. We demonstrate this methodology by showing how it can be used to derive new system descriptors, generate QM/MM partitioning schemes, and generate graph like views of complex system.
38Yasuhiro Idomura, Takuya Ina, Yussuf Ali and Toshiyuki Imamura.
"Optimization of Fusion Plasma Turbulence Code GT5D on FUGAKU and SUMMIT"
Turbulent transport is one of the key issues in fusion science. To address this issue via a five dimensional (5D) gyrokinetic model, the Gyrokinetic Toroidal 5D full-f Eulerian code GT5D [Idomura et al., Nucl. Fusion (2009)] has been developed. GT5D is based on a non-dissipative and conservative finite difference scheme and a semi-implicit time integration scheme, in which a stiff linear 4D convection operator is subject to implicit time integration, and the implicit finite difference solver for fast kinetic electrons occupies more than 80% of the total computing cost. The implicit solver was originally computed based on a Krylov subspace method (GCR), in which global collective communications and halo data communications were becoming bottlenecks on the latest accelerator based platforms. This issue was partly resolved by introducing a communication-avoiding Krylov subspace method (CA-GMRES), and good strong scaling was demonstrated on the Oakforest-PACS (KNL) [Idomura et al., ScalA17@SC17]. However, the remaining halo data communications in SpMV (finite difference) still occupy significant costs. To resolve this issue, in this work, the number of SpMVs and thus, halo data communications was reduced by improving the convergence property. For this purpose, we developed a new FP16 preconditioner for the CA-GMRES solver on FUGAKU (A64FX). The FP16 preconditioner was designed for the smooth linear operator by fully utilizing the new support for FP16 SIMD operations on A64FX, and achieved an order of magnitude smaller number of iterations and ~1.7x speedup compared to the CA-GMRES solver without preconditioning. In the poster, we will also present its performance on SUMMIT (V100), and discuss the performance portability of the new FP16 preconditioner between FUGAKU and SUMMIT.
39Jun Nagao, Masahiro Susaki, Abhishek Pillai and Ryoichi Kurose.
"Effects of non-uniform fuel injection on spray combustion instability using large-eddy simulation"

In order to reduce the emissions of NOx in gas turbine engine, lean combustion is one of the effective solutions. However, lean turbulent combustion is inherently unstable due to the insufficiency of fuel, and in that condition, the combustor is prone to combustion instability. Under the occurrence of combustion instability, severe oscillations of pressure and heat release rate are observed. Combustion instability causes loud combustion noise and damage to the combustor. Therefore, understanding the mechanisms of combustion instability is necessary to control it. In order to elucidate the underlying physics of combustion instability, many researches have been conducted not only with experiments, but also with numerical simulations. However, the detailed mechanism of combustion instability is still not completely understood, and the issue remains yet unresolved because of the complexity of this phenomenon.

Recently, Kitano et al. [1] showed in terms of large-eddy simulation (LES) that the droplet diameter distribution of fuel spray injected into the combustor (with back-step configuration), has a significant impact on combustion instability. While Nagao et al. [2] showed that the time fluctuations of liquid fuel droplet diameter distribution, and mass flow rate of liquid fuel caused by pressure fluctuations in a back-step combustor, also significantly influence the combustion instability. In their study, fuel is injected uniformly from each of the seven injectors, and the effect of the large recirculating vortex behind the back-step, in the direction along which the injectors are aligned was negligible.

Therefore, in this study, by changing the fuel mass flow rate among the seven fuel injectors (fuel injection configuration similar to the previous study [1, 2]), its effect on combustion instability occurring in the back-step combustor are investigated using LES. Two cases are simulated, one has uniform fuel injection rate for all the injectors similar to the previous study, and the other has non-uniform fuel injection rates across the injectors. Results show that, by using a non-uniform fuel injection rate among the injectors, the large recirculating vortex behind the back-step is damped, thereby reducing the intensity of combustion instability.

40Hideo Matsufuru, Yutaro Akahoshi, Sinya Aoki, Tatsumi Aoyama, Issaku Kanamori, Kazuyuki Kanaya, Yusuke Namekawa, Hidekatsu Nemora and Yusuke Taniguchi.
"Lattice QCD code Bridge++: preparation to Fugaku"

We are developing a general-purpose lattice QCD code `Bridge++' that is written in C++ based on the object-oriented design. Our development policy is to implement a readable, extensible, portable code with sufficiently high performance for productive studies. This code provides various lattice actions, algorithms, and measurement of observables. We further develop a mechanism to incorporate the implementation optimized for a specific architecture. This approach has been applied to GPUs, SIMD (AVX-512), and vector architectures.

For a productive run on Fugaku, we are extending our code in two ways: preparation of interface to use `QCD Wide SIMD' (QWS) library and optimization in Bridge++. QWS is a QCD library optimized for Fugaku being developed in the FS2020 project. Our interface class absorbs the difference of the conventions and data layout between Bridge++ and QWS. Since QWS has not covered all the actions demanded in practice, we also extend Bridge++ to cover these actions with optimization techniques employed in QWS. Now the preparation of the interface to QWS and the implementation of several major fermion actions in Bridge++ have been finished. The optimization on the RIKEN simulator is in progress.

41Naoki Yoshioka, Hajime Inaoka, Nobuyasu Ito, Fengping Jin, Kristel Michielsen and Hans De Raedt.
"Optimization for quantum computer simulation"
Simulator of quantum circuits is developed for massively parallel classical computers, and it is tested on the K computer in RIKEN R-CCS up to 45 qubits. Two optimization techniques are proposed in order to improve performance of the simulator. The "page method" reduces unnecessary copies in each node. It is found that this method makes approximately 17% speed-up maximum. Initial permutation of qubits is also studied how it affects performance of the simulator. It is found that a simple permutation in ascending order of the number of operations for each qubit is sufficient in the case of simulations of quantum adder circuits.
42Kento Sato, Akiyoshi Kuroda, Kazuo Minami, Jens Domke, Aleksandr Drozd, Mohamed Wahib, Shuhei Kudo, Toshiyuki Imamura, Kiyoshi Kumahata, Keigo Nitadori, Kazuo Ando and Satoshi Matsuoka.
"DL4Fugaku: Deep learning for Fugaku - Scalability Performance Extrapolation -"

Large-scale deep learning has emerged as an essential machine learning approach for many research challenges such as image classification, speech recognition and many others. Fast and large-scale deep learning enables us to train neural networks with more training data in shorter time. Our next-generation supercomputer, Supercomputer Fugaku, is expected to enable high performance computing for deep learning since A64FX, which is a general-purpose processer equipped in Fugaku, provide high-speed half-precision floating point (FP16) and 8-bit integer (INT8) operations for matrix multiplications and high bandwidth HBM2 memory (1,048 GB/sec) for convolutions. Also, Fugaku interconnects employ the next-generation ToFu interconnects (ToFuD) for gradient reduction operations. However, to make use of Fugaku/A64FX hardware performance, tuning software stacks from deep learning frameworks to low-level numerical libraries is indispensable.

To achieve fast and scalable deep learning in Fugaku, we launched a new project, DL4Fugaku (Deep learning for Fugaku). The goals of the projects are (1) performance analysis and tuning of deep learning frameworks and low-level numerical libraries used by the frameworks; (2) Reliable deployment of large-scale deep learning environments; (3) Enhancement of the usability for production use in Fugaku. We organized a project team for DL4Fugaku from PIs and researchers in the application development unit, the high-performance AI system research team, the high-performance big data research team and the large-scale parallel numerical computing technology research team under collaboration with industry, academia and government; AIST, ARM, Cybozu, Fujitsu laboratories, Fujitsu limited, Linaro and Toky Tech. To facilitate the logistics and accelerate the software development, RIKEN R-CCS signed MOU with Fujitsu ltd. for further collaboration on the DL4Fugaku project.

In this poster presentation, we will introduce overview of the DL4Fugaku project and recent scalability performance extrapolation studies for deep learning in Fugaku/A64FX. Our extrapolation suggests there is good scalability in both multi-threading and multi-processing. We expect Fugaku will be capable of scalable deep learning with A64FX and the ToFuD interconnect.

43Kentaro Nomura, Youhei Ishihara, Masaki Iwasawa, Daisuke Namekata and Junichiro Makino.
"Optimized particle-particle interaction kernel generator for various architectures"

Particle-based simulation, such as N-body, molecular dynamics, smoothed particle hydrodynamics simulation, is widely used in the field of science and engineering. To explore new phenomena or to obtain precise resolution or accuracy, large-scale particle-based simulation codes are developed in many fields. Such development requires multi-year effort and multi-person resource. The features necessary for particle-based simulation codes to achieve high efficiency on massively parallel environment are largely similar. Therefore the framework to develop particle-based simulation code is very useful for efficient code development.

FDPS (Framework for Developing Particle Simulator) is an application framework which enables researchers to develop highly parallelized codes for particle-based simulations. FDPS proved to be useful and is helpful and widely used for many simulation codes. However, a user of FDPS still needs to tune the particle-particle interaction kernel, which is the most computationally intensive part, by him/herself for multiple target architectures (e.g. AVX-512, ARM SVE, CUDA, OpenCL, etc.). Developing highly optimized multiple kernel codes for various target architectures can be very difficult for general code developers and time consuming. Thus, the optimized kernel generator is in huge demand.

Here, we report newly developed optimized particle-particle interaction kernel generator for FDPS. For the kernel generator, user only need to provide the information of particles including target precision and to describe the particle-particle interaction in high-level language. The kernel generator generates optimized AVX-512, or ARM SVE kernel code from an input file. Generated ARM SVE kernel is composed of ARM C Language Extension (ACLE) intrinsics and highly optimized for Fugaku, or A64FX processor. We demonstrate the example of typical N-body simulation for A64FX. We also plan to generate kernels in CUDA and OpenCL.

44Toshiki Matsushima, Seiya Nishizawa and Shin-ichiro Shima.
"Large eddy simulations of cumulus congestus cloud using super-droplet method"

Clouds consist of many micro-scale cloud droplets. Since clouds droplets size distribution affects rain onset and radiation, accurate prediction of droplets size distribution can decrease the uncertainty of weather and climate prediction. In the K-computer era, we implemented super-droplets method which is a sophisticated cloud microphysical model in SCALE-RM. sing SCALE-RM, we conducted large-eddy simulations of cumulus congestus for Small Cumulus Microphysical Study field campaign, and we compared the results with aircraft observation. Obtained relation between entrainment and cloud droplets size distribution is found to be similar with typical observation. We show that droplets number concentration, liquid water content, as well as the standard deviation of droplets radius are consistent to the observation in middle and upper layers of cloud. On the other hand, standard deviation in cloud core is narrower compared with observation.

In the poster presentation, we will also discuss fast microphysical algorithm using FP16, and science target for Fugaku. Furthermore, We will show an application of scientific visualization using virtual reality technique, which can integrate between cyber space and physical space.

45Ryousei Takano, Takahiro Hirofuchi, Mohamed Wahib, Truong Thao Nguyen, Hiroki Kanezashi and Akram Ahmed.
"Make Friends with Errors: a New Approach of Fault-Tolerant Computer System Design"

The end of Moore's law is coming within a decade because of technical as well as economic reasons. Architecture specialization is a promising approach in the post-Moore era. We have started exploring a new concept of fault-tolerant computer systems that improve the capability and capacity while drastically reducing power consumption. More specifically, we controllably allow hardware errors and develop system software to assure acceptable computational results.

Approximate computing is conceptually similar to reducing the computation and storage costs by relaxing hardware reliability and application accuracy. Although some specific applications like image processing and convolutional neural network are targeted in most approximate computing researches, we are targeting various applications from AI/Deep learning to HPC. We optimize a well-balanced computer system by exploiting approximate computing and fault tolerant computing techniques. For example, an error correction technique can result in increased latency and reduced capacity. By taking a holistic approach across the layers from hardware to software, lightweight and appropriate error correction is performed at the software layer while eliminating general purpose error correction in hardware layer.

In this poster, we introduce our vision and current activities especially focusing on system software and device simulation/emulation technologies.

46Jian Guo and Kento Sato.
"Research on reproducibility and universality in machine learning-based optimizations for prediction job runtime in HPC systems"

User prediction of job runtimes have emerged as an important component of the workload on HPC systems and can have a significant impact on how a job scheduler treats different jobs, and thus on overall performance [1]. At present, backfilling is the most common method for the job scheduler in HPC systems. To use backfilling, the job scheduler must know in advance the predicted runtime of each job. This information is used when computing the reservation time when scheduling jobs in HPC systems. However, there are studies and analysis based on existing job log show that user runtime predictions are actually rather inaccurate [2] which lead to worse overall performance of HPC systems [3]. Therefore, researchers have begun to try various methods and algorithms to get more accurate prediction of job runtime to improve the overall efficiency of the HPC systems.

Recently, we can see that machine learning-based technologies have been widely used. [4,5,6] Specifically, researchers train machine learning models with job log data and different features that are extracted from job log data to predict the job runtime. Then machine learning predicted job runtime is filled into backfilling algorithms for the job scheduler in HPC systems. As a result, those machine learning-based methods have achieved better results than traditional methods in this field of prediction of job runtime accurately. However, all machine learning-based technologies rely on job log data and the features which are extracted from job log data, and different authors extract different features from job log data to train their machine learning models for predicting job runtime. For example, A center and B center build machine learning models to predict job runtime with the same prediction model (machine learning algorithms, e.g. SVM or Decision Tree). A center train their model with features f1, f2, f3, f4 and f5, however, B center did not record job log about f5, that means that A and B train their model with data in different feature sets (feature set A vs. feature set B) may have a significant deviation in results, which will cause reproducibility issue.

On the other hand, we assume that even if different data centers use the same model and features to train their models for predicting job runtime, in which, those features are also likely to be extracted from the original job log data by different formulas because there is no uniform feature extraction standard in the field of job runtime prediction for HPC systems. Taking Figure 1 as an example, A center and B center build the same machine learning model for predicting job runtime with the same features (f1, f2, f3, and f4 in this case). However, both of two centers extracted those features with their own extraction formulas which lead to different feature values are extracted in the end. Obviously, if we train the same model with the same features but there are extracted with different formulas, its results will be different, which may lead to a universality issue.

In this research, we would like to study the reproducibility and universality of existing machine learning-based methods for optimizing the maintenance of HPC systems. Starting from the prediction the user estimated job runtime with job log data collected from K computer in RIKEN and AAIC computer in AIST, we will try to improve the accuracy of existing prediction models and build new universal models, as well as establish the guideline for system log collection in HPC centers for those optimization purposes.

47Takahiro Ogura, Toyohisa Kameyama, Tomoki Karatsu, Fumichika Sueyasu and Masamichi Takagi.
"Growing Arm Ecosystem with Open Source Software Management Tool Spack"

Users of supercomputer want to use scientific open source software (OSS). However, users must prepare the scientific applications by themselves and the complexity of building scientific applications makes it increasingly difficult to use. For some prominent OSS, installed by the HPC site administrators can be used, but users demand combinatorial versions of packages. It is impossible for site administrators to prepare an application that combines many conditions without using an OSS management tool.

RIKEN Center for Computational Science (R-CCS) is leading the development of Japan's next generation flagship supercomputer, Fugaku, the successor of the K Computer. RIKEN decided to use OSS management tool called Spack for Fugaku. Spack is a flexible package manager supporting multiple versions, configurations, platforms, and compilers.

There are two issues in using Spack at Fugaku. The first issue is to prepare OSSs build recipe that can be used at Fugaku, which uses A64FX, an Arm architecture CPU. Arm architecture CPU has recently introduced to HPC, and there are far fewer machines using Arm architecture CPUs than x86_64 architecture CPUs. For this reason, the OSS's Arm architecture support is not enough, and it is necessary to respond carefully to each recipe. We verified each Spack recipe using GNU GCC, FUJITSU Compiler, and LLVM (CLANG / FLANG) on Arm architecture machine. This poster introduces the verification results and the know-how obtained during the verification process.

The second issue is that we need to add functionality to Spack so that it can be used in Fugaku's cross compilation environment. Currently, Spack can distinguish between backend (execution node architecture) and frontend (compiling node architecture) concepts for cross-compilation support. For example, it is necessary to compile the OSS used at the time of build for frontend, and the OSS used at the time of linking and execution for backend. However, in the current implementation, the backend architecture is used as compilation target architecture. Therefore, it is necessary to add a functionality to choose appropriate compilers according to the steps toward running an OSS (i.e. building, linking, and executing). For example, autoconf for x86_64 needed to build fftw for arm on x86_64. For this poster introduces how to address this issue.

After solving these two issues, the recipe needs to be updated in a timely manner with the OSS update, but it is impossible for RIKEN or FUJITSU to do the work alone. Therefore, it is important to build a relationship where users and administrators with the same concerns can cooperate with each other, that is, to build an ecosystem. This poster gives the sense on what kind of efforts can be made to build an ecosystem.

48Yosuke Ueno and Masaaki Kondo.
"Neuromorphic Graph Processing for Minimum Weight Perfect Matching"

The trend of the exponential growth of processor performance known as Moore's law is expected to end in the near future because semiconductor process advancement is almost reaching its physical limit.

To achieve further performance improvement in the post-Moore era, we need to make use of new types of computer architectures and computing models such as Neuromorphic Computing (NC). The computer systems with NC are attracting a lot of attention as a

post-Moore architecture for various reasons. For example, it can potentially mitigate the von Neumann bottleneck and it is inherently power efficient.

In NC, many simple processing elements that are inspired by neurons of a human brain work as computation cores.

The communication among them is relatively simple and based on the form of spikes.

Therefore, NC has the potential to achieve higher computational efficiency and lower power consumption compared to traditional architectures.

Although most of the applications of NC are typically based on neural networks, NC characteristic, massively parallel computation with many simple computational units,

can be applied to other types of applications.

In this work, we study on using NC for a graph problem especially minimum weight matching problem.

In particular, we propose an approximate algorithm for minimum weight perfect matching with NC.

We show that the proposed algorithm is equivalent to a greedy algorithm whose approximation degree is $\frac{1}{2}$.

We apply it to several random graphs of different scales and evaluate its performance.

We also show the implementation of the proposed algorithm on an FPGA device.

49Issaku Kanamori.
"Neighboring Communication with uTofu for LQCD Application"

For a large parallel system such as Fugaku, it is important to reduce overheads in the communication. In Lattice Quantum ChromoDynamics (LQCD) application, which describes dynamics of quarks and gluons, we need a lot of neighboring communication in solving linear equations. The neighboring communication appears in sparse matrix multiplications, which are 9-point stencil computation on 4 dimensional spacetime lattice.

In order to obtain the best performance on Fugaku, we try to use remote direct memory access (RDMA) through the uTofu interface. In addition to the small latency, it helps efficient using of the network bandwidth: we can manually specify which RDMA engine (called TNI) out of 6 TNIs per node to be used in sending data to each neighbor. We also combine double buffering technique, which helps to reduce the communication overhead.

In the poster, we present benchmark results of the communication part of our LQCD library. In order to pin down solely to the communication part, we use a toy two-dimensional system instead of using the real LQCD application. Although the results do not guarantee the performance on the actual Fugaku as measured on the evaluation environment, they are promising: we observe a smaller overhead than MPI persistent communication, and a saturation of the communication bandwidth.

This work is based on collaborations with the co-design team for LQCD application on the Fugaku.

50Kenta Sueki, Tsuyoshi Yamaura, Hisashi Yashiro, Seiya Nishizawa, Ryuji Yoshida, Yoshiyuki Kajikawa and Hirofumi Tomita.
"Convergence of Convective Updraft Ensembles with Respect to the Grid Spacing of Atmospheric Models"

Atmospheric deep moist convection can organize into cloud systems, which impact the Earth’s climate significantly. High-resolution simulations that correctly reproduce organized cloud systems are necessary to understand the role of deep moist convection in the Earth’s climate system. However, there remain issues regarding convergence with respect to grid spacing. To investigate the resolution necessary for a reasonable simulation of deep moist convection, we conducted grid-refinement experiments using state-of-the-art atmospheric models on the K computer (Sueki et al. 2019). We found that the structure of an updraft ensemble in an organized cloud system converges at progressively smaller scales as the grid spacing is reduced. The gap between two adjacent updrafts converges to a particular distance when the grid spacing becomes as small as 1/20–1/40 of the updraft radius. We also found that the converged inter-updraft distance value is not significantly different between Reynolds-averaged Navier–Stokes simulations and large eddy simulations for grid spacings in the terra incognita range.


Sueki, K., et al. 2019, Geophys. Res. Lett., 46,

51Siddharth Jaiswal, Manoj Agarwal and Yogesh Simmhan.
"STEM: STreaming Edge Partitioner based on Motifs"

Large scale graph data is being accrued from various domains like web crawls, social networks, biological networks, IoT sensor networks, and road networks. Due to the massive size of these graphs, storing them on a single machine is infeasible both from a storage as well as a compute perspective. Hence, there is a growing need to process these datasets in a scalable, distributed manner. This is done by distributing the input graph over multiple partitions so that the distributed memory and parallel computation can be leveraged for improving the performance of downstream graph algorithms. More recently, most graph data generated from web crawls and social networks updates is being ingested in the form of streams of vertices or edges, where each arriving graph element is sent to one of P partitions, and the other element is cut(edges) or replicated(vertices) across partitions, respectively. Graph partitioning itself is an NP-Complete problem and hence various heuristics have been proposed in literature. We specifically deal with the problem of partitioning graph edge streams leading to replication of vertices of the graph. Traditionally, most heuristics attempt to reduce the vertex replication factor while keeping the partitions balanced in the number of edges they store. Such partitioning algorithms ignore the inherent graph topological structures like triangles and quasi-cliques. Maintaining such motifs locally within partitions, allows to run analytics like subgraph mining or clustering with a lower latency in the distributed graph, because most of the triangles are present within partitions rather than across and thus the network communication and synchronization across machines is reduced.

We propose a graph partitioning logic which attempts to maximise the formation of triangles within partitions while also satisfying the previously defined graph partitioning goals. The density of triangles within a partition serves as a good indication of the community structure being developed there. We develop a graph partitioner, STEM(STreaming Edge Partitioner based on Motifs), which ingests the graph at a single Reservoir machine and distributes the edges to one of P different Partition machines. The Reservoir maintains exact and approximate data structures which help in making the partitioning decision. The approximate data structures are of fixed size and are updated locally at the Reservoir, whereas the exact data structures are updated using regular updates from the Partition machines. Each of the Partition Machines maintains the local graph it has received, as well as all vertices that form triangles and all vertices that are of sufficiently high degree on that partition. The Reservoir receives edges, one at a time and decides on the destination for the edge based on its current state. This allows it to reduce the latency of the partitioning decision, as compared to querying all partitions for every edge. The trade-off here is between the partitioning latency and the staleness of the state maintained. We evaulate our algorithms on 4 real-world graphs, as large as 18M vertices, 133M edges and 17B triangles. Our best heuristic performs upto 5x better than the state-of-the-art heuristics HDRF and DBH in terms of identifying triangles and reduce the vertex replication factor by upto 44%.

52Nobuki Inoue and Takahito Nakajima.
"Efficient calculation method considering charge distribution of nuclei in electronic structure theory"

In electron structure theory, interaction between nuclei and electrons in molecules are constructed by nuclear attraction integral (NAI).

Nuclei are usually treated as point charge (PC), Gaussian distribution (Gauss), or homogeneous charged sphere (HCS) in electronic structure calculations, because of they have analytic formula of NAI. However, they cannot be applied for wide mass numbers.

On the other hand, function types which used fitting of experimental data have no analytical formula.

So, we propose the following type function AG12 that can be performed analytical nuclear attraction integral for Gaussian basis set and can represent realistic nuclear charge distribution.

This function form enables the analytical calculation of NAI as two-electron integrals and three-center overlap integrals, and is easily applicable to a multi-center system.

We have also found a procedure to efficiently determine this AG12 parameter for a given charge distribution. This procedure has been shown to work well for 2pF, and the fitting of experimental values has also been successful. More detail results, such as applied them to molecular orbital calculations, are shown in the poster.

53Go Tamura, Naohisa Sakamoto, Yasumitsu Maejima and Jorji Nonaka.
"Visual analysis of meteorological ensemble data sets by using stochastic isosurface visualization technique"
Supercomputers have been used to run large-scale ensemble simulations such as the record breaking 10,240 member global weather ensemble simulations done on the K computer. However, it is worth noting that typical ensemble simulation size for the weather forecasting is usually in the range of dozens, and the ensemble mean of the set of forecasts is commonly applied. It is also true that such kind of statistical values are normally used for the visual analysis of the ensemble simulation results. However, this kind of visualization has the possibility to mask some important information within the collection of entire simulation data results. In this poster, we present a visualization approach for assisting the understanding of the spatial and temporal distribution of the ensemble simulation results by using the stochastic isosurface visualization technique[1]. We will also present a prototype developed for evaluating the effectiveness of the proposed approach, and present some experimental results by using a 20 member meteorological simulation results of a severe rainfall occurred at Kobe-city in 2008. We expect that this system will be helpful for assisting the computational scientists to better understand the behavior of their developed simulation models.
54Igarashi, Yamaura and Yamazaki.
"Large-scale simulation of a cortio-thalamo-cerebellar circuit using MONET simulator on the K computer"

The next-generation exascale supercomputers are estimated to be able to perform simulations of a human-scale whole-brain model using single-compartment spiking neuron models. However, it remains unknown how the heterogeneous circuit structure of the brain can be parallelized for realizing the efficient parallel computing in the simulation. In particular, there is a problem in the communication of spike information among compute nodes in exascale computers. Communication data of spike information among compute nodes increases with an increase in model size while the performance gain of communication of exascale supercomputers is going to be less than one-tenth of that of calculation by processors.

We conducted a feasibility study of efficient parallelization and communication methods of a brain model using the K computer with the computational performance of 11 petaflops.

In the mammalian brain, the cortex and the cerebellum contain 99 % of neurons in the brain and have a layered sheet structure. The neurons are densely wired within the neighboring regions and sparsely wired across remote regions. Therefore, an efficient parallel computing of layered sheet spiking neural networks with the dense-neighbor and long-range-sparse remote connections is essential for realizing human-scale whole-brain simulations.

Taking into account these anatomical features of the brain, we propose a parallelization method to combine a tile partitioning method and a reduction method of communication frequency using minimum signal transmission delay.

We tested the proposed method by applying it to a realistic spiking neural network model of the cortico-cerebello-thalamo-cortical (CCTC) circuit using an in-house simulator, MONET (Mille-feuille like Organization NEural neTwork). The CCTC circuit was developed based on anatomical and electrophysiological features of the brain.

We assigned a cortical tile with 45 thousand neurons, a thalamic tile with 2 thousand neurons, and a cerebellar tile with 200 thousand neurons to compute nodes and tested weak scaling performance. The results showed excellent weak scaling performance for 63 million to 1 billion of neurons of CCTC circuit calculated by 768 to 12288 compute nodes. The result suggests that the proposed method may lead to human-scale whole-brain simulation on the next-generation exascale supercomputers.

55Eisuke Kawashima and Takahito Nakajima.
"Multi-Scale Simulation to Predict Biodegradability of Plastics"

Biodegradable plastics are attracting attention to reduce environmental impact and achieve Sustainable Development Goals, SDGs. They are applied to disposal items such as packaging, food trays, agricultural films, and medical sutures. Biodegradable aliphatic polyesters—polymers or copolymers of hydroxyalkanoates, or copolymers of diols and dicarboxylic acids—eventually decompose into carbon dioxide and water by microbial metabolism. Examples of current commercial biodegradable polyesters are poly(lactic acid), PLA, poly(ε-caprolactone), PCL, and poly(3-hydroxybutyrate-co-3-hydroxyhexanoate), PHBH [1]. Skeleton structures of these materials determine mechanical and thermal properties and biodegradability, which restricts their application.

We are developing theoretical tools to estimate biodegradability of plastics to offer design guideline, by applying materials informatics. Density functional theory, DFT, calculations are employed as implemented in NTChem, a quantum chemistry package developed by our group [2]. Methylmethanoate, CH3COOCH3, was chosen as a model ester to guess structures for others. The systems are optimized at ωB97X-D/def2-SVPD level of theory. Reaction pathways of hydrolysis of esters under acid and base conditions are investigated by nudged elastic band, NEB [3], and string method [4]. Calculations are performed on HPC clusters including K computer, and activation energies and heats of reaction are obtained. A macroscopic Monte Carlo simulator is also implemented to estimate changes of molecular weight distributions under degradation. At initialization polymers, whose molecular weights follow a probability distribution such as log-normal and inverse-gamma, are populated to reproduce experimental number- and mass-average molecular weights, and an ester bond is randomly cleaved at each Monte Carlo step. Kinetic Monte Carlo, KMC, simulator is under implementation, which uses rates calculated by DFT and estimates time change of populations. We are developing models to predict these values from monomer structures.

  1. T. Iwata and Y. Doi, Macromol. Chem. Phys., 1999, 200, 2429–2442.
  2. T. Nakajima, M. Katouda, M. Kamiya and Y. Nakatsuka, Int. J. Quantum Chem., 2015, 115, 349–359.
  3. G. Henkelman and H. Jónsson, J. Chem. Phys., 2000, 113, 9978–9985.
  4. W. E, W. Ren and E. Vanden-Eijnden, Phys. Rev. B, 2002, 66, 052301.
56Ken Seki.
"Green500, Nov. 2019 #1 Fujitsu A64FX prototype 48C 2GHz a.k.a. micro-Fugaku"

A64FX prototype - Fujitsu A64FX 48 cores 2.0GHz was ranked No. 1 on the SC19 Green500 list on November 2019. The system consists of 768 general purpose A64FX CPUs without accelerators. It achieved 16.876 Gflops/Watt power-efficiency at quality level 2 during its 1.9995 Pflops Linpack performance run. The figure is approximately three times better than that of the second system without accelerators.

It is listed on position 159 in the Top500, which has high calculation efficiency 84.75% with small Nmax 1,576,960.

This highly power-efficient value has resulted from both energy efficient hardware and superior parallel execution efficiency.

Fujitsu measured power consumption of a quarter of the A64FX prototype system and recognized its steady and high efficient computation from the point of view of energy.

The A64FX CPU will be widely used by many customers not only as a main CPU for Supercomputer Fugaku but also as a CPU for Fujitsu PRIMEHPC FX1000/FX700 and Cray CS500.

The power-efficient general purpose CPU will enable Fugaku to fulfill its goals: high application performance, power efficiency and usability.

57Muhammed Emin Ozturk.
"Theoretical Approach for Handling Soft Error in Krylov Subspace Methods"

Nowadays, there is a growing attention on resilience, particularly when we approaches the Exascale era. One of the acceptable reasons is that with increasing number of cores or nodes, the likelihood of a failure and the possibility of error increases. However, several other factors such as radiations, near-threshold computation and using smaller feature-size are also contributing to greater likelihood of soft errors, which involve bit flips causing to silent data corruption.

Previously, computers have been made resilient through hardware level detection. Nonetheless, this approach has become too expensive to cover all the components as the system scale grows. Therefore, there has been considerable focus on both detecting soft errors at the software level as well as making

software resilient against such errors.

However, most of the techniques proposed in the literature so far do not take into account the specific numerical properties of the underlying methods. For instance, redundancy has been used as a promising solution where one duplicates the execution and crosschecks that state across different instances.

In other words, such an approach is considered as expensive in terms of the resources required.

In this research, we focus on a class of numerical methods called as Krylov subspace methods.

These methods are designed so as to solve either large sparse matrix eigenvalue problems or a large linear system of equations, both of which are among the most significant problems in scientific computing. Krylov subspace methods play a significant role in how we can solve large, sparse, symmetric and non-symmetric matrix problems. Examples of such methods include Conjugate Gradient (CG), Conjugate Residual (CR), and Generalized Minimum Residual (GMRES). So far, there was only been a limited amount of work on addressing the problem of soft error detection for these methods. Moreover, most of these solutions have been heuristic in nature, and do not build on top of the analysis of numerical properties of these methods. For example, multiple efforts have used analysis of residual values across iterations as an indicator of soft errors in the case of CG. However, the value of residual is not strictly monotonically decreasing for CG, and this results in an indicator of limited accuracy. In this research, we consider three such methods those are GMRES, Conjugate Gradient (CG) and Conjugate Residual (CR). We basically concentrate on the problem of efficiently and accurately detecting soft errors leading to silent data corruption (SDC) for each of these methods.

Our research do not only refer existing literature on the analysis of these methods, it also identifies properties that lead to efficient and accuracy detectors of soft errors. Particularly, we identify a term we refer to as {\em energy norm}, which is monotonically decreasing for our target class of methods. We also show other applications of error norm and residual value, and expand the set of algorithms to which they an be applied.

We have extensively analyzed and evaluated our proposed methods (Monotonic Residual and Energy Norm) by using several real matrices. Our evaluation shows high detection accuracy (especially for GMRES and CG). One of the our fundamental observation is that the effectiveness and performance of out method is based on the matrix type. The sparsity structure of matrix affects the effectiveness of our proposed methods (Energy Norm and Monotonic Residual). Moreover, we demonstrate that average error due to undetected errors is small, which is showing that most significant errors get detected. Finally, we also observe low overheads of applying our method on GMRES, CG and CR (Krylov subspace methods).

58Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, Tomohiro Ueno, Kentaro Sano and Taisuke Boku.
"Pipelined Communication Combined with Computation in OpenCL Programming on FPGA"

In recent years, much High Performance Computing (HPC) researchers attract to utilize Field Programmable Gate Arrays (FPGAs) for HPC applications. We can use FPGAs for communication as well as computation thanks to FPGA’s I/O capabilities. HPC scientists cannot utilize FPGAs for their applications because of the difficulty of the FPGA development, however High Level Synthesis (HLS) allows them to use with appropriate costs.

In this study, we propose a Communication Integrated Reconfigurable CompUting System (CIRCUS) to enable us to utilize high-speed interconnection of FPGAS from OpenCL HLS. CIRCUS makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them. In this poster, we proposed and evaluated the CIRCUS system for high-speed inter-FPGA communication in OpenCL. CIRCUS extends intra-FPGA communication using channels for inter-FPGA communication. As a result of using channels, CIRCUS can create a fused pipeline for both computation and communication. We can completely overlap computation with communication in clock cycle resolution. Because this characteristic is unique to FPGAs, we believe we can accelerate HPC applications on FPGAs by combining computation and communication.

We used the Cygnus supercomputer operated by Center for Computational Sciences, University of Tsukuba, for the performance evaluation. Cygnus is a multi-heterogenous system and has a total of 80 nodes, which consist of 48 Deneb Nodes and 32 Albireo Nodes. The Deneb nodes are CPU + GPU nodes (no FPGAs), and the Albireo nodes are CPU + GPU + FPGA nodes. An Albireo node is equipped with four Intel Xeon CPUs, two NVIDIA V100 GPUs, four Mellanox InfiniBand HDR100 HCAs, and two Bittware (formerly Nallatech) 520N FPGA boards. The Bittware 520N FPGA board equips an Intel Stratix10 FPGA, 32GB DDR4 external memory, and four QSFP28 external ports supporting up to 100Gbps.

Moreover, there are 64 FPGAs (32 Albireo nodes x 2 FPGAs / node). Therefore, Cygnus has an 8x8 2D-torus network dedicated to FPGAs connected by Mellanox 100Gbps optical cables. We can still use the InfiniBand network independently for CPU or GPU applications. We used up to 16 FPGAs in the following evaluations.

We used three benchmarks to evaluate the CIRCUS system: pingpong benchmark, allreduce benchmark, and Himeno benchmark (19-point stencil computation). According to the pingpong benchmark results, the minimum latency was 0.5μs, and the maximum throughput was 90.2Gbps, and the additional latency per hop was approximately 0.23μs. We used an allreduce-like program to measure the overlapping effect. The maximum throughput was 90.2Gbps, which was the same throughput as the pingpong benchmark result. This result showed that we can make a successful communication-computation combined pipeline. Finally, we evaluated Himeno benchmark performance. We applied CIRCUS communication to the halo and allreduce communication in the benchmark. Strong-scalability was observed in the case of the problem size L, with 94.2% parallel efficiency. We consider this result to be a validation for the implementation of CIRCUS communication to HPC applications.

59Hao Zhang, Keigo Nishida, Itta Ohmura and Makoto Taiji.
"A Network-on-Chip Generator for Deep Probabilistic Computing Platform"

As deep learning technology advances, deep probabilistic learning draws more and more attention. Probabilistic models can outperform deterministic models when the data amounts are insufficient for training. Another advantage of probabilistic learning is its ability of solving the overfitting problem commonly seen in deterministic learning because of lacking capability of estimating the uncertainty information.

In our new project, we plan to develop high power efficiency and performance AI System-on-Chip (SoC) for next-generation computing technology, deep probabilistic computing. The power efficiency will outperform the commercial off-the-shelf CPU 100 times while the communication traffic load 10 times. For this purpose, we proposed a heterogeneous process element (PE) with a RISC-V core (Rocket Core) and an accelerator for probabilistic computing including training and inference. The PEs exchange data by using Network-on-Chip (NoC) for the purpose of parallel computing. As is well known, NoCs have been shown to be significant contributors to overall chip power and energy consumption. Consequently, people have to develop various solutions to reduce the power consumption while retain high performance. As one part of our project, a NoC generator is necessary to explore the design space for low power, high performance NoC development.

We proposed the NoC generator called EDGENoC ( Energy-efficient Deep probabilistic computing -oriented GEnerator for NoC). It is a comprehensive NoC generator written in Chisel. Chisel is a hardware design language that facilitates advanced circuit generation and design reuse for both ASIC and FPGA digital logic designs. It generates both C++ and Verilog models from the same Chisel source code. At the same time, Chisel provide the capability of writing parameterizable circuit generators that produce synthesizable Verilog. Therefore, our NoC generator makes full use of the above capabilities to provide a parameterizable NoC exploration tool. In general, everything can be configurable such as router microarchitectures, routing algorithms, topologies and so on.

Now we are developing the framework for our NoC generator using Chisel. The traffic pattern of deep probabilistic computing, both training and inference, will be analyzed. With the analysis results, we can propose associated routing algorithm, topology and router architecture, properly. At last, we will compare our NoC generator with pre-validated state-of-the-art simulators using both the generated C++ and Verilog models using firesim.

60Jens Domke, Kento Sato and Masaaki Kondo.
"Counter-based Performance Extrapolation Toolchain: How far can we look into the Future?"

Nowadays, co-design efforts are aided by simulators, such as RIKEN's gem5-based architecture simulator for Supercomputer Fugaku. However, such simulators have severe drawbacks. First, development cost (human time and labor) is substantial while still delivering runtime estimation errors in the lower double-digit percentage area. Second, this labor-intensity and required computer architecture know\-ledge mean that a near-accurate simulation approach can only be used towards the end of a co-design effort, when the system deployment is only few years or months away. And lastly, simulator have reported slowdowns of 1000-10,000x compared to executing the program on real hardware, and therefore only small toy-codes for application hotspots can be tested, instead of full scientific programs running on future HPC systems.

These drawbacks combined result in the fact that simulators are inadequate to perform rapid design-space exploration and performance estimations during earlier stages of the co-design process. We are currently exploring an alternative approach, by combining various tools into a framework designed to quickly test new ideas and extrapolate the performance of known/legacy application into the 5-to-10 year future. This toolchain shall aid processor and full-system architects in their early ``what-if'' stages to estimate the effect of a proposed architecture change onto full application runtime. The R\&D of our framework is still work-in-progress, but we believe our poster can foster discussions among different future-architecture groups and we hope to get feedback by the domestic and international community.

Our poster will sketch the basic concept of our idea and current work, similar to Figure 1, and we will explain the project's goals and show-case some of the achieved milestones which should ultimately, when combined, lead to an easy to use framework to extrapolate application runtime. Hereafter, we give the reviewing committee a quick rundown of the framework, so that they can judge the usefulness of our contribution to the community and the symposium.

Assume we have an application X and want to know how much faster it can be executed on future hardware Y compared to a given hardware we have right now, then our toolchain will analyze the application with different tools to extract the require information. The application is executed on current hardware to collect runtime and trace hardware performance counters, such as cache misses, on a per-function basis. Furthermore, we extract the basic blocks, i.e., set of logical assembly blocks comprising a function, and number of executions per block via Intel's Software Development Emulator. These basic blocks can be combined into a DAG to represent the full application's execution and the DAG is enriched by \#\{executions per basic block\} and counter information. We then utilize LLVM-mca, a tool of the LLVM compiler infrastructure project, to estimate the instructions per cycle for a given/future CPU architecture and set of instructions (meaning a given basic block). The future architecture can be easily altered by exchanging the architecture and instructions scheduling models within LLVM. Hence, being able to estimate the execution time of each basic block, and knowning the executions

per basic block as well as their dependencies, we can extrapolate the runtime by calculating the distance between application's start and end within the basic block DAG. Given this tool, researcher should be able to quickly test new ideas, such as ``What if our future architecture can reduce the cache misses by 90\% in this application?'' or ``How much fast can we execute this application if we provide 10 times more floating-point units, and is there enough parallelism in the application to exploit those units?'' instead of relying on labor-intensive simulators.

61Hiroshi Harada, Hidetomo Kaneyama and Chihiro Shibano.
"Incident and operation analysis of HPCI shared storage system R-CCS hub"
HPCI shared storage has achieved continuous non-stop operation for 13 months since October 2018. In FY2018, the system was shut down only twice, and the system continued to operate even after natural disasters such as typhoons and lightning strikes. In this presentation, we will introduce the incidents and operations that occurred at the HPCI shared storage system R-CCS hub over the past 2 years. The details of the location where the hardware failure occurred, the parts that occurred, and the operation error are shown. In particular, HDDs have been regularly inspected for all HDDs, but the number of failures is small, and currently MTBF is higher than the venders's nominal value. The network traffic of R-CCS, One-site operation, which is the key technology of continuous operation, and detailed procedures for failover operation of master metadata server will be revealed in this poster presentation.
62Ryohei Kobayashi, Norihisa Fujita, Ayumi Nakamichi, Yoshiki Yamaguchi, Taisuke Boku, Kohji Yoshikawa, Makito Abe and Masayuki Umemura.
"Accelerating Radiative Transfer Simulation with GPU-FPGA cooperative computation"

Graphics processing units (GPUs) offer good peak performance and high memory bandwidth. They have been widely used in high-performance computing (HPC) systems as accelerators. However, enabling the execution of parallel applications on such heterogeneous clusters requires inter-accelerator communication between nodes. This means that maintaining multiple copies of memory is required; this results in increased latency and severely degraded application performance, particularly when short messages are involved. Moreover, while the GPU has the above beneficial characteristics, it is not effective as an accelerator in applications that employ complicated algorithms using exceptions, non-single instruction multiple data streams (SIMD), and partially poor parallelism.

To address the above problems, Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing because their computation and communication capabilities have drastically improved in recent years due to advances in semiconductor integration technologies that rely on Moore’s Law. In addition to improving FPGA performance, toolchains for the development of FPGAs in OpenCL have been developed and offered by FPGA vendors that reduce the programming effort required. These improvements reveal the possibility of implementing a concept to enable on-the-fly offloading computation at which CPUs/GPUs perform poorly to FPGAs while performing low-latency data movement. We think that this concept is key to improving the performance of heterogeneous supercomputers using accelerators such as the GPU.

One reason to need such a GPU–FPGA coupling is to accelerate multiphysics applications. Multiphysics is defined as the coupled processes or systems involving more than one simultaneously occurring physical fields and the studies of and knowledge about these processes and systems. Therefore, multiphysics applications perform simulations with multiple interacting physical properties and there are various computations within a simulation. Because of that, accelerating simulation speed by GPU only is quite difficult and this is why we try to combine GPU and FPGA and make the FPGA cover GPU-non suited computation.

In this paper, we focus on radiative transfer simulation code that is based on two types of radiation transfer: the radiation transfer from spot light and the radiation transfer from spatially distributed light. We make GPUs and FPGAs work together, and perform the former radiation transfer on the GPU and the latter radiation transfer on the FPGA. As a result, we realized GPU–FPGA-accelerated simulation and its performance was up to 10.4x better than GPU-based implementation.

63Toshiyuki Imamura, Yusuke Hirota and Takuya Ina.
"Re-design of parallel divide and conquer algorithm for a symmetric band matrix"
Cuppen’s divide and conquer algorithm (called DC hereafter) invented in the 1980s has trivial high parallelism in parallel numerical eigenvalue calculation, and the DC method has been adapted in a lot of modern parallel numerical libraries such as ScaLAPACK, ELPA, and EigenExa. From the viewpoint of information science, DC follows one of a binary-tree structured control-flow and data-flow. Historically, ScaLAPACK takes advantage of regular-and-fixed mapping between a processor grid and distributed data arrays. It was one of the intuitively reasonable ways to implement a general framework on the numerical library. However, it allows an unfavorable load imbalance from top to bottom of the binary tree representation. Therefore, we analyzed the problem of the mapping thoughtfully and then re-designed the mapping strategy when we utilize a flagship-class parallel system such as Fugaku. The primal idea is borrowed from the construction of a k-d tree with alternating spatial plane splitting. A (k-1) dimension hyper-plane can split any k-dimensional data on the center point into two groups. This procedure is applied recursively until the size of the group becomes an adequate portion. The prototype routine of the DC method for a tridiagonal matrix has been already developed and some of the benchmarks were demonstrated on K computer and OakForest PACS system. We have confirmed excellent performance and parallel scalability up to 4096 and 2048, on K and OFP, respectively. Currently, we have been porting and optimizing the prototype routines onto the Fugaku system as one of the core parts of EigenExa, which is a parallel eigenvalue solver developed by R-CCS. We expect to provide many users with this brand-new routine and new functionality before the regular operation of Fugaku starts, possibly by the end of 2020.
64Keiji Onishi, Makoto Tsubokura, Rahul Bale, Koji Nishiguchi and Kazuto Ando.
"Unified Multi-physics Framework for Industrial Scale Simulations"

The objective of our research is to propose a unified simulation method of solving multiple partial differential equations by developing common fundamental techniques such as the effective algorithms of multi-scale phenomena or the simulation modeling for effective utilization of the massively parallel computer architecture. The target of the unified simulation is supposed to be complex and combined phenomena observed in manufacturing processes in industrial cycles. Our final goal is to contribute to enhance Japanese technological capabilities and industrial process innovation through the high-performance computing simulation.

Most of the complex flow phenomena observed in manufacturing processes are relating to or coupled with other physical or chemical phenomenon such as turbulence diffusion, structure deformation, heat transfer, electromagnetic field or chemical reaction. While computer simulations are rapidly spreading in industry as useful engineering tools, their limitations to such coupled phenomena have come to realize recently. This is because of the fact that each simulation method has been optimized to a specific phenomenon and once two or more solvers of different phenomena are coupled for such a complicated target, its computational performance is seriously degraded. This is especially true when we utilize a high-performance computer such as Fugaku. In such a situation, in addition to the fundamental difficulty of treating different time scale or spatial scales, interpolation of physical quantities like pressure or velocity at the interface of two different phenomena requires additional computer costs and communications among processor cores. Different mesh topology and hence data structures among each simulation and treatment of different time or spatial scales also deteriorate single processor performance. We understand that one of the keys to solve these problems is to adopt unified structured mesh and data structure among multiple simulations for coupled phenomena. As a candidate of unified data structure for complicated and coupled phenomena, we focused on the building-cube method (BCM) proposed by Nakahashi.

In this research, the following latest research cases are presented: the verification analysis of full-vehicle aerodynamics, the transient aerodynamic analysis with 6 degree of freedom motion, the body strength analysis by Eulerian structural analysis method, the IC engine combustion analysis with valve / piston motion.