# Workshop on Large-scale Parallel Numerical Computing Technology

(LSPANC 2020 January)

— HPC and Computer Arithmetics toward Minimal-Precision Computing —

January 29 – 30, 2020

RIKEN Center for Computational Science (R-CCS), Kobe, Japan

## Overview

In numerical computations, the precision of floating-point computations is a key factor to determine the performance (speed and energy-efficiency) as well as the reliability (accuracy and reproducibility). However, the precision generally plays a contrary role for both. Therefore, the ultimate concept for maximizing both at the same time is the minimal-precision computation through precision-tuning, which adjusts the optimal precision for each operation and data. Several studies have been already conducted for it so far, but the scope of those studies is limited to the precision-tuning alone.

In 2019, we have just started the Minimal-Precision Computing project [1] to propose a more broad concept of the minima-precision computing system with precision-tuning, involving both hardware and software stack. Specifically, our system combines (1) a precision-tuning method based on Discrete Stochastic Arithmetic (DSA), (2) arbitrary-precision arithmetic libraries, (3) fast and accurate numerical libraries, and (4) Field-Programmable Gate Array (FPGA) with High-Level Synthesis (HLS).

This workshop aims to discuss the future direction of the project by reviewing the available technologies and current challanges together with our project members, guest speakers, and you.

## General information

**Date**: January 29 – 30, 2020**Location**: RIKEN Center for Computational Science (R-CCS), Kobe, Japan (access map), seminar room (ground floor).**Registration**: Registration free (free of charge).**Wifi access**: Eduroam and “R-CCS guest” wifi are available.**Note**: This workshop does not publish any proceedings.

## Program committee

- Toshiyuki Imamura (RIKEN Center for Computational Science)(
*toshiyuki.imamura[at]riken.jp*) - Daichi Mukunoki (RIKEN Center for Computational Science) (
*daichi.mukunoki[at]riken.jp*) - Roman Iakymchuk (Sorbonne University and Fraunhofer ITWM)

## Speakers (alphavetical order)

### Speakers from project members

- Daichi Mukunoki (RIKEN Center for Computational Science)
- Fabienne Jézéquel (Sorbonne University)
- Jens Huthmann (RIKEN Center for Computational Science)
- Kentaro Sano (RIKEN Center for Computational Science)
- Norihisa Fujita (University of Tsukuba)
- Roman Iakymchuk (Sorbonne University and Fraunhofer ITWM)
- Taisuke Boku (University of Tsukuba)
- Toshiyuki Imamura (RIKEN Center for Computational Science)
- Yiyu Tan (RIKEN Center for Computational Science)

### Guest speakers

- Artur Podobas (RIKEN Center for Computational Science)
- Jens Domke (RIKEN Center for Computational Science)
- Kai Torben Ohlhus (Tokyo Woman’s Christian University)
- Maho Nakata (RIKEN)
- Shuhei Kudo (RIKEN Center for Computational Science)
- Takeo Hoshi (Tottori University)
- Takeshi Fukaya (Hokkaido University)
- Takeshi Ogita (Tokyo Woman’s Christian University)
- Yuki Murakami (University of Aizu)

## Program

### January 28, Tuesday

- 15:00-17:00
*“Tutorial and Hands-on for FPGA and Cygnus”*Norihisa Fujita (University of Tsukuba)

**Tutorial (only for project members)**

### Day 1: January 29, Wednesday

- 9:30-9:40 Opening
- 9:40-10:10
Toshiyuki Imamura (RIKEN Center for Computational Science) (slide (PDF))**“Overview of minimal-precision computing and (weak)-numerical reproducibility”** Abstract: The primary purpose of the initiated project is to explore the possibility of a new computing system with precision-tuning, which is or will be conducted in collaboration with RIKEN CCS, Sorbonne University, and University of Tsukuba. In the project, we have investigated “minimal-precision computing,” which aims to achieve reliability (accuracy and reproducibility) as well as high-performance (speed and energy) by obtaining the computing results with the accuracy requested by users with the minimal-precision use, which eventually leads to a new concept of (weak)-numerical reproducibility. It involves hardware and software stacks combining (1) a precision-tuning method through numerical validation by Discrete Stochastic Arithmetic (DSA), (2) arbitrary-precision arithmetic libraries, (3) fast and accurate numerical libraries, and (4) Field-Programmable Gate Array (FPGA) with high-level synthesis, and some essential components that we have developed. We will integrate these underlying software and hardware stacks to demonstrate a small test to realistic application codes.

- 10:10-10:50
Fabienne Jézéquel (Sorbonne University) (slide (PDF))**“Precision Auto-Tuning and Control of Accuracy in Numerical Applications”** Abstract: In the context of high performance computing, new architectures, becoming more and more parallel, offer higher floating-point computing power. Thus, the size of the problems considered (and with it, the number of operations) increases, becoming a possible cause for increased uncertainty. As such, estimating the reliability of a result at a reasonable cost is of major importance for numerical software. In this talk we describe the principles of Discrete Stochastic Arithmetic (DSA) that enables one to estimate rounding errors by performing all arithmetic operations several times using a random rounding mode. DSA is implemented, on the one hand, in the CADNA library (http://cadna.lip6.fr) that can be used to control the accuracy of programs in half, single, double and/or quadruple precision, and, on the other hand, in the SAM library (http://www-pequan.lip6.fr/~jezequel/SAM) that estimates rounding errors in arbitrary precision programs. Most numerical simulations are performed in double precision, and this can be costly in terms of computing time, memory transfer and energy consumption. We also present the PROMISE tool (PRecision OptiMISE, http://promise.lip6.fr), based on CADNA, that aims at reducing in numerical programs the number of double precision variable declarations in favor of single precision ones, taking into account a requested accuracy of the results. Finally, in order to combine high performance and control of accuracy in a numerical simulation, we show that the cost of rounding error estimation may be avoided if particular numerical kernels are used with perturbed input data.

- 10:50-11:00 Coffee Break
- 11:00-11:40
Taisuke Boku (University of Tsukuba) (slide (PDF))**“Cygnus: GPU meets FPGA for HPC”** Abstract: The traditional supercomputing facilities have been contributing to the scientific calculations which require very high performance of floating point computation. With such a background, today’s world leading supercomputers are equipped with GPU (Graphics Processing Unit) beside of ordinary CPU (Central Processing Unit). Actually, about half of the systems in TOP-10 machines in the world are the large cluster systems with tens of thousands of GPUs. However, the request for new fields of scientific computation such as deep learning is much more complicated where the traditional simple computing power cannot cover it. One of the big change in the processor architecture is the change of floating point precision, FP16 (16-bit half precision floating point) for example. Although new generation of GPUs and CPUs are supporting such a request nowdays, we need more aggressive challenge for new system architecture not only for high performance but also for high performance per energy consumption. In our Center for Computational Sciences, University of Tsukuba, we have been researching the original technologies toward next generation accelerating supercomputing. GPU is still the main player for it, but we need to consider wider variety and possibility of other kind of accelerators. One of the key technologies for processor architecture recently focused is FPGA (Field Programmable Gate Array) where the logic circuit itself can be programmed by some specific hardware description language according to the algorithm of target application. We are building a new method to combine GPU and FPGA together in a single system to compensate the weak point of GPU to be covered by the flexibility of FPGA toward complicated algorithms and problems. As the practical testbed for this challenge, our center introduced the world first cluster combing GPU and FPGA technologies for advanced scientific research. In this talk, I will introduce such a new concept of supercomputing with system development and applications toward next generation accelerated supercomputing.

- 11:40-12:10
Kentaro Sano (RIKEN Center for Computational Science) (slide (PDF))**“Data-flow Compiler for Stream Computing Hardware on FPGA”** Abstract:

- 12:10-13:20 Lunch Break
- 13:20-13:50
Jens Huthmann (RIKEN Center for Computational Science)**“Exploring HLS with arbitrary precision with the Nymble compiler”** Abstract: As Moores Law is slowing down, people are looking for other methods to increase the performance of calculations. Improving upon the memory bottleneck by decreasing the precision and thus decreasing input size is one option. Another option is to increase the number of operations executed on that data by exploiting the capability for parallel computation using FPGAs. Generating these computational units for FPGAs is becoming more and more convenient with HLS compilers such as IntelHLS, VivadoHLS, LegUp and Nymble. However, we do not have full control when integrating arbitrary precision operations in commercial compilers. In this talk I want to discuss the opportunities in using Nymble with arbitrary precision operations. The goal of Nymble is provide high productivity in exploring new ways by providing high compatibility with standard C codes and OpenMP support.

- 13:50-14:20
Norihisa Fujita (University of Tsukuba)**“CIRCUS: Pipelined Inter-FPGA Communication with Computation in OpenCL on Cygnus Supercomputer”** Abstract: We propose a Communication Integrated Reconfigurable CompUting System (CIRCUS) to enable us to utilize high-speed interconnection of FPGAS from OpenCL HLS. CIRCUS makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them. In this talk, I will show the detail of the CIRCUS system and the result of the performance evaluation. We used the Cygnus supercomputer operated by Center for Computational Sciences, University of Tsukuba, for the performance evaluation. Cygnus has 64 Bittware 520N FPGA boards (2 boards / node) and FPGAs are connected by an 8×8 2D-torus FPGA network. Bittware 520N Board equips an Intel Stratix10 FPGA, 32GB DDR4 external memory, and four QSFP28 external ports supporting up to 100Gbps.

- 14:20-14:40 Coffee Break
- 14:40-15:10
Yiyu Tan (RIKEN Center for Computational Science)**“Precision Tuning of the Arithmetic Units in Matrix Multiplication on FPGA”** Abstract:

- 15:10-15:40
Takeshi Fukaya (Hokkaido University)**“Investigation into the convergence behavior of the mixed-precision GMRES(m) method using FP64 and FP32”** Abstract: The GMRES(m) method is one of typical iterative methods for solving a linear system with an unsymmetric sparse coefficient matrix. Based on the restart technique employed in GMRES(m), a mixed-precision variant of GMRES(m) is easily derived. In this talk, we focus on GMRES(m) using FP64 and FP32, and report the experimental evaluation of its convergence property. This is joint work with Takeshi Iwashita (Hokkaido University).

- 15:40-15:50 Coffee Break
- 15:50-16:20
Maho Nakata (RIKEN ACCC) and Naohito Nakasato (University of Aizu)**“Implementation binary128 version of semidefinite programming solver”** Abstract: Semidefinite programming is an important optimization problem, and higher precision than binary64 (double precision) is required for several applications. We implemented and evaluated a binary128 version of semidefinite programming solver on PC, a step toward to use hardware-implemented binary128 on FPGAs.

- 16:20-17:00
Takeo Hoshi (Tottori University)~~“An a posteriori verification method for generalized real-symmetric eigenvalue problems in large-scale electronic state calculations”~~(presentation cancelled) Abstract: An a posteriori verification method is proposed for the generalized real-symmetric eigenvalue problems and is applied to densely clustered eigenvalue problems in large-scale electronic state calculations (https://arxiv.org/abs/1904.06461/). The method is realized by a two-stage process in which the approximate solution is generated by existing numerical libraries and then is verified with a moderate computational time. The procedure returns intervals containing one exact eigenvalue in each interval. Test calculations were carried out for organic device materials, and the verification method confirms that all the exact eigenvalues are well separated in the obtained intervals. The verification method will be integrated into EigenKernel (https://github.com/eigenkernel/), a middleware for various parallel solvers for the generalized eigenvalue problem. Such an a posteriori verification method will be important in future computational science.

- 17:00-17:45 Open Discussion

**Session 1: Plenary talks**

**Session 2: FPGA technologies**

**Session 3: Mixed-precision and applications (1)**

### Day 2: January 30, Thursday

- 10:40-11:10
Kai Torben Ohlhus (Tokyo Woman’s Christian University) (slide (PDF))**“More system independent usage of numerical verification algorithms written in high-level programming languages”** Abstract: Many numerical verification algorithms are actively developed using high-level programming languages. For example the Matlab/GNU Octave software VSDP (https://vsdp.github.io/) is able to compute rigorous error bounds for conic linear programs with up to 19 million variables and 27 thousand constraints using further verification algorithms from INTLAB (http://www.ti3.tu-harburg.de/intlab/). The application to large-scale problems often requires using High-Performance Computing (HPC) systems. Those systems sometimes lack of appropriate high-level language support, offer outdated versions, or hardly allow beneficial customization of the pre-installed software, like choosing specialized BLAS/LAPACK implementations. On the other hand, porting verification algorithms to another or lower-level programming language is time consuming and error prone. To overcome these issues, a recent promising approach of using lightweight Singularity (https://sylabs.io/singularity/) containers in combination with Spack (https://spack.io/) to control software dependencies is used. For the verification algorithms all necessary software customization can be prepared and tested on a desktop PC, while the final benchmark is performed on a Singularity-supporting HPC system, which is not rare in practice.

- 11:10-11:20 Coffee Break
- 11:20-12:00
Roman Iakymchuk (Sorbonne University and Fraunhofer ITWM) (slide (PDF))**“Hierarchical and modular approach for reproducible and accurate linear algebra algorithms”** Abstract: Due to the non-associativity of floating-point operations and dynamic resources utilization on parallel architectures, it is challenging to obtain reproducible floating-point results for multiple executions of the same code on different or even similar parallel architectures. We address the problem of reproducibility in the context of fundamental linear algebra operations – like the ones included in the BLAS library – and propose algorithms that yield both reproducible and accurate results. Following the hierarchical and modular structure of many linear algebra algorithms, we leverage these results and extend them to the LU factorization and Preconditioned Conjugate Gradient method. In this talk, we will present our developments on the ExBLAS library and the higher level algorithms, as well as show how their contribute to the minimal-precision system.

- 12:00-12:30
Daichi Mukunoki (RIKEN Center for Computational Science) (slide (PDF))**“Accurate BLAS implementations: OzBLAS and BLAS-DOT2”** Abstract: In the minimal-precision computing system, we utilize fast and accurate numerical libraries, instead of MPFR, for accelerating the portions of the computation that require high accuracy. This talk introduces two accurate BLAS implementations developed by us, OzBLAS and BLAS-DOT2. OzBLAS is a reproducible BLAS implementation with tunable accuracy on CPUs and GPUs. It can obtain the correctly-rounded result as well as the bit-level reproducibility using the Ozaki scheme. BLAS-DOT2 is an accurate BLAS implementation on GPUs. It computes double-precision data on two-fold (quadruple) precision using the Dot2 algorithm. Both implementations are available at http://www.math.twcu.ac.jp/ogita/post-k/.

- 12:30-13:30 Lunch Break
- 13:30-14:10
Yuki Murakami (University of Aizu)**“Performance Evaluation of Scientific Applications with Posit by using OpenCL”** Abstract: Posit is a kind of floating point number format, and it is published in 2017 by Gustafson. Main feature of Posit is variable bits for fraction and exponent parts. Exponent of Posit is divided into two parts: regime and exponent. Regime bit plays a role like main scale and exponent bit plays a role like auxiliary scale. We implemented arithmetic operation of Posit to GPU by using OpenCL, and evaluated matrix multiplication and scientific applications, for example, n-body problem and tsunami simulation.

- 14:10-14:40
Artur Podobas (RIKEN Center for Computational Science) (slide (PDF))**“Using Field-Programmable Gate Arrays to Explore Different Numerical Representation: A Use-Case on POSITs”** Abstract: The inevitable end of Moore’s law motivates researchers to re-think many of the historical architectural decisions. Among these decisions we find the representation of floating-point numbers, which has remained unchanged for nearly three decades. Chasing better performance, lower power consumption or improved accuracy, researches today are actively searching for smaller and/or better representations. Today, a multitude of different representations are found in the specialized (e.g. Deep-Learning) applications as well as for general-purpose applications (e.g. POSITs). However, despite their claimed strengths, alternative representations remain difficult to evaluate empirically. There are software approaches and emulation libraries available, but their sluggishness only allows the smallest of inputs to be evaluated and understood. POSIT is a new numerical representation, introduced by professor John Gustafson in 2017 as a candidate to replace the traditional IEEE-754 representation. In this talk I will present my experience in designing, building and accelerating the POSIT numerical representation on Field-Programmable Gate Arrays (FPGAs). I will start by briefly introducing the POSIT representation, show its hardware implementation details, reasoning around their trade-offs (with respect to IEEE-754) and conclude the presentation with small use-cases and their measured/obtained performance.

- 14:40-15:00 Coffee Break
- 15:00-15:30
Shuhei Kudo (RIKEN Center for Computational Science)**“How (not) to cheat in HPL-AI”** Abstract: HPL-AI is a new benchmark program for supercomputers which is released by Jack Dongarra at ISC 2019 with its significant performance rate, 445 PFlop/s, tested on the world’s fastest supercomputer, Summit. The program measures the computation time to solve a large linear system, which is same as the well-known HPL, but it allows to use the mixed-precision techniques followed by the iterative refinements to take the advantage of the hardware capability like the 16bit floating-points which is also used in the emerging AI workloads. Unfortunately, such lower-precision computation arouses problems like the numerical instability, and even worse, causes programmers to cheat however they are not intended to do. In this talk, we show examples of failures in the HPL-AI implementation to discuss with the problems for using the lower- and mixed-precision computation in scientific computations.

- 15:30-16:00
Jens Domke (RIKEN Center for Computational Science) (slide (PDF))**“Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?”** Abstract: Among the common wisdom in High-Performance Computing (HPC) is the applications’ need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and legacy software have without doubt followed and contributed to this view. In this talk, we challenge this wisdom, and we do so by exhaustively comparing a large number of HPC proxy applications on two processors: Intel’s Knights Landing (KNL) and Knights Mill (KNM). Although similar, the KNL and KNM architecturally deviate at one important point: the silicon area devoted to double-precision arithmetics. This fortunate discrepancy allows us to empirically quantify the performance impact in reducing the amount of hardware double-precision arithmetic. With the advent of a failing of Moore’s law, our results partially reinforce the view taken by modern industry (e.g., Fujitsu’s ARM64FX CPU) to integrate hybrid-precision hardware units.

- 16:00-16:10 Coffee Break
- 16:10-16:50
Takeshi Ogita (Tokyo Woman’s Christian University)**“Verified Numerical Computations on Supercomputers”** Abstract:

- 16:50-17:20 Discussion
- 17:20-17:30 Closing

**Session 4: Numerical verification (1)**

**Session 5: Accurate numerical libraries**

**Session 6: Mixed-precision and applications (2)**

**Session 7: Numerical verification (2)**

## Acknowledgement

This workshop was supported by FOCUS Establishing Supercomputing Center of Excellence.