In this talk, we introduce two research activities to improve the vectorization and performance optimization for state-of-the-art HPC platforms. Recent trends in processor design accommodate wide vector extensions. SIMD vectorization is more important than before to exploit the potential performance of the target architecture. The latest OpenMP specification provides new directives which help compilers produce better code for SIMD auto-vectorization. However, it is hard to optimize the SIMD code performance in OpenMP since the target SIMD code generation mostly relies on the compiler implementation. In the first part of the talk, we propose a new directive that specifies user-defined SIMD variants of functions used in SIMD loops. The compiler can then use the user-defined SIMD variants when it encounters OpenMP loops instead of auto-vectorized SIMD variants. The user can optimize the SIMD performance by implementing highly-optimized SIMD code with intrinsic functions. The performance evaluation using a image composition kernel shows that the user can optimize SIMD code generation in an explicit way by using our approach. The user-defined function reduces the number of instructions by 70% compared with the auto-vectorized code generated from the serial code. In the latter part of the talk, we propose a programming model for FPGAs. Because of the recent slowdown in silicon technology and increasing power consumption of hardware, several dedicated architectures have been proposed in High Performance Computing (HPC) to exploit the limited number of transistors in a chip with low power consumption. Although Field-Programmable Gate Array (FPGA) is considered as one of the promising solutions to realize dedicated hardware for HPC, it is difficult for non-experts to program FPGAs due to the gap between their applications and hardware-level programming models for FPGAs. To improve the productivity for FPGAs, we propose a C/C++ based programming framework, C2SPD, to describe stream processing on FPGA. C2SPD provides directives to specify code regions to be offloaded onto FPGAs. Two popular performance optimization techniques, vectorization and loop unrolling, also can be described in the directives. The compiler is implemented based on a famous open source compiler infrastructure LLVM. It takes C/C++ code as input and translates it into DSL code for the FPGA backend and CPU binary code. The DSL code is translated into Verilog HDL code by the FPGA backend and passed to the vendor’s FPGA compiler to generate hardware. The CPU binary code includes C2SPD runtime calls to manipulate FPGA, and transfer data between CPU and FPGA. C2SPD assumes a single PCI-card type FPGA device. Data transfer includes communication via the PCI Express interface. The C2SPD compiler uses SPGen, a data-flow High Level Synthesis (HSL) tool, as the FPGA backend. SPGen is an HLS tool for stream processing on FPGAs. The SPGen compiler takes its DSL, Stream Processing Description (SPD) and generates pipelined stream cores on FPGAs. Although the range of application is limited by its domain-specific approach, it can generate highly-pipelined hardware on FPGAs. A 2D-stencil computation kernel is written in C and C2SPD directives and the generated FPGA hardware achieves 175.41 GFLOPS by using 256 stream cores. The performance evaluation shows that vectorization can exploit FPGA memory bandwidth and loop unrolling can generate deep pipeline to hide the instruction latency. By modifying numbers in the directives, the user can easily change the configuration of the generated hardware on the FPGA and optimize the performance.
日時: 2018年12月7日（金）、15:15 - 16:15
場所: R-CCS 6階講堂
・講演題目：Research Activities for Parallel Programming Models for Current HPC Platforms