Batched BLAS Generator
Overview
Batched BLAS is one of the new approaches for a task-based BLAS invocation, and it is defined as a new interface that allows users to execute multiple independent BLAS operations as a single subroutine call. Some of the high-demanded batched BLAS routines have been selected and developed for general CPUs, a many-core accelerated processor and GPUs, such as cblas_batched_xgemm, and they accelerated deep-learning domain codes. However, the existing implementation was limited to a few specific kernels. A full set of the BLAS routines (including level-1/2/3 routines) has not been provided.
We have developed the first implementation of the level 1-2-3 full-set variable Batched (vbatched) BLAS interface compatible with the Intel MKL and MAGMABLAS. To develop the Batched BLAS, we introduced an efficient automatic code generation mechanism. The generated APIs include not only the invocation of underlying serial BLAS kernels but the parallel task scheduling based on a cost description by the user. Our preliminary evaluation on an Intel Xeon Phi 7210 processor demonstrated that the auto-generated batched BLAS routines achieved a competitive performance with the standard BLAS in the conference poster at ISC18 [8]. Our results suggest that such an automatic generation would be an effective method to develop batched BLAS routines for supercomputer Fugaku.
Batched BLASはタスクベースのBLAS呼び出しの新しいアプローチで,独立した複数のBLAS演算を一つのサブルーチンとして実行できる新しいインタフェースです.すでにいくつかの需要の高いルーチンが汎用CPU・GPU向けに開発され,深層学習用コードの高速化などに貢献していますが,レベル1/2/3ルーチンを含むフルセットのBLASルーチンは提供されていませんでした.そこで我々はIntel MKLおよびMAGMA BLASと互換性のあるレベル1-2/3フルセットBatched (vbatched) BLASを開発しました.コードは自動コード生成機構によって生成され,内部で既存のシリアル実行のBLASカーネルを呼び出しますが,より高性能を得るためにユーザが演算コストに基づいた並列タスクスケジューリングを定義することもできます.Intel Xeon Phi 7210プロセッサ上での予備評価では,自動生成されたBatched BLASルーチンが標準的なBLASと比較して遜色のない性能を持つことが実証されました(ISC18においてポスター発表).我々の開発した自動コード生成による実装はスーパーコンピュータ「富岳」においても十分な性能を発揮することが期待されます.
Downloads
- Batched BLAS version 1.0 (tar.gz, 807KB) (February 9, 2021)
- Batched BLAS Generator 1.1 (tgz, 682KB) (February 3, 2022)
- Name change (from Batched BLAS to Batched BLAS Generator).
- Modified to generate separate implementations with different scheduling methods (for details, see our MCSoC2021 paper).
- Batched BLAS Generator 1.2 (tgz, 688KB) (September 21, 2022)
- README and test codes are modified (no changes to the main of the Batched BLAS Generator)
Publications
- Yusuke Hirota, Daichi Mukunoki, and Toshiyuki Imamura, Automatic Generation of Full-Set Batched BLAS, Research Poster, International Supercomputing Conference (ISC’18) Jun. 26, 2018.
- Daichi Mukunoki, Yusuke Hirota, Toshiyuki Imamura, Task Scheduling Strategies for Batched Basic Linear Algebra Subprograms on Many-core CPUs, 14th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC 2021), Dec. 2021.