Semi-ScaLAPACK-Compatible 2.5D-PxGEMM based on SUMMA (SC-SUMMA-25D)
Overview
We are developing a new parallel matrix multiplication routine (so-called PDGEMM in PBLAS) that can achieve proper strong scaling on the post-K computer using the 2.5D algorithm with the help of communication avoidance. The 2.5D algorithm requires a 2.5D matrix distribution stacking a matrix with a 2D distribution over a 3D process grid. To support the compatibility with the conventional PDGEMM, which computes matrices distributed as a 2D distribution on a 2D process grid, our implementation was designed to perform a matrix redistribution between 2D and 2.5D distributions before and after the computation (2D-compatible 2.5D-PDGEMM). We have developed prototype implementations based on the Cannon’s algorithm and the SUMMA algorithm, furthermore, evaluated the performance using up to 16384 nodes of the K computer. The results showed that our implementations outperformed conventional 2D-PDGEMMs including the PBLAS PDGEMM even when the matrix redistribution cost between 2D and 2.5D distributions was included. For example, we observed that our implementation (with stack size c=4) achieved an approximately 3.3-fold speed increase in the case of 16,384 nodes (matrix size: n=32,768) when compared with the 2D implementation.
スーパーコンピュータ「富岳」のような大規模並列環境において高い強スケーリング性能を発揮するために,通信回避アルゴリズムである2.5次元アルゴリズムを用いた並列行列積ルーチン(PBLASにおけるPDGEMM)を開発しています.2.5次元アルゴリズムは,3次元プロセスグリッド上に2次元分散された行列を積み重ねた2.5次元分散を必要とします.2次元プロセスグリッド上に2次元分散された行列を計算する従来のPDGEMMと互換性を持たせるために,計算の前後で2次元分散と2.5次元分散の変換を行う,2次元互換2.5次元PDGEMMを開発しました.これまでにCannonアルゴリズムとSUMMAアルゴリズムを用いたプロトタイプ実装を開発し,スーパーコンピュータ「京」の最大16384ノードを用いて行った性能評価では,2次元分散と2.5次元分散の再分散コストを含む場合でも,PBLASのPDGEMMを含む従来の2次元PDGEMMを上回る性能を示しました(例えば16384ノード・行列サイズn=32768の場合,我々の実装(スタックサイズc=4)は2次元実装と比較して約3.3倍の高速化を実現).
Downloads
- SC-SUMMA-25D version 1.0a (tgz, 334KB) (April 9, 2021)
- Alpha version. A document ("A Working Note for Development of Semi-ScaLAPACK-Compatible 2.5D-PxGEMM based on SUMMA (SC-SUMMA-25D)") is included.
Publications
- Daichi Mukunoki and Toshiyuki Imamura: Performance Analysis of 2D-compatible 2.5D-PDGEMM on Knights Landing Cluster, Proc. International Conference on Computational Science (ICCS 2018), Lecture Notes in Computer Science, Vol. 10862, pp. 853-858, Jun. 2018 (short paper for poster presentation).
- Daichi Mukunoki and Toshiyuki Imamura: Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer, Proc. 12th International Conference on Parallel Processing and Applied Mathematics (PPAM2017), Lecture Notes in Computer Science, Vol. 10777, pp. 348-358, Mar. 2018.
- 椋木大地, 今村俊幸: 京コンピュータにおける2.5次元アルゴリズムを用いた分散並列行列積の実装と評価, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2017-HPC-159, No. 1, pp. 1-6, 2017年4月 (in Japanese).
- 椋木大地, 今村俊幸: 2.5次元アルゴリズムを用いた高性能PDGEMMの開発, 東京大学情報基盤センター スーパーコンピューティングニュース, Vol. 20, No. 4, pp. 31-36, 2018年7月 (in Japanese).