Opening Address From K to Post-K and Beyond Satoshi Matsuoka Director, Riken R-CCS Riken R-CCS Symposium, Kobe 20190218 # Post-K: The Game Changer - 1. Heritage of the K-Computer, HP in simulation via extensive Co-Design - High performance: up to x100 performance of K in real applications - Multitudes of Scientific Breakthroughs via Post-K application programs - Simultaneous high performance and ease-of-programming - 2. New Technology Innovations of Post-K - High Performance, esp. via high memory BW Performance boost by "factors" c.f. mainstream CPUs in many HPC & Society5.0 apps - Very Green e.g. extreme power efficiency Ultra Power efficient design & various power control knobs - Arm Global Ecosystem & SVE contribution Top CPU in ARM Ecosystem of 21 billion chips/year, SVE codesign and world's first implementation by Fujitsu - High Perf. on Society5.0 apps incl. AI Architectural features for high perf on Society 5.0 apps based on Big Data, AI/ML, CAE/EDA, Blockchain security, etc. Global leadership not just in the machine & apps, but as cutting edge IT Technology not just limited to Post-K, but into societal IT infrastructures e.g. Clouds ## "Post-K" Chronology (part1) - Jun 2006 Tokyo Tech. TSUBAME1.0 becomes Top500#1 in Japan, first time based on general-purpose multi-core CPU(AMD Opteron) - Sep 2006 K Computer project officially launched (Kobe 2007) - May 2009 Fujitsu unveils Sparc64 fxVIII, same day NEC drops out of K project, K Computer to become purely based on generalpurpose multi-core CPU - Nov 2009 Govt. review almost cancels K => 2010 formulation of *HPCI* (Japan's High Performance Computing Infrastructure consortium) based on NAREGI National Grid Project (2003-2007) - Nov 2010 Tokyo Tech. TSUBAME2.0, becomes Top500#1 in Japan, first petascale and many-core SC in Japan (NVIDIA Fermi GPU) - Apr 2011 Riken AICS (R-CCS predecessor) starts - Jun 2011 K Computer becomes Top500#1 in the World - Nov 2011 ACM Gordon Bell Prizes for K Computer & Tsubame2.0 ### "Post-K" Chronology (part2) - ~Apr 2011 SDHPC started, whitepaper on "Post-K" 2018~ SC - Apr. 2012~2013 "Post-K Feasibility Study" officially starts - 3 architecture investigation teams, 1 application requirements team - Apr 2014 "Post-K" project officially starts at Riken AICS, objective: up to x100 speedup on benchmark apps (NOT Exaflops on Linpack) - Jun 2014 K Computer becomes Graph500#1 Big Data convergence - Jun 2016 K Computer becomes HPCG#1 From FLOPS to BYTES - Nov 2016 U-Tokyo Tsukuba-U Oakforest PACS (Intel KNL) becomes Top500#1 in Japan – first general-purpose - Aug 2017 HotChips announcement that "Post-K" will adopt Arm ISA - Apr 2018 Riken AICS => Riken R-CCS (Riken Center for Computational Science), Satoshi (me) becomes Director - Aug 2018 Fujitsu unveils Arm64fx @ Hotchips2018 ## "Post-K" Chronology (part3) - Oct 2018 Several basic research projects towards future architectures and AI&HPC convergence start at R-CCS - Nov 2018 AIST ABCI becomes Top500#1 in Japan HPC & AI convergence becomes real based on 2017 TSUBAME3.0 - Nov 2018 "Post-K" manufacturing and production officially approved by CSTI, Prime Minister's Science and Technology Committee - Nov 2018 MEXT committee on investigating the future usage of "Post-K" towards Society 5.0 starts extreme broadening of SC usage - Feb 2019 "Post-K" public naming commences, to be announced May~June 2019 – submit your ideas now(!) - Mar 2019 "Post-K" Hardware specs to complete by Fujitsu and handed over to Riken R-CCS, machine sized at >150,000 nodes ## "Post-K" Chronology (part4) (Disclaimer: below includes speculative schedules and subject to change) - 1H2019 "Post-K" manufacturing budget approval by the Diet, actual manufacturing commences - Apr 2019 R-CCS lead research activities on next-gen architectures will commence => whitepaper to be written by Winter - Aug 2019 End of K-Computer operations - 4Q2019~1Q2020 "Post-K" installation starts - 1H2020 "Post-K" preproduction operation starts - 2020~2021 "Post-K" production operation starts (hopefully) - And of course we move on… Watch for announcements on "Post-K" technology commercialization by Fujitsu and its partner vendors RSN # Apr 1 2018 Became Director of Riken-CCS: Science, of Computing, by Computing, and for Computing #### **Riken Center for Computational Science (R-CCS)** World Leading HPC Research, active collaborations w/Universities, national labs, & Industry Sci. of Computing Sci. by Computing Foundational research on computing in high performance for K, Post-K, and beyond towards the "Post-Moore" era, including future high performance architectures, new computing and programming models, system software, large scale systems modeling, big data analytics, and scalable artificial intelligence / machine learning Breakthrough Science & Technology using high performance computing capabilities of K, Post-K and beyond to address the issues of high public concern, in areas such as life sciences, climate & environment, disaster prediction & prevention, advanced manufacturing, applications of machine learning for Society 5.0. High Resolution, High Fidelity Analysis & Simulation **Mutual Synergy** Novel Future High Performance Computing Architectures & Algorithms **Sci. for Computing** New Materials & Electronic Devices e.g., Photonics, Neuromorphics, Quantum, Reconfigurable # **Co-Design Activities in Post-K** Multiple Activities since 2011 #### Science by Computing •9 Priority App Areas: High Concern to General Public: Medical/Pharma, Environment/Disaster, Energy, tational characteristics **Science of Computing** - Extremely tight collabrations between the Co-Design apps centers, Riken, and Fujitsu, etc. - Chose 9 representative apps as "target application" scenario - Achieved up to x100 speedup c.f. K-Computer - Also ease-of-programming, broad SW ecosystem, very low power, … #### "Post-K" Arm64fx Processor is... Controlle an Many-Core ARM CPU··· 48 compute cores + 2 or 4 assistant (OS) cores Brand new core design by Fujitsu Near Xeon-Class Integer performance core ARM V8.2 --- 64bit ARM ecosystem - SVE 512 bit vector extensions (ARM & Fujitsu) - Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes) - Cache + access localization (sector cache) similar to scratchpad - HBM2 OPM − Massive Mem BW (1TByte/s, Bytes/DPF ~0.4 same as K) - Streaming memory access, strided access, scatter/gather etc. - Intra-chip barrier synch. and other memory enhancing features - 40GByte/s Tofu-.D interconnect + PCIe 3 • GPU-like High performance in HPC, AI/Big Data, Auto Driving... # Post-K A64fx A0 (ES) performance | | Performance / CPU | | | | | Machine Performance (HPC) | | | |-------------------------------------|-------------------|------------------|-----------------|------------------------|---------------------|---------------------------|--------|--------------------------------| | | Peak TF<br>(DFP) | Peak Mem.<br>BW | Stream<br>Triad | Theor<br>etical<br>B/F | DGEMM<br>Efficiency | Linpack<br>Efficiency | GF/W | Network BW<br>Per Chip | | Post-K A64fx<br>(A0 Eng.<br>Sample) | 2.764/<br>3.072 | <b>1024GB/</b> s | 840GB/s | 0.37/ | 94 % | 87.7 % | >15 | TOFU-D<br>40.8GB/s<br>(6.8x 6) | | Intel KNL | 3.0464 | 600GB/s | 490GB/s | 0.20 | 66% | 54.4 % | 4.9 | 12.5 GB/s | | Intel Skylake | 1.6128 | 127.8GB/s | 97 GB/s | 0.08 | 80 % | 66.7 % | 4.5 | 6.2GB/s | | NVIDIA V100<br>(DGX-2) | 7.8 | 900 GB/s | 855GB/s | 0.12 | | <b>76</b> % | 15.113 | 160GB/s<br>6.2GB/s | ### **Performance** - A64FX boosts performance up by microarchitectural enhancements, 512-bit wide SIMD, HBM2 and process technology - > 2.5x faster in HPC/Al benchmarks than SPARC64 XIfx (Fuiitsu's previous HPC CPU) - The results are based on the Fujitsu compiler optimized for our microarchitecture and SVE A64FX Benchmark Kernel Performance (Preliminary results) ~x2 c.f. Haswell Xeon per socket according to Fujitsu studies https://www.ssken.gr.jp/MAINSITE/event/2015/20151028- Baseline: SPARC64 XIfx (PRIMEHPC FX100) #### ARM HPC ECOSYSTEM #### (slides courtesy Prof. Simon Mcintosh-Smith, U-Bristol) - "Isambard" Cavium TX2 HPC Cluster - **Various Portings** and Benchmarking - Practically all x86 codes work "out-ofthe-box" - Compiler dependency more crucial c.f. ISA - Performance competitive due to most applications being memory BW dependent, and Cavium BW 33% superior Isambard system specification (red = new info): Cray "Scout" system – XC50 series Cavium ThunderX2 processors 2x 32core @ >2GHz per node · x86, Xeon Phi, Pascal GPUs Phase 1 installed March 2017 The Arm part arrives early 2018 Aries interconnect 10.000+ Armv8 cores Cray software tools · Technology comparison: 'Isambard', a new Tier 2 HPC service from GW4. Named in honour of Isambard Kingdom Brunel University of BRISTOL I.K.Brunel 1804-1859 I.K.Brunel 1804-1859 (A) Isambard - VASP, CASTEP, GROMACS, CP2K, UM, HYDRA, NAMD, Oasis, SBLI, NEMO - Note: 8 of these 10 codes are written in FORTRAN - Additional important codes for project partners: - · OpenFOAM, OpenIFS, WRF, CASINO, LAMMPS, ... - We want to collaborate wherever possible! - Accelerate the adoption of Arm in HPC @simonmcs http://gw4.ac.uk/isambard/ bristol.ac.uk #### Post-K Chassis, PCB (w/DLC), and A64fx CPU Package **FUJITSU CONFIDENTIAL** Copyright 2018 FUJITSU LIMITED #### TOFU-D 6D Mesh/Torus Network - Six coordinate axes: X, Y, Z, A, B, C - X, Y, Z: the size varies according to the system configuration - $\blacksquare$ A, B, C: the size is fixed to $2 \times 3 \times 2$ - Tofu stands for "torus fusion": $(X, Y, Z) \times (A, B, C)$ Z X X Y X Z X 2 X 3 X 2 Embedded on-Chip 0.49 µs latency 38.1GByte/s throughput Scalable to > 100,000 nodes #### **Overview of Post-K System** R-CCS - Compute Node, Compute + I/O Node connected by TOFU-D - 3-level hierarchical storage - 1st Layer: GFS Cache + Temp FS - 2<sup>nd</sup> Layer: Lustre-based GFS - 3<sup>rd</sup> Layer: Off-site Cloud Storage - Full Machine Spec - >150,000 nodes, ~8 million High Perf. Arm v8.2 Cores - > 400 racks - ~40 MegaWatts Machine+IDC PUE ~ 1.1 High Pressure DLC - ~= 15~30 million state-of-the art competing CPU Cores for HPC workloads (both dense and sparse problems) # **Sparse BYTES: The Graph500 – 2015~2018 – world #1 x 7** K Computer #1 Tokyo Tech[Matsuoka EBD CREST] Univ. Kyushu [Fujisawa Graph CREST], Riken AICS, Fujitsu Rank 1 **GTEPS** 5524.12 17977.05 19585.2 38621.4 List November 2013 June 2014 November 2014 June 2015, June 2016 ~ Nov 2018 **Efficient hybrid** Hybrid + Node Compression 660,000 CPU Cores 1.3 Petabyte mem 20GB/s Tofu NW performance c.f. Linpack LLNL-IBM Sequoia TaihuLight 1.6 million CPUs 1.6 Petabyte mem 10 million CPUs 1.3 Petabyte mem BYTES, not FLOPS! # Massive Scale Deep Learning on Post-K #### **Post-K Processor** - ◆ High perf FP16&Int8 - **◆** High mem BW for convolution - **◆** Built-in scalable Tofu network **High Performance DNN Convolution** #### **Unprecedened DL scalability** High Performance and Ultra-Scalable Network for massive scaling model & data parallelism Low Precision ALU + High Memory Bandwi Unprecedented Scalability of Data/ dth + Advanced Combining of Convolution Algorithms (FFT+Winograd+GEMM) #### What is worse: Moore's Law will end in the 2020's - Much of underlying IT performance growth due to Moore's law - "LSI: x2 transistors in 1~1.5 years" - Causing qualitative "leaps" in IT and societal innovations - The main reason we have supercomputers and Google... - But this is slowing down & ending, by mid 2020s...!!! - End of Lithography shrinks - End of Dennard scaling - End of Fab Economics The curse of <u>constant</u> <u>transistor power</u> shall Gordon Moore - How do we *sustain* "performance growth" beyond the "end of Moore"? - Not just one-time speed bumps - Will affect all aspects of IT, including BD/AI/ML/IoT, not just HPC - End of IT as we know it 20 year Eras towards of End of Moore's Law - 1980s~2004 Dennard scaling, perf+ = single thread+ = transistor & freq+ = power+ 2004~2015 feature scaling, perf+ = transistor+ = core#+, constant power 2015~2025 all above gets harder 2025~ post-Moore, - constant feature&power = flat performance Need to realize the next 20-year era of supercomputing Dotted line extrapolations by C. Moore # 2025-2028 Post-Moore FLOPS-to-BYTES x100 Speedup Architecture 3 nm UV fabrication Medium Bandwith 2.5D DRAM >64GBytes Capacity ~3TB/s Bandwidth High Capacity Flash NVM >1 TBytes Capacity Multi-Port High Injection NVM/Flash 1Tbps x 12 = 12Tbps Low Latency 3D SRAM >8GBytes Capacity 000000000000000 >10TB/s Bandwidth 2.5D DRAM 3D SRAM 0000000000000 3D \$RAM 0.5D DRAM 000000000000000 00000000000000 SRAM VCSEL Optical 00000000000 $\phi\phi\phi\phi\phi\phi\phi\phi\phi\phi\phi\phi$ VCSEL Optical Launch Pad Dataflow+Scalar Processor Launch Pad Neuromorphic Accelerator 2.5D DRAM 0000000000000 0000 0000 0000 000 0 00000000000 0T\$V Interposer-TSV Interposer 00000 Photonic Network **VCSEL-based** Organic Substrate General purpose processor: Heterogeneous reconfigurable dataflow + scalar many-core processor, 200 Teraflops SFP, 20TeraFlops DFP Accelerators: Neural/Neuromorphic, Ising, Graph, etc. - Direct Chip-Chip Interconnect with DWDM VCSEL micro-optics, 12Tbps injection bandwidth - Low arity switches for multi-dimensional torus, multi-channel network injection ports #### Many Core Era #### Post Moore Cambrian Era Flops-Centric Monolithic Algorithms and Apps Flops-Centric Monolithic System Software Hardware/Software System APIs Flops-Centric Massively Parallel Architecture Homogeneous General Purpose Nodes + Localized Data Compute Compute Nodes Nodes Gen CPU Gen CPU Data Data Compute Compute Nodes Nodes Gen CPU 汎用CPU Data Data Loosely Coupled with Electronic Interconnect Transistor Lithography Scaling (CMOS Logic Circuits, DRAM/SRAM) ~2025 M-P Extinction Event Cambrian Heterogeneous Algorithms and Apps Cambrian Heterogeneous System Software Hardware/Software System APIs "Cambrian" Heterogeneous Architecture Novel Devices + CMOS (Dark Silicon) (Nanophotonics, Non-Volatile Devices etc.) #### Iwashita (Hokkaido U) Solver does not matter as long as we obtain effective solution #### Re-thinking of Solvers in the Post-Moore Architecture #### **Traditional** Discretization Solvers (FEM, BEM, etc.) Gelerkin method, Discretization Linear Solver determines the runtime Large Scale Linear System Linear Iterative Solver Solution Governing Equation (e.g. Electromagnetic Field) $$\nabla \times \nabla \times A = -\sigma \frac{\partial A}{\partial t} + J$$ **Post-Moore Solver** (Quantum / Neuromorphic Computers) **Ouantum Annealer Solver** Offload whole or part of the solver to Ising Model