日時: 2019年1月18日（金）、15:30 - 16:30
場所: R-CCS 6階講堂
・講演題目：Nonadiabatic electron dynamics in intense laser field
Recent advances in laser technology have provided methods to track the dynamics of the electrons in molecular systems. The molecular systems in intense laser-field involves repetitive re-collision of the ionizing electron and parent ion and intense emission of radiation, thus resulting in high-order harmonic generation (HHG) and above-threshold ionization (ATI) processes. Moreover, the availability of intense X-ray laser pulse allows further investigations involving auto-ionization and the Auger effect. The theoretical studies of HHG processes and ATI processes have succeeded in giving reasonably accurate results for a single atomic system or very small molecules. In this talk, I would introduce my theoretical studies on nonadiabatic electron dynamics of molecules in intense laser field. Time-dependent configuration interaction method has been applied to single water molecule system with nuclear nonadiabatic effect. The electron wavefunction is represented by complex natural orbitals, thus allowing evaluation of electron flux, which is an one-particle operator, for each natural orbitals. Within this treatment, we can determine how the nuclear nonadiabatic force induce the electron flux and clarify the validity of adiabatic approximation. Ionization process has been included in the calculations by means of continuity equations of electron fluxes for each complex natural orbital, which allows us to target molecular system in intense laser field. The nuclear nonadiabatic effect introduces isotope effect in ATI process and HHG process for systems in intense laser field. The path-branching method has also been implemented, thereby the symmetry-breaking force induced by nuclear nonadiabatic effect introduces spontaneous symmetry-breaking to the electronic state. System of single water molecule in intense laser field undergoes electronic excitation to continuum state. The symmetry-breaking induced by nuclear nonadiabatic effect for such system can be attributed to pseudo-Jahn-Teller effect. The numerical calculation allows a simple evaluation of complex nuclear nonadiabatic effect among quasi-degenerate electronic states.
日時: 2019年1月11日（金）、13:00 - 14:00
場所: R-CCS 6階講堂
・講演題目：Beyond Post-K and Moore’s Law --- Imminent Failure of FLOPS-centric HPC leading to a Bright Future towards Post-Moore
・講演者：松岡 聡（R-CCS センター長）
No one can now deny the inevitability that Moore’s Law, which has sustained performance growth in computing terms of FLOPS, is approaching its end; already the number of fabs that sustain continued shrinkage of lithography is less than a handful, as technical difficulties result in soaring costs, with little hope for economic gains. Architectural one-time techniques to circumvent the limitations, such as many-core architectures as well as reduced precision computing, have already been applied as today’s techniques, and thus cannot be “used again” to attain speedups. As such we might consider ourselves to be in crisis --- as transistor power improvement saturate over time, our next machines will no longer be significantly faster in terms of “catalog FLOPS” compared to the machines in the early 2020s as all the techniques to improve FLOPS in a straightforward way will be used up, be it for HPC or for other apps such as AI. However, the situation in reality is welcome, as we can finally break away from the classical, FLOPS-centric thinking (as per represented by outdated metrics such as Linpack/Top500), and move onto the real of new computing paradigms to improve the time-to-solutions of real applications. Already such “Post-Moore” research has arisen recently in a global fashion, and it is essential that, Riken R-CCS as the leadership center in High Performance Computing in Japan, to break away from FLOPS mindset and assume leadership in alternative devices and architectural parameters leading to new programing, new algorithms and new applications, Already some of us are at the task with new research agenda, from Data or BYTES-centric computing, use of ML to augment/replace first-principle simulations, as well as new computing models such as neuromorphic as well as quantum computing. We will discuss our future research roadmaps and activities for research on machines beyond Post-K in the late 2020s, as well as into the 2030s.
日時: 2019年1月11日（金）、14:00 - 15:00
場所: R-CCS 6階講堂
・講演題目：An Innovative Method for Integration of Simulation/Data/Learning in the Exascale/Post-Moore Era
・講演者：中島 研吾（R-CCS 副センター長）
"ppOpen-HPC" is an open source infrastructure for development and execution of optimized and reliable simulation code on post-peta-scale (pp) parallel computers based on many-core architectures, and it consists of various types of libraries, which cover general procedures for scientific computation. Source code developed on a PC with a single processor is linked with these libraries, and the parallel code generated is optimized for post-peta-scale systems with manycore architectures, such as the Oakforest-PACS system of Joint Center for Advanced High Performance Computing (JCAHPC). "ppOpen-HPC" is part of a five-year project (FY.2011-2015) spawned by the "Development of System Software Technologies for Post-Peta Scale High Performance Computing" funded by JST-CREST. The framework covers various types of procedures for scientific computations, such as parallel I/O of data-sets, matrix-assembly, linear-solvers with practical and scalable preconditioners, visualization, adaptive mesh refinement and dynamic load-balancing, in various types of computational models, such as FEM, FDM, FVM, BEM and DEM. Automatic tuning (AT) technology enables automatic generation of optimized libraries and applications under various types of environments. We release the most updated version of ppOpen-HPC as open source software every year in November (2012-2015), which is available at http://ppopenhpc.cc.u-tokyo.ac.jp/ppopenhpc/ . In 2016, the team of ppOpen-HPC joined ESSEX-II (Equipping Sparse Solvers for Exascale) project (Leading P.I. Professor Gerhard Wellein (University of Erlangen-Nuremberg)), which is funded by JST-CREST and the German DFG priority programme 1648 "Software for Exascale Computing" (SPPEXA) under Japan (JST)-Germany (DFG) collaboration until FY.2018. In ESSEX-II, we develop pK-Open-HPC (extended version of ppOpen-HPC, framework for exa-feasible applications), preconditioned iterative solvers for quantum sciences, and a framework for automatic tuning (AT) with performance model. In the presentation, various types of achievements of ppOpen-HPC, ESSEX-II, and pK-OpenHPC project, such as applications using HACApK library for H-matrix computation, coupling simulations by ppOpen-MATH/MP, and parallel preconditioned iterative solvers will be shown.
Supercomputing in the Exa-scale and the Post-Moore Era is inherently different from that in the Peta-scale Era and before. Although supercomputers have been the essential tool for computational science in recent 30 years, they are now used for other purposes, such as data analytics, and machine learning. Architecture of the next generation supercomputing system is essentially heterogeneous for these multiple purposes (simulations + data + learning). We propose a new innovative method for integration of computational and data science (Big Data & Extreme Computing, BDEC) for sustainable promotion of new scientific discovery by supercomputers in the Exa-Scale/Post-Moore Era with heterogeneous architecture. "h3-Open-BDEC (h3: hierarchical, hybrid, heterogeneous,)" is an open source infrastructure for development and execution of optimized and reliable codes for BDEC on such supercomputers, which is the extended version of ppOpen-HPC. In this presentation, we will overview the h3-Open-BDEC, and the target supercomputer system, which will start operation in April 2021.
日時: 2019年1月10日（木）、14:00 - 15:00
場所: R-CCS 6階講堂
・講演題目：Multigrid for structured grids on large-scale parallel computers
・講演者：Prof. Dr. Matthias Bolten（University of Wuppertal, High performance computing / software engineering）
In many simulations in computational science and engineering a partial differential equation has to be
solved. Multigrid methods are among the fastest methods for accomplishing this task, in many cases with
optimal, i.e., O(N), complexity. As a consequence, in simulations of huge problems on large-scale
supercomputers often a multigrid method is used. If the underlying problem is formulated on a structured
grid, this structure can be exploited in the multigrid method to build up the grid hierarchyd.
Additionally, the presence of structure allows for a relatively straightforward efficient implementation
on modern computer architectures, like modern CPUs or GPUs. Further, structure allows for a rigorous
analysis of the problem and the multigrid method used for solving it. Still, the used multigrid
components, i.e., grid transfer operators and smoothers, have to be carefully chosen to be able to treat
the underlying problem. Besides the adaption to the problem the chosen component can have a huge
influence on the serial efficiency and the parallel scalability of the whole method.
In this talk multigrid methods for structured grids, their analysis and the specific choice of
algorithmic components for parallel computers will be discussed.
日時: 2019年1月10日（木）、15:20 - 15:50
場所: R-CCS 6階講堂
・講演題目：Multiplicative Schwartz type block multi-color Gauss-Seidel smoother for AMG/GMG method
In this talk, we will focus on a multigrid method. A convergence and performance of the multigrid method strongly depend on a smoother. We proposed a multiplicative Schwartz type block multi-color Gauss-Seidel(MS-BMC-GS) smoother. This smoother has better convergence, higher cache-hit ratio, and fewer communications compared with existing methods. In this talk, we introduce the MS-BMC-GS smoother and show the numerical evaluations with a geometric and algebraic multigrid method.
日時: 2018年12月21日（金）、15:15 - 16:15
場所: R-CCS 6階講堂
・講演題目：Designing Communication Platform for an FPGA Cluster
A Field programmable gate array (FPGA) is a reconfigurable device on which we can implement arbitrary circuits repeatedly.
By optimal implementation and stream processing, FPGA-based computing achieves both high computing performance and high power efficiency.
To further improve the performance, it is necessary to realize an FPGA cluster with multiple nodes.
We have developed a directly connected FPGA cluster and communication platform for the cluster.
In this talk, I introduce the design and structure of the FPGA cluster and how to communicate in the cluster.
I also explain the communication modules and actual data movement on the FPGA cluster.
Finally, I shows that the proposed platform achieves fast and flexible communication for various application on FPGAs.
日時: 2018年12月7日（金）、13:00 - 14:00
場所: R-CCS 6階講堂
・講演題目：Applying HPC to mitigate disaster damage by developing and integrating advanced computational science
Computational Disaster Mitigation and Reduction Research Team is aimed at developing advanced large-scale numerical simulation of natural disasters such as earthquake, tsunami, flood and inundation, for Kobe City and other urban areas in Hyogo Prefecture. Oishi team integrates geo hazards, water hazards and related hazards. Demand for natural disaster simulations became increasing because disasters frequently take place. Therefore, we are developing appropriate sets of computer programs which meet the demand of calculations. Computational Disaster Mitigation and Reduction Research Team is dealing with the following three kinds of research topics.
Urban model development: Research for urban hazards requires urban models which represent structure and shape of cities in numerical form.
However, it takes very long time to develop urban models consisting of buildings, foundations and infrastructures like bridges, ports and roads. Therefore, it is indispensable to invent methods which automatically construct urban models from exiting data that is basically ill-structured. Oishi team developed Data Processing Platform (DPP) for such purpose. By using DPP, construction of a national-wide urban model and 3D model construction from engineering drawings are achieved.
Recently, Oishi team has a couple of big collaborative researches with Hanshin Expressway Co. Ltd. and National Institute for Land and Infrastructure Management (MLIT). Three dimensional bridge model for programming code will be generated automatically from paper-based engineering drawings or 2D CAD so that Oishi team can simulate the seismic response of the entire network with high fidelity models. Since paper-based engineering drawings include errors and lack of information, it is hopeless to perform a robust model construction by merely extracting information from engineering drawings. To tackle with this problem, Oishi team have developed a template-based methodology.
Developing particle methods for landslide simulation using FDPS:
Conventional mesh-based numerical methods, such as finite element method
(FEM) and finite difference method (FDM) have difficulty to simulate the large deformations, the evolution and break-down of the traction-free-surfaces during a landslide process. On the other hand, meshfree methods, such as smoothed particle hydrodynamics (SPH), and moving particle semi-implicit method (MPS), are regarded as promising candidates for landslide simulations. Using a framework of developing parallel particle simulation code (FDPS), we try to develop a large-scale simulation code for landslide simulation. Since FDPS provides those common routines needed for parallelizing a general particle method, we can focus on the numerical schemes and the mechanisms of landslides. In this talk, we present an improvement of a mathematical reformulation of MPS (iMRMPS). This iMRMPS shows no deterioration of accuracy and convergence for randomly distributed particles, outperforming most conventional particles methods.
Water related disaster: Frequency of water disaster has increased. Not only water itself but also sediment cause damage to residents and their assets. Understanding possible hazards is necessary for a measure of precaution and making less damage. Therefore, Oishi team started to deal with water and sediment related disasters by making numerical simulation model for river basins in Kobe city and Hyogo prefecture. Estimation of a damage of sediment-related disaster accompanied with flood, inundation, and sediment supply due to landslides is important to establish a prevention plan. Oishi team has developed a 2D Distributed Rainfall and Sediment Runoff/Inundation Simulator (DRSRIS) with coupling the 2D rainfall runoff model, inundation flow model , and sediment transport model on the staggered grid which performs on the supercomputer.
日時: 2018年12月7日（金）、14:00 - 15:00
場所: R-CCS 6階講堂
・講演題目：Predictability of the July 2018 Record-breaking Rainfall in Western Japan
Data assimilation combines the computer model simulation and real-world data based on dynamical systems theory and statistical mathematics. Data assimilation addresses predictability of dynamical systems and has long been playing a crucial role in numerical weather prediction. Data Assimilation Research Team (DA Team) has been working on various problems of data assimilation, mainly focusing on weather prediction. In July 2018, a broad area in western Japan was damaged severely due to record-breaking heavy rainfall. DA Team developed real-time regional and global weather forecasting systems and investigated the historic rainfall event using these systems. Also, DA Team took the lead in organizing a rapid-response conference for meteorologist in August, about a month later of the event, in collaboration with the Computational Climate Science Research Team. In this presentation, we will report recent research progress of DA Team mainly focusing on the investigation related to the July 2018 rainfall event.
日時: 2018年12月7日（金）、15:15 - 16:15
場所: R-CCS 6階講堂
・講演題目：Research Activities for Parallel Programming Models for Current HPC Platforms
In this talk, we introduce two research activities to improve the vectorization and performance optimization for state-of-the-art HPC platforms. Recent trends in processor design accommodate wide vector extensions. SIMD vectorization is more important than before to exploit the potential performance of the target architecture. The latest OpenMP specification provides new directives which help compilers produce better code for SIMD auto-vectorization. However, it is hard to optimize the SIMD code performance in OpenMP since the target SIMD code generation mostly relies on the compiler implementation. In the first part of the talk, we propose a new directive that specifies user-defined SIMD variants of functions used in SIMD loops. The compiler can then use the user-defined SIMD variants when it encounters OpenMP loops instead of auto-vectorized SIMD variants. The user can optimize the SIMD performance by implementing highly-optimized SIMD code with intrinsic functions. The performance evaluation using a image composition kernel shows that the user can optimize SIMD code generation in an explicit way by using our approach. The user-defined function reduces the number of instructions by 70% compared with the auto-vectorized code generated from the serial code. In the latter part of the talk, we propose a programming model for FPGAs. Because of the recent slowdown in silicon technology and increasing power consumption of hardware, several dedicated architectures have been proposed in High Performance Computing
(HPC) to exploit the limited number of transistors in a chip with low power consumption. Although Field-Programmable Gate Array (FPGA) is considered as one of the promising solutions to realize dedicated hardware for HPC, it is difficult for non-experts to program FPGAs due to the gap between their applications and hardware-level programming models for FPGAs. To improve the productivity for FPGAs, we propose a C/C++ based programming framework, C2SPD, to describe stream processing on FPGA. C2SPD provides directives to specify code regions to be offloaded onto FPGAs. Two popular performance optimization techniques, vectorization and loop unrolling, also can be described in the directives. The compiler is implemented based on a famous open source compiler infrastructure LLVM. It takes C/C++ code as input and translates it into DSL code for the FPGA backend and CPU binary code.
The DSL code is translated into Verilog HDL code by the FPGA backend and passed to the vendor’s FPGA compiler to generate hardware. The CPU binary code includes C2SPD runtime calls to manipulate FPGA, and transfer data between CPU and FPGA. C2SPD assumes a single PCI-card type FPGA device. Data transfer includes communication via the PCI Express interface. The C2SPD compiler uses SPGen, a data-flow High Level Synthesis (HSL) tool, as the FPGA backend. SPGen is an HLS tool for stream processing on FPGAs. The SPGen compiler takes its DSL, Stream Processing Description (SPD) and generates pipelined stream cores on FPGAs. Although the range of application is limited by its domain-specific approach, it can generate highly-pipelined hardware on FPGAs. A 2D-stencil computation kernel is written in C and C2SPD directives and the generated FPGA hardware achieves 175.41 GFLOPS by using 256 stream cores. The performance evaluation shows that vectorization can exploit FPGA memory bandwidth and loop unrolling can generate deep pipeline to hide the instruction latency. By modifying numbers in the directives, the user can easily change the configuration of the generated hardware on the FPGA and optimize the performance.
日時: 2018年11月27日（火）、14:00 - 15:00
場所: R-CCS 6階講堂
・講演題目：Performance portable parallel CP-APR tensor decompositions
・講演者：寺西 慶太（Principal Member of Technical Staff, Sandia National Laboratories, California）
Tensors have found utility in a wide range of applications, such as chemometrics, network traffic analysis, neuroscience, and signal processing. Many of these data science applications have increasingly large amounts of data to process and require high-performance methods to provide a reasonable turnaround time for analysts. Sparse tensor decomposition is a tool that allows analysts to explore a compact representation (low-rank models) of high-dimensional data sets, expose patterns that may not be apparent in the raw data, and extract useful information from the large amount of initial data. In this work, we consider decomposition of sparse count data using CANDECOMP-PARAFAC Alternating Poisson Regression (CP-APR).
Unlike the Alternating Least Square (ALS) version, CP-APR algorithm involves non-trivial constraint optimization of nonlinear and nonconvex function, which contributes to the slow adaptation to high performance computing (HPC) systems. The recent studies by Kolda et al. suggest multiple variants of CP-APR algorithms amenable to data and task parallelism together, but their parallel implementation involves several challenges due to the continuing trend toward a wide variety HPC system architecture and its programming models.
To this end, we have implemented a production-quality sparse tensor decomposition code, named SparTen, in C++ using Kokkos as a hardware abstraction layer. By using Kokkos, we have been able to develop a single code base and achieve good performance on each architecture. Additionally, SparTen is templated on several data types that allow for the use of mixed precision to allow the user to tune performance and accuracy for specific applications. In this presentation, we will use SparTen as a case study to document the performance gains, performance/accuracy tradeoffs of mixed precision in this application, development effort, and discuss the level of performance portability achieved. Performance profiling results from each of these architectures will be shared to highlight difficulties of efficiently processing sparse, unstructured data. By combining these results with an analysis of each hardware architecture, we will discuss some insights for improved use of the available cache hierarchy, potential costs/benefits of analyzing the underlying sparsity pattern of the input data as a preprocessing step, critical aspects of these hardware architectures that allow for improved performance in sparse tensor applications, and where remaining performance may still have been left on the table due to having single algorithm implementations on diverging hardware architectures.