Session 10

  1. Home
  2. Programme
  3. Session 10

Session 10

10.1 Project Talk: Scalability Enhancements to FMM for Molecular Dynamics Simulations ”Scratching The Millisecond - The Std::Way”

David Haensel (JSC)
Modern HPC resources owe their peak performance to an increased node-level core count. In order to keep up scalability, a low-overhead threading approach for the user software is required. In our case, an FMM serves as a Coulomb solver for molecular dynamics simulations with the goal of milliseconds execution time for a single time step. To avoid parallelization overhead, e.g. synchronization points or load imbalance, algorithm-aware strategies have to be applied. Such measures will improve performance, especially for a tasking approach with dependency resolving and work scheduling. Implementing those specific strategies in a scheduler and dependency resolver of a third party library could be quite challenging. Also, relying solely on universal dynamic scheduling implementations could affect performance unfavorably. The current C++ language standard (C++11) offers several robust features for parallel intranode programming. With the help of those standardized C++ features, we added a tasking layer to our FMM library. In this talk we want to present, which C++11 features are most suited for tasking and how we apply and tailor such schemes for our purposes. Finally we will show an in-depth performance analysis of our current model utilizing dynamic work-stealing.

10.2 Individual Talk: Daino: A High-level AMR Framework on GPUs

Mohamed Wahib (RIKEN)
Adaptive Mesh Refinement methods reduce computational requirements of problems by increasing resolution for only areas of interest. However, in practice, efficient AMR implementations are difficult considering that the mesh hierarchy management must be optimized for the underlying hardware. Architecture complexity of GPUs can render efficient AMR to be particularity challenging in GPU-accelerated supercomputers. This talk presents a compiler- based high-level framework that can automatically transform serial uniform mesh code annotated by the user into parallel adaptive mesh code optimized for GPU-accelerated supercomputers. We show experimental results on three production applications. The speedups of code generated by our framework are comparable to hand-written AMR code while achieving good and weak scaling up to 3,600 GPUs.

10.3 Individual Talk: Portable Asynchronous Progress Model for MPI on Multi- and Many-Core Systems

Min Si (ANL)
Casper provides efficient process-based asynchronous progress for MPI one-sided communication on multi- and many- core architectures by dedicating a small number of cores to the asynchronous progress engine. It is designed as an external library of MPI through the PMPI interface thus ensuring portability among various MPI implementations and platforms. Although this model has successfully improved the performance of large quantum chemistry appli- cation NWChem by up to 30, it still lacks the solution for several issues, including the general approach to deliver optimal performance in multi-phase applications, supporting other MPI communication models such as two-sided and collectives, and cooperation with other PMPI based libraries such as the MPI performance tools. In this talk, we first propose an efficient dynamic adaptation mechanism to address the issue in multi-phase applications. It allows Casper to dynamically study the performance characteristics of application internal phases and adapt the configuration of asynchronous progress for each phase efficiently. Then we shortly introduce the ongoing work of Casper to improve the dynamic adaptation and resolve the second issue - supporting two-sided and collective modes, by integrating with the PVAS and ULP concepts contributed by RIKEN researchers. Finally, we discuss the potential collaborations to work on the challenge in the PMPI tools cooperation, and to seek the chance to integrate Casper within other MPI-based parallel runtime systems such as XMP on MPI.

10.4 Individual Talk: Efficient Composition of task-graph based Code

Christian Perez (INRIA)
Composition of code is still an open issue in HPC applications. The problem is even more difficult as task based programming models seem unavoidable. Moreover, applications may need to be specialized with respect to the ob- jective function (max. perf, power caping, etc). This talk is about showing the possibilities and the open challenge that a component base approach brings using some examples.

10.5 Individual Talk: In C++ we trust - performance portability out of the box?

Ivo Kabadshow (JSC)
The HPC landscape is very diverse and performance on a single platform may come at a high price of lost portability. In this talk we like to outline our attempt to reach high resource utilization without losing portability. We will present our current C++11 software layout for a fast multipole method (FMM). We will focus on intranode performance, especially vectorization and multithreading. We will also show how abstraction and template-metaprogramming helps us to maintain clean and readable code without impacting performance.

10.6 Individual Talk: Feedback control for Autonomic High-Performance Computing

Eric Rutten (INRIA)
” Due to the increasing complexity, scale and heterogeneity in computing systems and applications, including hard- ware, software, communications and networks, there are growing needs for runtime management of resources, in an automated self-adaptation to the variations due to data or the environment. Such feedback loops are the object of Autonomic Computing (AC) and can involve the use of Control Theory. We have first results, in joint work with J.F. M`ehaut, N. Zhou, G. Delaval, and B. Robu, illustrating the topic, concerning the topic of the dynamical man- agement of the trade-off between parallel computation and synchronization. Higher parallelism can potentially bring speedup, but also higher synchronization costs around shared data, which can even outgrow the gain. Additionally threads locality on different cores may impact on program performance as well. We have studied the problem in the particular framework of Software Transactional Memory (STM), and proposed solutions to dynamically adapt degree of parallelism and thread mapping policy, in order to diminish program execution time. We propose to address the perspectives of adopting such feedback-loop based approaches, which are very large, and the topic remains very novel. We propose to generalize the approach, on a first level, to other platforms and synchronization mechanisms (OpenMP, Charm++), and further, to consider objectives of different nature than response performance, but also of energy consumption, or dependability for example. Relevance to JLESC relates to the topic ””Programming Languages and Runtimes””. Preliminary exchanges have begun with S. Kale (UIUC) who works on related topics of load balancing regulation.”

10.7 Individual Talk: An Overview of the IHK/McKernel Lightweight Multikernel OS for Kernel-Assisted Communication Progression

Balazs Gerofi (RIKEN)
”The ””Lightweight Kernel (LWK) Assisted Communication Progression”” JLESC project’s goal is to propose mech- anisms to obtain asynchronous progression of communications running on an LWK without disturbing application thread scheduling. The project combines the IHK/McKernel (RIKEN, Tokyo) lightweight multi-kernel with the MadMPI+Pioman (INRIA, Bordeaux) communication library. IHK/McKernel is a multi-kernel approach running Linux and LWK(s) side-by-side on compute nodes with the primary aim of exploiting the lightweight kernel’s ability to provide scalable and consistent execution environment for large-scale parallel applications, but to retain the full POSIX/Linux APIs at the same time. In this talk, we provide an architectural overview of the IHK/McKernel soft- ware stack and report the current status of the collaboration.”

10.8 Individual Talk: New Parallel Execution Model - User-Level Implementation

Atsushi Hori (RIKEN)
New execution model implemented at user-level to map multiple processes in one virtual address space will be in- troduced