Session 7

7.1 Project Talk: Reducing Communication in Sparse Iterative and Direct Solvers ”Reducing Communication in Sparse Iterative and Direct Solvers”

William Gropp (UIUC)
The focus of this project is on reducing the communication overhead in sparse kernels, with application to both iterative and direct solvers. Communication represents a significant challenge, particularly as architectures move to more layers of parallelism. In this talk we highlight several steps in addressing communication limitations, including the development of an improved performance model and a new approach to handling communication in the sparse matrix-vector multiplication operation by increasing locality. Our recent findings are highlighting several areas that are rich for collaborative opportunities through the JointLab and we discuss the next steps in continuing to develop these methods.

7.2 Project Talk: Comparison of Meshing and CFD Methods for Accurate Flow Simulations on HPC sys- tems ”Recent Scaling-Enabling Developments of the CFD codes CUBE and ZFS”

Andreas Lintermann (JSC), Keiji Onishi(RIKEN)
The talk focuses on recent developments in the CFD-codes CUBE from AICS, Riken, and ZFS from JSC. It discusses efforts to increase the performance of both codes and to extend their applicability to various engineering problems, i.e., to the technical and biofluid-mechanical engineering realm. The methodological discussion will be complemented by examples from different applications.

7.3 Individual Talk: Handling Pointers and Dynamic Memory in Algorithmic Differentiation

Sri Hari Krishna Narayanan (ANL)
Proper handling of pointers and the (de)allocation of dynamic memory in the context of an adjoint computation via source transformation has so far had no established solution that is both comprehensive and efficient. This talk gives a categorization of the memory references involving pointers to heap and stack memory along with principal options to recover addresses in the reverse sweep. The main contributions are a code analysis algorithm to determine which remedy applies, memory mapping algorithms for the general case where one cannot assume invariant absolute addresses and an algorithm for the handling of pointers upon restoring checkpoints that reuses the memory mapping approach for the reverse sweep.

7.4 Individual Talk: Coupled multiphysics and parallel programming

Mariano Vazquez (BSC)
In this talk, BSC will address problems encountered to implement parallel multiphysics in Alya. Several issues appear at MPI tasks level and are related with point-to-point data interchange, where a correct communication pattern must be establish to avoid bottlenecks. We have explored different solutions: different numerical schemes, solution strategies, runtime load balancing... We would like to share our experiences (good and bad ones) to seek for help and arise discussion.

7.4 Individual Talk: Coupled multiphysics and parallel programming

Mariano Vazquez (BSC)
In this talk, BSC will address problems encountered to implement parallel multiphysics in Alya. Several issues appear at MPI tasks level and are related with point-to-point data interchange, where a correct communication pattern must be establish to avoid bottlenecks. We have explored different solutions: different numerical schemes, solution strategies, runtime load balancing... We would like to share our experiences (good and bad ones) to seek for help and arise discussion.

7.5 Project Talk: Optimizing ChASE eigensolver for Bethe-Salpeter computations on multi-GPUs”Efficient parallel implementation of the ChASE library on distributed CPU-GPU computing architectures”

Edoardo Di Napoli (JSC)
The Chebyshev Accelerated Subspace iteration Eigensolver (ChASE) is an iterative eigensolver developed at the JSC by the SimLab ab initio. The solver targets principally sequences of dense eigenvalue problems as they arise in Density functional Theory, but can also work on the single eigenproblem. ChASE leverages on the preponderant use of BLAS 3 subroutines to achieve close-to-peak performance. Currently, the library can be executed in parallel on many- and multi-core platforms. The latest development of this project dealt with the extension of the CUDA build to encompass multiple GPUs on distinct CPUs. As such, this hybrid parallelization will use MPI as well as CUDA interfaces effectively exploiting heterogeneous multi-GPU platforms. The extended library was tested on large and dense eigenproblems extracted from excitonic Hamiltonian. The ultimate goal is to integrate this new parallel implementation of ChASE with the VASP-BSE code.

7.6 Individual Talk: Extreme-scaling applications at JSC

Brian Wylie (JSC), Dirk Broemmel (JSC), Wolfgang Frings (JSC)
Since 2006 JSC has held a series of well-received Extreme Scaling Workshops for its Blue Gene systems, and with the High-Q Club has documented 28 application codes that successfully scaled to exploit the full 28 racks (with 458752 cores capable of running over 1.8 million threads) of its JUQUEEN Blue Gene/Q. We briefly review these activities and what might be lessons for future exascale computer systems.