Session 4

  1. Home
  2. Programme
  3. Session 4

Session 4

4.1 Project Talk: Exploiting the Dragonfly Topology to Improve Communication Operations in MPI "Topology-Aware Scatter and AllGather operations for Dragonfly Networks"

Nathanaël CHERIERE (INRIA)
"High-radix direct network topologies such as Dragonfly have been proposed for petascale and exascale supercomputers because they ensure fast interconnections and reduce the cost of the network compared with traditional network topologies. The design of new machines such as Theta with a Dragonfly network present an opportunity to further improve the performance of distributed applications by making the algorithms aware of the topology. Indeed, current algorithms do not consider the topology and thus lose numerous opportunities of optimization for performance that have been created by the topology. This talk describes optimized algorithms for two collective operations: AllGather and Scatter and presents the results of an evaluation using the CODES simulator. *Note: this talk is the result of a 5-month internship of Nathanaël CHEERIER (INRIA) at ANL. The project emerged recently and will be added to the JLESC web site (topic: I/O, storage and in situ processing)."

4.2 Individual Talk: In situ workflow tools: Technologies and opportunities

Justin M Wozniak (ANL)
This talk will present recent advances in programming interfaces for Decaf-based in situ workflows.

4.3 Individual Talk: Early Work in the Application of Deep Learning for Scientific Visualization

Rob Sisneros (NCSA)
In recent years, deep learning has become a trusted path towards intelligent automation and learning in various fields. Large scale image classification, computer vision, text analysis, 3D model reconstruction, etc. are only some of the areas that have benefited from deep learning. Its applications in scientific visualization however, have seldom been studied. A reason for this may be differences between how visualization scientists and machine learning scientists perceive the user’s role in data analysis; while machine learning tries to eliminate the user by means of intelligent automation, visualization benefits from unparalleled human intelligence. We believe it is possible to bridge this gap to the benefit of scientific visualization by automating complex and daunting tasks that result in visualizations. In this work we investigate deep learning transfer function design. We will discuss the elements of an automatic technique that utilizes state of the art deep learning and evolutionary optimization to create transfer functions based on sample target images. Even in this early stage the approach has shed light on the explorative process of transfer function design and shows promise to deliver impactful data insights and help free users to focus on analyzing over generating scientific visualization results.

4.4 Individual Talk: Performance of High level FPGA accelerators

Carlos Alvarez (BSC)
Using high level programming models to program FPGA accelerators through automatic tool-chains is a new field that seeks to drive the new heterogeneous platforms that are currently appearing in the market. However, as in any new field, many questions, as its efficiency and/or real productivity, remain unanswered. This talk will present the results obtained when analyzing the performance obtained using this new tools. The talk will describe how Paraver traces can help to analyze the performance of the resulting HDL accelerators obtained and highlight the problems discovered and the envisioned solutions.

4.5 Individual Talk: Supporting extreme-scale real-time science workflows on exascale computers

Raj Kettimuthu (ANL)
Activities at experimental facilities such as light sources and fusion tokamaks are tightly scheduled, with timing driven by factors ranging from the physical processes involved in an experiment to the travel schedules of on-site researchers. Thus, computing must often be available at a specific time, for a specific period, with a high degree of reliability. Such near-realtime requirements are hard to meet on current HPC systems, which are typically batch-scheduled under policies in which an arriving job is run immediately only if enough resources are available, and is queued otherwise. We are investigating the following aspects and are looking for collaboration opportunities: 1) What changes will be required to the scheduling algorithms, architecture, and implementation of exascale computers if they are to support real-time experimental science workloads effectively? 2) What are implications for other exascale workloads? 3) What system-level support can be beneficial? We would like to examine a wide range of design alternatives, develop new system models and simulation methods, and perform extensive simulation-based (and real-world) studies using various combinations of real-world batch job traces and synthetic real-time jobs (created based on the model that mimics the actual real-time jobs) to answer these questions.

4.6 Individual Talk: Towards efficient Big Data processing in HPC systems: Performance Analysis of Spark on HPC

Orcun Yildiz (INRIA)
On paving the way towards convergence of HPC and Big Data, adoption of Big Data processing frameworks into HPC systems remains a challenge. In this work, we conduct an experimental campaign to provide a clearer understanding of the performance of Big Data processing frameworks on HPC systems. In this talk, we present the results of our campaign together with the insights and open questions on how to design efficient Big Data processing solutions for HPC systems. We believe that our findings can interest participants who is working on the convergence between HPC and Big Data. Moreover, we hope that this talk will setup potential collaboration(s) towards efficient Big Data processing in HPC systems.

4.7 Individual Talk: A Suite of Collaborative Heterogeneous Applications for Integrated Architectures

Simon Garcia De Gonzalo (UIUC)
Heterogeneous systems are evolving into computing platforms with tighter integration between CPU and GPU. This is possible thanks to new features such as shared memory space, memory coherence, and system-wide atomic operations. Exponents of this trend are the Heterogeneous System Architecture (HSA) and the NVIDIA Pascal architecture. Programming frameworks such as OpenCL 2.0 and CUDA 8.0 allow programmers to exploit these plat- forms with fine-grain coordination of CPU and GPU threads. To evaluate these new architectures and programming languages, and to empower researchers to experiment with new ideas, a suite of benchmarks targeting these archi- tectures with close CPU-GPU collaboration is needed. We present Chai (Collaborative Heterogeneous Applications for Integrated-architectures), a benchmarks suite that leverage the latest features of heterogeneous architectures. These benchmarks cover a wide range of collaboration patterns, exhibit great diversity within each pattern, and are each implemented in five different programming models: OpenCL 2.0, OpenCL 1.2, C++ AMP, CUDA 7.5, and CUDA-Sim, with CUDA 8.0 in progress.

4.8 Individual Talk: Exploring Memory Management and Performance on Deep-Memory Architectures

Swann Perarnau (ANL)
Hardware advances are enabling new memory technologies resulting in a deepening of the memory hierarchy. On- package DRAM, now available on Intel’s Knights Landing architecture, provides higher bandwidth but limited capac- ity. Soon, byte-addressable NVRAM will also become available, deepening further the memory hierarchy. The Argo group at Argonne recently started exploring how this memory hierarchy should be exposed to HPC applications. Our approach follows several axes: 1) improve and augment current operating system interfaces to allow the available memory types to be explicitly managed efficiently and transparently. 2) develop tools to analyze the memory access patterns of HPC application, focusing on providing guidance on the use of the memory hierarchy and the above interfaces. 3) integrate automatic memory management facilities into parallel runtimes, so that applications can benefit from better usage of the memory hierarchy transparently and effortlessly. At the OS level, we are currently designing a low-level framework to perform asynchronous memory migration between nodes of the memory hierarchy. Early results indicate that the use of this framework along with out-of-core programming schemes can significantly improve the performance of application whose working set cannot fit in on-package memory. This summer, we prototyped a memory tracing and analysis toolset to extract from HPC application the locality of memory access to specific data structures. This can be then used to identify which data structures would benefit from being mi- grated across the memory hierarchy, either statically at allocation time or at runtime using our memory migration framework. We are also developing models and heuristics to guide automatic memory migration inside runtimes like StarPU, OmpSS or OpenMP 4. While we already collaborate with JLESC members (RIKEN and INRIA) for this work, we would like to extend these collaborations to other partners and formally establish (or join) JLESC projects to help those collaborations. During this presentation, we will outline the current state of this work and suggest possible collaboration points.