

## Preparing for Extreme Heterogeneity in High Performance Computing

Jeffrey S. Vetter

With many contributions from FTG Group and Colleagues

R-CCS International Symposium Kobe 18 Feb 2020

ORNL is managed by UT-Battelle, LLC for the US Department of Energy







http://ft.ornl.gov

vetter@computer.org

## Highlights

- Recent trends in extreme-scale HPC paint an ambiguous future
  - Contemporary systems provide evidence that power constraints are driving architectures to change rapidly
  - Multiple architectural dimensions are being (dramatically) redesigned: Processors, node design, memory systems, I/O
  - Complexity is our main challenge
- Applications and software systems are all reaching a state of crisis
  - Applications will not be functionally or performance portable across architectures
  - Programming and operating systems need major redesign to address these architectural changes
  - Procurements, acceptance testing, and operations of today's new platforms depend on performance prediction and benchmarking.
- We need portable programming models and performance prediction now more than ever!
  - Heterogeneous processing
    - OpenACC->FGPAs

16

- Intelligent runtime system (IRIS)
- Clacc OpenACC support in LLVM (not covered today)
- OpenACC dialect of MLIR for Flang Fortran (not covered today)
- Emerging memory hierarchies (NVM)
  - DRAGON transparent NVM access from GPUs (not covered today)
  - NVL-C user management of nonvolatile memory in C (not covered today)
  - Papyrus parallel aggregate persistent storage (not covered today)
- Performance prediction is critical for design and optimization (not covered today)



## Time for a short poll...



## History

Q: Think back 10 years. How many of you would have predicted that many of our top HPC systems would be GPU-based architectures?





**Future** 

Q: Think forward 10 years. How many of you predict that most of our top HPC systems will have the following architectural features? Assume general purpose multicore CPU

GPU

FPGA/Reconfigurable processor

Neuromorphic processor

Deep learning processor

Quantum processor

**RISC-V** processor

Some new unknown processor

All/some of the above in one SoC



## Implications

Q: Now, imagine you are building a new application with an expected ~3M LOC and 20 team members over the next 10 years. What on-node programming model/system do you use?

#### C, C++ XX, Fortran XX

Metaprogramming, etc (e.g., AMP, Kokkos, RAJA, SYCL)

CUDA, cu\*\*\*, HIP, OpenCL

Directives: OpenMP XX, OpenACC XX

R, Python, Matlab, etc

A Domain Specific Language (e.g., Claw, PySL)

A Domain Specific Framework (e.g., PetSc)

Some new unknown programming approach

All/some of the above



### The FTG Vision

| Science and<br>Engineering (e.g., (<br>Materials, Fusion |                             | Radio,  | Sensing<br>(e.g., SAR, v |                 | p learning<br>g., CNN) | Analytics<br>(e.g., graph |        | Robotics<br>e.g., sense and react) |                                                  |
|----------------------------------------------------------|-----------------------------|---------|--------------------------|-----------------|------------------------|---------------------------|--------|------------------------------------|--------------------------------------------------|
| ogrammir                                                 | ng Systems                  |         |                          |                 |                        |                           |        |                                    |                                                  |
| Compiler                                                 | Domain Specifi<br>Languages |         | t-in-time<br>npilation   | Metaprogramming | Scriptin               | g Libr                    | raries | Autotuning                         | Performance<br>Productivity<br>Energy Efficiency |
| ntime an                                                 | d Operatin                  | g Syste | ms                       |                 |                        |                           |        |                                    | Performance<br>Productivity<br>lergy Efficien    |
| Discovery                                                | Task Sche<br>and Ma         |         | Data Orches              | tration         | Ю                      | Synchronizat              | tion   | Load balancing                     | L L                                              |
| Discovery                                                | and Ma                      |         | Data Orches              | tration         | IO                     | Synchroniza               | tion   |                                    |                                                  |

## The FTG Vision | Applications

| gramming Sy  | stems                          |                                  |                   |                 |                |                                                  |
|--------------|--------------------------------|----------------------------------|-------------------|-----------------|----------------|--------------------------------------------------|
|              | •                              | st-in-time<br>ompilation Metapro | gramming Scriptin | g Libraries     | Autotuning     | nce<br>vity<br>siency                            |
| ntime and Op | erating Syste                  | ems                              |                   |                 |                | Performance<br>Productivity<br>Energy Efficiency |
| Discovery    | Task Scheduling<br>and Mapping | Data Orchestration               | Ю                 | Synchronization | Load balancing |                                                  |

PROJECT

### ECP applications target national problems in 6 strategic areas

| National security                                                                                                                                                                                 | Energy security                                                                                                                                                                                                                                         | Economic security                                                                                                                                                                                 | Scientific discovery                                                                                                                                                                                                                                                 | Earth system                                                                                                                                                                                                                                    | Health care                                                                                         |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
| National security<br>Stockpile<br>stewardship<br>Next-generation<br>electromagnetics<br>simulation of hostile<br>environment and<br>virtual flight testing for<br>hypersonic re-entry<br>vehicles | Energy security<br>Turbine wind plant<br>efficiency<br>High-efficiency,<br>low-emission<br>combustion engine<br>and gas turbine<br>design<br>Materials design for<br>extreme<br>environments of<br>nuclear fission<br>and fusion reactors<br>Design and | Economic security<br>Additive<br>manufacturing<br>of qualifiable<br>metal parts<br>Reliable and<br>efficient planning<br>of the power grid<br>Seismic hazard<br>risk assessment<br>Urban planning | Find, predict,<br>and control materials<br>and properties<br>Cosmological probe<br>of the standard model<br>of particle physics<br>Validate fundamental<br>laws of nature<br>Demystify origin of<br>chemical elements<br>Light source-enabled<br>analysis of protein | Accurate regional<br>impact assessments<br>in Earth system<br>models<br>Stress-resistant crop<br>analysis and catalytic<br>conversion<br>of biomass-derived<br>alcohols<br>Metagenomics<br>for analysis of<br>biogeochemical<br>cycles, climate | <section-header><section-header><section-header></section-header></section-header></section-header> |
|                                                                                                                                                                                                   | commercialization<br>of Small Modular<br>Reactors<br>Subsurface use<br>for carbon capture,<br>petroleum extraction,<br>waste disposal<br>Scale-up of clean<br>fossil fuel combustion<br>Biofuel catalyst<br>design                                      |                                                                                                                                                                                                   | and molecular<br>structure and design<br>Whole-device model<br>of magnetically<br>confined fusion<br>plasmas                                                                                                                                                         | change,<br>environmental<br>remediation                                                                                                                                                                                                         |                                                                                                     |



DARPA Domain Specific System on Chip Program is investigating Performance Portability of Software Defined Radio



•

## The FTG Vision | Architectures

| Science and<br>gineering (e.g., CFD,<br>Materials, Fusion) | Streaming<br>(e.g., SW Radio<br>Experimental instru | ,<br><sub>ment)</sub> (e.g., SAF |              | Deep learning<br>(e.g., CNN) | Analytics<br>(e.g., graphs) | Robotics<br>(e.g., sense and react) |                                                  |
|------------------------------------------------------------|-----------------------------------------------------|----------------------------------|--------------|------------------------------|-----------------------------|-------------------------------------|--------------------------------------------------|
| gramming S                                                 | Systems                                             |                                  |              |                              |                             |                                     |                                                  |
| Compiler                                                   | Domain Specific<br>Languages                        | Just-in-time<br>Compilation      | Metaprogramn | ning Scripting               | g Librarie                  | es Autotuning                       | ity<br>ency                                      |
|                                                            |                                                     |                                  |              |                              |                             |                                     |                                                  |
| time and C                                                 | Operating S                                         | Systems                          |              |                              |                             |                                     | Performance<br>Productivity<br>nergy Efficien    |
| time and C                                                 | <b>Dperating S</b><br>Task Schedul<br>and Mappir    | ing Data Orch                    | estration    | ю                            | Synchronizatior             | n Load balancing                    | Performance<br>Productivity<br>Energy Efficiency |
| time and C<br>Discovery                                    | Task Schedul                                        | ing Data Orch                    | estration    | 10                           | Synchronization             | n Load balancing                    | Performan<br>Productivi<br>Energy Effici         |

### **Contemporary devices are approaching fundamental limits**



Dennard scaling has already ended. Dennard observed that voltage and current should be proportional to the linear dimensions of a transistor: 2x transistor count implies 40% faster and 50% more efficient.

R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc, "Design of ion-implanted MOSFET's with very small physical dimensions," *IEEE Journal of Solid-State Circuits*, 9(5):256-68, 1974,

46



(MOSFET) shrinks, the gate dielectric (yellow) thickness approaches several atoms (0.5 nm at the 22-nm technology node). Atomic spacing limits the



Figure 2 | As a MOSFET transistor shrinks, the shape of its electric field departs from basic rectilinear models, and the level curves become disconnected. Atomic-level manufacturing variations, especially for dopant

I.L. Markov, "Limits on fundamental limits to computation," Nature, 512(7513):147-54, 2014, doi:10.1038/nature13570.



### End of Moore's Law : what's your prediction ??



"The number of people predicting the death of Moore's Law doubles every two years." – Peter Lee, Microsoft

47

#### designlines AUTOMOTIVE

#### **News & Analysis** Foundries' Sales Show Hard Times Continuing

G

Peter Clarke 5/23/2016 09:33 PM EDT 2 comments f Like < 6 🎔 Tweet 讷 Share 43 SEMICONDUCTOR ENGINEERING nd UMC, two LOI ufacturing, Design & Test > Uncertainty Grows For 5nm, 3n with recent winter is no ma **Uncertainty Grows For** Bot les that we 5nm, 3nm thos as after both f 😏 in 🖓 797 🚱 🖂 🕂 74 revenue inc cause they eetasia.com

**GlobalFoundries Forfeit 7nm** Manufacturing - EE Times Asia

#### Intel's 10nm Is Broken, SAN the bl **Delayed Until 2019**

Globa

by Paul Alcorn April 26, 2018 at 6:30 PM than

subsi

DESIGNLINES | WIRELESS AND NETWORKING DESIGNLINE

### **GlobalFoundries Selling ASIC Business to Marvell**

By Dylan McGrath, 05.20.19 🔲 1

Share Post



Samsung to Invest \$115 Billion in Foundry & Chip Businesses by 2030

> Another Step Toward the End of Moore's Law

37

COMMENTS

ers of

evelopers

Samsung and TSMC move to 5-nanometer manufacturing

|                       |            |           | Number o        | of Foundries wi | th a Cutting Ed       | ge Logic Fab                                            |                           |         |
|-----------------------|------------|-----------|-----------------|-----------------|-----------------------|---------------------------------------------------------|---------------------------|---------|
| SilTerra              |            |           |                 |                 |                       |                                                         | 13 Dec 2019   20:20 GMT   |         |
| X-FAB<br>Dongbu HiTek |            |           |                 |                 | TS                    | SMC's 5-N<br>Track for                                  | anometer<br>First Half    |         |
| ADI                   | ADI        |           |                 |                 | Devi                  | ces are 15 perce                                        | ent faster, 30 pe         | er      |
| Atmel                 | Atmel      |           |                 |                 |                       |                                                         | efficient                 |         |
| Rohm                  | Rohm       |           |                 |                 |                       |                                                         | By Samuel K. Moore        |         |
| Sanyo                 | Sanyo      |           |                 |                 | 2.1                   |                                                         |                           | Ī       |
| Mitsubishi            | Mitsubishi |           |                 |                 |                       |                                                         |                           | I       |
| ON                    | ON         |           |                 |                 |                       |                                                         |                           |         |
| Hitachi               | Hitachi    |           |                 |                 | 1                     | - Com                                                   |                           |         |
| Cypress               | Cypress    | Cypress   |                 |                 | 1.2                   | No.                                                     |                           |         |
| Sony                  | Sony       | Sony      |                 |                 |                       |                                                         |                           |         |
| Infineon              | Infineon   | Infineon  |                 |                 |                       | -                                                       | >                         |         |
| Sharp                 | Sharp      | Sharp     |                 |                 |                       |                                                         |                           | 1.1.1.1 |
| Freescale             | Freescale  | Freescale |                 |                 | The per               | van Semiconductor Manufacturi<br>formance enhancement a | achieved by TSMC's nev    |         |
| lenesas (NEC)         | Renesas    | Renesas   | Renesas         | Renesas         | partly du<br>wouldn't | ue to the inclusion of a "h<br>reveal.                  | igh-mobility channel." He | NC      |
| SMIC                  | SMIC       | SMIC      | SMIC            | SMIC            |                       |                                                         |                           |         |
| Toshiba               | Toshiba    | Toshiba   | Toshiba         | Toshiba         |                       |                                                         |                           |         |
| Fujitsu               | Fujitsu    | Fujitsu   | Fujitsu         | Fujitsu         |                       |                                                         |                           |         |
| TI                    | TI         | TI        | TI              | TI              |                       |                                                         |                           |         |
| Panasonic             | Panasonic  | Panasonic | Panasonic       | Panasonic       | Panasonic             |                                                         |                           |         |
| Microelectronics      | STM        | STM       | STM             | STM             | STM                   |                                                         |                           |         |
| UMC                   | UMC        | UMC       | UMC             | UMC             | UMC                   |                                                         |                           |         |
| IBM                   | IBM        | IBM       | IBM             | IBM             | IBM                   | IBM                                                     |                           |         |
| AMD                   | AMD        | AMD       | GlobalFoundries | GF              | GF                    | GF                                                      | GF                        |         |
| Samsung               | Samsung    | Samsung   | Samsung         | Samsung         | Samsung               | Samsung                                                 | Samsung                   | 1       |
| TSMC                  | TSMC       | TSMC      | тѕмс            | тѕмс            | тѕмс                  | TSMC                                                    | TSMC                      |         |
| Intel                 | Intel      | Intel     | Intel           | Intel           | Intel                 | Intel                                                   | Intel                     |         |
| 180 nm                | 130 nm     | 90 nm     | 65 nm           | 45 nm/40 nm     | 32 nm/28 nm           | 22 nm/20 nm                                             | 16 nm/14 nm               |         |

Donat

Renesa

STMicroe



Samsung

TSMC

Intel

10 nm

Samsuno

TSMC

Intel

7 nm | 5 nm

Future

### Sixth Wave of Computing



http://www.kurzweilai.net/exponential-growth-of-computing



### Optimize Software and Expose New Hierarchical Parallelism

- Redesign software to boost performance on upcoming architectures
- Exploit new levels of parallelism and efficient data movement

Architectural Specialization and Integration

- Use CMOS more effectively for specific workloads
- Integrate components to boost performance and eliminate inefficiencies
- Workload specific memory+storage system design

- Investigate new computational paradigms
  - Quantum
  - Neuromorphic
  - Advanced Digital
  - Emerging Memory Devices



### Optimize Software and Expose New Hierarchical Parallelism

- Redesign software to boost performance on upcoming architectures
- Exploit new levels of parallelism and efficient data movement

Architectural Specialization and Integration

- Use CMOS more effectively for specific workloads
- Integrate components to boost performance and eliminate inefficiencies
- Workload specific memory+storage system design

- Investigate new computational paradigms
  - Quantum
  - Neuromorphic
  - Advanced Digital
  - Emerging Memory Devices



### Optimize Software and Expose New Hierarchical Parallelism

- Redesign software to boost performance on upcoming architectures
- Exploit new levels of parallelism and efficient data movement

Architectural Specialization and Integration

- Use CMOS more effectively for specific workloads
- Integrate components to boost performance and eliminate inefficiencies
- Workload specific memory+storage system design

- Investigate new computational paradigms
  - Quantum
  - Neuromorphic
  - Advanced Digital
  - Emerging Memory Devices



# Quantum computing: Qubit design and fabrication have made recent progress but still face challenges

Science 354, 1091 (2016) - 2 December

#### A bit of the action

In the race to build a quantum computer, companies are pursuing many types of quantum bits, or qubits, each with its own strengths and weaknesses.



Note: Longevity is the record coherence time for a single qubit superposition state, logic success rate is the highest reported gate fidelity for logic operations on two qubits, and number entangled is the maximum number of qubits entangled and capable of performing two-qubit operations.

The National Academies of SCIENCES • ENGINEERING • MEDICINE

CONSENSUS STUDY REPORT

#### **QUANTUM COMPUTING** Progress and Prospects



FIGURE 7.4 An illustration of potential milestones of progress in quantum computing. The arrangement of milestones corresponds to the order in which the committee thinks they are likely to be achieved; however, it is possible that some will not be achieved, or that they will not be achieved in the order indicated.

INATIONAL Laboratory

#### http://nap.edu/25196

## Fun Question: when was the field effect transistor patented?

### Lilienfeld patents field effect transistor, **October 8, 1926**

Jessica MacNeil -October 08, 2018 6 Comments

On this day in tech history, JE Lilienfeld filed a patent for a threeelectrode structure using copper-sulfide semiconductor material, known today as a field-effect transistor.

Lilienfeld's patent for a "method and apparatus for controlling electric currents" was granted on January 28, 1930.

According to the patent, his invention was for controlling the flow of electric current between two terminals of an electrically

conducting solid by establishing a third potential between the terminals, particularly for the amplification of oscillating currents like those in radio communication.



|                                |                                                              | amplifying, oscillating or switching, or capacitors or resident<br>at least one potential-jump barrier or surface barrier, e.g.<br>junction depletion layer or carrier concentration layer; De<br>semiconductor bodies or of electrodes thereof; Multister<br>manufacturing processes therefor |
|--------------------------------|--------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                | Julius Edgar Lilien feld<br>By Beithhude<br>ATTORNEY         | H01L29/78681 Thin film transistors, i.e. transistors<br>channel being at least partly a thin film having a semicor<br>body comprising AIIIBV or AIIBVI or AIVBVI semiconduc<br>materials, or Se or Te                                                                                          |
| https://www.edn.com/electronic | s-blogs/edn-moments/4422371/Lilienfeld-patents-field-effect- |                                                                                                                                                                                                                                                                                                |

| lilienfeld controlling electric curr                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | € ♀                                                                       |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|
| Back to results 🖌 controlling; electric; currents; Assignee: lilienfe                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | ld;                                                                       |
| thod and apparatus for controlling electric c                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | urrents                                                                   |
| ges (1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | US1745175A                                                                |
| and the second sec | United States                                                             |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | <ul> <li>Download PDF</li> <li>Find Prior Art</li> <li>Similar</li> </ul> |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | Inventor: Lilienfeld Julius Edgar                                         |
| ssifications                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | Worldwide applications                                                    |
| <ul> <li>H03F3/04 Amplifiers with only discharge tubes or only<br/>emiconductor devices as amplifying elements with</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 1925 • <del>CA</del> 1926 • <del>US</del>                                 |
| emiconductor devices only                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Application US140363A events ⑦                                            |
| <ul> <li>H01L29/00 Semiconductor devices adapted for rectifying,<br/>mplifying, oscillating or switching, or capacitors or resistors with</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | 1925-10-22 • Priority to CA272437T                                        |
| t least one potential-jump barrier or surface barrier, e.g. PN<br>Inction depletion layer or carrier concentration layer; Details of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | <b>1926-10-08</b> • Application filed by<br>Lilienfeld Julius Edgar       |
| emiconductor bodies or of electrodes thereof; Multistep                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 1930-01-28 • Application granted                                          |
| nanufacturing processes therefor                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 1930-01-28 • Publication of US1745175A                                    |
| H01L29/78681 Thin film transistors, i.e. transistors with a hannel being at least partly a thin film having a semiconductor                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 1947-01-28 • Anticipated expiration                                       |
| ody comprising AllIBV or AlIBVI or AlVBVI semiconductor<br>naterials, or Se or Te                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 2020-02-16 • Application status is Expired                                |

58 transistor--October-8--1926

### Optimize Software and Expose New Hierarchical Parallelism

- Redesign software to boost performance on upcoming architectures
- Exploit new levels of parallelism and efficient data movement

Architectural Specialization and Integration

- Use CMOS more effectively for specific workloads
- Integrate components to boost performance and eliminate inefficiencies
- Workload specific memory+storage system design

- Investigate new computational paradigms
  - Quantum
  - Neuromorphic
  - Advanced Digital
  - Emerging Memory Devices



### Pace of Architectural Specialization is Quickening

- Industry, lacking Moore's Law, will need to continue to differentiate products (to stay in business)
  - Use the same transistors differently to enhance performance
- Architectural design will become extremely important, critical
  - Dark Silicon
  - Address new parameters for benefits/curse of Moore's Law
- 50+ new companies focusing on hardware for Machine Learning



ennounced on Thursday following Intel's acquisition of deep learning startup

ai-platform-takes-aim-at-nvidias-gpu-techology

http://www.theinguirer.net/inguirer/news/2477796/intels-nervai

ervana Systems earlier this yea

GOOGLE BUILT ITS VERY OWN CHIPS TO POWER ITS AI BOTS



GOOGLE HAS DESIGNED its own computer chip for driving deep neural networks, an <u>AI</u> technology that is reinventing the way Internet services operate.

This morning at Google I/O, the centerpiece of the company's year, CEO Sundar Pichai said that Google has designed an <u>ASIC</u>, or application-specific integrated circuit, that's specific to deep neural nets. These are networks of

http://www.wired.com/2016/05/google-tpu-custom-chips



NEW AT AMAZON: ITS

**CLOUD COMPUTING** 

**OWN CHIPS FOR** 

TOM SIMONITE BUSINESS 11.27.18 08:12 PM

mazon Web Services CEO Andy Jassy speaks at an event in San Francisco in 2017. DAVID PAUL MORRIS/BLOOMBERG/GETTY IMAGES

BIG SOFTWARE COMPANIES don't just stick to software any more—they build computer chips. The latest proof comes from Amazon, which announced late Monday that its cloud computing division has created its own chips to power customers' websites and other services. The chips, dubbed Graviton, are built around the same technology that powers smartphones and tablets. That approach has been much discussed in the cloud industry but never



https://fossbytes.com/nvidia-volta-gddr6-2018/









Xilinx ACAP

### Analysis of Apple A-\* SoCs







### **Intel Stratix 10 FPGA**

Experimental Computing Lab (ExCL) managed by the ORNL Future Technologies Group

- Intel Stratix 10 FPGA and four banks of DDR4 external memory
  - Board configuration: Nallatech 520 Network Acceleration Card
- Up to 10 TFLOPS of peak single precision performance
- 25MBytes of L1 cache @ up to 94 TBytes/s peak bandwidth
- 2X Core performance gains over Arria<sup>®</sup> 10
- Quartus and OpenCL software (Intel SDK v18.1) for using FPGA
- Provide researcher access to advanced FPGA/SOC environment





Mar 2019

### **NVIDIA Jetson AGX Xavier SoC**

Experimental Computing Lab (ExCL) managed by the ORNL Future Technologies Group

- NVIDIA Jetson AGX Xavier:
- High-performance system on a chip for autonomous machines
- Heterogeneous SoC contains:
  - Eight-core 64-bit ARMv8.2 CPU cluster (Carmel)
  - 1.4 CUDA TFLOPS (FP32) GPU with additional inference optimizations (Volta)
  - 11.4 DL TOPS (INT8) Deep learning accelerator (NVDLA)
  - 1.7 CV TOPS (INT8) 7-slot VLIW dual-processor Vision accelerator (PVA)
  - A set of multimedia accelerators (stereo, LDC, optical flow)
- Provides researchers access to advanced highperformance SOC environment











Mar 2019

#### https://excl.ornl.gov/

## Qualcomm 855 SoC (SM8510P) Snapdragon<sup>™</sup>

Experimental Computing Lab (ExCL) managed by the ORNL Future Technologies Group



#### Kyro 485 (8-ARM Prime+BigLittle Cores)

|        |               |                                     | A55                                     |                                              |
|--------|---------------|-------------------------------------|-----------------------------------------|----------------------------------------------|
| A76    | A76           | A76                                 | 128 KB                                  | 128 KB                                       |
|        |               |                                     | A55                                     |                                              |
| 256 КВ | 256 КВ        | 256 KB                              | 128 КВ                                  | 128 KB                                       |
| DSU    |               |                                     |                                         |                                              |
|        | А76<br>256 КВ | © 2.42G<br>A76 A76<br>256 K8 256 K8 | © 2.42G<br>A76 A76 A76<br>256 KB 256 KB | A76 A76 A76 A55 128 K8 A55 128 K8 A55 128 K8 |

#### Hexagon 690 (DSP + AI)

- Quad threaded Scalar Core
- DSP + 4 Hexagon Vector Xccelerators
- New Tensor Xccelerator for AI
- Apps: AI, Voice Assistance, AV codecs

#### Adreno 640

- Vulkan, OpenCL, OpenGL ES 3.1
- Apps: HDR10+, HEVC, Dolby, etc
- Enables 8k-360° VR video playback
- 20% faster compared to Adreno 630



- Snapdragon X24 LTE (855 built-in) modem LTE Category 20
- Snapdragon X50 5G (external) modem (for 5G devices)
- Qualcomm Wi-Fi 6-ready mobile platform: (802.11ax-ready, 802.11ac Wave 2, 802.11ay, 802.11ad)
- Qualcomm 60 GHz Wi-Fi mobile platform: (802.11ay, 802.11ad)
- Bluetooth Version: 5.0
- Bluetooth Speed: 2 Mbps
- High accuracy location with dual-frequency GNSS.

#### Spectra 360 ISP

- New dedicated Image Signal Processor (ISP)
- Dual 14-bit CV-ISPs; 48MP @ 30fps single camera
- Hardware CV for object detection, tracking, streo depth process
- 6DoF XR Body tracking, H265, 4K60 HDR video capture, etc.

#### Qualcomm Development Board connected to (mcmurdo) HPZ820



- Connected Qualcomm board to HPZ820 through USB
- Development Environment: Android SDK/NDK
- Login to mcmurdo machine
  - \$ ssh –Y mcmurdo
- Setup Android platform tools and development environment \$ source /home/ngx/setup android.source
- Run Hello-world on ARM cores
  - \$ git clone <u>https://code.ornl.gov/nqx/helloworld-android</u>
    \$ make compile push run
- Run OpenCL example on GPU
  - \$ git clone <a href="https://code.ornl.gov/nqx/opencl-img-processing">https://code.ornl.gov/nqx/opencl-img-processing</a>
  - Run Sobel edge detection

\$ make compile push run fetch

Login to Qualcomm development board shell

\$ adb shell

\$ cd /data/local/tmp

For more information or to apply for an account, visit https://excl.ornl.gov/

### **Growing Open Source Hardware Movement Enables Rapid Chip Design**

|                                                                 |                              | RISC-V Ec                                                                   | osysten                | 1                                        | A new blueprint for micro<br>challenges the industry's g<br>RISC-V is an alternative to proprietary designs                                                                                                                                                                                                                                                                                                                                                                                                                |
|-----------------------------------------------------------------|------------------------------|-----------------------------------------------------------------------------|------------------------|------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Software<br>Rise<br>Found<br>Hardware                           |                              | ibc, Linux, BSD,<br>reeRTOS,<br>DS, SylixOS,                                |                        |                                          | Print edition   Science and technology ><br>Oct 3rd 2019                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| <b>Open-source</b><br>Rocket, BOC<br>Ariane, Pico<br>SCR1, Humn | OM, RI5CY,<br>RV32, Piccolo, | Commercial cor<br>Andes, Bluespec<br>Codasip, Cortus,<br>Nuclei, SiFive, Sy | , Cloudbear,<br>C-Sky, | <b>Inhouse cores:</b><br>Nvidia, +others | <ul> <li>American giant, or by Arm, a Japanese one. Int<br/>desktop computers, servers and laptops. Arm's<br/>watches and other mobile devices. Together, th<br/>dominate the market. Almost every one of the<br/>on the planet, for example, relies on an Arm-d<br/>year, however, has seen a boomlet in chips ma<br/>RISC-V. If boomlet becomes boom, it may char<br/>dramatically, to the detriment of Arm and Inte<br/>Isas from those two firms, which are proprieta<br/>to anyone, anywhere, and is free.</li> </ul> |
|                                                                 |                              |                                                                             | RISC                   | C-V Summit <i>,</i> 2018                 | An ISA is a standardised description of how a c<br>basic level, and instructions for writing softwa                                                                                                                                                                                                                                                                                                                                                                                                                        |

Open-source computing

A new blueprint for microprocessors s giants



9 🗗 ն 🖸 🕞

hat do the grunt work in is, known as instructionither by Intel, an Intel's ISAS power m's power phones, r, these two firms the 5.1bn mobile phones n-designed ISA. The past made using an ISA called hange the chip industry ntel, because unlike the ietary, RISC-V is available

a chip works at the most basic level, and instructions for writing software to run on it. To draw an analogy, a house might have two floors or three, five bedrooms or six, one bathroom or two. That is up to the architect. An ISA, however, is the equivalent of insisting that the same sorts of electrical sockets and water inlets and outlets be put in the same places in every appropriate room, so that an electrician or a plumber can find them instantly and carry the correct kit to connect to them.

*LIDGE* Laboratory

## **DARPA ERI Programs Aiming for Agile (and Frequent) Chip Creation**





### Summary:

### **Transition Period will be Disruptive – Opportunities and Pitfalls Abound**

- New devices and architectures may not be hidden in traditional levels of abstraction
- Examples
  - A new type of CNT transistor may be completely hidden from higher levels
  - A new paradigm like quantum may require new architectures, programming models, and algorithmic approaches

| Layer       | Switch, 3D | NVM | Approximate | Neuro | Quantum |
|-------------|------------|-----|-------------|-------|---------|
| Application | 1          | 1   | 2           | 2     | 3       |
| Algorithm   | 1          | 1   | 2           | 3     | 3       |
| Language    | 1          | 2   | 2           | 3     | 3       |
| API         | 1          | 2   | 2           | 3     | 3       |
| Arch        | 1          | 2   | 2           | 3     | 3       |
| ISA         | 1          | 2   | 2           | 3     | 3       |
| Microarch   | 2          | 3   | 2           | 3     | 3       |
| FU          | 2          | 3   | 2           | 3     | 3       |
| Logic       | 3          | 3   | 2           | 3     | 3       |
| Device      | 3          | 3   | 2           | 3     | 3       |

Adapted from IEEE Rebooting Computing Chart



## **Department of Energy (DOE) Roadmap to Exascale Systems**

An impressive, productive lineup of *accelerated node* systems supporting DOE's mission



### **Frontier Continues the Accelerated Node Design**

- Partnership between ORNL, Cray, and AMD
- The Frontier system will be delivered in 2021
- Peak Performance greater than 1.5 EF
- Composed of more than 100 Cray Shasta cabinets
  - Connected by Slingshot<sup>™</sup> interconnect with adaptive routing, congestion control, and quality of service
- Accelerated Node Architecture:
  - One purpose-built AMD EPYC<sup>™</sup> processor
  - Four HPC and AI optimized Radeon Instinct<sup>™</sup> GPU accelerators
  - Fully connected with high speed AMD Infinity Fabric links
  - Coherent memory across the node
  - 100 GB/s injection bandwidth
  - Near-node NVM storage





### **Comparison of Titan, Summit, and Frontier Systems**

| System Specs            | Titan                                       | Summit                                              | Frontier                                                  |
|-------------------------|---------------------------------------------|-----------------------------------------------------|-----------------------------------------------------------|
| Peak                    | 27 PF                                       | 200 PF                                              | ~1.5 EF                                                   |
| # cabinets              | 200                                         | 256                                                 | > 100                                                     |
| Node                    | 1 AMD Opteron CPU<br>1 NVIDIA Kepler GPU    | 2 IBM POWER9™ CPUs<br>6 NVIDIA Volta GPUs           | 1 AMD EPYC CPU<br>4 AMD Radeon Instinct GPUs              |
| On-node<br>interconnect | PCI Gen2<br>No coherence<br>across the node | NVIDIA NVLINK<br>Coherent memory<br>across the node | AMD Infinity Fabric<br>Coherent memory<br>across the node |
| System<br>Interconnect  | Cray Gemini network<br>6.4 GB/s             | Mellanox Dual-port EDR IB network<br>25 GB/s        | Cray four-port Slingshot network<br>100 GB/s              |
| Topology                | 3D Torus                                    | Non-blocking Fat Tree                               | Dragonfly                                                 |
| Storage                 | 32 PB, 1 TB/s, Lustre<br>Filesystem         | 250 PB, 2.5 TB/s, IBM Spectrum<br>Scale™ with GPFS™ | 2-4x performance and capacity of Summit's I/O subsystem.  |
| On-node NVM             | No                                          | Yes                                                 | Yes                                                       |
| Power                   | 9 MV                                        | 13 MV                                               | 29 MV                                                     |



### **Complex architectures yield...**

SIMD

NUMA, HBM

Resource contention

Locality



Cores: OpenACC, CUDA, OpenCL, OpenMP4, ...

Memory use, coalescing

Data orchestration

Fine grained parallelism

Hardware features

### Complex Programming Models



120

### During this Sixth Wave transition, Complexity is our major challenge!

### Design

How do we design future systems so that they are better than current systems on important applications?

- Simulation and modeling are more difficult
- Entirely possible that the new system will be slower than the old system!
- Expect 'disaster' procurements

### Programmability

# How do we design applications with some level of performance portability?

- Software lasts much longer than transient hardware platforms
- Proper abstractions for flexibility and efficiency
- Adapt or die



## The FTG Vision | Programming Systems

| Science and<br>ngineering (e.g., CFD,<br>Materials, Fusion) | Streaming<br>(e.g., SW Radio<br>Experimental instru | ,<br>ment) (e.g., SAR, v    |                 | p learning<br>g., CNN) | Analytics<br>(e.g., graph |                            | Robotics<br>e.g., sense and react) |                                                  |
|-------------------------------------------------------------|-----------------------------------------------------|-----------------------------|-----------------|------------------------|---------------------------|----------------------------|------------------------------------|--------------------------------------------------|
| ogramming                                                   | Systems                                             |                             |                 |                        |                           |                            |                                    |                                                  |
| Compiler                                                    | Domain Specific<br>Languages                        | Just-in-time<br>Compilation | Metaprogramming | Scripting              | Libi                      | raries                     | Autotuning                         | nce<br>ity<br>iency                              |
|                                                             |                                                     |                             |                 |                        |                           |                            |                                    |                                                  |
| <b>ntime and (</b><br>Discovery                             | <b>Operating S</b><br>Task Schedul<br>and Mappir    | ing Data Orches             | stration        | 10                     | Synchroniza               | tion                       | Load balancing                     | Performance<br>Productivity<br>Energy Efficiency |
| Discovery                                                   | Task Schedul                                        | ing Data Orches             | stration        | 10                     | Synchroniza               | tion                       | Load balancing                     | Perforr<br>Produc<br>Energy E                    |
| Discovery                                                   | Task Schedul                                        | ing<br>ng Data Orches       |                 |                        | Synchroniza               | tion<br>Persister<br>Memor | nt Neuromorphic                    | Perforr<br>Produc                                |

### What more to say ?!?!? ③





## Directive-based Strategy with OpenARC: Open Accelerator Research Compiler

- Open-Sourced, High-Level Intermediate Representation (HIR)-Based, Extensible Compiler Framework.
  - Perform source-to-source translation from OpenACC C to target accelerator models.
    - Support full features of OpenACC V1.0 ( + array reductions and function calls)
    - Support both CUDA and OpenCL as target accelerator models
  - Provide common runtime APIs for various backends
  - Can be used as a research framework for various study on directive-based accelerator computing.
    - Built on top of Cetus compiler framework, equipped with various advanced analysis/transformation passes and built-in tuning tools.
    - OpenARC's IR provides an AST-like syntactic view of the source program, easy to understand, access, and transform the input program.





131

## FPGAs | Approach

- Design and implement an OpenACC-to-FPGA translation framework, which is the first work to use a standard and portable directive-based, high-level programming system for FPGAs.
- Propose FPGA-specific optimizations and novel pragma extensions to improve performance.
- Evaluate the functional and performance portability of the framework across diverse architectures (Altera FPGA, NVIDIA GPU, AMD GPU, and Intel Xeon Phi).

149



### **FPGA OpenCL Architecture**





### **Kernel-Pipelining Transformation Optimization**

- Kernel execution model in OpenACC
  - Device kernels can communicate with each other only through the device global memory.
  - Synchronizations between kernels are at the granularity of a kernel execution.
- Altera OpenCL channels
  - Allows passing data between kernels and synchronizing kernels with high efficiency and low latency



Kernel communications through global memory in OpenACC





## **Kernel-Pipelining Transformation Optimization (2)**

(a) Input OpenACC code

```
#pragma acc data copyin (a) create (b) copyout (c)
{
    #pragma acc kernels loop gang worker present (a, b)
    for(i=0; i<N; i++) { b[i] = a[i]*a[i]; }
    #pragma acc kernels loop gang worker present (b, c)
    for(i=0; i<N; i++) {c[i] = b[i]; }</pre>
```



(b) Altera OpenCL code with channels

```
channel float pipe_b;
__kernel void kernel1(__global float* a) {
    int i = get_global_id(0);
    write_channel_altera(pipe_b, a[i]*a[i]);
}
__kernel void kernel2(__global float* c) {
    int i = get_global_id(0);
    c[i] = read_channel_altera(pipe_b);
}
```





## **Kernel-Pipelining Transformation Optimization (3)**

(a) Input OpenACC code





### **FPGA-specific Optimizations**

- Single work-item
- Collapse
- <u>Reduction</u>
- Sliding window
- (Branch-variant code motion)
- (Custom unrolling)



### **Overall Performance of OpenARC FPGA Evaluation**



FPGAs prefer applications with deep execution pipelines (e.g., FFT-1D and FFT-2D), performing much higher than other accelerators.

For traditional HPC applications with abundant parallel floating-point operations, it seems to be difficult for FPGAs to beat the performance of other accelerators, even though FPGAs can be much more power-efficient.

• Tested FPGA does not contain dedicated, embedded floating-point cores, while others have fully-optimized floating-point computation units.

Current and upcoming high-end FPGAs are equipped with hardened floatingpoint operators, whose performance will be comparable to other accelerators, while remaining power-efficient.



### The FTG Vision | Runtime and Operating Systems

| gramming Sys               | toms                                                  |                             |                 |           |               | ) (e.g., sense and react)         |                                                  |
|----------------------------|-------------------------------------------------------|-----------------------------|-----------------|-----------|---------------|-----------------------------------|--------------------------------------------------|
|                            | lems                                                  |                             |                 |           |               |                                   |                                                  |
| I OMDIIER I                | · · ·                                                 | ust-in-time<br>ompilation   | letaprogramming | Scripting | g Libra       | ries Autotuning                   | nce<br>vity<br>iency                             |
| ntime and Ope<br>Discovery | <b>erating Syst</b><br>Task Scheduling<br>and Mapping | <b>ems</b><br>Data Orchestr | ration          | 10        | Synchronizati | ion Load balancing                | Performance<br>Productivity<br>Energy Efficiency |
| hitectures                 |                                                       |                             |                 |           |               |                                   |                                                  |
| ulticore CPU GPU           | FPGA                                                  | AI Accelerator              | SoC             | DSP       | Deep Memory   | Persistent<br>Memory Neuromorphic |                                                  |

**IRIS: Mapping Strategy for Heterogeneous Architectures** *and Native Programming Models* 





### **IRIS: An Intelligent Runtime System for Extremely Heterogeneous Architectures**

- Provide programmers a unified programming environment to write portable code across heterogeneous architectures (and preferred programming systems)
- Orchestrate diverse programming systems (OpenCL, CUDA, HIP, OpenMP for CPU) in a single application
  - OpenCL
    - NVIDIA GPU, AMD GPU, ARM GPU, Qualcomm GPU, Intel CPU, Intel Xeon Phi, Intel FPGA, Xilinx FPGA
  - CUDA
    - NVIDIA GPU
  - HIP
    - AMD GPU
  - OpenMP for CPU
    - Intel CPU, AMD CPU, PowerPC CPU, ARM CPU, Qualcomm CPU





## **The IRIS Architecture**

### Platform Model

- A single-node system equipped with host CPUs and multiple compute devices (GPUs, FPGAs, Xeon Phis, and multicore CPUs)
- Memory Model
  - Host memory + shared device memory
  - All compute devices share the device memory
- Execution Model
  - DAG-style task parallel execution across all available compute devices
- Programming Model
  - High-level OpenACC, OpenMP4, SYCL\* (\* planned)
  - Low-level C/Fortran/Python IRIS host-side runtime API + OpenCL/CUDA/HIP/OpenMP kernels (w/o compiler support)





## Supported Architectures and Programming Systems by IRIS

| ExCL* Systems       | Oswald                                              | Summit-node   | Radeon                                              | Xavier       | Snapdragon                                 |
|---------------------|-----------------------------------------------------|---------------|-----------------------------------------------------|--------------|--------------------------------------------|
| CPU                 | Intel Xeon                                          | IBM Power9    | Intel Xeon                                          | ARMv8        | Qualcomm<br>Kryo                           |
| Programming Systems | <ul><li>Intel OpenMP</li><li>Intel OpenCL</li></ul> | IBM XL OpenMP | <ul><li>Intel OpenMP</li><li>Intel OpenCL</li></ul> | GNU GOMP     | <ul> <li>Android NDK<br/>OpenMP</li> </ul> |
| GPU                 | NVIDIA P100                                         | NVIDIA V100   | AMD Radeon VII                                      | NVIDIA Volta | Qualcomm<br>Adreno 640                     |
| Programming Systems | <ul><li>NVIDIA CUDA</li><li>NVIDIA OpenCL</li></ul> | NVIDIA CUDA   | <ul><li>AMD HIP</li><li>AMD OpenCL</li></ul>        | NVIDIA CUDA  | Qualcomm OpenCL                            |
| FPGA                | Intel/Altera<br>Stratix 10                          |               |                                                     |              |                                            |
| Programming Systems | Intel OpenCL                                        |               |                                                     |              |                                            |



### **IRIS Booting on Various Platforms**



#3

264

### **Task Scheduling in IRIS**

- A task
  - A scheduling unit
  - Contains multiple in-order commands
    - Kernel launch command
    - Memory copy command (device-to-host, host-to-device)
  - May have DAG-style dependencies with other tasks
  - Enqueued to the application task queue with a device selection policy
    - Available device selection policies
      - Specific Device (compute device #)
      - Device Type (CPU, GPU, FPGA, XeonPhi)
      - Profile-based
      - Locality-aware
      - Ontology-base
      - Performance models (Aspen)
      - Any, All, Random, 3rd-party users' custom policies
- The task scheduler dispatches the tasks in the application task queue to available compute devices
  - Select the optimal target compute device according to task's device selection policy





### **SAXPY Example on Xavier**

- Computation
  - S[] = A \* X[] + Y[]
- Two tasks
  - S[] = A \* X[] on NVIDIA GPU (CUDA)
  - S[] += Y[] on ARM CPU (OpenMP)
    - S[] is shared between two tasks
    - Read-after-write (RAW), true dependency
- Low-level Python IRIS host code + CUDA/OpenMP kernels
  - saxpy.py
  - kernel.cu
  - kernel.openmp.h





### SAXPY: Python host code & CUDA kernel code

#### saxpy.py (1/2)

#!/usr/bin/env python

import iris import numpy as np import sys

iris.init()

SIZE = 1024 A = 10.0

x = np.arange(SIZE, dtype=np.float32)
y = np.arange(SIZE, dtype=np.float32)
s = np.arange(SIZE, dtype=np.float32)

print 'X', x print 'Y', y

mem\_x = iris.mem(x.nbytes)
mem\_y = iris.mem(y.nbytes)
mem\_s = iris.mem(s.nbytes)

#### saxpy.py (2/2)

kernel0 = iris.kernel("saxpy0")
kernel0.setmem(0, mem\_s, iris.iris\_w)
kernel0.setint(1, A)
kernel0.setmem(2, mem\_x, iris.iris\_r)

off = [ 0 ] ndr = [ SIZE ]

task0 = iris.task()
task0.h2d\_full(mem\_x, x)
task0.kernel(kernel0, 1, off, ndr)
task0.submit(iris.iris\_gpu)

kernel1 = iris.kernel("saxpy1")
kernel1.setmem(0, mem\_s, iris.iris\_rw)
kernel1.setmem(1, mem\_y, iris.iris\_r)

task1 = iris.task()
task1.h2d\_full(mem\_y, y)
task1.kernel(kernel1, 1, off, ndr)
task1.d2h\_full(mem\_s, s)
task1.submit(iris.iris\_cpu)

print 'S =', A, '\* X + Y', s

iris.finalize()

#### kernel.cu (CUDA)

```
extern "C" __global__ void saxpy0(float* S, float
A, float* X) {
    int id = blockIdx.x * blockDim.x + threadIdx.x;
    S[id] = A * X[id];
}
```

```
extern "C" __global__ void saxpy1(float* S,
float* Y) {
    int id = blockIdx.x * blockDim.x + threadIdx.x;
    S[id] += Y[id];
}
```



### SAXPY: Python host code & OpenMP kernel code

#### saxpy.py (1/2)

#!/usr/bin/env python

import iris import numpy as np import sys

iris.init()

SIZE = 1024 A = 10.0

x = np.arange(SIZE, dtype=np.float32)
y = np.arange(SIZE, dtype=np.float32)
s = np.arange(SIZE, dtype=np.float32)

print 'X', x print 'Y', y

mem\_x = iris.mem(x.nbytes)
mem\_y = iris.mem(y.nbytes)
mem\_s = iris.mem(s.nbytes)

#### saxpy.py (2/2)

kernel0 = iris.kernel("saxpy0")
kernel0.setmem(0, mem\_s, iris.iris\_w)
kernel0.setint(1, A)
kernel0.setmem(2, mem\_x, iris.iris\_r)

off = [ 0 ] ndr = [ SIZE ]

task0 = iris.task()
task0.h2d\_full(mem\_x, x)
task0.kernel(kernel0, 1, off, ndr)
task0.submit(iris.iris\_gpu)

kernel1 = iris.kernel("saxpy1")
kernel1.setmem(0, mem\_s, iris.iris\_rw)
kernel1.setmem(1, mem\_y, iris.iris\_r)

task1 = iris.task()
task1.h2d\_full(mem\_y, y)
task1.kernel(kernel1, 1, off, ndr)
task1.d2h\_full(mem\_s, s)
task1.submit(iris.iris\_cpu)

print 'S =', A, '\* X + Y', s

iris.finalize()

#### kernel.openmp.h (OpenMP)

#include <iris/iris\_openmp.h>

static void saxpy0(float\* S, float A, float\* X, IRIS\_OPENMP\_KERNEL\_ARGS) { int id; #pragma omp parallel for shared(S, A, X) private(id) IRIS\_OPENMP\_KERNEL\_BEGIN S[id] = A \* X[id]; IRIS\_OPENMP\_KERNEL\_END }

static void saxpy1(float\* S, float\* Y, IRIS\_OPENMP\_KERNEL\_ARGS) { int id; #pragma omp parallel for shared(S, Y) private(id) IRIS\_OPENMP\_KERNEL\_BEGIN S[id] += Y[id]; IRIS\_OPENMP\_KERNEL\_END



### **Memory Consistency Management**

#### saxpy.py (1/2)

#!/usr/bin/env python

import iris import numpy as np import sys

iris.init()

SIZE = 1024 A = 10.0

x = np.arange(SIZE, dtype=np.float32)
y = np.arange(SIZE, dtype=np.float32)
s = np.arange(SIZE, dtype=np.float32)

print 'X', x print 'Y', y

mem\_x = iris.mem(x.nbytes) mem\_y = iris.mem(y.nbytes) mem\_s = iris.mem(s.nbytes)

#### saxpy.py (2/2)

mem s is

shared between

GPU and CPU

kernel0 = iris.kernel("saxpy0")
kernel0.setmem(0, mem\_s, iris.iris\_w)
kernel0.setint(1, A)
kernel0.setmem(2, mem\_x, iris.iris\_r)

off = [ 0 ] ndr = [ SIZE ]

task0 = iris.task()
task0.h2d\_full(mem\_x, x)
task0.kernel(kernel0, 1, off, ndr)
task0.submit(iris.iris\_gpu)

kernel1 = iris.kernel("saxpy1")
kernel1.setmem(0, mem\_s, iris.iris\_rw)
kernel1.setmem(1, mem\_y, iris.iris\_r)

task1 = iris.task()
task1.h2d\_full(mem\_y, y)
task1.kernel(kernel1, 1, off, ndr)
task1.d2h\_full(mem\_s, s)
task1.submit(iris.iris\_cpu)

print 'S =', A, '\* X + Y', s

iris.finalize()



### **Locality-aware Device Selection Policy**

#### saxpy.py (1/2)

#!/usr/bin/env python

import iris import numpy as np import sys

iris.init()

SIZE = 1024 A = 10.0

x = np.arange(SIZE, dtype=np.float32)
y = np.arange(SIZE, dtype=np.float32)
s = np.arange(SIZE, dtype=np.float32)

print 'X', x print 'Y', y

mem\_x = iris.mem(x.nbytes) mem\_y = iris.mem(y.nbytes) **mem\_s = iris.mem(s.nbytes)** 

#### saxpy.py (2/2)

kernel0 = iris.kernel("saxpy0")
kernel0.setmem(0, mem\_s, iris.iris\_w)
kernel0.setint(1, A)
kernel0.setmem(2, mem\_x, iris.iris\_r)

off = [ 0 ] ndr = [ SIZE ]

task0 = iris.task()
task0.h2d\_full(mem\_x, x)
task0.kernel(kernel0, 1, off, ndr)
task0.submit(iris.iris\_gpu)

kernel1 = iris.kernel("saxpy1")
kernel1.setmem(0, mem\_s, iris.iris\_rw)
kernel1.setmem(1, mem\_y, iris.iris\_r)

task1 = iris.task()
task1.h2d\_full(mem\_y, y)
task1.kernel(kernel1, 1, off, ndr)
task1.d2h\_full(mem\_s, s)
task1.submit(iris.iris\_data)

print 'S =', A, '\* X + Y', s

iris.finalize()

iris\_data selects the device that requires minimum data transfer to execute the task

#### task0





### The FTG Vision

| Science and<br>gineering (e.g., CFD,<br>Aaterials, Fusion) | Streaming<br>(e.g., SW Radio,<br>Experimental instrume | Sensing<br>(e.g., SAR, visi |                | Deep learning<br>(e.g., CNN) |             | is) (e.g             | Robotics<br>., sense and react) |                                                  |
|------------------------------------------------------------|--------------------------------------------------------|-----------------------------|----------------|------------------------------|-------------|----------------------|---------------------------------|--------------------------------------------------|
| ramming                                                    | Svstems                                                |                             |                |                              |             |                      |                                 |                                                  |
| Compiler                                                   | Domai pecific                                          | Just-in-time<br>Compilation | etaprogramming | Scripting                    | g Lib       | raries               | Autotuning                      | ance<br>civity<br>iciency                        |
| time and                                                   | Operating Sy                                           | stems                       |                |                              |             |                      |                                 | Performance<br>Productivity<br>Energy Efficiency |
| Discovery                                                  | Task Schedulin<br>and<br>IRIS                          | Data Orchestra              | ation          | 10                           | Synchroniza | tion L               | oad balancing                   |                                                  |
| nitectures                                                 |                                                        |                             |                |                              |             |                      |                                 |                                                  |
| lticore CPU                                                | GPU FPGA                                               | AI Accelerator              | SoC            | DSP                          | Deep Memory | Persistent<br>Memory | Neuromorphic                    |                                                  |
|                                                            |                                                        |                             |                |                              |             |                      |                                 | ***                                              |

## Recap

- Motivation: Recent trends in computing paint an ambiguous future
  - Multiple architectural dimensions are being (dramatically) redesigned: Processors, node design, memory systems, I/O
  - Complexity is our main challenge
- Applications and software systems across many areas are all reaching a state of crisis
  - Need a focus on performance portability
- ORNL FTG investigating design and programming challenges for these trends
  - Performance modeling and ontologies
  - Performance portable compilation to many different heterogeneous architectures/SoCs
  - Intelligent scheduling system to automate discovery, device selection, and data movement
  - Targeting wide variety of existing and future architectures (DSSoC and others)

### • Visit us

- We host interns and other visitors year round
  - Faculty, grad, undergrad, high school, industry
- Jobs in FTG
  - Postdoctoral Research Associate in Computer Science
  - Software Engineer
  - Computer Scientist
  - Visit <u>https://jobs.ornl.gov</u>
- Contact me <u>vetter@ornl.gov</u>



### **Final Report on Workshop on Extreme Heterogeneity**

- 1. Maintaining and improving programmer productivity
  - Flexible, expressive, programming models and languages
  - Intelligent, domain-aware compilers and tools
  - Composition of disparate software components
- Managing resources intelligently
  - Automated methods using introspection and machine learning
  - Optimize for performance, energy efficiency, and availability
- Modeling & predicting performance
  - Evaluate impact of potential system designs and application mappings
  - Model-automated optimization of applications
- Enabling reproducible science despite non-determinism & asynchrony
  - Methods for validation on non-deterministic architectures
  - Detection and mitigation of pervasive faults and errors
- Facilitating Data Management, Analytics, and Workflows
  - Mapping of science workflows to heterogeneous hardware and software services
  - Adapting workflows and services to meet facility-level objectives through learning approaches









<sup>205</sup>https://orau.gov/exheterogeneity2018/

https://doi.org/10.2172/1473756

National Laboratory

# **Bonus Material**

