The performance improvement of conventional processor has begun to stagnate in recent years. Because of this, researchers are looking for new possibilities to improve the performance of computing systems. Heterogeneous systems turned out to be a powerful possibility. In the context of this talk, a heterogeneous system consists of a software-programmable processor and a FPGA based configurable hardware accelerator.
Due to their increased complexity, it is more complicated to develop applications for heterogeneous systems than for conventional systems based on a software-programmable processor. For programming the software and hardware parts, different languages have to be used and additional specialised hardware-knowledge is required. Both factors increase the development cost.
This work presents the compiler framework Nymble which allows to program a heterogeneous system with only a single high-level language. In the high-level language the developer only has to select which parts of the application should be executed in hardware. Nymble then generates a program for the software-processor, the configuration of the hardware, and all interfaces between software and hardware.
To hide long memory access latencies, this talk presents an execution model which allows the simultaneous execution of multiple threads in a single accelerator. Additionally, the model enables threads to be dynamically reordered at specific points in the common accelerator pipeline. This capability is used to let other (non-waiting) threads overtake a thread which is waiting for a memory access. Thus, these other threads can execute their calculations independently of the waiting thread to bridge the latency of memory accesses.
The presented execution model dynamically spreads multiple threads over the pipeline. This results in a higher utilisation of the resources by using resources more effectively. Furthermore, the simultaneous execution of multiple threads can achieve similar throughput as multiple copies of a single-threaded accelerator running in parallel.
It makes it possible to combine the improved throughput of multiple copies with the increased efficiency of simultaneous threads in a single accelerator. Thread reordering allows the new model to be effectively used with a cached shared-memory.
In comparison, between four copies of a single-threaded accelerator and a multi-thread accelerator with four thread (both created by Nymble), a resource efficiency of up to factor 2.6x can be achieved. At the same time, four simultaneous threads can be up to 4x as fast as four threads executed consecutively on a single accelerator. Compared to other, more optimised compilers, Nymble can still achieve up to 2x faster runtime with 1.5x resource efficiency.