Runtime adaptive hardware/software execution in complex heterogeneous systems

Suriano, Leonardo

Runtime adaptive hardware/software execution in complex heterogeneous systems

Suriano, Leonardo

Dirigida por:

Eduardo Torre Arnanz Director/a

Universidad de defensa: Universidad Politécnica de Madrid

Fecha de defensa: 05 de febrero de 2021

Tribunal:

Juan Carlos López López Presidente/a
Jorge Portilla Berrueco Secretario/a
Eduardo Juárez Martínez Vocal
Jesús Lázaro Arrotegui Vocal
Karol Desnos Vocal

Tipo: Tesis

Teseo: 649779 DIALNET Archivo Digital UPM editor

Resumen

Nowadays, it is indisputable that society is in the era of the IoT and Industry 4.0. Everyone’s life takes advantage of the use of electronic devices (i.e.,mobile phones, smart-watches, intelligent video surveillance cameras, et cetera). People’s growing needs are pushing the development of electronic devices to the point that was unimaginable years ago when, in 1970, the first Microprocessor appeared. The tendency is clear: to have as much portable electronic power as we can always with us (communication, sensors et cetera). The new generation of embedded computer systems should be portable, wearable, and offer the highest computing power using the lesser energy possible. Thanks to the market analysis of the new generation of electronic platforms (that will be reported in Chapter 1), it will be possible to note that a more significant computational capability in smaller and less power-hungry devices is nowadays achievable. Traditionally, the goal was pursued by increasing the number of transistors and the frequency of digital circuits. However, during the last 20 years, the same objective is attained by embedding, on the same chip, more heterogeneous Processing Elements (PEs). For this reason, MPSoCs that combine SW processing cores with programmable hardware acceleration are currently gaining market share in the embedded device domains, which is the context of this thesis. The trend delineates a growing complexity of the hardware. At the same time, an application running on any of these new platforms must be able to exploit the hardware capabilities offered. Therefore, the use of these heterogeneous MPSoCs comes at the price of reduced productivity, usually imposed by the lack of efficient hardware/software co-design methods and tools that exploit parallelism efficiently. On the other side, it must be remarked that an embedded device is usually part of a bigger system, generally defined as Cyber-Physical System to remark the coexistence of a cyber-part (for computational purposes) directly and strictly connected to the physical-world by meaning of sensors and actuators. In Section 1.1.2, where the main characteristics of these complex systems will be analyzed, it will be highlighted that the self-adaptation is a property required whenever run-time dynamism is necessary for reacting to changing external stimulus (for instance, to face new detected adverse environment situations). The self-adaptation feature in a Cyber-Physical System must ensure the capability of adjusting its own structure and behavior at run-time. Thus, the adaptation can profoundly affect the application (i.e., the software) and the hardware infrastructure. This will motivate the proposal of this thesis and push the development of a method that gives the possibility to design self-adaptive systems for complex heterogeneous devices efficiently, including hardware reconfiguration. The main task of the thesis will have several implications that define the Ph.D. objective goals in Section 1.3. A modern electronic system is always an extraordinary symbiosis of hardware shrewdly orchestrated by the software. As such, both must be considered together already from the very first phase of the design. Chapter 2 will analyze the state-of-the-art of three crucial aspects of the thesis: the Models of Computation (MoCs), the prototyping techniques for hardware/software co-design, and modern heterogeneous hardware architectures. Traditional design flows often rely on explicit user-defined parallelism in the application code (Imperative Languages), instead of relying on alternative MoCs where parallelism is inherently present. New programming paradigms raise the level of abstraction and make parallelism explicit. In Section 2.1, MoCs will be formally defined and their features deeply discussed. After a documented debate on dataflow literature, a MoC will be chosen for its expressiveness and analyzability associated with a crucial thesis aspect: its runtime reconfiguration capabilities. In fact, reconfiguration is one of the most important key-words in the context of self-adaptation: it is the possibility of dynamically changing and rearranging the software as well as the hardware to fulfill new requirements. In Section 2.2, a literature review of the main methods, techniques, and tools for rapid prototyping will be reported. The aim will be to highlight the main features and characteristics that these thesis’s proposals should achieve. In the last Section 2.3 of the state-of-the-art Chapter, the benefits and drawbacks of the possible hardware platforms on the market will be depicted. The flexibility to ensure the hardware reconfiguration capability for the designed system will deeply influence the choice of the architecture. Specifically, the benefits of theDynamic Partial Reconfiguration (DPR) available on the modern FPGAs are shown. The aim is to remark the reason for the important role of DPR within the thesis proposals. An FPGA is a reconfigurable architecture that guarantees a trade-off among performance and flexibility. They offer the possibility of creating custom accelerators specialized for specific computation purposes. In Section 3.1 of Chapter 3, the techniques and design tools for the creation of hardware accelerators will be reviewed. Among these techniques, the High-Level Synthesis (HLS)workflowallows a designer to start froma hardware description based on high-level languages (such as C/C++) instead of relying on the traditional Hardware Description Language (HDL)-based flow. In order to offload computation fromCPUs to accelerators on the FPGA, the Operating System (OS) of the platformshould be able of managing new custom hardware devices (when provided). For this reason, the hardware abstraction and the OS services will also be discussed. Finally, the possibilities offered by the Software-Defined System-On-Chip (SDSoC) workflow(developed by Xilinx) will be examined. SDSoC is an Integrated Development Environment (IDE) that integrates, in a single flow, the creation of the hardware system and of the OS with services to handle the accelerators properly. Benefits and drawbacks will be highlighted to justify its use in the main proposal of the Chapter. In Section 3.2, the proposal of integrating in a single flow the use of SDSoC and the dataflow MoC will be examined. The approach aims at offering a valid instrument to speed up the process of designing multithreaded applications that make use of multiple hardware accelerators. The idea involves the use of the already-mentioned SDSoC and the academic tool PREESM (developed at INSA Rennes). The method will be commented step by step, and every single challenge addressed analyzed. Specifically, PREESM is a rapid prototyping framework that deploys software applications starting from a high-level representation of architectures and a dataflow-based representation of applications. Thanks to its internal graph transformations and algorithms, it deploys the entire system generating a mapped and scheduled code for the target platform. The proposal will give the possibility of extending the use of PREESM for creating multi-hardware and multi-threaded heterogeneous systems. Additionally, the workflow allows Design Space Exploration (DSE) of different hardware/software design possibilities with no need of re-thinking and re-defining new data repartition among the PEs of the architecture. Also, the run-time manager of dataflow-based application called SPiDER (also developed at INSA Rennes) will be adopted to vary, dynamically at run-time, the parameters of the application that influence and modify the data-level parallelism of the dataflow applications. The entire DSE-flow and the run-time manager adopted will be tested on an image processing application (Section 3.3). The mathematical details of the algorithm are going to be discussed as well as the parallelization strategy applied to the use-case. After the design of the ad-hoc hardware accelerator, the method is applied, and every proposed step is re-examined on the real application. The result improvements will then be compared with the stateof- the-art performance of the hardware-accelerated-based application. In Section 3.4, the method will also be applied to perform a DSE of several hardware/software solutions for a new hardware-accelerated version of the 3D video game DOOM. To make possible the execution of the video game accelerated by hardware, a custom Linux-based OS will also be developed, since the basic services offered by the OS automatically generated by SDSoC does not cover all the needs of this complex application. Finally, the performed DSE will highlight the trade-off design choices among execution-time, power requirements, and energy consumption. Additionally, it will be observed that the cache misses caused by the data-starvation of several accelerators working in parallel could affect the overall performance of the entire system. In the conclusion of Chapter 3, the benefits and limitations of the proposed method are reported. The discussed limitations will, in fact, lay the foundation for further proposals presented in Chapter 4. Firstly, the hardware architecture and the software layers automatically created by SDSoC should be used as a black-box, thus limiting the designer’s hardware/software actions. Then, DPR is not directly supported by SDSoC, thus preventing the possibility of changing the structure of the architecture at run-time. These limitations will push the adoption of a new architecture infrastructure. In Section 4.1, the open-source run-time reconfigurable processing architecture ARTICo3 (developed at CEI-UPM) will be analyzed. The flexibility of its hardware infrastructure is the natural consequence of the DPR, which allows time-divisionmultiplexing of the logic resources. The architecture usage is made easy by the automated toolchain (which helps the designer to build the entire FPGA-based system), and by a run-time execution environment (that transparently manages the reconfigurable accelerators). With the inclusion of a reconfigurable architecture, the PREESM workflow will be re-discussed in Section 4.2. On the one hand, the high-level description of the architecture (namely S-LAM) will allow the specification of reconfigurable slots. On the other hand, the mapping of dataflow actors within a custom reconfigurable hardware accelerator is proposed, and its implications are analyzed. The code-generator of PREESM will also be modified in order to allow the correct management of the ARTICo3 accelerators and the creation of a special software thread that delegates and dispatches hardware tasks to the slots of the ARTICo3 architecture. Finally, the details on how to manage DPR and hardware PEs at run-time will be discussed. The goal will be achieved by combining SPiDER and the run-time Application Programming Interfaces (APIs) collection of ARTICo3. This last proposal ensures software and hardware reconfiguration of the whole system at run-time. However, for a system to be self-adaptable, self-awareness must also be guaranteed. In Section 4.3, the motivations for a unified hardware and software monitoring method will be discussed. The important role of the standard monitoring library PAPI will be depicted. Its integration with PAPIFY (developed at CITSEM-UPM) and PREESM will lay the foundation for adopting this multi-layered software infrastructure as a run-time monitoring instrument for reconfigurable architectures. In order to make possible this integration, the modification to the ARTICo3 run-time execution environment and the creation of a reconfigurable PAPIcomponent specific to the ARTICo3 architecture (inspired by PAPIFY software monitoring strategies) are reported and justified. The entire monitoring infrastructure will so ensure self-awareness of the designed embedded system. As a proof of concept for the newly proposed method for designing run-time adaptive hardware- and software-reconfigurable systems, a parallel version of the algorithm for matrix-multiplication is used. After the presentation of intuitive concepts at the base of the Divide and Conquer Algorithm, the dataflow version of matrix-multiplication is designed and proposed. In the experimental results within Section 4.4, DSE is performed by acting only on the parameters of the application, proving the strength and consistency of the method. As a use-case for the proposals of the entire thesis, Chapter 5 will be entirely dedicated to the study of an old but still active problem: the Inverse Kinematics (IK) of a robotic arm manipulator,attacked from a novel multi-level parallel perspective and using the new design instruments presented along with the thesis. To justify the novel approach to the problem, it will be observed that, in order to fully take advantage of the new technology opportunities, also the basic and widely-used algorithms should be revisited. As such, the solver will be formulated as an optimization problem, in which two levels of algorithmic parallelism will be proposed: the Nelder-Mead derivative-free method used as the optimization engine will be modified to allow the evaluation of the cost function in multiple vertices simultaneously, and the trajectory-path will be divided into non-overlapping segments, in which all the points will be solved concurrently. Algorithmic parallelism will also be supported by a variable number of parallel instances of a custom hardware accelerator, which speeds up the computation of the Forward Kinematics (FK) equations of the robot required during the resolution of the IK. The experimental results (Section 5.7) will show how a variable number of dynamically reconfigurable hardware accelerators, combined with the reconfiguration capability of the application parameters will provide run-time scalability in terms of trajectory accuracy, logic resources, dependability, and execution time. In order to prove the self-adaptivity opportunities provided by the designed system, a basic manager for the whole self-adaptive system will be described in Section 5.8. It will be implemented by simulating external input from the outside world by using the hardware connections of the used development board. The last Chapter of the thesis will summarize, briefly, the whole path followed to deploy the thesis work and the main contributions. It will also analyze the impact of the thesis through journal and conference publications and other dissemination channels. The most significant results of the thesis will also be published in open-source repositories to give the possibility of reproducing the results and even improved by other academic research. The thesis ends with a future research line ideas that will inspire and push the developments of future autonomous self-adaptable heterogeneous systems.