The Control unit uses a single address microcode design to encode the set of hardware system calls. The system-level Datapath implements the multiplexing and demultiplexing of the system call parameters into the service-level Datapath. The M00_Kernel and S00_Task are the master and slave interfaces of the system call, which is used to connect with the interfaces in the Hardware Task. The Kernel Core is responsible for time management and provides waiting events coupled with time-out functionalities and a parametrizable task sleep. The Control and Status registers will allow the host system to interact with the Hardware Kernel. To preserve the Hardware Kernel status, any control operation issued by the CPU (or multiple cores) is forwarded via the Authenticator unit that validates permissions before authorizing a write operation. As a consequence of the microprogramming technology used for the hardware system calls, the Kernel Core implementation results in a static unit that is independent of the Hardware Task implementation.
A service-level Datapath includes: (1) a dual-port and bidirectional message-queue used for messaging control information within the host system services; (2) a dual-port bidirectional data-FIFO available for generic Hardware Task use; (3) a local interrupt controller (LINTC) that allows synchronization with the Linux OS; (4) a true dual-port generic purpose local RAM (LRAM) for data exchange and temporary storage; and (5) two dual-channel Hardware Mutexes that implement mutual exclusion with the accelerator model. The latter are directly coupled with the LRAM and a system memory region allocated at boot time.
2.1. Kernel Core
Conceptually, the Kernel Core acts like any kernel that can be found in the most elementary OS, by providing a set of services that interact with local resources through hardware system call invocation. The Hardware Task implements system calls using procedures described in a kernel Hardware Description Language (HDL) package provided by the framework. For complex or composite operations, user-level HDL procedures, provided by the user package, can implement consecutive system calls involving more than one local resource. Procedures accept input and output parameters that link to resources from the Hardware Task design. These, in turn, will allow the hardware system call to access these resources and ultimately update them with execution results.
Figure 4 shows a simplified diagram that describes the internal organization of the Kernel Core component. The Control Unit determines the
Status of the accelerator that can be triggered by the active bits in the Control register. Such registers can be handled by the host system to address the application’s functional requirements. Due to the critical nature of the available operations, the content of the Control register is updated under the supervision of the Authenticator device, which validates the received word by scanning for the required authentication field. Once active, the Control Unit operates through the system-level Datapath, establishing the connectivity between the microprogrammed unit and the kernel’s Call and Response interfaces. These interfaces match directly to the S00_Task and M00_Kernel signals in the Hardware Kernel top level, and allow the Hardware Task to trigger the system calls present in the microprogrammed unit. In turn, the system calls implement a pre-programmed set of control actions, which operate at the system level, to handle adequate data manipulation using the existing local resources.
When a Hardware Task demands for a wait event within the duration of a predetermined number of clock cycles, or needs to wait for a hardware signal restricted to a maximum timeout interval, a system call interacts with the Time Event device to provide such a service. In addition, to implement composite operations, the Kernel Core uses the Scheduler services to select each system call from a concurrent implementation described by a user procedure. In similar way, the Index counter is used to manipulate data using consecutive indexes. Finally, an Error counter will register any errors that may occur while executing system calls. These may lead to an error state in the Control Unit, demanding for the intervention of the host system.
Once running, the microprogrammed unit suspends the clock signal at strategic points of the Hardware Task design for all system calls. In doing so, the Hardware Task context remains suspended while it interacts at the kernel level. Pre-programmed control signals are then generated to forward the received parameters using the system-level Datapath. At the same time, status information is generated to indicate whether the system call performs a write or a read operation, or if it must stay blocked waiting for available involved resources, and also including the current microprogram location. In the final active clock cycle that completes the system call execution, the microprogram re-establishes the context on the Hardware Task, which will resume with its normal processing.
2.2. Hardware System Calls
A hardware system call is a sequence of control operations assisted by a predetermined number of steps, in order to provide services that translate local resources in the accelerator model. Similar to the concept applied at the software level in the OS environments, the hardware system calls virtualize the accelerator through a specific set of features, allowing the designer to easily create a Hardware Task. They are the Kernel Core fundamental interface, to handle the local resources and abstract away the complexity that the accelerator model represents. Such abstraction, in turn, promotes the design’s reuse by allowing deployment on different platforms, as long as the set of hardware system calls offers appropriate implementation. The Kernel Core design is organized through an incremental set of programmable features.
As mentioned above, hardware system calls are implemented via procedures in the kernel package that specify the functionality, the involved parameters, and the connectivity between these and the kernel microprogram- and system-level Datapath units, while the Kernel Core provides entry and exit points in its interface that establish the required signals. Listing 1 shows an excerpt of the kernel package, defining at lines 163, 206, and 213 the sys_call_t type as a subset of system calls the kernel supports and uses in the input and output records to establish the system call interface. When executing system calls, each procedure specifies its arguments according to the desired feature in line 209, and links them to the input parameters in line 210. It then activates the this_call flag to signal the Kernel Core for valid inputs and to proceed with the system call. In response, the microprogram activates the block_task signal and transfers the received type of system call to the syscall_id field, line 216 and line 217, respectively. During the execution, the Kernel Core updates the return_arg output (line 218) with the processing results from the system-level Datapath. In the last step of the system call execution, the microprogram activates the signal on line 215, indicating valid parameters in the return_arg register, and at completion, it disables the block_task output to release the Hardware Task control. The output fields hold their contents until the next system call execution, thus allowing the Hardware Task to re-use or test them to evaluate results. Note that the kernel HDL package is inserted hierarchically, starting at the tool’s configuration package. This establishes, among others, the length of the system-level Datapath determined by the largest received parameter (lines 210 and 218). This parameter is the kernel-level control message and depends on the target architecture of the host system. As a result, the length of the Datapath is fixed on two words when the tool targets a 32-bit host, or three words on a 64-bit host.
Listing 1. Kernel package source file excerpt, describing hardware system call types, entry and exit records. |
6 library hal_asos_v4_00_a; 7 use hal_asos_v4_00_a.hal_asos_configs_pkg.all; 8 use hal_asos_v4_00_a.hal_asos_utils_pkg.all; ... 163 type sys_call_t is (SYS_CALL_NONE, SYS_CALL_WAIT_EVENT_TIMEOUT, SYS_CALL_READ_LFIFO, 164 SYS_CALL_WRITE_LFIFO, SYS_CALL_READ_MESSAGE, SYS_CALL_WRITE_MESSAGE, 165 SYS_CALL_READ_LBUS, SYS_CALL_WRITE_LBUS, SYS_CALL_MUTEX_LOCK, 166 SYS_CALL_MUTEX_TRY_LOCK, SYS_CALL_MUTEX_UNLOCK, SYS_CALL_READ_MBUS, 167 SYS_CALL_WRITE_MBUS, SYS_CALL_READ_LBUS_BURST, SYS_CALL_WRITE_LBUS_BURST, 168 SYS_CALL_READ_MBUS_BURST, SYS_CALL_WRITE_MBUS_BURST, SYS_CALL_YIELD); … 206 type sys_call_input_t is 207 record 208 this_call: std_ulogic;- - trigger sys_call 209 sys_call_id: sys_call_t; 210 parameters: std_logic_vector(C_MESSAGE_WIDTH-1 downto 0); - -field for syscall parameters 211 end record; 212 213 type sys_call_output_t is 214 record 215 valid: std_logic; 216 block_task: std_logic; 216 sys_call_id: sys_call_t; 218 return_arg: std_logic_vector (C_MESSAGE_WIDTH-1 downto 0); - - return sys_call data 219 end record; ... |
Algorithm 1 describes a 4-step hardware system call for the Hardware Mutex lock, where
Step0 evaluates the state of the resource and implements containment when locked. The
Locked A flag indicates that the resource is locked by the CPU in the host platform and as such, in this particular case, the microprogram must go to
Step0 when the condition is true or proceed to
Step1, otherwise.
Step1 acquires the resource, while
Step2 evaluates the final result of the operation. If the
Locked B flag is set, it indicates that the resource is locked by the Kernel Core and proceeding to
Step3 releases the Hardware Task. Otherwise, the concurrent race for the resource is lost and the microprogram retries the system call invocation, returning to
Step0 until it succeeds.
Algorithm 1 Microprogram to lock a Hardware Mutex |
1: pseudocode SYS_CALL_MUTEX_LOCK 2: Step0: produce block_task and lbus_rd_ce test mutex status Locked A flag. 3: if true then goto step 0. 4: Step1: produce block_task and lbus_wr_ce test true input. 5: if false then goto step 2. 6: Step2: produce block_task and lbus_rd_ce test mutex status Locked B flag. 7: if false then goto step 0. 8: Step3: produce valid 9: exit |
2.3. Microprogrammed Unit
The accelerator model employs single address microcode, and its operation is based on the flow of microinstructions of the microprogram, where each opcode activates certain outputs and selects one input for testing. Thus, an 8-bit Program Counter advances into the next instruction according to a
true test result, or takes a jump based on the current address and an implicit offset (
Step bit field) in the opcode if the result is
false.
Figure 5 shows the opcode format for
Step2 of the system call to lock a Hardware Mutex. In this example, the absolute address 0 × 22 is applied to the RAM where the microprogram is defined. The resulting word determines that input 10 is used as a test source; “00” is the next step false (NSF), which gives rise to the absolute address 0 × 20 for the case of
false test result; and output 7 remains set for the time that the current microinstruction is active. If the test result is
true, the Program Counter is incremented to the next microinstruction, at the absolute address 0 × 23. It also shows the value of the outputs
Valid (V),
Block task (B), and
Fault (F), which are transversal to all microinstructions, and for this reason, they are located at fixed positions in the opcode.
To select a test input, the design of the microprogram uses a 5-bit field (Input) in the opcode to implement a multiplexing function (from 32 signals to 1), which implement conditional jump, and can use “00000” or “11111” in the Input bit field as auxiliary false and true tests, for the unconditional jump or next instruction, respectively. In the same opcode, a 4-bit field (Output) allows the microprogram to activate outputs, by implementing a demultiplexer function (from 1 to 16 signals).
Table 1 shows an excerpt from the microprogram that includes the microinstructions of two system calls, the mutex lock and try-lock, while empty locations are mapped to null values for input and output with the bits
Block and
Fault asserted. The first signal will suspend the Hardware Task context, while the latter will trigger a Linux OS kernel page-fault. The contents in this table are ordered according to the microinstruction opcode presented in
Figure 5.
The first line of the mutex lock system call uses the absolute address 0 × 20 (see example in
Figure 5), where the microinstruction selects the input 12 (“01100” in the
Input bit field) for testing a flag
Locked A. As such, the microprogram should only proceed to the next instruction when the resource is free. In order to implement a continuous flow of valid tests, this flag must be complemented before the multiplexer input. In this way, when the
Locked A flag is active, the input selection will result in a
false test, and the microprogram will jump to the current instruction until the resource is released (Algorithm 1). On release, the
true result increases the step counter, which will give rise to the next instruction in the absolute address 0 × 21. In this step, the microprogram activates the demultiplexer output 6 (“0110” in the Output bit field), to write in the Hardware Mutex and implements the dummy test to proceed to
Step2 on any result. For this test, it selects the auxiliary
true logic test input statically assigned to the multiplexer input.
In the microprogram inputs, only the Locked A and Locked B are used in complemented logic, and the latter is used to test if the mutex has been released by the microprogram. As such, the same flag (without the complemented logic) is received at input 10, which gives rise to a true locked test. Such a test is used in Step2 of the lock system call to ensure success in the occurrence of a race condition for the resource. Upon success, the microprogram reaches Step3 by incrementing the step counter or otherwise, in Step2, a false test will result in conditional jump to address 0 × 20, and repeating the system call. In Step3, the output is activated to indicate valid data in the return register, and the Hardware Task is released by disabling the Block output. At completion, the microprogram needs to jump to Step0 in the counter register, so that a new system call can be started. Although the increment of the counter would result in a similar behavior, the design applies a false test at input “00000” to favor regularity, and jumps back in the last step of each system call.
The elasticity offered by the microprogram enables services of the accelerator model to detect runtime failures, namely, failure due to unregistered addresses or wrong transaction formats while accessing the memory system. The system-level Datapath triggers the failure signal and the Kernel Core goes to a fault state, and consequently, disconnects the Hardware Task from the microprogram and asserts an interrupt signal while waiting for the file system reply. This interrupt signal triggers a Linux OS page-fault, which, when processed, checks the accelerator’s status register and accordingly launches a specific handler for the detected failure. Each handler runs a rule-based procedure to tackle a microprogram conflict, which, for the unregistered address failure, requests Linux OS for memory allocation, followed by the forwarding of the assigned physical address to a specific purpose register of the accelerator’s Hardware Kernel. Otherwise, i.e., in cases of a transaction failure, the fault is processed to identify a replacement system call that is compatible with the memory interface, and the microprogram address is reprogrammed with the newly chosen system call.
Similarly, a DPR enabling new functionalities or the replacement of a whole Hardware Task can trigger an unsupported system call by the current microprogram, since the latter only contains system calls generated during the synthesis of the original design. Thus, trying to run such a system call raises a microprogram memory fault, as any free location of the microprogram memory is mapped to null values for input and output, with the bits
Blocked and
Fault asserted in the sequenced word (
Table 1). The Kernel Core replies accordingly by disconnecting the Hardware Task from the microprogram while going to a failure state. A rule-based procedure is selected to reprogram the accessed memory location with the required system call and triggers the Kernel Core to resume the microprogram with the newly added functionality. The S00_Control interface is used by the assigned handler to sequentially write the corresponding microprogram words, while specifying the offset of the given location. After, the handler concludes by asserting the
resume bit in the Control register of the Kernel Core, which signals the kernel to return to a processing state and reconnect the suspended Hardware Task to the microprogram.
2.4. Linux Integration Model
The integration of the HAL-ASOS accelerator model with the Linux OS at both the user and kernel levels, and the myriad of functional units in the model, demands proper OS support for the collection of device drivers that efficiently exports each functionality into the Linux OS user-space. Such a collection is organized through a customized file system, depicted in
Figure 6.
The HAL-ASOS accelerator file system is mounted at system start-up and it can be found at the root of the Linux OS file system in the hal-asos folder. Any existing accelerators will be probed from the device tree file and mapped into individual folders (e.g., Accelerator_1, …). Inside the accelerator folder, the structure is organized into a kernel folder, an interrupt folder, and a subset of virtual files that map the remainder of functional units in the accelerator model (e.g., the local-ram, the lram-mutex, the sysram and the sysram-mutex, the Hardware Kernel message-queue, and the data-fifo). The interrupt folder contains the virtual files that provide the synchronization between the software threads in the system and the accelerator. The lintc file represents the local interrupt controller and it uses eight native interrupts and up to twenty-three user-definable interrupts, mapped to local_* and user_* virtual files, respectively. The kernel folder contains the local-kernel virtual file, used to register accelerator administrative features. Among these features, the distinct memory profile activation includes UserIO, SharedMemory, and ZeroCopy. The UserIO profile is handled at the application level through the HAL-ASOS C/C++ software framework, and the remaining profiles are implemented using the shared and zero-copy virtual files. A microcode file is used to track the changes that resulted from the accelerator faults and include these in the future system restarts.