3.1. General-Purpose Platform
Atomicity violations have been around for a long time in general-purpose platforms that we often use. The occurrence information of atomicity violations is shared among developers through each software’s bug report. This information provides valuable patterns (or types) for research on on-the-fly repair approaches [
36,
37,
38] to prevent system failure during operation. Since there is a wealth of information on atomicity violations in the bug report, most of the on-the-fly repairing research studies have been conducted on general-purpose platforms. It has to be noted that airborne software is hard real-time software. However, the software of a general-purpose platform does not require hard real-time.
Considering the accuracy and overhead of existing research studies performed on a general-purpose platform, the method by J. Yu et al. [
36] and AI (anticipate invariant) [
37] are suitable for use in real-time systems. These two research studies are similar in terms of the methods for diagnosing errors using the pre-test information and treating errors by stalling the thread where the error is diagnosed.
J. Yu et al. [
36] defined the order of each static memory operation in a data structure as
(predecessor set). The collection conditions of
are as follows: (1) When there are two static memory operations,
P and
M, at least one between
P and
M is a write access; (2)
P and
M are executed in two different threads; (3)
M is executed immediately after
P is executed. AI [
37] defines the order for each static instruction as
(belonging set). The
collects static instructions (
) that satisfy the following conditions based on a dynamic instruction (
): (1) It accesses the same address as
; (2) it is executed in another thread, to which
does not belong; (3)
, accessed just before
, is executed and stored. The methods of these two research studies are similar in that they store access information from different threads. However, J. Yu et al. [
36]’s method does not store access information in
if the previous access was executed in the same thread. On the other hand, AI stores the access information executed in a different thread even if the previous access was executed in the same thread. This difference makes the coverage of errors that AI can repair larger than that of J. Yu et al.’s method.
Figure 2 shows two types of interleaving that can occur in
Figure 1. In
Figure 1, the developer would have intended
to be executed atomically, as shown in
Figure 2a. If this program is correctly executed in the testing phase, as shown in
Figure 2a, since only read access
of
satisfies the collection condition of
, the
corresponding to this interleaving is stored as a correct interleaving in
Figure 2c. This interleaving information is used when diagnosing the occurrence of atomicity violations in comparison with actual interleaving in the operation phase.
Figure 2b is the interleaving of atomicity violations that may occur in this program. If this program is incorrectly executed in the operation phase, as shown in
Figure 2b,
is collected as shown in the atomicity violation of
Figure 2c. If the actual execution is as shown in
Figure 2b, J. Yu et al.’s method repairs the atomicity violation using the following steps: First, when read access
of
executes, it is compared with the
of the correct interleaving. It is not diagnosed as an error, since the
of the correct interleaving and the
of the actual interleaving are the same. Next, when the read access
of
executes, it is compared with the
of the correct interleaving. It is diagnosed as an error, since the
of the correct interleaving and the
of the actual interleaving are different. At this time, J. Yu et al.’s method performs stalling before the read access
of
is executed to delay the read access
until the write access
of
comes to repair the atomicity violation.
Similarly, if this program is correctly executed in the testing phase as shown in
Figure 2a, according to the
collection conditions,
and
are stored as
and
and
are stored as
. This
is used to diagnose atomicity violations in comparison with the actual interleaving in the operation phase. If this program is incorrectly executed in the operation phase, as shown in
Figure 2b,
is collected as shown in the atomicity violation of
Figure 2d. The actual execution is shown in
Figure 2b. AI repairs the atomicity violation using the following steps: First, when read access
of
is issued, it is compared with the
from the correct interleaving. We find that it is not diagnosed as an error, because the
from the correct interleaving and the
from the actual interleaving are equal to
. Next, when the read access
of
is issued, it is compared with the
from the correct interleaving. Since the
from the correct interleaving and the
from the actual interleaving are different from
and
, it is diagnosed as an error. To treat the diagnosed error, AI stalls the execution read access
of
via
sleep() and lets the program continue with
and issue write access
. Since the execution sequence is reordered to the correct sequence, the atomicity violation is repaired.
The on-the-fly repairing of atomicity violations is essential in two aspects, time overhead and coverage. These approaches have the advantage of low overhead because atomicity violations are diagnosed based on the correct interleaving collected in the test phase and treated with stalling. The low overhead of these approaches shows their applicability to real-time systems. The coverage is different because the methods of collecting the correct interleaving of the two approaches are different. J. Yu et al. [
36]’s method collects only one correct interleaving and AI [
37] collects as many correct interleaving types as possible. This difference allows AI to diagnose the correct interleaving so that the accuracy of the diagnosis is higher than that of J. Yu et al.’s method [
36].
3.2. Avionics
There are reports and research studies of concurrency errors occurred in airborne software. According to the report in the DoD JSSSEH [
23], a race condition, which is one of the concurrency errors, occurred in the first shuttle flight and the 44th flight of NASA’s Advanced Fighter Technology Integration (AFTI) F16. All airborne software must comply with ARINC 653, the standard for airborne software. To prevent airborne software from crashing, ARINC 653 introduces health management systems (HMSs) for IMA (integrated modular avionics). The job of HMSs is to detect and repair faults in software. Regardless of this preventive structure, there are reports and research studies documenting concurrency errors in airborne software. However, it is not surprising, because all multi-threaded programs are inherent with concurrency errors. There are research studies focusing on repairing atomicity violations raised in ARINC 653-based airborne software [
6,
16]. Ha et al. [
16] and Tchamgoue et al. [
6] experimentally proved that the atomicity-violation issue exists in ARINC 653-based software. The approach these works used to treat the faults are similar. They store the shared variable access information of each thread in a data structure called
access history [
5]. The atomicity violation is diagnosed by examining the
access history to check if the shared variables of each thread can be parallelly executed and whether they are protected by a lock [
4,
28]. To treat atomicity violations, Ha et al.’s method [
16] inserts a lock around a shared variable that is not protected by a lock. On the other hand, Tchamgoue et al.’s method [
6] delays the thread initiating an access to the shared variable.
The access history used to check the parallelism is a set of data structures generally called label. The label stores parallelism and order relationship information for each access. The label can be expressed as [, ]. and are any integers defined by the programmer to denote the start and the end of a thread, respectively. must always be less than ( < ). Here, we give an example to better understand how labeling works to identify concurrent access to shared variables. Let us assume that there are two threads, and , branching from the main thread. The label of the main thread at the program startup time is [1, 100]. When the program creates two threads on the main thread, the label is split in half. Then, the label of is set to [1, 50] and the label of is set to [51, 100]. Next, we compare of and of to check if the two regions collide. The following rules identify the concurrent relationship between threads:
The two threads and are in a concurrent relationship if the regions are and .
Otherwise, and are in a sequential relationship.
The atomicity violation detection protocol guarantees to find at least one error if there are concurrency violations in a code. Every access event [label, Locks] pair is logged in the history and categorizes the access event. There are four types of access events, Read, Write, Critical Section (CS)-Read and CS-Write. Based on the diagnosis, we take actions based on a policy. When the read operation is detected, it checks for the write and CS-write operations in the history and whether they collide. We search for the Read, CS-read and CS-write operations to identify the atomicity violations of write operations.
The labeling scheme needs to log every access event in the history. The time and space overheads of existing atomicity violation diagnosing schemes depend on the maximum parallelism
T and the time and space complexity are
. Recent real-world programs are known to have millions of threads running at the same time [
39,
40,
41] and navigation software of an aircraft has about a billion lines of code [
42]. An autonomous repair-based HMS cannot use
software because it is inefficient and unreliable for airborne software.