Tightening contention delays while scheduling parallel applications on multi-core architectures

BENJAMIN ROUXEL, University of Rennes 1/IRISA
STEVEN DERRIEN, University of Rennes 1/IRISA
ISABELLE PUAUT, University of Rennes 1/IRISA

Multi-core systems are increasingly interesting candidates for executing parallel real-time applications, in avionic, space or automotive industries, as they provide both computing capabilities and power efficiency. However, ensuring that timing constraints are met on such platforms is challenging, because some hardware resources are shared between cores.

Assuming worst-case contentions when analyzing the schedulability of applications may result in systems mistakenly declared unschedulable, although the worst-case level of contentions can never occur in practice. In this paper, we present two contention-aware scheduling strategies that produce a time-triggered schedule of the application’s tasks. Based on knowledge of the application’s structure, our scheduling strategies precisely estimate the effective contentions, in order to minimize the overall makespan of the schedule. An Integer Linear Programming (ILP) solution of the scheduling problem is presented, as well as a heuristic solution that generates schedules very close to ones of the ILP (5 % longer on average), with a much lower time complexity. Our heuristic improves by 19% the overall makespan of the resulting schedules compared to a worst-case contention baseline.

CCS Concepts: • Computer systems organization → Parallel architectures; Real-time system architecture; • Software and its engineering → Scheduling; Real-time schedulability; • Hardware → Safety critical systems;

1 INTRODUCTION

The increasing demand for computing power and low energy consumption is placing multi-/many-core architectures as increasingly interesting candidates for executing embedded critical systems. It becomes more and more common to find such architectures in automotive, avionic or space industries [2, 16].

Guaranteeing that timing constraints are met on multi-core platforms is a challenging issue. One difficulty lies in the estimation of the Worst-Case Execution Time (WCET) of tasks. Due to the presence of shared hardware resources (buses, shared last level of cache, …), techniques designed for single-core architectures cannot be directly applied to multi-core ones. Since it is hard in general to guarantee the absence of resource conflicts during execution, current WCET techniques either produce pessimistic WCET estimates or constrain the execution to enforce the absence of conflicts, at the price of a significant hardware under-utilization.

A second issue is the selection of a scheduling strategy which will decide where and when to execute tasks. Scheduling for multi-core platforms was the subject of many research works, surveyed in [8]. We believe that static mapping of tasks to cores (partitioned scheduling) and...
time-triggered scheduling on each core allow to have control on sharing of hardware resources, and thus allow to better estimate worst-case contention delays.

Some existing work on multi-core scheduling considers that the platform workload consists of independent tasks. As parallel execution is the most promising solution to improve performance, we envision that within only a few years from now, real-time workloads will evolve toward parallel programs. The timing behaviour of such programs is challenging to analyze because they consist of dependent tasks interacting through complex synchronization/communication mechanisms. We believe that models offering a high-level view of the behavior of parallel programs allow a precise estimation of shared resource conflicts. In this paper, we assume parallel applications modeled as directed acyclic task graphs (DAGs), and show that the knowledge of the application’s structure allows to have precise estimation of tasks that effectively execute in parallel, and thus contention delays. These DAGs do not necessarily need to be built from scratch, which would require an important engineering effort. Automatic extraction of parallelism, for instance from a high level description of applications in model based design workflows [10], looks to us a much more promising direction.

In this paper, we present two mapping and scheduling strategies featuring bus contention awareness. Both strategies apply to multi-core platforms where cores are connected to a round-robin bus. A safe (but pessimistic) bound for the access latency is to consider $\frac{\text{NbCores} - 1}{\text{contending tasks}}$ being granted access to the bus (with $\text{NbCores}$ as the number of available cores). Our scheduling strategies take into consideration the application’s structure and information on the schedule under construction to estimate precisely the effective degree of interference used to compute the access latency. The proposed scheduling strategies generate a non preemptive time-triggered partitioned schedule and select the appropriate level of contention to minimize the schedule length.

The first proposed scheduling method models the task mapping and scheduling problem as constraints on task assignment, task start times and communications between tasks. We demonstrate that the optimal schedule can only be found using quadratic equations due to the nature of the required information to model the communication cost. This modeling is then refined into an Integer Linear Programming (ILP) formulation that in some cases overestimates communication costs and thus may not find the shortest schedule. Since the solved scheduling problem is NP-hard, the ILP formulation is shown to not scale properly when the number of tasks grows. We thus developed a heuristic scheduling technique that scales much better with the number of tasks and is able to compute the accurate communication cost. Albeit not always finding the optimal solution, the ILP formulation is included in this paper, because it gives a non ambiguous description of the problem under study, and also serves as a baseline to evaluate the quality of the proposed heuristic technique.

The proposed scheduling techniques are evaluated experimentally. The schedule’s length generated by our heuristic is compared to its equivalent baseline scheduling technique accounting for the worst case contention. The experimental evaluation also studies the interest of allowing concurrent bus accesses as compared to related work where concurrent accesses are systematically avoided in the generated schedule. Finally, we study the time required by the proposed techniques, as well as how schedule lengths vary when changing architectural parameters such as the duration of one slot of the round-robin bus. The experimental evaluation uses a subset of the StreamIT streaming benchmarks [25] as well as synthetic task graphs using the TGFF graph generator [11].

The contributions of this work are threefold:

1. First, we propose a novel approach to derive precise bounds on worst-case contention on a shared round-robin bus. Compared to previous methods, we use knowledge of the application’s structure (task graph) as well as knowledge of tasks placement and scheduling
to precisely estimate tasks that execute in parallel, and thus tighten the worst bus access delay.

(2) Second, we present two scheduling methods that calculate a time-triggered partitioned schedule, using an ILP formulation and a heuristic. The novelty with respect to existing scheduling techniques lies on the ability of the scheduler to select the best strategy regarding concurrent accesses to the shared bus (allow or forbid concurrency) to minimize the overall makespan of the schedule.

(3) Third, we provide experimental data to evaluate the benefit of precise estimation of contentions as compared to the baseline estimation where \( \text{NbCores} - I \) tasks are granted access to the bus for every memory access. Moreover, we discuss the interest of allowing concurrency (and thus interference) between tasks as compared to state-of-the-art techniques such as [2] where contentions are systematically avoided.

The rest of this paper details the proposed techniques and is organized as follows. Section 2 presents related studies. Assumptions on the hardware and software are given in Section 3. Section 4 details the proposed method to calculate precise worst-case degree of interference when accessing the shared bus. Section 5 then presents the two techniques for schedule generation, using an ILP formulation and a heuristic. Section 6 presents experimental results. Concluding remarks are given in Section 7.

2 RELATED WORK

Tasks scheduling on multi-core platforms consists in deciding where (mapping) and when (scheduling) each task is executed. The literature on mapping/scheduling of tasks on multi-cores is tremendous as there exists plenty of different properties on, e.g. the input task set, the type of scheduling algorithm. According to the survey from Davis and Burns [8], the three main categories of scheduling algorithms are global scheduling, semi-partitioned, and partitioned scheduling. According to their terminology, the scheduling methods presented in this paper can be classified as static, partitioned, time-triggered and non-preemptive.

Shared resources in multi-core systems may be either shared software objects (such as variables) that have to be used in mutual exclusion or shared hardware resources (such as buses or caches) that are shared between cores according to some resource arbitration policies (TDMA, round-robin, etc). These two classes of resources lead to different analyses to ensure that there is neither starvation nor deadlock. Dealing with shared objects is not new, and there now exists several techniques adapted from the single-core systems. Most of them are based on priority inheritance. In particular Jarrett et al. [17] apply priority inheritance to multi-cores and propose a resource management protocol which bounds the access latency to a shared resource. Negrean et al. [20] provide a method to compute the blocking time induced by concurrent tasks in order to determine their response time.

Beyond shared objects, multi-core processors feature hardware resources that may be accessed concurrently by tasks running on the different cores. Typical hardware resources are the main memory, the memory bus or shared last-level cache. A contention analysis then has to be defined to determine the worst case delay for a task to gain access to the resource (see [12] for a survey). Some shared resources may directly implement timing isolation mechanism between cores, such as Time Division Multiple Access (TDMA) buses, making contention analysis straightforward.

To avoid resource under-utilization caused by TDMA, other resource sharing strategies such as round-robin offer a good trade-off between predictability and resource usage. Worst-case bounds on contention are similar to those of TDMA. However, knowledge about the system may help tightening estimated contention delays.

This article was presented in the International Conference on Embedded Software (EMSOFT) 2017 and appears as part of the ESWEEK-TECS special issue.
Approaches to estimate contention delays for round-robin arbitration differ according to the nature and amount of information used to estimate contention delays. For architectures with caches, Dasari et al. [6, 7] only assume task mapping known, whereas Rihani et al. [16] assume both mapping and execution order on each core known. Schliecker et al. [22] tightly determine the number of interfering bus requests. In comparison with these works, our technique jointly calculates task scheduling and contentions with the objective of minimizing the schedule makespan by letting the technique decide when it is necessary to avoid or to account for interference.

Further refinements of contention costs can be obtained by using specific task models. Pellizzoni et al. [21] introduced the PRedictable Execution Model (PREM) that splits a task in a read communication phase and an execute phase. This allows accesses to shared hardware resources to be precisely identified. In our work, we use a model very close to the PREM task model.

The PREM task model, or similar ones, was used in several research works [1, 2]. Alhammad and Pellizzoni [1] proposed a heuristic to map and schedule a fork/join graph onto a multi-core in a contention-free manner. They split the graph in sequential or parallel segments, and then schedule each segment. In contrast to us, they consider only code and local data access in contention estimation, leaving the global shared variable in the main external memory with worst concurrency assumed when accessing them. Moreover, we deal with any DAG not just fork/join graphs, and write back modified data to memory only when required. Becker et al. [2] proposed an ILP formulation and a heuristic aiming at scheduling periodic independent PREM-based tasks on one cluster of a Kalray MPPA processor. They systematically create a contention-free schedule. Our work differs in the considered task model as well as the goal to reach. They consider sporadic independent tasks to which they aim at finding a valid schedule that meets each tasks’ deadline. In contrast, we consider one iteration of a task graph and we aim at finding the shortest schedule. In addition, our resulting schedule might include overlapping communications due to scheduler decision, while [1, 2] only allow synchronized communication.

Giannopoulou et al proposed in [13] a combined analysis of computing, memory and communication scheduling in a mixed-criticality setting, for cluster-based architectures such as the Kalray MPPA. Similarly to our work, the authors aim, among others, at precisely estimating contention delays, in particular by identifying tasks that may run in parallel under the FTTTS schedule, that uses synchronization barriers. However, to our best knowledge they do not take benefit of the application structure, in particular dependencies between tasks to further refine contention delays.

In order to reduce the impact of communication delays on schedules, [4, 14] hide the communication request while a computation task is running. This accounts with the asynchronism implied by DMA requests. However they use a worst-case contention which could be refined by our study. In addition to the initial problem, shared resource interference can be accounted at schedule time in order to tighten the overall makespan of the resulting schedule.

To quantify memory interference on DRAM-banks, [19, 27] proposed two analyses, request-driven and job-driven. The former one bounds memory request delays considering memory interference on the DRAM bank, while the latter adds the worst-case concurrency on the data-bus of the DRAM. Their work is orthogonal to ours: the request-driven analysis would refine the access time part in our delay, while our method could refine their job-driven analysis by decreasing the amount of concurrency they use.

3 SYSTEM MODEL
3.1 Architecture model
We consider a multi-core architecture for which every core is connected to a main external memory through a bus. Each core either has private access to a ScratchPad memory (SPM) (e.g.: Patmos [23])

This article was presented in the International Conference on Embedded Software (EMSOFT) 2017 and appears as part of the ESWeek-TECS special issue.
or there exists a mechanism for bank privatization (e.g.: Kalray MPPA [9]). Such a private memory allows, after having first fetched data/code from the main memory, to perform computations without any access to the shared bus. For each core, data is transferred between the SPM and the main memory through a shared bus.

Communications are blocking and indivisible. The sender core initiates a memory request, then waits until the request is fully complete (blocking communications), i.e. the data is transferred from/to the external memory. There is no attempt to reuse processor time during a communication by allocating the processor to another task (indivisible communication). Execution on the sending core is stalled until communication completion. In case the sender and the receiver tasks execute on the same core, communications are performed directly using the SPM and no memory transfer is performed.

The shared bus is arbitrated using a round-robin policy. Access requests are enqueued (one queue per core) and served in a round-robin fashion. A maximum duration of $T_{\text{slot}}$ is allocated to each core, to transfer $D_{\text{slot}}$ data words to external memory (a data word needs $T_{\text{slot}}/D_{\text{slot}}$ time units to be sent). If a core requires more time than $T_{\text{slot}}$ to send all the data, then the data will be split in chunks to be sent in several intervals of length $T_{\text{slot}}$ (see equation (1a)) plus some additional remaining time (see equation (1b)). If a full $T_{\text{slot}}$ duration is not needed to send some data, the arbiter processes the request from the next core in the round. As an example, taking a $D_{\text{slot}}$ of 2 data words and a core requesting a transfer request for data of 5 data words, results in two periods of duration $T_{\text{slot}}$ and a remaining time of $T_{\text{slot}}/D_{\text{slot}}$.

In the worst case and for each chunk, a request will be delayed by $Nb\text{Cores} - 1$ pending requests from the other cores (with $Nb\text{Cores}$ being the number of available cores), see equation (1c). Overall, equation (1d) derives the worst latency to transmit some data with a round-robin arbitration policy.

The round-robin arbiter is predictable as the latency of a request can be statically estimated, as long as the configuration of the arbiter (parameters $T_{\text{slot}}$ and $D_{\text{slot}}$) and the amount of data to be transferred ($data$) are known at design time [18].

\[
\begin{align*}
\#\text{chunks} &= \left\lfloor \frac{data}{D_{\text{slot}}} \right\rfloor \\
\text{remaining Time} &= (data \mod D_{\text{slot}}) \cdot \left(\frac{T_{\text{slot}}}{D_{\text{slot}}} \right) \\
\#\text{waiting Slots} &= \left\lceil \frac{data}{D_{\text{slot}}} \right\rceil \\
\text{delay} &= T_{\text{slot}} \cdot \#\text{waiting Slots} \cdot (Nb\text{Cores} - 1) + T_{\text{slot}} \cdot \#\text{chunks} + \text{remaining Time}
\end{align*}
\]

### 3.2 Task model

In this work, we consider an application modeled as a directed acyclic task graph (DAG), in which nodes represent computations (tasks) and edges represent communications between tasks. A task graph $G$ is a pair $(V, E)$ where the vertices in $V$ represent the tasks of the application. The set of edges $E$ represents the data dependencies. An edge is present when a task is causally dependent on another one, meaning the target of the edge needs the source to be completed prior to run. An example of simple task graph, extracted from the StreamIT benchmark suite [25], is presented by Figure 1 and corresponds to a radix-2 of a Fast Fourier Transform.

Each task is divided in three phases (or sub-tasks) according to the read-execute-write semantics, as first defined in PREM [21] and augmented in [1] with a write phase. The read phase reads/receives the mandatory code and/or data from main memory to SPM, such that the execute phase can proceed

---

1 Non blocking communication using a DMA engine, is left for future work

2 This work supports multiple DAGs with same periodicity as it is, however we skipped it for space considerations.
without access to the bus. Finally the write phase writes/sends the resulting data back to the main memory. In the following, read and write will refer to the communication phases of tasks. The obvious interest of the read-execute-write semantics is to isolate the sub-tasks that use the bus. Therefore, the contention analysis can focus only on these sub-tasks. For the sake of simplicity, this study considers that all code and data fits in all types of memory at any point of time. We also assume the code entirely fits into the SPM, but a simple extension could consider prefetching the code in the read phase.

A task $i$ is defined by a tuple $<\tau_r^i, \tau_i^e, \tau_w^i>$ to represent its read, execute, and write phases. An edge is defined by a tuple $e = <\tau_w^s, \tau_r^t, D_{s,t}>$ where $\tau_w^s$ is the write phase of the source task $s$, $\tau_r^t$ is the read phase of the target task $t$. $D_{s,t}$ is the amount of data exchanged between $s$ and $t$.

The WCET of the execute phase, noted $C_i$, can be estimated in isolation from the other tasks considering a single-core architecture, because there is no access to the main memory (all the required data and code have been loaded into the SPM before the task’s execution). Conversely, the communication delay of the read and write phases (respectively noted delay$_r$ and delay$_w$) depend on several factors: amount of data to be transferred, number of potential concurrent accesses to the bus. Thus there are 2 possibilities to estimate the WCET of the read/write phase: either taking a pessimistic static bound, agnostic of task placement on cores (equation (1d)), or take into consideration the knowledge about the applications’ structure (effective concurrency) and about task mapping and scheduling to obtain a more precise bound, as it will be detailed in Section 4.

4 REFINING COMMUNICATION COSTS

The communication cost for a communication phase depends on how much interference this phase suffers from. The interference is due to tasks running in parallel on the other cores. The number of such tasks depends on scheduling decisions (task placement in time and space). Considering a task $i$, only tasks that are assigned to a different core may interfere with $i$, and only tasks that execute within a time window overlapping with that of $i$ actually interfere. This section presents, using a top-down approach, how a precise estimation of communication costs is obtained. For the whole document, concurrent tasks is used for tasks with no data dependencies that may be executed in parallel, while parallel tasks is used for tasks that are scheduled to run in overlapping time windows.

4.1 Accounting for the actual concurrency in applications

Equation (1d) statically computes communication costs assuming all cores ($Nb$ Cores – 1) execute a communicating phase and thus always delay every memory access, which is pessimistic. From the example in Figure 1, and assuming that the application is the only one executing on the architecture, no parallel request can arise at the time of the read phase of task Split1 because of the structure of the application. Similarly on the same example, the read phase of task Add can only be delayed by the read and write phases of task Sub.

A pair of tasks is concurrent if they do not have data dependencies between each other, i.e. tasks that may be executed in parallel. As an example, in Figure 1, tasks Add and Sub are concurrent. Determining if two tasks are concurrent is usually NP-complete [24]. However, with the properties of our task model, in particular the presence of statically-paired communication points, evaluation
of concurrency is polynomial. Two tasks are concurrent if there exists no path connecting them in the task graph. By building the transitive closure of the task graph, using for example the classical Warshall’s algorithm [26], two tasks \( i \) and \( j \) are concurrent if there is no edge connecting them in the transitive closure. In the following, function \( \text{are_conc}(i,j) \) will be used to indicate task concurrency according to the method described in this section. It returns \( true \) when tasks \( i \) and \( j \) are concurrent and \( false \) otherwise.

According to the knowledge of the structure of the task graph, equation (1d) can then be refined as follows. Instead of considering \( NbCores – 1 \) contentions for every memory access, the worst-case number of contenders with a task \( i \) will be \( \min(NbCores-1, | j | \text{ s.t. } \text{are_conc}(i,j) ) \).

### 4.2 Further refining the worst-case degree of interference

Keeping the example from Figure 1, if the two concurrent tasks \( \text{Add} \) and \( \text{Sub} \) are mapped on the same core and thus are executed in sequence, then their communication phases do not interfere anymore. Knowledge of tasks’ scheduling (tasks placement and time window assigned to each task), when known, can further help refining the amount of interference suffered by a task.

Reasoning in reverse, two phases do not overlap if one ends before the other starts, which leads for tasks with \( \text{read}-\text{execute}-\text{write} \) semantics to equation (2). For two tasks \( i \) and \( j \), taking two phases \( \tau_i^X \) and \( \tau_j^Y \), where \( X \) and \( Y \) can either represent a \( \text{read} \) or a \( \text{write} \), equation (2) states that if phase \( \tau_i^X \) ends before phase \( \tau_j^Y \) starts or vice versa, then the two phases \( \tau_i^X \) and \( \tau_j^Y \) do not overlap.

We consider here the end date as the first discrete time point at which the task is over, thus no overlapping occurs when \( \text{end}_j^Y \leq \text{start}_i^X \vee \text{end}_i^X \leq \text{start}_j^Y \)

Then, by negating equation (2), we get equation (3) that will be true if two tasks have overlapping execution windows. In the following, \( \text{are}_OL(\tau_i^X, \tau_j^Y) \) returns \( true \) if the communication delay of phases \( \tau_i^X \) and \( \tau_j^Y \) overlap, and \( false \) otherwise.

\[
\text{are}_OL(\tau_i^X, \tau_j^Y) = \neg(\text{end}_j^Y \leq \text{start}_i^X \vee \text{end}_i^X \leq \text{start}_j^Y) \equiv (\text{end}_j^Y > \text{start}_i^X \vee \text{end}_i^X > \text{start}_j^Y) \tag{3}
\]

Assuming the schedule is known, the degree of interference a task can suffer from can be determined by counting the number of other tasks that overlap in the schedule. Only concurrent tasks (function \( \text{are}_\text{conc} \)) can overlap, because dependent (not concurrent) tasks have data dependencies.

As constrained by the task model (Section 3.2), only communication phases request accesses to the bus, thus only the amount of interference of the \( \text{read} \) and \( \text{write} \) phases needs to be computed. This leads to equations (4a) and (4b) that jointly compute the number of interfering tasks for each communication phase (#\( \text{inter}_f^r \) and #\( \text{inter}_f^w \) for respectively the \( \text{read} \) and \( \text{write} \)) of a task \( i \) by detecting overlapping executions in the set of concurrent tasks.

\[
\forall i \in T: \quad \#\text{inter}_f^r = \sum_{j \in T \mid \text{are_conc}(i,j)} \text{are}_\text{OL}(\tau_i^r, \tau_j^r) + \sum_{j \in T \mid \text{are_conc}(i,j)} \text{are}_\text{OL}(\tau_j^r, \tau_i^r) \tag{4a}
\]

\[
\#\text{inter}_f^w = \sum_{j \in T \mid \text{are_conc}(i,j)} \text{are}_\text{OL}(\tau_i^w, \tau_j^w) + \sum_{j \in T \mid \text{are_conc}(i,j)} \text{are}_\text{OL}(\tau_j^w, \tau_i^w) \tag{4b}
\]

The values of #\( \text{inter}_f^r \) and #\( \text{inter}_f^w \) from equations (4a) and (4b) can then replace the pessimistic value of \( NbCores – 1 \) from equation (1d) to tighten the worst-case delay of the \( \text{read} \) and \( \text{write} \) phases, leading to equations (5a) and (5b).
∀i ∈ T;
\begin{align*}
\text{delay}^r_i &= T_{\text{slot}} \cdot \#\text{waitingSlots} \cdot \text{interf}^r_i + T_{\text{slot}} \cdot \#\text{chunks} + \text{remainingTime} \quad (5a) \\
\text{delay}^w_i &= T_{\text{slot}} \cdot \#\text{waitingSlots} \cdot \text{interf}^w_i + T_{\text{slot}} \cdot \#\text{chunks} + \text{remainingTime} \quad (5b)
\end{align*}

This last refinement depends on the knowledge of the schedule. The two scheduling techniques described in Section 5 use these formulas jointly with schedule generation.

## 5 CONTENTION-AWARE MAPPING/SCHEDULING ALGORITHMS

This section presents two scheduling techniques that integrate the precise interference costs calculated in the previous section, first as a constraints’ system mapped to Integer Linear Programming (ILP) formulation, second as a heuristic method. The main outcome of both techniques is a static mapping and schedule for each core, for one single application. According to the terminology given in [8], the proposed scheduling techniques are partitioned, time-triggered and non-preemptive.

### 5.1 Integer Linear Programming technique

An Integer Linear Programming (ILP) problem consists of a set of integer variables constrained by linear inequalities. Solving an ILP problem then consists in optimizing (minimizing or maximizing) a linear function of the variables. When scheduling and mapping a task graph on a multi-core platform, the objective is to obtain the shortest schedule. Table 1 summarizes the notations and variables needed by the ILP formulation.

For a concise presentation of constraints, the two logical operators ∨, ∧ are directly used in the text of constraints. These operators can be transformed into linear constraints using the simple transformation rules from [3].

**Objective function.** The goal is to minimize the makespan of the schedule, that is minimizing the end time of the last scheduled task. The objective function, given in equation (6a), is to minimize the makespan Θ. Equation (6b) constrains the completion time of all tasks (starting of write phase, ρ^w_i, plus its WCET, delay^w_i) to be inferior or equal to the schedule makespan.

\[
\begin{align*}
\text{minimize } & \Theta \\
\forall i \in T; & \rho^w_i + \text{delay}^w_i \leq \Theta
\end{align*}
\]

**Problem constraints.** Some basic rules of a valid schedule are expressed in the following equations. Equation (7a) ensures the unicity of a task mapping. Equation (7b) indicates if two tasks are mapped on the same core, with a simplification to decrease the number of constraints. Equation (7c) orders tasks such that a task scheduled before another one can not also be scheduled after it, and also imposes an ordering between pairs of tasks. Finally equation (7d) calculates ordering of tasks assigned to the same core.

\[
\begin{align*}
\forall i \in T; & \sum_{c \in P} p_{i,c} = 1 \quad (7a) \\
\forall (i, j) \in T \times T; & i \neq j, m_{i,j} = \sum_{c \in P} (p_{i,c} \land p_{j,c}) \quad \text{and} \ m_{i,j} = m_{j,i} \\
\forall (i, j) \in T \times T; & i \neq j, a_{i,j} + a_{j,i} = 1 \\
\forall (i, j) \in T \times T; & i \neq j, a_{i,j} = a_{j,i} \land m_{i,j}
\end{align*}
\]

**Read-execute-write semantics constraints.** We impose each phase to execute contiguously, as expressed in equations (8a) and (8b). The start time of the execute phase of task i (ρ_i) is immediately
Table 1. Notations & ILP variables

<table>
<thead>
<tr>
<th>Sets</th>
<th>the set of tasks</th>
<th>the set of processors/cores</th>
</tr>
</thead>
<tbody>
<tr>
<td>$T$</td>
<td>$P$</td>
<td></td>
</tr>
<tr>
<td>$predecessors(i)$</td>
<td>returns the set of direct predecessors of task $i$</td>
<td></td>
</tr>
<tr>
<td>$successors(i)$</td>
<td>returns the set of direct successors of task $i$</td>
<td></td>
</tr>
<tr>
<td>$are_conc(i,j)$</td>
<td>returns true if $i$ and $j$ are concurrent, as defined in Section 4.1</td>
<td></td>
</tr>
<tr>
<td>Constants</td>
<td></td>
<td></td>
</tr>
<tr>
<td>$C_i$</td>
<td>task $i$ execute phase’s WCET computed in isolation as stated in Section 3.2</td>
<td></td>
</tr>
<tr>
<td>$D_{i,j}$</td>
<td>amount of data exchanged between task $i$ and $j$</td>
<td></td>
</tr>
<tr>
<td>Integer variables</td>
<td></td>
<td></td>
</tr>
<tr>
<td>$\Theta$</td>
<td>schedule makespan</td>
<td></td>
</tr>
<tr>
<td>$\rho^r_i, \rho^e_i, \rho^w_i$</td>
<td>start times of read, execute, write phases of task $i$</td>
<td></td>
</tr>
<tr>
<td>$\delta^r_i, \delta^w_i$</td>
<td>total amount of data read/written by a read/write phase of $i$ according to predecessors/successors’ mapping</td>
<td></td>
</tr>
<tr>
<td>$chunk^r_i, chunk^w_i$</td>
<td>the number of full slots to read/write $\delta^r_i/\delta^w_i$ from eq. (1a)</td>
<td></td>
</tr>
<tr>
<td>$remainingTime^r_i, remainingTime^w_i$</td>
<td>the remaining time to read/write that do not fit in a full slot from eq. (1b)</td>
<td></td>
</tr>
<tr>
<td>$waitingSlots^r_i, waitingSlots^w_i$</td>
<td>the number of full slots a read/write phase must wait from eq. (1c)</td>
<td></td>
</tr>
<tr>
<td>$interf^r_i, interf^w_i$</td>
<td>the number of interfering tasks of the read/write phases of $i$ from eq. (4a) and (4b)</td>
<td></td>
</tr>
<tr>
<td>$delay^r_i, delay^w_i$</td>
<td>task $i$ read, write phases’ WCET from equations (5a) and (5b)</td>
<td></td>
</tr>
<tr>
<td>Binary variables</td>
<td></td>
<td></td>
</tr>
<tr>
<td>$p_{i,c} = 1$</td>
<td>task $i$ is mapped on core $c$</td>
<td></td>
</tr>
<tr>
<td>$m_{i,j} = 1$</td>
<td>tasks $i$ &amp; $j$ are mapped on the same core</td>
<td></td>
</tr>
<tr>
<td>$a_{i,j} = 1$</td>
<td>task $i$ is scheduled before task $j$, in the sense $\rho^r_i \leq \rho^r_j$</td>
<td></td>
</tr>
<tr>
<td>$am_{i,j} = 1$</td>
<td>same as $a_{i,j}$ but on the same core</td>
<td></td>
</tr>
<tr>
<td>$ov_{i,j}^{XY} = 1$</td>
<td>phase $X$ of $i$ overlaps with phase $Y$ of $j$ $XY \in {rr, ww, rw, wr}$</td>
<td></td>
</tr>
</tbody>
</table>

After the completion of the read phase (start of read phase $\rho^r_i$ + communication cost $delay^r_i$). Similarly, the write phase starts ($\rho^w_i$) right after the end of the execute phase (start of read phase $\rho_i$ + WCET $C_i$).

$$\forall i \in T, \rho_i = \rho^r_i + delay^r_i$$  \hspace{1cm} (8a)

$$\forall i \in T, \rho^w_i = \rho_i + C_i$$  \hspace{1cm} (8b)

**Absence of overlapping on the same core.** Equation (9) forbids the overlapping of two tasks when mapped on the same core by forcing one to execute after the other.

$$\forall i, j \in T \times T; i \neq j, \quad \rho^w_i + delay^w_i \leq \rho^r_j + M (1 - am_{i,j})$$  \hspace{1cm} (9)

This constraint must be activated only if the two tasks are mapped on the same core. Thus, a nullification method is applied, with the use of a big-M notation [15]. The selected value for the big-M constant is the makespan of a sequential schedule on 1 core, the sum of tasks’ WCETs (see

This article was presented in the International Conference on Embedded Software (EMSOFT) 2017 and appears as part of the ESWEEK-TECS special issue.
equation (10)), which is the worst scenario that can arise.

\[ M = \sum_{i \in T} C_i \]  \hspace{1cm} (10)

Data dependencies in the task graph. Equation (11) enforces data dependencies by constraining all tasks to start after the completion of all their respective predecessors.

\[ \forall i \in T, \forall j \in \text{predecessors}(i); \quad \rho_i^w + delay_i^w \leq \rho_j^r \]  \hspace{1cm} (11)

Computing communication phases interference. All the following equations implement, using linear equations, the refinement of contention duration presented in Section 4.2, with the use of the function \( are\_conc(i,j) \) to exclude from the search space, tasks that never interfere with each other. Equations (12a) to (12d) implement function \( are\_OL \) derived from equation (3). For each pair of communication phases, the equations indicate if they are overlapping in the schedule \( ov_{Y,Z} = 1 \), with \( X \in \{rr, wW, rw, wr\} \).

Note that when there is no data for the considered communication phase \( (\delta_i^r = 0, \delta_i^w = 0) \), then there is no possible overlapping, and then each \( ov_{Y,Z} \) is constrained to be equal to 0.

\[ \forall i \in T, \forall j \in are\_conc(i,j); \]

\[
\begin{align*}
\text{interf}_{i}^{r} & = \sum_{j \in T | are\_conc(i,j)} \text{ov}_{i,j}^{rr} + \sum_{j \in T | are\_conc(i,j)} \text{ov}_{i,j}^{rw} \quad \text{(13a)} \\
\text{interf}_{i}^{w} & = \sum_{j \in T | are\_conc(i,j)} \text{ov}_{i,j}^{wr} + \sum_{j \in T | are\_conc(i,j)} \text{ov}_{i,j}^{ww} \quad \text{(13b)}
\end{align*}
\]

Finally, equation (14) contains two optimizations that constrain the overlapping variables, to improve the solving time.

\[ ov_{i,j}^{rr} = ov_{i,j}^{rr} \quad ov_{i,j}^{wr} = ov_{i,j}^{rw} \]  \hspace{1cm} (14)

Estimation of worst-case communication duration. To estimate the time needed for the communication phases, the volume of data read/written respectively by the read or write phases is required \( (\delta_i^r, \delta_i^w) \). This volume of data must account for the task mapping, as no communication overhead should be charged when both the producer and consumer are mapped on the same core \( (m_{i,j} = 1) \). This leads to equations (15a) and (15b) that sum the data (constant \( D_{i,j} \)) extracted from the application) read or written depending on the mapping of the predecessors and successors of a task.
∀ \( i \in T \);

\[
\delta^r_i = \sum_{j \in \text{predecessors}(i)} D_{j,i} \cdot (1 - m_{j,i}) \quad (15a)
\]

\[
\delta^w_i = \sum_{j \in \text{successors}(i)} D_{i,j} \cdot (1 - m_{i,j}) \quad (15b)
\]

The next group of equations (16a)-(16d) encodes the round-robin bus arbitration policy, equations (1a)-(1d), and later refined in equation (5). For conciseness, we skip equations related to the write phase as they are equivalent to those for the read phase with some trivial substitutions.

Equation (16a) computes the number of full slots needed as in (1a), according to the volume of data effectively transmitted (\( \delta^r_i \)) and the transfer rate (\( D_{\text{slot}} \)). Equation (16b) determines the remaining time as in (1b), equation (16c) computes the number of waiting slots of \( T_{\text{slot}} \) duration as in (16c), and finally equation (16d) estimates the communication delay required by the read phase of \( i \) as in equation (5).

∀ \( i \in T \);

\[
\text{chunks}^r_i = \lfloor \delta^r_i / D_{\text{slot}} \rfloor \quad (16a)
\]

\[
\text{remainingTime}^r_i = (\delta^r_i \text{ mod } D_{\text{slot}}) \cdot (T_{\text{slot}} / D_{\text{slot}}) \quad (16b)
\]

\[
\text{waitingSlots}^r_i = \lceil \delta^r_i / D_{\text{slot}} \rceil \quad (16c)
\]

\[
\text{delay}^r_i = T_{\text{slot}} \cdot \text{waitingSlots}^r_i \cdot \text{interf}^r_i + T_{\text{slot}} \cdot \text{chunks}^r_i + \text{remainingTime}^r_i \quad (16d)
\]

The reader may note that equations (16a)-(16c) are not linear. They can however easily be linearized, without any loss of information, into the set of equations (17a)-(17c).

∀ \( i \in T \);

\[
\delta^r_i = \text{chunks}^r_i \cdot T_{\text{slot}} + \text{remainingTime}^r_i \quad (17a)
\]

\[
\delta^r_i = \text{waitingSlots}^r_i \cdot T_{\text{slot}} - \text{unused}_{\text{rest}}^r_i \quad (17b)
\]

\[
\text{unused}_{\text{rest}}^r_i \geq \text{remainingTime}^r_i \quad (17c)
\]

A remaining issue emerges in the cost model provided in equation (16d), because this equation is quadratic and non-convex (with the term \( \text{waitingSlots}^r_i \cdot \text{interf}^r_i \), both operands being problem’s variables as defined by equations (16c) and (13a)). To model our problem as an instance of ILP, we make the choice of using a safe linear approximation of equation (16d), in which we substitute variable \( \text{waitingSlots}^r_i \) by a constant \( \text{WAIT}^r_i \) that may overestimate the number of waiting slots. We do so by considering the worst case scenario in terms of transmitted data, that is, when all data exchanged between dependant tasks occur through the shared bus, which happens for a read phase when all the predecessors of the task are mapped on different cores. \( \text{WAIT}^r_i \) is thus determined by the sum of all data read \( D_{j,i} \) as in equation (18).

∀ \( i \in T \);

\[
\text{WAIT}^r_i = \lceil (\sum_{j \in \text{predecessors}(i)} D_{j,i}) / T_{\text{slot}} \rceil \quad (18)
\]

This over-approximation of communication costs induces the solver to map tasks in sequence on the same core or isolate communication phases to avoid interference.

### 5.2 Heuristic technique based on list scheduling

The basic idea of the proposed heuristic, based on forward list scheduling, is to order tasks from the task graph, and then to add each task one by one in the schedule without backtracking, while keeping the goal of minimizing the overall makespan of the schedule. In the following, tasks are
sorted in topological order. Task ordering is a topic on its own and will not be further discussed in this paper.

The method is sketched in algorithm 1. It uses the task graph as input, sorts the nodes to create the list (line 1), and then a loop iterates on each task while there exists tasks to schedule (lines 4-18). This heuristic uses an As Soon As Possible (ASAP) strategy when mapping a task. It tries to schedule the task as early as possible on every processor, and then selects the processor where the mapping minimizes the overall makespan (line 15).

As previously explained, the communication cost is dependent on task placement. Thus, after scheduling each task, the communication costs in relation with the newly scheduled task must be recomputed and tasks must be moved on the time line of each involved core to ensure a valid schedule, i.e. a schedule accounting for all interference (lines 11 and 14). Moreover, the heuristic also enforces read/execute/write phases to be scheduled contiguously.

**Algorithm 1: Forward list scheduling**

**Input**: A task graph \( G = (T, E) \) and a set of processors \( P \)

**Output**: A schedule

1. **Qready** ← TopologicalSortNode(\( G \))
2. **Qdone** ← ∅
3. schedule ← ∅
4. **while** \( t \in Qready \) **do**
   5. **Qready** ← **Qready** \( \{ t \} \)
   6. **Qdone** ← **Qdone** ∪ \( \{ t \} \)
   7. /* tmpSched contains the best schedule for the current task */
   8. tmpSched ← ∅ with makespan = \( \infty \)
   9. **foreach** \( p \in P \) **do**
      10. /* Set \( t \) in \( copy_{eff} \) on \( p \) the earliest in the schedule */
      11. MapTaskEarliestStartTime(\( copy_{eff}, t, p \))
      12. AdjustSchedule(\( copy_{eff}, Qdone, t \))
      13. /* Set \( t \) in \( copy_{mutex} \) on \( p \) the earliest in mutual exclusion with others */
      14. MapTaskEarliestStartTime(\( copy_{mutex}, t, p \))
      15. AdjustSchedule(\( copy_{mutex}, Qdone, t \))
      16. tmpSched ← min_makespan(tmpSched, \( copy_{mutex} \), \( copy_{eff} \))
   **end**
17. schedule ← tmpSched
18. **end**
19. return schedule

Finding the best solution between overlapping and mutual exclusion. In the ILP formulation, to minimize the overall makespan, the ILP solver had the opportunity to select, on a per task basis, the best solution between two options: synchronize every communication phase (perform them in mutual exclusion) to obtain a contention-free schedule, or enable concurrency if it results in a shorter global schedule. A similar approach is used in the heuristic. Two schedules are computed: one allowing overlapping between concurrent tasks (lines 9-11) and the other one avoiding it (lines 12-14). Then the shortest of the two schedules is selected (line 15).

This article was presented in the International Conference on Embedded Software (EMSOFT) 2017 and appears as part of the ESWeek-TECS special issue.
Updating the schedule to cope with interference. Each time a task is added, new interferences caused by the addition of the task in the schedule must be added, and the delay of some communication phases must be recomputed.

As an example, Figure 2a depicts a partial initial schedule where arrows depict causality between tasks, red dashes draw the writing delay $delay^w_A$, and green dots draw the reading delay $delay^r_C$. A task $A$ mapped on $P_1$ sends 4 data items to a task $C$ mapped on $P_2$ with a $T_{slot} = 3$ and $D_{slot} = 1$, thus $delay^w_A = delay^r_C = 4$ (equation (5), in this situation none of them suffers from any concurrence). Figure 2b sketches the addition of task $D$ on $P_3$, it reads 4 data items written by $A$. Thus, the new writing delay for task $A$ becomes $delay^w_A = 8$ (still no concurrence on task $A$). The first consequence is to move task $B$ to guarantee the blocking communication restriction (Section 3.1). Second, the reading delay $delay^r_C$ must be adjusted now to account for the interference introduced by the reading delay $delay^r_D$ and becomes $delay^r_C = 10$ (according to equation (5)), then task $C$ is delayed accordingly.

Whenever a communicating task that was already mapped needs to be rescheduled/delayed, it may change the number of interfering tasks. The communication delay of all tasks impacted by this change must therefore be recomputed, since they may in turn also create interference. The partial communication delay calculation must therefore proceed iteratively until no task is impacted. Convergence is always reached since, at worst, every concurrent task will interfere.

To reduce the number of tasks impacted by each adjustment, algorithm 2 first computes the set of related tasks (line 1), i.e. the tasks that can be impacted by the addition of the current task. The set is constructed by looking into the schedule for the earliest scheduled predecessor of the current task, then it includes all tasks scheduled after this predecessor on all processors.

To propagate these changes, algorithm 2 recomputes the delay of each communication phase (line 2). Then, it remaps each task ASAP (line 3) with respect of the previous choice considering synchronizations (explained in the previous paragraph). Due to previous tasks' movement on the processor time lines, lines 6-13 shift forward tasks that need to be either re-synchronized, or because one earlier mapped extends itself on it. This process is then repeated until the length of the schedule becomes stable.

This article was presented in the International Conference on Embedded Software (EMSOFT) 2017 and appears as part of the ESWEK-TECS special issue.
ALGORITHM 2: AdjustSchedule : Updating the schedule to cope with interference

Input : An incomplete schedule schedule, the list of already mapped tasks Qdone and the newly mapped task cur_task
Output : An updated schedule

1. related_set ← BuildRelated(schedule, Qdone, cur_task)
2. Compute read/write phases’ delay ∀t ∈ related_set
   /* WCET of read/write might have changed, maybe we should move backward/forward some tasks */
3. Remap task t ∈ related_set as early as possible according to previous decision regarding synchronization

4. while schedule.length is not stable do
   5. Compute read/write phases’ delay ∀t ∈ related_set
   6. foreach t ∈ related_set do
      7. foreach t’ ∈ related_set | start time of t’ > start time of t do
         8. if t and t’ are on the same processor ∧ t extends itself on t’ then
            9. Add delay to t’ to start after t
         10. else if t and t’ are on different processor ∧ are_OL(t,t’) ∧ is_synchronized(t’) then
            11. Add delay to re-synchronize t’
      12. end
   13. end
   14. end

Precision of the estimation of communication costs. Compared to the ILP formulation, the heuristic does not suffer from the aforementioned over-estimation, thus the communication cost can be computed as accurately as possible using equation (5) from Section 4.2 and the effective amount of data according to tasks’ placement. For conciseness, the function to compute the communication cost of the read and write phases (line 2 in algorithm 2) are not detailed as they are just the application of equations (1) refined with equation (5).

6 EXPERIMENTS

Experiments were conducted on real code in the form of a subset of the StreamIT benchmark suite [25], as well as on synthetic task graphs generated using the TGFF [11] graph generator.

Applications from the StreamIT benchmark suite are modeled using fork-join graphs and come with timing estimates for each task and amount of data exchanged between them. Table 2 summarizes the benchmarks we used for our experiments. We were not able to use all the benchmarks and applications provided in the suite due to difficulties when extracting information (task graph, WCET, . . .) or because some test cases are linear chains of tasks with no concurrency. For each benchmark the table includes its number of tasks, the width of the graph (maximum number of tasks that may run in parallel) and the average amount of bytes exchanged between pairs of tasks. All average values given in the rest of the paper are arithmetic means.

Task Graph For Free (TGFF) was used when there is a need to generate a large number of task graphs. It is first used to evaluate the quality of our heuristic against the ILP formulation. Due to the intrinsic complexity of solving our scheduling problem using ILP, we need for that experiment small task graphs such that the ILP is solved in reasonable time. TGFF was also used to test our heuristic technique for applications larger than the StreamIT benchmarks.

We generated two sets of task graphs: one with relatively small task graphs (referred as STG), and another with bigger graphs (referred as BTG). For both sets, we used the latest version of the TGFF task generation software to generate task graphs with tasks’ chains of different lengths and widths, including both fork-join graphs and more evolved structures (e.g. multi-DAGs). Their...
resulting parameters are presented in Table 3. The table includes for both sets the number of task graphs, their number of tasks, the width of the task graph, the range of WCET values for each task and the range of amount of exchanged data in bytes between pairs of tasks. The TGFF parameters for STG (average and indicator of variability) are set in such a way that the average values for task WCETs and volume of data exchanged between task pairs correspond to the analogous average values for the StreamIT benchmarks.

### Table 3. Task graph parameters for synthetic task graphs

<table>
<thead>
<tr>
<th>Name</th>
<th>#Task graphs</th>
<th>#Tasks</th>
<th>Width</th>
<th>WCET</th>
<th>Amount of bytes exchanged</th>
</tr>
</thead>
<tbody>
<tr>
<td>STG</td>
<td>200</td>
<td>3, 34, 14</td>
<td>1, 11, 3</td>
<td>[1; 70]</td>
<td>[0; 11]</td>
</tr>
<tr>
<td>BTG</td>
<td>1000</td>
<td>9, 687, 228</td>
<td>1, 21, 8</td>
<td>[8; 999]</td>
<td>[0; 70]</td>
</tr>
</tbody>
</table>

All reported experiments have been conducted on several nodes from an heterogeneous computing grid with 138 computing nodes (1700 cores). In all experiments $T_{slot}$ is precised, and a transfer rate of one data word (32 bits) per time unit is used.

### 6.1 Scalability of the ILP formulation

Solving an ILP problem for a mapping/scheduling problem on multi-cores is known to be NP-hard [5]. Thus, the running time of our ILP formulation is expected to explode as the number of tasks grows. To evaluate the scalability of the ILP formulation with the number of tasks, a large number of different configurations is needed, explaining why we used synthetic task graphs for the evaluation. For each task graph in set STG we vary the number of cores in interval [2; 15] and vary $T_{slot}$ in

---

The reader may notice that the WCET average value is not perfectly in the middle of the min and max values. This is due to the generation of random numbers in TGFF (pseudo-random, not perfectly random) combined to the limited number of values generated.

---

This article was presented in the International Conference on Embedded Software (EMSOFT) 2017 and appears as part of the ESWEEK-TECS special issue.
interval [1;10]. With those varying parameters, the total number of scheduling problems to solve is \(200 \cdot 10 \cdot 14 = 28000\). The ILP solver used is CPLEX v12.6.0\(^4\) with a timeout of 11 hours.

![Fig. 3. Scalability of ILP formulation (synthetic task graphs / STG)](image)

Figure 3 draws the average solving time per number of tasks in each graph. As expected, when the number of tasks grows, the average solving time explodes, thus motivating the need for a heuristic that produces schedules much faster. Similar observations were made on the StreamIT benchmarks, for which an exact solution was found for only 17 out of the 23 benchmarks.

### 6.2 Quality of the heuristic compared to ILP

The following experiments aim at estimating the gap between makespans of schedules generated by the heuristic (see Section 5.2) opposed to solutions found by the ILP formulation. We expect this gap to be small. To perform the experiments we used the 200 task-graphs from the STG task set with the same parameters’ variation as previously: number of cores \(\in [2; 15]\) and \(T_{slot} \in [1; 10]\).

The heuristic is implemented in C++ and CPLEX was configured with a timeout of 11 hours.

Table 4. Degradation of the heuristic compared with the ILP (synthetic task graphs / STG)

<table>
<thead>
<tr>
<th>% of exact results (ILP only)</th>
<th>degradation &lt;min,max,avg&gt; %</th>
</tr>
</thead>
<tbody>
<tr>
<td>98%</td>
<td>-8%, 43%, 5%</td>
</tr>
</tbody>
</table>

Table 4 summarizes the results. The first column of Table 4 presents the percentage of exact results the ILP solver is able to find in the granted time. We only refer to the exact solutions for the comparison as the feasible ones (i.e not exact) might bias the conclusion on the quality of the heuristic compared to the ILP. The next column presents the minimum, maximum and average degradation in percent, computed using makespans with formula \((\text{heuristic} - \text{ILP})/\text{ILP}\). Positive values mean a degradation of the heuristic against the ILP formulation, while negative values show an improvement which is due to the over-approximation of the communication delay in the ILP formulation (see Section 5.1).

As we can observe, the average degradation is low, which means our heuristic has acceptable quality. A deeper analysis of the distribution of degradation, not included for space consideration, shows that 80% of the heuristic schedules are less than 10% worse than the ILP formulation solutions. We also observed a schedule generation time far much lower than the ILP, < 1 second on average, with a maximum observed of 2 seconds. Solving time is dependent on the number of re-adjustments the heuristic must perform to cope with effective amount of interference.

\(^4\)https://www-01.ibm.com/software/commerce/optimization/cplex-optimizer/

This article was presented in the International Conference on Embedded Software (EMSOFT) 2017 and appears as part of the ESWEEK-TECS special issue.
The influence of task sorting has a significant impact on the heuristic output. We chose a topological order with random tie breaking as stated in Section 5.2. This choice of sorting algorithm rather than simple explains the under-performance of 43% on the worst-case. We did not include a comparative study on sorting algorithm, considered as out of the scope of the paper.

### 6.3 Quality of the heuristic compared to basic contention analysis

We estimate the gain when using our method to tighten communication delays over the same heuristic using the pessimistic estimation of interference from equation (1). The higher is the gain the tighter is the proposed estimation of communication delays. The experiment was performed on the StreamIT benchmarks. The target architecture configuration includes 15 cores, and a value of $T_{slot} = 3$ is used as in [18].

![Fig. 4. Gain in % obtained by precise contention analysis (heuristic, StreamIT benchmarks)](image)

Results are depicted in Figure 4 by blue bars. The gain is computed using equation 19.

\[
\text{worst concurrence} - \frac{\text{accurate interference}}{\text{worst concurrence}}
\]  

(19)

Results show that the gain to use the accurate degree of interference decreases the overall makespan of 19% on average over the worst case concurrency, demonstrating the benefits of precisely computing the degree of interference at schedule time.

### 6.4 Quality of the heuristic compared to synchronized communication

Recent papers [1, 2, 21] suggested to build contention-free schedules to nullify interference cost. Due to the different task models and system models in the aforementioned works, a direct comparison with them is hard to achieve. Thus, we modified our heuristic to produce a schedule without any contention and to be as close as possible to the ideas defended in the mentioned papers. The gain of a contention-free heuristic against a worst contention one is depicted for the StreamIT benchmarks in Figure 4 by red bars.

Among the contention-aware and contention-free variants of our heuristic, no method outperforms the other for all benchmarks. Moreover, the difference between the schedule makespans using the two variants is very small. The average difference is 0.08%, with a worst value of 0.3%. Synchronized execution (red bars) gives better results for fft3, fft5, filterbank, fm, hdtv, mp3, tconvolve and vocoder; our proposed heuristic (blue bars) gives better for audiobeam, beamformer and mpd; the results for all other benchmarks are identical. Regarding schedule generation duration for the StreamIT benchmarks, contention-free solutions are found in less than 30 seconds on average, while

This article was presented in the International Conference on Embedded Software (EMSOFT) 2017 and appears as part of the ESWEEK-TECS special issue.
contention-aware once need less than 3 minutes on average. The shortest schedule generation times were obviously observed when generating contention-free schedules, because no estimation of interference costs has to be performed at all. We believe that our contention-aware scheduling heuristic will be better suited to task models in which communications are not separated from calculations, i.e. non-PREM task-set. Quantitative evaluation of the obtained benefit for such a task model is left for future work.

6.5 Impact of \( T_{\text{slot}} \) on the schedule

With our heuristic, we finally studied the influence of the duration \( T_{\text{slot}} \) on the overall makespan, assuming the overhead negligible when switching between slots. We chose to fall back on synthetic task graphs to benefit from a wider range of different test cases. Here the BTG task set is employed. For each graph, we generated three versions of the same topology but with different amount of exchanged data between tasks to study the influence of the duration \( T_{\text{slot}} \) on graphs that exchanges few data \([0; 5]\), reasonable amount of data \([5; 15]\) and large amount of data \([15; 70]\). The duration \( T_{\text{slot}} \) is in the range \([1; 40]\) as it covers all scenarios to exchange data in one or several chunks.

![Fig. 5. Average makespan when varying \( T_{\text{slot}} \) (synthetic task graphs / BTG)](image)

To compute the results of this experiments, we set a timeout to 1h, leaving us 75.6% of the initial number of task graphs. Results are presented in Figure 5 where the three curves correspond to the average makespan of each category over the value of \( T_{\text{slot}} \). We observe that \( T_{\text{slot}} \) has very little impact on task graphs with few communications (crossed line). While there is an impact on task graphs with bigger amount of data exchanged (continuous line). The exposed results confirm that it is a better choice to keep this \( T_{\text{slot}} \) small to reduce the waiting time between each slot even if there are several chunks. This allows small packets of data to be handled faster when in competition with bigger packets.

7 CONCLUSION

In this work, we show how to take advantage of the structure of a parallel application, along with its target hardware platform, to obtain tight estimates of contention delays. Our approach builds on a precise model of the cost of bus contention for a round-robin bus arbitration policy, which we use to define two scheduling and mapping strategies. Our experimental results show that, compared to a scenario where we account for worst case contention, our approach improves the schedule makespan by 19% on average.

One of the limitation in our approach is its restriction to blocking communications. A natural extension of this work is therefore to relax this constraint and introduce support for asynchronous communications, which are notoriously more challenging to support in a real-time context. Another possible research direction is to further refine the contention model, by more accurately capturing

This article was presented in the International Conference on Embedded Software (EMSOFT) 2017 and appears as part of the ESWEEK-TECS special issue.
the actual duration of contention phases between communicating tasks. Extensions to architectures with local caches is another direction for future research.

ACKNOWLEDGMENT
This work was partially supported by ARGO (http://www.argo-project.eu/), funded by the European Commission under Horizon 2020 Research and Innovation Action, Grant Agreement Number 688131.

REFERENCES


This article was presented in the International Conference on Embedded Software (EMSOFT) 2017 and appears as part of the ESWeek-TECS special issue.


