1 Introduction

Recent deep-submicron semiconductor technology has made it possible to integrate hundreds or thousands of circuit modules (i.e. nodes) into a single VLSI chip [1]. Network-on-chip (NoC) technology is foreseeable to play a key role in the implementation of future highly-integrated System-on-Chip (SoC) devices. In NoC, nodes and an interconnection network are integrated into a single chip, and data communications between nodes are done by transmitting packets on the network. Using global interconnection structure reduces the difficulty of wiring design and the latency of signal transmission [12]. The most popular NoC topology is two-dimensional (2D) mesh network which offers a simple and regular structure and a small wire length between nodes. Examples of commercial mesh-based NoCs include Intel SCC[13] and TILERA Tile-Gx[11].

One of the most important and fundamental issues that must be addressed for NoCs is fault tolerance. Defects during fabrication and failure occurrence in runtime are inevitable, and it is almost impossible to remove their effects from the system completely. In NoCs, packet transmission can provide fault tolerance capability. Instead of regarding the whole system as faulty, it is cost effective to equip a fault-tolerant routing control to transmit packets avoiding defective/faulty areas.

Designing a routing algorithm that is both fault-tolerant and deadlock-free is a hard problem. Due to the defects/faults, regular 2D mesh networks become irregular ones, which makes packet routing complex. A deadlock occurs when some packets cannot proceed toward their destinations because communication resources (i.e. channels) required by them are not available. Once the deadlock occurs, all packets involved in the deadlock are blocked forever. Therefore, routing algorithm must ensure deadlock freeness, while providing fault-tolerance capability.

One way to design a deadlock-free routing algorithm is to find acyclic paths among data communications using graph theory such as [4], and to route packets by a source routing algorithm [7] or a look-up table (LUT)-based routing algorithm [9]. However, the assumption of using global information such as the status of all nodes and existing routing paths for establishing a new routing path is impractical for NoC purpose.

Ni et al. [6] proposed (N − 1)-fault-tolerant routing algorithm for N-D mesh networks based on turn model [5]. Although this algorithm does not require global information for routing, only one faulty node can be tolerated in 2D mesh. Wu [14] proposed a fault-tolerant X-Y routing algorithm based on Odd-Even turn model [3]. However, this algorithm requires two virtual channels (VCs) to guarantee the deadlock freeness for limited fault tolerance capability; it is known that this algorithm can not tolerate faulty nodes on the edge of 2D mesh.

To tolerate multiple faults of arbitrary distribution, a region-based approach has been proposed [2, 8, 14]. In this approach, a set of rectangular faulty regions called faulty blocks is formed for faulty nodes, and nonfaulty (i.e. healthy) nodes around each faulty block form a ring-shaped detour path called fault ring. Then, packets which encounter a faulty block avoid the faulty block following the detour rule of the fault ring. Routing algorithms on this approach do not require any additional hardware such as LUT or VC to ensure the deadlock freeness. This is a desirable feature for large scale NoCs, where a router must be designed in a compact circuit for a small hardware overhead.

Chen et al. [2] proposed a region-based fault-tolerant routing algorithm called Message-Route, where a message information is stored into each packet to control the routing. Some overlooked misrouting conditions in Message-Route have been recently corrected by Holsmark et al. [8]. To the best of our knowledge, in the region-based approach, only Message-Route is designed to tolerate multiple faulty nodes without restrictions on the number of tolerable faulty nodes and its distribution. However, this algorithm has three crucial problems; (1) this algorithm fails to provide complete and deadlock-free routing (i.e., misrouting and deadlock occur in some cases), (2) many healthy nodes are contained in the faulty blocks and thus deactivated, and (3) complex routing functions are not feasible for hardware implementation.

In this paper, we give some counter-examples to show that Message-Route is not complete and deadlock-free; packets may not reach the destinations and deadlock may occur. By correcting those errors, we make Message-Route complete and deadlock-free. Then, we propose a novel fault-tolerant routing algorithm based on Message-Route. Unlike the previous algorithms, our algorithm introduces the new function of “router node” to form the minimal rectangular faulty blocks, which significantly reduces the number of nodes to be deactivated. The deadlock freeness and completeness are fully analyzed. Experimental results show that the proposed algorithm can enhance node availability and reduce communication delay in comparison with Message-Route. We also implement the proposed routing algorithm on Xillinx Virtex-5 FPGA to show the feasibility of the hardware implementation.

The rest of this paper is organized as follows. Section 2 introduces general architecture of 2D mesh NoCs (2D-MNoCs) and several notations used throughout this paper. Section 3 corrects the errors of Message-Route and gives proof of its deadlock freeness. Section 4 describes the proposed routing algorithm with proof of the deadlock freeness. Section 5 compares the routing performance of the proposed algorithm with the traditional routing algorithm via computer simulation, and discusses the hardware performance. Finally, Section 6 concludes this paper.

2 NoC Architecture and Fault-tolerant Routing

2.1 Two-Dimensional Mesh NoC Architecture

A general n × m two-dimensional mesh network-on-chip (2D-MNoC) consists of n and m nodes in row (horizontal) and in column (vertical) directions, respectively. Each node is composed of a processing element (PE) and a router module, as shown in Fig. 1.

Fig. 1
figure 1

A general n × m 2D-mesh NoC architecture

Let \(\mathcal {N}=\{1,2,\dots ,n\}\) and \(\mathcal {M}=\{1,2,\dots ,m\}\) denote the set of row- and column-coordinates, respectively, and (i,j) be the node on the i-th row and the j-th column, where \(i\in \mathcal {N}\), and \(j\in \mathcal {M}\). (i,j) is connected with neighbor nodes (\(i^{\prime }\),\(j^{\prime }\)) via two unidirectional links if and only if \(i^{\prime } = i\pm 1\) and \(j^{\prime }=j\), or if \(i^{\prime }=i\) and \(j^{\prime }= j\pm 1\).

The data communications between two nodes are done by transmitting packets on the network of routers. Here, we define router functions. The four directions of neighbor nodes are labeled as east, west, north, and south, respectively. PassEW is defined as the router function that packets sent from East neighbor node are sent to West neighbor node. NE turn is defined as the router function that packets sent from North neighbor node are sent to East neighbor node changing the direction 90-degree on the router. In the same way, all possible router functions are defined as described in Fig. 2.

Fig. 2
figure 2

Possible router functions

In this paper, we focus on 2D-MNoCs constructed with homogeneous nodes. We follow the common assumption in designing routing algorithm [2, 8]; only node can be faulty and all links are faulty-free. This assumption does not lose generality, because a link failure can be treated as the failure of neighboring two nodes connected with the link.

2.2 Deadlock

A deadlock occurs when some packets cannot proceed toward their destinations because communication resources (i.e. channels in a link) required by them are not available. All packets involved in a deadlock are blocked forever. Let S, C and D denote a source node, a current node, and a destination node, respectively. As an example, consider a simple deadlock scenario illustrated in Fig. 3. In the scenario, \(S_{1}\) sends a packet to \(D_{1}\) through (1,1) and \(S_{2}\), \(S_{3}\) and \(S_{4}\) also send a packet in the same way, where \(S_{i}\) and \(D_{i}\) indicate the source and the destination nodes of routing path i. In the wormhole switching technology usually employed in NoCs, a packet is divided into several flits and they are transmitted in a pipeline fashion. When the packet from \(S_{1}\) to \(D_{1}\) is blocked at (1,1), the packet from \(S_{4}\) to \(D_{4}\) will be blocked at (1,2) because the output channel in the south link of (1,2) is occupied by the packet from \(S_{1}\) to \(D_{1}\). Similarly, the packets from \(S_{3}\) to \(D_{3}\) and from \(S_{2}\) and \(D_{2}\) will be blocked at (2,2) and (2,1), respectively. As a result, no packet can reach to its destination node eternally in such a circular wait of packets.

Fig. 3
figure 3

A simple deadlock example

Some deadlock recovery methods are available to cope with the deadlock, such as a method that discards some blocked packets and retransmits them with the help of a deadlock detection function. However, they are not feasible for NoCs because of the requirements for low latency and low hardware overhead. Hence, routing algorithm must guarantee that the deadlock never occurs for any communication patterns.

The following Theorem 1 is known to guarantee the deadlock freeness, which is also used in the latter section in this paper.

Theorem 1

Assume that (i,j), (i,\(j{+}s\)), (\(i^{\prime }\),\(j^{\prime }\)), and (\(i^{\prime }\),\(j^{\prime }{+}s\)) allow EN, NW, SW, and ES turns, respectively, where \(i, i^{\prime }\in \mathcal {N}\), \(j, j^{\prime }\in \mathcal {M}\), \(s,s^{\prime }>1\), and \(j<j^{\prime }\). As long as 180-degree turns are prohibited, a routing algorithm is deadlock-free, if the following two conditions are satisfied: (1) there is no routing path in the i-th row from (i,j) to (i,\(j{+}s\)), (2) there is no routing path in the \(i^{\prime }\)-th row from (\(i^{\prime }\),\(j^{\prime }{+}s^{\prime }\)) to (\(i^{\prime }\),\(j^{\prime }\)).

For the proof of the theorem, see [8].

2.3 Message-based Fault-Tolerant Routing Control Algorithm

This section describes the traditional region-based fault-tolerant routing algorithm with message propagation technique [2, 8], i.e. Message-Route. This algorithm mainly consists of two phases: (1) detour generation phase, and (2) routing control phase.

2.3.1 Detour Generation Phase

In this phase, to avoid creating complex detour path, faulty regions are converted into rectangular faulty blocks by deactivating healthy nodes in the following manner.

Definition 1

A node deactivates itself if there are two faulty or deactivated neighbor nodes.

After the node deactivation process, all deactivated nodes are treated as faulty. To increase node availability, in Message-Route, deactivated nodes that have at least one healthy neighbor node can be reactivated by the healthy neighbor node. Such reactivated nodes are referred to as unsafe nodes. Each unsafe node will communicate with other nodes through the neighboring healthy node.

After completion of the above processes, the set of healthy nodes surrounding a single faulty block forms a detour ring path called fault ring. For ease of explanation, a fault ring is divided into two parts: border and corner. A node in a fault ring is on the border if one neighbor node is in the faulty block; otherwise, it is on the corner. The northeast node of the fault ring is called reference node. When faulty blocks touch north and east edge of NoC, (m+1,\(\ast \)) and (1,\(\ast \)) are set as the reference nodes, respectively, where \(\ast \) represents don’t care case. Figure 4a shows an example of the node deactivation. In this example, (6,5) is the northeast corner node and also the reference node of the fault ring; (6,4) is the east border node of the fault ring.

Fig. 4
figure 4

Examples of a node deactivation and b three types of fault rings

To support edge node failures, fault rings are also classified into three types as follows;

Definition 2

When a fault ring touches only south edge of NoC, the ring is referred to as s-chain. When a fault ring touches west edge or west and south edge of NoC, the ring is referred to as f-chain. When a fault ring is neither s-chain nor f-chain, the ring is referred to as f-string.

Figure 4b shows an example of the three fault rings. Every node on the fault ring stores ring type, reference node address, and both clockwise and counter-clockwise detour directions. See [2] for details.

2.3.2 Routing Control Phase

In this phase, each healthy node is allowed to send packets, and the proper transfer direction for each incoming packet toward the destination node is determined based on the built-in routing algorithm. The schematic process diagram of the routing control in Message-Route is depicted in Fig. 5a.

Fig. 5
figure 5

Process diagrams of a routing control algorithm in message-route and b message update module

To simplify the routing control, each packet carries one of the three message types: Row-First (RF), Column-First (CF) and Row-Only (RO). RF message restricts the packet transfer to the westward. Similarly, CF message restricts the packet transfer to the northward (southward) when the destination is to the northward (southward). RO message restricts the packet transfer to the eastward while keeping the same column address as the destination node. The message type is updated in the following manner.

  • Initially, the message type of each packet is determined by the addresses of source (S) and destination (D). If D is to the west of S, it is determined as RF. If S and D are on the same column, it is RO. Otherwise it is CF.

  • At every intermediate router between S and D, the message is updated if and only if either one of the following two cases is satisfied: (1) When current node (C) and D are on the same row, RF is updated to CF, and (2) when C and D are in the same column, CF is updated to RO.

The process diagram of message initialization/update is described in Fig. 5b. Packets with RF and RO are basically sent to the west and east neighbor nodes, respectively, and packet with CF is sent to the north (south) when D is to the north (south) of C.

3 Correction of Message-Route Algorithm

In this section, we give counter-examples to show some cases in Message-Route where packets may not reach the destinations and deadlock may occur. Then, we correct the errors of Message-Route to make it complete and deadlock-free.

3.1 Correction of Wrong Ring Selections

From the node deactivation rule in Definition 1, at most two fault rings may overlap on a single node. In Message-Route, when the node is on two fault rings, the ring selection module called Overlapped-Ring-Chain-Route selects one fault ring for each incoming packet. The entire ring selection process is shown in Fig. 6. In the process, the following four cases are overlooked and wrong fault rings may be selected, resulting in incomplete routing.

  1. Case i)

    A node on overlapped fault rings sends packets: As pointed out in Fig. 6, the ring selection process has an undefined state. This state corresponds to the case when S is on two fault rings (i.e., no neighbor send packet to the node). Clearly, the routing algorithm fails to determine next route directions because no ring is selected. To correct this error, we modify the ring selection module to remove the undefined state.

  2. Case ii)

    f-string versus f-chain: Consider that an f-chain overlaps an f-string as described in Fig. 7a. In this case, when S sends a packet to D, the ring selection module selects the f-string at C. Since the packet has CF message at C, and D is to the south of the reference node of the f-string, the packet will be sent to \(C^{\prime }\) along the f-string. At \(C^{\prime }\), since the message is updated to RO, the packet will be sent to\(D^{\prime }\). Finally, due to the one way message update constraint, the packet never reaches D.

  3. Case iii)

    f-chain versus s-chain: Consider that an f-chain overlaps an s-chain as described in Fig. 7b. In this case, when S sends a packet to D, the ring selection module selects the s-chain at C. Since the packet has CF message at C, the packet will be sent to \(C^{\prime }\) along the s-chain. At \(C^{\prime }\), since the message is updated to RO, the packet will be sent to D. Finally, due to the one way message update constraint, the packet never reaches D.

Fig. 6
figure 6

A ring selection module in message-route [8]

Fig. 7
figure 7

Possible four wrong ring selections

To correct errors in Cases ii and iii, we modify the ring selection module to select the westward fault ring, when D is to the westward of C.

  1. Case iv)

    s-chain versus f-string: Consider that an s-chain overlaps an f-string. This case is further divided into the following two sub-cases.

  2. Case iv-1)

    s-chain is the westward fault ring: In this case (See Fig. 7c), when S sends a packet to D, the packet has RO message. Following the previously selected ring (i.e. s-chain at C), the ring selection module also selects the s-chain at \(C^{\prime }\). At \(C^{\prime }\), since the east neighbor node is faulty, the packet will be sent to \(D^{\prime }\) along the s-chain. However, since the routing algorithm does not allow the packet with RO message to be sent to neither the east neighbor node or the north neighbor node at \(D^{\prime }\), the packet never reaches to D.

  3. Case iv-2)

    s-chain is the eastward fault ring: In this case (See Fig. 7d), when S sends a packet to D, the packet has RO message. Following the previously selected ring (i.e. f-string at C), the ring selection module also selects the f-string at \(C^{\prime }\). At \(C^{\prime }\), since the east neighbor node is faulty, the packet will be forwarded along the f-string until it reaches the same column as D, then reaches (2,2). Since the east neighbor of (2,2) is also faulty node, the packet will be forwarded along the f-string, again. As a result, the packet never reaches D.

To correct errors in Cases \(i\hspace {-.1em}v\)-1 and \(i\hspace {-.1em}v\)-2, we modify the ring selection module to select the eastward fault ring when packets reach the same column as D. The modified ring selection module is described in Fig. 8.

Fig. 8
figure 8

Revised ring selection module

3.2 Correction of Deadlock Cases

Holsmark et al. [8] provides the proof of deadlock freeness for Message-Route based on Theorem 1. As an exceptional case, ES and NW turns may connect to SW and EN turns respectively on the same f-chain. The proof claims that such connected turns do not lead deadlock. It is true when the f-chain does not overlap other fault rings. However, when the f-chain overlaps fault rings, we found deadlock might occur.

Figure 9 illustrates possible two counter-examples for the proof. In Fig. 9a, the f-chain overlaps one f-string and an acyclic path is created at the north (south) of the f-chain. When an another acyclic path exists at the south (north) of the f-chain, the f-chain can connect the two paths that form a cyclic path leading to deadlock. Similarly, in Fig. 9b, the f-chain overlaps one s-chain and an acyclic path is created at the south of the f-chain. and an another acyclic path exists at the north of the f-chain, the f-chain can connect the two paths that form a cyclic path leading to deadlock.

Fig. 9
figure 9

Possible two deadlock cases: a an f-chain overlaps an f-string, and b an f-chain overlaps an s-chain

To solve the former case, we make a modification to the original routing behavior on the f-chain. As illustrated in Fig. 10a, by prohibiting both WS turn on the south border and PassNS on the southeast corner of the f-chain as long as the west neighbor node is available, the cyclic path shown in Fig. 9a is never created. For the latter case, we also modify the original routing behavior of RO message. As illustrated in Fig. 10b, by prohibiting EN turn on the west border of the s-chain that overlaps the f-chain, the cyclic path shown in Fig. 9b is never created. However, when the s-chain does not overlap the f-chain, EN turn should be allowed on the west border of the s-chain to send packets with RO to the east of the s-chain. In order to apply the above modification, we newly add the following state of the healthy node called sp-mode.

Fig. 10
figure 10

The modification for the case where the f-chain overlaps two fault rings

Definition 3

A healthy node becomes sp-mode and stores the own row address as reference row address, if the node is on either (1) the node is on both the south border of f-chain and the northwest corner of s-chain, or (2) both the southeast corner of f-chain and the west border of s-chain. The node under sp-mode propagates the reference row address to the south and the west neighbor nodes. When a node receives reference row address from the neighbor, the node becomes sp-mode even if the node does not satisfy the above condition.

Under the sp-mode, the reference row address allows nodes to judge whether each packet can make EN turn on the s-chain. Thus, in our modified routing algorithm, if a packet has RO message and the destination node is to the east of the reference row address, the packet will be sent northward based on the routing behavior for CF message until the packet reaches a node of being not sp-mode.

3.3 Packet Reachability

Here, we discuss the packet reachability; packets sent from any healthy (not deactivated) nodes are reachable to their destinations. In Message-Route, the packet reachability is guaranteed based on the one way message update constraint (see [2, 8] for details). This is true only when there is no overlapped fault ring. Since we have modified the ring selection module, we give a proof of packet reachability by seeing the routing behavior for each message as follows.

RF message

When a packet has CF message, the packet is sent to the west neighbor node until it reaches the same row coordinate as its destination node. When the west neighbor node is in a faulty block, the routing algorithm determines a proper detour direction by selecting the westward fault ring. After the packet detours the fault ring, it is sent to the west neighbor node. Thus, any packet with RF is reachable to the same row coordinate as its destination node.

CF message

When a packet has RF message, the packet is sent to either the north or south neighbor node having the same column coordinate as the destination node. Since the packet will never have RF message, the destination node is assumed not to be in the west of the current node. So, when the packet with CF is to the east of the destination node during it detours f-chain, the previous wrong ring selection (see Fig. 7a and b) makes the packet unreachable. In contrast, since the revised ring selection module selects f-chain; when the packet is to the east of the the destination node, the routing behavior on f-chain works as one of RF. For other case, when the packet is to the north (south) of the destination, the ring selection module selects the northward (southward) fault ring as the previous ring selection module does. Thus, any packet with CF is reachable to the node which is not westward of and the same as the destination node.

RO message

When a packet has RO message, the packet is sent to the eastward destination node by maintaining its column coordinate to be the same as the destination node. In Message-Route, to maintain column coordinate, the packet carries the ring information selected in the previous node (i.e previous ring information). When the packet is not in the same column as the destination and has previous ring information, the packet is sent along the previous selected ring to be in the same column as the destination. On the other hand, when the packet is in the same column as the destination, the ring selecting module selects the eastward fault ring to send the packet to the eastward. Thus, any packet with RO is reachable to the destination node.

From the above, the new ring selection module provides proper ring selection to deliver packets to their destination node, data communication between any available nodes are successfully performed as far as their routing paths are available.

3.4 Deadlock Freeness

Theorem 2

With our modifications, Message-Route algorithm becomes deadlock-free.

Proof

We restrict our proof to the case of ring overlapping, because the original proof [8] is correct for other case. First of all, we consider the special case in which an f-chain overlaps other fault rings. As discussed in the correction of deadlock cases (Fig. 9), connected ES and SW turns on an f-chain and connected EN and NW turns on the same f-chain can connect both the northward and the southward acyclic paths to form a circular waiting path. Since this problem is already solved by our modifications, those connected turns on f-chains do not lead deadlock. For the following other cases, our proof is also based on Theorem 1 as done in [2, 8].

First, we show that ES turn does not connect SW turn. For intuitive understanding, all possible ES and SW turns are summarized in Fig. 11a and b, respectively. According to the node deactivation rule (Definition 1), at most two fault rings overlap on the single node, and each corner node can overlap a corner and a border of other fault ring. For ES turn on s-chain made by a packet with either CF or RO message, no SW turn can be on the east border of the same s-chain even if the s-chain overlaps other fault ring. For ES turn on f-string made by a packet with CR message, basically it does not connect SW turn by the specific routing behavior shown in Fig. 11c. Therefore, ES turn does not connect SW turn.

Fig. 11
figure 11

Schematic illustration of possible a ES turns and b SW turns in message-route, and c a key solution not to connect ES turn on f-string to any SW turns

Second, we show that EN turn does not connect NW turn. Similarly, all possible EN and NW turns are summarized in Fig. 12a and b, respectively. For EN turn on f-string made by a packet with RO message, since NW turn on the northeast corner is prohibited even if the f-string overlaps other fault rings, the EN turn does not connect NW turns. For EN turn at s-chain made by a packet with RO message, basically it does not connect NW turn by the specific routing behavior shown in Fig. 12c. Hence, EN turn does not connect NW turn.

Fig. 12
figure 12

Schematic illustration of possible a EN turns and b NW turns in message-route, and c a key solution not to connect EN turn on s-chain to any NW turns

From the above, based on Theorem 1, the modified Message-Route is deadlock-free. □

4 Proposed Region-based Fault-Tolerant Routing Algorithm

In the previous section, we modified Message-Route to assure the packet reachability and deadlock freeness. However, the node deactivation method used in Message-Route requires many healthy nodes to be deactivated, and the related complex message and ring propagation functions are not feasible for hardware implementation. In addition, no ring management mechanism to provide simple data access is available. In this section, we propose a new fault-tolerant routing algorithm extending Message-Route to solve above problems.

4.1 Minimal Rectangular Faulty Block

Instead of the previous node deactivation method in Definition 1, faulty regions are converted into minimal rectangular faulty blocks in the following manner.

Definition 4

A node deactivates itself if there is at least one faulty or deactivated neighbor node in both row and column.

After completion of the node deactivation, fault rings are constructed around each faulty block as does in Message-Route. Notice that since at most four fault rings may overlap on a single healthy node in the new node deactivation method, Message-Route cannot provide deadlock-free routing control for the new node deactivation method. Additionally, to provide deadlock-free routing control in the proposed algorithm, we introduce a key node function called router node, in which each node deactivates own processor element and just works as a router. A node becomes a router node in the following manner.

Definition 5

A node becomes a router node if the node is on both the east border and the west border of two f-strings, and the reference node of the eastward f-string is to the north of the westward f-string.

In the proposed algorithm, to enhance node availability, deactivated nodes can be reactivated to be unsafe nodes by healthy neighbor nodes. Figure 13 illustrates an example of node deactivation.

Fig. 13
figure 13

Node deactivation examples: a traditional method used in [8] could save 22 % healthy nodes, and b the proposed minimal rectangle node deactivation with router node could save 96 % healthy nodes

4.2 Routing Control Algorithm

Since the new node deactivation method also forms rectangular faulty blocks, the specific routing behaviors for each single fault ring are also available for the new node deactivation method. However, since the deactivation method creates more complex fault ring overlapping cases which are unexpected in Message-Route, Message-Route does not work properly.

For the new node deactivation method, to realize a simple deadlock-free routing control taking full advantage of Message-Route, we propose a new routing algorithm called Position-Route algorithm. Specifically, Position-Route does not require complex message and ring information propagation functions. To enable simple and efficient ring selection, simple ring information units (RIUs) and the related new ring management technique is used in Position-Route.

4.2.1 Message Propagation / Update Functions

Although the message propagation and the message update are key functions to guarantee the packet reachability in Message-Route, those functions involve relatively large overheads in both packet header and hardware implementation and increase complexity of the routing control that may result in higher operating delay. To overcome this problem, we point out that there are some common routing behaviors among ones with different message types, and we remove them to simplify the entire routing process.

In each router, the message update module (Fig. 5b) obtains new message type from the positional relation of source and destination nodes for every packet, then the module updates the message type according to one way message update constraint. As pointed out in the previous section, there are only two cases in which the message update module selects old message type: (1) a packet with CF message detours f-chain, and (2) a packet with RO message detours either f-string or s-chain.

In the former case, the packet can be sent to the east of the destination node when the packet detours f-chain. To send the packet to the same column as the destination node after it detours the f-chain, the packet will be sent to the west neighbor to be in the same row coordinate of the destination node. As illustrated in Fig. 14, the routing behavior with RF message contains exactly the same routing behavior with CF message. In the latter case, although the destination node of each packet can be to the north (south) when the packet detours an f-string (s-chain), the message type will not be updated to CF message due to the one way message update constraint. After the packet detours the f-string (s-chain), since the routing behavior with RO message is assumed to have the destination node in the same column, the packet will be sent to the north (south) neighbor to be in the same column coordinate of the destination node. As illustrated in Fig. 15, the routing behavior for CF message contains exactly the same routing behavior for RO message. The above observations imply that the prohibited message update from CR and RO messages to RF and CF messages can be removed, respectively. In other words, it is not necessary to introduce the one way message update constraint.

Fig. 14
figure 14

Common routing behaviors on f-chain with RF and CF when the destination node is in a the northward and b the southward

Fig. 15
figure 15

Common routing behaviors with CF and RO a on f-string and b on s-chain

We denote the routing behaviors with RF and RO messages in Message-Route by W-Route and E-Route, respectively. We also denote the routing behaviors with CF messages when the destination nodes to the north and to the south by N-Route and S-Route, respectively. The main module in Position-Route consists of those four sub-modules without both message propagation and message update functions. (See Fig. 16.) In comparison with Message-Route, due to replacement of a part of the routing behavior with RO message by N-Route, Position-Route has one slight difference in the routing behavior on the west border of s-chains. As illustrated in Fig. 17, when the destination node is to the south of the reference node of the s-chain, N-Route in Position-Route makes shorter routing path. Since the routing behavior at the west border of the s-chain is the same as one with RO message, the modification does not loose both packet reachability and deadlock freeness of Message-Route as discussed in the previous section.

Fig. 16
figure 16

Position-route algorithm

Fig. 17
figure 17

One different routing behavior between message-route and position-route

4.2.2 Ring Selection

The new node deactivation method allows at most four fault ring overlapping on a single node. As discussed in the previous section, the ring selection is the key function to provide both packet reachability and deadlock freeness. Unlike the previous deactivation method, the west (north) border of the fault ring may overlap at the east (south) border of the other fault ring. Message-Route does not support such cases.

To cover those cases and to further reduce the complexity of the ring selection method, Position-Route uses four ring information units (RIUs). Let \(\mathcal {T} =\) {\(\emptyset \), f-string, f-chain, s-chain} and \(\mathcal {D} =\) {north, south, east, west} denote the set of fault rings and the set of directions. Each RIU consists of four memory units, such as ring type (tp), column address of the reference node (r), and clockwise (cw) and counter-clockwise (ccw) detour directions, where \(tp \in \mathcal {T}\), \(r \in \mathcal {M}{\cup }\{m{+}1\}\) and cw, \(ccw \in \mathcal {D}\). The four RIUs are labeled as ne, nw, se and sw to represent the northeastward, northwestward, southeastward and southwestward fault rings, respectively. Figure 18 shows a concrete example of RIU. In this example, (2,3) is on both southwest corner of the f-string and north border of the f-chain. The ring information of the f-string is stored in ne of (2,3). Since the north border of the f-chain is the southwest and southeast fault rings for (2,3), both sw and se of (2,3) store the ring information of the f-chain. Such position-based fault ring management allows the routing algorithm to access the specific ring information immediately. All four routing sub-modules that contain ring selection function are summarized in Appendix. Figure 19 illustrates routing examples of four routing sub-modules.

Fig. 18
figure 18

An example of RIU-based ring management of (2,3)

Fig. 19
figure 19

Routing examples in position-route algorithm

Since W-Route, E-Route, N-Route, and S-Route emulate routing behaviors with RF, RO, the northward CF and the southward CF messages, respectively, we give the ring selection strategy for each four sub-module as follows.

W-Route

To send each packet to the same row coordinate of the destination node, W-Route refers to nw and sw to obtain the westward ring information. When at least one f-chain overlaps on the current node, to avoid misrouting on the f-chain, W-Route selects the northward and the southward f-chains with the first and second priorities, respectively. This priority basis selection allows each packet to be delivered to the same row coordinate of the destination node.

E-Route

Since E-Route sends a packet to the east neighbor node as far as each node has the healthy east neighbor node (not deactivated), no ring selection is required.

N-Route

To send a packet to the north direction, N-Route refers to ne and nw to obtain the northward ring information. In particular, since N-Route should select the eastward fault ring when the destination node is to the south of the reference node of the ring, N-Route select the eastward fault ring (in ne) and the westward fault ring (in nw) with first and second priorities, respectively. This priority basis selection allows each packet to be delivered to the same column coordinate of the destination node.

S-Route

To send a packet to the same column coordinate of the destination node, S-Route refers to se and sw to obtain the southward ring information and also refers to nw to realize the specific routing behavior with the highest priority. (See Fig. 10a).

4.3 Deadlock Freeness

We give the proof of deadlock freeness based on Theorem 1.

Theorem 3

Position-Route is deadlock-free on irregular 2D-MNoCs.

Proof

Since the routing behaviors for each type of fault rings in Position-Route emulate ones of Message-Route, Position-Route is also deadlock-free under the previous node deactivation method in Definition 1. However, when the new node deactivation method is applied, specific routing behaviors (See Fig. 11c and 12c) to isolate ES and SW turns on the west border of f-string and EN and NW turns on the west border of the s-chain may cause deadlock. Therefore, in this proof, we only need to focus on the following two cases:

  1. 1.

    The west border of the f-string overlaps the east border of the other fault ring, and E-Route makes ES turn on the eastward f-string.

  2. 2.

    The west border of the s-chain overlaps the east border of the fault ring, and E-Route makes EN turn on the westward s-chain.

In the former case, when the westward fault ring is f-string, to make ES turn on the eastward f-string, the eastward f-string is to the north of the westward f-string as shown in Fig. 20a. In this case, only W-Route and N-Route can make SW turns on the east border of the eastward f-string, and at least one router node exists between two fault rings. Due to the function of router node, since PassNS is prohibited at southernmost router node, ES turn does not connect SW turn. When the westward fault ring is either f-chain or s-chain, those conditions do not make any cyclic path as discussed in the proof of Theorem 2.

Fig. 20
figure 20

Key solutions not to connect a ES and SW turns by assigning a router node and b EN and NW turn based on ring selection

In the latter case, when the westward fault ring is f-string, to make NW turn on the f-string, an another f-string also overlaps as shown in Fig. 20b. In this case, only N-Route can make NW turn on the southeast corner, the southwest corner, and the south border. Since NW turn on the northeast corner of fault rings is prohibited, EN turn does not connect NW turn. When the westward fault ring is f-chain, this condition does not make a cyclic path as discussed in the proof of Theorem 2.

Therefore, according to Theorem 1, Position-Route is deadlock-free. □

5 Performance Comparison

To evaluate node availability and routing performance of Position-Route, we conduct computer simulations. Our simulation program consists of two main modules: a fault generator and two routing simulators.

The fault generator generates a fault map based on either random and cluster fault models. In the random fault model, each healthy node becomes faulty with the average node failure probability, \(\overline {f}\). In contrast, in the cluster fault model, failure probability changes in accordance with the number of neighboring faulty nodes. We use fault generation method [10]. In the model, total \(\lceil nm\overline {f}\rceil \) faulty nodes are created with the probability \(f=(\sigma _{1}{+}\sigma _{2}{\times }F_{n})\), where \(\sigma _{1}\) and \(\sigma _{2}\) indicate cluster parameters and \(F_{n}\) indicates the number of neighboring faulty nodes. In this simulation, we use \(\sigma _{1}=0.001\) and \(\sigma _{2}=0.006\).

Two routing simulators implement Message-Route and Position-Route. The same fault map generated in the fault generator is used in each routing simulator. In the simulation, results for first 5,000 cycles are discarded as transients; then the following 100,000 cycles are measured for the evaluation.

5.1 Node Failure Rate versus Node Availability

To evaluate node deactivation methods, 10,000 fault maps are generated under both the random and the cluster fault models. Figures 21 and 22 show the average occupation rate of each node state for \(20{\times }20\) 2D-MNoCs. Since implementation of node reactivation and data communication function for unsafe nodes depends on NoC designers, we count only occupation rate of healthy nodes as node availability. In the figures, it can be observed that node availability is significantly improved in Position-Route. In particular, under the random fault model, Position-Route has less deactivated nodes and more unsafe nodes than Message-Route. This indicates that the node deactivation method in Position-Route does not create large-sized faulty blocks compared with Message-Route. For \(\overline {f}=\)10 %, Message-Route could only achieve 38 % healthy and 5 % unsafe nodes, while Position-Route achieves 83 % healthy and 4 % unsafe nodes. In contrast, under the cluster fault model, when \(\overline {f}=\)15 %, Message-Route could only achieve 20 % healthy and 5 % unsafe nodes, while Position-Route achieves 68 % healthy and 8 % unsafe nodes. It is notable that the node deactivation in Position-Route creates negligible router nodes for any node failure rate, where at most 0.26 % and 0.16 % router nodes are created under the random and the cluster fault models, respectively.

Fig. 21
figure 21

An average occupation rate of each node state under random fault model (\(n{=}m{=}20\))

Fig. 22
figure 22

An average occupation rate of each node state under cluster fault model (\(n{=}m{=}20\))

5.2 Node Failure Rate versus Average Delay

To understand the relation between the number of deactivated nodes and communication delay, we compare average delay for different node failure rate. Figure 23 shows average delay and throughput on \(10{\times }10\) 2D-MNoCs under the random fault model. For \(\overline {f}{=}5~\%\), both routing algorithms achieve almost the same delay and throughput due to negligible difference in the results of node deactivation. In case of higher node failure rate, \(\overline {f}{=}10~\%\), the communication delay in each algorithm is significantly increased. Compared to Message-Route, Position-Route could achieve the lower latency while keeping almost the same throughput. Due to lower node availability in Message-Route, although the large traffic congestion increases average delay, the shorter communication paths do not decrease the throughput.

Fig. 23
figure 23

Average delay and throughput on \(10{\times }10\) 2D-MNoCs with 6 flits wormhole routing; where a \(\overline {f}{=}5~\%\) and b \(\overline {f}{=}10~\%\) (random fault model)

5.3 Average Delay versus Packet Injection Rate

To understand overall delay performance, we compare the average delay with packet injection rate. Figure 24 shows simulation results on \(20{\times }20\) 2D-MNoCs. In case of lower packet injection rate, Position-Route has longer delay. This is because the average distance between source and destination nodes in Message-Route is shorter and a large number of deactivated nodes in Message-Route prevent packet generation. On the other hand, in case of higher packet injection rate, since a large number of packets detour large-sized faulty blocks in Message-Route, the average delay significantly increases compared with Position-Route.

Fig. 24
figure 24

Average delay vs. packet injection rate on a 20\(\times \)20 2D-MNoC with 10 % node failure rate; a packet, b 6 flits, and c 12 flits

5.4 Network Size versus Average Delay

For scalability, we also compare the average delay on \(15{\times }15\) and \(20{\times }20\) 2D-MNoCs. As shown in Fig. 25, Position-Route achieves small communication delay in both network sizes. In particular, on \(20{\times }20\) 2D-MNoCs, Position-Route could achieve much lower latency compared with Message-Route. It can be observed that Position-Route is more scalable.

Fig. 25
figure 25

Average delay on a \(15{\times }15\) and b \(20{\times }20\) 2D-MNoCs under the random fault distribution (\(\overline {f}{=}10~\%\), 6 flits)

5.5 Network Size versus Implementation Cost

To evaluate the hardware cost of our algorithm, we implemented the main module and four sub-modules on Xilinx Vertex 5 family FPGA. Table 1 shows the logic utilization of both main and four sub-modules and the maximum frequency for \(16{\times }16\) 2D-MNoCs. Since no message updating and no ring selection are involved in Position-Route, packet header is not required to contain message information, which results in fast routing control and small hardware resource. This comes from the advantage of Position-Route that the underlined parts of each sub-module of Position-Route (Fig. 26) can be pre-calculated because they are fixed once all fault rings are created.

Table 1 Implementation summary (\(n = m = 16\))
Fig. 26
figure 26

Four sub-modules in position-route algorithm

We also evaluate the size of RIUs that will be the main overhead of our algorithm. Since each RIU consists of four items, such as tp, r, cw, and ccw, the total size of four RIUs is given by

$$\begin{array}{rll} 4(\lceil{\rm log}_{2}(2|\mathcal{D}| + |\mathcal{T}|)\rceil + \lceil{\rm log}_{2}|r|\rceil)\\ = 4(6 + \lceil{\rm log}_{2}|r|\rceil) ~ ~ ~{\rm{(bits)}}. \end{array} $$
(1)

From the above, it can be seen that four RIUs offer relatively small hardware overhead.

6 Conclusion

In this paper, we corrected errors of the traditional region-based fault-tolerant routing algorithm called Message-Route, and proposed the fault-tolerant routing algorithm called Position-Route. Experimental results showed that Position-Route achieves higher node availability and provides simple routing control. We also implemented Position-Route on an FPGA and evaluated the hardware consumption and the maximum frequency of the operation. We expect the presented implementation technique, RIU, can be used in other routing algorithms.