A Survey On Application Mapping Strategies For Network-On-Chip Design

Journal of Systems Architecture 59 (2013) 6076
Contents lists available at SciVerse ScienceDirect
Journal of Systems Architecture

journal homepage: www.elsevier.com/locate/sysarc
A survey on application mapping strategies for Network-on-Chip design

Pradip Kumar Sahu, Santanu Chattopadhyay
Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology Kharagpur, Kharagpur 721302, West Bengal, India
a r t i c l e
i n f o
a b s t r a c t
Application mapping is one of the most important dimensions in Network-on-Chip (NoC) research. It maps the cores of the application to the routers of the NoC topology, affecting the overall performance and power requirement of the system. This paper presents a detailed survey of the work done in last one decade in the domain of application mapping. Apart from classifying the reported techniques, it also performs a quantitative comparison among them. Comparison has been carried out for larger sized test applications also, by implementing some of the prospective techniques. 2012 Elsevier B.V. All rights reserved.
Article history: Received 1 February 2012 Received in revised form 12 October 2012 Accepted 13 October 2012 Available online 2 November 2012 Keywords: Application mapping Network-on-chip System-on-chip Intellectual property
1. Introduction With the growing complexity of embedded VLSI products, System-on-Chip (SoC) based single-chip implementation, integrating numerous Intellectual Property (IP) cores performing various functions and possibly working at different clock frequencies, is now a well-established one. Shared medium arbitrated bus is the commonly used communication backbone in these SoCs. However, the performance of a bus based SoC does not scale with the number of cores attached. In the process of search for the communication backbone of next generation many-core based SoCs supporting new inter-core communication demands, Network-on-Chip (NoC) has emerged as a viable alternative. It proposes the design of modular and scalable communication architectures where various IP cores are connected to a router-based network using appropriate Network Interface (NI) [13]. Fig. 1 shows the NoC synthesis ow. The application is specied as a collection of tasks with communication between them. This is known as the application task graph. The IP cores in a design library can perform a subset of these tasks. Hence, the rst step is to select a set of cores and allocate the tasks to be realized by them. This gives rise to the core graph, with cores as nodes and communication bandwidths as edge labels. The mapping techniques map this core graph onto a topology graph, the objective of mapping being reduction in overall communication delay. The mapped graph then passes through routing and scheduling stages to generate the nal NoC.
Corresponding author. Mobile: +91 9434042800; fax: +91 3222 255303.
E-mail addresses: pradipsahu.ece@iitkgp.ac.in (P.K. Sahu), santanu@ece.iitkgp. ernet.in (S. Chattopadhyay). 1383-7621/$ - see front matter 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.sysarc.2012.10.004
The holistic research problems in this NoC design paradigm can be broadly classied into four different dimensions [46]. The rst dimension is focused on the choice of communication infrastructure, such as, network topology, router architecture, buffer optimization, link design, clocking, oor planning, and layout. The second dimension of NoC research deals with the communication paradigm including routing policies, switching techniques, congestion control, power and thermal management, fault tolerance, reliability etc. The third dimension involves designing an evaluation framework for NoC to have a good understanding of achievable throughput, latency, and bandwidth of the network. Once the communication infrastructure and paradigm for a NoC have been nalized, a major challenge in overall system design is to associate the IP cores implementing tasks of an application with the routers. This has got a very signicant role to play in determining the performance of the overall system, as it directly inuences communication time, required link bandwidth and admissible delay of the router. This application mapping forms the fourth important dimension in NoC research. While there exist quite a good number of surveys [46] on NoC works in the rst three dimensions, the fourth one, that is, application mapping techniques have not been surveyed well. To the best of our knowledge, the only survey on NoC mapping and scheduling techniques is [7], this is now dated. A large number research works have been reported in recent years. The scope of this paper lies in studying these works. The objective is to classify the mapping algorithms into different categories and compare them. In this paper we have mainly focused on application mapping. In some NoCs, the cores are already attached to the routers; hence, task mapping stage can simply be augmented to decide on the particular core to be used. On the other hand, in general NoCs no such association pre-exists between cores and routers. The application mapping
P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076
61
Application Task Graph Core Selection Core Graph Topology Graph Mapping Mapped Topology Graph Routing Routes for Communication Scheduling Synthesized NoC
Fig. 1. Application-specic NoC design ow.
while Section 4 concentrates on static mapping approaches. Section 5 presents a performance comparison among the mapping techniques. Some special mapping techniques have been presented in Section 6. Section 7 presents an overview of some of the application mapping tools. Section 8 draws the conclusion. 2. Mapping techniques The problem of application mapping is NP-hard [7]. Depending on the time at which the tasks are assigned to the IPs for processing, the mapping techniques can be classied as dynamic mapping and static mapping (Fig. 2). In case of on-line or dynamic mapping, the assignment and ordering of tasks are performed during the execution of the application. Dynamic mapping always tries to detect the performance bottleneck and distribute the workload among the processors. As the mapping depends on the current load of the processors, it should result in a better solution. However, the computational overhead of the mapping algorithm may increase the delay and energy consumption of the application at run-time. On the other hand, in case of static mapping, generally the mapping of tasks is performed off-line, before the application is run. For a given application and a target communication infrastructure, static mapping always tries to dene the best placement of tasks at design time. As the mapping is completed before execution, the mapping algorithm is executed only once. For NoC, static mapping is generally recommended, as excess communication overhead in dynamic mapping signicantly affects system performance, increasing the overall delay of the system [7]. 3. Dynamic mapping techniques Dynamic mapping is an on-line mapping strategy. The ready tasks are mapped to the processors by observing the load of the processors at run-time. So the placement of tasks on NoC can be changed during execution of the application.
phase performs this mapping. While a third category synthesizes the topology graph specically optimized for an application, commonly known as application specic NoC synthesis. In this paper we survey mostly the techniques which map cores to routers without assuming any pre-existing corerouter association. It also excludes the application specic topology generation from its purview. The IP cores in a NoC topology may be homogeneous or heterogeneous in nature. All homogeneous cores can perform same set of tasks while heterogeneous cores can perform different sets of tasks. This is generally taken care of at the core selection stage (Fig. 1). The mapping process takes the core graph (along with its communication requirements) as input and nds a mapping of cores to the topology graph. The rest of the paper is organized as follows. Section 2 presents an overview of mapping techniques. Section 3 enumerates different dynamic mapping techniques,
Mapping Algorithms
Dynamic Mapping
Static Mapping
Exact Mapping
Search based Mapping
Mathematical Programming based Mapping
Systematic or Deterministic Search Transformative Heuristic
Heuristic Search
Constructive Heuristic
Constructive without Iterative Improvement
Constructive with Iterative Improvement
ILP, MILP, etc.
Branch and Bound (BB)
PSO, GA , ACO
BMAP, CMAP , CHMAP, SMAP
NMAP, LMAP, SA, Onyx
Fig. 2. Classication of mapping algorithms.
62
In [8], authors have proposed a compiler based application mapping scheme, which can perform task scheduling, processor mapping, data mapping and packet routing. This mapping technique needs very high compilation time which may degrade the system performance. Comparison of energy has been carried out with and without data mapping and packet routing. In [9,18], authors have presented heuristics for dynamic task mapping with an initial task mapping phase, followed by a dynamic mapping phase. The dynamic mapping phase may use any one of the techniques, such as, First Free (FF), Nearest Neighbor (NN), Minimum Maximum Channel Load (MMC), Minimum Average Channel Load (MAC), and Path Load (PL). In case of FF technique, the NoC selects the rst free node which can execute the requested task, the network being searched column wise. The NN mapping is similar to FF, only difference is that the requested task is placed at the free neighboring node of the node making the request. MMC is a congestion-aware mapping heuristic which reduces the maximum loads in the links. The MAC technique is similar to MMC, which distributes the communication load onto the NoC to reduce the average load in the links. The MMC and MAC consider all NoC links while mapping a new task. Hence, mapping takes time. The PL technique overcomes this problem by considering only the links that are used by the task being mapped. It has been shown that the PL heuristic produces the best solution, compared to others [9,18]. The authors in [10] have proposed a technique for run-time application mapping onto NoC platforms with multiple voltage levels. This technique consists of a region selection algorithm using a heuristic for run-time application mapping. Different regions are operated at different voltage levels. Compared to random mapping, the communication energy saving is about 50%. In [11], authors have described a run-time strategy for task allocation to homogeneous NoC platform. It incorporates the user behavior information in the resource allocation process. This allows the system to respond better to real-time changes and adapt to user needs dynamically. In [12], the real-time applications are dynamically mapped onto embedded MPSoCs, where communication is performed via NoC, the resources connected to the NoC have multiple voltage levels, as in [10]. A twostep algorithm has been proposed, rst to select proper region for mapping followed by a greedy heuristic for run-time task mapping to nodes. The task allocation problem may follow any one of the techniques Best Case (BC), Worst Case (WC), Euclidean Minimum (EM), Fixed Center (FC), Random Frontier (RF) and Neighbor-aware Frontier (NF). The BC technique corresponds to the optimal solution generated by an exhaustive search. It cannot be applied to dynamic mapping due to its high run-time overhead [12]. In case of WC, the mapping of a task onto a tile is dependent on an already mapped task. The EM technique maps a task by selecting an unmapped tile on NoC having minimum Euclidian distance from all the mapped nodes. In case of FC, tasks are mapped with a minimum Manhattan Distance (MD) to the rst mapped tile. The RF heuristic maps a task by selecting an unmapped tile randomly from the frontier of the mapped region. Every tile has four neighbors. A tile is available for mapping if it has not been included into the region. The NF heuristic maps a task onto a tile with a minimal number of available neighbors. It has been shown that the EM and FC heuristics produce reasonably better solutions, compared to others [12]. Sometimes the number of tasks running in an MPSoC may exceed the number of available resources, so dynamic task mapping is needed [13]. Tasks are mapped on the y, according to the communication requests and the loads in the links. The target MPSoC architecture contains software and hardware processing elements. Each processing element can support only one task. The approach [13] reduces the NoC channel load, congestion, and packet latency using different heuristics as in [9,18]. DSM, a dynamic spiral mapping technique has been proposed in [14] for task mapping during run-time. The placement of a task is searched in a spiral path from
centre to the boundary of the network architecture. It tries to place the communicating tasks close to each other. It also attempts to reduce the communication time by reducing dynamic mapping time, reconguration time and task migration time. A run-time agent based distributed application mapping technique for NoC based heterogeneous MPSoCs has been presented in [15]. The technique maps the applications in a decentralized or distributed manner using an agent based approach. It reduces monitoring trafc and computational effort for the mapping process, compared to the centralized approaches. Agents are small tasks which can be executed on any node in the NoC. They perform resource management and store state information for the resources. The agents act and negotiate with each other to nd processing elements suitable for mapping a task. There are two types of agents to accomplish this: Global Agents (GA) and Cluster Agents (CA). The CAs have knowledge about their clusters. When they get a new task request, they negotiate with the GAs, which have global information about all clusters. The target MPSoC architecture designed in [16] contains both software and hardware processing elements which can support more than one task in parallel. In this MPSoC architecture, among the available processing nodes, one processing node acts as a Manager Processor, and is responsible for task binding, task mapping, task migration, resource control and reconguration control. The resource status is updated at run-time and the Manager Processor keeps track of the information about resource occupancy. Mapping decisions are taken accordingly. The mapping heuristics proposed in [16,17] map the communicating tasks of an application close to each other so as to minimize the communication overhead in order to improve performance. The heuristics examine the available resources prior to recommending the adjacent tasks on the same processing element. The algorithm also attempts to map the communicating tasks in close proximity in a compact manner, so as to reduce communication overhead and time. Hence, total execution time also gets reduced. This is a two phase algorithm. In the rst phase, initial task mapping is done either on the rst free position found in the network that can support the tasks, or the NoC is partitioned into regions and the initial tasks are placed at the centre of these region clusters. In the second phase, the new requesting tasks are mapped for better performance gain by an efcient run-time mapping algorithm. In general, the works proposed in [9,13,18] are extended in [16,17], employing a packing strategy that minimizes the communication overhead in the same NoC based MPSoC platform. It supports multi-task mapping onto the same PE. An energy-aware heuristic for dynamic task mapping, named lower energy consumption based on dependencies-neighborhood (LEC-DN) has been presented in [19,20]. The main cost function here is not only the distance in hops between communicating tasks, but also the proximity in the number of hops and the communication volume among the tasks, since the number of transmitted its denes the communication energy. When target task has only one communicating task that has already been mapped, LEC-DN uses the Nearest Neighbor (NN) search in a spiral fashion. On the other hand, if there are more than one communicating tasks that are already mapped, it searches for a processing element inside the bounding box dened by the position of such task depending on the communication volume. In [21], a dynamic decentralized, application-driven and resource-aware mapping has been proposed, where tasks can be embedded incrementally by an already mapped predecessor task. This is a self-embedding approach which is fully decentralized and autonomous.
4. Static mapping techniques A mapping is called static if the resource on which a task is going to be executed is decided before its execution and is not
63
changed thereafter. Static mapping (Fig. 2) is an off-line mapping. All the cores are mapped to routers at design time. Various techniques have been developed to nd a good and efcient mapping solution. The static application mapping approaches can be broadly classied to be either Exact mapping or Search based mapping, as shown in Fig. 2, depending upon the techniques employed to reach at a mapping solution. 4.1. Exact mapping The mathematical programming based mapping produces optimal solution. A Mixed Integer Linear Programming (MILP) based task mapping for heterogeneous multiprocessor systems has been reported in [22]. In this heterogeneous multiprocessor, some processors are programmable, while others are application specic. The model determines the optimization tradeoff between execution time, processor (general purpose processor or application specic processor) and communication cost. This is a hardware/ software co-design process, run iteratively until the design goal is met. An MILP formulation for mapping cores onto NoC while considering the choice of core placements, switches for each core, and network interfaces for communication has been proposed in [23]. It is reported that the energy consumption is much less as compared to other mapping techniques for some real, as well as, random benchmarks. An integrated approach for mapping of cores onto heterogeneous processor/memory based NoC topologies and physical planning has been presented in [24], where the position and size of the cores and network components are computed. For initial mapping, authors have followed a greedy mapping of cores onto the specied topology and then in improvement phase the relative core positions are xed by Tabu search. An MILP based physical planning algorithm has been formulated to improve the area and power of the nal design and also to guarantee Qualityof-Service (QoS) for the application. In [25], MILP formulation for synthesis of custom NoC architecture has been presented. Here the optimization objective is to minimize the power consumption, subject to the performance constraints. In case of Linear Programming (LP), the main bottleneck is runtime. To reduce runtime, the authors have partitioned the application task graph into a number of clusters. The MILP formulation for topology design is then utilized and partial solutions are generated. At the end, the nal mapped custom topology is generated by adding physical links between the ports of neighbouring routers of the clusters. Network processors incorporate features like symmetric multiprocessing (SMP), block multi-threading, and multiple memory elements to support high performance networking applications. Mapping an application onto a complex multi-processor, multithreaded network processor is a difcult task. In [26], authors have presented a two stage Integer Linear Programming (ILP) formulation for process allocation and data mapping on SMP and block multi-threading based network processor. Power/energy control is a very important issue in case of NoC based chip multiprocessors (CMPs). The work [27] attempts to minimize energy by shutting down certain communication links in such architectures. This formulation can be used for selecting the links in use and voltage, frequency for those links. The problem of minimization of energy consumption during application execution while satisfying the performance constraint may be combination of some sub-problems, such as, mapping of application tasks to IPs, mapping of IPs to the routers of NoC architecture, assigning operating voltages to IPs, and routing. Different operating voltages are assigned to IPs if they are operating at multiple voltages. A unied approach of energy efcient application mapping which utilizes MILP formulation of the problem has been presented in [28] taking care of all the sub-problems, such as, application mapping, operating voltage assignment, and routing. In [29], the existing ILP [28] has been ex-
tended to nd a trade-off between computation and communication energy. In [30], factors that produce network contention have been analyzed. It proposes an ILP formulation for a contention-aware application mapping algorithm in tile-based NoC to minimize inter-tile network contention. In NoC based design, the global wires are replaced by a network of shared links and the routers exchange data packets simultaneously through the links. So there is trafc congestion within the links which signicantly degrades the system performance. The network contention may be source based, destination based, and path based. The result shows that there is a signicant reduction of packet latency by reducing the network contention, but the loss of communication energy is high. In [31], authors have presented ILP formulation for application mapping onto mesh based NoC to minimize energy consumption for different benchmarks. However, the formulation does not include bandwidth constraints. The CPU time for different benchmarks reported in this paper is also quite high. To overcome the high CPU time, a clustering based relaxation for ILP formulation has been proposed in [32]. The tasks of the application graph are clustered suitably, as in [25]. Based on the number of clusters, the mesh architecture is divided into smaller sized meshes. The ILP based formulation of [31] is used to map the clusters onto corresponding sub-meshes. At the end, it merges all such sub-meshes to determine the nal solution. It has been noted that, the CPU time gets improved with a sacrice in the communication cost of the mapping solution. 4.2. Search based mapping Depending on the search type and results, there are two types of search based mapping algorithms (i) systematic or deterministic search and (ii) heuristic search. 4.2.1. Deterministic search Search algorithms using Branch-and-Bound (BB) belongs to this category. It is a systematic search algorithm that topologically nds the mapping by searching the solution in tree branches and bounding unallowable solutions. It can be applied to smaller problems, as search time grows exponentially with the size of the problem. In [3335], authors have proposed an energy and performance aware mapping for tile-based regular NoC architecture to satisfy the specied design constraints through bandwidth reservation. In [34,35], it has been suggested that the most appropriate routing technique for NoC should be deterministic, deadlock-free, minimal, and wormhole-based. In a tile-based NoC architecture, as the network wires are structured and modular, their electrical parameters can be very well controlled and optimized. They have rst formulated Energy- and Performance-Aware Mappings (EPAM or GMAP) in topological sense and then an efcient Performance-aware Branch-and-Bound (PBB) algorithm is utilized to improve the solution quality. A good amount of energy saving has been reported for EPAM combined with PBB, compared to Simulated Annealing (SA) based solutions. In the above mapping, single IP is connected to a router. An IP with large communication volume will result in a heavy trafc load to certain routers, which may become hotspot due to high power density that affects the reliability of the chip. A trafc balanced IP mapping algorithm (TBMAP) for 2-D (two dimensional) mesh based NoC has been proposed in [36]. The trafc of all the routers is balanced without sacricing the network performance by the TBMAP algorithm. To reduce unbalanced trafc loads, authors have proposed various network interfaces (NIs), Single-Router to Single-IP (SRSI), Single-Router to Multiple-IP (SRMI), and Double-Router to Single-IP (DRSI). Based on the new network interfaces (NIs), TBMAP uses a modied branch-andbound search, as in [3335], to map all the IPs onto 2-D mesh based
64
NoC. As in this case trafc loads are more balanced, some data paths might not be the shortest ones. So, the average trafc load may be higher in some cases. In [37], authors have taken NMAP [75] (discussed in Section 4.2.2.2.2) as their initial mapping solution. A Branch-and-Bound algorithm, as in [3335], has been applied upon the NMAP mapping solution to arrive at a better solution. It has been reported that the new mapping solution is better than EPAM, PBB and NMAP in terms of communication cost, power consumption and network latency. The Branch-and-Bound algorithms demand high memory depth and suffer from long CPU time requirement. 4.2.2. Heuristic search A number of heuristic approaches have been reported to solve the application mapping problem. They can be broadly classied into transformative heuristics and constructive heuristics. 4.2.2.1. Transformative heuristics. Transformative heuristics transform some existing mapping solution(s) to arrive at better ones. Typical examples include the evolutionary techniques, such as, Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Ant Colony optimization (ACO), and so on. 4.2.2.1.1. GA based transformative heuristics. Genetic Algorithm (GA) is a stochastic search algorithm based on operations of natural genetics. Here, xed-sized population of chromosomes evolves over a number of generations following the principle of natural selection. Each chromosome identies a potential solution. A chromosome has got an associated tness measure. Using the operators similar to crossover and mutation in nature, the population evolves through generations. To evolve a new generation, generally top few percentages of chromosomes are directly copied to the next generation. Rest of the population is created by two operators crossover and mutation. The crossover operator selects two parent chromosomes to participate in the operation. Their parts are exchanged to create new offspring. The crossover may be singlepoint or multi-point. The mutation operator may be implemented by selecting a parent chromosome and randomly changing some of its portions. The mutation rate can be controlled to control the rate of convergence to local or global minima. The termination criterion is often set to be no improvement in last few generations or a specied number of generations for which the GA has run. A two-step Genetic Algorithm (GA) for mapping applications onto NoC has been proposed in [38], which reduces the overall execution time. In the rst step, the tasks are assigned onto different IPs assuming the edge delays to be constant and equal to the average edge delay. In second step, the IPs are mapped to tiles of NoC taking the actual edge delay based on the network trafc model, and the total system delay is minimized. In this mapping, some delay factors, such as, the message sending probability of cores, packet length and the network contention for communication have not been considered. In [39], authors have proposed a delay model for application mapping onto NoC considering all these factors. Their proposed genetic algorithm based delay model can map application onto NoC optimally with a minimum average delay. In this proposed genetic algorithm, a population corresponds to the core positions to the NoC topology. The initial population is chosen randomly. In this case the tness function is average waiting-time. To evolve new generation, a multi-point crossover is used with randomly chosen crossover points. Chromosome with lower waiting-time has a higher chance of participating in crossover than a chromosome with higher waiting-time. The size of the chromosome is same as the number of cores in the core graph. To control the rate of convergence to local or global minima a mutation is performed randomly on a chromosome with a mutation probability. The above operations are performed repeatedly until the lowest average wait-
ing-time has not been changed for a specied number of iterations. Then the best solution at the end generates optimal core positions on the NoC. For example, let the cores a to f to be mapped onto a (2 3) 2-D mesh, and the nal chromosome structure is (1 2 1 3 3 6). In this structure, the rst integer is used to map a, the second integer is used to map b, and so on. All the solutions will have a 1 in the rst position. The second core b can be mapped onto two places, that is, before a (with a value 1) or after a (with a value 2). So, now b is placed after a as the integer value for b is 2 in the chromosome structure. Then there are three placeholders for c, that is before a (with a value 1), in between a and b (with a value 2), or after b (with a value 3). For c, the integer value is 1, so it is placed before a. Similarly the other cores are placed and the nal positions of cores are obtained as (c a e d b f) according to the chromosome structure. Utilizing this representation for the chromosomes, cores are placed in a (2 3) 2-D mesh connected NoC according to the positions obtained in the nal solution as shown in Fig. 3. A pareto based multi-objective evolutionary computing technique has been proposed in [40], that optimizes performance and power consumption of mapped NoC. Same authors in [41] used the above technique for application task mapping. For dynamic evaluation, an event-driven trace-based simulator has been used to compare their results with pareto based Branch-and-Bound approach [41] and pareto based NMAP approach [41]. A multi-objective genetic algorithm based application mapping for NoC has been presented in [42], which targets mapping with Network Assignment (NA) for heterogeneous distributed embedded systems to improve the performance and reduce the power consumption and area. This technique rst allocates tasks to cores, and then maps the cores to different tiles of NoC satisfying communication requirements. The mapping of IP cores onto NoC tiles, together with routing path allocation has been referred as network assignment (NA). The network assignment is usually performed after task mapping to reduce on-chip inter-communication distance. The Genetic Algorithm based optimization technique MGAP proposed in [43] minimizes the power consumption by reducing the number of switches in the communication path between cores and also maximizes the throughput. Though similar technique has been used in [38], but here authors have considered the dynamic effect of trafc. They have also given a set of solutions using pareto mapping as used in [40,41]. A multi-objective Genetic Algorithm (MOGA) based application mapping technique has been proposed in [44], where oneone as well as manymany mapping between switches and tiles have been taken into consideration to minimize energy consumption and required link bandwidth. It is used to nd optimal solution from the pareto optimal solutions as in [43]. The chromosome is representation of the mapping solution [43,44] which is formed by m n genes, where, m is the number of rows and n is the number of columns of the mesh connected NoC. The ith gene corresponds to the core in the tile having row di=ne and column (i % n). Here the crossover is single-point, and during the crossover the maximum communicating cores are remapped to random tiles result in a new chromosome. The mutation operation is performed upon a chromosome by choosing highly communicated cores and placing them nearby to each other. In [45,46], CGMAP, a Genetic Algorithm based application mapping technique has been proposed that uses the chaotic mapping operator instead of the random processes in GA. Here the concept
c d
a b
e f
Fig. 3. Final core placement [39].
65
of chaotic sequences has been combined with genetic algorithm for an optimal mapping solution. The same authors in [47] presented a different one-dimensional chaotic mapping technique onto NoC. Here authors have combined different chaotic operators with GA to arrive at a better solution. GBMAP, an evolutionary approach for mapping cores onto NoC architecture has been proposed in [48], which reduces energy consumption and total bandwidth requirement of NoC. GAMR [49], a genetic algorithm based mapping and routing approach addresses a two phase mapping of IP cores onto NoC architecture and generates a deterministic dead-lock free minimal routing path for each communication to minimize the total communication energy and maximize link bandwidth utilization of the NoC architecture. In the rst phase, GAMR maps IP cores onto different resource nodes of mesh based NoC architecture. In second phase, it generates deterministic dead-lock free minimal routing path for each communication trace without changing the placement of cores generated in the rst phase. In [50], authors have proposed Architecture-Aware Analytic Mapping algorithm (A3MAP) for NoC with homogeneous and heterogeneous cores on regular and irregular mesh or custom architecture. The task mapping problem is solved by two effective heuristics, a successive relaxation algorithm as a fast algorithm and a genetic algorithm to nd better mapping solutions. In [51], a genetic algorithm based mapping technique has been proposed for customized NoC architecture to reduce the communication energy. The same authors in [52] proposed a GA based congestion aware mapping technique for irregular customized NoC architecture to reduce the communication energy. A Multi-objective Adaptive Immune Algorithm (MAIA), based on evolutionary approach has been proposed in [53], which maps the application tasks onto NoC to reduce the power consumption and overall network latency. The adaptive immune algorithms integrate a wide set of features that improve local search while preventing the premature convergence by preserving the diversity of solutions in the population. The same authors in [54] have proposed an improved version of MAIA to solve the multi-application NoC problem. It produces a set of mapping alternatives by exploring the mapping space. The main drawback of such genetic approach is the slow rate of convergence. It often requires the GA to evolve a large number of generations to converge to a solution. The best solution at the end is taken to be the solution of the mapping problem. To accelerate the rate of convergence, the mutation rate can be increased. However, it mostly converges to local best solutions, rather than nding the global best. 4.2.2.1.2. PSO and ACO based transformative heuristics. Particle Swarm Optimization (PSO) [55] is a population based stochastic technique developed by Eberhart and Kennedy in 1995, inspired by social behaviour of bird ocking or sh schooling. In a PSO system, multiple candidate solutions coexist and collaborate simultaneously. Each solution, called a particle, ies (evolves) in the problem space according to its own experience as well as the experience of neighbouring particles. It has been successfully applied in many problem areas. In a PSO, each single solution is a particle in the search space, having a tness value. The quality of a particle is evaluated by its tness. PLBMR, a PSO based two-phase application mapping algorithm has been proposed in [56], which minimizes the NoC communication energy and allocates the routing path for balancing the linkload. In rst phase, the PSO maps IP cores onto NoC to minimize the energy consumption, and in the second phase the routing paths are allocated to every pair to satisfy the link-load balance. The particle structure and initial particle generation is same as the chromosome structure of GA based technique described in [39]. In [57], authors have proposed a Particle Swarm Optimization (PSO) based application mapping technique for NoC. However, the merit of the scheme is not clear, as no comparison has been made with
the existing approaches. A mapping technique based on discrete PSO has been presented in [58]. However, it only considers improvement over genetic algorithm based method and reports relative improvements only. In [59], a hybrid multi-objective algorithm has been proposed, where Dijkstras shortest path algorithm has been used to nd the shortest path among communicating cores to satisfy the bandwidth constraints and then a multi-objective pareto based Particle Swarm Optimization (PSO) technique is applied upon that to improve performance. In [60], PSMAP, a meta-heuristic strategy using Particle Swarm Optimization (PSO) technique has been proposed to reduce both static and dynamic cost of NoC for 2-D mesh based application mapping. A particle corresponds to a possible mapping of cores to the routers. An example of a particle structure has been shown in Fig. 4. The numbers shown within circles in the boxes are the core numbers present in the core graph. The numbers outside the box are the router numbers of the topology graph. It is assumed that the routers are numbered in an increasing order from top left to bottom right position. The gure shows that core 1 is attached to router 0, core 4 is attached to router 1, and so on. If the number of nodes (routers) present in the topology graph is greater than the number of cores present in the core graph, dummy nodes are added to the core graph to make the two numbers same. Dummy nodes are connected to all core nodes and between themselves. Edges connecting a core node to dummy nodes and the edges between dummy nodes are assigned a cost zero. Let N be the number of cores present in the core graph, after connecting dummy nodes, if required. For these N cores, there are N node positions in the topology graph. A particle is a permutation of numbers from 1 to N, which shows the placement of cores to the node positions of the topology graph. The overall communication cost is inuenced by the position of cores in a particle. In our formulation, the overall communication cost forms the tness function. Fitness of a particle pi is equal to the overall communication cost after placement of cores of the core graph to different routers, as specied by the particle. In the evolution process, every particle i has its corresponding local best pbesti, which is the permutation of core positions that gives the minimum communication cost, among all permutations that the particle has seen so far. The local best permutation value guides partially the evolution of the particle. For a particular generation, the particle resulting in the minimum communication cost is the global best (gbestk) for that generation. This parameter also guides the evolution of particles. The particles evolve through generations to create new particles which are expected to give results closer to the optimum. In the rst generation, the initial population is created randomly and the tness of individual particles is evaluated. The local best (pbesti) of each particle is set to be same as the initial particle. The global best (gbestk) of a generation is the particle giving the least communication cost (smallest tness function) in that generation. Further generations are evolved through a series of operations called swap operations [61]. The local best of each particle and the global best of a generation are modied if the corresponding values in the current generation are lesser than the values in the previous generation. For a particle p, the router associated with a core is identied by the position index of the core in p. The indexing of the position takes value between 0 and N 1 (N being the number of routers). The index corresponds to the router number, as shown in Fig. 1.
Core Number Router Number
1 0
4 1
3 2
6 3
2 4
8 5
5 6
7 7
Fig. 4. Particle structure [60].
66
Let the swap operator be SOj,k (where, j and k = 0,1,. . .N 1) that swaps jth and kth positions of the particle p to create a new particle pnew. For example, let us consider the particle p = {1, 4, 3, 6, 2, 8, 5, 7}, where the numbers represent the core numbers of the core graph and the position represents the router numbers in the topology graph. The swap operator SO4,6 swaps the cores at positions 4 and 6, which creates a new particle pnew = {1, 4, 3, 6, 5, 8, 2, 7}. A swap sequence SS is made up of one or more swap operators. The swap operators of the swap sequence are applied, in order, upon the particle p to create a new particle pnew. For example, let the swap sequence SS = {SO4,6, SO2,5} be applied upon the particle p = {1, 4, 3, 6, 2, 8, 5, 7}. It creates a new particle pnew = {1, 4, 8, 6, 5, 3, 2, 7}. To align a particle pi with its local best, the swap sequence is identied. Let this be SSil best . Then another swap sequence is idenbest tied to align the particle with the global best. Let this be SSg . i l best Now the swap sequence SSi is applied on particle pi with a probability of a [62]. Let the modied particle be pli best . Then the swap best sequence SSg is applied on pil best with a probability of b [62]. i This creates a new particle pnew . Its tness is evaluated and the loi cal best is updated for particle i, if it is better than the previous local best for the particle. If the best tness in a generation is better than the global best of the previous generation, the global best is also updated. The Ant Colony Optimization (ACO) technique [63] is a population based probabilistic technique developed by A. Colorni and M. Dorigo in 1991, inspired by the biological behaviour of ants in nding the paths from the colony to a food source. Thus, when one ant nds a good path from the colony to a food source, other ants are more likely to follow that path, and positive feedback eventually leads all the ants following a single path. It constitutes some meta-heuristic optimization. An Ant Colony Optimization (ACO) based algorithm has been proposed in [64] for application task mapping onto NoC to minimize the bandwidth requirement. The results have been compared with random mapping techniques. 4.2.2.2. Constructive heuristics. In constructive heuristics, partial solutions are generated sequentially, and at the end the nal mapping solution is obtained. The constructive heuristic may be constructive without iterative improvement or constructive with iterative improvement. Constructive heuristic search techniques are normally much faster than the transformative heuristics. 4.2.2.2.1. Constructive heuristic without iterative improvement. A constructive heuristic without improvement algorithm maps the cores of a core graph, one at a time, onto the NoC topology graph by selecting the cores based on some predened criteria. There will be no change of position of a core once the placement of a core is done. No optimization technique is applied upon the initial solution to arrive at a better solution. PMAP, a two-phase mapping algorithm for placing clusters onto processors has been presented in [65], where highly communicating clusters are placed on adjacent nodes of the processor network. Each cluster contains all tasks which are to be executed in the same processor having zero interconnection overhead to increase parallelism. UMARS, a unied mapping, routing and slot allocation algorithm presented in [66] couples mapping, path allocation and time-slot allocation to minimize communication energy. This technique maps cores onto NoC topology, route the communication and allocate TDMA time-slots on network channels so that application constraints are met. SMAP [67] is a simulation based environment, which performs application mapping and task routing for 2D mesh-based NoC to minimize execution time and communication energy. In this technique the highest priority task is mapped at the centre and other tasks are mapped from the mapped tasks spirally [68] to the boundaries of the mesh based NoC by placing
Fig. 5. Application mapping onto NoC [67,68].
highly communicating cores as close as possible to each other (Fig. 5). An efcient binomial IP mapping and optimization algorithm (BMAP) has been presented in [69] to reduce hardware cost of on-chip network. It is a very fast and efcient algorithm having less computation complexity than NMAP [75] (discussed in Section 4.2.2.2.2). The binomial mapping comprises of three steps IP ranking, merging IP set, and refreshing IP set. IP ranking depends upon the communication bandwidth between them. The communication bandwidth of an IP is the sum of the bandwidth from it to other IPs and from other IPs to it. Depending on the IP ranking, the most communicated IP sets are merged two-by-two every iteration as shown in Fig. 6. The new requirements of merged IP sets are recalculated by taking each IP set as an individual IP, which refresh the IP set. CHMAP [70] is a chain-mapping algorithm that produces chains of connected cores in order to introduce a method for application mapping onto mesh-based NoC. CMAP [71] is a fast constructive application mapping algorithm that maps tasks onto NoC minimizing total communication cost and energy. It is a hybrid of two constructive mapping algorithms link-based mapping (LBMAP), and sort-based mapping (SBMAP). After comparing the results of these two, the better one is taken as output. RMAP, a reliability-aware application mapping technique for mesh-based NoC has been proposed in [72]. It divides the application graph into two sub-graphs which minimizes the communication trafc between the subgraphs and maximizes the trafc within each sub-graph. Then one sub-graph is mapped onto upper triangular nodes of the NoC and the other is mapped to lower triangular nodes of the NoC. This technique utilizes the non-uniformity of trafc distribution over the network channels to efciently route the packets of redundant communications. In [73], all the nodes and the interconnections among nodes of 2-D mesh-based NoC are abstracted as a tree. In this tree model the vertex with highest communication volume is selected as root node. The vertices communicating to the root (node) are the children of that node, and so on. During mapping,
Iteration 1
Iteration 2
Iteration 3 Iteration 4
Fig. 6. An example of binomial merging (N = 16) iterations [69].
67
the root node is placed at centre of mesh-based NoC, and the traversal is made from the centre towards the borders of the NoC. The children nodes are placed by seeing the tree structure and the communication volume of interconnect from the centre towards the borders. CastNet, an energy-aware application mapping and routing technique for 2-D NoC has been proposed in [74]. Before mapping, a priority list for the tasks is formed based on its total communication bandwidth and average communication bandwidth. Depending on the priority list, the initial task is selected. If there is a tie, the task is selected randomly. The next task which is most communicated with mapped task is selected for mapping. If again there is a tie, then the higher priority task is selected between them for mapping. For mapping the rst task, a set of initial node positions is selected (Fig. 7). A set of solutions are generated by this technique for each initial node position for the initial task. The remaining tasks are placed on the nodes of NoC according to the priority list. After each mapping the priority list is updated. Finally, from the set of solutions, the best one is taken as the solution for mapping of application onto NoC. 4.2.2.2.2. Constructive heuristic with iterative improvement. In this case, the cores of the core graph are mapped onto NoC topology graph one at a time based on some predened criteria to generate an initial solution. Then an iterative improvement is acted upon the initial constructed solution, to nd better candidate solutions. In [75], NMAP, a mapping technique has been proposed with minimum path routing in the mesh architecture which satises the bandwidth constraint and minimizes the average communication delay. The proposed heuristic has three phases. In initialization phase, the core having maximum communication demand is mapped to a node having maximum neighbours. Then the core having most communication demand with the already mapped cores is selected for mapping. The selected core is mapped to the node that minimizes the communication cost, that is, (hopcount Bandwidth) with mapped cores. This is obtained by examining every available node in the mesh. This procedure is continued till all the cores are mapped. In next phase, Dijkstras shortest path algorithm is applied to the quadrant graph for minimum path computation with satisfaction of bandwidth constraints. In the last phase, the initial solution is improved iteratively by invoking the second phase for each pair-wise swapping of mapped cores. It also proposes trafc splitting that considers the mapping problem together with the possibility of splitting trafc among various paths. For various benchmark applications, NMAP produces better results than the reported mapping algorithms before it. A tool, SUNMAP, has been presented in [76] to automatically select the best standard topology for a given application and producing a mapping of cores onto that topology. It minimizes the average communication delay, area, power dissipation subject to bandwidth and area constraints. MOCA, a two phase heuristic for low energy mesh based on-chip interconnection architecture has been proposed in [77], to reduce the communication energy considering the bandwidth and latency constraints. In the rst phase, the cores are mapped to different routers of the mesh by invoking a bi-partition-
ing based slicing tree generation technique. In the second phase, it attempts to nd a minimal path from source to destination for each trafc trace. It does not give good solution when latency constraints are considered. All the mapping techniques proposed previously, use communication weighted model (CWM) to account for the overall communication volume of each channel. It does not consider communication timing. To capture both timing of application communication and communication volume, communication dependence and computation model (CDCM) has been proposed in [78,79], which maps applications on regular NoC under bandwidth constraint and minimizes average communication delay. The same authors in [80] have compared different algorithms for obtaining low energy mappings onto NoCs using a CWM. They have also proposed two heuristics, largest communication rst (LCF) and greedy incremental (GI) for low energy mapping using CWM. The CWM counts the dynamic energy only when there is a bit transition. However, trafc without bit transitions also consumes dynamic energy. Therefore, to overcome the problems of CWM, the same authors have been proposed an extended communication weighted model (ECWM) [81], which captures both the volume of communication and the bit transition rate in each communication channel. A Simulated Annealing (SA) based application mapping technique has been proposed in [82] for 2-D mesh based NoC which minimizes the area requirement and the maximum bandwidth. It also proposes an efcient routing algorithm which selects a route among alternative paths based on the network state and occupancy of queues. Cluster based technique combined with simulated annealing has been proposed in [83] for application mapping onto 2-D mesh-based NoC. In this technique, mapping is done cluster-wise, instead of node-wise, to reduce the mapping complexity. Clustering is a technique to partition nodes into groups according to the physical distance among them in the network topology. Clustering exploits the knowledge about the network architecture and communication demand of applications. So in this mapping technique, rst cluster-based core to node initial mapping is done and then a simulated annealing technique is applied upon it to nd good mapping solution. In [8486], authors have analyzed different approaches to minimize total communication energy by inserting some permissible longer links and by-passing some routers of application-specic NoC. In this process, by network partitioning, the area cost is reduced by reducing both router area and number of links. In [86], the authors have proposed an efcient methodology to choose the most power efcient application-specic NoC architecture. In this paper, authors have compared different topologies taking only one application benchmark and reported the best one, but that topology may not be good for other applications. Topology design is one of the signicant factors that affects the net delay and energy consumption of an application specic NoC. The topology must satisfy the design constraints. For very high I/O rate streaming type of application mapping, a guaranteed and high throughput pipelined mechanism for NoC is introduced in [87]. In this paper, authors have proposed a pipeline-based high throughput low energy mapping algorithm which performs task allocation, pipelined task
Region for initial candidate core
(a) Selected region for initial candidate core
(b) Initial candidate core (44 mesh)
(c) Initial candidate core (55 mesh)
Fig. 7. Candidate for initial core selection [74]. (a) Selected region for initial candidate core. (b) Initial candidate core (4 4 mesh). (c) Initial candidate core (5 5 mesh).
68
scheduling, and communication scheduling simultaneously on the heterogeneous NoC and minimizes the energy consumption. Onyx, a new bandwidth constrained application mapping has been presented in [88] to minimize the overall communication cost of NoC. In this technique, a core with the highest communication bandwidth has been mapped at the centre. Then the ranking of
Fig. 8. Concept of Lozenge-shape path selection [88].
Fig. 9. Zigzag path for core mapping [89].
other unmapped cores are settled according to the communication volume with mapped cores. The unmapped cores are placed at the nearest possible distance with its related core by looking the lozenge-shape path with one hop or two hop distances and so on till the empty tile is identied (Fig. 8). In [89], Crinkle, a mapping algorithm has been presented to reduce the overall communication cost. In this technique, priority lists are prepared depending on the interconnection degree of nodes and communication bandwidth before mapping onto mesh based NoC. Depending on the priority lists, the heuristic maps the tasks from the corner of 2-D mesh platform and ends on another corner in a zigzag manner (Fig. 9). In [90,91], a power-aware template-based efcient mapping (TEM) algorithm for NoC has been proposed to generate good mapping solutions with low run time under bandwidth and latency constraints. In this mapping technique, a core having highest connectivity is called as hot core and that is mapped onto the tile having maximum number of neighbour tiles. In an application core graph there is at least one hot core. Once all the hot cores are mapped, the mapping sequences of remaining unmapped cores are performed based on the decreasing order of weight of edges connecting to them with minimum hop distance to an already mapped hot core. An Architecture-Aware Analytic Mapping algorithm (A3MAP) for NoC with homogeneous and heterogeneous cores on regular and irregular mesh or custom architecture has been proposed in [50]. The task mapping problem is solved by a successive relaxation algorithm and a genetic algorithm is applied upon this to nd better mapping solutions. Citrine, a two step 2-D mesh mapping algorithm has been proposed in [92], which uses the mapping technique Onyx [88] to retrieve the order of cores and then a Branch-and-Bound search tries to search different permutations by lozenge shaped rule of Onyx [88]. In [93], authors have proposed a two step multi-application mapping algorithm that maps multiple applications simultaneously onto different regions of NoC to minimize network latency and energy consumption for a set of applications. The algorithm consists of an application mapping phase followed by a task mapping phase. The application mapping phase deals with the multiple applications mapping to optimize the layout of multiple applications on the NoC and nd a region with minimal Nodes Average Distance (NAD) for each application. After application mapping phase, the role of task mapping phase is to map the tasks of the application so that the individual as well as overall average communication
Level-0 Initial Graph
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
C14
C15
C16
Level-1 Partitions
C8
C9
C10
C11
C12
C13
C14
C15
C1
C2
C3
C4
C5
C6
C7
C8
Partition ID-0
Partition ID-1
Level-2 Partitions
C8
C9
C10
C12
C11
C13
C14
C15
C5
C6
C7
C16
C1
C2
C3
C4
Partition ID-0
Partition ID-1
Partition ID-2
Partition ID-3
Level-3 Partitions
C9
C12
C8
C10
C11
C15
C13
C14
C6
C7
C5
C16
C1
C2
C3
C4
Partition ID-0
Partition ID-1 Partition ID-2 Partition ID-3 Partition ID-4 Partition ID-5 Partition ID-6 Partition ID-7
Fig. 10. Partitions at all levels for core graph VOPD [94].
69
{C12, C9} U1 Level-3 Partitions U3 U2 U5
{C15, C11} U6
{C8,C10 } U4 U7
{C13, C14} U8 Level-1 Partitions
{C7, C6} U9 U10 U13
{C2, C1} U14
{C16, C5} U11 U12 Level-2 Partitions U15
{C4, C3} U16
Fig. 11. Initial mapping for VOPD application [94].
distance is minimized. The task mapping of each application follows the tree-model based mapping as described in [73]. In [94], LMAP, a mapping algorithm has been proposed to reduce both static and dynamic cost of mesh based NoC. This is a three phase mapping algorithm. The rst one is a partitioning phase, in which a KernighanLin (KL) partitioning scheme [95] has been used to identify the closeness of cores by analyzing their bandwidth or communication requirements. This bi-partitioning is applied (recursively) until the closest two cores are left in one nal partition as shown in Fig. 10 for VOPD (Fig. 12b) application. The second phase is initial mapping. In this phase the placement of the cores of an application onto a mesh connected NoC is performed based on the KL nal partitioning result (Fig. 11). After initial mapping an iterative improvement phase has been applied by swapping and ipping of cores within partition to arrive at a nal mapping solution. All the application mapping techniques of NoC discussed above are based on mesh based network architecture. But it is essential to check the suitability of other network topology when applications are mapped onto that. In [96], an energy-aware mapping technique has been proposed which maps the IPs onto tree based NoC architecture such that the total communication energy can be minimized. In this technique, rst an energy-aware mapping is formulated, and then a recursive bi-partitioning algorithm is used to solve it. An application mapping heuristic has been proposed [97] for generating optimal tree based topology for multimedia applications to minimize energy consumption while meeting the design constraints. Application mapping techniques have been proposed in [98] and [99] to map applications onto Buttery-FatTree (BFT) and Mesh-of-Tree (MoT) based NoC respectively. In this technique a KernighanLin (KL) partitioning scheme has been used [94] to identify the closeness of cores by analyzing their bandwidth or communication requirements. 5. Performance comparison An application can be represented in the form of a core graph [75], dened as follows Denition 1. The core graph for an application is a directed graph, G(C, E) with each vertex ci 2 C representing a core and the directed edge ei,j 2 E representing the communication between the cores ci
and cj. The weight of edge ei,j, denoted by commi,j, represents the bandwidth requirement of the communication from ci to cj. On the other hand, the given NoC topology can be represented in the form of a topology graph [75]. Denition 2. The NoC topology graph is a directed graph P(U, F) with each vertex ui 2 U representing a node in the topology and the directed edge fi,j 2 F representing a direct communication between the vertices ui and uj. The weight of the edge fi,j, denoted as bwi,j, represents the bandwidth available across the edge fi,j. A mapping of the core graph G(C, E) onto the topology graph P(U, F) is dened by the function map: C ? U, such that, " ci 2 C, $ uj 2 U and map(ci) = uj. The function associates core ci to router uj. Naturally, mapping is dened only when |C| 6 |U|. The quality of such a mapping is dened in terms of the total communication cost of the application under this mapping [75]. The communication between each pair of cores can be treated as ow of a single commodity dk, k = 1, 2,..., |E|. The value of commodity dk, corresponding to the communication between cores ci and cj is equal to commi,j, the bandwidth requirement. If ci is mapped to the router map(ci) and cj is mapped to map(cj), the set of all commodities D = {dk} is dened as follows
D fd jv alued commi;j ; for k 1; 2; . . . ; jEj and ei;j 2 Eg

Also,
sourced mapci and sinkd mapcj

The link between two individual routers ui and uj of the topology has a maximum bandwidth of bwi,j. The total commodity owing through such a link should not exceed this bandwidth. The k quantity xk i;j indicating the value of commodity d owing through the link (ui, uj) is given by,
xk i; j
v aluedk ;
if linkui ; uj 2 Path sourced ; sinkd
0; otherwise
where Path (a, b) indicates the deterministic routing path between the mesh nodes a and b in the topology. Satisfaction of bandwidth limitations of individual links must be ensured. That is, all mapping solutions should satisfy the following relation.
70
1 126
70
362
362
4
362 49 27
9 10
0.5
190
1
0.5 60
64 64
128
7
300
353
357 16
910 600 32
64
15
4
670 173 500
2
64 64
6
64 64
40
31
500
313 94
9
313
16
11
16
16 16
12
540
11
7
250
3
40
126
10 16
70
14 17
362
157 16
(d) PIP
13
32
12
(c) MPEG - 4
362
1 0.193
3
0.025 38.001 8 2 2.083 38.001 9 0.01
18
362
19
49 540
1 128
64
2 96 4 6
128
38.016 3 46.733 7 4.06 6 24.634 37.958 4
22
300
353
21
357 16
20
27
30
96 96
96 96
23
500
313 94
24
313
16
26 16
16
27
96 64
96
10
25
29
16 157 16
28
11
64
12
10
11
0.5
12
(a) DVOPD
70 362 362
(e) MWD
0.025
(f) 263enc mp3dec

1 0.025 0.25 0.025 10 8 2 3.672 3 6 3.672 7 5 0.38 3.672 4.06 13 0.5 2.083 9 11 12 0.01 0.5 14
1
353
2
357 16 313 94 313
3
362 27
4
49
4.06
1 2.083 3 0.5 4 1 5 1 2
9 2.083 10 0.01 13 0.5 12 4.06 11
0.187 4 0.5 0.1
7
300
16
8
500
16
12 16
16
13
10
11
16 16 15
157 16
0.87 0.18 8 6 0.15 7
14
(b) VOPD
(g) mp3enc mp3dec
(h) 263dec mp3dec
Fig. 12. Application core graphs with communication bandwidth (MB/s).
jEj X xk i;j 6 bwi;j ; for all i; j 2 f1; 2; . . . ; jU jg k1
If all bandwidth constraints are satised, the communication cost of a mapping solution is given by,
jE j X k 1
v aluedk hopcount sourcedk ;
sinkd
Here, hopcount (a, b) is the number of hops between the topology nodes a and b. For a deterministic shortest path routing, hopcount corresponds to the minimum number of hops between the constituent nodes. Since communication cost is very much dependent on the mapping solution, the overall mapping problem is to optimize the communication cost, ensuring that the bandwidth constraints of all individual links are satised. The communication cost affects the performance of the overall system and its energy consumption, as both of these factors are directly proportional to the total hopcount. The application mapping results are generally reported on a set of benchmarks. Fig. 12 notes a number of such benchmark applications. Most of the existing tools have reported results on three benchmarks VOPD, MPEG-4 and PIP. Table 1 notes those absolute mapping result along with the cost normalized to NMAP and average communication cost relative to NMAP. To compare other
benchmarks (DVOPD, MWD, 263enc mp3dec, mp3enc mp3dec, 263dec mp3dec), we have implemented the NMAP algorithm and ILP ourselves. As expected, ILP based exact methods achieve best results. The evolutionary approaches also do quite well. In particular, PSMAP could obtain results same as ILP. The results of LMAP and PSMAP are available from our existing works [94,60]. Tables 2 and 3 note these results along with cost normalized to NMAP. Here also it can be observed that PSMAP produces results almost same as ILP with less CPU time. NMAP and LMAP results are very close. ILP could not be run on 32-core DVOPD example, as run-time becomes unacceptably high. All the algorithms are run on an Intel Core i5 platform with 4GB main memory and 2.4 GHz clock frequency. The CPU times needed in each of the techniques for individual benchmarks are noted in Tables 2 and 3. The PSMAP algorithm has been run with at most 200 particles for at most 100 generations without improvement. Since the results reported in the literature are for applications with less number of cores, we have used the TGFF tool [100] to generate a few task graphs with 64 and 128 cores. By varying bandwidth, number of start nodes and in-out degree for nodes, different task graphs have been generated via TGFF. The bandwidths are varied from 10 to 1500 MB/s for some graphs and 50 to 150 MB/s for other graphs. The in-out degrees of nodes are varied from 1 to 8 to generate both low and high communication graphs. Number of start nodes also varied to generate different
P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076 Table 1 Absolute communication cost, cost normalized to NMAP, and average communication cost relative to NMAP. Mapping techniques VOPD Absolute comm. cost (hops BW) Cost normalized to NMAP 0.966 0.986 MPEG-4 Absolute comm. cost (hops BW) Cost normalized to NMAP 0.971 0.971 PIP Absolute comm. cost (hops BW) Cost normalized to NMAP Average communication cost relative to NMAP
71
ILP based exact mapping techniques ILP [31] 4119.0 Cluster + ILP 4205.0 [32] Deterministic search techniques GMAP [33-35] 5553.0 PBB [33-35] 4317.0 Elixir [37] 4249.0
3567.0 3567.0
0.972a 0. 979
1.302 1.012 0.996
7849.0 3763.0 3640.0 3600.0 3572.0 3772.0
2.137 1.025 0.991 0.980 0.973 1.027
704.0 640.0
1.10 1.0
1.513 1.012 0.994 0.994 0.981 1.027 0.971
GA based transformative heuristic techniques CGMAP 4300.0 1.008 [45,46] GBMAP [48] 4217.0 0.989 GAMR [49] A3MAP-GA 4141.0 0.971 [50] PSO and ACO based transformative heuristic techniques PSMAP [60] 4119.0 0.966 ACO [64] Constructive heuristic without iterative improvement PMAP [65] 7054.0 1.654 BMAP [69] 4351.0 1.020 CHMAP [70] 4249.0 0.996 CMAP [71] 4281.0 1.004 CastNet [74] 4135.0 0.969 Constructive heuristic with iterative improvement A3MAP-SR 4265.0 1.0 [50] NMAP [75] 4265.0 1.0 MOCA [77] SA [83] 4231.0 0.992 CSA [83] 4169.0 0.977 Onyx [88] 4249.0 0.996 LMAP [94] 4189.0 0.982
a b
3567.0 3633.0 6128.0 6280.0 3977.0 3704.0 3852.0 3672.0 5246.0 3612.0 4006.0
0.971 0.989 1.669 1.710 1.083 1.009 1.049 1.0 1.429 0.984 1.091
640.0 832.0 640.0 640.0
1.0 1.30 1.0 1.0
0.940b 0.989 1.541 1.365 1.040 1.006 1.009 1.0 1.0b 1.429 0.992 0.977 0.990 0.983b
Average communication cost relative to NMAP taking the values from Tables 13. Average communication cost relative to NMAP taking the values from Tables 14.
Table 2 Absolute communication cost, cost normalized to NMAP and CPU time for different applications with their corresponding algorithms. Mapping algorithm DVOPD Comm. cost (hops BW) 10253.0 9974.0 9752.0 CPU time in s 0.380 1.660 14.287 Cost normalized to NMAP 1.0 0.973 0.951 VOPDa Comm. cost (hops BW) 4265.0 4189.0 4119.0 4119.0 CPU time in s 0.024 0.040 0.260 4474.730 MPEG-4a Comm. cost (hops BW) 3672.0 4006.0 3567.0 3567.0 CPU time in s 0.016 0.040 0.040 21.530 PIPa Comm. cost (hops BW) 640.0 640.0 640.0 640.0 CPU time in s 0.010 0.010 0.010 1.280
NMAP LMAP PSMAP ILP

a
Cost normalized to NMAP shown in Table 1.
Table 3 Absolute communication cost, cost normalized to NMAP and CPU time for different applications with their corresponding algorithms. Mapping MWD 263enc mp3dec mp3enc mp3dec 263dec mp3dec algorithm Comm. cost CPU time Cost Comm. cost CPU time Cost Comm. cost CPU Cost Comm. cost CPU Cost Normalized normalized (hops BW) in s (hops BW) time in s normalized (hops BW) time in s normalized (hops BW) in s to NMAP to NMAP to NMAP to NMAP NMAP LMAP PSMAP ILP 1184.0 1248.0 1120.0 1120.0 0.016 0.030 0.020 200.510 1.0 1.054 0.946 0.946 230.407 230.417 230.407 230.407 0.012 0.040 0.268 191.910 1.0 1.0 1.0 1.0 18.171 17.856 17.021 17.021 0.016 0.040 0.320 1432.430 1.0 0.983 0.937 0.937 20.073 20.058 19.823 19.823 0.016 0.040 0.260 4895.250 1.0 0.999 0.987 0.987
72
Table 4 Absolute communication cost, cost normalized to NMAP and CPU time for different TGFF task graphs with their corresponding. TGFF task graphs NMAP Comm. cost (hops BW) 64 Cores Graph Graph Graph Graph Graph Graph Graph Graph Graph Graph 1 2 3 4 5 6 7 8 9 10 9207.49 132,292.38 116,337.81 55,244.17 6015.28 44,902.16 70,168.36 503,767.47 343,982.87 82,744.31 CPU time in s 8.43 12.37 13.41 12.09 12.96 12.93 410.57 548.63 423.57 451.50 LMAP Comm. cost (hops BW) 7441.40 128,174.0 121,835.0 51,344.40 6381.99 44,005.10 70,168.36 477,572.87 306,761.0 80,746.20 CPU time in s 6.43 9.87 9.90 9.75 9.95 9.74 186.34 320.56 207.67 137.83 Cost normalized to NMAP 0.808 0.969 1.047 0.929 1.061 0.980 1.0 0.948 0.892 0.976 PSMAP Comm. cost (hops BW) 8380.79 115,797.09 110,077.83 50,947.07 5949.09 42,086.60 67,508.53 453,078.22 285,295.72 73,940.90 CPU time in s 13.87 23.63 52.14 35.76 42.22 36.56 205.85 567.74 405.34 584.36 Cost normalized to NMAP 0.910 0.875 0.946 0.922 0.989 0.937 0.962 0.899 0.829 0.894
128 Cores
graphs and to see the effect of mapping solutions upon them. Sixty four-core NoCs are implemented in 2-D as 8 8 and for 128-core case, 2-D mesh is realized as 8 16. We have implemented the mapping algorithms, such as, NMAP [75], LMAP [94] and PSMAP [60] for these 64 and 128 core graphs, and the mapping solutions and cost normalized to NMAP for these graphs are noted in Table 4. PSMAP produces consistently good results, whereas, the results produced by NMAP and LMAP are comparable.
6. Special mapping techniques The design ow of NoC includes several parameters. There are some special mapping techniques, such as, routing-aware mapping, and integrated mapping and scheduling that represent highly correlated design problem that must be handled carefully to optimize different performance matrices. 6.1. Routing-aware mapping CART, a Communication-Aware Routing Technique has been proposed in [101,102] that optimizes the network performance for application-specic NoCs. It combines both topology-agnostic routing algorithm and a communication-aware mapping technique considering the bandwidth constraints. In [103], the same authors have proposed a core mapping technique based on source routing to achieve a mapping with path length constraint. The path length constraint has been achieved by a heuristic search that satises the distance restrictions between source and destination. A multiobjective optimization strategy has been proposed in [104], which determines the pareto optimal NoC congurations to optimize average delay of the network and routing robustness. In this technique, both the topological mapping and routing are considered concurrently. An application specic routing algorithm (APSRA) has been proposed in [105,106] to maximize the communication performance for an application after mapping onto NoC. The proposed algorithm can be applied for both deterministic as well as adaptive routing scheme. This algorithm can be used on any network topology and both homogeneous and heterogeneous 2-D mesh connected NoC systems. After taking the mapped information of cores to routers, APSRA generates a set of routing tables that guarantees both the reachability as well as the deadlock-free communication among the cores, maximizing the routing adaptability. 6.2. Integrated mapping and scheduling The process of application mapping answers the question where, but to answer when, scheduling is required. If multiple numbers of tasks of an application are mapped onto one core, then
the task scheduling is encountered. Given an application task graph mapped onto NoC architecture, Scheduling is the time ordering of tasks and communications determining the order in which tasks and transactions between them are to be executed such that deadlines are met and some parameters are optimized. This is called the process of scheduling. In this light, an energy-aware communication and task scheduling has been proposed in [107,108] which maps tasks and statically schedules both communication transactions and computation tasks onto heterogeneous NoCs. It automatically assigns the tasks onto different processing elements and schedules their execution under real-time constraints. Here the communication is non-streaming in nature, that is, tasks communicate at most once with each other. For streaming communication, in which tasks periodically and repeatedly communicate with each other, a time constrained resource-efcient routing and scheduling strategy for task mapping has been presented in [109]. This minimizes resource usage by exploiting all scheduling freedom offered by NoC. Quality of service (QoS) is an essential parameter for real time and multimedia applications. To achieve this, a rate-based scheduling policy in NoC has been proposed in [110]. In this task mapping method, a data ow requiring QoS is admitted only if all the routers in the path from source to destination of the NoC can transmit at a rate required by the specic ow. Then each router dynamically denes the priority of each QoS ow, locally depending on the required rate and the rate currently used by the QoS ow. A non pre-emptive static trafc-aware scheduling has been proposed in [111], which maps the application tasks onto NoC keeping track of the network trafc and then schedules the computation and communication of tasks. A power-aware online scheduling has been proposed [112] to minimize communication energy consumption. For online scheduling, the communication status of an application task graph is analyzed at run time to implement the real-time scheduling. General mapping idea is to map the cores nearer having more communication bandwidth. This mapping can cause runtime trafc congestion if not properly scheduled. Without coordinated scheduling on both computation and communication, speculative mapping may not generate effective runtime behaviour. To handle above issue a combined mapping and scheduling algorithm has been proposed in [113] which route and schedule the transmission in the process of task mapping. In this technique a routing-aware list-scheduling method has been proposed to schedule each task onto the best t processor minimizing the overall execution time. All the above mapping and scheduling techniques do not consider the temperature effect during mapping. Temperature affects performance, power, and reliability of the system. A temperature-aware task mapping and scheduling technique has been proposed in [114] which maps tasks using a heuristic and a oorplanning tool is used to reduce the peak temperature.
73
Application Xpipes Component Lib
HW/SW Co - design and Simulation Area Lib Pow Lib Floor Plan Mapping onto Topologies Topology Lib Phase 1 Routing Function
Area Lib
Pow Lib
Topology Selection
Xpipes Compiler
Floor Plan Phase 2
SystemC Files of whole Design Phase 3
Fig. 13. Design ow of SUNMAP [76].
7. Application mapping tools In this section we present an overview of some of the application mapping tools. The tool SUNMAP [76] has the ability to map cores of an application onto various network architectures and choose the most suitable one amongst them. It explores the available topologies (from a library) for a given application and performs synthesis around the best topology. The exploration of RTL-level NoC topologies attempts to minimize average communication delay, design area, and power dissipation subject to bandwidth and area constraints. The tool is supported by various routing techniques, such as, dimension ordered, minimum path, trafc splitting across minimum path, and trafc splitting across all paths. The design ow of SUNMAP tool is shown in Fig. 13. It has three phases of operation. In the rst phase, mapping onto various network topologies is performed by considering the routing functions, area constraints, power constraints, and topology library. For each mapping, the bandwidth and area constraints are evaluated. In the second phase, all the mappings produced in rst phase are evaluated for several design objectives and the best one is selected. In the third phase, SUNMAP generates SystemC description of the network components using the xpipesCompiler [115,116] and xpipes component library. The xpipesCompiler [115,116] automatically instantiates network components, such as, routers, links, and network interfaces for a specic NoC topology using xpipes library. Most innovative feature of xpipes is that all its components are highly parameterized, and it can be tailored at the design time according to the needs for a specic architecture. SMAP [67] is an application mapping and simulation tool in the Matlab environment. It performs application mapping and task routing in a spiral fashion to enhance the performance of the NoC. It provides a variety of algorithms for application mapping, task routing, and task scheduling for different NoC topologies and calculates a series of performance and cost metrics to select the best mapping onto a best NoC topology. Some more tools have been reported in [117,119,121]. Selection of network architecture for an application and associated mapping of cores onto NoC are major issues for a high performance NoC design. Considering these issues, a NoC topology exploration based mapping and simulation model has been presented in [117] to select the best NoC topology for an application and mapping onto that. The IP mapping is automatically computed by SCOTCH partitioning tool [118] respecting different design constraints. It performs static graph partitioning, mapping, and sparse matrix block ordering. SCOTCH allows the user to map efciently any kind of
weighted core graph onto any kind of topology graph with different design constraints and topological constraints. xENoC [119] is an experimental NoC environment for parallel and distributed computing on NoC based MPSoC architectures. It shares the capabilities of xpipes and SUNMAP to select the topology for an application. It is also inspired by NoCGEN [120] to customize and select different NoC parameters to choose from different mapping, routing and switching schemes. xENoC performs a complete HW/SW co-design to build an efcient distributed NoC based MPSoC design. HeMPS [121] targets exible application mapping strategies, fast design space exploration and performance evaluation of the mapped application to select the best mapping for NoC based MPSoC design. It supports both static and dynamic application mapping and SystemC simulation model for evaluation of performance and cost metrics.
8. Conclusion This paper surveys the NoC application mapping strategies reported mostly in the last one decade. It classies the reported techniques into groups like dynamic and static mapping approaches. Static mapping techniques have further been categorized as exact methods, branch-and-bound, transformative, and constructive approaches. We have also presented a performance comparison between the static mapping techniques. Apart from the existing benchmarks, we have generated some test cases having 64 and 128 cores. Communication cost and mapping times of some of the algorithms have been compared. Thus, it provides a fair understanding of the effort needed and quality of solution obtained in different mapping approaches.
References
[1] L. Benini, G. De Micheli, Networks on Chips: a new SoC paradigm, IEEE Computer 35 (1) (2002) 7078. [2] W. J. Dally, B. Towles, Route packets, not wires: on-chip interconnection networks, in: Proceedings of the 38th Design Automation Conference (DAC), 2001, pp. 684689. [3] S. Kumar, A. Jantsch, J.P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja, A. Hemani, A network on chip architecture and design methodology, in: Proceedings of ISVLSI, 2002, pp. 117124. [4] U.Y. Ogras, J. Hu, R. marculescu, Key research problems in NoC design: a holistic perspective, in: Proceedings of the IEEE International Conference on Hardware/Software Codesign and System, Synthesis, 2005, pp. 6974. [5] R. Marculescu, U.Y. Ogras, L.S. Peh, N.E. Jerger, Y. Hoskote, Outstanding research problems in NoC design: systems, microarchitecture, and circuit perspectives, IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems 28 (1) (2009) 0321.
74
P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076 [35] J. Hu, R. Marculescu, Energy- and performance-aware mapping for regular NoC architectures, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 24 (4) (2005) 551562. [36] T.J. Lin, S.Y. Lin, A.Y. Wu, Trafc-balanced IP mapping algorithm for 2D-Mesh on-chip-networks, in: IEEE Workshop on Signal Processing Systems (SiPS), 2008, pp. 200203. [37] M. Reshadi, A. Khademzadeh, A. Reza, Elixir: a new bandwidth-constrained mapping for networks-on-chip, IEICE Electronics Express 7 (2) (2010) 7379. [38] T. Lei, S. Kumar, A two-step genetic algorithm for mapping task graphs to a network on chip architecture, in: Proceedings of the Euromicro Symposium on Digital System Design (DSD), 2003, pp. 180187. [39] W. Zhou, Y. Zhang, Z. Mao, An application specic NoC mapping for optimized delay, in: IEEE International Conference on Design and Test of Integrated Systems in Nanoscale (DTIS), 2006, pp. 184188. [40] G. Ascia, V. Catania, M. Palesi, Multi-objective mapping for mesh-based NoC architectures, in: ACM International Conference on Hardware/Software Codesign and System Synthesis, 2004, pp. 182187. [41] G. Ascia, V. Catania, M. Palesi, Multi-objective genetic approach to mapping problem on Network-on-Chip, Journal of Universal Computer Science 12 (4) (2006) 370394. [42] A.H. Benyamina, P. Boulet, Multi-objective mapping for NoC architecture, Journal of Digital Information Management 5 (2007) 378384. [43] R.K. Jena, G.K. Sharma, A multi-objective evolutionary algorithm based optimization model for Network-on-Chip synthesis, in: IEEE International Conference on Information Technology (ITNG), 2007, pp. 977982. [44] K. Bhardwaj, R.K. Jena, Energy and bandwidth aware mapping of IPs onto regular NoC architectures using multi-objective genetic algorithms, in: International Symposium on System-on-Chip (SOC), 2009, pp. 2731. [45] F.M. Darbari, A. Khademzadeh, G.G. Fard, Evaluating the performance of a chaos genetic algorithm for solving the network on chip mapping problem, in: International Conference on Computational Science and Engineering, 2009, pp 366373. [46] F.M. Darbari, A. Khademzadeh, G.G. Fard, CGMAP: a new approach to Network-on-Chip mapping problem, IEICE Electronics Express 6 (1) (2009) 2734. [47] G.G. Fard, A. Khademzadeh, F.M. Darbari, Evaluating the performance of onedimensional chaotic maps in Network-on-Chip mapping problem, IEICE Electronics Express 6 (12) (2009) 811817. [48] M. Tavanpour, A. Khademzadeh, S. Pourkiani, M. Yaghobi, GBMAP: an evolutionary approach to mapping cores onto a mesh-based NoC architecture, Journal of Communication and Computer 7 (3) (2010) 17. [49] G. Fen, W. Ning, Genetic algorithm based mapping and routing approach for network on chip architectures, Chinese Journal of Electronics 19 (1) (2010) 9196. [50] W. Jang, D.Z. Pan, A3MAP: Architecture-aware analytic mapping for Networkon-Chip, in: Asia and South Pacic Design Automation Conference (ASP-DAC), 2010, pp. 523528. [51] N. Choudhary, M.S. Gaur, V. Laxmi, V. Singh, Energy aware design methodologies for application specic NoC, in: Proceedings of NORCHIP, 2010, pp. 14. [52] N. Choudhary, M.S. Gaur, V. Laxmi, V. Singh, GA based congestion aware topology generation for application specic NoC, in: IEEE International Symposium on Electronics Design, Test, and Application, 2011, pp. 9398. [53] M.J. Sepulveda, M. Strum, W.J. Chau, A multi-objective adaptive immune algorithm for NoC mapping, in: International Conference on Very Large Scale Integration (VLSI-SOC), 2009, pp. 193196. [54] M.J. Sepulveda, M. Strum, W.J. Chau, G. Gogniat, A multi-objective approach for multi-application NoC mapping, in: IEEE Latin American Symposium on Circuits and Systems (LASCAS), 2011, pp. 14. [55] I. Kennedy, R.C. Eberhart, Particle swarm optimization, in: Proceedings of IEEE International Conference on Neural Networks, NJ, 1995, pp. 19421948. [56] W. Zhou, Y. Zhang, Z. Mao, Link-load balance aware mapping and routing for NoC, WSEAS Transactions on Circuits and Systems 6 (11) (2007) 583 591. [57] A.R. Fekr, A. Khademzadeh, M. Janidarmian, V.S. Bokharaei, Bandwidth/fault/ contention aware application-specic NoC using PSO as a mapping generator, in: Proceedings of the World Congress on Engineering (WCE), vol. 1, 2010, pp. 247252. [58] W. Lei, L. Xiang, Energy- and latency-aware NoC mapping based on discrete particle swarm optimization, in: Proceedings of IEEE International Conference on Communications and Mobile Computing, 2010, pp. 263268. [59] A.H. Benyamina, P. Boulet, A. Aroul, S. Eltar, K. Dellal, Mapping real time applications on NoC architecture with hybrid multi-objective algorithm, in: International Conference on Metaheuristics and Nature Inspired, Computing, 2010, pp. 110. [60] P.K. Sahu, P. Venkatesh, S. Gollapalli, S. Chattopadhyay, Application mapping onto mesh structured Network-on-Chip using particle swarm optimization, in: IEEE International symposium on VLSI (ISVLSI), 2011, pp. 335336. [61] K. Wang, L. Huang, C. Zhou, W. Pang, Particle swarm optimization for traveling salesman problem, in: Proceedings of the Second International Conference on Machine Learning and Cybermetics, 2003, pp. 15831585. [62] Yuhui Shi, Russell Eberhart, Parameter Selection in Particle Swarm Optimization, Springer Berlin/ Heidelberg, vol. 1447/1998, 2006, pp. 591-600. [63] A. Colorni, M. Dorigo, V. Maniezzo, Distributed optimization by ant colonies, actes de la premire confrence europenne sur la vie articielle, France, Elsevier Publishing, Paris, 1991. 134142.
[6] A. Agarwal, C. Iskander, R. Shankar, Survey of network on chip (NoC) architectures & contributions, Journal of Engineering, Computing and, Architecture 3 (1) (2009). [7] R. Pop, S. Kumar, A survey of techniques for mapping and scheduling applications to network on chip systems, ISSN 1404 0018, Research Report 04:4, School of Engineering, Jnkping University, 2004. [8] G. Chen, F. Li, M. Kandemir, Compiler-directed application mapping for NoC based chip multiprocessors, in: Proceedings of LCTES, 2007, pp. 155157. [9] E. Carvalho, N. Calazans, F. Moraes, Heuristics for Dynamic Task Mapping in NoC-based Heterogeneous MPSoCs, IEEE International Workshop on Rapid system Prototyping (RSP), 2007, pp. 3440. [10] C.L. Chou, R. Marculescu, Incremental runtime application mapping for homogeneous NoCs with multiple voltage levels, in: ACM International Conference on Hardware/Software codesign and system synthesys, 2007, pp. 161166. [11] C.L. Chou, R. Marculescu, User-aware dynamic task allocation in Network-onChip, in: Proceedings of Design, Automation and Test in Europe (DATE), 2008, pp. 12321237. [12] C.L. Chou, U.Y. Ogras, R. Marculescu, Energy- and performance-aware incremental mapping for NoCs with multiple voltage levels, IEEE Transactions on Computer-Aided design of Integrated Circuits and Systems 27 (10) (2008) 18661879. [13] E. Carvalho, F. Moraes, Congestion-aware task mapping in heterogeneous MPSoCs, in: International Symposium on SoC, 2008, pp. 14. [14] A. Mehran, A. Khademzadeh, S. Saeidi, DSM: a heuristic dynamic spiral mapping algorithm for Network-on-Chip, IEICE Electronics Express 5 (13) (2008) 464471. [15] M.A.A. Faruque, R. Krist, J. Henkel, ADAM: run-time agent based distributed application mapping for on-chip communication, IEEE Design Automation Conference (DAC), 2008, pp. 760765. [16] A.K. Singh, W. Jigang, A. Prakash, T. Srikanthan, Mapping algorithms for NoCbased heterogeneous MPSoC platforms, in: Euromicro Conference on Digital System Design/Architecture, Methods and Tools, 2009, pp. 133140. [17] A.K. Singh, T. Srikanthan, A. Kumar, W. Jigang, Communication-aware heuristics for run-time task mapping on NoC-based MPSoC platforms, Journal of System Architecture 56 (2010) 242255. [18] E. Carvalho, N. Calazans, F. Moraes, Dynamic task mapping for MPSoCs, IEEE Design and Test of Computers (2010) 2635. [19] M. Mandelli, L. Ost, E. Carara, G. Guindani, T. Gouvea, G. Medeiros, F.G. Moraes, Energy-aware dynamic task mapping for NoC-based MPSoCs, in: Proceedings of ISCAS, 2011, pp. 16761679. [20] M. Mandelli, A. Amory, L. Ost, F.G. Moraes, Multi-task dynamic mapping onto NoC-based MPSoCs, in: Proceedings of the 24th Symposium on Integrated Circuits and System Design, 2011, pp. 191196. [21] A. Weichslgartner, S. Wildermann, J. Teich, Dynamic decentralized mapping of tree-structured applications on NoC architectures, in: IEEE/ACM International Symposium on Network-on-Chip (NOCS), 2011, pp. 201208. [22] A. Bender, MILP based task mapping for heterogeneous multiprocessor systems, in: Proceedings of International conference on Design and Automation (EURO-DAC), 1996, pp. 190197. [23] C. Rhee, H. Jeong, S. Ha, Many-to-Mmany core-switch mapping in 2-D Mesh NoC architectures, in: IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD), 2004, pp. 438443. [24] S. Murali, L. Benini, G.D. Micheli, Mapping and physical planning of networkson-chip architectures with quality-of-service guarantees, in: Asia and South Pacic Design Automation Conference (ASP-DAC), 2005, pp. 2732. [25] K. Srinivasan, K.S. Chatha, G. Konjevod, Linear-programming-based techniques fo synthesis of Network-on-Chip architectures, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 14 (4) (2006) 407420. [26] C. Ostler, K.S. Chatha, An ILP formulation for system-level application mapping on network processor architecture, in: Proceedings of Design, Automation and Test in Europe (DATE), 2007, pp. 16. [27] O. Ozturk, M. Kandemir, S.W. Son, An ILP based approach to reducing energy consumption in NoC based CMPs, in: IEEE International Symposium on Low Power Electronics and Design (ISLPED), 2007, pp. 411414. [28] P. Ghosh, A. Sen, A. Hall, Energy efcient application mapping to NoC processing elements operating at multiple voltage levels, in: IEEE International Symposium on Network-on-Chip (NoCS), 2009, pp. 8085. [29] J. Huang, C. Buckl, A. Raabe, A. Knool, Energy-aware task allocation for Network-on-Chip based heterogeneous multiprocessor systems, in: Euromicro International Conference on Parallel, Distributed and Network based Processing (PDP), 2011, pp. 447454. [30] C.L. Chou, R. Marculescu, Contention-aware application mapping for Network-on-Chip communication architectures, in: IEEE International Conference on Computer Design (ICCD), 2008, pp. 164169. [31] S. Tosun, O. Ozturk, M. Ozen, An ILP formulation for application mapping onto Network-on-Chips, in: International Conference on Application of Information and Communication Technologies (AICT), 2009, pp. 15. [32] S. Tosun, Clustered-based application mapping method for Network-on-Chip, Journal of Advances in Engineering Software 42 (10) (2011) 868874. [33] J. Hu, R. Marculescu, Energy-aware mapping for tile-based NoC architectures under performance constraints, in: Asia and South Pacic Design Automation Conference (ASP-DAC), 2003, pp. 233239. [34] J. Hu, R. Marculescu, Exploiting the routing exibility for energy/performance aware mapping of regular NoC architectures, in: Proceedings of Design, Automation and Test in Europe (DATE), 2003, pp. 688693.
P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076 [64] J. Wang, Y. Li, S. Chai, Q. Peng, Bandwidth-aware application mapping for NoC-based MPSoCs, Journal of Computational Information Systems 7 (1) (2011) 152159. [65] N. Koziris, M. Romesis, P. Tsanakas, G. Papakonstantinou, An efcient algorithm for the physical mapping of clustered task graphs onto multiprocessor architectures, in: Proceedings of 8th Euro PDP, 2000, pp. 406413. [66] A. Hansson, K. Goossens, A. Radulescu, A unied approach to constrained mapping and routing on Network-on-Chip architectures, in: IEEE/ACM International Conference on Hardware/Software Codesign and System, Synthesis (CODES+ISSS), 2005, pp. 7580. [67] S. Saeidi, A. Khademzadeh, A. Mehran, SMAP: An intelligent mapping tool for network on chip, in: International Symposium on Signals, Circuits and Systems (ISSCS), 2007, pp. 14. [68] R. Mehran, S. Saeidi, A. Khademzadeh, A.A. Kusha, Spiral: a heuristic mapping algorithm for network on chip, IEICE Electronics Express 4 (15) (2007) 478 484. [69] T. Shen, C.H. Chao, Y.K. Lien, A.Y. Wu, A new binomial mapping and optimization algorithm for reduced-complexity mesh-based on-chip network, in: Proceedings of NOCS07, 2007, pp. 317322. [70] M. Tavanpour, A. Khademzadeh, M. Janidarmian, Chain-mapping for mesh based Network-on-Chip architecture, IEICE Electronics Express 6 (22) (2009) 15351541. [71] Y. Chen, L. Xie, J. Li, An energy-aware heuristic constructive mapping algorithm for network on chip, in: International Conference on ASIC (ASICON), 2009, pp. 101104. [72] A. Patooghy, H. Tabkhi, S.G. Miremadi, RMAP: a reliability-aware application mapping for Network-on-Chips, in: International Conference on Dependability, 2010, pp. 112117. [73] B. Yang, T.C. Xu, T. Santti, J. Plosila, Tree-model based mapping for energyefcient and low-latency Network-on-Chip, in: International Symposium on Design and Diagnostics of Electronics Circuits and Systems (DDECS), 2010, pp. 189192. [74] S. Tosun, New heuristic algorithm for energy aware application mapping and routing on mesh-based NoCs, Journal of System Architecture 57 (2011) 69 78. [75] S. Murali, G. De Micheli, Bandwidth constrained mapping of cores onto NoC architectures, in: Proceedings of Design, Automation and Test in Europe Conference and Exhibition (DATE), vol. 2, 2004, pp. 896901. [76] S. Murali, G. De Micheli, SUNMAP: a tool for automatic topolog selection and generation for NoCs, in: Proceedings of 41st Design Automation Conference (DAC), 2004, pp. 914919. [77] K. Srinivasan, K.S. Chatha, A technique for low energy mapping and routing in Network-on-Chip architecture, in: IEEE International Symposiun on Low Power Electronics and Design (ISLPED), 2005, pp. 387392. [78] C. Marcon, N. Calazans, F. Moraes, A. Susin, I. Reis, F. Hessel, Exploring NoC mapping strategies: an energy and timing aware technique, in: Proceedings of Design, Automation and Test in Europe Conference and Exhibition (DATE), vol. 1, 2005, pp. 502507. [79] C. Marcon, A. Borin, A. Susin, L. Carro, F. Wagner, Time and energy efcient mapping of embeded applications onto NoCs, in: Proceedings of Asia and South Pacic Design Automation Conference (ASP-DAC), vol. 1, 2005, pp. 33 38. [80] C.A.M. Marcon, E.I. Moreno, N.L.V. Calazans, F.G. Moraes, Comparison of Network-on-Chip mapping algorithms targeting low energy consumption, IET Computer & Digital Technique 2 (6) (2008) 471482. [81] C.A.M. Marcon, J.C.S. Palma, A.A. Susin, R.A.L. Reis, N.L.V. Calazans, F.G. Moraes, Modeling the trafc effect for the application cores mapping problem onto NoCs, VLSI-SoC International Federation for Information Processing, vol. 240/2007, 2007, pp. 179194. [82] H.M. Harmanani, R. Farah, A method for efcient mapping and reliable routing for NoC architectures with minimum bandwidth and area, in: IEEE International Workshop on Circuits and systems and TAISA Conference (NEWCAS-TAISA), 2008, pp. 2932. [83] Z. Lu, L. Xia, A. Jantsch, Cluster-based simulated annealing for mapping cores onto 2D mesh Networks on Chip, in: Proceedings of Design and Diagnostics of Electronic Circuits and Systems (DDECS), 2008, pp. 16. [84] H. Elmiligi, A.A. Morgan, M.W.E. Kharashi, F. Gebali, Power-aware topology optimization for Network-on-Chips, in: IEEE International Symposium on Circuits and Systems, 2008, PP. 360363. [85] A. Morgan, H. Elmiligi, A.M.W.E. Kharashi, F. Gebali, Application-specic networks-on-chip topology customization using network partitioning, in: 1st International Forum on Next-generation Multicore/manycore Technologies, 2008. [86] H. Elmiligi, A.A. Morgan, M.W.E. Kharashi, F. Gebali, Power optimization for application-specic networks-on-chips: a topology-based approach, Journal of Microprocessor and Microsystems 33 (2009) 343355. [87] M.Y. Yu, M. Li, J.J. Song, F.F. Fu, Y.X. Bai, Pipelining-based high throughput low energy mapping on Network-on-Chip, in: Euromicro International Conference on Digital System Design/Architectures, Methods and Tools, 2009, pp. 427432. [88] M. Janidarmian, A. Khademzadeh, M. Tavanpour, Onyx: a new heuristic bandwidth-constrained mapping of cores onto network on chip, IEICE Electronics Express 6 (1) (2009) 17. [89] S. Saeidi, A. Khademzadeh, F. Vardi, Crinkle: a heuristic mapping algorithm for network on chip, IEICE Electronics Express 6 (24) (2009) 17371744.
75
[90] X. Wang, M. Yang, Y. Jiang, P. Liu, Power-aware mapping for Network-on-Chip architectures under bandwidth and latency constraints, in: International Conference on Embedded and Multimedia Computing (EM-COM), 2009, pp. 16. [91] X. Wang, M. Yang, Y. Jiang, P. Liu, Power-aware mapping approach to map IP cores onto NoCs under bandwidth and latency constraints, ACM Transactions on Architecture and Code Optimization 7 (1) (2010) 130. [92] M. Janidarmian, A. Khademzadeh, A.R. Fekr, V.S. Bokharaei, Citrine: a methedology for application-specic Network-on-Chips design, in: Proceedings of World Congress on Engineering and Computer Science, vol. 1, 2010, pp. 196202. [93] B. Yang, L. Guang, T.C. Xu, T. Santti, J. Plosila, Multi-application mapping algorithm for Network-on-Chip platforms, in: IEEE 26th Convention of Electrical and Electronics Engineers in Israel (IEEEI), 2010, pp. 540544. [94] P.K. Sahu, N. Shah, K. Manna, S. Chattopadhyay, A new application mapping algorithm for mesh based Network-on-Chip design, in: IEEE International Conference (INDICON), 2010, pp. 14. [95] B. Kernighan, S. Lin, An efcient heuristic procedure for partitioning graphs, Bell System Technical Journal 49 (2) (1970) 291307. [96] Z. Chang, G. Xiong, N. Sang, Energy-aware mapping for tree-based NoC architecture by recursive bipartitioning, in: International Conference on Embedded Software and Systems (ICESS), 2008, pp. 105109. [97] D. Majeti, A. Pasalapudi, K. Yalamanchili, Low energy tree based network on chip architectures using homogeneous routers for bandwidth and latency constrained multimedia applications, in: International Conference on Emerging Trends in Engineering and Technology (ICETET), 2009, pp. 358363. [98] P.K. Sahu, N. Shah, K. Manna, S. Chattopadhyay, An application mapping technique for buttery-fat-tree Network-on-Chip, in: IEEE International Conference on Emerging Applications and Information Technology (EAIT), 2011, pp. 383386. [99] P.K. Sahu, N. Shah, K. Manna, S. Chattopadhyay, A new application mapping strategy for mesh-of-tree based Network-on-Chip, in: IEEE International Conference on Emerging Trends in Electrical and Computer Technology (ICETECT), 2011, pp. 518523. [100] R.P. Dick, D.L. Rhodes, W. Wolf, TGFF: task graphs for free, in: Proceedings of International Workshop on Hardware/Software Codesign, 1998. [101] R. Tornero, J.M. Orduna, A. Mejia, J. Flich, J. Duato, CART: communicationaware routing technique for application-specic NoCs, in: IEEE Euromicro Conference on Digital System Design Architecture, Methods and Tools, 2008, pp. 2631. [102] R. Tornero, J.M. Orduna, A. Mejia, J. Flich, J. Duato, A communication-driven routing technique for application-specic NoCs, International Journal of Parallel Programming 39 (3) (2011) 357374. [103] R. Tornero, S. Kumar, S. Mubeen, J.M. Orduna, Distance constrained mapping to support NoC platforms based on source routing, in: Workshop on Highly Parallel Processing on a Chip (HPPC), 2009, pp. 817. [104] R. Tornero, V. Sterrantino, M. Palesi, J.M. Orduna, A multi-objective strategy for concurrent mapping and routing in Networks on Chip, in: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), 2009, pp. 18. [105] M. Palesi, R. Holsmark, S. Kumar, A methodology for design of application specic deadlock-free routing algorithms for NoC Systems, in: ACM International Conference on Hardware/Software Codesign and System Synthesis, 2006, pp. 142147. [106] M. Palesi, R. Holsmark, S. Kumar, V. Catania, Application specic routing algorithms for network on chip, IEEE Transactions on Parallel and Distributed Systems 20 (3) (2009) 316330. [107] J. Hu, R. Marculescu, Energy-aware communication and task scheduling for Network-on-Chip architectures under real-time constraints, in: Design Automation and Test in Europe Conference and Exhibition, vol. 1, 2004, pp. 234239. [108] J. Hu, R. Marculescu, Communication and task scheduling of applicationspecic Network-on-Chip, IEE Proc. Compute. Digit. Tech. 152 (5) (2005) 643651. [109] S. Stuijk, T. Basten, M. Geilen, A.H. Ghamarian, B. Theelen, Resource-efcient routing and scheduling of time-constrained Network-on-Chip communication, in: Proceedings of EUROMICRO Conference on Digital System Design, 2006, pp. 4552. [110] A. Mello, N. Calazans, F. moraes, Rate-based scheduling policy for QoS Floows in network on chip, in: International Conference on Very large Scale Integration (VLSI-SoC), 2007, pp. 140145. [111] A. Raina, V. Muthukumar, Trafc aware scheduling algorithm for network on chip, in: International Conference on Information Technology, 2009, pp. 877 882. [112] W. Hu, X. Tang, B. Xie, T. Chen, D. Wang, An efcient power-aware optimization for task scheduling on NoC-based many-core System, in: IEEE International Conference on Computer and Information Technology (CIT), 2010, pp. 171178. [113] H. Yu, Y. Ha, B. Veeravalli, Communication-aware application mapping and scheduling for NoC-based MPSoCs, in: IEEE International Symposium on Circuits and Systems (ISCAS), 2010, pp. 32323235. [114] Y. Xie, W.L. Hung, Temperature-aware task allocation and scheduling for embeded multiprocessor system-on-chip (MPSoC) design, Journal of VLSI Signal Processing 45 (2006) 177189. [115] A. Jalabert, S. Murali, L. Benini, G.D. Micheli, xpipesCompiler: a tool for instantiating application specic network on chip, in: Proceedings of Design,
76
P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076 Automation and Test in Europe Conference and Exhibition (DATE), 2004, pp. 884889. D. Bertozzi, L. Benini, xpipes: a Network-on-Chip architecture for gigascale system-on-chip, in: IEEE Circuits and Systems Magazine, 2004, pp. 1831. L. Bononi, N. Concer, M. Grammatikakis, NoC topology exploration based on simulation models, in: Euromicro Conference on Digital System Design Architecture, Methods and Tools (DSD), 2007, pp. 543546. F. Pellegrini, SCOTCH and LibScotch 4.0 Users Guide. J. Joven, O.F. Bach, D.C. Rufas, R. Martinez, L. Teres, J. Carrabina, xENoC an experimental Network-on-Chip environment for parallel distributed computing on NoC-based MPSoC architecture, in: Euromicro Conference on Parallel, Distributed and Network-based Processing, 2008, pp. 141148. J. Chan, S. Parameswaran, NoCGEN: a templet based reuse methodology for network on chip architecture, in: International Conference on VLSI Design, 2004, pp. 717720. E.A. Carara, R.P. de Oliveira, N.L.V. Calazans, F.G. Moraes, HeMPS a framework for NoC based MPSoC generation, in: IEEE Symposium on Circuits and Systems (ISCAS), 2009, pp. 13451348. Santanu Chattopadhyay is currently Professor in the Department of Electronics and Electrical Communication Engineering at Indian Institute of Technology, Kharagpur. He received the PhD degree in Computer Science and Engineering from Indian Institute of Technology Kharagpur in 1996. Before joining IIT Kharagpur, he was with Indian Institute of Technology, Guwahati. His research interests include CAD tools for low power circuit design and test, Systemon- Chip testing, Network-on-Chip design and test. He has more than 120 publications in refereed international journals and conferences. He is the co-author of the book on Additive Cellular Automata Theory and Applications, published by the IEEE Computer Society Press in 1997.
[116] [117]
[118] [119]
[120]
[121]
Pradip Kumar Sahu is a PhD student in the Department of Electronics and Electrical Communication Engineering at Indian Institute of Technology, Kharagpur. His research interests include Network-on-Chip architecture design and Application mapping in 2-D and 3-D environments, Performance and Cost Evaluation, and Power-Performance-Reliability trade-off.

A Survey On Application Mapping Strategies For Network-On-Chip Design

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Survey On Application Mapping Strategies For Network-On-Chip Design

Uploaded by

Copyright:

Available Formats

Journal of Systems Architecture 59 (2013) 6076

Contents lists available at SciVerse ScienceDirect

Journal of Systems Architecture

A survey on application mapping strategies for Network-on-Chip design

P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076

Search based Mapping

Mathematical Programming based Mapping

Systematic or Deterministic Search Transformative Heuristic

Constructive without Iterative Improvement

Constructive with Iterative Improvement

ILP, MILP, etc.

Branch and Bound (BB)

BMAP, CMAP , CHMAP, SMAP

NMAP, LMAP, SA, Onyx

Fig. 2. Classication of mapping algorithms.

P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076

P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076

P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076

Fig. 3. Final core placement [39].

P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076

Core Number Router Number

Fig. 4. Particle structure [60].

P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076

Fig. 5. Application mapping onto NoC [67,68].

Fig. 6. An example of binomial merging (N = 16) iterations [69].

P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076

Region for initial candidate core

(a) Selected region for initial candidate core

(b) Initial candidate core (44 mesh)

(c) Initial candidate core (55 mesh)

P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076

Fig. 8. Concept of Lozenge-shape path selection [88].

Fig. 9. Zigzag path for core mapping [89].

Level-0 Initial Graph

P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076

{C12, C9} U1 Level-3 Partitions U3 U2 U5

{C13, C14} U8 Level-1 Partitions

{C7, C6} U9 U10 U13

{C2, C1} U14

{C16, C5} U11 U12 Level-2 Partitions U15

{C4, C3} U16

Fig. 11. Initial mapping for VOPD application [94].

D fd jv alued commi;j ; for k 1; 2; . . . ; jEj and ei;j 2 Eg

sourced mapci and sinkd mapcj

if linkui ; uj 2 Path sourced ; sinkd

P.K. Sahu, S. Chattopadhyay / Journal of Systems Architecture 59 (2013) 6076

0.025 38.001 8 2 2.083 38.001 9 0.01

38.016 3 46.733 7 4.06 6 24.634 37.958 4

(f) 263enc mp3dec

9 2.083 10 0.01 13 0.5 12 4.06 11

0.187 4 0.5 0.1

0.87 0.18 8 6 0.15 7

(g) mp3enc mp3dec

(h) 263dec mp3dec

Fig. 12. Application core graphs with communication bandwidth (MB/s).

jEj X xk i;j 6 bwi;j ; for all i; j 2 f1; 2; . . . ; jU jg k1

v aluedk hopcount sourcedk ;

1.302 1.012 0.996

7849.0 3763.0 3640.0 3600.0 3572.0 3772.0

2.137 1.025 0.991 0.980 0.973 1.027

1.513 1.012 0.994 0.994 0.981 1.027 0.971

640.0 832.0 640.0 640.0

1.0 1.30 1.0 1.0