You are on page 1of 5

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication.

An FPGA-Based Framework
for Technology-Aware Prototyping
of Multi-Core Embedded Architectures
Paolo Meloni, Simone Secchi, Student Member, IEEE, and Luigi Raffo, Member, IEEE

Abstract—As the multi-core era has been entered, the hardware- II. R ELATED W ORK
software co-development of embedded platforms started to expose to
the designer a huge number of degrees of freedom. Effective design To date, software cycle-accurate simulation has been the primary
and simulation methodologies are needed to deal with the need of high tool to allow collaborative hardware and software research, for the
accuracy and with the strict time-to-market constraints. The use of cycle-
accurate software simulators as a foundation for the exploration of all detailed system-level evaluation of complex systems. Some examples
the possible hw-sw configurations of the full system does not appear to can be found in [5], [22].
be anymore a feasible way to handle this complexity. In this paper, an However, very often, for parallel software development of real-
FPGA-based cycle-accurate hardware emulation framework is presented world applications, such approaches to simulation do not provide a
and proposed as a research accelerator for the exploration of complete
multi-core systems. Support for semi-automatic instantiation and FPGA practical speed-accuracy trade-off. At the state-of-the-art, two main
emulation of a complete hw-sw multi-core architecture is provided, directions are followed in facing the problem of effectively simulating
building on a library of processing and interconnection configurable complex multi- and many-core architectures.
modules. The framework provides the possibility to extract from the The first one aims at achieving the maximum speed of the
hardware-emulated system a set of metrics for the assessment of the
performance and the evaluation of the architectural trade-offs, as well as simulation by completely or partially raising the abstraction-level of
the estimation of figures of power and area consumption of a prospective the described architecture. Simics [14] is one of the best known full-
ASIC implementation of the considered architecture. A possible use case system functional simulators. It offers the level of accuracy neces-
is presented, employing the emulation infrastructure inside a design space sary to execute fairly complex binaries on the simulated machine,
exploration flow for the configuration of some interconnection network
parameters.
including operating systems. Cycle-accurate timing simulations can
be performed including custom modules that extend Simics through
Index Terms—FPGA, MPSoC emulation, design exploration.
its set of APIs. A timing multi-processor simulator built on top of
the Simics library is GEMS: in [15], a SPARC-based multiprocessor
I. I NTRODUCTION and its memory hierarchy simulation are targeted. Extensions of

T HE prediction of the performances of modern multi-core ar-


chitectures requires an effective solution of the speed-accuracy
trade-off, especially in the early design steps. The interest has recently
Simics targeting the simulation of reconfigurable hardware processor
extensions have been developed, as reported in [12]. MC-Sim ([9])
is a multi-accuracy software-based simulator in which the processing
shifted, both at academic and industrial level, from well-established cores are simulated with functional accuracy, preserving the highest
cycle-accurate full-system simulators to the adoption of FPGA-based modularity (through definition of specific APIs) to enable the pos-
hardware emulation platforms, whose trends in integration capability, sible addition of custom processor or cache models. The on-chip
speed and price propose them as a candidate to speed-up the explo- interconnection model included in MC-Sim, instead, supports timing
ration of large multi-core architectures [3]. Moreover, to consider simulation, still preserving high modularity. The paper presents also a
already at system/architectural level the variables related to the low methodology for automatic generation of fast (claimed 45x over RTL)
level implementation, the concept of “System-Level Design with C-based simulators for coprocessors from a high-level description.
technology-awareness” must be introduced. Detailed area, frequency ReSP ([4]) presents a TLM SystemC-based simulation platform that
and power models can be used to back-annotate the architectural introduces automatically generated Python wrappers that provide
assumptions and the experimental results obtained by means of increased flexibility, in terms of integration of new components
the prototyping. This paper presents an FPGA-based framework for and advanced simulation control capabilities. The Liberty ([20])
the emulation of complex and large multi-core architectures that modeling framework emphasizes the reusability of components and
allows the easy instantiation of the desired system configuration the minimization of the specification overhead. The user specifies a
and automatically generates the hardware description files for the structural system description that is automatically translated into a
FPGA synthesis and implementation, thus enabling rapid and accurate simulator executable.
prototyping/exploration of different architectures. The prototyping The second state-of-the-art direction aims at preserving the maxi-
results can be duely back-annotated using analytic models included mum accuracy of the simulation (cycle-accuracy), by exploiting fine-
in the framework, to evaluate a prospective ASIC implementation of grained reconfigurable devices such as FPGAs to build hardware-
the system. The remaining of this paper is organized as follows: Sect. based emulators. Several approaches use FPGAs as a means for
II provides an overview of other works that share objectives, tools or accelerating simulation: FAST ([7]) and A-Ports ([19]), for instance,
methodologies with ours; Sect. III presents the proposed framework map timing and functionally accurate performance models of the
in its main components, Sections IV and V present a possible use circuit under emulation onto reconfigurable logic, and provide a
case of the considered framework and an assessment of the achievable particularly useful support for rapid architectural exploration in those
speedup over pure software cycle-accurate simulation, while Sect. VI design cases where the RTL is not available, or where device
concludes the work presenting some future research directions. utilization problems are verified. Protoflex [8], on the other hand,
P. Meloni, S. Secchi and L. Raffo are with the Department of Electrical
defines a methodology for implementing a hybrid hw/sw approach to
and Electronic Engineering (DIEE), University of Cagliari, Italy. full-system multiprocessor emulation. Part of the target architecture
E-mail: {paolo.meloni,simone.secchi,luigi}@diee.unica.it is prototyped on an FPGA device and part is simulated by a

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
Authorized licensed use limited to: Southern Taiwan University. Downloaded on March 15,2010 at 00:04:26 EDT from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

host software-based functional environment. For every simulator the according to shared-memory, message-passing or hybrid models
specific hw/sw partitioning depends on the activation rate of the of computation, employing libraries provided by the framework.
elementary subsystems. Moreover, Protoflex virtualizes the execution Moreover, a high-level structural description of the candidate system,
of different logical processors on the same hardware resources, to whose composition is described in detail in section III-A, is taken as
minimize device utilization. The authors claim an average speed-up input.
of 38x over Simics. Step 1 - The system-level description file is passed to the SHMPI
Other approaches build the prototyping platform implementing the builder which, building upon a repository of soft-cores, generates
full-system under test on FPGA. The most important contribution to the files that are needed for the actual FPGA implementation and
this field is brought by the RAMP project ([2]). Atlas ([21]) is the first for the prototyping phase. The SHMPI builder is described in
operational design within the RAMP project. It is a framework for further detail in section III-A. The reference repository of HDL
research on CMP architectures with special emphasis on transactional IPs contains different Xilinx proprietary processing, memorization
memories. It provides an FPGA design composed by processing and interconnection cores, along with custom modules for hardware
and memory cores featuring rich software support. In this work the synchronization. Moreover, the Xpipes architecture ([6], [1]) has been
speed-up achievable with on-hardware emulation is clearly assessed. adopted as the template for interconnection network infrastructures.
Another example of full-system FPGA-based emulator is [10], where In order to preserve the possibility of extending the library with
application specific architectures are considered. Moreover, various little effort in terms of adaptation to the proprietary communication
approaches to FPGA-based emulation have been developed also on protocols, dedicated wrappers have been developed to adapt the cores
the industrial side, such as [11]. to the OCP communication protocol ([17]).
Our work is intended to enhance usability and flexibility of FPGA Step 2 - The architectural hardware description and the software
full-system prototyping, by provision of a system-level composition libraries generated by the SHMPI builder are then passed to the Xilinx
framework with novel features not elsewhere available in literature, proprietary tools for the FPGA implementation flow; the toolchain
such as: handles the configuration and the synthesis of the hardware modules,
• a “topology builder”, that is able to translate high-level directives as well as the compilation of the software code (application kernel,
input by the user into a prototyping platform; libraries and drivers) to be mapped onto the different cores included
• support for performance extraction during the prototyping, with in the platform. The FPGA emulation, by means of adequate support
modalities and granularity specified already at system level; for performance extraction (described in detail in section III-B),
• support for technology awareness, introduced by means of back- allows an accurate profiling of the target application on the configured
annotation of the prototyping data with power/frequency models, architecture.
in order to obtain detailed activity-based power figures; Step 3 - The cycle-accurate information on the switching activity
• capabilities of investigating new generation multi-core architec- collected during the emulation, can be passed as input to analytic
tures (e.g. including NoC interconnects programmed according power and area models, in order to investigate on a prospective ASIC
to message passing, shared memory or hybrid models of com- implementation of the system. A deeper insight of the models is
putation); provided in section III-C.
Finally, we provide other fundamental information derived by our Output - The user can compare the evaluated metrics/cost func-
experience in building the proposed tool, pointing out in detail tions with the constraints that the final system is desired to satisfy,
the effort related to the implementation of the architecture under being them either physical (power consumption, operating frequency
prototyping on FPGA. or occupied area) or purely functional (e.g. total execution time,
average latency, bandwidth) and tune the design accordingly.
III. F RAMEWORK DESCRIPTION
A. The SHMPI topology builder
The SHMPI topology builder generates the actual RTL cores (pro-
cessing, interconnection, memories) of the platform in a library-based
approach (HDL instantiation), based on the system-level specification
file input by the designer. The SHMPI topology builder includes the
parsing engine of ×pipes compiler, a tool developed for the automatic
instantiation of application-specific interconnection networks ([13]).
New functions have been developed, enabling: composition of the
entire multi-core hardware platform (including processing elements
and memory hierarchy definition), automatic configuration of the
software libraries and integration with the Xilinx development tools.
The topology is described, inside the input file, in terms of:
• a high-level component-wise topology description of processing
cores, memories and interconnection architecture (switches and
links);
• the routing tables to be used by the network interfaces in case
a source-routing interconnection network is instantiated;
Fig. 1. A view of the proposed framework employed in a possible design • all the necessary parameters for the configuration of the pro-
space exploration loop cessing and interconnection modules;
• the address map of the memory cores and the different memory-
Figure 1 gives a schematic view of the proposed framework, mapped peripherals to be included in the platform. Through
employed in a possible design space exploration loop. the declaration of the address spaces, the complete memory
Input - The designer, starting the exploration, inputs the parallel architecture (caches, shared portions, private portions) can be
application to be executed. The application can be parallelized defined;

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
Authorized licensed use limited to: Southern Taiwan University. Downloaded on March 15,2010 at 00:04:26 EDT from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

• specific directives for the inclusion and the connection of the Finally, how they are accessed; the designer can trade-off additional
performance/event counters needed for metrics extraction. hardware resources to reduce traffic overhead, instantiating dedicated
The functions developed for the SHMPI builder, on the basis of point-to-point connections instead of using the interconnection layer
this system-level description, perform the following operations: already used in the system.
• instantiation and sizing of the HDL code of processors, periph-
erals, semaphores and memory blocks; C. Models for prospective ASIC implementation
• generation of the input platform hardware and software descrip- Considering classic hw/sw embedded system design flows, several
tion files (.mhs and .mss) for the Xilinx toolchain; the hardware iterations are often needed before being able to meet the design
is tailored according to the specified architectural parameters. closure. In addition, several assumptions made at the system-level
The compilers are configured according to the chosen kind of design phase are often verified only after the actual back-end imple-
processing elements, the linker script and dedicated header files mentation. The effort required by this kind of process is increasing
are modified to be compliant with the defined memory map with the technology scaling. For example, capacitive effects related to
and the makefiles for compilation are tuned to include the right long wires are very difficult to be pre-estimated at high-level. Thus,
libraries; introducing ”technology awareness” already at system-level would be
• instantiation of the performance counters and configuration of crucial to increase the productivity and reduce the iterations needed
their access mode. When instantiating a processing or inter- to achieve a physically feasible and efficient system configuration. To
connection core, if requested by the system-level description, this aim, the adoption of some kind of modeling is unavoidable to
dedicated performance extraction modules are instantiated and obtain an early estimation of technology-related features and to take
connected to the module pins. The performance extraction effective technology-aware system-level decisions.
subsystem is described in detail in Sect. III-B. A memory- Within the proposed framework, the use of analytic models, is
mapping is defined for each counter; coupled to FPGA fast emulation, to obtain early power and execution
• generation of software libraries for performance counter ac- time figures related to a prospective ASIC implementation, without
cessing. According to the memory-mapping defined for the the need to perform long post-synthesis software simulations. The
performance counters, adequate C functions are created and FPGA emulation provides event/cycle-based metrics that can be back-
included for compilation; annotated using the analytic models for the estimation of the physical
• configuration of the system to support shared memory, mes- figures of interest. This allows to evaluate the timing results according
sage passing or hybrid models of computation, and relative to the modeled target ASIC operating frequencies and to translate the
customization of the synchronization/communication software evaluated switching activity in detailed power numbers.
libraries and processor-to-memory interfaces. In particular, when The models included in the framework are built by interpolation of
message-passing support is enabled for a processing core, the layout-level experimental results obtained after the ASIC implemen-
related private memory is implemented using two dual-port tation of the reference library IPs. More detailed information can be
BRAMs, respectively for data and instructions. In such a way, found in [16], referring to the power consumption and area occupation
for each memory side, one port can be used to establish a direct of the Xpipes NoC building blocks. Although the step involving back-
connection from the memory block to the NoC, so to stream the annotation of the functional performances with “technology aware”
message without affecting the processor execution. In this case a models can insert a certain degree of imprecision, the accuracy of the
memory-mapped DMA module is instantiated and programmed models described in [16] is assessed in the paper to be lower then
via software to execute send/receive operations; 10% when complete topologies are considered, with respect to post
layout analysis of real ASIC implementations.
B. Performance extraction
The framework provides the possibility to connect dedicated low- IV. U SE C ASES
overhead memory-mapped event-counters to specific logic chunks, In this section we present a case study, that employs the framework
for the evaluation of different performance metrics. The designer to assess the performances and explore the hardware-related trade-
is allowed to monitor the processing core interface, the switch offs of different architectural configurations. The focus is on the
output channel interface (in case a NoC structure is instantiated) comparison between three system configurations for fixed number of
and the memory port. The declaration of these event-counters and processors and memory hierarchy, but considering three alternative
their memory-mapping to specific processors can be included in NoC topologies as interconnection infrastructure.
the topology file that is input to the whole framework; the SHMPI The used hardware FPGA-based platform includes a Xilinx Virtex5
topology builder then handles the insertion and the connection of the XC5VLX330 device. The target application is a shared-memory
necessary hardware modules and wires. implementation of the RadixSort algorithm, included in the SPLASH2
The overhead introduced by the insertion of the performance benchmark suite [18]. The considered topologies are shown in Figure
counters depends on three factors, under full control of the user. 2: each one includes 8 processors, 8 private memories and 3 shared
Firstly, which core is intended to access the performance extraction modules (a shared memory shm, a bank of hardware semaphores for
subsystem. A dedicated processing core, originally not included in the synchronization purposes t&s, and an I/O controller uart).
architecture under emulation, can be added and placed on the FPGA, The first explored topology (hereafter called “star”) features a
to access all the event-counters without affecting the application 11x11 central switch, limiting the number of hops between sources
execution. Conversely, in order to save hardware resources, the and destinations of the communication scheme imposed by the
performance counters can be read by the same system processors target application. In the second topology (hereafter called “tree”)
under emulation, inserting explicit instructions within the application the 11x11 switch is replaced with three 5x5 switches, in order to
code. The second factor is when the performance counters are increase the operative frequency at the cost of an increased latency
accessed; they can be read offline, at the end of the execution, adding to the shared devices. The last topology under comparison is a
dedicated BRAM buffers to store the event traces, or they can be read quasi-mesh 4x2 topology, where every node includes a processor
at runtime, enabling dynamic resources management mechanisms. and a private memory and the shared devices are placed at the

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
Authorized licensed use limited to: Southern Taiwan University. Downloaded on March 15,2010 at 00:04:26 EDT from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

however mitigated by the lower power consumption, resulting in a


lower energy dissipation. The designer could here operate a perfor-
mance/energy trade-off choice according to the initial constraints.

Tree topology
BRAM blocks 80(27%)
Slice Registers 28618(13%)
Slice LUTs 42529(20%)
Critical Path 12.43 ns
Emulation Time < 2 sec
TABLE I
FPGA HARDWARE RESOURCES UTILIZATION AND EMULATION ACTUAL
TIME FOR ONE OF THE THREE TOPOLOGIES IN USE CASE .

In table I we show the hardware resources required to implement


one of the topologies under test. As can be noticed, the use case can
fit easily on commercial devices. Moreover, we report in the table
the critical path inside the system and the obtained emulation time.
The presented use-case was chosen since the network topology
selection is one of the interesting capabilities offered by the frame-
work. However, the design exploration is not limited to it. A potential
IP developer, in order to test the integration of its modules inside an
MPSoC platform, having access to the needed physical models, could
easily integrate the figures related with other components.

V. S IMULATION SPEED AND ACCURACY ASSESSMENT


A point of interest in using FPGA-based emulators is certainly the
speedup achievable over software-based cycle-accurate simulators.
Fig. 2. Explored topologies (left), compared with the per-switch dynamic
power consumption (right) All the results related to functional and physical metrics showed
in Figures 3 and 2 are obtained, for each topology, with a single
FPGA emulation. The time needed for application execution and
performance data outputting is 0.8 sec. This result is coherent with
corners of the mesh. The largest switch is still 5x5. Figure 3 plots
[10] and [21], where multi-core FPGA-based emulators are assessed
various functional and physical metrics evaluated on the examined
to be three orders of magnitude faster than software-based simulators,
topologies. In detail, the execution times and the total latencies are
when not accounting for the time spent on HW implementation flow.
plotted, both in terms of seconds, thus accounting for the different
When the emulation platform is instead used inside a Design Space
maximum operating frequencies of the three architectures, estimated
Exploration cycle, a factor limiting the mentioned speedup is the time
with the analytic models. Moreover, the occupied area and the
needed to traverse the whole FPGA implementation flow. This effort
total power/energy consumption of the NoC modules are indicated,
increases with the complexity of the architecture and especially when
according to the estimations provided by the analytic models included
the size approaches the limit of on-chip reconfigurable resources.
in the framework. The right part of Figure 2 shows the contribution
of each single switch (a single bar) of the topologies to the dynamic
power consumed in the different system configurations, estimated
observing the flit congestion at every single output channel.

Fig. 4. Computational effort needed to complete the implementation flow


for different architectural configurations

In Figure 4 we provide an overview of how the FPGA imple-


Fig. 3. Execution times and total latencies, area obstruction, energy/power mentation effort scales for regular quasi-mesh topologies with in-
consumption of the compared topologies creasing number of processors. The implementation flows have been
performed by Xilinx ISE 10.1.3 on a DualCore AMD Opteron (@2.2
This kind of technology-aware information has a relevant meaning GHz) with 6 GB RAM. To shrink the synthesis time we managed to
during system-level design. For instance, in Figure 3, ”star” and ”tree” build a library of reusable pre-synthesized components with different
topologies show contrasting advantages with respect to execution parameter configurations. For topologies not larger than 8 processors,
time and energy consumption. In fact, the ”star” topology, with its the total FPGA implementation time does not exceed one hour. Thus,
limited operating frequency, shows a higher execution time, that is we can consider iterative optimization and exploration loops as a

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
Authorized licensed use limited to: Southern Taiwan University. Downloaded on March 15,2010 at 00:04:26 EDT from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

potential field of use for the proposed framework, especially when [4] G. Beltrame, C. Bolchini, L. Fossati, A. Miele, and D. Sciuto. ReSP: A
the target application is complex, while the number of integrated Non-Intrusive Transaction-Level Reflective MPSoC Simulation Platform
for Design Space Exploration. In ASP-DAC ’08: Proceedings of the 2008
cores does not increase very much, like in the embedded systems
Asia and South Pacific Design Automation Conference, pages 673–678,
domain. Los Alamitos, CA, USA, 2008. IEEE Computer Society Press.
In order to perform a fair comparison with software-based cycle- [5] L. Benini, D. Bertozzi, A. Bogliolo, F. Menichelli, and M. Olivieri.
accurate simulation, we simulated the SystemC RTL models of MPARM: Exploring the Multi-Processor SoC Design Space with Sys-
the NoC modules for a 16-processor mesh topology, enabling the temC. J. VLSI Signal Process. Syst., 41(2):169–182, 2005.
[6] D. Bertozzi and L. Benini. X-pipes: A Network-on-Chip Architecture
performance-counting needed for the application of the power models for Gigascale Systems-on-Chip. IEEE Circuits and Systems Magazine,
described in Sect. III-C. The simulation has been performed using the 4(2):18–31, 2004.
same RadixSort memory-access traces as stimulus, simulation time [7] D. Chiou, D. Sunwoo, J. Kim, N. A. Patil, W. Reinhart, D. E. Johnson,
resulted in more than 12 hours. The FPGA-based prototyping of the J. Keefe, and H. Angepat. FPGA-Accelerated Simulation Technologies
(FAST): Fast, Full-System, Cycle-Accurate Simulators. In MICRO ’07:
system, accounting for the implementation effort, takes, according to
Proceedings of the 40th Annual IEEE/ACM International Symposium on
Figure 4, about 1 hour and 40 minutes (8x). In those cases in which Microarchitecture, pages 249–261, Washington, DC, USA, 2007. IEEE
the exploration extends only to software configuration (mapping, Computer Society.
scheduling, TLP granularity), only one synthesis is needed, resulting [8] E. S. Chung, M. K. Papamichael, E. Nurvitadhi, J. C. Hoe, K. Mai, and
in a much higher speedup as reported in literature. Moreover, unlike B. Falsafi. ProtoFlex: Towards Scalable, Full-System Multiprocessor
Simulations Using FPGAs. ACM Trans. Reconfigurable Technol. Syst.,
software-based simulators, monitoring additional signals does not 2(2):1–32, 2009.
introduce an overhead in terms of emulation time, resulting in an [9] J. Cong, K. Gururaj, G. Han, A. Kaplan, M. Naik, and G. Reinman. MC-
improved scalability of the number of events that can be observed. Sim: An Efficient Simulation Tool for MPSoC Designs. In Computer-
Regarding simulation accuracy assessment, it is worth noting that Aided Design, 2008. ICCAD 2008. IEEE/ACM International Conference
on, pages 364–371, Nov. 2008.
in the proposed use-case the same RTL code could be synthesized
[10] P.G. Del Valle, D. Atienza, I. Magan, J.G. Flores, E.A. Perez, J.M.
for FPGA (for evaluation) and for ASIC (for real production). The Mendias, L. Benini, and G. De Micheli. Architectural Exploration
prototyping does not insert any error in the estimation of “func- of MPSoC Designs Based on an FPGA Emulation Framework. In
tional related” (execution time, latency, congestion,) performances. Proceedings of XXI Conference on Design of Circuits and Integrated
Cycle/signal-level accuracy is thus guaranteed without the need of Systems (DCIS), pages 12–18, 2006.
[11] Emulation and Verification Engineering - EVE. Zebu XL and ZV
a test comparison (emulated vs. prospective implementation). The models. Technical report, 2005.
inaccuracy that might be inserted by the technology-aware models, [12] W. Fu and K. Compton. A Simulation Platform for Reconfigurable Com-
as mentioned in section III-C, is affordable in most design cases. puting Research. In FPL - Field Programmable Logic and Applications,
pages 1–7, 2006.
VI. C ONCLUSIONS AND F UTURE D EVELOPMENTS [13] A. Jalabert, S. Murali, L. Benini, and G. De Micheli. XpipesCompiler:
A Tool for Instantiating Application Specific Networks on Chip. In
In this work, an FPGA-based framework for the exploration and DATE ’04: Proceedings of the conference on Design, automation and test
characterization of MPSoC architectures is presented, with particular in Europe, page 20884, Washington, DC, USA, 2004. IEEE Computer
emphasis on NoC-based systems. The two main points of strength of Society.
the proposed framework are high-level automatic hw-sw platform [14] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg,
J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full
instantiation, integrated with Xilinx proprietary tools for FPGA System Simulation Platform. Computer, 35(2):50–58, Feb 2002.
implementation, and the use of analytic models that, exploiting the [15] M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R.
functional information provided by the FPGA emulation, are able Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet’s
to estimate different technology-related parameters of a prospective General Execution-Driven Multiprocessor Simulator (GEMS) Toolset.
SIGARCH Comput. Archit. News, 33(4):92–99, 2005.
ASIC implementation, such as power and energy consumption, [16] P. Meloni, I. Loi, F. Angiolini, S. Carta, M. Barbaro, L. Raffo, and
area occupation, and maximum achievable operating frequency. The L. Benini. Area and Power Modeling for Networks-on-Chip with Layout
presented use case validates the usefulness of the framework in Awareness. VLSI-Design Journal, Hindawi Publications, (ID 50285),
all the contexts where rapid simulation methodologies are required, 2007.
as an effective support to quantitative design space exploration or [17] OCP International Partnership (OCP-IP). Open Core Protocol Standard,
2003. http://www.ocpip.org/home.
simply as an environment for rapid prototyping of complex multi- [18] J. Pal Singh, S. C. Woo, M. Ohara, E. Torrie, and A. Gupta. The
core platforms. Already planned future work includes the extension SPLASH-2 Programs: Characterization and Methodological Consider-
of the library of components, as well as the consideration of hard ations. Proceedings of the International Symposium on Computer
routed macros and partial reconfigurability mechanisms to further Architecture, 1995.
[19] M. Pellauer, M. Vijayaraghavan, M. Adler, Arvind, and J. Emer. A-
reduce the overhead introduced by the FPGA implementation flow. Ports: an Efficient Abstraction for Cycle-Accurate Performance Models
on FPGAs. In FPGA ’08: Proceedings of the 16th international
ACKNOWLEDGMENT ACM/SIGDA symposium on Field programmable gate arrays, pages 87–
This work has been funded by the EU FP7 project MADNESS 96, New York, NY, USA, 2008. ACM.
(Contract Info FP7/248424). [20] M. Vachharajani, N. Vachharajani, D. A. Penry, J. A. Blome, S. Malik,
and D. I. August. The Liberty Simulation Environment: A Deliberate
Approach to High-Level System Modeling. Technical report, ACM
R EFERENCES Transactions on Computer Systems, 2004.
[1] F. Angiolini, P. Meloni, S. Carta, L. Benini, and L. Raffo. Contrasting a [21] S. Wee, J. Casper, N. Njoroge, Y. Tesylar, D. Ge, C. Kozyrakis, and
NoC and a Traditional Interconnect Fabric with Layout Awareness. In K. Olukotun. A Practical FPGA-based Framework for Novel CMP
Proceedings of the DATE ’06 Conference, Munich, Germany, 2006. Research. In FPGA ’07: Proceedings of the 2007 ACM/SIGDA 15th
[2] Arvind, K. Asanovic, C. Kozyrakis, S. L. Lu, and M. Oskin. RAMP: international symposium on Field programmable gate arrays, pages
Research Accelerator for Multiple Processors - A Community Vision 116–125, New York, NY, USA, 2007. ACM.
for a Shared Experimental Parallel HW/SW Platform, 2005. [22] A. Ramirez, A. Rico, A. Vega, D. R. Pico, E. Ayguade,
[3] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, F. Cabarcas, P. Carpenter, X. Martorell. CellSim: Modular
K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, Simulator for Heterogeneous Multiprocessor Architectures.
and K. A. Yelick. The Landscape of Parallel Computing Research: A http://pcsostres.ac.upc.edu/cellsim/doku.php.
View from Berkeley. Technical Report UCB/EECS-2006-183, EECS
Department, University of California, Berkeley, Dec 2006.

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
Authorized licensed use limited to: Southern Taiwan University. Downloaded on March 15,2010 at 00:04:26 EDT from IEEE Xplore. Restrictions apply.

You might also like