Professional Documents
Culture Documents
An FPGA-Based Framework
for Technology-Aware Prototyping
of Multi-Core Embedded Architectures
Paolo Meloni, Simone Secchi, Student Member, IEEE, and Luigi Raffo, Member, IEEE
Abstract—As the multi-core era has been entered, the hardware- II. R ELATED W ORK
software co-development of embedded platforms started to expose to
the designer a huge number of degrees of freedom. Effective design To date, software cycle-accurate simulation has been the primary
and simulation methodologies are needed to deal with the need of high tool to allow collaborative hardware and software research, for the
accuracy and with the strict time-to-market constraints. The use of cycle-
accurate software simulators as a foundation for the exploration of all detailed system-level evaluation of complex systems. Some examples
the possible hw-sw configurations of the full system does not appear to can be found in [5], [22].
be anymore a feasible way to handle this complexity. In this paper, an However, very often, for parallel software development of real-
FPGA-based cycle-accurate hardware emulation framework is presented world applications, such approaches to simulation do not provide a
and proposed as a research accelerator for the exploration of complete
multi-core systems. Support for semi-automatic instantiation and FPGA practical speed-accuracy trade-off. At the state-of-the-art, two main
emulation of a complete hw-sw multi-core architecture is provided, directions are followed in facing the problem of effectively simulating
building on a library of processing and interconnection configurable complex multi- and many-core architectures.
modules. The framework provides the possibility to extract from the The first one aims at achieving the maximum speed of the
hardware-emulated system a set of metrics for the assessment of the
performance and the evaluation of the architectural trade-offs, as well as simulation by completely or partially raising the abstraction-level of
the estimation of figures of power and area consumption of a prospective the described architecture. Simics [14] is one of the best known full-
ASIC implementation of the considered architecture. A possible use case system functional simulators. It offers the level of accuracy neces-
is presented, employing the emulation infrastructure inside a design space sary to execute fairly complex binaries on the simulated machine,
exploration flow for the configuration of some interconnection network
parameters.
including operating systems. Cycle-accurate timing simulations can
be performed including custom modules that extend Simics through
Index Terms—FPGA, MPSoC emulation, design exploration.
its set of APIs. A timing multi-processor simulator built on top of
the Simics library is GEMS: in [15], a SPARC-based multiprocessor
I. I NTRODUCTION and its memory hierarchy simulation are targeted. Extensions of
Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
Authorized licensed use limited to: Southern Taiwan University. Downloaded on March 15,2010 at 00:04:26 EDT from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
host software-based functional environment. For every simulator the according to shared-memory, message-passing or hybrid models
specific hw/sw partitioning depends on the activation rate of the of computation, employing libraries provided by the framework.
elementary subsystems. Moreover, Protoflex virtualizes the execution Moreover, a high-level structural description of the candidate system,
of different logical processors on the same hardware resources, to whose composition is described in detail in section III-A, is taken as
minimize device utilization. The authors claim an average speed-up input.
of 38x over Simics. Step 1 - The system-level description file is passed to the SHMPI
Other approaches build the prototyping platform implementing the builder which, building upon a repository of soft-cores, generates
full-system under test on FPGA. The most important contribution to the files that are needed for the actual FPGA implementation and
this field is brought by the RAMP project ([2]). Atlas ([21]) is the first for the prototyping phase. The SHMPI builder is described in
operational design within the RAMP project. It is a framework for further detail in section III-A. The reference repository of HDL
research on CMP architectures with special emphasis on transactional IPs contains different Xilinx proprietary processing, memorization
memories. It provides an FPGA design composed by processing and interconnection cores, along with custom modules for hardware
and memory cores featuring rich software support. In this work the synchronization. Moreover, the Xpipes architecture ([6], [1]) has been
speed-up achievable with on-hardware emulation is clearly assessed. adopted as the template for interconnection network infrastructures.
Another example of full-system FPGA-based emulator is [10], where In order to preserve the possibility of extending the library with
application specific architectures are considered. Moreover, various little effort in terms of adaptation to the proprietary communication
approaches to FPGA-based emulation have been developed also on protocols, dedicated wrappers have been developed to adapt the cores
the industrial side, such as [11]. to the OCP communication protocol ([17]).
Our work is intended to enhance usability and flexibility of FPGA Step 2 - The architectural hardware description and the software
full-system prototyping, by provision of a system-level composition libraries generated by the SHMPI builder are then passed to the Xilinx
framework with novel features not elsewhere available in literature, proprietary tools for the FPGA implementation flow; the toolchain
such as: handles the configuration and the synthesis of the hardware modules,
• a “topology builder”, that is able to translate high-level directives as well as the compilation of the software code (application kernel,
input by the user into a prototyping platform; libraries and drivers) to be mapped onto the different cores included
• support for performance extraction during the prototyping, with in the platform. The FPGA emulation, by means of adequate support
modalities and granularity specified already at system level; for performance extraction (described in detail in section III-B),
• support for technology awareness, introduced by means of back- allows an accurate profiling of the target application on the configured
annotation of the prototyping data with power/frequency models, architecture.
in order to obtain detailed activity-based power figures; Step 3 - The cycle-accurate information on the switching activity
• capabilities of investigating new generation multi-core architec- collected during the emulation, can be passed as input to analytic
tures (e.g. including NoC interconnects programmed according power and area models, in order to investigate on a prospective ASIC
to message passing, shared memory or hybrid models of com- implementation of the system. A deeper insight of the models is
putation); provided in section III-C.
Finally, we provide other fundamental information derived by our Output - The user can compare the evaluated metrics/cost func-
experience in building the proposed tool, pointing out in detail tions with the constraints that the final system is desired to satisfy,
the effort related to the implementation of the architecture under being them either physical (power consumption, operating frequency
prototyping on FPGA. or occupied area) or purely functional (e.g. total execution time,
average latency, bandwidth) and tune the design accordingly.
III. F RAMEWORK DESCRIPTION
A. The SHMPI topology builder
The SHMPI topology builder generates the actual RTL cores (pro-
cessing, interconnection, memories) of the platform in a library-based
approach (HDL instantiation), based on the system-level specification
file input by the designer. The SHMPI topology builder includes the
parsing engine of ×pipes compiler, a tool developed for the automatic
instantiation of application-specific interconnection networks ([13]).
New functions have been developed, enabling: composition of the
entire multi-core hardware platform (including processing elements
and memory hierarchy definition), automatic configuration of the
software libraries and integration with the Xilinx development tools.
The topology is described, inside the input file, in terms of:
• a high-level component-wise topology description of processing
cores, memories and interconnection architecture (switches and
links);
• the routing tables to be used by the network interfaces in case
a source-routing interconnection network is instantiated;
Fig. 1. A view of the proposed framework employed in a possible design • all the necessary parameters for the configuration of the pro-
space exploration loop cessing and interconnection modules;
• the address map of the memory cores and the different memory-
Figure 1 gives a schematic view of the proposed framework, mapped peripherals to be included in the platform. Through
employed in a possible design space exploration loop. the declaration of the address spaces, the complete memory
Input - The designer, starting the exploration, inputs the parallel architecture (caches, shared portions, private portions) can be
application to be executed. The application can be parallelized defined;
Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
Authorized licensed use limited to: Southern Taiwan University. Downloaded on March 15,2010 at 00:04:26 EDT from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
• specific directives for the inclusion and the connection of the Finally, how they are accessed; the designer can trade-off additional
performance/event counters needed for metrics extraction. hardware resources to reduce traffic overhead, instantiating dedicated
The functions developed for the SHMPI builder, on the basis of point-to-point connections instead of using the interconnection layer
this system-level description, perform the following operations: already used in the system.
• instantiation and sizing of the HDL code of processors, periph-
erals, semaphores and memory blocks; C. Models for prospective ASIC implementation
• generation of the input platform hardware and software descrip- Considering classic hw/sw embedded system design flows, several
tion files (.mhs and .mss) for the Xilinx toolchain; the hardware iterations are often needed before being able to meet the design
is tailored according to the specified architectural parameters. closure. In addition, several assumptions made at the system-level
The compilers are configured according to the chosen kind of design phase are often verified only after the actual back-end imple-
processing elements, the linker script and dedicated header files mentation. The effort required by this kind of process is increasing
are modified to be compliant with the defined memory map with the technology scaling. For example, capacitive effects related to
and the makefiles for compilation are tuned to include the right long wires are very difficult to be pre-estimated at high-level. Thus,
libraries; introducing ”technology awareness” already at system-level would be
• instantiation of the performance counters and configuration of crucial to increase the productivity and reduce the iterations needed
their access mode. When instantiating a processing or inter- to achieve a physically feasible and efficient system configuration. To
connection core, if requested by the system-level description, this aim, the adoption of some kind of modeling is unavoidable to
dedicated performance extraction modules are instantiated and obtain an early estimation of technology-related features and to take
connected to the module pins. The performance extraction effective technology-aware system-level decisions.
subsystem is described in detail in Sect. III-B. A memory- Within the proposed framework, the use of analytic models, is
mapping is defined for each counter; coupled to FPGA fast emulation, to obtain early power and execution
• generation of software libraries for performance counter ac- time figures related to a prospective ASIC implementation, without
cessing. According to the memory-mapping defined for the the need to perform long post-synthesis software simulations. The
performance counters, adequate C functions are created and FPGA emulation provides event/cycle-based metrics that can be back-
included for compilation; annotated using the analytic models for the estimation of the physical
• configuration of the system to support shared memory, mes- figures of interest. This allows to evaluate the timing results according
sage passing or hybrid models of computation, and relative to the modeled target ASIC operating frequencies and to translate the
customization of the synchronization/communication software evaluated switching activity in detailed power numbers.
libraries and processor-to-memory interfaces. In particular, when The models included in the framework are built by interpolation of
message-passing support is enabled for a processing core, the layout-level experimental results obtained after the ASIC implemen-
related private memory is implemented using two dual-port tation of the reference library IPs. More detailed information can be
BRAMs, respectively for data and instructions. In such a way, found in [16], referring to the power consumption and area occupation
for each memory side, one port can be used to establish a direct of the Xpipes NoC building blocks. Although the step involving back-
connection from the memory block to the NoC, so to stream the annotation of the functional performances with “technology aware”
message without affecting the processor execution. In this case a models can insert a certain degree of imprecision, the accuracy of the
memory-mapped DMA module is instantiated and programmed models described in [16] is assessed in the paper to be lower then
via software to execute send/receive operations; 10% when complete topologies are considered, with respect to post
layout analysis of real ASIC implementations.
B. Performance extraction
The framework provides the possibility to connect dedicated low- IV. U SE C ASES
overhead memory-mapped event-counters to specific logic chunks, In this section we present a case study, that employs the framework
for the evaluation of different performance metrics. The designer to assess the performances and explore the hardware-related trade-
is allowed to monitor the processing core interface, the switch offs of different architectural configurations. The focus is on the
output channel interface (in case a NoC structure is instantiated) comparison between three system configurations for fixed number of
and the memory port. The declaration of these event-counters and processors and memory hierarchy, but considering three alternative
their memory-mapping to specific processors can be included in NoC topologies as interconnection infrastructure.
the topology file that is input to the whole framework; the SHMPI The used hardware FPGA-based platform includes a Xilinx Virtex5
topology builder then handles the insertion and the connection of the XC5VLX330 device. The target application is a shared-memory
necessary hardware modules and wires. implementation of the RadixSort algorithm, included in the SPLASH2
The overhead introduced by the insertion of the performance benchmark suite [18]. The considered topologies are shown in Figure
counters depends on three factors, under full control of the user. 2: each one includes 8 processors, 8 private memories and 3 shared
Firstly, which core is intended to access the performance extraction modules (a shared memory shm, a bank of hardware semaphores for
subsystem. A dedicated processing core, originally not included in the synchronization purposes t&s, and an I/O controller uart).
architecture under emulation, can be added and placed on the FPGA, The first explored topology (hereafter called “star”) features a
to access all the event-counters without affecting the application 11x11 central switch, limiting the number of hops between sources
execution. Conversely, in order to save hardware resources, the and destinations of the communication scheme imposed by the
performance counters can be read by the same system processors target application. In the second topology (hereafter called “tree”)
under emulation, inserting explicit instructions within the application the 11x11 switch is replaced with three 5x5 switches, in order to
code. The second factor is when the performance counters are increase the operative frequency at the cost of an increased latency
accessed; they can be read offline, at the end of the execution, adding to the shared devices. The last topology under comparison is a
dedicated BRAM buffers to store the event traces, or they can be read quasi-mesh 4x2 topology, where every node includes a processor
at runtime, enabling dynamic resources management mechanisms. and a private memory and the shared devices are placed at the
Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
Authorized licensed use limited to: Southern Taiwan University. Downloaded on March 15,2010 at 00:04:26 EDT from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Tree topology
BRAM blocks 80(27%)
Slice Registers 28618(13%)
Slice LUTs 42529(20%)
Critical Path 12.43 ns
Emulation Time < 2 sec
TABLE I
FPGA HARDWARE RESOURCES UTILIZATION AND EMULATION ACTUAL
TIME FOR ONE OF THE THREE TOPOLOGIES IN USE CASE .
Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
Authorized licensed use limited to: Southern Taiwan University. Downloaded on March 15,2010 at 00:04:26 EDT from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
potential field of use for the proposed framework, especially when [4] G. Beltrame, C. Bolchini, L. Fossati, A. Miele, and D. Sciuto. ReSP: A
the target application is complex, while the number of integrated Non-Intrusive Transaction-Level Reflective MPSoC Simulation Platform
for Design Space Exploration. In ASP-DAC ’08: Proceedings of the 2008
cores does not increase very much, like in the embedded systems
Asia and South Pacific Design Automation Conference, pages 673–678,
domain. Los Alamitos, CA, USA, 2008. IEEE Computer Society Press.
In order to perform a fair comparison with software-based cycle- [5] L. Benini, D. Bertozzi, A. Bogliolo, F. Menichelli, and M. Olivieri.
accurate simulation, we simulated the SystemC RTL models of MPARM: Exploring the Multi-Processor SoC Design Space with Sys-
the NoC modules for a 16-processor mesh topology, enabling the temC. J. VLSI Signal Process. Syst., 41(2):169–182, 2005.
[6] D. Bertozzi and L. Benini. X-pipes: A Network-on-Chip Architecture
performance-counting needed for the application of the power models for Gigascale Systems-on-Chip. IEEE Circuits and Systems Magazine,
described in Sect. III-C. The simulation has been performed using the 4(2):18–31, 2004.
same RadixSort memory-access traces as stimulus, simulation time [7] D. Chiou, D. Sunwoo, J. Kim, N. A. Patil, W. Reinhart, D. E. Johnson,
resulted in more than 12 hours. The FPGA-based prototyping of the J. Keefe, and H. Angepat. FPGA-Accelerated Simulation Technologies
(FAST): Fast, Full-System, Cycle-Accurate Simulators. In MICRO ’07:
system, accounting for the implementation effort, takes, according to
Proceedings of the 40th Annual IEEE/ACM International Symposium on
Figure 4, about 1 hour and 40 minutes (8x). In those cases in which Microarchitecture, pages 249–261, Washington, DC, USA, 2007. IEEE
the exploration extends only to software configuration (mapping, Computer Society.
scheduling, TLP granularity), only one synthesis is needed, resulting [8] E. S. Chung, M. K. Papamichael, E. Nurvitadhi, J. C. Hoe, K. Mai, and
in a much higher speedup as reported in literature. Moreover, unlike B. Falsafi. ProtoFlex: Towards Scalable, Full-System Multiprocessor
Simulations Using FPGAs. ACM Trans. Reconfigurable Technol. Syst.,
software-based simulators, monitoring additional signals does not 2(2):1–32, 2009.
introduce an overhead in terms of emulation time, resulting in an [9] J. Cong, K. Gururaj, G. Han, A. Kaplan, M. Naik, and G. Reinman. MC-
improved scalability of the number of events that can be observed. Sim: An Efficient Simulation Tool for MPSoC Designs. In Computer-
Regarding simulation accuracy assessment, it is worth noting that Aided Design, 2008. ICCAD 2008. IEEE/ACM International Conference
on, pages 364–371, Nov. 2008.
in the proposed use-case the same RTL code could be synthesized
[10] P.G. Del Valle, D. Atienza, I. Magan, J.G. Flores, E.A. Perez, J.M.
for FPGA (for evaluation) and for ASIC (for real production). The Mendias, L. Benini, and G. De Micheli. Architectural Exploration
prototyping does not insert any error in the estimation of “func- of MPSoC Designs Based on an FPGA Emulation Framework. In
tional related” (execution time, latency, congestion,) performances. Proceedings of XXI Conference on Design of Circuits and Integrated
Cycle/signal-level accuracy is thus guaranteed without the need of Systems (DCIS), pages 12–18, 2006.
[11] Emulation and Verification Engineering - EVE. Zebu XL and ZV
a test comparison (emulated vs. prospective implementation). The models. Technical report, 2005.
inaccuracy that might be inserted by the technology-aware models, [12] W. Fu and K. Compton. A Simulation Platform for Reconfigurable Com-
as mentioned in section III-C, is affordable in most design cases. puting Research. In FPL - Field Programmable Logic and Applications,
pages 1–7, 2006.
VI. C ONCLUSIONS AND F UTURE D EVELOPMENTS [13] A. Jalabert, S. Murali, L. Benini, and G. De Micheli. XpipesCompiler:
A Tool for Instantiating Application Specific Networks on Chip. In
In this work, an FPGA-based framework for the exploration and DATE ’04: Proceedings of the conference on Design, automation and test
characterization of MPSoC architectures is presented, with particular in Europe, page 20884, Washington, DC, USA, 2004. IEEE Computer
emphasis on NoC-based systems. The two main points of strength of Society.
the proposed framework are high-level automatic hw-sw platform [14] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg,
J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full
instantiation, integrated with Xilinx proprietary tools for FPGA System Simulation Platform. Computer, 35(2):50–58, Feb 2002.
implementation, and the use of analytic models that, exploiting the [15] M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R.
functional information provided by the FPGA emulation, are able Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet’s
to estimate different technology-related parameters of a prospective General Execution-Driven Multiprocessor Simulator (GEMS) Toolset.
SIGARCH Comput. Archit. News, 33(4):92–99, 2005.
ASIC implementation, such as power and energy consumption, [16] P. Meloni, I. Loi, F. Angiolini, S. Carta, M. Barbaro, L. Raffo, and
area occupation, and maximum achievable operating frequency. The L. Benini. Area and Power Modeling for Networks-on-Chip with Layout
presented use case validates the usefulness of the framework in Awareness. VLSI-Design Journal, Hindawi Publications, (ID 50285),
all the contexts where rapid simulation methodologies are required, 2007.
as an effective support to quantitative design space exploration or [17] OCP International Partnership (OCP-IP). Open Core Protocol Standard,
2003. http://www.ocpip.org/home.
simply as an environment for rapid prototyping of complex multi- [18] J. Pal Singh, S. C. Woo, M. Ohara, E. Torrie, and A. Gupta. The
core platforms. Already planned future work includes the extension SPLASH-2 Programs: Characterization and Methodological Consider-
of the library of components, as well as the consideration of hard ations. Proceedings of the International Symposium on Computer
routed macros and partial reconfigurability mechanisms to further Architecture, 1995.
[19] M. Pellauer, M. Vijayaraghavan, M. Adler, Arvind, and J. Emer. A-
reduce the overhead introduced by the FPGA implementation flow. Ports: an Efficient Abstraction for Cycle-Accurate Performance Models
on FPGAs. In FPGA ’08: Proceedings of the 16th international
ACKNOWLEDGMENT ACM/SIGDA symposium on Field programmable gate arrays, pages 87–
This work has been funded by the EU FP7 project MADNESS 96, New York, NY, USA, 2008. ACM.
(Contract Info FP7/248424). [20] M. Vachharajani, N. Vachharajani, D. A. Penry, J. A. Blome, S. Malik,
and D. I. August. The Liberty Simulation Environment: A Deliberate
Approach to High-Level System Modeling. Technical report, ACM
R EFERENCES Transactions on Computer Systems, 2004.
[1] F. Angiolini, P. Meloni, S. Carta, L. Benini, and L. Raffo. Contrasting a [21] S. Wee, J. Casper, N. Njoroge, Y. Tesylar, D. Ge, C. Kozyrakis, and
NoC and a Traditional Interconnect Fabric with Layout Awareness. In K. Olukotun. A Practical FPGA-based Framework for Novel CMP
Proceedings of the DATE ’06 Conference, Munich, Germany, 2006. Research. In FPGA ’07: Proceedings of the 2007 ACM/SIGDA 15th
[2] Arvind, K. Asanovic, C. Kozyrakis, S. L. Lu, and M. Oskin. RAMP: international symposium on Field programmable gate arrays, pages
Research Accelerator for Multiple Processors - A Community Vision 116–125, New York, NY, USA, 2007. ACM.
for a Shared Experimental Parallel HW/SW Platform, 2005. [22] A. Ramirez, A. Rico, A. Vega, D. R. Pico, E. Ayguade,
[3] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, F. Cabarcas, P. Carpenter, X. Martorell. CellSim: Modular
K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, Simulator for Heterogeneous Multiprocessor Architectures.
and K. A. Yelick. The Landscape of Parallel Computing Research: A http://pcsostres.ac.upc.edu/cellsim/doku.php.
View from Berkeley. Technical Report UCB/EECS-2006-183, EECS
Department, University of California, Berkeley, Dec 2006.
Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
Authorized licensed use limited to: Southern Taiwan University. Downloaded on March 15,2010 at 00:04:26 EDT from IEEE Xplore. Restrictions apply.