Signal Processing: Image Communication: B. Krill, A. Ahmad, A. Amira, H. Rabah

ARTICLE IN PRESS
Signal Processing: Image Communication 25 (2010) 377387
Contents lists available at ScienceDirect
Signal Processing: Image Communication

journal homepage: www.elsevier.com/locate/image
An efcient FPGA-based dynamic partial reconguration design ow and environment for image and signal processing IP cores
B. Krill a,b,, A. Ahmad b,c, A. Amira a, H. Rabah d
a
Nanotechnology and Integrated Bio-Engineering Centre (NIBEC), Faculty of Computing and Engineering, University of Ulster, Jordanstown Campus, Newtownabbey Co. Antrim, BT37 0QB Belfast, Northern Ireland Department of Electronic and Computer Engineering, School of Engineering and Design, Brunel University, West London, UB83PH Uxbridge, UK c Department of Computer Engineering, Faculty of Electrical and Electronic Engineering, Universiti Tun Hussein Onn Malaysia (UTHM), 86400 Batu Pahat, Johor, Malaysia d Laboratoire dInstrumentation, Electronique de Nancy, University Henri Poincare, 540003 Nancy, France
b
a r t i c l e in fo
Article history: Received 30 October 2009 Accepted 26 April 2010 Keywords: Dynamic partial reconguration (DPR) Design ow Field programmable gate array (FPGA) IP cores Image and signal processing
abstract
This paper describes a dynamic partial reconguration (DPR) design ow and environment for image and signal processing algorithms used in adaptive applications. Based on the evaluation of the existing DPR design ow, important features such as overall exibility, application and standardised interfaces, host applications and DPR area/size placement have been taken into consideration in the proposed design ow and environment. Three intellectual property (IP) cores used in pre-processing and transform blocks of compression systems including colour space conversion (CSC), two-dimensional biorthogonal discrete wavelet transform (2-D DBWT) and three-dimensional Haar wavelet transform (3-D HWT) have been selected to validate the proposed DPR design ow and environment. Results obtained reveal that the proposed environment has a better solution providing: a scriptable program to establish the communication between the eld programmable gate array (FPGA) with IP cores and their host application, power consumption estimation for partial reconguration area and automatic generation of the partial and initial bitstreams. The design exploration offered by the proposed DPR environment allows the generation of efcient IP cores with optimised area/speed ratios. Analysis of the bitstream size and dynamic power consumption for both static and recongurable areas is also presented in this paper. & 2010 Elsevier B.V. All rights reserved.
1. Introduction Image and signal processing is one of the emerging areas and their state-of-the-art have changed the way its computationally complex algorithms and systems are implemented. Increasing demand for real-time processing as well as maintaining the system performance is of
Corresponding author at: Nanotechnology and Integrated BioEngineering Centre (NIBEC), Faculty of Computing and Engineering, University of Ulster, Jordanstown Campus, Newtownabbey Co. Antrim, BT37 0QB Belfast, Northern Ireland. E-mail address: ben@codiert.org (B. Krill).
crucial importance and motivates a strong justication to further research in these areas. Recongurable hardware, especially eld programmable gate arrays (FPGAs) are widely used in image and signal processing applications from simple low-resolution and low-bandwidth applications to very high-resolution and high-bandwidth [1]. Owing to its massive parallelism capabilities, multimillion gate counts and special low power packages, FPGA has attracted a great deal of research and development. Since the employment of image and signal processing for adaptive applications uses several building blocks for its computationally intensive algorithms, complexity in
0923-5965/$ - see front matter & 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.image.2010.04.005
ARTICLE IN PRESS
378 B. Krill et al. / Signal Processing: Image Communication 25 (2010) 377387
addressing and accessing real-time and massive amount of data to be processed have resulted in vast challenges from a hardware implementation point of view. In order to address these issues, FPGAs with an efcient recongurability mechanism should be deployed to meet the requirements in terms of speed, area (size), power consumption and throughput. Dynamic partial reconguration (DPR) is a promising solution for reducing the hardware required in implementing an efcient design for image and signal processing algorithms used in adaptive applications [2,3]. With this solution, the design can be divided into sub-designs that t into the available devices which can be then ported onto the recongurable hardware when needed. DPR has been widely studied in various elds [418]. However, current DPR design ows and implementations are not capable to provide a set of programs to establish communication between the FPGA and host computer. Moreover, estimation of power consumption for the recongurable area and auto-generated bitstream for partial and initial cannot be performed [2,3]. To address these issues, this paper proposes a DPR design ow and environment to accelerate the development of partial reconguration platform and validation through a range of intellectual property (IP) cores used in image and signal processing for adaptive applications. Colour space conversion (CSC), two-dimensional biorthogonal discrete wavelet transform (2-D DBWT) and three-dimensional Haar wavelet transform (3-D HWT) IP cores have been selected to validate the feasibility and advantages features of the proposed DPR design ow and environment. The structure of the paper is organised as follows. An overview of the related work of DPR design ow, environment and application is given in Section 2. Section 3 presents the proposed design ow and environment. Section 4 exposes the case study of the IP cores used in image and signal processing adaptive applications. Experimental results and analysis are described in Section 5. Finally, concluding remarks are given in Section 6.
tween the IP core based co-processing on the Virtex-4 FX FPGA and the host application running on the central processing unit (CPU). The proposed implementation generates hardware wrappers for the core and resembles a C-function invocation in the source codes. Moreover, the same static wrapper is used for multiple cores and allows users to select the core to be invoked in the program. An explanation of the DPR design ow that has been utilised in sensor and software-dened radio applications is discussed in [14,15], respectively. In these papers, an overview of the design ow together with their advantages and disadvantages are discussed. As one of the main key players in the FPGA industry, Xilinx initially proposed methodologies termed as difference-based and module-based [16,17] and the early access partial reconguration (EAPR) [18]. It is worth noting that, although the difference-based approach is suitable for small changes in bitstream, it is, however, not suitable for large dynamically recongurable module. This fact has led to the introduction of modular approach. Besides that, EAPR design ow suffers with the limitation of the partial bitstreams for a module to be executed on a recongurable region, and it must be predetermined. Therefore, it can be classied as semi-partial dynamic in nature. In order to allow for better and efcient system implementation, simplication of the design ow is important. On the whole, the studies carried out reveal that there are opportunities to be explored and Table 1 summarises all the signicant features to be considered in the DPR design ow and environment proposal and also the comparison with Xilinx DPR design ow. All these features have been taken into consideration in the proposed DPR design ow and environment and will be further discussed in Section 3. Details of each feature can be explained as follows: 1. Overall exibility: Flexibility is signicant for the user/ developer. With Xilinx DPR, the developer has the total exploitation of exibility. In this paper, the proposed framework can be considered as xed, since the modication can only be made for the DPR area. 2. Application interfaces: Application interface is concerned with the interfaces implementation. In Xilinx DPR, user/developer needs to establish the interfaces while in the proposed DPR framework, it has a dened interface to the host personal computer (PC). 3. Standardised interfaces: The framework provides a standardised interface which the user has to satisfy.
2. Related work 2.1. DPR design ow and environment Shoa and Shirani [2] thoroughly explain different issues in run-time recongurable (RTR) systems and list the implemented system which supported RTR reconguration as well as discussing different applications and the improvements achieved by applying RTR. An evaluation of DPR for signal and image processing has been presented in [3] and the authors present the advantages and limitation of DPR in professional electronics applications as well as provide guidelines to improve its applicability. Furthermore, the missing elements of the design ow to be used in DPR have been identied and explained. Mitra et al. [19] describe a software tool for automatically generating the communication interface be-
Table 1 Importance features for DPR techniques. Features Overall exibility Application interfaces Standardised interfaces Host applications DPR area place/size Xilinx DPR [1618] Total exibility No xed interfaces No standard specication Need to be programmed Manual placement by users
ARTICLE IN PRESS
B. Krill et al. / Signal Processing: Image Communication 25 (2010) 377387 379
On the contrary, without a framework the user needs to specify or implement everything manually. 4. Host applications: The framework has the host PC programs to communicate with the FPGA. 5. DPR area place/size: The framework adjusts the area automatically during the compilation ow, while in Xilinx DPR user needs to do it manually.
3. Proposed DPR design ow and environment 3.1. Design ow In this section, an explanation of the proposed design ow and environment including the static and recongurable areas is discussed. The design ow that has been used for the system architecture implementation is illustrated in Fig. 1. In order to efciently implement the DPR technique, there are ve stages to be executed for the IP cores. 1. Integration of IP cores and denition of the wrapper: All IP cores which should be used as a partial module must implement the proposed entity. The entity is designed with an input and output interfaces. With this design, it is possible to communicate with the IP cores from the outside and the IP core can communicate independently with all modules connected to the Wishbone bus including the memory controller, liquid crystal display (LCD) and universal serial bus to Wishbone (USB2WB). For debugging purpose, the entity also provides an interface to light emitting diodes (LEDs) and this could be used to easily give status information. 2. Synthesis of the IP cores: This step provides information of all integrated IP cores. This is a very important step,
2.2. IP cores in image and signal processing for adaptive applications using DPR In order to validate the efciency of the proposed DPR design ow and environment, image and signal processing algorithms for adaptive applications have been chosen as a targeted system to be deployed. Several approaches reveal the advantages of DPR in image and signal processing, including FPGA-based discrete cosine transform (DCT) architecture [7,13], real-time image interpolation [8,9], dynamic image processor [10] and advanced encryption standard (AES) [12]. However, there is no existing work reported on specic IP cores such as CSC, DBWT and HWT that can be used in image and signal processing for adaptive applications.
Fig. 1. Proposed design ow for partial reconguration.
ARTICLE IN PRESS
since it provides the required area size of each module. Moreover, the generation of the netlists are performed. 3. Adjustment of partial reconguration regions area and size: From step 2, all sizes of each IP core are given. This step collects all sizes and adjusts the partial reconguration area to the biggest needed size. Moreover, it also considers all the resources needed such as slices, block random access memory (BRAM) and digital signal processor (DSP) blocks. After all information is collected and processed, the required constraints are exported for the actual building/bitstream generation. 4. Place and route (PAR) for the static and partial regions: In this step, PAR is performed for the static and partial area. The static area build has to be rst executed and the information of some routed signals are transferred to the partial module build. As described in the previous step, the generated constraint information of the partial area is applied. 5. Assemble the bitstreams: This step collects all build native circuit description (.ncd) les, the programs PR verify design/PR assemble are called, to verify the build and to extract the partial modules to generate the partial bitstreams. Finally, the initial bitstream and all specied partial bitstreams are available.
Further it calls the back-end scripts (written in BASH [21]) to build the whole framework and IP cores. The implementation of the back-end scripts allows to call the same scripts from a console without having a GUI. The bash scripts are using the Xilinx environment to build the netlists and bitsreams. Fig. 2 illustrates the described work ow during design phase. The common/static framework is written in VHDL. Communication programs on the host are programmed in C and use libraries like libusb to establish universal serial bus (USB) communication in user space. As already mentioned, the framework also provides host programs to communicate with the FPGA. To provide an application programming interface (API) for other user programs a library was created. Basically, the library provides write and read functions, which will then propagate this behaviour to the FPGA Wishbone bus. This provides an interface to all modules which are
3.1.1. DPR framework ow The framework comprises a graphical user interface (GUI), low-level communication programs and build programs. Basically, the GUI (written in QT [20]) is used to create a ow sheet, and determine the order in which the DPR IP cores should be programmed into the FPGA.
Fig. 3. Example of the program ow procedures.
Fig. 2. DPR framework ow.
ARTICLE IN PRESS
connected to the bus and the user can for example modify congurations inside the IP core. Fig. 3 shows an example of the program ow approach. At the beginning the user program reads a le and writes data into FPGA double data rate (DDR2) memory. After nishing writing data, it sends a signal to the FPGA to activate the IP core in order to process the data inside memory. After data processing is completed, data can be read back from the memory. 4. Case study: IP cores for image and signal processing adaptive applications 4.1. Overview of an adaptive image and signal processing applications Fig. 4 illustrates an application where adaptivity is important, including the transform, quantisation and entropy coding blocks for a compression system. In each block, buffers have been used for storing intermediate results to be processed. Our goal is to propose an adaptive compression system for medical images, with all its blocks are recongurable. This paper focus only on the pre-processing and transform blocks. Hence, this was the main reason for selecting CSC and other transforms (2-D DBWT and 3-D HWT) to validate the proposed DPR design ow and environment. The DPR is used because it allows to exchange the type of algorithm (HWT, DBWT, etc.), thus making the architecture exible. It also helps to examine the best tradeoffs between area, power consumption and hardware performance. 4.2. Proposed system architecture The proposed system architecture as illustrated in Fig. 5(a)(e) is briey explained, including the IP cores that have been used, the top-level architecture of the proposed system environment and framework, architectures for the CSC and block diagrams for 2-D DBWT and 3-D HWT, respectively. There are three areas in the system environment and architecture: application and operation system, static and recongurable areas. Fig. 5(b) illustrates the details of the working system and the communication channel for host-FPGA and different components. The host part consists of command line tools and low-level drivers that can be changed within the GUI. Current implementation system uses a USB connection to
the FPGA which could be exchanged by a peripheral component interconnect express (PCI-E) connection. The static area on the FPGA has been dened for all modules, and it must be remained active during the FPGA duty cycle, while the recongurable area is used for the IP cores (CSC, 2-D DBWT and 3-D HWT). All used components are Wishbone bus compliant and connected via a shared bus interconnection. 4.2.1. Static area The static area has been established for components communication zone with memory controller interface, USB2WB interface, LCD visualisation, internal conguration access port (ICAP) and clock domain translation. The memory controller interface provides a Wishbone connection and translation to the Xilinx memory interface generator (MIG) controller. Moreover, it provides a 256 bytes cache to improve the read performance for accesses within a memory region. The core runs with a DDR2 frequency of 133 MHz and an interface frequency of 200 MHz which provides a non-cached read performance of 256 MB/s and a write performance of 386 MB/s. The USB2WB interface is a component which is connected to a Cypress EZ-host programmable embedded chip that has been congured as a USB host/peripheral controller [22]. 4.2.2. Recongurable area The partial reconguration area has been declared mainly for the IP cores that can be changed for the system application. In order to accommodate with enough space/ slices for the IP cores, the area size is calculated and it is always dened based on the largest IP core. There are two connections that act as a slave and master to the Wishbone bus. With these connectivities, the core can be congured and performs read and write data operation from/to the memory controller. 4.2.3. Selected IP cores 1. CSC: Colour space is a technique to specify, create and visualise colour. There are many existing colour spaces and the most popular models are RGB (used in computer graphics), YIQ, YUV and YCrCb (used in video systems) and CMYK (used in colour printing). RGB colour space is simple and uses three numerical components to represent a colour and this colour space can be represented using a 3-D coordinate system whose axes correspond to the three components, R or Red, G or Green, and B or Blue.
Fig. 4. Proposed system applications.
ARTICLE IN PRESS
Fig. 5. Proposed system architecture framework. (a) IP cores. (b) Proposed top-level architecture. (c) Parallel architecture for CSC based on DA principles. (d) Block diagrams for the separable 2-D DBWT and non-separable 2-D DBWT for two-levels of decomposition. (e) Block diagram for the 3-D HWT with transpose based computation.
Fig. 6. Block diagram for R0 G0 B0 2 Y 0 CrCb.
On the other hand, Y CrCb is a scaled and offset version of the YUV colour space based on luminance and chrominance, which correspond to brightness and colour. In this 0 0 0 0 colour space R G B is separated into a luminance part (Y ) and two chrominance parts (Cb and Cr). A general block 0 0 0 0 diagram for R G B 2 Y CrCb is shown in Fig. 6. The CSC IP core that has been used in this DPR environment is based on distributed arithmetic (DA) principles [23] and its architecture is shown in Fig. 5(c).
DA provides an efcient method of computing vector multiplication by means of bit-level rearrangement of the multiply accumulate process, and it is called as ROMbased DA method. It decomposes the variable input of the inner product to bit-level in order to generate precomputed data. The basic operations required for performing DA-based inner product are a sequence of ROM look up table, addition, subtraction and shift operations of the input data sequence. 2. DBWT: Discrete wavelet transform (DWT) plays a signicant role in image and signal processing applications as an alternative to classical time frequency representation techniques such as the discrete Fourier transform (DFT) and DCT [24]. With its multi-resolution characteristics and capability to represent real life non-stationary signals such as image and speech, DWT has attracted a great deal of research and development. The 2-D DWT is a multi-level decomposition technique which provides an efcient analysis method of signals at different frequency bands and the block diagrams for both separable and non-separable 2-D DWT are depicted in Fig. 5(d). Each decomposition level j can be seen as the further decomposition of a 2-D data set Ij 1 (having N j1 Nj1 samples) into four sub-bands LLj; LHi; HLj and HHj (each having Nj 1/2 samples, where N is input data/samples) [25].
ARTICLE IN PRESS
3. 3-D HWT: As the simplest wavelet transform, the Haar wavelet chaar , is discontinuous, symmetric wavelet in the Daubechies family, and the only one which has an explicit expression. The scale function fhaar is a simple average function. The wavelet and the scale functions can be expressed as follows: 8 1 >1 > if 0 r t o > > 2 < 1 chaar t 1 > 1 if rt o 1 > > 2 > : 0 otherwise
On the contrary, the proposed DPR environment presented in this paper provides the following advantages: 1. It has a scriptable program to establish a communication of the FPGA with the IP cores and their application. It also capable to avoid user-errors in bus macro placement and partial area selection. 2. It can calculates and estimates the power consumption for partial reconguration area based on the presented equation in Section 5.2. 3. It provides an automation in generating the partial and initial bitstream les.
fhaar t
1 0
8 0 rt o 1 otherwise
In addition, the HWT wavelet is simple and computationally cheap because it can be implemented by a few integer additions, subtractions, and shift operations [26]. The mathematical features of the basis are as follows: the most simplistic wavelet basis can be implemented using pairwise averaging and differencing, both unitary and orthogonal, and also it has compact support. Therefore, this wavelet basis is by no means the most suitable to achieve close to optimal compression performance in image compression systems application. Fig. 5(e) shows a block diagram of the 3-D HWT IP core with transpose based computation that has been used as a one of the IP cores in this DPR design ow and environment. 5. Experimental results and analysis Fig. 7 illustrates the partial reconguration design ow of the framework. Action points 13 have to be done once, since the framework species the top-level design. Currently, the user does not have the capability to change the top-level design. Points 4 and 5 are performed automatically by the framework during the build process. The size will be adjusted to the biggest IP core used. Depending on the size of the DPR area the bus macros are placed at the edges of the area. The proposed framework species an interface which provides a read and write connection to the Wishbone bus. This gives opportunities to read from the core or the core can write to the Wishbone bus. The current design ow for partial reconguration by Xilinx uses a module-based DPR which requires bus macros to integrate the static and partial areas [27]. Conguration frames are recongured and bus macros are used to connect the DPR areas with the static area [18]. Another solution is presented in [28] with direct bitstream manipulation. In order to locate the components such as congurable logic blocks (CLBs) and block random access memory (BRAMs) inside the bitstream, equations are required. However, this solution is currently only applicable for Virtex-II FPGA devices. A hierarchical analysis tool offered by Xilinx PlanAhead [29] has been discussed in [30] and the design ow requires manual placement of the partial area and bus macros, hence lead to a complex design ow and system implementation.
5.1. Bitstream utilisation An evaluation of the proposed DPR environment with three IP cores gives an information of the generated bitstream le and conguration sizes. Fig. 8 illustrates the relationship of the bitstreams generated with the components used including the CLBs, RAMB36, DSP48E and input/output (I/O). These information are used during the internal calculation of the bitstream size and reconguration time of each IP core. In order to reduce the generated bitstream size, the proposed design ow also provides an information of available components as well as the suitable area to be located. The interface that has been used for the partial area requires a minimum of 12 CLBs. This is due to the fact that in one CLB, there are two bus macros with four inputs/ outputs. Bus macro is needed to establish the connection of static and partial areas. The number of CLBs is calculated by 32 bits data in/out buses, 32 bits address buses and eight control signals. In order to use DDR2 memory, a minimum of 25 CLBs is also required. It is worth mentioning that the partial reconguration is performed in frames and the minimum frame in Virtex5 devices is dened by 20 CLBs and a bitstream for both buses has a minimum size of 153 kBytes. 5.2. Dynamic area utilisation The framework adjusts the dynamic area automatically. During the build process, the log les of all used partial modules are analysed and the required logic size is extracted. The area will be adjusted to the biggest logic size needed. The constraint le that species the DPR area will then be rewritten with new values. The total usage power of an FPGA device can be dened as the total of static (StP) and dynamic (DP) power [31]. StP is inherently dependent on the architectural layout of the FPGA itself and it is technology dependent. For the case of Xilinx SRAM FPGAs, DP is classied into clock, signal, logic, input and output power and can be dened as follows: X Ci Vi2 f 3 DP
i
where C is the capacitance of the node, V is the core voltage and f is dened as switching frequency. In this
ARTICLE IN PRESS
Fig. 7. Partial reconguration design ow. (a) Steps for partial design ow. (b) Dene static and recongurable modules.
1.2e+06
1e+06
SLICE RAMB36 DSP48E IO
Bitfile size (Bytes)
800000
600000
400000
200000
10
15 20 25 Component Count
30
35
40
Fig. 8. Bitstream utilisation.
work, power components including logic, hardcore (like BRAM), signal and clock are considered relevant. The framework does not support the use of I/O components inside the partial reconguration area, so these components are ignored in the calculation.
Eq. (4) gives an average calculation of DP for a given area, where the sum of dynamic clock power (DPc) reects the power used by CLBs and hardcores, while ci implies the components count. The arithmetic mean of all CLBs, inputs and outputs (DPio) provides an estimation of the
ARTICLE IN PRESS
Table 2 Synthesise results and power consumption for calculation of 256 256 pixel image at 133 MHz. Modules Registers LUTs BRAM DSP Troute Static power Dynamic power Framework 3075 2755 15 0 6 1.42657 1.77131 CSC 1563 1130 12 0 NA 0.01598 0.09079 2-D DBWT 623 345 64 0 NA 0.00623 0.03561 3-D HWT 1337 1305 0 2 NA 0.02074 0.12177
signal power and m is dened by the Virtex-5 CLB structure. Virtex-5 CLBs have 4 6 look-up-tables (LUTs) inputs and four ip ops and multiplexer outputs.
n X i
"
DP A
1 DP cci m
m 46 8n X s
# 4
DP s
Table 3 IP cores performance comparison with existing implementations for 2-D DBWT. Parameters 2-D DBWT Proposed FPGA Area (slices) Max. freq. (MHz) Area/speed (ratio) XC5VLX110T 763 150 5.086 2D-I [32] Virtex1000E-6 2221 78 28.47 2D-II [32] Virtex1000E-6 434 105 41.40
The power consumption can be reduced by shrinking the partial area to the minimum size of the largest core. It can be established by modifying the user constraint le (UCF) of the PAR tool. Another analysis of power consumption during reconguration process is given in Eq. (5). The rst part of the equation represents the dynamic power consumption of the logic which is used for the reconguration of the partial area. The second part represents the partial reconguration area components multiplied by the reconguration power consumption constant z which is needed to recongure different module used:
300 n X X DP cci cd z i i
DP R
Table 4 IP cores performance comparison with existing implementations for CSC and 3-D HWT. Parameters CSC Proposed FPGA Area (slices) Max. freq. (MHz) Area/speed (ratio) XC5VLX110T 152 213 0.71 CSC-I [33] XCV50E-8 70 128 0.55 CSC-II [33] XCV50E-8 193 234 0.82 XC5VLX110T 3,140 360 8.72 3-D HWT proposed
170 160 150 140 130 120 110 100 2500
Static 3-D_HWT CSC 2-D_DBWT
Max. Frequency (MHz)
3000
3500
4000
4500 5000 Area (slices)
5500
6000
6500
7000
Fig. 9. Maximum frequencies with different DPR area sizes.
ARTICLE IN PRESS
Table 5 Power consumption (mW) of different FPGAs at 133 MHz. Framework Static power Dynamic power Total power Static without cores 1.44749 1.89353 3.34103 Framework on 110 T 1.42657 1.77131 3.19788 Framework on 50 T 0.70593 1.62071 2.32665
Note: 110T: XC5VLX110T; 50T: XC5VLX50T.
5.3. IP cores analysis The proposed DPR environment with three IP cores has been implemented on the Xilinx Virtex-5 (XC5VLX110T3FF1136) using Xilinx ISE [29] design tool. Synthesis results and power consumption for the whole framework and IP cores without a partial area are shown in Table 2. FPGA implementation performances for all the IP cores and comparison with the existing implementation are given in Tables 3 and 4, respectively. With different FPGA platforms and various internal structures, area/ speed ratio can be considered as the comparable performance measures and results obtained reveal that all the IP cores have a better area/speed ratio than the existing implementation. Since the study presents the rst hardware implementation of 3-D HWT, there is no comparison available. The framework is built with a clock frequency constraint of 7.5 ns. Fig. 9 shows the maximum frequencies achieved with different DPR area sizes for each IP cores. Minimum size of the biggest core is 3022, which is the minimum of the DPR area size. It can be seen that the clock frequency decreased as the DPR area size getting bigger. This is because of the used framework logic, especially the DDR2 memory controller and it needs space to reach the timing. Additionally, the gure also shows the optimal working point of the framework and all IP cores are 133 MHz and the DPR area size of 4004 slices. This area is represented by a rectangle of 44 91 slices and placed on the opposite side of the DDR2 I/O pins. This is considered as a design decision and a development board dependent. It can only be adjusted by changing with a different board. The rectangle is also placed at a position that BRAM can be used for the IP core. 5.4. Framework analysis The proposed framework with static part, and selected IP and without the cores have been implemented on two different FPGA devices. Table 5 lists the StP, DP and total power consumption for the static without cores and the framework on XC5VLX110T and XC5VLX50T devices. Results obtained have shown that the static power consumption of the static built without cores is higher than the DPR framework on both devices as a result of built-in IP cores. All the IP cores were unable to be used in parallel, hence lead to the same ow as the DPR framework. To illustrate the impact of power minimisation with DPR, the framework has been implemented on a smaller device that resulting of overmapping issue for the framework in each IP core.
6. Conclusions This paper discusses the advantages of the proposed novel DPR design ow and environment for image and signal processing used in adaptive applications. The proposed approach is fully custom design, while the proposed environment employs a complete framework and it only requires the user to provide the DPR modules for FPGA implementation. Results obtained have shown that the proposed IP cores have better area/speed ratio when compared with existing implementations. Ongoing research is focusing on the evaluation and implementation of other optimised IP cores used in computationally intensive applications such as real-time medical imaging which requires DPR for better implementation as well as increasing the system performance. References
[1] P. Dang, VLSI architecture for real-time image and video processing systems, Journal of Real-Time Image Processing 1 (October 2006) 5762. [2] A. Shoa, S. Shirani, Run-time recongurable systems for digital signal processing applications: a survey, The Journal of VLSI Signal Processing 39 (March 2005) 213235. [3] P. Manet, D. Maufroid, L. Tosi, G. Gailliard, O. Mulertt, M. Di Ciano, J.-D. Legat, D. Aulagnier, C. Gamrat, R. Liberati, V. La Barba, P. Cuvelier, B. Rousseau, P. Gelineau, An evaluation of dynamic partial reconguration for signal and image processing in professional electronics applications, EURASIP Journal on Embedded Systems 2008 (January 2008) 111. [4] M. Majer, J. Teich, A. Ahmadinia, C. Bobda, The Erlangen slot machine: a dynamically recongurable FPGA-based computer, The Journal of VLSI Signal Processing 47 (April 2007) 1531. [5] C. Claus, J. Zeppenfeld, F. Muller, W. Stechele, Using partial-runtime recongurable hardware to accelerate video processing in driver assistance system, in: Proceedings of the Conference Design, Automation, Test and Exhibition in Europe (DATE 07), Nice, France, April 2007, pp. 16. [6] L. Braun, K. Paulsson, H. Kromer, M. Hubner, J. Becker, Data path driven waveform-like reconguration, in: Proceedings of the International Conference on Field Programmable Logic and Applications (FPL 2008), Heidelberg, Germany, 2008, pp. 607610. [7] J. Huang, M. Parris, J. Lee, R.F. DeMara, Scalable FPGA-based architecture for DCT computation using dynamic partial reconguration, ACM Transactions on Embedded Computing Systems 9 (1) (2009) 118. [8] E. Bourennane, C. Milan, M. Paindavoine, S. Bouchoux, Real time image rotation using dynamic reconguration, Real-Time Imaging 8 (4) (2002) 277289. [9] R. Hudson, D. Lehn, P. Athanas, A run-time recongurable engine for image interpolation, in: IEEE Symposium on FPGAs for Custom Computing Machines, 1998, Proceedings, April 1998, pp. 8895. [10] M.R. Boschetti, S. Bampi, I.S. Silva, Throughput and reconguration time trade-offs: from static to dynamic reconguration in dedicated image lters, in: Lecture Notes in Computer Science: Field Programmable Logic and Application, vol. 3203, August 2004, pp. 474483.
ARTICLE IN PRESS
[11] Y.-J. Oh, H. Lee, C.-H. Lee, Dynamic partial recongurable FIR lter design, in: Lecture Notes in Computer Science: Recongurable Computing: Architectures and Applications, vol. 3985, August 2006, pp. 3035. [12] Z.E.A.A. Ismaili, A. Moussa, Self-partial and dynamic reconguration implementation for AES using FPGA, IJCSI International Journal of Computer Science Issues 2 (August 2009) 3340. [13] J. Gause, P.Y.K. Cheung, W. Luk, Static and dynamic recongurable designs for a 2D shape-adaptive DCT, in: FPL 00: Proceedings of the The Roadmap to Recongurable Computing, 10th International Workshop on Field-Programmable Logic and Applications, Springer-Verlag, London, UK, 2000, pp. 96105. [14] C. Ibala, K. Arshak, Using dynamic partial reconguration approach to read sensor with different bus protocol, in: Sensors Applications Symposium, 2009, SAS 2009, IEEE, February 2009, pp. 175179. [15] E. McDonald, Runtime FPGA partial reconguration, in: Aerospace Conference, 2008 IEEE, New Orleans, LA, March 2008, pp. 17. [16] Two ows for partial reconguration: module based or difference based, Xilinx Inc. XAPP290 Version 1.1, Technical Report, 2003. [17] Two ows for partial reconguration: module based or difference based, Xilinx Inc. XAPP290 Version 1.2, Technical Report, 2004. [18] P. Lysaght, B. Blodget, J. Mason, J. Young, B. Bridgford, Invited paper: enhanced architectures, design methodologies and CAD tools for dynamic reconguration of Xilinx FPGAs, in: International Conference on Field Programmable Logic and Applications, 2006, FPL 06, August 2006, pp. 16. [19] A. Mitra, Z. Guo, A. Banerjee, W. Najjar, Dynamic co-processor architecture for software acceleration on CSoCs, in: International Conference on Computer Design, 2006, ICCD 2006, October 2006, pp. 127133. [20] [Online]. Available: /http://qt.nokia.comS. [21] [Online]. Available: /http://en.wikipedia.org/wiki/BashS. [22] EZ-host programmable embedded USB host/peripheral controller, Cypress Semiconductor Corporation, Technical Report, 2003.
[23] F. Bensaali, A. Amira, Accelerating colour space conversion on recongurable hardware, Journal of Image and Vision Computing 23 (11) (2005) 935942. [24] S.G. Mallat, A theory for multiresolution signal decomposition: the wavelet representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 11 (1989) 674693. [25] I.S. Uzun, A. Amira, Real-time 2-D wavelet transform implementation for HDTV compression, Journal of Real-Time ImagingSpectral Imaging II 11 (2) (2005) 151165. [26] C. Bajaj, I. Ihm, S. Park, 3D RGB image compression for interactive applications, ACM Transactions on Graphics 20 (1) (January 2001) 1038. [27] Partial reconguration design with PlanAhead, Xilinx Inc., Technical Report, 2008. [28] Y.E. Krasteva, E. de la Torre, T. Riesgo, D. Joly, Virtex II FPGA bitstream manipulation: application to reconguration control systems, in: International Conference on Field Programmable Logic and Applications, 2006, FPL 06, August 2006, pp. 14. [29] [Online]. Available: /http://www.xilinx.comS. [30] K. Arshak, E. Jafer, C. Ibala, Improving the performance of an FPGA based model design for sensor monitoring using PlanAhead tool, in: Proceedings of the 2006 IEEE International Behavioral Modeling and Simulation Workshop, September 2006, pp. 9196. [31] R. Fischer, K. Buchenrieder, U. Nageldinger, Reducing the power consumption of FPGAs through retiming, in: 12th IEEE International Conference and Workshops on the Engineering of ComputerBased Systems, 2005, ECBS 05, April 2005, pp. 8994. [32] I. Uzun, A. Amira, Rapid prototypingframework for FPGA-based discrete biorthogonal wavelet transforms implementation, IEE Proceedings Vision Image and Signal Processing 153 (6) (December 2006) 721734. [33] F. Bensaali, A. Amira, Design and efcient FPGA implementation of an RGB to YCrCb color space converter using distributed arithmetic, in: Lecture Notes in Computer Science: Field Programmable Logic and Application, vol. 3203, August 2004, pp. 991995.

Signal Processing: Image Communication: B. Krill, A. Ahmad, A. Amira, H. Rabah

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Signal Processing: Image Communication: B. Krill, A. Ahmad, A. Amira, H. Rabah

Uploaded by

Copyright:

Available Formats

ARTICLE IN PRESS

Signal Processing: Image Communication 25 (2010) 377387

Contents lists available at ScienceDirect

Signal Processing: Image Communication

Fig. 1. Proposed design ow for partial reconguration.

Fig. 3. Example of the program ow procedures.

Fig. 2. DPR framework ow.

Fig. 4. Proposed system applications.

Fig. 6. Block diagram for R0 G0 B0 2 Y 0 CrCb.

SLICE RAMB36 DSP48E IO

Bitfile size (Bytes)

Fig. 8. Bitstream utilisation.

170 160 150 140 130 120 110 100 2500

Static 3-D_HWT CSC 2-D_DBWT

Max. Frequency (MHz)

4500 5000 Area (slices)

Fig. 9. Maximum frequencies with different DPR area sizes.

Note: 110T: XC5VLX110T; 50T: XC5VLX50T.

You might also like