Sami Hasan, Alex Yakovlev and Said Boussakta School of Electrical, Electronic and Computer Engineering, University of Newcastle upon Tyne, UK {sami.hasan, alex.yakovlev, s.boussakta}@ncl.ac.uk
Abstract Currently, Field Programmable Gate Array (FPGA) goes beyond the low-level line-by-line hardware description language programming in implementing parallel multidimensional image filtering algorithms. High-level abstract hardware-oriented parallel programming method can structurally bridge this gap. This paper proposes a first step toward such a method to efficiently implement Parallel 2-D MRI image filtering algorithms using the Xilinx system generator. The implementation method consists of five simple steps that provide fast FPGA prototyping for high performance computation to obtain excellent quality of results. The results are obtained for nine 2-D image filtering algorithms. Behaviourally, two Virtex-6 FPGA boards, namely, xc6vlX240Tl-1lff1759 and xc6vlX130Tl-1lff1156 are targeted to achieve; lower power consumption of (1.57 W) and down to (0.97 W) respectively at maximum sampling frequency of up to (230 MHZ). Then, one of the nine MRI image filtering algorithms, has empirically improved to generate an enhanced MRI image filtering with moderate lower power consumption at higher maximum frequency. I. INTRODUCTION FPGAs are increasingly used in modern parallel algorithm applications such as medical imaging [1], DSP [2], image filtering [3], power consumption in portable image processing [4], MPEG-4 motion estimation in mobile applications [5], satellite data processing [6], new Mersenne Number Transform [7][8], high speed wavelet- based image compress [9] and even the global communication link [10]. However, most of the above FPGA-based solutions are typically programmed with low-level hardware description languages (HDL) inherited from ASIC design methodologies [11]. On the other hand, parallel multidimensional image filtering algorithms[12], for aerospace, defence, digital communications, multimedia, video and imaging industries, demand insatiable computationally complex operations [13] [14] at maximum sampling frequency. Traditional DSP processor arrays, with fixed architectures and relatively short life, can be costly programmed line- by-line with thousands of codes lines [15] [16]. Alternatively, this paper presents a high-level abstract implementation method to fill the present programming gap between parallel algorithms coding and final FPGA implementation. The proposed FPGA implementation method is architecturally based on the Xilinx system generator development tool [17] within the ISE 11.3 development suite. This tool is a system-level block diagram modeling environment that facilitates FPGA hardware implementation for the bit-accurate and cycle-true performance efficient specifications of the parallel multi dimensional filtering algorithms. The new method is tested on the performance efficient implementation of nine 2-D image digital filtering algorithms: Edge, Sobel X, Sobel Y, Sobel X-Y, Blur, Smooth, Sharpen, Gaussian and Identity[13] [14], targeting two Virtex-6 FPGA [18] boards, namely, xc6vlX240Tl-1lff1759 and xc6vlX130Tl-1lff1156. II. PARALLEL 2-D IMAGE FILTERING ALGORITHMS Parallel 2-D MRI filtering algorithms are 5x5 convolutional kernel mask-based image processing algorithms. Generally, the parallel architecture of these algorithms is constructed of two input matrices, 2-D convolution array for processing and a parallel to series reconstructed output matrix, as shown in Fig.1.
Figure 1. The parallel 2-D MRI image filtering algorithms architecture SIP-6 978-1-86135-369-6/10/$25.00 2010 IEEE 765 CSNDSP 2010 Generally, let the original image, x (n 1 , n 2 ), be of size (N x N), and the kernel, (m 1 , m 2 ) of size (M x M), then the output image, y (n 1 , n 2 ), can be expressed by the 2-D convolution formula: y(n 1 , n 2 ) = x(m 1 , m 2 )[(n 1 -m 1 , n 2 -m 2 ) (2) N-1 m 2 =0 N-1 m 1 =0
Where, u n 1, n 2 < N+H-1. Moreover, the 2-D image is equally subdivided into small sub-sequences of size ((N/n) x (N/n)) which are independently convolved: y(n 1 , n 2 ) = x(m 1 , m 2 )[(n 1 -m 1 , n 2 -m 2 ) (2) ( N n )-1 m 2 =0 ( N n )-1 m 1 =0
Where, u n 1, n 2 < (Nn) +H-1. Nine 5x5 convolutional kernels are utilized for the parallel 2-D MRI image filtering algorithms. One of the nine algorithms, namely, the Edge algorithms is empirically modified by a new Edge enhancement orthogonal kernels matrix to enhance fine detail in images, New Edge= l l l l l u u - u.12S u u u u - u.12S u u - u.12S u.12S 2.uu - u.12S u.12S u u - u.12S u u u u - u.12S u u 1 1 1 1 1 (3) The Edge algorithm is selected after the first round of the algorithm performance results of table I, which shows a noticeable performance wide span. III. THE PARALLEL 2-D ALGORITHMS CAPTURE These parallel 2-D MRI image filtering algorithms can be behaviorally captured as a stream model-based synchronous dataflow system using system generator libraries. The clock and its corresponding enable logic do not appear in the system generator block diagram but are internally generated when the FPGA implementation is behaviorally compiled within Xilinx/Simulink environment. The 2-D convolution operation, in (1), can be functionally implemented as an n-tap MAC FIR filter [13] [14] [17]. Consequently, the parallel 2-D image filtering algorithms can be efficiently realized using n-tap MAC FIR filters with nine programmable coefficient sets. Further high abstracted implementation can be achieved using a 5x5 filter image block, as in Fig. 2. The implementation diagram consists of three stages: MRI input, processing and output. In the first stage, the TABLE I. PERFORMANCE INDICES USING TWO VIRTEX-6 BOARDS 2-D image Filtering algorithms Power Consumption (Watt) Maximum Frequency (MHz) X240T X130T X240T X130T Edge 1.57 0.97 194 230 SobelX 1.57 0.97 222 228 SobelY 1.57 0.97 202 230 SobelXY 1.56 0.97 223 230 Blur 1.57 0.97 227 226 Smooth 1.57 0.97 230 207 Sharpen 1.57 0.97 214 230 Gaussian 1.57 0.97 230 230 Identity 1.57 0.97 230 230
Figure 2. Xilinx System Generator Captures of the Parallel Nine 2-D Image filtering algorithms. magnetic resonance imaging (MRI) pixels are sequentially sub-streamed into 5 virtex line buffers via a pipelined gateway block. Each line is delayed by 64 samples and line 5 is a copy of the MRI scan. The second stage consists of parallel five n-tap MAC FIR filters and four adder blocks structure which can be abstractly provided by the 5x5 filter block, as shown in Fig. 2, to filter the 64x64 grayscale MRI scan. Nine different 2-D FIR filters can be applied via the 5x5 filter block. The nine filters are Edge, SobelX, SobelY, SobelXY, Blur, Smooth, Sharpen, Gaussian and Identity. This 2-D FIR filter offers compile time mask parameters. Then the nine 2-D filters types can be either selected by changing the mask parameter on the 5x5 Filter block or modified. The 2-D filter coefficients are stored in a block RAM. Thus, the stored coefficients can be modified by changing the mask of the 5x5 FIR filter. Each n-tap MAC FIR filter is clocked 5 times faster than the input rate and the 5x5 filter operates at 213 MHz [17]. Therefore the throughput of the design is 213 MHz / 5 = 42.6 million pixels/second. For the 64x64 MRI image, this is 42.6x10^6/ (64x64) = 10,400 frames/sec. The third stage is pipelined by inserting delay block between the 5x5 filter and the gateway boundary block to be displayed via a simulink block, Fig. 2, that pop up the original MRI image together with the filtered result, as shown in Fig. 3, Fig.4 and Fig.5. The single system generator diagram in Fig. 2 is behaviorally equivalent to a 7140 lines of VHDL program
Figure 3. The 2-D MRI images filtered, via Virtex-6 X240T, using 2-D filter types; A. Edge, B. SobelX, C. SobelY, D. SobelXY, E. Blur, F. Smooth, G. Sharpen, H. Gaussian, I. Identity. A E F G B C D H I SIP-6 766 CSNDSP 2010
Figure 4. The 2-D MRI images filtered, via Virtex-6 X130T, using 2-D filter types; a. Edge, b. SobelX, c. SobelY, d. SobelXY, e. Blur, f. Smooth, g. Sharpen, h. Gaussian, i. Identity. code and a 8423 lines of Verilog program code. Those thousands of code lines must be manually verified, refined and re-entered line-by-line. This can be a waste of valuable time. Consequently, this paper proposes, after development, an FPGA implementation method. IV. AN FPGA IMPLEMENTATION METHOD The developed method is a high-level FPGA implementation method for any DSP algorithms to avoid all the drawbacks of the traditional HDL programming. The method has only five simple steps, namely: 1. State the DSP algorithm. 2. Structure the DSP algorithm architecture. 3. Algorithm captures using system generator from Xilinx. 4. Quality of results is verified, refined and improved. 5. FPGA bit stream generation. V. RESULTS The goal of this paper is a new FPGA implementation method that provides fast FPGA prototyping for high performance computation of parallel 2-D MRI image filtering algorithms. A time analysis compilation tool is needed to evaluate the speed/power consumption performance indices. Thus the Xilinx Timing Analyzer is utilized to generate time statistics, total power analysis and histogram charts of FPGA implementation paths delay. This provides guides to clarify the bottleneck in the design and focus on the optimization of the slow paths outliers. The performance efficient implementation results can be behaviorally achieved by low power consumption at maximum frequency for the parallel 2-D MRI image filtering algorithms. Consequently, comparative results of two Virtex-6 FPGA boards, xc6vlX240Tl-1lff1759 and xc6vlX130Tl-1lff1156 are compiled for the nine 2-D filters by two sets of 5x5 coefficient mask. The first set is the stored mask within the 5x5 filter block, and the second set is obtained by empirically modifying the 5x5 Edge coefficients to a new 5x5 Edge Enhancement Orthogonal Kernels as in (2).The results presented into three forms: performance index table, grayscale MRI filtered images and Histogram Charts of path delay distribution. Behaviorally, the results from the first set show that the parallel 2-D MRI image filtering algorithms have better
Figure 5. The 2-D MRI images filtered using the new 2-D MRI Edge filter for both FPGA boards. performance when implemented via the X130T board compared to X240T board. Furthermore, the results from the second set reveals an observable MRI filtering improvement compared to that of the first set. Noticeably, the performance indices within table I outperform efficiently the X130T FPGA implementation compared to X240T FPGA by its minimum total power consumption (around 0.97 Watt) and maximum frequency (mostly around 230 MHz). This high performance efficient FPGA implementation is observably apparent for the 2-D MRI Edge filter algorithm. Thus, the modification is empirically conducted on the 5x5 convolutional Edge operators. The filtered 2-d MRI images of Fig. 3 and Fig. 4 are generated from the nine parallel filtering algorithms implementation using Virtex-6 X240T and X130T FPGAs respectively. By inspection, the two figures show slight improvement of the 2-D MRI images filtered via X130T FPGA compared to X240T. The histogram time charts, in Fig. 6 and Fig. 7 depict the slow paths distributions of the 2-D MRI Edge filter captured behaviourally via X240T and X130T FPGA board respectively. Each histogram chart is a useful metric to analyze the FPGA implementation. Where are the slowest paths concentrated? How many slow paths are in each bin? How efficient is the implementation to meet timing? Accordingly, the FPGA implementation can be adjusted. Those histograms are grouped into regions of roughly formed normal distribution paths groups. The numbers at the top of the bins show the number of paths in each bin. Fig. 6 shows 308 paths that are roughly forming five groups. These groups are probably from different portions of the system generator architecture, as in Fig. 2, or from different timing clock region constraints. This shows that most of the slow paths are concentrated around (2.81 ns). The slowest path is about (6.15 ns). There are an outlier group of slow paths in the time range 6.13ns-6.30ns with empty bins to the right of it. That is because the FPGA implementation frequency, from table I, is the slowest (194 MHz) for this 2-D MRI Edge filter. However, there are no red/ pink bins or portions that do not meet the timing constrains. Fig. 7 shows a shorter histogram chart of 308 paths that forming totally different distributed histogram with a c b d e f g h i X240T X130T SIP-6 767 CSNDSP 2010
Figure 6. Histogram Chart depicts the total path delay distribution of the 2-D MRI Edge filter captured behaviourally via (X240T) FPGA board. roughly only three normally distributed paths groups between (2.2 ns) and (4.36 ns). That is because the FPGA implementation frequency, from table I, is the highest (230 MHz) for the same 2-D MRI Edge filter. The slow paths are concentrated between (2.2 ns) and (2.8 ns). The slowest path is about (4.2 ns). Moreover, the greater number of only one path per bin, distributed throughout the nanosecond domain demonstrate the highly outperformance efficient implementation of (230 MHz) maximum frequency. Consequently, there are no red/pink bins or portions that do not meet the timing constrains. The second result set is generated by targeting the same two Virtex-6 FPGA boards after modifying the Xilinx stored Edge coefficients matrix up to a new empirical Edge Enhancement Orthogonal Kernel of (2). Fig. 5, Fig. 8 and Fig. 9 are depicted those results. The new Edge filtering algorithm is noticeably revealing the MRI image filtering improvement, as depicted in Fig. 5, compared to that MRI Edge filtered image in Fig. 3.A and Fig. 4.a.
Figure 7. Histogram Chart depicts the total path delay distribution of the 2-D MRI Edge filter captured behaviourally via (X130T) FPGA board.
Figure 8. Histogram Chart depicts the total path delay distribution of the new 2-D MIR Edge filter captured behaviourally via (X240T) FPGA b . Furthermore, the X240T FPGA based implementation frequency increased from (194 MHz) to (229 MHz) with relatively the same total power consumption of (1.56 Watt). On the other hand, the X130T FPGA Power consumption is comparatively lowered to (0.96 W) at maximum frequency of (228 MHz). The histogram charts, in Fig. 8 and Fig. 9 are displaying the reflections of the new maximum sampling frequencies over the slow paths concentration for the new Edge filter FPGA implementation of X240T and X130T respectively. Fig. 8 chart shows a shorted histogram compared to that of Fig. 6, because of the new maximum frequency (229 MHz). This chart depicts 308 paths grouped roughly into four bell curve regions. Most of the slow paths are concentrated around (2.4 ns). The slowest path is about (4 ns). Consequently, the outlier group of the slowest paths are shifted to the time range of 3.88ns- 4.20ns with empty bins to the right of it. There are no red/ pink bins or portions that do not meet the timing constrains.
Figure 9. Histogram Chart depicts the path delays distribution of the new 2-D MRI Edge filter captured behaviourally via (X130T) FPGA . SIP-6 768 CSNDSP 2010 Fig. 9 histogram is distributed 308 slow paths to roughly form three bell shape distribution between (2 ns) and (4.2 ns). The slowest path is about (4.09 ns). There are less one path bins compared to those of Fig. 7. Consequently, there are no red/pink bins or portions that do not meet the timing constrains.
VI. CONCLUSION This paper developed new FPGA implementation methods that provide fast FPGA prototyping for high performance computation. This methodology is of high- level abstract hardware-oriented parallel programming, to outperform the low-level line-by-line HDL programming, with excellent quality of performance results for nine parallel 2-D MRI image filtering algorithms of power consumption down to (0.96) at maximum frequency of up to (230 MHz). The FPGA implementation is behaviourally targeted two Virtex-6 FPGA boards, namely, xc6vlX240Tl- 1lff1759 and xc6vlX130Tl-1lff1156 using the updated Xilinx system generator within the ISE 11.3 development suite. The X130T board outperforms the X240T board in parallel MRI filtering by consuming the lowest power at maximum sampling frequency. One of the nine parallel filtering algorithms, the Edge algorithm, is empirically improved by a new enhanced orthogonal 5x5 kernel which generates excellent MRI filtering results, compared to the previous filtering run with moderate lower power consumption at higher maximum sampling frequency. The future work will be focused on the high performance efficient FPGA implementation for the parallel 3-D image filtering algorithms of the next generation advanced DSP applications within aerospace, defence, digital communications, multimedia, video and imaging industries. REFERENCES [1] S. Coric, M. Leeser, E. Miller, M. Trepanier, " Parallel-Beam Back projection: an FPGA implementation optimized for medical imaging," Journal of VLSI signal Processing systems for signal, image, and video technology 39 (3), 2005, pp.: 295-311. [2] O. Maslennikow, A. Sergiyenko, Mapping DSP Algorithms into FPGA, Parallel Computing in Electrical Engineering, PAR ELEC 2006, and International Symposium on 13-17 Sept. 2006, pp. 208 - 213. [3] M. Kiran, K. M. War, L. M. Kuan, L. K. Meng and L.W. Kin, Implementing image processing algorithms using Hardware in the loop approach for Xilinx FPGA, Electronic Design, ICED 2008, International Conference, Dec. 2008, PP.:1 6. [4] W. Atabany and P. Degenaar, "Parallelism to reduce power consumption on FPGA Spatiotemporal image processing," Proc. IEEE International Symposium on Circuits and Systems, ISCAS 2008, pp. 14761479. [5] R. Gao, D. Xu and J.P. Bentley, Reconfigurable Hardware Implementation of an Improved Parallel Architecture for MPEG- 4 Motion Estimation in Mobile Applications," IEEE Transactions on Consumer Electronics, Vol. 49, 2003, pp.: 1383- 1390. [6] K. R. Nataraj, S. Ramachandran and B. S. Nagabushan, Development of Algorithm, Architecture and FPGA Implementation of Demodulator for Processing Satellite Data Communication" IJCSNS International Journal of Computer Science and Network security, 2009, VOL.9, pp.:137-147. [7] O. Nibouche, S. Boussakta and M. Darnell,"Pipeline architectures for radix-2 new Mersenne number transform ," IEEE Transactions on Circuits and Systems I: Regular Papers 56 (8), 2009, pp. 1668-1680. [8] O. Nibouche, S. Boussakta and M. Darnell, " A new architecture for radix-2 new Mersenne number transform," IEEE International Conference on Communications 2006, pp. 3219- 3222. [9] A. Masoudnia, H. Sarbazi-Azad and S. Boussakta, "Design and performance of a pixel - level pipelined - parallel architecture for high speed wavelet-based image compression" Computers and Electrical Engineering 31 (8), 2005, pp. 572-588. [10] T. Mak, et al, Implementation of wave-pipelined interconnects in FPGAs," Proceedings - Second IEEE International Symposium on NOCS 2008, , pp. 213-214. [11] C. Chang, Design and application of a reconfigurable computing System for High Performance Digital Signal Processing", Ph.D. thesis, University of California, Berkeley, 2005. [12] S. Boussakta, "A novel method for parallel image Processing applications," Journal of Systems, 45 (10), 1999, pp. 825-839 [13] Clive Maxfield, FPGAs: World Class Design 2009, Elsevier Ins. [14] R. Woods, J. McAllister, G. Lightbody and Y. Yi, FPGA-based Implementation of Signal Processing Systems 2008, John Wiley &Sons, Ltd. [15] M. Aziz, "Parallel Digital Filtering Algorithms for Multiprocessor DSP systems, a PhD thesis, 2004, University Of Leeds. [16] O. Alshibami, S. Boussakta and M. Aziz, " Fast algorithm for the 2-D new Mersenne number transform," Signal Processing 81 (8), 2001,pp.:1725-1735. [17] System Generator for DSP user guides, 2010, downloadable from; http://www.xilinx.com/support/sw_manuals/sysgen_bklist.pdf [18] Virtex-6 FPGA Xilinx documentation 2010, downloadable from; http://www.xilinx.com/support/documentation/virtex-6.htm