You are on page 1of 4

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

A Deep Learning prediction process accelerator based FPGA

Qi Yu, Chao Wang,Xiang Ma Xi Li, Xuehai Zhou


School of Computer Science School of Computer Science
University of Science and Technology of China University of Science and Technology of China
Hefei, China Hefei, China
e-mail: yuiq1123@mail.ustc.edu.cn e-mail: {llxx, xhzhou}@ustc.edu.cn
cswang@ustc.edu.cn, supermaxiang@gmail.com

AbstractRecently, machine learning is widely used in The system has a Control Unit, a grid of Processing Tiles,
applications and cloud services. And as the emerging field of and a Smart DMA interfacing external memory.
machine learning, deep learning shows excellent ability in In this paper, we also use FPGA to design the hardware
solving complex learning problems. To give users better accelerator for deep learning prediction process. For large
experience, high performance implementations of deep scale neural networks where direct mapping is not possible,
learning applications seem very important. As a common the implementation becomes a problem in terms of
means to accelerate algorithms, FPGA has high performance,
performance and the hardware resources. To tackle this
low power consumption, small size and other characteristics.
problem, we use time-sharing reused technology and
So we use FPGA to design a deep learning accelerator, the
accelerator focuses on the implementation of the prediction
decompose the input data into data fragment [10]. Every
process, data access optimization and pipeline structure. iteration, we reuse the arithmetic computation units to
Compared with Core 2 CPU 2.3GHz, our accelerator can process a data fragment. The accelerator design focuses on
achieve promising result. data access optimization and high throughput pipeline
structure. The design is synthesized by Zedboard
Keywords-FPGA; deep learning; prediction process; development broad.
accelerator

I. INTRODUCTION II. DEEP NERUAL NETWORKS


Recently, machine learning is widely used in applications Fig. 1 illustrates a deep neural network for handwritten
and cloud services, such as image search, face identification, digits recognition which composed of one input layer, many
speech recognition and so on. Since 2006, a subset of hidden layers and one output layer (Mnist is a dataset of
artificial neural networks has emerged as achieving higher handwritten digits). In this paper, we use DNNs for example.
accuracy and better results across a broad set of machine DNNs have two computation modes, prediction process
learning applications, compared with the traditional state-of- and training process [2]. Prediction process is feedforward
the-art algorithms [1]. This subset of artificial neural computation which computes the output for each given input
networks includes both Deep Neural Networks (DNNs) and with the weight coefficients which are got from training
Convolutional Neural Networks (CNNs). The computational process. Training process includes pre-training which locally
algorithm is called Deep Learning (DL). As we all know, tunes the connection weights between the units in adjacent
Deep Learning is multilayer neural networks which are
compute intensive and memory intensive [2][3]. However,
with the increasing of accuracy requirements and complexity
of the practical applications, the size of Deep Learning
networks becomes increasingly large scale, examples include 0
the Google cat recognizing system (1 Billion neuronal 0
connections) and Baidu Brain system (100 Billion neuronal 0
connections). Therefore, the high performance
1
implementation of large-scale deep learning neural networks

is particularly important and becomes one of the research 28x28

hotspots.

As a main means to accelerate deep learning algorithms, 0

FPGA (Field-Programmable Gate Array) has high 0

performance and low power consumption. Currently around


FPGA acceleration researches, Ly [5] and Kim [6]
Hidden layer 1 Hidden layer 2 Hidden layer n Output layer
respectively have designed a multi-FPGA architecture to (500) (500) (500) (10)

accelerate the restricted Boltzmann machine, a pre-training Input layer(784)


algorithm of deep learning. Farabet [7] shows a runtime
reconfigurable dataflow architecture for CNNs using FPGA. Figure 1. The schematic diagram of DNNs for Mnist

978-1-4799-8006-2/15 $31.00 2015 IEEE 1159


DOI 10.1109/CCGrid.2015.114
layers with the training datasets [4], and global training
which globally tunes the connection weights with Back
Propagation algorithm (BP algorithm). For technical and
market considerations, we implement the prediction process
rather than training process.
Like traditional neural networks, the prediction process
computes layer by layer from input layer to output layer, and
the outputs of current layer are the inputs of the next layer.
We suppose to use the layer which has Ni neurons (x1,
x2 ) as the input data to compute the layer having No
neurons (y1, y2 ). The computational formula is
presented in (1).



 (1)

Where f represents the activation function (usually


sigmoid function). wkj indicates the weight coefficient of Figure 2. The application framework of accelerator based FPGA
between the neuron xk and the neuron yj. bj represents the
offset value. The whole calculation can be written as a layer. Also, a FIFO task queue is used to schedule the user
matrix multiplication and activation function process showed requests. (3)The library layer: Standard API and function
in (2). Y= (y1, y2 yNo), X= (1, x1, x2 xNi). libraries are provided in this layer, so that the accelerator is
easy-to-use for developers. This paper focuses on the
   (2) description of accelerator design based FPGA, and we will
where do not use too much paragraph to introduce the
superstructure of the accelerator.
A. Compatibility

 (3) The compatibility is always weakness for FPGA

computation units. Once, the computing units are mapped to
the FPGA board, modification seem be difficult even a very
little modification. Even though, the reconfigurable
  (4) technology can alleviate this problem, but the flexibility is

not optimistic compared with GPU implementation based
CUDA. So typically the implementation is designed for
III. ARCHITECTURE specific FPGA boards and specific applications. Of course,
A FPGA based accelerator architecture is proposed in the superstructure (the driver layer and the library layer) is
this section. Fig. 2 shows the application framework for our written for the specific FPGAs and applications. It is
accelerator. The accelerators are components of an troublesome for developers when the implementation is
application system for Deep learning applications or services changed.
as computing resources [8] [9]. Every compute node is a In order to improve the compatibility, we take the below
physical machine which includes CPU, FPGA, networking measures:
and memory. A control node is a controller for the entire Use the same communication protocol and interface
system. It handles the requests from users, creates tasks and in IP cores design. In our case, we use axi-stream
schedules the computing resources. Also, the control node protocol to transmit the data, and define two FIFO
will update the weight coefficients which are trained off-line ports: one is input stream port, and one for output.
and keep the consistency of weight coefficients in each Besides, all the data transmissions between FPGA
compute node. and the memory are put into effect by DMA. So, we
To support FPGA usage, the FPGA based accelerator can make relatively uniform design in superstructure.
must have a framework. In Fig. 2, we can see that the overall Use C to HDL methodology. C to HDL tools can
framework of FPGA based accelerator includes hardware convert C or C-like high-level languages into
layer, driver layer and library layer. (1)The hardware layer hardware description languages such as VHDL or
contains memory and the FPGA board. The FPGA board Verilog. The converted code can be synthesized and
consists of a DMA and deep learning module (DL Module). used in FPGA boards. This methodology can reduce
All of the data (the configuration information, the user the development cycle greatly.
requests and the weight coefficients) are transmitted through
B. Performance
a DMA. DL Module is the main logic computing unit. The
prediction process is mapped on this unit. (2)The driver layer: In section 2, we have known the main computing process.
The drivers of DL module and DMA are implemented in this When the size of deep learning neural network becomes very

1160
large, the total parallel computing cannot be mapped on a
single FPGA. For example, in our case, we use DSP48E to
implement the floating point arithmetic, and the number of
DSP48E resources cannot meet the all parallel requirement.
So we decompose the input data into data fragments and use
time-sharing multiplexing technology to deal with these
fragments. Every iteration, we reuse the arithmetic
computation units to process a data fragment. But this
method will result in performance decline.
The optimum proposals to improve performance:
Data access optimization. We analyze the data
locality of the prediction process, and cache the Figure 3. The schematic diagram of DL Module details
reused data to Block RAM (BRAM) based the reuse
distance. We use some Registers or BRAMs to read and loops to save into different BRAMs in 32 by the row
the required data for next iteration computation number of the weight matrix (n=i%32 n refers to the
while the present iteration is in computing. number of BRAM, and i indicates the row number of weight
Flow calculation. We design the DL Module in matrix). So, the accelerator can read 32 weight values in
pipeline with streaming protocol. In this way, the DL parallel.
Module will have a high throughput. To reduce the impact on performance of the data access
time, we design two registers set to read the required data for
IV. IMPLEMENTATION DETAILS next iteration computation or be used in current iteration
In this section, we will introduce the details of our alternately. In our test, the time to cache 32 input values is
accelerator implementation. The schematic diagram is much less than the time of the computation of 32 values. So,
showed in Fig. 3. As mentioned before, DL Module is the iteration computing will start without waiting except the
designed to accelerate the prediction process of a DNN first iteration.
application. The main computation part is matrix
C. Adder tree
multiplication and activation function. And matrix
multiplication is suitable for parallel processing. We use a binary adder tree to implement accumulation.
In this way, the time consumption of accumulation can be
A. Time-sharing computation reduced from O (n) to O (log n). With the pipeline design,
For large scale deep learning neural networks, we cannot we can get sum or partial sum every clock cycle. When the
map all the parallel computation. According to the number of number of input value cannot be enough (in our case is 32),
hardware resources, we implement only partial zeros will be used to be temp input values. Because the input
computational logic, and use time-sharing technology to value is floating-point, we use DSP48E to design the
complete the total computation. floating-point addition. In this reason, the number of
1) For several or many layers: As the deep learning DSP48E resources will limit the size of binary adder tree.
neural networks computes layer by layer from input layer to D. Piecewise linear interpolation
output layer, and the outputs of current layer are the inputs
We use piecewise linear interpolation to realize the
of the next layer, We only implement the largest layer which activation function. 1) The computation of DNNs does not
have the max size of weight matrix. When a smaller layer is need high precision; 2) Piecewise linear interpolation can
in processing, the remain space in the the input or weight achieve better performance than other methods, such as
coefficients block is padded with zeros. binomial expansion. The piecewise linear interpolation
2) For single layer: We implement partial arithmetic (y=ai*x+bi, x ) can implement any activation
logics. In our case, we design 31 floating point addition and function with negligible loss of accuracy when the interval k
32 floating point multiplication, then we decompose the between and  is enough small. (5) Shows the
input data into data fragment which size is 32. Every implementation of sigmoid function. This implementation
iteration the accelerator deals with 32 input values to do the takes advantage of symmetry and bounded range. For x > 10
dot product calculation. or x 10, the results are sufficiently close to the bounds of
1 and 0, respectively. Besides, the sigmoid function is
B. Data access optimization symmetric by (0, 0.5), f(x) =1-f(-x), where x<0.So we divide
Because we need upload the computation task to FPGA, the sigmoid function into four segments, only 0 <x 10
the data access between memory and FPGA is unavoidable. segment uses piecewise linear interpolation.
To get better performance, we must optimize the number and 
time of the data access.

We use the BRAM resources to cache the weight (5)

coefficients between two adjacent layers. The accelerator 
reads the matrix of weight coefficients data from input
buffer, 

1161
VI. CONCLUSION
In this paper, we show a FPGA based accelerator for the
prediction process of deep learning. The accelerator focuses
on data access optimization and high throughput pipeline
structure. And we use piecewise linear interpolation to
realize activation function. Compared with Core 2 CPU
2.3GHz, our single accelerator can achieve average 30x
speedup. Future work will be started with testing of much
Figure 4. The piecewise linear interpolation schematic diagram more large-scale neural network, the total accelerator
system design with cloud system and business application
We use two BRAMs to store the values of a set and b set. test.
The values of a, b and k are fixed after DL Module is
hardwired. We find the corresponding a value and b value
according the value x, and get y after multiplication and VII. ACKNOWLEDGMENTS
addition. Fig.4 shows the computation schematic diagram. This work was supported by the National Science
The computation process is pipelined and we can get a value Foundation of China under grants (No. 61379040, No.
y every clock cycle. 61272131, No. 61202053), Jiangsu Provincial Natural
V. EXPERIMENT AND RESULT Science Foundation (No. SBK201240198), Fundamental
Research Funds for the Central Universities No.
A. Evaluation Methodology WK0110000034, Open Project of State Key Laboratory of
MATLAB: In section 4, we use piecewise linear Computer Architecture Institute of Computing
interpolation to realize the activation function, so we should Technology Chinese Academy of Sciences (No.
get a set and b set for different k in (5) by doing Linear CARCH201407), and the Strategic Priority Research
Regression. Also, the test data of weight coefficients and Program of CAS (No. XDA06010403). The authors deeply
inputs are generated by matlab. appreciate many reviewers for their insightful comments
CPU: For CPU baseline, we use Core2, clocked at and suggestions.
2.3GHz. We implement a standard C++ version of the
different size layers for the prediction process, and we use
QueryPerformanceCounter() to count the clock cycles of REFERENCES
CPU. [1] Y. Bengio and O. Delalleau, "On the expressive power of deep
FPGA: We use Zedboard development board to architectures," in Algorithmic Learning Theory, 2011, pp. 18-36 .
implement the design of DL module. Zedboard development [2] G. Hinton, S. Osindero, and Y.-W. Teh, "A fast learning algorithm for
board has two ARM Cortex-A9 processors, clocked at deep belief nets," Neural computation, vol. 18, pp. 1527-1554, 2006.
667MHz. For running time of the DL Module, we used AXI [3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, "Greedy
layer-wise training of deep networks," Advances in neural
Timer to record the number of total running clock cycles. information processing systems, vol. 19, p. 153, 2007.
B. Result [4] G. Hinton, "A practical guide to training restricted Boltzmann
machines," Momentum, vol. 9, p. 926, 2010.
Fig.5 illustrates the speedup of different size layers. In [5] D. Le Ly and P. Chow, "High-performance reconfigurable hardware
our case, we use three different sizes: 64x64, 128x128 and architecture for restricted Boltzmann machines," Neural Networks,
256x256. The result shows our accelerator can achieve IEEE Transactions on, vol. 21, pp. 1780-1792, 2010.
average 30x speedup over CPU, and it is result of only one [6] S. K. Kim, P. L. McMahon, and K. Olukotun, "A large-scale
slave node. If we have 10 compute nodes to do data parallel architecture for restricted Boltzmann machines," in Field-
computation, our accelerator system speedup can be up to Programmable Custom Computing Machines (FCCM), 2010 18th
IEEE Annual International Symposium on, 2010, pp. 201-208.
300x possibly.
[7] C. Farabet, Y. LeCun, K. Kavukcuoglu, E. Culurciello, B. Martini, P.
Akselrod, et al., "Large-scale FPGA-based convolutional networks,"
Machine Learning on Very Large Data Sets, 2011.
[8] C. Wang, X. Li, P. Chen, A. Wang, X. Zhou, and H. Yu, "Heterogene
ous Cloud Framework for Big Data Genome Sequencing," IEEE/AC
M Transactions on Computational Biology and Bioinformatics, pp. 1-
1. 2014.
[9] C. Wang, X. Li, J. Zhang, P. Chen, Y. Chen, X. Zhou, et al.,
"Architecture Support for Task Out-of-order Execution in MPSoCs,"
IEEE Transactions on Computers, pp. 1-1.2014.
[10] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, et al., "DianNao: a
small-footprint high-throughput accelerator for ubiquitous machine-
learning," in Proceedings of the 19th international conference on
Architectural support for programming languages and operating
Figure 5. The speedup of different size layers systems, 2014, pp. 269-284.

1162

You might also like