FIR DIRECT For DSP Reconfigurable Applications

TITLE: Cost effective Approach for Multi-Level Filter Design for
Reconstruction of Stable and Dynamic Applications
ABSTRACT:
Transpose-form finite impulse response (FIR) structures are inherently pipelined and
support multiple constant multiplication (MCM) results in significant saving of
computation. However, transpose-form configuration does not directly support the block
processing unlike direct-form configuration.
In this paper, we explore the possibility of realization of block FIR filter in transpose-from
configuration for area-delay efficient realization of large order FIR filters for both fixed and
reconfigurable applications. Based on detail computational analysis of transpose-form
configuration of FIR filter we have derived a flow graph for transpose-from block FIR filter
with optimized register complexity.
A generalized block formulation is presented for transpose-from FIR filter. We have

derived a general multiplier based architecture for the proposed transpose-form block
filter for reconfigurable applications. A low-complexity design using MCM scheme is also
presented for the block implementation of fixed FIR filters.
Performance comparison shows that the proposed structure involves significantly less
area-delay product (ADP) and less energy per sample (EPS) than the existing block
direct-form structure for medium or large filter lengths while for the short-length filters, the
existing block direct-form FIR structure has less ADP and less EPS than the proposed
structure.
ASIC synthesis result shows that the proposed structure for block-size 4 and filter-length
64 involve 42% less ADP and 40% less EPS than the best available FIR structure
proposed for reconfigurable applications. For the same filter length and the same block
size, the proposed structure involves 13% less ADP and 12.8% less EPS than that of the
existing direct-from block FIR structure. Based on these findings, we present a scheme
for the selection of direct-form and transpose-form configuration based on the filter
lengths and block-length for obtaining area delay and energy efficient block FIR
structures.
INTRODUCTION:
FINITE impulse response (FIR) digital filter is widely used in several digital signal
processing applications such as speech processing, loud speaker equalization, echo
cancellation, adaptive noise-cancellation, and various communication applications
including software defined radio (SDR) etc. [1]. Many of these applications require FIR
filters of large order to meet the stringent frequency specifications [2]– [4]. Very often
these filters need to support high sampling rate for high-speed digital communication [5].
The number of multiplications and additions required for each filter output, however,
increases linearly with the filter order.
Since, there is no redundant computation available in the FIR filter algorithm, real-time
implementation of a large order FIR filter in a resource constrained environment is a
challenging task. Filter coefficients very often remain constant and known a priori in signal
processing applications. This feature has been utilized to reduce the complexity of
realization of multiplications. Several designs have been suggested by various
researchers for efficient realization of FIR filters (having fixed coefficients) using
distributed arithmetic (DA) [22] and multiple constant multiplication (MCM) methods [10],
[13]–[16].
DA-based designs use look-up-tables (LUTs) to store pre-computed results to reduce the
computational complexity. The MCM method on the other hand reduces the number of
additions required for the realization of multiplications by common subexpression sharing,
when a given input is multiplied with a set of constants. The MCM scheme is more
effective when a common operand is multiplied with more number of constants.
Therefore, MCM scheme is suitable for the implementation of large order FIR filters with
fixed coefficients.
But, MCM blocks can be formed only in the transpose form configuration of FIR filters.
Block-processing method is popularly used to derive highthroughput hardware structures.
Not only does it provide throughput-scalable design but also improves the area-delay
efficiency. The derivation of block-based FIR structure is straight-forward when direct-
from configuration is used [21] whereas the transpose-form configuration does not directly
support block processing.
But, to take the computational advantage of the MCM, FIR filter is required to be realized
by transpose form configuration. Apart from that, transpose form structures are inherently
pipelined and suppose to offer higher operating frequency to support higher sampling
rate. There are some applications such as SDR channelize where FIR filters need to be
implemented in a reconfigurable hardware to support multi-standard wireless
communication [6]. Several designs have been suggested during the last decade for
efficient realization of reconfigurable FIR (RFIR) using general multipliers, and constant
multiplication schemes [7]– [12], [17], [18].
A programmable multiply-accumulator based processor is proposed in [7] for FIR filtering.

The area and power requirement of these architectures are significantly large and,
therefore, they are not suitable of SDR channelizer. The structure of [9] is multiplier-based
and uses poly-phase decomposition scheme. In [10], a reconfigurable FIR filter
architecture using computation sharing vector-scaling technique of [8] has been
proposed.
In [11], a programmable canonical signed digit (CSD) based architecture was proposed
using Booth encoding to generate partial products and Wallace tree adder for addition of
partial products. Chen et al [12] have proposed a CSD-based reconfigurable FIR filter
where the non-zero CSD values are modified to reduce the precision of filter coefficients
without significant impact on filter behavior. But, the reconfiguration overhead is
significantly large and does not provide an area-delay efficient structure.
The architectures of [8]–[12] are more appropriate for lower order filters and they are not
suitable for channel filters due to their large area complexity. Constant shift method (CSM)
and programmable shift method (PSM) have been proposed in [16], [17] for RFIR filters
specifically for SDR channelizer. Recently, Park et al. [18] have proposed an interesting
distributed arithmetic (DA) based architecture for RFIR filter.
The existing multiplier-based structures use either direct-form configuration or
transposeform configuration. But, the multiplier-less structures of [16], [17] use transpose-
form configuration whereas the DA-based structure of [18] uses direct-form configuration.
But, we do not find any specific block-based design for RFIR filter in the literature. A block-
based RFIR structure can easily be derived using the scheme proposed in [20], [21]. But,
we find that the block structure obtained from [20], [21] is not efficient for large filter
lengths and variable filter coefficients such as SDR channelizer.
Therefore, the design methods proposed in [20], [21] are more suitable for 2-D FIR and
BLMS adaptive filters. In this paper we explore the possibility of realization of block FIR
filter in transpose-from configuration in order to take advantage of the MCM schemes and
the inherent pipelining for area-delay efficient realization of large order FIR filters for both
fixed and reconfigurable applications.
The main contributions of this paper are as follows:
• Computational analysis of transpose-form configuration of FIR filter and derivation of

flow-graph for transpose from block FIR filter with reduced register complexity.
• Block formulation for transpose-from FIR filter.
• Design of transpose-form block filter for reconfigurable applications.
• A low-complexity design method using MCM scheme for the block implementation of
fixed FIR filters.
CHAPTER 1:
INTRODUCTION TO DIGITAL FILTERS
Digital filters are a very important part of DSP. In fact, their extraordinary performance is
one of the key reasons that DSP has become so popular. As mentioned in the
introduction, filters have two uses: signal separationand signal restoration. Signal
separation is needed when a signal has been contaminated with interference, noise, or
other signals. For example, imagine a device for measuring the electrical activity of a
baby's heart (EKG) while still in the womb. The raw signal will likely be corrupted by the
breathing and heartbeat of the mother. A filter might be used to separate these signals
so that they can be individually analyzed.
Signal restoration is used when a signal has been distorted in some way. For example,
an audio recording made with poor equipment may be filtered to better represent the
sound as it actually occurred. Another example is the deblurring of an image acquired
with an improperly focused lens, or a shaky camera.
These problems can be attacked with either analog or digital filters. Which is better?
Analog filters are cheap, fast, and have a large dynamic range in both amplitude and
frequency. Digital filters, in comparison, are vastly superior in the level of performance
that can be achieved. For example, a low-pass digital filter presented in Chapter 16 has
a gain of 1 +/- 0.0002 from DC to 1000 hertz, and a gain of less than 0.0002 for
frequencies above 1001 hertz. The entire transition occurs within only 1 hertz. Don't
expect this from an op amp circuit! Digital filters can achieve thousands of times better
performance than analog filters. This makes a dramatic difference in how filtering
problems are approached. With analog filters, the emphasis is on handling limitations of
the electronics, such as the accuracy and stability of the resistors and capacitors. In
comparison, digital filters are so good that the performance of the filter is frequently
ignored. The emphasis shifts to the limitations of the signals, and the theoretical issues
regarding their processing.
It is common in DSP to say that a filter's input and output signals are in the time domain.
This is because signals are usually created by sampling at regular intervals of time. But
this is not the only way sampling can take place. The second most common way of
sampling is at equal intervals in space. For example, imagine taking simultaneous
readings from an array of strain sensors mounted at one centimeter increments along the
length of an aircraft wing. Many other domains are possible; however, time and space are
by far the most common. When you see the term time domain in DSP, remember that it
may actually refer to samples taken over time, or it may be a general reference to any
domain that the samples are taken in.
As shown in Fig. 14-1, every linear filter has an impulse response, a step response and
a frequency response. Each of these responses contains complete information about
the filter, but in a different form. If one of the three is specified, the other two are fixed and
can be directly calculated. All three of these representations are important, because they
describe how the filter will react under different circumstances.
The most straightforward way to implement a digital filter is by convolving the input signal
with the digital filter's impulse response. All possible linear filters can be made in this
manner. (This should be obvious. If it isn't, you probably don't have the background to
understand this section on filter design. Try reviewing the previous section on DSP
fundamentals). When the impulse response is used in this way, filter designers give it a
special name: the filter kernel.
There is also another way to make digital filters, called recursion. When a filter is
implemented by convolution, each sample in the output is calculated by weighting the
samples in the input, and adding them together. Recursive filters are an extension of this,
using previously calculated values from the output, besides points from the input. Instead
of using a filter kernel, recursive filters are defined by a set of recursion coefficients.
This method will be discussed in detail in Chapter 19. For now, the important point is that
all linear filters have an impulse response, even if you don't use it to implement the filter.
To find the impulse response of a recursive filter, simply feed in an impulse, and see what
comes out. The impulse responses of recursive filters are composed of sinusoids that
exponentially decay in amplitude. In principle, this makes their impulse
responses infinitely long. However, the amplitude eventually drops below the round-off
noise of the system, and the remaining samples can be ignored. Because
of this characteristic, recursive filters are also called Infinite Impulse Response or
IIR filters. In comparison, filters carried out by convolution are called Finite Impulse
Response or FIR filters.
As you know, the impulse response is the output of a system when the input is
an impulse. In this same manner, the step response is the output when the input is
a step (also called an edge, and an edge response). Since the step is the integral of the
impulse, the step response is the integral of the impulse response. This provides two
ways to find the step response: (1) feed a step waveform into the filter and see what
comes out, or (2) integrate the impulse response. (To be mathematically
correct: integration is used with continuous signals, while discrete integration, i.e., a
running sum, is used with discrete signals). The frequency response can be found by
taking the DFT (using the FFT algorithm) of the impulse response. This will be reviewed
later in this chapter. The frequency response can be plotted on a linear vertical axis, such
as in (c), or on a logarithmic scale (decibels), as shown in (d). The linear scale is best at
showing the passband ripple and roll-off, while the decibel scale is needed to show the
stopband attenuation.
Don't remember decibels? Here is a quick review. A bel (in honor of Alexander Graham
Bell) means that the power is changed by a factor of ten. For example, an electronic circuit
that has 3 bels of amplification produces an output signal with 10 × 10 × 10 = 1000 times
the power of the input. A decibel (dB) is one-tenth of a bel. Therefore, the decibel values
of: -20dB, -10dB, 0dB, 10dB & 20dB, mean the power ratios: 0.01, 0.1, 1, 10, & 100,
respectively. In other words, every ten decibels mean that the power has changed by a
factor of ten.
Here's the catch: you usually want to work with a signal's amplitude, not its power. For
example, imagine an amplifier with 20dB of gain. By definition, this means that the power
in the signal has increased by a factor of 100. Since amplitude is proportional to the
square-root of power, the amplitude of the output is 10 times the amplitude of the input.
While 20dB means a factor of 100 in power, it only means a factor of 10 in amplitude.
Every twenty decibels mean that the amplitude has changed by a factor of ten. In
equation form:
The above equations use the base 10 logarithm; however, many computer languages
only provide a function for the base e logarithm (the natural log, written logex or ln x ). The
natural log can be use by modifying the above equations: dB = 4.342945 loge(P2/P1) and
dB = 8.685890 loge(A2/A1).
Since decibels are a way of expressing the ratio between two signals, they are ideal for
describing the gain of a system, i.e., the ratio between the output and the input signal.
However, engineers also use decibels to specify the amplitude (or power) of
a single signal, by referencing it to some standard. For example, the term: dBV means
that the signal is being referenced to a 1 volt rms signal. Likewise, dBm indicates a
reference signal producing 1 mW into a 600 ohms load (about 0.78 volts rms).
Digital filters are used for two general purposes: (1) separation of signals that have been
combined, and (2) restoration of signals that have been distorted in some way. Analog
(electronic) filters can be used for these same tasks; however, digital filters can achieve
far superior results. The most popular digital filters are described and compared in the
next seven chapters. This introductory chapter describes the parameters you want to look
for when learning about each of these filters.
How Information is Represented in Signals
The most important part of any DSP task is understanding how information is contained
in the signals you are working with. There are many ways that information can be
contained in a signal. This is especially true if the signal is manmade. For instance,
consider all of the modulation schemes that have been devised: AM, FM, single-sideband,
pulse-code modulation, pulse-width modulation, etc. The list goes on and on. Fortunately,
there are only two ways that are common for information to be represented in naturally
occurring signals. We will call these:information represented in the time domain,
and information represented in the frequency domain.
Information represented in the time domain describes when something occurs and what
the amplitude of the occurrence is. For example, imagine an experiment to study the light
output from the sun. The light output is measured and recorded once each second. Each
sample in the signal indicates what is happening at that instant, and the level of the event.
If a solar flare occurs, the signal directly provides information on the time it occurred, the
duration, the development over time, etc. Each sample contains information that is
interpretable without reference to any other sample. Even if you have only one sample
from this signal, you still know something about what you are measuring. This is the
simplest way for information to be contained in a signal.
In contrast, information represented in the frequency domain is more indirect. Many things
in our universe show periodic motion. For example, a wine glass struck with a fingernail
will vibrate, producing a ringing sound; the pendulum of a grandfather clock swings back
and forth; stars and planets rotate on their axis and revolve around each other, and so
forth. By measuring the frequency, phase, and amplitude of this periodic motion,
information can often be obtained about the system producing the motion. Suppose we
sample the sound produced by the ringing wine glass. The fundamental frequency and
harmonics of the periodic vibration relate to the mass and elasticity of the material. A
single sample, in itself, contains no information about the periodic motion, and therefore
no information about the wine glass. The information is contained in
the relationship between many points in the signal.
This brings us to the importance of the step and frequency responses. The step
response describes how information represented in the time domain is being modified by
the system. In contrast, the frequency response shows how information represented in
the frequency domain is being changed. This distinction is absolutely critical in filter
design because it is not possible to optimize a filter for both applications. Good
performance in the time domain results in poor performance in the frequency domain,
and vice versa. If you are designing a filter to remove noise from an EKG signal
(information represented in the time domain), the step response is the important
parameter, and the frequency response is of little concern. If your task is to design a
digital filter for a hearing aid (with the information in the frequency domain), the frequency
response is all important, while the step response doesn't matter. Now let's look at what
makes a filter optimal for time domain or frequency domain applications.
Time Domain Parameters
It may not be obvious why the step response is of such concern in time domain filters.
You may be wondering why the impulse response isn't the important parameter. The
answer lies in the way that the human mind understands and processes information.
Remember that the step, impulse and frequency responses all contain identical
information, just in different arrangements. The step response is useful in time domain
analysis because it matches the way humans view the information contained in the
signals.
For example, suppose we have given a signal of some unknown origin and asked to
analyze it. The first thing we need to do is divide the signal into regions of similar
characteristics. You can't stop from doing this; your mind will do it automatically. Some of
the regions may be smooth; others may have large amplitude peaks; others may be noisy.
This segmentation is accomplished by identifying the points that separate the regions.
This is where the step function comes in. The step function is the purest way of
representing a division between two dissimilar regions. It can mark when an event starts,
or when an event ends. It tells you that whatever is on the left is somehow different from
whatever is on the right. This is how the human mind views time domain information: a
group of step functions dividing the information into regions of similar characteristics. The
step response, in turn, is important because it describes how the dividing lines are being
modified by the filter.
The step response parameters that are important in filter design are shown in Fig. 14-2.
To distinguish events in a signal, the duration of the step response must be shorter than
the spacing of the events. This dictates that the step response should be as fast (the DSP
jargon) as possible. This is shown in Figs. (a) & (b). The most common way to specify
the risetime (more jargon) is to quote the number of samples between the 10% and 90%
amplitude levels. Why isn't a very fast risetime always possible? There are many reasons,
noise reduction, inherent limitations of the data acquisition system, avoiding aliasing, etc.
Figures (c) and (d) shows the next parameter that is important: overshoot in the step
response. Overshoot must generally be eliminated because it changes the amplitude of
samples in the signal; this is a basic distortion of the information contained in the time
domain. This can be summed up in
Is the overshoot you observe in a signal coming from the thing you are trying to measure,
or from the filter you have used?
Finally, it is often desired that the upper half of the step response be symmetrical with the
lower half, as illustrated in (e) and (f). This symmetry is needed to make the rising
edges look the same as the falling edges. This symmetry is called linear phase, because
the frequency response has a phase that is a straight line.
Frequency Domain Parameters
Figure 14-3 shows the four basic frequency responses. The purpose of these filters is to
allow some frequencies to pass unaltered, while completely blocking other frequencies.
The passband refers to those frequencies that are passed, while the stopband contains
those frequencies that are blocked. The transition band is between. A fast roll-
off means that the transition band is very narrow. The division between the passband
and transition band is called the cutoff frequency. In analog filter design, the cutoff
frequency is usually defined to be where the amplitude is reduced to 0.707 (i.e., -3dB).
Digital filters are less standardized, and it is common to see 99%, 90%, 70.7%, and 50%
amplitude levels defined to be the cutoff frequency.
Figure 14-4 shows three parameters that measure how well a filter performs in the
frequency domain. To separate closely spaced frequencies, the filter must have a fast
roll-off, as illustrated in (a) and (b). For the passband frequencies to move through the
filter unaltered, there must be no passband ripple, as shown in (c) and (d). Lastly, to
adequately block the stopband frequencies, it is necessary to have good stopband
attenuation, displayed in (e) and (f).
The number of samples used to represent the impulse response can be arbitrarily large.
For instance, suppose you want to find the frequency response of a filter kernel that
consists of 80 points. Since the FFT only works with signals that are a power of two, you
need to add 48 zeros to the signal to bring it to a length of 128 samples. This padding
with zeros does not change the impulse response. To understand why this is so, think
about what happens to these added zeros when the input signal is convolved with the
system's impulse response. The added zeros simply vanish in the convolution, and do
not affect the outcome.
Taking this a step further, you could add many zeros to the impulse response to make it,
say, 256, 512, or 1024 points long. The important idea is that longer impulse responses
result in a closer spacing of the data points in the frequency response. That is, there are
more samples spread between DC and one-half of the sampling rate. Taking this to the
extreme, if the impulse response is padded with an infinite number of zeros, the data
points in the frequency response are infinitesimally close together, i.e., a continuous line.
In other words, the frequency response of a filter is really a continuous signal between
DC and one-half of the sampling rate. The output of the DFT is a sampling of this
continuous line. What length of impulse response should you use when calculating a
filter's frequency response? As a first thought, try , but don't be afraid to change it if
needed (such as insufficient resolution or excessive computation time).
Keep in mind that the "good" and "bad" parameters discussed in this chapter are only
generalizations. Many signals don't fall neatly into categories. For example, consider an
EKG signal contaminated with 60 hertz interference. The information is encoded in
the time domain, but the interference is best dealt with in the frequency domain. The best
design for this application is
CHAPTER 2:
LITERATURE SURVEY:
In signal processing, a digital filter is a system that performs mathematical operations on

a sampled, discrete-time signal to reduce or enhance certain aspects of that signal. This
is in contrast to the other major type of electronic filter, the analog filter, which is an
electronic circuit operating on continuous-time analog signals.
A digital filter system usually consists of an analog-to-digital converter to sample the input
signal, followed by a microprocessor and some peripheral components such as memory
to store data and filter coefficients etc. Finally a digital-to-analog converter to complete
the output stage. Program Instructions (software) running on the microprocessor
implement the digital filter by performing the necessary mathematical operations on the
numbers received from the ADC. In some high performance applications, an FPGA or
ASIC is used instead of a general purpose microprocessor, or a specialized DSP with
specific paralleled architecture for expediting operations such as filtering.
Digital filters may be more expensive than an equivalent analog filter due to their
increased complexity, but they make practical many designs that are impractical or
impossible as analog filters. When used in the context of real-time analog systems, digital
filters sometimes have problematic latency (the difference in time between the input and
the response) due to the associated analog-to-digital and digital-to-analog conversions
and anti-aliasing filters, or due to other delays in their implementation.
FUNDAMENTALS AND ALGORITHMS:
Digital filter
A digital filter is a system that performs mathematical operations on a sampled, digitized

signal to reduce or enhance certain features of the processed signal. Digital filter scheme
consists of a prefilter or anti-aliasing filter to perform filtering of an input signal using a low
pass filter. This is required to restrict the bandwidth of a signal to satisfy the sampling
theorem. An interface is needed between the analog signal and the digital filter, this
interface is known as analog-to-digital converter (ADC). After the process of sampling and
converting, a digital signal is ready for further processing using an appropriate digital
signal processor. The output signal that is digitized is usually changed back into analog
form using digital-to-analog converter (DAC). The digital filtering process is shown in
Figure 1 (Alam & Gustafsson, 2014). Digital filter is a major topic in the field of digital
signal processing (DSP). Over the past few years the field of DSP has become so popular
both technologically and theoretically. The major reason for its success in the industry is
due to the use of the low cost and development of software and hardware. Applications
of DSP are mainly the algorithms that are implemented either in software using interactive
software like MATLAB or a processor. In high-bandwidth applications FPGA, ASIC or a
specialized digital signal processor are used for expediting operations of filtering. Digital
filters are preferably used because they eliminate several problems associated with
analog filters. There are two fundamental types of digital filters: Finite Impulse Response
(FIR) and Infinite Impulse Response (IIR) (Tian, Li, & Li, 2013).
FIR filters
FIR filters also known as non-recursive digital filters have a finite impulse response
because after a finite time the response of FIR filter settles to zero. Block diagram of FIR
filter is shown in Figure 2. The basic structure of FIR filter consists of adders, multipliers
and delay elements as shown in Figure 3. The difference equation of nth order digital filter
(FIR) can be represented as:
FLOW DIAGRAM:
where X(z) is the filter’s input and Y(z) is the filter’s output. In realization, a given transfer
function is used to convert into a suitable filter structure (Xu, Yin, Qin, & Zou, 2013).
The main advantages of the FIR filter design over their IIR equivalents are the following:
(1) FIR filters with exactly linear phase can easily be designed.
(2) There exist computationally efficient realizations for implementing FIR filters.
(3) FIR filters realized non-recursively are inherently stable and free of limit cycle
oscillations when implemented on a finite-word length digital system.
(4) Excellent design methods are available for various kinds of FIR filters with arbitrary
specifications. (5) Design and noise issues are less complex than IIR filter (Rabiner,
Kaiser, Herrmann, & Dolan, 1974).
Design stages of digital filters
Design of a digital filter involves the following five steps:
(1) Filter specification
(2) Filter coefficient calculation
(3) Realization
(4) Analysis of finite word length effect and
(5) Implementation
These five stages are interlinked as shown in Figure 4.
In first stage the required specifications of the FIR filter are defined. Whereas, in second
stage window method is selected because it offers a simple and flexible way of calculating
the FIR filter coefficient; due to its well-defined equations.
Filter designing
Filter designing and analysis tool (FDATool) is used for designing the digital filters. It is a
powerful user interface for scheming and analyzing the filter’s behavior quickly in signal
processing. It is used to realize quantized direct-form FIR filters Simulink model (Siauw &
Bayen, 2014). To analyze the behavior of FIR digital filter, different window functions are
used by using the specifications as shown in Table 1.
Magnitude and phase responses of a 15th order digital band pass filter using Hanning,
Hamming, Blackman and Kaiser window functions are observed and investigated as
shown in Figures 5–8 (Jieshan & Shizhen, 2009). Table 2 shows the brief comparison
among different window functions that are used in designing. Using the above
comparison, it is analyzed that the Kaiser window is more reliable and result in high gain.
Generally, the width of the main-lobe determines the transition bandwidth, while the
relative heights of the side-lobes control the size of the ripples in the amplitude response.
There is a bargain between main-lobe width and a height of side-lobe; in other words,
both quantities cannot be reduced at the same time. Calculations show that the Kaiser
window gives the minimum normalized transition width of main-lobe i.e. 0.11719 and a
sharp cutoff which means this window has less transition width and introduces more
ripple. This window gives simple and fast results (Patel, Kumar, Jaiswal, & Saxena, 2013).
Hardware realization
As the focus of this research work is to minimize the hardware implementation cost. For
this, the effects of quantization by varying the number of quantization bits and analyzing
the corresponding frequency responses are observed (Mehboob, Khan, & Qamar, 2009).
A 15th-order band pass filter using Kaiser window is realized as double precision floating
point implementation (see Figure 4) converted to a quantized filter. Float-to-fixed point
conversion is required to target ASIC and fixedpoint digital signal processor core which
will lessen the truncation and calculation complexity.
CHAPTER 3:
HARDWARE & SOFTWARE MODELLING
3.1 HARDWARE MODELLING:

INTRODUCTION ASIC DESIGNS:
ASIC Design is based on a flow that uses HDL as the entry level for design, which applies
for both Verilog and VHDL. The following description describes the flow from specification
of design upto tapeout, which is the form sent to silicon foundry for fabrication.
The following are the steps for the flow:-
1. Specification: This is the beginning and most important step towards designing a
chip as the features and functionalities of the chip are defined. Both design at
macro and micro level are taken into consideration which is derived from the
required features and functionalities. Speed, size, power consumption are among
the considerations on which the accepted range of values are specified. Other
performance criteria are also set at this point and deliberated on its viability; some
form of simulation might be possible to check on this.
2. RTL Coding: The microarchitecture at specification level is then transformed in
RTL code which marks the beginning of the real design phase towards realising a
chip.As a real chip is expected, so the code has to be a synthesiable RTL code.
3. Simulation and Testbench: RTL code and testbench are simulated using HDL
simulators to check on the functionality of the design. If Verilog is the language
used a Verilog simulator is required while VHDL simulator for a VHDL code. Some
of the tools available at CEDEC include: Cadence’s Verilog XL, Synopsys’s VCS,
and Mentor Graphic’s Modelsim. If the simulation results do not agree with the
intended function expected, the testbench file or the RTL code could be the cause.
The process of debugging the design has to be done if the RTL code is the source
of error. The simulation has to be repeated once either one of the two causes, or
both, have been corrected. There could be a possiblity of the loop in this process,
until the RTL code correctly describes the required logical behaviour of the design.
4. Synthesis: This process is conducted on the RTL code. This is the process
whereby the RTL code is converted into logic gates. The logic gate produced is
the functional equivalent of the RTL code as intended in the design. The synthesis
process however requires two input files: firstly, the “standard cell technology files”
and secondly the “constraints file”. A synthesised database of the design is created
in the system.
5. Pre-Layout Timing Analysis: When synthesis is completed, the synthesized

database along with timing information from the synthesi process is used to
perform a Static Timing Analysis (STA). Tweaking (making small changes) has to
be done to correct any timing issues.
6. APR: This is the Automatic Place and Route process whereby the layout is being
produced. In this process, the synthesized database together with timing
information from synthesis is used to place the logic gates. Most designs have
critical paths whose timings required them to be routed first. The process of
placement and routing normally has some degree of flexibility.
7. Back Annotation: This is the process where extraction for RC parasitics are made
from the layout. The path delay is calculated from these RC parasitics. Long
routing lines can significantly increase the interconnect delay for a path and for
sub-micron design parasitics cause significant increase in delay. Back annotation
is the step that bridges synthesis and physical layout.
8. Post-Layout Timing Analysis: This step in ASIC flow allows real timing violations
such as hold and setup to be detected. In this step, the net interconnect delay
information is fed into the timing analysis and any setup violation should be fixed
by optimizing the paths that fail while hold violation is fixed by introducing buffers
to the path to increase the delay. The process between APR, back annotation and
post-layout timing analysis go back and forth until the design is cleared of any
violation. Then it will be ready for logic verification.
9. Logic Verification: This step acts as the final check to ensure the design is correct
functionally after additional timing information from layout. Changes have to be
made on the RTL code or the post-layout synthesis to correct the logic verification.
10. Tapeout: When the design passed the logic verification check, it is now ready for
fabrication. The tapeout design is in the form of GDSII file, which will be accepted
by the foundry.
FPGA ARCHITECTURE:
Logic blocks
Main article: Logic block
Simplified example illustration of a logic cell

The most common FPGA architecture[1] consists of an array of logic blocks (called
configurable logic block, CLB, or logic array block, LAB, depending on vendor), I/O pads,
and routing channels. Generally, all the routing channels have the same width (number
of wires). Multiple I/O pads may fit into the height of one row or the width of one column
in the array.
An application circuit must be mapped into an FPGA with adequate resources. While the
number of CLBs/LABs and I/Os required is easily determined from the design, the number
of routing tracks needed may vary considerably even among designs with the same
amount of logic. For example, a crossbar switch requires much more routing than
a systolic array with the same gate count. Since unused routing tracks increase the cost
(and decrease the performance) of the part without providing any benefit, FPGA
manufacturers try to provide just enough tracks so that most designs that will fit in terms
of lookup tables (LUTs) and I/Os can be routed. This is determined by estimates such as
those derived from Rent's ruleor by experiments with existing designs.
In general, a logic block (CLB or LAB) consists of a few logical cells (called ALM, LE, slice
etc.). A typical cell consists of a 4-input LUT[timeframe?], a full adder (FA) and a D-type flip-
flop, as shown below. The LUTs are in this figure split into two 3-input LUTs. In normal
mode those are combined into a 4-input LUT through the left mux. In arithmetic mode,
their outputs are fed to the FA. The selection of mode is programmed into the middle
multiplexer. The output can be either synchronous or asynchronous, depending on the
programming of the mux to the right, in the figure example. In practice, entire or parts of
the FA are put as functions into the LUTs in order to save space.[33][34][35]
Hard blocks
Modern FPGA families expand upon the above capabilities to include higher level
functionality fixed into the silicon. Having these common functions embedded into the
silicon reduces the area required and gives those functions increased speed compared
to building them from primitives. Examples of these include multipliers, generic DSP
blocks, embedded processors, high speed I/O logic and embedded memories.
Higher-end FPGAs can contain high speed multi-gigabit transceivers and hard IP
cores such as processor cores, Ethernet MACs, PCI/PCI Express controllers, and
external memory controllers. These cores exist alongside the programmable fabric, but
they are built out of transistors instead of LUTs so they have ASIC level performance and
power consumption while not consuming a significant amount of fabric resources, leaving
more of the fabric free for the application-specific logic. The multi-gigabit transceivers also
contain high performance analog input and output circuitry along with high-speed
serializers and deserializers, components which cannot be built out of LUTs. Higher-level
PHY layer functionality such as line coding may or may not be implemented alongside
the serializers and deserializers in hard logic, depending on the FPGA.
Clocking
Most of the circuitry built inside of an FPGA is synchronous circuitry that requires a clock
signal. FPGAs contain dedicated global and regional routing networks for clock and reset
so they can be delivered with minimal skew. Also, FPGAs generally contain
analog PLL and/or DLL components to synthesize new clock frequencies as well as
attenuate jitter. Complex designs can use multiple clocks with different frequency and
phase relationships, each forming separate clock domains. These clock signals can be
generated locally by an oscillator or they can be recovered from a high speed serial data
stream. Care must be taken when building clock domain crossing circuitry to avoid
metastability. FPGAs generally contain block RAMs that are capable of working as dual
port RAMs with different clocks, aiding in the construction of building FIFOs and dual port
buffers that connect differing clock domains.
3D architectures
To shrink the size and power consumption of FPGAs, vendors such

as Tabula and Xilinx have introduced new 3D or stacked architectures. [36][37] Following
the introduction of its 28 nm 7-series FPGAs, Xilinx revealed that several of the highest-
density parts in those FPGA product lines will be constructed using multiple dies in one
package, employing technology developed for 3D construction and stacked-die
assemblies.
Xilinx's approach stacks several (three or four) active FPGA die side-by-side on a
silicon interposer – a single piece of silicon that carries passive interconnect. [37][38] The
multi-die construction also allows different parts of the FPGA to be created with different
process technologies, as the process requirements are different between the FPGA fabric
itself and the very high speed 28 Gbit/s serial transceivers. An FPGA built in this way is
called a heterogeneous FPGA.[39]
Altera's heterogeneous approach involves using a single monolithic FPGA die and
connecting other die/technologies to the FPGA using Intel's embedded multi-die
interconnect bridge (EMIB) technology.
DESIGN AND PROGRAMMING:
To define the behavior of the FPGA, the user provides a design in a hardware description
language (HDL) or as a schematic design. The HDL form is more suited to work with large
structures because it's possible to just specify them numerically rather than having to
draw every piece by hand. However, schematic entry can allow for easier visualisation of
a design.
Then, using an electronic design automation tool, a technology-mapped netlist is

generated. The netlist can then be fit to the actual FPGA architecture using a process
called place-and-route, usually performed by the FPGA company's proprietary place-and-
route software. The user will validate the map, place and route results via timing
analysis, simulation, and other verification methodologies. Once the design and
validation process is complete, the binary file generated (also using the FPGA company's
proprietary software) is used to (re)configure the FPGA. This file is transferred to the
FPGA/CPLD via a serial interface(JTAG) or to an external memory device like
an EEPROM.
The most common HDLs are VHDL and Verilog, although in an attempt to reduce the
complexity of designing in HDLs, which have been compared to the equivalent
of assembly languages, there are moves[by whom?] to raise the abstraction level through the
introduction of alternative languages. National Instruments' LabVIEW graphical
programming language (sometimes referred to as "G") has an FPGA add-in module
available to target and program FPGA hardware.
To simplify the design of complex systems in FPGAs, there exist libraries of predefined
complex functions and circuits that have been tested and optimized to speed up the
design process. These predefined circuits are commonly called IP cores, and are
available from FPGA vendors and third-party IP suppliers (rarely free, and typically
released under proprietary licenses). Other predefined circuits are available from
developer communities such as OpenCores (typically released under free and open
source licenses such as the GPL, BSD or similar license), and other sources.
In a typical design flow, an FPGA application developer will simulate the design at multiple
stages throughout the design process. Initially the RTL description in VHDL or Verilog is
simulated by creating test benches to simulate the system and observe results. Then,
after the synthesis engine has mapped the design to a netlist, the netlist is translated to
a gate level description where simulation is repeated to confirm the synthesis proceeded
without errors. Finally the design is laid out in the FPGA at which point propagation delays
can be added and the simulation run again with these values back-annotated onto the
netlist.
More recently, OpenCL is being used by programmers to take advantage of the

performance and power efficiencies that FPGAs provide. OpenCL allows programmers
to develop code in the C programming language and target FPGA functions as OpenCL
kernels using OpenCL constructs
3.2 SOFTWARE MODELLING:
VERILOG MODELLING FOR VLSI DESIGNS:
Verilog, standardized as IEEE 1364, is a hardware description language (HDL) used to

model electronic systems. It is most commonly used in the design and verification of
digital circuits at the register-transfer level of abstraction. It is also used in the verification
of analog circuits and mixed-signal circuits, as well as in the design of genetic circuits
Overview[edit]
Hardware description languages such as Verilog are similar to software programming

languages because they include ways of describing the propagation time and signal
strengths (sensitivity). There are two types of assignment operators; a blocking
assignment (=), and a non-blocking (<=) assignment. The non-blocking assignment
allows designers to describe a state-machine update without needing to declare and
use temporary storage variables. Since these concepts are part of Verilog's language
semantics, designers could quickly write descriptions of large circuits in a relatively
compact and concise form. At the time of Verilog's introduction (1984), Verilog
represented a tremendous productivity improvement for circuit designers who were
already using graphical schematic capture software and specially written software
programs to document and simulate electronic circuits.
The designers of Verilog wanted a language with syntax similar to the C programming
language, which was already widely used in engineering software development. Like C,
Verilog is case-sensitive and has a basic preprocessor (though less sophisticated than
that of ANSI C/C++). Its control flow keywords (if/else, for, while, case, etc.) are
equivalent, and its operator precedence is compatible with C. Syntactic differences
include: required bit-widths for variable declarations, demarcation of procedural blocks
(Verilog uses begin/end instead of curly braces {}), and many other minor differences.
Verilog requires that variables be given a definite size. In C these sizes are assumed from
the 'type' of the variable (for instance an integer type may be 8 bits).
A Verilog design consists of a hierarchy of modules. Modules encapsulate design

hierarchy, and communicate with other modules through a set of declared input, output,
and bidirectional ports. Internally, a module can contain any combination of the following:
net/variable declarations (wire, reg, integer, etc.), concurrent and sequential statement
blocks, and instances of other modules (sub-hierarchies). Sequential statements are
placed inside a begin/end block and executed in sequential order within the block.
However, the blocks themselves are executed concurrently, making Verilog a dataflow
language.
Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating,
undefined") and signal strengths (strong, weak, etc.). This system allows abstract
modeling of shared signal lines, where multiple sources drive a common net. When a wire
has multiple drivers, the wire's (readable) value is resolved by a function of the source
drivers and their strengths.
A subset of statements in the Verilog language are synthesizable. Verilog modules that
conform to a synthesizable coding style, known as RTL (register-transfer level), can be
physically realized by synthesis software. Synthesis software algorithmically transforms
the (abstract) Verilog source into a netlist, a logically equivalent description consisting
only of elementary logic primitives (AND, OR, NOT, flip-flops, etc.) that are available in a
specific FPGA or VLSI technology. Further manipulations to the netlist ultimately lead to
a circuit fabrication blueprint (such as a photo mask set for an ASIC or a bitstream file for
an FPGA).
History[edit]
Beginning[edit]
Verilog was one of the first popular[clarification needed] hardware description languages to be
invented.[citation needed] It was created by Prabhu Goel, Phil Moorby and Chi-Lai Huang and
Douglas Warmke between late 1983 and early 1984.[2] Chi-Lai Huang had earlier worked
on a hardware description LALSD, a language developed by Professor S.Y.H. Su, for his
PhD work.[3] The wording for this process was "Automated Integrated Design Systems"
(later renamed to Gateway Design Automation in 1985) as a hardware modeling
language. Gateway Design Automation was purchased by Cadence Design Systems in
1990. Cadence now has full proprietary rights to Gateway's Verilog and the Verilog-XL,
the HDL-simulator that would become the de facto standard (of Verilog logic simulators)
for the next decade. Originally, Verilog was only intended to describe and allow
simulation, the automated synthesis of subsets of the language to physically realizable
structures (gates etc.) was developed after the language had achieved widespread
usage.
Verilog is a portmanteau of the words "verification" and "logic".[4]
Verilog-95[edit]
With the increasing success of VHDL at the time, Cadence decided to make the language
available for open standardization. Cadence transferred Verilog into the public domain
under the Open Verilog International (OVI) (now known as Accellera) organization.
Verilog was later submitted to IEEE and became IEEE Standard 1364-1995, commonly
referred to as Verilog-95.
In the same time frame Cadence initiated the creation of Verilog-A to put standards
support behind its analog simulator Spectre. Verilog-A was never intended to be a
standalone language and is a subset of Verilog-AMS which encompassed Verilog-95.
Verilog 2001[edit]
Extensions to Verilog-95 were submitted back to IEEE to cover the deficiencies that users
had found in the original Verilog standard. These extensions became IEEE Standard
1364-2001 known as Verilog-2001.
Verilog-2001 is a significant upgrade from Verilog-95. First, it adds explicit support for (2's
complement) signed nets and variables. Previously, code authors had to perform signed
operations using awkward bit-level manipulations (for example, the carry-out bit of a
simple 8-bit addition required an explicit description of the Boolean algebra to determine
its correct value). The same function under Verilog-2001 can be more succinctly
described by one of the built-in operators: +, -, /, *, >>>. A generate/endgenerate construct
(similar to VHDL's generate/endgenerate) allows Verilog-2001 to control instance and
statement instantiation through normal decision operators (case/if/else). Using
generate/endgenerate, Verilog-2001 can instantiate an array of instances, with control
over the connectivity of the individual instances. File I/O has been improved by several
new system tasks. And finally, a few syntax additions were introduced to improve code
readability (e.g. always, @*, named parameter override, C-style function/task/module
header declaration).
Verilog-2001 is the version of Verilog supported by the majority of

commercial EDA software packages.
Verilog 2005[edit]
Not to be confused with SystemVerilog, Verilog 2005 (IEEE Standard 1364-2005)

consists of minor corrections, spec clarifications, and a few new language features (such
as the uwire keyword).
A separate part of the Verilog standard, Verilog-AMS, attempts to integrate analog and
mixed signal modeling with traditional Verilog.
SystemVerilog[edit]
Main article: SystemVerilog
The advent of hardware verification languages such as OpenVera, and Verisity's e

language encouraged the development of Superlog by Co-Design Automation Inc
(acquired by Synopsys). The foundations of Superlog and Vera were donated
to Accellera, which later became the IEEE standard P1800-2005: SystemVerilog.
SystemVerilog is a superset of Verilog-2005, with many new features and capabilities to

aid design verification and design modeling. As of 2009, the SystemVerilog and Verilog
language standards were merged into SystemVerilog 2009 (IEEE Standard 1800-2009).
The current version is IEEE standard 1800-2012.
Example[edit]
A simple example of two flip-flops follows:
module toplevel(clock,reset);
input clock;
input reset;
reg flop1;
reg flop2;
always @ (posedge reset or posedge clock)
if (reset)
begin
flop1 <= 0;
flop2 <= 1;
end
else
begin
flop1 <= flop2;
flop2 <= flop1;
end
endmodule
The "<=" operator in Verilog is another aspect of its being a hardware description
language as opposed to a normal procedural language. This is known as a "non-blocking"
assignment. Its action doesn't register until after the always block has executed. This
means that the order of the assignments is irrelevant and will produce the same result:
flop1 and flop2 will swap values every clock.
The other assignment operator, "=", is referred to as a blocking assignment. When "="
assignment is used, for the purposes of logic, the target variable is updated immediately.
In the above example, had the statements used the "=" blocking operator instead of "<=",
flop1 and flop2 would not have been swapped. Instead, as in traditional programming, the
compiler would understand to simply set flop1 equal to flop2 (and subsequently ignore
the redundant logic to set flop2 equal to flop1).
An example counter circuit follows:
module Div20x (rst, clk, cet, cep, count, tc);
// TITLE 'Divide-by-20 Counter with enables'
// enable CEP is a clock enable only
// enable CET is a clock enable and
// enables the TC output
// a counter using the Verilog language
parameter size = 5;
parameter length = 20;
input rst; // These inputs/outputs represent
input clk; // connections to the module.
input cet;
input cep;
output [size-1:0] count;
output tc;
reg [size-1:0] count; // Signals assigned
// within an always
// (or initial)block
// must be of type reg
wire tc; // Other signals are of type wire
// The always statement below is a parallel
// execution statement that
// executes any time the signals
// rst or clk transition from low to high
always @ (posedge clk or posedge rst)
if (rst) // This causes reset of the cntr
count <= {size{1'b0}};
else
if (cet && cep) // Enables both true
begin
if (count == length-1)
count <= {size{1'b0}};
else
count <= count + 1'b1;
end
// the value of tc is continuously assigned

// the value of the expression
assign tc = (cet && (count == length-1));
endmodule
An example of delays:
...
reg a, b, c, d;
wire e;
...
always @(b or e)
begin
a = b & e;
b = a | b;
#5 c = b;
d = #6 c ^ e;
end
The always clause above illustrates the other type of method of use, i.e. it executes
whenever any of the entities in the list (the b or e) changes. When one of these
changes, a is immediately assigned a new value, and due to the blocking
assignment, b is assigned a new value afterward (taking into account the new value of a).
After a delay of 5 time units, c is assigned the value of b and the value of c ^ e is tucked
away in an invisible store. Then after 6 more time units, d is assigned the value that was
tucked away.
Signals that are driven from within a process (an initial or always block) must be of
type reg. Signals that are driven from outside a process must be of type wire. The
keyword reg does not necessarily imply a hardware register.
Definition of constants[edit]
The definition of constants in Verilog supports the addition of a width parameter. The
basic syntax is:
<Width in bits>'<base letter><number>
Examples:
 12'h123 — Hexadecimal 123 (using 12 bits)
 20'd44 — Decimal 44 (using 20 bits — 0 extension is automatic)
 4'b1010 — Binary 1010 (using 4 bits)
 6'o77 — Octal 77 (using 6 bits)
Synthesizeable constructs[edit]
There are several statements in Verilog that have no analog in real hardware, e.g.
$display. Consequently, much of the language can not be used to describe hardware.
The examples presented here are the classic subset of the language that has a direct
mapping to real gates.
// Mux examples — Three ways to do the same thing.
// The first example uses continuous assignment
wire out;
assign out = sel ? a : b;
// the second example uses a procedure

// to accomplish the same thing.
reg out;
always @(a or b or sel)
begin
case(sel)
1'b0: out = b;
1'b1: out = a;
endcase
end
// Finally — you can use if/else in a
// procedural structure.
reg out;
always @(a or b or sel)
if (sel)
out = a;
else
out = b;
The next interesting structure is a transparent latch; it will pass the input to the output
when the gate signal is set for "pass-through", and captures the input and stores it upon
transition of the gate signal to "hold". The output will remain stable regardless of the input
signal while the gate is set to "hold". In the example below the "pass-through" level of the
gate would be when the value of the if clause is true, i.e. gate = 1. This is read "if gate is
true, the din is fed to latch_out continuously." Once the if clause is false, the last value at
latch_out will remain and is independent of the value of din.
// Transparent latch example
reg latch_out;
always @(gate or din)
if(gate)
latch_out = din; // Pass through state
// Note that the else isn't required here. The variable
// latch_out will follow the value of din while gate is
// high. When gate goes low, latch_out will remain constant.
The flip-flop is the next significant template; in Verilog, the D-flop is the simplest, and it
can be modeled as:
reg q;
always @(posedge clk)
q <= d;
The significant thing to notice in the example is the use of the non-blocking assignment.
A basic rule of thumb is to use <= when there is a posedge or negedge statement within
the always clause.
A variant of the D-flop is one with an asynchronous reset; there is a convention that the
reset state will be the first if clause within the statement.
reg q;
always @(posedge clk or posedge reset)
if(reset)
q <= 0;
else
q <= d;
The next variant is including both an asynchronous reset and asynchronous set condition;
again the convention comes into play, i.e. the reset term is followed by the set term.
reg q;
always @(posedge clk or posedge reset or posedge set)
if(reset)
q <= 0;
else
if(set)
q <= 1;
else
q <= d;
Note: If this model is used to model a Set/Reset flip flop then simulation errors can result.
Consider the following test sequence of events. 1) reset goes high 2) clk goes high 3) set
goes high 4) clk goes high again 5) reset goes low followed by 6) set going low. Assume
no setup and hold violations.
In this example the always @ statement would first execute when the rising edge of reset
occurs which would place q to a value of 0. The next time the always block executes
would be the rising edge of clk which again would keep q at a value of 0. The always
block then executes when set goes high which because reset is high forces q to remain
at 0. This condition may or may not be correct depending on the actual flip flop. However,
this is not the main problem with this model. Notice that when reset goes low, that set is
still high. In a real flip flop this will cause the output to go to a 1. However, in this model it
will not occur because the always block is triggered by rising edges of set and reset —
not levels. A different approach may be necessary for set/reset flip flops.
The final basic variant is one that implements a D-flop with a mux feeding its input. The
mux has a d-input and feedback from the flop itself. This allows a gated load function.
// Basic structure with an EXPLICIT feedback path
if(gate)
q <= d;
else
q <= q; // explicit feedback path
// The more common structure ASSUMES the feedback is present
// This is a safe assumption since this is how the
// hardware compiler will interpret it. This structure
// looks much like a latch. The differences are the
// '''@(posedge clk)''' and the non-blocking '''<='''
//
if(gate)
q <= d; // the "else" mux is "implied"
Note that there are no "initial" blocks mentioned in this description. There is a split
between FPGA and ASIC synthesis tools on this structure. FPGA tools allow initial blocks
where reg values are established instead of using a "reset" signal. ASIC synthesis tools
don't support such a statement. The reason is that an FPGA's initial state is something
that is downloaded into the memory tables of the FPGA. An ASIC is an actual hardware
implementation.
Initial and always[edit]
There are two separate ways of declaring a Verilog process. These are the always and
the initial keywords. The always keyword indicates a free-running process.
The initial keyword indicates a process executes exactly once. Both constructs begin
execution at simulator time 0, and both execute until the end of the block. Once
an always block has reached its end, it is rescheduled (again). It is a common
misconception to believe that an initial block will execute before an always block. In fact,
it is better to think of the initial-block as a special-case of the always-block, one which
terminates after it completes for the first time.
//Examples:
initial
begin
a = 1; // Assign a value to reg a at time 0
#1; // Wait 1 time unit
b = a; // Assign the value of reg a to reg b
end
always @(a or b) // Any time a or b CHANGE, run the process
begin
if (a)
c = b;
else
d = ~b;
end // Done with this block, now return to the top (i.e. the @ event-control)
always @(posedge a)// Run whenever reg a has a low to high change
a <= b;
These are the classic uses for these two keywords, but there are two significant additional
uses. The most common of these is an always keyword without the @(...) sensitivity list.
It is possible to use always as shown below:
always
begin // Always begins executing at time 0 and NEVER stops
clk = 0; // Set clk to 0
#1; // Wait for 1 time unit
end // Keeps executing — so continue back at the top of the begin
The always keyword acts similar to the C language construct while(1) {..} in the sense
that it will execute forever.
The other interesting exception is the use of the initial keyword with the addition of
the forever keyword.
The example below is functionally identical to the always example above.
initial forever // Start at time 0 and repeat the begin/end forever
begin
#1; // Wait for 1 time unit

end
Fork/join[edit]
The fork/join pair are used by Verilog to create parallel processes. All statements (or
blocks) between a fork/join pair begin execution simultaneously upon execution flow
hitting the fork. Execution continues after the join upon completion of the longest running
statement or block between the fork and join.
initial
fork
$write("A"); // Print Char A
$write("B"); // Print Char B
begin
$write("C");// Print Char C
end
join
The way the above is written, it is possible to have either the sequences "ABC" or "BAC"
print out. The order of simulation between the first $write and the second $write depends
on the simulator implementation, and may purposefully be randomized by the simulator.
This allows the simulation to contain both accidental race conditions as well as intentional
non-deterministic behavior.
Notice that VHDL cannot dynamically spawn multiple processes like Verilog.[5]
Race conditions[edit]
The order of execution isn't always guaranteed within Verilog. This can best be illustrated
by a classic example. Consider the code snippet below:
initial
a = 0;
initial
b = a;
initial
begin
#1;
$display("Value a=%d Value of b=%d",a,b);
end
What will be printed out for the values of a and b? Depending on the order of execution
of the initial blocks, it could be zero and zero, or alternately zero and some other arbitrary
uninitialized value. The $display statement will always execute after both assignment
blocks have completed, due to the #1 delay.
Operators[edit]
Note: These operators are not shown in order of precedence.
Operator
Operator type Operation performed
symbols
~ Bitwise NOT (1's complement)
Bitwise & Bitwise AND
| Bitwise OR
^ Bitwise XOR
~^ or ^~ Bitwise XNOR
! NOT
Logical && AND
|| OR
& Reduction AND
~& Reduction NAND
| Reduction OR
Reduction
~| Reduction NOR
^ Reduction XOR
~^ or ^~ Reduction XNOR
+ Addition
- Subtraction
- 2's complement
Arithmetic
* Multiplication
/ Division
** Exponentiation (*Verilog-2001)
Relational > Greater than

< Less than
>= Greater than or equal to
<= Less than or equal to
Logical equality (bit-value 1'bX is removed from

==
comparison)
Logical inequality (bit-value 1'bX is removed from

!=
comparison)
4-state logical equality (bit-value 1'bX is taken as

===
literal)
4-state logical inequality (bit-value 1'bX is taken as

!==
literal)
>> Logical right shift
<< Logical left shift

Shift
>>> Arithmetic right shift (*Verilog-2001)
<<< Arithmetic left shift (*Verilog-2001)
Concatenation {, } Concatenation
Replication {n{m}} Replicate value m for n times
Conditional ?: Conditional
CHAPTER 4
INTRODUCTION TO FIR FILTER ARHITECTURES
The finite impulse response (FIR) filter is used in many digital signal processing (DSP)
systems to perform signal preconditioning, antialiasing, band selection,
decimation/interpolation, low-pass filtering, and video convolution functions. Only a
limited selection of off-the-shelf FIR filter circuits is available; these circuits often limit
system performance. Therefore, programmable logic devices (PLDs) are an ideal choice
for implementing FIR filters. Altera FLEX devices, including the FLEX 10K and FLEX 8000
families, are flexible, high-performance devices that can easily implement FIR filters. For
example, you can use a FLEX device for one or more critical filtering functions in a DSP
microprocessor-based application, freeing the DSP processor to perform the lower-bit-
rate, algorithmically complex operations. A DSP microprocessor can implement an 8-tap
FIR filter at 5 million samples per second (MSPS), while an off-the-shelf FIR filter circuit
can deliver 30 MSPS. In contrast, FLEX devices can implement the same filter at over
100 MSPS. This application note describes how to map the mathematical operations of
the FIR filter into the FLEX architecture and compares this implementation to a hard-wired
design. Implementation details— including performance/device resource tradeoffs
through serialization, pipelining, and precision—are also discussed.
CONVENTIONAL FIR STRUCTURES:
The output of each register is called a tap and is represented by x(n), where n is the tap
number. Each tap is multiplied by a coefficient h(n) and then all the products are summed.
The equation for this filter is:
For a linear phase response FIR filter, the coefficients are symmetric around the center
values. This symmetry allows the symmetric taps to be added together before they are
multiplied by the coefficients. See Figure 2. Taking advantage of the symmetry lowers the
number of multiplies from eight to four, which reduces the circuitry required to implement
the filter.
RECONFIGURABLE ARCHITECTURE:
Reconfigurable FIR filter architecture The architecture of block FIR filter for
Reconfigurable applications is shown in the Fig.1 for block size L=4. The main blocks are
one Register Unit (RU), one Coefficient Storage Unit (CSU), one Pipeline Adder Unit
(PAU), and M number of Inner Product Units (IPUs).
Fig.1. Block FIR filter for reconfigurable applications.
The Coefficient Storage Unit (CSU) is used to store the coefficients of all the filters. These
coefficients are used in the reconfigurable applications. It has N Read Only Memory
(ROM) Lookup Tables (LUTs) where N is the length of the filter (N=ML). The Register
Unit (RU) is used for storing the input samples is shown in Fig.2. It contains (L-1) registers.
During the K th cycle, the register unit accepts input sample XK and computes L rows of
SK 0 in parallel. The outputs from the RU are given as inputs to M Inner Product Units.
Fig.2. Internal structure of Register Unit (RU) for block size L=4
The Inner Product Unit (IPU) is used to perform a multiplication operation of SK 0 with the
small weight vector cm is shown in Fig.3. The M Inner Products Units accepts L rows of
SK 0 from the RU and M small weight vectors from the CSU. Each Inner Product Unit
contains L number of Inner Product Cells (IPCs) which performs L inner product
computations of L rows of SK 0 with coefficient vector cm and produces a block of L
number of partial inner products. All the four IPUs work simultaneously and M blocks of
a result are obtained.
Fig.3. Internal structure of (m+1)th IPU.

The internal structure of (l +1) th IPC is shown in Fig.4. The Inner Product Cell (IPC)
accepts (l+1)th row of Sk 0 and small weight vector cm and produce a partial result of
inner product r(kL −l), for 0 ≤ l ≤ L – 1. Each IPC consists of L multipliers and (L-1) number
of adders.
Fig.4. Internal structure of (l + 1)th IPC.
The Pipelined Adder Unit (PAU) receives partial products from all the M IPUs. Array of
Kogge Stone Adder is used in PAU to add all the partial products is shown in Fig.5. KSA
is one of the Carry Tree Adders or Parallel Prefix Adders. Kogge Stone Adders gains
more importance among all the adders because of its high performance.
Fig.5. Different Stages in Kogge Stone Adder
KSA can be implemented in 3 stages, namely Pre-Computation Stage, Carry generation

network and final computation stage. Generate and Propagate signals are computed in
Pre-Computation Stage, corresponding to each pair of input bits A and B. The second
stage compute carries corresponding to each bit. Execution of these operations is
performed in parallel form, and they are partitioned into smaller pieces. Group generate
and propagate bits which are computed in the first stage are used as intermediate signals
in carry generation network. The final computation stage is common for all the adders of
this family which gives the summation of input bits.
Fixed FIR filter architecture
The architecture of block FIR filter for fixed application is shown in the Fig.6. For Fixed
FIR filter implementation, the CSU is not necessary here filter coefficients are fixed.
Similarly, IPUs are not used because multiplication operation is performed with Multiple
Constant Multiplication (MCM) units to reduce the huge complexity of the architecture.
Fig.6. Fixed FIR filter using MCM.

The MCM based method is more efficient when a given input variable is multiplied with
more number of fixed constants using shift and add method is shown in Fig.7. It can be
implemented by using adders/sutractors and shifters. Initially, the constants are
expressed in binary form. Then for every non-zero digits in the binary format of the
constant, based on its digit position the input variable is shifted and adds up the shifted
variable to obtain a result. MCM is employed in many applications like error correcting
codes, frequency multiplication and Multiple Input and Multiple Output systems (MIMO).
Fig.7. Block Diagram of MCM.
MCM based technique for a Fixed FIR filter with block size L=4, make utilize of symmetry
in input matrix SK 0 to execute vertical and horizontal common subexpression elimination
and to reduce the number of shift and add functions in the MCM units. MCM can be
employed in both vertical and horizontal order of the coefficient matrix. The MCM based
method consists of six input samples similar to six MCM blocks. All MCM blocks compute
the required product terms using shift and add method. The outputs of all MCM blocks
are given to the adder network to produce the inner product terms. In the Pipelined Adder
Unit (PAU) array of KSA is used to add inner product values and produce a block of the
filter output.
CHAPTER 5
RESULTS AND DISCUSSION
RESULTS:
Conclusion
FIR filters are extensively used in wired, wireless communications, video, audio
processing and handheld devices are preferred because of their stability and linear phase
properties. This paper presents a novel design methodology for an optimized FIR digital
filters from software level to the hardware level.
The main goal is to encompass all the fields that are used in the efficient hardware
realization of filters i.e. design method, selection of structure and the algorithm to reduce
the arithmetic complexity of FIR filtering.
Theoretical and experimental result suggests that the Kaiser window gives the minimum
mainlobe width i.e. 0.11719 and a sharp cutoff which means this window has less
transition width, and the study showed that the direct-form structure approach is simpler,
more robust to withstand the quantization errors, low cost and offers better performance
than other common structures.
Proposed optimized filter implementation using an appropriate quantization scheme

results in reducing arithmetic complexity, area and hardware resources. Comparison
revealed that the optimized filter implementation is requiring 42% less hardware
resources than the normal filter implementation.
References
[1] J. G. Proakis and D. G. Manolakis, Digital Signal Processing: Principles, Algorithms

and Applications. Upper Saddle River, NJ, USA:Prentice-Hall, 1996.
[2] J. Mitola, Software Radio Architecture: Object-Oriented Approaches to Wireless

Systems Engineering. New York, NY, USA: Wiley, 2000.
[3] P. K. Meher, S. Chandrasekaran, and A. Amira, “FPGA realization of FIR filters by

efficient and flexible systolization using distributed arithmetic,” IEEE Trans. Signal
Process., vol. 56, no. 7, pp. 3009–3017,Jul. 2008.
[4] P. K. Meher, “New approach to look-up-table design and memory based realization of
FIR digital filter,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57, no. 3, pp. 592–603,
Mar. 2010.
[5] J. Park, W. Jeong, H. Mahmoodi-Meimand, Y. Wang, H. Choo, and K. Roy,

“Computation sharing programmable FIR filter for low- power and high-performance
applications,” IEEE J. Solid State Circuits, vol. 39, no. 2, pp. 348–357, Feb. 2004.
[6] K.-H. Chen and T.-D. Chiueh, “A low-power digit-based reconfigurable FIR filter,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 53, no. 8, pp. 617–621, Aug. 2006.
[7] R. Mahesh and A. P. Vinod, “New reconfigurable architectures for implementing FIR
filters with low complexity,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol.
29, no. 2, pp. 275–288, Feb. 2010.
[8] B. K. Mohanty and P. K. Meher, “A high-performance energy-efficient architecture for

FIR adaptive filter based on new distributed arithmetic formulation of block LMS
algorithm,” IEEE Trans. Signal Process.,vol. 61, no. 4, pp. 921–932, Feb. 2013.
[9] B. K. Mohanty, P. K. Meher, S. Al-Maadeed, and A. Amira, “Memory footprint reduction

for power-efficient realization of 2-D finiteimpulse response filters,” IEEE Trans. Circuits
Syst. I, Reg. Papers, vol. 61, no. 1,pp. 120–133, Jan. 2014.
[10] S. Y. Park and P. K. Meher, “Efficient FPGA and ASIC realizations of a DA-based
reconfigurable FIR digital filter,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 7,
pp. 511–515, Jul. 2014.

FIR DIRECT For DSP Reconfigurable Applications

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FIR DIRECT For DSP Reconfigurable Applications

Uploaded by

Copyright:

Available Formats

TITLE: Cost effective Approach for Multi-Level Filter Design for

Reconstruction of Stable and Dynamic Applications

A generalized block formulation is presented for transpose-from FIR filter. We have

A programmable multiply-accumulator based processor is proposed in [7] for FIR filtering.

The main contributions of this paper are as follows:

• Computational analysis of transpose-form configuration of FIR filter and derivation of

• Block formulation for transpose-from FIR filter.

• Design of transpose-form block filter for reconfigurable applications.

INTRODUCTION TO DIGITAL FILTERS

How Information is Represented in Signals

Time Domain Parameters

Frequency Domain Parameters

In signal processing, a digital filter is a system that performs mathematical operations on

FUNDAMENTALS AND ALGORITHMS:

A digital filter is a system that performs mathematical operations on a sampled, digitized

Design stages of digital filters

Design of a digital filter involves the following five steps:

(1) Filter specification

(2) Filter coefficient calculation

(4) Analysis of finite word length effect and

These five stages are interlinked as shown in Figure 4.

3.1 HARDWARE MODELLING:

The following are the steps for the flow:-

5. Pre-Layout Timing Analysis: When synthesis is completed, the synthesized

Main article: Logic block

Simplified example illustration of a logic cell

To shrink the size and power consumption of FPGAs, vendors such

DESIGN AND PROGRAMMING:

Then, using an electronic design automation tool, a technology-mapped netlist is

More recently, OpenCL is being used by programmers to take advantage of the

3.2 SOFTWARE MODELLING:

VERILOG MODELLING FOR VLSI DESIGNS:

Verilog, standardized as IEEE 1364, is a hardware description language (HDL) used to

Hardware description languages such as Verilog are similar to software programming

A Verilog design consists of a hierarchy of modules. Modules encapsulate design

Verilog is a portmanteau of the words "verification" and "logic".[4]

Verilog-2001 is the version of Verilog supported by the majority of

Not to be confused with SystemVerilog, Verilog 2005 (IEEE Standard 1364-2005)

Main article: SystemVerilog

The advent of hardware verification languages such as OpenVera, and Verisity's e

SystemVerilog is a superset of Verilog-2005, with many new features and capabilities to

A simple example of two flip-flops follows:

always @ (posedge reset or posedge clock)

flop1 <= flop2;

flop2 <= flop1;

An example counter circuit follows:

module Div20x (rst, clk, cet, cep, count, tc);

// TITLE 'Divide-by-20 Counter with enables'

// enable CEP is a clock enable only

// enable CET is a clock enable and

// enables the TC output

// a counter using the Verilog language

parameter length = 20;

input rst; // These inputs/outputs represent

input clk; // connections to the module.

output [size-1:0] count;

reg [size-1:0] count; // Signals assigned

// must be of type reg

wire tc; // Other signals are of type wire

// The always statement below is a parallel

// execution statement that

// executes any time the signals

// rst or clk transition from low to high