You are on page 1of 8

3/20/2014

DSP algorithm requirements


Large number of samples being continuously fed to the system
(samples or blocks).

DSP algorithm requirements


&
Processor Architecture

Repetitive Operations:
The same operation being applied to different set of samples
Stream processing or Block processing

Prof. Hardip Shah


Associate Professor
Department of Electronics & Communication Engineering
Dharmsinh Desai University, Nadiad.

Vector and Matrix Operations

Prof. Hardip Shah, EC Dept. DDU

DSP algorithm requirements


Real time operations
ASAP but within specified time
process present sample before arrival of next sample
High speed processing with increase in sampling frequency
Example
Processor clocked at 120 MHz and can perform 120MIPS
Sampling rate = 48KHz (Digital Audio Tape - DAT)
number of instructions per sample = (120 x 106)/(48 x 103)
= 2500.
Sampling rate = 8KHz (voice-band, telephony) number
of instructions per sample = 15000.
Sampling rate = 75MHz (CIF 360x288 Video at 30 frames
per second) number of instructions per sample = 1.6.

DSP algorithm requirements


Multiply
Multiply and Sum A=B*C+D
Filtering
M 1

y (n ) =

N 1

Bi x (n i) +

i0

Prof. Hardip Shah, EC Dept. DDU

Ai y ( n i )

i =1

Convolution

y (n ) =

h(m ) x(m n)

m =0

FFT

N 1

y (n ) =

n=0

Prof. Hardip Shah, EC Dept. DDU

x ( n ) exp(

j 2 kn
)
N

Prof. Hardip Shah, EC Dept.


DDU

3/20/2014

Limitations of GPP

Microprocessors/controllers

Small dynamic range

Micro Processor

General Purpose Processor(GPP)

Micro Controller
GPP + Peripherals

Digital Signal Processor


GPP + Peripherals+ Math Co-processor
Architecture optimized for signal processing jobs requiring
extensive arithmetic operations in smallest number of cycles.

Specifically designed to perform fast DSP operations


(e.g., Fast Fourier Transforms, inner products, Multiply &
Accumulate)

Prof. Hardip Shah, EC


Dept. DDU

E.g. For a 16 bit processor the dynamic range is 32767 to


-32768. Such a small dynamic range can easily create
overflows. For example, 200 350 = 70000, which is an
overflow!

To solve this problem, the GPP processors provide the result


of 16-bit multiplication using two 16-bit registers.

Digital signal processing algorithms - multiplication and


addition intensive.

Overflow due to cumulative multiplications and additions


after multiplication

An overflow can have serious consequences, (e.g.,


unintentionally clipping a large signal).

Prof. Hardip Shah, EC Dept.


DDU

Memory Access

Architectural Differences

General Purpose Processors


Common Memory for data and program
Von Neumann Architecture
Limited bus/memory bandwidth

Memory access
Harvard Architecture
Pipelining
Number Representation
Special Instructions
MAC unit
Extended Parallelism - VLIW

GPP Data Path Only


Memory

Memory Data
Bus
Register 1

Register 2

ALU

Prof. Hardip Shah, EC Dept.


DDU

Prof. Hardip Shah, EC Dept. DDU

Prof. Hardip Shah, EC Dept.


DDU

3/20/2014

Harvard Architecture

Harvard Architecture

Program Memory Data Bus

Program and data memory on separate Spaces


Full overlap of instruction fetch and decode
Modified Harvard architecture
Additionally communication between the two memory
spaces is permissible

Data Memory Data Bus


Program
Memory

Data
Memory

Multiplexer

Multiplexer

ALU

Accumulator

Prof. Hardip Shah, EC Dept. DDU

Prof. Hardip Shah, EC Dept.


DDU

Number Representation

Pipelining

Fixed point representation


Three stage pipelining
More stages by using number of registers
A prefetch counter holds address of the next instruction to
be fetched,
an IR holds instruction to be executed
a queue IR stores the instructions to be executed if current
instruction is still executing,
the Program Counter contains address of the next
instruction to execute
Reduces the average execution time per instruction by
exploiting parallelism
Prof. Hardip Shah, EC Dept.
DDU

Prof. Hardip Shah, EC Dept. DDU

Number range limited (or scaled) to +1 to 1


Q format
An implied binary point to represent binary fractions

Floating point representation

M2E
Larger dynamic range
Speed may reduce

Prof. Hardip Shah, EC Dept.


DDU

12

3/20/2014

Fixed Point Vs Floating Point

Number Representation
Floating point representation is similar to scientific notation
The most common is ANSI/IEEE Std. 754-1985.

32 bit number called Single precision as well as 64 bit


numbers called double precision.

Floating Point

Fixed Point

Applications

Applications

Modems

Portable Products

Digital Subscriber Line (DSL)

2G, 2.5G and 3G Cell Phones

Wireless Basestations

Digital Audio Players

Central Office Switches

Digital Still Cameras

Private Branch Exchange (PBX)

Electronic Books

Digital Imaging

Voice Recognition

3D Graphics

GPS Receivers

Speech Recognition

Headsets

Voice over IP

Biometrics
Fingerprint Recognition

13
Prof. Hardip Shah, EC Dept. DDU

Prof. Hardip Shah, EC Dept. DDU

MAC using General Purpose Processor


(GPP)
R0

Basic operation in DSP- Multiplication and additions


Harvard Architecture allows multiple memory reads

11
12
3

11
24

X
R1

MAC unit in DSP-p

R2
44

X
Register

2
3
Loop

Clr

;Clear Accumulator A

Clr

; Clear Accumulator B

Mov

*R0, Y0

; Move data from memory location 1 to register Y0

Mov

*R1,X0
X0,Y0,A

;X0*Y0 ->A

A,B

;A + B -> B

R0

P
Register

R
Register

Y
Register

; Move data from memory location 2 to register X0

Mpy
Add
Inc

;R0 + 1 -> R0

Inc

R1

;R1 + 1 -> R1

Dec

;Dec N (initially equals to 3)

Tst

;Test for the value

Jnz

Loop

;Different than zero loop again

Mov

B,*R2

;Move result to memory

15
Prof. Hardip Shah, EC Dept. DDU

Prof. Hardip Shah, EC Dept. DDU

Prof. Hardip Shah, EC


Dept. DDU

16

3/20/2014

MAC using DSP-p

HARDWARE LOOPING

Harvard Architecture allows multiple


memory reads

DSP algorithms - Repetitive computations


Hardware support for efficient looping.

11
12
3

11

24

R2
44

2
3

Clr

;Clear Accumulator A

Rep

; Rep N times the next instruction

MAC

*(R0)+, *(R1)+, A

; Fetch the two memory locations pointed by R0 and R1,


multiply them together and add the result to A, the final result
is stored back in A

Mov

A, *R2

; Move result to memory

There is a loop or repeat instruction, which allows loops to


be implemented without spending any extra clock cycles for
testing and updating the loop counter, or for jumping back to
the start of the loop.
This is obviously an advantage as compared to GPPs
where each loop has to have a test-and-branch operation
which requires at least one clock more.
Also nested loops being very common, DSP support
hardware for several levels of nested loops.

Prof. Hardip Shah, EC


Dept. DDU

Prof. Hardip Shah, EC Dept.


DDU

Control Unit Architecture

Special Instructions

Digital Signal Processors (DSP-p) are designed for real time


calculation. A fixed sampling rate leads to necessity of having a
regular instruction cycle.

Such regular instruction cycles are achieved in RISC (Reduced


Instruction Set Computer) microprocessors by restricting the
instruction set. So, DSP-p should use RISC.

Basic RISC may be too slow for DSP specific complex operations
(FFT etc.).

In DSP this is carried out through hardware


Prof. Hardip Shah, EC Dept.
DDU

Prof. Hardip Shah, EC Dept. DDU

18

19

Instructions that support basic DSP operations


Instructions that reduce the overhead in loops
Application oriented instructions
Benefits
More compact code-Less space in memory
Increased speed of execution

Prof. Hardip Shah, EC Dept.


DDU

20

3/20/2014

Special Instructions

Additional features

Examples
2nd gen. TMS320 uses LTD and MPY
instructions

Replication
More than one ALU, memory or multiplier units

On Chip memory/Cache
On chip data RAM and ROMs

Permits simultaneous loading of data into


temporary register for multiplier, data
shifting(delay) and accumulation of product
Repeat

Prof. Hardip Shah, EC Dept.


DDU

Bit Reverse Addressing

21

Extended Parallelism

Prof. Hardip Shah, EC Dept.


DDU

22

TMS320C67x CPU Core


C67x Floating-Point CPU Core

SIMD-Single Instruction Multiple Data

Program Fetch

Instruction Decode

VLIW
Increase number of instructions per cycle
VLIW is concatenation of several short instructions
Requires several execution units
Prof. Hardip Shah, EC Dept.
DDU

Prof. Hardip Shah, EC Dept. DDU

Control
Registers

Instruction Dispatch

Increased Number of operations performed per


instructions
Multiple data paths and multiple execution units

23

Data Path 1

Data Path 2

A Register File

B Register File

L1 S1 M1 D1

D2 M2 S2 L2

Control
Logic
Test
Emulation

Arithmetic
Logic
Unit

Auxiliary
Logic
Unit

Multiplier
Unit

Interrupts

Floating-Point
Capabilities

Prof. Hardip Shah, EC Dept.


DDU

24

3/20/2014

VLIW Simplified Architecture


Example

Very Large Instruction Word (VLIW)


VLIW
A CPU architecture that reads a group of
instructions and executes them at the same time. For Ex.
The group (word) might contain four instructions ,and the
compiler ensures that those four instructions are not
dependent on each other so they can be executed
simultaneously. Otherwise it places noops in the word
necessary.

Program
Memory

256 bits consisting of 8 instructions


Each instruction is 32 bits
Execution
Execution
Units
Execution
Units
Execution
Units
Execution
Units
Execution
Units
Execution
Units
Execution
Units
Units

VLIW architectures execute multiple instructions/cycle


and use simple, regular instruction sets
More parallelism, higher performance

Each unit executing


one instruction

Multiple independent instructions per cycle, packed into


single large "instruction word" or "packet
Prof. Hardip Shah, EC
Dept. DDU

TMS320C67x DSP
Block Diagram

Types of DSP
Low End Fixed Point
TMS320C2XX, ADSP21XX, DSP56XXX
High End Fixed Point
TMS320C55XX, DSP16XXX,
ADSP215XX, DSP56800
MSC8101 - StarPro2000 (using SC140 from Starcore)
Floating Point
TMS320C3X, C67XX, ADSP210XX, DSP96000, DSP32XX

Program Cache/Program Memory


32-bit address, 256-Bit data
512K RAM
Power
Down
Host Port
Interface

C67x Floating-Point CPU Core


Program Fetch

Control
Registers

Instruction Dispatch

4
Channel
DMA

Instruction Decode
Data Path 1
A Register File

Data Path 2
B Register File

Control
Logic
Test
Emulation

L1

External
Memory
Interface

Data Memory
32-Bit address
8-, 16-, 32-Bit data
512K RAM

S1

M1

D1

D2 M2

S2

L2

26

Interrupts

2 Timers
2 Multichannel
buffered
serial ports
(T1/E1)
28

27
Prof. Hardip Shah, EC Dept. DDU

Prof. Hardip Shah, EC Dept. DDU

3/20/2014

Reference
Digital Signal Processing- A practical
approach by E. C. Ifeachor and B.W.Jervis, 2nd
Eddition, Pearson
TI reference manual/data sheets

Prof. Hardip Shah, EC Dept.


DDU

29

Prof. Hardip Shah, EC Dept.


DDU

30

Thank You

Prof. Hardip Shah, EC Dept.


DDU

Prof. Hardip Shah, EC Dept. DDU

31

You might also like