Overview-Digital Signal ProcessorsExpertTalk PDF

3/20/2014
DSP algorithm requirements

Large number of samples being continuously fed to the system
(samples or blocks).

&
Processor Architecture
Repetitive Operations:
The same operation being applied to different set of samples
Stream processing or Block processing
Prof. Hardip Shah

Associate Professor
Department of Electronics & Communication Engineering
Dharmsinh Desai University, Nadiad.
Vector and Matrix Operations
Prof. Hardip Shah, EC Dept. DDU

Real time operations
ASAP but within specified time
process present sample before arrival of next sample
High speed processing with increase in sampling frequency
Example
Processor clocked at 120 MHz and can perform 120MIPS
Sampling rate = 48KHz (Digital Audio Tape - DAT)
number of instructions per sample = (120 x 106)/(48 x 103)
= 2500.
Sampling rate = 8KHz (voice-band, telephony) number
of instructions per sample = 15000.
Sampling rate = 75MHz (CIF 360x288 Video at 30 frames
per second) number of instructions per sample = 1.6.

Multiply
Multiply and Sum A=B*C+D
Filtering
M 1
y (n ) =
N 1
Bi x (n i) +
i0
Ai y ( n i )
i =1
Convolution
y (n ) =
h(m ) x(m n)
m =0
FFT
N 1
y (n ) =
n=0
x ( n ) exp(
j 2 kn
)
N
Prof. Hardip Shah, EC Dept.

DDU
3/20/2014
Limitations of GPP
Microprocessors/controllers
Small dynamic range
Micro Processor
General Purpose Processor(GPP)
Micro Controller
GPP + Peripherals
Digital Signal Processor

GPP + Peripherals+ Math Co-processor
Architecture optimized for signal processing jobs requiring
extensive arithmetic operations in smallest number of cycles.
Specifically designed to perform fast DSP operations

(e.g., Fast Fourier Transforms, inner products, Multiply &
Accumulate)
Prof. Hardip Shah, EC

Dept. DDU
E.g. For a 16 bit processor the dynamic range is 32767 to

-32768. Such a small dynamic range can easily create
overflows. For example, 200 350 = 70000, which is an
overflow!
To solve this problem, the GPP processors provide the result

of 16-bit multiplication using two 16-bit registers.
Digital signal processing algorithms - multiplication and

addition intensive.
Overflow due to cumulative multiplications and additions

after multiplication
An overflow can have serious consequences, (e.g.,

unintentionally clipping a large signal).

DDU
Memory Access
Architectural Differences
General Purpose Processors

Common Memory for data and program
Von Neumann Architecture
Limited bus/memory bandwidth
Memory access
Harvard Architecture
Pipelining
Number Representation
Special Instructions
MAC unit
Extended Parallelism - VLIW
GPP Data Path Only

Memory
Memory Data
Bus
Register 1
Register 2
ALU

DDU

DDU
3/20/2014
Program Memory Data Bus
Program and data memory on separate Spaces

Full overlap of instruction fetch and decode
Modified Harvard architecture
Additionally communication between the two memory
spaces is permissible
Data Memory Data Bus

Program
Memory
Data
Memory
Multiplexer
Multiplexer
ALU
Accumulator

DDU
Pipelining
Fixed point representation

Three stage pipelining
More stages by using number of registers
A prefetch counter holds address of the next instruction to
be fetched,
an IR holds instruction to be executed
a queue IR stores the instructions to be executed if current
instruction is still executing,
the Program Counter contains address of the next
instruction to execute
Reduces the average execution time per instruction by
exploiting parallelism
DDU
Number range limited (or scaled) to +1 to 1

Q format
An implied binary point to represent binary fractions
Floating point representation
M2E
Larger dynamic range
Speed may reduce

DDU
12
3/20/2014
Fixed Point Vs Floating Point
Floating point representation is similar to scientific notation
The most common is ANSI/IEEE Std. 754-1985.
32 bit number called Single precision as well as 64 bit

numbers called double precision.
Floating Point
Fixed Point
Applications
Applications
Modems
Portable Products
Digital Subscriber Line (DSL)
2G, 2.5G and 3G Cell Phones
Wireless Basestations
Digital Audio Players
Central Office Switches
Digital Still Cameras
Private Branch Exchange (PBX)
Electronic Books
Digital Imaging
Voice Recognition
3D Graphics
GPS Receivers
Speech Recognition
Headsets
Voice over IP
Biometrics
Fingerprint Recognition
13
MAC using General Purpose Processor

(GPP)
R0
Basic operation in DSP- Multiplication and additions

Harvard Architecture allows multiple memory reads
11
12
3
11
24
X
R1
MAC unit in DSP-p
R2
44
X
Register
2
3
Loop
Clr
;Clear Accumulator A
Clr
; Clear Accumulator B
Mov
*R0, Y0
; Move data from memory location 1 to register Y0
Mov
*R1,X0
X0,Y0,A
;X0*Y0 ->A
A,B
;A + B -> B
R0
P
Register
R
Register
Y
Register
; Move data from memory location 2 to register X0
Mpy
Add
Inc
;R0 + 1 -> R0
Inc
R1
;R1 + 1 -> R1
Dec
;Dec N (initially equals to 3)
Tst
;Test for the value
Jnz
Loop
;Different than zero loop again
Mov
B,*R2
;Move result to memory
15

Dept. DDU
16
3/20/2014
MAC using DSP-p
HARDWARE LOOPING
Harvard Architecture allows multiple

memory reads
DSP algorithms - Repetitive computations

Hardware support for efficient looping.
11
12
3
11
24
R2
44
2
3
Clr
;Clear Accumulator A
Rep
; Rep N times the next instruction
MAC
*(R0)+, *(R1)+, A
; Fetch the two memory locations pointed by R0 and R1,

multiply them together and add the result to A, the final result
is stored back in A
Mov
A, *R2
; Move result to memory
There is a loop or repeat instruction, which allows loops to

be implemented without spending any extra clock cycles for
testing and updating the loop counter, or for jumping back to
the start of the loop.
This is obviously an advantage as compared to GPPs
where each loop has to have a test-and-branch operation
which requires at least one clock more.
Also nested loops being very common, DSP support
hardware for several levels of nested loops.

Dept. DDU

DDU
Control Unit Architecture
Digital Signal Processors (DSP-p) are designed for real time

calculation. A fixed sampling rate leads to necessity of having a
regular instruction cycle.
Such regular instruction cycles are achieved in RISC (Reduced

Instruction Set Computer) microprocessors by restricting the
instruction set. So, DSP-p should use RISC.
Basic RISC may be too slow for DSP specific complex operations
(FFT etc.).
In DSP this is carried out through hardware

DDU
18
19
Instructions that support basic DSP operations

Instructions that reduce the overhead in loops
Application oriented instructions
Benefits
More compact code-Less space in memory
Increased speed of execution

DDU
20
3/20/2014
Additional features
Examples
2nd gen. TMS320 uses LTD and MPY
instructions
Replication
More than one ALU, memory or multiplier units
On Chip memory/Cache
On chip data RAM and ROMs
Permits simultaneous loading of data into

temporary register for multiplier, data
shifting(delay) and accumulation of product
Repeat

DDU
Bit Reverse Addressing
21
Extended Parallelism

DDU
22
TMS320C67x CPU Core

C67x Floating-Point CPU Core
SIMD-Single Instruction Multiple Data
Program Fetch
Instruction Decode
VLIW
Increase number of instructions per cycle
VLIW is concatenation of several short instructions
Requires several execution units
DDU
Control
Registers
Instruction Dispatch
Increased Number of operations performed per

instructions
Multiple data paths and multiple execution units
23
Data Path 1
Data Path 2
A Register File
B Register File
L1 S1 M1 D1
D2 M2 S2 L2
Control
Logic
Test
Emulation
Arithmetic
Logic
Unit
Auxiliary
Logic
Unit
Multiplier
Unit
Interrupts
Floating-Point
Capabilities

DDU
24
3/20/2014
VLIW Simplified Architecture

Example
Very Large Instruction Word (VLIW)

VLIW
A CPU architecture that reads a group of
instructions and executes them at the same time. For Ex.
The group (word) might contain four instructions ,and the
compiler ensures that those four instructions are not
dependent on each other so they can be executed
simultaneously. Otherwise it places noops in the word
necessary.
Program
Memory
256 bits consisting of 8 instructions

Each instruction is 32 bits
Execution
Execution
Units
Execution
Units
Execution
Units
Execution
Units
Execution
Units
Execution
Units
Execution
Units
Units
VLIW architectures execute multiple instructions/cycle

and use simple, regular instruction sets
More parallelism, higher performance
Each unit executing

one instruction
Multiple independent instructions per cycle, packed into

single large "instruction word" or "packet
Dept. DDU
TMS320C67x DSP
Block Diagram
Types of DSP
Low End Fixed Point
TMS320C2XX, ADSP21XX, DSP56XXX
High End Fixed Point
TMS320C55XX, DSP16XXX,
ADSP215XX, DSP56800
MSC8101 - StarPro2000 (using SC140 from Starcore)
Floating Point
TMS320C3X, C67XX, ADSP210XX, DSP96000, DSP32XX
Program Cache/Program Memory

32-bit address, 256-Bit data
512K RAM
Power
Down
Host Port
Interface
C67x Floating-Point CPU Core

Program Fetch
Control
Registers
Instruction Dispatch
4
Channel
DMA
Instruction Decode
Data Path 1
A Register File
Data Path 2
B Register File
Control
Logic
Test
Emulation
L1
External
Memory
Interface
Data Memory
32-Bit address
8-, 16-, 32-Bit data
512K RAM
S1
M1
D1
D2 M2
S2
L2
26
Interrupts
2 Timers
2 Multichannel
buffered
serial ports
(T1/E1)
28
27
3/20/2014
Reference
Digital Signal Processing- A practical
approach by E. C. Ifeachor and B.W.Jervis, 2nd
Eddition, Pearson
TI reference manual/data sheets

DDU
29

DDU
30
Thank You

DDU
31

Overview-Digital Signal ProcessorsExpertTalk PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Overview-Digital Signal ProcessorsExpertTalk PDF

Uploaded by

Copyright:

Available Formats

3/20/2014

DSP algorithm requirements

DSP algorithm requirements

Prof. Hardip Shah

Vector and Matrix Operations

Prof. Hardip Shah, EC Dept. DDU

DSP algorithm requirements

DSP algorithm requirements

Prof. Hardip Shah, EC Dept. DDU

Prof. Hardip Shah, EC Dept. DDU

Prof. Hardip Shah, EC Dept.

Small dynamic range

General Purpose Processor(GPP)

Digital Signal Processor

Specifically designed to perform fast DSP operations

Prof. Hardip Shah, EC

E.g. For a 16 bit processor the dynamic range is 32767 to

To solve this problem, the GPP processors provide the result

Digital signal processing algorithms - multiplication and

Overflow due to cumulative multiplications and additions

An overflow can have serious consequences, (e.g.,

Prof. Hardip Shah, EC Dept.

General Purpose Processors

GPP Data Path Only

Prof. Hardip Shah, EC Dept.

Prof. Hardip Shah, EC Dept. DDU

Prof. Hardip Shah, EC Dept.

Program Memory Data Bus

Program and data memory on separate Spaces

Data Memory Data Bus

Prof. Hardip Shah, EC Dept. DDU

Prof. Hardip Shah, EC Dept.

Fixed point representation

Prof. Hardip Shah, EC Dept. DDU

Number range limited (or scaled) to +1 to 1

Floating point representation

Prof. Hardip Shah, EC Dept.

Fixed Point Vs Floating Point

32 bit number called Single precision as well as 64 bit

Digital Subscriber Line (DSL)

2G, 2.5G and 3G Cell Phones

Digital Audio Players

Central Office Switches

Digital Still Cameras

Private Branch Exchange (PBX)

Prof. Hardip Shah, EC Dept. DDU

MAC using General Purpose Processor

Basic operation in DSP- Multiplication and additions

MAC unit in DSP-p

; Move data from memory location 1 to register Y0

; Move data from memory location 2 to register X0

;Dec N (initially equals to 3)

;Test for the value

;Different than zero loop again

;Move result to memory

Prof. Hardip Shah, EC Dept. DDU

Prof. Hardip Shah, EC

MAC using DSP-p

Harvard Architecture allows multiple

DSP algorithms - Repetitive computations

; Rep N times the next instruction

; Fetch the two memory locations pointed by R0 and R1,

; Move result to memory