You are on page 1of 87

UNIT : 3 Architectures For Programmable DSP Devices

Architectures For Programmable DSP Devices


V. R. Gupta
Assistant Professor Department of Electronics & Telecomm. Engg. Y. C. College of Engineering, Nagpur.

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Outline
Digital Signal Processors (DSPs) vs. General Purpose Processors (GPPs) Basic Architectural features Multiplier and Multiplier Accumulator (MAC) Modified Bus Structure and Memory Access Schemes in P-DSPs Multiple Access Memory

Multiported Memory
VLIW architecture
Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur
2

Outline
Pipelining:

1) Pipelining and Performance,


2) Pipeline Depth, 3) Interlocking, 4) Branching effects, 5) Interrupt effects, 6) Pipeline Programming models. Special Addressing Modes in P-DSPs On-Chip Peripherals.
Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur
3

Why do we need DSP processors?


Why not use a General Purpose Processor (GPP) such as a Pentium instead of a DSP processor? What is the power consumption of a Pentium

and a DSP processor?


What is the cost of a Pentium and a DSP

processor?

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Why do we need DSP processors?


Use a DSP processor when the following are required: Cost saving. Smaller size. Low power consumption. Processing of many high frequency signals in real-time. Use a GPP processor when the following are required: Large memory. Advanced operating systems.
Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

What are the typical DSP algorithms?


The Sum of Products (SOP) is the key element in most DSP algorithms:

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Hardware vs. Microcode multiplication


DSP processors are optimized to perform multiplication and addition operations. Multiplication and addition are done in hardware and in one cycle.

Example: 4-bit multiply (unsigned).


Hardware Microcode

1011 x 1110
10011010

1011 x 1110
0000 1011. 1011.. 1011... 10011010

Cycle Cycle Cycle Cycle

1 2 3 4

Cycle 5

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Parameters to consider when choosing a DSP processor

Parameter
Arithmetic format Extended floating point

TMS320C6211
(@150MHz)

TMS320C6711
(@150MHz)

32-bit N/A

32-bit 64-bit

Extended Arithmetic
Performance (peak) Number of hardware multipliers Number of registers Internal L1 program memory cache Internal L1 data memory cache Internal L2 cache

40-bit
1200MIPS 2 (16 x 16-bit) with 32-bit result

40-bit
1200MFLOPS 2 (32 x 32-bit) with 32 or 64-bit result

32
32K 32K 512K

32
32K 32K 512K

C6711 Datasheet: \Links\TMS320C6711.pdf C6211 Datasheet: \Links\TMS320C6211.pdf

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Parameters to consider when choosing a DSP processor Parameter


I/O bandwidth: Serial Ports (number/speed) DMA channels Multiprocessor support

TMS320C6211
(@150MHz)

TMS320C6711
(@150MHz)

2 x 75Mbps 16 Not inherent

2 x 75Mbps 16 Not inherent

Supply voltage
Power management On-chip timers (number/width) Cost

3.3V I/O, 1.8V Core


Yes 2 x 32-bit US$ 21.54

3.3V I/O, 1.8V Core


Yes 2 x 32-bit US$ 21.54

Package
External memory interface controller JTAG

256 Pin BGA


Yes Yes

256 Pin BGA


Yes Yes

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Floating vs. Fixed point processors


Applications which require:
High precision. Wide dynamic range. High signal-to-noise ratio. Ease of use.

Need a floating point processor. Drawback of floating point processors:


Higher power consumption. Can be more expensive. Can be slower than fixed-point counterparts and larger in size.

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Floating vs. Fixed point processors


It is the application that dictates which device and platform to use in order to achieve optimum performance at a low cost.

For educational purposes, use the floating-point


device (C6711) as it can support both fixed and

floating point operations.

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

General Purpose DSP vs. DSP in ASIC


Application Specific Integrated Circuits (ASICs) are semiconductors designed for dedicated functions. The advantages and disadvantages of using ASICs are listed below:
Advantages High throughput Lower silicon area Lower power consumption Improved reliability Reduction in system noise Low overall system cost Disadvantages High investment cost Less flexibility Long time from design to market

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Texas Instruments TMS320 family


Different families and sub-families exist to support different markets.

C2000
Lowest Cost
Control Systems Motor Control Storage Digital Ctrl Systems

C5000
Efficiency
Best MIPS per Watt / Dollar / Size Wireless phones Internet audio players Digital still cameras Modems Telephony VoIP

C6000 Performance & Best Ease-of-Use


Multi Channel and Multi Function App's Comm Infrastructure Wireless Base-stations DSL Imaging Multi-media Servers Video

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

TMS320C64x: The C64x fixed-point DSPs offer the industry's highest level of performance to address the demands of the digital age. At clock rates of up to 1 GHz, C64x DSPs can process information at rates up to 8000 MIPS with costs as low as $19.95. In addition to a high clock rate, C64x DSPs can do more work each cycle with built-in extensions. These extensions include new instructions to accelerate performance in key application areas such as digital communications infrastructure and video and image processing. TMS320C62x: These first-generation fixed-point DSPs represent breakthrough technology that enables new equipments and energizes existing implementations for multi-channel, multi-function applications, such as wireless base stations, remote access servers (RAS), digital subscriber loop (xDSL) systems, personalized home security systems, advanced imaging/biometrics, industrial scanners, precision instrumentation and multichannel telephony systems.

TMS320C67x: For designers of high-precision applications, C67x floating-point DSPs offer the speed, precision, power savings and dynamic range to meet a wide variety of design needs. These dynamic DSPs are the ideal solution for demanding applications like audio, medical imaging, instrumentation and automotive.
Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

C6000 Roadmap
Object Code Software Compatibility
Performance
Multi-core Floating Point C64x DSP 1.1 GHz

2nd Generation
C6416 C6414 C6412 C6411 C6415 DM642

1st Generation
C6203 C6202 C6713

C6201 C6701
C6211

C6204 C6205 C6711 C6712

C62x/C64x/DM642: Fixed Point C67x: Floating Point Time

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

DSPs Features
High speed DSP computations
Specialized instruction set

High performance repetitive numeric calculations


Fast & efficient memory accesses

Special mechanism for real-time I/O


Low power consumption Low cost in comparison with GPPs
16

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

DSPs General Applications


Digital cellular phones Satellite communications Seismic analysis Vehicle collision avoidance Secure communications Voice mail Digital cameras Navigation equipment Modems (ISDN, cable,...) Audio production Noise cancellation Videoconferencing Medical ultrasound Music synthesis, effects Radar

Voice over Internet


Motor control

Sonar
Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

17

DSPs Ps Applications
Speech and audio compression Filtering Modulation and demodulation Error correction coding and decoding Audio processing (e.g., surround sound, noise

reduction, equalization, sample rate conversion, echo


cancellation) Signaling (e.g., DTMF detection) Speech recognition Signal synthesis (e.g., music, speech synthesis)
Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

18

DSPs Characteristics
1. Data path & internal ALU architecture
2. Specialized instruction set

3. External memory architecture


4. Specialized addressing modes

5. Specialized execution control


6. Specialized peripherals for DSP

19

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Data Path
DSPs GPPs
Performs all key Multiplication often arithmetic operations take >1 cycle in 1 cycle. Shifts often take >1 Hardware support for cycle managing numeric fidelity: Other operations (e.g. Shifters saturation, rounding) Guard bits typically take multiple Saturation cycles congregation
20

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

DSPs Data Path Example

A representative conventional fixedpoint DSP processor data path (from the Motorola DSP560xx, a 24-bit, fixed point processor family)

21

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Instruction Set
DSPs
Specialized, complex instructions Multiple operations per instruction (e.g. using VLIW)

GPPs
General-purpose instructions Typically only one operation per instruction

22

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

VLIW
Very long instruction word (VLIW) architectures are garnering increased attention for DSP applications.

Major features:
Multiple independent operations per cycle Packed into a single large instruction or packet More regular, orthogonal, RISC-like operations Large, uniform register sets

23

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Memory Architecture
DSPs
Harvard architecture 2-4 memory accesses/cycle

GPPs
Von Neumann architecture Typically 1 access/cycle

No cacheson-chip SRAM

May use caches

24

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Von Neumann Architecture


The Von Neumann memory architecture, common among micro controllers. Since there is only one data bus, operands cannot be loaded while instructions are fetched, creating a bottleneck that slows the execution of DSP algorithms.

25

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Harvard Architecture
A Harvard architecture, common to many DSP processors. The processor can simultaneously access the two memory banks

using

two

independent

sets of buses, allowing

operands to be loaded
while fetching instructions.
26

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Addressing Modes
DSPs Dedicated address generation units Specialized addressing modes; e.g.: Auto-increment Modulo (circular) Bit-reversed (for FFT) Good immediate data support GPPs Often, no separate address generation unit
General-purpose addressing modes

27

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Execution Control
Hardware support for fast looping
Fast interrupts for I/O handling

Real-time debugging support

28

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

DSPs classifications (1)


By arithmetic format Fixed-point Floating-point
By data width Typical fixed-point DSPs: 16-bit Typical floating-point DSPs: 32-bit By memory organization By multiprocessor support
29

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

DSPs classifications (2)


By speed Million of instruction per second (MIPS) A basic operation (e.g. MAC) A basic algorithm (e.g. FFT, FIR or IIR filter)
By power consumption Operating voltage Sleep or idle mode Programmable clock dividers Peripheral control
30

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

DSPs Evolution
First generation (TI TMS32010)
Second generation (Motorola DSP56001, AT&T DSP16A, Analog Dev. ADSP-2100, TI TMS320C50) Third generation (Motorola DSP56301, TI TMS320C541, TI TMS320C80, Motorola MC68356) Fourth generation (TI TMS320C6201, Intel Pentium MMX)

31

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

First Generation (1982)


16-bit fixed-point
Harvard architecture

Accumulator
Specialized instruction set

390 ns MAC time (228 ns


today)

32

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Second Generation (1987)


24-bit data, instructions 3 memory spaces (X, Y, P) Parallel moves Single- and multi instruction hardware loops Modulo addressing 75 ns MAC (21 ns today)

33

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Third Generation (1995)


Enhanced conventional DSP architectures 3.0 or 3.3 volts More on-chip memory Application-specific function units in data path or as coprocessors More sophisticated debugging and application development tools DSP cores (Pine & Oak from DSP G., cDSP from TI) 20 ns MAC (10 ns today)
34

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Fourth Generation (1998)


Blazing clock speeds and super scalar architectures VLIW-like architectures, achieve top performance via high parallelism and increased clock speeds 3 ns MAC throughput Expensive, power-hungry

35

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

DSPs Evolution Chart

36

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Multiplier and Multiplier Accumulator (MAC)


Most common operation required in DSP applications. i.e. Array multiplication E.g. Convolution and Correlation

Important requirement of array multipliers: process the signal in


real time It requires multiplication as well as accumulation to be carried out using hardware elements. Two approaches to solve this problem are:

1) Implement a Dedicated MAC unit in H/W (M DSP5600X)


2) Have Multiplier and Accumulator separate
Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

(TI 320C5X)
37

Multiplier and Multiplier Accumulator (MAC)


In both the approaches the MAC operation can completed in one clock cycle. Presence of H/W multiplier and/or MAC is one of the mandatory requirement of P-DSPs. be

x x
n

n 1

n2

n M 3

nM 2

n M 1

+
Register

h h
1

M 3

M 2

M 1

Figure: Implementation of convolver with single multiplier/ adder


38

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Multiplier and Multiplier Accumulator (MAC)


The output yn at nth sampling interval, is obtained by multiplying the array the array

[ xn , xn 1 , xn 2 ,......., xn M 3 , xn M 2 , xn M 1 ] , hM 2 , hM 1 ]

corresponding to the present and past M-1 samples of the input with

h [h , h , h , h ,......., h
0 1 2 3

M 3

corresponding to the impulse response sequence.


To obtain

n 1

, the input signal array xn 1 is multiplied

After obtaing the product xn M 1h M 1 the element xn M may Similarly, after obtaing the product xn M 2 h M 2 the element xn M 1may be made equal to xn M 2and so on.
Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur
39

with the array h.

be made equal to xn M 1.

Multiplier and Multiplier Accumulator (MAC)


In P-DSPs this can be achieved b using a special instruction MACAD : Multiply accumulate For example, in TMS320C5X MCAD pma, dma This instruction multiplies the content of the program memory pma with the content of the data memory with

address dma and stores the result in the product register.


The content of the product register is added to the accumulator before the new product is stored. The content of the dma is copied to the next location whose address is dma+1.
Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur
40

Modified Bus Structure in P-DSPs


DSP processors use special memory architectures, namely, Harvard architecture or modified Von Neumann

architecture, which allow fetching multiple data and/or

instructions at the same time.


GPPs have used Von Neumann architecture, in which there

is one memory space connected to the processor core by


one bus set consisting of address bus and a data bus. Von Neumann architecture is not good for DSP applications, as some DSP algorithms require more memory bandwidth.
8 July 2013 41

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Modified Bus Structure in P-DSPs


Note that the MAC operation with data move (i.e. MACD instruction) requires 4 memory access per instruction cycle. The 4 memory accesses/ clock period required for the MACD instructions are as follows: 1) Fetch the MACD instruction from the program memory

2) Fetch one of the operand from the program memory.


3) Fetch the second operand from the data memory.

4) Write the content of the data memory with address


dma into the location with address dma+1 .
8 July 2013 42

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Modified Bus Structure in P-DSPs


Results

Processing Unit
Operands Status Bus

Data Bus

Opcode

Instructions

Data/Instructions

Control Unit
Address

Data Memory and Program Memory

Figure: Von Neumann Architecture


8 July 2013 43

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Modified Bus Structure in P-DSPs


Results/ Operands

Processing Unit
Address

Data Memory

Status Bus

Opcode

Control Unit

Instructions Address

Program Memory

Figure: Harvard Architecture


8 July 2013 44

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Modified Bus Structure in P-DSPs


Results/ Operands

Processing Unit
Address
Status Bus Opcode

Data Memory

Control Unit
Instructions

Program/ Data Memory

Address

Figure: Modified Harvard Architecture


8 July 2013 45

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Memory Access Schemes in P-DSPs


The number of memory accesses/ clock period can also be

increased by using a high speed memory that permits more


than one memory access/ clock period. For example: DARAM, dual access RAM permits two memory access/ cock period. Multiple access RAM may be connected to the processing

unit of the P-DSPs by using the Harvard architecture.


Example: DARAM connected to a P-DSP with two independent data and address buses can be used to achieve four memory accesses/ cock period.
8 July 2013 46

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Memory Access Schemes in P-DSPs


Multiported Memory: Adopted for increasing the number of accesses/ clock period.
Major limitation o f the dual ported memory is the increase in the cost compared to two single port memory of the same capacity. Since the number of pins and the chip area is increased. Motorola DSP 561XX have a single ported program memory and a dual ported data memory
Address bus 1 Data bus 1

Address bus 2

Dual Port Memory

Data bus 2

Figure: Block diagram of a Dual ported Memory


8 July 2013 47

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

VLIW Architecture

8 July 2013

48

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Pipelining
It is a technique for increasing the performance of a processor by breaking a sequence of operations into smaller pieces and executing theses pieces in parallel when possible, thereby decreasing the overall time required to complete the set of operation. It represents the trade-off between efficiency and ease of use. Lets see a real life example of pipelining.

8 July 2013

49

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Traditional Pipeline Concept


Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes 20 minutes

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Traditional Pipeline Concept


6 PM
7 8
Time

10

11

Midnight

30
A

40

20

30

40

20

30

40

20

30

40

20

Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Traditional Pipeline Concept


6 PM T a s k O r d e r 7 8 9 Time 30 A B C D 40 40 40 40 20 10 11 Midnight

Pipelined laundry takes 3.5 hours for 4 loads

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Traditional Pipeline Concept


6 PM 7 8
Time T a s k

30
A

40

40

40

40

20

O r d e r

Pipelining doesnt help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Stall for Dependences

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Pipelining
An Instruction cycle can be divided into a number of microinstructions. Execution of each microinstruction is referred as one Phase of an instruction. For e.g. an instruction cycle requiring four microinstructions can be said to be in four phases as follows: 1) Fetch Phase : The instruction is fetched from the program memory. 2) Decode Phase : The instruction is decoded. 3) Memory Read Phase : The operand required for the execution of the instruction may be read from the data memory. 4) Execution Phase : Execution as well as the storage of the results in either one of the register or memory is carried out.
8 July 2013 54

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Instruction cycles of processor with no Pipelining


Value of T Fetch I -1 I -1 I -1 Decode Read Execute

8 July 2013

1 2 3 4 5 6 7 8 9 10 11 12

I -1
I -2 I -2

I -2
I -2 I -3

I -3
I -3 I -3
55

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Instruction cycles of processor with no Pipelining


Value of T Fetch I -1 I -1 I -1 Decode Read Execute

8 July 2013

1 2 3 4 5 6 7 8 9 10 11 12

I -1
I -2 I -2

I -2
I -2 I -3

I -3
I -3 I -3
56

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Instruction cycles of processor with Pipelining


Value of T Fetch I -1 I -2 I -3 Decode I -1 I -2 Read Execute

8 July 2013

1 2 3 4 5 6 7 8 9 10 11 12

I -1

I -4 I -5 I -6

I -3 I -4 I -5

I -2 I -3 I -4

I -1 I -2 I -3

57

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Instruction cycles of processor with Pipelining


Value of T Fetch I -1 I -2 I -3 Decode I -1 I -2 Read Execute

8 July 2013

1 2 3 4 5 6 7 8 9 10 11 12

I -1

I -4 I -5 I -6 I -7 I -8 I -9

I -3 I -4 I -5 I -6 I -7 I -7 I -9

I -2 I -3 I -4 I -5 I -6 I -7 I -7 I -9

I -1 I -2 I -3 I -4 I -5 I -6 I -7 I -7 I -9

58

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Pipeline Performance
Let T denote the time required for each phase of the instruction. One clock cycle of the processor corresponds to T. In a period of 12T only three instructions can be executed in a

machine with no pipeline.


In the same period nine instructions can be executed in a machine with pipeline.

Hence the throughput is increased by a factor of 3 in this case.


Also note that the initial latency of a machine with 4 phases is 4T. Hence for executing a program with N instructions (N+4)T. With a non-pipelined machine, the time required for executing N instruction is 4NT
8 July 2013 59

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Branching Effect on Pipelining


Branching Effects: To overcome this problem some of the PDSPs have special Branch/ Call and return instructions called as delayed branch/call/return instructions. The throughput efficiency may also be reduced because of conflicts between the instructions in the instruction pipeline in different phases. This happens if the same memory is used to store the data and program and there is only a single address bus for addressing both the program and data memory. For e.g. an instruction in the fetch phase may try to fetch the instruction code from a memory chip that is also accessed by another instruction that is in the operand read phase. To avoid the conflict, the operand read phase will be done first and the opcode fetch is repeated till there is no conflict again.
8 July 2013 60

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Pipeline Hazards
Data hazards an instruction uses the result of a previous instruction (RAW) ADD R1, R2, R3 ADD R4, R1, R5 Control hazards the location of an instruction depends on a previous instruction JMP LOOP LOOP: ADD R1, R2, R3 Structural hazards two instructions need access to the same resource e.g., single memory shared for instruction fetch and load/store collision in reservation table
61

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Data Hazards (RAW)

Cycle
F R X M W
Write Data to R1 Here

Instruction

Read from R1 Here

ADD R1, R2, R3 ADD R4, R1, R5

62

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Pipeline Depth
The number of instructions that are processed simultaneously in the CPU, is referred as Depth of the instruction pipeline, differs in different families of PDSPs. The pipeline depths of some of the P-DSPs are as given below:
P-DSP Name/Family Analog Devices Motorola DSP5600X TI TMS320C5X TI TMS320c54X
8 July 2013

Pipeline Depth 2 3 4 6
63

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Special addressing Modes in P-DSPs


The P-DSPs have special addressing modes that permits single word/instruction format thereby speed up the execution by making effective use of pipelining.

1) 2) 3) 4) 5) 6)

Short Immediate Addressing Short Direct Addressing Memory Mapped Addressing Indirect Addressing Bit Reversed Addressing Circular Addressing

8 July 2013

64

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Short Immediate Addressing


Permits the operand to be specified using a short constant that forms part of a single word instruction. The length of the short constant depends on the instruction type and the P-DSP. For e.g. In case of TMS320C5X, an 8-bit constant can be specified as one of the operands in the single word instructions for addition, subtraction, AND, OR, XOR, etc.

8 July 2013

65

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Short Direct Addressing


Permits the lower order address of the operand of an instruction to be specified in the single word instruction. For e.g. In the TMS320 DSPs, the higher order 9 bits of the memory are stored in the data page pointer and only the lower 7 bits are specified as a part of the instruction.

Each contiguous block of 128 words is referred to as one page in


the TI DSP. The argument in the instruction specifies only the location within the current page.
8 July 2013 66

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Memory Mapped Addressing


The CPU registers and the I/O registers of the P-DSPs are also
accessible as memory location. This is achieved by storing them in either the starting page or the final page of the memory space. For e.g. in TMS320C5X, page 0 corresponds to the CPU registers

and I/O registers.


In the case of Motorola DSP5600X, the last page of the memory space containing 64 locations is used as the memory map for the CPU and I/O registers
8 July 2013 67

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Indirect Addressing
Permits an array of data to be processed in P-DSP to be efficiently
fetched and stored. The address of the operand can be stored in one of the registers called indirect address registers. In the case of TI processors, the indirect address registers are

called auxiliary registers ARs.


The content of ARs may be incremented or decremented either in steps of 1 or in steps specified by the content of the offset register (TI processor: INDX register). Additional ALU in the CPU core for indirect address registers ARs.
8 July 2013 68

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Indirect Addressing
In the P-DSP from analog devices it is called the modifier register.
The content of the indirect address registers may also be updated by a constant using Bit reversed addressing mode. In the TI 5X Processors the new address computed by the auxiliary ALU is not used for fetching the operand for the current

instruction that is being decoded and is executed.


It is used for fetching the operand that uses the indirect addressing mode next with this particular AR. E.g. A0= A0+ *R5++ or A0= A0+ *R5-- or A0 = A0+ *R5++ R17 Indirect addressing mode with post-increment and decrement.
Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur
69

Bit Reversed Addressing Mode


The most unusual of addressing modes, Bit Reversed addressing is used only in very specialized circumstances. For the computation of the FFT, the data is to be arranged in the

bit reversed order and 2-point DFT of the resulting sequence is to


be computed first. In the bit reversed addressing mode , when a 8-point FFT is to be computed, 2-point DFT of X(0) and X(4) is to be found. Similarly 2-point DFT of X(2) and X(6)and so on. Note that the values 0,4,2,6,1,5,3,7 corresponds to the consecutive numbers in the bit reversed number representation.
70

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Bit Reversed Addressing Mode


In the bit reversed addressing mode, the address is incremented/ decremented by the number represented in the bit reversed form.
Decimal No.
0 1

Binary Representation
000 001

Reversed Binary Representation


000 100

Bit reversed addresses


0 4

2
3 4 5 6 7

010
011 100 101 110 111

010
110 001 101 011 111

2
6 1 5 3 7
71

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Circular Addressing Mode


Let x(n)=[1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 ] Y(n)=[9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9] 18 9

27 3 6 4 5 4 5 6 3 7 2 8 1 9 9

8 7 6 5 4 3 2 1 1

LCD Display

72

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

On-Chip Peripherals
The P-DSPs have a number of on-chip peripherals that relieve
the CPU from routine functions. They also help to reduce the chip count n the DSP system based around P-DSP. Some of the on-chip peripherals in the PDSPs are as follows: On-chip Timer Serial Port TDM serial port Host Port Comm ports On-chip A/D, D/A converters P-DSPs with RISC and CISC

Parallel Port
Bit I/O Ports

73

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

On chip Timer
Two common applications of on-chip Timers are
1) 2) Generation of periodic interrupts to the P-DSPs Generation of sampling clocks for the A/D converters.

The Timer mode can be programmed by the P-DSPs. The timers can generate a single pulse or periodic train of pulses.

They can also generate a single square wave or a periodic square


wave. The period of the timer is also made programmable.

8 July 2013

74

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Serial Port
This enables the data communication between the P-DSP & an external peripheral such as A/D converter, D/A converter, RS232 C. These ports have input & output buffers. So that the P-DSP writes

or reads from serial port in parallel form and the serial port sends
and receive the data to the peripherals in serial form. These devices have parallel to serial and serial to parallel converter inbuilt into them. The shift clock can be fed from P-DSP or external clock generator. Can operate in synchronous mode or asynchronous mode.
75

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

TDM Serial Port


The P-DSPs have a special serial port called TDM serial port. Permits a P-DSP to communicate with other devices or P-DSPs by using Time Division Multiplexing (TDM).

One of the devices can generate the frame sync

pulse that

indicates the beginning of a TDM frame and bit clock, the duration for which a bit is to be transmitted. The TDM frame is split into a number of equal slots and each slot can be allotted for one of the devices.
Ch 1

Ch 2

Ch 3

Ch 4

Ch 5

Ch 6

Ch 7

Ch 8

One TDM frame


76

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

TDM Serial Port


There are 8 slots/frame and is referred to as a TDM with eight channels. In each of the slots, a number of bits may be transmitted by a channel. The TDM serial port normally uses four lines for the propose of serial communication. They are TFRM: TClock: TADD: TDAT:
Ch 1

The frame sync signal The bit clock The address of the serial device The data Tx into the TDM channel by the authorized device
Ch 2 Ch 3 Ch 4 Ch 5 Ch 6 Ch 7 Ch 8
One TDM frame
77

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

TDM Serial Port

8 July 2013

78

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Parallel Port

8 July 2013

79

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Bit I/O Port

8 July 2013

80

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Host Port

8 July 2013

81

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Comm Port

8 July 2013

82

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

On chip A/D & D/A Converters

8 July 2013

83

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

P-DSPs with RISC and CISC

8 July 2013

84

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Comparison: CISC, RISC, VLIW

8 July 2013

85

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

Web Links & Information


http://www.bdti.com http://www.eg3.com/dsp Buyers Guide to DSP Processors, Berkeley, California: Berkeley Design Technology, Inc., 1994, 1995, 1997, 1999. Phil Lapsley, Jeff Bier, Amit Shoham, and Edward A. Lee, DSP Processor Fundamentals: Architectures and Features, Berkeley, California: Berkeley Design Technology, Inc., 1996. An Introduction To Very-Long Instruction Word (VLIW) Computer Architecture, Philips Semiconductors, http://www.semiconductors.philips.com/acrobat_download/other/vliw-wp.pdf

86

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

THANK YOU !!

HAVE A NICE DAY!!!


8 July 2013 87

Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur

You might also like