Professional Documents
Culture Documents
Outline
Digital Signal Processors (DSPs) vs. General Purpose Processors (GPPs) Basic Architectural features Multiplier and Multiplier Accumulator (MAC) Modified Bus Structure and Memory Access Schemes in P-DSPs Multiple Access Memory
Multiported Memory
VLIW architecture
Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur
2
Outline
Pipelining:
processor?
1011 x 1110
10011010
1011 x 1110
0000 1011. 1011.. 1011... 10011010
1 2 3 4
Cycle 5
Parameter
Arithmetic format Extended floating point
TMS320C6211
(@150MHz)
TMS320C6711
(@150MHz)
32-bit N/A
32-bit 64-bit
Extended Arithmetic
Performance (peak) Number of hardware multipliers Number of registers Internal L1 program memory cache Internal L1 data memory cache Internal L2 cache
40-bit
1200MIPS 2 (16 x 16-bit) with 32-bit result
40-bit
1200MFLOPS 2 (32 x 32-bit) with 32 or 64-bit result
32
32K 32K 512K
32
32K 32K 512K
TMS320C6211
(@150MHz)
TMS320C6711
(@150MHz)
Supply voltage
Power management On-chip timers (number/width) Cost
Package
External memory interface controller JTAG
C2000
Lowest Cost
Control Systems Motor Control Storage Digital Ctrl Systems
C5000
Efficiency
Best MIPS per Watt / Dollar / Size Wireless phones Internet audio players Digital still cameras Modems Telephony VoIP
Multi Channel and Multi Function App's Comm Infrastructure Wireless Base-stations DSL Imaging Multi-media Servers Video
TMS320C64x: The C64x fixed-point DSPs offer the industry's highest level of performance to address the demands of the digital age. At clock rates of up to 1 GHz, C64x DSPs can process information at rates up to 8000 MIPS with costs as low as $19.95. In addition to a high clock rate, C64x DSPs can do more work each cycle with built-in extensions. These extensions include new instructions to accelerate performance in key application areas such as digital communications infrastructure and video and image processing. TMS320C62x: These first-generation fixed-point DSPs represent breakthrough technology that enables new equipments and energizes existing implementations for multi-channel, multi-function applications, such as wireless base stations, remote access servers (RAS), digital subscriber loop (xDSL) systems, personalized home security systems, advanced imaging/biometrics, industrial scanners, precision instrumentation and multichannel telephony systems.
TMS320C67x: For designers of high-precision applications, C67x floating-point DSPs offer the speed, precision, power savings and dynamic range to meet a wide variety of design needs. These dynamic DSPs are the ideal solution for demanding applications like audio, medical imaging, instrumentation and automotive.
Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur
C6000 Roadmap
Object Code Software Compatibility
Performance
Multi-core Floating Point C64x DSP 1.1 GHz
2nd Generation
C6416 C6414 C6412 C6411 C6415 DM642
1st Generation
C6203 C6202 C6713
C6201 C6701
C6211
DSPs Features
High speed DSP computations
Specialized instruction set
Sonar
Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur
17
DSPs Ps Applications
Speech and audio compression Filtering Modulation and demodulation Error correction coding and decoding Audio processing (e.g., surround sound, noise
18
DSPs Characteristics
1. Data path & internal ALU architecture
2. Specialized instruction set
19
Data Path
DSPs GPPs
Performs all key Multiplication often arithmetic operations take >1 cycle in 1 cycle. Shifts often take >1 Hardware support for cycle managing numeric fidelity: Other operations (e.g. Shifters saturation, rounding) Guard bits typically take multiple Saturation cycles congregation
20
A representative conventional fixedpoint DSP processor data path (from the Motorola DSP560xx, a 24-bit, fixed point processor family)
21
Instruction Set
DSPs
Specialized, complex instructions Multiple operations per instruction (e.g. using VLIW)
GPPs
General-purpose instructions Typically only one operation per instruction
22
VLIW
Very long instruction word (VLIW) architectures are garnering increased attention for DSP applications.
Major features:
Multiple independent operations per cycle Packed into a single large instruction or packet More regular, orthogonal, RISC-like operations Large, uniform register sets
23
Memory Architecture
DSPs
Harvard architecture 2-4 memory accesses/cycle
GPPs
Von Neumann architecture Typically 1 access/cycle
No cacheson-chip SRAM
24
25
Harvard Architecture
A Harvard architecture, common to many DSP processors. The processor can simultaneously access the two memory banks
using
two
independent
operands to be loaded
while fetching instructions.
26
Addressing Modes
DSPs Dedicated address generation units Specialized addressing modes; e.g.: Auto-increment Modulo (circular) Bit-reversed (for FFT) Good immediate data support GPPs Often, no separate address generation unit
General-purpose addressing modes
27
Execution Control
Hardware support for fast looping
Fast interrupts for I/O handling
28
DSPs Evolution
First generation (TI TMS32010)
Second generation (Motorola DSP56001, AT&T DSP16A, Analog Dev. ADSP-2100, TI TMS320C50) Third generation (Motorola DSP56301, TI TMS320C541, TI TMS320C80, Motorola MC68356) Fourth generation (TI TMS320C6201, Intel Pentium MMX)
31
Accumulator
Specialized instruction set
32
33
35
36
(TI 320C5X)
37
x x
n
n 1
n2
n M 3
nM 2
n M 1
+
Register
h h
1
M 3
M 2
M 1
[ xn , xn 1 , xn 2 ,......., xn M 3 , xn M 2 , xn M 1 ] , hM 2 , hM 1 ]
corresponding to the present and past M-1 samples of the input with
h [h , h , h , h ,......., h
0 1 2 3
M 3
n 1
After obtaing the product xn M 1h M 1 the element xn M may Similarly, after obtaing the product xn M 2 h M 2 the element xn M 1may be made equal to xn M 2and so on.
Mr. Vikas R. Gupta, Assistant Professor, ET, YCCE, Nagpur
39
be made equal to xn M 1.
Processing Unit
Operands Status Bus
Data Bus
Opcode
Instructions
Data/Instructions
Control Unit
Address
Processing Unit
Address
Data Memory
Status Bus
Opcode
Control Unit
Instructions Address
Program Memory
Processing Unit
Address
Status Bus Opcode
Data Memory
Control Unit
Instructions
Address
Address bus 2
Data bus 2
VLIW Architecture
8 July 2013
48
Pipelining
It is a technique for increasing the performance of a processor by breaking a sequence of operations into smaller pieces and executing theses pieces in parallel when possible, thereby decreasing the overall time required to complete the set of operation. It represents the trade-off between efficiency and ease of use. Lets see a real life example of pipelining.
8 July 2013
49
10
11
Midnight
30
A
40
20
30
40
20
30
40
20
30
40
20
Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?
30
A
40
40
40
40
20
O r d e r
Pipelining doesnt help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Stall for Dependences
Pipelining
An Instruction cycle can be divided into a number of microinstructions. Execution of each microinstruction is referred as one Phase of an instruction. For e.g. an instruction cycle requiring four microinstructions can be said to be in four phases as follows: 1) Fetch Phase : The instruction is fetched from the program memory. 2) Decode Phase : The instruction is decoded. 3) Memory Read Phase : The operand required for the execution of the instruction may be read from the data memory. 4) Execution Phase : Execution as well as the storage of the results in either one of the register or memory is carried out.
8 July 2013 54
8 July 2013
1 2 3 4 5 6 7 8 9 10 11 12
I -1
I -2 I -2
I -2
I -2 I -3
I -3
I -3 I -3
55
8 July 2013
1 2 3 4 5 6 7 8 9 10 11 12
I -1
I -2 I -2
I -2
I -2 I -3
I -3
I -3 I -3
56
8 July 2013
1 2 3 4 5 6 7 8 9 10 11 12
I -1
I -4 I -5 I -6
I -3 I -4 I -5
I -2 I -3 I -4
I -1 I -2 I -3
57
8 July 2013
1 2 3 4 5 6 7 8 9 10 11 12
I -1
I -4 I -5 I -6 I -7 I -8 I -9
I -3 I -4 I -5 I -6 I -7 I -7 I -9
I -2 I -3 I -4 I -5 I -6 I -7 I -7 I -9
I -1 I -2 I -3 I -4 I -5 I -6 I -7 I -7 I -9
58
Pipeline Performance
Let T denote the time required for each phase of the instruction. One clock cycle of the processor corresponds to T. In a period of 12T only three instructions can be executed in a
Pipeline Hazards
Data hazards an instruction uses the result of a previous instruction (RAW) ADD R1, R2, R3 ADD R4, R1, R5 Control hazards the location of an instruction depends on a previous instruction JMP LOOP LOOP: ADD R1, R2, R3 Structural hazards two instructions need access to the same resource e.g., single memory shared for instruction fetch and load/store collision in reservation table
61
Cycle
F R X M W
Write Data to R1 Here
Instruction
62
Pipeline Depth
The number of instructions that are processed simultaneously in the CPU, is referred as Depth of the instruction pipeline, differs in different families of PDSPs. The pipeline depths of some of the P-DSPs are as given below:
P-DSP Name/Family Analog Devices Motorola DSP5600X TI TMS320C5X TI TMS320c54X
8 July 2013
Pipeline Depth 2 3 4 6
63
1) 2) 3) 4) 5) 6)
Short Immediate Addressing Short Direct Addressing Memory Mapped Addressing Indirect Addressing Bit Reversed Addressing Circular Addressing
8 July 2013
64
8 July 2013
65
Indirect Addressing
Permits an array of data to be processed in P-DSP to be efficiently
fetched and stored. The address of the operand can be stored in one of the registers called indirect address registers. In the case of TI processors, the indirect address registers are
Indirect Addressing
In the P-DSP from analog devices it is called the modifier register.
The content of the indirect address registers may also be updated by a constant using Bit reversed addressing mode. In the TI 5X Processors the new address computed by the auxiliary ALU is not used for fetching the operand for the current
Binary Representation
000 001
2
3 4 5 6 7
010
011 100 101 110 111
010
110 001 101 011 111
2
6 1 5 3 7
71
27 3 6 4 5 4 5 6 3 7 2 8 1 9 9
8 7 6 5 4 3 2 1 1
LCD Display
72
On-Chip Peripherals
The P-DSPs have a number of on-chip peripherals that relieve
the CPU from routine functions. They also help to reduce the chip count n the DSP system based around P-DSP. Some of the on-chip peripherals in the PDSPs are as follows: On-chip Timer Serial Port TDM serial port Host Port Comm ports On-chip A/D, D/A converters P-DSPs with RISC and CISC
Parallel Port
Bit I/O Ports
73
On chip Timer
Two common applications of on-chip Timers are
1) 2) Generation of periodic interrupts to the P-DSPs Generation of sampling clocks for the A/D converters.
The Timer mode can be programmed by the P-DSPs. The timers can generate a single pulse or periodic train of pulses.
8 July 2013
74
Serial Port
This enables the data communication between the P-DSP & an external peripheral such as A/D converter, D/A converter, RS232 C. These ports have input & output buffers. So that the P-DSP writes
or reads from serial port in parallel form and the serial port sends
and receive the data to the peripherals in serial form. These devices have parallel to serial and serial to parallel converter inbuilt into them. The shift clock can be fed from P-DSP or external clock generator. Can operate in synchronous mode or asynchronous mode.
75
pulse that
indicates the beginning of a TDM frame and bit clock, the duration for which a bit is to be transmitted. The TDM frame is split into a number of equal slots and each slot can be allotted for one of the devices.
Ch 1
Ch 2
Ch 3
Ch 4
Ch 5
Ch 6
Ch 7
Ch 8
The frame sync signal The bit clock The address of the serial device The data Tx into the TDM channel by the authorized device
Ch 2 Ch 3 Ch 4 Ch 5 Ch 6 Ch 7 Ch 8
One TDM frame
77
8 July 2013
78
Parallel Port
8 July 2013
79
8 July 2013
80
Host Port
8 July 2013
81
Comm Port
8 July 2013
82
8 July 2013
83
8 July 2013
84
8 July 2013
85
86
THANK YOU !!