You are on page 1of 4

A Low Power Implementation of a W-CDMA Receiver on an Ultra Low Power DSP

Alice Wang
Massachusetts Institute of Technology 50 Vassar Street, Rm. 38-107, Cambridge, MA
Abstract This paper present results from a Wideband-CDMA downlink receiver implementation on the newly developed lowpower Texas Instruments TMS32OC55x DSP for wireless applications. The CSSx includes new instructions and functional units, such as the D u a l Multiply-Accumulates @ualMAC), implicit/explicit parallelism, and an added pipeline stage. It also indudes hardware in the Address Unit exclusively for memory address computation. Tbis new architecture is suitable for implementation of systems that have large computational tasks and low-power requirements, such as the baseband processing in the W-CDMA downlink receiver. The chip dissipates 0.25 mW/MIPS which is lower than the power dissipated in the current low power DSP, the C54x. We show that the C55x has a 4&60% improvement over the C54x in cycle efficiency in our implementationof a W-CDMA receiver.

Rahmi Hezar,Wanda Gass


Texas Instruments 8085 Forest Lane Dallas, TX This paper presents benchmarking results from our implementation of a W-CDMA downlink receiver using the C55x. Results show that the new DSP has on average 4060% reduction in the number of cycles required to execute W-CDMA functions. They also show that the C55x architecture is designed to reduce the number of program and data memory accesses and also provides improved code density. 11. W-CDMA BENCHMARKS Fig. 1 shows a block diagram of our implementation of the W-CDMA downlink receiver. The correlators are used to separate the received signals into multipaths or fingers, and is passed to the DSP. The correlators are clocked at the chip rate (4096 Mcps) and are implemented on dedicated hardware. The baseband processing done at the DSP includes Maximal Ratio Combine (MRC), Digital Automatic Gain Control (DigAGC), the deinterleaver and rate matching. These functions operate at the user's symbol rate (8 to 1024 ksps). The outputs of the rate matching block are passed to the decoder. We benchmarked the baseband processing functions on the new DSP architecture.

I. INTRODUCTION

The market for mobile communication services is growing at a rapid rate. Soon, the demand will shift from voice and low bit-rate wireless applications, to applications requiring high bit-rate data transmission over a wireless link. New communication standards, such as the 3d Generation Wideband-Code Division Multiple Access standard (3G WCDMA) are being developed for both high rate data and voice transmission [l]. Compared to current standards, W-CDMA A. Channel Estimation and Maximal Ratio Combine has an added computation cost, leading to the difficult task of (MRC) providing high performance communication at the mobile, The Channel Estimation and MRC functions in a CDMA without increasing power dissipation. system provide time diversity by optimally combining the TI'S newly developed low-power DSP, the TMS32OC55x, received signal over all multipaths. Time diversity schemes is specially designed for wireless applications. The C55x are employed to mitigate fading effects and also to improve architecture evolved from its predecessor, the TMS32OC54x SINR which is reduced due to co-channel interferences. family of programmable DSPs [2]. The chip dissipates 0.25 mW/MIPS which is lower than the power dissipated in the current low power DSP, the C54x. The new DSP has seven protected pipeline stages, one more than found on the C54x. In the Address Computation stage there is additional hardware in the Address Data Flow Unit, used to generate memory addresses. Another key feature of the new DSP is the Dual Multiply-Accumulates (DualMAC) hardware, which allows two multiply-accumulates in one cycle. The new DSP also allows implicit and explicit parallelism of non-conflicting instructions. These, plus other new features, provide the needed performance to implement highly computational tasks, within the low-power requirements.

Fig. 1. A block diagram of the W-CDMA receiver. The baseband processing is performed on the DSP and the colrelators are implemented in dedicated hardware.

57803-6451-1~/$10.C0 0 2000 IEEE

24 1

The 3G W-CDMA communications standard also introduces Space-Time block coding based Transmit Diversity (STTD) principles. STTD further improves performance by employing multiple transmit antennas at the basestation to have both time and space diversity. O u r implementation of the downlink receiver includes STTD capabilities, as well as traditional MRC time diversity schemes (IS-95).
Fig. 2: A block diagram of the MRC (non-SW).

1. Maximal Ratio Combine (non-SiTD) In traditional MRC, a strong unmodulated pilot sequence is
transmitted with the traffic data. In the baseband processing done at the receiver, the pilot sequence is used to detect multipaths and is also used in channel estimation and phase correction. As shown in Fig. 2, the pilot data is combined and smoothed to give the phase correction estimate for the slot. In the MRC block, the phase estimate, a,is multiplied with the traffic data, and summed over all fingers. This process is called the Maximal Ratio Combine for an unmodulated pilot sequence.

The SiTD receiver Space Time Transmit Diversity ( S " D ) is a scheme that uses multiple transmit antennas at the basestation to increase diversity and improve performance. As shown in Fig. 3 at the > i : being transmitter while the data stream 4 1 , S2 .; transmitted through antenna #1, the sequence e-S2, SI ...> is transmitted simultaneously through antenna #2. This is performed in the STTD encoder.
At the mobile, each symbol of the received data (e.g. RI, R2)is a combination of each transmitted symbol pair, (SI, S2). A different channel estimation scheme is needed to derotate and recover the STTD encoded bits. Fig. 4 shows the block diagram for the STTD receiver. In the S'ITD receiver, the received pilot symbols are combined and smoothed to provide the phase correction estimates, a1and @. Then in the STTDMRC block, the traffic data is de-rotated in pairs and then each symbol is combined over all fingers.

2.

Fig. 3: The S
Received pilot

W encoder at the basestation.


STTP MRC

Fig. 4 The SlTD receiver block diagram.

B. Dig-AGC and Deinterleaver


In our W-CDMA receiver, Digital Automatic Gain Control (Dig AGC) block is needed to keep the data symbols within the correct dynamic range. The output from the MRC block, after phase correction is 32-bit, but the input to the decoder block is required to be 8-bit. Thus, the Dig AGC block is inserted between the two, to normalize the traffic data in each data slot. After AGC, the data is passed through the Deinterleaver. A multiple interleaving scheme is used at the transmitter in order to decrease fading effects and burst errors introduced in the channel. At the receiver end, the data is deinterleaved once before rate matching, and again in the decoder. C. Rate Matching Different users not only have different average symbol rates, but may also have different payload requirements per frame. Since the transmitted frame size is fixed, the transmitted user's data payload is either repeated to fill up an entire frame or the payload is punctured and shrunken to fit into the correct frame size. Consequently, at the receiver in the Rate Matching block, the received data is derepeated or depunctured to recover the user's original payload size.

1 1 1 . LOW-POWER DSP ARCHITECTURE


The C55x is a programmable fixed point DSP core with a variable instruction set and parallel execution of instructions.

242

The architecture and the instruction set are optimized for very low power consumption. The C55x architecture is very suitable for telecommunication DSP intensive algorithms, because it offers both high code density and efficient execution. The functional block diagram of the core shown in Fig. 5 describes the principal blocks and bus structure. The four main blocks are Instruction BufSeer Unit (ZU): The IU is a 32x16 bit buffer queue that stores and organizes code execution. It improves the performance of loop codes by allowing speculative fetch for conditional branching. The IU also determines if two variable length instructions execute in parallel or not, and dispatches the execution of instruction(s) to be executed on the three other processing units. Program Flow Unit (PU): The PU generates the program memory address. With a hardware loop controller it permits 3 levels of looping with zero overhead (2 blockrepeat and a single-repeat). Block-repeat and singlerepeat mechanisms are interruptible. Address Data Flow Unit (AU): The AU is used to generate addresses to the segmented 16 Mbyte (8 Mword) word addressable data space. This unit includes 3 24-bit data address generation units, 8 23-bit generic address registers, one 16-bit coefficient data pointer register, 4 16-bit generic data registers, and one data stack pointer register. On top of these addressing resources it includes a general purpose 16-bit ALU with shifting and bit manipulation capabilities. Data Computation Unit (DU): This main data execution unit contains rich computation resources with parallelism. They include 4 40-bit generic accumulators, 2 16-bit transition registers, 2 single cycle 17x17 MAC units, a 40-bit ALU with 2 modes (40-bit and dual 16-bit), and a 40-bit shifter. Dedicated hardware permits dual compare and select max or min useful for Viterbi butterflies, and enhanced bit manipulation operations. In addition to these processing units, C55x has an 7 stage protected pipeline. The benefits of this feature are 1) code density for compiled code, and 2) easier code development. The C55x architecture is designed with advanced power management, such that different modules and peripherals (e.g. memory, functional units) are powered down when unused. The C55x architecture exploits parallelism (DualMAC, IU) to double the number of instructions executed per cycle, which effectively halves the number of cycles/sec required. This allows the DSP to be run at a lower voltage supply, to further reduce power dissipation. The C55x can operate at as low as 0.9V. Other functionality built into the C55x which further improves the execution efficiency are the Address Data Flow Unit, which is used to generate addresses in parallel with data computation, and additional accumulators, buses and

Data Read buses (3x16)

Program Address

.Data Write Addres Buses (2x24)

Data Write Buses (2x16)

Fig. 5 : ms32oc55x core smcmre.

dedicated hardware added to accelerate typical communications applications. The C55x core also includes features to reduce memory accesses. The IU is an instruction cache which stores frequently accessed instructions (loops, repeats), eliminating power dissipated during external memory accesses. Variable instruction lengths improves code density and improves memory efficiency.

N. RESULTS
The W-CDMA baseband processing functions were implemented on the C55x and the benchmark results are shown in this section. The important metrics are the number of cycleslsec for each benchmark, the code size (bytes) and the program and data memory accesses (bytes). We also compare the performance of the C55x to its predecessor, the C54x. Figs. 6-9 show the relative improvement of the C55x over the C54x in cycleshec for all benchmark metrics of the W-CDMA functions. Fig. 6 shows that the C55x requirement is 40-60% of the C54x requirement in cycles/sec for slot processing functions. The large improvement of the C55x over the C54x is mainly due to its rich computational resources. For example, in WCDMA there is a great deal of complex arithmetic and multiplication, thereby making the DualMAC instrumental in increasing the cycle efficiency. Secondly, the IU was utilized to execute non-conflicting instructions in parallel to further reduce the cycleskec requirement. Finally, the AU allowed the processor to both compute the memory address and perform the computation in one cycle. In Fig. 7, we compare the code size for the C55x and the C54x kernel benchmarks. These results show that without

243

sufficiently increasing code density, we were able to achieve the 40-60% improvement in cycle efficiency. Results have shown that code density can improve up to 30% in a system implementation, which would contain many kernel applications. The reduction in code size is primarily due to the variable length instruction set and the protected pipelining capabilities of the C55x. The protected pipeline not only provides reduced code size, but also allows for easier code development. The C55x shows a large improvement in both data and program memory accesses. Fig. 8 shows that the C55x had fewer program memory fetches during execution than the ~ 5 4 x -A . iarge improvement in prograi memory accesses is due to the IU in which loop code for the block-repeats and single-repeat can be cached to eliminate multiple accesses to the program memory. In Fig. 9, we showed that the implementation of W-CDMA functions require fewer data memory accesses than the C54x. This is due to the increased number of registers and accumulators in the Data Computation Unit to store intermediate data thus reducing unnecessary data memory accesses. Reducing external memory access leads to a reduction of power dissipation, since there is a time penalty for external memory accesses which translates to increased power dissipation. V. SUMMARY We benchmarked the 3G W-CDMA baseband functions on the C55x to measure the cycle efficiency, the code size, and the number of program and data memory accesses. We also compared the performance of the C55x versus its predecessor, the C54x. Various new features and the rich computational resources are instrumental to bring about a 40-60% improvement in performance, namely lower runtime, lower code size, and lower power compared to C54x architecture. These results demonstrated that the baseband functions of the W-CDMA application can be implemented in the C55x rather than in dedicated hardware. This in turn, lowers the overall system cost and design time of our wireless communication system.

0 . 9
0.8
0.7

0.6

0.5
0.4

0 . 3 0 . 2
0.1

0
non-STTD MRC
STTD MRC

Dig AGC and Deint

Rate Matching

Fig. 7: Fraction of processor cycles/sec for C55x vs. the C54x.


1.6 1.4 1.2 1
0.8 0.6 0.4

0.2
0

non-STTD MRC

STTD MRC Dig AGC and Deint

Rate Matching

Fig. 8: Code Size (bytes) comparison between C55x and C54x.


1

0.9

0.8 0.7
0.6 0.5
0.4

0 . 3 02
0.1

non-STrD MRC

STrD MRC

Dig AGC and Deint

Rate Matching

Fig. 9: Fraction of Program Memory Accesses for the C55x vs. the C54x for one frame of data.

Communications. IEEE Joumal of Solid-state, Vol. 32, No. 11, November 1997.

non-STrD MRC

STrD MRC

Dig AGC and Deint

Rate Matching

Fig. 10: Fraction of Data Memory Access for the of C55x vs. C54x for one frame of data.

244

You might also like