Professional Documents
Culture Documents
WCDMA RNC HSPA User Plane Acceleration Using LSI Networking Solution
Version 3.0 Feb 3, 2009 Abstract
Introduction
An important requirement for cellular systems is to provide high data rates for packet data services. To meet this requirement, HSPA is introduced in releases 5 and 6 of the 3GPP/WCDMA specifications.1, 2 Although packet data communication is supported in the first release of the 3GPP/WCDMA standard, HSPA brings further enhancements including higher order modulation, fast scheduling, and rate control to support higher peak data rates per end user. As the evolution of HSPA continues, peak data rate will only increase. It is expected that in 3GPP Release 7+ [3], peak data rate increases to more then 200Mb/s per user. In a 3G network with HSPA, RNC typically controls several hundred base stations. The RNC is in charge of call setup and radio resource management of the cells under its control. WCDMA user plane protocol layers, including PDCP, RLC, MAC, and FP, are initiated in the RNC in the downlink direction and terminated in the RNC in the uplink direction. The RLC protocol layer is the only layer of the RNC user plane layers that is terminated in the user equipment in 3G network (mobile device). All the other layers are only between the RNC and the Base Station. Existing RNC platforms typically use several general purpose CPUs to process WCDMA user plane protocol stacks. With the evolution of HSPA and increase in the cell data rate, the existing RNC architectures do not scale to meet the increased traffic workload in WCDMA networks. Moores Law calls for 2x scaling of CPU performance every 18 months. According to several market research studies, network demand for RNC user plane processing capacity is estimated to increase by roughly 3x every 12 months. This obviously leads to a significant problem with current RNC user plane processing approaches (Figure 1). LSI provides both short term and long term solutions for this problem.
As High Speed Packet Access (HSPA) peak data rates increase, current Radio Network Controller (RNC) platforms, which rely on a collection of General Purpose Processors (CPU) to do user plane processing, cannot scale to meet increased traffic workloads. RNCs therefore require a new approach for processing user plane wireless protocols such as PDCP, Radio Control Link (RLC), MAC, and FP. RNCs need HSPA acceleration because of both the higher peak rate demands and the overall increase in the number of HSPA users and the associated traffic in WCDMA networks. This paper describes how to accelerate current HSPA user plane design, regardless of the hardware they run on today, by using the LSI APP650 Advanced PayloadPlus network processor. By offloading user plane processing such as RLC segmentation/concatenation and reassembly to the APP650, RNCs can achieve a user peak data rate of 100+ Mb/s for small RLC Service Data Units (SDU) and an aggregate throughput across 30k users of over 700 Mb/s. Using an APP650 offload approach with flexible RLC (3GPP Release 7+), the peak rate throughput per user can be more than 200 Mb/s.
GAP
2000
2002
2004
2006
2008
2010
2012
This paper proposes a solution which offloads RLC segmentation/concatenation to substantially reduce the load on existing CPUs. Segmentation offload has been previously used for other protocols (e.g., TCP) to save CPU cycles in server platforms. This paper describes the application of similar concepts to accelerate WCDMA user plane processing (specifically the RLC protocol). The LSI APP650 processor [5] offloads segmentation/concatenation and reassembly for up to 30k RLC connections. Figure 2 shows several CPUs using a single APP650 processor as the acceleration engine. This acceleration approach provides significant advantages over non-accelerated implementations that are typically limited by single core or single thread performance. Increasing HSPA peak rate and number of users (who are supported with a typical CPU and operating system model which uses CPUs for user plane processing) would require single user processing software to be parallelized or pipelined across multiple processors. Such a software effort would be extremely complex, expensive, and error prone. In contrast, moving some of the most CPU-intensive processing to the LSI APP650 processor can eliminate 50% or more of the CPU processing load, enabling high peak rates and overall aggregate throughput to more than double using the same hardware.
RLC
RLC
CPUs
RLC
APP650 RLC PDUs RLC Reassembly RLC Segmentation & Concatenation RLC SDUs
single context and the context executing in the pipeline is switched only when the context encounters a stall (cache miss, memory access, branch misprediction, etc.). In a conventional single-threaded architecture, keeping the execution pipeline busy is challenging since all the instructions in the pipeline belong to a single thread. In the APP650 architecture, if a context executes a high latency function call, its place in the pipeline will be assigned to another context. Consequently, the APP650 multi-threaded architecture provides a zerocycle context switch capability which is not present in single-threaded multi-core architecture. The pattern processing engine has 144 separate contexts to allow it to fully utilize hardware resources and hide memory latency effects. In contrast, the memory bottleneck in CPUs prevents resources from being fully utilized and causes CPU cycles to be wasted. The APP650 network processor allocates a context to an incoming packet and many packets are processed concurrently. By processing many packets at the same time, resources are fully utilized and up to 5.9 Gb/s of data rate can be achieved. In the APP650 architecture, mechanisms are separated from policies. Hardware is in charge of providing mechanisms and software is in charge of providing policies. The APP650
architecture implements packet memory management and data movement in hardware. Consequently, software does not consume cycles to allocate or free memory, keep track of packet pointers, or copy data to different memory addresses. On each packet, the APP650 hardware invokes the software to provide policy decisions, eliminating wasted cycles for processing interrupts or polling. The APP650 network processor also includes a Pre-Queuing Modification (PQM) engine which can insert or delete data to or from different parts of a packet. The PQM engine can also segment a packet to many subpackets. These features of the PQM engine significantly accelerate RLC segmentation/concatenation. Another important feature of the APP650 network processor is hardware assist for multifield packet classification. Packet classification can take significant cycles on CPUs but is highly efficient on the APP650 network processor. The APP650 state engine provides a mechanism to keep track of states associated with packets. In RLC processing, this engine is used to keep track of RLC connection states. For example, the 12 bits of sequence number associated with each RLC connection is one piece of protocol state kept by the state engine.
WCDMA RNC HSPA User Plane Acceleration Using LSI Networking Solution
In the APP650 network processor, hardware invokes software as a subroutine to provide policy decisions on buffer management, traffic shaping/scheduling, and packet modification. The software runs on three compute engines based on very long instruction word (VLIW) architecture. The buffer management compute engine enforces packet discard policies and keeps queue statistics. The traffic shaper compute engine determines Quality of Service (QOS) and Class of Service (COS) treatment for each queue. The sed stream editor compute engine performs Protocol Data Unit (PDU) modification. The APP650 network processors hardwareassisted traffic management supports deterministic traffic management behavior across thousands of queues, while providing a
framework to customize traffic management algorithms in software via a subset of C programming language. Since traffic management functions are done in separate engines, the classification workload does not impact traffic management determinism. In contrast, CPU architectures execute traffic management algorithms either on the same processor pool that supports packet processing applications or on a separately allocated core. In both cases, hardware resources are underutilized to achieve determinism. In addition, software programmers are responsible for all aspects of developing a traffic management solution. The APP650 architecture hides most of the complexity in a framework that is implemented in hardware and only exposes
policy decisions to software programmers. The APP650 architecture is built so that it hides hardware multithreading and parallel processing from the software developer. Consequently, the APP650 architecture requires many fewer lines of software and provides significantly higher throughput compared to existing CPU-based wireless user plane solutions. LSI provides a rich software development environment including a cycle-accurate simulator that can be used for functional debugging and performance analysis of applications. Further, the simulator tool can be used to determine the utilization of different hardware resources.
Reassembly Buffer
PDU Buffer
Traffic Manager Port 0-3 PPE + Classifier Buffer Mgr CE TS CE SED CE Port 0-3 Port 4
Inputs
Port 4
PCI Program Memory External Host CPU State Memory Context Memories Context Memories Context Memories
Outputs
PQM
WCDMA RNC HSPA User Plane Acceleration Using LSI Networking Solution
Figure : Partitioning RLC Processing Between CPU Farm and Acceleration Engine
SDUs
SDU Buffer
SDUs
Reassembly
PDUs
PDUs
WCDMA RNC HSPA User Plane Acceleration Using LSI Networking Solution
Embedded Processor RLC PDUs RLC/Eth GE Tester/Analyzer in Echo Mode Port 1 APP6xx Segmentation/ Concatenation GE PCI Bus
System Memory
WCDMA RNC HSPA User Plane Acceleration Using LSI Networking Solution
The APP650 simulator is used to provide resource utilization information (Table 2). Results show the APP650 context utilization for high RLC channel counts and aggregated throughput of 700 Mb/s. First pass and second pass context utilization is 51% and 10% respectively. This shows that even at such high rate the APP650 network processor still has plenty of headroom left for additional functionality.
Table : Resource Utilization for High Channel Counts and 700 Mb/s Aggregate Throughput Provided by APP Simulator Environment
Metric Average Flow Instructions Average Microroot Tree Instructions Average Internal Tree Instructions Average External Tree Instructions Flow Instruction Budget Tree Instruction Budget (@100% efficiency) Average Flow Engine Utilization Average Tree Engine Utilization Classification Program Memory Efficiency Classification Program Memory Utilization (Internal) Classification Program Memory Utilization (External) Classification PDU Buffer Memory Utilization Classification Control Memory Utilization Average First Pass Context Utilization Average Second Pass Context Utilization Maximum Active First Pass Contexts Maximum Active Second Pass Contexts Maximum Active PDUs VALUe 9 instructions/pdu 1 instructions/pdu 0 instructions/pdu 10 instructions/pdu 66 instructions/pdu 66 instructions/pdu 0% 9% 9.506% 0% 1.15% 11.6% 9.98% 51.7% 10.788% 9 17 19
Conclusion
As HSPA peak data rates increase, existing RNC platforms that rely on a collection of CPU cores to do WCDMA user plane processing do not scale to meet increased traffic workloads. The problem with existing RNC platforms is that the nature of user plane processing (mostly data processing) is not suited for general purpose CPU architectures. Wireless user plane processing requires optimization of cycle-consuming functions such as RLC segmentation/ concatenation and reassembly. This paper describes an approach that accelerates an existing RNC WCDMA user plane stack by offloading RLC segmentation/concatenation and reassembly to an APP650 network processor. The paper discusses the advantages of the APP650 architecture such as determinism in efficiently processing packets. Simulation and prototyping show that the APP650 network processor can sustain up to 700 Mb/s aggregated throughput for 30K RLC channels. For Flexible RLC, a single RLC channel peak rate of more than 200Mb/s can be achieved. In short, the APP650 network processor can be used as a user plane accelerator to solve both user plane peak and aggregate rate challenges in todays RNC systems.
WCDMA RNC HSPA User Plane Acceleration Using LSI Networking Solution
Revision History
Version .0 .1 .0 1. 1. 1.1 1.0 0.0 0.01 D At e 0/0/009 0/0/009 0/0/009 01/1/009 01/0/009 01/9/009 01/7/009 01/1/009 01/0/009 Description Merging comments. (Reza Etemadi) Final edits. (Henri Tervonen) Merging comments from Henri Tervonen, Curtis Hillier, Robert Munoz, Tareq Bustami and Jas Tremblay. (Reza Etemadi) Edits. (Henri Tervonen) Edits. (Robert Munoz) Edits. (Curtis Hillier) First draft. (Reza Etemadi) Edits focusing on benefits of LSI solution. (Henri Tervonen) Initial Draft. Abstract Added (Reza Etemadi)
References
1. 3rd Generation Partnership Project; Technical Specification Group Radio Access Network; High Speed Downlink Packet Access (HSDPA) Overall Description (Release 5), 3GPP TS 25.308. 2. 3rd Generation Partnership Project; Technical Specification Group Radio Access Network; FDD Enhanced Uplink; Overall Description (Release 6), 3GPP TS 25.309. 3. 3rd Generation Partnership Project; Technical Specification Group Services and System Aspects; 3GPP System Architecture Evolution; Report on Technical Options and Conclusions (Release 7), 3GPP TR 23.882. 4. 3rd Generation Partnership Project; Technical Specification Group Radio Access Network; Radio Link Control (RLC) Protocol Specification (Release 7), 3GPP TS 25.322 V7.3.0. 5. APP650 Product Brief, LSI Corporation.
For more information and sales office locations, please visit the LSI web sites at: lsi.com lsi.com/contacts
LSI and LSI logo design are trademarks or registered trademarks of LSI Corporation or its subsidiaries. All other brand and product names may be trademarks of their respective companies. LSI Corporation reserves the right to make changes to any products and services herein at any time without notice. LSI does not assume any responsibility or liability arising out of the application or use of any product or service described herein, except as expressly agreed to in writing by LSI; nor does the purchase, lease, or use of a product or service from LSI convey a license under any patent rights, copyrights, trademark rights, or any other of the intellectual property rights of LSI or of third parties. Copyright 2009 by LSI Corporation. All rights reserved. February 2009 PB06-028CMPR