BP Networking Concepts and Technology

Networking Concepts and Technology: A Designers Resource
Deepak Kakadia and Francesco DiMambro
Sun Microsystems, Inc. www.sun.com
Part No. 817-1046-10 June 2004, Revision A Submit comments about this document at: http://www.sun.com/hwdocs/feedback
Copyright 2004 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, California 95054, U.S.A. All rights reserved. Sun Microsystems, Inc. has intellectual property rights relating to technology that is described in this document. In particular, and without limitation, these intellectual property rights may include one or more of the U.S. patents listed at http://www.sun.com/patents and one or more additional patents or pending patent applications in the U.S. and in other countries. This document and the product to which it pertains are distributed under licenses restricting their use, copying, distribution, and decompilation. No part of the product or of this document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers. Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S. and in other countries, exclusively licensed through X/Open Company, Ltd. Sun, Sun Microsystems, the Sun logo, AnswerBook2, docs.sun.com, iPlanet, Java, JavaDataBaseConnectivity, JavaServer Pages, Enterprise JavaBeans, Netra Sun ONE , Sun Trunking, JumpStart, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and in other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and in other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. The OPEN LOOK and Sun Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Suns licensees who implement OPEN LOOK GUIs and otherwise comply with Suns written license agreements. U.S. Government RightsCommercial use. Government users are subject to the Sun Microsystems, Inc. standard license agreement and applicable provisions of the FAR and its supplements. DOCUMENTATION IS PROVIDED "AS IS" AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID. Copyright 2004 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, Californie 95054, Etats-Unis. Tous droits rservs. Sun Microsystems, Inc. a les droits de proprit intellectuels relatants la technologie qui est dcrit dans ce document. En particulier, et sans la limitation, ces droits de proprit intellectuels peuvent inclure un ou plus des brevets amricains numrs http://www.sun.com/patents et un ou les brevets plus supplmentaires ou les applications de brevet en attente dans les Etats-Unis et dans les autres pays. Ce produit ou document est protg par un copyright et distribu avec des licences qui en restreignent lutilisation, la copie, la distribution, et la dcompilation. Aucune partie de ce produit ou document ne peut tre reproduite sous aucune forme, par quelque moyen que ce soit, sans lautorisation pralable et crite de Sun et de ses bailleurs de licence, sil y en a. Le logiciel dtenu par des tiers, et qui comprend la technologie relative aux polices de caractres, est protg par un copyright et licenci par des fournisseurs de Sun. Des parties de ce produit pourront tre drives des systmes Berkeley BSD licencis par lUniversit de Californie. UNIX est une marque dpose aux Etats-Unis et dans dautres pays et licencie exclusivement par X/Open Company, Ltd. Sun, Sun Microsystems, le logo Sun, AnswerBook2, docs.sun.com, iPlanet, Java, JavaDataBaseConnectivity, JavaServer Pages, Enterprise JavaBeans, Netra Sun ONE , Sun Trunking, JumpStart, et Solaris sont des marques de fabrique ou des marques dposes de Sun Microsystems, Inc. aux Etats-Unis et dans dautres pays. Toutes les marques SPARC sont utilises sous licence et sont des marques de fabrique ou des marques dposes de SPARC International, Inc. aux Etats-Unis et dans dautres pays. Les produits portant les marques SPARC sont bass sur une architecture dveloppe par Sun Microsystems, Inc. Linterface dutilisation graphique OPEN LOOK et Sun a t dveloppe par Sun Microsystems, Inc. pour ses utilisateurs et licencis. Sun reconnat les efforts de pionniers de Xerox pour la recherche et le dveloppement du concept des interfaces dutilisation visuelle ou graphique pour lindustrie de linformatique. Sun dtient une license non exclusive de Xerox sur linterface dutilisation graphique Xerox, cette licence couvrant galement les licencies de Sun qui mettent en place linterface d utilisation graphique OPEN LOOK et qui en outre se conforment aux licences crites de Sun. LA DOCUMENTATION EST FOURNIE "EN LTAT" ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSES OU TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISEE PAR LA LOI APPLICABLE, Y COMPRIS NOTAMMENT TOUTE GARANTIE IMPLICITE RELATIVE A LA QUALITE MARCHANDE, A LAPTITUDE A UNE UTILISATION PARTICULIERE OU A LABSENCE DE CONTREFAON.
Please Recycle
Acknowledgements
Deepak says, I would like to thank the many people who have gone out of their way to help me not only with writing this book, but also teaching me about various aspects of my professional and academic career. First, I am very grateful for the tremendous corporate support I received from Scott McNealy, Clark Masters, Gary Beck, Brad Carlile, and Bill Sprouse. I feel extremely fortunate to be a part of a team with the greatest, most unselfish corporate leadership of modern time. I would like to thank Kemer Thomson, Vicky Hardman, Gary Rush, Barb Jugo, Alice Kemp, a veteran book writer and also my book mentor, and Diana Lins, who did the illustrations in this book. Much of the data center work developed in this book was built on the shoulders of giants Richard Croucher, a world-renowned expert in datacenter technologies, Dr. Jim Baty, Dr. Joseph Williams, Mikael Lofstrand, and Jason Carolan. Frank and I are very grateful to the technical reviewers who spent considerable time reviewing and providing feedback and comments: David Auslander, Martin Lorenz, Ken Pepple, Mukund Buddhikot, my good friends and collegues: Mark Garner, who sacrificed many pub nights for me, David Deeths, Don Devitt, and John Howard. I would also like to thank Dr. Nick McKeown, professor at Stanford University, Rui Zhang-Shen, Ph.D. student at Stanford University, John Fong, NortelNetworks, John Reuter and David Bell of Foundry Networks, Dan Mercado and Bill Cormier of Extreme Networks, and Sunil Cherian of ArrayNetworks. Above all, says Deepak, I thank my wife, Jagruti, and daughters, Angeli and Kristina, for their patience, sacrifice, and understanding of why I had to miss both school programs and family events. Finally, I want thank my mother, mother-in-law, and father-in-law for helping out at home during my many absences.
iii
Frank says, We must also remember the unsung heros who implement, test, sustain, and measure performance of the Sun network technology, device drivers, and device driver framework for their outstanding contribution to the Sun networking technology. Their assistance and support helped gain the collective experience that has been essential to providing the best possible networking capability for Sun and, in turn, the material necessary for some parts of this book. These include the Network ASIC development team, in particular Shimon Muller, Binh Pham, George Chu, and Carlos Castil. The device driver development team, including Sumanth Kamatala, Joyce Yu, Paul Simons, Raghunath Shenbagam, David Gordon and Paul Lodrige. The Networking Quality Assurance team, including Lalit Bhola, Benny Chin, Jie Zhu, Alan Hanson, Neeraj Gupta, Deb Banerjee, Charleen Yee, and Ovid Jacob. The Solaris Device Driver Framework Development team, including Adi Masputra, Jerry Chu, Priyanka Agarwal, and Paul Durrant. The System Performance Measurement team: Patrick Ong, Jian Huang, Paul Rithmuller, Charles Suresh, and Roch Borbonnais. I would love to say a big thank you to my wife, Bridget, Frank says, for her patience and encouragment as I progressed with my contribution to this book. Thanks to my two sons, Francesco and Antonio, for distracting me from time to time and forcing me to take a break from the book and play. For them, Dads writing his book was not a reasonable excuse. God bless them, for they were right. Who would imagine someone three feet tall could have that much insight? Those breaks made all the difference.
iv Networking Concepts and Technology: A Designers Resource
Contents
1.
Overview
1 1
Evolution of Web Services Infrastructures The Data Center IP Network 4 7
Network Traffic Characteristics
End-to-End Session: Tuning the Transport Layer Network Edge Traffic Steering: IP Services Server Networking Internals 12 13 10
Network Availability Design Patterns Reference Implementations 2. 14
Network Traffic Patterns: Application Layer Services on Demand Architecture 18
17
Multi-Tier Architecture and Traffic Patterns
20 21
Mapping Tiers to the Network Architecture Inter-tier Traffic Flows Web Services Tier 23 26 22
Application Services Tier Architecture Examples 29
Designing for Vertical Scalability and Performance Designing for Security and Vertical Scalability 32
31
Designing for Security and Horizontal Scalability Example Solution 3. 34 37
33
Tuning TCP: Transport Layer TCP Tuning Domains 38
TCP Queueing System Model Why the Need to Tune TCP
40 41 44 46
TCP Packet Processing Overview
TCP STREAMS Module Tunable Parameters TCP State Model 48 49 50 50 51
Connection Setup
Connection Established Connection Shutdown
TCP Tuning on the Sender Side Startup Phase 51 53
Steady State Phase
TCP Congestion Control and Flow Control Sliding Windows TCP Tuning for ACK Control TCP Example Tuning Scenarios 54 56 56
53
Tuning TCP for Optical Networks WANS Tuning TCP for Slow Links 59
TCP and RDMA Future Data Center Transport Protocols 4. Routers, Switches, and AppliancesIP-Based Services: Network Layer Packet Switch Internals 66 71
62
65
Emerging Network Services and Appliances Server Load Balancing Hash 73 74 72
Round-Robin
vi
Smallest Queue First /Least Connections Finding the Best SLB Algorithm How the Proxy Mode Works 78 80 76
74
Advantages of Using Proxy Mode
Disadvantages of Using Proxy Mode How Direct Server Return Works 80
80
Advantages of Direct Server Return
82 82
Disadvantages of Direct Server Return Server Monitoring Persistence 83 83
Commercial Server Load Balancing Solutions
84 84
Foundry ServerIron XLDirect Server Return Mode
Extreme Networks BlackDiamond 6800 Integrated SLB Proxy Mode 86 Layer 7 Switching 88 91
Network Address Translation Quality of Service 92 92 92
The Need for QoS
Classes of Applications Data Transfers 93
Video and Voice Streaming Interactive Video and Voice
93 93 94
Mission-Critical Applications Web-Based Applications 94
Service Requirements for Applications QoS Components 95 95
94
Implementation Functions QoS Metrics 95
Network and Systems Architecture Overview
96
Contents vii
Implementing QoS ATM QoS Services
98 98 99
Sources of Unpredictable Delay QoS-Capable Devices 102
Implementation Approaches
102 102
Functional ComponentsHigh-Level Overview QoS Profile 103 104
Deployment of Data and Control Planes Packet Classifier Metering Marking 105 106 107 107 105
Policing and Shaping IP Forwarding Module Queuing 107
Congestion Control Packet Scheduler Secure Sockets Layer 109
107
109
SSL Protocol Overview
110 112
SSL Acceleration Deployment Considerations Software-SSL LibrariesPacket Flow 112
The Crypto Accelerator BoardPacket Flow SSL Accelerator AppliancePacket Flow SSL Performance Tests 117 115
113
Test 1: SSL Software Libraries versus SSL Accelerator ApplianceNetscaler 9000 117 Test 2: Sun Crypto Accelerator 1000 Board 118
Test 3: SSL Software Libraries versus SSL Accelerator ApplianceArray Networks 119 Conclusions Drawn from the Tests 121
viii
5.
Server Network Interface Cards: Datalink and Physical Layer Token Ring Networks 123 125 125
123
Token Ring Interfaces
Configuring the SunTRI/S Adapter with TCP/IP Setting the Maximum Transmission Unit Disabling Source Routing 127 127 126
Disabling ARI/FCI Soft Error Reporting Configuring the Operating Mode 127
Resource Configuration Parameter Tuning
128 128
Configuring the SunTRI/P Adapter with TCP/IP Setting the Maximum Transmission Unit Configuring the Ring Speed 129 129
Configuring the Locally Administered Address Fiber Distributed Data Interface Networks FDDI Stations 132 132 133 131
130
Single-Attached Station Dual-Attached Station FDDI Concentrators 134
Single-Attached Concentrator Dual-Attached Concentrator FDDI Interfaces 136
134 135
Configuring the SunFDDI/S Adapter with TCP/IP Setting the Maximum Transmission Unit Target Token Rotation Time 137 137
137
Configuring the SunFDDI/P Adapter with TCP/IP Setting the Maximum Transmission Unit Target Token Rotation Time Ethernet Technology 139 139 138
138
Contents
ix
Software Device Driver Layer Transmit Receive 140 144 152 152
140
Jumbo Frames
Ethernet Physical Layer
Basic Mode Control Layer Basic Mode Status Register
153
154 155
Link-Partner Auto-negotiation Advertisement Register Gigabit Media Independent Interface Ethernet Flow Control Example 1 164 Example 2 164 165 165 168 161 157
Fast Ethernet Interfaces
10/100 hme Fast Ethernet
Current Device Instance in View for ndd Operational Mode Parameters Transceiver Control Parameter Inter-Packet Gap Parameters 168 169 170
Local Transceiver Auto-negotiation Capability Link Partner Capability 173 174 175 176
171
Current Physical Layer Status 10/100 qfe Quad Fast Ethernet
Local Transceiver Auto-negotiation Capability Link Partner Capability 180
179
Current Physical Layer Status 10/100 eri Fast Ethernet 182
181
184
Receive Interrupt Blanking Parameters
187 187
Local Transceiver Auto-negotiation Capability Link Partner Capability 188 189
Current Physical Layer Status 10/100 dmfe Fast Ethernet 190
Operational Mode Parameters
191 192
Current Physical Layer Status Fiber Gigabit Ethernet 195 196 196
1000 vge Gigabit Ethernet 1000 ge Gigabit Ethernet
198
202 203
Local Transceiver Auto-negotiation Capability Link Partner Capability 204 205 206 209
Current Physical Layer Status
Performance Tunable Parameters
10/100/1000 ce GigaSwift Gigabit Ethernet
Contents
xi
Current Device Instance in View for ndd Operational Mode Parameters Flow Control Parameters 213 212
211
Gigabit Link Clock Mastership Controls Transceiver Control Parameter Inter-Packet Gap Parameters 214 214
213
Receive Interrupt Blanking Parameters Random Early Drop Parameters PCI Bus Interface Parameters 216
215
217 217
Jumbo Frames Enable Parameter Performance Tunables 218
10/100/1000 bge Broadcom BCM 5704 Gigabit Ethernet Operational Mode Parameters 222 224
220
Current Physical Layer Status Sun VLAN Technology VLAN Configuration Sun Trunking Technology Trunking Configuration Trunking Policies Network Configuration 228 230 230 231
232 233 233
Configuring the System to Use the Embedded MAC Address Configuring the Network Host Files 234
Setting Up a GigaSwift Ethernet Network on a Diskless Client System Installing the Solaris Operating System Over a Network Configuring Driver Parameters 238 238 236
235
Setting Network Driver Parameters Using the ndd Utility
xii
Using the ndd Utility in Non-interactive Mode Using the ndd Utility in Interactive Mode Reboot Persistence Using driver.conf Global driver.conf Parameters 240 242
240
242 243
Per-Instance driver.conf Parameters Using /etc/system to Tune Parameters Network Interface Card General Statistics 244 245
Ethernet Media Independent Interface Kernel Statistics
246 249
Maximizing the Performance of an Ethernet NIC Interface Ethernet Physical Layer Troubleshooting 250
Deviation from General Ethernet MII/GMII Conventions Ethernet Performance Troubleshooting ge Gigabit Ethernet ce Gigabit Ethernet 6. 255 256 261 261 254
253
Network Availability Design Strategies Network Architecture and Availability Layer 2 Strategies 264
Trunking Approach to Availability Theory of Operation Availability Issues 265 265 267
264
Load-Sharing Principles
Availability Strategies Using SMLT and DMLT Availability Using Spanning Tree Protocol Availability Issues Layer 3 Strategies 278 279 274 274
271
VRRP Router Redundancy
IPMPHost Network Interface Redundancy Integrated VRRP and IPMP 280
279
Contents
xiii
OSPF Network RedundancyRapid Convergence RIP Network Redundancy 288
281
Conclusions Drawn from Evaluating Fault Detection and Recovery Times 7. Reference Design Implementations Logical Network Architecture IP Services 298 298 296 295
292
Stateless Server Load Balancing Stateless Layer 7 Switching Stateful Layer 7 Switching 299 300
Stateful Network Address Translation
302 302
Stateful Secure Sockets Layer Session ID Persistence Stateful Cookie Persistence 304 305 308 309 312
Design Considerations: Availability
Collapsed Layer 2/Layer 3 Network Design Multi-Tier Data Center Logical Design
How Data Flows Through the Service Modules Physical Network Implementations Secure Multi-Tier 315 315
Multi-Level Architecture Using Many Small Switches
316 317
Flat Architecture Using Collapsed Large Chassis Switches Physical NetworkConnectivity Switch Configuration 322 323 324 320
Configuring the Extreme Networks Switches Configuring the Foundry Networks Switches Master Core Switch Configuration Standby Core Switch Configuration Server Load Balancer Server Load Balancer 328 329 326 327
xiv
Network Security
330 333
Netscreen Firewall A. Lyapunov Analysis Glossary Index 341 337
355
Contents
xv
Figures
FIGURE 1-1 FIGURE 1-2
Web Services Infrastructure Impact on Data Center Network Architectures
High-Level Overview of Networks Spanning Clients, Data Center, Vendors, and Partners (a) 5 High-Level Overview of Networks Spanning Clients, Data Center, Vendors, and Partners (b) 6 Influence of Multi-Tier Software Architectures on Network Architecture 8 Transport Layer Traffic Flows Tuned According to Client Links 10 Data Center Edge IP Services 11 Data Center Networking Considerations on the Server 12 Availability Strategies in the Data Center 14 Example Implementation of an Enterprise Muli-Tier Data Center 15 Main Components of Multi-Tier Architecture 19 20 22
FIGURE 1-3
FIGURE 1-4 FIGURE 1-5 FIGURE 1-6 FIGURE 1-7 FIGURE 1-8 FIGURE 1-9 FIGURE 2-1 FIGURE 2-2 FIGURE 2-3 FIGURE 2-4 FIGURE 2-5 FIGURE 2-6 FIGURE 2-7 FIGURE 2-8 FIGURE 2-9 FIGURE 3-1
Logical View of Multi-Tier Service on Demand Architecture Network Inter-tier Traffic Flows of a Web-based Transaction
Model of Presentation/Web Tier Components and Interfacing Elements 24 High-Level Survey of EJB Availability Mechanisms 27 31
Decoupled Web Tier and Application Server TierVertically Scaled
Tightly Coupled Web Tier and Application Server TierVertically Scaled 32 Decoupled Web Tier and Application Server TierHorizontally Scaled 33 Tested and Implemented Architecture Solution 35 Overview of Overlapping Tuning Domains 39
xvii
FIGURE 3-2 FIGURE 3-3 FIGURE 3-4 FIGURE 3-5 FIGURE 3-6 FIGURE 3-7 FIGURE 3-8 FIGURE 3-9 FIGURE 3-10 FIGURE 3-11 FIGURE 3-12 FIGURE 3-13 FIGURE 3-14 FIGURE 4-1 FIGURE 4-2 FIGURE 4-3 FIGURE 4-4 FIGURE 4-5 FIGURE 4-6 FIGURE 4-7 FIGURE 4-8 FIGURE 4-9 FIGURE 4-10 FIGURE 4-11 FIGURE 4-12 FIGURE 4-13 FIGURE 4-14 FIGURE 4-15 FIGURE 4-16 FIGURE 4-17
Closed-Loop TCP System Model
40
Perfectly Tuned TCP/IP System 42 Tuning Required to Compensate for Faster Links 43 Tuning Required to Compensate for Slower Links Complete TCP/IP Stack on Computing Nodes 45 TCP and STREAM Head Data Structures Tunable Parameters TCP State Engine Server and Client Node TCP Startup Phase 52 55 49 47 44
TCP Tuning for ACK Control
Comparison between Normal LAN and WAN Packet Traffic 57 Tuning Required to Compensate for Optical WAN 59 Comparison between Normal LAN and WAN Packet TrafficLong Low Bandwidth Pipe 60 Increased Performance of InfiniBand/RDMA Stack Internal Architecture of a Multi-Layer Switch 68 High-Level Model of Server Load Balancing 73 High-Level Model of the Shortest Queue First Technique Round-Robin and Weighted Round-Robin 76 Server Load Balanced System Modeled as N - M/M/1 Queues 77 System Model of One Queue 78 Server Load BalancePacket Flow: Proxy Mode Direct Server Return Packet Flow 81 Content Switching Functional Model 90 97 79 75 63
Overview of End-to-End Network and Systems Architecture One-Way End-to-End Packet Data Path Transversal 100 QoS Functional Components 104 Traffic Burst Graphic 106
Congestion Control: RED, WRED Packet Discard Algorithms 108 High-Level Condensed Protocol Overview 111 113 114
Packet Flow for Software-based Approach to SSL Processing
PCI Accelerator Card Approach to SSL ProcessingPartial Offload
xviii
SSL Appliance Offloads Frontend Client SSL Processing 116 SSL Test Setup with No Offload 117 Throughput Increases Linearly with More Processors 119 SSL Test Setup for SSL Software Libraries 119 120 120
SSL Test Setup for an SSL Accelerator Appliance Effect of Number of Threads on SSL Performance Effect of File Size on SSL Performance Token Ring Network 124 Typical FDDI Dual Counter-Rotating Ring 132 SAS Showing Primary Output and Input 133 121
DAS Showing Primary Input and Output 134 SAC Showing Multiple M-ports with Single-Attached Stations 135
DAC Showing Multiple M-ports with Single-Attached Stations 136 Communication Process between the NIC Software and Hardware Transmit Architecture 141 140
Basic Receive Architecture 145 Hardware Transmit Checksum 147
Hardware Receive Checksum 148 Software Load Balancing 149 Hardware Load Balancing 150
Basic Mode Control Register 153 Basic Mode Status Register 154 Link Partner Auto-negotiation Advertisement 155 Link Partner Priority for Hardware Decision Process 156 Auto-negotiation Expansion Register 157 Extended Basic Mode Control Register 158 Basic Mode Status Register 158 Gigabit Extended Status Register Gigabit Control Status Gigabit Status Register 159 160 159
Figures
xix
GMII Mode Link Partner Priority
161
Flow Control Pause Frame Format 161 Link Partner Auto-negotiation Advertisement Register Rx/Tx Flow Control in Action 163 162
Typical hme External Connectors 166 Typical qfe External Connectors 175 Typical vge and ge MMF External Connectors 196
Sun GigaSwift Ethernet MMF Adapter Connectors 209 Sun GigaSwift Ethernet UTP Adapter Connectors 209
Example of Servers Supporting Multiple VLANs with Tagging Adapters 229 Network Topologies and Impact on Availability Trunking Software Architecture 265 Trunking Failover Test Setup 266 Correct Trunking Policy on Switch 268 263
Incorrect Trunking Policy on Switch 268 Correct Trunking Policy on Server 269
Incorrect Trunking Policy on a Server 270 Incorrect Trunking Policy on a Server 271 Layer 2 High-Availability Design Using SMLT 272
Layer 2 High-Availability Design Using DMLT 273 Spanning Tree Network Setup 275 High-Availability Network Interface Cards on Sun Servers 280 281
Design PatternIPMP and VRRP Integrated Availability Solution Design PatternOSPF Network RIP Network Setup 289 297 282
Logical Network Architecture Overview
IP ServicesSwitch Functions Operate on Incoming Packets 299 Application Redirection Functional Model 300 Content Switching Functional Model 301 303
Network Switch with Persistence Based on SSL Session ID
xx
FIGURE 7-6 FIGURE 7-7 FIGURE 7-8 FIGURE 7-9 FIGURE 7-10 FIGURE 7-11 FIGURE 7-12 FIGURE 7-13 FIGURE 7-14 FIGURE 7-15 FIGURE 7-16 FIGURE 7-17 FIGURE 7-18 FIGURE 7-19 FIGURE 7-20 FIGURE 7-21
Tested SSL Accelerator ConfigurationRSA Handshake and Bulk Encryption 304 Network Availability Strategies 305 Logical Network ArchitectureDesign Details 306
Traditional Availability Network Design Using Separate Layer 2 Switches 308 Availability Network Design Using Large Chassis-Based Switches 309 Logical Network Architecture with Virtual Routers, VLANs, and Networks 310 Logical Network Secure Multi-Tier 313 315 316
Multi-Tier Data Center Architecture Using Many Small Switches Network Configuration with Extreme Networks Equipment 318
Sun ONE Network Configuration with Foundry Networks Equipment 319 Physical Network Connections and Addressing 321 Collapsed Design Without Layer 2 Switches 322 Foundry Networks Implementation 325 Firewalls between Service Modules 331
Virtual Firewall Architecture Using Netscreen and Foundry Networks Products 332
Figures
xxi
Tables
TABLE 2-1 TABLE 5-1 TABLE 5-2 TABLE 5-3 TABLE 5-4 TABLE 5-5 TABLE 5-6 TABLE 5-7 TABLE 5-8 TABLE 5-9 TABLE 5-10 TABLE 5-11 TABLE 5-12 TABLE 5-13 TABLE 5-14 TABLE 5-15 TABLE 5-16 TABLE 5-17 TABLE 5-18 TABLE 5-19
Network Inter-tier Traffic Flows of a Web-based Transaction tr.conf Parameters 126 MTU Sizes 126 127 127
23
Source Routing Values
ARI/FCI Soft Error Reporting Values Operating Mode Values 128
trp.conf Parameters 129 Maximum Transmission Unit Ring Speed 130 nf.conf Parameters 137 Maximum Transmission Unit Request Operating TTRT pf.conf Parameters 137 129
138
138 138 139
Maximum Transmission Unit
Request Operating Target Token Rotation Time Multi-Data Transmit Tunable Parameter 144
Possibilities for Resolving Pause Capabilities for a Link Driver Parameters and Status Instance Parameter 168 Operational Mode Parameters 168 167
163
xxiii
TABLE 5-20 TABLE 5-21 TABLE 5-22 TABLE 5-23 TABLE 5-24 TABLE 5-25 TABLE 5-26 TABLE 5-27 TABLE 5-28 TABLE 5-29 TABLE 5-30 TABLE 5-31 TABLE 5-32 TABLE 5-33 TABLE 5-34 TABLE 5-35 TABLE 5-36 TABLE 5-37 TABLE 5-38 TABLE 5-39 TABLE 5-40 TABLE 5-41 TABLE 5-42 TABLE 5-43 TABLE 5-44 TABLE 5-45 TABLE 5-46 TABLE 5-47 TABLE 5-48 TABLE 5-49
Transceiver Control Parameter Inter-Packet Gap Parameter
170
171 172
Local Transceiver Auto-negotiation Capability Parameters Link Partner Capability Parameters 173 174
Current Physical Layer Status Parameters Driver Parameters and Status Instance Parameter 176 Operational Mode Parameters Inter-Packet Gap Parameter 177 178 175
179
Current Physical Layer Status Parameters Driver Parameters and Status Instance Parameter 184 Operational Mode Parameters Inter-Packet Gap Parameters 184 186 182
187 187
Current Physical Layer Status Parameters Driver Parameters and Status Operational Mode Parameters 190 191
193
Current Physical Layer Status Parameters Driver Parameters and Status Instance Parameter 198 Operational Mode Parameters Inter-Packet Gap Parameter 199 201 197
202
xxiv
TABLE 5-50 TABLE 5-51 TABLE 5-52 TABLE 5-53 TABLE 5-54 TABLE 5-55 TABLE 5-56 TABLE 5-57 TABLE 5-58 TABLE 5-59 TABLE 5-60 TABLE 5-61 TABLE 5-62 TABLE 5-63 TABLE 5-64 TABLE 5-65 TABLE 5-66 TABLE 5-67 TABLE 5-68 TABLE 5-69 TABLE 5-70 TABLE 5-71 TABLE 34 TABLE 5-72 TABLE 5-73 TABLE 7-1 TABLE 7-2 TABLE 7-3
203
Current Physical Layer Status Parameters Performance Tunable Parameters Driver Parameters and Status Instance Parameter 211 Operational Mode Parameters 212 210 207
Read-Write Flow Control Keyword Descriptions Gigabit Link Clock Mastership Controls Inter-Packet Gap Parameter 215 215 216 214
213
Rx Random Early Detecting 8-Bit Vectors PCI Bus Interface Parameters 217 217 218
Jumbo Frames Enable Parameter Performance Tunable Parameters Driver Parameters and Status Operational Mode Parameters 221
222 224
Current Physical Layer Status Parameters General Network Interface Statistics General Network Interface Statistics 245 246
Physical Layer Configuration Properties List of ge Specific Interface Statistics List of ce Specific Interface Statistics Network and VLAN Design 311
253
255 256
Sequence of Events for FIGURE 7-12
314 320
Physical Network Connections and Addressing
Tables
xxv
Preface
Networking Concepts and Technology: A Designers Resource is a resource for network architects who must create solutions for emerging network environments in enterprise data centers. Youll find information on how to leverage Sun Open Network Environment (Sun ONE) technologies to create Services on Demand solutions as well as technical details about the networking internals. Youll also learn how to integrate your environment with advanced networking switching equipment, providing sophisticated Internet Protocol (IP) services beyond plain vanilla Layer 2 and Layer 3 routing. Based upon industry standards, expert knowledge, and hands-on experience, this book provides a detailed technical overview of the following:
s
Design of highly available, scalable, manageable gigabit network architectures with a focus on the server-to-switch tier. We will share key ingredients for successful deployments based on actual experiences. Emerging IP services that vastly improve Sun ONE-based solutions, giving you a centralized source of concise information about these services, the benefits they provide, how to implement them, and where to use them. Example services include quality of service (QoS), server load balancing (SLB), Secure Sockets Layer (SSL), and IPSec. Sun Networking software and hardware technologies available. We describe and explain how Sun differs from the competition in the networking arena, and then summarize the internal operations and describe technical details that lead into the tuning section. Currently there are only blind recommendations for tuning, with no explanations. This book fills that void by first describing the networking technology, which variables serve what purpose, what tuning will do, and why.
xxvii
The Sun BluePrints Program

The mission of the Sun BluePrints program is to empower Sun s customers with the technical knowledge required to implement reliable, extensible, and secure information systems within the data center using Sun products. This program provides a framework to identify, develop, and distribute preferred practices information that applies across the Sun product lines. Experts in technical subjects in various areas contribute to the program and focus on the scope and advantages of the information. The Sun BluePrints program includes books, guides, and online articles. Through these vehicles, Sun can provide guidance, installation and implementation experiences, real-life scenarios, and late-breaking technical information. The monthly electronic magazine, Sun BluePrints OnLine, is located on the Web at: http://www.sun.com/blueprints. To be notified about updates to the Sun BluePrints program, please register on this site.
Who Should Read This Book

This book is intended for readers with varying degrees of experience with and knowledge of computer system and server technology, who are designing, deploying, and managing a data center within their organizations. Typically these individuals already have UNIX knowledge and a clear understanding of their IP network architectural needs. The book is targeted at network architects who must design and implement highly available, scalable data centers.
How This Book Is Organized

This book is organized into the following chapters:
s
Chapter 1 provides an overview of this book and its concepts.
xxviii
Chapter 2 explores the main components of a typical enterprise Services on Demand network architecture and some of the more important underlying issues that impact network architecture design decisions. Chapter 3 describes some of key Transport Control Protocol (TCP) tunable parameters related to performance tuning: how these tunables work, how they interact with each other, and how they impact network traffic when they are modified. Chapter 4 describes the internal architecture of a basic network switch and provides a comprehensive discussion of server load balancing. Chapter 5 discusses the networking technologies that are regularly found in a data center. Chapter 6 provides an overview of the various approaches and describes where it makes sense to apply that solution. Chapter 7 describes network implementation concepts and details. Appendix A provides an example of the Lyapunov function. Glossary provides definitions for the technical terms and acronyms used in this book.
s s s
Shell Prompts
Shell Prompt
C shell C shell superuser Bourne shell and Korn shell Bourne shell and Korn shell superuser
machine-name% machine-name# $ #
Preface
xxix
Typographic Conventions
Typeface Meaning Examples
AaBbCc123
The names of commands, files, and directories; on-screen computer output What you type, when contrasted with on-screen computer output Book titles, new words or terms, words to be emphasized. Replace command-line variables with real names or values.
Edit your.login file. Use ls -a to list all files. % You have mail. % su Password: Read Chapter 6 in the Users Guide. These are called class options. You must be superuser to do this. To delete a file, type rm filename.
AaBbCc123
AaBbCc123
Accessing Sun Documentation

You can view, print, or purchase a broad selection of Sun documentation, including localized versions, at: http://www.sun.com/documentation
xxx Networking Concepts and Technology: A Designers Resource
CHAPTER
Overview
This book provides a resource for network architects who design IP network architectures for the typical data center. It provides abstractions as well as detailed insights based on actual network engineering experience that include network product development and real-world customer experiences. The focus of this book is limited to network architectures that support Web services-based multi-tier architectures. However, this includes everything from the edge data center switch (which connects the data center to the existing backbone network) to the server network protocol stacks. While there is tremendous acceptance of Web services technologies and multi-tier software architectures, there is limited information about how to create the network infrastructures required in the data center to optimally support these new architectures. This book also provides a new perspective on how to think about solving this problem, leveraging new emerging technologies that help create superior solutions. It explains in detail how certain key technologies work and why they are recommended procedures. One of the complexities of networking in general is that the technology requires breadth of knowledge. Networking connectivity spans many completely different technologies. It is a complex, interconnected, commingled, interrelated set of components and devices, including hardware, software, and different solution approaches for each segment. We try to simplify this complexity by extracting key segments and taking a layered approach in describing an end-to-end solution while limiting the scope of material to the data center.
Evolution of Web Services Infrastructures

The Web service infrastructure is middleware, which is embraced by both early adopters and mainstream enterprises interested in reducing costs and integrating legacy applications. The Web service paradigm is platform neutral as well as
1
hardware and software neutral. Thus enterprises can easily communicate with employees, customers, business partners, and vendors while maintaining the specific security requirements, access, and privileges needs of all. The examples used in this book will focus on Web-based systems, which simplifies our discussion from a networking perspective.
FIGURE 1-1 shows a conceptual model of how this new paradigm impacts the data center network architecture, which must efficiently support this infrastructure. The Web services-based infrastructure allows different applications to integrate through the exposed interface, which is the advertised service. The internal details of the service and subservices required to provide the exposed service are hidden. This approach has a profound impact on the various networks, including the service provider network and the data center network that support the bulk of the intelligence required to deliver these Web-based services.
AP PL ICA TIO NS
cy n 2 ga tio Le ica l pp A
cy n 1 ga tio Le ica l pp
A ed as -b e A eb ic W erv S
W eb
idd Bu lew sin Fe ar es eI de Ne s P nfr ra as ted tw ar tru or tne Ne k ctu r tw re Cu or s Ne tom ks tw er or k Ve Ne ndo tw r or k
FIGURE 1-1
Data center network architectures are driven by computing paradigms. One can argue that the computing paradigm has now come full circle. From the 1960s to the 1980s, the industry was dominated by a centralized data center architecture that
s es in er us n n B art tio P lica pp

Se rvi ce sM
ed as -b e B eb ic W erv S
Web Services Infrastructure Impact on Data Center Network Architectures
er m n to tio us a C plic p A A
X ols P toc TT o H Pr P OA s S re r ctu ente DI e C rk UD hit ta o L rc Da etw SD kA N W or r tw ide Ne ov Pr rk ice wo rv t Se Ne
ML
or n nd tio Ve lica pp
revolved around a mainframe with remote terminal clients. Systems Network Architecture (SNA) and Binary Synchronous Communication (BSC) were dominant protocols. In the early to mid 1990s, client-server computing influenced a distributed network architecture. Departments had their local workgroup server with local clients and an occasional link to the corporate database or mainframe. Now, computing has returned to a centralized architecture (where the enterprise data center is more consolidated) for improved manageability and security. This centralized data center architecture is required to provide access to intranet and Internet clients, with different devices, link speeds, protocols, and security levels. Clients include internal corporate employees, external customers, partners, and vendorseach with different security requirements. A single flexible and scalable architecture is required to provide all these different services. Now the network architect requires a wider and deeper range of knowledge, including Layer 2 and Layer 3 networking equipment vendors, emerging startup appliance makers, and server-side networking features. Creating optimal data center edge architectures is not only about routing packets from the client to the target server or set of servers that collectively expose a service, but also about processing, steering, and providing cascading services at various layers. For the purposes of this book, we distinguish network design from architecture as follows:
s
Architecture is a high-level description of how the major components of the system interconnect from a logical and physical perspective. Design is a process that specifies, in sufficient detail for implementation, how to construct a network of interconnected nodes that meets or exceeds functional and non-functional requirements (performance, availability, scalability, and such).
Advances in networking technologies, combined with the rapid deployment of Webbased, mission-critical applications, brought growth and significant changes in enterprise IP network architectures. The ubiquitous deployment of Web-based applications that has streamlined business processes has further accelerated Web services deployments. These deployments have a profound impact on the supporting infrastructures, often requiring a complete paradigm shift in the way we think about building the network architectures. Early client-server deployments had network traffic pattern characteristics that were predominantly localized traffic over large Layer 2 networks. As the migration towards Web-based applications accelerated, client-server deployments evolved to multi-tier architectures, resulting in different network traffic patterns, often outgrowing the old architectures. Traditional network architectures were designed on the assumption that the bulk of the traffic would be local or Layer 2, with proportionately less inter-company or Internet traffic. Now, traffic is very different due to the changing landscape of corporate business policies towards virtual private networks (VPNs), consumer-tobusiness, and business-to-business e-commerce. These innovations have also given rise to new challenges and opportunities in the design and deployment of emerging enterprise data center IP network architectures.
Chapter 1
Overview
This book describes why these network traffic patterns have changed, defining multi-tier data centers that support these emerging applications and then describing how to design and build suitable network architectures that will optimally support multi-tier data centers. The focus of this book spans the edge of the data center network to the servers. The scope of this book is limited to the data center edge, so it does not cover the core of the enterprise network.
The Data Center IP Network

Together FIGURE 1-2 and FIGURE 1-3 provide a high-level overview of the various interconnected networks, collectively referred to as the Internet. These illustrations also show the relationship between the client-side and enterprise-side networks. They show two distinct networks that can be segregated based on business entities:
s
The Internet Service Provider (ISP) that provides connectivity to the public Internet for both clients and enterprises The owners of the physical plant and communications equipment, which fall into one of the following categories:
s
The Incumbent Local Exchange Carrier (ILEC), which provides local access to subscribers in a local region The Inter Exchange Carrier (IXC), which provides national and international access to subscribers The Tier 2 ISP, which is usually a private company that leases lines and cage space at an ILEC facility, or it can be the ILEC or IXC itself.
The diagram shows Tier 2 ISPs as being relatively local ISPs, situated in the access networks, whereas Tier 1 ISPs have their own longhaul backbone and provide wider regional coverage, situated in the Core, or national backbone. Tier 1 often aggregates the traffic of many Tier 2 ISPs, in addition to providing services directly to individual subscribers. Large networks connect to each other through peering points, such as MAE-East/MAE-West, which are public peering points, or through Network Access Points (NAPs), such as SPRINTS NAP, which are private.
Remote Access Cable Cable modem IP Ethernet IP DOCSIS HFC
Tier 2 ISP/ Access Networks
Cable head end Metro area network
ATM
DSL DSLAM IP DSL PPP modem AAL5/ATM SONET Mobile Wireless
ATM ATM
IP Ethernet
PDA IP PPP CDMA/ GPRS/ UMTS
GPRS GPRS Edge IP PPP L2TP Ethernet
Dial Up IP PPP V.34, V.90, 56kbps T1-Leased Line CSU/DSU IP Ethernet
CO/POP PSTN Voice Circuit Access switch
FIGURE 1-2
High-Level Overview of Networks Spanning Clients, Data Center, Vendors, and Partners (a)
Chapter 1
Overview
Tier 1 ISP Backbone Network
Enterprise Data Center Network
Enterprise Remote Branch Office
Peering points MAE NAP
Vendors
Partners
FIGURE 1-3
High-Level Overview of Networks Spanning Clients, Data Center, Vendors, and Partners (b)
A client can be any software that initiates a request for a service. This means that a Web server itself can be a client. For example, when in the process of replying back to a client for a Web page, it needs to fetch images from an image server. FIGURE 1-2 and FIGURE 1-3 show remote dial-up clients as well as corporate clients. Depending on the distances between the client and the server hosting the Web service, data might need to traverse a variety of networks for end-to-end communication. The focus of this book is the network that interconnects the servers located in an enterprise data center or the data center of an ISP offering collocation services. We describe the features and functions of the networking equipment and servers in sufficient depth to help a network architect in the design of enterprise IP network architectures. We take a layered approach, starting from the application layer down to the physical layer to describe the implications on the design of network architectures. We describe not only high-level architectural principles, but also key details such as tuning the transport layer for optimal working networks. We discuss in detail how the building blocks of the network architecture are constructed and work to make more informed design decisions. Finally, we present actual tested configurations, providing a baseline for customizing and extending these configurations to meet actual customer requirements.
Network Traffic Characteristics

One of the first steps in designing network architectures for the data center is understanding network traffic patterns. FIGURE 1-4 shows how the application layer fits within the overall networking infrastructure. The network architecture must be designed to meet the bandwidth and latency requirements of the enterprise network applications, both at steady state and during episodes of congestion. We will provide an overview of some of the tiers that are typically deployed in all multi-tier solutions. There are several reasons for partitioning the solution into tiers, including the ability to control network traffic access between tiers. Switches are designed to look at traffic from source to destination by dividing the application layer into tiers. This allows each tier one-to-one mapping to a corresponding virtual local area network (VLAN), which is also directly mapped to a specific network or subnet.
Chapter 1
Overview
Enterprise Multi-Tier Data Center
OR
FIGURE 1-4
Influence of Multi-Tier Software Architectures on Network Architecture
Chapter 2 offers insight into the applications that generate the traffic flows across the tiers. Inter-tier traffic starts with a client request, which can originate from remote dial-up, intranet corporate employee, Internet partner, and so on. This HyperText Transport Protocol (HTTP) or HyperText Transport Protocol over SSL (HTTPS) packet is usually about a hundred bytes. The server response is usually a 1000-byte to 200-kilobyte file, often consisting of Web page images. Chapter 2 describes the key components and technologies used at the application layer and provides some deeper insights into achieving availability.
8 Networking Concepts and Technology: A Designers Resource
The processing of client Web requests and generation of an HTTP response may require significant processing across various Web and application or legacy servers. Examples of typical applications include business applications implemented using Enterprise JavaBeans (EJBs) on an application server, mail messaging, and dynamic Web page generation using JavaServer Pages (JSP) and servlets. The nature of the traffic requirements should be clearly identified and quantified. Most important are the identification and specification of handling peaks or bursts. We provide detailed Web, application, and database tier traffic flows and availability strategies, which directly impact inter-tier traffic flows.
End-to-End Session: Tuning the Transport Layer

One key factor in successful deployments is the optimal working network. In Chapter 3, we describe how to tune server-side Transport Control Protocol (TCP) to meet the challenges that arise as a result of clients connecting at different network bandwidths and latencies. TCP is a complex window-based protocol that must be tuned in order to achieve good throughput. FIGURE 1-5 shows the importance of endto-end connectivity and tuning TCP due to the widely different latencies, bandwidths, and congestion that each client may be subjected to when connecting to the data center network server. Almost all current material provides blind recommendations about which parameters to tune. We describe exactly which parameters are important to tune and the impact that tuning has on other parts of TCP. A network architect is often consulted about how to improve performance after a network has been designed and implemented. Chapter 3, although used after the design phase, was included to fill a void in this area.
Chapter 1
Overview
Enterprise Multi-Tier Data Center TCP IP Ethernet
FIGURE 1-5
Transport Layer Traffic Flows Tuned According to Client Links
Network Edge Traffic Steering: IP Services

FIGURE 1-6 shows an overview of the various IP services that can be used in creating
a multi-tier data center network architecture. These services are essentially packet processing functions that alter the flow of traffic from client to server for increasing certain aspects of the architecture. Firewalls and Secure Sockets Layer (SSL) are added for increasing security. Server load balancing (SLB) is used to increase availability, scalability, flexibility, and performance. In Chapter 4, we describe key services that the architect can leverage, including in-depth explanations of how they work and which variant is the best and why. A question that is often asked is which server load balancing algorithm is best and why. We provide a detailed technical analysis explaining exactly why one algorithm is the best. We also provide detailed
10
explanation of the new emerging quality of service (QoS), which has gained more importance because of increasing deployment time-dependent applications, such as Voice over IP (VoIP) or multimedia. Most enterprise networks are overprovisioned. Normal steady-state network flows are usually not an issue. What really concerns most competent network architects is how to handle peaks or bursts. Every potential incoming HTTP request could be a revenue-producing opportunity that cannot be discarded. Here is where QoS plays an essential role. One of the missing pieces in most network architectures is planning for handling peak workloads and providing differentiated services. When there is congestion, we absolutely must prioritize and service the important customers ahead of casual browsing Web surfers. Quality of service will be discussed in detail: its importance, where to use it, and how it works.
Enterprise Multi-Tier Data Center SLB Service Access Point QoS NAT SSL FW
FIGURE 1-6
Data Center Edge IP Services
Chapter 1
Overview
11
Server Networking Internals

The design of data center network architectures not only involves an understanding of networking equipment, but equally important, the network interface cards (NICs) and software protocols and drivers on the servers that provide services (see FIGURE 1-7). There are a variety of different cards that the architect can choose from. Depending on the requirements, appropriate design choices dictate which NIC is most suitable. Further tuning drivers that directly interface with NICs and the rest of the server protocol stack requires understanding of server networking internals. Chapter 5 presents a survey of available NICs and then provides an overview of the internal operations and insights ito tuning key parameters.
IP Driver Network Interface Card
IP Driver Network Interface Card
FIGURE 1-7
Data Center Networking Considerations on the Server
12
Network Availability Design Patterns

Chapter 6 presents the various approaches to achieving high-availability network designs, including trade-offs and recommendations. We describe various techniques, provide some detailed configuration examples of the different ways to connect servers to the edge switches, and touch on how data center switches can be configured for increased availability, as shown in FIGURE 1-8. The material in Chapter 6 is based on actual customer experiences and has proven to be quite valuable to the network architect. We describe the following Layer 2 approaches:
s s s
Trunking NIC, server side Trunking Switch side, including Distributed Multi-link Trunking (DMLT) Spanning Tree Protocol (STP)
The following Layer 3 strategies are described:

s
Virtual Router Redundancy Protocol (VRRP) default router redundancy mechanisms IP Multipathing (IPMP) NIC redundancy Open Shortest Path First (OSPF) and Routing Information Protocol (RIP) data center routing protocol availability features.
s s
The advantages and disadvantages will be described, along with suggestions on which approach makes sense for which situation.
Chapter 1
Overview
13
FIGURE 1-8
Availability Strategies in the Data Center
Reference Implementations
The final chapter ties together all the concepts from previous chapters and describes actual network architectures from a complete solution standpoint. The material in Chapter 7 is based on actual tested configurations. FIGURE 1-9 shows an example of the tested configurations. The solution described in Chapter 7 is generic enough to be useful for actual solutions, yet customizable for specific requirements. The logical network architecture describes a high-level overview of the Layer 3 networks and segregates the various service tiers. IP services, which are implemented at key boundary points of the architecture, are then reviewed. We describe the design considerations that lead to the physical architecture. We discuss two different architectural approaches and show how different network switch vendors can be deployed, including the advantages and disadvantages of each. We present detailed descriptions of configurations and implementations. A hardware-based firewall is used to show a logical firewall solution, providing security between each tier, yet using only one appliance. For increased availability, a second appliance is optionally added.
14
Client 1 Client access Tier Level 2-3 edge switch
Client 2
192.168.10.1 10.50.0.1 Standby core 192.168.10.3 10.30.0.1 10.40.0.1 10.20.0.1 10.10.0.1 Server load-balancer switches
Master core 192.168.10.2 Foundry switches
Web service Tier
Sun Fire 280R Sun Fire 280R Sun Fire 280R Sun Fire 280R 10.10.0.100 10.10.0.101 10.10.0.102 10.10.0.103
Directory service Tier Sun Fire 280R Sun Fire 280R Sun Fire 280R Sun Fire 280R 10.20.0.100 10.20.0.101 10.20.0.102 10.20.0.103
Application service Tier
Sun Fire 6800 10.40.0.100
Sun Fire 6800 10.40.0.101
Database service Tier
T3 Sun Fire 6800 10.30.0.100 Sun Fire 6800 10.30.0.101
T3
FIGURE 1-9
Example Implementation of an Enterprise Muli-Tier Data Center
Chapter 1
Overview
15
CHAPTER
Network Traffic Patterns: Application Layer

In the design of network architectures, it is essential to understand the applications used and the resulting network traffic injected into the environment. There are many classes of networked applications. This chapter focuses on Web-based applications that are predominant in the data center. Most Web-based applications can be partitioned into the following functional tiers:
s s s s s
Presentation Tier Web Tier Application Tier Naming Services Tier Data Tier
In this chapter we will explore the main components of a typical enterprise Services on Demand network architecture and some of the more important underlying issues that impact network architecture design decisions. We describe in detail the Web tier and Application tier, pointing out issues that impact design decisions for availability, performance, manageability, security, and scalability. We then describe some example architectures that were actually deployed in industry. Topics include:
s
Services on Demand Architecture on page 18 describes the overall architecture from a software perspective, showing the applications that generate the network traffic. Multi-Tier Architecture and Traffic Patterns on page 20 describes the mapping process from the logical architecture to the physical realization onto the network architecture. It also describes the inter-tier traffic patterns and the reasons behind the network traffic. Web Services Tier on page 23 describes the most important tier, which is present in all Web-based architectures. This section provides detailed insights into the applications that run on this tier and directly impact the network architecture. Application Services Tier on page 26 describes the Application tier and the relationship to the Web services tier. This section provides detailed insights into the applications that run on this tier and directly impact the network architecture.
17
Architecture Examples on page 29 provides examples of various architectures based design trade offs and the reasons behind them. Its important to note that it is the application characteristics that influence the design of the network architecture. Example Solution on page 34 describes an actual tested and implemented multi-tier architecture, using the design concepts and principles described in this chapter.
Services on Demand Architecture

Enterprise architects are faced with a wide and often confusing variety of deployment options for configuring corporate applications that are accessed not only by employees but also by customers and business partners. These applications can include legacy mainframe, traditional client-server, Web-based, and other applications. The challenge is to create a unified framework where all these different software technologies can be easily managed, deployed, and universally accessible. Conceptually, all these applications can be considered as services. When a user needs a service, it should be immediately available on demand. A Services on Demand architecture consists of a framework of completely integrated technologies that empowers enterprise architects to develop, manage, integrate, and deploy any application within an open standards-based set of protocols that can be accessed by any Web client running on any device such as PC desktops, laptops, PDAs, or cell phones. Sun ONE is a set of products that uses an open standards-based software architecture. Sun ONE is designed for implementing and deploying Web-based applications, also known as Services on Demand. The tiers are usually segregated based on functionality and security. The front tier usually performs some form or presentation functionality. The Application tier usually performs the business logic, and the Data tier maintains persistent storage of data. In actual practice, the tiers are usually less distinct because of optimizations for security or performance. In addition, designs will vary due to optimizations for performance, availability, and security. These directly impact the network architecture. The Sun ONE architecture spans the Web, Application, and Data tiers. However, the focus is on the Application tier. Sun ONE provides a common framework where all these components can be developed, implemented, and deployed, being assured of tested and proven integration capabilities.
FIGURE 2-1 illustrates the main components of a typical multi-tier architecture for deploying Web-based applications. Note the firewall component is not included in this illustration to simplify the focus of this discussion.
18
Or
ac
le
JD C B EJ ner i nta Co et rvl Se ainer t n Co n on Na
ec
tio
ns JD g C n on BC tio ns
ing gi n
Firewall
BC
ec
Firewall
b We er erv S Ro ute r
et rvl Se ainer nt Co B OR
Firewall
TP HT er rv Se
er Int
ne
t r ute
b We er erv S
Su
a TP plic HT erp A Nrv eE S nO Su
tio
m Me Na g JB Mail gin E ner sa es tai M n rity Co cu il Se Ma A D rity JP cu r ito Se on M DA n& JP mi r ito Ad B on OR &M n er mi erv Ad nS S ion erv er
a ss
ing
Le
ga
cy
nO
A NE
pp
lica
Ro
Se
rvi
eC
ti rea
ce rvi Selivery De
ce rvi n ce Se ratio rvi r g Se taine e n s Int Co ation lic eb p Apore Wes C ervic S b Weces rvi licy Se Po nd a ty nti rm Ide tfo Pla on
,A
m se
bly
an
p De
loy
me
nt
FIGURE 2-1
Main Components of Multi-Tier Architecture
Chapter 2
19
Multi-Tier Architecture and Traffic Patterns

We first take a look at a the high-level logical view of a typical multi-tier architecture. FIGURE 2-2 shows the main logical networks that support an enterprise multi-tier based service infrastructure:
Client Network 172.16.0.0.
External Network 192.168.10.0.
Web Tier Network 10.10.0.0. Directory Tier Network 10.20.0.0. App Serv Tier Network 10.30.0.0.
Management Network 10.100.0.0.
Access to all Networks
Database Tier Network 10.100.0.0.
Backup Network 10.110.0.0.

FIGURE 2-2
SAN Network 10.50.0.0.
Logical View of Multi-Tier Service on Demand Architecture
20
FIGURE 2-2 shows how the various services map directly to a corresponding logical Layer 3 network cloud, shown above in boxes, which then maps directly onto a Layer 2 VLAN. The mapping process starts with the high-level model of the services to be deployed onto the physical model. This top-down approach allows network architects to maintain some degree of platform neutrality. The target hardware can change or scale, but the high-level model remains intact.1
Mapping Tiers to the Network Architecture

This mapping process allows the software architecture to be decoupled from the hardware architecture, resulting in a flexible modular solution. From a network architecture perspective, there are two key tools you can use:
s
Layer 2 VLANs segregate Layer 2 broadcast domains and service domains. An example of a service domain would be a group of Web servers, load balanced, horizontally scaled, and aggregated to provide a highly available service with a single IP access point, commonly deployed in actual practice as a VIP on a load balancer. Layer 3 IP networking segregates Layer 3 routed domains and service domains. Segregating service domains based on IP addresses makes this service network accessible to any host on any Layer 3 IP network. One advantage of this approach is that the service interface for each cloud only needs to be one endpoint, which is easily implemented by a virtual IP (VIP) address. The service is actually provided across many subservice instances running on physically separated servers, collectively forming a logical cluster. The external world does not need to know (and should not know for many reasons, especially security) about the individual servers that provide the service. By creating a layer of indirection, the requesting client need not be modified if any one server is removed or replaced. This decoupling improves manageability and serviceability.
This mapping process allows better control of the network traffic by providing a mechanism for routers and switches to steer the traffic according to user-defined rules. In actual practice, these user-defined rules are accomplished by configuring VLANs, static routes, and access control lists (ACLs). A further benefit allows traffic to be filtered at wirespeed to identify flows for other services such as Quality of Service (QoS).
1Keep in mind the physical constraints imposed by the actual target hardware. Examples of physical constraints
include the number of ports on a network switch and computing capacities.
Chapter 2
21
Inter-tier Traffic Flows

Understanding the traffic flows is important when determining the inter-tier link bandwidth requirements. FIGURE 2-3 illustrates the typical network traffic flows as a result of a Web-based transaction. TABLE 2-1 describes each flow in detail, corresponding to the numbers in the illustration.
Clients
10
Switching Services
Web Services 3 4 8 5
Directory Services
Application Services
Database Services
FIGURE 2-3
Network Inter-tier Traffic Flows of a Web-based Transaction
22
The Item column in TABLE 2-1 corresponds with the numbers in FIGURE 2-3.
TABLE 2-1 Item
Network Inter-tier Traffic Flows of a Web-based Transaction

Interface1 Interface2 Protocol Description
1 2
Client Switch
Switch Web service
HTTP HTTP
Client initiates Web request. Switch redirects client request to particular Web server based on L2-L7 and SLB configuration. Web service request directory service. Directory service resolves request. Servlet obtains handle to EJB bean, invokes a method on remote object. Web server talks to the iAS through a Web connector, which uses NSAPI, ISAPI, or optimized CGI. Entity Bean requests to retrieve or update row in DB table. Entity Bean request completed. Application server returns dynamic content to Web server. Switch receives reply from Web server. Switch rewrites IP header, returns HTTP request to client.
3 4 5
Web service Directory service Web service
Directory service Web service Application service
LDAP LDAP RMI
6 7 8 9 10
Application service Database service Application service Web service Switch
Database service Application service Web service Switch Client
Oracle Proprietary TNS Oracle Proprietary TNS RMI HTTP HTTP
Web Services Tier

Design strategies for the Web Services tier are directly influenced by the characteristics of the software components that run on the Web server. These components include static Web pages, Java Server Pages (JSP), and servlets. In this section we describe the important characteristics, how they work, and how they impact the design of the network architecture. The Sun ONE Web server will be used to illustrate the concepts presented in this chapter. It is important to note that the Sun ONE Web server can be deployed as a standalone product or as an integrated component within the Sun ONE Application server. This has implications on the design strategies detailed in this chapter.
Chapter 2
23
Sun ONE Web Server 6.0 Client Browser
Sun ONE App Server
FIREWALL
FIREWALL
Load Balancer Session Stickyness Sun ONE Web Server 6.0

FIGURE 2-4
Sun ONE App Server
Model of Presentation/Web Tier Components and Interfacing Elements
FIGURE 2-4 provides an overview of a high-level model of the Presentation, Web, Application, and Data-tier components and interfacing elements. The following describes the sequence of interaction between the client and the multi-tier architecture:
1. Client initiates a Web request; HTTP request reaches Web server. 2. Web server processes client request and passes request to backend Application server, containing another Web server with a servlet engine. 3. Servlet processes a portion of the request and requests spurring service from an Enterprise Java Bean (EJB) running on Application server containing an EJB container. 4. EJB retrieves data from database. The Sun ONE Application server comes with a bundled Web server container. However, reasons for deploying a separate Web tier include security, load distribution, and functional distribution. There are two availability strategies that depend on the type of operations that are executed between the client and Web server processes:
24
Stateless and Idempotent If the nature of transactions is idempotent (where the order of transactions is not dependent on each other), then the availability strategy at the Web tier is trivial. Both availability and scalability are achieved by replication. Web servers are added behind a load-balancer switch. This class of transactions includes static Web pages and simple servlets that perform a single computation. Stateful If the transactions between the client and server require that state be maintained between individual client HTTP requests and server HTTP responses, then the problem of availability is more complicated and discussed in this section. Examples of this class of applications include shopping carts, banking transactions, and the like.
The Sun ONE Web servers provide various services including SSL, a Web container that serves static content, JSP software, and a servlet engine. Availability strategies include a front-end multilayer switch with load balancing capabilities and the ability to switch based on SSL session IDs and cookies. If the Web servers are only serving static pages, then the load balancer will provide sufficient availability; if any Web server fails, subsequent client requests will be forwarded to the remaining surviving servers. However, if the Web servers are running JSP software or servlets that require session persistence, the availability strategy is more complex. Implementing session failover capabilities can be accomplished by coding, Web container support, or a combination of both. There are actually several complications, including the fact that even if the transparent session failover problem is solved for failures that occur at the beginning of transactions, idempotent transactions still pose a problem for transactions that have started and then failed because the client is unaware of the server state. A programmatic session failover solution can involve leveraging the javax.servlet.http.HttpSession object, storing and retrieving user session state to or from an LDAP directory or database using cookies in the clients HTTP request. Some Web containers provide the ability to cluster HttpSession objects using elaborate schemes, but they still have flaws such as failures in the middle of a transaction. These clustering schemes involve memorybased session persistence or database-based persistence and a replicated HttpSession object on a backup server. If the primary server fails, the replica takes over. The Sun ONE Web server availability strategy for HttpSession persistence offers extending the IWSSessionManager, which in multiprocess mode can share session information across multiple processes running on multiple Web servers. This means that a client request has an associated session ID, which identifies the specific client. This information can be saved and subsequently retrieved either in a file that resides on a Network File System (NFS) mounted directory or by having the database IWSSessionManager create an IWSHttpSession object for each client session. The IWSSessionManager will require some coding efforts to support distributed sessions so that if the primary server that maintained a particular session fails, the standby server running another IWSSessionManager should retrieve the persistent session information from persistent store based on the session ID. Logic is also required to ensure the load balancer would redirect the clients HTTP request to the backup Web server based on additional cookie information.
Chapter 2 Network Traffic Patterns: Application Layer 25
Currently there is no support for SSL session failover in Sun ONE Web Server 6.0. HttpSession failover can be implemented by extending IWSSessionManager using a shared NFS file or database session persistence strategies, providing user control and flexibility.
Application Services Tier

Design strategies for increased availability for the Application Services tier can become complex because of the different entities with varying availability requirements. Some entities require failover mechanisms and some do not. This section presents a survey of various availability strategies implemented by various vendors and then discusses examples of the various Sun ONE Application Server architectures and associated availability strategies. At the Application Server tier, you are working with the following entities:
s
HttpSession This is the client session object that the Web container creates and manages for each client HTTP request. Session failover mechanisms were described in the previous section. Stateless Session Bean This type of EJB does not require any session failover services. If a client request requires logic to be executed in a stateless session bean, and the server where that bean is deployed fails, an alternative server can redo the operation correctly without any knowledge of the failed bean. Failure detection by the client plug-in or application login must detect when the operation has failed and reinitiate the same operation on a secondary server with the appropriately deployed EJB component. Stateful Session Bean This type of EJB component requires sophisticated mechanisms to maintain state between the primary and backupin addition to the required failover mechanisms described in the Stateless Session Bean case. Entity Bean There are two types of Entity Beans: Container Managed Persistence (CMP) and Bean Managed Persistence (BMP). These essentially differ in whether the container or the user code is responsible for ensuring persistence. In either case, session failover mechanisms other than those already provided in the EJB 2.0 specification are not required because Entity Beans represent a row in a database and the notion of session is replaced by transaction. Clients usually access Entity Beans at the start of transactions. If a failure occurs, the entire transaction is rolled back. An alternative server can redo the transaction, resulting in correct operation.
26
The degree of transparency of the failover requires some consideration. In some cases, the client is completely unaware that a failure occurred and an automatic failover action took place. In other situations, the client times out and must reinitiate a transaction.
Cookie Prim - AS1 Sec - AS2 1
WS1
JSP/Servlet
Home Home STUS
AS1
Home Object EJB
Primary
EJB Class Instance
Static HTML
Remote
Object STUS
EJB Class Instance EJB Class Instance
EJB Object
4 Web Container JNDI
System Services - DSYNC GXDSyncModule
EJB Container
Namespace
7 9
JNDI local tree JNDI global tree Modified JNDI
WS2 9
Web Plug-in
AS1
Home Object EJB EJB Object
Replicated
EJB Class Instance EJB Class Instance EJB Class Instance
Web Container
System Services - DSYNC GXDSyncModule
EJB Container
Namespace
JNDI local tree JNDI global tree Modified JNDI
FIGURE 2-5
High-Level Survey of EJB Availability Mechanisms

FIGURE 2-5 shows an abstract logical overview of the various transactions that can transpire among a client, a Web server instance, and an Application server instance. Note that the firewall is not shown to simplify our discussion. FIGURE 2-5 illustrates three scenarios: Points 1 through 7 depict one scenario, Point 8 depicts the second scenario, and Point 9 depicts the third scenario. The numbered arrows in the figure correspond to the following:
Chapter 2
27
Scenario 1
1. A client makes an HTTP request, which may contain some cookie state information to preserve state between that individuals HTTP requests to a particular server. 2. The load balancer switch ensures that the clients request is forwarded to the appropriate server. 3. The JSP software or servlet retrieves a handle to a remote EJB object residing in the application server instance. 4. The client must first find the home object using a naming service such as Java Naming and Directory Interface (JNDI). The returned object is cast to the home interface type. 5. The client uses this home interface reference to create instances. 6. The client continues to create instances. 7. The application server provides replication services. When an EJB object is updated on the active application server instance, the standby server updates the corresponding backup EJB objects state information. These replication services are provided by the application server systems services.
Scenario 2
8. A JNDI tree cluster that manages replication of the EJB state updates and keeps track of the primary and replicated objects. This scenario occurs when vendor implementations use a modified JNDI as a clustering mechanism. In the standard JNDI implementation, multiple objects cannot bind to a single name, but using added logic, each member of a cluster can have a local and shared global JNDI tree. If the primary object fails, the JNDI will return the backup object bound to a particular name. If a failure occurs after a client has performed a JNDI lookup, the client will hang or time out and try again. The subsequent request will be directed to a secondary server, which will have the correct state of the failed node of a particular entity.
Scenario 3
9. This scenario simply forwards the HTTP request to the Application server using a plug-in. The HTTP request would be received by the Application servers HTTP server. The HTTP request would recursively arrive at point 2 in FIGURE 2-5. Another mechanism includes adding a replica-aware or cluster-aware stub to the EJB objects and system services including a cluster module that runs on the appserver, which is loaded in the deployment descriptor, if specified. The cluster module might
consist of various subsystems that provide data synchronization services, keep the state of the backup EJB object synchronized with the primary, manage cluster failovers, and monitor the health of the appserver instances. If the primary appserver instance fails, the cluster failover manager can redirect client-side EJB method invocations to the backup node. Another approach involves the primary and secondary cluster nodes inserting and altering a cookie on the clients HTTP request, which would apply in the case where the Web server and app server reside on the same server. If the primary node of the cluster fails, the load-balancing switch must be configured to redirect the request to the backup node of the cluster. The backup node must look at the clients cookie and retrieve state information. However, most of these solutions suffer one drawback: Idempotent transactions are not handled transparently or properly in the event that a failure occurs after a method invocation has commenced. At the time of this writing, the Sun ONE Application Server 7 Enterprise Edition is expected to provide a highly available and scalable EJB clustering solution that allows enterprise customers to create solutions with minimal downtime.
Architecture Examples
This section describes three architecture designs. Deciding which architecture to choose can be reduced to identifying the following design objectives:
s
Application partitioning The application itself might make better use of resources by segregating or collapsing the Web tier from the Application tier. If an application makes heavy use of static Web pages, JSP software, or servlet code, and minimal EJB architecture, it might make sense to horizontally scale the Web tier and have only one or two small application servers. Similarly, at the other end of the spectrum, it might make sense for an application to deploy all the servlet and EJB war, jar, and ear files on the same application server if there is a lot of servlet-to-EJB communication. Security level Separating the Web tier and Application Server tier with a firewall creates a more secure solution. The potential drawbacks include hardware and software costs, increased communication latencies between servlets and EJB components, and increased manageability costs. Performance In some cases, customers are willing to forego tight security advantages for increased performance. For example, the firewall between the Web tier and the Application Server tier might be considered overkill because the ingress traffic is already firewalled in front of the Web tier. Scalability Applications can be partitioned and deployed in two ways:
Chapter 2
29
s s
Horizontally scaled, where many small separate Web systems are utilized Vertically scaled, where a few monolithic systems support many instances of Web servers
Manageability In general, the fewer the number of servers, the lower the total cost of operation (TCO).
The next three sections describe three architecture designs.

s
Designing for Vertical Scalability and Performance on page 31 describes a vertically scaled design where the primary objectives are security and vertical scalability. Designing for Security and Vertical Scalability on page 32 describes a tightlycoupled design where the primary objectives are performance between the Web tier and Application tier and vertical scalability. Designing for Security and Horizontal Scalability on page 33 describes a highly distributed solution with the primary design objective being horizontal scalability and security. It is the application characteristics that directly influence the network architecture.
30
Designing for Vertical Scalability and Performance

App Server 7.0 Instance
servlet container servlet container EJB container EJB container Persistence JDBC JCA JNDI JMS JavaMail Security JPDA Admin & Monitoring
Web Server Instance

http listener http listener http listener virtual server virtual server virtual server

servlet container EJB container EJB container Persistence JDBC JCA JNDI JMS JavaMail Security JPDA Admin & Monitoring
Web Server Instance

FIREWALL FIREWALL
Load Balancing Switch
http listener http listener http listener
virtual server virtual server virtual server
servlet container
Web Server Instance


Web Server Instance


Ingress Network
FIGURE 2-6
Web/presentation Tier
Enterprise Information Tier
Decoupled Web Tier and Application Server TierVertically Scaled
The architecture example shown in FIGURE 2-6 provides enhanced security. The Web server can be configured as a reverse proxy by receiving an HTTP request on the ingress network side from a client, then opening another socket connection on the appserver side to send an HTTP request to the Web server running inside the Sun ONE Application Server instance. Alternatively, the Web server instance could instantiate EJB components after performing a lookup on the home interface of a particular EJB component. One advantage of this decoupled architecture is independent scaling. If it turns out that the Web server servlets need to scale horizontally, they can do so independently of the application server logic. Similarly, if the EJB architectures logic needs to scale or be modified, it can do so
Chapter 2
31
independently of the Web tier. Potential disadvantages include increased latency between the Web tier and Application Server tier communications and increased maintenance.
Designing for Security and Vertical Scalability

App Server 7.0 Instance Web Server Instance
http listener http listener http listener virtual server virtual server virtual server servlet container servlet container EJB container EJB container Persistence JDBC JCA JNDI JMS JavaMail Security JPDA Admin & Monitoring

FIREWALL


Ingress Network
FIGURE 2-7
Application Server Tier
FIREWALL
Tightly Coupled Web Tier and Application Server TierVertically Scaled
The example shown in FIGURE 2-7 represents a collapsed architecture that takes advantage of the Web server already included in the Sun ONE Application Server instance process. This architecture is suitable for applications that have relatively intensive servlet-to-EJB communications and less stringent security requirements.
32
From an availability standpoint, fewer horizontal servers result in lower availability. A potential advantage of this architecture is lower maintenance cost because there are fewer servers to manage and configure.
Designing for Security and Horizontal Scalability

Web Server Instance


servlet container EJB container EJB container Persistence JDBC JCA JNDI JMS JavaMail Security JPDA Admin & Monitoring
FIREWALL
FIREWALL
http listener http listener http listener
virtual server virtual server virtual server

Web Server Instance

Web Server Instance


Ingress Network
FIGURE 2-8
Application Server Tier
FIREWALL
Web Server Instance
servlet container
Decoupled Web Tier and Application Server TierHorizontally Scaled
The architecture shown in FIGURE 2-8 is a more horizontally scaled variant of the architecture shown in FIGURE 2-6. This results in increased availability. More server failures can be tolerated without bringing down services in this configuration.
Chapter 2
33
Example Solution
This section describes an example of a tested and implemented data center, multitier network architecture shown in FIGURE 2-9. The network design is composed of segregated networks, implemented physically using VLANs configured by the network switches. This internal network used the 10.0.0.0 private IP address space for security and portability advantages. This design is an implementation of the design described in Designing for Security and Horizontal Scalability on page 33. It includes availability design principles, which will be discussed further in Chapter 6. The management network allows centralized data collection and management of all devices. Each device has a separate interface to the management network to avoid contaminating the production network performance measurements. The management network is also used for jumpstart installation and terminal server access. Although several networks physically reside on a single active core switch, network traffic is segregated and secured using static routes, access control lists (ACLs), and VLANs. From a practical perspective, this can be as secure as separate individual switches, depending on the switch manufacturers implementation of VLANs.
34
Client 2
Web service Tier
Sun Fire 6800 10.40.0.100
Sun Fire 6800 10.40.0.101
T3
FIGURE 2-9
Tested and Implemented Architecture Solution
Chapter 2
35
CHAPTER
Tuning TCP: Transport Layer

This chapter describes some of key Transport Control Protocol (TCP) tunable parameters related to performance tuning. More importantly it describes how these tunables work, how they interact with each other, and how they impact network traffic when they are modified. Applications often recommend TCP settings for tunable parameters, but offer few details on the meaning of the parameters and adverse effects that might result from the recommended settings. This chapter is intended as a guide to understanding those recommendations. This chapter is intended for network architects and administrators who have an intermediate knowledge of networking and TCP. This is not an introductory chapter on TCP terminology. The concepts discussed in this chapter build on basic terminology concepts and definitions. For an excellent resource, refer to Internetworking with TCP/IP Volume 1, Principles, Protocols, and Architectures by Douglas Comer, Prentice Hall, New Jersey. Network architects responsible for designing optimal backbone and distribution IP network architectures for the corporate infrastructure are primarily concerned with issues at or below the IP layernetwork topology, routing, and so on. However, in data center networks, servers connect either to the corporate infrastructure or the service provider networks, which host applications. These applications provide networked application services with additional requirements in the area of networking and computer systems, where the goal is to move data as fast as possible from the application out to the network interface card (NIC) and onto the network. Designing network architectures for performance at the data center includes looking at protocol processing above Layer 3, into the transport and application layers. Further, the problem becomes more complicated because many clients stateful connections are aggregated onto one server. Each client connection might have vastly different characteristics, such as bandwidth, latencies, or probability of packet loss. You must identify the predominant traffic characteristics and tune the protocol stack for optimal performance. Depending on the server hardware, operating system, and device driver implementations, there could be many possible tuning configurations and recommendations. However, tuning the connection-oriented transport layer protocol is often most challenging.
37
This chapter includes the following topics:

s
TCP Tuning Domains on page 38 provides an overview of TCP from a tuning perspective, describing the various components that contain tunable parameters and where they fit together from a high level, thus showing the complexities of tuning TCP. TCP State Model on page 48 proposes a model of TCP that illustrates the behavior of TCP and the impact of tunable parameters. The system model then projects a network traffic diagram baseline case showing an ideal scenario. TCP Congestion Control and Flow Control Sliding Windows on page 53 shows various conditions to help explain how and why TCP tuning is needed and which are the most effective TCP tunable parameters needed to compensate for adverse conditions. TCP and RDMA Future Data Center Transport Protocols on page 62 describes TCP and RDMA, promising future networking protocols that may overcome the limitations of TCP.
TCP Tuning Domains

Transport Control Protocol (TCP) tuning is complicated because there are many algorithms running and controlling TCP data transmissions concurrently, each with slightly different purposes.
38
TCP State Data Structures

tcp_conn_hash
STREAMS
Congestion Control
Timers
FlowControl TCP Internal Variables, RTT

ssthresh
FIGURE 3-1
Overview of Overlapping Tuning Domains
FIGURE 3-1 shows a high-level view of the different components that impact TCP processing and performance. While the components are interrelated, each has its own function and optimization strategy.
s
The STREAMS framework looks at raw bytes flowing up and down the streams modules. It has no notion of TCP, congestion in the network, or the client load. It only looks at how congested the STREAMS queues are. It has its own flow control mechanisms. TCP-specific control mechanisms are not tunable, but they are computed based on algorithms that are tunable. Flow control mechanisms and congestion control mechanisms are functionally completely different. One is concerned with the endpoints, and the other is concerned with the network. Both impact how TCP data is transmitted. Tunable parameters control scalability. TCP requires certain static data structures that are backed by non-swappable kernel memory. Avoid the following two scenarios:
s
Allocating large amounts of memory. If the actual number of simultaneous connections is fewer than anticipated, memory that could have been used by other applications is wasted.
Chapter 3
39
Allocating insufficient memory. If the actual number of connections exceeds the anticipated TCP load, there will not be sufficient free TCP data structures to handle the peak load.
This class of tunable parameters directly impacts the number of simultaneous TCP connections a server can handle at peak load and control scalability.
TCP Queueing System Model

The goal of TCP tuning can be reduced to maximizing the throughput of a closed loop system, as shown in FIGURE 3-2. This system abstracts all the main components of a complete TCP system, which consists of the following components:
s s
ServerThe focus of this chapter. NetworkThe endpoints can only infer the state of the network by measuring and computing various delays, such as round-trip times, timers, receipt of acknowledgments, and so on. ClientThe remote client endpoint of the TCP connection.
Server Application
read ()
Client Application
write ()
Server TCP/IP Stack

Send Buffer
Client TCP/IP Stack
Network
Receive Buffer
ACK Receive Window
FIGURE 3-2
Closed-Loop TCP System Model
This section requires basic background in queueing theory. For more information, refer to Queueing Systems, Volume 1, by Dr. Lenny Kleinrock, 1975, Wiley, New York. In FIGURE 3-2, we model each component as an M/M/1 queue. An M/M/1 queue is
a simple queue that has packet arrivals at a certain speed, which weve designated as . At the other end of the queue, these packets are processed at a certain speed, which weve designated as . TCP is a full duplex protocol. For the sake of simplicity, only one side of the duplex communication process is shown. Starting from the server side on the left in FIGURE 3-2, the server application writes a byte stream to a TCP socket. This is modeled as messages arriving at the M/M/1 queue at the rate of l. These messages are queued and processed by the TCP engine. The TCP engine implements the TCP protocol and consists of various timers, algorithms, retransmit queues, and so on, modeled as the server process , which is also controlled by the feedback loop as shown in FIGURE 3-2. The feedback loop represents acknowledgements (ACKs) from the client side and receive windows. The server process sends packets to the network, which is also modeled as an M/M/1 queue. The network can be congested, hence packets are queued up. This captures latency issues in the network, which are a result of propagation delays, bandwidth limitations, or congested routers. In FIGURE 3-2 the client side is also represented as an M/M/1 queue, which receives packets from the network and the client TCP stack, processes the packets as quickly as possible, forwards them to the client application process, and sends feedback information to the server. The feedback represents the ACK and receive window, which provide flow control capabilities to this system.
Why the Need to Tune TCP

FIGURE 3-3 shows a cross-section view of the sequence of packets sent from the server
to the client of an ideally tuned system. Send window-sized packets are sent one after another in a pipelined fashion continuously to the client receiver. Simultaneously, the client sends back ACKs and receive windows in unison with the server. This is the goal we are trying to achieve by tuning TCP parameters. Problems crop up when delays vary because of network congestion, asymmetric network capacities, dropped packets, or asymmetric server/client processing capacities. Hence, tuning is required. To see the TCP default values for your version of Solaris, refer to the Solaris documentation at docs.sun.com.
Chapter 3
41
Send Window = 1RTT First Batch
Send Window = 1RTT Second Batch
Server continuously sends Packets D1, D2,...D8
D1
D2
D3
D4
D5
D6
D7
D8
Short Med Speed Link -Ethernet
Router 1
Network Segments (hops)
D1'
D2'
D3'
D4'
Long Fast Link -Optical WAN
Router 2 D1" D2" A1 D3" A2 D4" A3 A4 Client - ACKs sent back, A1...A4 Time
FIGURE 3-3
Short Slow Link -Dial Up -POTS
Perfectly Tuned TCP/IP System
In a perfectly tuned TCP system spanning several network links of varying distances and bandwidths, the clients send back ACKs to sender in perfect synchronization with the start of sending the next window. The objective of an optimal system is to maximize the throughput of the system. In the real world, asymmetric capacities require tuning on both the server and client side to achieve optimal throughput. For example, if the network latency is excessive, the amount of traffic injected into the network will be reduced to more closely maintain a flow that matches the capacity of the network. If the network is fast enough, but the client is slow, the feedback loop will be able to alert the sender TCP process to reduce the amount of traffic injected into the network. Later sections will build on these concepts to describe how to tune for wireless, high-speed wide area networks (WANs), and other types of networks that vary in bandwidth and distance.
FIGURE 3-4 shows the impact of the links increasing in bandwidth; therefore, tuning is needed to improve performance. The opposite case is shown in FIGURE 3-5, where the links are slower. Similarly, if the distances increase or decrease, delays attributed to propagation delays require tuning for optimal performance.
42
D1
D2
D3
D4
D5
D6
D7
D8
High Speed Link -Ethernet
Router 1
Long Very Fast Link -Optical WAN
D1'
D2'
D3'
D4'
D5'
D6'
D7'
D8'
Router 2 D1" D2" D3" D4" D5" D6" D7" D8"

Short Med Link -Dial Up -POTS
Client - ACKs sent back, A1...A4 Time

FIGURE 3-4
Tuning Required to Compensate for Faster Links
Chapter 3
43
D1
D2
D3
Short Low Speed Link -Ethernet
Router 1
D1'
D2'
D3'
Long Med Speed Link -TDM WAN
Router 2 D1" D2" D3"

Short Very Slow Link -Dial Up -POTS
Client - ACKs sent back, A1...A4 Time

FIGURE 3-5
Tuning Required to Compensate for Slower Links
TCP Packet Processing Overview

Now lets take a look at the internals of the TCP stack inside the computing node. We will limit the scope to the server on the data center side for TCP tuning purposes. Since the clients are symmetrical, we can tune them using the exact same concepts. In a large enterprise data center, there could be thousands of clients, each with a diverse set of characteristics that impact network performance. Each characteristic has a direct impact on TCP tuning and hence on overall network performance. By focusing on the server, and considering different network deployment technologies, we essentially cover the most common cases.
44
Server Application
socket () close () bind () write () listen () read () accept()
Client Application
socket () close () connect () write () read ()
Libsocket
libnsl
SYS CALL read rq rq rq write

Stream Head
Server Node
Client Node
Libsocket
libnsl
SYS CALL read rq rq rq write

Stream Head
rput() wsrv() wq TCP rsrv() wput() rput() wsrv() wq IP rsrv() wput() rput() wsrv() wq NIC Driver rsrv() wput() Network
rput() wsrv() wq TCP rsrv() wput() rput() wsrv() wq IP rsrv() wput() rput() wsrv() wq NIC Driver rsrv() wput()
Network Interface Card
Network Interface Card
FIGURE 3-6
Complete TCP/IP Stack on Computing Nodes

FIGURE 3-6 shows the internals of the server and client nodes in more detail.
To gain a better understanding of TCP protocol processing, we will describe how a packet is sent up and down a typical STREAMS-based TCP implementation. Consider the server application on the left side of FIGURE 3-6 as a starting point. The following describes how data is moved from the server to the client on the right. 1. The server application opens a socket. (This triggers the operating system to set up the STREAMS stack, as shown.) The server then binds to a transport layer port, executes listen, and waits for a client to connect. Once the client connects, the server completes the TCP three-way handshake, establishes the socket, and both server and client can communicate. 2. Server sends a message by filling a buffer, then writing to the socket. 3. The message is broken up and packets are created, sent down the streamhead (down the read side of each STREAMS module) by invoking the rput routine. If the module is congested, the packets are placed on the service routine for deferred processing. Each network module will prepend the packet with an appropriate header.
Chapter 3
45
4. Once the packet reaches the NIC, the packet is copied from system memory to the NIC memory, transmitted out of the physical interface, and sent into the network. 5. The client reads the packet into the NIC memory and an interrupt is generated that copies the packet into system memory and goes up the protocol stack as shown on right in the Client Node. 6. The STREAMS modules read the corresponding header to determine the processing instructions and where to forward the packet. Headers are stripped off as the packet is moved upwards on the write side of each module. 7. The client application reads in the message as the packet is processed and translated into a message, filling the client read buffer. The Solaris operating system (Solaris OS) offers many tunable parameters in the TCP, User Datagram Protocol (UDP), and IP STREAMS module implementation of these protocols. It is important to understand the goals you want to achieve so that you can tune accordingly. In the following sections, we provide a high-level model of the various protocols and provide deployment scenarios to better understand which parameters are important to tune and how to go about tuning them. We start off with TCP, which is, by far, the most complicated module to tune and has the greatest impact on performance. We then describe how to modify these tunable parameters for different types of deployments. Finally, we describe IP and UDP tuning.
TCP STREAMS Module Tunable Parameters

The TCP stack is implemented using existing operating system application programming interfaces (APIs). The Solaris OS offers a STREAMS framework, originating from AT&T, which was originally designed to allow a flexible modular software framework for network protocols. The STREAMS framework has its own tunable parameters, for example sq_max_size, which controls the depth of a STREAMS syncq. This impacts how raw data messages are processed for TCP. FIGURE 3-7 provides a more detailed view of the facilities provided by the Solaris STREAMS framework.
46
STREAMHEAD rsrv
tcp_sth_rcv_lowat tcp_sth_rcv_hiwat
wsrv
rput
wput
TCP STREAMS Module TCP Server Process Listening on Socket TCP State Lookup tcp_bind_hash tcp_conn_hash tcp_acceptor_hash
SYN Recvd, Pending 3 way handshake tcp_conn_req_max_q0
Listener Backlog tcp_conn_req_max_q
tcp_listen_hash
rsrv
tcp_sth_rcv_lowat tcp_sth_rcv_hiwat
wsrv
rput
wput
FIGURE 3-7
TCP and STREAM Head Data Structures Tunable Parameters

FIGURE 3-7 shows some key tunable parameters for the TCP-related data path. At the
top is the streamhead, which has a separate queue for TCP traffic, where an application reads data. STREAMS flow control starts here. If the operating system is sending up the stack to the application and the application cannot read data as fast as the sender is sending it, the stream read queue starts to fill. Once the number of packets in the queue exceeds the high-water mark, tcp_sth_recv_hiwat, streamsbased flow control triggers and prevents the TCP module from sending any more packets up to the streamhead. There is some space available for critical control messages (M_PROTO, M_PCPROTO). The TCP module will be flow controlled as long as the number of packets is above tcp_sth_recv_lowat. In other words, the
Chapter 3
47
streamhead queue must drain below the low-water mark to reactivate TCP to forward data messages destined for the application. Note that the write side of the streamhead does not require any high-water or low-water marks because it is injecting packets into the downstream, and TCP will flow control the streamhead write side by its high-water and low-water marks tcp_xmit_hiwat and tcp_xmit_lowat. Refer to the Solaris AnswerBook2 at docs.sun.com for the default values of your version of the Solaris OS. TCP has a set of hash tables. These tables are used to search for the associated TCP socket state information on each incoming TCP packet to maintain state engine for each socket and perform other TCP tasks to maintain that connection, such as update sequence numbers, update windows, round trip time (RTT), timers, and so on. The TCP module has two new queues for server processes. The first queue, shown on the left in FIGURE 3-7, is the set of packets belonging to sockets that have not yet established a connection. The server side has not yet received and processed a clientside ACK. If the client does not send an ACK within a certain window of time, then the packet will be dropped. This was designed to prevent synchronization (SYN) flood attacks, where a bunch of unacknowledged client SYN requests caused servers to be overwhelmed and prevented valid client connections from being processed. The next queue is the listen backlog queue, where the client has sent back the final ACK, thus completing the three-way handshake. The server socket for this client will move the connection from LISTEN to ACCEPT. But the server has not yet processed this packet. If the server is slow, then this queue will fill up. The server can override this queue size with the listen backlog parameter. TCP will flow control on IP on the read side with its parameters tcp_recv_lowat and tcp_recv_hiwat similar to the streamhead read side.
TCP State Model

TCP is a reliable transport layer protocol that offers a full duplex connection byte stream service. The bandwidth of TCP makes it appropriate for wide area IP networks where there is a higher chance of packet loss or reordering. What really complicates TCP are the flow control and congestion control mechanisms. These mechanisms often interfere with each other, so proper tuning is critical for highperformance networks. We start by explaining the TCP state machine, then describe in detail how to tune TCP, depending on the actual deployment. We also describe how to scale the TCP connection-handling capacity of servers by increasing the size of TCP connection state data structures.
FIGURE 3-8 presents an alternative view of the TCP state engine.
48
SERVER NODE
CLIENT NODE
Server Application
socket () close () bind () write () listen () read () accept()
Client Application
socket () close () write () connect () read ()
CP
socket ()
TCP Connection Setup

socket ()
Socket Opened
bind ()
Socket Opened
bind ()
Connection Setup
Port Bound
listen ()
Port Bound
connect () RST tcp_ip_linterval
Listen
client connects()
Connect
send syn/ACK()
RST tcp_ip_linterval
SYN RECVD
send syn/ACK()
Close
SYN SENT
recv syn/ACK()
Close
SYN SENT
client ACK
SYN RECVD
server ACK
Established Connection Established Data Transfer RECEIVE SEND

Active Close (Local Close) recv ACK close (), send FIN recv FIN, send ACK Passive Close (Remote Close)
Established RECEIVE SEND

Active Close (Local Close) recv ACK
Connection Established Data Transfer

close (), send FIN recv FIN, send ACK
Passive Close (Remote Close)
recv FIN, send ACK
Fin_Wait_1
send ACK recv FIN, ACK
Fin_Wait_1
send ACK recv FIN, ACK
Close_Wait
recv ACK
Fin_Wait_2
recv FIN, send ACK
Closing
recv ACK
Close_Wait
recv ACK
Fin_Wait_2
recv FIN, send ACK
Closing
recv ACK
Last_Ack
2MSL Timeout
Time_Wait
Last_Ack
2MSL Timeout
Time_Wait
Socket Closed
recv ACK
Connection Shutdown
Socket Closed
recv ACK
Connection Shutdown
FIGURE 3-8
TCP State Engine Server and Client Node
This figure shows the server and client socket API at the top and the TCP module with the following three main states:
Connection Setup
This includes the collection of substates that collectively set up the socket connection between the two peer nodes. In this phase, the set of tunable parameters includes:
Chapter 3 Tuning TCP: Transport Layer 49
tcp_ip_abort_cinterval: the time a connection can remain in half-open state during the initial three-way handshake, just prior to entering an established state. This is used on the client connect side. tcp_ip_abort_linterval: the time a connection can remain in half-open state during the initial three-way handshake, just prior to entering an established state. This is used on the server passive listen side. Long Abort Intervals The longer the abort interval, the longer the server will wait for the client to send information pertaining to the socket connection. This might result in increased kernel consumption and possibly kernel memory exhaustion. The reason is that each client socket connection requires state information, using approximately 12 kilobytes of kernel memory. Remember that kernel memory is not swappable, and as the number of connections increases, the amount of consumed memory and time delays for lookups for connections increases. Hackers exploit this fact to initiate Denial of Service (DoS) attacks, where attacking clients constantly send only SYN packets to a server, eventually tying up all kernel memory, not allowing real clients to connect. Short Abort Intervals If the interval is too short, valid clients that have a slow connection or go through slow proxies and firewalls could get aborted prematurely. This might help reduce the chances of DoS attacks, but slow clients might also be mistakenly terminated.
For a server, there are two trade-offs to consider:

s
Connection Established
This includes the main data transfer state (the focus of our tuning explanations in this chapter). The tuning parameters for congestion control, latency, and flow control will be described in more detail. FIGURE 3-8 shows two concurrent processes that read and write to the bidirectional full-duplex socket connection.
Connection Shutdown
This includes the set of substates that work together to shut down the connection in an orderly fashion. We will see important tuning parameters related to memory. Tunable parameters include:
s
until this time has expired. However, if this value is too short and there have been many routing changes, lingering packets in the network might be lost. tcp_fin_wait_2_flush_interval: how long this side will wait for the remote side to close its side of the connection and send a FIN packet to close the connection. There are cases where the remote side crashes and never sends
50
a FIN. So to free up resources, this value puts a limit on the time the remote side has to close the socket. This means that half-open sockets cannot remain open indefinitely.
Note tcp_close_wait is no longer a tunable parameter. Instead, use

tcp_time_wait_interval.
TCP Tuning on the Sender Side

TCP tuning on the sender side controls how much data is injected into the network and the remote client end. There are several concurrent schemes that complicate tuning. So to better understand, we will separate the various components and then describe how these mechanisms work together. We will describe two phases: Startup and Steady State. Startup Phase TCP tuning is concerned with how fast we can ramp up sending packets into the network. Steady State Phase tuning is concerned about other facets of TCP communication such as tuning timers, maximum window sizes, and so on.
Startup Phase
In Startup Phase tuning, we describe how the TCP sender starts to initially send data on a particular connection. One of the issues with a new connection is that there is no information about the capabilities of the network pipe. So we start by blindly injecting packets at a faster and faster rate until we understand the capabilities and adjust accordingly. Manual TCP tuning is required to change macro behavior, such as when we have very slow pipes as in wireless or very fast pipes such as 10 Gbit/sec. Sending an initial maximum burst has proven disastrous. It is better to slowly increase the rate at which traffic is injected based on how well the traffic is absorbed. This is similar to starting from a standstill on ice. If we initially floor the gas pedal, we will skid, and then it is hard to move at all. If, on the other hand, we start slowly and gradually increase speed, we can eventually reach a very fast speed. In networking, the key concept is that we do not want to fill buffers. We want to inject traffic as close as possible to the rate at which the network and target receiver can service the incoming traffic. During this phase, the congestion window is much smaller than the receive window. This means the sender controls the traffic injected into the receiver by computing the congestion window and capping the injected traffic amount by the size of the congestion window. Any minor bursts can be absorbed by queues. FIGURE 3-9 shows what happens during a typical TCP session starting from idle.
Chapter 3
51
tcp_cwnd_max
Congestion Window Size (KB)

Congestion Window increases exponentially, doubly, up until packet loss or ssthresh
After ssthresh, Congestion window increases additively
idle ssthresh 1 ssthresh 2
Timeout
tcp_slow_start_initial
tcp_slow_start__after_idle
Time
FIGURE 3-9
TCP Startup Phase
The sender does not know the capacity of the network, so it starts to slowly send more and more packets into the network trying to estimate the state of the network by measuring the arrival time of the ACK and computed RTT times. This results in a self-clocking effect. In FIGURE 3-9, we see the congestion window initially starts with a minimum size of the maximum segment size (MSS), as negotiated in the three-way handshake during the socket connection phase. The congestion window is doubled every time an ACK is returned within the timeout. The congestion window is capped by the TCP tunable variable tcp_cwnd_max, or until a timeout occurs. At that point, the ssthresh internal variable is set to half of tcp_cwnd_max. ssthresh is the point where upon a retransmit, the congestion window grows exponentially. After this point it grows additively, as shown in FIGURE 3-9. Once a timeout occurs, the packet is retransmitted and the cycle repeats.
FIGURE 3-9 shows that there are three important TCP tunable parameters:
s
tcp_slow_start_initial: sets up the initial congestion window just after the socket connection is established. tcp_slow_start_after_idle: initializes the congestion window after a period of inactivity. Since there is some knowledge now about the capabilities of the network, we can take a shortcut to grow the congestion window and not start from zero, which takes an unnecessarily conservative approach.
52
tcp_cwnd_max: places a cap on the running maximum congestion window. If the receive window grows, then tcp_cwnd_max grows to the receive window size.
In different types of networks, you can tune these values slightly to impact the rate at which you can ramp up. If you have a small network pipe, you want to reduce the packet flow, whereas if you have a large pipe, you can fill it up faster and inject packets more aggressively.
Steady State Phase

In Steady State Phase, after the connection has stabilized and completed the initial startup phase, the socket connection reaches a phase that is fairly steady and tuning is limited to reducing delays due to network and client congestion. An average condition must be used because there are always some fluctuations in the network and client data that can be absorbed. Tuning TCP in this phase, we look at the following network properties:
s
Propagation Delay This is primarily influenced by distance. This is the time it takes one packet to traverse the network. In WANs, tuning is required to keep the pipe as full as possible, increasing the allowable outstanding packets. Link Speed This is the bandwidth of the network pipe. Tuning guidelines for link speeds from 56kbit/sec dial-up connections differ from 10Gbit/sec optical local area networks (LANs).
In short, tuning will be adjusted according to the type of network and associated key properties: propagation delay, link speed, and error rate. These properties actually self-adjust in some instances by measuring the return of acknowledgments. We will look at various emerging network technologies: optical WAN, LAN, wireless, and so onand describe how to tune TCP accordingly.
TCP Congestion Control and Flow Control Sliding Windows

One of the main principles for congestion control is avoidance. TCP tries to detect signs of congestion before it happens and to reduce or increase the load into the network accordingly. The alternative of waiting for congestion and then reacting is much worse because once a network saturates, it does so at an exponential growth rate and reduces overall throughput enormously. It takes a long time for the queues to drain, and then all senders again repeat this cycle. By taking a proactive congestion avoidance approach, the pipe is kept as full as possible without the danger of network saturation. The key is for the sender to understand the state of the network and client and to control the amount of traffic injected into the system.
Flow control is accomplished by the receiver sending back a window to the sender. The size of this window, called the receive window, tells the sender how much data to send. Often, when the client is saturated, it might not be able to send back a receive window to the sender to signal it to slow down transmission. However, the sliding windows protocol is designed to let the sender know, before reaching a meltdown, to start slowing down transmission by a steadily decreasing window size. At the same time these flow control windows are going back and forth, the speed at which ACKs come back from the receiver to the sender provides additional information to the sender that caps the amount of data to send to the client. This is computed indirectly. The amount of data that is to be sent to the remote peer on a specific connection is controlled by two concurrent mechanisms:
s
The congestion in the network - The degree of network congestion is inferred by the calculation of changes in Round Trip Time (RTT): that is the amount of delay attributed to the network. This is measured by computing how long it takes a packet to go from sender to receiver and back to the client. This figure is actually calculated using a running smoothing algorithm due to the large variances in time. The RTT value is an important value to determine the congestion window, which is used to control the amount of data sent out to the remote client. This provides information to the sender on how much traffic should be sent to this particular connection based on network congestion. Client load - The rate at which the client can receive and process incoming traffic. The client sends a receive window that provides information to the sender on how much traffic should be sent to this connection based on client load.

FIGURE 3-10 shows how senders and receivers control ACK waiting and generation. The general strategy is that clients want to reduce receiving many small packets. Receivers try to buffer up a bunch of received packets before sending back an acknowledgment (ACK) to the sender, which will trigger the sender to send more packets. The hope is that the sender will also buffer up more packets to send in one large chunk rather than many small chunks. The problem with small chunks is that the efficiency ratio or useful link ratio utilization is reduced. For example, a one-byte data packet requires 40 bytes of IP and TCP header information and 48 bytes of Ethernet header information. The ratio works out to be 1/(88+1) = 1.1 percent utilization. When a 1500-byte packet is sent, however, the utilization can be 1500/(88+1500) = 94.6 percent. Now, consider many flows on the same Ethernet segment. If all flows are small packets, the overall throughput is low. Hence, any effort to bias the transmissions towards larger chunks without incurring excessive delays is a good thing, especially interactive traffic such as Telnet.
54
Sender
Receiver
Data
ACK
tcp_remit_interval_min,4000ms [1ms to 20 secs] tcp_remit_interval_min, 3s [1ms to 20 secs] tcp_remit_interval_min, 60s [1ms to 120 min] tcp_ip_abort_interval, 8min [500ms to 1193 hrs]
tcp_deferred_ack_interval, 100ms [1ms to 60 secs] -for non direct connected endpoints
tcp_local_dack_interval, 50ms [1ms to 60 secs] -for direct connected endpoints
tcp_deferred_ack_max, 2 [1 to 16] -max received tcp segments (multiple of mss) received from non-direct connected endpoints tcp_local_dacks_max, 8 [0 to 16] -max received tcp segments (multiple of mss received before forcing out ACK)
rexm rexm rexm
it Dat
it Dat
it Dat
reset
FIGURE 3-10

FIGURE 3-10 provides an overview of the various TCP parameters. For a complete detailed description of the tunable parameters and recommended sizes, refer to your product documentation or the Solaris AnswerBooks at docs.sun.com.
There are two mechanisms that are used by senders and receivers to control performance:
s
Senderstimeouts waiting for ACK. This class of tunable parameters controls various aspects of how long to wait for the receiver to send back an ACK of the data that was sent. If tuned too short, then excessive retransmissions occur. If tuned too long, then excess wasted idle time elapses before the sender realizes the packet was lost and retransmits. Receiverstimeouts and number of bytes received before sending an ACK to sender. This class of tunable parameters allows the receiver to control the rate at which the sender sends data. The receiver does not want to send an ACK for every packet received because the sender will send many small packets, increasing the ratio of overhead to actual useful data ratio and reducing the efficiency of the transmission. However, if the receiver waits too long, there is excess latency that increases the burstiness of the communication. The receiver side can control ACKs with two overlapping mechanisms based on timers and the number of bytes received.
Chapter 3
55
TCP Example Tuning Scenarios

The following sections describe example scenarios where TCP require tuning, depending on the characteristics of the underlying physical media.
Tuning TCP for Optical Networks WANS

Typically, WANS are high-speed, long-haul network segments. These segments introduce some interesting challenges because of their properties. FIGURE 3-11 shows how the traffic changes as a result of a longer, yet faster, link, comparing a normal LAN and an Optical WAN. The line rate has increased, resulting in more packets per unit time, but the delays have also increased from the time a packet leaves the sender to the time it reaches the receiver. This has the strange effect that more packets are now in flight.
56
Sender, continously sends Packets Data1, Data2,...
Data1
Data2
Data3
ACK1 ACK2 ACK3
Ethernet LAN 100mbs - 100 meters
Host 2 Normal Data Sent and ACK Received Timings Synchronized.
Receiver - ACKS sent back, ACK1, ACK2, . . .
Time
Send Window 1RTT First Batch
Sender sends fewer data packets due to higher error rates, but there are wasted time slots until the first ACK returns. Host 1
Data1
Data2
ACK1
ACK2 Long Slow Pipe Few packets in Flight, Line delay incurs huge cost in packet transmission, hence selective ACK is a major improvement.
Packet over Sonet (POS) WAN 1Gbs - 2500 miles
Host 2
Receiver - ACKs sent, but delayed at sender due to slow link.
Time
FIGURE 3-11
Comparison between Normal LAN and WAN Packet Traffic

FIGURE 3-11 shows a comparison of the number of packets that are in the pipe
between a typical LAN of 10 mbps/100 meters with RTT of 71 microseconds, which is what TCP was originally designed for, and an optical WAN, which spans New York to San Francisco at the rate of 1 Gbps with RTT of 100 milliseconds. The bandwidth delay product represents the number of packets that is actually in the network and implies the amount of buffering the network must provide. This also
gives some insight into the minimum window size, which we discussed earlier. The fact that the optical WAN has a very large bandwidth delay product as compared to a normal network requires tuning as follows:
s
The window size must be much larger. The current window size allows for 216 bytes. To achieve larger windows, RFC 1323 was introduced to allow the window size to scale to larger sizes while maintaining backwards compatibility. This is achieved during the initial socket connection, where during the SYN-ACK threeway handshake, window scaling capabilities are exchanged by both sides, and they try to agree on the largest common capabilities. The scaling parameter is an exponent base 2. The maximum scaling factor is 14, hence allowing a maximum window size of 230 bytes. The window scale value is used to shift the window size field value up to a maximum of 1 gigabyte. Like the MSS option, the window scale option should only appear in SYN and SYN-ACK packets during the initial three-way handshake. Tunable parameters include:
s
tcp_wscale_always: controls who should ask for scaling. If set to zero, the remote side needs to request; otherwise, the receiver should request. tcp_tstamp_if_wscale: controls adding timestamps to the window scale. This parameter is defined in RFC 1323 and used to track the round-trip delivery time for data in order to detect variations in latency, which impact timeout values. Both ends of the connection must support this option.
During the slow start and retransmissions, the minimum initial window size, which can be as small as one MSS, is too conservative. The send window size grows exponentially, but starting at the minimum is too small for such a large pipe. Tuning in this case requires that the following tunable parameters be adjusted to increase the minimum start window size:
s
tcp_slow_start_initial: controls the starting window just after the connection is established. tcp_slow_after_idle: controls the starting window after a lengthy period of inactivity on the sender side.
Both of these parameters must be manually increased according to the actual WAN characteristics. Delayed ACKs on the receiver side should also be minimized because this will slow the increasing of the window size when the sender is trying to ramp up. RTT measurements require adjustment less frequently due to the long RTT times, hence interim additional RTT values should be computed. The tunable tcp_rtt_updates parameter is somewhat related. The TCP implementation knows when enough RTT values have been sampled, and then this value is cached. tcp_rtt_updates is on by default, but a value of 0 forces it to never be cached, which is the same as the case of not having enough for an accurate estimate of RTT for this particular connection.
58
tcp_recv_hiwat and tcp_xmit_hiwat: control the size of the STREAMS queues before STREAMS-based flow control is activated. With more packets in flight, the size of the queues must be increased to handle the larger number of outstanding packets in the system.
LAN - 10 mbs - 100 meters - 71 s RTT sender

Propagation = 2x100m/2.8x108m/s = 7.14x10-7s Delay Bandwidth Delay = Prop Delay x Bandwidth = 7.14x10-7s x 100x106 bits/s = 71.4 bits
receiver
WAN - 1 Gbps - 100 ms RTT sender

Propagation = 100 ms round trip Delay Bandwidth Delay = Prop Delay x Bandwidth = 1x10 -1sx1x10 9 bits/s = 1x10 8 bits
FIGURE 3-12
receiver
Tuning Required to Compensate for Optical WAN
Tuning TCP for Slow Links

Wireless and satellite networks have a common problem of a higher bit error rate. One tuning strategy to compensate for the lengthy delays is to increase the send window, sending as much data as possible until the first ACK arrives. This way, the link is utilized as much as possible. FIGURE 3-13 shows how slow links and normal links differ. If the send window is small, then there will be significant dead time between the time the send window sends packets over the link and the time an ACK arrives and the sender can either retransmit or send the next window of packets in the send buffer. But due to the increased error probability, if one byte is not acknowledged by the receiver, the entire buffer must be re-sent. Hence, there is a trade-off to increase the buffer to increase throughput. But you dont want to increase it so much that if there is an error the performance is degraded by more than was gained due to retransmissions. This is where manual tuning comes in. Youll need to try various settings based on an estimation of the link characteristics. One major improvement in TCP is the selective acknowledgement (SACK), where only the one byte that was not received can be retransmitted, not the entire buffer.
Chapter 3
59
Sender continuously sends Packets Data 1, Data 2. . .
Host 1 Data1 Data2 Data3

Ethernet LAN 100mbs - 100 meters
ACK1
ACK2
ACK3 Host 2 Normal Data sent and ACK Received Timings Synchronized Receiver - ACKS sent back, Ack1, Ack2. . . Time
Sender sends fewer data packets due to higher error rates, but there are wasted time slots until the first ACK returns. Host 1
K1 AC K2
AC
FIGURE 3-13
Da ta
60
Da ta 2
Time
Long Slow Pipe More packets in Flight, Line delay incurs huge cost in packet transmission, hence selective ACK is a major improvement.
Racket over Sonet (POS) QAN 1Gbs -2500miles
Host 2 Receiver - ACKs sent, but delayed at sender due to slow link
Comparison between Normal LAN and WAN Packet TrafficLong Low Bandwidth Pipe
Another problem introduced in these slow links is that the ACKs play a major role. If ACKs are not received by the sender in a timely manner, the growth of windows is impacted. During initial slow start, and even slow start after an idle, the send window needs to grow exponentially, adjusting to the link speed as quickly as
possible for coarser tuning. It then grows linearly after reaching ssthresh for finergrained tuning. However, if the ACK is lost, which has a higher probability in these types of links, then the performance throughput is again degraded. Tuning TCP for slow links includes the following parameters:
s
tcp_sack_permitted: activates and controls how SACK will be negotiated during the initial three-way handshake:
s s
0 = no sack disabled. 1 = TCP will not initiate a connection with SACK information, but if an incoming connection has the SACK-permitted option, TCP will respond with SACK information. 2 = TCP will both initiate and accept connections with SACK information.
TCP SACK is specified in RFC 2018 TCP selective acknowledgement. TCP need not retransmit the entire send buffer, only the missing bytes. Due to the higher cost of retransmission, it is far more efficient to only re-send the missing bytes to the receiver. Like optical WANs, satellite links also require the window scale option to increase the number of packets in flight to achieve higher overall throughput. However, satellite links are more susceptible to bit errors, so too large a window is not a good idea because one bad byte will force a retransmission of one enormous window. TCP SACK is particularly useful in satellite transmissions to avoid this problem because it allows the sender to select which packets to retransmit without requiring an entire window (which contained that one bad byte) for retransmission.
s
tcp_dupack_fast_retransmit: controls the number of duplicate ACKs received before triggering the fast recovery algorithm. Instead of waiting for lengthy timeouts, fast recovery allows the sender to retransmit certain packets, depending on the number of duplicate ACKs received by the sender from the receiver. Duplicate ACKs are an indication that possibly later packets have been received, but the packet immediately after the ACK might have been corrupted or lost.
Adjust all timeouts to compensate for long-delay satellite transmissions and possibly longer-distance WANs; the timeout values must be compensated.
Chapter 3
61
TCP and RDMA Future Data Center Transport Protocols

TCP is ideally suited for reliable end-to-end communications over disparate distances. However, it is less than ideal for intra-data center networking primarily because over-conservative reliability processing drains CPU and memory resources, thus impacting performance. During the last few years, networks have grown faster in terms of speed and reduced cost. This implies that the computing systems are now the bottlenecknot the networkwhich was not the case prior to the mid1990s. Two issues have resulted due to multi-gigabit network speeds:
s
Interrupts generated to the CPU The CPU must be fast enough to service all incoming interrupts to prevent losing any packets. Multi-CPU machines can be used to scale. However, the PCI bus then introduces some limitations. It turns out that the real bottleneck is memory. Memory Speed An incoming packet must be written and read from the NIC to the operating system kernel address space to the user address. You can reduce the number of memory-to-memory copies to achieve zero copy TCP by using workarounds such as page flipping, direct data placement, and scatter-gather I/O. However, as we approach 10-gigabit Ethernet interfaces, memory speed continues to be a source of performance issues. The main problem is that over the last few years, memory densities have increased, but not speed. Dynamic random access memory (DRAM) is cheap but slow. Static random access memory (SRAM) is fast but expensive. New technologies such as reduced latency DRAM (RLDRAM) show promise, but these seem to be dwarfed by the increases in network speeds.
To address this concern, there have been some innovative approaches to increase the speed and reduce the network protocol processing latencies in the area of remote direct memory access (RDMA) and infiniband. New startup companies such as Topspin are developing high-speed server interconnect switches based on infiniband and network cards with drivers and libraries that support RDMA, Direct Access Programming Library (DAPL), and Sockets Direct Protocol (SDP). TCP was originally designed for systems where the networks were relatively slow as compared to the CPU processing power. As networks grew at a faster rate than CPUs, TCP processing became a bottleneck. RDMA fixes some of the latency.
62
Application Stream Head TCP IP NIC PCI Driver PCI Bus Network Interface Card Mac PHY
Network Traffic
User Memory
User Memory
Application CPU CPU
Kernel Memory
uDAPL SDP Driver
Kernel Memory
Memory
Infiniband HCA
Network Traffic
Infiniband/RDMA Stack
TCP/IP Stack
FIGURE 3-14
Increased Performance of InfiniBand/RDMA Stack
FIGURE 3-14 shows the difference between the current network stack and the newgeneration stack. The main bottleneck in the traditional TCP stack is the number of memory copies. Memory access for DRAM is approximately 50 ns for setup and then 9 ns for each subsequent write or read cycle. This is orders of magnitude longer than the CPU processing cycle time, so we can neglect the TCP processing time. Saving one memory access on every 64 bits results in huge savings in message transfers. Infiniband is well suited for data center local networking architectures, as both sides must support the same RDMA technology.
Chapter 3
63
CHAPTER
Routers, Switches, and AppliancesIP-Based Services: Network Layer

Traditional Ethernet packet forwarding decisions were based on Layer 2 and Layer 3 destination Media Access Control (MAC) and IP addresses. As performance, availability, and scalability requirements grew, advances in switching decisions based on intelligent packet processing tried to keep pace by offloading functions traditionally implemented in software and executed on general purpose RISC processors onto network processors, Field Programmable Gate Arrays (FPGAs), or Application Specific Integrated Circuits (ASICS). Early server load balancing implementations were implemented in software and executed on general purpose RISC processors, which then evolved to services implemented in the data plane and control plane of packet switches. For example, a server load-balancing implementation now involves health checks implemented in the control plane. The health check results then update specialized forwarding tables and enable forwarding decisions to be performed at wirespeed by consulting these specialized forwarding tables and rewriting the packet. SSL was first implemented by Netscape as software libraries, originally executed on general-purpose CPUs. Performance was then improved somewhat by offloading the mathematical computations onto ASICS, which were actually delivered on PCI cards installed on servers. Recent startup companies are now working on performing all SSL processing in ASICS, allowing SSL to be a dataplane service. This chapter reviews internal switching architectures as well as some of the new features that have been integrated in multilayer Ethernet switches due to evolving requirements that surfaced during deployment of Internet Web-based applications. It discusses in varying detail the following IP services:
s
Server Load Balancinga mechanism to distribute loads across a group of servers, which host identical applications, that logically behaves as one application Layer 7 Switchingpacket forwarding decisions based on packet payload
65
Network Address Translation (NAT)rewriting packet source and destination addresses and ports for the purpose of decoupling the external public interface from internal interfaces of servers in particular IP addresses and ports Quality of Service (QoS)providing differentiated services to packet flows Secure Socket Layers (SSL)encrypting traffic at the application layer for HTTPbased traffic
s s
This chapter first describes the internal architecture of a basic network switch and then describes more advanced features. It also provides a comprehensive discussion of server load balancing from a detailed conceptual perspective to actual practical switch configuration details. Because of the stateless nature of HTTP, server load balancing (SLB) has proven to be ideal for scaling the Web tier. However, there are many different flavors of SLB in terms of fundamental algorithm and deployment strategies that this chapter discusses and describes in detail. This chapter also answers a question that crops up over and over and is rarely answered: How do we know which is the best SLB algorithm, and what is the proof? The chapter then briefly describes Layer 7 switching and NAT and variants thereof. This is followed by a detailed look at QoS, showing where and how to use it and how it works. Finally, we look at SSL from a conceptual layer and describe configuring a commercially available SSL appliance.
Packet Switch Internals

The terms router and switch are often confusing because of the marketing adaptations from vendor to vendor. Original routers performed Layer 3 packet forwarding decisions on general-purpose computing devices with multiple network interface cards. The bulk of the packet processing was performed in software. The inherent design of Ethernet has limited scalability. As the number of nodes increased on an Ethernet segment belonging to one collision domain, the latency increased exponentially, hence bridges were introduced. Bridges segregated Ethernet segments by only allowing broadcasts to certain segments and learning MAC addresses. This allowed the number of nodes to increase. The next advance in network switches was introduced in the early 1990s with the introduction of the Ethernet switch, which allowed multiple simultaneous forwarding of packets based on Layer 2 MAC addresses. This increased the throughput of networks dramatically, since the single talker shared-bus Ethernet only allowed one flow to communicate at any one single instant in time. Packet switches evolved by making forwarding decisions not only on Layer 2, but also on Layer 3 and Layer 4. These higher layer forwarding packet switches are more complicated because more complex software is required to update the corresponding forwarding tables and more memory is needed. Memory bandwidth is a significant bottleneck for wirespeed packet forwarding. Another advance in network switches
was due to cut-through mode, which allowed the switch to immediately make a forwarding decision even before the entire packet was read into the memory of the switch. Traditional switches were of the store-and-forward type, which needed to read the entire packet before making a forwarding decision.
FIGURE 4-1
shows the internal architecture of a multi-layer switch, including a significant amount of integration of functions. Most of the important repetitive tasks are implemented in ASIC components, in contrast to the early routers described previously, which performed forwarding tasks in software on a general-purpose computer CPU card. Here the CPU mostly runs control plane and background tasks and does very little data forwarding. Modern network switches break down tasks into those that need to be completed quickly and those that do not need to be performed in real time into layers or Planes as follows:
Control Planethe set of functions that controls how incoming packets should be processed or how the data path is managed. This includes routing processes and protocols that populate the forwarding tables that contain routing (Layer 3) and switching (Layer 2) entries. This is also commonly referred to the slow path because timing is not as crucial as in the data path. Data Planethe set of functions that operates on the data packets, such as Route lookup and rewrite of destination MAC address. This is also commonly referred to as the fast path. Packets must be forwarded at wire speed, hence packet processing has a much higher priority than control processing and speed is of the essence. The following section describes various common components and features of a modern network switch.
Chapter 4
67
Memory
Addressing Tables
RISC Processor(s) 5 3 Trunking LACP Flow Control SNMP MGT SW 7 FLOW Data Structs CLI S/W Routing Protocols 11 SPT BPDU 4 12 Packet 13 Scheduler Switching Fabric
Express Dest SAC SRC Dest Port Mac Mac IP IP
QoS Queues
Packet/Frame Buffers
0x00ff 0x0100 0x0101 0xffff Tx, RX Descriptors
10 9
FIB Lookup VLAN Lookup
FIB Lookup VLAN Lookup Packet Classification RX FIFO MAC PHY Transceiver TX FIFO 14
FIB Lookup VLAN Lookup Packet Classification RX FIFO MAC PHY Transceiver TX FIFO
8 Packet Classification 6 RX FIFO MAC PHY Transceiver TX FIFO 2 1
FIGURE 4-1
Internal Architecture of a Multi-Layer Switch
The following numbered sections describe the main functional components of a typical network switch and correlate to the numbers in FIGURE 4-1. 1. PHY Transceiver
FIGURE 4-1 shows that as a packet enters a port, the physical layer (PHY) chip is in Receive Mode (Rx). The data stream will be in some serialized encoded format, where a 4-bit nibble is built and sent to the MAC to build a complete Ethernet frame. The PHY chip implements critical functions such as collision detection, which is needed only in half-duplex mode, link monitoring to detect tpe-link-test, and auto negotiation to synchronize with the sender.
68
2. Media Access Control The Media Access Control (MAC) ASIC takes the 4-bit nibble from PHY and constructs a complete Ethernet frame. The MAC chip inserts a Start Frame Delimiter and Preamble when in Transmit Mode (Tx) and strips off these bytes when in Rx mode. The MAC implements the 802.3u,z functions depending on the link speed. MAC implements functions such as collision backoff and flow control. The flow control feature prevents slower link queues from being overrun. This is an important feature. For example, when a 1 Gbit/sec link is transmitting to a slower 100 Mbit/sec link, a finite amount of buffer or queue memory is available. By sending PAUSE frames, the sender slows down, hence using fewer switch resources to accommodate the fast senders and slow receiver data transmissions. Once a frame is constructed, the MAC first checks if the destination MAC address is in the range of 01-80-C2-00-00-00 to 01-80-C2-00-00-0F. These are special reserved multicast addresses used for MAC functions such as link aggregation, spanning tree, or pause frames for flow control. 3. Flow Control (MAC Pause Frames) When a flow control frame is received, a timer module is invoked to wait until a certain time elapses before sending out the subsequent frame. For example, a flow control frame is sent out when the queues are being overrun, so the MAC is free to catch up and allow the switch to process the ingress frames that are queued up. 4. Spanning Tree When Bridge Protocol Data Units (BPDUs) are received by the MAC, the spanning tree process parses the BPDU, determines the advertised information, and compares it with stored state. This allows the process to compute the spanning tree and control which ports to block or unblock. 5. Trunking When a Link Aggregation Control Protocol (LACP) frame is received, a link aggregation sublayer parses the LACP frame, processes the information, and configures the collector and distributor functions. The LACP frame contains information about the peer trunk device such as aggregation capabilities and state information. This information is used to control the data packets across the trunked ports. The collector is an ingress module that aggregates frames across the ports of a trunk. The distributor spreads out the frames across the trunked ports on egress. 6. Receive FIFO If the MAC frame is not a control frame, then the MAC frame is stored in the Receive Queue or Rx FIFO (first in, first out), which are buffers that are referenced by Rx descriptors. These descriptors are simply pointers, so when moving packets around for processing, small 16-bit pointers are moved around instead of 1500-byte frames.
Chapter 4
69
7. Flow Structures The first thing that occurs after the Ethernet frame is completely constructed is that a flow structure is looked up. This flow structure will have a pointer to an address table that will be able to immediately identify the egress port so that the packet can be quickly stored, queued, and forwarded out the egress port. On the first packet of a flow, this flow data structure will not exist, so the lookup will return a failure. The CPU must be interrupted to create this flow structure and return to caller. This flow structure has enough information about where to store the packet in a region of memory used for storing entire packets. There are associated data structures called Tx or Rx descriptors; these are handles to the packet itself. As with FIFO descriptors, the reason for these data structures is speed. Instead of moving around large 1500byte packets for queuing up, only 32-bit pointers are moved around. 8. Packet Classification A switch has many flow-based rules for firewalls, NAT, VPN, and so on. The packet classification performs a quick lookup for all the rules that apply to this packet. There are many algorithms and implementations that basically inspect the IP header and try to find a match in the table that contains all the rules for this packet. 9. VLAN Lookup The VLAN module needs to identify the VLAN membership of this frame by looking at the VLAN ID (VID) in the tag. If the frame is untagged, then depending on whether the VLAN is port based or MAC address based, the set of output ports needs to be looked up. This is usually implemented by vendors in ASICs due to wirespeed timing requirements. 10. Forwarding Information Base (FIB) Lookup After a packet has passed through all the Layer 2 processing, the next step is to determine the egress ports that this packet must be forwarded to. The routing tables determine the next hop, which is populated in the control plane. There are two approaches to implementing this function:
s s
Centralized: One central database contains all the forwarding entries. Distributed: Each port has a local database for quick lookups.
The distributed implementation is much faster. It will be discussed further later in this chapter. 11. Routing Protocols All routing packets are sent to the appropriate routing process, such as RIP, OSPF, or BGP, and this process populates the routing tables. This process is performed in the control plane or slow path. The routing tables are used to populate the Forwarding
70
Information Base (FIB), which can be in a central memory area or downloaded to each ports local memory, providing faster data path performance in the FIB lookup phase. The next step occurs when the packet is ready to be scheduled for transmission by the packet scheduler by pulling out the descriptor out of the appropriate QoS queue. Finally, the packet is sent out the egress port. 12. Switch Fabric Module (SFM) Once the FIB lookup is completed, the packet scheduler must queue the packet onto the output queues. The output queues can be implemented as a set of multiple queues, each with a certain priority, to implement different classes of services. The SFM links the ingress processing to the egress processing. SFM can be implemented using Shared Memory or CrossPoint Architectures. In a shared memory approach, the packets can be written and read to a shared memory location. An arbitrator module controls access to the shared memory. In a CrossPoint Architecture, there is no storage of packets; instead, there is a connection from one to another. CrossPoint further requires that the packet be broken into fixed-sized cells. CrossPoint usually has very high bandwidth used only for backplanes. The bandwidths must be higher because of the extra overhead and padding required in the construction and destruction of fixed-sized cells. Both approaches suffer from Head of Line Blocking (HOL), but usually use some form of virtual output queue workaround to mitigate the effects. HOL occurs when a large packet holds up smaller packets farther down the queue when being scheduled. 13. Packet Scheduler The packet scheduler simply chooses packets that need to be moved from one set of queues to another based on some algorithm. The packet scheduler is usually implemented in an ASIC. Instead of moving entire frames, sometimes 1500 bytes, only 16-bit or 32-bit descriptors are moved. 14. Transmit FIFO The transmit queue or Tx FIFO is the final store before the frame is sent out the egress port. The same functions are performed as those described on the ingress (Rx FIFO) but in the opposite directions.
Emerging Network Services and Appliances

Over the past years, enterprise networks have evolved significantly to handle Web traffic. Enterprise customers are realizing the benefits as a result, embracing intelligent IP-based services in addition to traditional stateless Layer 2 and Layer 3
Chapter 4
71
services at the data center edge. Services such as SLB, Web caching, SSL accelerators, NAT, QoS, firewalls, and others are now common in every data center edge. These devices are either deployed adjacent to network switches or integrated as an added service inside the network switch. Often a multitude of vendors can potentially implement a particular set of functions. The following sections describe some the key IP services you can use in the process of crafting high-quality network designs.
Server Load Balancing

Network SLB is essentially the distribution of load across a pool of servers. Incoming client requests that are destined to a specific IP address and port are redirected to a pool of servers. The SLB algorithm determines the actual target server. The first form of server load balancing was introduced using DNS round-robin, where a Domain Name Service (DNS) resource record allowed multiple IP addresses to be mapped to a single domain name. The DNS server then returned one of the IP addresses using a round-robin scheme. Round-robin provides a crude way to distribute the load across different servers. The limitations of this scheme include the need for a service provider to register each IP address. Some Web farms now grow to hundreds of front-end servers, and every client migh inject a different load, resulting in uneven distribution of load. Modern SLB, where one virtual IP address maps to a pool of real servers, was introduced in the mid 1990s. One of the early successes included the introduction of the Cisco local director, where it became apparent that roundrobin was also an ideal solution for increasing not only the availability but also the aggregate service capacity for HTTP-based Web requests.
FIGURE 4-2 describes a high-level model of server load balancing.
72
/N
SLB
/N
Total Incoming Client Requests at Rate = /N N
Distributed Incoming Client Requests at Uneven Rate = /N

FIGURE 4-2
High-Level Model of Server Load Balancing
In FIGURE 4-2 the incoming load = . It is spread out evenly across N servers, each having a service capacity rate = . How does the SLB device determine where to forward the client request? The answer depends on the algorithm. One of the challenges faced by network architects is choosing the right SLB algorithm from the plethora of SLB algorithms and techniques available. The following sections explore the more important SLB derivatives, as well as which technique is best for which problem.
Hash
The hash algorithm pulls certain key fields from the client incoming request packet, usually the source/destination IP address and TCP/UDP port numbers, and uses their values as an index to a table that maps to the target server and port. This is a highly efficient operation because the network processor can execute this instruction in very few clock cycles, only performing expensive read operations for the index table lookup. However, the network architect needs to be careful about the following pitfalls:
s
Megaproxy architectures, such as those used by some ISPs, remap the dial-in clients source IP addresses to that of the megaproxy, not the clients actual dynamically allocated IP address, which might not be routable. So be careful not to assume stickiness properties for the hash algorithm.
Chapter 4
73
Hashing bases its assumption of even load distribution on heuristics, which require careful monitoring. It is entirely possible that due to the mathematics, the hash values will skew the load distribution, resulting in worse performance than round-robin.
Round-Robin
Round-robin (RR)or weighted round-robin (WRR)is the most widely used SLB algorithm because it is simple to implement efficiently. The RR/WRR algorithm looks at the incoming packet and remaps the destination IP address/port combination to the target IP/port from a fixed table and moving pointer. The Least Connections algorithm requires at least one more process to continually monitor the requests sent or received to or from each server, hence estimating the queue occupancy. From that information, the incoming packet can determine the target IP/port. The major flaw with this algorithm is that the servers must be evenly loaded or the resulting architecture will be unstable, as requests can build up on one server and eventually overload it.
Smallest Queue First /Least Connections

The Smallest Queue First (SQF) is one of the best SLB algorithms because it is selfadapting. This method considers the actual capabilities of the server and knows exactly which server can best absorb the next request. It also provides the least average delay and above all is stable. In commercial switches, this is close to what is referred to as Least Connections. However, in commercial implementations, there are some cost reduction short-cuts that approximate SQF. FIGURE 4-3 provides a highlevel model of the SQF algorithm.
74
SLB
Total Incoming Client Requests at Rate = N N N
Uneven Distributed Incoming Client Requests at Uneven Rate = 1

FIGURE 4-3
Unequal Server Capacities
High-Level Model of the Shortest Queue First Technique
Data centers often have servers that all perform the same function but vary in processing speed. Even when the servers have identical hardware and software, the actual client requests may exercise different code paths on the servers, hence injecting different loads on each server. This results in an uneven distribution of load. The SQF algorithm determines where to spread out the incoming load by looking at the queue occupancies. If server i is more overloaded than the other servers, the Queue i of one server i begins to build up. The SQF algorithm automatically adjusts itself and stops forwarding requests to server i. Because the other SLB variations do not have this crucial property, SQF is the best SLB algorithm. Further analysis shows that SQF has another more important property: stability. Stability describes the long-term behavior of the system.
Chapter 4
75
/N*W1
SLB
/N*W2
Total Incoming Client Requests at Rate = /N*WN 1
Client requests are forwarded blindly to servers. Weights W1 ..WN determine proportion of incoming load
FIGURE 4-4
Equal Server Capacities
Round-Robin and Weighted Round-Robin
Finding the Best SLB Algorithm

Recently, savvy customers have begun to ask network architects to substantiate why one SLB algorithm is better than another. Although this section requires significant technical background knowledge, it provides definite proof and explains why the SQF is the best algorithm in terms of system stability. The SLB system, which is composed of client requests and the servers, can be abstracted for the purposes of analysis as shown in FIGURE 4-5. This shows that initial client Web requeststhat is, when the client picks the first home page, excluding correlated subsequent requestscan be modeled as a Poisson process with rate . The Poisson process is a probability function with an exponential distribution, and it is reasonably accurate for telecommunication network theory as well as Internet session initiation traffic analysis. The Web servers or application servers can be modeled as M/M/1 queues. We have a number (N) of independent and potentially different capacities. Hence you can model each one with its own range of service times and corresponding average. This model is reasonable because it captures the fact that the client request can invoke software code path traversals that vary as well as hardware configuration differences. The SLB shown is subjected to an aggregate load from many clients. Each client has its own Poisson process request traffic. However, because one fundamental property of the Poisson process is that the sum of all Poisson processes is also a Poisson process, we can simplify the complete client side, which we can now model as one large Poisson process of rate . The SLB
76
device forwards the initial client request to the least-occupied queue. There are N queues, each with a Poisson arrival process and an exponential service time. Hence we can model all the servers as N M/M/1 queues. To prove that this system is stable, we must show that under all admissible time and injected load conditions the queues will never grow without any bounds. There are two approaches we can take:
s
Model the state of the queues as a stochastic process, determine the Markov Chain, and then solve the long-term equilibrium distribution . Craft and utilize a Lyapunov Function L (t) which accurately models the growth of the queues, and then show that over the long termthat is, after the system has time to warm up and reach a steady state and a certain thresholdthe rate of change of queue size is negative and remains negative for large enough L (t). This is a common and proven technique found in many network analysis research papers. We will show that: dL/dt = some negative value, for all values of L (t) greater than some threshold. It turns out that the Expected Value of the single step drift is equivalent, but much easier to calculate, which is the technique that we will use.
M/M/1 1 1 1
M/M/1 SLB = i 2 2 2
Client Requests Session Initiations Modeled as a Poisson Process with Rate =
N M/M/1
Server Process with Exponential Service Rate = i
FIGURE 4-5
Server Load Balanced System Modeled as N - M/M/1 Queues
Chapter 4
77
We will perform this analysis by first obtaining the discrete time model of one particular queue and then generalizing the result to all the N queues, as shown in the system model. If we take the discrete model, the state of one of the queues can be modeled as shown in FIGURE 4-6.
Queue State = # of queued Client requests at time t Arrival Rate = i 1 1
Web server Service rate =

FIGURE 4-6
System Model of One Queue
Queue Occupancy
at time t+1 = Queue Occupancy at t + Number of Arrivals(t+1) - Number of Departures(Serviced)(t+1) Q(t+1) = Q (t) + A (t+1) - D (t+1) Because the state of the queue depends only on the previous state, this is easily modeled as a valid Markov Process, for which there are known, proven methods of analysis to find the steady-state distribution. However, since we have N queues, the actual mathematics is very complex. The Lyapunov function is an extremely powerful and accurate method to obtain the same results, and it is far simpler. See Appendix A for more information about the Lyapunov analysis.
How the Proxy Mode Works

The SQF algorithm is only one component in understanding how to best deploy SLB in network architectures. There are several different deployment scenarios available for creating solutions. In Proxy Mode, the client points to the server load balancing device, and the server load balancer remaps the destination IP address and port to the target server as selected by the SLB algorithm. Additionally, the source IP/port is changed so that the server will return the response to the server load balancer and not to the client directly. The server load balancer keeps state information to return the packet to the correct client.
78
dst ip
dst port 3201
src ip 120.141.0.19
src port 80 <hl>.....
payload </hl>
192.191.3.89
1
src ip 192.191.3.89 src port 3201 dst ip 120.141.0.19 dst port 80 payload GET://www.abc.com/index.html
2
120.141.0.19
SLB Table SLB SLB Table Info
3
4
dst ip dst port 33 src ip 10.0.0.1 src port 80 <hl>..... payload </hl>
120.141.0.19
10.0.0.1 FIGURE 4-7
Server Load BalancePacket Flow: Proxy Mode

FIGURE 4-7 illustrates how the packet is modified from client to SLB to server, back to
SLB, and finally back to client. The following numbered list correlates with the numbers in FIGURE 4-7. 1. The client submits an initial service request targeted to the virtual IP (VIP) address of 120.141.0.19 on port 80. This VIP address is configured as the IP address of the SLB appliance.
Chapter 4
79
2. The SLB receives this packet from the client and recognizes that this incoming packet must be forwarded to a server selected by the SLB algorithm. 3. The SLB algorithm identifies server 10.0.0.1 at port 80 to receive this client request and modifies the packet so that the server sends it to the SLB and not to the client. Hence, the source and port are also modified. 4. The server receives the client request. 5. Perceiving that the request has come from the SLB, the server returns the requested Web page back to the SLB device. 6. The SLB receives this packet from the server. Based on the state information, it knows that this packet must be sent back to client 192.191.3.89. 7. The SLB device rewrites the packet and sends it out the appropriate egress port. 8. Client receives receives response packet.
Advantages of Using Proxy Mode

s
Increases security and flexibility by decoupling the client from the backend servers Increases switch manageability because servers can be added and removed dynamically without any modifications to the SLB device configuration after it is initially configured Increases server manageability because any IP address can be used
Disadvantages of Using Proxy Mode

s
Limits throughput because the SLB must process packets on ingress as well as return traffic from server to client Increases client delays because each packet requires more processing
How Direct Server Return Works

One of the main limitations of Proxy Mode is performance. Proxy Mode requires double work in the sense that incoming traffic from client to servers must be intercepted and processed, as well as return traffic from server to clients. Direct Server Return (DSR) addresses this limitation by requiring that only incoming traffic be processed by the SLB, thereby increasing performance considerably. To better understand how this works, see FIGURE 4-8. In DSR Mode, the client points to the SLB device, which only remaps the destination MAC address. This is accomplished by leveraging the loopback interface of the Sun Solaris servers and other servers that
support loopback. Every server has a regular unique IP address and a loopback IP address, which is the same as the external VIP address of the SLB. When the SLB forwards a packet to a particular server, the server looks at the MAC address to determine whether this packet should be forwarded up to the IP stack. The IP stack recognizes that the destination IP address of this packet is not the same as the physical interface, but it is identical to the loopback IP address. Hence, the stack will forward the packet to the listening port.
1
2
120.141.0.19 SLB Table SLB SLB State Info
3
4
dst ip loopback lo0:120.141.0.19 mac:0:8:3e:4:4c:84 real ip: 10.0.0.1 FIGURE 4-8 dst port 33 src ip 10.0.0.1 src port 80 <hl>..... payload </hl>
120.141.0.19
Direct Server Return Packet Flow
Chapter 4
81
FIGURE 4-8 shows the DSR packet flow process. The following numbered list correlates with the numbers in FIGURE 4-8.
1. The client submits an initial service request targeted to the VIP address of 120.141.0.19 port 80. This VIP address is configured as the IP address of the SLB appliance. 2. The SLB receives this packet from the client and forwards this incoming packet to a server selected by the SLB algorithm. 3. The SLB algorithm identifies server 10.0.0.1 port 80 to receive this client request and modifies the packet by only changing the destination MAC Address to 0:8:3e:4:4c:84 which is the MAC address of the real server.
Note Statement 3 implies that the SLB and the servers must be on the same Layer
2 VLAN. Hence, DSR is less secure than the proxy mode approach. 4. The server receives the client request and processes the incoming packet. 5. The server returns the incoming packet directly back to the client by swapping the destination/source IP and TCP address pair. 6. The destination IP address is the same as that configured on the loopback and is sent back directly to the client.
Advantages of Direct Server Return

s
Increases security and flexibility by decoupling the client from the back-end servers. Increases switch manageability because servers can be added and removed dynamically without any modifications to the SLB device configuration after it is initially configured. Increases performance and scalability. The server load-balancing work is reduced by half because the return path is the same as the incoming path. Thus, more cycles are free to process more incoming traffic.
Disadvantages of Direct Server Return

s
The SLB must be on same Layer 2 network as the server because they have the same IP network number, only differing by MAC address. All the servers must be configured with the same loopback address as the SLB VIP. This might be an issue for securing critical servers.
82
Server Monitoring
All SLB algorithms, except the family of fixed round-robin, require knowledge of the state of the servers. SLB implementations vary enormously from vendor to vendor. Some poor implementations simply monitor link state on the port to which the real server is attached. Some monitor using ping request on Layer 3. Port-based health checks are superior because the actual target application is verified for availability and response time. In some cases, the Layer 2 state might be fine, but the actual application has failed, and the SLB device mistakenly forwards requests to that failed real server. The features and capabilities of switches are changing rapidly, often in simple flash updates, and you must be aware of the limitations.
Persistence
Often when a client is initially load-balanced to a specific server, it is crucial that subsequent requests are forwarded to the same server within the pool. There are several approaches to accomplishing this:
s s
Allow the server to insert a cookie in the clients HTTP request. Configure the SLB to look for a cookie pattern and make a forwarding decision based on the cookie. The first request of the client will have no cookie, so the SLB will forward to the best server based on the algorithm. The server will install a cookie, which is a name-value pair. On the return of the packet, the SLB will read the cookie value and record client-server pair. Subsequent requests from the same client will have a cookie, which triggers the SLB to forward based on the recorded cookie information, not on the SLB algorithm. Hash, based on the clients source IP address. This is risky if the client request comes from a megaproxy.
It is best to avoid persistence because HTTP was designed to be stateless. Trying to maintain state across many stateless transactions causes serious issues if there are failures. In many cases, the application software can maintain state. For example, when a servlet receives a request, it can identify the client based on its own cookie value and retrieve state information from the database. However, switch persistence might be required. If so, you should look at the exact capabilities of each vendor and decide which features are most critical.
Chapter 4
83
Commercial Server Load Balancing Solutions

Many commercial SLB implementations are available both hardware and software.
s
Resonate provides a Solaris library offering, where a STREAMS Module/Driver is installed on a server that accepts all traffic, inspects the ingress packet, and forwards it to another server that actually services the request. As the cost of hardware devices falls and performance increases, the Resonate product is less popular. Various companies such as Cisco, F5, and Foundry ServerIron sell hardware appliances that perform only server load balancing. One important factor to examine carefully is the method used to implement the server load-balancing function. The F5 is limited because it is a PC Intel box, running BSD UNIX, with two or more network interface cards.
Wirespeed performance can be limited because these general purpose computerbased appliances are not optimized for packet forwarding. When a packet arrives at a NIC, an interrupt must first be generated and serviced by the CPU. Then the PCI bus arbitration process will grant access to traverse the bus. Finally, the packet is copied into memory. These events cumulatively contribute to significant delays. In some newer implementations, wirespeed SLB forwarding can be achieved. Data Plane Layer 2/Layer 3 forwarding tables are integrated with the server loadbalancing updates. Hence as soon as a packet is received, a packet classifier immediately performs an SLB lookup in the data plane with hardware using tables populated and maintained by the SLB process that resides in the control plane, which also monitors the health of the servers.
Foundry ServerIron XLDirect Server Return Mode

CODE EXAMPLE 4-1 shows the configuration file for the setup of a simple server load balancer. Refer to the Foundry ServerIron XL user guide for detailed explanations of configuration parameters. This shows the level of complexity for configuring a typical SLB device. This device is assigned a VIP address of 172.0.0.11, which is the IP address exposed to the outside world. On the internal LAN, this SLB device is assigned an IP address of 20.20.0.50, which can be used as the source IP address that is sent to the servers if you are using proxy mode. However, this device is
84
configured in DSR mode, where the SLB forwards to the servers, which then return directly to the client. Notice that the servers are on the same VLAN as this SLB device on the internal LAN side of the 20.0.0.0 network.
CODE EXAMPLE 4-1
Configuration for a Simple Server Load Balancer
! ver 07.3.05T12 global-protocol-vlan ! ! server source-ip 20.20.0.50 255.255.255.0 172.0.0.10 ! !! !! ! server real s1 20.20.0.1 port http port http url "HEAD /" ! server real s2 20.20.0.2 port http port http url "HEAD /" ! ! server virtual vip1 172.0.0.11 port http port http dsr bind http s1 http s2 http ! vlan 1 name DEFAULT-VLAN by port no spanning-tree ! hostname SLB0 ip address 172.0.0.111 255.255.255.0 ip default-gateway 172.0.0.10 web-management allow-no-password banner motd ^C Reference Architecture -- Enterprise Engineering^C Server Load Balancer-- SLB0 129.146.138.12/24^C !!
Chapter 4
85
Extreme Networks BlackDiamond 6800 Integrated SLB Proxy Mode

CODE EXAMPLE 4-2 shows an excerpt of the SLB configuration for a large chassisbased Layer 2/Layer 3 switch with integrated SLB capabilities. Various VLANs and IP addresses are configured on this switch in addition to the SLB. Pools of servers with real IP addresses are configured. The difference is that this switch is configured in the more secure proxy mode instead of DSR, shown in the previous example.
CODE EXAMPLE 4-2
SLB Configuration for a Chassis-based Switch
# # MSM64 Configuration generated Thu Dec 6 21:27:26 2001 # Software Version 6.1.9 (Build 11) By Release_Master on 08/30/01 11:34:27 .. # Config information for VLAN app. config vlan "app" tag 40 # VLAN-ID=0x28 Global Tag 8 config vlan "app" protocol "ANY" config vlan "app" qosprofile "QP1" config vlan "app" ipaddress 10.40.0.1 255.255.255.0 configure vlan "app" add port 4:1 untagged .. # # Config information for VLAN dns. .. configure vlan "dns" add port 5:3 untagged configure vlan "dns" add port 5:4 untagged configure vlan "dns" add port 5:5 untagged .. .. configure vlan "dns" add port 8:8 untagged config vlan "dns" add port 6:1 tagged # # Config information for VLAN super. config vlan "super" tag 1111 # VLAN-ID=0x457 config vlan "super" protocol "ANY" config vlan "super" qosprofile "QP1" # No IP address is configured for VLAN super. config vlan "super" add port 1:1 tagged config vlan "super" add port 1:2 tagged config vlan "super" add port 1:3 tagged config vlan "super" add port 1:4 tagged config vlan "super" add port 1:5 tagged config vlan "super" add port 1:6 tagged config vlan "super" add port 1:7 tagged
Global Tag 10
86
CODE EXAMPLE 4-2
SLB Configuration for a Chassis-based Switch (Continued)
# config config .. .. config config config config config ..
vlan "super" add port 1:8 tagged ..
vlan vlan vlan vlan vlan
"super" "super" "super" "super" "super"
add add add add add
port port port port port
6:4 6:5 6:6 6:7 6:8
tagged tagged tagged tagged tagged
enable web access-profile none port 80 configure snmp access-profile readonly None configure snmp access-profile readwrite None enable snmp access disable snmp dot1dTpFdbTable enable snmp trap configure snmp community readwrite encrypted "r~`|kug" configure snmp community readonly encrypted "rykfcb" configure snmp sysName "MLS1" configure snmp sysLocation "" configure snmp sysContact "Deepak Kakadia, Enterprise Engineering" .. # ESRP config config config config .. .. # SLB Configuration enable slb config slb global ping-check frequency 1 timeout 2 config vlan "dns" slb-type server config vlan "app" slb-type server config vlan "db" slb-type server config vlan "ds" slb-type server config vlan "web" slb-type server config vlan "edge" slb-type client create slb pool webpool lb-method round-robin config slb pool webpool add 10.10.0.10 : 0 config slb pool webpool add 10.10.0.11 : 0 create slb pool dspool lb-method least-connection Interface Configuration vlan "edge" esrp priority 0 vlan "edge" esrp group 0 vlan "edge" esrp timer 2 vlan "edge" esrp esrp-election ports-track-priority-mac
Chapter 4
87
CODE EXAMPLE 4-2
SLB Configuration for a Chassis-based Switch (Continued)
# config config create config config create config config create config config create 0 unit create unit 1 create unit 1 create 0 unit create 0 unit .. ..
slb slb slb slb slb slb slb slb slb slb slb slb 1 slb
pool dspool add 10.20.0.20 : 0 pool dspool add 10.20.0.21 : 0 pool dbpool lb-method least-connection pool dbpool add 10.30.0.30 : 0 pool dbpool add 10.30.0.31 : 0 pool apppool lb-method least-connection pool apppool add 10.40.0.40 : 0 pool apppool add 10.40.0.41 : 0 pool dnspool lb-method least-connection pool dnspool add 10.50.0.50 : 0 pool dnspool add 10.50.0.51 : 0 vip webvip pool webpool mode translation 10.10.0.200 : vip dsvip pool dspool mode translation 10.20.0.200 : 0
slb vip dbvip pool dbpool mode translation 10.30.0.200 : 0 slb vip appvip pool apppool mode translation 10.40.0.200 : 1 slb vip dnsvip pool dnspool mode translation 10.50.0.200 : 1
Layer 7 Switching
The recent explosive demand for application hosting and increased security fueled the demand for a new concept called content switching, also known as Layer 7 switching, proxy switching, or URL switching. This switching technology basically inspects the payload, which is expected to be some HTTP request, such as a static or dynamic Web page. The content switch searches for a certain string, and if there is a match, it takes some type of action. For example, the content switch might rewrite the content or redirect it to a pool of servers that specializes in these services or to a caching server for increased performance. The main idea is that a forwarding decision is made based on the application data, not traditional Layer 2 or Layer 3 destination network addresses. Some major technical challenges arise in performing this type of processing. The first is a tremendous performance impact. In traditional Layer 2 and Layer 3 processing, the destination addresses and corresponding egress port are found by looking at a fixed offset in the packet. This allows for extremely cheap and fast ASICs. Usually, the packet header is read in from the MAC and copied into SRAM, which has an
88
access time of around five nanoseconds. The variable size and bulky payload are usually copied into DRAM, which has a higher initial setup time. The forwarding decision requires two SRAM memory accesses, where the header is read, modified, written, and a quick lookup is performedusually a Telecommunications Access Method (TCAM) or Patricia Tree lookup in SRAM, which takes a few nanoseconds. However, for Layer 7 forwarding decisions, almost all commercial switches, except the Extreme Px1, must perform this function in much slower CPU, running a realtime operating system, such as VxWorks. The payload, which resides in DRAM, must be read, processed, and written. This string search is also time intensive. (There have been recent advances in Layer 7 technology such as that offered by Solidum and PMC-Sierras ClassiPI, which perform this at wirespeed rates. However, at the time of this writing, we are not aware of any major switch manufacturer using this technology.) This operation takes orders of magnitude more time. NAT can be extended not only to hide internal private IP addresses but also to base packet forwarding decisions on the payload. There are two approaches to accomplish this function:
s
Application Gateway This approach terminates the socket connection on the client side and creates another connection on the server side, providing complete isolation between the client and the server. This requires more processing time and resources on the switch. However, it allows the switch to make a comprehensive application-layer forwarding decision. TCP Splicing This approach simply rewrites the TCP/IP packet headers, thereby reducing the amount of processing required on the switch. This makes it more difficult for the switch to make application-layer forwarding decisions if the complete payload spans many small TCP packets.
This section describes an application gateway approach to NAT and performing Layer 7 processing.
FIGURE 4-9 shows an overview of the functional content switching model.
Chapter 4
89
servergroup 1 stata
servergroup 2 dnsa
Proxy Switching Function client http request

-Terminate socket connection -get url -check against rules -forward to servergroup /slb function -or get valid cookie, with server id, and forward to same server
servergroup 3 statb
servergroup 4 cacheb
-http:www.a.com/SMA/stata/index.html servergroup1 -http:www.a.com/SMA/dnsa/index.html servergroup2 -http:www.a.com/SMA/statb/index.html servergroup3 -http:www.a.com/SMA/CHACHEB/index.html servergroup4 -http:www.a.com/SMA/DYNA/index.html servergroup5
servergroup 5 dynab
FIGURE 4-9
Content Switching Functional Model
Content switching with full network address translation (NAT) serves the following purposes:
s s
Isolates internal IP addresses from being exposed to the public Internet. Allows reuse of a single IP address. For example, clients can send their Web requests to www.a.com or www.b.com, where DNS maps both domains to a single IP address. The proxy switch receives this request with the packet containing an HTTP header in the payload that contains the target domain, for example a.com or b.com, and determines to which group of servers to redirect this request.
90
Allows parallel fetching of different parts of Web pages from servers optimized and tuned for that type of data. For example, a complex Web page might need GIFs, dynamic content, cached content, and so on. With content switching, one set of Web servers can hold the GIFs, while another can hold the dynamic content or cached content. The proxy switch can make parallel fetches and retrieve the entire page at a faster rate than would be possible otherwise. Ensures requests with cookies or SSL session IDs are redirected to the same server to take advantage of persistence.
FIGURE 4-9 shows that the clients socket connection is terminated by the proxy function. The proxy retrieves as much of the URL as is needed to make a decision based on the retrieved URL. In FIGURE 4-9, various URLs map to various server groups, which are VIP addresses. The proxy determines whether to forward the URL directly or pass it off to a server load-balancing function that is waiting for traffic destined to the server group.
The proxy is configured with a VIP address, so the switch forwards all client requests destined to this VIP address to the proxy function. The proxy function also rewrites the IP header, particularly the source IP and port, so that the server sends back the requested data to the proxy, not to the client directly.
Network Address Translation

Network Address Translation (NAT) is a critical component for security and proper traffic direction. There are two basic types of NAT: half and full. Half NAT rewrites the destination IP address and MAC address to a redirected location such as a Web cache, which returns the packet directly to the client because the source IP address is unchanged. Full NAT is where the socket connection is terminated by a proxy, so the source IP and MAC are changed to that of the proxy server. NAT serves the following purposes:
s s
SecurityPrevents exposing internal private IP addresses to the public. IP Address ConservationRequires only one valid exposed IP address to fetch Internet traffic from internal networks with invalid IP addresses. RedirectionIntercepts traffic destined to one set of servers and redirects it to another by rewriting the destination IP and MAC addresses. The redirected servers can send back the request directly to the clients with half NAT-translated traffic because the original source IP has not been rewritten.
NAT is configured with a set of filters, usually a 5-tuple Layer 3 rule. If the incoming traffic matches a certain filter rule, the packet IP header is rewritten or another socket connection is initiated to the target server, which itself can be changed, depending on the particular rule. NAT is often combined with other IP services such as SLB and content switching. The basic idea is that the client and servers are
Chapter 4
91
completely decoupled from each other, and the NAT device manages the IP address conversions, while the partner service is responsible for another decision such as determining which server will handle the request based on load or other rules.
Quality of Service
As a result of emerging real-time and mission-critical applications, enterprise customers realize that the traditional Best Effort IP network service model is unsuitable. The main concern is that poorly behaved flows adversely affect other flows that share the same resources. It is difficult to tune resources to meet the requirements of all deployed applications. Quality of Service (QoS) measures the ability of network and computing systems to provide different levels of services to selected applications and associated network flows. Customers that deploy mission-critical applications and real-time applications have an economic incentive to invest in QoS capabilities so that acceptable response times are guaranteed within certain tolerances.
The Need for QoS

To understand why QoS is critical, it helps to understand what has happened to enterprise applications over the past decade. In the late 1980s and early 1990s, the client/server was the dominant architecture. The main principle involved a thick client and local server, where 80 percent of the traffic was from the client to a local server and 20 percent of the client traffic needed to traverse the corporate backbone. In the late 1990s, with the rapid adoption of Internet-based applications, the architecture changed to a thin client, and servers were located anywhere and everywhere. This had one significant implication: The network became a critically shared resource, where priority traffic was dangerously impacted by nonessential traffic. A common example is the difference between downloading images and processing sales orders. Different applications have different resource needs. The following section describes why different applications have different QoS requirements and why QoS is a critical resource for enterprise data centers and service providers whose customers drive the demand for QoS.
Classes of Applications
There are five classes of applications, having different network and computing requirements. They are:
92
s s s s s
Data transfers Video and voice streaming Interactive video and voice Mission-critical Web-based
These classes are important in classifying, prioritizing, and implementing QoS. The following sections detail these five classes.
Data Transfers
Data transfers include applications such as FTP, email, and database backup. Data transfers tend to have zero tolerances for packet loss and high tolerances for delay and jitter. Typical acceptable response times range from a few seconds for FTP transfers to hours for email. Bandwidth requirements in the order of Kbyte/sec are acceptable, depending on the file size, which keeps response times to a few seconds. Depending on the characteristics of the application (for example, the size of a file), disk I/O transfer times can contribute cumulatively to delays along with network bottlenecks.
Video and Voice Streaming

Video and voice streaming includes applications such as Apple QuickTime Streaming or Real Networks streaming video and voice products. Video and voice streams have low tolerances for packet loss, and medium tolerances for delay and jitter. Typical acceptable response times are only a few seconds. This is possible because the server can pre-buffer multimedia data on the client to a certain degree. This buffer drains at a constant rate on the client side, while simultaneously receiving bursty streaming data from the server with variations in delay. As long as the buffer can absorb all variations (without draining to empty), the client receives a constant stream of video and voice. Typical bandwidth requirements are about one Mbyte/sec, depending on the frame rate, compression/decompression algorithms, and the size of images. Disk I/O and CPU also contribute to delays. Large MPEG files must be read from disks and compression/decompression algorithms.
Interactive Video and Voice

Interactive video and voice tends to have low tolerance for packet loss and low tolerance for delay and jitter. Typical bandwidth requirements are tremendous (depending on the number of simultaneous participants in the conference, growing exponentially). Due to the interactive nature of the data being transferred, tolerances are very low for delay and jitter. As soon as one participant moves or talks, all other participants need to immediately see and hear this change. Response time
Chapter 4
93
requirements range from 250 to 500 milliseconds. This response time is compounded by the bandwidth requirements, with each stream requiring a few Mbit/sec. In a conference of five participants, each participant pumps out a voice and video stream while at the same time receiving streams from the other participants.
Mission-Critical Applications
Mission-critical applications vary in bandwidth requirements, but they tend to have zero tolerance for packet loss. Depending on the application, bandwidth requirements are about one Kbyte/sec. Response times range from 500 ms to a few seconds. Server resource requirements (CPU, disk, and memory) vary, depending on the application.
Web-Based Applications
Web-based applications tend to have low bandwidth requirements (unless large image files are associated with the requested Web page) and grow in CPU and disk requirements, due to dynamically generated Web pages and Web transaction-based applications. Response time requirements range from 500 milliseconds to one second. Different classes of applications have different network and computing requirements. The challenge is to align the network and computing services to the applications service requirements from a performance perspective.
Service Requirements for Applications

The two most common approaches used to satisfy the service requirements for applications are:
s s
Overprovisioning Managing and controlling
Overprovisioning allows overallocation of resources to meet or exceed peak load requirements. Depending on the deployment, overprovisioning can be viable if it is a simple matter of just upgrading to faster lLAN switches and NICs or adding memory, CPUs, or disks. However, overprovisioning might not be viable in certain cases, for example when dealing with relatively expensive long-haul WAN links, resources that on average are underutilized, or sources that are busy only during short peak periods. Managing and controlling allows allocation of network and computing resources. Better management of existing resources attempts to optimize utilization of existing resources such as limited bandwidth, CPU cycles, and network switch buffer memory.
QoS Components
To give you enough background on the fundamentals and an implementation perspective, this section describes the overall network and systems architecture and identifies the sources of delays. It also explains why QoS is essentially about controlling network and system resources in order to achieve more predictable delays for preferred applications.
Implementation Functions
Three necessary implementation functions are:
s
Traffic Rate Limiting and Traffic Shaping Token Leaky Bucket Algorithm. Network traffic is always bursty. The level of burstiness is controlled by the time resolution of the measurements. Rate limiting controls the burstiness of the traffic coming into a switch or server. Shaping refers to the smoothing of the egress traffic. Although these two functions are opposite, the same class of algorithms is used to implement both. Packet Classification Individual flows must be identified and classified at line rate. Fast packet classification algorithms are crucial, as every packet must be inspected and matched against a set of rules that determine the class of service the specific packet should receive. The packet classification algorithm has serious scalability issues; as the number of rules increases, it takes longer to classify a packet. Packet Scheduling To provide differentiated services, the packet scheduler must decide quickly which packet to schedule and when. The simplest packet scheduling algorithm is strict priority. However, this often does not work because low-priority packets are starved and might never get scheduled.
QoS Metrics
QoS is defined by a multitude of metrics. The simplest is bandwidth, which can be conceptually viewed as a logical pipe of a larger pipe. However, actual network traffic is bursty, so a fixed bandwidth would be wasteful because at one instant in time one flow might use 1 percent of this pipe while another might need 110 percent of the allocated pipe. To reduce waste, certain burst metrics are used to determine how much of a burst and how long a burst can be tolerated. Other important metrics that directly impact the quality of service include packet loss rate, delay, and jitter (variation in delay). The network and computing components that control these metrics are described later in this chapter.
Chapter 4
95
Network and Systems Architecture Overview

To fully understand where QoS fits into the overall picture of network resources, it is useful to take a look at the details of the complete network path traversal, starting from the point where a client sends a request, traverses various network devices, and finally arrives at the destination where the server processes the request. Different classes of applications have different characteristics and requirements (see the The Need for QoS on page 92 for additional details). Because several federated networks with different traffic characteristics are combined, end-to-end QoS is a complex issue.
FIGURE 4-10 illustrates a high-level overview of the components involved in an endto-end packet traversal for an enterprise that relies on a service provider. Two different paths are shown. Both originate from the client and end at a server.
96
Cable IP Ethernet Cable modem IP DOCSIS HFC
Remote access users DSL Mobile wireless IP Ethernet PDA DSL modem IP PPP AALS/ATM SONET IP Ethernet CDMA/ GPRS/ UMTS A 1
Dial up
IP PPP V.34, V.90, 56kbps
Voice circuit
Switch delay=(queueing, scheduling, packet classification lookup time, route lookup time, congestion, backplane Cable headend ATM Metro area network C ATM
GPRS Edge IP PPP L2TP Ethernet
GPRS CO/POP B Access switch 2
ATM
Tier 2 ISP/access networks Leased line T1 Switch Access switch F E MAE PSTN T1 ATM Metro area network
Peering points NAP D Tier 1 ISP backbone network G
ATM
ATM Tier 2 ISP/access networks Link delay=(Propagation, LineRate) Server delay= (CPU, memory, disk ) PSTN
CSU/DSU Firewall H Corporate user Enterprise network Switch

FIGURE 4-10
CSU/DSU Firewall 3 4
Enterprise network
Overview of End-to-End Network and Systems Architecture
Chapter 4
97
Path A-H is a typical scenario, where the client and server are connected to different local ISPs and must traverse different ISP networks. Multiple Tier 1 ISPs can be traversed, connected together by peering points such as MAE-East or private peering points such as Sprints NAP. Path 1-4 shows an example of the client and server connected to the same local Tier 2 ISP, when both client and server are physically located in the same geographical area. In either case, the majority of the delays are attributed to the switches. In the Tier 2 ISPs, the links from the end-user customers to the Tier 2 ISP tend to be slow links, but the Tier 2 ISP aggregates many links, hoping that not all subscribers will use the links at the same time. If they do, packets get buffered up and eventually are dropped.
Implementing QoS
You can implement QoS in many different ways. Each domain has control over its resources and can implement QoS on its portion of the end-to-end path using different technologies. Two domains of implementation are enterprises and network service providers.
s
Enterprise Enterprises can control their own networks and systems. From a local Ethernet or token ring LAN perspective, IEEE 801.p can be used to mark frames according to priorities. These marks allow the switch to offer preferential treatment to certain flows across VLANS. For computing devices, there are facilities that allow processes to run at higher priorities, thus obtaining differentiated services from a process computing perspective. Network Service Provider (NSP) The NSP aggregates traffic and forwards either within its own network or hands off to another NSP. The NSP can use technologies such as DiffServ or IntServ to prioritize the handling traffic within its networks. Service Level Agreements (SLAs) are required between NSPs to obtain a certain level of QoS for transit traffic.
ATM QoS Services

It is interesting that NSPs implement QoS at both the IP layer and the asynchronous transfer mode (ATM) layer. Most ISPs still have ATM networks that carry IP traffic. ATM itself offers six types of QoS services:
s
Constant Bit Rate (CBR) Provides a constant bandwidth, delay, and jitter throughout the life of the ATM connection. Variable Bit Rate-Real Time (VBR-rt) Provides constant delay and jitter, but variations in bandwidth.
98
Variable Bit Rate-Non Real Time (VBR-nrt) Provides variable bandwidth, delay, and jitter, but has a low cell loss rate. Unspecified Bit Rate (UBR) Provides Best Effort service but no guarantees. Available Bit Rate (ABR) Provides no guarantees and expects the applications to adapt according to network availability. Guaranteed Frame Rate (GFR) Provides some minimum frame rate, delivers entire frame or none, and is used for ATM Adaptation Layer 5 (AAL5).
s s
One of the main difficulties in providing an end-to-end QoS solution is that so many private networks must be traversed, and each network has its own QoS implementations and business objectives. The Internet is constructed so that networks interconnect or peer with other networks. One network might need to forward traffic of other networks. Depending on the arrangements, competitors might not forward the traffic in the most optimal manner. This is what is meant by business objectives.
Sources of Unpredictable Delay

From a system computing perspective, unpredictable delays are often due to limited CPU resources or disk I/O latencies. These degrade during a heavy load. From a network perspective, many components add up to the cumulative end-to-end delay. This section describes some of the important components that contribute to delay and explains the choke points at the access networks, where the traffic is aggregated and forwarded to a backbone or core. Service providers overallocate their networks to increase profits and hope that not all subscribers will access the network at the same time.
FIGURE 4-11 was constructed by taking out path A-G in FIGURE 4-10 and projecting it onto a Time-Distance plane. This is a typical Web client accessing the Internet site of an enterprise. The vertical axis indicates the time that elapsed for a packet to travel a certain link segment. The horizontal axis indicates the link segment that the packet traverses. At the top, we see the network devices and vertical lines that project down to the distance axis, showing the corresponding link segment. In this illustration, an IP packets journey starts when a user clicks on a Web page. The HTTP request maps first to a TCP three-way handshake to create a socket connection. The first TCP packet is the initial SYN packet, which first traverses segment 1 and is usually quite slow because this link is typically 30 kbit/sec over a 56 kbit/sec modem, depending on the quality and distance of the last mile wiring.
Chapter 4
99
Tier 2 ISP/ access networks Dial up 56 Kbps POTS OC3 Access switch ATM T1OC3
Tier 1 ISP backbone network MPLS Core D1OC48 T1OC3
Tier 2 ISP/ access networks T1 Leased line
Enterprise network
ATM
Access switch
Ethernet
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Time delays
Pa
cke
1 (Not to scale)
FIGURE 4-11
10
11
12
13
14
15
16
Distance - various physical networks
One-Way End-to-End Packet Data Path Transversal
Network Delay is composed of the following components:

s s
Propagation delay that depends on the media and distance Line rate that primarily depends on the link rate and loss rate or Bit Error Rate (BER) Node transit delay that is the time it takes a packet to traverse an intermediate network switch or router
The odd-numbered links of FIGURE 4-11 represent the link delays. Note that segment and link are used interchangeably.
s
Link 1, in a typical deployment, is the copper wire, or the last mile connection from the home or Small Office/Home Office (SOHO) to the Regional Bell Operating Company (RBOC). This is how a large portion of consumer clients connect to the Internet.
100
Link 3 is an ATM link inside the carriers internal network, usually a Metropolitan Area Network link. Link 5 connects the Tier 2 ISP to the Tier 1 ISP. This provides a Backbone Network. This link is a larger pipe, which can range from T1 to OC-3 while growing.
Link 7 is the Core Network of the backbone Tier 1 provider. Typically, this core is extremely fast, consisting of DS3 links (the same ones used by IDT) or more modern links (like those used by VBNS of OC-48) and links that are beta testing OC-192 links while running Packet over SONET and eliminating the inefficiencies of ATM altogether.
s s
Links 9 and 11 are a reflection of links 5 and 3. Link 13 is a typical leased line, T1 link to the enterprise. This is how most enterprises connect to the Internet. However, after the 1996 Telecommunications Act, competitive local exchange carriers (CLECs) emerged. CLECs provide superior service offerings at lower prices. Providers such as Qwest and Telseon provide gigabit Ethernet connectivity at prices that are often below OC-3 costs. Link 15 is the enterprises internal network. There should be a channel service time division multiplexing (TDM) and data service device (data side) that terminates the T1 line and converts it to Ethernet.
The even-numbered links of FIGURE 4-11 represent the delays experienced in switches. These delays are composed of switching delays, route lookups, packet classification, queueing, packet scheduling, and internal switch forwarding delays, such as sending a packet from the ingress unit through the backplane to the egress unit. As FIGURE 4-11 illustrates, QoS is needed to control access to shared resources during episodes of congestion. The shared resources are servers and specific links. For example, Link 1 is a dedicated point-to-point link, where a dedicated voice channel is set up at call time with a fixed bandwidth and delay. Link 13 is a permanent circuit as opposed to a switched dedicated circuit. However, this is a digital line. QoS is usually implemented in front of a congestion point. QoS restricts the traffic that is injected into the congestion point. Enterprises have QoS functions that restrict the traffic being injected into their service provider. The ISP has QoS functions that restrict the traffic injected into their core. Tier 2 ISPs oversubscribe their bandwidth capacities, hoping that not all their customers will need bandwidth at the same time. During episodes of congestion, switches buffer packets until they can be transmitted. Links 5 and 9 are boundary links that connect two untrusted parties. The Tier 2 ISP must control the traffic injected into the network that must be handled by the Tier 1 ISPs core network. Tier 1 polices the traffic that customers inject into the network at Links 5 and 9. At the enterprise, many clients need to access the servers.
Chapter 4
101
QoS-Capable Devices
This section describes the internals of QoS-capable devices. One of the difficulties of describing QoS implementations is the number of different perspectives that can be used to describe all the features. The scope of this section is limited to the prioritybased model and the related functional components to implement this model. The priority-based model is the most common implementation approach because of its scalability advantage.
Implementation Approaches
There are two completely different approaches to implementing a QoS-capable IP switch or server: The Reservation Model, also known as Integrated Services/RSVP or ATM, is the original approach, requiring applications to signal their traffic handling requirements. After signaling, each switch that is in the path from source to destination reserves resources, such as bandwidth and buffer space, that either guarantee the desired QoS service or ensure that the desired service is provided. This model is not widely deployed because of scalability limitations. Each switch has to keep track of all this information for each flow. As the number of flows increases, the amount of memory and processing increases, hence limiting scalability. The Precedence Priority Model, also known as Differentiated Services, IP Precedence TOS, or IEEE 802.1pQ, takes aggregated traffic, segregates the traffic flows into classes, and provides preferential treatment of classes. It is only during episodes of congestion that noticeable differentiated services effects are realized. Packets are marked or tagged according to priority. Switches then read these markings and treat the packets according to their priority. The interpretation of the markings must be consistent within the autonomous domain.
Functional ComponentsHigh-Level Overview

Implementation Functions on page 95 describes the three high-level QoS components: traffic shaping, packet classification, and packet scheduling. This section describes these QoS components in further detail. A QoS-capable device consists of the following functions:
102
Admission Control accepts or rejects access to a shared resource. This is a key component for Integrated Services and ATM networks. Admission control ensures that resources are not oversubscribed. Due to this, admission control is more expensive and less scalable than other components. Congestion Management prioritizes and queues traffic access to a shared resource during congestion periods. Congestion Avoidance prevents congestion early, using preventive measures. Algorithms such as Weighted Random Early Detection (WRED) exploit TCPs congestion avoidance algorithms to reduce traffic injected into the network, preventing congestion. Traffic Shaping reduces the burstiness of egress network traffic by smoothing the traffic and then forwarding it out to the egress link. Traffic Rate Limiting controls the ingress traffic by dropping packets that exceed burst thresholds, thereby reducing device resource consumption such as buffer memory. Packet Scheduling schedules packets out the egress port so that differentiated services are effectively achieved.
The next section describes the modules that implement these high-level functions in more detail.
QoS Profile
The QoS Profile contains information put in by the network or systems administrator on the definition of classes of traffic flows and how these flows should be treated in terms of QoS. For example, a QoS profile might have a definition that Web traffic from the CEO should be given EF DiffServ Marking, Committed Information Rate (CIR) 1 Mbit/sec, Peak Information Rate (PIR) 5 Mbit/sec, Excess Burst Size (EBS) 100 Kbyte, and Committed Burst Size (CBS) 50 Kbyte. This profile defines the flow and level of QoS the Web traffic from the CEO should receive. This profile is compared against the actual measured traffic flow. Depending on how the actual traffic flow compares against this profile, the type of service (TOS) field of the IP header is re-marked or an internal tag is attached to the packet header, which controls how the packet is handled inside this device.
FIGURE 4-12 shows the main functional components involved in delivering prioritized
differentiated services that apply to a switch or a server. These include the packet classification engine, the metering, the marker function, policing/shaping, I/P forwarding module, queuing, congestion control management, and the packet scheduling function.
Chapter 4
103
Flo
Flo
Me Packet classification Me
S Qoles ofi pr ter S Qoles ofi pr
ter
Ma
rke
Ma IP fo inf rwa orm rdi n ba ation g se
rke
Po l sh icer/ ap er
Po l sh icer/ ap er IP forwarding
Queuing congestion control
FIGURE 4-12
QoS Functional Components
Deployment of Data and Control Planes

Typically, if the example in FIGURE 4-12 were deployed on a network switch, there would be an ingress board and an egress board connected together through a backplane. It would be deployed on a server. These functions would be
104
at
pl
an
C m on an tr a o pl ge l an an m d e en t
t r ke le ac u P hed sc
implemented in the network protocol stack, either in the IP module, adjacent to the IP module, or possibly on the network interface card, offering superior performance due to the ASIC/FPGA implementation. There are two planes:
s
The Data Plane operates the functional components that actually read and write the IP header. The Control Plane operates the functional components that control how the functional units read information from the Network Administrator, directly or indirectly.
Packet Classifier
The Packet Classifier is a functional component responsible for identifying a flow and matching it with a filter. The filter is composed of source and destination, IP address, port, protocol, and the type of service fieldall in the IP Header. The filter is also associated with information that describes the treatment of this packet. Aggregate ingress traffic flows are compared against these filters. Once a packet header is matched with a filter, the QoS profile is used by the meter, marker, policing, and shaping functions.
Metering
The metering function compares the actual traffic flow against the QoS profile definition. FIGURE 4-13 illustrates the different measurement points. On average, the input traffic arrives at 100 Kbyte/sec. However, for a short period of time, the switch or server allows the input flow rate to reach 200 Kbyte/sec for one second, which computes to a buffer of 200 Kbyte. For the time period of t=3 to t=5, the buffer drains at a rate of 50 Kbyte/sec as long as the input packets arrive at 50 Kbyte/sec, keeping the output constant. Another more aggressive burst arrives at the rate of 400 Kbyte/sec for 5.5 sec, filling up the 200 Kbyte buffer. From t=5.0 to 5.5, however, 50 Kbyte are drained, leaving 150 Kbyte at t=5.5 sec. This buffer drains for 1.5 sec at a rate of 100 Kbyte/sec. This example is simplified, so the real figures need to be adjusted to account for the fact that the buffer is not completely filled at t=5.5 sec because of the concurrent draining. Notice that the area under the graph, or the integral, represents the approximate number of bytes in the buffer, and bursts represent the high sloped lines above the dotted line, representing the average rate or the CIR.
Chapter 4
105
900 800 700 600 500
Bursts - must buffer packets Packets drain
PIR= 400 Kbyte/sec.
Packets drain EBS= 200 Kbyte Burst rate at 200 Kbyte/sec.
Kbyte
400 300 CBS= 200 Kbyte 200 100 0 0 1 2 3 4 Time 5 6 7 8 9 CIR= 100 Kbyte/sec.
FIGURE 4-13
Traffic Burst Graphic
Marking
Marking is tied in with metering so that when the metering function compares the actual measured traffic against the agreed QoS profile the traffic is handled appropriately. The measured traffic measures the actual burst rate and amount of packets in the buffer against the CIR, PIR, CBS, and EBS. The Two Rate Three Color (TrTCM) algorithm is a common algorithm that marks the packets green if the actual traffic is within the agreed-upon CIR. If the actual traffic is between CIR and PIR, the packets are marked yellow. Finally, if the actual metered traffic is at PIR or above, the packets are marked red. The device then uses these markings on the packet in the policing and shaping functions to determine how the packets are treated (for example, whether the packets should be dropped, shaped, or queued in a lower priority queue).
106
Policing and Shaping

The policing functional component uses the metering information to determine if the ingress traffic should be buffered or dropped. Shaping pumps out the packets at a constant rate, buffering packets to achieve a constant output rate. The common algorithm used here is the Token Bucket algorithm to shape the egress traffic and to police ingress traffic.
IP Forwarding Module
The IP forwarding module inspects the destination IP address and determines the next hop using the forwarding information base. The forwarding information base is a set of tables populated by routing protocols and/or static routes. The packet is then forwarded internally to the egress board, which places the packet in the appropriate queue.
Queuing
Queuing encompasses two dimensions, or functions. The first function is congestion control that controls the number of packets queued up in a particular queue (see the following section). The second function is differential services. Differential services queues are serviced by the packet scheduler in a certain manner (providing preferential treatment to preselected flows) by servicing packets in certain queues more often than others.
Congestion Control
There is a finite amount of buffer space or memory, so the number of packets that can be buffered within a queue must be controlled. The switch or server forwards packets at line rate. However, when a burst occurs, or if the switch is oversubscribed and congestion occurs, packets are buffered. There are several packet discard algorithms. The simplest is Tail Drop: Once the queue fills up, any new packets are dropped. This works well for UDP packets, but causes severe disadvantages for TCP traffic. Tail Drop causes TCP traffic in already-established flows to quickly go into congestion avoidance mode, and it exponentially drops the rate at which packets are sent. This problem is called global synchronization. It occurs when all TCP traffic simultaneously increases and decreases flow rates. What is needed is to have some of the flows slow down so that the other flows can take advantage of the freed-up buffer space. Random Early Detection (RED) is an active queue management algorithm that drops packets before buffers fill up and randomly reduces global synchronization.
Chapter 4
107
FIGURE 4-14 describes the RED algorithm. Looking at line C on the far right, when the
average queue occupancy goes from empty up to 75 percent full, no packets are dropped. However, as the queue grows past 75 percent, the probability that random packets are discarded quickly increases until the queue is full, where the probability reaches certainty. Weighted Random Early Detection (WRED) takes RED one step further by giving some of the packets different thresholds at which packet probabilities of discard start. As illustrated in FIGURE 4-14, Line A starts to get random packets dropped at only 25 percent average queue occupancy, making room for higher-priority flows B and C.
WRED queue Packets
1.0
.75 Drop probability A .50 B C
.25
0 0 25% 50% 75% 100% Full queue
Average queue occupancy

FIGURE 4-14
Congestion Control: RED, WRED Packet Discard Algorithms
108
Packet Scheduler
The packet scheduler is one of the most important QoS functional components. The packet scheduler pulls packets from the queues and sends them out the egress port or forwards them to the adjacent STREAMS module, depending on implementation. There are several packet scheduling algorithms that service the queues in a different manner. Weighted Round-Robin (WRR) scans each queue, and depending on the weight assigned a certain queue, allows a certain number of packets to be pulled from the queue and sent out. The weights represent a certain percentage of the bandwidth. In actual practice, unpredictable delays are still experienced because a large packet at the front of the queue can hold up smaller-sized packets behind it. Weight Fair Queuing (WFQ) is a more sophisticated packet scheduling algorithm that computes the time the packet arrives and the time to actually send out the entire packet. WFQ is then able to handle varying-sized packets and optimally select packets for scheduling. WFQ conserves work, meaning that no packets wait idly when the scheduler is free. WFQ can also put a bound on the delay, as long as the input flows are policed and the lengths of the queues are bound. In Class-Based Queuing (CBQ), used in many commercial products, each queue is associated with a class, where higher classes are assigned a higher weight translating to relatively more service time from the scheduler than the lower-priority queues. Competitive product offerings by Packeteer and Allot offer hardware solutions that sit between the clients and servers. These products offer pure QoS solutions, but they use the term policy as a specific QoS rule. These products are limited in their flexibility and integration with policy servers.
Secure Sockets Layer

In 1994, Netscape Communications proposed SSL V.1 and shipped the first products with SSL V.2. SSL V3 was introduced to address some of the limitations of SSL V2 in the area of cryptographic security limitations and functionality. Transport Layer Security (TLS) was created to allow an open standard to prevent any one company from controlling this important technology. However, it turns out that even though Netscape was granted a patent for SSL, SSL is now the defacto standard for secured Web transactions. This section provides a brief overview of the SSL protocol, and it then describes strategies for deploying SSL processing in the design of data center network architectures.
Chapter 4
109
SSL Protocol Overview

The basic operation of SSL includes the following phases: 1. Initial Full Handshake The client and server authenticate each other, exchange keys, negotiate preferred cryptographic algorithms (such as RSA or 3DES) and perform a CPU-intensive public key cryptographic mathematical computation. This full handshake can occur again during the life of a client server communication if the session information is not cached or reused and needs to be regenerated. More details are described below in the Handshake Layer. 2. Data Transfer PhaseBulk Encryption Once the session is established, data is authenticated and encrypted using the master secret. A typical Web request can span many HTTP requests, requiring that each HTTP session establish an individual SSL session. The resulting higher performance impact might not outweigh the marginal incremental security benefit. Hence, a technique called SSL resumption can be exploited to save the session information for a particular client connection that has already been authenticated at least once. SSL is composed of two sublayers:
s
Record Layer This layer operates in two directions:

s
Downstream The record layer receives clear messages from the handshake layer. The record layer encapsulates, encrypts, fragments, and compresses the messages using the Message Authentication Code (MAC) operations before sending the messages downstream to the TCP Layer. Upstream The record layer receives TCP packets from the TCP layer and uncompresses, reassembles, decrypts, runs a MAC verification, and decapsulates the packets before sending them to higher layers.
Handshake Layer This layer exchanges messages between client and server in order to exchange public keys, negotiate and advertise capabilities, and agree on: s SSL version s Cryptographic algorithm s Cipher suite The cipher suite contains key exchange method, data transfer cipher, and Message Digest for Message Authentication Code (MAC). SSL 3.0 supports a variety of key exchange algorithms.
FIGURE 4-15 illustrates an overview of the SSL-condensed protocol exchanges.
110
Client
Server
HTTP
ClientHello
HTTP
SSL Handshake SubLayer

Certificate ClientKeyExchange CertificateVerify ChangeCipherSpec Finished
ServerHello Certificate CertificateRequest ServerKeyExchange ServerHelloDone
ChangeCipherSpec Finished
Fragment Message Compress Message Encrypt Message Authenticate Message (MAC)
Reassemble Message Uncompress Message Decrypt Message Verify Message (MAC)
Fragment Message
Reassemble Message Uncompress Message Decrypt Message Verify Message (MAC)
SSL Record SubLayer
Compress Message Encrypt Message Authenticate Message (MAC)
Transport: TCP Network: IP Data Link: Ethernet PHY
Transport: TCP Network: IP Data Link: Ethernet PHY
FIGURE 4-15
High-Level Condensed Protocol Overview
Once the first set of messages is successfully completed, an encrypted communication channel is established. The following sections describe the differences between using a pure software solution and an SSL accelerator appliance in terms of packet processing and throughput.
Chapter 4
111
We will not be discussing SSL in depth. The purpose of this section is to describe the different network architectural deployment scenarios you can apply to SSL processing. The following sections describe various approaches to scaling SSL processing capabilities from a network architecture perspective.
SSL Acceleration Deployment Considerations

One of the fundamental limitations of SSL is performance. When SSL is added to a Web server, performance drops dramatically because of the strain on the CPU caused by the mathematical computations and the number of sessions that constantly need to be set up. There are three common SSL approaches:
s
Software-SSL libraries This approach uses the bundled SSL libraries and offers the most cost-effective option for processing SSL transactions. Crypto Accelerator Board This approach can offer a massive improvement in performance for SSL processing for certain types of SSL traffic. Conclusions Drawn from the Tests on page 121 suggests when best to use the Sun Crypto Accelerator 1000 board, for example. SSL Accelerator Appliance This solution might have a high initial cost, but it proves to be very effective and manageable for large-scale SSL Web server farms. Conclusions Drawn from the Tests on page 121 suggests when best to deploy an appliance such as Netscaler or ArrayNetworks.
There are several deployment options for SSL acceleration. This section describes where it makes sense to deploy different SSL acceleration options. It is important to consider certain characteristics, including:
s s s s s
The level or degree of security The number of client SSL transactions The volume of bulk encrypted data to be transferred in the secure channel Cost The number of horizontally scaled SSL Web servers.
Software-SSL LibrariesPacket Flow

FIGURE 4-16 shows the packet flow for a software-based approach to SSL processing. Although the path seems direct, the SSL processing is bottlenecked by millions of CPU cycles consumed in the processing of cryptographic algorithms such RSA and 3DES.
112
Application SSL Memory STREAM Head
TCP
IP
NIC Driver
PCI Bridge
PCI
NIC
FIGURE 4-16
Packet Flow for Software-based Approach to SSL Processing
The Crypto Accelerator BoardPacket Flow

FIGURE 4-17 shows the packet flow using a PCI accelerator card for SSL acceleration. In this case, the incoming encrypted packet reaches the SSL libraries. The SSL libraries maintain various session information and security associations, but the mathematical computations are offloaded to the PCI accelerator card, which contains
Chapter 4
113
an ASIC that can compute the cryptographic algorithms in very few clock cycles. However, there is a an overhead of transferring data to the card, as the PCI bus must first be arbitrated and traversed. Note that in the case of small data transfers, the overhead of PCI transfers might not outweigh the benefit of the cryptographic computation acceleration offered by the card. Further, it is important to make sure the PCI slot used is 64 bit and 66 MHz. Using a 32-bit slot could have a performance impact.
Application SSL Memory STREAM Head
TCP
IP
NIC Driver
PCI Bridge
PCI
3 SSL Accelerator
NIC
FIGURE 4-17
PCI Accelerator Card Approach to SSL ProcessingPartial Offload
114
SSL Accelerator AppliancePacket Flow

FIGURE 4-18 illustrates how a typical SSL accelerator appliance can be exploited to reduce load on servers by offloading front-end client SSL processing. Commercial SSL accelerators at the time of this writing are all PC-based boxes with PCI accelerator cards. The operating systems and network protocol stack are optimized for SSL processing. The major benefit to the backend servers is that CPU cycles are freed up by not having to process thousands of client SSL transactions. The accelerator can either offload all SSL processing and forward cleartext to the server or terminate all client SSL connections and maintain only one SSL connection to the target server, depending on the customers requirements.
Chapter 4
115
Application SSL Memory STREAMHead
TCP SSL Appliance offloads client front-end SSL processing
IP
NIC Driver
PCI Bridge
PCI Client 1 Client 2 SSL Accelerator
Client 5000
NIC
Server
Persistent SSL connection appliance - server
Server
Server
FIGURE 4-18
SSL Appliance Offloads Frontend Client SSL Processing
116
SSL Performance Tests

To gain a better understanding of the trade-offs of the three approaches to SSL acceleration, we ran various tests using the Sun Crypto Accelerator 1000 board, Netscaler 9000 SSL Accelerator appliance, and ArrayNetworks SSL accelerator appliance. Due to limited time and resources, tests were selected that enabled us to compare key attributes among approaches. In the first test, we compare raw SSL processing differences between SSL libraries and an appliance.
Test 1: SSL Software Libraries versus SSL Accelerator ApplianceNetscaler 9000

In this test, we looked closely at CPU utilization and network traffic with a software solution. We found a tremendous load on the CPU, which completely pinned the CPU. It took over two minutes to complete 100 SSL transactions. We then looked at CPU utilization and network traffic using an SSL appliance with the exact same server and client used in the first example. With this setup, it took under one second to complete 100 SSL transactions. The main reason the SSL appliance is so much faster is that the appliance maintains few long-lived SSL connections on the server. Hence the server is less burdened with recalculating cryptographic computations, which are CPU intensiveas is setting up and tearing down SSL sessions. The appliance terminates the SSL session between the client and the appliance and then reuses the SSL connection at the backend with the servers.
Sun ONE Web Server Client - benchmark Alteon Switch SSL Libs
deepak2 129.146.138.98
FIGURE 4-19
deepak 129.146.138.99
SSL Test Setup with No Offload
Chapter 4
117
Test 1 (A)Software-SSL Libraries

We used an industry standard benchmark load generator on the client to generate SSL traffic. Both tests ran the same tests on the same 100-megabyte server file. 100 requests were injected into the SSL Web server in concurrency of 10 requests.
Test 1 (B)SSL Accelerator Appliance

We used the Netscaler 9000 SSL Accelerator device on the client to generate SSL traffic. Both tests ran on the same 100-megabyte server file. The performance gains using the SSL offload device were significant. Some of the key reasons include:
s
Hardware SSL implementation, including hardware coprocessor for mathematically intensive computations of cryptographic algorithms. Reuse of backend SSL tunnel. By keeping one SSL tunnel alive and reusing it, the result is massive server SSL offload.
We ran the benchmark load generator on client (deepak2). The client points to the VIP on the Netscaler, which terminates one side of the SSL connection. The Netscaler then reuses the backend SSL connection. This is also more secure because the client is unaware of the backend servers and hence can do less damage:
#abc -n 100 -c 10 -v 4 http://129.146.138.52:443/100m.file1 >./netscaler100mfel1n100c10.softwareonly
612 packets were transferred to complete 100 SSL handshakes in less that one second!
Test 2: Sun Crypto Accelerator 1000 Board

In this test set, we leveraged the work done by the Performance Availability and Engineering group regarding performance tests of the Sun Crypto Accelerator 1000 board. The test setup consisted of a Sun Fire 6800, using eight 900-MHz UltraSPARC III processors, and a single Sun Crypto Accelerator 1000 board. FIGURE 4-20 shows that the throughput increases linearly as the number of processors increases on the software approach versus the near-constant performance at 500 Mbit/sec using the Sun Crypto Accelerator 1000 board. Tests show that the ideal benefit of the accelerator board results when the minimum message size exceeds 1000 bytes. If the messages are too small, the benefit of the card acceleration does not outweigh the overhead of diverting SSL computations from the CPU to the board and back.
118
SSL Throughput Mbps 500 400 300 200 100 1 2 3 4 5 6 7 8
Number of CPUS (Ultra SPARCIII - 900MHz)

FIGURE 4-20
Throughput Increases Linearly with More Processors
Test 3: SSL Software Libraries versus SSL Accelerator ApplianceArray Networks

In this set of tests, we performed more detailed tests to better understand not only the value of the SSL appliance, but the impact of threads and file size. FIGURE 4-21 shows the basic test setup for the SSL software test, where an Sun Enterprise 450 server used as the client was sufficient to saturate the Sun Blade server.
Client E450 4 cpu 450 mhz
Foundry Big Iron 4000
B1600 SunBlade 650 Mhz
FIGURE 4-21
SSL Test Setup for SSL Software Libraries
FIGURE 4-22 shows the SSL appliance tests. Larger clients were required to saturate the servers. We used two additional Sun Fire 3800 servers in addition to the Enterprise 450 server. The reason for this was that the SSL appliance terminated the SSL connection, performed all SSL processing, and maintained very few socket connections to the backend servers, thereby reducing the load on the servers.
Chapter 4
119
Client E450 4 cpu 450 mhz
Client SF3800 8 cpu 900 mhz
Foundry Big Iron 4000
B1600 SunBlade 650 Mhz
Client SF3800 8 cpu 900 mhz
Array Networks SSL Appliance
FIGURE 4-22
SSL Test Setup for an SSL Accelerator Appliance
FIGURE 4-23 suggests that there is a sweet spot for the number of threads to be used for the client load generator. After a certain point, performance drops. This suggests that the SSL processing of software only approaches benefits from increased threads up to a certain maximum point. These are initial tests and not comprehensive by any means. Our intent is to show that this is one potentially important configuration consideration, which might be beyond the scope of pure design.
SSL Fetches/Sec 800 600 400 200
10
15
20
30
40
50
Number of Threads
FIGURE 4-23
Effect of Number of Threads on SSL Performance
FIGURE 4-24 shows the impact of file size on SSL performance. Note that these are SSL encrypted bulk files. The SSL appliance has a dramatic impact on increasing performance of SSL throughput for large files. However, the number of transactions 120 Networking Concepts and Technology: A Designers Resource
decreases in direct proportion to the file size. The link was a 1-gigabit pipe, which can support 125 MByte/sec throughput. The results show that the limiting factor actually is not the network pipe.
SSL Performance and File Size 100 80 Fetches/Sec 60 40 20
1KB 100KB
500KB
File Size Kilobytes (KB) 1MB 800KB Kilobytes (KB) Transferred/Sec 600KB 400KB 200KB
FIGURE 4-24
Effect of File Size on SSL Performance
Conclusions Drawn from the Tests

The software solution is best used in situations that require relatively few SSL transactions throughput, which is typical for sign-on and credit card Web-based transactions, where only certain aspects require SSL encryption. The PCI accelerator card dramatically increases performance at relatively low cost. The PCI card also offers true end-to-end security and is often desirable for highly secure environments.
Chapter 4
121
The accelerator device can be installed in an existing infrastructure and can offer very good performance. The servers do not need to be modified. Hence, only one device must be managed for SSL acceleration. Another benefit is that the appliance exploits the fact that not every server will be loaded with SSL at the same time. Hence, from a utilization standpoint, an appliance is more economically feasible.
122
CHAPTER
Server Network Interface Cards: Datalink and Physical Layer

This chapter discusses the networking technologies available through Sun Microsystems, which are regularly found in a data center. In many cases, a highlevel overview of the networking technology is provided for completeness. The technologies covered include:
s s s
Token Ring Networks Fiber Distributed Data Interface (FDDI) Networking Ethernet Networking
Token Ring Networks

A token ring network is a physically star-wired local area network that interconnects various devices such as personal computers and workstations into a logical ring configuration. The cabling system consists of wiring concentrators, connectors, and end stations. The Sun token ring protocol conforms to the IEEE 802.5-1988 standard. Token ring refers to the media access control (MAC) portion of the link layer (DLC) as well as the entire physical layer (PHY). Access to the ring is controlled by a bit pattern, called a token, that circulates from station to station around the ring. Any station can use the ring. Capturing the token means that a station changes the token bit pattern so that it is no longer that of a token but that of a data frame. The sending station then sends its data within the information field of the frame. The frame also includes the destination address of the destination station. The frame is passed from station to station until it arrives at the proper destination. At the destination station, the frame is altered to indicate that the address was recognized and that the data was copied. The frame is then passed back to the original sending station, where the
123
sending station checks to see that the destination station copied the data. If there is no more data to be sent, the sending station alters the frames bit configuration so that it now functions as a free token available to another station on the ring. If a station fails, it is physically switched out of the ring, dynamically. The ring is then automatically reconfigured. When the station has been repaired, the ring is again automatically reconfigured to include the added station.
Free token
Station A is waiting to send data to Station C. Station A waits for free token.
Station A changes free token to busy. Station A sends data. Station D repeats data. A Station C receives data. Station B will repeat data. Data D
B Data
Station A receives its own busy token. Station A generates new free token. A C
Free token D
FIGURE 5-1
Token Ring Network
124
Token Ring Interfaces

Sun supports two token ring drivers for its range of SPARC platforms. This section describes the token ring interfaces in detail. The SBus-based and the PCI token ring interfaces provide access to 4-Mbit/sec or 16Mbit/sec token ring local area networks. SunTRI/S software supports the IEEE 802.5
standards for token ring networks.
The IEEE standard specifies the lower two layers of the OSI 7-layer model. The two layers are the Physical layer (Layer 1) and the Data Link layer (Layer 2). The Data Link layer is further divided into the Logical Link sublayer (LLC) and the Media Access Control (MAC) sublayar. The token ring driver is a multi-threaded, loadable, clonable, STREAMS hardware driver that supports the connectionless Data Link Provider Interface, dlpi(7p), over a token ring controller. Multiple token ring controllers installed within the system are supported by the driver. SunTRI/S software can support different protocol architectures concurrently, via the SNAP encapsulation technique of RFC1042. From this SNAP encapsulation, high-level applications can communicate through their different protocols over the same SunTRI/S interface. Support also exists for adding different protocol packages (not included with SunTRI/S). These protocol packages include OSI and other protocols available directly from Sun or through third-party vendors. TCP/IP is implicit with the Solaris operating system. The software driver also provides source routing, which enables the workstation to access multiple ring networks connected by source-route bridges. Locally administered addressing is also supported and aids in management of certain userspecific and vendor-specific network configurations.
Support for IBM LAN Manager is provided by the TMS380 MAC-level firmware that complies with the IEEE 802.5 standard.
Configuring the SunTRI/S Adapter with TCP/IP

The Sbus token ring driver is called tr and can be configured using ifconfig once you have established that the interface is physically present in the system and the device driver is installed. Refer to Configuring the Network Host Files on page 234. The rest of this section describes the configuration of individual
Chapter 5
125
parameters of the tr device that can be altered in the driver.conf file and global parameters that can be altered using /etc/system. TABLE 5-1 describes the tr.conf parameters.
TABLE 5-1 Parameter
tr.conf Parameters
Description
mtu sr ari
Maximum transfer unit index Source routing enable Disabling ARI/FCI Soft Error Reporting
Setting the Maximum Transmission Unit

Sun supports the IEEE 802.5 Token Ring Standard Maximum Transmission Unit (MTU) size of 17800 bytes. All hosts should use the same MTU size on any particular network. Additionally, if different types of IEEE 802 networks are connected by transparent link layer bridges, all hosts on all of these networks should use the same MTU size. The maximum MTU sizes supported are 4472 for 4 Mbit/sec operation and 17800 for 16 Mbit/sec operation. These are the rates specified by the token ring chip set on the SunTRI/S adapter. TABLE 5-2 lists the MTU indices and their corresponding sizes.
TABLE 5-2
MTU Sizes
MTU Size (bytes)
MTU Index
0 1 2 3 4 5 6
516 1470 2052 4472 8144 11407 17800
The default value of the MTU index is 3 (4472 bytes).
126
Disabling Source Routing

Source routing is the method used within the token ring network architecture to route frames through a multiple-ring local area network. A route is a path taken by a frame as it travels through a network from the originating station to a destination
station.
By default, source routing is enabled. To disable source routing, set the sr value in the tr.conf file to 0.
TABLE 5-3 Parameter
Source Routing Values

Description
sr
0 Enables source routing 1 Disables source routing
Disabling ARI/FCI Soft Error Reporting

In 1989, the Token Ring committee changed its recommendations on the use of the Address Recognized Indicator/Frame Copied Indicator (ARI/FCI). The old recommendation was to use the bits to confirm the receipt or delivery of frames. The new recommendation is to use the bits to report soft errors. This recommendation gave rise to issues in networks that had devices developed based on the old recommendation and in networks with devices developed since 1989 that did not adhere to the new recommendation. The ari parameter in the tr.conf file can be used to set ARI/FCI soft error reporting. By default, the ARI/FCI soft error reporting parameter is enabled. If you have an older network device, you might need to disable ARI/FCI error reporting by setting the ari parameter to 1.
TABLE 5-4 Parameter
ARI/FCI Soft Error Reporting Values

Description
ari
0 Enables ARI/FCI error reporting
1 Disables ARI/FCI error reporting
Configuring the Operating Mode

The SunTRI/S adapter supports both classic and Dedicated Token Ring (DTR) modes of operation. You can use the mode command to set the operating mode.
Chapter 5
127
By default, the adapter is set to classic mode (half duplex). If the mode is set to DTR, the adapter will come up in full duplex mode. If the mode is set to auto, the adapter will automatically choose between classic and DTR mode, depending on the capabilities of the switch or media access unit (MAU).
TABLE 5-5 Parameter
Operating Mode Values

Description
mode
The Operation mode values can be set to: 0 Classic mode 1 Auto mode 2 DTR mode
Resource Configuration Parameter Tuning

The SunTRI/S driver is shipped with 64 2-kilobyte buffers for receiving and transmitting packets. This configuration should be adequate under normal situations. However, the token ring interface throughput can be sluggish under heavy load and may even lock up for an indefinite time. This is especially true under NFS-related operations. This problem can be resolved by increasing the number of buffers available in the driver. The tunable parameter tr_nbufs can be set in the file /etc/system. Add this line in the file if it does not already exist:
# set tr:tr_nbufs=<xxx>
xxx is the number of 2-kilobyte buffers desired. You should not see a value less than the default value of 64. Proper setting of this parameter requires tuning. Numbers between 400 and 500 should be reasonable for medium load. You must reboot the system after you have updated the /etc/system file for the changes to take effect.
Configuring the SunTRI/P Adapter with TCP/IP

The PCI Bus token ring driver is called trp and can be configured using ifconfig once you have established that the interface is physically present in the system and the device driver is installed. Refer to Configuring the Network Host Files on
128
page 234. The rest of this section describes the configuration of individual parameters of the trp device that can be altered in the driver.conf file and global parameters that can be altered using /etc/system.
TABLE 5-6 Parameter
trp.conf Parameters
Description
mtu sr ari
Maximum Transfer unit Source routing enable Disabling ARI/FCI Soft Error Reporting

Sun supports the IEEE 802.5 Token Ring Standard Maximum Transmission Unit (MTU) size of 17800 bytes.
TABLE 5-7 Parameter

Description
mtu
The maximum MTU sizes supported are 4472 for 4 Mbit/sec operation and 17800 for 16 Mbit/sec operation. The default MTU size is 4472 bytes.
Configuring the Ring Speed

The ring speed is the number of megabits per second (Mbit/sec) at which the adapter transmits and receives data. The SunTRI/P software sets the ring speed to auto-detect by default. When the workstation enters the token ring,it will automatically detect the speed at which the ring is running and set itself to that ring speed. If your workstation is the first workstation on the token ring, the ring speed is set by the hub. However, if your workstation is the first workstation on the token ring and the token ring has no active hubs, you must set the ring speed manually. Additional workstations that join the token ring will set their ring speed automatically. You can set the ring speed using the trpinstance_ring_speed parameter in the trp.conf file. The trpinstance_ring_speed parameter can be set for each interface. For example, setting the trp0_ring_speed parameter affects the trp0 adapter.
Chapter 5
129
This parameter can be changed to the following settings.

TABLE 5-8
Ring Speed
Parameter Description
trpinstance_ring_speed
The ring speed setting applied to the node: 0= auto-detect (default) 4= 4 Mbit/sec 16= 16 Mbit/sec
To change the value of the ring speed on trp0 to 4 Mbit/sec and the ring speed on trp1 to 16 Mbit/sec, change the following settings in the trp.conf file:
trp0_ring_speed = 4 trp1_ring_speed = 16
Configuring the Locally Administered Address

The Locally Administered Address (LAA) is part of the token ring standard specification. You might need to use an LAA for some protocols, such as DECNET or SNA. To use an LAA, create a file with execute permission in the /etc/rcS.d directory, such as /etc/rcS.d/S20trLAA, with the ifconfig trinstance ether XX:XX:XX:XX:XX:XX command. The adapter instance is represented by trinstance and the LAA for that adapter is used in place of XX:XX:XX:XX:XX:XX.
# /sbin/sh case $1 in 'start') echo Configuring Token Ring LAA... /sbin/ifconfig trX ether XX:XX:XX:XX:XX:XX ;; 'stop') echo Stop of Token Ring LAA is not implemented. ;; *) echo Usage: $0 { start | stop } ;; esac
130
For example, to use an LAA of 04:00:ab:cd:11:12 on the tr0 interface, use the following
command within the /etc/rcS.d/S20trLAA file:

# /sbin/ifconfig tr0 ether 04:00:ab:cd:11:12
The least significant bit of the most significant byte of the address used in the above command should never be 1. That bit is individual/group bit and used by multicasting. For example, the address 09:00:ab:cd:11:12 would be invalid and would cause unexpected networking problems.
Fiber Distributed Data Interface Networks

A typical Fiber Distributed Data Interface (FDDI) network is based on a dual counter-rotating ring, as illustrated in FIGURE 5-2. Each FDDI station is connected in sequence to two rings simultaneouslya primary ring and a secondary ring. Data flows in one direction on the primary ring and in the other on the secondary ring. The secondary ring serves as a redundant path. It is used during station initialization and can be used as a backup to the primary ring in the event of a station or cable failure. When a failure occurs, the dual ring is wrapped around to isolate the fault and to create a single one-way ring. The components of a typical FDDI network and the failure recovery mechanism are described in more detail in the following sections.
Chapter 5
131
FDDI Stations
FDDI Stations
FDDI Stations
FDDI Stations
FIGURE 5-2
Primary Ring Secondary Ring
Typical FDDI Dual Counter-Rotating Ring
FDDI Stations
An FDDI station is any device that can be attached to a fiber FDDI network through an FDDI interface. The FDDI protocols define two types of FDDI stations:
s s
Single-attached station (SAS) Dual-attached station (DAS)
Single-Attached Station
A SAS is attached to the FDDI network through a single connector, called the S-port. The S-port has a primary input (Pin) and a primary output (Pout). Data from an upstream station enters through Pin and exits from Pout to a downstream station, as shown in FIGURE 5-3. Single-attached stations are normally attached to single- and dual-attached concentrators as described in FDDI Concentrators on page 134.
132
Single-Attached Station (SAS) MAC
PHY
S-port
Data to downstream station

FIGURE 5-3
Pout
Pin
Data from upstream station
SAS Showing Primary Output and Input
Dual-Attached Station
A DAS is attached to the FDDI network through two connectors, called the A-port and the B-port, respectively. The A-port has a primary input (Pin) and a secondary output (Sout); the B-port has a primary output (Pout) and a secondary input (Sin). The primary input/output is attached to the primary ring and the secondary input/output is attached to the secondary ring. The flow of data during normal operation is shown in FIGURE 5-4. To complete the ring, you must ensure that the B-port of an upstream station is always connected to the A-port of a downstream station. For this reason, most FDDI DAS connectors are keyed to prevent connections between two ports of the same type.
Chapter 5
133
Dual-Attached Station (DAS)
MAC
PHY B
PHY A
B-port
A-port
Data to downstream station Data from downstream station

FIGURE 5-4
Pout
Sin
Sout
Pin
Data from upstream station Data to upstream station
DAS Showing Primary Input and Output
FDDI Concentrators
FDDI concentrators are multiplexers that attach multiple single-attached stations to the FDDI ring. An FDDI concentrator is analogous to an Ethernet hub. The FDDI protocols define two types of concentrator:
s s
Single-attached concentrator (SAC) Dual-attached concentrator (DAC)
Single-Attached Concentrator
A SAC is attached to the FDDI network through a single connector, which is identical to the S-port on a single-attached station. It has multiple M-ports to which single-attached stations are connected, as shown in FIGURE 5-5.
134
Single-Attached Station (SAS) S-port
M-port
M-port
M-port
Single-Attached Concentrator (SAC)
S-port
Data to downstream station

FIGURE 5-5
Pout
Pin
Data from upstream station
SAC Showing Multiple M-ports with Single-Attached Stations
Dual-Attached Concentrator
A DAC is attached to the FDDI network through two portsthe A-port and the Bport, which are identical to the ports on a dual-attached station. A DAC has multiple M-ports, to which single-attached stations are connected as shown in FIGURE 5-6. Dual-attached concentrators and FDDI stations are often arranged in a flexible network topology called the ring of trees. Additionally, many failover capabilities are built into the FDDI network to ensure it is robust.
Chapter 5
135
M-port
M-port
M-port
Duel-Attached Concentrator (DAC)
B-port
B-port
Data to downstream station Data from upstream station
Pout
Sin
Sout
Pin
Data from upstream station Data to downstream station
FIGURE 5-6
DAC Showing Multiple M-ports with Single-Attached Stations
FDDI Interfaces
Sun supports two FDDI drivers for its range of SPARC platforms: the SBus driver, known as the SunFDDI/S, and the PCI driver, known as SunFDDI/P. The SBus-based and the PCI FDDI interfaces provide access to 100 mbit/s FDDI local area networks.
136
Configuring the SunFDDI/S Adapter with TCP/IP

The Sbus FDDI driver is called nf and can be configured using ifconfig once you have established that the interface is physically present in the system and the device driver is installed. Refer to Configuring the Network Host Files on page 234. The rest of this section describes the configuration of individual parameters of the nf device that can be altered in the driver.conf file.
TABLE 5-9 Parameter
nf.conf Parameters
Description
nf_mtu nf_treq
Maximum Transmission Unit Target Token Rotation Time

Sun Supports the FDDI maximum transmission unit (MTU) that has been optimized for pure FDDI networks.
TABLE 5-10 Parameter

Description
nf_mtu
The maximum MTU size can be set to a maximum of 4500 bytes.
Target Token Rotation Time

Target token rotation time (TTRT) is the key FDDI parameter used for network performance tuning. In general, increasing the TTRT increases throughput and increases access delay. For SunFDDI, the TTRT must be between 4000 and 165,000 ms. The TTRT is set to 8000 ms by default. The optimum value for the TTRT is dependent on the application and the type of traffic on the network:
s
If the network load is irregular (bursty traffic), the TTRT should be set as high as possible to avoid lengthy queueing at any one station. If the network is used for the bulk transfer of large data files, the TTRT should be set relatively high to obtain maximum throughput without allowing any one station to monopolize the network resources.
Chapter 5
137
If the network is used for voice, video, or real-time control applications, the TTRT should be set low to decrease access delay.
The TTRT is established during the claim process. Each station on the ring bids a value (T_req) for the operating value of the TTRT (T_opr) and the station with the lowest bid wins the claim. Setting the value of T_req on a single station does not guarantee that this bid will win the claim process.
Request Operating TTRT

Description
nf_treq
Requested TTRT should be a value in the range 4000 through 165,000.
Configuring the SunFDDI/P Adapter with TCP/IP

The PCI bus FDDI driver is called pf and can be configured using ifconfig once you have established that the interface is physically present in the system and the device driver is installed. Refer to Configuring the Network Host Files on page 234. The rest of this section describes the configuration of individual parameters of the pf device that can be altered in the driver.conf file.
pf.conf Parameters
Description
pf_mtu pf_treq
Maximum transfer unit Target token rotation time

Sun supports the FDDI maximum transmission unit (MTU) that has been optimized for pure FDDI networks.

Description
pf_mtu
The maximum MTU size can be set up to a maximum of 4500 bytes.
138
Target Token Rotation Time

The target token rotation time (TTRT) can also be programmed with the SunFDDI/P. A detailed explanation of the TTRT is provided above with the SunFDDI/S.
Request Operating Target Token Rotation Time

Description
pf_tReq
Requested TTRT should be a value in the range 4000 through 165,000.
Ethernet Technology
This section discusses low-level network interface controller (NIC) architecture features. It explains the elements that make up a NIC adapter, breaking it down into the transmit (Tx) data path and the receive (Rx) data path followed by acceleration features available with more modern NICs. The components are broken down in this manner to provide the necessary high-level understanding required to discuss the finer details of the Sun NIC devices available. These broad concepts plus the finer details included in this explanation will help you understand the operation of these devices and how to tune them for maximum benefit in throughput, request/response performance, and level of CPU utilization. These concepts are also useful in explaining the development path that Sun took for its NIC technology. Each concept is retained from one Sun NIC to the next as each new product builds on the strengths of the last. With the NIC architecture concepts in place, the next area of discussion is by far the largest source of customer discomfort with Ethernet technology: the physical layer. The original ubiquitous Ethernet technology was 10 Mbit/sec. Ethernet technology has been improved continuously over the years, going from 10 Mbit/sec to 100 Mbit/sec and most recently to 1 Gbit/sec. Along the way Ethernet technology always promised to be backward compatible and accomplished this using a technology called auto-negotiation, which allows new Ethernet arrivals to connect to the existing infrastructure and establish the correct speed to operate with and be part of that infrastructure. On the whole, the technology works very well, but there are some difficulties with understanding the Ethernet physical layer. Hopefully, our explanation of this layer will facilitate better use of this feature. The last addition to the Ethernet technology is network congestion control using pause flow control. This is a useful but under-utilized feature of Ethernet that we hope to demystify.
Chapter 5 Server Network Interface Cards: Datalink and Physical Layer 139
Software Device Driver Layer

This section discusses the low-level NIC architecture features required to understand the tuning capabilities of each NIC. To discuss this we will divide the process of communication into the software device driver layer relative to TCP/IP and then further into Transmit and Receive. The software device driver layer conforms to the data link provider interface (DLPI). The DLPI interface layer is how protocols like TCP/IP, Appletalk, and so on talk to the software driving the Ethernet device. This is illustrated further in FIGURE 5-7.
TCP/IP Protocol Stack
DLPI Interface
Software Domain
NIC Device Driver
Receive
Transmit
Hardware Domain
NIC Device
FIGURE 5-7
Communication Process between the NIC Software and Hardware
Transmit
The Transmit portion of the software device driver level is the simpler of the two and basically is made up of a Media Access Control module (MAC), a direct memory access (DMA) engine, and a descriptor ring and buffers. FIGURE 5-8 illustrates these items in relation to the computer system.
140
System Memory Tx Descriptor Tx Descriptor Ring Data Buffer
Phy Device
Media Access Control Module
TX DMA Engine
FIGURE 5-8
Transmit Architecture
The key element to this transmit architecture is the descriptor ring. This is the part of the architecture where the transmit hardware and the device driver transmit software share information required to move data from the system memory to the Ethernet network connection. The transmit descriptor ring is a circular array of descriptor elements that are constantly being used by the hardware to find data to be transmitted from main memory to the Ethernet media. At a minimum, the transmit descriptor element contains the length of the Ethernet packet data to be transmitted and a physical pointer to a location in system physical memory to find the data. The transmit descriptor element is created by the NIC device driver as a result of a request at the DLPI interface layer to transmit a packet. That element is placed on the descriptor ring at the next available free location in the array. Then the hardware is notified that a new element is available. The hardware fetches the new descriptor, and using the pointer to the packet data physical memory, moves the data from the physical memory to the Ethernet media for the given length of the packet provided in the Tx descriptor. Note that requests for more packets to be transmitted by the DLPI interface continue while the hardware is transmitting the packets already posted on the descriptor ring. Sometimes the arrival rate of the transmit packets at the DLPI interface exceeds the rate of transmission of the packets by the hardware to the media. In that case, the descriptor ring fills up and further attempts to transmit must be postponed until previously posted transmissions are completed by the hardware and more descriptor elements are made available by the device driver software. This is a typical producer-consumer effect where the producer is the DLPI interface producing requests for the transmit descriptor ring and the hardware is the consumer consuming those requests and moving data to the media.
Chapter 5
141
This producer-consumer effect can be reduced by increasing the size of the transmit descriptor ring to accommodate the delay that the hardware or the underlying media imposes on the movement of the data. This delay is also known as transmission latency. Later sections describe how many of the device drivers give a measurement of how often the transmission latency becomes so large that data transmission is postponed, awaiting transmit descriptor ring space. The aim is to avoid this situation. In some cases, NIC hardware allows you to increase the size of the descriptor ring, allowing a larger transmit latency. In other cases, the hardware has a fixed upper limit for the size of the transmit descriptor ring. In those cases, theres a hard limit to how much latency the transmit can endure before postponing packets is inevitable.
Transmit DMA Buffer Method Thresholds

The packets staged for transmission are buffers present in the kernel virtual address space. A mapping is created that provides a physical address that the hardware uses as a base address of the bytes to be fetched from main memory for transmission. The minimum granularity of a buffer is an 8-kilobyte page, so if an Ethernet packet crosses an 8-kilobyte page in the virtual address space, theres no guarantee that the two pages will also be adjacent physical address space. To make sure the physical pages are adjacent in the physical address space, SPARC systems provide an input/output memory management unit (IOMMU), which is designed to make sure that the device view of main memory matches that of the CPU and hence simplifies DMA mapping. The IOMMU simplifies the mapping of virtual to physical address space and provides a level of error protection against rogue devices that exceed their allowed memory regions, but it does so at some cost to the mapping set up and tear down. The generic device driver interface for creating this mapping is known as ddi_dma. In SPARC platforms, a newer set of functions for doing the mapping known as fast dvma is now available. With fast dma it is possible to further optimize the mapping functions. The use of fast dvma is limited, so when that resource is unavailable, falling back to ddi_dma is necessary. Another aspect of DMA is the CPU cache coherence. DMA buffers on bridges in the data path between the NIC device and main memory must be synchronized with the CPU cache prior to the CPU reading or writing data. Two different modes of maintaining DMA-to-CPU cache coherency form two types of DMA transaction, know as consistent mode and streaming mode.
s
Consistent mode uses a consistency protocol in hardware, which is common in both x86 platforms and SPARC platforms. Streaming mode uses a software synchronization method.
142
The trade-offs between consistent and streaming modes are largely due to the prefetch capability of the DMA transaction. In the consistent mode theres no pre-fetch, so when a DMA transaction is started by the device, each cache line of data is requested individually. In streaming mode a few extra cache lines can be pre-fetched in anticipation of being required by the hardware, hence reducing per cache line rearbitration costs. All of these trade-offs lead to the following rules for using ddi_dma, fast dvma, and consistent versus streaming mode:
s
If the packets are small, avoid setting up a mapping on a per-packet basis. This means that small packets are copied out of the message and passed down from the upper layer to a pre-mapped buffer. That pre-mapped buffer is usually a consistent mode buffer, as the benefits of streaming mode are difficult to realize for small packets. Large packets should use the fast dvma mapping interface. Streaming mode is assumed in this mode. On x86 platforms, streaming mode is not available. Mid-range packets should use the ddi_dma mapping interface. This range applies to all cases where fast dvma is not available. The mid-range can be further split, as one can control explicitly whether the DMA transaction uses consistent mode or streaming mode. Given that streaming mode pre-fetch capability works best for larger transactions, the upper half should use streaming mode while the lower half uses consistent mode.
Setting the thresholds for these rules requires clear understanding of the memory latencies of the system and the distance between the I/O Expander card and the CPU card in a system. The rule of thumb here is the larger the system, the larger the memory latency. Once the course-grain tuning is applied, more fine-grain tuning is required. The best tuning is established by experimentation. A good way to do this is by running FTP or NFS transfers of large files and measuring the throughput.
Multi-data Transmit Capability

A new development in the process of transmission in the Solaris TCP/IP protocol stack is known as multi-data transmission (MDT). Delivering one packet at a time to the transmit driver caused a lot of networking layer traversals. Furthermore, the individual cost of setting up network device DMA hardware to transmit one packet at a time was too expensive. Therefore, a method was devised that allowed multiple packets to be passed to the driver for transmission. At the same time, every effort was made to ensure that the data for those packets remained in one contiguous buffer that could be enabled for transmission with one setup of Network Device DMA hardware.
Chapter 5
143
This feature requires a new interface to the driver, so only the most recent devices have implemented it. Furthermore, it can only be enabled if TCP/IP is configured to allow it. Even with that, it will only attempt to build an MDT transaction to the driver if the TCP connection is operating in a Bulk transfer mode such as FTP or NFS. The Multi-Data transmit capability is also included as part of the performance enhancements provided in the ce driver. This feature is negotiated with the upper layer protocol so it can be enabled in the ce driver as well as the upper layer protocol. If theres no negotiation, the feature is disabled. The TCP/IP protocol began supporting Multi-Data Transmit capability in the Solaris 9 8/03 operating system, but by default it will not negotiate with the driver to enable it. The first step to making this capability available is to enable the negotiations through an /etc/system tunable parameter.
Multi-Data Transmit Tunable Parameter

Value Description
ip_use_dl_cap
0-1
Enables the ability to negotiate special hardware accelerations with a lower layer. 1 Enable 0 Disable Default 0
q To enable the multi-data transmit capability, add the following line to the
/etc/system file:
set ip:ip_use_dl_cap = 1
Receive
The receive side of the interface looks much like the transmission side, but it requires more from the device driver to ensure that packets are passed to the correct stream. There are also multithreading techniques to ensure that the best advantage is made of multiprocessor environments. FIGURE 5-9 shows the basic Rx architecture.
144
System Memory Rx Descriptor Rx Descriptor Ring Data Buffer
PHY Device
Media Access Control Module
RX DMA Engine
FIGURE 5-9
Basic Receive Architecture
The receive descriptor plays a key role in the process of receiving packets. Unlike transmission, receive packets originate from remote systems. Therefore, the Rx descriptor ring refers to buffers where those incoming packets can be placed. At a minimum, the receive descriptor element provides a buffer length and a pointer to an available buffer. When a packet arrives, its received first by the PHY device and then passed to the MAC, which notifies the Rx DMA engine of an incoming packet. The Rx DMA takes that notification and uses it to initiate a Rx descriptor element fetch. The descriptor is then used by the Rx DMA to post the data from the MAC device internal FIFOs to system main memory. The length provided by the descriptor ensures the Rx DMA doesnt exceed the buffer space provided for the incoming packet. The Rx DMA continues to move data until the packet is complete. Then it places in the current descriptor location a new completion descriptor containing the size of the packet that was just received. In some cases, depending on the hardware capability, there might be more information in the completion descriptor associated with the incoming packet (for example, a TCP/IP partial checksum). When the completion descriptor is placed back onto the Rx descriptor ring, the hardware advances its pointer to the next free Rx descriptor. Then the hardware interrupts the CPU to notify the device driver that it has a packet that needs to be passed to the DLPI layer. Once the device driver receives the packet, it is responsible for replenishing the Rx descriptor ring. That process requires the driver to allocate and map a new buffer for DMA and post it to the ring. When the new buffer is posted to the ring, the hardware is notified that this new descriptor is available. Once the buffer is replenished, the current packet can be passed up for classification to the stream expecting that packets arrival.
Chapter 5
145
It is possible that the allocation and mapping can fail. In that case, the current packet cannot be received, as its buffer is reposted to the ring to allow the hardware to continue to receive packets. This condition is not very likely, but it is an example of an overflow condition. Other overflow conditions can occur on the Rx path starting from the DLPI layer:
s
Overflow can be caused when the DLPI layer cannot receive the incoming packet. In that case, packets are typically dropped, even though they were successfully received by the hardware. Overflow can be caused when the device driver software is unable to replenish the Rx descriptor elements faster than the NIC hardware consumes them. This usually occurs because the system doesnt have enough CPU performance to keep up with the network traffic.
Overflow also occurs within the NIC device between the MAC and the Rx DMA interface. This is known as MAC overflow. It is caused when the descriptor overflows and backfill occurs because of that condition. MAC overflow can occur when a highlatency system bus makes the MAC overflow its internal buffer as it waits for the Rx DMA to get access to the bus to move the data from the MAC buffer to main memory. Finally, if a MAC overflow condition exists, any packet coming in cannot be received. Hence that packet is considered missed. In some cases, overflow conditions can be avoided by careful tuning of the device driver software. The extent of available tuning depends on the NIC hardware. In cases where the Rx descriptor ring is overflowed, many devices allow increases in the number of descriptor elements available. This will be discussed further with respect to example NIC cards in later sections. You can avoid the MAC overflow condition by careful system configuration, which can require more memory, faster CPUs, or more CPUs. It might also require that NIC cards not share the system bus with other devices. Newer devices have the ability to adjust the priority of the Rx DMA versus the Tx DMA, giving one a more favorable opportunity to access the system bus than the other. Therefore, if the MAC overflow condition occurs, it might be possible to adjust the Rx DMA priority to make Rx accesses to the system bus more favorable than the Tx DMA, thus reducing the likelihood of MAC overflow. The overflow condition from the DLPI layer is caused by an overwhelmed CPU. There are a few new hardware features that help reduce this effect. Those features include hardware checksumming, interrupt blanking, and CPU load balancing.
Checksumming
The hardware checksumming feature accelerates the ones complement checksum applied to TCP/IP packets. The TCP/IP checksum is applied to each packet sent by the TCP/IP protocol. The TCP/IP checksum is made up of a ones complement
146
addition of the bytes in the pseudo header plus all the bytes in the payload. The pseudo header is made up of bytes from the source and destination IP address plus the TCP source and destination port numbers. The hardware checksumming feature is merely an acceleration. Most hardware designs dont implement the TCP/IP checksumming directly. Instead, the hardware does the bulk of the ones complement additions over the data and allows the software to take that result and mathematically adjust it to make it appear the complete hardware checksum was calculated. On transmission, the TCP checksum field is filled with an adjustment value that is considered just another two bytes of data that the hardware is applying during the ones complement addition of all the bytes of the packet. The end result of that sequence is a mathematically correct checksum that can be placed in the TCP header on transmission by the MAC to the network.
Hardware here
Start
Checksum From
End Checksum Here
Ethernet Header
IP Header
TCP Header
Payload Data
Hardware here FIGURE 5-10
Place
Checksum Result
Hardware Transmit Checksum
On the Rx path, the hardware completes the ones complement checksum based on a starting point in the packet. That same starting point is passed to TCP/IP along with the ones complement checksum from the bytes in the incoming packet. The TCP/IP software again does a mathematical fix-up, using this information before it finally compares the result with the TCP/IP checksum bytes that arrived as part of the packet. The main advantage of hardware checksumming is the reduction in cost of requiring the system CPU to calculate the checksum for large packets by allowing the majority of the checksum calculation to be completed by the NIC hardware. Because the hardware does not do the complete TCP/IP checksum calculation, this form of TCP/IP checksum acceleration is called partial checksumming.
Chapter 5
147
Hardware here
Start
Checksum From
End Checksum Here
Ethernet Header
IP Header
TCP Header
Payload Data
Hardware Calculate Checksum Pkt Buffer FIGURE 5-11 Pkt Length
Hardware Receive Checksum
Interrupt Blanking
Interrupt blanking is another hardware acceleration. Typically with regular NIC devices, the CPU is interrupted when a receive packet arrives. Hence the CPU is interrupted on a per-packet basis. While this is reasonable for transactional requests, where you would expect a response to a request immediately, it is not always required, especially in large bulk data transfers. In the single-interrupt-per-packet case, a packet arrival interrupting the CPU adds the cost of processing each individual packet to the overhead of the interrupt processing. The interrupt blanking technique allows a set number of packets to arrive before the next receive interrupt is generated. This allows the overhead of the interrupt processing to be distributed, or amortized, across the number of received packets. If that number of packets is not reached, then the packets that have arrived so far will not generate an interrupt and hence would not be processed. A timeout ensures that the receive packet interrupt will be generated and those received packets will be processed. The best setting for the interrupt blanking depends on the type of traffictransactional versus bulk data transfersand the speed of the system. The best way to tune these parameters can be achieved empirically when the given parameters are well known and the interrupt blanking can be tuned dynamically to match. This will be discussed further in the context of individual NICs that provide this feature.
CPU Load Balancing

CPU load balancing is the latest hardware acceleration to become available. It is designed to take maximum advantage of the large number of CPUs available in many UltraSPARC-based systems. There are two forms of CPU load balancing: software load balancing and hardware load balancing.
148
Software load balancing can be enhanced with hardware support, but it can also be implemented without hardware support. Essentially, it requires the ability to separate the workload of different connections from the same protocol stack into flows that can then be processed on different CPUs. The interrupt thread is now required only to replenish buffers for the descriptor ring, allowing more packets to arrive. Packets taken off the receive rings are then load balanced into packet flows based on connection information from the packet. A packet flow has a circular array that is updated with receive packets from the interrupt service routine while packets posted earlier are being removed and post-processed in the protocol stack by the kernel worker thread. Usually more than one flow is set up within a system made up of a circular array and a corresponding kernel worker thread. The more CPUs available, the more flows can be allowed. The kernel worker threads are available to run whenever packet data is available on the flows array. The system scheduler participates using its own CPU load-balancing technique to ensure a fair distribution of workload for incoming data.
FIGURE 5-12 demonstrates the architecture of software load balancing.
Protocol Stack
Flow Worker thread
Flow Worker thread
Flow Worker thread
Circular array
Circular array
Circular array
Rx Interrupt Services routine

FIGURE 5-12
Software Load Balancing
Hardware load balancing requires that the hardware provide built-in load balancing capability. The PCI bus enables receive hardware load balancing by using its four available interrupt lines together with the ability of the UltraSPARC III systems to allow each of those four interrupt lines to be serviced by different CPUs within the system. The advantage of having the four lines receive interrupts running on different CPUs is that it allows not only the protocol post processing to happen in
Chapter 5
149
parallel, as in the case of software load balancing, but it also allows the processing of the descriptor rings in the interrupt service routines to run in parallel, as shown in FIGURE 5-13.
Protocol Stack
Rx Interrupt service routine
PCI BUS
NIC hardware providing hardware load balancing Hardware Load Balancing
FIGURE 5-13
It is possible to combine the concept of software load balancing with the concept of hardware load balancing if enough CPUs are available to allow all the parallel Rx processing to happen. However, there is a gotcha with this load balancing capability: To realize its benefit, you must have multiple connections in order to provide the load balancing in the first place.
Received Packet Delivery Method

The received packet delivery method is the way packets are posted to the upper layer for protocol processing and refers to which thread of execution takes responsibility for that last leg of the journey for the received packet. CPU software load balancing is an example of a received packet delivery method where the interrupt processing is decoupled from the protocol stack processing. A hint provided by the hardware helps decide which worker thread completes the delivery of the packet to the protocol. In this model, many CPUs get to participate in the protocol stack processing.
150
The Streams Service Queue model also requires the driver to decouple interrupt processing from protocol stack processing. In this model, theres no requirement to provide a hint because there is only one protocol processing thread per queue open to the driverwith respect to TCP/IP, thats only one stream. This method works best on systems with a small number, but greater than one, of CPUs. Like CPU load balancing, it compromises on latency. The most common received packet delivery method is to do all the interrupt processing and protocol processing in the interrupt thread. This is a widely accepted method, but it is restricted by the available CPU bandwidth taking all the NIC driver interrupts. This is really the only option on a single CPU system. In a multi-CPU system, you can choose one of the other two methods if its established that the CPU taking the NIC interrupts is being overwhelmed. That situation becomes apparent when the system starts to become unresponsive.
Random Early Discard

The Random Early Discard feature was introduced recently to try to reduce the ill effects of a network card going into an overflow state. Under normal circumstances, there are a couple of overflow possibilities:
s
The internal device memory is full and the adapter is unable to get timely access to the system bus in order to move data from that device memory to system memory. The system is so busy servicing packets that the descriptor rings fill up with inbound packets and no further packets can be received. This overflow condition is very likely to also trigger the first overflow condition at the same time.
When these overflow conditions occur, the upper layer connections effectively stop receiving packets and a connection appears to have stopped in motion, at least for the duration of the overflow condition. With TCP/IP in particular, this leads to many packets being lost. The connection state is modified to assume a less reliable connection, and in some cases the connections might be lost completely. The impact of a lost connection is obvious, but if the TCP/IP protocol assumes a less reliable connection it will further contribute to the congestion on the network by reducing the number of packets outstanding without an ACK from the regular eight to a smaller value. A technique that can avoid this scenario can take advantage of the TCP/IP ability to allow for the occasional single packet loss associated with a connection and still maintain the same number of packets outstanding without an ACK. The lost packet is simply requested again, and the transmitting end of the connection will perform a retry. Completely avoiding the overflow scenario is impossible, but you can reduce its likelihood by beginning to drop random packets already received in the device memory, avoiding propagating them further into the system and adding to the
Chapter 5
151
workload already piled up for the system. This technique, known as Random Early Discard (RED), has the desired effect of avoiding overwhelming the system, while at the same time having minimal negative effect on the TCP/IP connections. The rate of random discard is done relative to how many bytes of packet data occupy the device internal memory. The internal memory is split into regions. As one region fills up with packet data, it spills into the next until all regions of memory are filled and overflow occurs. When the packet data spills from one region to the next, thats the trigger to randomly discard. The number of packets discarded is based on the number of regions filled; the more regions filled, the more you need to discard, as youre getting closer to the overflow state.
Jumbo Frames
Jumbo frames technology allows the size of an Ethernet data packet to be extended past the current 1514 standard limit, which is the norm for Ethernet networks. The typical size of jumbo frames has been set to 9000 bytes when viewed from the IP layer. Once the Ethernet header is applied, that grows by 14 bytes for the regular case or 18 bytes for VLAN packets. When jumbo frames are enabled on a subnet or VLAN, every member of that subnet or VLAN should be enabled to support jumbo frames. To ensure that this is the case, configure each node for jumbo frames. The details of how to set and check a node for jumbo frames capability tend to be NIC device/driver-specific and are discussed for interfaces that support them below. If any one node in the subnet is not enabled for jumbo frames, no members in the subnet can operate in jumbo frame mode, regardless of their preconfiguration to support jumbo frames. The big advantage of jumbo frames is similar to that provided by MDT. They provide a huge improvement in bulk data transfer throughput with corresponding reduction in CPU utilization, but with the addition that the same level of improvement is also available in the receive direction. Therefore, best bulk transfer results can be achieved using this mode. The jumbo frames mode should be used with care because not all switches or networking infrastructure elements are jumbo frames capable. When you enable jumbo frames, make sure that theyre contained within a subnet or VLAN where all the components in that subnet or VLAN are jumbo frames capable.
Ethernet Physical Layer

The Ethernet Physical layer has developed along with the Ethernet technology. When Ethernet moved from 10 Mbit/sec to 100 Mbit/sec, there were a number of technologies and media available to provide 100 Mbit/sec line rate. To allow those media and technologies to develop without altering the long-established Ethernet
protocol, a partition was made between the media-specific portion and the Ethernet protocol portion of the overall Ethernet technology. At that partition was placed the Media Independent Interface (MII). The MII allowed Ethernet to operate over fiberoptic cables to switches built to support fiber. It also allowed the introduction of a new twisted-pair copper technology, 100BASE-T4. These differing technologies for the supporting 100 Mbit/sec ultimately did not survive the test of time, leaving 100BASE-T as the standard 100 Mbit/sec media type. The existing widespread adoption of 10 Mbit/sec Ethernet brought with it a requirement that Ethernet media for 100 Mbit/sec should allow for backward compatibility with existing 10 Mbit/sec networks. Therefore, the MII was required to support 10 Mbit/sec operation as well as 100 Mbit/sec and allow the speed to be user selectable or automatically detected or negotiated. Those requirements led to the ability to force a particular speed setting for a link, known as Forced mode, or based on link speed signaling set a particular speed, known as auto-sensing, where both sides of the link share information about link speed and duplex capabilities and negotiate the best speed and duplex to set for the link, known as Auto-negotiation.
Basic Mode Control Layer

For the benefit of this discussion, the MII is restricted to the registers and bits used by software to allow the Ethernet physical layer to operate in Forced or Autonegotiation mode. The first register of interest is the Basic Mode Control Register (BMCR). This register controls whether the link will auto-negotiate or use Forced mode. If Forced mode is chosen, then auto-negotiation is disabled and the remaining bits in the register become meaningful.
Reset
Speed Select
Auto-Negotiation Enable
Restart Auto-Negotiation
Duplex Mode
FIGURE 5-14
Basic Mode Control Register
The Reset bit is a self-clearing bit that allows the software to reset the physical layer. This is usually the first bit touched by the software in order to begin the process of synchronizing the software state with the hardware link state. Speed Selection is a single bit, and it is only meaningful in Forced mode. Forced mode of operation is available when auto-negotiation is disabled. If this bit is set to 0, then the speed selected is 10 Mbit/sec. If set to 1, then the speed selected is 100 Mbit/sec.
Chapter 5
153
When the Auto-negotiation Enable bit is set to 1, auto-negotiation is enabled, and the speed selection and duplex mode bits are no longer meaningful. The speed and Duplex mode of the link are established based on auto-sensing or autonegotiation advertisement register exchange. The Restart Auto-negotiation bit is used to restart auto-negotiation. This is required during the transition from Forced mode to Auto-negotiation mode or when the Advertisement register has been updated to a different set of autonegotiation parameters. The Duplex mode bit is only meaningful in Forced mode. When set to 1, the link is set up for full duplex mode. When set to 0, the link operates in Half-duplex mode.
Basic Mode Status Register

The next register of interest is the Basic Mode Status Register (BMSR). This read-only register provides the overall capabilities of the MII physical layer device. From these capabilities you can choose a subset to advertise, using the Auto-negotiation Advertisement register during the auto-negotiation process.
100BASE-T4
100BASE-T Full Duplex
100BASE-T Half Duplex
Autonegotiation complete
Link Status
Autonegotiation capable
FIGURE 5-15
When the 100BASE-T4 bit is set to 1, it indicates that the physical layer device is capable of 100BASE-T4 networking. When set to 0, it is not. When the 100BASE-T Full-duplex bit is set to 1, it indicates that the physical layer device is capable of 100BASE-T full-duplex networking. When set to 0, it is not capable. When the 100BASE-T Half-duplex bit is set to 1, it indicates that the physical layer device is capable of 100BASE-T half-duplex networking. When set to 0, it is not capable. When the 10BASE-T Full-duplex bit is set to 1, it indicates that the physical layer device is capable of 10BASE-T full-duplex networking. When set to 0, it is not. When the 10BASE-T Half-duplex bit is set to 1, it indicates that the physical layer device is capable of 10BASE-T half-duplex networking. When set to 0, it is not.
154
The Auto-negotiation Complete bit is only meaningful when the physical layer device is capable of auto-negotiation and it is enabled. Auto-negotiation Complete indicates that the auto-negotiation process has completed and the information in the link partner auto-negotiation accurately reflects the link capabilities of the link partner. When the Link Status bit is set to 1, it indicates that the physical link is up. When set to 0, the link is down. When used in conjunction with auto-negotiation, this bit must be set together with the Auto-negotiation Complete before the software can establish that the link is actually up. In Forced mode, as soon as this bit is set to 1, the software can assume the link is up. When the Auto-negotiation Capable bit is set to 1, it indicates that the physical layer device is capable of auto-negotiation. When set to 0, the physical layer device is not capable. This bit is used by the software to establish any further auto-negotiation processing that should occur.
Link-Partner Auto-negotiation Advertisement Register

The next registers of interest are the Auto-negotiation Advertisement Register (ANAR) and Link-Partner Auto-negotiation Advertisement (LPANAR). These two registers at the heart of the Auto-negotiation process, and they both share the same bit definitions. The ANAR is a read/write register that can be programmed to control the link partners view of the local link capabilities. The LPANAR is a read-only register that is used to discover the remote link capabilities. Using the information in the LPANAR together with the information available in the ANAR register, the software can establish what shared link capability has been established once auto-negotiation has completed.
100BASE-T4
FIGURE 5-16
Link Partner Auto-negotiation Advertisement
When the 100BASE-T4 is set to 1 in the ANAR, it advertises the intention of the local physical layer device to use 100BASE-T4. When set to 0, this capability is not advertised. This same bit, when set to 1 in the LPANAR, indicates that the link partner physical layer device has advertised 100BASE-T4 capability. When set to 0, the link partner is not advertising this capability.
Chapter 5
155
The 100BASE-T Full-duplex, 100BASE-T Half-duplex, 10BASE-T Full-duplex, and 100BASE-T Half-duplex bits all have the same functionality as 100BASE-T4 and provide the ability to decide what link capabilities should be shared for the link. The decision process is made by the physical layer hardware and is based on priority, as shown in FIGURE 5-17. It is the result of logically ANDing ANAR and the LPANAR on completion of auto-negotiation.
Highest
100BASE-T Full Duplex 100BASE-T4 100BASE-T Half Duplex 10BASE-T Full Duplex
Lowest
10BASE-T Half Duplex Link Partner Priority for Hardware Decision Process
FIGURE 5-17
Auto-negotiation in the purest sense requires that both sides participate in the exchange of ANAR. This allows both sides to complete loading of the LPANAR and establish a link that operates at the best negotiated value. It is possible that one side, or even both sides, of the link might be operating in Forced mode instead of Auto-negotiation mode. This can happen because the new device is connected to an existing 10/100 Mbit/sec link that was never designed to support auto-negotiation or because the auto-negotiation is switched off on one or both sides. If both sides are in Forced mode, one needs to set the correct speed and duplex for both sides. If the speed is not matched, the link will not come up, so speed mismatches can be easily tracked down once the physical connection is checked and considered good. If the duplex is not matched yet the speed is matched, the link will come up, but theres an often unnoticed gotcha in that. If one side is set to half duplex while the other is set to full duplex, then the half-duplex side will operate with the Ethernet protocol Carrier Sense Multiple Access with Collision Detection (CSMA/CD) while the full-duplex side will not. To the physical layer, this means that the full-duplex side is not adhering to the half-duplex CSMA/CD protocol and will not back off if someone is currently transmitting. For the half-duplex side of the connection, this appears as a collision, and its transmit is stopped. These collisions will occur frequently, preventing the link from operating to its best capacity. If one side of the connection is running Auto-negotiation mode and the other is running Forced mode and the auto-negotiating side is capable and advertising all available MII speeds and duplex settings, the link speed will always be negotiated successfully by the auto-sensing mechanism provided as part of the auto-negotiation protocol. Auto-sensing uses physical layer signaling to establish the operating speed
156
of the Forced side of the link. This allows the link to at least come up at the correct speed. The link duplex, on the other hand, needs the Advertisement register exchange and cannot be established by auto-sensing. Therefore, if the link duplex setting on the Forced mode side of the link is full duplex, then the best guess the auto-negotiating side of the link can make is half duplex. This gives rise to the same effect discussed when both sides are in Forced mode and theres a duplex mismatch. The only solution to the issue of duplex mismatch is to be aware that it can happen and make every attempt to configure both sides of the link to avoid it. In most cases, enabling auto-negotiation on both sides wherever possible will eliminate the duplex mismatch issue. The alternative is Forced mode, which should only be employed in infrastructures that have full-duplex configurations. Where possible, those configurations should be replaced with an auto-negotiation configuration. Theres one more MII register worthy of note. The Auto-negotiation Expansion register (ANER) can be useful in establishing whether a link partner is capable of auto-negotiation or not, and providing information about the auto-negotiation algorithm.
Link Partner Auto-Negotiation able Auto-negotiation Expansion Register
Parallel Detection Fault
FIGURE 5-18
The Parallel Detection Fault bit indicates that the auto-sensing part of the autonegotiation protocol was unable to establish the link speed, and the regular ANAR exchange was also unsuccessful in establishing a common link parameter. Therefore auto-negotiation failed. If this condition happens, the best course of action is to check each side of the link manually and ensure that the settings are mutually compatible.
Gigabit Media Independent Interface

As time progressed, Ethernet was increased in speed by another multiple of 10 to give 1000 Mbit/sec or 1 Gbit/sec. The MII remained and was extended to support the new 1 Gbit/sec operation, giving rise to the Gigabit Media Independent Interface (GMII).
Chapter 5
157
The GMII was first implemented using fiber-optic physical layer known as 1000BASE-X and was later extended to support twisted-pair copper known as 1000BASE-Tx. Those extensions led to additional bits in registers in the MII specification and some completely new registers, giving a GMII register set definition. The first register to be extended was the BMCR because it can be used to force speed. Then the ability to force 1-gigabit operation was added. All existing bit definition was maintained with the addition of one bit taken from the existing reserved bits to allow the enumeration of the different speeds that can now be forced with GMII devices.
Reset
Speed Select (0)
Auto-Negotiation Enable
Restart Auto-Negotiation
Duplex Mode
Speed Select (1)
FIGURE 5-19
Extended Basic Mode Control Register
The next register of interest was the BMSR. This register was extended to indicate to the driver software that there are more registers that apply to 1-gigabit operation.
100BASE-T 100BASE-T4 Full Duplex
1000BASE-T Extended Status
AutoNegotiation Complete
Link Status
AutoNegotiation Capable
FIGURE 5-20
When the 1000BASE-T Extended Status is set, thats the indication to the driver software to look at the new 1-gigabit operating registers. The function is similar to the Basic Mode Status and the ANAR. The Gigabit Extended Status Register (GESR) is the first of the gigabit operating registers. Like the BMSR, it gives an indication of the types of gigabit operation the physical layer device is capable of.
158
1000BaseX Full Duplex
1000BaseX Half Duplex
1000BaseT Full Duplex
1000BaseT Half Duplex
FIGURE 5-21
Gigabit Extended Status Register
The 1000BASE-X full duplex indicates that the physical layer device is capable of operating with 1000BASE-X fiber media with full-duplex operation. The 1000BASE-X half duplex indicates that the physical layer device is capable of operating with 1000BASE-X fiber media with half-duplex operation. The 1000BASE-T full duplex indicates that the physical layer device is capable of operating with 1000BASE-T twisted-pair copper media with full-duplex operation. The 1000BASE-T half duplex indicates that the physical layer device is capable of operating with 1000BASE-T twisted-pair copper media with half-duplex operation. The information provided by the GESR gives the possible 1-gigabit capabilities of the physical layer device. From that information you can choose the gigabit capabilities that will be advertised through the Gigabit Control Register (GCR). In the case of twisted-pair copper physical layer, there is also the ability to advertise the Clock Mastership.
Master/Slave Manual config enable

FIGURE 5-22
Master/Slave Config Value
Gigabit Control Status
Clock Mastership is a new concept that only applies to copper media running at 1 gigabit. At such high signaling frequencies, it becomes increasingly difficult to continue to have separate clocking for the remote and the local physical layer devices. Hence, a single clocking domain was introduced, which the remote and the local physical layer devices share while a link is established. To achieve the single clocking domain only one end of the connection provides the clock (the link master), and the other (the link slave) simply uses it. The Gigabit Status Register (GSR) bits, Master/Slave Manual Config Enable, and Master/Slave Config Value control how your local physical layer device will behave in this master/slave relationship.
Chapter 5
159
When the Master/Slave manual config enable bit is set, the master slave configuration is controlled by the master config value. When it is cleared, the Master/Slave configuration is established during auto-negotiation by a clocklearning sequence, which automatically establishes a clock master and slave for the link. Typically in a network, the master is the switch port and the slave is the end port or NIC. The Master/Slave manual config enable setting is only meaningful when the Master/Slave manual config enable bit is set. If set to 1, it will force the local clock mastership setting to be Master. If set to 0, the local clock becomes the Slave. When using the Master/Slave manual configuration, take care to ensure that the link partner is set accordingly. For example, if 1-gigabit Ethernet switches are set up to operate as link masters, then the computer system attached to the switches should be set up as a slave.
Master/Slave Manual config enable

FIGURE 5-23
Master/Slave Config Value
Gigabit Status Register
When the driver fills in the bits in the GSR, its equivalent to filling in the ANAR in MII: It controls the 1-gigabit capabilities that are advertised. Likewise the GSR is like the LPANAR, providing the capabilities of the link partner. The register definition for the GSR is similar to the GCR. With GMII operation, once auto-negotiation is complete, the contents of the GCR are compared with those in the GSR and the highest-priority shared capability is used to decide the gigabit speed and duplex. It is possible to disable 1-gigabit operation. In that case, the shared capabilities must be found in the MII registers as described above. In GMII mode, at the end of auto-negotiation, once the GCR and GSR are compared and the ANAR and LPANAR are compared, then the choice of the operating speed and duplex is established by the hardware based on the following descending priority:
160
Highest
1000BASE-T Full Duplex 1000BASE-T Half Duplex 100BASE-T Full Duplex 100BASE-T4 100BASE-T Half Duplex 10BASE-T Full Duplex
Lowest
FIGURE 5-24
GMII Mode Link Partner Priority
Once the correct setting is established, the device software makes that setting known to the user through kernel statistics. It is also possible to manipulate the configuration using the ndd utility.
Ethernet Flow Control

One area of MII/GMII that appeared after the initial definition of MII but before GMII was the introduction of Ethernet Flow Control. Ethernet Flow Control is a MAC Layer feature that controls the rate of packet transmission in both directions.
Destination Address
Source Address
Protocol Type Field
MAC Pause Opcode
MAC Pause Opcode
PAD to 42 bytes
Frame CRC
FIGURE 5-25
Flow Control Pause Frame Format
The key to this feature is the use of MAC control frames known as pause frames, which have the following formats:
s
The Destination Address is a 6-byte address defined for Ethernet Flow Control as a multicast address of 01:80:C2:00:00:1. The Source Address is a 6-byte address that is the same as the Ethernet station address of the producer of the pause frame. The Protocol Type Field is a 2-byte address set to the MAC control protocol 0x8808. Pause capability is one example of the usage of MAC control protocol.
Chapter 5
161
The MAC Control Pause Opcode is a 2-byte value, 0x0001, that indicates the type of MAC control feature to be used, in this case pause. The Mac Control Pause Parameter is a 2-byte value that indicates whether the flow control is startedalso referred to as XOFF or XON. When the MAC Control Pause Parameter is non zero, you have an XOFF pause frame. When the value is 0, you have an XON pause frame. The value of the parameter is in units of slot time.
To understand the Flow Control capability, consider symmetric flow control first. With symmetric flow control, a network node can generate flow control frames or react to flow control frames. Generating a flow control frame is known as Transmit Pause capability and is triggered by congestion on the Rx side. The Transmit Pause sends an XOFF flow control message to the link partner, who should react to pause frames (Receive pause capability). By reacting to pause frames, the link partner uses the transmitted pause parameter as a duration that the link partners transmitter should remain silent while the Rx congestion clears. If the Rx congestion clears within that pause parameter period, an XON flow control message can be transmitted telling the link partner that the congestion has cleared and transmission can continue as normal. In many cases, Flow Control capability is available in only one direction. This is known as Asymmetric Flow Control. This might be a configuration choice or simply a result of a hardware design. Therefore, the MII/GMII specification was altered to allow Flow Control capability to be advertised to a link partner along with the best type of flow control to be used for the shared link. The changes were applied to the ANAR along with two new bits: Pause Capability and Asymmetric Pause Capability. FIGURE 5-26 shows the updated register.
100BASE-T4
Pause Capability
Asymmetric Pause Capability
FIGURE 5-26
Link Partner Auto-negotiation Advertisement Register
Starting with Asymmetric Pause Capability, if the value of this bit is set to 0, then the ability to pause is managed by the Pause Capability. If Pause Capability is set to 1, it indicates the local ability to pause in both Rx and Tx direction. If the Asymmetric Pause Capability is set to 1, it indicates the local ability to pause in either the Rx or the Tx direction. When its set to 1, it indicates that the local setting is to Receive flow control. In other words, reception of XOFF can stop transmitting, and XON can
162
restart it. When set to 0, it indicates transmit flow control, which means when Rx becomes congested, it will transmit XOFF, and once the congestion clears, it can transmit XON.
Remote Machine
On Reception of an XOFF, stop transmitting until Pause time has elapsed or XON Arrives.
Local Machine
Enough packets have been serviced to reduce the RX FIFO occupancy to below the threshold; send XON.
Tx FIFO
Incoming Packet
Rx FIFO
Receive in the FIFO has exceeded the Pause threshold; send XOFF. FIGURE 5-27
Rx/Tx Flow Control in Action
Now that the Pause Capability and Asymmetric Pause Capability are established, it is required to advertise these parameters to the link partner and negotiate the pause setting to be used for the link.
TABLE 5-16 enumerates all the possibilities for resolving the pause capabilities for a
link.
TABLE 5-16 Local Device cap_pause cap_asmpause
Possibilities for Resolving Pause Capabilities for a Link

Remote Device lp_cap_pause lp_cap_asmpause Link Resolution link_pause link_asmpause
0 0 0 1 1 1 1
0 1 1 0 0 1 1
X 0 1 0 1 0 1
X X 0 X X 0 X
0 0 0 0 1 0 1
X 0 0 0 0 0 0
Chapter 5
163
The link_pause and link_asmpause parameters have the same meanings as the cap_pause and cap_asmpause parameters and enumerate meaningful information for a link given the pause capabilities available for both sides of the link.
Example 1
cap_asmpause = 1 cap_pause = 0 lp_cap_asmpause = 0 lp_cap_pause = 1 The device is capable of asymmetric pause. The device will send pauses if the Receive side becomes congested. The device is capable of symmetric pause. The device will send pauses if the Receive side becomes congested, and it will respond to pause by disabling transmit. Because both the local and remote partner are set to send a pause on congestion, only the remote partner will respond to that pause. This is equivalent to no flow control, as it requires both ends to stop transmitting to alleviate the Rx congestion. Further indication that no meaningful flow control is happening on the link.
link_asmpause = 0
link_pause = 0
Example 2
cap_asmpause = 1 cap_pause = 1 lp_cap_asmpause = 0 lp_cap_pause = 1 The device is capable of asymmetric pause. The device will send pauses if the receive side becomes congested. The device is capable of symmetric pause. The device will send pauses if the receive side becomes congested, and it will respond to pause by disabling transmit. Because the local setting is to stop sending on arrival of a flow control message and the remote end is set to send flow control messages when it gets congested, we have flow control on the receive direction of the link. Hence its asymmetric. The direction of the pauses is incoming.
link_asmpause = 1
link_pause = 1
164
There are more examples of flow control from the table that can be discussed in terms of flow control in action. Well return to this topic when discussing individual devices that support this feature. This concludes all the options for controlling the configuration of the Ethernet Physical Layer MII/GMII. The preceding information should come in useful when configuring your network and making sure that each Ethernet link is coming up as required by the configuration. Finally, there is a perception that auto-negotiation has difficulties, but most of these were cleared up with the introduction of Gigabit Ethernet technology. Therefore it is no longer required to disable auto-negotiation to achieve reliable operation with available Gigabit switches.
Fast Ethernet Interfaces

Sun supports four Fast Ethernet drivers for its range of SPARC platforms:
s s s s
10/100 10/100 10/100 10/100
hme Fast Ethernet qfe Quad Fast Ethernet eri Fast Ethernet dmfe Fast Ethernet
The following sections describe the details of these interfaces.
10/100 hme Fast Ethernet

The hme interface is available for Sbus systems and PCI bus systems. In both cases, they share the same device driver. The interface at the lowest level provides an RJ-45 twisted-pair Ethernet physical layer that supports auto-negotiation. There are also motherboard implementations using hme that provide a MII connection. This allows an external transceiver to be attached and alternative physical layer interfaces like 100BASE-T4 or 100BASE-SX to be made available.
Chapter 5
165
Typical HME External Connectors
RJ-45 100BASE-T Connector
MII Connectors
Media Attachment unit allows a connection to Ethernet via an alternative Physical layer media type.
FIGURE 5-28
Typical hme External Connectors
At a MAC level, the interface stages packets for transmission using a single 256element descriptor ring array. Once that ring is exhausted, other packets get queued waiting in the streams queue. When the hardware completes transmission of packets currently occupying space on the ring, the packets waiting on the streams queue are moved to the descriptor ring. The Rx side of the descriptor ring is again a maximum of 256 elements. Once those elements are exhausted, no further buffering is available for incoming packets and overflows begin to occur. When hme was introduced to the market, 256 descriptors to Tx and Rx were reasonable because the CPU frequencies for that time were around 100 Mhz to 300 Mhz, so the arrival rate of transmission packets posted to the descriptor rings closely matched the transmission capability of the physical media. As time progressed, CPUs became faster and this number of descriptors became inadequate. Often the interface began to exhaust the elements in the transmit ring and incurred more scheduling overhead for transmission. On the Rx side, as CPUs became faster, the ability to receive packets became simpler because less time was required to service each packet. The occupancy on the Rx ring of packets needing to be serviced diminished. The hme interface is limited in tuning capability. If you experience low performance because of overflows or the transmit ring being constantly full, no corrective action is possible.
166
The physical layer of hme is fully configurable using the driver.conf file and ndd command.
Driver Parameters and Status

Type Description
instance adv_autoneg_cap adv_100T4_cap adv_100fdx_cap adv_100hdx_cap adv_10fdx_cap adv_10hdx_cap use_int_xcvr lance_mode ipg0 ipg1 ipg2 autoneg_cap 100T4_cap 100fdx_cap 100hdx_cap 10fdx_cap 10hdx_cap lp_autoneg_cap lp_100T4_cap lp_100fdx_cap lp_100hdx_cap lp_10fdx_cap lp_10hdx_cap transceiver_inuse link_status link_speed link_mode
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only
Current device instance in view for ndd Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Transceiver control parameter Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Current physical layer status Current physical layer status Current physical layer status Current physical layer status
Chapter 5
167
Current Device Instance in View for ndd

The current device instance in view allows you to point ndd to a specific device instance that needs configuration. This must be done prior to altering or viewing any of the other parameters or you might not be viewing or altering the correct parameters.
TABLE 5-18
Instance Parameter
Values 0-1000 Description Current device instance in view for the rest of the ndd configuration variables
Parameter instance
Before you view or alter any of the other parameters, make a quick check of the value of instance to ensure that it is actually pointing to the device you want to view or alter.

The operational mode parameters adjust the MII capabilities that are used for autonegotiation. When auto-negotiation is disabled, the highest priority value is taken as the mode of operation. See Ethernet Physical Layer on page 152 regarding MII.
TABLE 5-19

Values 0-1 Description Local interface capability of auto-negotiation signaling advertised by the hardware. 0 = Forced mode 1 = Auto-negotiation Default is set to the autoneg_cap parameter. Local interface capability of 100-T4 advertised by the hardware. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Default is set to the 100T4_cap parameter. Local interface capability of 100 full duplex is advertised by the hardware. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Default is set based on the 100fdx_cap parameter.
Parameter adv_autoneg_cap
adv_100T4_cap
0-1
adv_100fdx_cap
0-1
168
TABLE 5-19
Operational Mode Parameters (Continued)

Values 0-1 Description Local interface capability of 100 half duplex is advertised by the hardware. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Default is set based on the 100hdx_cap parameter. Local interface capability of 10 full duplex is advertised by the hardware. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Default is set based on the 10fdx_cap parameter. Local interface capability of 10 half duplex is advertised by the hardware. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable Default is set based on the 10hdx_cap parameter.
Parameter adv_100hdx_cap
adv_10fdx_cap
0-1
adv_10hdx_cap
0-1
If you are using the interactive mode of ndd with this device to alter the adv_100fdx_cap parameter to adv_10hdx_cap, the changes applied to those parameters are not actually applied to the hardware until adv_autoneg_cap is changed to its alternative value and then back again.
Transceiver Control Parameter

The hme driver can have an external MII physical layer device connected though the MII interface. In this case, the driver has two choices of physical layer connection to the network. Either connection is valid. Therefore a policy is implemented that assumes that if an external physical layer is attached, its because the given internal physical layer device doesnt provide the required media, so the driver assumes that the external physical layer device is the one to use.
Chapter 5
169
In some cases it might be necessary to override that policy. Therefore the ndd parameter use_int_xcvr is provided. Transceiver Control Parameter
Values 0-1 Description Override for the policy that the external XCVR takes priority over the internal transceiver. 0 = If an external transceiver is present, use it instead of the internal (default). 1 = If an external transceiver is present, ignore it and continue to use the internal.
TABLE 5-20
Parameter use_int_xcvr
Inter-Packet Gap Parameters

The Inter-Packet Gap (IPG) parameters are ipg0, ipg1, and ipg2. The total IPG is the sum of ipg1 and ipg2 plus an optional ipg0 that will only be present when the lance_mode parameter is set. The total default IPG is 9.6 microseconds when the link speed set by the auto-negotiation protocol is 10 Mbit/sec. When the link speed is 100 Mbit/sec, the total IPG is 0.96 microseconds. The additional delay set by ipg0 helps to reduce collisions. Systems that have lance_mode enabled might not have enough time on the network. If lance_mode is disabled, the value of ipg0 is ignored and no additional delay is set. Only the delays set by ipg1 and ipg2 are used. Disable lance_mode if other systems keep sending a large number of back-to-back packets. You can add the additional delay by setting the ipg0 parameter, which is the nibble time delay from 0 to 31. Note that nibble time is the time it takes to transfer four bits on the link. If the link speed is 10 Mbit/sec, nibble time is equal to 400 ns. If the link speed is 100 Mbit/sec, nibble time is equal to 40 ns.
Note IPG is sometimes increased on older systems using slower NICs, where
newer NICs and systems are hogging the network. When a server dominates a halfduplex network it's known as server capture effect.
170
For example, if the link speed is 10 Mbit/sec and you set ipg0 to 20 nibble times, multiply 20 by 400 ns to get 8000 ns. If the link speed is 100 Mbit/sec and you set ipg0 to 30 nibble times, multiply 30 by 40 ns to get 1200 ns.
TABLE 5-21
Inter-Packet Gap Parameter

Values 0 1 0-31 0-255 0-255 Description lance_mode disabled lance_mode enabled (default) Additional IPG before transmitting a packet Default = 4 First inter-packet gap parameter Default = 8 Second inter-packet gap parameter Default = 8
Parameter lance_mode
ipg0
ipg1
ipg2
All of the IPG parameters can be set using ndd or can be hard-coded into the hme.conf files. Details of the methods of setting these parameters are provided in Configuring Driver Parameters on page 238.
Local Transceiver Auto-negotiation Capability

The local transceiver auto-negotiation capability parameters are read-only parameters and represent the fixed set of capabilities associated with the current PHY that is in use. This allows an external MII PHY device to be attached to the
Chapter 5
171
external MII port. Therefore the capabilities presented in these statistics might vary according to the capabilities of the external MII physical layer device that is attached. Local Transceiver Auto-negotiation Capability Parameters
Values Description
autoneg_cap
0-1
Local interface is capable of auto-negotiation signaling. 0 = Can only operate in forced mode 1 = Capable of auto-negotiation Local interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Local interface is capable of 100 full duplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Local interface is capable of 100 half-duplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Local interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Local interface is capable of 10 half-duplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable
100T4_cap
0-1
100fdx_cap
0-1
100hdx_cap
0-1
10fdx_cap
0-1
10hdx_cap
0-1
172
Link Partner Capability

The link partner capability parameters are read-only parameters and represent the fixed set of capabilities associated with the attached link partner set of advertised auto-negotiation parameters. These parameters are only meaningful when autonegotiation is enabled and can be used in conjunction with the operation parameters to establish why there might be problems bringing up the link. Link Partner Capability Parameters
Values 0-1 Description Link partner interface is capable of autonegotiation signaling. 0 = Can only operate in forced mode 1 = Capable of auto-negotiation Link partner interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Link partner interface is capable of 100 fullduplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Link partner interface is capable of 100 halfduplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Link partner interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Link partner interface is capable of 10 halfduplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable
TABLE 5-23
Parameter lp_autoneg_cap
lp_100T4_cap
0-1
lp_100fdx_cap
0-1
lp_100hdx_cap
0-1
lp_10fdx_cap
0-1
lp_10hdx_cap
0-1
Chapter 5
173

The current physical layer status gives an indication of the state of the link, whether its up or down, or what speed and duplex its operating at. These parameters are derived based on the result of establishing the highest priority shared speed and duplex capability when auto-negotiation is enabled. They can be preconfigured with Forced mode.
TABLE 5-24
Current Physical Layer Status Parameters

Values 0-1 Description This parameter indicates which transceiver is currently in use. 0 = Internal transceiver is in use. 1 = External transceiver is in use. Current link status 0 = Link down 1 = Link up This parameter provides the link speed and is only valid if the link is up. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable This parameter provides the link duplex and is only valid if the link is up. 0 = Half duplex 1 = Half duplex
Parameter transceiver_inuse
link_status
0-1
link_speed
0-1
link_mode
0-1
Note that the physical layer status parameters are only meaningful while ndd is running in interactive mode or while the interface being viewed is already initialized by virtue of the presence of open streams such as snoop -d hme0 or ifconfig hme0 plumb inet up. If these streams dont exist, then the device is uninitialized and the state gets set up when you probe these parameters with ndd. As a result, the parameters are subject to a race between the user viewing them and the link reaching a steady state. This makes these parameters unreliable unless an existing stream is associated with an instance prior to checking. A good rule to follow is to only trust these parameters if the interface is configured up using the ifconfig command.
174
10/100 qfe Quad Fast Ethernet

The qfe interface was developed based on the same ASIC as the hme, and the driver is very similar in terms of capabilities and limitations. One key difference is that theres no external MII connector and therefore no possibility to use any physical media other than 100BASE-T.
FIGURE 5-29
Typical qfe External Connectors
With the introduction of qfe came the introduction of trunking technology, which will be discussed later. The physical layer of qfe is fully configurable using the driver.conf file and ndd command.

Type Description
instance adv_autoneg_cap adv_100T4_cap adv_100fdx_cap adv_100hdx_cap adv_10fdx_cap adv_10hdx_cap use_int_xcvr lance_mode ipg0 ipg1 ipg2 autoneg_cap 100T4_cap 100fdx_cap
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read only Read only Read only
Current device instance in view for ndd Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Transceiver control parameter Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability
Chapter 5
175
Driver Parameters and Status (Continued)

Type Description
100hdx_cap 10fdx_cap 10hdx_cap lp_autoneg_cap lp_100T4_cap lp_100fdx_cap lp_100hdx_cap lp_10fdx_cap lp_10hdx_cap transceiver_inuse link_status link_speed link_mode
Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only
Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Current physical layer status Current physical layer status Current physical layer status Current physical layer status

The current device instance in view allows you to point ndd to a particular device instance that needs configuration. This must be applied prior to altering or viewing any of the other parameters or you might not be viewing or altering the correct parameters.
TABLE 5-26
Instance Parameter
Parameter instance
176

The operational mode parameters adjust the MII capabilities that are used for autonegotiation. When auto-negotiation is disabled, the highest priority value is taken as the mode of operation. See Fast Ethernet Interfaces on page 165 regarding MII.

Values Description
adv_autoneg_cap
0-1
Local interface capability of auto-negotiation signaling is advertised by the hardware. 0 = Forced mode 1 = Auto-negotiation Default is set to the autoneg_cap parameter. Local interface capability of 100-T4 is advertised by the hardware. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Default is set to the 100T4_cap parameter. Local interface capability of 100 full duplex is advertised by the hardware. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Default is set based on the 100fdx_cap parameter. Local interface capability of 100 half duplex is advertised by the hardware. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Default is set based on the 100hdx_cap parameter. Local interface capability of 10 full duplex is advertised by the hardware. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Default is set based on the 10fdx_cap parameter. Local interface capability of 10 half duplex is advertised by the hardware. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable Default is set based on the 10hdx_cap parameter.
adv_100T4_cap
0-1
adv_100fdx_cap
0-1
adv_100hdx_cap
0-1
adv_10fdx_cap
0-1
adv_10hdx_cap
0-1
Chapter 5
177
If you are using the interactive mode of ndd with this device to alter the adv_100fdx_cap parameter to adv_10hdx_cap, the changes applied to those parameters are not actually applied to the hardware until adv_autoneg_cap is changed to its alternative value and then back again.

The qfe driver has the capability to have an external MII physical layer device connected, but theres no implemented hardware to allow this feature to be utilized. The use_int_xcvr parameter should never be altered in the case of qfe.

The Inter-Packet Gap (IPG) parameters are ipg0, ipg1, and ipg2. The total IPG is the sum of ipg1 and ipg2 plus an optional ipg0 that will only be present when the lance_mode parameter is set. The total default IPG is 9.6 microseconds when the link speed set by the auto-negotiation protocol is 10 Mbit/sec. When the link speed is 100 Mbit/sec, the total IPG is 0.96 microseconds. The additional delay set by ipg0 helps to reduce collisions. Systems that have lance_mode enabled might not have enough time on the network. If lance_mode is disabled, the value of ipg0 is ignored and no additional delay is set. Only the delays set by ipg1 and ipg2 are used. Disable lance_mode if other systems keep sending a large number of back-to-back packets. You can add the additional delay by setting the ipg0 parameter, which is the nibble time delay, from 0 to 31. Note that nibble time is the time it takes to transfer four bits on the link. If the link speed is 10 Mbit/sec, nibble time is equal to 400 ns. If the link speed is 100 Mbit/sec, nibble time is equal to 40 ns. For example, if the link speed is 10 Mbit/sec and you set ipg0 to 20 nibble times, multiply 20 by 400 ns to get 800 ns. If the link speed is 100 Mbit/sec, and you set ipg0 to 30 nibble times, multiply 30 by 40 ns to get 120 ns.

Values Description
lance_mode
0 1
lance_mode disabled lance_mode enabled (default)
178

Values Description
ipg0
0-31 0-255 0-255
Additional IPG before transmitting a packet Default = 4 First IPG parameter Default = 8 Second IPG parameter Default = 8
ipg1
ipg2
All of the IPG parameters can be set using ndd or can be hard-coded into the qfe.conf files. Details about setting these parameters are provided in Reboot Persistence Using driver.conf on page 242.

The local transceiver auto-negotiation capability parameters are read-only parameters and represent the fixed set of capabilities associated with the current PHY that is in use. This device allows an external MII PHY device to be attached to the external MII port. Therefore the capabilities presented in these statistics might vary according to the capabilities of the external MII physical layer device that is attached.
Local Transceiver Auto-negotiation Capability Parameters

Values Description
autoneg_cap
0-1
Local interface is capable of auto-negotiation signaling. 0 = Can only operate in forced mode 1 = Capable of auto-negotiation Local interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Local interface is capable of 100 full duplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable
100T4_cap
0-1
100fdx_cap
0-1
Chapter 5
179

Values Description
100hdx_cap
0-1
Local interface is capable of 100 half duplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Local interface is capable of 10 full duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Local interface is capable of 10 half duplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable
10fdx_cap
0-1
10hdx_cap
0-1

Values Description
lp_autoneg_cap
0-1
Link partner interface is capable of autonegotiation signaling. 0 = Can only operate in forced mode 1 = Capable of auto-negotiation Link partner interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Link partner interface is capable of 100 fullduplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable
lp_100T4_cap
0-1
lp_100fdx_cap
0-1
180
Link Partner Capability Parameters (Continued)

Values Description
lp_100hdx_cap
0-1
Link partner interface is capable of 100 halfduplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Link partner interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Link partner interface is capable of 10 halfduplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable
lp_10fdx_cap
0-1
lp_10hdx_cap
0-1

The current physical layer status gives an indication of the state of the link, whether its up or down, or what speed and duplex its operating at. These parameters are derived based on the result of establishing the highest priority shared speed and duplex capability when auto-negotiation is enabled or can be pre-configured with Forced mode.

Values Description
transceiver_inuse
0-1
Indicates which transceiver is currently in use. 0 = Internal transceiver is in use. 1 = External transceiver is in use. Current link status 0 = Link down 1 = Link up This parameter provides the link speed and is only valid if the link is up. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable This parameter provides the link duplex and is only valid if the link is up. 0 = Half duplex 1 = Half duplex
link_status
0-1
link_speed
0-1
link_mode
0-1
Chapter 5
181
Note that the physical layer status parameters are only meaningful while ndd is running in interactive mode or the interface being viewed is already initialized by virtue of the presence of open streams such as snoop -d qfe0 or ifconfig qfe0 plumb inet up. If these streams dont exist, the device is uninitialized and the state gets set up when you probe these parameters with ndd. As a result, the parameters are subject to a race between the user viewing them and the link reaching a steady state. This makes these parameters unreliable unless an existing stream is associated with an instance prior to checking. A good rule to follow is to only trust these parameters if the interface is configured up using the ifconfig command.
10/100 eri Fast Ethernet

When the eri interface was applied to the UltraSPARC III desktop systems, it addressed many of the shortcomings of the hme interface and also eliminated the external MII interface. The detailed architecture of the eri interface is made up again of a single transmit descriptor ring and a single receive descriptor ring. However, with eri, the maximum size of the descriptor ring was increased to 8-Kbyte elements. So the opportunity to store packets for transmission is much larger and the probability of not having a descriptor element when attempting to transmit is reducedalong with the need to use the streams queue behind the transmit descriptor ring. Overall, the scheduling overhead described as a problem with hme and qfe is vastly reduced with eri. The eri interface is also the first interface capable of supporting the hardware checksumming capability, allowing it to be more efficient in bulk transfer applications. The physical layer and performance features of eri are fully configurable using the driver.conf file and ndd command.

Type Description
instance adv_autoneg_cap adv_100T4_cap adv_100fdx_cap adv_100hdx_cap adv_10fdx_cap adv_10hdx_cap
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write
Current device instance in view for ndd Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters
182

Type Description
use_int_xcvr lance_mode ipg0 ipg1 ipg2 autoneg_cap 100T4_cap 100fdx_cap 100hdx_cap 10fdx_cap 10hdx_cap lp_autoneg_cap lp_100T4_cap lp_100fdx_cap lp_100hdx_cap lp_10fdx_cap lp_10hdx_cap transceiver_inuse link_status link_speed link_mode
Read and Write Read and Write Read and Write Read and Write Read and Write Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only
Transceiver control parameter Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Current physical layer status Current physical layer status Current physical layer status Current physical layer status
Chapter 5
183

Instance Parameter
Values Description
instance
0-1000
Current device instance in view for the rest of the ndd configuration variables

TABLE 5-34

Values 0-1 Description Local interface capability of auto-negotiation signaling is advertised by the hardware. 0 = Forced mode 1 = Auto-negotiation Default is set to the autoneg_cap parameter. Local interface capability of 100-T4 is advertised by the hardware. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Default is set to the 100T4_cap parameter. Local interface capability of 100 full duplex is advertised by the hardware. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Default is set based on the 100fdx_cap parameter.
Parameter adv_autoneg_cap
adv_100T4_cap
0-1
adv_100fdx_cap
0-1
184
TABLE 5-34

Values 0-1 Description Local interface capability of 100 half duplex is advertised by the hardware. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Default is set based on the 100hdx_cap parameter. Local interface capability of 10 full duplex is advertised by the hardware. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Default is set based on the 10fdx_cap parameter. Local interface capability of 10 half duplex is advertised by the hardware. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable Default is set based on the 10hdx_cap parameter.
Parameter adv_100hdx_cap
adv_10fdx_cap
0-1
adv_10hdx_cap
0-1
If you are using the interactive mode of ndd with this device to alter the adv_100fdx_cap parameter to adv_10hdx_cap, the changes applied to those parameters are not actually applied to the hardware until the adv_autoneg_cap is changed to its alternative value and then back again.

The eri driver is capable of having an external MII physical layer device connected, but theres no implemented hardware to allow this feature to be utilized. The use_int_xcvr parameter should never be altered in the case of eri.

The Inter-Packet Gap (IPG) parameters are ipg0, ipg1, and ipg2. The total IPG is the sum of ipg1 and ipg2 plus an optional ipg0 that will only be present when the lance_mode parameter is set. The total default IPG is 9.6 microseconds when the link speed set by the auto-negotiation protocol is 10 Mbit/sec. When the link speed is 100 Mbit/sec, the total IPG is 0.96 microseconds.
Chapter 5
185
The additional delay set by ipg0 helps to reduce collisions. Systems that have lance_mode enabled might not have enough time on the network. If lance_mode is disabled, the value of ipg0 is ignored and no additional delay is set. Only the delays set by ipg1 and ipg2 are used. Disable lance_mode if other systems keep sending a large number of back-to-back packets. You can add the additional delay by setting the ipg0 parameter, which is the nibble time delay, from 0 to 31. Note that nibble time is the time it takes to transfer four bits on the link. If the link speed is 10 Mbit/sec, nibble time is equal to 400 ns. If the link speed is 100 Mbit/sec, nibble time is equal to 40 ns. For example, if the link speed is 10 Mbit/sec and you set ipg0 to 20 nibble times, multiply 20 by 400 ns to get 800 ns. If the link speed is 100 Mbit/sec and you set ipg0 to 30 nibble times, multiply 30 by 40 ns to get 120 ns.

Values Description
lance_mode ipg0 ipg1 ipg2
0 1 0-31 0-255 0-255
lance_mode disabled lance_mode enabled (default)

Additional IPG before transmitting a packet
Default = 4 First Inter-packet gap parameter Default = 8 Second Inter-packet gap parameter Default = 8
All of the IPG parameters can be set using ndd or can be hard-coded into the eri.conf files. Details of the methods of setting these parameters are provided in Configuring Driver Parameters on page 238.
186

The eri device introduces the receive interrupt blanking capability to 10/100 Mbit/sec ports on the UltraSPARC III desktop systems. The following table provides the parameter names, value range, and given defaults.

Values Description
intr_blank_time
0-127
Interrupt after this number of clock cycles has passed and the packets pending have not reached the number of intr_blank_packets. One clock cycle equals 2048 PCI clock cycles. (Default = 6) Interrupt after this number of packets has arrived since the last packet was serviced. A value of zero indicates no packet blanking. (Default = 8)
intr_blank_packets
0-255

The local transceiver auto-negotiation capability parameters are read-only parameters and represent the fixed set of capabilities associated with the current PHY that is in use. Local Transceiver Auto-negotiation Capability Parameters
Values 0-1 Description Local interface is capable of auto-negotiation signaling. 0 = Can only operate in forced mode 1 = Capable of auto-negotiation Local interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Local interface is capable of 100 full-duplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable
TABLE 5-37
Parameter autoneg_cap
100T4_cap
0-1
100fdx_cap
0-1
Chapter 5
187
TABLE 5-37
Local Transceiver Auto-negotiation Capability Parameters (Continued)

Values 0-1 Description Local interface is capable of 100 half-duplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Local interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Local interface is capable of 10 half-duplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable
Parameter 100hdx_cap
10fdx_cap
0-1
10hdx_cap
0-1

Values Description
lp_autoneg_cap
0-1
Link partner interface is capable of autonegotiation signaling. 0 = Can only operate in forced mode 1 = Capable of auto-negotiation Link partner interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Link partner interface is capable of 100 fullduplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable
lp_100T4_cap
0-1
lp_100fdx_cap
0-1
188

Values Description
lp_100hdx_cap
0-1
lp_10fdx_cap
0-1
lp_10hdx_cap
0-1

TABLE 5-39

Values 0-1 Description Indicates which transceiver is currently in use. 0 = Internal transceiver is in use. 1 = External transceiver is in use. Current link status 0 = Link down 1 = Link up This parameter provides the link speed and is only valid if the link is up. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable This parameter provides the link duplex and is only valid if the link is up. 0 = Half duplex 1 = Half duplex
Parameter transceiver_inuse
link_status
0-1
link_speed
0-1
link_mode
0-1
Chapter 5
189
Note that the physical layer status parameters are only meaningful while ndd is running in interactive mode or the interface being viewed is already initialized by virtue of the presence of open streams such as snoop -d eri0 or ifconfig eri0 plumb inet up. If these streams dont exist, the device is uninitialized and the state gets set up when you probe these parameters with ndd. As a result, the parameters are subject to a race between the user viewing them and the link reaching a steady state. This makes these parameters unreliable unless an existing stream is associated with an instance prior to checking. A good rule to follow is to only trust these parameters if the interface is configured up using the ifconfig command.
10/100 dmfe Fast Ethernet

The dmfe interface is a another Ethernet system interface applied to the UltraSPARC rack-mounted Netra X1 and Sun Fire V100 server systems. This interface is much like the others in that its architecture supports a single transmit and receive descriptor run. The number of elements in its descriptor ring is fixed to 32 elements. The physical layer of dmfe is fully configurable using the driver.conf file and ndd command.

Type Description
adv_autoneg_cap adv_100T4_cap adv_100fdx_cap adv_100hdx_cap adv_10fdx_cap adv_10hdx_cap autoneg_cap 100T4_cap 100fdx_cap 100hdx_cap 10fdx_cap 10hdx_cap lp_autoneg_cap lp_100T4_cap
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read only Read only Read only Read only Read only Read only Read only Read only
Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Link partner capability Link partner capability
190

Type Description
lp_100fdx_cap lp_100hdx_cap lp_10fdx_cap lp_10hdx_cap link_status link_speed link_mode
Read only Read only Read only Read only Read only Read only Read only
Link partner capability Link partner capability Link partner capability Link partner capability Current physical layer status Current physical layer status Current physical layer status


Values Description
adv_autoneg_cap
0-1
Local interface capability of auto-negotiation signaling is advertised by the hardware. 0 = Forced mode 1 = Auto-negotiation Default is set to the autoneg_cap parameter. Local interface capability of 100-T4 is advertised by the hardware. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Default is set to the 100T4_cap parameter. Local interface capability of 100 full duplex is advertised by the hardware. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Default is set based on the 100fdx_cap parameter.
adv_100T4_cap
0-1
adv_100fdx_cap
0-1
Chapter 5
191

Values Description
adv_100hdx_cap
0-1
Local interface capability of 100 half duplex is advertised by the hardware. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Default is set based on the 100hdx_cap parameter. Local interface capability of 10 full duplex is advertised by the hardware. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Default is set based on the 10fdx_cap parameter. Local interface capability of 10 half duplex is advertised by the hardware. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable Default is set based on the 10hdx_cap parameter.
adv_10fdx_cap
0-1
adv_10hdx_cap
0-1

The local transceiver auto-negotiation capability parameters are read-only parameters and represent the fixed set of capabilities associated with the current PHY that is in use. This device allows an external MII PHY device to be attached to
192
the external MII port. Therefore, the capabilities presented in these statistics might vary according to the capabilities of the external MII physical layer device that is attached. Local Transceiver Auto-negotiation Capability Parameters
Values 0-1 Description Local interface is capable of auto-negotiation signaling. 0 = Can only operate in Forced mode 1 = Capable of auto-negotiation Local interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Local interface is capable of 100 full-duplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Local interface is capable of 100 half-duplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Local interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Local interface is capable of 10 half-duplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable
TABLE 5-42
100T4_cap
0-1
100fdx_cap
0-1
100hdx_cap
0-1
10fdx_cap
0-1
10hdx_cap
0-1
Chapter 5
193

Values 0-1 Description Link partner interface is capable of autonegotiation signaling. 0 = Can only operate in forced mode 1 = Capable of auto-negotiation Link partner interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Link partner interface is capable of 100 fullduplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Link partner interface is capable of 100 halfduplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Link partner interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Link partner interface is capable of 10 halfduplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable
TABLE 5-43
Parameter lp_autoneg_cap
lp_100T4_cap
0-1
lp_100fdx_cap
0-1
lp_100hdx_cap
0-1
lp_10fdx_cap
0-1
lp_10hdx_cap
0-1
194


Values Description
link_status
0-1
Current link status 0 = Link down 1 = Link up This parameter provides the link speed and is only valid if the link is up. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable This parameter provides the link duplex and is only valid if the link is up. 0 = Half duplex 1 = Half duplex
link_speed
0-1
link_mode
0-1
Fiber Gigabit Ethernet

The first physical media available at 1 gigabit was the fiber media. This media allows Ethernet to stretch to the one-kilometer range using fiber-optic cable. The first interface introduced to provide the 1-gigabit capability was the Sun Gigabit Ethernet adapter, vge. This was quickly followed by the ge interface, which was then followed by the high-performance ce interface. This section describes these interfaces in detail and explains how they can be best utilized to maximize the performance of the network that they drive or are simply part of.
Chapter 5
195
FIGURE 5-30
Typical vge and ge MMF External Connectors
1000 vge Gigabit Ethernet

The vge gigabit interface exists only as a fiber interface 1000BASE-SX and is available for support of existing Sbus-capable systems or PCI bus systems. The vge interface was also the first available interface to support VLAN capability.
1000 ge Gigabit Ethernet

The ge gigabit interface exists only as a fiber interface 1000BASE-SX and is available for support of existing Sbus-capable systems or PCI bus systems. The architecture is the same as the eri interface, with one transfer ring and one receive ring. The ge interface employs the hardware checksumming capability described above to reduce the cost of the TCP/IP checksum calculation. During its development, the interface was always challenging the limits of the SPARC systems, so it has many tunable features that can be set to provide the best system and application performance.
196
The ge interface also provides Layer 2 flow control capability. The physical layer and performance features of ge are fully configurable using the driver.conf file and ndd command.

Type Description
instance adv_autoneg_cap adv_1000fdx_cap adv_1000hdx_cap adv_100T4_cap adv_100fdx_cap adv_100hdx_cap adv_10fdx_cap adv_10hdx_cap use_int_xcvr lance_mode ipg0 ipg1 ipg2 intr_blank_time intr_blank_packets autoneg_cap 1000fdx_cap 1000hdx_cap 100T4_cap 100fdx_cap 100hdx_cap 100fdx_cap 100hdx_cap 10fdx_cap 10hdx_cap lp_autoneg_cap
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only
Current device instance in view for ndd Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Transceiver control parameter Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Receive interrupt blanking parameters Receive interrupt blanking parameters Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Link partner capability
Chapter 5
197

Type Description
lp_1000fdx_cap lp_1000hdx_cap lp_100T4_cap lp_100fdx_cap lp_100hdx_cap lp_10fdx_cap lp_10hdx_cap transceiver_inuse link_status link_speed link_mode
Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only
Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Current physical layer status Current physical layer status Current physical layer status Current physical layer status

Instance Parameter
Values Description
instance
0-1000
Current device instance in view for the rest of the ndd configuration variables
198


Values Description
adv_autoneg_cap
0-1
Local interface capability of auto-negotiation signaling is advertised by the hardware. 0 = Forced mode 1 = Auto-negotiation Default is set to the autoneg_cap parameter. Local interface capability of 1000 full duplex is advertised by the hardware. 0 = Not 1000 Mbit/sec full-duplex capable 1 = 1000 Mbit/sec full-duplex capable Default is set to the 1000fdx_cap parameter. Local interface capability of 1000 half duplex is advertised by the hardware. 0 = Not 1000 Mbit/sec half-duplex capable 1 = 1000 Mbit/sec half-duplex capable Default is set to the 1000hdx_cap parameter. Local interface capability of 100-T4 is advertised by the hardware. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Default is set to the 100T4_cap parameter. Local interface capability of 100 full duplex is advertised by the hardware. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Default is set based on the 100fdx_cap parameter.
adv_1000fdx_cap
0-1
adv_1000hdx_cap
0-1
adv_100T4_cap
0-1
adv_100fdx_cap
0-1
Chapter 5
199

Values Description
adv_100hdx_cap
0-1
Local interface capability of 100 half duplex is advertised by the hardware. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Default is set based on the 100hdx_cap parameter. Local interface capability of 10 full duplex is advertised by the hardware. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Default is set based on the 10fdx_cap parameter. Local interface capability of 10 half duplex is advertised by the hardware. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable Default is set based on the 10hdx_cap parameter.
adv_10fdx_cap
0-1
adv_10hdx_cap
0-1

The ge driver has the capability to have an external MII physical layer device connected, but theres no implemented hardware to allow this feature to be utilized. The use_int_xcvr parameter should never be altered in the case of ge.

The Inter-Packet Gap (IPG) parameters are ipg0, ipg1, and ipg2. The total IPG is the sum of ipg1 and ipg2 plus an optional ipg0 that will only be present when the lance_mode parameter is set. The total default IPG is 9.6 microseconds when the link speed set by the auto-negotiation protocol is 10 Mbit/sec. When the link speed is 100 Mbit/sec, the total IPG is 0.96 microseconds. When the link speed is 1000 Mbit/sec, the total IPG is 0.096 microseconds.
200
The additional delay set by ipg0 helps to reduce collisions. Systems that have lance_mode enabled might not have enough time on the network. If lance_mode is disabled, the value of ipg0 is ignored and no additional delay is set. Only the delays set by ipg1 and ipg2 are used. Disable lance_mode if other systems keep sending a large number of back-to-back packets. You can add the additional delay by setting the ipg0 parameter, which is the nibble time delay, from 0 to 31. Note that nibble time is the time it takes to transfer four bits on the link. If the link speed is 10 Mbit/sec, nibble time is equal to 400 ns. If the link speed is 100 Mbit/sec, nibble time is equal to 40 ns. If the link speed is 1000 Mbit/sec, the nibble time is 4 ns. For example, if the link speed is 10 Mbit/sec and you set ipg0 to 20 nibble times, multiply 20 by 400 ns to get 800 ns. If the link speed is 100 Mbit/sec and you set ipg0 to 30 nibble times, multiply 30 by 40 ns to get 120 ns.

Values Description
lance_mode
0 1 0-31 0-255 0-255
lance_mode disabled lance_mode enabled (default) Additional IPG before transmitting a packet Default = 4 First inter-packet gap parameter Default = 8 Second inter-packet gap parameter Default = 8
ipg0
ipg1
ipg2
All of the IPG parameters can be set using ndd or can be hard-coded into the ge.conf files. Details of the methods of setting these parameters are provided in Configuring Driver Parameters on page 238.
Chapter 5
201

The ge device introduces the receive interrupt blanking capability to 1-Gbit/sec ports. TABLE 5-49 lists and describes the parameters.

Values Description
intr_blank_time
0-127
Interrupt after this number of clock cycles has passed and the packets pending have not reached the number of intr_blank_packets. One clock cycle equals 2048 PCI clock cycles. Note: Given that this time is linked to PCI clock, an adapter plugged into a 66-MHz PCI slot will have a shorter blanking time. Relative to one 33-MHz slot, it will be a multiple of two. (Default = 6) Interrupt after this number of packets has arrived since the last packet was serviced. A value of zero indicates no packet blanking. (Default = 8)
intr_blank_packets
0-255
Note ge and ce fiber devices do not support 100 Mbit/sec capabilities. They
support 1000 Mbit/sec only.
202

The local transceiver auto-negotiation capability parameters are read-only parameters and represent the fixed set of capabilities associated with the current PHY that is in use. Local Transceiver Auto-negotiation Capability Parameters
Values 0-1 Description Local interface is capable of auto-negotiation signaling. 0 = Can only operate in Forced mode 1 = Capable of auto-negotiation Local interface is capable of 1000 full-duplex operation. 0 = Not 1000 Mbit/sec full-duplex capable 1 = 1000 Mbit/sec full-duplex capable Local interface is capable of 1000 half-duplex operation. 0 = Not 1000 Mbit/sec half-duplex capable 1 = 1000 Mbit/sec half-duplex capable Local interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Local interface is capable of 100 full-duplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Local interface is capable of 100 half-duplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Local interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Local interface is capable of 10 half-duplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable
TABLE 5-50
1000fdx_cap
0-1
1000hdx_cap
0-1
100T4_cap
0-1
100fdx_cap
0-1
100hdx_cap
0-1
10fdx_cap
0-1
10hdx_cap
0-1
Chapter 5
203

Values Description
lp_autoneg_cap
0-1
Link partner interface is capable of autonegotiation signaling. 0 = Can only operate in forced mode 1 = Capable of auto-negotiation Link partner interface is capable of 1000 fullduplex operation. 0 = Not 1000 Mbit/sec full-duplex capable 1 = 1000 Mbit/sec full-duplex capable Link partner interface is capable of 1000 halfduplex operation. 0 = Not 1000 Mbit/sec half-duplex capable 1 = 1000 Mbit/sec half-duplex capable Link partner interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Link partner interface is capable of 100 fullduplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable
lp_1000fdx_cap
0-1
lp_1000hdx_cap
0-1
lp_100T4_cap
0-1
lp_100fdx_cap
0-1
204

Values Description
lp_100hdx_cap
0-1
lp_10fdx_cap
0-1
lp_10hdx_cap
0-1


Values Description
transceiver_inuse
0-1
This parameter indicates which transceiver is currently in use. 0 = Internal transceiver is in use. 1 = External transceiver is in use.
Chapter 5
205
Current Physical Layer Status Parameters (Continued)

Values Description
link_status
0-1
Current link status 0 = Link down 1 = Link up This parameter provides the link speed and is only valid if the link is up. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable This parameter provides the link duplex and is only valid if the link is up. 0 = Half duplex 1 = Half duplex
link_speed
0-1
link_mode
0-1
Note that the physical layer status parameters are only meaningful while ndd is running in interactive mode, or the interface being viewed is already initialized by virtue of the presence of open streams such as snoop -d ge0 or ifconfig hme0 plumb inet up. If these streams dont exist, the device is uninitialized and the state gets set up when you probe these parameters with ndd. As a result, the parameters are subject to a race between the user viewing them and the link reaching a steady state. This makes these parameters unreliable unless an existing stream is associated with an instance prior to checking. A good rule to follow is to only trust these parameters if the interface is configured up using the ifconfig command.

Gigabit Ethernet pushes systems to their limits, and in some cases it can overwhelm them. Therefore, much analysis has occurred and special system parameters are available that help in tuning the ge card for a particular system or application.
206
Note that just as the tunables can be used to enhance performance, they can also degrade performance.

Values Description
ge_intr_mode
0-1
Enables the ge driver to send packets directly to the upper communication layers rather than queueing. 0 = Packets are not passed in the interrupt service routine but are placed in a streams service queue and passed to the protocol stack later, when the streams service routine runs. (default) 1 = Packets are passed directly to the protocol stack in the interrupt context. Default = 0 (queue packets to upper layers) Enables infinite burst mode for PCI DMA transactions rather than using cache-line size PCI DMA transfers. This feature is supported only on Sun platforms with the UltraSparc III CPU. 0 = Disabled (default) 1 = Enabled Minimum packet size to use fast dvma interfaces rather than standard dma interfaces. Default = 1024 Number of transmit descriptors used by the driver. Default = 512 Maximum packet size to use copy of buffer into premapped dma buffer rather than remapping. Default = 256
ge_dmaburst_mode
0-1
ge_tx_fastdvma_min
59-1500
ge_nos_tmd
32-8192
ge_tx_bcopy_max
60-256
Chapter 5
207
Performance Tunable Parameters (Continued)

Values Description
ge_nos_txdvma
0-8192
Number of dvma buffers (for transmit) used in the driver. Default = 256 Number of fragments that must exist in any one packet before ge_tx_onemblk coalesces them into a fresh mblk. Default = 2 For DMA, this parameter determines whether to use DDI_DMA_CONSISTENT or DDI_DMA_STREAMING. If the packet length is less than ge_tx_stream_min, then we use DDI_DMA_CONSISTENT. Default = 512
ge_tx_onemblk
1-100
ge_tx_stream_min
256-1000
The ge tunable parameters require that the /etc/system file be modified and the system rebooted to apply the changes. See Using /etc/system to Tune Parameters on page 244. The tuning variables ge_use_rx_dvma and ge_do_fastdvma are of particular interest because they control whether the ge driver uses fast dvma or the regular ddi_dma interface. Currently the setting applied is fast dvma, but with every new operating system release the ddi_dma interface is being improved and the performance difference between the two interfaces might be eliminated. The ge_nos_tmd can be used to adjust the size of the transmit descriptor ring. This might be required if the driver is experiencing a large number of notmd, as this indicates that the arrival rate of packets for the descriptor ring exceeds the rate that the hardware can transmit. In that case, increasing the descriptor ring might be a remedy. The ge_put_cfgin conjunction with ge_intr_mode controls the receive packet delivery model. When the ge_intr_mode is 1, the interface passes packets to the protocol stack in the interrupt context. When set to 0, the delivery model is controlled by ge_put_cfg. When it is set to 0, the ge driver provides a special-case software load balancing where theres only one worker thread; when set to 1, it uses the regular streams service routine. The transmit control tunables, ge_tx_bcopy_max, ge_tx_stream_min, and ge_tx_fastdvma_min, define the thresholds for the transmit buffer method. The ge_tx_onemblk controls coalescing of multiple message blocks that make up a single packet into one message block. In many cases where system memory latency is high, it makes sense to avoid individually mapping packet fragments. Instead,
208
you can have the driver create a new buffer, bring all the fragments together, and use only one DMA buffer. This feature is especially useful for HTTP server applications. The ge_nos_txdvma controls the pool of fast dvma resources associated with a driver. Since fast dvma resources are finite within a system, it is possible for one device to monopolize all of those resources. The tunable is designed to avoid this scenario and allow the ge driver to allocate a limited number of resources that can be shared at runtime with instances switching to transmit packets using the dvma interface. A clearer description of this will be presented later based on kstat information feedback.
10/100/1000 ce GigaSwift Gigabit Ethernet

The Sun GigaSwift Ethernet adapter relieves congestion experienced at the backbone and server levels by todays networks, and provides a future upgrade path for highend workstations that require more bandwidth than fast Ethernet can provide. The Sun GigaSwift Ethernet MMF adapter is a single-port gigabit Ethernet fiberoptics PCI Bus card. It operates in 1000 Mbit/sec Ethernet networks only. The configuration capability of the GigaSwift Ethernet is exactly the same as that of the copper GigaSwift adapter except that it is unable to negotiate any speeds other than 1000 Mbit/sec. The detailed discussion of the copper GigaSwift adapter will cover any configuration details that apply to the MMF interface.
FIGURE 5-31
Sun GigaSwift Ethernet MMF Adapter Connectors
The Sun GigaSwift Ethernet UTP adapter is a single-port gigabit Ethernet copperbased PCI Bus card. It can be configured to operate in 10 Mbit/sec, 100 Mbit/sec, or 1000 Mbit/sec Ethernet networks.
FIGURE 5-32
Sun GigaSwift Ethernet UTP Adapter Connectors
There is also a Dual Fast Ethernet/Dual SCSI PCI adapter card that is supported by the GigaSwift Ethernet device driver yet is limited to 100BASE-TX capability.
Chapter 5
209
The ce interface employs the hardware checksumming capability described above to reduce the cost of the TCP/IP checksum calculation. The ce interface also provides Layer 2 flow control capability, RED, and Infinite Burst. The physical layer and performance features of ce are configurable using the driver.conf file and ndd command.

Type Description
instance adv-autoneg-cap adv-1000fdx-cap adv-1000hdx-cap adv-100T4-cap adv-100fdx-cap adv-100hdx-cap adv-10fdx-cap adv-10hdx-cap adv-asmpause-cap adv-pause-cap master-cfg-enable master-cfg-value use-int-xcvr enable-ipg0 ipg0 ipg1 ipg2 rx-intr-pkts rx-intr-time red-dv4to6k red-dv6to8k red-dv8to10k
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write
Current device instance in view for ndd Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Flow control parameter Flow control parameter Gigabit link clock mastership controls Gigabit link clock mastership controls Transceiver control parameter Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Receive interrupt blanking parameters Receive interrupt blanking parameters Random early detection and packet drop vectors Random early detection and packet drop vectors Random early detection and packet drop vectors
210

Type Description
red-dv10to12k tx-dma-weight rx-dma-weight infinite-burst disable-64bit accept-jumbo
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write
Random early detection and packet drop vectors PCI interface parameters PCI interface parameters PCI interface parameters PCI interface parameters Jumbo frames enable parameter
With the ce driver, any changes applied to the above parameters take effect immediately.

The current device instance in view allows you to point ndd to a particular device instance for configuration. This must be applied prior to altering or viewing any of the other parameters or you might not be able to view or alter the correct parameters.
TABLE 5-55
Instance Parameter
Parameter instance
Before viewing or altering any of the other parameters, be sure to check of the value of instance to ensure that it is actually pointing to the device you want to configure.
Chapter 5
211

The following parameters adjust the MII capabilities which are used for autonegotiation. When auto-negotiation is disabled, the highest priority value is taken as the mode of operation. See Ethernet Physical Layer on page 152 regarding MII.

Values Description
adv-autoneg-cap
0-1
Local interface capability is advertised by the hardware. 0 = Forced mode 1 = Auto-negotiation (default) Local interface capability is advertised by the hardware. 0 = Not 1000 Mbit/sec full-duplex capable 1 = 1000 Mbit/sec full-duplex capable (default) Local interface capability is advertised by the hardware. 0 = Not 1000 Mbit/sec half-duplex capable 1 = 1000 Mbit/sec half-duplex capable (default) Local interface capability is advertised by the hardware. 0 = Not 100-T4 capable (default) 1 = 100-T4 capable Local interface capability is advertised by the hardware. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable (default) Local interface capability is advertised by the hardware. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable (default) Local interface capability is advertised by the hardware. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable (default) Local interface capability is advertised by the hardware. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable (default)
adv-1000fdx-cap
0-1
adv-1000hdx-cap
0-1
adv-100T4-cap
0-1
adv-100fdx-cap
0-1
adv-100hdx-cap
0-1
adv-10fdx-cap
0-1
adv-10hdx-cap
0-1
212
Flow Control Parameters

The ce device is capable of sourcing (transmitting) and terminating (receiving) pause frames conforming to the IEEE 802.3x Frame Based Link Level Flow Control Protocol. In response to received flow control frames, the ce device can slow down its transmit rate. On the other hand, the ce device is capable of sourcing flow control frames, requesting the link partner to slow down, provided that the link partner supports this feature. By default, the driver advertises both transmit and receive pause capability during auto-negotiation.
TABLE 5-57 provides flow control keywords and describes their function.
TABLE 5-57 Keyword
Read-Write Flow Control Keyword Descriptions

Description
adv-asmpause-cap
The adapter supports asymmetric pause, which means it can pause only in one direction. 0 = Off (default) 1 = On This parameter has two meanings depending on the value of adv-asmpause-cap. (Default = 0) If adv-asmpause-cap = 1 while adv-pause-cap = 1, pauses are received. If adv-asmpause-cap = 1 while adv-pause-cap = 0, pauses are transmitted. If adv-asmpause-cap = 0 while adv-pause-cap = 1, pauses are sent and received. If adv-asmpause-cap = 0, then adv-pause-cap determines whether Pause capability is on or off.
adv-pause-cap
Gigabit Link Clock Mastership Controls

The concept of link clock mastership was introduced with one-gigabit twisted-pair technology. This concept requires one side of the link to be the master that provides the link clock and the other to be the slave that uses the link clock. Once this relationship is established, the link is up and data can be communicated. Two
Chapter 5
213
physical layer parameters control whether a side is the master or the slave or whether mastership is negotiated with the link partner. Those parameters are as follows.
Gigabit Link Clock Mastership Controls

Description
master-cfg-enable master-cfg-value
Determines whether or not during the auto-negotiation process the link clock mastership is set up automatically. If the master-cfg-enable parameter is set, the mastership is not set up automatically but is dependant on the value of mastercfg-value. If the master-cfg-value is set, the physical layer expects the local device to be the link master. If it is not set, the physical layer expects the link partner to be the master. If auto-negotiation is not enabled, the value of master-cfgenable is ignored and the value of master-cfg-value is key to the link clock mastership. If the master-cfg-value is set, the physical layer expects the local device to be the link master. If its not set, the physical layer expects the link partner to be the master.

The ce driver is capable of having an external MII physical layer device connected, but theres no implemented hardware to allow this feature to be utilized. The use_int_xcvr parameter should never be altered.

The Inter-Packet Gap (IPG) parameters are ipg0, ipg1, and ipg2. The total IPG is the sum of ipg1 and ipg2 plus an optional ipg0 that will only be present when the enable-ipg0 parameter is set. The total default IPG is 9.6 microseconds when the link speed set by the auto-negotiation protocol is 10 Mbit/sec. When the link speed is 100 Mbit/sec, the total IPG is 0.96 microseconds, and for 1 Gbit/sec, it drops down to 0.096 microseconds. The additional delay set by ipg0 helps to reduce collisions. Systems that have ipg0-enable set might not have enough time on the network. If ipg0 is disabled, the value of ipg0 is ignored and no additional delay is set. Only the delays set by ipg1 and ipg2 are used. Clear enable-ipg0 if other systems keep sending a large number of back-to-back packets.
214
You can add the additional delay by setting the ipg0 parameter, which is the media byte time delay, from 0 to 255. Note that nibble time is the time it takes to transfer four bits on the link. If the link speed is 10 Mbit/sec, nibble time is equal to 400 ns. If the link speed is 100 Mbit/sec, nibble time is equal to 40 ns. For example, if the link speed is 10 Mbit/sec and you set ipg0 to 20 nibble times, multiply 20 by 400 ns to get 800 ns. If the link speed is 100 Mbit/sec and you set ipg0 to 30 nibble times, multiply 30 by 40 ns. to get 120 ns. If the link speed is 1000 Mbit/sec and you set ipg0 to 30 nibble times, multiply 30 by 40 ns to get 1200 ns.

Values Description
enable-ipg0
0-1
Enables ipg0. 0 = ipg0 disabled 1 = ipg0 enabled Default = 1 Additional IPG before transmitting a packet Default = 8 First inter-packet gap parameter Default = 8 Second inter-packet gap parameter Default = 4
ipg0
0-255 0-255 0-255
ipg1
ipg2
All of the IPG parameters can be set using ndd or can be hard-coded into the ce.conf files. Details of the methods of setting these parameters are provided in Configuring Driver Parameters on page 238.

The ce device introduces the receive interrupt blanking capability to 1-Gbit/sec ports. TABLE 5-60 describes the receive interrupt blanking values.
TABLE 5-60 Field Name

Values Description
rx-intr-pkts
0 to 511
Interrupt after this number of packets have arrived since the last packet was serviced. A value of zero indicates no packet blanking. (Default = 8) Interrupt after 4.5 microsecond ticks have elapsed since the last packet was serviced. A value of zero indicates no time blanking. (Default = 3)
rx-intr-time
0 to 524287
Chapter 5
215
Random Early Drop Parameters

TABLE 5-61 describes the Rx random early detection 8-bit vectors, which allow you to
enable random early drop (RED) thresholds. When received packets reach the RED range, packets are dropped according to the preset probability. The probability should increase when the FIFO level increases. Control packets are never dropped and are not counted in the statistics.
TABLE 5-61 Field Name
Rx Random Early Detecting 8-Bit Vectors

Values Description
red-dv4to6k
0 to 255
Random early detection and packet drop vectors when FIFO threshold is greater than 4096 bytes and less than 6144 bytes. Probability of drop can be programmed on a 12.5 percent granularity. For example, if bit 0 is set, the first packet out of every eight will be dropped in this region. (Default = 0) Random early detection and packet drop vectors when FIFO threshold is greater than 6144 bytes and less than 8192 bytes. Probability of drop can be programmed on a 12.5 percent granularity. For example, if bit 0 is set, the first packet out of every eight will be dropped in this region. (Default = 0) Random early detection and packet drop vectors when FIFO threshold is greater than 8192 bytes and less than 10,240 bytes. Probability of drop can be programmed on a 12.5 percent granularity. For example, if bits 1 and 6 are set, the second and seventh packets out of every eight will be dropped in this region. (Default = 0) Random early detection and packet drop vectors when FIFO threshold is greater than 10,240 bytes and less than 12,288 bytes. Probability of drop can be programmed on a 12.5 percent granularity. If bits 2, 4, and 6 are set, then the third, fifth, and seventh packets out of every eight will be dropped in this region. (Default = 0)
red-dv6to8k
0 to 255
red-dv8to10k
0 to 255
red-dv10to12k
0 to 255
216
PCI Bus Interface Parameters

These parameters allow you to modify PCI interface features to gain better PCI performance for a given application.
PCI Bus Interface Parameters

Values Description
tx-dma-weight
0-3
Determines the multiplication factor for granting credit to the Tx side during a weighted round-robin arbitration. Values are 0 to 3. (Default = 0) Zero means no extra weighting. The other values are powers of 2 extra weighting, on that traffic. For example, if tx-dmaweight = 0 and rx-dma-weight = 3, then as long as Rx traffic is continuously arriving, its priority will be eight times greater than Tx to access the PCI. Determines the multiplication factor for granting credit to the Rx side during a weighted round-robin arbitration. Values are 0 to 3. (Default = 0) Allows the infinite burst capability to be utilized. When this is in effect and the system supports infinite burst, the adapter will not free the bus until complete packets are transferred across the bus. Values are 0 or 1. (Default = 0) Switches off 64-bit capability of the adapter. In some cases, it is useful to switch off this feature. Values are 0 or 1. (Default = 0, which enables 64-bit capability)
rx-dma-weight
0-3
infinite-burst
0-1
disable-64bit
0-1
Jumbo Frames Enable Parameter

This new feature, only recently added to the GigaSwift driver, allows the ce device to communicate with larger MTU frames.
Jumbo Frames Enable Parameter

Values Description
accept-jumbo
0-1
0 = Jumbo frames are disabled 1 = Jumbo frames are enable (Default = 0)
Once jumbo frames capability is enabled, the MTU can be controlled using ifconfig. The MTU can be raised to 9000 or reduced to the regular 1500-byte frames.
Chapter 5
217
Performance Tunables
GigaSwift Ethernet pushes systems even further than ge did. Many lessons were learned from ge, leading to a collection of special system tunables that assist in tuning the ce card for a specific system or application. Note that just as the tunables can be used to enhance performance, they can also degrade performance. Handle with great care.

Values Description
ce_taskq_disable
0-1
Disables the use of task queues and forces all packets to go up to Layer 3 in the interrupt context. Default depends on whether the number of CPUs in the system exceeds the ce_cpu_threshold. Controls the number of taskqs set up per ce device instance. This value is only meaningful if ce_taskq_disable is false. Any value less than 64 is meaningful. (Default = 4). The size of the service FIFO, in number of elements. This variable can be any integer value. (Default = 2048) The threshold for the number of CPUs required in the system and online before the taskqs are utilized to Rx packets. (Default = 4) An enumerated type that can have a value of 0 or and 1. 0 = Transmit algorithm doesnt do serialization, 1 = Transmit algorithm does serialization. (Default = 0) An enumerated type that can have a value of 0, 1, or 2. 0 = Receive processing occurs in the interrupt context. 1 = Receive processing occurs in the worker threads. 2 = Receive processing occurs in the streams service queues routine. (Default = 0)
ce_inst_taskqs
0-64
ce_srv_fifo_depth
30-100000
ce_cpu_threshold
1-1000
ce_start_cfg
0-1
ce_put_cfg
0-2
218

Values Description
ce_reclaim_pending
1-4094
The threshold when reclaims start happening. Currently 32 for both ge and ce drivers. Keep it less than ce_tx_ring_size/3. (Default = 32) The size of the Rx buffer ring, a ring of buffer descriptors for Rx. One buffer = 8K. This value must be Modulo 2, and its maximum value is 8K. (Default = 256) The size of each Rx completion descriptor ring. It also is Modulo 2. (Default = 2048) The size of each Tx descriptor ring. It also is Modulo 2. (Default = 2048) A mask to control which Tx rings are used. (Default = 3) Disables the Tx load balancing and forces all transmission to be posted to a single descriptor ring. 0 = Tx load balancing is enabled. 1 = Tx load balancing is disabled. (Default = 1) The mblk size threshold used to decide when to copy a mblk into a pre-mapped buffer as opposed to using DMA or other methods. (Default = 256)
ce_ring_size
32-8216
ce_comp_ring_size
0-8216
ce_comp_ring_size
0-8216
ce_tx_ring_mask
0-3 0-1
ce_no_tx_lb
ce_bcopy_thresh
0-8216
Chapter 5
219

Values Description
ce_dvma_thresh
0-8216
The mblk size threshold used to decide when to use the fast path DVMA interface to transmit mblk. (Default = 1024) This global variable splits the ddi_dma mapping method further by providing Consistent mapping and Streaming mapping. In the Tx direction, Streaming is better for larger transmissions than Consistent mappings. The mblk size falls in the range greater than 256 bytes but less than 1024 bytes; then mblk fragment will be transmitted using ddi_dma methods. (Default = 512) The number of receive packets that can be processed in one interrupt before it must exit. (Default = 512)
ce_dma_stream_thresh
0-8216
ce_max_rx_pkts
321000000
The performance tunables require an understanding of some key kernel statistics from the ce driver to be used successfully. There might also be an opportunity to use the RED features and interrupt blanking both configurable using the ndd commands. A clearer description of this will be presented later based on kstat information feedback.
10/100/1000 bge Broadcom BCM 5704 Gigabit Ethernet

The bge interface is a another Ethernet system interface applied to the UltraSPARC III rack-mounted Sun Blade V210 and Sun Blade V240 server systems. This interface is much like the others in that its architecture supports a single Tx and Rx descriptor ring.
220
The physical layer of bge is fully configurable using the bge.conf file and ndd commands.

Type Description
adv_autoneg_cap adv_1000fdx_cap adv_1000hdx_cap adv_100T4_cap adv_100fdx_cap adv_100hdx_cap adv_10fdx_cap adv_10hdx_cap adv_asm_pause_cap adv_pause_cap autoneg_cap 100T4_cap 100fdx_cap 100hdx_cap 10fdx_cap 10hdx_cap asm_pause_cap pause_cap lp_autoneg_cap lp_100T4_cap lp_100fdx_cap lp_100hdx_cap lp_10fdx_cap lp_10hdx_cap lp_asm_pause_cap lp_pause_cap
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read only Read only Read only Read only Read only Read only Read and Write Read and Write Read only Read only Read only Read only Read only Read only Read only Read only
Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability
Chapter 5
221

Type Description
link_status link_speed link_mode
Read only Read only Read only
Current physical layer status Current physical layer status Current physical layer status


Values Description
adv_autoneg_cap
0-1
Local interface capability of auto-negotiation signaling is advertised by the hardware. 0 = Forced mode 1 = Auto-negotiation Default is set to the autoneg_cap parameter. Local interface capability of 100-T4 advertised is by the hardware. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Default is set to the 100T4_cap parameter. Local interface capability of 100 full duplex is advertised by the hardware. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Default is set based on the 100fdx_cap parameter. Local interface capability of 100 half duplex is advertised by the hardware. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Default is set based on the 100hdx_cap parameter. Local interface capability of 10 full duplex is advertised by the hardware. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Default is set based on the 10fdx_cap parameter.
adv_100T4_cap
0-1
adv_100fdx_cap
0-1
adv_100hdx_cap
0-1
adv_10fdx_cap
0-1
222

Values Description
adv_10hdx_cap
0-1
Local interface capability of 10 half duplex is advertised by the hardware. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable Default is set based on the 10hdx_cap parameter. The adapter supports asymmetric pause, which means it can pause only in one direction. 0 = Off 1 = On (Default = 1) This parameter has two meanings, depending on the value of adv_asm_pause_cap. If adv_asm_pause_cap = 1 while adv_pause_cap = 1, pauses are received and Transmit is limited. If adv_asm_pause_cap = 1 while adv_pause_cap = 0, pauses are transmitted. If adv_asm_pause_cap = 0 while adv_pause_cap = 1, pauses are sent and received. If adv_asm_pause_cap = 0, adv_pause_cap determines whether pause capability is on or off. (Default = 0)
adv_asm_pause_cap
0-1
adv_pause_cap
0-1
Chapter 5
223

The local transceiver auto-negotiation capability parameters are read-only parameters and represent the fixed set of capabilities associated with the PHY that is currently in use.

Values Description
autoneg_cap
0-1
Local interface is capable of auto-negotiation signaling. 0 = Can only operate in Forced mode 1 = Capable of auto-negotiation Local interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Local interface is capable of 100 full-duplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Local interface is capable of 100 half-duplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Local interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable
100T4_cap
0-1
100fdx_cap
0-1
100hdx_cap
0-1
10fdx_cap
0-1
224
Local Transceiver Auto-negotiation Capability Parameters (Continued)

Values Description
10hdx_cap
0-1
Local interface is capable of 10 half-duplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable The adapter supports asymmetric pause, which means it can pause only in one direction. 0 = Off 1 = On (Default = 1) This parameter has two meanings depending on the value of asm_pause_cap. If asm_pause_cap = 1 while pause_cap = 1, pauses are received, and transmit is limited. If asm_pause_cap = 1 while pause_cap = 0, pauses are transmitted. If asm_pause_cap = 0 while pause_cap = 1, pauses are sent and received. If asm_pause_cap = 0, pause_cap determines whether pause capability is on or off. (Default = 0)
asm_pause_cap
0-1
pause_cap
0-1
Chapter 5
225

The link partner capability parameters are read-only parameters and represent the fixed set of capabilities associated with the attached link partner set of advertised auto-negotiation parameters. These parameters are only meaningful when autonegotiation is enabled and can be used in conjunction with the operation parameters to establish why there might be problems bringing up the link.
Link Partner Capability Parameters

Values Description
lp_autoneg_cap
0-1
Link partner interface is capable of auto-negotiation signaling. 0 = Can only operate in Forced mode 1 = Capable of auto-negotiation Link partner interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Link partner interface is capable of 100 full-duplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Link partner interface is capable of 100 half-duplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Link partner interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable
lp_100T4_cap
0-1
lp_100fdx_cap
0-1
lp_100hdx_cap
0-1
lp_10fdx_cap
0-1
226

Values Description
lp_10hdx_cap
0-1
Link partner interface is capable of 10 half-duplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable The adapter supports asymmetric pause, which means it can pause only in one direction. 0 = Off 1 = On (Default = 1) This parameter has two meanings depending on the value of lp_asm_pause_cap. If lp_asm_pause_cap = 1 while lp_pause_cap = 1, pauses are received, and Transmit is limited. If lp_asm_pause_cap = 1 while lp_pause_cap = 0, pauses are transmitted. If lp_asm_pause_cap = 0 while lp_pause_cap = 1, pauses are sent and received. If lp_asm_pause_cap = 0, then lp_pause_cap determines whether pause capability is on or off. (Default = 0)
lp_asm_pause_cap
0-1
lp_pause_cap
0-1
Chapter 5
227


Values Description
link_status
0-1
Current link status 0 = Link down 1 = Link up This parameter provides the link speed and is only valid if the link is up. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable This parameter provides the link duplex and is only valid if the link is up. 0 = Half duplex 1 = Full duplex
link_speed
0-1
link_mode
0-1
Sun VLAN Technology

VLANs allow you to split your physical LAN into logical subparts, providing an essential tool for increasing the efficiency and flexibility of your network. VLANs are commonly used to separate groups of network users into manageable broadcast domains, to create logical segmentation of workgroups, and to enforce security policies within each logical segment. Each defined VLAN behaves as its own separate network, with its traffic and broadcasts isolated from the others, increasing the bandwidth efficiency within each logical group. VLAN technology is also useful for containing jumbo frames. If a VLAN is configured with the ability to use jumbo frames, then the fact that the jumbo frame configuration is part of a VLAN ensures that the jumbo frames never leave the VLAN network.
228
Although VLANs are commonly used to create individual broadcast domains and/or separate IP subnets, it is sometimes useful for a server to have a presence on more than one VLAN simultaneously. Several Sun products support multiple VLANs on a per-port or per-interface basis, allowing very flexible network configurations.
FIGURE 5-33 shows an example network that uses VLANs.
Accounting Server (VLAN 3)
Main Server Adapter Gigabit/Tagged (All VLANs)
VLAN 1 VLAN 2 VLAN 3
Shared Media Segment
Software PC 1 (VLAN 2)
Software PC 2 (VLAN 2)
Engineering PC 3 (VLAN 1)
Accounting PC 4 (VLAN 3)
Engineering/ Software PC 5 Adapter Gigabit/Tagged (VLAN 1 & 2)
FIGURE 5-33
Example of Servers Supporting Multiple VLANs with Tagging Adapters
Chapter 5
229
VLAN Configuration
VLANs can be created according to various criteria, but each VLAN must be assigned a VLAN tag or VLAN ID (VID). The VID is a 12-bit identifier between 1 and 4094 that identifies a unique VLAN. For each network interface (ce0, ce1, ce2, and so on), 4094 possible VLAN IDs can be selected over an individual ce instance. Once the VLAN tag is chosen, a VLAN can be configured on a subnet using a ce interface with the ifconfig command. The VLAN tag is multiplied by 1000 and the instance number of the device, also the device Primary Point of Attachment (PPA), is added to give a VLAN PPA. For a VLAN with VID 123 that needs to be configured over ce0, the new VLAN PPA would be 123000. With this new PPA you can proceed to configure the ce interface within the VLAN.
# ifconfig ce123000 plumb inet up
You can also set up a configuration that is persistent through a reboot by creating a hostname file.
# hostaname.ce123000 inet
In summary, the VLAN PPA is calculated using the simple formula: VLAN PPA = VID * 1000 + Device PPA
Note Only GigaSwift NICs using the ce driver and Solaris 8 VLAN packages have
VLAN tagging capabilities. Other NICs do not.
Sun Trunking Technology

Sun Trunking software provides the ability to aggregate multiple links between a pair of devices so that they work in parallel as if they were a single link. Once aggregated, these point-to-point links operate as a single highly available fat pipe, providing increased network bandwidth as well as high availability. For a given link level connection, trunking enables you to add bandwidth up to the maximum number of network interface links supported.
230
Note Sun Trunking is not included with the Solaris operating system. This is an
unbundled software product. Sun Trunking provides trunking support for the following network interface cards:
s s s s s
Sun Sun Sun Sun Sun
Quad FastEthernet adapter, qfe GigabitEthernet adapter, ge GigaSwift Ethernet UTP or MMF adapter, ce Dual FastEthernet and Dual SCSI/P adapter, ce Quad GigaSwift Ethernet adapter, ce
The key to enabling the trunking capability is the nettr command. This command can be used to trunk devices of the same technology together. Once trunked, a trunk head interface is established, and that interface is used by ifconfig to complete the configuration. For example, if the two qfe instances (qfe0 and qfe1) need to be trunked, once the nettr command is complete, the trunk head would be assigned to qfe0. Then you could proceed to ifconfig to make the trunk operate under the TCP/IP protocol stack.
Trunking Configuration
The nettr(1M) utility is used to configure trunking. nettr(1M)can be used to:
s s s s
set up a trunk release a trunk display a trunk configuration display statistics of trunked interfaces
Following is the command syntax for nettr for setting up a trunk or modifying the configuration of the trunk members. The items in the square brackets are optional.
nettr -setup head-instance device=<qfe | ce | ge> members=<instance,instance,.,.> [ policy=<number> ]
Chapter 5
231
Trunking Policies
MAC
s
Is the default policy used by the Sun Trunking software. MAC is the preferred policy to use with switches. Most trunking-capable switches require use of the MAC hashing policy, but check your switch documentation. Uses the last three bits of the MAC address of both the source and destination. For two ports, the MAC address of the source and destination is first XORed: Result = 00, 01, which selects the port. Favors a large population of clients. For example, using MAC ensures that 50 percent of the client connections will go through one of two ports in a two-port trunk.
Round-Robin
s
Is the preferred policy with a back-to-back connection used between the output of a transmitting device and the input of an associated receiving device. Uses each network interface of the trunk in turn as a method of distributing packets over the assigned number of trunking interfaces. Could have an impact on performance because the temporal ordering of packets is not observed.
IP Destination Address
s
Uses the four bytes of the IP destination address to determine the transmission path. If a trunking interface host has one IP source address and it is necessary to communicate to multiple IP clients connected to the same router, then the IP Destination Address policy is the preferred policy to use.
IP Source Address/IP Destination Address

s
Connects the source server to the destination based on where the connection originated or terminated. Uses the four bytes of the source and destination IP addresses to determine the transmission path. The primary use of the IP Source/IP Destination Address policy occurs where you use the IP virtual address feature to give multiple IP addresses to a single physical interface.
232
For example, you might have a cluster of servers providing network services in
which each service is associated with a virtual IP address over a given interface. If a service associated with an interface fails, the virtual IP address migrates to a physical interface on a different machine in the cluster. In such an arrangement, the IP Source Address/IP Destination Address policy gives you a greater chance of using more different links within the trunk than would the IP Destination Address policy.
Network Configuration
This section describes how to edit the network host files after any of the Sun adapters have been installed on your system. The section contains the following topics:
s s s s
Configuring the System to Use the Embedded MAC Address on page 233
Configuring the Network Host Files on page 234 Setting Up a GigaSwift Ethernet Network on a Diskless Client System on page 235
Installing the Solaris Operating System Over a Network on page 236
Configuring the System to Use the Embedded MAC Address

All Sun networking adapters have a MAC address embedded on their PROM associated with each port available on the adapter. To use the adapters embedded MAC address instead of the MAC address on the systems IDPROM, set the localmac-address\? OBP property to true. You must reboot your system for these changes to become active. As a rule, this is something that should be configured to effectively operate with Solaris software.
q As superuser, set the local-mac-address\? OBP property to true:
# eeprom local-mac-address\?=true
Alternatively, this can be set at the Open Boot Prom level:

ok setenv local-mac-address? true
Chapter 5
233
Configuring the Network Host Files

After installing the driver software, you must create a hostname.ceinstance file for the network interface. You must also create both an IP address and a host name for that interface in the /etc/hosts file. For the remaining test, we will assume the ce interface as an example interface for this description.
w To Configure the Network Host files

1. At the command line, use the grep command to search the /etc/path_to_inst file for ce interfaces.
# grep ce /etc/path_to_inst "/pci@1f,4000/network@4" 0 "ce"
In the previous example, the device instance is from a Sun GigaSwift Ethernet adapter installed in slot 1. For clarity, the instance number is in bold italics. Be sure to write down your device path and instance, which in the example is /pci@1f,0/pci@1/network@4 0. While your device path and instance might be different, they will be similar. You will need this information to make changes to the ce.conf file. See Setting Network Driver Parameters Using the ndd Utility on page 238. 2. Use the ifconfig command to set up the adapters ce interface. 3. Use the ifconfig command to assign an IP address to the network interface. Type the following at the command line, replacing ip_address with the adapters IP address:
# ifconfig ce0 plumb ip_address up
Refer to the ifconfig(1M) man page and the Solaris documentation for more information. If you want a setup that will remain the same after you reboot, create an /etc/hostname.ceinstance file, where instance corresponds to the instance number of the ce interface you plan to use. To use the adapters ce interface in the Step 1 example, create an /etc/hostname.ce0 file where 0 is the instance number of the ce interface. If the instance number were 1, the filename would be /etc/hostname.ce1.
Do not create an /etc/hostname.ceinstance file for a Sun GigaSwift Ethernet adapter interface you plan to leave unused.
s
The /etc/hostname.ceinstance file must contain the host name for the appropriate ce interface.
234
The host name should have an IP address and should be listed in the /etc/hosts file. The host name should be different from any other host name of any other interface; for example: /etc/hostname.ce0 and /etc/hostname.ce1 cannot share the same hostname.
The following example shows the /etc/hostname.ceinstance file required for a system called zardoz that has a Sun GigaSwift Ethernet adapter (zardoz-11). # cat /etc/hostname.hme0
zardoz
# cat /etc/hostname.ce0
zardoz-11
4. Create an appropriate entry in the /etc/hosts file for each active ce interface. For example:
# cat /etc/hosts # # Internet host table # 127.0.0.1 localhost 129.144.10.57 zardoz loghost 129.144.11.83 zardoz-11
Setting Up a GigaSwift Ethernet Network on a Diskless Client System

Before you can boot and operate a diskless client system across a network, you must first install the network device driver software packages into the root directory of the diskless client.
w To Set Up a Network Port on a Diskless Client

1. Locate the root directory of the diskless client on the host server. The root directory of the diskless client system is commonly installed in the host servers /export/root/client_name directory, where client_name is the diskless clients host name. In this procedure, the root directory will be:
/export/root/client_name
Chapter 5
235
2. Use the pkgadd -R command to install the network device driver software packages to the diskless clients root directory on the server.
# pkgadd -R root_directory/Solaris_2.7/Tools/Boot -d . SUNWced
3. Create a hostname.ceinstance file in the diskless clients root directory. You will need to create an /export/root/client_name/etc/hostname.deviceinstance file for the network interface. See Configuring the Network Host Files on page 234 for instructions. 4. Edit the hosts file in the diskless clients root directory. You will need to edit the /export/root/client_name/etc/hosts file to include the IP address of the Network interface. See Configuring the Network Host Files on page 234 for instructions.
s
Be sure to set the MAC address on the server side and rebuild the device tree if you want to boot from the GigaSwift Ethernet port.
5. To boot the diskless client from the Network interface port, type the following boot command:
ok boot path-to-device:link-param, -v
Installing the Solaris Operating System Over a Network

The following procedure assumes that you have created an install server, which contains the image of the Solaris CD, and that you have set up the client system to be installed over the network. Before you can install the Solaris operating system on a client system with a given network interface, you must first add the driver software packages to the install server. These software packages are generally available on the driver installation CD.
236
To Install the Solaris Software Over a GigaSwift Ethernet Network
1. Prepare the install server and client system to install the Solaris operating system over the network. 2. Find the root directory of the client system. The client systems root directory can be found in the install servers /etc/bootparams file. Use the grep command to search this file for the root directory.
# grep client_name /etc/bootparams client_name root=server_name:/netinstall/Solaris_2.7/Tools/Boot install=server_name:/netinstall boottype=:in rootopts=:rsize=32768
In the previous example, the root directory for the Solaris 7 client is /netinstall. In Step 4, you would replace root_directory with /netinstall. 3. Use the pkgadd -R command to install the network device driver software packages to the diskless clients root directory on the server.
# pkgadd -R root_directory/Solaris_2.7/Tools/Boot -d . SUNWced
4. Shut down and halt the client system. 5. At the ok prompt, boot the client system using the full device path of the network device. 6. Proceed with the Solaris operating system installation. 7. After installing the Solaris operating system, install the network interface software on the client system. This step is required because the software installed in Step 2 was required to boot the client system over the network interface. Often network interface cards are not a bundled option with Solaris. Therefore, after installation is complete, you will need to install the software in order for the operating system to use the clients network interfaces in normal operation. 8. Confirm that the network host files have been configured correctly during the Solaris installation. Although the Solaris software installation creates the clients network configuration files, you might need to edit these files to match your specific networking environment. See Configuring the Network Host Files on page 234 for more information about editing these files.
Chapter 5
237
Configuring Driver Parameters

This section describes how to configure the driver parameters used by the networking adapter. This section contains the following topics:
s s
Setting networking driver parameters using the ndd utility Reboot persistence with driver.conf
Setting Network Driver Parameters Using the ndd Utility

Many of the network drivers allow you to configure device driver parameters dynamically while the system is running using the ndd utility. Once configured using ndd, those parameters are only valid until you reboot the systemhence the requirement for boot persistence, which is provided by driver.conf. The following sections describe how you can use the ndd utility to modify (with the -set option) or display (with the -get option) the parameters for a network driver and individual devices.
w To Specify Device Instances for the ndd Utility

There are two ways to specify the ndd command line based on the type of networking driver style. The difference in style can be established by looking at the /dev directory for the driver node.
s
Style 1 drivers have a /dev/name instance symbolic link to a physical network device instance. Style 2 drivers have a /dev/name symbolic link to a physical network device instance.
Once the style is established, the way you use the ndd command has to be adjusted, as the way of getting exclusive access to the device instance with ndd is different based on the style. 1. Determine the style of driver youre using.
238
a. If there exists a Style 1 node /dev/name instance, then you can use the Style 1 command form.
# ndd /dev/bge0 -get adv_autoneg_cap 1 # ndd /dev/bge0 -set adv_autoneg_cap 0 # ndd /dev/bge0 -get adv_autoneg_cap 0
b. If there exists a Style 2 node /dev/name, then you cannot use the Style 1 form. You must use the Style 2 form. This requires an initial step, which is to set the configuration context.
# ndd /dev/hme -set instance 0
2. Once you are pointing to the correct instance, you can alter as many parameters as required for that instance.
# ndd /dev/hme -set instance 0
In all networking drivers, the instance number is allocated at the time of enumeration once the adapter is installed. The instance number is recorded permanently in the /etc/path_to_inst file. Take note of the instance numbers in /etc/path_to_inst so you can configure the instance using ndd.
# grep ce /etc/path_to_inst "/pci@1f,2000/pci@1/network@0" 2 "ce" "/pci@1f,2000/pci@2/network@0" 1 "ce" "/pci@1f,2000/pci@4/network@0" 0 "ce"
The instance association is shown in bold italics and can be used in both the configuration styles. The preceding examples show the ndd utility being used in the non-interactive mode. In that mode, only one parameter can be modified per command line. There is also an interactive mode that allows you to enter an ndd shell where you can read or issue writes to ndd parameters associated with a device.
Chapter 5
239
Using the ndd Utility in Non-interactive Mode

In the non-interactive mode, the command line allows you to read (get) the current setting of the parameter or write (set) a new setting to a parameter.
# ndd device node -[get/set] parameter [value]
This mode assumes that you remember all the parameter options of the network interface.
Using the ndd Utility in Interactive Mode

The ndd utility offers an interactive mode that allows you to query the parameters that you can read or write. The interactive mode is entered by simply pointing the ndd at a particular device instance or (in the case of Style 2) a driver instance.
# ndd device node
If ndd is pointed to a device node that can only be a Style 1 device, then ndd is already pointing to a device instance.
# ndd /dev/bge0
If the device node can be a Style 2 device, then ndd is pointing to a driver and not necessarily a device instance. Therefore, you must always first set the instance variable to ensure that ndd is pointing to the right device instance before configuration begins.
# ndd /dev/ce name to get/set? instance value ? 0
240
A very useful feature of ndd is the ? query, which you can use to get a list of possible parameters that a particular driver supports.
# ndd /dev/ce name to get/set ? ? ? instance adv_autoneg_cap adv_1000fdx_cap adv_1000hdx_cap adv_100T4_cap adv_100fdx_cap adv_100hdx_cap adv_10fdx_cap adv_10hdx_cap adv_asmpause_cap adv_pause_cap master_cfg_enable master_cfg_value use_int_xcvr enable_ipg0 ipg0 ipg1 ipg2 rx_intr_pkts rx_intr_time red_dv4to6k red_dv6to8k red_dv8to10k red_dv10to12k tx_dma_weight rx_dma_weight infinite_burst disable_64bit name to get/set ? #
(read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read
only) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write)
Once you have set the desired parameters, they will persist until reboot. To have them persist through a reboot, you must set those parameters in the driver.conf file.
Chapter 5
241
Reboot Persistence Using driver.conf

Reboot persistence is often required to avoid repeatedly invoking ndd to adjust device parameters after reboot. There are two common ways to specify driver parameters in the driver.conf:
s s
Global driver.conf parameters Per instance driver.conf parameters
In both cases, the driver.conf file resides in the same directory as the device driver. For example, ge resides in /kernel/drv. Therefore, the ge.conf file also resides in /kernel/drv. Note that even when a system is booted in 64-bit mode, the driver.conf file is still located in the same directory as the 32-bit driver.
Global driver.conf Parameters

Global driver.conf parameters apply to all instances of the driver and to all devices. Parameters specified alone will apply globally. For example, you can disable auto-negotiation for all ce devices in the system using a global property in the ce.conf file.
ce.conf adv_autoneg_cap = 0;
There are other older examples of global driver.conf parameters, which are a design choice of the driver developer. Those configuration parameters embody information about the device name and instance in the property.
trp.conf trp0_ring_speed = 4;
A more common method is to take advantage of the driver.conf framework to identify an instance that is unique for you.
242
Per-Instance driver.conf Parameters

This method requires you to identify a unique device instance. This method uses three device driver properties associated with each device in the system: the name, parent, and unit-address. Once these parameters are established, every property-value pair is a parameter that is familiar to the driver. The following example illustrates disabling auto-negotiation on an hme instance.
hme.conf name = "hme" parent = "pci@if,2000" unit-address = "0" adv_autoneg_cap = 0;
The name is simply the driver name. In the previous example, hme is the name. The parent and unit address are found using the /etc/path_to_inst file. It is assumed that when you write a driver.conf file and apply instance properties, you know the instance to which you are applying the parameters. The instance becomes the key to finding the parent and unit address from the /etc/path_to_inst file.
# grep ce /etc/path_to_inst "/pci@1f,2000/network@2" 2 "hme" "/pci@1f,2000/network@1" 1 "hme" "/pci@1f,2000/network@0" 0 "hme"
In the example above, the instance number being configured is 1. The instance numbers in the example are shown in bold italics to guide the discussion. Taking the second line as being the line associated with hme1, you can begin to extract the information required to write the driver.conf file. The unit-address and parent are part of the leaf node information, which is the first string in quotes.
"/pci@1f,2000/network@1" 1 "hme"
leaf node
instance
name
The leaf node can be thought of as a file in a directory structure, so you can address it relative to root or relative to a parent. If it is relative to a parent, then the leaf node breaks down to the string to the right of the last / and the string remaining to the left of the / is the parent.
Chapter 5
243
"/pci@1f,2000/network@1"
parent
leaf node
Therefore, in the above example the parent = /pci@1f,2000, the unit address is the number or byte sequence to the right of the @ in the remaining leaf node, and the unit address = 1. The resulting driver.conf file to disable auto-negotiation for instance 1 is as follows:
hme.conf name = "hme" parent = "/pci@1f,2000" \ unit-address = "1" adv_autoneg_cap = 0;
Using /etc/system to Tune Parameters

Solaris software provides a system-wide file for tuning system parameters. In this section, we will look only at how to set tuning parameters in the NIC adapters that have been described. Exercise great care when using this file because direct access is given to global variables of the drivers. You cannot assume that drivers will compensate for exceeding minimum and maximum values set here. Each parameter is added to the end of the file and is made of the set command followed by the driver module name, a colon (:), and then the parameter and its value. The following example illustrates this.
/etc/system
set ge:ge_dmaburst_mod = 1;
Value Parameter Driver Module name
Once this file has been modified, the system must be rebooted for the changes to take effect.
244
Network Interface Card General Statistics

All the network interface cards described in this chapter export a collection of kernel statistics information. In the previous sections describing individual network interfaces, we described kernel statistics that are unique to each interface. This section describes the kernel statistics common to all interfaces. You can use either the kstat(1M) utility or the netstat(1M) utility to gather statistics about each interface. In many cases, these statistics help establish whether packets are moving properly through the interfaces or whether the interface is in a state that will even allow packets to be communicated properly.
TABLE 5-70 kstat name
General Network Interface Statistics

Type Description
ipackets ipackets64 rbytes rbytes64 multircv brdcstrcv unknowns
counter counter counter counter counter counter counter
The number of packets received by the interface A 64-bit version of ipackets so a larger count can be kept The number of bytes received by the interface A 64-bit version of rbytes so a larger count can be kept The number of multicast packets received by the interface The number of broadcast packets received by the interface The number of packets that are received by an interface but cannot be classified to any Layer 3 or above protocol available in the system The number of receive packet errors that led to a packet being discarded The number receive packets that could not be received because the NIC had no buffers available The number of packets transmitted by the interface A 64-bit version of opackets so a larger count can be kept The number of bytes transmitted by the interface
ierrors norcvbuf opackets opackets64 obytes
counter counter counter counter counter
Chapter 5
245
General Network Interface Statistics (Continued)

Type Description
obytes64 multixmt brdcstxmt oerrors noxmtbuf
A 64-bit version of obytes so a larger count can be kept The number of multicast packets transmitted by the interface The number of broadcast packets transmitted by the interface The number of packets that encountered an error on transmission, causing the packets to be dropped The number of transmit packets that were stalled for transmission because the NIC had no buffers available The number of collisions encountered while transmitting packets The current speed of the network connection in megabits per second The current MTU allowed by the driver, including the Ethernet header and 4-byte CRC
collisions ifspeed mac_mtu
counter state state
Ethernet Media Independent Interface Kernel Statistics

The Ethernet Media Independent Interface (MII) Kernel Statistics help keep the state of the link. They are very useful statistics for Ethernet interfaces, as they can be used for troubleshooting physical layer problems with a network connection. These statistics cover MII and GMII. Note that in some cases with the Fast Ethernet or Fiber-only devices some of the statistics might not apply.
General Network Interface Statistics

Type Description
xcvr_addr xcvr_id xcvr_inuse
state state state
Provides the MII address of the transceiver currently in use. Provides the specific Vendor/Device ID of the transceiver currently in use. Indicates the type of transceiver currently in use.
246

Type Description
cap_1000fdx cap_1000hdx cap_100fdx cap_100hdx cap_10fdx cap_10hdx cap_asmpause cap_pause
state state state state state
Indicates the device is 1 Gbit/sec full-duplex capable. Indicates the device is 1 Gbit/sec half-duplex capable. Indicates the device is 100 Mbit/sec full-duplex capable. Indicates the device is 100 Mbit/sec half-duplex capable. Indicates the device is 10 Mbit/sec full-duplex capable. Indicates the device is 10 Mbit/sec full-duplex capable. Indicates the device is capable of asymmetric pause Ethernet flow control. Indicates the device is capable of symmetric pause Ethernet flow control when set to 1 and cap_asmpause is 0. If cap_asmpause = 1 while cap_pause = 0, transmit pauses based on receive congestion. cap_pause = 1, receive pauses and slows down transmit to avoid congestion. Indicates the device is capable of remote fault indication. Indicates the device is capable of auto-negotiation. Indicates the device is advertising 1 Gbit/sec Full duplex capability. Indicates the device is advertising 1 Gbit/sec Half duplex capability. Indicates the device is advertising 100M bits/s Full duplex capability. Indicates the device is advertising 100 Mbit/sec halfduplex capability. Indicates the device is advertising 10 Mbit/sec fullduplex capability. Indicates the device is advertising 10 Mbit/sec fullduplex capability. Indicates the device is advertising the capability of asymmetric pause Ethernet flow control.
state
state state
cap_rem_fault cap_autoneg adv_cap_1000fdx adv_cap_1000hdx adv_cap_100fdx adv_cap_100hdx adv_cap_10fdx adv_cap_10hdx adv_cap_asmpause
state state state state state state state state state
Chapter 5
247

Type Description
adv_cap_pause
state
Indicates the device is advertising the capability of symmetric pause Ethernet flow control when adv_cap_pause = 1 and adv_cap_asmpause = 0. If adv_cap_asmpause = 1 while adv_cap_pause = 0, transmit pauses based on receive congestion. If adv_cap_pause = 1, receive pauses and slows down transmit to avoid congestion. Indicates the device is experiencing a fault that it is going to forward to the link partner. Indicates the device is advertising the capability of auto-negotiation. Indicates the link partner device is 1 Gbit/sec fullduplex capable. Indicates the link partner device is 1 Gbit/sec halfduplex capable. Indicates the link partner device is 100 Mbit/sec fullduplex capable. Indicates the link partner device is 100 Mbit/sec halfduplex capable. Indicates the link partner device is 10 Mbit/sec fullduplex capable. Indicates the link partner device is 10 Mbit/sec halfduplex capable. Indicates the device is advertising the capability of asymmetric pause Ethernet flow control. Indicates the link partner device is capable of symmetric pause Ethernet flow control when set to 1 and lp_cap_asmpause is 0. If lp_cap_asmpause = 1 while lp_cap_pause = 0, transmit pauses based on receive congestion. If lp_cap_pause = 1, receive pauses and slows down transmit to avoid congestion. Indicates the link partner is experiencing a fault with the link. Indicates the link partner device is capable of autonegotiation.
adv_cap_rem_fault adv_cap_autoneg lp_cap_1000fdx lp_cap_1000hdx lp_cap_100fdx lp_cap_100hdx lp_cap_10fdx lp_cap_10hdx lp_cap_asmpause lp_cap_pause
state state state state state state state state state state
lp_cap_rem_fault lp_cap_autoneg
state state
248

Type Description
link_asmpause
state
Indicates the shared link asymmetric pause setting the value is based on local resolution column of Table 37-4 IEEE 802.3 spec. link_asmpause = 0 Link is symmetric Pause link_asmpause = 1 Link is asymmetric Pause Indicates the shared link pause setting. The value is based on local resolution shown above. If link_asmpause = 0 while link_pause = 0, the link has no flow control. If link_pause = 1, link can flow control in both directions. If link_asmpause = 1 while link_pause = 0, local flow control setting can limit link partner. If link_pause = 1, link will flow control local Tx. The current speed of the network connection in megabits per second. Indicates the link duplex. link_duplex = 0, indicates link is down and duplex will be unknown. link_duplex = 1, indicates link is up and in half duplex mode. link_duplex = 2, indicates link is up and in full duplex mode. Indicates whether the link is up or down. link_up = 1, indicates link is up. link_up = 0, indicates link is down.
link_pause
state
link_speed link_duplex
state state
link_up
state
Maximizing the Performance of an Ethernet NIC Interface

There are many ways to maximize the performance of your Ethernet NIC interface, and there are a few tools that are valuable in achieving that. The ndd parameters and kernel statistics provide a means to get the best out of your NIC. But there are some other tools for looking at the system behavior and establishing if more tuning can be achieved to better utilize the system as well as the NIC.
Chapter 5
249
The starting point for this discussion is the physical layer because that layer is the most important with respect to creating the link between two systems. At the physical layer, failures can prevent the link from coming up. Or worse, the link comes up and the duplex is mismatched, giving rise to less-visible problems. Then the discussion will move to the data link layer, where most problems are performance related. During that discussion, the architecture features described above can be used to address many of these performance problems.
Ethernet Physical Layer Troubleshooting

The possibility of problems at the physical layer is huge. The problems range from no cable being present to duplex mismatch. The key tool for looking at the physical layer is the kstat command. See the kstat man page. The first step in checking the physical layer is to check if the link is up.
kstat ce:0 | grep link_ link_asmpause link_duplex link_pause link_speed link_up
0 2 0 1000 1
If the link_up variable is set, then things are positive, and a physical connection is present. But also check that the speed matches your expectation. For example, if the interface is 1000BASE-Tx interface and you expect it to run at 1000 Mbit/sec, then the link_speed parameter shown should indicate 1000. If this is not the case, then a check of the link partner capabilities might be required to establish if they are the limiting factor. The following kstat command line will show output similar to the following:
kstat ce:0 | grep lp_cap lp_cap_1000fdx lp_cap_1000hdx lp_cap_100T4 lp_cap_100fdx lp_cap_100hdx lp_cap_10fdx lp_cap_10hdx lp_cap_asmpause lp_cap_autoneg lp_cap_pause
1 1 1 1 1 1 1 0 1 0
250
If the link partner appears to be capable of all the desired speed, then the problem might be local. There are two possibilities: The NIC itself is not capable of the desired speed. Or the configuration has no shared capabilities that can be agreed on hence the link will not come up. You can check this using the following kstat command line.
kstat ce:0 | grep cap_ cap_1000fdx cap_1000hdx cap_100T4 cap_100fdx cap_100hdx cap_10fdx cap_10hdx cap_asmpause cap_autoneg cap_pause .....
1 1 1 1 1 1 1 0 1 0
If all the required capabilities are available for the desired speed and duplex, yet there remains a problem with achieving the desired speed, the only remaining possibility is an incorrect configuration. You can check this by looking at individual ndd adv_cap_* parameters or you can use the kstat command:
kstat ce:0 | grep adv_cap_ adv_cap_1000fdx adv_cap_1000hdx adv_cap_100T4 adv_cap_100fdx adv_cap_100hdx adv_cap_10fdx adv_cap_10hdx adv_cap_asmpause adv_cap_autoneg adv_cap_pause
1 1 1 1 1 1 1 0 1 0
Configuration issues are where most problems lie. All the issues of configuration can be addressed using the kstat command above to establish the local and remote configuration, and adjusting the adv_cap_* parameters using ndd to correct the problem.
Chapter 5
251
The most common configuration problem is duplex mismatch, which is induced when one side of a link is enabled for auto-negotiation and the other is not. This is known as Forced mode and can only be guaranteed for 10/100 Mode operation. For 1000BASE-T UTP Mode operation, the Forced mode (auto-negotiation disabled) capability is not guaranteed because not all vendors support it. If Auto-negotiation is turned off, you must ensure that both ends of the connection are also in Forced mode, and that the speed and duplex are matched perfectly. If you fail to match Forced mode in gigabit operation, the impact will be that the link will not come up at all. Note that this result is quite different from the 10/100 Mode case. While in 10/100 Mode operation, if only one end of the connection is autonegotiating (with full capabilities advertised) the link will come up with the correct speed, but the duplex will always be set to half duplex (creating the potential for a duplex mismatch if the forced end is set to full duplex). If both sides are set to Forced mode and you fail to match speeds, the link will never come up. If both sides are set to forced mode and you fail to match duplex, the link will come up, but you will have a duplex mismatch. Duplex mismatch is a silent failure that manifests itself from an upper layer point of view as really poor performance as many of the packets get lost because of collisions and late collisions occurring on the half-duplex end of the connection due to violations of Ethernet protocol induced by the full-duplex end. The half-duplex end experiences collisions and late collisions while the full-duplex end experiences a whole manner of smashed packets, leading to MIB counters measuring, crc, runts, giants, alignment errors all being incremented. If the node experiencing poor performance is the half duplex end of the connection, you can look at the kstat values for collisions and late_collisions.
kstat ce:0 | grep collisions collisions 22332
late_collisions 15432 If the node experiencing poor performance is the full duplex end of the connection, you can look at the packet corruption counters, for example, crc_err, alignment_err.
kstat ce:0 | grep crc_err crc_err 22332 kstat ce:0 | grep alignment_err alignment_err 224532
252
Depending on the capability of the switch end or remote end of the connection, it may be possible to do similar measurements there. Forced mode while having the problem of creating a potential duplex mismatch also has the drawback of isolating the link partner capabilities from the local station. In Forced mode, you cannot view the lp_cap* values and determine the capabilities of the remote link partner locally. Where possible, use the default of Auto-negotiation with all capabilities advertised and avoid tuning the physical link parameters. Given the maturity of the Auto-negotiation protocol and its requirement in the 802.3z specification for one gigabit UTP Physical implementations, ensure that Autonegotiation to enabled.
Deviation from General Ethernet MII/GMII Conventions

We must address some remaining deviation from the general Ethernet MII/GMII kernel statistics. In the case of the ge interface, all of the statistics for getting local capabilities and link partner capabilities are read-only ndd properties, so they cannot be read using the kstat command, as described previously, although the debug mechanism is still valid. To read the corresponding lp_cap_* using ge, use the following commands:
hostname# ndd -set /dev/ge instance 0 hostname# ndd -get /dev/ge lp_1000fdx_cap
Or you could use the interactive mode, described previously. The mechanism used for enabling Ethernet Flow control on the ge interface is also different, using the parameters in the table below.
TABLE 34 Statistic
Physical Layer Configuration Properties

Values Description
adv_pauseTX adv_pauseRX
0-1 0-1
Transmit Pause if the Rx buffer is full. When you receive a pause slow down Tx.
Chapter 5
253
Theres also a deviation in ge for adjusting ndd parameters. For example, when modifying ndd parameters like adv_1000fdx_cap, the changes will not take effect until the adv_autoneg_cap parameter is toggled to change state (from 0-1 or from 1-0). This is a deviation from the General Ethernet MII/GMII convention for the take affect immediately rule of ndd.
Ethernet Performance Troubleshooting

Ethernet performance troubleshooting is device specific because not all devices have the same architecture capabilities. Therefore, the discussion of troubleshooting performance issues will have to be tackled on a per-device basis. The following Solaris tools aid in the analysis of performance issues:
s s s
kstat to view device-specific statistics mpstat to view system utilization information lockstat to show areas of contention
You can use the information from these tools to tune specific parameters. The tuning examples that follow describe where this information is most useful. You have two options for tuning: using the /etc/system file or the ndd utility. Using the /etc/system file to modify the initial value of the driver variables requires a system reboot for the to take effect. If you use the ndd utility for tuning, the changes take effect immediately. However, any modifications you make using the ndd utility will be lost when the system goes down. If you want the ndd tuning properties to persist through a reboot, add these properties to the respective driver.conf file. Parameters that have kernel statistics but have no capability to tune for improvement are omitted from this discussion because no troubleshooting capability is provided in those cases.
254
ge Gigabit Ethernet
The ge interface provides some kstats that can be used to measure the performance bottlenecks in the driver in the Tx or the Rx. The kstats allow you to decide what corrective tuning can be applied based on the tuning parameters previously described. The useful statistics are shown in TABLE 5-72.
List of ge Specific Interface Statistics

Type Description
rx_overflow no_free_rx_desc
counter counter
Number of times the hardware is unable to receive a packet due to the Internal FIFOs being full. Number of times the hardware is unable to post a packet because there are no more Rx Descriptors available. Number of times transmit packets are posted on the driver streams queue for processing some time later, the queues service routine. Number of times a packet is simply dropped by the driver because the module above the driver cannot accept the packet. The PCI bus speed that is driving the card.
no_tmds
counter
nocanput
counter
pci_bus_speed
value
When rx_overflow is incrementing, packet processing is not keeping up with the packet arrival rate. If rx_overflow is incrementing and no_free_rx_desc is not, this indicates that the PCI bus or SBus bus is presenting an issue to the flow of packets through the device. This could be because the ge card is plugged into a slower I/O bus. You can confirm the bus speed by looking at the pci_bus_speed statistic. An SBus bus speed of 40 MHz or a PCI bus speed of 33 MHz might not be sufficient to sustain full bidirectional one-gigabit Ethernet traffic. Another scenario that can lead to rx_overflow incrementing on its own is sharing the I/O bus with another device that has similar bandwidth requirements to those of the ge card. These scenarios are hardware limitations. There is no solution for SBus. For PCI bus, a first step in addressing them is to enable infinite burst capability on the PCI bus. You can achieve that by using the /etc/system tuning parameter ge_dmaburst_mode. Alternatively, you can reorganize the system to give the ge interface a 66-MHz PCI slot, or you can separate devices that contend for a shared bus segment by giving each of them a bus segment.
Chapter 5
255
The probability that rx_overflow incrementing is the only problem is small. Typically, Sun systems have a fast PCI bus, and memory subsystem, so delays are seldom induced at that level. It is more likely is that the protocol stack software might fall behind and lead to the Rx descriptor ring being exhausted of free elements with which to receive more packets. If this happens, then the kstat parameter no_free_rx_desc will begin to increment, meaning the CPU cannot absorb the incoming packet in the case of a single CPU. If more than one CPU is available, it is still possible to overwhelm a single CPU. But given that the Rx processing can be split using the alternative Rx data delivery models provided by ge, it might be possible to distribute the processing of incoming packets to more than one CPU. You can do this by first ensuring that ge_intr_mode is not set to 1. Also be sure to tune ge_put_cfg to enable the load-balancing worker thread or streams service routine. Another possible scenario is where the ge device is adequately handling the rate of incoming packets, but the upper layer is unable to deal with the packets at that rate. In this case, the kstat nocanputs parameter will be incrementing. The tuning that can be applied to this condition is available in the upper layer protocols. If you're running the Solaris 8 operating system or an earlier version, then upgrading to the Solaris 9 version will help your application experience fewer nocanputs. The upgrade might reduce nocanput errors due to improved multithreading and IP scalability performance improvements in the Solaris 9 operating system. While the Tx side is also subject to an overwhelmed condition, this is less likely than any Rx-side condition. If the Tx side is overwhelmed, it will be visible when the no_tmds parameter begins to increment. If the Tx descriptor ring size can be increased, the /etc/system tunable parameter ge_nos_tmd provides that capability.
ce Gigabit Ethernet
The ce interface provides a far more extensive list of kstats that can be used to measure the performance bottlenecks in the driver in the Tx or the Rx. The kstats allow you to decide what corrective tuning can be applied based on the tuning parameters described previously. The useful statistics are shown in TABLE 5-73.
List of ce Specific Interface Statistics

Type Description
rx_ov_flow rx_no_buf rx_no_comp_wb
counter counter counter
Number of times the hardware is unable to receive a packet due to the Internal FIFOs being full. Number of times the hardware is unable to receive a packet due to Rx buffers being unavailable. Number of times the hardware is unable to receive a packet due to no space in the completion ring to post Received packet descriptor.
256
List of ce Specific Interface Statistics (Continued)

Type Description
ipackets_cpuXX mdt_pkts rx_hdr_pkts rx_mtu_pkts rx_jumbo_pkts rx_ov_flow
counter counter counter counter counter counter
Number of packets being directed to load balancing thread XX. Number of packets sent using Multidata interface. Number of packets arriving that are less than 252 bytes in length. Number of packets arriving that are greater than 252 bytes in length. Number of packets arriving that are greater than 1522 bytes in length. Number of times a packet is simply dropped by the driver because the module above the driver cannot accept the packet. Number of packets dropped due to Service Fifo Queue being full. Number of packets hitting the small packet transmission method. Packets are copied into a premapped DMA buffer. Number of packets hitting the mid-range DDI DMA transmission method. Number of packets hitting the top range DVMA fast path DMA Transmission method. Number of packets being sent that are greater than 1522 bytes in length. Measure of the maximum number of packets ever queued on a Tx ring. Number of times a packet transmit was attempted and Tx Descriptor Elements were not available. The packet is postponed until later. Number of packets transmitted on a particular queue. The maximum packet size allowed past the MAC. The PCI bus speed that is driving the card.
rx_pkts_dropped tx_hdr_pkts
counter counter
tx_ddi_pkts tx_dvma_pkts tx_jumbo_pkts tx_max_pend tx_no_desc
tx_queueX mac_mtu pci_bus_speed
counter value value
When rx_ov_flow is incrementing, it indicates that packet processing is not keeping up with the packet arrival rate. If rx_ov_flow is incrementing while rx_no_buf or rx_no_comp_wb is not, this indicates that the PCI bus is presenting an issue to the flow of packets through the device. This could be because ce card is
Chapter 5
257
plugged into a slower PCI bus. This can be established by looking at the pci_bus_speed statistic. A bus speed of 33 MHz might not be sufficient to sustain full bidirectional one gigabit Ethernet traffic. Another scenario that can lead to rx_ov_flow incrementing on its own is sharing the PCI bus with another device that has bandwidth requirements similar to those of the ce card. These scenarios are hardware limitations. A first step in addressing them is to enable the infinite burst capability on the PCI bus. Use the ndd tuning parameter infinite-burst to achieve that. Infinite burst will help give ce more bandwidth, but the Tx and Rx of the ce device will still be competing for that PCI bandwidth. Therefore, if the traffic profile shows a bias toward Rx traffic and this condition is leading to rx_ov_flow, you can adjust the bias of PCI transactions in favor of the Rx DMA channel relative to the Tx DMA channel, using ndd parameters rx-dma-weight and tx-dma-weight Alternatively, you can reorganize the system by giving the ce interface a 66-MHz PCI slot, or you can separate devices that contend for a shared bus segment by giving each of them a bus segment. If this doesnt contribute much to reducing the problem, then you should consider using Random Early Detection (RED) to ensure that the impact of dropping packets is minimized with respect to keeping connections alive that normally would be terminated due to regular overflow. The following parameters that allow enabling RED are configurable using ndd: red-dv4to6k, red-dv6to8k, red-dv8to10k, and red-dv10to12k. The probability that rx_overflow incrementing is the only problem is small. Typically Sun systems have a fast PCI bus and memory subsystem, so delays are seldom induced at that level. It is more likely that the protocol stack software might fall behind and lead to the Rx buffers or completion descriptor ring being exhausted of free elements with which to receive more packets. If this happens, then the kstats parameters rx_no_buf and rx_no_comp_wb will begin to increment. This can mean that theres not enough CPU power to absorb the packets, but it can also be due to a bad balance of the buffer ring size versus the completion ring size, leading to the rx_no_comp_wb incrementing without the rx_no_buf incrementing. The default configuration is one buffer to four completion elements. This works great provided that the packets arriving are larger than 256 bytes. If they are not and that traffic dominates, then 32 packets will be packed into a buffer leading to a greater probability that configuration imbalance will occur. For that case, more completion elements need to be made available. This can be addressed using the /etc/system tunables ce_ring_size to adjust the number of available Rx buffers and ce_comp_ring_size to adjust the number of Rx packet completion elements. To understand the trafc prole of the Rx so you can tune these parameters, use kstat to look at the distribution of Rx packets across the rx_hdr_pkts and rx_mtu_pkts.
258
If ce is being run on a single CPU system and rx_no_buf and rx_no_comp_wb are incrementing, then you will have to resort again to RED or enable Ethernet flow control. If more than one CPU is available, it is still possible to overwhelm a single CPU. Given that the Rx processing can be split using the alternative Rx data delivery models provided by ce, it might be possible to distribute the processing of incoming packets to more than one CPU, described earlier as Rx load balancing. This will happen by default if the system has four or more CPUs, and it will enable four load-balancing worker threads. The threshold of CPUs in the system and the number of load-balancing worker threads enabled can be managed using the /etc/system tunables ce_cpu_threshold and ce_inst_taskqs. The number of load balancing worker threads and how evenly the Rx load is being distributed to each worker thread can be viewed with the ipacket_cpuxx kstats. The highest number of xx tells you how many load balancing worker threads are running while the value of these parameters gives you the spread of the work across the instantiated load balancing worker threads. This, in turn, gives an indication if the load balancing is yielding a benefit. For example, if all ipacket_cpuxx kstats have an approximately even number of packets counted on each, then the load balancing is optimal. On the other hand, if only one is incrementing and the others are not, then the benefit of Rx load balancing is nullified. It is also possible to measure whether the system is experiencing a even spread of CPU activity using mpstat. In the ideal case, if you experience good load balancing as shown in the kstats ipackets_cpuxx, it should also be visible in mpstat that the workload is evenly distributed to multiple CPUs. If none of this benefit is visible, then disable the load balancing capability completely, using the /etc/system variable ce_taskq_disable. The Rx load balancing provides packet queues, also known as service FIFOs, between the interrupt threads that fan out the workload and the service FIFO worker threads that drain the service FIFO and complete the workload. These service FIFOs are of fixed size and are controlled by the /etc/system variable ce_srv_fifo_depth. It is possible that the service FIFOs can also overflow and drop packets as the rate of packet arrival exceeds the rate with which the service FIFO draining thread can complete the post processing. These dropped packets can be measured using the rx_pkts_dropped kstat. If this is measured as occurring, you can increase the size of the service FIFO or you can increase the number of service FIFOs, allowing more Rx load balancing. In some cases, it may be possible to eliminate increments in rx_pkts_dropped, but the problem may move to rx_nocanputs, which is generally only addressable by tuning that can be applied by upper layer protocol. If you're running the Solaris 8 operating system or an earlier version, then upgrading to the Solaris 9 version will help your application experience fewer nocanputs. The upgrade might reduce nocanput errors due to improved multithreading and IP scalability performance improvements in the Solaris 9 operating system.
Chapter 5
259
There is a difficulty is maximizing the Rx load balancing, and it is contingent on the Tx ring processing. This is measurable using the lockstat command and will show contention on the ce_start routine at the top as the most contended driver function. This contention cannot be eliminated, but it is possible to employ a new Tx method known as Transmit serialization, which keeps contention to a minimum while forcing the Tx processes on a fixed set of CPUs. Keeping the Tx process on a fixed CPU reduces the risk of CPUs spinning waiting for other CPUs to complete their Tx activity, ensuring CPUs are always kept busy doing useful work. This transmission method can be enabled using the /etc/system variable ce_start_cfg, setting it to 1. When you enable Transmit serialization, you will be trading off Transmit latency for avoiding mutex spins induced by contention. The Tx side is also subject to overwhelmed condition, although this is less likely than any Rx side condition. This becomes visible when tx_max_pending value matches the size of the /etc/system variable ce_tx_ring_size. If this occurs, then you know that packets are being postponed because Tx descriptors are being exhausted. Therefore the size of the ce_tx_ring_size should be increased. The tx_hdr_pkts, tx_ddi_pkts, and tx_dvma_pkts are useful for establishing the traffic profile of an application and matching that with the capabilites of a system. For example, many small systems have very fast memory access times making the cost of setting up DMA transactions more expensive than transmitting directly from a pre-mapped DMA buffer, in which case you can adjust the DMA thresholds programmable via /etc/system to push more packet into the preprogrammed DMA versus the per packet programming. Once the tuning is complete, these statistics can be viewed again to see if the tuning took effect. The tx_queueX kstats give a good indication if Tx load balancing matches the Rx side. If no load balancing is visible, meaning all the packets appear to be getting counted by only one tx_queue, then it may make sense to switch this feature off. The /etc/system variable that does that is ce_no_tx_lb. The mac_mtu gives an indication of the maximum size of packet that will make it through the ce device. It is useful to know if jumbo frames is enabled at the DLPI layer below TCP/IP. If jumbo frames is enabled, then the MTU indicated by mac_mtu will be 9216. This is helpful, as it will show if theres a mismatch between the DLPI layer MTU and the IP layer MTU, allowing troubleshooting to occur in a layered manner. Once jumbo frames is successfully configured at the driver layer and the TCP/IP layer, then you should ensure that jumbo frames packets are being communicated using the rx_jumbo_pkts and tx_jumbo_pkts to ensure Transmits and Receives of jumbo frame packets respectively is happening correctly.
260
CHAPTER
Network Availability Design Strategies

This chapter provides a survey of availability strategies from a networking perspective. Keep in mind the required degree of availability during the network design process. Availability has always been an important design goal for network architectures. As enterprise customers increasingly deploy mission-critical Webbased services, they require a deeper understanding of designing optimal network availability solutions. There are several approaches to implementing high-availability network solutions. This chapter provides a high-level survey of possible approaches to increasing network availability and shows possible deployments using actual implemented examples.
Network Architecture and Availability

One of the first items to consider for network availability is the architecture itself. Network architectures fall into two basic categories: flat and multi-level.
s
A flat architecture is composed of a multi-layer switch that performs multiple switching functions in one physical network device. This implies that a packet will traverse fewer network switching devices when communicating from the client to the server. This results in higher availability. A multi-level architecture is composed of multiple small switches where each switch performs one or two switching functions. This implies that a packet will traverse more network switching devices when communicating from the client to the server. This results in a lower availability.
261
Serial components reduce availability and parallel components increase availability. A serial design requires that every component is functioning at the same time. If any one component fails, the entire system fails. A parallel design offers multiple paths in case one path fails. In a parallel design, if any one component fails, the entire system still survives by using the backup path. Three network architecture aspects impact network availability:
s
Component failure This aspect is the probability of the device failing. It is measured using statistics averaging the amount of time the device works divided by the average time the device works plus the failed time. This value is called the MTBF. In calculating the MTBF, components that are connected serially dramatically reduce the MTBF, while components that are in parallel increase the MTBF. System failure This aspect captures failures that are caused by external factors, such as a technician accidentally pulling out a cable. The number of components that are potential candidates for failure is directly proportional to the complexity of the system. Design B in FIGURE 6-1 has more components that can go wrong, which contributes to the increased probability of failure. Single points of failure This aspect captures the number of devices that can fail and still have the system functioning. Neither Design A nor Design B shown in FIGURE 6-1 has a single point of failure (SPOF), so they are equal in this regard. However, Design B is somewhat more resilient because if a network interface card (NIC) fails, that failure is isolated by the Layer 2 switch and does not impact the rest of the architecture. This issue has a trade-off to consider, where availability is sacrificed for increased resiliency and isolation of failures.
FIGURE 6-1 shows two network designs. In both designs, Layer 2 switches provide physical connectivity for one virtual local area network (VLAN) domain. Layer 27 switches are multilayer devices providing routing, load balancing, and other IP services in addition to physical connectivity.
Design A shows a flat architecture, often seen with multilayer chassis-based switches using Extreme Networks Black Diamond, Foundry Networks BigIron, or Cisco switches. The switch can be partitioned into VLANs, isolating traffic from one segment to another, yet providing a much better overall solution. In this approach, the availability will be relatively high because there are two parallel paths from the ingress to each server and only two serial components that a packet must traverse to reach the target server. In Design B, the architecture provides the same functionality, but across many small switches. From an availability perspective, this solution will have a relatively lower mean time between failures (MTBF) because there are more serial components that a packet must traverse to reach a target server. Other disadvantages of this approach include manageability, scalability, and performance. However, one can argue that there might be increased security using this approach, which for some customers outweighs all other factors. In Design B, multiple switches must be hacked to control the network, whereas in Design A, only one switch needs to be hacked to bring down the entire network.
A) Flat architecture, higher MTBF Redundant multilayer switches Web service Directory service Application service Database service Integration service Integration layer 3 5 17 19 Switch Layer 2-7
B) Multilayered architecture, lower MTBF
2 4
10 12 14 Switch l Layer 2
FIGURE 6-1
11 13 15
16 18
Switch Layer 3
Network Topologies and Impact on Availability
Chapter 6
Service modules
Distribution layer
263
Layer 2 Strategies
There are several Layer 2 availability design options. Layer 2 availability designs are desirable because any fault detection and recovery is transparent to the IP layer. Further, the fault detection and recovery can be relatively fast if the correct approach is taken. In this section, we explain the operation and recovery times for three approaches:
s s
Trunking and variants based on IEEE 802.3ad SMLT and DMLT, a relatively new and promising approach available from Nortel Networks Spanning Tree, a time-tested and proven Layer 2 availability strategy, originally designed for bridged networks by the brilliant Dr. Radia Perlman from DEC and now with Sun Microsystems.
Trunking Approach to Availability

Link aggregation or trunking increases availability by distributing network traffic over multiple physical links. If one link breaks, the load on the broken link is transferred to the remaining links. IEEE 802.3ad is an industry standard created to allow the trunking solutions of various vendors to interoperate. Like most standards, there are many ways to implement the specifications. Link aggregation can be thought of as a layer of indirection between the MAC and PHY layers. Instead of having one fixed MAC address that is bound to a physical port, a logical MAC address is exposed to the IP layer and implements the Data Link Provider Interface (DLPI). This logical MAC address can be bound to many physical ports. The remote side must have the same capabilities and algorithm for distributing packets among the physical ports. FIGURE 6-2 shows a breakdown of the sublayers.
264
MAC client
IP
LMAC Frame collector Logical MAC LACP Aggregator parser/Mux Aggregator parser/Mux Frame distributor
Physical MAC
FIGURE 6-2
PMAC
PHY1
PHY2
Trunking Software Architecture
Theory of Operation
The Link Aggregation Control Protocol (LACP) allows both ends of the trunk to communicate trunking or link aggregation information. The first command sent is the Query command, where each link partner discovers the link aggregation capabilities of the other. If both partners are willing and capable, a Start Group command is sent. The Start Group command indicates that a link aggregation group is to be created followed by adding segments to this group that include link identifiers tied to the ports participating in the aggregation. The LACP can also delete a link, which might be due to the detection of a failed link. Instead of balancing the load across the remaining ports, the algorithm simply places the failed links traffic onto one of the remaining links. The collector reassembles traffic coming from the different links. The distributor takes an input stream and spreads out the traffic across the ports belonging to a trunk group or link aggregation group.
Availability Issues
To understand suitability for network availability, Sun Trunking 1.2 software was installed on several quad fast Ethernet cards. The client has four trunks connected to the switch. The server also has four links connected to the switch. This setup allows the load to be distributed across the four links, as shown in FIGURE 6-3.
Chapter 6
265
Client qfe0 qfe1 qfe2 qfe3

FIGURE 6-3
Switch-trunking capable Trunked links point to point Trunked links point to point
Server qfe0 qfe1 qfe2 qfe3
Trunking Failover Test Setup
The highlighted line (in bold italic) in the CODE EXAMPLE 6-1 output shows the traffic from the client qfe0 moved to the server qfe1 under load balancing.
CODE EXAMPLE 6-1
Output Showing Traffic from Client qfe0 to Server qfe1
Jan 10 14:22:05 2002 Name qfe0 qfe1 qfe2 qfe3 Ipkts 210 0 0 0 Ierrs 0 0 0 0 Opkts 130 130 130 130 Oerrs 0 0 0 0 Collis 0 0 0 0 Crc 0 0 0 0 %Ipkts 100.00 0.00 0.00 0.00 %Opkts 25.00 25.00 25.00 25.00
(Aggregate Throughput(Mb/sec): 18.18%(New/Past))
5.73(New Peak)
31.51(Past Peak)
0.00(New Peak)
31.51(Past Peak)
Jan 10 14:22:07 2002 Name qfe0 qfe1 Ipkts 0 0 Ierrs 0 0 Opkts 0 0 Oerrs 0 0 Collis 0 0 Crc 0 0 %Ipkts 0.00 0.00 %Opkts 0.00 0.00
266
CODE EXAMPLE 6-1
Output Showing Traffic from Client qfe0 to Server qfe1 (Continued) 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00
Jan 10 14:22:05 2002 qfe2 0 0 qfe3 0 0
0.00(New Peak)
31.51(Past Peak)
23.70(New Peak)
31.51(Past Peak)
Several test transmission control protocol (TTCP) streams were pumped from one host to the other. When all links were up, the load was balanced evenly and each port experienced a 25 percent load. When one link was cut, the traffic of the failed link (qfe0) was transferred onto one of the remaining links (qfe1), which then showed a 51 percent load. The failover took three seconds. However, if all links were heavily loaded, the algorithm might force one link to be saturated with its original link load in addition to the failed links traffic. For example, if all links were running at 55 percent capacity and one link failed, one link would be saturated at 55 percent + 55 percent = 110 percent traffic. Link aggregation is suitable for point-to-point links for increased availability, where nodes are on the same segment. However, there is a trade-off of port cost on the switch side as well as the host side.
Load-Sharing Principles
The Trunking Layer will break up packets on a frame boundary. This means as long as the server and switch know that a trunk is spanning certain physical ports, neither side needs to know about which algorithm is being used to distribute the load across the trunked ports. What is important is to understand the traffic characteristics in order to optimally distribute the load as evenly as possible across the trunked ports. The following diagrams describe how load sharing across trunks should be configured based on the nature of the traffic, which is often asymmetric.
Chapter 6
267
Client 1 IP:10.0.0.1 Mac:0:8:c:a:b:1 Network Client 2 IP:10.0.0.2 Mac:0:8:c:a:b:2
Ingress Traffic is distributed across Trunked ports
Router-int10 IP10.0.0.100 Mac:0:0:8:8:8:2 Client 3 IP:10.0.0.3 Mac:0:8:c:a:b:3

FIGURE 6-4
Router-int20 IP20.0.0.1 Mac:0:0:8:8:8:1
Sun Server IP20.0.0.2 Mac:8:0:20:1:a:1
Correct Trunking Policy on Switch

FIGURE 6-4 shows that a correct trunking policy on a switch with ingress traffic that has distributed Source IP address and Source MAC address can use a Trunking Policy based on round-robin, Source MAC/Destination MAC address, and Source IP/Destination IP address. Such a policy will distribute load evenly across physically trunked links.
Ingress Traffic is not distributed across Trunked ports

FIGURE 6-5
Router-int20 IP20.0.0.1 Mac:0:0:8:8:8:1
Incorrect Trunking Policy on Switch
268
FIGURE 6-5 shows an incorrect trunking policy on a switch. In this case, the ingress traffic, which has single target IP address and target MAC, should not use a trunking policy based solely on the destination IP address or destination or source MAC.
Egress Return Traffic is distributed across Trunked ports

FIGURE 6-6
Router-int20 IP20.0.0.1 Mac:0:0:8:8:8:1
Correct Trunking Policy on Server

FIGURE 6-6 shows a correct trunking policy on a server with egress traffic that has distributed target IP address, but the target MAC of the default router should only use a trunking policy based on round-robin or destination IP address. A destination MAC will not work because the destination MAC will only point to the default router :0:0:8:8:1, not the actual client MAC.
Chapter 6
269
Egress Return Traffic is not distributed across Trunked ports
Client 3 IP:10.0.0.3 Mac:0:8:c:a:b:3 Incorrect Trunking Policy on a Server
Router-int10 IP10.0.0.100 Mac:0:0:8:8:8:2
Router-int20 IP20.0.0.1 Mac:0:0:8:8:8:1
FIGURE 6-7
FIGURE 6-7 shows an incorrect trunking policy on the server. In this example, the egress traffic has a distributed target IP address, but the target MAC of the default router should not use a trunking policy based on the destination MAC because the destination MAC will only point to the default router :0:0:8:8:1, not the actual client MAC. Trunking policy should not use either the source IP address or the source MAC. The trunking policy should use the target IP addresses because that will spread the load across the physical interfaces evenly.
270
Egress Return Traffic is not distributed across Trunked ports
Router-int20 IP20.0.0.1 Mac:0:0:8:8:8:1
FIGURE 6-8
Incorrect Trunking Policy on a Server

FIGURE 6-8 shows an incorrect trunking policy on a server. Even though the egress traffic is using round-robin, it is not distributing the load evenly because all the traffic belongs to the same session. In this case, trunking is not effective in distributing load across physical interfaces.
Availability Strategies Using SMLT and DMLT

In the past, server network resiliency leveraged IPMP and VRRP. Our deployments revealed that network switches with relatively low-powered CPUs had problems processing large numbers of ping requests due to ICMP health checks when combined with other control processing such as VRRP routing calculations. Network switches were not designed to process a steady stream of ping requests in a timely manner. Ping requests were traditionally used occasionally to troubleshoot network issues. Hence, processing of ping requests was a lower priority than processing routing updates and other control plane network tasks. As the number of IPMP nodes increases, the network switch soon runs out of CPU processing resources and drops ping requests. This results in IPMP nodes falsely detecting router failures, which often result in a ping-pong effect of failing over back and forth across interfaces. One recent advance, introduced in NortelNetworks switches, is called Split MultiLink Trunking (SMLT) and Distributed Multilink Trunking (DMLT). In this section, we describe several key tested configurations using NortelNetworks Passport 8600 Core switches and the smaller Layer 2 switches NortelNetworks
Chapter 6
271
Business Policy Switch 2000. These configurations illustrate how network high availability can be achieved without encountering the scalability issues that have plagued IPMP and VRRP deployments. SMLT is a Layer 2 trunking redundancy mechanism. It is similar to plain trunking except that it spans two physical devices. FIGURE 6-9 shows a typical SMLT deployment using two NortelNetworks Passport 8600 Switches and a Sun server with dual GigaSwift cards. The trunk spans both cards, but each card is connected to a separate switch. SMLT technology, in effect, exposes one logical trunk to the Sun server, when actually there are two physically separate devices.
SMLT BLOCK 1- 10.0.0.0/24

Passport8600 Core Passport8600 Core
SW1
IST12
SW2
Passport 8600 Core
Passport 8600 Core
SW3
IST34
SW4
SMLT BLOCK 2 - 20.0.0.0/24

SMLT trunk - head ce
ce0
ce1
FIGURE 6-9
Layer 2 High-Availability Design Using SMLT
FIGURE 6-10 shows another integration point where workgroup servers connect to the corporate network at an edge point. In this case, instead of integrating directly into the enterprise core, the servers connect to a smaller Layer 2 switch, which runs DMLT, a scaled version of the SMLT, but similar in functionality. DMLT has fewer features and a smaller binary image than SMLT. This means that DMLT can run on smaller network devices. The switches are viewed as one logical trunking device even though packets are load shared across the links, with the switches ensuring
272
packets arrive in order at the remote destination. FIGURE 6-10 illustrates a server-toedge integration of a Layer 2 high-availability design using Sun Trunking 1.3 and NortelNetworks Business Policy 2000 Wiring Closet Edge Switches.
SMLT BLOCK 1- 10.0.0.0/24

Passport8600 Core Passport8600 Core
SW1
IST12
SW2
Passport 8600 Core
Passport 8600 Core
SW3
IST34
SW4
SMLT BLOCK 2 - 20.0.0.0/24

SMLT
Business Policy 2000
Business Policy 2000

DMLT trunk - head ce
ce0
ce1
FIGURE 6-10
Layer 2 High-Availability Design Using DMLT
CODE EXAMPLE 6-2 shows a sample configuration of the Passport 8600.
CODE EXAMPLE 6-2
Sample Configuration of the Passport 8600
# # MLT CONFIGURATION PASSPORT 8600 #
Chapter 6
273
CODE EXAMPLE 6-2
Sample Configuration of the Passport 8600 (Continued)
# mlt mlt mlt mlt mlt mlt mlt mlt mlt mlt mlt #
1 1 1 1 1 1 2 2 2 2 2
create add ports 1/1,1/8 name "IST Trunk" perform-tagging enable ist create ip 10.19.10.2 vlan-id 10 ist enable create add ports 1/6 name "SMLT-1" perform-tagging enable smlt create smlt-id 1
Availability Using Spanning Tree Protocol

The spanning tree algorithm was developed by Radia Perlman, currently with Sun Microsystems. The Spanning Tree Protocol (STP) is used on Layer 2 networks to eliminate loops. For added availability, redundant Layer 2 links can be added. However, these redundant links introduce loops, which cause bridges to forward frames indefinitely. By introducing STP, bridges communicate with each other by sending bridge protocol data units (BPDUs), which contain information that a bridge uses to determine which ports forward traffic and which ports dont based on the spanning tree algorithm. A typical BPDU contains information including a unique bridge identifier, the port identifier, and cost to the root bridge, which is the top of the spanning tree. From these BPDUs, each bridge can compute a spanning tree and decide which ports to direct all forwarding of traffic. If a link fails, this tree is recomputed, and redundant links are activated by turning on certain portshence creating increased availability. A network needs to be designed to ensure that every possible link that could fail has some redundant link. In older networks, bridges are still used. However, with recent advances in network switch technology and smaller Layer 2 networks, bridges are not used as much.
Availability Issues
To better understand failure detection and recovery, a testbed was created, as shown in FIGURE 6-11.
274
Server 11.0.0.51
s48t
7 7 8 sw2 8
sw1 Spanning tree 7 8 7
sw3
sw4
Blocked port 8 on sw4 s48b
Client 16.0.0.51
FIGURE 6-11
Spanning Tree Network Setup
The switches sw1, sw2, sw3, and sw4 were configured in a Layer 2 network with an obvious loop, which was controlled by running the STP among these switches. On the client, we ran the traceroute server command, resulting in the following output, which shows that the client sees only two Layer 3 networks: the 11.0.0.0 and the 16.0.0.0 network.
client># traceroute server traceroute: Warning: Multiple interfaces found; using 16.0.0.51 @ hme0 traceroute to server (11.0.0.51), 30 hops max, 40 byte packets 1 16.0.0.1 (16.0.0.1) 1.177 ms 0.524 ms 0.512 ms 2 16.0.0.1 (16.0.0.1) 0.534 ms !N 0.535 ms !N 0.529 ms !N
Chapter 6
275
Similarly, the server sees only two Layer 3 networks. We ran the traceroute client command on the server and got the following output:
server># traceroute client traceroute: Warning: Multiple interfaces found; using 11.0.0.51 @ hme0 traceroute to client (16.0.0.51), 30 hops max, 40 byte packets 1 11.0.0.1 (11.0.0.1) 0.756 ms 0.527 ms 0.514 ms 2 11.0.0.1 (11.0.0.1) 0.557 ms !N 0.546 ms !N 0.531 ms !N
The following outputs show the STP configuration and port status of the participating switches, showing the port MAC address of the root switches.
* sw1:17 # sh s0 ports 7-8 Stpd: s0 Port: 7 PortId: 4007 Stp: ENABLED Path Cost: 4 Port State: FORWARDING Topology Change Ack: FALSE Port Priority: 16 Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 0 Designated Bridge: 80:00:00:01:30:92:3f:00 Designated Port Id: 4007 Stpd: s0 Port: 8 PortId: 4008 Stp: ENABLED Path Cost: 4 Port State: FORWARDING Topology Change Ack: FALSE Port Priority: 16 Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 0 Designated Bridge: 80:00:00:01:30:92:3f:00 Designated Port Id: 4008
* sw2:12 # sh s0 ports 7-8 Port Mode State Cost Flags Priority Port ID Designated Bridge 7 802.1D FORWARDING 4 e-R-- 16 16391 80:00:00:01:30:92:3f:00 8 802.1D FORWARDING 4 e-D-- 16 16392 80:00:00:01:30:92:3f:00 Total Ports: 8 Flags: e=Enable, d=Disable, T=Topology Change Ack R=Root Port, D=Designated Port, A=Alternative Port
276
* sw3:5 # sh s0 ports 7-8 Stpd: s0 Port: 7 PortId: 4007 Stp: ENABLED Path Cost: 4 Port State: FORWARDING Topology Change Ack: FALSE Port Priority: 16 Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 0 Designated Bridge: 80:00:00:01:30:92:3f:00 Designated Port Id: 4001 Stpd: s0 Port: 8 PortId: 4008 Stp: ENABLED Path Cost: 4 Port State: FORWARDING Topology Change Ack: FALSE Port Priority: 16 Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 4 Designated Bridge: 80:00:00:e0:2b:98:96:00 Designated Port Id: 4008
The following output shows that STP has blocked Port 8 on sw4.
* sw4:10 # sh s0 ports 7-8 Stpd: s0 Port: 7 PortId: 4007 Stp: ENABLED Path Cost: 4 Port State: FORWARDING Topology Change Ack: FALSE Port Priority: 16 Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 4 Designated Bridge: 80:00:00:01:30:f4:16:a0 Designated Port Id: 4008 Stpd: s0 Port: 8 PortId: 4008 Stp: ENABLED Path Cost: 4 Port State: BLOCKING Topology Change Ack: FALSE Port Priority: 16 Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 4 Designated Bridge: 80:00:00:e0:2b:98:96:00 Designated Port Id: 4008
To get a better understanding of failure detection and fault recovery, we conducted a test where the client continually sent a ping to the server, and we pulled a cable on the spanning tree path.
Chapter 6
277
The following output shows that it took approximately 58 seconds for failure detection and recovery, which is not acceptable in most mission-critical environments. (Each ping takes about one second. The following output shows that from icmp_seq=16 to icmp_seq=74, the pings did not succeed.)
on client --------4 bytes from server (11.0.0.51): icmp_seq=12. time=1. ms 64 bytes from server (11.0.0.51): icmp_seq=13. time=1. ms 64 bytes from server (11.0.0.51): icmp_seq=14. time=1. ms 64 bytes from server (11.0.0.51): icmp_seq=15. time=1. ms 64 bytes from server (11.0.0.51): icmp_seq=16. time=1. ms ICMP Net Unreachable from gateway 16.0.0.1 for icmp from client (16.0.0.51) to server (11.0.0.51) ... ... ICMP Net Unreachable from gateway 16.0.0.1 for icmp from client (16.0.0.51) to server (11.0.0.51) ICMP Net Unreachable from gateway 16.0.0.1 for icmp from client (16.0.0.51) to server (11.0.0.51) for icmp from client (16.0.0.51) to server (11.0.0.51) 64 bytes from server (11.0.0.51): icmp_seq=74. time=1. ms 64 bytes from server (11.0.0.51): icmp_seq=75. time=1. ms 64 bytes from server (11.0.0.51): icmp_seq=76.
Layer 3 Strategies
There are several Layer 3 availability design options. Layer 3 availability designs are desirable because there could be a fault at the IP layer, but not at the lower layers. By implementing a Layer 3 availability strategy, we can infer the status of the network at all layers below, but not at the layers above. The fault detection and recovery can be relatively slower than Layer 2 strategies, depending on the strategy. In this section we explain the operation and recovery times for three approaches:
s
VRRP and IPMP proven to be very useful at the server-to-default router network connectivity segment of the data center network OSPF a proven and effective link-state routing protocol, suitable for inter-switch connectivity RIP a time-tested distance vector routing protocol, suitable for inter-switch connectivity.
278
We describe how these network design strategies work and actually tested configurations.
VRRP Router Redundancy

The Virtual Router Redundancy Protocol (VRRP) was designed to remove a single point of failure where hosts connected to the rest of the enterprise network or Internet through one default router. The VRRP is based on an election algorithm, where there are two routers: one master that owns both a MAC and an IP address and one that is a backup. Both routers reside on one LAN or VLAN segment. The hosts all point to one IP address that points to the master router. The master and backup constantly send multicast messages to each other. Depending on the vendorspecific implementation, the backup will assume the master role if the master is no longer functioning or has lowered in priority based on some criteria. The new master also assumes the same MAC address so that the clients do not need to update their Address Resolution Protocol (ARP) caches. The VRRP, by itself, has left open many aspects so that switch manufacturers can implement and add features to differentiate themselves. All vendors offer a variety of features that alter the priority, which can be tied to server health checks, number of active ports, and so on. Whichever router has the highest priority becomes the master. These configurations need to be closely monitored to prevent oscillations. Often, a switch is configured to be too sensitive, causing it to constantly change priority and hence fluctuate between master and backup.
IPMPHost Network Interface Redundancy

The purpose of the server redundant network interface capability is to increase overall system availability. If one server NIC fails, the backup will take over within two seconds. This is IP Multipathing (IPMP) on the Solaris operating system. IPMP is a feature bundled with the Solaris operating system that is crucial in creating highly available network designs. IPMP has a daemon that constantly sends pings to the default router, which is intelligently pulled from the kernel routing tables. If that router is not reachable, another standby interface in the same IPMP group then assumes ownership of the floating IP address. The switch re-runs the ARP for the new MAC address and can contact the server again. A typical highly available configuration includes a Sun server with dual NIC cards, which increases the availability of these components by several orders of magnitude. For example, the GigabitEthernet card, part number 595-5414-01, by itself has an MTBF of 199156 hours, and assuming approximately two hours mean time to
Chapter 6
279
recovery (MTTR), it has an availability of 0.999989958. With two cards, the MTBF becomes nine 9s at .9999999996 availability. This small incremental cost has a big impact on the overall availability computation.
FIGURE 6-12 shows the Sun server redundant NIC model using IPMP. The server has
two NICs, ge0 and ge1, with a fixed IP addresses of a.b.c.d and e.f.g.h. The virtual IP address of w.x.y.z is the IP address of the service. Client requests use this IP address as the destination. This IP address floats between the two interfaces ge0 or ge1. Only one interface can be associated with the virtual IP address at any one time. If the ge0 interface owns the virtual IP address, then data traffic will follow the P1 path. If the ge0 interface fails, then the ge1 interface will take over and associate the virtual IP address and data traffic will follow the P2 path. Failures can be detected within two seconds, depending on the configuration.
HCS P2 P1 VLAN IPMP-dual NIC ge0:a.b.c.d w.x.y.z ge1:e.f.g.h
HCS
FIGURE 6-12
High-Availability Network Interface Cards on Sun Servers
Integrated VRRP and IPMP

By combining the availability technologies of routers and server NICs, we can create a cell that can be reused in any deployment where servers are connected to routers. This reusable cell is highly available and scalable. FIGURE 6-13 shows how this is implemented. Lines 1 and 2 show the VRRP protocol used by the routers to monitor each other. If one router detects that the other has failed, the surviving router assumes the role of master and inherits the IP address and MAC address of the master. Lines 3 and 5 in FIGURE 6-13 show how a switch can verify that a particular connection is up and running. This verification can be port-based, link-based, or based on Layers 3, 4, and 7. The router can make synthetic requests to the server and
280
verify that a particular service is up and running. If it detects that the service has failed, then the VRRP can be configured, on some switches, to take this into consideration to impact the election algorithm and tie this failure to the priority of the VRRP router. Simultaneously, the server also monitors links. Currently, IPMP consists of a daemon, in.mpathd, that constantly sends pings to the default router. As long as the default router can receive a ping, the master interface (ge0) assumes ownership of the IP address. If the in.mpathd daemon detects that the default router is not reachable, automatic failover will occur, which brings down the link and floats the IP address of the server to the surviving interface (ge1). In the lab, we can tune IPMP and Extreme Standby Routing Protocol (ESRP) to achieve failure detection and recovery within one second. Because the ESRP is a CPU-intensive task and the control packets are on the same network as the production network, the trade-off is that if the switches, networks, or servers become overloaded, false failures can occur because the device can take longer than the strict timeout to respond to the peers heartbeat.
VRRP
2 VRRP
3 4 ge1 ge0
6 5
in.mpathd
FIGURE 6-13
Design PatternIPMP and VRRP Integrated Availability Solution
OSPF Network RedundancyRapid Convergence

Open Shortest Path First (OSPF) is an intra-domain, link-state routing protocol. The main idea of OSPF is that each OSPF router can determine the state of the link to all neighbor routers and the costs associated with sending data over that link. One property of this routing protocol is that each OSPF router has a view of the entire network, which allows it to find the best path to all participating routers. All OSPF routers in the domain flood each other with link state packets (LSPs), which contain the unique ID of the sending router; a list of directly connected neighbor routers and associated costs; a sequence number and a time to live,
Chapter 6 Network Availability Design Strategies 281
authentication, hierarchy, and load balancing; and checksum information. From this information, each node can reliably determine if this LSP is the most recent by comparing seq numbers and computing the shortest path to every node and then collecting all LSPs from all nodes and comparing costs using Dijstras shortest path algorithm. To prevent continuous flooding, the sender never receives the same LSP packet that it sent out. To better understand OSPF for suitability from an availability perspective, the following lab network was set up, consisting of Extreme Network switches and Sun servers. FIGURE 6-14 describes the actual setup used to demonstrate availability characteristics of the interior routing protocol OSPF.
Server 11.0.0.51 Default path from client to server
s48t 12.0.0.0
18.0.0.0
sw1 13.0.0.0
sw2 17.0.0.0
sw3 14.0.0.0 sw4 15.0.0.0 s48b Backup path
Client 16.0.0.51 Design PatternOSPF Network
FIGURE 6-14
282
To confirm correct configuration, traceroute commands were issued from client to server. In the following output, the highlighted lines show the path through sw2:
client># traceroute server traceroute: Warning: Multiple interfaces found; using 16.0.0.51 @ hme0 traceroute to server (11.0.0.51), 30 hops max, 40 byte packets 1 16.0.0.1 (16.0.0.1) 1.168 ms 0.661 ms 0.523 ms 2 15.0.0.1 (15.0.0.1) 1.619 ms 1.104 ms 1.041 ms 3 17.0.0.1 (17.0.0.1) 1.527 ms 1.197 ms 1.043 ms 4 18.0.0.1 (18.0.0.1) 1.444 ms 1.208 ms 1.106 ms 5 12.0.0.1 (12.0.0.1) 1.237 ms 1.274 ms 1.083 ms 6 server (11.0.0.51) 0.390 ms 0.349 ms 0.340 ms
The following tables show the initial routing tables of the core routers. The first two highlighted lines in CODE EXAMPLE 6-3 show the route to the client through sw2. The second two highlighted lines show the sw2 path.
CODE EXAMPLE 6-3
Router sw1 Routing Table Gateway 12.0.0.1 12.0.0.1 12.0.0.2 13.0.0.1 13.0.0.2 18.0.0.2 13.0.0.2 18.0.0.2 13.0.0.2 18.0.0.2 18.0.0.1 127.0.0.1 Mtr 1 5 1 1 8 12 12 13 13 8 1 0 Flags UG---S-um UG-----um U------uU------uUG-----um UG-----um UG-----um UG-----um UG-----um UG-----um U------uU-H----um Use M-Use VLAN Acct-1 63 0 net12 0 98 0 net12 0 1057 0 net12 0 40 0 net13 0 4 0 net13 0 0 0 net18 0 0 0 net13 0 0 0 net18 0 0 0 net13 0 0 0 net18 0 495 0 net18 0 0 0 Default 0
OR *s *oa *d *d *oa *oa *oa *oa *oa *oa *d *d
Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 15.0.0.0/8 16.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8
Origin(OR): b - BlackHole, bg - BGP, be - EBGP, bi - IBGP, bo - BOOTP, ct - CBT d - Direct, df - DownIF, dv - DVMRP, h - Hardcoded, i - ICMP mo - MOSPF, o - OSPF, oa - OSPFIntra, or - OSPFInter, oe - OSPFAsExt o1 - OSPFExt1, o2 - OSPFExt2, pd - PIM-DM, ps - PIM-SM, r - RIP ra - RtAdvrt, s - Static, sv - SLB_VIP, un - UnKnown. Flags: U - Up, G - Gateway, H - Host Route, D - Dynamic, R - Modified, S - Static, B - BlackHole, u - Unicast, m - Multicast. Total number of routes = 12.
Chapter 6
283
CODE EXAMPLE 6-3
Router sw1 Routing Table (Continued) 8 1 routes at length 24
Mask distribution: 11 routes at length
CODE EXAMPLE 6-4
Router sw2 Routing Table
sw2:8 # sh ipr OR *s *oa *oa *oa *oa *oa *oa *d *d *d # # Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8 Gateway 18.0.0.1 18.0.0.1 18.0.0.1 18.0.0.1 17.0.0.2 17.0.0.2 17.0.0.2 17.0.0.1 18.0.0.2 127.0.0.1 Mtr 1 9 8 8 8 8 9 1 1 0 Flags UG---S-um UG-----um UG-----um UG-----um UG-----um UG-----um UG-----um U------uU------uU-H----um Use M-Use VLAN Acct-1 27 0 net18 0 98 0 net18 0 0 0 net18 0 0 0 net18 0 0 0 net17 0 9 0 net17 0 0 0 net17 0 10 0 net17 0 403 0 net18 0 0 0 Default 0
CODE EXAMPLE 6-5
Router sw3 Routing Table
sw3:5 # sh ipr OR *s *oa *oa *d *d *oa *oa *oa *oa *d Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8 Gateway 13.0.0.1 13.0.0.1 13.0.0.1 13.0.0.2 14.0.0.1 14.0.0.2 14.0.0.2 14.0.0.2 13.0.0.1 127.0.0.1 Mtr 1 9 8 1 1 8 9 8 8 0 Flags UG---S-um UG-----um UG-----um U------uU------uUG-----um UG-----um UG-----um UG-----um U-H----um Use M-Use VLAN Acct-1 26 0 net13 0 0 0 net13 0 121 0 net13 0 28 0 net13 0 20 0 net14 0 0 0 net14 0 0 0 net14 0 0 0 net14 0 0 0 net13 0 0 0 Default 0
284
The first two highlighted lines in CODE EXAMPLE 6-6 show the route back to the server through sw4. The second two highlighted lines show the sw2 path.
CODE EXAMPLE 6-6
Switch sw4 Routing Table
sw4:8 # sh ipr OR *s *oa *oa *oa *oa *oa *d *d *oa *d *oa *d Destination 10.100.0.0/24 11.0.0.0/8 11.0.0.0/8 12.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8 Gateway 14.0.0.1 17.0.0.1 14.0.0.1 17.0.0.1 14.0.0.1 14.0.0.1 14.0.0.2 15.0.0.1 15.0.0.2 17.0.0.2 17.0.0.1 127.0.0.1 Mtr 1 13 13 12 12 8 1 1 5 1 8 0 Flags UG---S-um UG-----um UG-----um UG-----um UG-----um UG-----um U------uU------uUG-----um U------uUG-----um U-H----um Use M-Use VLAN Acct-1 29 0 net14 0 0 0 net17 0 0 0 net14 0 0 0 net17 0 0 0 net14 0 0 0 net14 0 12 0 net14 0 204 0 net15 0 0 0 net15 0 11 0 net17 0 0 0 net17 0 0 0 Default 0
To check failover capabilities on the OSPF, the interface on the switch sw2 was damaged to create a failure and a constant ping command was run from the client to the server. The interface on the switch sw2 was removed, and the measurement of failover was performed as shown in the following output. The first highlighted line shows when the interface sw2 fails. The second highlighted line shows that the new switch interface sw3 route is established in two seconds.
client reading: 64 bytes from server (11.0.0.51): 64 bytes from server (11.0.0.51): ICMP Net Unreachable from gateway for icmp from client (16.0.0.51) ICMP Net Unreachable from gateway for icmp from client (16.0.0.51) 64 bytes from server (11.0.0.51): 64 bytes from server (11. icmp_seq=11. time=2. ms icmp_seq=12. time=2. ms 17.0.0.1 to server (11.0.0.51) 17.0.0.1 to server (11.0.0.51) icmp_seq=15. time=2. ms
OSPF took approximately two seconds to detect and recover from the failed node.
Chapter 6
285
The highlighted lines in the following output from the traceroute server command shows the new path from the client to the server through the switch interface sw3.
client># traceroute server traceroute: Warning: Multiple interfaces found; using 16.0.0.51 @ hme0 traceroute to server (11.0.0.51), 30 hops max, 40 byte packets 1 16.0.0.1 (16.0.0.1) 0.699 ms 0.535 ms 0.581 ms 2 15.0.0.1 (15.0.0.1) 1.481 ms 0.990 ms 0.986 ms 3 14.0.0.1 (14.0.0.1) 1.214 ms 1.021 ms 1.002 ms 4 13.0.0.1 (13.0.0.1) 1.322 ms 1.088 ms 1.100 ms 5 12.0.0.1 (12.0.0.1) 1.245 ms 1.131 ms 1.220 ms 6 server (11.0.0.51) 1.631 ms 1.200 ms 1.314 ms
The following code examples show the routing tables after the node failure. The first highlighted line in CODE EXAMPLE 6-7 shows the new route to the server through the switch sw3. The second highlighted line shows that the switch sw2 link is down.
CODE EXAMPLE 6-7
Switch sw1 Routing Table After Node Failure
sw1:27 # sh ipr OR *s *oa *d *d *oa *oa *oa *oa d *d Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8 Gateway 12.0.0.1 12.0.0.1 12.0.0.2 13.0.0.1 13.0.0.2 13.0.0.2 13.0.0.2 13.0.0.2 18.0.0.1 127.0.0.1 Mtr 1 5 1 1 8 12 13 12 1 0 Flags UG---S-um UG-----um U------uU------uUG-----um UG-----um UG-----um UG-----um --------U-H----um Use M-Use VLAN 63 0 net12 168 0 net12 1083 0 net12 41 0 net13 4 0 net13 0 0 net13 22 0 net13 0 0 net13 515 0 -------0 0 Default Acct-1 0 0 0 0 0 0 0 0 0 0
CODE EXAMPLE 6-8
sw1:4 # sh ipr OR *s *oa *d Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 Gateway 12.0.0.1 12.0.0.1 12.0.0.2 Mtr Flags 1 UG---S-um 5 UG-----um 1 U------uUse M-Use VLAN 63 0 net12 168 0 net12 1102 0 net12 Acct-1 0 0 0
286
CODE EXAMPLE 6-8
Switch sw2 Routing Table After Node Failure (Continued) 13.0.0.1 13.0.0.2 13.0.0.2 13.0.0.2 13.0.0.2 18.0.0.1 127.0.0.1 1 U------u8 UG-----um 12 UG-----um 13 UG-----um 12 UG-----um 1 --------0 U-H----um 41 4 0 22 0 515 0 0 net13 0 net13 0 net13 0 net13 0 net13 0 -------0 Default 0 0 0 0 0 0 0
sw1:4 # sh ipr *d 13.0.0.0/8 *oa 14.0.0.0/8 *oa 15.0.0.0/8 *oa 16.0.0.0/8 *oa 17.0.0.0/8 d 18.0.0.0/8 *d 127.0.0.1/8
CODE EXAMPLE 6-9
sw3:6 # sh ipr OR *s *oa *oa *d *d *oa *oa *oa *d Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 127.0.0.1/8 Gateway 13.0.0.1 13.0.0.1 13.0.0.1 13.0.0.2 14.0.0.1 14.0.0.2 14.0.0.2 14.0.0.2 127.0.0.1 Mtr Flags Use M-Use VLAN Acct-1 1 UG---S-um 26 0 net13 0 9 UG-----um 24 0 net13 0 8 UG-----um 134 0 net13 0 1 U------u29 0 net13 0 1 U------u20 0 net14 0 8 UG-----um 0 0 net14 0 9 UG-----um 25 0 net14 0 8 UG-----um 0 0 net14 0 0 U-H----um 0 0 Default 0
The highlighted line in CODE EXAMPLE 6-10 shows the new route back to the client through sw3.
CODE EXAMPLE 6-10
sw4:9 # sh ipr OR *s *oa *oa *oa *d *d *oa *d *d Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 127.0.0.1/8 Gateway 14.0.0.1 14.0.0.1 14.0.0.1 14.0.0.1 14.0.0.2 15.0.0.1 15.0.0.2 17.0.0.2 127.0.0.1 Mtr 1 13 12 8 1 1 5 1 0 Flags UG---S-um UG-----um UG-----um UG-----um U------uU------uUG-----um U------uU-H----um Use M-Use VLAN Acct-1 29 0 net14 0 21 0 net14 0 0 0 net14 0 0 0 net14 0 12 0 net14 0 216 0 net15 0 70 0 net15 0 12 0 net17 0 0 0 Default 0
OSPF is a good routing protocol with enterprise networks. It has fast failure detection and recovery.
Chapter 6
287
RIP Network Redundancy

The Routing Information Protocol (RIP) is based on the Bellman-Ford distance vector algorithm. The idea behind the RIP is that each RIP router builds a one-dimensional array that contains a scalar notion of hops to reach all other hops. (In theory, OSPF was able to use the notion of cost with greater accuracy, which could capture information such as link speed. However, in actual practice, this might not be practical because of the increased burden of maintaining correct link costs in large changing environments.) RIP routers flood each other with their view of the network by first starting with directly connected neighbor routers and then modifying their vector if peer updates show that there is a shorter path. After a few updates, a complete routing table is constructed. When a router detects a failure, the distance is updated to infinity. Ideally, all routers would eventually receive the proper update and adjust their tables accordingly. However, if the network is designed with redundancy, there can be issues in properly updating the tables to reflect a failed link. There are problems such as count to infinity that have fixes such as split horizon and poison reverse. The RIP was a first implementation of the distance vector algorithm. The RIPv2, the most common, addresses scalability and other limitations of the RIP. To better understand the failover capabilities of RIPv2, the test network shown in FIGURE 6-15 was set up.
288
Server 11.0.0.51 Default path from server to client
s48t 12.0.0.0
18.0.0.0
sw1 13.0.0.0
sw2 17.0.0.0 sw4
sw3 14.0.0.0 If sw2 fails, backup path becomes active route 15.0.0.0 s48b
Client 16.0.0.51
FIGURE 6-15
RIP Network Setup
The following output shows the server-to-client path before node failure. The highlighted lines show the path through the switch sw3.
server># traceroute client traceroute: Warning: Multiple interfaces found; using 11.0.0.51 @ hme0 traceroute to client (16.0.0.51), 30 hops max, 40 byte packets 1 11.0.0.1 (11.0.0.1) 0.711 ms 0.524 ms 0.507 ms 2 12.0.0.2 (12.0.0.2) 1.448 ms 0.919 ms 0.875 ms 3 13.0.0.2 (13.0.0.2) 1.304 ms 0.977 ms 0.964 ms 4 14.0.0.2 (14.0.0.2) 1.963 ms 1.091 ms 1.151 ms 5 15.0.0.2 (15.0.0.2) 1.158 ms 1.059 ms 1.037 ms 6 client (16.0.0.51) 1.560 ms 1.170 ms 1.107 ms
Chapter 6
289
The following code examples show the initial routing tables. The highlighted line in CODE EXAMPLE 6-11 shows the path to the client through the switch sw3.
CODE EXAMPLE 6-11
Switch sw1 Initial Routing Table Gateway 12.0.0.1 12.0.0.1 12.0.0.2 13.0.0.1 13.0.0.2 18.0.0.2 13.0.0.2 18.0.0.2 18.0.0.1 127.0.0.1 Mtr 1 2 1 1 2 3 4 2 1 0 Flags UG---S-um UG-----um U------uU------uUG-----um UG-----um UG-----um UG-----um U------uU-H----um Use M-Use VLAN Acct-1 32 0 net12 0 15 0 net12 0 184 0 net12 0 52 0 net13 0 1 0 net13 0 0 0 net18 0 10 0 net13 0 0 0 net18 0 12 0 net18 0 0 0 Default 0
OR *s *r *d *d *r *r *r *r *d *d
Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8
CODE EXAMPLE 6-12
Switch sw2 Initial Routing Table
sw2:3 # sh ipr OR *s *r *r *r *r *r *r *d *d *d # # Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8 Gateway 18.0.0.1 18.0.0.1 18.0.0.1 18.0.0.1 17.0.0.2 17.0.0.2 17.0.0.2 17.0.0.1 18.0.0.2 127.0.0.1 Mtr Flags Use M-Use VLAN Acct-1 1 UG---S-um 81 0 net18 0 3 UG-----um 9 0 net18 0 2 UG-----um 44 0 net18 0 2 UG-----um 0 0 net18 0 2 UG-----um 0 0 net17 0 2 UG-----um 0 0 net17 0 3 UG-----um 3 0 net17 0 1 U------u17 0 net17 0 1 U------u478 0 net18 0 0 U-H----um 0 0 Default 0
CODE EXAMPLE 6-13
sw3:3 # sh ipr OR Destination *s 10.100.0.0/24 *r 11.0.0.0/8 Gateway 13.0.0.1 13.0.0.1 Mtr Flags 1 UG---S-um 3 UG-----um Use M-Use VLAN 79 0 net13 3 0 net13 Acct-1 0 0
290
CODE EXAMPLE 6-13
Switch sw3 Initial Routing Table (Continued) 13.0.0.1 13.0.0.2 14.0.0.1 14.0.0.2 14.0.0.2 14.0.0.2 13.0.0.1 127.0.0.1 2 1 1 2 3 2 2 0 UG-----um U------uU------uUG-----um UG-----um UG-----um UG-----um U-H----um 44 85 33 0 10 0 0 0 0 0 0 0 0 0 0 0 net13 net13 net14 net14 net14 net14 net13 Default 0 0 0 0 0 0 0 0
sw3:3 # sh ipr *r 12.0.0.0/8 *d 13.0.0.0/8 *d 14.0.0.0/8 *r 15.0.0.0/8 *r 16.0.0.0/8 *r 17.0.0.0/8 *r 18.0.0.0/8 *d 127.0.0.1/8
The highlighted line in CODE EXAMPLE 6-14 shows the path to the server through the switch sw3.
CODE EXAMPLE 6-14
sw4:7 # sh ipr OR *s *r *r *r *d *d *r *d *r *d Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8 Gateway 14.0.0.1 14.0.0.1 14.0.0.1 14.0.0.1 14.0.0.2 15.0.0.1 15.0.0.2 17.0.0.2 17.0.0.1 127.0.0.1 Mtr Flags Use M-Use VLAN Acct-1 1 UG---S-um 29 0 net14 0 4 UG-----um 9 0 net14 3 UG-----um 0 0 net14 0 2 UG-----um 0 0 net14 0 1 U------u13 0 net14 0 1 U------u310 0 net15 0 2 UG-----um 16 0 net15 0 1 U------u3 0 net17 0 2 UG-----um 0 0 net17 0 0 U-H----um 0 0 Default 0
The highlighted lines in the following output from running the traceroute client command show the new path from the server to the client through the switch sw2 after the switch sw3 fails.
server># traceroute client traceroute: Warning: Multiple interfaces found; using 11.0.0.51 @ hme0 traceroute to client (16.0.0.51), 30 hops max, 40 byte packets 1 11.0.0.1 (11.0.0.1) 0.678 ms 0.479 ms 0.465 ms 2 12.0.0.2 (12.0.0.2) 1.331 ms 0.899 ms 0.833 ms 3 18.0.0.2 (18.0.0.2) 1.183 ms 0.966 ms 0.953 ms 4 17.0.0.2 (17.0.0.2) 1.379 ms 1.082 ms 1.062 ms 5 15.0.0.2 (15.0.0.2) 1.101 ms 1.024 ms 0.993 ms 6 client (16.0.0.51) 1.209 ms 1.086 ms 1.074 ms
Chapter 6
291
The following output shows the result of the server ping commands.
64 bytes from client (16.0.0.51): icmp_seq=18. time=2. ms 64 bytes from client (16.0.0.51): icmp_seq=19. time=2. ms 64 bytes from client (16.0.0.51): icmp_seq=20. time=2. ms ICMP Net Unreachable from gateway 12.0.0.2 for icmp from server (11.0.0.51) to client (16.0.0.51) ICMP Net Unreachable from gateway 12.0.0.2 .. .. for icmp from server (11.0.0.51) to client (16.0.0.51) ICMP Net Unreachable from gateway 12.0.0.2 for icmp from server (11.0.0.51) to client (16.0.0.51) ICMP Net Unreachable from gateway 12.0.0.2 for icmp from server (11.0.0.51) to client (16.0.0.51) 64 bytes from client (16.0.0.51): icmp_seq=41. time=2. ms 64 bytes from client (16.0.0.51): icmp_seq=42. time=2. ms 64 bytes from client (16.0.0.51): icmp_seq=43. time=2. ms
The fault detection and recovery took in excess of 21 seconds. The RIPv2 is widely available. However, the failure detection and recovery is not optimal.
Conclusions Drawn from Evaluating Fault Detection and Recovery Times

We presented several approaches for increased availability for network designs by evaluating fault detection and recovery times and the adverse impact on computing and memory resources. In comparing these networking designs, we drew the following conclusions:
s
Link aggregation is suitable for increasing the bandwidth capacity and availability on point-to-point links only. Layer 2 availability designs using Sun Trunking 1.3 and Split MultiLink Trunking available on Nortel Networks Passport 8600 Switches were configured and tested. Distributed MultiLink Trunking was also configured and tested using Nortels smaller Layer 2 Business Policy switches. Both switches were found to provide rapid failure detection and failover recovery within two to five seconds. Further benefit of this approach was that the failure and recovery events were transparent to the IP layer.
292
Spanning Tree Protocol is not suitable because failure detection and recovery are slow. A recent improvement, IEEE 802.3w Rapid Spanning Tree, designed to improve these limitations, might be worth considering in the future. Layer 3 availability designs using VRRP and IPMP offer an alternative availability strategy combination for server-to-network connection. This approach provides rapid failure detection and recovery and is economically feasible when considering the increased MTBF calculations. Be sure to investigate the processing capabilities of the control processor and consult with the vendor on the impact of additional load due to the ICMP ping commands caused by IPMP.
Chapter 6
293
CHAPTER
Reference Design Implementations

This chapter describes network implementation concepts and details. It first describes how the multi-tier services map to networks and VLANS. Then it describes some of the more important IP services to consider when crafting architectures for multi-tier data centers:
s
Server Load Balancinghow to achieve increased availability and performance by redundancy of stateless applications Layer 7 Switchinghow to decouple internal applications from external references Network Address Translationhow to decouple internal IP addresses from external references Cookie Persistencehow to achieve stateful transactions over a stateless protocol Secure Sockets Layer (SSL)how to achieve secure transactions over a public network IPMPhow to achieve network interface redundancy on servers that is transparent to applications VRRPhow to achieve router redundancy.
s s
The chapter then describes the logical network architecture and various physical realizations. Most important, it describes actual tested network reference implementations. It first describes the original secure multi-tier architecture and its limitations. Then it describes a second architecture based on many small multi-layer and simple Layer 2 switches and their limitations. Finally, it describes in detail a collapsed network architecture based on large chassis-based switches. It is important to note that these designs are vendor independent and could have been realized by Cisco, Nortel, and other similar vendors or combinations thereof. Network Equipment Providers usually implement standard Layer 2 and Layer 3 functions using ASICs and there are few differences in their basic implementations. However, additional features such as load balancing can differentiate vendors significantly in how their products actually impact the network architecture. We explore two vendors and describe reference implementations that were configured
295
and tested. We then describe where it makes sense to use each design. We also discuss how to create virtual firewalls between tiers to increase the level of security without sacrificing wirespeed performance. In particular, we describe the tested configuration of Netscreen firewall and show how one box can be configured to create virtual firewalls, segregating and filtering inter-tier network traffic.
Logical Network Architecture

The logical network design is composed of segregated networks that are implemented physically using virtual local area networks (VLANs) defined by network switches. The internal network uses private IP address space (10.0.0.0) for security and portability advantages. FIGURE 7-1 shows a high-level overview of the logical network architecture. The management network provides centralized data collection and management of all devices. Each device has a separate interface to the management network to avoid contaminating the production network in terms of security and performance. The management network is also used for automating the installation of the software using Solaris JumpStart technology. Although several networks physically reside on a single active core switch, network traffic is segregated and secured using static routes, ACLs, and VLANs. From a practical perspective, this is as secure as separate individual switches.
296
Client network 172.16.0.0. External network
External network 192.168.10.0. Production network Web service network 10.10.0.0. Naming services network 10.20.0.0. Application services network 10.30.0.0.
Management network 10.100.0.0.
Access to all networks
Database service network 10.100.0.0. SAN network 10.50.0.0.
Management network
Backup network 10.110.0.0.
FIGURE 7-1
Logical Network Architecture Overview
Chapter 7
297
IP Services
The following subsections provide a description of some emerging IP services that are often an important component in a complete network design for a Sun ONE deployment. The IP services are divided into two categories:
s
Stateful Session Based This class of IP services requires that the switch maintain session state information so that a particular clients session state is maintained across all packets. This requirement has severe implications for highly available solutions and limits scalability and performance. Stateless Session Based This class of IP services does not require that the switch maintain any state information associated with a particular flow.
Many functions can be implemented either by network switches and appliances or by the Sun ONE software stack. This section describes how these new IP services work and the benefit they provide. It then discusses availability strategies. Later sections describe similar functions that are included in the Sun ONE integrated stack. Modern multilayer network switches perform many Layer 3 IP services in addition to vanilla routing. These services are implemented as functions that operate on a packet by modifying the packet headers and controlling the rate at which the packet is forwarded. IP services include functions such as QoS, server load balancing, application redirection, network address translation, and others. This section starts our discussion on an important service for data centersserver load balancingand then describes adjacent services that can be cascaded.
Stateless Server Load Balancing

The server load balancing (SLB) function maps incoming client requests destined to a virtual IP (VIP) address and port to a real server IP address and port. The target server is selected from a set of identically configured servers based on a predefined algorithm that considers the loads on the servers as criteria for choosing the best server at any instant in time. The purpose of SLB is to provide one layer of indirection to decouple servers from the network service that clients interface with. Thus, the server load balancer can choose the best server to service a client request. Decoupling increases availability because if some servers fail, the service is still available from the remaining functioning servers. Flexibility is increased because servers can be added or removed without impacting the service. Other redirection functions can be cascaded to provide compound functionality. SLB mapping functions differ from other mapping functions such as redirection, which makes mapping decisions based on criteria such as ensuring that a particular client is redirected to the same server to take advantage of caches or cookie persistence. FIGURE 7-2 shows an overview of the various mapping functions and how the IP header is rewritten by various functions.
298
Client
Packet A
srcIP:a.b.c.d srcPort: 123 dstIP:VIP1 http://www.a.com/index.html
VIP1 = a.b.c.d:123
URL String Match HTTP Header SSL Session ID
VIP2 = e.f.g.h:456
Cookie
Cache
SLB
Custom Algorithm
First Function rewrote srcIP and port so that the real server will reply to this switch, which is at srcIP e.f.g.h and port 456. Dest is set to the SLB function.
Round Robin
Least Connections
SLB function finds the best server and rewrites the dstIP of the target real serverReal IP.
Real Dest. IP [1,2,3,...n]
Packet A1
srcIP:e.f.g.h srcPort: 456 dstIP:VIP2 http://www.a.com/index.html
Packet A2
srcIP:e.f.g.h srcPort: 456 dstIP:RealIP http://www.a.com/index.html
FIGURE 7-2
IP ServicesSwitch Functions Operate on Incoming Packets
FIGURE 7-2 shows that a typical client request is destined for an external VIP with IP
address a.b.c.d and port 123. Various functions, as shown, can intercept this request and rewrite it according to the provisioned configuration rules. The SLB algorithm will eventually intercept the packet and rewrite the destination IP address destined to the real server, which was chosen by a particular algorithm. The packet is then returned as indicated by the source IP address.
Stateless Layer 7 Switching

Stateless Layer 7 switching, which is also called the application redirection function, intercepts a clients HTTP request and redirects the request to another destinationusually a group of cache servers. Application redirection rewrites the IP destination field. This is different from proxy switching, where the socket connection is terminated and a new one is created to the server to fetch the requested Web page. Application redirection serves the following purposes:
s
Reduces the load on one set of Web servers and redirects it to another set, which is usually cache servers for specific content Intercepts client requests and redirects to another destination for control of certain types of traffic based on filtered criteria
Chapter 7
299
FIGURE 7-3 illustrates the functional model of application redirection, which only rewrites the IP header.
Application Redirection client http request - DEST = A

-Filter has a defined destination IP -Client request meets filter criteria, request is intercepted, IP dest is rewritten to new desitination IP addr, DEST = B
servergroup 1 DEST = A
servergroup 2 DEST = B
FIGURE 7-3
Application Redirection Functional Model
Stateful Layer 7 Switching

Stateful Layer 7 switching, which is also called content switching, proxy switching, or URL switching, accepts a clients incoming HTTP request, terminates the socket connection, and creates another socket connection to the target Web server, which is chosen based on a user-defined rule. The difference between this and application redirection is the maintenance of state information. In application redirection, the packet is rewritten and continues on its way. In content switching, state information is required to keep track of client requests and server responses and to make sure they are tied together. The content switching function fetches the requested Web page and returns it to the client.
FIGURE 7-4 shows an overview of the functional content switching model.
300
servergroup 1 stata VIP Layer 7 Switching Function

- terminate socket connection - get URL - check against rules - forward to servergroup/slb function - or get valid cookie with server ID and forward it to the same server
servergroup 2 dnsa
client http request
servergroup 3 statb
servergroup 4 cacheb http://www.a.com/SMA/stata/index.html servergroup1 http://www.a.com/SMA/dnsa/index.html servergroup2 http://www.a.com/SMB/statb/index.html servergroup3 http://www.a.com/SMB/CACHEB/index.html servergroup4 http://www.a.com/SMB/DYNA/index.html servergroup1
FIGURE 7-4
servergroup 5 dynab
Content Switching Functional Model
Content switching with full NAT serves the following purposes:

s s
Isolates internal IP addresses from being exposed to the public Internet. Allows reuse of a single IP address. For example, clients can send their Web requests to www.a.com or www.b.com, where DNS maps both domains to a single IP address. The proxy switch receives this request with the packet containing an HTTP header in the payload that contains the target domain, for example a.com or b.com, and makes a decision to which group of servers to redirect this request. Allows parallel fetching of different parts of Web pages from servers optimized and tuned for that type of data. For example, a complex Web page might need GIFs, dynamic content, or cached content. With content switching, one set of Web servers can hold the GIFs and another can hold the dynamic content. The proxy switch can make parallel fetches and retrieve the entire page at a faster rate than would be possible otherwise. Ensures that requests with cookies or SSL session IDs are redirected to the same server to take advantage of persistence.
FIGURE 7-3 shows that the clients socket connection is terminated by the proxy function. The proxy retrieves as much of the URL as needed to make a decision based on the retrieved URL. FIGURE 7-3 shows various URLs mapped to various
Chapter 7
301
server groups, which are VIP addresses. The next step is to forward the URL directly or pass it off to the SLB function that is waiting for traffic destined to the server group. The proxy is configured with a VIP, so the switch forwards all client requests destined to this VIP to the proxy function. The proxy function rewrites the IP header, particularly the source IP and port, so that the server sends back the requested data to the proxy, not the client directly.
Stateful Network Address Translation

Network Address Translation (NAT) is a critical component for security and proper traffic direction. There are two basic types of NAT: half and full. Half NAT rewrites the destination IP address and MAC address to a redirected location such as Web cache, which returns the packet directly to the client because the source IP address is unchanged. Full NAT is where the socket connection is terminated by a proxy, so the source IP and MAC are changed to that of the proxy server. NAT serves the following purposes:
s s
Security Prevents exposing internal private IP addresses to the public. IP Address Conservation Requires only one valid exposed IP address to fetch Internet traffic from internal networks with non-valid IP addresses. Redirection Intercepts traffic destined to one set of servers and redirects it to another by rewriting the destination IP and MAC addresses. The redirected servers can send the request directly back to the clients with half NAT translated traffic because the original source IP address has not been rewritten.
NAT is configured with a set of filters, usually a 5-tuple Layer 3 rule. If the incoming traffic matches a certain filter rule, the packet IP header is rewritten or another socket connection is initiated to the target server, which itself can be changed, depending on the rule.
Stateful Secure Sockets Layer Session ID Persistence

Secure Sockets Layer (SSL) can be implemented in software, hardware, or both. SSL can be terminated at the target server, an intermediate server, an SSL network appliance, or at an SSL-capable network switch. An SSL appliance, such as netscaler or array networks, tends to be implemented with a PC board and have a PCI-based card, which contains the SSL accelerator ASIC. Hence, the SSL acceleration is implemented in libraries, which offload only the mathematical computations. The rest of the SSL processing is implemented in software, with selective functions being directed to the hardware accelerator. Clearly, one immediate limitation is the PCI bus. Other newer SSL devices have an SSL accelerator integrated in the datapath of
302
the network switch. These advanced products are just emerging from startups such as Wincom Systems. This section discusses the switch and appliance interactions. A later section covers the server SSL implementation.
FIGURE 7-5 shows that once a client makes initial contact to a particular server, which
may have been selected based on SLB, the switch ensures that subsequent requests are forwarded to the same SSL server based on the SSL ID that the switch has stored during the initial SSL handshake. The switch keeps state information about the clients initial request based on HTTPS and port 443, which contain a hello message. This first request is then forwarded to the server selected by the SLB algorithm or by another function. The server responds to the clients hello message with an SSL session ID. The switch then intercepts this SSL session and stores it in a table. The switch forwards all of the clients subsequent requests to the same server as long as each request contains the SSL session ID in the HTTP header. FIGURE 7-5 shows there may be several different TCP socket connections that span the same SSL session. State is maintained by the SSL session ID in each HTTP request sent by the same client.
SSTunnel - SSL Session ID

HTTP Session 1 HTTP Session 2 HTTP Session 3 HTTP Session 4
Client
SSI Server 1
Switch Stores SSL Session ID and switches client to same SSL Server SSI Server 2
FIGURE 7-5
Network Switch with Persistence Based on SSL Session ID
An appliance can be added for increased performance in terms of SSL handshakes and bulk encryption throughput. FIGURE 7-7 illustrates how an SSL appliance would be potentially deployed. Client requests first come in on a specific URL with the HTTPS protocol on port 443. The switch recognizes that these requests must be directed to the appliance, which is configured to provide that SSL service. A typical appliance such as Netscaler can also be configured, in addition to SSL acceleration, to provide content switching and load balancing. The appliance then reads or inserts cookies and resubmits the HTTP request to an appropriate server, which can maintain state based on the cookie that was in the HTTP header.
Chapter 7
303
Sun ONE web servers
http
Load balancer
Client
Internet
Multilayer switch
Session persistence based on SSL session ID
SSL
http
Session persistence based on cookie
https
SSL accelerator appliance key exchange and bulk encryption FIGURE 7-6
Tested SSL Accelerator ConfigurationRSA Handshake and Bulk Encryption
Stateful Cookie Persistence

The HTTP 1.0 protocol was originally designed to provide static pages in one transaction. As more complex Web sites evolved, requiring that multiple HTTP requests access the same server, performance was severely limited by the closing and opening of TCP socket connections. This was solved by HTTP 1.1, which allowed persistent connection. Immediately after a socket connection, the client can pipeline multiple requests. However, as more complex Web sites evolved to include applications such as the shopping cart, which required persistence across multiple HTTP 1.1 requests that were further complicated by proxies and load balancers that interfere with the traffic being redirected to the same Web server, another mechanism was required to maintain state across multiple HTTP 1.1 requests. The solution was the introduction of two new headers in the HTTP request: Set-Cookie and Cookies as defined in RFC 2109. These headers carried the state information between the client and server. Typically, most load-balancing switches have enough intelligence to ensure that a particular clients session with a particular server is maintained based on the cookie inserted by the server and maintained by the client.
304
SW1
SW2
Client Tier
External Network Connectivity
Network Tier Layer 3 Redundancy Session-Based Services require Session Sharing. Other stateless services failover with no problem.
Services Tier Sun Servers with IPMP dual NICs
FIGURE 7-7
Network Availability Strategies
Design Considerations: Availability

FIGURE 7-7 shows a cross section of the tier types and functions that are performed at
each tier. Also shown are the availability strategies for the Network and Web tier. External tier availability strategies are outside the scope of this book. We will limit our discussion to the services tiers, which include Web, Application Services, Naming, and so on. Designing network architectures for optimal availability requires maximizing two orthogonal components:
s
Intra Availability Refers to maximizing the function that estimates failure probability of the components themselves. The components that cause the failure are only considered by the following equation:
FAvailability = MTBF (MTBF + MTTR)

s
Inter Availability Refers to minimizing the impact of failures caused by factors external to the system such as single points of failure (SPOFs), power outages, or a technician accidently pulling out a cable.
Chapter 7
305
It is not sufficient to simply maximize the FAvailability function. The SPOF and environmental factors also must be considered. The networks designed in this chapter describe a highly available architecture that conforms to these design principles and is described in further detail later.
Client network 172.16.0.0. External network
External network 192.168.10.0. Production network Web service network 10.10.0.0. Naming services network 10.20.0.0. Application services network 10.30.0.0.
Management network 10.100.0.0.
Access to all networks
Database service network 10.100.0.0. SAN network 10.50.0.0.
Management network
Backup network 10.110.0.0.
FIGURE 7-8
Logical Network ArchitectureDesign Details
306
FIGURE 7-8 is repeated here to simplify a detailed discussion. The diagram shows an overview of the logical network architecture, showing how the tiers map to the different networks, which are also mapped to segregated VLANs. This segregation allows inter-tier traffic to be controlled by filters on the switch or a firewall, which is the only bridge point between VLANs. The following describes each subnetwork:
s
External network The external facing network that directly connects to the Internet. All IP addresses must be registered and should be secured with a firewall. The following networks are assigned non-routable IP addresses based on RFC 1918, which can also be based on the following: 10.0.0.0 10.255.255.255 (10/8 prefix) 172.16.0.0 172.31.255.255 (172.16/12 prefix) 192.168.0.0 192.168.255.255 (192.168/16 prefix) Web services network A dedicated network that contains Web servers. Typical configurations include a load-balancing switch, which can be configured to allow the Web server to return the clients HTTP request directly or to require the load balancing device to return the request on behalf of the provider Web server. Naming services network A dedicated network that consists of servers that provide LDAP, DNS, NIS, and other naming services. The services are for internal use only and should be highly secure. Internal infrastructure support services must be sure that requests originate and are destined to internal servers. Most requests tend to be read intensive, hence their potential for caching strategies for increased performance. Management network A dedicated service network that provides management and configuration of all servers, including jumpstart of new systems. Backup network A dedicated service network that provides backup and restore operations pivotal to minimizing disturbances to other production service networks during backup and other network bandwidth-intensive operations. Device network This is a dedicated network that attaches IP storage devices and other devices. Application services network A dedicated network that typically consists of large multi-CPU servers that host multiple instances of the Sun ONE Application server software image. These requests tend to be low network bandwidth intensive but may span multiple protocols, including HTTP, CORBA, proprietary TCP, and UDP. The network traffic can also be significant when Sun ONE Application server clustering is enabled. Every update to a stateful session bean triggers a multicast update to all servers on this dedicated network so that participating cluster nodes update the appropriate stateful session bean. Network utilization increases in direct proportion to the intensity of session bean updates.
Chapter 7
307
Database network A dedicated network that typically consists of one or two multi-CPU database servers. The network traffic typically consists of Java DataBase Connectivity (JDBC) traffic between the application server or the Web server.
Collapsed Layer 2/Layer 3 Network Design

Each service is deployed in a dedicated Class C network where the first three octets represent the network number. The design represents an innovative approach where separate Layer 2 devices are not required because the functionality is collapsed into the core switch. Decreasing the management and configuration of separate devices while maintaining the same functionality is a major step toward cutting costs and increasing reliability.
FIGURE 7-9 shows how a traditional configuration requires two Layer 2 switches. A specific VLAN spans the six segments that give each interface access to the VLAN on failover.
Client network Edge switch Master switch (Layer 3) Standby switch (Layer 3)
Layer 2 switch
Layer 2 switch Network interface 1
Network interface 0
Sun server
FIGURE 7-9
Traditional Availability Network Design Using Separate Layer 2 Switches
308
The design shown in FIGURE 7-10 results in the same network functionality, but eliminates the need for two Layer 2 devices. This is accomplished using a tagged VLAN interconnect between the two core switches. By collapsing the Layer 2 functionality, there is a reduction in the number of network devices, providing fewer units that might fail, lower cost, and reduced manageability issues.
Client network
Edge switch
Master core switch 1
Standby core switch 2
Network interface 1 Network interface 0 Sun server
FIGURE 7-10
Availability Network Design Using Large Chassis-Based Switches
Multi-Tier Data Center Logical Design

The logical network design for the multi-tier data center (FIGURE 7-11) incorporates server redundant network interfaces and integrated VRRP and IPMP. See Integrated VRRP and IPMP on page 280 for more information.
Chapter 7
309
Clients
172.16.0.1
192.168.0.1 Master 192.168.0.2 Slave 192.168.0.2
10.50.0.1
10.10.0.1
10.10.0.1
10.50.0.1
10.40.0.1
10.20.0.1
10.20.0.1
10.40.0.1
10.30.0.1
10.30.0.1
Servers
FIGURE 7-11
Logical Network Architecture with Virtual Routers, VLANs, and Networks
310
TABLE 7-1 summarizes the eight separate networks and associated VLANs.
TABLE 7-1 Name
Network and VLAN Design

Network Default Router VLAN Purpose
client edge web ds db app dns mgt
172.16.0.0 192.16.0.0 10.10.0.0 10.20.0.0 10.30.0.0 10.40.0.0 10.50.0.0 10.100.0.0
172.16.0.1 192.16.0.1 10.10.0.1 10.20.0.1 10.30.0.1 10.40.0.1 10.50.0.1 10.100.0.1
client edge web ds db app dns mgt
Client load generation Connects client network to the data center Web services Directory services Database services Application services DNS services Management and administration
The edge network connects to the internal network in a redundant manner. One of the core switches has ownership of the 192.16.0.2 IP address, which means that switch is the master and the other is in slave mode. When the switch is in slave mode, it does not respond to any traffic, including ARPs. The master also assumes ownership of the MAC that floats along with the virtual IP address of 192.16.0.2.
Note If you have multiple NICs, make sure each NIC uses its unique MAC
address. Each switch is configured with the identical networks and associated VLANS, as shown in TABLE 7-1. An interconnect between the switches extends each VLAN but is tagged to allow multiple VLAN traffic to share a physical link (this requires a network interface, such as the Sun ge, that supports tagged VLANS). The Sun servers connect to both switches in the appropriate slot, where only one of the two interfaces will be active. Although most switches support Routing Information Protocol (RIP and RIPv2), Open Shortest Path First (OSPF), and Border Gateway Protocol v4 (BGP4), static routes provide a more secure environment. A redundancy protocol based on virtual router redundancy protocol (VRRP, RFC 2338) runs between the virtual routers. The MAC address of the virtual routers floats among the active virtual routers so that the ARP caches of the servers do not need any updates when a failover occurs.
Chapter 7
311
How Data Flows Through the Service Modules

When a client makes a request, it can be handled in one of two ways, depending on the type of request. A Web server might return information to the client directly or it might forward the request to an application server for further processing. In the case where the clients request is for static content such as images, the request is handled directly by the Web server module. These requests are handled quickly and do not present a heavy load to the client or server. In the case where the client requests dynamically generated content that requires Java Server Pages (JSP) or servlet processing, the request is passed to the application service module for processing. This is often the bottleneck for large-scale environments. The application server runs the core of the application that handles the business logic to service the client request, either directly or indirectly. Over the course of handling the business logic, the application server can use many supporting resources, including directory servers, databases, and perhaps even other Web application services.
FIGURE 7-12 illustrates how the data flows through the various system interfaces during a typical application services request. TABLE 7-2 provides a description of each numbered interaction.
312
Clients
12
Switching services
11
Web services
3 4 Directory services 10
8 9
Application services
7 6
Database services
FIGURE 7-12
Logical Network
Chapter 7
313
TABLE 7-2 Item
Sequence of Events for FIGURE 7-12

Interface2 Protocol Description
Interface1
Client
Switch
HTTP/ HTTPS
Client initiates Web request. Client communication can be HTTP or HTTPS (HTTP with secure socket layer). HTTPS can be terminated at the switch or at the Web server. Switch redirects client request to appropriate Web server. The Web server redirects the request to the application server for processing. Communication passes through a Web server plug-in over a proprietary TCP-based protocol. The Java 2 Enterprise Edition (J2EE) application hosted by the application server identifies the requested process as requiring specific authorization. It sends a request to the directory server to verify that the user has valid authorization. The directory server successfully verifies the authorization through the users LDAP role. The validated response is returned to the application server. Application server then processes business logic represented in J2EE application. The business logic requests data from a database as input for processing. The requests may come from servlets, Java Data Objects, or Enterprise Java Beans (EJBs) that in turn use Java DataBase Connectivity (JDBC) to access the database. The JDBC request can contain any valid SQL statement. The database processes the request natively and returns the appropriate result through JDBC to the application server. The J2EE application completes the business logic processing, packages the data for display (usually through a JSP that renders HTML) and returns the response to the Web server. Switch receives reply from Web server. Switch rewrites IP header and returns request to client.
2 3
Switch Web Server
Web server Application server
HTTP/ HTTPS Application server Web connector over TCP LDAP
Application server
Directory server
Directory server
Application server
LDAP
Application server
Database server
JDBC
Database server
Application server
JDBC
Application server
Web server
Application server Web connector over TCP HTTP/ HTTPS HTTP/ HTTPS
9 10
Web server Switch
Switch Client
314
Physical Network Implementations

The next step involves constructing a real network based on the logical network architecture. You can use several approaches to realize the network that functionally satisfies the logical architectural requirements. The multi-tier data center is vendor independent, and you can use the network equipment that best suits your environment. We briefly describe the original multitier data center implementation (secure multi-tier architectures), then we describe the multiswitch approach, and finally we describe the collapsed approach.
Secure Multi-Tier
FIGURE 7-13 shows the overall structure of a classic multi-tier design.
Web
Web
App
App
DB
DB
FIGURE 7-13
Secure Multi-Tier
The advantages of this approach are simplicity and security. Clearly the only way to access the Data tier is through the application servers. There are no other possible network paths to access the Data tier. The drawbacks are limited flexibility and manageability. If an application running on the Web server needs to connect to an
Chapter 7
315
LDAP server or a database through a JDBC connection, a fundamental change to the architecture will be needed. As the number of tiers increases, so does the number of switches, which becomes a management issue.
Multi-Level Architecture Using Many Small Switches

FIGURE 7-14 shows the overall structure of a multi-level architecture that is composed of many small port density switches.
Multilayer Switch
Multilayer Switch
Layer 2
Layer 2
Multilayer Switch
Multilayer Switch
Multilayer Switch
Multilayer Switch
Layer 2
Layer 2
Layer 2
Layer 2
Layer 2
Layer 2
Layer 2
Layer 2
FIGURE 7-14
Multi-Tier Data Center Architecture Using Many Small Switches
316
This approach has few advantages and many disadvantages. One advantage is that the entry cost is low. One can start from a very small deployment, procuring small eight-port multilayer switches and Layer 2 switches and increasing the tiers and servers to the point where the ingress links become a bottleneck or the port density of the small multilayer switches becomes an issue. Actual tested configurations leveraged Alteon 180 switches as the multilayer switches and Extreme Networks Summit 48i for the Layer 2 switches, which had gigabit uplinks and 10/100 ports for connections to the server. This architecture has the following disadvantages:
s
Lower Availability Because of the number of links and devices, more things can go wrong. In particular, the serial connections drastically reduce the MBTF. The links are often prone to accidents and should be kept to a minimum. Due to the architecture, the link failure detection time and recovery is much slower because of the number of layers. Waste In any network architecture, stateless functionality should be deployed towards the center of the network and complex processing should be deployed at the outermost edge. Having two layers of multilayer switches is a tremendous waste in terms of packet processing and equipment cost. When a packet undergoes Layer 7 processing, especially by software, it is extremely slow. The cost of a multilayer switch is much more than that of a plain Layer 2 or Layer 3 device. Manageability As the number of switches increases, so does the manageability workload.
Flat Architecture Using Collapsed Large Chassis Switches

The flat network architecture using collapsed large chassis switches was found to be the best design for large-scale multi-tier deployments in availability, performance, and manageability. In the lab, we built two different network configurations. One configuration used Extreme Networks equipment (FIGURE 7-15), and the other used Foundry Networks equipment (FIGURE 7-16). The Extreme Networks switch that we used has built-in load balancing, so there was no need for an external load-balancing device. The Foundry Networks products required use of a separate load-balancing switch.
Chapter 7
317
Client 1 Client access
Client 2
192.168.10.1 Extreme switch 192.168.10.2
L2-L3 edge switch 10.50.0.1 Extreme switch 192.168.10.3 10.30.0.1 10.40.0.1 10.20.0.1 10.10.0.1
Extreme switches
Core
Core
Web service Tier Sun Fire 280R Sun Fire 280R Sun Fire 280R Sun Fire 280R 10.10.0.100 10.10.0.101 10.10.0.102 10.10.0.103
Directory sevice Tier Sun Fire 280R Sun Fire 280R Sun Fire 280R Sun Fire 280R 10.20.0.100 10.20.0.101 10.20.0.102 10.20.0.103
Sun Fire 6800 10.40.0.100
Sun Fire 6800 10.40.0.101
T3
FIGURE 7-15
Network Configuration with Extreme Networks Equipment
318
Client 2
Web service Tier
Sun Fire 6800 10.40.0.100
Sun Fire 6800 10.40.0.101
T3
FIGURE 7-16
Sun ONE Network Configuration with Foundry Networks Equipment

Chapter 7 Reference Design Implementations 319
Physical NetworkConnectivity
The physical wiring of the architecture is shown in FIGURE 7-17 and described in
TABLE 7-3
TABLE 7-3

Description Port PHY Speed Base Address Netmask
Switch
edge edge
Client network to external network router External network - mls1
1,2,3,4 5,6
ge ge
172.16.0.1 192.168.10.1
255.255.255.0 255.255.255.0
mls1
External network Web/app service router Directory service router Database services router
1 3,4,5,6 7,8 9,10
ge ge ge ge
192.168.10.2 10.10.0.1 10.20.0.1 10.30.0.1
255.255.255.0 255.255.255.0 255.255.255.0 255.255.255.0
mls1 mls1
mls1
mls2
External network Web/app service router Directory services router Database services router
1 3,4,5,6 7,8 9,10
ge ge ge ge
192.168.10.2 10.10.0.1 10.20.0.1 10.30.0.1
255.255.255.0 255.255.255.0 255.255.255.0 255.255.255.0
mls2
mls2
mls2
320
ge0:172.16.0.102/24 Client2 1 Edge hme0:10.100.16.102 10.100.16.1 10.100.168.2 mls1 10.30.0.1/24 7 5 192.168.0.1/24 6 2 172.16.0.1/24 3
ge0:172.16.0.101/24 4 Client1
hme0:10.100.16.101 10.100.168.2
192.168.0.101/24
192.168.0.102/24
mls2 10.30.0.1/24
192.168.0.2/24
192.168.0.2/24
10.20.0.1/24
10.10.0.1/24
10.10.0.1/24
10.20.0.1/24
192.168.0.2/24
192.168.0.2/24
ge0:10.10.0.101/24 ge0:10.10.0.103/24 ge0:10.10.0.105/24 ge0:10.10.0.107/24 web1 web1 web1 web1
ge1:10.10.0.102/24 ge1: 10.10.0.104/24 ge1:10.10.0.106/26 ge1:10.10.0.108/26
ge0:10.40.0.101/24
ge1:10.40.0.102/24
ge0:10.40.0.105/24
ge1:10.40.0.106/24
app1 app2 ge0:10.40.0.103/24 hme010.100.10.101 ge0:10.20.0.101/24 ds1 hme0:10.100.20.101 ge0:10.30.0.101/24 db1 hme0:10.100.30.101
FIGURE 7-17
app1 app2 ge1:10.40.0.104/24 ge0:10.10.0.107/24 ge1:10.40.0.108/24 hme0:10.100.10.105
ge0:10.20.0.103/24 ds2 ge1:10.20.0.102/24 hme0:10.100.20.103 ge0:10.30.0.103/24 db2 ge1:10.30.0.102/24 hme0:10.100.30.103 ge1:10.30.0.104/24 ge1:10.20.0.104/24

Switch Configuration
A high-level overview of the switch configuration is shown in FIGURE 7-18.
Extreme Networks Summit 7i client 172.16.0.1 edge 192.168.0.2
Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 Slot 6 Slot 7 Slot 8
web 10.10.0.1 ds 10.20.0.1 db 10.30.0.1 app 10.40.0.1 dns 10.50.0.1
edge 192.168.0.2
web 10.10.0.1 ds 10.20.0.1 db 10.30.0.1 app 10.40.0.1 dns 10.50.0.1
edge 192.168.0.2
Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 Slot 6 Slot 7
esrp interconnect - web, ds, db, app, dns
esrp interconnect - web, ds, db, app, dns
mgt 10.100.0.1 Extreme Networks - Black Diamond 6808
mgt 10.100.0.1 Extreme Networks - Black Diamond 6808
Slot 8
FIGURE 7-18
Collapsed Design Without Layer 2 Switches
322
Configuring the Extreme Networks Switches

For the multi-tier data center, two Extreme Networks BlackDiamond switches were used for the core switches and one Summit7i switch was used for the edge switch.
Note Network equipment from Foundry Networks can be used instead. See
Configuring the Foundry Networks Switches on page 324.
To Configure the Extreme Networks Switches

The following example shows an excerpt of the switch configuration file.
# # MSM64 Configuration generated Thu Dec 6 20:19:20 2001 # Software Version 6.1.9 (Build 11) By Release_Master on 08/30/01 11:34:27 configure slot 1 module g8x configure slot 2 module g8x configure slot 3 module g8x configure slot 4 module g8x configure slot 5 module g8x configure slot 6 module g8x configure slot 7 module f48t configure slot 8 module f48t ..................................................... configure dot1q ethertype 8100 configure dot1p type dot1p_priority 0 qosprofile QP1 configure dot1p type dot1p_priority 1 qosprofile QP2 configure dot1p type dot1p_priority 2 qosprofile QP3 configure dot1p type dot1p_priority 3 qosprofile QP4 ..................................................... enable sys-health-check configure sys-health-check alarm-level log enable system-watchdog config qosprofile QP1 minbw 0% maxbw 100% priority Low minbuf 0% maxbuf 0 K config qosprofile QP2 minbw 0% maxbw 100% priority LowHi minbuf 0% maxbuf 0 K
1. Configure the core switches.
Chapter 7
323
2. Configure the edge switch. The following example shows an excerpt of the switch configuration file.
# # Summit7i Configuration generated Mon Dec 10 14:39:46 2001 # Software Version 6.1.9 (Build 11) By Release_Master on 08/30/01 11:34:27 configure dot1q ethertype 8100 configure dot1p type dot1p_priority 0 qosprofile QP1 .................................................... enable system-watchdog config qosprofile QP1 minbw 0% maxbw 100% priority Low minbuf 0% maxbuf 0 K .................................................... delete protocol ip delete protocol ipx delete protocol netbios delete protocol decnet delete protocol appletalk .................................................... # Config information for VLAN Default. config vlan Default tag 1 # VLAN-ID=0x1 Global Tag 1 config vlan Default protocol ANY config vlan Default qosprofile QP1 enable bootp vlan Default ....................................................
Configuring the Foundry Networks Switches

This section describes the network architecture implementation using Foundry Networks equipment instead of Extreme Networks equipment. The overall setup is shown in FIGURE 7-19.
324
Client
Client
Client
Client
NS5200 Netscreen firewall
S7i Extreme Networks Summit 7i
NS5200 Netscreen firewall
SLB0 Server load balancer
MLS0 BigIron layer 2/3 switch
MLS1 BigIron layer 2/3 switch
SLB1 Server load balancer
Servers
Servers
Web service module Directory service module Application service module Database service module
FIGURE 7-19
Servers
Servers
Servers
Servers
Servers
Servers
Servers
Servers
Foundry Networks Implementation
Chapter 7
325
Master Core Switch Configuration

CODE EXAMPLE 7-1 shows an example of the configuration file for the master core switch (called MLS0 in the lab). We used the Foundry Networks BigIron switch.
CODE EXAMPLE 7-1
MLS0 Configuration File
module 1 bi-jc-8-port-gig-m4-management-module module 3 bi-jc-48e-port-100-module ! global-protocol-vlan ! vlan 1 name DEFAULT-VLAN by port vlan 10 name refarch by port untagged ethe 1/1 ethe 3/1 to 3/16 router-interface ve 10 vlan 99 name mgmt by port untagged ethe 3/47 to 3/48 router-interface ve 99 ! hostname MLS0 ip default-network 129.146.138.0/16 ip route 192.168.0.0 255.255.255.0 172.0.0.1 ip route 129.148.181.0 255.255.255.0 129.146.138.1 ip route 0.0.0.0 0.0.0.0 129.146.138.1 ! router vrrp-extended interface ve 10 ip address 20.20.0.102 255.255.255.0 ip address 172.0.0.70 255.255.255.0 ip vrrp-extended vrid 1 backup priority 100 track-priority 20 advertise backup ip-address 172.0.0.10 dead-interval 1 track-port e 3/1 enable ip vrrp-extended vrid 2 backup priority 100 track-priority 20 advertise backup ip-address 20.20.0.100 dead-interval 1 track-port e 3/13 enable ! interface ve 99 ip address 129.146.138.10 255.255.255.0 end
326
Standby Core Switch Configuration

CODE EXAMPLE 7-2 shows a partial listing of the configuration file for the standby core switch (called MLS1 in the lab). Again we used the Foundry Networks BigIron switch.
CODE EXAMPLE 7-2
MLS1 Configuration File
ver 07.5.05cT53 ! module 1 bi-jc-8-port-gig-m4-management-module module 3 bi-jc-48e-port-100-module ! global-protocol-vlan ! vlan 1 name DEFAULT-VLAN by port ! vlan 99 name swan by port untagged ethe 1/6 to 1/8 router-interface ve 99 ! vlan 10 name refarch by port untagged ethe 3/1 to 3/16 router-interface ve 10 ! ! hostname MLS1 ip default-network 129.146.138.0/1 ip route 192.168.0.0 255.255.255.0 172.0.0.1 ip route 0.0.0.0 0.0.0.0 129.146.138.1 ! router vrrp-extended interface ve 10 ip address 20.20.0.102 255.255.255.0 ip address 172.0.0.71 255.255.255.0 ip vrrp-extended vrid 1 backup priority 100 track-priority 20 advertise backup ip-address 172.0.0.10 dead-interval 1 track-port e 3/1 enable ip vrrp-extended vrid 2 backup priority 100 track-priority 20 advertise backup ip-address 20.20.0.100 dead-interval 1 track-port e 3/13 enable
interface ve 99
Chapter 7
327
CODE EXAMPLE 7-2
MLS1 Configuration File (Continued)
ip address 129.146.138.11 255.255.255.0 ! ! ! ! ! sflow sample 512 sflow source ethernet 3/1 sflow enable ! ! end
Server Load Balancer

The following code box shows a partial listing of the configuration file used for the server load balancer (called SLB0 in the lab). We used the Foundry Networks Server XL.
CODE EXAMPLE 7-3
SLB0 Configuration File
ver 07.3.05T12 global-protocol-vlan ! ! server source-ip 20.20.0.50 255.255.255.0 172.0.0.10 ! !! ! server real web1 10.20.0.1 port http port http url "HEAD /" ! server real web2 10.20.0.2 port http port http url "HEAD /" ! ! server virtual WebVip1 192.168.0.100 port http port http dsr bind http web1 http web2 http ! ! vlan 1 name DEFAULT-VLAN by port
CODE EXAMPLE 7-3
SLB0 Configuration File (Continued)
ver 07.3.05T12 no spanning-tree ! hostname SLB0 ip address 192.168.0.111 255.255.255.0 ip default-gateway 192.168.0.10 web-management allow-no-password banner motd ^C Reference Architecture -- Enterprise Engineering^C Server Load Balancer-- SLB0 129.146.138.12/24^C !! end
Server Load Balancer

The following code box shows a partial listing of the configuration file used for the standby server load balancer (called SLB1 in the lab). Again we used the Foundry Networks Server XL.
CODE EXAMPLE 7-4
SLB1 Configuration File
ver 07.3.05T12 global-protocol-vlan ! ! server source-ip 20.20.0.51 255.255.255.0 172.0.0.10 ! !! ! server real s1 20.20.0.1 port http port http url "HEAD /" ! server real s2 20.20.0.2 port http port http url "HEAD /" ! ! server virtual vip1 172.0.0.11 port http port http dsr bind http s1 http s2 http ! ! vlan 1 name DEFAULT-VLAN by port
CODE EXAMPLE 7-4
SLB1 Configuration File (Continued)
ver 07.3.05T12 ! hostname SLB1 ip address 172.0.0.112 255.255.255.0 ip default-gateway 172.0.0.10 web-management allow-no-password banner motd ^C Reference Architecture - Enterprise Engineering^C Server Load Balancer - SLB1 - 129.146.138.13/24^C !
Network Security
For the Sun ONE network configuration, firewalls were configured between each service module to provide network security. FIGURE 7-20 shows the relationship between the firewalls and the service modules.
330
Client
Client
Intranet/Internet Edge switch Firewall Web service Tier Firewall Application service Tier Firewall Database service Tier
FIGURE 7-20
Firewalls between Service Modules
In the lab, one physical firewall device was used to create multiple virtual firewalls. Network traffic was directed to pass through the firewalls between the service modules, as shown in FIGURE 7-21. The core switch is only configured for Layer 2 with separate port-based VLANs. The connection between the Netscreen and the core switch uses tagged VLANS. Trust zones are created on the Netscreen device, and they map directly to the tagged VLANs. The Netscreen firewall device performs the Layer 3 routing. This configuration directs all traffic through the firewall, resulting in firewall protection between each service module.
Chapter 7
331
Client
Client
Intranet/Internet Edge switch
Netscreen device
VLAN*
Core switch
Core switch VLAN* Netscreen device
Database VLAN Application VLAN

Web VLAN
Database VLAN Application VLAN

Web VLAN
Web service Tier
Database service Tier *Web, application, and database trafc multiplexed on one VLAN
FIGURE 7-21
Virtual Firewall Architecture Using Netscreen and Foundry Networks Products
332
Netscreen Firewall
CODE EXAMPLE 7-5 shows a partial example of a configuration file used to configure the Netscreen device.
CODE EXAMPLE 7-5
Configuration File Used for Netscreen Device
set auth timeout 10 set clock "timezone" 0 set admin format dos set admin name "netscreen" set admin password nKVUM2rwMUzPcrkG5sWIHdCtqkAibn set admin sys-ip 0.0.0.0 set admin auth timeout 0 set admin auth type Local set zone id 1000 "DMZ1" set zone id 1001 "web" set zone id 1002 "appsrvr" set zone "Untrust" block set zone "DMZ" vrouter untrust-vr set zone "MGT" block set zone "DMZ1" vrouter trust-vr set zone "web" vrouter trust-vr set zone "appsrvr" vrouter trust-vr set ip tftp retry 10 set ip tftp timeout 2 set interface ethernet1 zone DMZ1 set interface ethernet2 zone web set interface ethernet3 zone appsrvr set interface ethernet1 ip 192.168.0.253/24 set interface ethernet1 route set interface ethernet2 ip 10.10.0.253/24 set interface ethernet2 route set interface ethernet3 ip 20.20.0.253/24 set interface ethernet3 route unset interface vlan1 bypass-others-ipsec unset interface vlan1 bypass-non-ip set interface ethernet1 manage ping unset interface ethernet1 manage scs unset interface ethernet1 manage telnet unset interface ethernet1 manage snmp unset interface ethernet1 manage global unset interface ethernet1 manage global-pro unset interface ethernet1 manage ssl set interface ethernet1 manage web
Chapter 7
333
CODE EXAMPLE 7-5
Configuration File Used for Netscreen Device (Continued)
unset interface ethernet1 ident-reset set interface vlan1 manage ping set interface vlan1 manage scs set interface vlan1 manage telnet set interface vlan1 manage snmp set interface vlan1 manage global set interface vlan1 manage global-pro set interface vlan1 manage ssl set interface vlan1 manage web set interface v1-trust manage ping set interface v1-trust manage scs set interface v1-trust manage telnet set interface v1-trust manage snmp set interface v1-trust manage global set interface v1-trust manage global-pro set interface v1-trust manage ssl set interface v1-trust manage web unset interface v1-trust ident-reset unset interface v1-untrust manage ping unset interface v1-untrust manage scs unset interface v1-untrust manage telnet unset interface v1-untrust manage snmp unset interface v1-untrust manage global unset interface v1-untrust manage global-pro unset interface v1-untrust manage ssl unset interface v1-untrust manage web unset interface v1-untrust ident-reset set interface v1-dmz manage ping unset interface v1-dmz manage scs unset interface v1-dmz manage telnet unset interface v1-dmz manage snmp unset interface v1-dmz manage global unset interface v1-dmz manage global-pro unset interface v1-dmz manage ssl unset interface v1-dmz manage web unset interface v1-dmz ident-reset set interface ethernet2 manage ping unset interface ethernet2 manage scs unset interface ethernet2 manage telnet unset interface ethernet2 manage snmp unset interface ethernet2 manage global unset interface ethernet2 manage global-pro unset interface ethernet2 manage ssl
334
CODE EXAMPLE 7-5
Configuration File Used for Netscreen Device (Continued)
unset interface ethernet2 manage web unset interface ethernet2 ident-reset set interface ethernet3 manage ping unset interface ethernet3 manage scs unset interface ethernet3 manage telnet unset interface ethernet3 manage snmp unset interface ethernet3 manage global unset interface ethernet3 manage global-pro unset interface ethernet3 manage ssl unset interface ethernet3 manage web unset interface ethernet3 ident-reset set interface v1-untrust screen tear-drop set interface v1-untrust screen syn-flood set interface v1-untrust screen ping-death set interface v1-untrust screen ip-filter-src set interface v1-untrust screen land set flow mac-flooding set flow check-session set address DMZ1 "dmznet" 192.168.0.0 255.255.255.0 set address web "webnet" 10.10.0.0 255.255.255.0 set address appsrvr "appnet" 20.20.0.0 255.255.255.0 set snmp name "ns208" set traffic-shaping ip_precedence 7 6 5 4 3 2 1 0 set ike policy-checking set ike respond-bad-spi 1 set ike id-mode subnet set l2tp default auth local set l2tp default ppp-auth any set l2tp default radius-port 1645 set policy id 0 from DMZ1 to web "dmznet" "webnet" "ANY" Permit set policy id 1 from web to DMZ1 "webnet" "dmznet" "ANY" Permit set policy id 2 from DMZ1 to appsrvr "dmznet" "appnet" "ANY" Permit set policy id 3 from appsrvr to DMZ1 "appnet" "dmznet" "ANY" Permit set ha interface ethernet8 set ha track threshold 255 set pki authority default scep mode "auto" set pki x509 default cert-path partial _____________________
Chapter 7
335
APPENDIX
Lyapunov Analysis
This appendix provides an outline of the mathematical proof that shows why the least connections server load balancing (SLB) algorithm is inherently stable. This means that over a long period of time, the system will ensure that the load is evenly balanced. This analysis can be used to model and verify the stability of any network design, which may be of tremdous value if you are an advanced network architect. Building on what was discussed in Chapter 3, we will extend the model of the single queue to that of the entire system and then show that the entire system is stable. The entire system consists of an aggregate ingress load of l, N server processes of varying service rates 1, 2, . . .n, hence we get the following equation: EQN 1: S = + 1 + 2 +. . . n We will use this equation later. It states that the value S is the sum of the aggregate load and the sum of all the service rates. This means in one time slot:
s s
average sum of all incoming loads average sum of all server processing capacity
Since the incoming packets are modeled as Poisson arrivals, which is in continuous time, we will map the time domain to an index N, which increases whenever the state of the system changes. The state is defined as the queue occupancy. If a packet arrives, it will increase the size of one of the queues in the system. If a packet is serviced, then the size of one queue of the system will decrease. Let Qs(t) = min(Q1(t), Q2(t). . . QN(t)). This Qs is the least occupied queue among all N queues. Let Qb(t) = set {Q1(t), Q2(t). . . QN(t)} - {Qs(t)}, which is all the queues except for the least occupied. Let Qa(t) = Qb(t) + Qs(t), which is all the queues.
337
We know that the next state of all queues in the set Qb(t) can change only due to a Web service, which is a reduction by one request. There can be no increase in this queue size because the SLB will not forward any new requests. Therefore, these queues cannot grow in the next time slot, so we get: Qb(t+1) = Qb(t) - 1 with probability of ib/S We can also figure out the next possible state of Qs(t), which can change due to a Web service, resulting in a reduction of queue size by 1 or an increase in queue size, due to the SLB forwarding a request to this queue. Hence we get the next state as follows: Qs(t+1) = Qs(t) -1 with probability of is/S or Qs(t) +1 with a probability of is/S We can assign the Lyapunov Function to the sum of all the occupancies of all N queues. We will use t, representing a particular time slot: L(t) = Q1(t) + Q2(t).... QN(t) = Qib(t) + Qis(t) L(t+1) = Qib(t+1) + Qis(t) = [ib/S(Qib(t) -1)] + is/S[Qs(t) -1] + is/S[Qs(t) +1] Now if we look at one particular queue, Qi(t), keeping time discrete, the state of Qi(t) only changes due to events of arrivals and/or departures. We can see how this queue increases and decreases in size or queue occupancy. For stability, we need to show: EL = E[L(t+1) - L(t)) | L(t)] <= -e || Q || + k This says that the expected value of the single step driftthat is, the Lyapunov Function at time t+1Lyapunov Function at time t, given the Lyapunov Function at time tmust be a negative constant times the queue size plus some constant k. The value of EL becomes negative when the queue size times -e is larger than k. This is typical in almost all systems in that before the system reaches a steady state there is an initial unstable period, but after some time a steady state is reached. This is where we need to look at the system to determine the behavior of the system in steady state. EL = E[L(t+1) - L(t)|L(t)] = E [[ib/S(Qib(t) -1)] + is/S*[Qs(t) -1] + is/S*[Qs(t) +1] - ib/S*Qib(t) - (is/S + is/S)*Qis(t)| L(t)] = E [ is/S*[Qs(t) +1] - is/S*Qis(t) + is/S*[Qs(t) -1] - is/S*Qis(t) + [ib/S(Qib(t) 1)] -ib/S*Qib(t)]|
338
= E [ is/S - is/S = is/S - is/S
- ib/S]
- ib/S
This is always negative as long as: is < is/S - ib/S
This means that since:

s
All incoming traffic is being redirected by the SLB algorithm to the least occupied queue, = is
All Web server capacity at time slot t is: is + ib
From this we conclude that as long as the incoming traffic is admissible or < The system is stable! This proves that the SQF algorithm is guaranteed to drain all queues in such a way as to make sure the system is stable. If we had a round-robin SLB algorithm instead, we would not get this mathematical result. In particular, there is no way we can enforce the following: i < i, resulting in Qi(t) overflowing, even though the overall average incoming traffic is less than the overall average server capacity that is, < . The SLB blindly forwards incoming traffic to servers, without considering the occupancy of Qi(t). The round-robin scheme can easily have some idle servers and still continue to forward traffic to an overloaded server, resulting in instability. In the SQF algorithm, we know that only the shortest queue is forwarded traffic and that the other queues can only drain. As long as Qis does not overflow, the entire system is stable. We know that < , hence we know that Qis(t) is stable.
Appendix A
Lyapunov Analysis
339
Glossary
This glossary defines terms and acronyms used in this book.
A
ABR ACK ACL ANAR ANER API Application Server Available Bit Rate Acknowledgement flag, TCP header Access Control List Auto-Negotiation Advertisement Register Auto-Negotiation Expansion Register Application Programming Interface A host computer that provides access to a software application. In the context of this Reference Architecture, it is used to mean a J2EE Application Server, which essentially serves as an enterprise platform for Java applications. See J2EE. Address Recognition Indicator/Frame Copier Indicator Address Resolution Protocol Application Specific Integrated Circuit Asynchronous Transfer Mode
ARI/FCI ARP ASIC ATM
341
B
BER BGP BMCR BMP BMSR BPDU Bit Error Rate Border Gateway Protocol Basic Mode Control Register Bean Managed Persistence Basic Mode Status Register Bridge Protocol Data Unit
C
CBQ CBR CBS CGI CIR CLEC CMP Congestion Window Class-Based Queuing Constant Bit Rate Committed Burst Size Common Gateway Interface Committed Information Rate Competitive Local Exchange Carrier Container Managed Persistence A congestion window added by slow start to the senders TCP: the congestion window, called cwnd. When a new connection is established with a host on another network, the congestion window is initialized to one segment (that is, the segment size announced by the other end).
D
DAC DAPL DAS Dual-Attached Concentrator Direct Access Programming Library Dual-Attached Station
342
DLPI DMA DMLT DNS DoS DRAM DSR DTR
Data Link Provided Interface Direct Memory Access Distributed Multilink Trunking Domain Name Service Denial of Service Dynamic Random Access Memory Direct Server Return Dedicated Token Ring
E
EBS EJB ESRP Edge data center switch Excess Burst Size Enterprise JavaBean Extreme Standby Routing Protocol The integration point to the customers existing backbone network. This is the switch that connects the data center to the customers backbone network.
F
Failover A characteristic of a highly available component or service that describes the ability to switch to another equivalent component or service so that the overall availability is still maintained. See High Availability. Fiber Distributed Data Interface Forwarding Information Base First In, First Out Field Programmable Gate Array FDDI FIB FIFO FPGA
Glossary
343
G
GCR GESR GFR GMII GSR Gigabit Control Register Gigabit Extended Status Register Guaranteed Frame Rate Gigabit Media Independent Interface Gigabit Status Register
H
High Availability (HA) HOL HTTP HTTPS General term used to describe the ability of a component or service to be running and therefore available. Head of Line Blocking (Hypertext Transfer Protocol) The Internet protocol based on TCP/IP that fetches hypertext objects from remote hosts. HTTP over SSL
I
ias ILEC Integratable iPlanet Application Server Incumbent Local Exchange Carrier In the context of an integrated stack, it represents a mixture of third-party software products that support open standards such as Java and Java Technologies for SOAP, UDDI, XML, and WSDL. These products can be combined to deliver a customer solution and should work together given their support of these open standards. In the context of an integrated stack, it represents Suns software products that implement the Sun ONE architecture to deliver a fully optimized, tested, and supported system to maximize value to customers. input/output memory management unit
Integrated
IOMMU
344
IP IPG IPMP isapi ISP IXC
Internet Protocol Inter-Packet Gap Internet Protocol Multipathing Microsofts internet server application programming interface Internet Service Provider Inter Exchange Carrier
J
J2EE (Java 2 Platform Enterprise Edition) Set of standards that leverages J2SE technology and simplifies Java development by offering standardized, modular components by providing a complete set of services to those components, and by handling many details of application behavior automatically, without complex programming. This is the standard on which the Sun ONE Application Server is based. See http://java.sun.com/j2ee/. (Java 2 Platform Standard Edition) Represents the set of technologies that provides the run time environment and Software Development Kit for Java development. See http://java.sun.com/j2se/. An object-oriented programming language developed by Sun Microsystems. The Write Once, Run Anywhere programming language. The software development kit that developers need to build applications for the Java 2 Platform, Standard Edition, v. 1.2. See also JDK. A portable, platform-independent reusable component model. See http://java.sun.com/. (Java Remote Method Invocation) (n.) A distributed object model for Java program to Java program, in which the methods of remote objects written in the Java programming language can be invoked from other Java virtual machines, possibly on different hosts. (Java API for XML Messaging) Enables applications to send and receive document oriented XML messages using a pure Java API. JAXM implements Simple Object Access Protocol (SOAP) 1.1 with Attachments messaging so that developers can focus on building, sending, receiving, and decomposing messages for their applications instead of programming low-level XML communications routines. See http://java.sun.com/xml/jaxm/index.html.
J2SE
Java Java 2 SDK JavaBeans Java RMI
JAXM
Glossary
345
JAXR
(Java API for XML Registries) Provides a uniform and standard Java API for accessing different kinds of XML Registries. See http://java.sun.com/xml/jaxr/index.html. (Java API for XML-based RPC) Enables Java technology developers to build Web applications and Web services incorporating XML-based RPC functionality according to the SOAP 1.1 specification. See http://java.sun.com/xml/jaxrpc/index.html. Java DataBase Connectivity (Java Development Kit) The software that includes the APIs and tools that developers need to build applications for those versions of the Java platform that preceded the Java 2 Platform. See also Java 2 SDK. Java Naming and Directory Interface (Java runtime environment) A subset of the Java Development Kit (JDK) for users and developers who want to redistribute the runtime environment. The Java runtime environment consists of the Java virtual machine (JVM), the Java core classes, and supporting files. (JavaServer Pages) Technology that allows Web developers and designers to rapidly develop and easily maintain information-rich, dynamic Web pages that leverage existing business systems. See http//java.sun.com/products/jsp/.
JAXRPC
JDBC JDK
JNDI JRE
JSP
L
LAA LACP LAN LDAP LLC LPNAR LSP Locally Administered Address Link aggregation Control Protocol Local Area Network (Lightweight Directory Access Protocol) The Internet standard for directory lookups. Logical Link Control Link Partner Auto-negotiation Advertisement Register Link State Packet
346
M
MAC MAU MDT MII M/M/1 queue MSS MTBF MTTR MTU Multi-tier architecture Maximum Segment Size Mean Time Between Failures Mean Time Till Recovery Maximum Transmission Unit For a given custom application, multiples of any of these tiers may be usedthus n-tier. There is no implied relationship between tiers and machines, but collapsing all the tiers onto a single machine would not be network centric. 1. 2. 3. 4. Client Tier Web Tier Application Tier Database Tier Media Access Control Media Access Unit Multi-Data Transmission Media Independent Interface
N
NAP NAT NFS NIC NSAPI NSP Network Access Point Network Address Translation Network File System Network Interface Card (or Controller) Netscape Application Programming Interface Network Service Provider
Glossary
347
O
Operating System OSPF A collection of programs that monitor the use of the system and supervise the other programs executed by it. Open Shortest Path First
P
PHY ping Physical layer (1) (n.) (Packet Internet Groper) A small program (ICMP ECHO) that a computer sends to a host and times on its return path. (2) (v.) To test the reach of destinations by sending them an ICMP ECHO: Ping host X to see if it is up! PIR PPA Presentation Service Peak Information Rate Primary Point of Attachment Term used to describe a service that presents the data that is returned to the end user. In this context, the presentation service was delivered by a tier of Web servers that served up JSP/servlet traffic for viewing by the client Web browsers. Enables communication between Sun ONE Application Server and a client. Manages and provides services for all active, loaded listeners. Supports HTTP, HTTPS (HTTP over SSL), and IIOP.
Protocol Manager
Q
QoS
(Quality of Service) Measures the ability of network and computing systems to provide different levels of services to selected applications and associated network flows.
348
R
RBOC RDMA RED Remote system RLDRAM RIP RJ-45 connector (n.) Regional Bell Operating Company Remote Direct Memory Access Random Early Detection (n.) A system other than the one on which you are working. reduced latency DRAM (n.) Routing Information Protocol, An IGP with Berkeley UNIX (n.) A modular cable connector standard used with consumer telecommunications equipment, such as systems equipped for ISDN connectivity. (n.) Remote Method Invocation (See Java RMI.) (n.) In a hierarchy of items, the one item from which all other items are descended. The root item has nothing above it in the hierarchy. See also class, hierarchy, package, root directory, root file system, and root user name. (n.) The base directory from which all other directories stem, directly or indirectly. (n.) On Sun server systems, the disk drive where the operating system resides. The root disk is located in the SCSI tray behind the front panel. (n.) A file system residing on the root device (a device predefined by the system at initialization) that anchors the overall file system. (n.) The SunOS user name that grants special privileges to the person who logs in with that ID. The user who can supply the correct password for the root user name is given superuser privileges for the particular machine. (1) (n.) In the X protocol, a window with no parent window. Each screen has a root window that covers it. (2) (adj.) Characteristic of an input method that uses a pre-editing window that is a child of the root window. Router RR RTT A system that assigns a path for network (or Internet) traffic to follow based on IP Address. Round-robin method of load balancing. Round Trip Time
RMI root
root directory root disk root file system root user name
root window
Glossary
349
S
SAC SACK SAS SBus single-attached concentrator selective acknowledgement single-attached station (n.) A 32-bit self-identifying bus used mainly on SPARCTM workstations, the SBus provides information to the system so that it can identify the device driver that needs to be used. An SBus device might need to use hardware configuration files to augment the information provided by the SBus card. See also PCI bus. (n.) A device providing additional SBus slots by connecting two SBuses. Generally, a bus bridge is functionally transparent to devices on the SBus. However, there are cases (for example, bus sizing) in which bus bridges can change the exact way a series of bus cycles is performed. Also called an SBus coupler. (n.) The hardware responsible for performing arbitration, addressing translation and decoding, driving slave selects and address strobe, and generating timeouts. (n.) A logical device attached to the SBus. This device might be on the motherboard or on an SBus expansion card. (n.) A physical printed circuit assembly that conforms to the single- or doublewidth mechanical specifications and that contains one or more SBus devices. (n.) An SBus slot into which you can install an SBus expansion card. (n.) A special series of bytes at address 0 of each SBus slave that identifies the SBus device. Sockets Direct Protocol (Secure Sockets Layer) A protocol developed for transmitting private documents via the Internet. SSL works by using a public key to encrypt data thats transferred over the SSL connection. Ability to provide information, data, and applications to anyone, anytime, anywhere on any device. Includes Web services technology, but also includes technology you are using today and could use in the future. Switch Fabric Module Service Level Agreement server load balancing
SBus bridge
SBus controller
SBus device SBus expansion card SBus expansion slot SBus ID SDP SecuritySSL
Services on Demand
SFM SLA SLB
350
Sliding window SMLT SNA SOHO Solaris Operating System
A TCP flow control protocol that allows the sender to transmit multiple packets before it stops and waits for an acknowledgment. Split Multilink Trunking Systems Network Architecture Small Office/Home Office The Sun Microsystems open standards-based UNIX operating system. The Solaris Operating System, the foundation for Sun ONE software architecture, delivers the security, manageability, and performance. single point of failure Smallest Queue First static random access memory Spanning Tree Protocol (n.) A kernel aggregate created by connecting STREAMS components, resulting from an application of the STREAMS mechanism. The primary components are the Stream head, the driver, and zero or more pushable modules between the Stream head and driver. (n.) A Stream component that is farthest from the user process and contains a driver. (n.) A Stream component closest to the user process. It provides the interface between the Stream and the user process. Handles data streams from the Sun ONE Application Server to the Web server and to the Web browser. A streaming service improves performance by allowing users to begin viewing results of requests sooner rather than waiting until the complete operation has been processed. (n.) A kernel mechanism that supports development of network services and data communications drivers. STREAMS defines interface standards for character input/output within the kernel and between the kernel and user level. The STREAMS mechanism includes integral functions, utility routines, kernel facilities, and a set of structures. (n.) A mechanism for bidirectional data transfer implemented using STREAMS and sharing properties of STREAMS-based devices. (Sun Open Net Environment) The Sun Microsystems software strategy that comprises the vision, architecture, platform, and expertise for developing and deploying Services on Demand today. See http://www.sun.com/sunone.
SPOF SQF SRAM STP Stream
Stream end Stream head Streaming server
STREAMS
STREAMS-based pipe Sun ONE
Glossary
351
Switch SYN
Any device or mechanism that moves data from one network to another without any routing tables. synchronization
T
TCAM TDM TTCP TTRT Telecommunications Access Method time division multiplexing Test Transmission Control Protocol Target Token rotation
U
UDDI Universal Description, Discovery, and Integration. The UDDI Project is an industry initiative that is working to enable businesses to quickly, easily, and dynamically find and transact with one another via Web services. UDDI enables a business to (i) describe its business and its services, (ii) discover other businesses that offer desired services, and (iii) integrate with these other businesses. See http://www.uddi.org. An alternative to UDDI is JAXR, created by Sun Microsystems. See JAXR.
W
Web connectors Web connectors and listeners manage the passing of requests from the Web server to the Sun ONE Application Server. Listeners distribute and handle requests from the Web connectors. New listeners can be added with the HTTP handler. The easy-to-use, extensible, easy-to-administer, secure, platform-independent solution to speed up and simplify the deployment and management of your Internet and intranet Web sites. It provides immediate productivity for fullfeatured, Java technology-based server applications.
Web server
352
Web service
A fine-grained, component-style service Advertised and described in a service registry Based on standardized protocols JAXR, UDDI, JAXRPC, JAXM, SOAP, WSDL, and so on Accessible programmatically by applications or other Web services
WSDL
(Web Services Description Language) An XML format for describing network services as a set of endpoints operating on messages containing either document-oriented or procedure-oriented information. See http://www.w3.org/TR/.
Glossary
353
Index
A access control lists 296 active switches 311 application redirection function 299 architecture 3 architecture, network security 330 asynchronous transfer mode (ATM) 98 Auto-negotiation 153 Auto-negotiation Advertisement Register 155 Auto-negotiation Expansion register 157 B Basic Mode Control Register 153 Basic Mode Status Register 154 BlackDiamond switches 323 border gateway protocol v4 (BGP4) 311 bridges 66 business logic 312 C checksumming 146 cipher suite 110 class C network 308 client requests 312 configuring the Extreme Networks switches 323 Foundry Networks switches 324 congestion control 107 congestion window 51, 54 consistent mode 142 Constant Bit Rate (CBR) 98 content switching 300 Control Plane 67
CPU load balancing 148 D data flow through service modules 312 Data Plane 67 descriptor ring 141 design 3 disable source routing 127 Dual-attached concentrator 134 Dual-attached station 132 E Enterprise 98 Enterprise Java Beans (EJBs) 314 Extreme Networks equipment 317 Extreme Networks switches, configuring 323 F FDDI concentrators 134 FDDI interfaces 136 FDDI station 132 Fiber Distributed Data Interface network 131 firewall architecture 332 firewalls between service modules (figure) 331 flat architecture 261 flow control keywords 213 Forced mode 153 Foundry Networks equipment 317 Foundry Networks switches, configuring 324 Full NAT 91, 302 functional tiers 17 G Gigabit Media Independent Interface 157 global synchronization 107
355
H Half NAT 91, 302 Handshake Layer 110 I interface specifications 314 interrupt blanking 148 IP address space (private) 296 IP forwarding module 107 IP header 314 J J2EE application 314 Java data access objects 314 Java DataBase Connectivity (JDBC) 314 Java Server Pages (JSP) 312 Jumbo frames 152 jumbo frames 217 JumpStart, Solaris 296 L Layer 3 routing 331 Link-partner Auto-negotiation Advertisement 155 load balancing, built-in 317 local area networks, virtual (VLANs) 296 logical network architecture 296 logical network architecture overview (figure) 297 logical network design 296 M MAC overflow 146 management network 296 mapping process 21 media access unit 128 multi-data transmission 143 multi-level architecture 261 N netmask values 320 Netscreen Firewall configuration file 333 network configuration (Extreme Networks equipment) 318 configuration (Foundry Networks equipment) 319 physical 315 security architecture 330 Network Address Translation 302 Network Address Translation (NAT) 91 network architecture with virtual routers 310
356
network design 3 traditional 308 using chassis-based switches 309 Network Service Provider (NSP) 98 O open shortest path first (OSPF) 311 P Parallel Detection Fault 157 partial checksumming 147 pause frames 161 physical network 315 policing 107 Precedence Priority Model 102 private IP address space 296 proxy switching 300 Q QoS Profile 103 Quality of Service (QoS) 92 queuing 107 R random early detection register 216 Random Early Discard 151 receive interrupt blanking values 215 receive window 54 received packet delivery method 150 Reservation Model 102 ring of trees 135 ring speed 129 round-robin 74 router 66 routers 320 routing information protocol (RIP) 311 S secure socket layer 314 sequence of events (data flow) 314 server load balancing 298 Service Level Agreements (SLAs) 98 Services on Demand architecture 18 shaping 107 Single-attached concentrator 134 Single-attached station 132 sliding windows 54 Startup Phase 51 stateful 25 Stateful Layer 7 switching 300
Stateful Session Based 298 stateless and idempotent 25 Stateless Session Based 298 static routes 296, 311 Steady State Phase 51 streaming mode 142 Streams Service Queue model 151 switch 66 switch configuration 322 switch configuration file (Extreme switch) 323 switch configuration file (Foundry switch) 326 symmetric flow control 162 T tagged VLAN 309 Tail Drop 107 token ring interfaces 125 token ring network 123 transmission latency 142 Transmit Pause capability 162 Trunking Policies 232 trust zones 331 U URL switching 300 V Variable Bit Rate-Real Time (VBR-rt) 98 virtual firewalls 331 virtual local area networks (VLANs) 296 virtual routers 310 VLAN, tagged 309 W Web-based applications 17 weighted round-robin 74
Index
357

BP Networking Concepts and Technology

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BP Networking Concepts and Technology

Uploaded by

Copyright:

Available Formats

Networking Concepts and Technology: A Designers Resource

Deepak Kakadia and Francesco DiMambro

Sun Microsystems, Inc. www.sun.com

iv Networking Concepts and Technology: A Designers Resource

Evolution of Web Services Infrastructures The Data Center IP Network 4 7

Network Traffic Characteristics

Network Availability Design Patterns Reference Implementations 2. 14

Network Traffic Patterns: Application Layer Services on Demand Architecture 18

Multi-Tier Architecture and Traffic Patterns

Application Services Tier Architecture Examples 29

Designing for Security and Horizontal Scalability Example Solution 3. 34 37

Tuning TCP: Transport Layer TCP Tuning Domains 38

TCP Queueing System Model Why the Need to Tune TCP

TCP Packet Processing Overview

TCP STREAMS Module Tunable Parameters TCP State Model 48 49 50 50 51

Connection Established Connection Shutdown

TCP Tuning on the Sender Side Startup Phase 51 53

Steady State Phase

Emerging Network Services and Appliances Server Load Balancing Hash 73 74 72

Networking Concepts and Technology: A Designers Resource

Advantages of Using Proxy Mode

Disadvantages of Using Proxy Mode How Direct Server Return Works 80

Advantages of Direct Server Return

Disadvantages of Direct Server Return Server Monitoring Persistence 83 83

Commercial Server Load Balancing Solutions

Foundry ServerIron XLDirect Server Return Mode

Network Address Translation Quality of Service 92 92 92

The Need for QoS

Classes of Applications Data Transfers 93

Video and Voice Streaming Interactive Video and Voice

Mission-Critical Applications Web-Based Applications 94

Service Requirements for Applications QoS Components 95 95

Implementation Functions QoS Metrics 95

Network and Systems Architecture Overview

Implementing QoS ATM QoS Services

Sources of Unpredictable Delay QoS-Capable Devices 102

Functional ComponentsHigh-Level Overview QoS Profile 103 104

Policing and Shaping IP Forwarding Module Queuing 107

Congestion Control Packet Scheduler Secure Sockets Layer 109

SSL Protocol Overview

SSL Acceleration Deployment Considerations Software-SSL LibrariesPacket Flow 112

Networking Concepts and Technology: A Designers Resource

Token Ring Interfaces

Resource Configuration Parameter Tuning

Single-Attached Station Dual-Attached Station FDDI Concentrators 134

Single-Attached Concentrator Dual-Attached Concentrator FDDI Interfaces 136

Ethernet Physical Layer

Basic Mode Control Layer Basic Mode Status Register

Fast Ethernet Interfaces

10/100 hme Fast Ethernet

Current Physical Layer Status 10/100 qfe Quad Fast Ethernet

Local Transceiver Auto-negotiation Capability Link Partner Capability 180

Networking Concepts and Technology: A Designers Resource

Current Physical Layer Status 10/100 eri Fast Ethernet 182

Receive Interrupt Blanking Parameters

Local Transceiver Auto-negotiation Capability Link Partner Capability 188 189

Current Physical Layer Status 10/100 dmfe Fast Ethernet 190

Operational Mode Parameters

Local Transceiver Auto-negotiation Capability Link Partner Capability 194 195

1000 vge Gigabit Ethernet 1000 ge Gigabit Ethernet

Receive Interrupt Blanking Parameters

Current Physical Layer Status

Performance Tunable Parameters