2 - PDF - 1177 - 1177459 526027 547954 013144753x ch03

Chapter Three.
Overview of Server Architectures
Table of Contents
Overview of Server Architectures ............................................................... 1
Introduction .................................................................................................................................................................................... 1 Linux Servers .................................................................................................................................................................................. 1 Processors and Multiprocessing .................................................................................................................................................... 2 Memory ........................................................................................................................................................................................... 5 I/O .................................................................................................................................................................................................. 6 Linux Enterprise Servers ................................................................................................................................................................ 7 Linux Clusters ................................................................................................................................................................................ 9 Examples of Server Systems ......................................................................................................................................................... 11 Summary ....................................................................................................................................................................................... 21
Chapter Three. Overview of Server Architectures

Performance Tuning for Linux Servers By Sandra K. Johnson, Gerrit Huizenga, Badari Prepared for babu krishnamurthy, Safari ID: babu_krishnamurthy@yahoo.com Pulavarty ISBN: 013144753X Publisher: IBM Press Print Publication Date: 2005/05/27 User number: 547954 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Page 1
Return to Table of Contents

By Michael Hohnbaum, Hanna Linder, and Chris McDermott
Introduction
Now that we've covered Linux kernel fundamentals, it's time to define the typical server processing model most common in the Linux Server market. We'll describe some of the basic building blocks and standard features that comprise today's server platforms. The focus in this chapter is on computers with multiple CPUs and large memory, and those that support large disks. We'll also focus on the architecture of the server more than the software algorithms designed to utilize the various capabilities of large computers. By the end of the chapter, you will have a better understanding of some of the main architectures and topologies available. You will understand the difference between SMP, NUMA, and clustered systems.
Licensed by babu krishnamurthy
Linux Servers
A computer consists of one or more processors, memory, and I/O devices. I/O devices are connected to the computer through I/O buses such as PCI. Additionally, some computers have service processors to assist in system control (booting, error handling) and monitoring (power, cooling). A server is a computer that provides services to other computers. For example, a DNS server is a computer that provides the domain name lookup service to other computers connected to it via a network. A computer functioning as a server is not limited to providing only one service, although for simplicity and security reasons, it is common to restrict a server to a single service. One of the most appealing features of the Linux kernel is its modular design. This design has made it relatively easy to port the Linux kernel to new architectures or enhance and extend existing architecture ports to support new features. It is also one of the fundamental reasons why Linux has become so popular. Today, Linux runs in computing environments that range from embedded systems to desktops to entry-level and enterprise-level servers to proprietary mainframe systems. Linux has probably made its biggest impact in the small- to mid-range server market. Linux runs well on servers with two to four CPUs,

Page 2
although current stable kernels support up to 32 CPUs. Two to four CPU servers are currently considered the Linux "sweet spot" from a performance point of view. In the following section, you will learn about systems and processors. We will discuss how a computer is configured to run as a server, as well as mixing processors within the same system.
Processors and Multiprocessing

Most server-class systems are designed to support more than one processor. The most common type of multiprocessor systems supported by Linux are the tightly coupled Symmetrical Multiprocessing or Shared Memory Multiprocessing architectures (SMP). These tightly coupled architectures are called SMP because each processor shares the same system bus, and therefore, each processor is symmetrical to or equidistant from system memory and I/O resources. In other words, memory access and I/O access times from any processor in the system are uniform. The advantage of SMP systems is that they provide more computing power, because there are more processors with which to schedule work. In a perfect world, an SMP system provides linear scalability as more processors are added to the system. To explain further, a workload on an n-processor system could perform n times faster than the same workload on a one-processor system. Realistically, because processors in an SMP server share system resources (the memory bus, the I/O bus, and so on), linear scalability is difficult to achieve. Achieving acceptable scalability in an SMP environment involves both optimized hardware and software. The hardware must be designed to exploit the system's parallel characteristics. System software must be written to take full advantage of the parallelism built into the hardware. On the other hand, the fact that processors share certain system resources places limitations on the amount of parallelism that can be achieved. Both system hardware and software must implement complicated locking logic and algorithms to provide mutual exclusion of shared system resources. The system must prevent concurrent access to any shared resource to preserve data consistency and correct program operation. Mutual exclusion is one of the primary factors that limit the scalability of any SMP operating system. SMP support in the Linux kernel has evolved from a model that completely serialized access to the entire kernel to a design that now supports multiple layers and types of locks within kernel components at every level of the kernel. The 2.6 Linux kernel continues to improve SMP scalability over previous kernel versions by implementing features that further exploit parallelism in SMP environments.
Server Topologies
Any size of computer can be configured to run as a server. Some services can be provided by a single processor computer, whereas other services, such as large databases, require more substantial computer hardware. Because the typical single-processor system with
Page 3
memory and a few disk drives should be common to anyone attempting to tune Linux on a server, this chapter focuses on larger server configurations with multiple processors and potentially large amounts of disk storage. Linux can support servers effectively with up to 16 processors on 2.4-based kernels and 32 processors on 2.6-based kernels (and up to 512 processors on some architectures). As the processor count scales up, a similar scaling up of memory and I/O capacity must occur. A 16-processor server with only 1GB of memory would most likely suffer performance problems from a lack of memory for the processors to make use of. Similarly, a server with a large memory would be hampered if only one disk drive were attached, or only one path to get to disk storage. An important consideration is the balance of a server's elements so that adequate resources exist for the work being performed. An important characteristic of multiprocessor configurations is the manner in which the processors are connectedthe server's topology. The basic multiprocessor system employs a large system bus that all processors connect to and that also connects the processors to memory and I/O buses, as depicted in Figure 3-1. Multiprocessor systems like these are referred to as Symmetric Multiprocessors (SMPs) because all processors are equal and have similar access to system resources.
Figure 3-1. The basic multiprocessor system.

Page 4
How many processors a server needs is determined by the workload. More processors provide more processing power and can provide additional throughput on CPU-bound jobs. If a workload is CPU-boundprocesses are waiting excessively for a turn on a processoradditional processor capacity might be warranted. SMP systems are fairly common and can be found with two to four processors as a commodity product. Larger configurations are possible, but as the processor count goes up, more memory is attached, and more I/O devices are used, the common system bus becomes a bottleneck. There is not enough capacity on the shared system bus to accommodate all the data movement associated with the quantity of processors, memory, and I/O devices. Scaling to larger systems then requires approaches other than SMP. Various approaches for larger scaling have been employed. The two most common approaches are clusters and Non-Uniform Memory Architecture (NUMA). Both of these approaches have a common basis in that they eliminate a shared system bus. A cluster is constructed of a collection of self-contained systems that are interconnected and have a central control point that manages the work that each system (node) within a cluster is performing. Each node within a cluster runs its own operating system (that is, Linux) and only has direct access to its own memory. NUMA systems, on the other hand, are constructed from nodes connected through a high-speed interconnect, but a common address space is shared across all nodes. Only one operating system image is present, and it controls all operations across the nodes. The memory, although local to each node, is accessible to all nodes in one large, cache-coherent physical address space. Clusters and NUMA systems are discussed in more detail in later sections.
Mixing Processors
Many modern server platforms support the mixing of processor speeds and steppings (revisions) within the same system. Special consideration must be taken to ensure optimal operation in such an environment. Usually, the processor vendor publishes specific guidelines that must be met when mixing processors with different speeds and different features or stepping levels. Some of the most common guidelines are as follows: The boot processor is selected from the set of processors having the lowest stepping and lowest feature set of all other processors in the system. The system software uses a common speed for all processors in the system, determined by the slowest speed of all processors configured in the system. All processors use the same cache size, determined by the smallest cache size of all processors configured in the system.
System software must implement and follow similar restrictions or guidelines to ensure correct program operation.

Page 5
As you have learned from this section, Linux can support servers with multiprocessors. However, you have also learned the memory has to be increased to avoid performance problems. The next section addresses the memory issue.
Memory
Servers tend to have large quantities of memory. Amounts of 1GB to 4GB of memory per processor are common. The amount of memory needed for a server varies depending on the type of work the server is doing. If a server is swapping excessively, additional memory should be considered. Some workloads perform substantially better if there is enough memory on the server to keep common, heavily used data locked in memory. Other workloads use small amounts of memory with transient data, so additional memory would not be a benefit. The maximum amount of memory a process on a server can address is limited by the processor's word size. Server processors have either 32-bit or 64-bit words. Registers on processors are the size of a word and are used to hold memory addresses. The maximum amount of memory that can be addressed by a processor is a function of the word size. 32bit processors have a 4GB limit on memory addressability (2 raised to the 32nd power). On Linux the user-space process is provided only 3GB of address space, with the last gigabyte of address space reserved for use by the kernel. On 64-bit processors, the 4GB limit goes away, but most 64-bit implementations put a restriction on the maximum address below the possible maximum (that is, 2 raised to the 64th power). Some 32-bit processors (for example, Pentium) implement additional address bits for accessing physical addresses greater than 32 bits, but these are accessible only via virtual addressing by use of additional bits in page table entries. x86-based processors currently support up to 64GB of physical memory through this mechanism, but the virtual addressability is still restricted to 4GB. 64-bit processors are appropriate for workloads that have processes that need to address large quantities of data. Large databases, for example, benefit from the additional memory addressability provided by 64-bit processors. 32-bit processors, on the other hand, are better for workloads that do not have large addressability requirements, because code compiled for 32-bit processors is more compact (because addresses used in the code are half the size of 64-bit addresses). The more compact code reduces cache usage. Processor speeds and memory speeds continue to increase. However, memory speed technology usually lags processor technology. Therefore, most server systems implement smaller high-speed memory subsystems called caches. Cache memory subsystems are implemented between the processors and memory subsystems to help bridge the gap
Page 6
between faster processor speeds and the slower memory access times. The advantage of implementing caches is that they can substantially improve system performance by exploiting a property called locality of reference. Most programs, at some point, continuously execute the same subset of instructions for extended periods of time. If the subset of instructions and the associated data can fit in the cache memory, expensive memory accesses can generally be eliminated, and overall workload performance can be substantially increased. Most processors today implement multiple levels of caches. In addition, some servers can also implement multiple cache hierarchies. The processor caches are typically much smaller and faster than caches implemented in the platform. Caches range in size from a few kilobytes to a few megabytes for on-chip caches and up to several megabytes for system caches. Caches are broken into same-sized entries called cache lines. Each cache line represents a number of contiguous words of main memory. Cache line sizes range from a few bytes (in processor caches) to hundreds of bytes (in system caches). Data is inserted into or evicted from caches on cache line boundaries. The Linux kernel exploits this fact by ensuring that data structures or portions of data structures that are accessed frequently are aligned on cache line boundaries. Cache lines are further organized into sets. The number of lines in a set represents the number of lines a hash routine must search to determine whether an address is available in cache. Caches implement different replacement policies to determine when data is evicted from a cache. Caches also implement different cache consistency algorithms to determine when data is written back to main memory and provide the capability to flush the total contents of a cache. Proper operating system management of system caches can have an impact on system performance. As much as memory is important to keeping things running smoothly, I/O capacity is also important. Along with memory, this is a component that keeps the processors effective.
I/O
One difference between server systems and other mid-range class computers is the I/O subsystem. Server systems support much larger numbers of I/O devices and more complex I/O topologies. Because servers typically host large amounts of data and need to make data available to consumers quickly, I/O throughput and bandwidth can play a key role in system performance. I/O buses serve as a connection from I/O devices to the system memory. The prevalent I/ O bus on modern servers is PCI. Servers have multiple PCI slots that allow PCI cards that interface to SCSI, Fibre Channel, networks, and other I/O devices. A server configuration can vary from a single disk up to thousands of disks. Although the number of system disks is limited to 256 in the 2.4 kernel series, this limit has been
Page 7
increased for the 2.6 kernel. Servers with only one or two disks might use IDE technology for the disks, but systems with more than a few disks mostly use SCSI technology. Fibre Channel is also used, especially for larger storage networks that might encompass multiple Fibre Channel adaptors, switches, and enterprise-level storage arrays. Multipath I/O (MPIO) provides more than one path to a storage device. In basic MPIO configurations, the extra path(s) to the storage device is (are) available for failover only that is, when the system detects an error attempting to reach the storage device over the primary path, it switches over and uses the secondary path to access the storage device. More advanced MPIO configurations use additional paths to increase the bandwidth between the storage device and memory. The operating system attempts to load- balance across the multiple paths to maximize the amount of data throughput between memory and disks. A server needs at least one network connection, but again, depending on the type of work the server is performing, it might have multiple network connections, either all to the same network (with the multiple connections providing increased network bandwidth) or to multiple networks. Ethernet is the network technology most in use. Although 10Mbps Ethernet is still supported, most Ethernet today operates at the rate of at least 100Mbps, with Gigabit Ethernet becoming the default on modern servers. Other network technologies, such as ATM, exist and are supported by Linux but are not as widespread as Ethernet. Other I/O devices are also connected to servers. Most servers have a CD-ROM device for loading software onto the system. They might also have a CD-writer. Some type of device to back up system data is also needed. Tape devices are usually used for this, although other backup mechanisms and strategies are possible. Large servers have some form of service processor that is used for controlling the initial power-on reset sequence, booting the operating system, and monitoring the system (power supply, cooling, and so on) during normal operation. Service processors often participate in machine fault handling and provide other services for the normal functioning of the server. The multiprocessors, memory, and I/O subsystems are the kernel components that make an enterprise server. The hardware of the Linux Enterprise Server provides the foundation to bring this high-powered functionality to life.
Linux Enterprise Servers

An enterprise server is a server intended for use in a data center. The term implies a certain level of functionality above that of a simple PC-class box running a hobbyist version of

Page 8
Linux. An enterprise server supports the computing needs of a larger organization, with needs that are larger and more complex than those of an individual user or department. Historically, mainframes served enterprises, and smaller systems served departments or users. However, over the past 20 years, there has been a significant change in computer technology. Now, enterprise-class servers are being built from commodity hardware. At this point, an important distinguishing factor in determining if a server is an enterprise server is the software functionality it provides. The functionality that is expected in an enterprise server includes the following:
Advanced management capabilities Performance monitoring and tuning Storage management, usually in the form of a logical volume manager Resource management Security monitoring
User account creation Performance and scalability Capability of the OS to make effective use of the system resources Raw system performance capable of doing enterprise-level computing Multiple processors Capable of supporting GBs of memory
Large I/O capabilities (disks, networks) Reliability, Availability, Serviceability (RAS) Five-nine's of availability (99.999% available) Problem diagnostics Debugging tools Error logging The criteria for calling a server an enterprise server are subjective, but the capabilities listed here are factors to consider.
Page 9
The enterprise server serves the complex needs of today's businesses. You can add to this complexity by including clusters. Linux cluster technology allows intensive tasks to run on multiple servers, allowing you to pool your resources.
Linux Clusters
There are two distinct types of clusters: high-performance clusters and high-availability clusters. The commonality between the two cluster types is that they are both made up of a set of independent computers that are interconnected and working on a common task. Each independent computer within a cluster is called a node. The goal of a high-performance cluster is to perform large computational tasks, spreading the work across a large number of nodes. The goal of a high-availability cluster is for an application (typically a database) to be able to continue functioning even through the failure of one or more nodes. A high-performance cluster is not typically considered an enterprise server, but rather is dedicated to specific computationally intensive tasks. A high-availability cluster, on the other hand, usually operates as an enterprise server. High-performance clusters (HPCs) tend to have a higher node count than high-availability clusters, with 100 node clusters being common. High-availability clusters tend to have smaller node counts, typically not exceeding 16 nodes, and more commonly only two to four nodes.
High-Performance Clusters
High-performance clusters are an inexpensive way of providing large computational power for problems that are divisible into multiple parallel components. The nodes typically are inexpensive single- or dual-processor computers. Because large numbers of nodes are involved, size is an important consideration. Computers that fit into a 1U form factor that allow for stacking large quantities per rack are commonly used for high performance clusters. Most major hardware vendors sell systems capable of being clustered. Nodes are headlessthat is, they have no keyboard, monitor, or mouse. Larger clusters might include a separate management LAN and/or a terminal server network to provide console capability to the nodes. Each node in an HPC has its own local disk storage to maintain the operating system, provide swap space, store programs, and so on. Some clusters have an additional type of nodea storage serverto provide access to common disk storage for shared data. There is also a master node that provides overall control of the cluster, coordinating the work across nodes and providing the interface between the cluster and local networks. The interconnect for nodes in an HPC can be Ethernet (10, 100, 1000MBps) or it can be a specialty interconnect that delivers higher performance, such as myrinet. The choice of

Page 10
the interconnect technology is a trade-off between price and speed (latency and bandwidth). The type of work a cluster is designed to do influences the choice of interconnect technology. Certain file systems are designed for use in cluster environments. These file systems provide a global, parallel cluster file systemfor example, GPFS from IBM or CXFS from SGI. These file systems provide concurrent read/write access to files located on a shared disk file system. Communication between HPC nodes often makes use of message-passing libraries such as MPI or PVM. These libraries are based on common standards and allow the easy porting of parallel applications to different cluster environments.
High-Availability Clusters
Some workloads are more sensitive to failurethat is, a failure of the workload can have expensive repercussions for a business. Sample workloads include customer relationship management (CRM), inventory control, messaging servers, databases, and file and print servers. Availability is critical to these workloads; availability requirements are often described as the five-nines of availability (99.999%). Providing that level of availability allows about 5 minutes of outage per year. One method of preventing downtime caused by failure of a system running these critical workloads is the use of high-availability (HA) clusters. An HA cluster consists minimally of two independent computers with a "heartbeat" monitoring program that monitors the health of the other node(s) in the cluster. If one node fails, another node detects the failure and automatically picks up the work of the failed node. It is common for HA clusters to be built from larger computers (four or more processors). Typical HA clusters have only a handful of nodes. Ideally, there is enough excess capacity on the nodes in a cluster to absorb the workload from one failed node. Thus, in a two-node cluster, each node should normally run at 50% capacity so that there is headroom to absorb the load of the other node. In a four-node cluster, each node could run at 75% capacity and there would be sufficient excess capacity to absorb the workload of a failed node. Thus, it is more efficient to have larger node counts. However, the efficiency of large node centers comes at the cost of additional complexity and administrative overhead. For HA clusters, all nodes need to have access to the data being used by the application, which is normally a database. Use of Fibre Channel adapters and switches is usually necessary to connect more than a few nodes to a common disk storage array. This can often be the limiting factor on the number of nodes in an HA cluster. Within the Linux community is an active group focused on high-availability clusters (see http://linux-ha.org). This site provides details on the design and configuration of Linux
Page 11
high-availability clusters and provides links to papers that describe various implementations and deployments of Linux HA clusters. Clusters are a way of consolidating resources. In the following section, see what consolidation means in terms of the mainframe.
Examples of Server Systems

Several server systems are available. The performance of these servers is critical to the success of the evolving information infrastructure. Here, we'll discuss a few examples of what is available.
IBM MainframeszSeries
Mainframe servers are the ideal platform for some Linux server consolidation efforts. Mainframe servers are large, extremely reliable platforms. These platforms have high memory and I/O bandwidth, low memory latencies, shared large L2 caches, dedicated I/ O processors, and very advanced RAS capabilities. IBM zSeries platforms are the mainstay of this technology. Mainframes support logical partitioning, which is the capability to carve up the machine memory, processor, and I/O resources into multiple machine images, each capable of running an independent operating system. The memory allocated to a partition is dedicated physical memory, whereas the processors can be dedicated or shared, and the I/O channels can also be either dedicated or shared among partitions. When I/O or processor resources are shared among partitions, this sharing is handled by firmware controlling the partitioning and is invisible to the operating systems running in the partitions. Although Linux can run in a partition on a mainframe, the real advantage is to run large numbers of Linux server images in virtual machines under z/VM. z/VM is an operating system that provides virtual machine environments to its "guests," thus giving each guest the view of an entire machine. This virtual machine implementation allows for supporting a large number of Linux images limited only by system resources. Real deployments have provided thousands of Linux servers running as guests in virtual machines hosted by z/ VM on a single mainframe. Resources can be shared among the Linux images, and high-speed virtual networks are possible. These virtual networks essentially run at memory speed, because there is no need to send packets onto a wire to reach other Linux guests. New instances of Linux virtual machines are created completely through software control and can be readily scripted. This simplifies system management and allows deployment of a new Linux server (as a guest to z/VM) in a matter of minutes, rather than the hours it
Page 12
would take to install a new physical server. Physical resources do not have to be reserved and dedicated for each guest; rather, under control of z/VM, all the system resources are shared. z/VM understands the hardware and can exploit the advanced RAS capabilities on behalf of the guest servers. A robust workload monitoring and control facility is provided that supplies an advanced resource management capability. z/VM also provides many debugging tools to assist in diagnosing software problems with a Linux guest, and it also aids in testing new servers. For details on the IBM mainframe technology circa 2000, see http://www.research.ibm.com/journal/rd43-56.html. These machines are optimal for server consolidation. They can support hundreds to thousands of discrete Linux images. Workload types suited to run in this environment exhibit the following characteristics:

I/O-intensive operations (for example, serving web pages) Lightly loaded servers
Computation-intensive, graphic-oriented workloads are not good matches for mainframe serversnor are workloads that check the system clock often. Heavily loaded servers also are not good candidates for consolidation on a mainframe server. Older mainframes did not support IEEE-format floating-point operations. Thus, all floating-point operations under Linux had to be converted to the IBM proprietary format, executed, and then converted back to IEEE format. Newer mainframes support IEEE format and do not pay this performance penalty. Advanced mainframe system management and programming skills are needed for planning and installing a Linux deployment under z/VM. However, after the system is installed and configured, adding new virtual servers is fairly simple.
Hardware Design Overview

The IBM zSeries architecture implements either 12 or 20 processors on a processor module. There are multiple types of processors, which are determined by the microcode that is loaded that controls the processor. The principal types of processors are the processor units, which are equivalent to CPUs on most servers, and the System Assist Processor (SAP), which handles control of the I/O operations. Each machine has at least two SAPs, and could have more depending on the I/O capacity needs. The modules have two large L2 caches, with each L2 cache shared among half the processors on the module. This differs from standard servers that have an L2 cache associated with each processor. The shared L2 cache allows a process to migrate between processors without losing its cache warmth.
Page 13
Memory bandwidth is 1.5GBps per processor with an aggregate system bandwidth of 24GBps. The very large memory bandwidth design favors applications that do not have cache-friendly working sets or behavior. Normal Linux servers usually run at no more than 50% average utilization to provide headroom for workload spikes. The design of the mainframes allows systems to run at 80% to 90% utilization.
Reliability
A significant feature of the mainframe servers is their extremely high reliability. Reliability on these systems is measured in terms of the overall availability. Five-nines of reliability that is, 99.999% availabilityis a common target that is achievable at the hardware level. This amounts to less than 5 minutes of downtime per year. This level of reliability is achieved through advanced hardware design techniques referred to as Continuous Reliable Operation (CRO). The goal of continuous reliable operation is to keep a customer's workload running without interruptions for error conditions, maintenance, or system change. While remaining available, the machine must also provide reliable results (data integrity) and stable performance. These requirements are met through constant error checking, hot-swap capabilities, and other advanced methods.
IBM Mainframe I/O Design

Each system has two or more processors dedicated to performing I/O operations. SAPs run specialized microcode and manage I/O operations, removing the work of managing I/O operations from the main processors. I/O cards supported are as follows:

ESCON-16 channel cards FICON channel cards Open Systems Adapter (OSA) Express Gigabit Ethernet (GbE) Asynchronous Transfer Mode (ATM) Fast Ethernet (FENET) PCI-Cryptographic Coprocessor (PCI-CC) cards
ESCON is a zSeries technology that supports 20MBps half-duplex serial bit transmission over fiber-optic cables. FICON is a newer zSeries technology capable of supporting 100MBps full-duplex serial interface over fiber. It supports multiple outstanding I/O operations at the same time to different channel control units. FICON provides the same I/O concurrency as up to eight ESCON channels.

Page 14
Open Systems Adapter cards (Ethernet, token ring, or ATM) provide network connectivity. The OSA-Express card implements Queued Direct I/O (QDIO), which uses shared memory queues and a signaling protocol to exchange data directly with the TCP/IP stack. Disk storage is connected via ESCON or FICON. The recommended storage device is the Enterprise Storage Server (ESS), commonly known as Shark. Shark is a full-featured diskstorage array that supports up to nearly 14 terabytes of disk storage. The ESS has large internal caches and multiple RISC processors to provide high-performance disk storage. Multiple servers can be connected to a single ESS using Fibre Channel, ESCON, FICON, or UltraSCSI technology. More information about the ESS is available at http:// www.storage.ibm.com/hardsoft/products/ess/ess.htm.
Blades
Blades are computers implemented in a small form factor, usually an entire computer, including one or two disk drives, on a single card. Multiple cards (blades) reside in a common chassis that provides power, cooling, and cabling to support network and system management connectivity. This packaging allows significant computing power to be provided in dense packaging, thus saving space. Blades are used heavily in data centers that need large quantities of relatively small independent computers, such as large web server environments. Blades are also candidates for use in clusters. Because size is a significant factor for blade designs, most blades are limited to single processors, although blades with dual processors are available. As processor technology and packaging continue to advance, it is likely that blades with high processor counts will become available. However, the practicality of large processor count blades is somewhat hampered by the amount of I/O connectivity available.
NUMA
Demand for greater computing capacity has led to the increased use of multiprocessor computers. Most multiprocessor computers are considered Symmetric Multiprocessors (SMPs) because each processor is equal and has equal access to all system resources (such as memory and I/O buses). SMP systems generally are built around a system bus that all system components are connected to and which is used to communicate between the components. As SMP systems have increased their processor count, the system bus has increasingly become a bottleneck. One solution that is gaining use by hardware designers is Non-Uniform Memory Architecture (NUMA). NUMA systems colocate a subset of the system's overall processors and memory into nodes and provide a high-speed, high-bandwidth interconnect between the nodes, as shown in Figure 3-2. Thus, there are multiple physical regions of memory, but all memory is tied together into a single cache-coherent physical address space. In the resulting system, some processors are closer to a given region of physical memory than are other processors.
Page 15
Conversely, for any processor, some memory is considered local (that is, it is close to the processor) and other memory is remote. Similar characteristics can also apply to the I/O busesthat is, I/O buses can be associated with nodes.
Figure 3-2. NUMA's high-bandwidth interconnect.
Although the key characteristic of NUMA systems is the variable distance of portions of memory from other system components, there are numerous NUMA system designs. At one end of the spectrum are designs where all nodes are symmetricalthey all contain memory, CPUs, and I/O buses. At the other end of the spectrum are systems where there are different types of nodesthe extreme case being separate CPU nodes, memory nodes, and I/O nodes. All NUMA hardware designs are characterized by regions of memory being at varying distances from other resources, thus having different access speeds. To maximize performance on a NUMA platform, Linux takes into account the way the system resources are physically laid out. This includes information such as which CPUs
Page 16
are on which node, which range of physical memory is on each node, and what node an I/ O bus is connected to. This type of information describes the system's topology. Linux running on a NUMA system obtains optimal performance by keeping memory accesses to the closest physical memory. For example, processors benefit by accessing memory on the same node (or closest memory node), and I/O throughput gains by using memory on the same (or closest) node to the bus the I/O is going through. At the process level, it is optimal to allocate all of a process's memory from the node containing the CPU(s) the process is executing on. However, this also requires keeping the process on the same node.
Hardware Implementations
Many design and implementation choices result in a wide variety of NUMA platforms. This section discusses hardware implementations and provides examples and descriptions of NUMA hardware implementations.
Types of Nodes
The most common implementation of NUMA systems consists of interconnecting symmetrical nodes. In this case, the node itself is an SMP system that has some form of high-speed and high-bandwidth interconnect linking it to other nodes. Each node contains some number of processors, physical memory, and I/O buses. Typically, there is a nodelevel cache. This type of NUMA system is depicted in Figure 3-3.
Figure 3-3. Hierarchical NUMA design.

Page 17

Page 18
A variant on this design is to put only the processors and memory on the main node, and then have the I/O buses be separate. Another design option is to have separate nodes for processors, memory, and I/O buses, which are all interconnected. It is also possible to have nodes that contain nodes, resulting in a hierarchical NUMA design. This is depicted in Figure 3-3.
Types of Interconnects
There is no standardization of interconnect technology. More relevant to Linux, however, is the topology of the interconnect. NUMA machines can use the following interconnect topologies: Ring topology, in which each node is connected to the node on either side of it. Memory access latencies can be nonsymmetricthat is, accesses from node A to node B might take longer than accesses from node B to node A. Crossbar interconnect, where all nodes connect to a common crossbar. Point-to-point, where each node has a number of ports to connect to other nodes. The number of nodes in the system is limited to the number of connection ports plus one, and each node is directly connected to each other node. This type of configuration is depicted in Figure 3-3. Mesh topologies, which are more complex topologies that, like point-to-point topologies, are built on each node having a number of connection ports. Unlike pointto-point topologies, however, there is no direct connection between each node. Figure 3-4 depicts a mesh topology for an 8-node NUMA system with each node having three interconnects. This allows direct connections to three "close" nodes. Access to other nodes requires an additional "hop," passing through a close node.
Figure 3-4. 8-node NUMA configuration.

Page 19
The topology provided by the interconnect affects the distance between nodes. This distance affects the access times for memory between the nodes.
Latency Ratios
An important measurement for determining a system's "NUMA-ness" is the latency ratio. This is the ratio of memory latency for on-node memory access to off-node memory access. Depending on the topology of the interconnect, there might be multiple off-node latencies.

Page 20
This latency is used to analyze the cost of memory references to different parts of the physical address space and thus influences decisions affecting memory usage.
Specific NUMA Implementations

Several hardware vendors are building NUMA machines that run the Linux operating system. This section briefly describes some of these machines, but it is not an all-inclusive survey of the existing implementations. One of the earlier commercial NUMA machines is the IBM NUMA-Q. This machine is based on nodes that contain four processors (i386), memory, and PCI buses. Each node also contains a management module to coordinate booting, monitor environmentals, and communicate with the system console. The nodes are interconnected using a ring topology. Up to 16 nodes can be connected for a maximum of 64 processors and 64GB of memory. Remote-to-local memory latency ratios range from 10:1 to 20:1. Each node has a large remote cache that helps compensate for the large remote memory latencies. Much of the Linux NUMA development has been on these systems because of their availability. NEC builds NUMA systems using the Intel Itanium processor. The most recent system in this line is the NEC TX7. The TX7 supports up to 32 Itanium2 processors in nodes of four processors each. The nodes are connected by a crossbar and grouped in two supernodes of four nodes each. The crossbar provides fast access to nonlocal memory with low latency and high bandwidth (12.8GBps per node). The memory latency ratio for remote-to-local memory in the same supernode is 1.6:1. The remote-to-local memory latency ratio for outside the supernode is 2.1:1. There is no node-level cache. I/O devices are connected through PCI-X buses to the crossbar interconnect and thus are all the same distance to any CPU/node. The large IBM xSeries boxes use Intel processors and the IBM XA-32 chipset. This chipset provides an architecture that supports four processors, memory, PCI buses, and three interconnect ports. These interconnect ports allow point-to-point connection of up to four nodes for a 16-processor system. The system depicted in Figure 3-2 approximates the configuration of a four-node x440. Also supported is a connection to an external system with additional PCI slots to increase the system's I/O capacity. The IBM eServer xSeries 440 is built on this architecture with Intel Xeon processors.
MPIO on NUMA Systems

Multipath I/O, as mentioned earlier in this chapter, can provide additional I/O bandwidth for servers. However, on NUMA platforms, it can provide even larger benefits. MPIO involves using multiple I/O adaptors (SCSI cards, network cards) to gain multiple paths to the underlying resource (hard disks, the network), thus increasing overall bandwidth. On SMP platforms, potential speedups due to MPIO are limited by the fact that all CPUs and memory typically share a bus, which has a maximum bandwidth. On NUMA platforms,
Page 21
however, different groups of CPUs, memory, and I/O buses have their own distinct interconnects. Because these distinct interconnects allow each node to independently reach its maximum bandwidth, larger I/O aggregate throughput is likely. An ideal MPIO on NUMA setup consists of an I/O card (SCSI, network, and so on) on each node connected to every I/O device, so that no matter where the requesting process runs, or where the memory is, there is always a local route to the I/O device. With this hardware configuration, it is possible to saturate several PCI buses with data. This is even further assisted by the fact that many machines of this size use RAID or other MD devices, thus increasing the potential bandwidth by using multiple disks.
Timers
On uniprocessor systems, the processor has a time source that is easily and quickly accessible, typically implemented as a register. On SMP systems, the processors' time source is usually synchronized because all the processors are clocked at the same rate. Therefore, synchronization of the time register between processors is a straightforward task. On NUMA systems, synchronization of the processors' time source is not practical because not only does each node have its own crystal providing the clock frequency, but there tends to be minute differences in the frequencies that the processors are driven at, which thus leads to time skew. On multiprocessor systems, it is imperative that there be a consistent system time. Otherwise, time stamps provided by different processors cannot be relied on for ordering. If a process is dispatched on a different processor, it is possible that there can be unexpected jumps (backward or forward) in time. Ideally, the hardware provides one global time source with quick access times. Unfortunately, global time sources tend to require off-chip access and often off-node access, which tend to be slow. Clock implementations are very architecture-specific, with no clear leading implementation among the NUMA platforms. On the IBM eServer xSeries 440, for example, the global time source is provided by node 0, and all other nodes must go off-node to get the time. In Linux 2.6, the i386 timer subsystem has an abstraction layer that simplifies the addition of a different time source provided by specific machine architecture. For standard i386 architecture machines, the timestamp counter (TSC) is used, which provides a very quick time reference. For NUMA machines, a global time source is used (for example, on the IBM eServer xSeries 440, the global chipset timer is used).

Page 22
Summary
This chapter has shown you that Linux has a lot to offer! While familiarizing you with all Linux has to offer as far as server architecture, the point here is the enhanced design of the Linux kernel. This design helps you shape your system to your needs. Each feature works together to form a powerhouse of software technology. From the power of the processors and multiprocessors, to the technique of configuration, to the versatility of the memory, Linux has your server!


2 - PDF - 1177 - 1177459 526027 547954 013144753x ch03

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 - PDF - 1177 - 1177459 526027 547954 013144753x ch03

Uploaded by

Copyright:

Available Formats

Chapter Three.

Overview of Server Architectures

Chapter Three. Overview of Server Architectures

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Chapter Three. Overview of Server Architectures

Licensed by babu krishnamurthy

Chapter Three. Overview of Server Architectures

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Processors and Multiprocessing

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Chapter Three. Overview of Server Architectures

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Chapter Three. Overview of Server Architectures

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Linux Enterprise Servers

Chapter Three. Overview of Server Architectures

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Chapter Three. Overview of Server Architectures

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Examples of Server Systems

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Hardware Design Overview

Chapter Three. Overview of Server Architectures

Return to Table of Contents

IBM Mainframe I/O Design

Chapter Three. Overview of Server Architectures

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Chapter Three. Overview of Server Architectures

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Chapter Three. Overview of Server Architectures

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Figure 3-4. 8-node NUMA configuration.

Chapter Three. Overview of Server Architectures

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Chapter Three. Overview of Server Architectures

Chapter Three. Overview of Server Architectures

Return to Table of Contents

Specific NUMA Implementations