Professional Documents
Culture Documents
White Paper
November 2009
— Albert Einstein
Sun Microsystems, Inc.
Table of Contents
Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
For More Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
This Page Intentionally Left Blank
1 Executive Summary Sun Microsystems, Inc.
Executive Summary
From weather prediction and global climate modeling to minute sub-atomic analysis
and other grand-challenge problems, modern supercomputers often provide the key
technology for unlocking some of the most critical challenges in science and
engineering. These essential scientific, economic, and environmental issues are
complex and daunting — and many require answers that can only come from the
fastest available supercomputing technology. In the wake of the industry-wide
migration to terascale computing systems, an open and predictable path to petascale
supercomputing environments has become essential.
Unfortunately, the design, deployment, and management of very large terascale and
petascale clusters and grids has remained elusive and complex. While a few have
accomplished petascale deployments, they have been largely proprietary in nature, and
have come at a high cost. In fact, it is often difficult to reach petascale for fundamental
reasons — not because of inherent limitations, but due to practicalities of attempting
to scale architectures to their full potential. Seemingly simple concerns — heat, power,
cooling, cabling, and weight — are rapidly overloading the vast majority of even the
most modern datacenters.
Sun understands that the key to building petascale supercomputers lies in a balanced
and systemic infrastructure design approach, along with careful application of the
latest technology advancements. Derived from Sun’s experience and innovation with
very large supercomputing deployments, the Sun™ Constellation System provides the
world's first open petascale computing environment — one built entirely with open
and standard hardware and software technologies. Cluster architects can use the Sun
Constellation System to design and rapidly deploy tightly-integrated, efficient, and cost-
effective supercomputing clusters that scale predictably from a few teraflops to over a
petaflop. With a completely modular approach, processors, memory, interconnect
fabric, and storage can all be scaled independently depending on individual needs.
1.http://www.tacc.utexas.edu/resources/hpcsystems/#constellation
2 Pathways to Open Petascale Computing Sun Microsystems, Inc.
Chapter 1
Pathways to Open Petascale Computing
1.www.top500.org
3 Pathways to Open Petascale Computing Sun Microsystems, Inc.
400
MPP
Cluster
300
Systems
SMP
Constellations
200
Single Processor
Others
100
0
06/1993
06/1994
06/1995
06/1996
06/1997
06/1998
06/1999
06/2000
06/2001
06/2002
06/2003
06/2004
06/2005
06/2006
06/2007
06/2008
06/2009
Top500 Releases
Figure 1. In the last five years, clusters have increasingly dominated the Top500
list architecture share (image courtesy www.top500.org)
Not only have clusters provided access to supercomputing resources for increasingly
larger groups of researchers and scientists, but the largest supercomputers in the world
are now built using cluster architectures. This trend has been assisted by an explosion
in performance, bandwidth, and capacity for key technologies, including:
• Faster processors, multicore processors, and multisocket rackmount and blade
systems
• Inexpensive memory and system support for larger memory capacity
• Faster standard interconnects such as InfiniBand
• Higher aggregated storage capacity from inexpensive commodity disk drives
Unfortunately, significant challenges remain that have stifled the growth of true open
petascale-class supercomputing clusters. Time-to-deployment constraints have resulted
from the complexity of deploying and managing large numbers of compute nodes,
switches, cables, and storage systems. The programability of extremely large clusters
remains an issue. Environmental factors too are paramount since deployments must
often take place in existing datacenter space with strict constraints on physical
footprint, as well as power and cooling.
4 Pathways to Open Petascale Computing Sun Microsystems, Inc.
In addition to these challenges, most petascale computational users also have unique
requirements for clustered environments beyond those of less demanding HPC users,
including:
• Scalability at the socket and core level — Some have espoused large grids of
relatively low-performance systems, but lower performance only increase the
number of nodes that are required to solve very large computational problems.
• Density in all things — Density is not just a requirement for compute nodes, but for
interconnect fabrics and storage solutions as well.
• A scalable programming and execution model — Programmers need to be able to
apply their programmatic challenges to massively-scalable computational resources
without special architecture-specific coding requirements.
• A lightweight grid model — Demanding applications need to be able to start
thousands of jobs quickly, distributing workloads across the available computational
resources through highly-efficient distributed resource management (DRM) systems.
• Open and standards-based solutions — Programmatic solutions must not cause
extensive porting efforts, or be dedicated to particular proprietary architectures or
environments, and datacenters must remain free to purchase the latest high-
performance computational gear without being locked into proprietary or dead-end
architectures.
Many technologies have failed because the fundamental principles that worked in
small clusters simply could not scale effectively when re-cast in a run-time environment
thousands of times larger or faster than their initial implementations. For example, Ten
Gigabit Ethernet — though a significant accomplishment — is known in the
supercomputing realm to be fraught with sufficiently variable latency as to make it
impractical for situations where low guaranteed latency and throughput dominate
performance. Ultimately, building petascale-capable systems is about being willing to
fundamentally rethink design, using the latest available components that are capable
of meeting or exceeding specified data rates and capacities.
Put simply, getting to petascale requires balance and massive scalability in all
dimensions, including scalable tools and frameworks, processors, systems,
interconnects, and storage — as well as the ability to accommodate changes that
allow software to scale accordingly.
5 Pathways to Open Petascale Computing Sun Microsystems, Inc.
These challenges serve as reminders that the value of genuine innovation in the
marketplace must never be underestimated — even as design-cycle times shrink and
the pressures of time to market grow with the demand for faster, cheaper and
standards based solutions.
Along with key technologies and the experience of helping design and deploy some of
the world’s largest supercomputing clusters, these strengths make Sun an ideal partner
for delivering open high-end terascale and petascale architecture.
7 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.
Chapter 2
Fast, Large, and Dense InfiniBand Infrastructure
Building the largest supercomputing grids presents significant challenges, with fabric
technology paramount among them. Sun set out to design InfiniBand architecture for
maximum flexibility and fabric scalability, and to drastically reduce the cost and
complexity of delivering large-scale HPC solutions. Achieving these goals required a
delicate balancing act — one that weighed the speed and number of nodes along with
a sufficiently fast interconnect to provide minimal and predictable levels of latency.
Even with these advantages, building the largest InfiniBand clusters and grids has
remained complex and expensive — primarily because of the need to interconnect very
large numbers of computational nodes. Traditional large clusters require literally
thousands of cables and connections and hundreds of individual core and leaf switches
— adding considerable expense, weight, cable management complexity, and
8 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.
Clos fabrics have the advantage of being non-blocking, in that each attached node has
a constant bandwidth. In addition, an equal number of stages between nodes provides
for uniform latency. Historically, the disadvantage of large Clos networks was that they
required more resources to build.
density. Sun has employed considerable innovation in all of these areas, and provides
both dual data rate (DDR) and quad data rate (QDR) scalable InfiniBand fabrics. For
example, as a part of the Sun Constellation System, Sun InfiniBand infrastructure can
provide both QDR Clos clusters that can scale up to 5,184 nodes as well as 3D Torus
configurations.
In contrast, the very dense InfiniBand fabric provided by Sun Datacenter InfiniBand
switches is able to potentially eliminate hundreds of switches and thousands of cables
— dramatically lowering acquisition costs. In addition, replacing physical switches and
cabling with switch chips and traces on printed circuit boards drastically improves
reliability. Standard 12x InfiniBand cables and connectors coupled with a specialized
Sun Blade 6048 Network Express Module can eliminate thousands of additional cables,
providing additional cost, complexity, and reliability improvements. Overall, these
switches provide radical simplification of InfiniBand infrastructure. Sun Datacenter
Switches are available to support both DDR and QDR data rates, with fabric capacities
enumerated in Table 1.
a.Eight switches are required. The Sun Datacenter InfiniBand Switch 648 is capable of supporting
clusters beyond 5,184 servers. The maximum number of nodes is currently determined by the
number of uplink ports (eight) provided by the Sun Blade 6048 InfiniBand QDR Switched NEM.
support up to 5,184 nodes in a single cluster. As shown in Figure 2, the Sun Datacenter
InfiniBand Switch 648 also provides extensive cable support and management for clean
and efficient installations.
Figure 2. The Sun Datacenter InfiniBand Switch 648 offers up to 648 QDR/DDR/SDR
4x InfiniBand connections in an 11u rackmount chassis (shown with cable
management arms deployed).
The Sun Datacenter InfiniBand Switch 648 is ideal for deploying fast, dense, and
compact Clos fabrics when used as a part of the Sun Constellation System. Based on
the Mellanox InfiniScale IV 36-port InfiniBand switch device, each switch chassis
connects up to 648 nodes using 12x CXP connectors. The switch represents a full three-
stage Clos fabric, and up to eight Sun Datacenter InfiniBand Switch 648 can be used to
combine up to 54 Sun Blade 6048 chassis in a maximal 5,184-node fabric. Up to three
Sun Datacenter InfiniBand Switch 648 (and up to 1,944 QDR ports) can be deployed in a
single standard rack (Figure 3).
The Sun Datacenter InfiniBand Switch 648 is tightly integrated with the Sun Blade 6048
InfiniBand QDR Switched Network Express Module (NEM). 12x cables and CXP
connectors provide a 3:1 cable consolidation ratio. Each dual-height NEM connects up
to 24 compute nodes in a single Sun Blade 6048 shelf to a QDR InfiniBand fabric. Sun’s
approach to InfiniBand networking is highly flexible in that both Clos and mesh/torus
interconnects can be built using the same components. The Sun Blade 6048 InfiniBand
Switched NEM can be used by itself to build mesh and torus fabrics, or in combination
with the Sun Datacenter InfiniBand Switch 648 switch to build Clos InfiniBand fabrics.
The Sun Datacenter InfiniBand Switch 648 employs a passive midplane. Fabric cards
install vertically and connect to the midplane from the rear of the chassis. Up to nine
line cards install horizontally from the front of the chassis. A three-dimensional
perspective of the fabric provided by the switch is shown in Figure 4, with an example
Figure 3. Up to three Sun Datacenter route overlaid. With this dense switch configuration, InfiniBand packets traverse only
InfiniBand Switch 648 in a single
19-inch rack deliver 1,944 QDR ports.
12 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.
three hops from ingress to egress of the switch, keeping latency very low. The Sun
Blade 6048 InfiniBand QDR Switched NEM adds only two hops for a total of five. All
InfiniBand routing is managed using a redundant host-based subnet manager.
Nin
eF
ab
ric
Ca
rds
Figure 4. A path through a Sun Datacenter InfiniBand Switch 648 core switch
connects two nodes across horizontal line cards, a vertical fabric card, and the
passive orthogonal midplane.
Depicted in Figure 5, the Sun Datacenter InfiniBand Switch 72 occupies only one rack
unit, offering an ultraslim and ultradense complete switch fabric solution for clusters of
up to 72 nodes.
When used in conjunction with the Sun Blade 6048 Modular System, up to eight Sun
Datacenter InfiniBand Switch 72 can be combined to support clusters of up to 576
nodes. While similar solutions from competitors occupy over 17 rack units, eight 1U Sun
Datacenter InfiniBand Switch 72 save considerable space, and require roughly one third
the number of cables. In addition to simplification, this end-to-end supercomputing
solution offers extremely low latency using industry-standard transport, and
commodity processors including AMD Opteron™, Intel® Xeon®, and Sun SPARC®.
Datacenter InfiniBand Switch 36 is provisioned with redundant power and cooling for
high availability in demanding datacenter environments. The Sun Datacenter
InfiniBand Switch 36 is shown in Figure 6.
Figure 6. The Sun Datacenter InfiniBand Switch 36 offers 36 QDR InfiniBand ports
in a 1U form factor
Chapter 3
Deploying Dense and Scalable Modular Compute
Nodes
Blade technology has offered considerable promise in these areas for some time, but
has often been constrained by legacy blade platforms that locked adopters into
expensive proprietary infrastructure. Power and cooling limitations often meant that
16 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
processors were limited to less powerful versions. Limited processing power, memory
capacity, and I/O bandwidth often severely restricted the applications that could be
deployed. Proprietary tie-ins and other constraints in chassis design dictated
networking and interconnect topologies, and I/O expansion options were limited to a
small number of expensive and proprietary modules.
Each server module provides significant I/O capacity as well, with up to 32 lanes of
PCI Express 2.0 bandwidth delivered from each server module to the multiple
available I/O expansion modules (a total of up to 207 Gb/sec supported per server
17 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
Within the Sun Blade 6048 Modular System, a chassis monitoring module (CMM)
works in conjunction with the service processor on each server module to form a
complete and transparent management solution. Individual server modules
provide support for IPMI, SNMP, CLI (through serial console or SSH), and HTTP(S)
management methods. In addition, Sun Ops Center provides discovery,
aggregated management, and bulk deployment for multiple systems.
18 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
System Overview
The Sun Blade 6048 chassis provides space for up to 12 server modules in each of its
four shelves — for up to 48 Sun Blade 6000 server modules in a single chassis. This
design approach provides considerable density. Front and rear perspectives of the Sun
Blade 6048 Modular System are provided in Figure 7.
Figure 7. Front and rear perspectives of the Sun Blade 6048 Modular System
With four self-contained shelves per chassis, the Sun Blade 6048 Modular System
houses a wide range of components.
• Up to 48 Sun Blade 6000 server modules insert from the front of the chassis, with 12
modules supported by each shelf.
• A total of eight hot-swap power supply modules insert from the front of the chassis,
with two 8,400 Watt 12-volt power supplies (N+N) are provided for each shelf. Each
power supply module contains a dedicated fan module.
• Up to 96 hot-plug PCI Express ExpressModules (EMs) insert from the rear of the
chassis (24 per shelf), supporting industry-standard PCI Express interfaces with two
EM slots available for use by each server module.
19 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
• Up to four dual-height Sun Blade 6048 InfiniBand NEMs can be installed in a single
chassis (one per shelf). Alternately, up to eight single-height Network Express
Modules (NEMs) can be inserted from the rear, with two NEM slots serving each shelf
of the chassis.
• A chassis monitoring module (CMM) and power interface module are provided for
each shelf. The CMM provides for transparent management access to individual
server modules while the Power Interface Module provides six plugs for the power
supply modules in each shelf.
• Redundant (N+1) fan modules are provided at the rear of the chassis for efficient
front-to-back cooling.
Each server module is energized through the midplane from the redundant chassis
power grid. The midplane also provides connectivity to the I2C network in the chassis,
allowing each server module to directly monitor the chassis environment, including fan
and power supply status as well as various temperature sensors. A number of I/O links
are also routed through the midplane for each server module. Connection details differ
depending on the selected server module and associated NEMS. As an example,
Figure 8 illustrates the dual-node Sun Blade X6275 server module configured with the
Sun Blade 6048 InfiniBand QDR Switched NEM with connections that include:
• An x8 PCI Express 2.0 link connecting from each compute node to a dedicated EM
• Two gigabit Ethernet links to the NEM — one from each compute node
20 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
• Two 4x QDR InfiniBand connections to the NEM — one from each compute node
• An Ethernet connection from the server module to the CMM for management
EMs
NEM 0
For QDR InfiniBand interconnects, the Sun Blade 6048 InfiniBand QDR Switched NEM
offers the ability to connect up to 24 nodes in a single Sun Blade 6048 shelf. Each NEM
provides all of the connections necessary from the individual server modules to two 36-
port InfiniBand switch chips (Figure 9). The Mellanox InfiniScale IV 36-port switches
offer very low 100 ns latency, and QDR speeds across all ports.
36 Port 36 Port
QDR InfiniBand QDR InfiniBand
Switch Switch
12x-IB 12x-IB 12x-IB 12x-IB 12x-IB InfiniBand 12x-IB 12x-IB 12x-IB 12x-IB 12x-IB
B12-B14 B9-B11 B6-B8 B3-B5 B0-B2 A12-A14 A9-A11 A6-A8 A3-A5 A0-A2
External
GEE GE GE GEE
G GE
G GEE Connectors GE GE GE GE GE GE
GEE
G GE
G GEE GE GE GEE
G Ethernet GE GE GE GE GE GE
Figure 9. Paired with the Sun Blade X6275 server module, the Sun Blade 6048
InfiniBand QDR Switched NEM connects up to 24 nodes per shelf and exposes 10
12x InfiniBand connections, and 24 Gigabit Ethernet ports
The Sun Blade 6048 InfiniBand QDR Switched NEM is designed to work with a number
of Sun Blade 6000 server modules1. However, the Sun Blade X6275 server module was
optimized to derive maximum density and performance from the Sun Blade 6048
InfiniBand QDR Switched NEM. Ideal for HPC environments where performance and
1.As of this writing, the Sun Blade 6048 InfiniBand QDR Switched NEM supports the Sun Blade X6275 server
module through its on-board QDR HCAs. The Sun Blade X6440 server module is supported via an InfiniBand
Fabric Expansion Module (FEM) at DDR speeds only.
22 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
density are key, each Sun Blade X6275 server module features two compute nodes, with
each node supporting two sockets for Intel Xeon Processor 5500 Series CPUs and up to
96 GB of memory (Figure 10).
Figure 10. The Sun Blade X6275 server module provides two compute nodes on a
single server module.
Figure 11 shows a block-level representation of how the Sun Blade X6275 server module
connects to the Sun Blade 6048 InfiniBand QDR Switched NEM. In this configuration,
twelve ports from each switch chip (24 total) are used to communicate with the two
compute nodes on each Sun Blade X6275 server module, with nine ports used to
connect the two switches together. The 30 remaining ports (15 per switch chip) are
used as uplinks to either other QDR switched NEMS or external InfiniBand switches.
Sun Blade 6048 InfiniBand QDR Switched NEMs can be connected together directly to
provide mesh or 3D torus fabrics. Alternately, one or more Sun Datacenter InfiniBand
Switch 648 or 72 can be connected to provide Clos fabric implementations. The external
ports use industry-standard CXP connectors that aggregate three 4x ports into a single
12x connector.
23 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.
Up to 12
Server
Modules
PCIe 2.0
IB HCA
QDR IB
Switch
Memory P
9 12x IB
4x IB
ports Cables
Memory P 36-port
PCIe 2.0
IB HCA
QDR IB
Switch
Memory P
15 ports
Node 2
Figure 11. The Sun Blade 6048 InfiniBand QDR Switched NEM connects directly to
Mellanox HCAs on both nodes of the Sun Blade X6275 server module
In the default configuration, and for clusters that utilize up to four Sun Datacenter
InfiniBand Switch 648s, the switch provides a non-blocking fabric. To maintain a non-
blocking fabric in configurations of larger than four switches, an external 12x cable can
link two of the external CXP connectors (one connected to each internal switch chip) to
interconnect the two switches with an additional three 4x connections. This
configuration fully meshes the InfiniScale IV chips on the Sun Blade 6048 InfiniBand
QDR Switched NEM, with a total of 12 ports communicating between the two Mellanox
InfiniScale IV InfiniBand switches, while still leaving 24 4x ports (Eight 12x CXP
connectors) available as switch uplinks.
Table 2. Maximum numbers of Sun Blade 6275 server modules and Sun Blade 6048 Modular Systems
supported by various numbers of Sun Datacenter InfiniBand Switch 648.
Number of
Sun Blade 6048 Sun Datacenter Line Cards Sun Blade 6048 12x Cables Total Compute
Chassis InfiniBand Required InfiniBand QDR Required Nodes Supported
Switch 648 Switched NEMs
1 1 2 4 32 96
2 1 3 8 64 192
3 1 4 12 96 288
4 1 6 16 128 384
5 1 7 20 160 480
6 1 8 24 192 576
8 2 11 32 256 768
10 2 14 40 320 960
12 2 16 48 384 1,152
24 4 32 96 768 2,304
48 8 64 192 1,536 4,608
54 8 72 216 1,728 5,184
25 Scalable and Manageable Storage Sun Microsystems, Inc.
Chapter 4
Scalable and Manageable Storage
Even as the capacities of individual disk drives have risen, and prices have fallen, high-
volume parallel storage systems have remained expensive and complex. With
experience deploying petabytes of storage into large supercomputing clusters, Sun
understands the key issues needed to deliver high-capacity, high-throughput storage in
a cost-effective and manageable fashion. As an example, the Tokyo Institute of
26 Scalable and Manageable Storage Sun Microsystems, Inc.
Technology (TiTech) TSUBAME supercomputing cluster was initially deployed with 1.1
petabytes of storage provided by clustered Sun Fire X4500 storage servers and the
Lustre parallel file system.
48 high-performance
SATA disk drives
Figure 12. The Sun Fire X4540 storage server provides up to 48 terabytes of
compact storage in only four rack units — ideal for configuration as cluster
scratch space using the Lustre parallel file system.
The Sun Fire X4540 storage server represents an innovative design that provides
throughput and high-speed access to the 48 directly-attached, hot-plug Serial ATA (SATA)
disk drives. Designed for datacenter deployment, the efficient system is cooled from
from front to back across the components and disk drives. Each Sun Fire X4540 storage
server provides:
• Minimal cost per gigabyte utilizing SATA II storage and software RAID 6 with six
SATA II storage controllers connecting to 48 high-performance SATA disk drives
• High performance from an industry-standard x64 server based on two Quad-Core or
enhanced Quad-Core AMD Opteron processors
27 Scalable and Manageable Storage Sun Microsystems, Inc.
Parallel file systems are required for moving massive amounts of data through
supercomputing clusters. Given its strengths, the Sun Fire X4540 storage server is now a
standard component of many large supercomputing cluster deployments around the
world. Large grids and clusters need high-performance heterogeneous access to data,
and the Sun Fire X4540 storage server provides both high throughput as well as
essential scalability that allow parallel file systems to perform at their best. Together
with the Lustre parallel file system and the Linux OS, the Sun Fire X4540 storage server
also serves as the key component for the Sun Lustre Storage System (Chapter 5).
Building on the strengths of the Lustre parallel file system, the Sun Lustre Storage
system is architected using Sun Open Storage systems that deliver exceptional
performance and provide additional value. The main components of a typical Lustre
architecture include:
• Lustre file system clients (Lustre clients)
• Metadata Severs (MDS)
• Object Storage Servers (OSS)
Metadata Servers and Object Storage Servers implement the file system and
communicate with the Lustre clients. The MDS manages and stores metadata, such as
file names, directories, permissions and file layout. Configurations also require one or
more Lustre Object Storage Server (OSS) modules, which provide scalable I/O
performance and storage capacity.
To these standard configurations, all Sun Lustre Storage System configurations include
a High Availability Lustre Metadata Server (HA MDS) module that provides failover. For
maximum flexibility, the Sun Lustre Storage System defines two OSS modules: a
28 Scalable and Manageable Storage Sun Microsystems, Inc.
Standard OSS module for greatest density and economy, and an HA OSS module that
provides OSS failover for environments where automated recovery from OSS failure is
important (Figure 13).
InfiniBand
Multiple
networks
supported Storage Arrays
simultaneously (Direct Connect)
Clients
Ethernet
Enterprise
Storage Arrays
& SAN Fabrics
File System Object Storage
Fail-over Servers (OSS)
Figure 13. High availability metadata servers (HA MDS) and high availability
object storage servers (HA OSS) allow for file system failover in Luster
configurations.
• HA MDS Module
Designed to meet the critical requirement of high availability, the HA MDS
module, is common to all Sun Lustre Storage System configurations. This module
includes a pair of Sun Fire X4270 servers with an attached Sun Storage J4200 array
acting as shared storage. Internal boot drives in the Sun Fire X4270 server are
mirrored for added protection. The Sun Fire X4270 server features two quad-core
Intel Xeon Processor 5500 Series (Nehalem) CPUs and is configured with
24 GB RAM.
29 Scalable and Manageable Storage Sun Microsystems, Inc.
Sun can reference many storage installations that have achieved impressive scalability
results. One such reference is the Texas Advanced Computing Center’s Ranger System
(see http://www.tacc.utexas.edu/resources/hpcsystems/#ranger), where Sun has
demonstrated near-linear scalability in a configuration encompassing fifty similar
previous-generation Sun OSS modules with a single HA MDS module supporting a file
system that was 1.2 petabytes in size. Figure 14 shows the Lustre file system
throughput at TACC where throughput rates of 45 GB/sec with peaks approaching 50
GB/sec have been observed. In addition, TACC has experienced near-linear throughput
on a single application’s use of the Lustre file system at 35 GB/sec.
50 Stripecount = 4 Stripecount = 4
25
40
20
30
15
20
10
10
5
0 0
10 100 1000 10000 10 100 1000 10000
# of Writing Clients # of Writing Clients
More information on implementing the Lustre parallel file system can be found in the
Sun BluePrints article titled Solving the HPC I/O Bottleneck: Sun Lustre Storage System
(http://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-
+Sun+Lustre+Storage+System).
30 Scalable and Manageable Storage Sun Microsystems, Inc.
To address this need, Sun Storage 7000 Unified Storage Systems incorporate an open-
source operating system, commodity hardware, and industry-standard technologies.
These systems represent low-cost, fully-functional network attached storage (NAS)
storage devices designed around the following core technologies:
• General-purpose x64-based servers (that function as the NAS head), and Sun Storage
products — proven high-performance commodity hardware solutions with
compelling price-performance points
• The ZFS file system, the world’s first 128-bit file system with unprecedented
availability and reliability features
• A high-performance networking stack using IPv4 or IPv6
• DTrace Analytics, that provide dynamic instrumentation for real-time performance
analysis and debugging
• Sun Fault Management Architecture (FMA) for built-in fault detection, diagnosis, and
self-healing for common hardware problems
• A large and adaptive two-tiered caching model, based on DRAM and enterprise-class
solid state devices (SSDs)
To meet varied needs for capacity, reliability, performance, and price, the product
family includes three different models — the Sun Storage 7110, 7210, 7310, and 7410
Unified Storage Systems (Figure 15). Configured with appropriate data processing and
storage resources, these systems can support a wide range of requirements in HPC
environments.
31 Scalable and Manageable Storage Sun Microsystems, Inc.
Silent data corruption is corruption that goes undetected, and for which no error
messages are generated. This particular form of data corruption is of special concern to
HPC applications since they typically generate, store, and archive significant amounts
of data. In fact, a study by CERN1 has shown that silent data corruption, including disk
errors, RAID errors, and memory errors, is much more common that previously
imagined. ZFS provides end-to-end checksumming for all data, greatly reducing the risk
of silent data corruption.
Sun Storage 7000 Unified Storage Systems rely heavily on ZFS for key functionality such
as Hybrid Storage Pools. By automatically allocating space from pooled storage when
needed, ZFS simplifies storage management and gives organizations the flexibility to
optimize data for performance. Hybrid Storage Pools also effectively combine the
strengths of system memory, flash memory technology in the form of enterprise solid
state drives (SSDs), and conventional hard disk drives (HDDs).
Chapter 6 provides additional detail and a graphical depiction of how caching file
systems such as the Lustre parallel file system combine with SAM-QFS in a real-world
example to provide data management in large supercomputing installations.
34 Sun HPC Software Sun Microsystems, Inc.
Chapter 5
Sun HPC Software
As clusters move from the realm of supercomputing to the enterprise, cluster software
has never been more important. Organizations deploying clusters at all levels need
better ways to control and monitor often expansive cluster deployments in ways that
benefit their users and applications. Unfortunately, collecting, assembling, testing, and
patching all of the requisite software components for effective cluster operation has
proved challenging, to say the least.
Available in both a Linux Edition, and a Solaris Developer Edition, Sun HPC software is
designed to address these needs. Sun HPC Software, Linux Edition is detailed in the
sections that follow. For more information on Sun HPC Software, Solaris Developer
Edition, please see http://wikis.sun.com/display/hpc /
Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1
From its inception, the project’s goals were to provide an “open” product — one that
uses as much open source software as possible, and one that depends on and enhances
the community aspects of software development and consolidation. The ongoing goals
for Sun HPC software are to:
• Provide simple, scalable provisioning of bare-metal systems into a running HPC
cluster
• Validate configurations
• Dramatically reduce time-to-results
• Offer integrated management and monitoring of the cluster
• Employ a community-driven process
Because clusters can vary widely in size, Sun HPC software is designed to be scalable,
and all of the components are selected with large numbers of nodes in mind. For
example, the Lustre parallel file system and OneSIS provisioning software are both well
known for working well with clusters comprised of thousands of nodes. Tools that
provision, verify, manage, and monitor the cluster were likewise selected for their
scalability to reduce the management cost as clusters grow.
Sun HPC software, Linux Edition is built to be completely modular so that organizations
can customize it according to their own preferences and requirements. The modular
framework provides a ready-made stack that contains the components required to
deploy an HPC cluster. Add-on components let organizations make specific choices
beyond the core software installed. Figure 16 provides a high-level perspective of Sun
HPC Software, Linux Edition. For more specific information on the components provided
at each level, please see www.sun.com/software/products/hpcsoftware, or send an e-
mail to linux_hpc_swstack-discuss@sun.com to join the community.
Sun HPC Software, Linux Edition 2.0.1 contains the components listed in Table 3.
The software stack itself can be downloaded from the Web, and is designed to fit onto a
single DVD. While installing the first system in a cluster might take place from the DVD,
it goes without saying that installing an entire large cluster in this fashion would
consume unacceptable amounts of time, not to mention the additional time to
maintain and update individual system images. With OneSIS, administrators can create
system images that define the behavior of the entire computing infrastructure.
The cluster can be running and ready to accept jobs in as little as 50 minutes.
37 Deploying Supercomputing Clusters Rapidly with Less Risk Sun Microsystems, Inc.
Chapter 6
Deploying Supercomputing Clusters Rapidly
with Less Risk
With Datacenter Express, Sun’s Customer Ready Program builds and tests systems to
exact customer specifications, while Sun Services provides complementary expertise
based upon exact requirements. Having testing and component integration performed
at Sun ISO-certified facilities helps reduce HPC system deployment time, installation
issues, and minimizes unnecessary downtime. Organizations can also leverage the
expertise of Sun Services so that HPC solutions are easier to repair, maintain, and
support — and configurations are easier to scale and modify.
The sections that follow describe several Sun Customer Ready program offerings.
• Sun Compute Cluster
The Sun Compute Cluster offers a high-performance computing cluster solution
that can be customized to specific needs. With a choice of Sun Fire x64 rackmount
servers and Sun Blade Modular Systems, the Sun Compute Cluster allows
organizations to customize their cluster designs without compromise.
As a part of the TeraGrid program, the NSF in mid-2005 issued a request for bids on a
project to configure, implement, and operate a new supercomputer with performance
in excess of 400 teraflops peak performance, making it one of the most powerful
supercomputer systems in the world. The resulting supercomputer also provides over
100 terabytes of memory, and 1.7 petabytes of disk storage.
Designed as “fast scratch space” for large clusters, the Sun Lustre Storage System
provides the data storage capacity and throughput that these applications require. Key
components of this data storage solution include high-performance Sun Fire X4540 and
X4250 servers, the Lustre scalable cluster file system, and high-speed InfiniBand
interconnects — all integrated into one system. As deployed at TACC, the system will
scale to over 1.729 petabytes of raw capacity. The configuration includes a scalable
storage cluster with 72 Sun Fire X4500 servers and over three thousand 500 GB disk
drives, but only occupies eight physical racks.
The Sun Customer Ready Scalable Storage Cluster that was deployed to implement the
data cache was designed as part of an overall cluster solution (Figure 18). Sun Fire
X4500 storage clusters were tightly integrated with other complementary products,
including Sun StorageTek Storage Archive Manager and the QFS file system, that
provide long-term data archive capabilities to other Sun StorageTek devices. High-
performance Sun Fire servers act as Data Movers, efficiently moving data between the
fast scratch storage of the Sun Customer Ready Storage Cluster and long-term storage.
Compute Node
Compute Node Tier 1
Compute Node
Compute Node
Archive &
Compute Node
Compute Node
Home
Compute Node
Compute Node
Directories
Compute Node
Compute Node
Compute Node
Compute Node
Compute Node
Tier 2
Compute
Fixed Content
Cluster Infiniband SAN
FC-AL
Archive
Network Network
Lustre Metadata
Lustre Metadata
Lustre OSS Node
Lustre OSS Node
Lustre OSS Node
Lustre OSS Node
Lustre OSS Node Data Movers Tier 2
Lustre OSS Node
Lustre OSS Node Data Movers
Near-line
Lustre OSS Node Archive
Lustre OSS Node
(Tape)
Parallel
File System
Figure 18. Data movers provide automated, policy-based data management and
migration between storage tiers
Chapter 7
Conclusion
Coupled with the Sun Blade 6048 InfiniBand QDR Switched NEM, the Sun Datacenter
InfiniBand Switch 648 offers a compelling solution that supports up to 5,184 nodes in a
dense and consolidated InfiniBand Clos fabric. In clusters of up to eight switches, the
Sun Datacenter InfiniBand Switch 72 offers smaller and medium clusters that can scale
up to 576 nodes. With the innovative and robust 12x connectors and cables, these
switches drastically reduce the number of switches and cables required for large
supercomputing installations — potentially eliminating hundreds of switches, and
thousands of cables.
The Sun Blade 6048 Modular System is the first blade platform designed for extreme
density and performance. With a choice of the latest SPARC, Intel, and AMD processors,
the Sun Blade 6048 Modular System integrates tightly with the Sun Datacenter
InfiniBand Switch 648 and 72. Fully compatible with the Sun Blade 6000 Modular
System, server modules run standard open-source operating systems such as the Solaris
OS and Linux, and can deploy general-purpose software that does not require custom
coding, compiles, and tuning. A modular and efficient design realizes savings in both
power and cooling.
Along with a breadth of Sun StorageTek storage offerings, the Sun Fire X4540 server
provides one of the most economical and scalable parallel file system building blocks
available — serving effectively as an OSS for the Lustre file system. Supporting up to
48 TB in a single 4U chassis, the Sun Fire X4540 server effectively combines a powerful
multisocket, multicore x64 server with large-scale storage, and direct InfiniBand
connectivity. The Sun Fire X4270 server and optional Sun Storage J4400 arrays can be
combined to build a highly-effective high-availability MDS and OSS modules for use as a
part of the Sun Lustre Storage System. Coupled with the Sun Storage 7000 Unified
Storage System, the scalable ZFS file system effectively implements hybrid storage
pools across system memory, solid state drives, and conventional hard disk drives —
offering scalable and robust storage for tier-1 archival and user home directories.
The Sun Constellation System also supports Sun HPC Software that benefits application
developers and cluster users alike. Integrated tools such as Sun Studio 12 provide the
fastest compilers available, tuned to get the most of Sun platforms. Sun HPC Cluster
43 Conclusion Sun Microsystems, Inc.
Tools enable the development of cluster-ready applications. Sun Grid Engine provides
distributed resource management along with policy enforcement as it distributes jobs
for execution. Sun Ops Center provides monitoring, patching, and simplified inventory
management for clusters. Together these tools help ensure that development and
management of large clusters remains manageable as they scale towards petascale.
Acknowledgements
This work was inspired by and in part based on previous works from Andreas
Bechtolsheim (International Symposium on SuperComputing 2007), Jim Waldo (for his
Sun Labs paper “On System Design” SMLI-PS-2006-6), and Ivan Sutherland (for his paper
“Technology and Courage” SMLI-PS-96-1).
Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com
© 2007-2009 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, Lustre, Solaris, Sun Fire, Sun Blade, StorageTek, and ZFS are trademarks or registered trademarks of Sun Microsystems, Inc. or its subsidiaries
in the United States and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the US and other countries. Products bearing SPARC
trademarks are based upon an architecture developed by Sun Microsystems, Inc. Intel Xeon is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. AMD
and Opteron are trademarks or registered trademarks of Advanced Micro Devices, Inc. Information subject to change without notice. Printed in USA. SunWIN #:537015 11/09