You are on page 1of 48

PATHWAYS TO

OPEN PETASCALE COMPUTING


The Sun™ Constellation System — designed for performance

White Paper
November 2009

“Make everything as simple as possible, but not simpler”

— Albert Einstein
Sun Microsystems, Inc.

Table of Contents

Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Pathways to Open Petascale Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2


The Unstoppable Rise of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
The Importance of a Balanced and Rigorous Design Methodology . . . . . . . . . . . . . . 4
The Sun Constellation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Fast, Large, and Dense InfiniBand Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . 7


The Fabric Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Sun Datacenter Switches for InfiniBand Fabrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Deploying Dense and Scalable Modular Compute Nodes . . . . . . . . . . . . . . . . . . . 15


Compute Node Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
The Sun Blade 6048 Modular System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Scaling to Multiple Sun Datacenter InfiniBand Switch 648. . . . . . . . . . . . . . . . . . . . 23

Scalable and Manageable Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25


Storage for Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Clustered Sun Fire X4540 Servers as Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
The Sun Lustre Storage System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
ZFS and Sun Storage 7000 Unified Storage Systems . . . . . . . . . . . . . . . . . . . . . . . . . 30
Long-Term Retention and Archive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Sun HPC Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34


Sun HPC Software, Linux Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Seamless and Scalable Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Simplified Cluster Provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Deploying Supercomputing Clusters Rapidly with Less Risk . . . . . . . . . . . . . . . . . 37


Sun Datacenter Express Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Sun Architected HPC Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A massive supercomputing cluster at the Texas Advanced Computing Center . . . . 38

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
For More Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
This Page Intentionally Left Blank
1 Executive Summary Sun Microsystems, Inc.

Executive Summary

From weather prediction and global climate modeling to minute sub-atomic analysis
and other grand-challenge problems, modern supercomputers often provide the key
technology for unlocking some of the most critical challenges in science and
engineering. These essential scientific, economic, and environmental issues are
complex and daunting — and many require answers that can only come from the
fastest available supercomputing technology. In the wake of the industry-wide
migration to terascale computing systems, an open and predictable path to petascale
supercomputing environments has become essential.

Unfortunately, the design, deployment, and management of very large terascale and
petascale clusters and grids has remained elusive and complex. While a few have
accomplished petascale deployments, they have been largely proprietary in nature, and
have come at a high cost. In fact, it is often difficult to reach petascale for fundamental
reasons — not because of inherent limitations, but due to practicalities of attempting
to scale architectures to their full potential. Seemingly simple concerns — heat, power,
cooling, cabling, and weight — are rapidly overloading the vast majority of even the
most modern datacenters.

Sun understands that the key to building petascale supercomputers lies in a balanced
and systemic infrastructure design approach, along with careful application of the
latest technology advancements. Derived from Sun’s experience and innovation with
very large supercomputing deployments, the Sun™ Constellation System provides the
world's first open petascale computing environment — one built entirely with open
and standard hardware and software technologies. Cluster architects can use the Sun
Constellation System to design and rapidly deploy tightly-integrated, efficient, and cost-
effective supercomputing clusters that scale predictably from a few teraflops to over a
petaflop. With a completely modular approach, processors, memory, interconnect
fabric, and storage can all be scaled independently depending on individual needs.

Best of all, the Sun Constellation System is an enterprise-class Sun-supported offering


comprised of general-purpose compute nodes, interconnects, and storage components
that can be deployed very rapidly. In fact, existing supercomputing clusters have
already been built using the system. For instance, the Texas Advanced Computing
Center (TACC) at the University of Texas at Austin partnered with Sun to deploy the Sun
Constellation system as their Ranger supercomputing cluster1 — with a peak
performance rating of over 500 teraflops. This document describes the key challenges
and constraints involved in the build-out of petascale supercomputing architectures,
including network fabrics, multicore modular compute systems, storage, open HPC
software, and general-purpose I/O.

1.http://www.tacc.utexas.edu/resources/hpcsystems/#constellation
2 Pathways to Open Petascale Computing Sun Microsystems, Inc.

Chapter 1
Pathways to Open Petascale Computing

Most practitioners in today's high-performance computing (HPC) marketplace would


readily agree that the industry is well into the age of terascale systems.
Supercomputing systems capable of processing multiple teraflops are becoming
commonplace. These systems are readily being built using mostly commercial off-the-
shelf (COTS) components with the ability to address terabytes and petabytes of storage,
and more recently, terabytes of system memory (generally as distributed shared
memory and storage pools, or even as a single system image at the high end).

Only a few years ago, general-purpose terascale computing clusters constructed of


COTS components were hard to imagine. Though they were on several industry road-
maps, such systems were widely regarded as impractical due to limitations in the
scalability of the interconnects and fabrics that tie disparate systems together. Through
competitive innovation and the race to be the fastest, the industry has been driven into
the realm of practical and commercially-viable terascale systems — and now to the
edge of pondering what similar limitations, if any, lie ahead in the design of open
petascale systems.

The Unstoppable Rise of Clusters


In the last five years, technologies used to build the world's fastest supercomputers
have evolved rapidly. In fact, clusters of smaller interconnected rackmount and blade
systems now represent a majority of the supercomputers on the Top500 list of
supercomputing sites1 — steadily replacing vector supercomputers and other large
systems that dominated previously. Figure 1 shows the relative shares of various
supercomputing architectures comprising the Top500 list from 1993 through 2009,
establishing clear acceptance of clusters as leading supercomputing technology.

1.www.top500.org
3 Pathways to Open Petascale Computing Sun Microsystems, Inc.

Architecture Share Over Time


1993 - 2009
500

400
MPP
Cluster
300

Systems
SMP
Constellations
200
Single Processor
Others
100

0
06/1993
06/1994
06/1995
06/1996
06/1997
06/1998
06/1999
06/2000
06/2001
06/2002
06/2003
06/2004
06/2005
06/2006
06/2007
06/2008
06/2009
Top500 Releases

Figure 1. In the last five years, clusters have increasingly dominated the Top500
list architecture share (image courtesy www.top500.org)

Not only have clusters provided access to supercomputing resources for increasingly
larger groups of researchers and scientists, but the largest supercomputers in the world
are now built using cluster architectures. This trend has been assisted by an explosion
in performance, bandwidth, and capacity for key technologies, including:
• Faster processors, multicore processors, and multisocket rackmount and blade
systems
• Inexpensive memory and system support for larger memory capacity
• Faster standard interconnects such as InfiniBand
• Higher aggregated storage capacity from inexpensive commodity disk drives

Unfortunately, significant challenges remain that have stifled the growth of true open
petascale-class supercomputing clusters. Time-to-deployment constraints have resulted
from the complexity of deploying and managing large numbers of compute nodes,
switches, cables, and storage systems. The programability of extremely large clusters
remains an issue. Environmental factors too are paramount since deployments must
often take place in existing datacenter space with strict constraints on physical
footprint, as well as power and cooling.
4 Pathways to Open Petascale Computing Sun Microsystems, Inc.

In addition to these challenges, most petascale computational users also have unique
requirements for clustered environments beyond those of less demanding HPC users,
including:
• Scalability at the socket and core level — Some have espoused large grids of
relatively low-performance systems, but lower performance only increase the
number of nodes that are required to solve very large computational problems.
• Density in all things — Density is not just a requirement for compute nodes, but for
interconnect fabrics and storage solutions as well.
• A scalable programming and execution model — Programmers need to be able to
apply their programmatic challenges to massively-scalable computational resources
without special architecture-specific coding requirements.
• A lightweight grid model — Demanding applications need to be able to start
thousands of jobs quickly, distributing workloads across the available computational
resources through highly-efficient distributed resource management (DRM) systems.
• Open and standards-based solutions — Programmatic solutions must not cause
extensive porting efforts, or be dedicated to particular proprietary architectures or
environments, and datacenters must remain free to purchase the latest high-
performance computational gear without being locked into proprietary or dead-end
architectures.

The Importance of a Balanced and Rigorous Design


Methodology
As anyone who has witnessed prior generations of supercomputing and HPC
architectures can attest, scaling gracefully is not simply a matter of accelerating
systems that already perform well. Bigger versions of existing technologies are not
always better. Regrettably, the pathways to teraflop systems are littered with the
products and technologies from dozens of companies that simply failed to adapt along
the way.

Many technologies have failed because the fundamental principles that worked in
small clusters simply could not scale effectively when re-cast in a run-time environment
thousands of times larger or faster than their initial implementations. For example, Ten
Gigabit Ethernet — though a significant accomplishment — is known in the
supercomputing realm to be fraught with sufficiently variable latency as to make it
impractical for situations where low guaranteed latency and throughput dominate
performance. Ultimately, building petascale-capable systems is about being willing to
fundamentally rethink design, using the latest available components that are capable
of meeting or exceeding specified data rates and capacities.
Put simply, getting to petascale requires balance and massive scalability in all
dimensions, including scalable tools and frameworks, processors, systems,
interconnects, and storage — as well as the ability to accommodate changes that
allow software to scale accordingly.
5 Pathways to Open Petascale Computing Sun Microsystems, Inc.

Key challenges for petascale environments include:


• Keeping floating-point operations (FLOPs) to memory bandwidth ratios balanced to
minimize the effects of memory latency (with each FLOP representing at least two
loads and one store)
• Allowing for the practical scaling of the interconnect fabric to allow the connection of
tens of thousands of nodes
• Exploiting the considerable investment, momentum, and cost savings of commodity
multicore x64 processors, tools, and software
• Overcoming software challenges such as the forward portability of HPC codes to new
architectures, scalability limitations, reliability, robustness, and being able to take
advantage of multicore multiprocessor system architectures
• Architecting to account for the opportunity to take advantage of external floating
point, vector, and/or general purpose processing on graphics processing unit
(GP/GPU) solutions within a cluster framework
• Designing the highest levels of density into compute nodes, interconnect fabrics, and
storage solutions in order to facilitate large and compact clusters
• Building systems with efficient power and cooling to accommodate the broadest
range of datacenter facilities and to help ensure the highest levels of reliability
• Architecting cluster architecture such that compute-intensive applications have
access to fast cluster scratch storage space for a balanced computational approach

These challenges serve as reminders that the value of genuine innovation in the
marketplace must never be underestimated — even as design-cycle times shrink and
the pressures of time to market grow with the demand for faster, cheaper and
standards based solutions.

The Sun™ Constellation System


Since its inception, Sun has been focused on building balance and even elegance into
its system designs. The Sun Constellation System represents a tangible application of
this philosophy on a grand scale — in the form of a systematic approach to building
terascale and petascale supercomputing clusters. Specifically, the Sun Constellation
System delivers an open architecture that is designed to allow organizations to build
clusters that scale seamlessly from a few racks to teraflops or petaflops of performance.
With an overall datacenter focus, Sun is free to innovate at all levels of the system —
from switching fabric, to core system and storage elements, to HPC and file system
software. As a systems company, Sun looks beyond existing technologies toward
solutions that optimize the simultaneous equations of cost, space, practicality, and
complexity. In the form of the Sun Constellation System, this systemic focus combines a
massively-scalable InfiniBand interconnect with very dense computational and storage
solutions — in a single architecture that functions as a cohesive system. Organizations
can now obtain all of these tightly-integrated building blocks from a single vendor, and
benefit from a unified management approach.
6 Pathways to Open Petascale Computing Sun Microsystems, Inc.

Components of the Sun Constellation System include:


• The Sun Datacenter InfiniBand Switch 648, offering up to 648 QDR/DDR ports in a
single 11 rack unit (11U) chassis, and supporting clusters of up to 5,184 nodes with
multiple switches
• The Sun Datacenter InfiniBand Switch 72, offering up to 72 QDR/DDR ports in a
compact 1U form factor, and supporting clusters of up to 576 nodes with multiple
switches
• The Sun Datacenter InfiniBand Switch 36, offering up to 36 nodes in a 1U form factor
• The Sun Blade™ 6048 Modular System, providing an ultra-dense InfiniBand-connected
blade platform with support for up to 48 multiprocessor, multicore Sun Blade 6000
server modules and up to 96 compute nodes in a rack-sized chassis
• Sun Fire™ X4540 storage clusters, serving as an economical InfiniBand-connected
parallel file system building block, with support for up to 48 terabytes in only four
rack units and up to 480 terabytes in a single rack
• The Sun Storage 7000 Unified Storage System, integrating enterprise flash
technology through ZFS™ hybrid storage pools and DTrace Analytics to provide
economical, scalable, and transparent storage
• The Sun Lustre™ Storage System, a simple-to-deploy storage environment based on
the Lustre parallel file system, Sun Fire servers, and Sun Open Storage platforms
• Sun HPC Software, encompassing integrated developer tools, Sun™ Grid Engine
infrastructure, advanced ZFS and Lustre file systems, provisioning, monitoring,
patching, and simplified inventory management — available in both a Linux Edition
and a Solaris™ Operating System (OS) Developer Edition

The Sun Constellation System provides an open systems supercomputer architecture


designed for petascale computing — as integrated and Sun-supported product. This
holistic approach offers key advantages to those designing and constructing the largest
supercomputing clusters:
• Massive scalability in terms of optimized compute, storage, interconnect, and
software technologies and services
• Simplified cluster deployment with open HPC software that can rapidly turn bare-
metal systems into functioning clusters that are ready to run
• A dramatic reduction in complexity through integrated connectivity and
management to reduce start-up, development, and operational connectivity
• Breakthrough economics from technical innovation that results in fewer more
reliable components and high-efficiency systems in a tightly-integrated solution

Along with key technologies and the experience of helping design and deploy some of
the world’s largest supercomputing clusters, these strengths make Sun an ideal partner
for delivering open high-end terascale and petascale architecture.
7 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.

Chapter 2
Fast, Large, and Dense InfiniBand Infrastructure

Building the largest supercomputing grids presents significant challenges, with fabric
technology paramount among them. Sun set out to design InfiniBand architecture for
maximum flexibility and fabric scalability, and to drastically reduce the cost and
complexity of delivering large-scale HPC solutions. Achieving these goals required a
delicate balancing act — one that weighed the speed and number of nodes along with
a sufficiently fast interconnect to provide minimal and predictable levels of latency.

The Fabric Challenge


For many applications, the interconnect fabric is already the element that limits
performance. One unavoidable driver is that faster processors require a faster
interconnect. Beyond merely employing a fast technology, the fabric must scale
effectively with both the speed and number of systems and processors. Interconnect
fabrics for large terascale and petascale deployments require:
• Low latency
• High bandwidth
• The ability to handle fabric congestion
• High reliability to avoid interruptions
• Open standards such as OpenFabrics and the OpenMPI software stack

InfiniBand technology has emerged as an attractive fabric for building large


supercomputing clusters. As an open standard, InfiniBand presents a compelling choice
over proprietary interconnect technologies that depend on the success and innovation
of a single vendor. InfiniBand also presents a number of significant technical
advantages:
• A switched fabric offers considerable scalability, supporting large numbers of
simultaneous collision-free connections with virtually no increase in latency.
• Host channel adaptors (HCAs) with remote direct memory access (RDMA) support
offload communications processing from the processor and operating system,
leaving more processor resources available for computation.
• Fault isolation and troubleshooting are easier in switched environments since
problems can be isolated to a single connection.
• Applications that rely on bandwidth or quality of service are also well served, since
they each receive their own dedicated bandwidth.

Even with these advantages, building the largest InfiniBand clusters and grids has
remained complex and expensive — primarily because of the need to interconnect very
large numbers of computational nodes. Traditional large clusters require literally
thousands of cables and connections and hundreds of individual core and leaf switches
— adding considerable expense, weight, cable management complexity, and
8 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.

consumption of valuable datacenter rack space. It is clear that density, consolidation,


and management efficiencies are important not just for computational platforms, but
for InfiniBand interconnect infrastructure as well.

Even with very significant accomplishments in terms of processor performance and


computational density, large clusters are ultimately constrained by real estate and and
the complexities and limitations of interconnect technologies. Cable length limitations
constrain how many systems can be connected together in a given physical space while
avoiding increased latency. Interconnect topologies play a vital role in determining the
properties that clustered systems exhibit. Mesh, torus (or toroidal), and Clos topologies
are popular choices for interconnected supercomputing clusters and grids.

Mesh and 3D Torus Topologies


In mesh and 3D torus topologies, each node connects to its neighbors in the x, y, and z
dimensions, with six connecting ports per node. Some of the most notable
supercomputers based upon torus topologies include IBM’s BlueGene and Cray’s XT3/
XT4 supercomputers. Torus fabrics have had the advantage that they have generally
been easier to build than Clos topologies. Unfortunately, torus topologies represent a
blocking fabric, where interconnect bandwidth can vary between nodes. Torus fabrics
also provide variable latency due to variable hop count, and application deployment for
torus fabrics must carefully consider node locality as a result. For some specific
applications that express a nearest-neighbor type of communication pattern, torus
topologies are a good fit. Computational fluid dynamics (CFD) is one such application.

Clos Fat Tree Topologies


First described by Charles Clos in 1953, Clos networks have long formed the basis for
practical multistage telephone switching systems. Clos networks utilize a “fat tree”
topology, allowing complex switching networks to be built using many fewer
crosspoints than if the entire system were implemented as a single large crossbar
switch. Clos switches are typically comprised of multiple tiers and stages (hops), with
each tier built from of a number of crossbar switches. Connectivity exists only between
switch chips on adjacent tiers.

Clos fabrics have the advantage of being non-blocking, in that each attached node has
a constant bandwidth. In addition, an equal number of stages between nodes provides
for uniform latency. Historically, the disadvantage of large Clos networks was that they
required more resources to build.

Constructing Large Switched Supercomputing Clusters


Constructing very large InfiniBand Clos switches in particular is governed by a number
of practical constraints, including the number of ports available in individual switch
elements, maximum achievable printed circuit board size, and maximum connector
9 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.

density. Sun has employed considerable innovation in all of these areas, and provides
both dual data rate (DDR) and quad data rate (QDR) scalable InfiniBand fabrics. For
example, as a part of the Sun Constellation System, Sun InfiniBand infrastructure can
provide both QDR Clos clusters that can scale up to 5,184 nodes as well as 3D Torus
configurations.

Sun Datacenter Switches for InfiniBand Fabrics


Recognizing the considerable promise of InfiniBand interconnects, Sun has made
InfiniBand connectivity a core competency, and has set out to design scalable and
dense switches that avoid many of the conventional limitations. Not content to accept
the status quo in terms of available InfiniBand switching, cabling, and host adapters,
Sun engineers used their considerable networking and datacenter experience to view
InfiniBand technology from a systemic perspective.

Key Technical Innovations for Sun Datacenter InfiniBand Switches


Sun Datacenter InfiniBand Switches 36, 72, and 648 are components of a complete
system that is based on multiple technical innovations, including:
• The Sun Datacenter InfiniBand Switch 648 chassis implements a three-stage Clos
fabric with up to 54 36-port Mellanox InfiniScale IV switching elements, integrated
into a single 11U rackmount enclosure. The Sun Datacenter InfiniBand Switch 648
implements a 3-stage Clos fabric.
• Industry-standard 12x CXP connectors on Sun Datacenter InfiniBand Switch 72 and
648 consolidate three discrete InfiniBand 4x connectors, resulting in the ability to
host 72 4x ports through 24 physical 12x connectors.
• Complementing the 12x CXP connector, a 12x trunking cable carries signals from
three servers to a single switch connector, offering a 3:1 cable reduction when used
for server trunking, and reducing the number of cables needed to support 648 servers
to 216. A splitter cable that converts one 12x connection to three 4x connections is
provided for connectivity to systems and storage that require 4x QSFP connectors.
• A custom-designed double-height Network Express Module (NEM) for the Sun Blade
6048 Modular System provides seamless connectivity to both the Sun Datacenter
InfiniBand Switch 648 and 72. Using the same 12x CXP connectors, the Sun Blade
6048 InfiniBand QDR Switched NEM can trunk up to 12 Sun Blade 6000 server
modules (up to 24 compute nodes) in a single Sun Blade 6048 Modular System shelf.
The NEM together with the 12x CXP cable facilitates connectivity of up to 5,184
servers in a 5-stage Clos topology.
• Fabric topology for forwarding InfiniBand traffic is established by a redundant host-
based Subnet Manager. A host-based solution allows the Subnet Manager to take
advantage of the full resources of a general-purpose multicore server.
10 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.

Massive Switch and Cable Consolidation


Given the scale involved with building supercomputing clusters and grids, cost and
complexity figure importantly. Regrettably, traditional approaches to using InfiniBand
for massive connectivity have required very large numbers of conventional switches
and cables. In these configurations, many cables and ports are consumed redundantly
connecting core and leaf switches together, making advertised per-port switch costs
relatively meaningless, and reducing reliability through extra cabling.

In contrast, the very dense InfiniBand fabric provided by Sun Datacenter InfiniBand
switches is able to potentially eliminate hundreds of switches and thousands of cables
— dramatically lowering acquisition costs. In addition, replacing physical switches and
cabling with switch chips and traces on printed circuit boards drastically improves
reliability. Standard 12x InfiniBand cables and connectors coupled with a specialized
Sun Blade 6048 Network Express Module can eliminate thousands of additional cables,
providing additional cost, complexity, and reliability improvements. Overall, these
switches provide radical simplification of InfiniBand infrastructure. Sun Datacenter
Switches are available to support both DDR and QDR data rates, with fabric capacities
enumerated in Table 1.

Table 1. Sun Datacenter InfiniBand Switch capacities

InfiniBand Switch Data Rate (Connector) Maximum Supported Maximum


Nodes per Switch Clos Fabric
Sun Datacenter InfiniBand QDR or DDR 648 5,184a
Switch 648 (up to 216 12x CXP)
Sun Datacenter InfiniBand QDR or DDR 72 576a
Switch 72 (24 12x CXP)
Sun Datacenter InfiniBand QDR or DDR 36 —
Switch 36 (36 4x QSFP)

a.Eight switches are required. The Sun Datacenter InfiniBand Switch 648 is capable of supporting
clusters beyond 5,184 servers. The maximum number of nodes is currently determined by the
number of uplink ports (eight) provided by the Sun Blade 6048 InfiniBand QDR Switched NEM.

The Sun Datacenter InfiniBand Switch 648


The Sun Datacenter InfiniBand Switch 648 is designed to drastically reduce the cost and
complexity of delivering large-scale HPC solutions, such as those scaled for leadership
in the Top500 list of supercomputing sites, as well as smaller and moderately-sized
enterprise and HPC applications in scientific, technical, and financial markets. Each Sun
Datacenter InfiniBand Switch 648 provides up to 648 QDR InfiniBand ports in only 11
rack units (11U). Up to eight Sun Datacenter InfiniBand Switch 648 can be combined to
11 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.

support up to 5,184 nodes in a single cluster. As shown in Figure 2, the Sun Datacenter
InfiniBand Switch 648 also provides extensive cable support and management for clean
and efficient installations.

Figure 2. The Sun Datacenter InfiniBand Switch 648 offers up to 648 QDR/DDR/SDR
4x InfiniBand connections in an 11u rackmount chassis (shown with cable
management arms deployed).

The Sun Datacenter InfiniBand Switch 648 is ideal for deploying fast, dense, and
compact Clos fabrics when used as a part of the Sun Constellation System. Based on
the Mellanox InfiniScale IV 36-port InfiniBand switch device, each switch chassis
connects up to 648 nodes using 12x CXP connectors. The switch represents a full three-
stage Clos fabric, and up to eight Sun Datacenter InfiniBand Switch 648 can be used to
combine up to 54 Sun Blade 6048 chassis in a maximal 5,184-node fabric. Up to three
Sun Datacenter InfiniBand Switch 648 (and up to 1,944 QDR ports) can be deployed in a
single standard rack (Figure 3).

The Sun Datacenter InfiniBand Switch 648 is tightly integrated with the Sun Blade 6048
InfiniBand QDR Switched Network Express Module (NEM). 12x cables and CXP
connectors provide a 3:1 cable consolidation ratio. Each dual-height NEM connects up
to 24 compute nodes in a single Sun Blade 6048 shelf to a QDR InfiniBand fabric. Sun’s
approach to InfiniBand networking is highly flexible in that both Clos and mesh/torus
interconnects can be built using the same components. The Sun Blade 6048 InfiniBand
Switched NEM can be used by itself to build mesh and torus fabrics, or in combination
with the Sun Datacenter InfiniBand Switch 648 switch to build Clos InfiniBand fabrics.

The Sun Datacenter InfiniBand Switch 648 employs a passive midplane. Fabric cards
install vertically and connect to the midplane from the rear of the chassis. Up to nine
line cards install horizontally from the front of the chassis. A three-dimensional
perspective of the fabric provided by the switch is shown in Figure 4, with an example
Figure 3. Up to three Sun Datacenter route overlaid. With this dense switch configuration, InfiniBand packets traverse only
InfiniBand Switch 648 in a single
19-inch rack deliver 1,944 QDR ports.
12 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.

three hops from ingress to egress of the switch, keeping latency very low. The Sun
Blade 6048 InfiniBand QDR Switched NEM adds only two hops for a total of five. All
InfiniBand routing is managed using a redundant host-based subnet manager.

Nin
eF
ab
ric
Ca
rds

Nine Line Cards

Path Through Switch


Alternate Path Through Switch

Figure 4. A path through a Sun Datacenter InfiniBand Switch 648 core switch
connects two nodes across horizontal line cards, a vertical fabric card, and the
passive orthogonal midplane.

The Sun Datacenter InfiniBand Switch 72


The Sun Datacenter InfiniBand Switch 72 leverages many of the innovations found in
the Sun Datacenter InfiniBand Switch 648, while offering support for smaller and mid-
sized configurations. Like the larger 648-port switch, the Sun Datacenter InfiniBand
Switch 72 offers QDR and DDR connectivity, extreme density, and unrivaled cable
aggregation for Sun Blade and Sun Fire servers as well as Sun storage solutions.
13 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.

Depicted in Figure 5, the Sun Datacenter InfiniBand Switch 72 occupies only one rack
unit, offering an ultraslim and ultradense complete switch fabric solution for clusters of
up to 72 nodes.

Figure 5. The Sun Datacenter InfiniBand Switch 72 offers 72 4x QDR InfiniBand


ports in a 1U form factor

When used in conjunction with the Sun Blade 6048 Modular System, up to eight Sun
Datacenter InfiniBand Switch 72 can be combined to support clusters of up to 576
nodes. While similar solutions from competitors occupy over 17 rack units, eight 1U Sun
Datacenter InfiniBand Switch 72 save considerable space, and require roughly one third
the number of cables. In addition to simplification, this end-to-end supercomputing
solution offers extremely low latency using industry-standard transport, and
commodity processors including AMD Opteron™, Intel® Xeon®, and Sun SPARC®.

The Sun Datacenter InfiniBand Switch 72 provides the following specifications:


• 72 QDR/DDR/SDR 4x InfiniBand ports (expressed through 24 12x CXP connectors)
• Data throughput of 4.6 Tb/sec.
• Port-to-port latency of 300ns (QDR)
• Eight data virtual lanes
• One management virtual lane
• 4096 byte MTU

Sun Datacenter InfiniBand Switch 36


Leveraging the properties of the InfiniBand architecture, the Sun Datacenter InfiniBand
Switch 36 helps organizations deploy smaller high-performance fabrics in demanding
high-availability (HA) environments. The switch supports the creation of logically
isolated sub-clusters, as well as advanced features for traffic isolation and Quality of
Service (QoS) management — preventing faults from causing costly service disruptions.
The embedded InfiniBand fabric management module supports active/hot-standby
dual-manager configurations, helping to ensure a seamless migration of the fabric
management service in the event of a management module failure. The Sun
14 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.

Datacenter InfiniBand Switch 36 is provisioned with redundant power and cooling for
high availability in demanding datacenter environments. The Sun Datacenter
InfiniBand Switch 36 is shown in Figure 6.

Figure 6. The Sun Datacenter InfiniBand Switch 36 offers 36 QDR InfiniBand ports
in a 1U form factor

The Sun Datacenter InfiniBand Switch 36 provides the following specifications:


• 36 QDR/DDR/SDR 4x InfiniBand ports (expressed through 36 4x QSFP connectors)
• Data throughput of 2.3 Tb/sec.
• Port-to-port latency of 100ns (QDR)
• Eight data virtual lanes
• One management virtual lane
• 4096 byte MTU
15 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

Chapter 3
Deploying Dense and Scalable Modular Compute
Nodes

Implementing terascale and petascale supercomputing clusters depends heavily on


having access to large numbers of high-performance systems with large memory
support and high memory bandwidth. As a part of the Sun Constellation System, Sun’s
approach is to combine the considerable and constant performance gains in the
standard processor marketplace with the advantages of modular architecture. This
approach results in some of the fastest and most dense systems possible — all tightly
integrated with Sun Datacenter InfiniBand switches.

Compute Node Requirements


While some supercomputing architectures employ very large numbers of slower
proprietary nodes, this approach does not translate easily to petascale. The
programmatic implications alone of handling literally millions of nodes are not
particularly appealing — much less the physical realities of managing and housing
such systems. Instead, building large and open terascale and petascale systems
depends on key capabilities for compute nodes, including:
• High Performance
Compute nodes must provide very high peak levels of floating-point performance.
Likewise, because floating-point performance is dependent on multiple memory
operations, equally high levels of memory bandwidth must be provided. I/O
bandwidth is also crucial, yielding high-speed access to storage and other
compute nodes.
• Density, Power, and Cooling
The physical requirements of today’s ever more expensive datacenter real estate
dictate that any viable solutions take the best advantage of datacenter floor space
while staying within environmental realities. Solutions must be as energy efficient
as possible, and must provide effective cooling that fits well with the latest
energy-efficient datacenter practices.
• Superior Reliability and Serviceability
Due to their large numbers, computational systems must be as reliable and
servicable as possible. Not only must systems provide redundant hot-swap
processing, I/O, power, and cooling modules, but serviceability must be a key
component of their design and management. Interconnect schemes must allow
systems to be cabled once and reconfigured at will as required.

Blade technology has offered considerable promise in these areas for some time, but
has often been constrained by legacy blade platforms that locked adopters into
expensive proprietary infrastructure. Power and cooling limitations often meant that
16 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

processors were limited to less powerful versions. Limited processing power, memory
capacity, and I/O bandwidth often severely restricted the applications that could be
deployed. Proprietary tie-ins and other constraints in chassis design dictated
networking and interconnect topologies, and I/O expansion options were limited to a
small number of expensive and proprietary modules.

The Sun Blade™ 6048 Modular System


To address the shortcomings of earlier blade computing platforms, Sun started with a
design point focused on the needs of the datacenter and highly-scalable deployments,
rather than with preconceptions of chassis design. With this innovative and truly
modular approach, the Sun Blade 6048 Modular System offers an ultra-dense high-
performance solution for large HPC clusters. Organizations gain the promised benefits
of blades, and can deploy thousands of nodes within the cabling, power, and cooling
constraints of existing datacenters. Fully compatible with the Sun Blade 6000 Modular
System, the Sun Blade 6048 Modular System provides distinct advantages over other
approaches to modular architecture.
• Innovative Chassis Design for Industry-Leading Density and Environmentals
The Sun Blade 6048 Modular System features a standard rack-size chassis that
facilitates the deployment of high-density computational environments. By
eliminating all of the hardware typically used to rackmount individual blade
chassis, the Sun Blade 6048 Modular System provides 20% more usable space in
the same physical footprint. Up to 48 Sun Blade 6000 server modules can be
deployed in a single Sun Blade 6048 Modular System for up to 96 compute nodes
per rack. Innovative chassis features are carried forward from the Sun Blade 6000
Modular System.
• A Choice of Processors and Operating Systems
Each Sun Blade 6048 Modular System chassis supports up to 48 full performance
and full featured Sun Blade 6000 series server modules. Server modules based on
x86/x64 architectures, and ideal for HPC and supercomputing environments
include:
– The Sun Blade X6440 server module, with four sockets for Six-Core AMD Opteron
8000 Series processors, and support for up to 256 GB of memory
– The Sun Blade X6270 server module, with two sockets for Intel Xeon Processor
5500 Series (Nehalem) CPUs and 144 GB of memory per server module
– The Sun Blade X6275 server module, with two nodes, each with two sockets for
Intel Xeon Processor 5500 Series CPUs, 96 GB of memory per node, and an on-
board QDR Mellanox InfiniBand host channel adapter (HCA)

Each server module provides significant I/O capacity as well, with up to 32 lanes of
PCI Express 2.0 bandwidth delivered from each server module to the multiple
available I/O expansion modules (a total of up to 207 Gb/sec supported per server
17 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

module). To enhance availability, server modules don’t have separate power


supplies or fans. Some server modules feature up to four hot-swap hard disk drives
(HDDs) or solid state drives (SSDs) disks with hardware RAID options, while others
provide on-board flash technologies for fast and reliable I/O. Organizations can
deploy server modules based on the processors and operating systems that best
serve their applications or environment. Different server modules can be mixed
and matched in a single chassis, and deployed and redeployed as needs dictate.
The Solaris™ Operating System (OS), Linux, and Microsoft Windows are all
supported.

• Complete Separation Between CPU and I/O Modules


Sun Blade 6048 Modular System design avoids compromises because it provides a
complete separation between CPU and I/O modules. Two types of I/O modules are
supported.

– Up to two industry-standard PCI Express ExpressModule (EMs) slots are dedi-


cated to each server module.
– Up to two Network Express Modules (NEMs) provide bulk IO for all of the server
modules installed in each shelf (four shelves per chassis).
Through this flexible approach, server modules can be configured with different
I/O options depending on the applications they host. All I/O modules are hot-plug
capable, and customers can choose from Sun-branded or third-party adapters for
networking, storage, clustering, and other I/O functions.
• Sun Blade Transparent Management
Many blade vendors provide management solutions that lock organizations into
proprietary management tools. With the Sun Blade 6048 Modular System,
customers have the choice of using their existing management tools or Sun Blade
Transparent Management. Sun Blade Transparent Management is a standards-
based cross-platform tool that provides direct management over individual server
modules and direct management of chassis-level modules using Sun Integrated
Lights out Management (ILOM).

Within the Sun Blade 6048 Modular System, a chassis monitoring module (CMM)
works in conjunction with the service processor on each server module to form a
complete and transparent management solution. Individual server modules
provide support for IPMI, SNMP, CLI (through serial console or SSH), and HTTP(S)
management methods. In addition, Sun Ops Center provides discovery,
aggregated management, and bulk deployment for multiple systems.
18 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

System Overview
The Sun Blade 6048 chassis provides space for up to 12 server modules in each of its
four shelves — for up to 48 Sun Blade 6000 server modules in a single chassis. This
design approach provides considerable density. Front and rear perspectives of the Sun
Blade 6048 Modular System are provided in Figure 7.

Chassis Management Module


Hot Swappable Four and Power Interface Module
N+N Redundant self-contained
shelves Up to 24 PCI Express
power supply
ExpressModules (EMs)
modules
Up to 12 Sun Blade 6000 Up to two single-height Network
server modules Express Modules (NEMS) or one
per shelf dual-height InfiniBand NEM
Eight fan modules (N+1)

Figure 7. Front and rear perspectives of the Sun Blade 6048 Modular System

With four self-contained shelves per chassis, the Sun Blade 6048 Modular System
houses a wide range of components.
• Up to 48 Sun Blade 6000 server modules insert from the front of the chassis, with 12
modules supported by each shelf.
• A total of eight hot-swap power supply modules insert from the front of the chassis,
with two 8,400 Watt 12-volt power supplies (N+N) are provided for each shelf. Each
power supply module contains a dedicated fan module.
• Up to 96 hot-plug PCI Express ExpressModules (EMs) insert from the rear of the
chassis (24 per shelf), supporting industry-standard PCI Express interfaces with two
EM slots available for use by each server module.
19 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

• Up to four dual-height Sun Blade 6048 InfiniBand NEMs can be installed in a single
chassis (one per shelf). Alternately, up to eight single-height Network Express
Modules (NEMs) can be inserted from the rear, with two NEM slots serving each shelf
of the chassis.
• A chassis monitoring module (CMM) and power interface module are provided for
each shelf. The CMM provides for transparent management access to individual
server modules while the Power Interface Module provides six plugs for the power
supply modules in each shelf.
• Redundant (N+1) fan modules are provided at the rear of the chassis for efficient
front-to-back cooling.

Standard I/O Through a Passive Midplane


In essence, the passive midplane in the Sun Blade 6048 Modular System is a collection
of wires and connectors between different modules in the chassis. Since there are no
active components, the reliability of this printed circuit board is extremely high — in
the millions of hours. The passive midplane provides electrical connectivity between
the server modules and the I/O modules.
All front and rear modules connect directly to the passive midplane, with the exception
of the power supplies and the fan modules. The power supplies connect to the
midplane through a bus bar and to the AC inputs via a cable harness. The redundant
fan modules plug individually to a set of three fan boards, where fan speed control and
other chassis-level functions are implemented. The front fan modules that cool the PCI
Express ExpressModules each connect to the chassis via self-aligning, blind-mate
connections. The main functions of the midplane include:
• Providing a mechanical connection point for all of the server modules
• Providing 12 VDC from the power supplies to each customer-replaceable module
• Providing 3.3 VDC power used to power the System Management Bus devices on each
module, and to power the CMM
• Providing a PCI Express interconnect between the PCI Express root complexes on each
server module to the EMs and NEMs installed in the chassis
• Connecting the server modules, CMMs, and NEMs to the management network

Each server module is energized through the midplane from the redundant chassis
power grid. The midplane also provides connectivity to the I2C network in the chassis,
allowing each server module to directly monitor the chassis environment, including fan
and power supply status as well as various temperature sensors. A number of I/O links
are also routed through the midplane for each server module. Connection details differ
depending on the selected server module and associated NEMS. As an example,
Figure 8 illustrates the dual-node Sun Blade X6275 server module configured with the
Sun Blade 6048 InfiniBand QDR Switched NEM with connections that include:
• An x8 PCI Express 2.0 link connecting from each compute node to a dedicated EM
• Two gigabit Ethernet links to the NEM — one from each compute node
20 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

• Two 4x QDR InfiniBand connections to the NEM — one from each compute node
• An Ethernet connection from the server module to the CMM for management

EMs

PCI Express x8 (Node 0)


Node 0
PCI Express x8 (Node 1) Ethernet

4x QDR InfiniBand (Node 0) CMM


Gigabit Ethernet (Node 0)

Node 1 4x QDR InfiniBand (Node 1)


Gigabit Ethernet (Node 1)
NEM 1

Sun Blade X6275 Server Module

NEM 0

Figure 8. Distribution of communications links from a typical Sun Blade 6000


server module

Tight Integration with Sun Datacenter InfiniBand Switches


Providing dense connectivity to servers while minimizing cables is one of the issues
facing large HPC cluster deployments. The Sun Blade 6048 QDR Switched InfiniBand
NEM solves this challenge and improves both density and reliability by integrating
connections and switch components into a dual-height NEM form factor for the Sun
Blade 6048 chassis. As a part of the Sun Constellation System, the NEM uses common
components, cables, connectors, and architecture with the Sun Datacenter InfiniBand
Switch 648 and 72.
21 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

For QDR InfiniBand interconnects, the Sun Blade 6048 InfiniBand QDR Switched NEM
offers the ability to connect up to 24 nodes in a single Sun Blade 6048 shelf. Each NEM
provides all of the connections necessary from the individual server modules to two 36-
port InfiniBand switch chips (Figure 9). The Mellanox InfiniScale IV 36-port switches
offer very low 100 ns latency, and QDR speeds across all ports.

Blade 8 Blade 9 Blade 10 Blade 11 Blade 2 Blade 3 Blade 4 Blade 5


Server Server Server Server Server Server Server Server Server Server Server Server Server Server Server Server
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Blade 6 Blade 7 Blade 0 Blade 1


Server Server Server Server Server Server Server Server
0 1 0 1 0 1 0 1

36 Port 36 Port
QDR InfiniBand QDR InfiniBand
Switch Switch

12x-IB 12x-IB 12x-IB 12x-IB 12x-IB InfiniBand 12x-IB 12x-IB 12x-IB 12x-IB 12x-IB
B12-B14 B9-B11 B6-B8 B3-B5 B0-B2 A12-A14 A9-A11 A6-A8 A3-A5 A0-A2
External
GEE GE GE GEE
G GE
G GEE Connectors GE GE GE GE GE GE
GEE
G GE
G GEE GE GE GEE
G Ethernet GE GE GE GE GE GE

12x Connections to Cluster IB Fabric GE Ports for Server Administration

Figure 9. Paired with the Sun Blade X6275 server module, the Sun Blade 6048
InfiniBand QDR Switched NEM connects up to 24 nodes per shelf and exposes 10
12x InfiniBand connections, and 24 Gigabit Ethernet ports

The Sun Blade 6048 InfiniBand QDR Switched NEM is designed to work with a number
of Sun Blade 6000 server modules1. However, the Sun Blade X6275 server module was
optimized to derive maximum density and performance from the Sun Blade 6048
InfiniBand QDR Switched NEM. Ideal for HPC environments where performance and

1.As of this writing, the Sun Blade 6048 InfiniBand QDR Switched NEM supports the Sun Blade X6275 server
module through its on-board QDR HCAs. The Sun Blade X6440 server module is supported via an InfiniBand
Fabric Expansion Module (FEM) at DDR speeds only.
22 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

density are key, each Sun Blade X6275 server module features two compute nodes, with
each node supporting two sockets for Intel Xeon Processor 5500 Series CPUs and up to
96 GB of memory (Figure 10).

Figure 10. The Sun Blade X6275 server module provides two compute nodes on a
single server module.

Figure 11 shows a block-level representation of how the Sun Blade X6275 server module
connects to the Sun Blade 6048 InfiniBand QDR Switched NEM. In this configuration,
twelve ports from each switch chip (24 total) are used to communicate with the two
compute nodes on each Sun Blade X6275 server module, with nine ports used to
connect the two switches together. The 30 remaining ports (15 per switch chip) are
used as uplinks to either other QDR switched NEMS or external InfiniBand switches.

Sun Blade 6048 InfiniBand QDR Switched NEMs can be connected together directly to
provide mesh or 3D torus fabrics. Alternately, one or more Sun Datacenter InfiniBand
Switch 648 or 72 can be connected to provide Clos fabric implementations. The external
ports use industry-standard CXP connectors that aggregate three 4x ports into a single
12x connector.
23 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

Up to 12
Server
Modules

Sun Blade X6275 Sun Blade 6048 InfiniBand


Server Module QDR Switched NEM
Node 1
15 ports
Memory P 36-port

PCIe 2.0

IB HCA
QDR IB
Switch
Memory P
9 12x IB
4x IB
ports Cables

Memory P 36-port

PCIe 2.0

IB HCA
QDR IB
Switch
Memory P
15 ports
Node 2

Figure 11. The Sun Blade 6048 InfiniBand QDR Switched NEM connects directly to
Mellanox HCAs on both nodes of the Sun Blade X6275 server module

In the default configuration, and for clusters that utilize up to four Sun Datacenter
InfiniBand Switch 648s, the switch provides a non-blocking fabric. To maintain a non-
blocking fabric in configurations of larger than four switches, an external 12x cable can
link two of the external CXP connectors (one connected to each internal switch chip) to
interconnect the two switches with an additional three 4x connections. This
configuration fully meshes the InfiniScale IV chips on the Sun Blade 6048 InfiniBand
QDR Switched NEM, with a total of 12 ports communicating between the two Mellanox
InfiniScale IV InfiniBand switches, while still leaving 24 4x ports (Eight 12x CXP
connectors) available as switch uplinks.

Scaling to Multiple Sun Datacenter InfiniBand Switch 648


Designers need the ability to scale supercomputing deployments without being
constrained by arbitrary limitations. The Sun Datacenter InfiniBand Switch 648 lets
organizations scale from mid-sized InfiniBand deployments that may only populate a
portion of a single Sun Datacenter InfiniBand Switch 648 chassis, to very large
deployments built from multiple Sun Datacenter InfiniBand Switch 648. As with single-
24 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

switch configurations, a multiswitch system still functions and is managed as a single


entity, greatly reducing management complexity.
• A single Sun Datacenter InfiniBand Switch 648 can be deployed for configurations
that require up to 648 compute nodes.
• Up to eight Sun Datacenter InfiniBand Switch 648 can be configured to serve up to
5,184 compute nodes.

Certain requirements exist for maintaining a non-blocking InfiniBand fabric. Table 2


lists various supported numbers of Sun Blade 6048 chassis, Sun Datacenter InfiniBand
Switch 648, Line Cards, Sun Blade 6048 InfiniBand QDR Switched NEMS, and 12x cables
to support various numbers of compute nodes via Sun Blade 6275 server modules. All
listed configurations are non-blocking.

Table 2. Maximum numbers of Sun Blade 6275 server modules and Sun Blade 6048 Modular Systems
supported by various numbers of Sun Datacenter InfiniBand Switch 648.

Number of
Sun Blade 6048 Sun Datacenter Line Cards Sun Blade 6048 12x Cables Total Compute
Chassis InfiniBand Required InfiniBand QDR Required Nodes Supported
Switch 648 Switched NEMs

1 1 2 4 32 96
2 1 3 8 64 192
3 1 4 12 96 288
4 1 6 16 128 384
5 1 7 20 160 480
6 1 8 24 192 576
8 2 11 32 256 768
10 2 14 40 320 960
12 2 16 48 384 1,152
24 4 32 96 768 2,304
48 8 64 192 1,536 4,608
54 8 72 216 1,728 5,184
25 Scalable and Manageable Storage Sun Microsystems, Inc.

Chapter 4
Scalable and Manageable Storage

Large-scale supercomputing clusters place significant demands on storage systems. The


enormous computational performance gains that have been realized through
supercomputing clusters are capable of generating ever-larger quantities of data at very
high rates. Effective HPC storage solutions must provide cost-effective capacity, and
throughput must be able to scale along with the performance of cluster compute
nodes. In addition, users and systems alike need fast access to data and home
directories, and longer-term retention and archival are increasingly important in HPC
and supercomputing environments. These diverse demands require a robust range of
integrated storage offerings.

Storage for Clusters


Along with the general growth in storage capacity requirements and the shear number
of files stored, large HPC environments are seeing significant growth in the numbers of
users needing convenient access to their files. All users want to access their essential
data quickly and easily without having to perform extraneous steps. Organizations also
want to get the best utilization possible from their computational systems.
Unfortunately, storage speeds have seriously lagged behind computational
performance for years, and HPC users are increasingly concerned about storage
benchmarks, the increasingly complexity of the I/O path, and the range of solutions
required to provide complete storage solutions.

Of particular importance, large HPC environments need to be able to effectively


manage the flow of high volumes of data through their storage infrastructure,
requiring:
• Storage that acts as a resilient compute engine data cache to match the streaming
rates of applications running on the compute cluster
• Storage that provides longer-term retention and archive to store massive quantities
of essential data to tiered disk or tape hierarchies
• A range of scalable and parallel file systems and integrated data management
software to help project file system data from near-term cache to longer-term
retention and archiving and back on demand

Even as the capacities of individual disk drives have risen, and prices have fallen, high-
volume parallel storage systems have remained expensive and complex. With
experience deploying petabytes of storage into large supercomputing clusters, Sun
understands the key issues needed to deliver high-capacity, high-throughput storage in
a cost-effective and manageable fashion. As an example, the Tokyo Institute of
26 Scalable and Manageable Storage Sun Microsystems, Inc.

Technology (TiTech) TSUBAME supercomputing cluster was initially deployed with 1.1
petabytes of storage provided by clustered Sun Fire X4500 storage servers and the
Lustre parallel file system.

Clustered Sun Fire™ X4540 Storage Servers as Data Cache


Ideal for building storage clusters to serve as cluster scratch space or data cache, the
Sun Fire X4540 storage server defines a new category of system. These innovative
systems closely couple a general-purpose enterprise-class x64 server with high-density
storage — all in a very compact form factor. Supporting up to 48 terabytes in only four
rack units, the Sun Fire X4540 storage server also provides considerable compute power
with dual sockets for Third-Generation Quad-Core and enhanced Quad-Core AMD
Opteron processors. The server can also be configured for high-throughput InfiniBand
networking — allowing it to be connected directly to Sun InfiniBand switches. With
support for up to 48 internal 500 GB or 1 TB disk drives, the Sun Fire X4540 storage
server is ideal for large cluster deployments running the Linux OS and the Lustre
parallel file system.

48 high-performance
SATA disk drives

Four rack units (4U)

Figure 12. The Sun Fire X4540 storage server provides up to 48 terabytes of
compact storage in only four rack units — ideal for configuration as cluster
scratch space using the Lustre parallel file system.

The Sun Fire X4540 storage server represents an innovative design that provides
throughput and high-speed access to the 48 directly-attached, hot-plug Serial ATA (SATA)
disk drives. Designed for datacenter deployment, the efficient system is cooled from
from front to back across the components and disk drives. Each Sun Fire X4540 storage
server provides:
• Minimal cost per gigabyte utilizing SATA II storage and software RAID 6 with six
SATA II storage controllers connecting to 48 high-performance SATA disk drives
• High performance from an industry-standard x64 server based on two Quad-Core or
enhanced Quad-Core AMD Opteron processors
27 Scalable and Manageable Storage Sun Microsystems, Inc.

• Maximum memory and bandwidth scaling from embedded single-channel DDR


memory controllers on each processor, delivering up to 64 GB of capacity
• High-performance I/O from two PCI-X slots to delivers over 8.5 gigabits per second of
plug-in I/O bandwidth, including support for InfiniBand HCAs
• Easy maintenance and overall system reliability and availability from redundant hot-
pluggable disks, power supply units, fans, and I/O

Parallel file systems are required for moving massive amounts of data through
supercomputing clusters. Given its strengths, the Sun Fire X4540 storage server is now a
standard component of many large supercomputing cluster deployments around the
world. Large grids and clusters need high-performance heterogeneous access to data,
and the Sun Fire X4540 storage server provides both high throughput as well as
essential scalability that allow parallel file systems to perform at their best. Together
with the Lustre parallel file system and the Linux OS, the Sun Fire X4540 storage server
also serves as the key component for the Sun Lustre Storage System (Chapter 5).

The Sun Lustre™ Storage System


The Lustre file system is a software-only architecture that supports a number of
different hardware implementations. Lustre’s state-of-the-art object-based storage
architecture provides ground-breaking I/O and metadata throughput, with considerable
reliability, scalability, and performance advantages. The Lustre file system currently
scales to thousands of nodes and hundreds of terabytes of storage — with the potential
to support tens of thousands of nodes and petabytes of data.

Building on the strengths of the Lustre parallel file system, the Sun Lustre Storage
system is architected using Sun Open Storage systems that deliver exceptional
performance and provide additional value. The main components of a typical Lustre
architecture include:
• Lustre file system clients (Lustre clients)
• Metadata Severs (MDS)
• Object Storage Servers (OSS)

Metadata Servers and Object Storage Servers implement the file system and
communicate with the Lustre clients. The MDS manages and stores metadata, such as
file names, directories, permissions and file layout. Configurations also require one or
more Lustre Object Storage Server (OSS) modules, which provide scalable I/O
performance and storage capacity.

To these standard configurations, all Sun Lustre Storage System configurations include
a High Availability Lustre Metadata Server (HA MDS) module that provides failover. For
maximum flexibility, the Sun Lustre Storage System defines two OSS modules: a
28 Scalable and Manageable Storage Sun Microsystems, Inc.

Standard OSS module for greatest density and economy, and an HA OSS module that
provides OSS failover for environments where automated recovery from OSS failure is
important (Figure 13).

(active) (standby) Commodity


Metadata
Servers Storage
(MDS)

InfiniBand

Multiple
networks
supported Storage Arrays
simultaneously (Direct Connect)

Clients

Ethernet

Enterprise
Storage Arrays
& SAN Fabrics
File System Object Storage
Fail-over Servers (OSS)

Figure 13. High availability metadata servers (HA MDS) and high availability
object storage servers (HA OSS) allow for file system failover in Luster
configurations.

• HA MDS Module
Designed to meet the critical requirement of high availability, the HA MDS
module, is common to all Sun Lustre Storage System configurations. This module
includes a pair of Sun Fire X4270 servers with an attached Sun Storage J4200 array
acting as shared storage. Internal boot drives in the Sun Fire X4270 server are
mirrored for added protection. The Sun Fire X4270 server features two quad-core
Intel Xeon Processor 5500 Series (Nehalem) CPUs and is configured with
24 GB RAM.
29 Scalable and Manageable Storage Sun Microsystems, Inc.

• Standard OSS Module


The Sun Fire X4540 server was chosen for use as the Standard OSS module. As
discussed, the Sun Fire X4540 server features an innovative architecture that
combines a high-performance server, high I/O bandwidth, and very high density
storage in a single integrated system.
• HA OSS Module
Each HA OSS module includes two Sun Fire X4270 servers and four Sun Storage
J4400 arrays. Sun Fire X4270 servers were chosen for the HA OSS module, because
with six PCI Express slots, the Sun Fire X4270 server has the ability to drive the
high throughput required in Sun Lustre Storage System environments. The Sun
Storage J4400 array was chosen for the HA OSS module because it offers
compelling storage density, connectivity, higher availability and very low price per
gigabyte. With redundant SAS I/O Modules and front-serviceable disk drives, the
Sun Storage J4400 array helps the Sun Lustre Storage System deliver price/
performance advantages without sacrificing RAS features.

Sun can reference many storage installations that have achieved impressive scalability
results. One such reference is the Texas Advanced Computing Center’s Ranger System
(see http://www.tacc.utexas.edu/resources/hpcsystems/#ranger), where Sun has
demonstrated near-linear scalability in a configuration encompassing fifty similar
previous-generation Sun OSS modules with a single HA MDS module supporting a file
system that was 1.2 petabytes in size. Figure 14 shows the Lustre file system
throughput at TACC where throughput rates of 45 GB/sec with peaks approaching 50
GB/sec have been observed. In addition, TACC has experienced near-linear throughput
on a single application’s use of the Lustre file system at 35 GB/sec.

$SCRATCH File System Throughput $SCRATCH Application Performance


60 35
Stripecount = 1 Stripecount = 1
30
Write Speed (GB/sec)

Write Speed (GB/sec)

50 Stripecount = 4 Stripecount = 4
25
40
20
30
15
20
10
10
5
0 0
10 100 1000 10000 10 100 1000 10000
# of Writing Clients # of Writing Clients

Figure 14. Luster parallel file system performance at TACC

More information on implementing the Lustre parallel file system can be found in the
Sun BluePrints article titled Solving the HPC I/O Bottleneck: Sun Lustre Storage System
(http://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-
+Sun+Lustre+Storage+System).
30 Scalable and Manageable Storage Sun Microsystems, Inc.

ZFS™ and Sun Storage 7000 Unified Storage Systems


While high-throughput cluster scratch space is critical, clusters also need storage that
serves other needs. Some of an organization’s most important data includes completed
simulations and key source data. Clusters need storage that provides scalable, reliable,
and robust storage for tier-1 data archival and user’s home directories.

To address this need, Sun Storage 7000 Unified Storage Systems incorporate an open-
source operating system, commodity hardware, and industry-standard technologies.
These systems represent low-cost, fully-functional network attached storage (NAS)
storage devices designed around the following core technologies:
• General-purpose x64-based servers (that function as the NAS head), and Sun Storage
products — proven high-performance commodity hardware solutions with
compelling price-performance points
• The ZFS file system, the world’s first 128-bit file system with unprecedented
availability and reliability features
• A high-performance networking stack using IPv4 or IPv6
• DTrace Analytics, that provide dynamic instrumentation for real-time performance
analysis and debugging
• Sun Fault Management Architecture (FMA) for built-in fault detection, diagnosis, and
self-healing for common hardware problems
• A large and adaptive two-tiered caching model, based on DRAM and enterprise-class
solid state devices (SSDs)

To meet varied needs for capacity, reliability, performance, and price, the product
family includes three different models — the Sun Storage 7110, 7210, 7310, and 7410
Unified Storage Systems (Figure 15). Configured with appropriate data processing and
storage resources, these systems can support a wide range of requirements in HPC
environments.
31 Scalable and Manageable Storage Sun Microsystems, Inc.

Sun Storage 7110 Unified Storage System

Sun Storage 7210 Unified Storage System

Sun Storage 7310 Unified Storage System

Sun Storage 7410 Unified Storage System

Figure 15. Sun Storage 7000 Unified Storage Systems


32 Scalable and Manageable Storage Sun Microsystems, Inc.

Tight integration of the ZFS scalable file system


Sun Storage 7000 Unified Storage Systems are powered by the ZFS scalable file stem.
ZFS offers a dramatic advance in data management with an innovative approach to
data integrity, tremendous performance improvements, and a welcome integration of
both file system and volume management capabilities. A true 128-bit file system, ZFS
removes all practical limitations for scalable storage, and introduces pivotal new
concepts such as hybrid storage pools that de-couple the file system from physical
storage. This radical new architecture optimizes and simplifies code paths from the
application to the hardware, producing sustained throughput at near platter speeds.
New block allocation algorithms accelerate write operations, consolidating what would
traditionally be many small random writes into a single, more efficient write sequence.

Silent data corruption is corruption that goes undetected, and for which no error
messages are generated. This particular form of data corruption is of special concern to
HPC applications since they typically generate, store, and archive significant amounts
of data. In fact, a study by CERN1 has shown that silent data corruption, including disk
errors, RAID errors, and memory errors, is much more common that previously
imagined. ZFS provides end-to-end checksumming for all data, greatly reducing the risk
of silent data corruption.

Sun Storage 7000 Unified Storage Systems rely heavily on ZFS for key functionality such
as Hybrid Storage Pools. By automatically allocating space from pooled storage when
needed, ZFS simplifies storage management and gives organizations the flexibility to
optimize data for performance. Hybrid Storage Pools also effectively combine the
strengths of system memory, flash memory technology in the form of enterprise solid
state drives (SSDs), and conventional hard disk drives (HDDs).

Key capabilities of ZFS related to Hybrid Storage Pools include:


• Virtual storage pools — Unlike traditional file systems that require a separate volume
manager, ZFS introduces the integration of volume management functions.
• Data integrity — ZFS uses several techniques to keep on-disk data self consistent and
eliminate silent data corruption, such as copy-on-write and end-to-end
checksumming.
• High performance — ZFS simplifies the code paths from the application to the
hardware, delivering sustained throughput at near platter speeds.
• Simplified administration — ZFS automates many administrative tasks to speed
performance and eliminate common errors.
Sun Storage 7000 Unified Storage Systems utilize ZFS Hybrid Storage Pools to
automatically provide data placement, data protection, and data services such as RAID,
error correction, and system management. By placing data on the most appropriate
storage media, Hybrid Storage Pools help to optimize performance and contain costs.
Sun Storage 7000 Unified Storage Systems feature a common, easy-to-use management
1.Silent Corruptions, Peter Kelemen, CERN After C5, June 1st, 2007
33 Scalable and Manageable Storage Sun Microsystems, Inc.

interface, along with a comprehensive analytics environment to help isolate and


resolve issues. The systems support NFS, CIFS, and iSCSI data access protocols, mirrored
and parity-based data protection, local point-in-time (PIT) copy, remote replication, data
checksum, data compression, and data reconstruction.

Long-Term Retention and Archive


Staging, storing, and maintaining HPC data requires a massive repository of on-line and
near-line storage to support data retention and archival needs. High-speed data
movement must be provided between computational and archival environments. The
Sun Constellation System addresses this need by integrating with a wealth of
sophisticated Sun StorageTek™ options, including:
• Sun StorageTek SL8500 and SL500 Modular Library Systems
• Sun StorageTek 6540 and 6140 Modular Arrays
• High-speed data movers
• Sun StorageTek 5800 system fixed-content archive

The comprehensive Sun StorageTek software offering is key to facilitating seamless


migration of data between cache and archival.
• Sun StorageTek™ QFS
Sun StorageTek QFS software provides high-performance heterogeneous shared
access to data over a storage area network (SAN). Users across the enterprise get
shared access to the same large files or data sets simultaneously, speeding time to
results. Up to 256 systems running Sun StorageTek QFS technology can have
shared access to the same data while maintaining file integrity. Data can be
written and accessed at device-rated speeds, providing superior application I/O
rates. Sun StorageTek QFS software also provides heterogeneous file sharing using
NFS, CIFS, Apple Filing Protocol, FTP, and Samba.
• Sun StorageTek Storage Archive Manager (SAM) software
Large HPC installations must manage the considerable storage required by
multiple projects running large-scale computational applications on very large
datasets. Solutions must provide a seamless and transparent migration for
essential archival data between disk and tape storage systems. Sun StorageTek
Storage Archive Manager (SAM) addresses this need by providing data
classification and policy-driven data placement across tiers of storage.
Organizations can benefit from data protection as well as long-term retention and
data recovery to match their specific needs.

Chapter 6 provides additional detail and a graphical depiction of how caching file
systems such as the Lustre parallel file system combine with SAM-QFS in a real-world
example to provide data management in large supercomputing installations.
34 Sun HPC Software Sun Microsystems, Inc.

Chapter 5
Sun HPC Software

As clusters move from the realm of supercomputing to the enterprise, cluster software
has never been more important. Organizations deploying clusters at all levels need
better ways to control and monitor often expansive cluster deployments in ways that
benefit their users and applications. Unfortunately, collecting, assembling, testing, and
patching all of the requisite software components for effective cluster operation has
proved challenging, to say the least.

Available in both a Linux Edition, and a Solaris Developer Edition, Sun HPC software is
designed to address these needs. Sun HPC Software, Linux Edition is detailed in the
sections that follow. For more information on Sun HPC Software, Solaris Developer
Edition, please see http://wikis.sun.com/display/hpc /
Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1

Sun HPC Software, Linux Edition


Many HPC customers are demanding Linux-based HPC solutions with open source
components. To answer these demands, Sun has introduced Sun HPC Software, Linux
Edition — an integrated solution for Linux HPC clusters based on Sun hardware. More
than a mere collection of software components, Sun HPC Software simplifies the entire
process of deploying and managing large-scale Linux HPC clusters, providing
considerable potential savings in maintenance time and expense.

From its inception, the project’s goals were to provide an “open” product — one that
uses as much open source software as possible, and one that depends on and enhances
the community aspects of software development and consolidation. The ongoing goals
for Sun HPC software are to:
• Provide simple, scalable provisioning of bare-metal systems into a running HPC
cluster
• Validate configurations
• Dramatically reduce time-to-results
• Offer integrated management and monitoring of the cluster
• Employ a community-driven process

Seamless and Scalable Integration


Sun HPC Software, Linux Edition covers the entire cluster life-cycle. The software
provides everything needed to provision the cluster nodes, verify that the software and
hardware are working correctly, manage the cluster, and monitor the cluster’s
performance and health. All of the components have been fully tested on Sun HPC
hardware, so that the likelihood of post-installation integration problems is significantly
reduced.
35 Sun HPC Software Sun Microsystems, Inc.

Because clusters can vary widely in size, Sun HPC software is designed to be scalable,
and all of the components are selected with large numbers of nodes in mind. For
example, the Lustre parallel file system and OneSIS provisioning software are both well
known for working well with clusters comprised of thousands of nodes. Tools that
provision, verify, manage, and monitor the cluster were likewise selected for their
scalability to reduce the management cost as clusters grow.

Sun HPC software, Linux Edition is built to be completely modular so that organizations
can customize it according to their own preferences and requirements. The modular
framework provides a ready-made stack that contains the components required to
deploy an HPC cluster. Add-on components let organizations make specific choices
beyond the core software installed. Figure 16 provides a high-level perspective of Sun
HPC Software, Linux Edition. For more specific information on the components provided
at each level, please see www.sun.com/software/products/hpcsoftware, or send an e-
mail to linux_hpc_swstack-discuss@sun.com to join the community.

Figure 16. Sun HPC Software stack, Linux Edition

Sun HPC Software, Linux Edition 2.0.1 contains the components listed in Table 3.

Table 3. Sun HPC Software 2.0.1 components

Category Sun HPC Software, Linux Edition 1.2


Operating System and kernel Red Hat Enterprise Linux, CentOS Linux, Lustre parallel file
system, perfctr, SuSE Linux Enterprise Server
User space library Allinea DDT, Env-switcher, genders, git, Heartbeat, Intel
Compiler, Mellanox firmware tools, Modules, MVAPITCH
and MVAPITCH2, OFED, OpenMPI, PGI compiler, RRDtool,
Sun Studio, Sun HPC Clustertools, Totalview
Verification HPCC Bench Suite, Lustre IOkit, IOR, Lnet selftest, NetPIPE
Schedulers Sun Grid Engine, PBS, LSF, SLURM, MOAB, MUNGE
Monitoring Ganglia
Provisioning OneSIS, Cobbler
Management CFEngine, Conman, FreeIPMI, gtdb, IBSRM, IPMItool,
Ishw, OpenSM, pdsh, Powerman, Sun Ops Center
36 Sun HPC Software Sun Microsystems, Inc.

Simplified Cluster Provisioning


Sun HPC Software, Linux Edition is specifically designed to simplify the complex task of
provisioning systems as a part of a clustered environment. For example, the software
stack includes the OneSIS open source software tool developed at Sandia National
Laboratory. This tool is specifically designed to ease system administration in large-
scale Linux cluster environments.

The software stack itself can be downloaded from the Web, and is designed to fit onto a
single DVD. While installing the first system in a cluster might take place from the DVD,
it goes without saying that installing an entire large cluster in this fashion would
consume unacceptable amounts of time, not to mention the additional time to
maintain and update individual system images. With OneSIS, administrators can create
system images that define the behavior of the entire computing infrastructure.

A typical installation process approximates the following:


• The software is first onto a management node, and Installing the system locally via
DVD typically takes about 20 minutes from bare metal to a login prompt
• Configuring the system requires another 20 minutes at most
• Other cluster systems are then booted onto the master image

The cluster can be running and ready to accept jobs in as little as 50 minutes.
37 Deploying Supercomputing Clusters Rapidly with Less Risk Sun Microsystems, Inc.

Chapter 6
Deploying Supercomputing Clusters Rapidly
with Less Risk

Sun has considerable experience helping organizations deploy supercomputing clusters


specific to their computational, storage, and collaborative requirements.
Complementing the compelling capabilities of the Sun Constellation System, Sun
provides a range of services that are specifically aimed at delivering results for HPC-
focused organizations. Sun’s partnership with the Texas Advanced Computing Center
(TACC) at the University of Texas at Austin to deliver the Sun Constellation System in the
3,936-node Ranger supercomputing cluster is one such example.

Sun Datacenter Express Services


Sun’s new Datacenter Express Services provide a comprehensive, all-in-one systems and
services solution that takes the complexity and cost out of HPC infrastructure and
procurement. Sun delivers a uniquely flexible approach to managing IT resources that
allows organizations to maintain control of their environments. The offerings combine
the cost savings and improved quality of the Sun Customer Ready Program, with the
expertise of Sun Services to deliver:
• Improved availability and decreased cost
• Optimized system performance and reduced complexity
• Control of IT resources at all times

With Datacenter Express, Sun’s Customer Ready Program builds and tests systems to
exact customer specifications, while Sun Services provides complementary expertise
based upon exact requirements. Having testing and component integration performed
at Sun ISO-certified facilities helps reduce HPC system deployment time, installation
issues, and minimizes unnecessary downtime. Organizations can also leverage the
expertise of Sun Services so that HPC solutions are easier to repair, maintain, and
support — and configurations are easier to scale and modify.

Sun Customer Ready Architected Systems


The Sun Customer Ready program of factory-integrated systems provides powerful, cost-
effective, and energy-efficient computing solutions. The program helps reduce the risk
and time to deployment for clusters through pre-configured solutions that are factory
integrated and tested. As a part of this program, Sun Customer Ready architected
systems combine Sun’s powerful and cost-effective server and storage products with
leading infrastructure or application software to provide cluster building blocks that are
complete, easy-to-order, and fast-to-deploy.
38 Deploying Supercomputing Clusters Rapidly with Less Risk Sun Microsystems, Inc.

The sections that follow describe several Sun Customer Ready program offerings.
• Sun Compute Cluster
The Sun Compute Cluster offers a high-performance computing cluster solution
that can be customized to specific needs. With a choice of Sun Fire x64 rackmount
servers and Sun Blade Modular Systems, the Sun Compute Cluster allows
organizations to customize their cluster designs without compromise.

• Sun Lustre Storage System


As discussed, the Sun Lustre Storage System (formerly Sun Storage Cluster)
provides a high-performance storage solution to support the demands of HPC
clusters for fast access to working data sets. Built around the Sun Fire X4540 and
X4250 servers, Sun Storage J4400 arrays, and the Lustre parallel cluster file system,
the Sun Lustre Storage System supports large clusters of compute nodes where
high data throughput rates, low-latency, and high-speed interconnects are
required. Both standard OSS and high-availability OSS modules can be configured
as a part of the Sun Lustre Storage System.

• Sun Storage and Archive Solution for HPC


Provided as an integrated set of hardware and software modules, the Sun Storage
and Archive Solution for HPC performs a number of key functions for an HPC
environment:
– Serving data to the compute cluster
– Storing and protecting data on various tiers
– Providing a policy-based hierarchical storage manager to automate the move-
ment of data across the storage tiers
– Implementing a continuous copy mechanism to protect critical data and allevi-
ate the need for traditional backup applications

A Massive Supercomputing Cluster at the Texas Advanced


Computing Center
The Ranger supercomputing cluster now deployed at TACC is testament to Sun’s
commitment to help design and build the world’s largest clusters. TACC is a leading
research center for advanced computational science, engineering, and technology,
supporting research and education programs by providing comprehensive advanced
computational resources and support services to researchers in Texas, and across the
nation. TACC is one of several major research centers participating in the TeraGrid, a
program sponsored by the National Science Foundation (NSF) that makes high-
performance computing, data management, and visualization resources available to
the nation’s research scientists and engineers.
39 Deploying Supercomputing Clusters Rapidly with Less Risk Sun Microsystems, Inc.

As a part of the TeraGrid program, the NSF in mid-2005 issued a request for bids on a
project to configure, implement, and operate a new supercomputer with performance
in excess of 400 teraflops peak performance, making it one of the most powerful
supercomputer systems in the world. The resulting supercomputer also provides over
100 terabytes of memory, and 1.7 petabytes of disk storage.

TACC Ranger Computation and Interconnect Architecture


TACC has deployed the Sun Constellation System to build the Ranger supercomputing
cluster. The 3,936-node cluster is expected to provide over 500 teraflops of peak
performance. Figure 17 illustrates the initial deployment floor-plan.

12 APC racks for 72 Sun Fire X4500


Two Sun DS 3456 systems 82 Sun Blade 6048 chassis
Switch C A bulk storage nodes, and 25 Sun Fire X4600
16 line cards per switch with 3,936 server modules servers (19 support and 6 metadata nodes)

Space for cable management 328 bundles of four 12x cables


arms and trays 116 APC row coolers
24 bundles of three splitter cables

Figure 17. TACC initial deployment floor-plan featuring dual previous-generation


Sun DDR core switches and 82 fully-populated Sun Blade 6048 Modular Systems
40 Deploying Supercomputing Clusters Rapidly with Less Risk Sun Microsystems, Inc.

The initial TACC floor-plan consists of:


• Two previous-generation Sun DDR InfiniBand core switches
• 82 Sun Blade 6048 racks featuring 3,936 four-socket Sun Blade X6440 server modules
based on Third-Generation Quad-Core AMD Opteron processors
• 12 APC racks for 72 Sun Fire X4500 servers acting as bulk storage nodes, 19 Sun Fire
X4600 M2 servers acting as support nodes, and six Sun Fire X4600 M2 servers acting
as metadata nodes
• 116 APC row coolers and doors to facilitate an efficient hot/cold isle configuration

TACC Ranger and Sun HPC Storage Solutions


For the TACC Ranger installation, it was essential to provide an effective data cache that
could keep pace with the massive high-speed cluster while deploying tiered storage
infrastructure for long-term retention and archival. Large capacities and throughput
where essential throughout. Tight integration and ease of data migration were also
required.

A Resilient Cluster Data Cache


HPC applications need fast access to high-capacity data storage for writing and
retrieving data sets. TACC needed storage infrastructure that functions as an effective
and resilient compute engine data cache — providing the maximum aggregate
throughput at the lowest possible costs. This cache had to be capable of scaling to
hundreds of petabytes, with very low latency for temporary storage, and it had to be
accessible by all of the compute nodes in the cluster. Ideally, the data cache had to be
easy to deploy, administer, and maintain as well.

Designed as “fast scratch space” for large clusters, the Sun Lustre Storage System
provides the data storage capacity and throughput that these applications require. Key
components of this data storage solution include high-performance Sun Fire X4540 and
X4250 servers, the Lustre scalable cluster file system, and high-speed InfiniBand
interconnects — all integrated into one system. As deployed at TACC, the system will
scale to over 1.729 petabytes of raw capacity. The configuration includes a scalable
storage cluster with 72 Sun Fire X4500 servers and over three thousand 500 GB disk
drives, but only occupies eight physical racks.

Long-Term Retention and Archive


At the other end of the spectrum, TACC needed storage infrastructure for long-term
retention and archival with a deep repository for massive amounts of data. Users
needed to be able to access their data regardless of location. Integral management was
required for the transparent movement of data from archival media (e.g. tape) in and
out of the compute engine data cache.
41 Deploying Supercomputing Clusters Rapidly with Less Risk Sun Microsystems, Inc.

The Sun Customer Ready Scalable Storage Cluster that was deployed to implement the
data cache was designed as part of an overall cluster solution (Figure 18). Sun Fire
X4500 storage clusters were tightly integrated with other complementary products,
including Sun StorageTek Storage Archive Manager and the QFS file system, that
provide long-term data archive capabilities to other Sun StorageTek devices. High-
performance Sun Fire servers act as Data Movers, efficiently moving data between the
fast scratch storage of the Sun Customer Ready Storage Cluster and long-term storage.

Compute Node
Compute Node Tier 1
Compute Node
Compute Node
Archive &
Compute Node
Compute Node
Home
Compute Node
Compute Node
Directories
Compute Node
Compute Node
Compute Node
Compute Node
Compute Node

Tier 2
Compute
Fixed Content
Cluster Infiniband SAN
FC-AL
Archive
Network Network
Lustre Metadata
Lustre Metadata
Lustre OSS Node
Lustre OSS Node
Lustre OSS Node
Lustre OSS Node
Lustre OSS Node Data Movers Tier 2
Lustre OSS Node
Lustre OSS Node Data Movers
Near-line
Lustre OSS Node Archive
Lustre OSS Node
(Tape)
Parallel
File System

Figure 18. Data movers provide automated, policy-based data management and
migration between storage tiers

As deployed at TACC the system will scale to over:


• 200 petabytes of near-line storage
• 3.1 petabytes of on-line storage

The configuration includes:


• Five Sun StorageTek SL8500 Modular Library Systems
• 48 Sun StorageTek T10000 tape drives
• 10 Sun StorageTek 6540 Arrays
• Six Sun Fire Metadata servers with SAM-QFS
42 Conclusion Sun Microsystems, Inc.

Chapter 7
Conclusion

In spite of advancements in technology, delivering petascale clusters and grids has


remained challenging. More than vast collections of nodes, Sun’s approach to large
terascale and petascale architecture provides a systematic and careful design of fabric,
compute, software, and storage elements. The Sun Constellation System delivers an
open and standard supercomputing architecture that provides massive scalability, a
dramatic reduction in complexity, and breakthrough economics.

Coupled with the Sun Blade 6048 InfiniBand QDR Switched NEM, the Sun Datacenter
InfiniBand Switch 648 offers a compelling solution that supports up to 5,184 nodes in a
dense and consolidated InfiniBand Clos fabric. In clusters of up to eight switches, the
Sun Datacenter InfiniBand Switch 72 offers smaller and medium clusters that can scale
up to 576 nodes. With the innovative and robust 12x connectors and cables, these
switches drastically reduce the number of switches and cables required for large
supercomputing installations — potentially eliminating hundreds of switches, and
thousands of cables.

The Sun Blade 6048 Modular System is the first blade platform designed for extreme
density and performance. With a choice of the latest SPARC, Intel, and AMD processors,
the Sun Blade 6048 Modular System integrates tightly with the Sun Datacenter
InfiniBand Switch 648 and 72. Fully compatible with the Sun Blade 6000 Modular
System, server modules run standard open-source operating systems such as the Solaris
OS and Linux, and can deploy general-purpose software that does not require custom
coding, compiles, and tuning. A modular and efficient design realizes savings in both
power and cooling.

Along with a breadth of Sun StorageTek storage offerings, the Sun Fire X4540 server
provides one of the most economical and scalable parallel file system building blocks
available — serving effectively as an OSS for the Lustre file system. Supporting up to
48 TB in a single 4U chassis, the Sun Fire X4540 server effectively combines a powerful
multisocket, multicore x64 server with large-scale storage, and direct InfiniBand
connectivity. The Sun Fire X4270 server and optional Sun Storage J4400 arrays can be
combined to build a highly-effective high-availability MDS and OSS modules for use as a
part of the Sun Lustre Storage System. Coupled with the Sun Storage 7000 Unified
Storage System, the scalable ZFS file system effectively implements hybrid storage
pools across system memory, solid state drives, and conventional hard disk drives —
offering scalable and robust storage for tier-1 archival and user home directories.

The Sun Constellation System also supports Sun HPC Software that benefits application
developers and cluster users alike. Integrated tools such as Sun Studio 12 provide the
fastest compilers available, tuned to get the most of Sun platforms. Sun HPC Cluster
43 Conclusion Sun Microsystems, Inc.

Tools enable the development of cluster-ready applications. Sun Grid Engine provides
distributed resource management along with policy enforcement as it distributes jobs
for execution. Sun Ops Center provides monitoring, patching, and simplified inventory
management for clusters. Together these tools help ensure that development and
management of large clusters remains manageable as they scale towards petascale.

Acknowledgements
This work was inspired by and in part based on previous works from Andreas
Bechtolsheim (International Symposium on SuperComputing 2007), Jim Waldo (for his
Sun Labs paper “On System Design” SMLI-PS-2006-6), and Ivan Sutherland (for his paper
“Technology and Courage” SMLI-PS-96-1).

For More Information


To learn more about Sun products and the benefits of the Sun Constellation System,
contact a Sun sales representative, or consult the related documents and Web sites
listed in Table 4.

Table 4. Related Websites

Web Site URL Description


sun.com/sunconstellationsystem Sun Constellation System
sun.com/ds648 Sun Datacenter InfiniBand Switch 648
sun.com/ds72 Sun Datacenter InfiniBand Switch 72
sun.com/ds36 Sun Datacenter InfiniBand Switch 36
sun.com/blades/6000 Sun Blade 6000 and 6048 Modular Systems
sun.com/servers/x64/x4540 Sun Fire X4500 server
sun.com/servers/x64/x4270 Sun Fire X4270 server
sun.com/storagetek Sun StorageTek storage products
sun.com/software/products/hpcsoftware/ Sun HPC Software
sun.com/sge Sun Grid Engine software
sun.com/hpc Sun in HPC
This Page Intentionally Left Blank
Pathways to Open Petascale Computing On the Web sun.com/sunconstellationsystem

Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com
© 2007-2009 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, Lustre, Solaris, Sun Fire, Sun Blade, StorageTek, and ZFS are trademarks or registered trademarks of Sun Microsystems, Inc. or its subsidiaries
in the United States and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the US and other countries. Products bearing SPARC
trademarks are based upon an architecture developed by Sun Microsystems, Inc. Intel Xeon is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. AMD
and Opteron are trademarks or registered trademarks of Advanced Micro Devices, Inc. Information subject to change without notice. Printed in USA. SunWIN #:537015 11/09

You might also like