Lecture Clusters PDF

INTRODUCTION TO THE DESIGN
OF COMPUTING CLUSTERS
High Performance Infrastructures (HPC Master)
Roberto R. Expósito
Department of Computer Engineering
Universidade da Coruña
Contents
• What is a cluster?
• Types of clusters
• Computing clusters
• Commodity cluster architecture
• Head node
• Computing nodes
• Operating system and cluster middleware
• Distributed Resource Management (DRM) systems
• Software installation and management
• Storage and file systems
• Interconnection networks
• Cluster monitoring and optimization
• Cluster performance and benchmarking 2
WHAT IS A CLUSTER?
What is a cluster?
• A cluster is:
• A collection of individual computers
• Loosely or tightly interconnected through a local network
• Running a software that enables them to collectively work together as a
group to serve a specific purpose
• Ranging from general-purpose business needs (e.g. web services) to
computation-intensive scientific calculations (e.g. weather forecasting)
• Support for clustering can be built directly into the OS or may sit above it (i.e.
at the application level)
• In short, a cluster is a distributed memory multicomputer
• Each computer has its own local (private) memory, which is not directly
accessible from other computers
• A communication network is needed to connect inter-processor memories
• When a processor needs access to data in another processor, it is
usually the task of the programmer to explicitly define how and when
data is communicated
4
What is a cluster?
• According to Flynn’s taxonomy, a cluster is a MIMD machine
Source: https://slideplayer.com/slide/6115054/18/images/41/Flynn+Taxonomy+of+Parallel+Processor+Architectures.jpg
5
TYPES OF CLUSTERS
Types of clusters
• High Availability (HA) or failover clusters

• Load Balancing (LB) clusters
7
Types of clusters
• HA or failover clusters
• HA clusters provide high-availability services used in mission-critical
applications
• Critical databases, business applications, and customer services such as
electronic commerce websites
• A subset of the computers in the cluster provide the primary service
• The minimum required to provide redundancy is a two-node cluster
• Redundant computers will be in standby mode to deliver a backup service
when the primary service fails
• Single Points Of Failure (SPOF) are eliminated by incorporating multiple
network connections, redundant storage and power supplies, etc
• HA clusters usually use a heartbeat private network connection which is
used to monitor the health and status of each node in the cluster
• Different HA configurations: Active/Active, Active/Passive, N+1…
• The Linux-HA project is one commonly used free software for building
HA clusters on Unix-like operating systems
8
Types of clusters
• LB clusters
• LB clusters distribute a workload among the computers in a cluster
• E.g. a Web server implemented using LB clustering assigns different client
requests to different computers
• The overall response time is optimized
• The load balancer or director is the frontend computer of the whole cluster
• The director balances requests from clients among a set of servers, so that the
clients consider that all the services is from a single IP address
• This can be accomplished using a simple round-robin algorithm
• The actual services (e.g. Web, Mail, FTP, DNS…) run in the servers
• A shared storage is required for the servers, so that it is easy for them to
have the same contents and provide the same services
• The director is a SPOF that is usually replicated to provide HA features
• HA/LB clusters
• The Linux Virtual Server (LVS) is an open-source project that provides
an IP load balancing software for Linux systems
• LVS implements several balancing schedulers, not only round-robin 9
Types of clusters
• Clusters mainly focused on running compute-intensive and/or data-
intensive applications
• Scientific computing (weather forecasting, molecular dynamics, industrial
design, astronomical modeling, vehicular accident modeling)
• Big Data analytics (machine learning, graph analytics, healthcare, fraud
detection, marketing, social media analysis, IoT)
• The main goal is to provide an overall high-performance computer by
aggregating the computational power of each individual computer
• They are mostly intended to solve complex computational problems as
fast as possible by exploiting multiple computing resources
• The ultimate goal of the HPC/Big Data paradigms
• Applications rely on parallel/distributed programming approaches to take
advantage of computer clusters
• E.g.: message-passing model, MapReduce paradigm
• This course is focused on computing clusters or just clusters from now on

• Mainly clusters intended for running HPC applications
10
COMPUTING CLUSTERS
Computing clusters
• Back in the day, scientific computing was only performed on dedicated,
specialized and expensive parallel systems called supercomputers
• Very powerful computers tightly integrated from top to bottom by a single
company (e.g. Cray, IBM, SGI), often built from specialized non-
commodity/proprietary components
• The vast majority of current supercomputers are very large computing clusters,
but this has not always been the case
• Early supercomputers relied on a small number of closely connected
processors that accessed shared memory (shared memory multiprocessors)
• Only national laboratories and big research groups could afford their cost
• Few supercomputers serving large user communities
• Users suffered from high queue waiting times
12
Computing clusters
• In the middle of the 90s, commodity clusters became popular as a cost-
effective alternative to supercomputers
• The rapid growth and maturation of x86 processors made commodity hardware
began to provide acceptable/good performance
• The advent of freely available software (i.i. GNU/Linux) to support the UNIX-like
environment provided by most supercomputers was another enabler of the
commodity cluster
• Commodity clusters are built from commercial off-the-shell computers
(PCs/workstations/servers) and commodity hardware developed and marketed for
other standalone purposes
13
COMMODITY CLUSTER
ARCHITECTURE
Commodity cluster architecture
• Computers within a commodity cluster can be dedicated or can be
standalone computers that dynamically join and leave the cluster
• A Network of Workstations (NOW) or Cluster of Workstations
(COW) are composed of computers usable as individual workstations
• A computer laboratory at a university might become a NOW/COW on the
weekend when the laboratory is closed
• Office computers might join a cluster in the evening after the daytime
users leave the office
• The term Beowulf is generally used to describe a commodity cluster of
dedicated computers
• Often with minimal hardware and accessed only via remote login
• If no one is going to use a computer as a standalone machine, there is no need
for that computer to have a dedicated keyboard, mouse or monitor
• So, they generally use rackmount servers installed in standard computer racks
• Running Linux and (mostly) freely available, open-source software
• From now on, we will use the terms computer and node as synonyms
15
• Symmetric cluster
• Each node can function as an individual computer
• Typically used in a NOW/COW, where each computer must be
independently usable
• This is straightforward to set up by
• Creating a subnetwork with the individual computers or simply adding the
computers to an existing network
• Adding any cluster-specific middleware you need
16
• Asymmetric cluster
• One computer is the head/frontend/master node
• It serves as a gateway between the users and the remaining computers
• The remaining computers are the computing/worker/slave nodes
• They are dedicated exclusively to execute the computing tasks
• All traffic must pass through the head node
• Asymmetric clusters tend to provide a higher level of security
• If the computing nodes are physically secure and your users are trusted, you
will only need to harden the head node
• This architecture is most commonly used in dedicated clusters
17
• Expanded asymmetric cluster

• For large clusters, the head node may be broken into several service nodes
for performance and scalability reasons:
• File servers (e.g. NFS servers), login nodes, monitoring nodes…
• It is often desirable to distribute a shared file system across the computing
nodes to allow parallel I/O access
• Cluster file systems generally require separate and dedicated I/O servers
• We will focus on (expanded) asymmetric clusters from now on
• Most dedicated HPC clusters are based on this architecture
18
HEAD NODE
Head node
• Node configured to act as a middle point between the actual cluster
and the outside network
• It serves as an access point to the cluster where users log in
• It requires (at least) two network interfaces
• A private interface directly connected to the internal network
• A public interface directly connected to the external network
• This node provides critical services for centralized cluster
administration and resource management
• Centralized user account management
• Job scheduling
• Node provisioning and cluster monitoring
• Tools and services to install, configure and manage all the computing nodes
in the cluster from a central point
• It may also contain a large amount of attached storage shared via the
internal network by all the computing nodes
• At a basic level, shared storage is typically based on NFS 20
Head node
• It is where users edit, develop, compile, submit and monitor their applications
• Users must not run their applications directly on the head node!
• Computing nodes are exclusively intended for computation
• Small interactive tasks (e.g. testing, visualization) may be allowed
• Instead, the head node serves as the launching central point for running
applications on the cluster
• To do so, it must provide users with appropriate tools and services to
• Submit their applications to the computing nodes for their execution
• Monitor and control their resources
• View and retrieve their results
• Distributed Resource Management (DRM) systems provide this kind of
resource management and job scheduling services
• Slurm, Son of Grid Engine, Torque are popular job schedulers used in HPC
• Ideally, users should never have any need to access any of the individual
computing nodes directly
• Most clusters block direct ssh access to (most of) the computing nodes
• This prevents users from running their applications on the computing nodes
without any resource control 21
Head node
• For large clusters and/or high number of users, some of the
functionality provided by the head node may be broken into several
nodes for performance and scalability reasons
• Login nodes
• Provide additional and redundant access points where users can log in
• For small-medium clusters, the head node serves as the only login node
• Very large clusters can have 2-4 login nodes
• Login nodes are used to edit, develop, compile and submit applications
through the job scheduler
• File servers
• For a high number of users and I/O file traffic, storage services can put a lot
of strain on the head node
• These nodes provide NAS-like storage for the cluster (e.g. using NFS)
• Monitoring nodes
• Dedicated nodes for monitoring the cluster that generally perform continuous
background logging and data collection
• They can provide Web- and CLI -based data browsing and visualization
22
COMPUTING NODES
Computing nodes
• Computing nodes are the place where all the computation is performed
• They provide the computational resources (e.g. CPU, memory) needed for
running the applications in the cluster
• Generally, they are systems that have the bare minimum hardware (i.e. no
keyboard, mouse and monitor) and run the bare minimum OS
• Unnecessary daemons are turned off
• Unnecessary packages are not installed
• Most of the nodes in a cluster are ordinarily computing nodes
• Each computing node can execute one or more tasks simultaneously
• Usually under the control of the job scheduling system (i.e. DRM)
• Direct ssh access to computing nodes is generally restricted
• They only require a private interface directly connected to the internal
network (but they are usually connected to multiple internal networks)
• Most clusters are based on a particular schema for naming these nodes
• Using Greek letters (alpha, beta, gamma…) or numbers (one, two, three…)
• Sometimes their names have a reference to their physical locations
• E.g.: compute-0-4 would be the fourth node in rack zero 24
Computing nodes
• These nodes are managed as non-critical elements in the cluster
• Usually treated as disposable nodes with no backups
• Very fast hardware but usually without or with minimum redundancy
• They are just reinstalled when they suffer any software/hardware failure
• Homogeneous clusters
• All computing nodes are equal in terms of hardware resources
• Heterogeneous clusters provide different types of computing nodes
• Thin/skinny nodes
• Regular computing nodes that provide balanced cores/memory ratios
• Fat/big nodes
• Special computing nodes with more cores and memory than regular nodes
• GPU/accelerator/hybrid nodes
• Special computing nodes with manycore processors (e.g. GPUs, Xeon Phi)
• Interactive/visualization nodes
• Computing nodes exclusively dedicated for heavy interactive tasks that can
slowdown the head/login nodes: testing, (remote) visualization, data pre-
processing and post-processing, parallel compilation… 25
Computing nodes
• Current dedicated, high-performance commodity clusters are
composed of a large number of computing nodes, in which:
• Each node can have more than one processor using a multiple-socket
motherboard
• Each node is a shared memory multiprocessor (usually a NUMA system)
• Each processor can provide a large number of execution cores
• Each processor provides a powerful multicore CPU
• Each node can have access to one or more manycore processors
• E.g.: GPUs, Xeon Phi
• Nodes are interconnected through a low-latency, high-speed network
• E.g.: InfiniBand FDR/EDR, 10/40/100 Gigabit Ethernet
26
Computing nodes
• A typical thin/skinny node
• Single/dual-socket motherboard (i.e., 1-2 processors per node)
• 8-32 cores per processor (i.e., 8-64 cores per node)
• 32-128 GB of memory per processor (i.e., 32-256 GB of memory per node)
• A typical fat/big node
• Quad/octa-socket motherboard (i.e., 4-8 processors per node)
• 16-32 cores per processor (i.e., 64-256 cores per node)
• 128-256 GB of memory per processor (i.e., 512-2048 GB of memory per node)
• A typical accelerator node
• Thin node as baseline plus 1-4 manycore processors (GPUs, Xeon Phi) that
provide very high floating-point performance for scientific applications
• A typical visualization node
• Thin node as baseline plus X windowing system installed plus 1-4 server-class
GPUs that provide hardware-accelerated OpenGL rendering features (e.g.
NVIDIA Quadro) for remote visualization and high demand 2D/3D applications
• Nowadays, most server-class GPUs for HPC (e.g. NVIDIA Tesla) also provide
OpenGL support, which allows using the same GPU hardware for both compute
and visualization tasks 27
Computing nodes
• Overview of 1U rackmount thin node
Dual-socket motherboard
Redundant power supplies
8x 4cm system fans
10x hot-swap HDD drives 24x DIMMs slots
28
Computing nodes
• Overview of 1U rackmount accelerator node
Dual-socket motherboard
9x 4cm fans Redundant power supplies
2x internal HDD drives
16x DIMMs slots

2x hot-swap HDD drives
4x GPUs/Xeon Phi
29
Computing nodes
• Overview of 2U rackmount fat node
Quad-socket motherboard
Redundant power supplies
4x 8cm fans
24x hot-swap HDD drives

48x DIMMs slots
30
Computing nodes
• Current designs are moving away from individual 1U-4U “pizza box” nodes
to blade systems to improve the node density
• Blade systems also allow for shared, cooling and management
Octa-socket blade system: up to 224 cores and 12 TB of memory in a rackmount 7U server
31
Computing nodes
• FinisTerrae II at CESGA
• 317 nodes, 7712 CPU cores, 44.8 TB RAM, 8 GPUs and 4 Xeon Phi
32
Computing nodes
• 306 thin nodes
• 2x Intel Xeon E5-2680v3 12-core processors (24 cores per node)
• 128 GB of memory
• 6 accelerator nodes
• 4 nodes equipped with 2x NVIDIA Tesla K80
• 2 nodes equipped with 2x Intel Xeon Phi 7120P
• 1 fat node
• 8x Intel Xeon E7-8867v3 16-core processors (128 cores)
• 4 TB of memory
• 4 login nodes
• All nodes interconnected through InfiniBand and Gigabit Ethernet
33
OPERATING SYSTEM AND CLUSTER
MIDDLEWARE
OS and cluster middleware
• TOP500 list (June 2018)

• The predominant OS for HPC can be summed up in one word: Linux
35
• Prior to the advent of Linux, the HPC market used UNIX exclusively
• Linux is an open-source and plug-and-play alternative
• In addition to the Linux kernel, much of the important software has
been developed as part of the free-software GNU project
• Virtually all clusters use commercial/free GNU/Linux distributions
• Popular freely available distributions (and its derivatives):
• Debian (Ubuntu, Linux Mint), Slackware, Gentoo…
• Commercial enterprise-class distributions:
• Red Hat Enterprise Linux (RHEL) is developed and supported by Red Hat
• SUSE Linux Enterprise Server (SLES) is developed and supported by SUSE
• Community-supported free distributions forked from commercial versions:
• From RHEL: Fedora, CentOS, Scientific Linux, Oracle Linux…
• From SLES: openSUSE and its derivatives
• Distributions specifically developed by some vendors for their supercomputers:
• Cray: Compute Node Linux (CNL) and Cray Linux Environment (CLE)
• IBM: Compute Node Kernel (CNK) and I/O Node Kernel (INK)
36
• Standard Linux kernel and distributions do not contain all the software
and tools needed to deploy and manage computing clusters as a whole
• Special-purpose OS and/or cluster middleware is needed to turn a set of
interconnected computers into a functional cluster for end users
• The main goal is to present the group of computers that make up a
cluster as a single, big and more powerful computer
• Two basic ways of achieving this goal at the software level:
• By providing Single System Image (SSI) capabilities at the OS level
• SSI is a property of a system that allows to hide the heterogeneous and
distributed nature of the available computing resources and presents them to
end users and applications as a single unified computing resource
• SSI-based solutions provide users with a “virtual”, bigger shared memory
multiprocessor machine on top of a cluster
• By combining all the needed tools, middleware and libraries to enable
clustering upon the OS of the computing nodes
• All usually integrated in such a way that is “easy” to deploy, manage and use
all the nodes in the cluster
37
• SSI-based cluster middleware
• SSI clustering support is built directly into the OS at the kernel level
• This concept is often considered synonymous with that of a distributed OS
• SSI tries to make clustering transparent to applications by providing:
• Single process and memory spaces to provide the illusion that all processes
are running on the same machine (i.e. process management tools such as “ps”
or “kill” operate on all processes in the cluster)
• Single root file system to provide a single view of the file system
• Process migration for automatic work distribution among nodes
• Processes may start on one node and be moved to another node, possibly
for resource balancing, without the need for using any specific
programming API
• Single IPC space to allow processes on different nodes to communicate using
IPC mechanisms (e.g. pipes) as if they were running on the same machine
• A (more or less) complete illusion of a single machine is provided
depending on the software support for all these features
• Disadvantage: offering this transparency level is very hard to implement
• Most SSI-based projects are currently inactive/abandoned 38
• Examples of SSI-based cluster middleware
• MOSIX
• Proprietary distributed OS where there is no need to modify or to link
applications with any library or even to assign processes to different nodes (it
is all done automatically)
• Under active development
• openMOSIX
• Community-based version of MOSIX after the latter became proprietary
software in 2001
• This project was abandoned in 2008, but the LinuxPMI project continued its
development but it currently also dead
• OpenSSI
• Open-source SSI project which originally had a strong commitment and
support from HP, but it is inactive since 2010
• Kerrighed
• Another open-source project that has been started at INRIA, which is
implemented by extending the Linux kernel
• Last version was released in 2010 39
• The need for developing SSI-based cluster middleware decreased
during the last decade
• Due to the advent of multi-core processors, commodity hardware and the
increasing popularity of Beowulf clusters
• Current cluster middleware for HPC takes a very different approach
• The focus is now on orchestrating all the activities and computing resources of the
cluster nodes to present them to end users as a large one computing resource
• So, applications must be cluster-aware in order to take advantage of multiple
computing nodes of the cluster
• Using programming models for distributed memory systems (e.g. MPI)
• All-in-one solutions are generally based on integrating a provisioning system for
software installation with development tools and libraries to provide all the
needed services above the OS and support the entire lifecycle of a cluster
• Basically, deployment, configuration, resource management and monitoring
of all the computing nodes from a unique, centralized point
• Some authors consider this approach as a type of non-transparent, SSI clustering
implemented upon the OS
• We will focus on this type of cluster middleware from now on
40
• Current cluster middleware range from all-in-one integrated solutions
that allows to (easily) build a cluster from scratch to more limited ones
that only provides some functionality
• Multiple both open-source (OS) and commercial (C) solutions exist
• Rocks Cluster - OS
• Qlustar - OS
• OSCAR (Open Source Cluster Application Resources) - OS
• xCAT (Extreme Cloud Administration Toolkit) - OS
• PERCEUS - OS
• Warewulf - OS
• OpenHPC - OS
• Bright Cluster Manager - C
• Scyld ClusterWare - C
• HPE Performance Cluster Manager - C
• IBM Platform HPC - C
• Lenovo intelligent Computing Orchestration (LiCO) - C
• Bull SuperSomputer Suite (Bull SCS) - C 41
• Rocks Cluster
• http://www.rocksclusters.org
• Rocks is an all-in-one cluster management system that simplifies the process of
deploying, managing and upgrading computing clusters
• The primary goal is to build clusters in a turnkey fashion
• Rocks does not require much learning curve to setup a cluster
• It is designed to perfectly match the asymmetric cluster architecture
• Current releases are based on CentOS together with a modified Anaconda installer
that simplifies massive network installation onto many computers
• Rocks supports stateful clusters where the OS is installed on the hard disk of
the computing nodes (stateless/diskless clusters are not possible)
• Stateful node provisioning is performed using a Kickstart-based system
• Rocks combines the underlying CentOS distribution with additional software
bundles or “rolls” that cover the administrative and users needs of an HPC cluster
• MPI (e.g. Open MPI) and scientific libraries
• Workload and resource management through Distributed Management
Resource Systems (DRMS)
• Cluster monitoring tools (e.g. Ganglia)
42
• Qlustar
• https://qlustar.com
• Qlustar is another all-in-one cluster management system specifically
designed for scalability and performance
• Both stateful/stateless clusters can be deployed in a turnkey fashion
• Qlustar Cluster OS: a full-fledged OS based on Debian/Ubuntu/CentOS
• Qlustar’s core is a lightweight, computing node OS kernel designed
such that OS images contain the exact minimum of programs/files
needed for the desired functionality
• Qlustar HPC stack includes a complete distribution that provides it's own
software repositories with additional software packages relevant to HPC
• Including MPI libraries, workload and resource management systems
and monitoring tools in a similar way to Rocks
• Qlustar Data stack supports Lustre and BeeGFS parallel file systems
• Qlustar also provides a powerful cluster management GUI (QluMan)
43
• Warewulf
• http://warewulf.lbl.gov
• Warewulf: “ware” from software + “wulf” from Beowulf
• Management toolkit specifically designed for node OS provisioning and
management in order to ease large-scale cluster deployments
• Warewulf is not all-in-one cluster solution in the sense that Rocks and Qlustar
are, but it can be combined with other tools (e.g. OpenHPC) to achieve a
similar functionality
• The Warewulf project pioneered the concept of stateless HPC clusters
• Nowadays, it supports both stateful and stateless approaches
• xCAT
• https://xcat.org
• xCAT enables to easily manage large number of servers for any type of
computing workload, not only for HPC clusters
• Similar to Warewulf, xCAT also provides node OS provisioning and
management features for stateful deployments
• wareCAT adds support for Warewulf to provide stateless clusters
• xCAT is much more powerful but also more complex to set up 44
• OpenHPC
• https://openhpc.community
• Community effort that aggregates a number of common ingredients required to
deploy and manage HPC clusters
• OpenHPC can be understood as:
• A software repository where packages have been pre-built with HPC
integration in mind and a goal to provide re-usable building blocks for the
HPC community
• A set of installation recipes provided to set up a simple cluster from scratch
using components from the OpenHPC software stack
• OpenHPC is intended to be installed on top of a base OS (CentOS or SLES)
• General approach: install the base OS on the master node, enable the
OpenHPC repository and then install the desired components following a
specific installation recipe for the selected base OS
• The components you choose to install can include:
• A provisioning system, currently supporting Warewulf and xCAT
• Workload and resource management tools, currently supporting SLURM and
PBS Pro
• Development tools, scientific/parallel libraries and applications 45
DISTRIBUTED RESOURCE
MANAGEMENT (DRM) SYSTEMS
DRM systems
• HPC clusters are expensive systems
• It is very important to make efficient use of them, especially on heavy shared
systems where multiple users are competing for the same resources
• Each simultaneous user gets a fraction of the cluster depending on his/her
resource requirements
• A “way” to specify such requirements to the cluster is needed
• Fairness between users must be ensured based on available resources and
scheduling policies
• It is also needed “something” that provides such mechanism
• CPU time cannot be wasted waiting for input as in an interactive application
• Users cannot (generally) log in to computing nodes to run their applications
• HPC clusters provide a batch environment for non-interactive computation
• Distributed Resource Management (DRM) system (or DRMS) is the
software that HPC clusters rely on to provide such batch environments
• Interactive computations are also possible but they are usually more restricted
and only under the control of the DRM (more on this topic later)
• From now on, we will use the terms DRM/DRMS/job scheduler/resource
scheduler/workload manager/batch queuing system as synonyms
47
DRM systems
48
DRM systems
• Batch processing in HPC clusters
• A batch job is a computer program or set of programs processed in batch mode
• A sequence of commands to be executed is listed in a file (job script) and
submitted for execution as a single unit
• HPC users must write a job script and submit it from the head node
(or login node) to a job queue
• A job queue (or batch queue) is a data structure managed by the DRM system that
contain the pool of batch jobs available for execution on the cluster
• Required resources (CPU time, memory…) can be specified as part of the
submission process or embedded within the job script using special directives
• The DRM software will run users’ jobs on their behalf when the
required resources become available
• So, users do not have to sit around and wait for the cluster to be free before
running their applications
• Administrators are in charge of configuring the DRM, creating the job
queues and configuring the scheduling policies according to site-wide
requirements
49
DRM systems
• A DRM usually consists of:
• A resource manager, which knows the state of the computing resources
(cores, memory, etc) and maintains a list of the pending jobs (job queue)
that are requesting resources
• It must provide tools to submit, cancel and monitor jobs
• A scheduler, which uses the information from the resource manager to:
• Determine if there are enough resources available to execute any specific job
from the job queue
• Decide the order in which queued jobs are executed to “optimize” the
mapping of requests to the resources according to configurable policies
• The use of DRM software in HPC clusters gives these benefits:
• Efficient sharing of computing resources among many users
• Avoids idle resources without minute-by-minute human supervision
• Enables around-the-clock high resource utilization
• Automatic workload distribution across the resource pool
• Allows HPC researchers and users to actually get some sleep ;)
50
DRM systems
• Multiple open-source (OS) and commercial (C) DRM solutions exist
• Sun Grid Engine (SGE)/Oracle Grid Engine (OGE) – C and its derivatives
• Son of Grid Engine (SGE) - OS
• Open Grid Scheduler/Grid Engine (OGS/GE) - OS
• Univa Grid Engine (UGE) – C – Univa Corporation
• Slurm1 Workload Manager – OS/C – SchedMD LLC
• Portable Batch System (PBS) – OS and its derivatives
• OpenPBS – OS (deprecated)
• TORQUE – C (formerly OS) - Adaptive Computing, Inc
• PBS Professional (PBS Pro) – OS/C - Altair Engineering, Inc
• IBM LoadLeveler – C (superseded by IBM Spectrum LSF)
• IBM Spectrum Load Sharing Facility (LSF)2 – C and its derivatives
• OpenLava – OS
1formerly known as Simple Linux Utility for Resource Management (SLURM)

2formerly known as IBM Platform LSF
51
DRM systems
• Some DRMs provide simple schedulers
• E.g.: TORQUE provides a very simple, Fist Come First Served (FCFS)
scheduler
• In most cases, the integrated scheduler can be replaced with a more
advanced, compatible solution
• So, DRM only plays the role of the resource manager working together
with a separate scheduler
• Popular available schedulers:
• Maui Cluster Scheduler – OS
• Initially developed by Cluster Resources, Inc (now Adaptive
Computing, Inc)
• Maui is no longer actively developed, but still very useful
• Moab Workload Manager – C
• The commercial version of the Maui scheduler supported by Adaptive
Computing, Inc
• Maui/Moab can be integrated with most of the existing DRM software
(TORQUE, PBS Pro, SGE/OGE, Slurm, LSF…) 52
DRM systems
• Job scheduler
• Decides which job will run next based on continuously evolving priorities
• Works together with the resource manager to check available resources
• Tries to optimize long-term resource usage keeping fairness in mind
• The scheduler assigns a priority number to each job
• The highest priority job will be the next to run (not FCFS!)
• Priorities are (usually) calculated based on the time a job is queued
• The longer a job is waiting, the higher the priority
• Fairshare policies take into account historical resource utilization
• Users who have recently used a lot of resources get a lower priority
• Backfill can be used to allow lower priority jobs to be run “out of order”
with respect to the assigned priorities
• Only if there is a runnable job whose expected runtime is small enough to not
delay the start time of any higher priority jobs in the queue
• Backfill requires users to provide the scheduler with expected runtimes
• Resource reservations enables to reserve system resources for specified

pending jobs (e.g. to meet a deadline) 53
DRM systems
• Typical workflow of HPC users interacting with DRM software
• Users log in to the head node (or login node) of the cluster via ssh
• They submit their jobs requesting resources such as CPU time, cores, memory…
• SGE/OGE, OGS/GE: qsub -l s_rt=00:10.00 script.sh
• Slurm: sbatch -t 00:10:00 script.sh
• DRM registers the job in the job queue and assigns an identifier (JOB_ID)
• As soon as requested resources become available, DRM runs the job with the
highest priority according to the configuration of the scheduler
• Users can monitor the status of the job queue
• SGE/OGE, OGS/GE: qstat
• Slurm: squeue
• Typical states of a job: pending or queued, running, completed, cancelled, failed…
• Users can monitor the status and progress of their jobs using their identifiers
• SGE/OGE, OGS/GE: qstat -j JOB_ID
• Slurm: scontrol show job JOB_ID
• Users can also delete their jobs from the job queue
• SGE/OGE, OGS/GE: qdel JOB_ID
• Slurm: scancel JOB_ID
54
DRM systems
• Example for SGE/OGE
• Simple job script to execute
[user@cluster ~]$ cat sleep.sh
#!/bin/sh
date
sleep 20
date
• Job submission requesting 60 seconds of CPU time
[user@cluster ~]$ qsub –l s_rt=00:00:60 sleep.sh
Your job 2 (“sleep.sh") has been submitted
• Status of the job queue
[user@cluster ~]$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-------------------------------------------------------------------------------------------------------------
2 0.55500 sleep.sh user r 24/10/2018 17:37:19 all.q@c0-0.local 1
55
DRM systems
• Redirecting I/O
• The standard output (o) and error (e) streams of the job are redirected to files by
the DRM
• Most common naming convention:
• stdout: <JOB_NAME>.o<JOB_ID>
• stderr: <JOB_NAME>.e<JOB_ID>
• Default location of these files is the directory from which the job has been
submitted
• The default job name is the name of the job script (but it can be changed)
• In our previous example the files would be sleep.sh.o2 and sleep.sh.e2
[user@cluster ~]$ cat sleep.sh.o2
Wed Oct 24 10:19:20 CEST 2018
Wed Oct 24 10:19:40 CEST 2018
56
DRM systems
• Most DRM software also supports interactive jobs/sessions
• Interactive jobs give you an interactive shell on one of the computing nodes
• Accessing the compute nodes this way means that the DRM system guarantees the
resources that you have asked for
• DRM systems provide appropriate commands to allocate resources and put users
in an interactive shell on a specific computing node
• SGE/OGE, OGS/GE: qlogin
• Slurm: salloc
• Interactive jobs are useful for:
• Running graphical applications or software that requires user interaction
• Testing and debugging your code
• Compiling applications on a specific computing node architecture
• The execution of interactive jobs is usually more restricted
• Waiting for user input takes a very long time in the life of a CPU and does not
make efficient usage of the computing resources
• HPC clusters cannot afford to have idle nodes waiting for interactive commands
while other jobs are waiting to get CPU time!
57
SOFTWARE INSTALLATION AND
MANAGEMENT
Software installation and management
• All Linux distributions provide:
• A specific file format in which software is packaged
• Some sort of package manager that automates the process of installing,
upgrading, configuring, and removing software
• A front-end to the package manager (the back-end) that eases the process
of obtaining and installing packages from repositories and help in
resolving their dependencies
• Multiple CLI/TUI/GUI front-ends can be available
• Common package formats
• Debian (Ubuntu, Linux Mint): deb
• RHEL (CentOS, Fedora, Scientific Linux) and SLES (openSUSE): rpm
• rpm is also the name of a package manager
• In other distributions packages are simply tar archive files that have been
compressed with GZip (tgz), XZ (txz) or BZip2 (tbz)
• E.g. Slackware
59
• Common package managers
• Debian (Ubuntu, Linux Mint): dpkg (back-end) apt (CLI front-end)
• aptitude provides both CLI/TUI front-ends for apt
• Synaptic is a GTK+-based GUI for apt
• RHEL (CentOS, Fedora): rpm (back-end) , yum (CLI front-end)
• Since Fedora version 22, dnf is the default CLI front-end instead of yum
• SLES and derivatives (openSUSE): ZYpp (back-end) and zypper (CLI
front-end)
• YaST provides, among other things, a GUI front-end for Zypp
• Slackware: pkgtool (back-end)
• Package managers allows to install large software stacks providing
• Dependency traking/resolution
• Software updates
• A means to uninstall software
60
• In a typical desktop environment, it is usually enough to have a single
version of a software package installed in the system
• Using the package manager is the most common approach
• Software is installed in standard locations and works out-of-the-box
• For software not available in public repos, it can be manually installed
• By downloading the binaries and copying them into the desired
location
• By downloading the sources and building the software from them
using the configure-make-make install paradigm
• When software is installed in standard locations, no further
configuration is needed
• However, HPC systems are used by a large number of users with
widely varying demands
• In this context, the following requirements arise on HPC clusters:
• Support for multiple versions of a software package
• Installing software in non standard locations
61
• Why do we need to have multiple versions of a software package?
• There are multiple competing software that provide either identical or
significantly overlapping functionality, each available in multiple versions
• Compilers: GNU, Intel, PGI
• MPI libraries: Open MPI, MVAPICH2, Intel MPI
• Linear algebra libraries: OpenBLAS, Intel MKL
• In most cases, programs compiled with one version of a compiler cannot
safely link to libraries compiled with another
• Users might have specific version requirements for their code
• A user comes to you and says: “I need package XYZ built with library
ABC version 1.2”
• Why do we need to install software in a non-standard location?
• To install software in a globally, shared location that all computing nodes
can access (“install once available everywhere”)
• To keep boot images as small as possible
62
• How can we support multiple versions of a software package?
• It is tricky to achieve using the package manager
• Pre-packaged software generally assumes that it is the only version of
that software
• When possible, it requires careful configuration
• Easy to achieve when building software from sources, but
• It requires heavy effort from admins
• There is need for tools that enables automated software
installation
• A specific version of a compiler must be selected, probable between
multiple options available
• Any other dependency on external libraries (e.g. MPI, BLAS) only
increases the possible combinations!
63
• How can we install software in a non-standard location?
• It is tricky to achieve using the package manager
• Most pre-packaged software is not relocatable!
• It generally assumes that it is going to be installed in a specific
path, usually a standard location
• Installing a non-relocatable package to a different location
requires rebuilding it from sources
• Even when installing relocatable packages or when building software
from sources, environment variables must be modified accordingly
• PATH to include the location of the binaries the software provide
• LD_LIBRARY_PATH for shared libraries required at runtime
• Additional package-specific variables (e.g. specify a license server)
• Relying on users to handle these environment variables by modifying
their .bashrc (or similar) files is the wrong approach
• Admins can only hope they get it correct!
64
• On HPC clusters, it is necessary to look at software installation
differently from a standard Linux desktop system
• A simple yet, powerful solution to these issues is based on the
Environment Modules system
• It allow users to easily load, unload and switch between software
packages by modifying the user’s environment on their behalf
• So, all users configure their environment in the same way!
• Users maintain control of their working environment as they switch
between different versions of compilers, MPI libraries, etc
• The environment modules system has become ubiquitous on HPC systems
since the late 1990s
• Be careful, not to confuse “environment modules” with “kernel modules”
• There exists multiple implementations of the environment modules
system
65
• Environment Modules
• http://modules.sourceforge.net/
• This tool lets users easily modify their environment using module files
• A module file is a shell-independent script written in TCL language that contains
the changes to a user’s environment required to support a particular software
• The environment can be modified on a per-module basis using the module
command which interprets module files
• Typically, module files instruct the module command to alter or set shell
environment variables such as PATH, LD_LIBRARY_PATH, etc
• Module files may be shared by many users on a system and users may have
their own collection to supplement or replace the shared module files
• Modules can be loaded and unloaded dynamically and atomically
• All popular shells are supported (bash, ksh, zsh, sh, csh, tcsh, fish), as well as
some scripting languages such as perl, ruby, tcl, python, cmake and R
• Environment modules allows users to specify the exact set of tools
they want or need and is key to operating an effective HPC system
• Bad news for admin is that they must write module files to users
66
• At the heart of environment modules interaction resides the following
components:
• the MODULEPATH environment variable, which defines the list of
searched directories for module files
• module file (an example on the next slide) associated to each available
software
• Typically, module files are placed in a directory hierarchy (tree) in a
specific location in the file system
• Ideally, module files are placed in a globally, shared location in the
cluster
• Otherwise, module files must be copied to the specific hierarchy to all
nodes and maintain them, including making sure they are all identical
67
• Example of a module file:
#%Module1.0#####################################
## This is ax example module to access something
prereq somethingelse
conflict thatothermodule
set version 1.0
set topdir /opt/[installdir]/$version
append-path PATH $topdir/bin
append-path MANPATH $topdir/man
append-path LD_LIBRARY_PATH $topdir//lib64
• line 1: identifies this as a module file
• line 2: is a comment that explains what the module does
• line 3 expresses a prerequisite that the module somethingelse should be loaded
before this one.
• line 4 expresses a conflict, i.e., if thatothermodule is loaded, this one cannot be
loaded as well.
• lines 5 through 9 are the heart of the matter. They modify the environment
variables accordingly by adding directories to the paths 68
• Loading a module on a Linux machine under bash
[user@cluster ~]$ module load gcc/6.1.1
[user@cluster ~]$ which gcc
/opt/gcc/6.1.1/linux-x86_64/bin/gcc
• Now we'll switch to a different version of the module
[user@cluster ~]$ module switch gcc/6.3.1
/opt/gcc/6.3.1/linux-x86_64/bin/gcc
• And now we'll unload the module altogether
[user@cluster ~]$ module unload gcc
gcc not found
• Other useful module commands are:
• module list: list loaded modules
• module avail: lists all the modules which are available to be loaded
• module purge: unload all modules
• module display <module>: display what a module does 69
• Lmod: Environmental Modules System
• https://www.tacc.utexas.edu/research-development/tacc-projects/lmod
• Lmod is a new, enhanced and backward-compatible implementation of
Environment Modules that uses module files written in Lua
• It is a drop-in replacement for TCL modules as it is able to read them directly
• Lmod adds some interesting features targeting a hierarchical module
organization and easier interaction (search, load etc.) for the users
• All the popular shells are supported: bash, ksh, csh, tcsh, zsh, while is also
available for perl and python
• One of the key Lmod features is its ability to handle a module hierarchy
• It only lists available modules that are dependent on the currently loaded
modules, preventing users from loading incompatible modules
• This helps users understand what combinations of modules are available,
because admins might not build every possible combination
• However, if users want to see all possible modules, the Lmod module spider
command lets them see all modules
• The spider command is not available in Environment Modules
• Lmod is usually combined with EasyBuild: a software build and installation
framework written in Python that can generate module files automatically 70
STORAGE AND FILE SYSTEMS
Storage and file systems
• There are usually different kinds of storage available on a cluster, each
with different characteristics and usage restrictions
• HOME directory
• When you log in to a cluster you typically start out in your HOME directory
• HOME directories provide users with safe and long-term storage of important
and critical data (program source, scripts, small input files, etc)
• Backups are generally provided and usage per user is usually restricted
to a small disk quota (e.g. 20 GB per user)
• HOME directories are usually mounted on a shared location that can be used
to setup personal configuration and preferences, as well as test and build
programs to run on the cluster
• A distributed file system is needed so that user’s files can be shared
across the cluster and every node can see the exact same set of files
• Most extended approach for storing HOME directories on commodity
clusters: NAS-based storage relying on NFS as distributed file system
• NFS allows the contents of a directory on the server (e.g. the head node)
to be accessed by the clients (the computing nodes) over the network
• For small-medium clusters, the head node provides the HOME storage
• For large clusters, dedicated file servers are preferred 72
• Local scratch
• Storage provided by local hard disks physically attached to computing nodes
• Data on local scratch cannot be directly read by other computing nodes
• Not intended for long-term storage, but suitable for storing temporary or
intermediate data during execution
• At the end of the execution, final results must be copied (if needed) to a
location that can be globally accessed (e.g. HOME)
• Usually located at /tmp, /temp or /scratch in most systems
• Relying on common block-level local file systems (e.g. ext4, xfs)
• This is usually the fastest storage available to applications
• I/O is carried out within the compute node (no network involved)
• Computing nodes are usually equipped with fast SSD-based disks
• No backups are performed and usage per user is only limited to the available
local storage space
• But purging policies are usually applied (e.g. removing files that have
not been accessed nor modified in the past 30 days) 73
• Global store/scratch
• Some clusters provide additional shared storage locations to provide:
1. Long-term/short-term (store/scratch) storage of large amounts of data
2. High throughput for I/O-intensive parallel applications
3. A globally shared location where software can be installed for all nodes
• Backups are not generally provided and usage per user is not limited or
restricted to a large disk quota (e.g. 1 TB per user)
• Purging policies may be applied
• NFS may be enough for 1) and 3) but parallel file systems are required for 2)
• NFS lacks of scalability can be limitation for I/O-intensive applications
• NFS was not designed for parallel file access
• Examples of parallel file systems are:
• E.g. Lustre, OrangeFS (formerly PVFS2), IBM Spectrum Scale
(formerly, IBM GPFS), BeeGFS (formerly FhGFS), Ceph, Gfarm,
XtreemFS, pNFS (in development as part of the NFS v4.1 standard)
74
• A parallel file system is:
• A type of distributed file system that spread file data across multiple
storage devices on multiple storage nodes to provide high-performance,
concurrent I/O access by multiple clients
• Multiple storage nodes, multiple storage devices and multiple network paths
to data are used to provide a high degree of parallelism
• A parallel file system usually consists of:
• Multiple Storage Servers that hold the actual file system data
• One or more Metadata Servers store information about files (e.g.
directories) that help clients to make sense of data stored in the file system
• Metadata information can also be distributed over the storage servers
• And, optionally:
• Monitoring software that ensures continuous availability of all the
components
• A redundancy layer that replicates in some way information in the cluster,
so that the file system can survive the loss of some component server
75
• In a parallel file system:

• You add storage servers, not only disk space
• Each added storage server brings in more memory, more processing
power and more network and disk bandwidth
• However, software complexity is much higher
• There is no “easy” and standard solution like NFS
• In the next slides, we will review the architecture of the most
popular parallel file system for HPC: Lustre
76
• Lustre is an open-source, POSIX-compliant parallel file system
designed for scalability, high-performance and high-availability
• Lustre runs on Linux-based operating systems and employs a client-server
network architecture
• http://lustre.org
• Lustre servers can aggregate tens of petabytes of storage for a single
file system, providing high combined I/O bandwidths (> 1 TB/s)
• Lustre scales to meet the requirements of applications from small-scale
HPC environments up to the very largest supercomputers
• Redundant servers support storage failover, while metadata and data
are stored on separate servers
• Lustre can deliver fast I/O to parallel applications across high-speed
networks such as InfiniBand and Ethernet
• Used by many of the TOP500 supercomputers
77
• Lustre architecture
• Object Storage Device (OSD)
• Lustre is built on a distributed object storage model where all backend block
storage is abstracted by the OSD interface
• Lustre servers
• Management Server (MGS)
• MGS stores configuration information for all the Lustre file systems in a cluster
• Metadata Server (MDS)
• MDS manages all metadata operations for a Lustre file system (e.g. file
creation/deletion, renaming…)
• Object Storage Server (OSS)
• OSS provides bulk storage for the contents of files in a Lustre file system
• Lustre clients
• All application-level file system access is performed over a network fabric
between Lustre clients and the OSSs
• Lustre Networking (LNet)
• High-speed data network protocol that clients use to access the file system
78
• Object Storage Device (OSD)
• Lustre has three types of OSD instance:
• Management Target (MGT): used by MGS to maintain file system
configuration data used by all hosts in the Lustre environment
• Metadata Target (MDT): used by MDS to record the file system
namespace (i.e. file and directory information)
• Object Storage Target (OST): used by OSS to record data objects
representing the contents of files
• OSD relies on a local file system on Lustre servers, supporting two different
backend file systems
• LDISKFS (based on EXT4)
• It will be necessary to patch the Linux kernel for Lustre servers
• ZFS
• ZFS does not require patches to the kernel
• ZFS is a combined file system and logical volume manager that
provides RAID-Z (“soft” RAID) for data integrity protection,
snapshots, automatic repair…
79
• Management Server (MGS)
• MGS stores configuration information for all Lustre file systems in a cluster
• Servers and clients connect to the MGS on startup in order to retrieve the
configuration for the file system
• Persistent configuration information for all Lustre nodes is recorded by the
MGS onto a OSD called a Management Target (MGT)
• MGS does not participate in file system operations
• Metadata Server (MDS)
• MDS manages all metadata operations for a Lustre file system
• MDS stores this information about a Lustre file system on an OSD referred to
as Metadata Target (MDT)
• An MDT stores namespace metadata such as filenames, directories,
access permissions and file layout (i.e. striping)
• A Lustre file system will always have at least one MDS and corresponding
MDT (more can be added to meet the scaling needs of applications)
• Multiple MDSs are possible in an active/passive configuration, all of them
attached to a shared MDT storage
80
• Object Storage Server (OSS)
• OSS provides bulk storage for the contents of files in a Lustre file system,
storing file data on an OSD called Object Storage Target (OST)
• A single OSS typically serves between 2-12 OSTs which can be stored on
direct-attached storage or through a SAN device
• A single Lustre file system can scale to hundreds of OSSs
• The capacity of a Lustre file system is the sum of the capacities provided
by the OSTs across all of the OSSs
• The aggregated bandwidth can approach to the sum of bandwidths
provided by all the OSSs
• OSSs are usually configured in pairs, with each pair connected to a shared
external storage enclosure that stores the OSTs
• OSTs in the enclosure are accessible via the two servers in an active-
passive failover configuration to provide high availability
81
• Clients
• Lustre client combines the metadata and object storage into a single POSIX
file system
• Represented to the client OS as a file system mount point
• A Lustre file system mounted on the client looks much like any other POSIX
file system
• Applications do not need to be re-written to run with Lustre
• Lustre clients do not access storage directly: all I/O is sent over the network
• Lustre Networking (LNet)
• LNet: high-speed network protocol that clients use to access the file system
• LNet supports Ethernet, InfiniBand, Intel Omni-Path Architecture (OPA) and
specific compute fabrics such as Cray Gemini/Aries
• LNet allows for full RDMA throughput and zero-copy communications
when supported by the underlying network
• LNet supports routing, providing maximum flexibility for connecting
different network fabric technologies
• LNet support multi-homing to improve performance and reliability 82
83
• Lustre file layout

• Each Lustre file has its own layout allocated by the MDS upon creation
• File Layout is fixed once created
• File layout is selected by the client, either:
• By policy (inherited from parent directory)
• By the user or application
• The layout determines two basic properties of each Lustre file:
• Stripe count: the number of OSTs to stripe data across
• Stripe size: how much data is written to an OST
84
• Lustre I/O flow

• When the client opens a file, it sends a request to the MDS
• MDS responds to the client with information about how the file layout
• Which OSTs are used, stripe size…
• Based on the file offset, client can calculate which OST holds the data
• Client directly contacts appropriate OST to read/write data
85
• HOME
• Long-term storage of user critical data (accessible via $HOME)
• Quota: 10 GB and 100K files (backups are performed)
• Global store
• Long-term storage of user data (accessible via $STORE)
• Quota: 500 GB and 300K files (no backups)
• Long-term storage provided by the Lustre parallel file system intended for
I/O-intensive applications (accessible via $LUSTRE)
• Quota: 3 TB and 200K files (no backups)
• Local scratch
• Short-term storage provided by local hard disks physically attached to
computing nodes, which is automatically deleted when jobs are finished
• Accessible via $LOCAL_SCRATCH or $TMPDIR
• Global scratch
• Short-term storage provided by the Lustre parallel file system, which is
automatically deleted when jobs are finished
• Accessible via $LUSTRE_SCRATCH 86
• Lustre parallel file system over InfiniBand
• 4x OSS servers and 2x MDS servers
• OSS servers are connected to 120-disk enclosures
• Each disk provides 2 TB of capacity
• The disks in each 120-disk enclosure are configured into 12
RAID6 groups using 10 disks (8 data disks + 2 parity disks)
• Each 10-disk RAID6 group provides 16 TB of usable storage
• OSS servers format each RAID6 group as an OST
• Each OSS serves 12 OSTs
• Each 120-disk enclosure contains 12 OSTs
• 48x OSTs and 480x 2TB disks in total
• Up to 750 TB of usable storage
• Up to 25.7 GB/s and 29.8 GB/s for write and read bandwidths
87
INTERCONNECTION NETWORKS
Interconnection networks
• Interconnection networks are fundamental components in HPC clusters
• Bandwidth and latency are the key factors for a network:
• Bandwidth: maximum throughput of a physical communication path
• Critical to move very large blocks of data
• Latency: the time from the source sending a packet to the destination receiving
• Critical to move a large number of small blocks of data
• A cluster generates multiple types of traffic:
• Computation traffic between computing nodes (from parallel applications)
• Storage traffic (from an NFS server, from parallel file systems, etc)
• Administration traffic (from job control, node monitoring, etc)
• Management traffic (from remote node management through IPMI)
• A cluster must have as little as one private network
• All traffic going through a single connection, usually Gigabit Ethernet (1 Gbps)
• High-performance clusters provide separate private networks
• At least to separate computation/storage and administration/management traffic
• Different types of networks are used for different types of traffics
89
• Computation traffic
• Most parallel HPC applications rely on the MPI standard
• For providing scalable performance, MPI requires high-bandwidth and low-
latency networks, especially when using a large number of cores
• Current high-speed networks can provide sub-microsecond latencies and
bandwidth rates of up to 50-100 Gbps
• Popular choices in HPC:
• Commodity clusters and supercomputers: 10/40/100 Gigabit Ethernet,
InfiniBand QDR/FDR/EDR, Intel OmniPath Architecture (OPA)
• Other supercomputers can rely on custom/proprietary networks (e.g. Cray
Gemini/Aries, IBM Blue Gene/Q)
• Storage traffic
• NFS traffic can rely on Gigabit Ethernet when used for storing HOME directories
• Other NFS uses would require channel bonding over Gigabit Ethernet, using
10 Gigabit or even NFS over InfiniBand
• Parallel file systems require similar requirements as MPI applications
• The network for computation traffic can be shared for small deployments, but
for providing the highest performance a separate network is a must
90
• Administration traffic
• This traffic include system administration and cluster in-band management tasks,
traffic from scheduling software (job management), monitoring tools, etc
• No latency or bandwidth requirements are generally needed
• Gigabit Ethernet is the network par excellence for this traffic
• Management traffic
• Mostly from remote node management tasks (e.g. rebooting, powering on/off)
performed through IPMI-like interfaces provided by the nodes
• The IPMI specification is a set of interfaces that provides out-of-band management and
monitoring capabilities independently of the host system's CPU and OS
• At the core of IPMI is a hardware chip that is known as the Baseboard Management
Controller (BMC) or Management Controller (MC)
• Vendors provide IPMI-compliant BMC implementations (accessible via web interface)
• Dell: integrated Dell Remote Access Controller (iDRAC)
• Hewlett-Packard Enterprise (HPE): integrated Lights-Out (iLO)
• Lenovo: Integrated Management Module (IMM)
• No latency or bandwidth requirements are generally needed

• It can share the administration network, but large deployments require a dedicated
network to isolate both types of traffics 91
• InfiniBand refers to two different things:

• A switch-based point-to-point interconnect architecture that defines
• A layered hardware protocol
• Physical, Link, Network, Transport and Upper Layer Protocols
• Multiple devices needed for system communication
• Channel adapters, switches, routers, etc
• A programming API called InfiniBand Verbs (IBV)
• The IBV API is an implementation of a Remote Direct Memory
Access (RDMA) technology
• Remote: data transfers between computers in a network
• Direct: no OS kernel involvement in data transfers (everything about the
transfer is offloaded onto the network adapter, bypassing the kernel)
• Memory: transfers between user-space application memory (no extra
copying or buffering)
• Access: send, receive, read, write and atomic operations are supported
• Both aspects (InfiniBand networks and RDMA) are covered next
92
• InfiniBand Architecture (IBA)

• The InfiniBand Architecture (IBA) is a industry-standard defined and
developed by the InfiniBand Trade Association (IBTA)
• IBA building blocks
• Channel Adapters (CA)
• Device that terminates an InfiniBand link and executes transport-level
functions
• Host Channel Adapter (HCA)
• Equivalent to a network Ethernet card
• HCA provides an interface to a host device and supports the IBV
API
• Target Channel Adapter (TCA)
• TCA provides the connection to a storage I/O device that supports
a subset of HCA features
• Switches
• Device that moves packets from one link to another of the same subnet
93

• Routers
• Device that transports packets between different subnets
• Bridges/Gateways
• InfiniBand to Ethernet/Fibre Channel
• Subnet Manager (SM)
• SM configures the local subnet and ensures its continued operation
• At least one subnet manager must be present in a subnet
• SM manages all switch and router setups for subnet reconfiguration
when a link goes down or a new link comes up
• SM is responsible to discover the physical topology and set up the
logical topology by configuring the forwarding tables in each switch
• SM may reside within any of the devices on the subnet
• Typically running on a switch or on a server
94

95

• IBA is a stack divided into multiple layers where each layer operates
independently of one another
• Physical, Link, Network, Transport, Upper Layers
96
• InfiniBand physical layer

• Defines both electrical and mechanical characteristics
• Cables and receptacles for optical fiber and copper media
• Backplane connectors for rack-mounted systems
• Backplane connectors are electrical parts used to connect several Printed
Circuit Boards (PCBs) together within a computer system
• Physical links are bidirectional point-to-point communication channels
that provide a full duplex connection at a base speed of 2.5 Gbps
• Copper: four wires comprising a differential signaling pair for each direction
• Optical fiber: two optical fibres, one for each direction
97

• Links can be aggregated to achieve greater bandwidth using byte stripping
• Link widths: 1X, 4X, 8X and 12X
• Link signaling rates
• 2001: Single Data Rate (SDR): 2.5 Gbps per lane
• 2005: Double Data Rate (DDR): 5 Gbps per lane
• 2007: Quad Data Rate (QDR): 10 Gbps per lane
• 2011: Fourteen Data Rate (FDR10): 10.3125 Gbps per lane
• 2011: Fourteen Data Rate (FDR): 14.0625 Gbps per lane
• 2014: Enhanced Data Rate (EDR): 25.78125 Gbps per lane
• 2017: High Data Rate (HDR): 51.5625 Gbps per lane
• Link encoding
• SDR, DDR, QDR: 8/10 bit encoding (8 data bits + 2 control bits)
• FDR10/FDR, EDR, HDR: 64/66 bit encoding (64 data bits + 2 control bits)
• Link speed
• Multiplication of the link width and link signaling rate
• Link encoding reduces the effective bandwidth (i.e. link data rate)
98

• Most common today is FDR/EDR 4X links
SDR DDR QDR FDR10 FDR EDR HDR

Link encoding 8/10 8/10 8/10 64/66 64/66 64/66 64/66
1X Link speed (Gbps) 2.5 5 10 10.3125 14.0625 25.78125 51.5625
1X Link data rate (Gbps) 2 4 8 10 13.64 25 50
4X Link speed (Gbps) 10 20 40 41.25 56.25 103.125 206.25
8X Link speed (Gbps) 20 40 80 82.5 112.5 206.25 412.5
12X Link speed (Gbps) 30 60 120 123.75 168.75 309.375 618.75
Latency (microseconds) 5 2.5 1.3 0.7 0.7 0.5 <0.5
99
4X QSFP+ active optical fiber cable 4X QSFP+ passive copper cable
100
• InfiniBand link layer

• Responsible for the transmission and reception of packets
• Management packets: used for link configuration and maintenance
• Data packets: carry up to 4 KB of payload (MTU)
• Switching
• All devices within a subnet have a unique 16-bit Local IDentifier (LID)
assigned by the SM when link becomes active
• Not persistent through reboots!
• Within a subnet, packet forwarding and switching is handled at the link layer
using LID for addressing
• Link-level switching forwards packets to the device specified by a
destination LID contained in the Local Route Header (LRH) of the packet
• QoS
• Supported through Virtual Lanes (VL), which are separate logical links that
share a single physical link
• Each link can support up to 15 standard VLs and 1 management lane (VL15)
• VL15 is the highest priority and VL0 is the lowest
• Management packets use VL15 exclusively 101
• InfiniBand link layer

• Credit-based link-level flow control
• Flow control is used to manage data flow between two links in order to
guarantee a lossless fabric (no packet loss even in the presence of congestion)
• Each receiving link supplies credits to the sending link to specify the amount
of data that can be received without loss of data
• Credit passing between each device is managed by a dedicated management
packet to update the number of data packets the receiver can accept
• Data is not transmitted unless the receiver advertises credits indicating that
receive buffer space is available
• Flow control is handled on a per VL basis
• Congestion on one VL does not impact traffic with guaranteed QoS on another
VL even though they share the same physical link
• Error detection for data integrity
• Two CRCs per packet to provide link-level data integrity between two hops
and end-to-end data integrity
• Ethernet defines only a single CRC: an error can be introduced within a
device which then recalculates the CRC. Check at the next hop would reveal
a valid CRC even though the data has been corrupted 102
• InfiniBand network layer

• Network layer handles routing of packets from one subnet to another
• Each device has a constant 64-bit Global Unique IDentifier (GUID)
• Assigned by vendors (i.e. persistent through reboots)
• GUID is “like a Ethernet MAC address” (i.e. unique in the world)
• Packets sent between subnets contain a Global Route Header (GRH)
• A router reads the GRH to forward the packed based on the Global
IDentifier (GID) of the destination
• Each device within a fabric is identified by its GID
• The GID is a 128-bit address “similar” to an IPv6 address
• GID consists of the vendor-specified GUID plus a subnet identifier
• The router rebuilds each packet modifying the LRH with the proper LID
on the next subnet
• The last router in the path replaces the LID in the LRH with the LID of the
destination port
• Within a subnet, packets do not require the network layer information and
header overhead (the common scenario in computing clusters)
103
• InfiniBand transport layer

• IBA offers a significant improvement for the transport layer: all functions
in this layer are implemented in hardware by the CAs
• The transport layer is responsible for Segmenting Assembly &
Reassembly (SAR)
• When sending, data coming from the application (i.e. a message) is divided
into multiple packets of the proper size based on the MTU
• When receiving, the receiver reassembles the packets into messages
• At this layer, CAs communicate using an asynchronous model based on
Work Queues (WQ) and Queue Pairs (QP)
• A QP is a bi-directional message transport engine that consists of two WQs:
• The Send queue is used for placing instructions which transfer data from the
consumer’s memory to the memory of another consumer
• The Receive queue contains incoming data and instructions which specify where
the data is to be placed in the memory of the receiving consumer
• Instructions are indicated by placing a WQ Element (WQE) into a QP
• WQEs are processed by the CA from the Send queue of a local QP and sent
out to the Receive queue of a remote QP
104

• User-level applications interact directly with CAs bypassing the kernel
105

• Types of transfer operations
• Send/Receive
• Data is sent from local the Send queue to the remote Receive queue
• The receiver must have previously posted a receive buffer to receive the data
• The sender does not have any control over where the data will reside in the
remote memory
• RDMA Write
• Data is written from local memory (active side) to remote memory (passive side)
• RDMA Read
• Data is read into local memory (active side) from remote memory (passive side)
• Atomic extensions to RDMA operations
• Fetch and Add, Compare and Swap
• RDMA operations allows implementing efficient zero-copy protocols
• Passive side uses NO CPU cycles during RDMA operations
• Prior to performing any RDMA operation
• The passive side must provide appropriate permissions to access its memory
• The active side must obtain the remote memory address 106

• Connected transport types
• A QP communicates with exactly one other QP
• Connection-oriented transport service
• Messages are transmitted by the Send queue of one QP to the Receive queue
of the other QP
• If a message size is bigger than the MTU, it is fragmented in the sender
side and reassembled in the receiver side (multi-packet messages)
• Reliable Connection (RC)
• It is guaranteed that messages are delivered at most once, in order and
without corruption
• Reception of messages is acknowledged
• A RC connection is very similar to a TCP connection
• Unreliable Connection (UC)
• There is not any guarantee that the messages will be received (packets
may be lost) or about the order of the packets
• Messages with errors are not retried by the transport: error handling
must be provided by a higher level protocol 107

• Unconnected transport types
• A QP can communicate with any number of other QPs of the same type
• Connection-less transport service
• Messages can be transmitted in either unicast or multicast (one to many) way
• Each message is limited to one packet (single-packet messages)
• The maximum message size is the MTU
• Reliable Datagram (RD)
• It is guaranteed that messages are delivered at most once, in order and
without corruption
• Reception of messages is acknowledged
• Not implemented in current hardware
• Unreliable Datagram (UD)
• There is not any guarantee that the messages will be received (packets
may be lost)
• Messages with errors are not retried by the transport: error handling
must be provided by a higher level protocol
108
• InfiniBand Upper Layer Protocols (ULPs)

• ULPs connect InfiniBand to common interfaces
• Computing - Clustering
• Message Passing Interface (MPI)
• Reliable Datagram Socket (RDS)
• Connectionless protocol for delivering datagram sockets
• Network
• IP over InfiniBand (IPoIB)
• IPoIB defines how to encapsulate IP packets over InfiniBand
• Sockets Direct Protocol (SDP)
• TCP-bypass protocol to support stream sockets
• Storage
• iSCSI Extensions for RDMA (iSER)
• SCSI RDMA Protocol (SRP)
• NFSoRDMA (NFS over RDMA)
109
• InfiniBand Upper Layer Protocols (ULPs)

• IPoIB
110
• InfiniBand network topologies

• Two-level fat-tree topology
• The most common topology for clusters based on InfiniBand
• Also known as Constant Bisectional Bandwidth (CBB) or CLOS networks
• Switches at the top level are called Level 2/Core
• Switches at the bottom level are called Level 1/Edges
• Internal connections are between core and edge switches
• External connections are between edge switches and servers
• Non-blocking fat-tree
• The number of external and internal connections is the same
• For any edge switch, the number of links going down to the servers is
equal to the number of links going up to the core switches
• Uplink and downlink bandwidths remain the same for edge switches
• Blocking fat-tree
• Higher number of external connections than internal ones
• Uplink bandwidth is lower than downlink bandwidth for edge switches
111

• Star topology
• When possible, small clusters can use a single “big” switch
• Current InfiniBand switches can provide up to 648 ports in 28Us
• The wiring is really simple by drawing cables from every rack to the switch
• Actually, those “big” switches implement a two-level fat-tree internally
• Using a single “big” switch is generally more expensive than building the
same fat-tree network manually with smaller switches
• Other topologies
• 2D/3D Mesh
• 2D: each node is connected to four other nodes: X and Y axis (positive
and negative), except nodes located at X and Y ends of the mesh
• 3D: each node is connected to six other nodes: X, Y and Z axis (positive
and negative), except nodes located at X, Y and Z ends of the mesh
• 2D/3D Torus
• The X, Y and Z ends of the 2D/3D mashes are “wrapped around” and
connected to the first node (all nodes are connected to four/six other
nodes) 112

• Example: two-level non-blocking fat-tree using six 36-port switches
• Each of the four switches on the edge level has 18 ports dedicated to
connecting servers (the small circles in the bottom of the picture)
• The other 18 ports of each edge switch are connected to the core layer
• Two bundles of 9 ports (the thick lines) go to the core switches
• Two servers connected to the same edge switch can communicate via that
particular edge switch without going up to the core level
• If two servers are connected to different edge switches, the packets will
travel up to any of the core switches and then down to the target edge switch
• The maximum traffic path includes 3 switches (3 hops)
113

• Two-level non-blocking fat-tree networks
• PE = number of ports of edge switches
• PC = number of ports of core switches
• The largest non-blocking network that can be built will connect:
• Up to PC*PE/2 servers
• Using PE/2 core switches and PC edge switches
• When PE = PC = P
• Up to 2*(P/2)2 servers
• Using P/2 core switches and P edge switches
• Examples
• Largest network using 36-port switches
• 648 servers using 18 core switches and 36 edge switches
• Largest network using 324-port switches
• 52488 servers using 162 core switches and 324 edge switches
• Largest network using 648-port core switches and 36-port edge switches
• 11664 servers using 18 core switches and 648 edge switches 114

• A “big” switch is built upon a two-level fat-tree internally using smaller,
“monolithic” (i.e. using a single ASIC chip) switches, all combined in a
single chassis
• These “big” switches are usually known as Director switches
• They are built to be modular and allow for easy expansion
• The number of ports represents the maximum capacity when it is fully
populated
• Director switches are easier to manage than many small, non-Director
switches needed to build the same fat-tree network manually
• The wiring pattern is more simple and reliable: any cabling between edge and
core levels is now inside the Director switch (you also save on cables!)
• When using Director switches at the core level and smaller, non-Director
ones at the edge level you obtain in the end a three-level fat-tree network
• When using Director switches both at the core and edge levels you obtain in
the end a four-level fat-tree network
• Director switches are needed for building fat-tree networks that connect a
very large number of servers 115

• Example: a 648-port Director switch can be built upon a two-level
fat-tree network using 54 36-port “monolithic” switches
116

• Example: a two-level fat-tree network built upon two 648-port
Director core switches and 72 36-port “monolithic” edge switches
• It may also be seen as a three-level fat-tree network
117

• Non-blocking vs blocking fat-tree networks
• Non-blocking
• For any edge switch, the number of links going down to servers is equal
to the number of links going up to core switches
• Uplink and downlink bandwidths remain the same for all edge switches
• The number of uplink and downlink ports are in proportion of 1:1
• Blocking
• Downlink and uplink ports are distributed in a different proportion
• Example: a blocking factor of 2 means that you have twice more links
going down than going up (2:1)
• Uplink bandwidth is lower than downlink bandwidth for edge switches
• A blocking network means that two packets that would otherwise follow
separate paths, would be queued and one of them would have to wait
• This introduces latency which is never good for parallel computing
• Blocking is only used because it allows to make somewhat cheaper
networks (fewer switches are necessary) or to connect more nodes using
the same hardware than a non-blocking network can provide
118
• RDMA networks
• RDMA enables high-throughput, low-latency networking with low CPU
utilization, which is especially useful in computing clusters
• Most widely used MPI libraries implement their communications directly
on top of API provided by the RDMA network
• Open MPI, MVAPICH, Intel MPI…
• InfiniBand is an RDMA-enabled network that has no standard API
• IBA only defines the functionality that must be provided by the RDMA
adapter in terms of “verbs”
• A “verb” is an abstract representations of a functions must exist
• The syntax of these functions is left to the vendors
• This is called the InfiniBand Verbs API (IBV)
• The de-facto standard implementation of the IBV API is developed and
maintained by the OpenFabrics Alliance (OFA)
• https://www.openfabrics.org
• OFA provides a full RDMA software stack released as part of the
OpenFabrics Enterprise Distribution (OFED), available for
Linux/Windows and FreeBSD 119
• RDMA-based vs IP-based communications

• Typical IP data transfer
• Application X on computer A sends some data to application Y on computer
B. As part of the data transfer, the kernel on B must:
• Receive the data and decode the packet headers
• Determine that the data belongs to application Y
• Wake up application Y
• Wait for application Y to perform a read system call into the kernel
• Manually copy the data from the kernel's own internal memory space into the
buffer provided by application Y
• Most network traffic must be copied across the system's main memory bus at
least twice
• Once when the host network adapter uses DMA to put the data into the kernel-
provided memory buffer, and again when the kernel moves the data to the
memory buffer provided by application Y
• The computer must execute a number of context switches to switch between
kernel context and application Y context
• Both things impose extremely high CPU loads on the system when network
traffic is flowing at very high rates and can make other tasks to slow down
120

• RDMA zero-copy data transfer
• RDMA allows the host network adapter in computer B to know:
• When a packet comes in from the network
• Which application should receive that packet
• Where in the application's memory space it should go
• So, the host network adapter can transfer the data directly to the buffer
provided by application Y, eliminating the need to copy data between
application memory and the kernel buffers in the OS
Source: http://www.datacenterjournal.com/maximize-your-software-defined-data-center-infrastructure-efficiency-with-rdma-enabled-
interconnects/
121

• Most widely used MPI libraries implement their communications directly
on top of the IBV API implemented by OFA
• Open MPI, MVAPICH, Intel MPI…
• However, existing IP-based networking applications are usually built
upon the standard Berkeley sockets API
• IPoIB allows socket-based applications running over InfiniBand seamlessly
• However, network traffic using IPoIB goes through the normal IP stack!
• Applications running over IPoIB will work on top of the full speed of the
InfiniBand link, although the CPU will probably not be able to run the IP
stack fast enough to use all this speed (e.g. 56 Gbps for FDR)
• Latency is significantly increased and bandwidth is reduced
• SDP allows TCP-based applications running over InfiniBand seamlessly
• However, SDP only deals with stream sockets
• SDP outperforms IPoIB but still far from native RDMA performance
• Why not RDMA over Ethernet (non-InfiniBand) hardware?

122
• RDMA networks over Ethernet

• Currently, there are two additional RDMA technologies that allows to run
the OFA implementation of the IBV API on Ethernet hardware
• Internet Wide Area RDMA Protocol (iWARP)
• Layer 3 protocol that implements RDMA over TCP/IP
• iWARP adapters offload the processing of the entire TCP/IP stack onto the
hardware by providing a TCP Offload Engine (TOE)
• Zero-copy and kernel-bypass mechanisms are implemented by the iWARP
extensions to TCP/IP that were standardized by the IETF
• Direct Data Placement protocol (DDP)
• RDMA over Converged Ethernet (RoCE)
• RoCE benefits from advances defined by the Data Center Bridging (DCB)
task group that make Ethernet a lossless fabric like InfiniBand
• E.g. Priority Flow Control (PFC)
• So, it requires DCB-compliant Ethernet switches properly configured
• Current efforts are towards to RoCE over lossy Ethernet (Resilient RoCE)
• There exists two different RoCE specifications, RoCE v1 and RoCE v2, that
implement layer 2 and layer 3 protocols, respectively
123
• RDMA networks over Ethernet

• RDMA over Converged Ethernet (RoCE)
• RoCE v1
• Layer 2 protocol that implements RDMA over Ethernet by placing the
InfiniBand transport protocol directly over the Ethernet link layer
• LRH header at the InfiniBand link layer is replaced by the Ethernet
MAC header
• EtherType field indicates the payload encapsulates the RoCE v1
protocol
• RoCE v1 allows communication between any two hosts in the same
Ethernet broadcast domain (RoCE v1 is not routable)
• RoCE v2
• Layer 3 protocol that implements RDMA over UDP/IP, which enables
RoCE v2 packets to be routed
• GRH header at the InfiniBand network layer is replaced by the
standard IP networking header
• UDP encapsulation of the layer 4 payload allows packets to be
forwarded efficiently by routers as a mainstream data path operation
124
• RDMA networks supported by OFA
Source: https://www.snia.org 125

• Computation traffic between computing nodes
• InfiniBand FDR (56 Gbps)
• Storage traffic
• Gigabit Ethernet for $HOME and $STORE
• InfiniBand FDR (56 Gbps) for $LUSTRE and $LUSTRE_SCRATCH
• Administration and management traffic
• 2x Gigabit Ethernet
• InfiniBand topology
• Two-level fat-tree using 36-port switches
• 6 cores switches
• 17 edges switches
• Compute nodes using a 2:1 blocking configuration
• 12 uplink ports to core switches and 24 downlink ports to compute nodes
• I/O nodes (e.g. Lustre servers) using a non-blocking configuration
126
CLUSTER MONITORING AND
OPTIMIZATION
Cluster monitoring and optimization
• A crucial administration task is to keep the cluster up and running
• Why is so important?
• To verify the proper operation of the whole cluster
• To detect performance bottlenecks and misuses
• Early detection of possible problems:
• Unplanned service downtimes
• Attacks/intrusions
Monitoring Optimization
128
• Monitoring a cluster involves, among other things, in:
• Checking the “health” of all the nodes
• Are there any node down? Why?
• Checking the efficiency of the overall system
• Are the system resources being used efficiently? Why not?
• Aspects to be monitored
1. System resource usage of the nodes
• CPU load, memory used, disk space, swapping, network traffic…
2. Internal physical conditions of the nodes
• CPU/GPU/disk temperatures, power supplies and fan status, power
consumption…
• These parameters can be usually monitored through the web interface
provided by the IPMI-like interfaces of the nodes (see next three slides)
3. External physical conditions of the data center
• Data center temperature, status of the UPS and HVAC units…
• We will focus on tools to monitor aspect 1 129
• Monitoring internal physical conditions of the nodes
• Example of the web interface provided by HPE iLO
130
• HPE iLO: fan monitoring (status and speed)
131
• HPE iLO: temperature monitoring (CPU, GPU, memory modules…)
132
• Monitoring of system resource usage
• Why is a computing node running slow?
• It is (almost) never fault of the CPU!
• Mandatory to analyze the four main resources
• CPU, memory, disk and network (in this order)
• Plenty of tools exist to do so:
• CLI-based tools for single-node monitoring
• top-like tools (atop, htop, vtop, iotop, ptop)
• stat-like tools (vmstat, iostat, mpstat, dmstat, netstat, dstat)
• Nmon, collectl, sar, etc
• GUI-based tools for cluster-wide monitoring
• Cacti: https://www.cacti.net
• Nagios: https://www.nagios.org
• Ganglia: http://ganglia.sourceforge.net
• Icinga: https://icinga.com
• Pandora FMS: https://pandorafms.com
133
• CPU utilization vs CPU load
• CPU utilization is the percentage of the total available CPU cycles that
are consumed by each process
• A process consumes CPU cycles when is executing program
instructions or when waiting for data from memory (stalled)
• CPU load is the average number of processes using the CPU (running
state) or waiting for CPU time (runnable state) over a period of time
• Caveat: Linux also includes processes in a uninterruptible sleep (D)
state (usually waiting for I/O)
• Single-core CPU: a load less than 1 means that, on average, every
process that needed the CPU could use it immediately without being
blocked
• Multi-core CPU: load must be divided by the number of cores
• Contention occurs when the load is greater than the number of
cores
• So, high CPU load values does not necessarily mean high CPU usage!
• There may be a disk bottleneck
134
• Example with top on a 32-core CPU
• %CPU: 4 out of 32 cores being used at 100% (4 / 32 *100 = 12.5%)
• CPU load (1, 5, 15 minutes): around 4
135
• Example with top
• CPU utilization: only one process is using the CPU!
• The top command itself (only 0.3 %CPU)
• CPU load (1, 5, 15 minutes): around 3!
• Note 97.6%wa (iowait => CPU is waiting for I/O to complete)
136
• Example with vmstat on a 32-core CPU
• Up to 12 runnable processes (r)
• No processes blocked (b) => No contention (load average is 11)
• swpd > 0 would mean that the system is swapping!
• si/so is swapping activity (swap in/swap out)
• bi/bo is disk I/O activity
137
• Examples with iostat
• Useful to detect disk bottlenecks
138
• Previous CLI-based tools are intended for single-node monitoring
• Very useful to perform in-depth analyses and detect performance issues
• Cluster-wide monitoring needs advanced tools to get the whole picture
• These tools usually provide user-friendly GUIs
• Ganglia and Nagios are two popular tools for this purpose
• Both tools have converged in some areas, but there are still differences
• Ganglia is more concerned with gathering metrics and tracking them over
time
• It does not have a built-in notification system
• Nagios is aimed at monitoring anything and everything: servers, services
on servers, switches and network bandwidth via SNMP…
• E.g. Nagios can be set up to monitor a DRM system to see how full
the queues are and see available nodes for running jobs
• Nagios has focused on health checking and alerting
• Nagios will send alerts based on a specified set criteria
139
• Ganglia is an open-source, scalable distributed monitoring system for
clusters
• Designed from scratch to scale to thousands of nodes
• Ganglia allows administrators to monitor a cluster from a single node
using a web-based interface
• It provides real-time metrics monitoring and keeps track of historical
data using a Round-Robin Database (RDD)
• Specifically, it uses RRDTool to store the data and generate graphs
• Client-server model
• Client: gmod (Ganglia Monitoring Daemon)
• It gathers and sends the metrics (e.g. memory usage)
• Very little overhead (no impact on performance)
• Installed on every node you want to monitor (e.g. computing nodes)
• Server: gmetad (Ganglia Meta Daemon)
• It is the backend for data collection, polling gmond sources
periodically and storing their metrics to local RRDs 140
141
• Nagios is an open-source tool that provides monitoring and alerting
services for systems, networks and infrastructure
• Monitoring of system resources (CPU, memory, disk…)
• Monitoring of network services (HTTP, SSH…)
• Monitoring of any hardware that have the ability to send collected data via
a network to specifically written Nagios plugins
• A simple plugin design that allows users to easily develop their own
service checks depending on needs
• The official Nagios plugins package contains over 50 plugins to monitor
all the basics
• More than 4000 plugins are provided by the community
• Currently two different editions:
• Nagios Core, which includes a CGI web interface
• Nagios XI: includes a built-in Web configuration GUI
• Icinga is an open-source fork of Nagios that maintains backwards
compatibility 142
143
CLUSTER PERFORMANCE AND
BENCHMARKING
Cluster performance and benchmarking
• Scientific applications heavily rely on floating-point arithmetic, mostly
using double-precision numbers for higher accuracy
• Double precision: 64-bit floating-point real numbers
• Single precision: 32-bit floating-point real numbers
• FLOP/s is a popular unit of measure for the numerical computing
performance of a cluster, computer or processing unit (e.g. CPU, GPU)
• A FLOP is a Floating-Point Operation
• A + B is counted as a FLOP if both operands are floating-point real numbers
• FLOP/s or FLOPS: number of FLOPs executed per second
• FLOP/s represents the theoretical peak floating-point performance that
a single computer/CPU/accelerator can provide
• Peak FLOP/s for a single CPU can be calculated as follows:
FLOP/s = Clock frequency x #Cores x FLOPs/cycle
where,
FLOPs/cycle = floating-point instructions/cycle x FLOPs/instruction x
register width*
*Number of floating-point operands that can be stored in a register 145
• Nowadays, CPU clock frequency is complex and governed by multiple
mechanisms that perform dynamic frequency scaling (e.g. Intel Turbo Boost)
• Modern CPUs can operate at two clock frequencies (at least):
• Base frequency: absolute lowest frequency for a single CPU core
• Turbo frequency: absolute highest frequency for a single CPU core
• Because different workloads exhibit different die thermal and electrical
characteristics, they can also run at different frequencies
• When executing intensive floating-point vector instructions (e.g. AVX2), the
clock speed may be reduced to keep the CPU within its power limits
• The clock speed at which a CPU core executes these intensive instructions can be
lower than the base and/or turbo frequencies advertised by the vendor!
• In fact, the CPU has specific base and turbo frequencies for these instructions
(usually 10-20% lower), but their values are only provided on the documentation*
• For the sake of simplicity, we will use the base frequency as advertised by the
vendor to calculate theoretical peak FLOP/s
*https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
146
• FLOPs per cycle depends on the specific CPU microarchitecture
• Intel Xeon Sandy Bridge/Ivy Bridge CPUs supports AVX extensions
• 16 floating-point registers of 256 bits each that can operate on
• 8 single-precision floating-point numbers (256 / 32)
• 4 double-precision floating-point numbers (256 / 64)
• These CPUs can execute 2 AVX instructions (one addition + one multiplication) per
clock cycle (each AVX instruction, addition or multiplication, performs 1 FLOP)
• FLOPs/cycle = 8 (2*1*4) in double precision and 16 (2*1*8) in single precision
• Intel Xeon processors based on Haswell/Broadwell support AVX2 extensions
• New FMA3 instruction: three-operand Fused Multiply-Add operation
• A = A + B x C in a single machine instruction! (2 FLOPs per FMA instruction)
• Using the same 256-bit AVX registers, these CPUs can execute 2 FMA instructions per
clock cycle (each FMA instruction performs 2 FLOPs)
• FLOPs/cycle = 16 (2*2*4) in double precision and 32 (2*2*8) in single precision
• Examples:
• Intel Xeon E5-2670 v2 (Ivy Bridge) 10 cores at 2.5 Ghz (base frequency)
• GFLOP/s (double precision) = 2.5 x 10 x 8 = 200
• Intel Xeon E5-2650 v3 (Haswell): 10 cores at 2.3 Ghz (base frequency)
147
• FLOPs per cycle depends on the specific CPU microarchitecture
• Most Intel Xeon CPUs based on Skylake support AVX-512 extensions
• 32 registers of 512 bits each that can operate on
• 16 single-precision floating-point numbers (512 / 32)
• 8 double-precision floating-point numbers (512 / 64)
• Xeon Skylake Bronze/Silver
• These CPUs can only execute 1 FMA instruction per cycle
• FLOPs/cycle = 16 (1*2*8) in double precision and 32 (1*2*16) in single
precision
• Xeon Skylake Gold/Platinun
• These CPUs can execute 2 FMA instructions per cycle as Haswell/Broadwell
• FLOPs/cycle = 32 (2*2*8) in double precision and 64 (2*2*16) in single
precision
• Examples:
• Intel Xeon Silver 4114 (Skylake): 10 cores at 2.2 Ghz (base frequency)
• Intel Xeon Gold 5115 (Skylake): 10 cores at 2.4 Ghz (base frequency)
148
• GFLOP/s for a typical thin/skinny node
• Based on a dual-socket Intel Xeon E5-2650 v4 (Broadwell)
• 2 CPUs with 12 cores each at 2.2 Ghz (base frequency)
• GFLOP/s (double precision) per CPU = 2.2 x 12 x 16 = 422.4
• GFLOP/s (double precision) per node = 2 x 422.4 = 844.8
• GFLOP/s for a typical fat/big node
• Based on a quad-socket Intel Xeon E5-4640 v4 (Broadwell)
• GFLOP/s (double precision) per CPU = 2.1 x 12 x 16 = 403.2
• GFLOP/s (double precision) per node = 4 x 422.4 = 1612.8
• GFLOP/s for a typical accelerator node
• Based on a dual-socket Intel Xeon E5-2650 v4 (Broadwell)
• Plus 2 NVIDIA Tesla P100 GPUs
• 4 TFLOP/s per GPU in double precision
• GFLOP/s (double precision) per node = 844.8 + 2 x 4000 = 8844.8
149
• Theoretical peak FLOP/s for a whole cluster can be estimated by calculating
FLOP/s for each type of computing node and multiplying by their number
• But peak FLOP/s only represents an upper bound on performance!
• The actual performance is a function of many interrelated quantities
• The application
• The algorithm
• The size of the problem
• The high-level language
• The implementation
• The human level of effort used to optimize the program
• The compiler's ability to optimize
• The “age” (version) of the compiler
• The operating system and the kernel
• The architecture of the computer
• The hardware characteristics
• …
• So, actual performance is typically lower than peak performance 150
• High-Performance Linpack (HPL) benchmark attempts to measure the
actual performance of distributed-memory systems
• HPL benchmark solves a dense system of linear equations in double-precision
arithmetic (64 bits)
• Linpack is a software library written in Fortran for performing numerical
linear algebra on computers
• HPL is a portable, MPI-based parallel implementation of the LINPACK
benchmark for distributed-memory systems
• HPL allows the user to scale the size of the problem and to optimize the software
in order to achieve the best performance for a given system
• Since the problem is very regular, the performance and scalability achieved is
generally quite high
• The performance FLOP/s numbers provided by running HPL are a good point of
reference of the theoretical peak performance
• HPL is used to benchmark and rank supercomputers for the TOP500 list
• http://www.netlib.org/benchmark/hpl
151
• The TOP500 list ranks the 500 most powerful commercially available
supercomputers in the world as measured by the HPL benchmark
• https://www.top500.org
• The collection was started in 1993 and has been updated twice a year since then
• The best HPL FLOP/s performance achieved (Rmax) is used as a performance
measure in ranking the computers
• The theoretical peak performance (Rpeak) is also provided for each system
• The Green500 list ranks the top 500 computer systems in the world by
their energy efficiency
• https://www.top500.org/green500
• For decades, the notion of "performance" has been synonymous with "speed" as
measured in FLOP/s
• This focus of performance at any cost has led to the emergence of systems
that consume vast amounts of electrical power and produce so much heat that
large cooling facilities must be constructed to ensure their proper operation
• The inaugural list was announced on November 2007 in order to put a premium
on energy-efficient performance for sustainable supercomputing
• The list is ranked by the GFLOPS/watts ratio
152
153

• #1 Summit
• An IBM-built supercomputer running at the Department of Energy’s (DOE)
Oak Ridge National Laboratory (ORNL) in Tennessee (US)
• It has 4356 computing nodes, each one equipped with two 22-core IBM
Power9 processors and six NVIDIA Tesla V100 GPUs
• More than 26000 GPUs!
• Rmax/Rpeak: 122.3/187.6 PFLOP/s (Efficiency 65.19%)
• Power: 8.8 mW (Power Efficiency: 13.88 GFLOP/s per watt)
• #2 Sunway TaihuLight
• Supercomputer developed by China’s National Research Center of Parallel
Computer Engineering & Technology (NRCPC) and installed at the National
Supercomputing Center in Wuxi (China)
• It has 40960 computing nodes, each one equipped with a 260-core Sunway
SW26010 manycore processor
• More than 10 million cores!
• Rmax/Rpeak: 93/125.4 PFLOP/s (Efficiency 74.16%)
• Power: 15.3 mW (Power Efficiency: 6.05 GFLOP/s per watt) 154
• Green500 list (June 2018)
155
156
157
158
159
160
• CPU peak performance: X+Y TFLOP/s
• 316 nodes based on 2x Intel Xeon E5-2680v3 12-core CPUs at 2.5 Ghz
• TFLOP/s (double precision) per node = Z
• 316 x Z = X TFLOP/s
• 1 node based on 8x Intel Xeon E7-8867v3 16-core CPUs at 2.5 GHz
• TFLOP/s (double precision) per node = Y
• Accelerator peak performance: A+B TFLOP/s

• 4 nodes equipped with 2x NVIDIA Tesla K80: A TFLOP/s
• TFLOP/s (double precision) per K80 GPU = 1.87
• 2 nodes equipped with 2x Intel Xeon Phi 7120P: B TFLOP/s
• TFLOP/s (double precision) per Xeon Phi = 1.21
• Rpeak: CPU peak (X+Y) + Accelerator peak (A+B) = ??? TFLOP/s
• Rmax: 240 TFLOP/s (Efficiency ???%)
• Power: 118 kW (Power Efficiency: ??? GFLOP/s per watt)
161
• CPU peak performance: 308.48 TFLOP/s
• 316 nodes based on 2x Intel Xeon E5-2680v3 12-core CPUs at 2.5 Ghz
• TFLOP/s (double precision) per node = 2.5 x 24 x 16 = 0.96
• 316 x 0.96 = 303.36 TFLOP/s
• 1 node based on 8x Intel Xeon E7-8867v3 16-core CPUs at 2.5 GHz
• TFLOP/s (double precision) per node = 2.5 x 128 x 16 = 5.12
• Accelerator peak performance: 19.8 TFLOP/s

• 4 nodes equipped with 2x NVIDIA Tesla K80: 14.96 TFLOP/s
• TFLOP/s (double precision) per K80 GPU = 1.87
• 2 nodes equipped with 2x Intel Xeon Phi 7120P: 4.84 TFLOP/s
• TFLOP/s (double precision) per Xeon Phi = 1.21
• Rpeak: 308.48 + 19.8 = 328.28 TFLOP/s
• Rmax: 240 TFLOP/s (Efficiency 73.17%)
• Power: 118 kW (Power Efficiency: 2.78 GFLOP/s per watt)
162

• FinisTerrae II at CESGA (Rmax: 240 TFLOP/s , Power: 118 kW)
163
• NAS Parallel Benchmarks (NPB) are a set of programs designed to
evaluate the performance of parallel computers
• https://www.nas.nasa.gov/publications/npb.html
• This suite mimics the computation and data movement from
computational fluid dynamics applications
• Conjugate Gradient (CG), Fast Fourier Transform (FT), Block Tri-diagonal (BT)..
• Most of them are intensive on floating-point operations
• Developed and maintained by the NASA Advanced Supercomputing
(NAS) Division
• Last NPB version includes MPI, OpenMP and serial implementations
• The Multi-zone variant (NPB-MZ) is designed to exploit multiple
levels of parallelism and to test the effectiveness of hybrid
parallelization paradigms
• NPB-MZ includes hybrid MPI+OpenMP implementations of some NPB programs
• There exist an unofficial implementation with OpenCL support
• http://aces.snu.ac.kr/software/snu-npb
164
• NPB Block Tri-diagonal (BT) solver on a 512-core cluster
• MPI+OpenMP (MZ variant) using two different MPI libraries
• Million Operations Per Second (MOPS) and speedup
165
• iperf
• Widely used open-source tool to measure TCP/UDP network performance,
reporting maximum bandwidth, delay jitter and datagram loss
• https://iperf.fr
• Intel MPI Benchmarks (IMB)
• IMB suite performs MPI performance measurements for point-to-point (e.g.
MPI_Send/MPI_Recv) and global communication operations (e.g. MPI_Bcast)
• IMB allows to measure the performance of a cluster system, including network
latency, bandwidth and the overall efficiency of the MPI implementation
• https://software.intel.com/en-us/articles/intel-mpi-benchmarks
• IOZone
• Iozone is an I/O and file system benchmarking tool that generates and measures a
variety of file operations (e.g. read, write, re-read, fread…)
• http://www.iozone.org
• Flexible I/O Tester (FIO)
• FIO is a versatile I/O workload generator flexible enough to replicate real-world
environments (e.g. sequential/random read/write tests using various block sizes)
• https://github.com/axboe/fio 166
• IOR benchmark
• IOR is a commonly used file system benchmarking tool particularly well-suited
for evaluating the I/O performance of parallel file systems using various interfaces
(e.g. MPI-IO, POSIX) and access patterns
• This benchmark performs writes and reads to/from files under several sets of
conditions and reports the resulting throughput rates
• https://github.com/hpc/ior
• STREAM
• STREAM is a simple synthetic benchmark designed to measure sustainable
memory bandwidth and the corresponding computation rate for simple vector
kernels (e.g. COPY: a(i) = b(i))
• http://www.cs.virginia.edu/stream/ref.html
• Scalable HeterOgeneous Computing (SHOC)
• SHOC is a collection of benchmark programs to evaluate the performance of
multicore-systems as well as computing devices with non-traditional architectures
for general-purpose computing (e.g. GPUs) and the software used to program
them (e.g. CUDA, OpenCL)
• https://github.com/vetter/shoc
167
References
• Sloan, J.D. (2004). High Performance Linux Clusters with OSCAR, Rocks,
openMosix & MPI: A Comprehensive Getting-Started Guide. O'Reilly Media, Inc
• Sterling, T., Anderson, M., & Brodowicz, M. (2017). High Performance Computing:
Modern Systems and Practices. Morgan Kaufmann Publishers
• Eadline, D. (2011). High Performance Computing for Dummies, 2nd Edition. Wiley
Publishing, Inc
• Geimer, M., Hoste, K., & McLay, R. (2014). Modern scientific software management
using EasyBuild and Lmod. Proceedings of the 1st International Workshop on HPC
User Support Tools (HUST'14). New Orleans, USA
• Jin, H., Buyya, R, & Cortes, T. (2002). High Performance Mass Storage and Parallel
I/O: Technologies and Applications, 1st Edition. John Wiley & Sons, Inc
• Frisch, A. (2002). Essential System Administration: Tools and techniques for Linux
and Unix administration, 3rd Edition. O'Reilly Media, Inc.
• Intel Corporation (2014). Architecting a High Performance Storage System. White
Paper, High Performance Data Division
• Shanley, T. (2002). InfiniBand Network Architecture, 1st Edition. Addison-Wesley
Professional
• Mellanox Technologies Inc (2003). Introduction to InfiniBand. White Paper
• Grun, P. (2010). Introduction to InfiniBand for End Users. White Paper, InfiniBand
Trade Association 168

Lecture Clusters PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Clusters PDF

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO THE DESIGN

• According to Flynn’s taxonomy, a cluster is a MIMD machine

• High Availability (HA) or failover clusters

• This course is focused on computing clusters or just clusters from now on

• Expanded asymmetric cluster

8x 4cm system fans

10x hot-swap HDD drives 24x DIMMs slots

9x 4cm fans Redundant power supplies

2x internal HDD drives

16x DIMMs slots

Redundant power supplies

24x hot-swap HDD drives

Octa-socket blade system: up to 224 cores and 12 TB of memory in a rackmount 7U server

• TOP500 list (June 2018)

1formerly known as Simple Linux Utility for Resource Management (SLURM)

• Resource reservations enables to reserve system resources for specified

• In a parallel file system:

• Lustre file layout

• Lustre I/O flow

• No latency or bandwidth requirements are generally needed

• InfiniBand refers to two different things:

• InfiniBand Architecture (IBA)

• InfiniBand Architecture (IBA)

• InfiniBand Architecture (IBA)

• InfiniBand Architecture (IBA)

• InfiniBand physical layer

• InfiniBand physical layer

• InfiniBand physical layer

SDR DDR QDR FDR10 FDR EDR HDR

• InfiniBand physical layer

4X QSFP+ active optical fiber cable 4X QSFP+ passive copper cable

• InfiniBand link layer

• InfiniBand link layer

• InfiniBand network layer

• InfiniBand transport layer

• InfiniBand transport layer

• InfiniBand transport layer

• InfiniBand transport layer

• InfiniBand transport layer

• InfiniBand Upper Layer Protocols (ULPs)

• InfiniBand Upper Layer Protocols (ULPs)

• InfiniBand network topologies

• InfiniBand network topologies

• InfiniBand network topologies

• InfiniBand network topologies

• InfiniBand network topologies

• InfiniBand network topologies

• InfiniBand network topologies

• InfiniBand network topologies

• RDMA-based vs IP-based communications

• RDMA-based vs IP-based communications

• RDMA-based vs IP-based communications

• Why not RDMA over Ethernet (non-InfiniBand) hardware?

• RDMA networks over Ethernet

• RDMA networks over Ethernet

• RDMA networks supported by OFA

Source: https://www.snia.org 125

• TOP500 list (June 2018)

• TOP500 list (June 2018)

• Green500 list (June 2018)

• TOP500 list (June 2018)

• TOP500 list (June 2018)

• TOP500 list (June 2018)