Professional Documents
Culture Documents
ibm.com/redbooks
International Technical Support Organization
December 2005
SG24-7145-00
Note: Before using this information and the product it supports, read the information in “Notices” on
page xvii.
This edition applies to the IBM TotalStorage DS6000 and its capabilities as of August 2005.
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
The team that wrote this redbook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii
Contents v
5.4.2 Fibre Channel topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.5 SAN implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.5.1 Description and characteristics of a SAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.5.2 Benefits of a SAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.5.3 SAN cabling for availability and performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.5.4 Importance of establishing zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.5.5 LUN masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.5.6 Configuring logical disks in a SAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.6 Subsystem Device Driver (SDD) - multipathing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.6.1 SDD load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.6.2 Concurrent LMC load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.6.3 Single path mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.6.4 Single FC adapter with multiple paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.6.5 Path failover and online recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.6.6 Using SDDPCM on an AIX host system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.6.7 SDD datapath command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Contents vii
8.8 Disk bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
8.9 Other performance resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Contents ix
11.5 Additional information about iSeries performance . . . . . . . . . . . . . . . . . . . . . . . . . . 405
11.5.1 Publications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
11.5.2 Web sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
Contents xi
Caution using benchmark results to design production . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
Figures xv
10-2 Concurrent I/O with PAV and Multiple Allegiance . . . . . . . . . . . . . . . . . . . . . . . . . . 361
10-3 Concurrent read operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
10-4 Concurrent write operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
10-5 Number of volumes on a (6+P) RAID 5 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
10-6 DB2 large volume performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
10-7 DSS dump large volume performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
10-8 Channel utilization limits for hypothetical workloads . . . . . . . . . . . . . . . . . . . . . . . . 368
10-9 FICON port and channel throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
10-10 Daisy chaining DS6000s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
10-11 Sample set of RMF Magic workload summary charts . . . . . . . . . . . . . . . . . . . . . . . 381
10-12 I/O and data rate summary for a single subsystem . . . . . . . . . . . . . . . . . . . . . . . . . 382
10-13 Cache summary for a single subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
10-14 Breakdown of measurement data by SSID within a subsystem . . . . . . . . . . . . . . . 384
10-15 Summary of subsystem response time components . . . . . . . . . . . . . . . . . . . . . . . . 385
11-1 Performance Tools Disk Utilization Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
13-1 DB2 UDB logical structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
13-2 Allocating DB2 containers using a “spread your data” approach. . . . . . . . . . . . . . . 428
13-3 IMS large volume performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
14-1 FlashCopy establish. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
14-2 FlashCopy interfaces and functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
14-3 Synchronous logical volume replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
14-4 Logical paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
14-5 Logical paths for Metro Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
14-6 Symmetrical Metro Mirror configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
14-7 Asynchronous logical volume replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
14-8 Global Copy and Metro Mirror state change logic . . . . . . . . . . . . . . . . . . . . . . . . . . 453
14-9 Logical paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
14-10 Global Copy environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
14-11 Global Mirror overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
14-12 How Global Mirror works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
14-13 Global Copy with write hit at the remote site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
14-14 Application write I/O within two Consistency Group points . . . . . . . . . . . . . . . . . . . 462
14-15 Coordination time - how does it impact application write I/Os? . . . . . . . . . . . . . . . . 463
14-16 Remote storage server configuration, all Ranks contain equal numbers of volumes 465
14-17 Remote storage server with D volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
14-18 A three-site z/OS Metro/Global Mirror implementation . . . . . . . . . . . . . . . . . . . . . . 468
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not give you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such provisions are
inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS
PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring
any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrates programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, modify, and
distribute these sample programs in any form without payment to IBM for the purposes of developing, using,
marketing, or distributing application programs conforming to IBM's application programming interfaces.
Java, JDK, Solaris, Sun, Sun Microsystems, Ultra, and all Java-based trademarks are trademarks of Sun Microsystems,
Inc. in the United States, other countries, or both.
BackOffice, Excel, Microsoft, Windows server, Windows NT, Windows, and the Windows logo are trademarks of Microsoft
Corporation in the United States, other countries, or both.
i386, Intel, Itanium, Pentium, Xeon, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered
trademarks of Intel Corporation or its subsidiaries in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
This IBM® Redbook provides guidance about how to configure, monitor, and manage your
IBM TotalStorage® DS6000 to achieve optimum performance. We describe the DS6000
performance features and characteristics and how they can be exploited with the different
server platforms that can attach to it. Then in consecutive chapters we detail the specific
performance recommendations and discussions that apply for each server environment, as
well as for database and Copy Services environments.
We also outline the various tools available for monitoring and measuring I/O performance for
the different server environments, as well as how to monitor performance of the entire
DS6000 subsystem.
Cathy Warrick is a Project Leader and Certified IT Specialist in the IBM International
Technical Support Organization. She has over 27 years of experience in IBM with large
systems, open systems, and storage, including education on products internally and for the
field. Prior to joining the ITSO three years ago, she developed the Technical Leadership
education program for IBM and IBM Business Partner’s technical field force and was the
Program Manager for the Storage Top Gun classes.
Benoit Granier is part of the IBM European Advanced Technical Support Center in
Montpellier (France) for three years. As an IT Specialist, he started working at the pSeries®
and TotalStorage Benchmark Center. He is now responsible for the Early Shipment Programs
for storage disk systems in EMEA. Benoit's areas of expertise include: mid-range/high-end
storage solutions (IBM DS4000/ESS/DS8000), virtualization (IBM SAN Volume Controller),
and high-end IBM Eserver® pSeries servers. Benoit has a degree in Telecommunication
from ESIGETEL.
Keitaro Imai has been working in AP Advanced Technical Support in Japan and engaged in
Open environment storage as an IT Specialist for three years. He mainly supported DS4000
(formerly FAStT) for two years. He was assigned to the DS6000 Support Team at the end of
last year, where he provides technical consultation, troubleshooting, and support, including
critical situations and skills transfer.
Brannen Proctor is a Senior IT Specialist with the IBM Storage Techline organization in
Atlanta, Georgia. He has been in Techline since 1998, providing pre-sales technical support
on IBM disk, tape, and SAN storage products. He is currently the Team Leader of the
Business Partner support team, and also coordinates training activities for the Storage
Techline team. Prior to joining the Techline organization, he was a Transition Leader in IBM
Global Services for five years. Prior to joining IBM, he held positions in the computer industry
as a programmer, systems programmer, performance analyst, and internal consultant.
Jim Sedgwick is a member of the Americas Storage Advanced Technical Support team in
Raleigh, North Carolina. His main responsibility is open systems storage performance. His 23
year career has included work in IBM Global Services, IBM Sales and Distribution, and IBM
Printer Advanced Development.
Paulus Usong started his IBM career in Indonesia decades ago. He moved to New York and
worked at a bank for a few years, before rejoining IBM at the Santa Teresa Lab (now called
the Silicon Valley Lab). In 1995 he joined the Advanced Technical Support group in San Jose.
Currently he is a Consulting IT Specialist and his main responsibility is handling DASD
performance CritSits and performing XRC sizing for customers who want to implement a
disaster recovery system using this option.
Mary Ann Vandermark is a Product Field Engineer (PFE) in the Washington D.C. area and
has worked for IBM for seven years. Her career began with a focus on quality assurance
processes and testing of hardware and software storage products. Mary Ann developed and
published a field Escape Analysis process used to improve product quality and test
effectiveness. Her current responsibilities include on-site support for ESS and DS6000/8000
for secure U.S. Government accounts in addition to remote PFE support for North American
accounts in the private sector.
John Wickes has had more than 30 years experience in the IT industry, with many years as a
mainframe MVS™ Operating Systems Specialist, several years as the IBM ANZ MVS
Instructor, and more recently as a Storage and Storage Area Network (SAN) design and
Implementation Specialist. John has been involved in several Copy Services projects,
including both FlashCopy® and Peer-to-Peer Remote Copy (PPRC) implementations.
Front row - Paulus, Keitaro, Cathy, Jim, John, MaryAnn; Back row - Benoit, Craig, Brannen, Rosemary,
John Amann
We want to thank John Amann for hosting this residency at the Washington Systems Center
in Gaithersburg, MD.
In addition other members of the performance advisory group helped us out with
presentations as well as reviewing our material:
Ime Archibong, Siebo Friesenborg, Joe Hyde, Carl Jones, Josh Martin, Henry May, Bruce
McNutt, Vernon Miller, Dharmendra Modha, Rick Ripberger, Mike Roll, and Sonny
Williams.
Many thanks to those people in IBM in Montpellier, France who helped us with access to
equipment as well as technical information and review:
Olivier Alluis (Manager of the ATS TotalStorage Benchmark Center), Philippe Jachymczyk,
Dominique Salomon, Christophe Majek and Jean-Armand Broyelle.
Mary Lovelace
International Technical Support Organization, San Jose Center
Martin Kammerer
IBM Germany
Steve Pratt
IBM Austin
Edward Holcombe
IBM Beaverton
Mike Downie
IBM Boulder
Dan Braden
IBM Dallas
Donald C. Laing
IBM Midland
Cathy Cronin
IBM Poughkeepsie
Jeffrey Berger
IBM San Jose
Mike Gonzales
IBM Santa Teresa
Kwai Wong
IBM Toronto
Andy Ruhl
IBM Tucson
Preface xxi
Many thanks to:
Gilbert Houtekamer from Intellimagic
Pablo Clifton from CompuPro
Your efforts will help increase product acceptance and customer satisfaction. As a bonus,
you'll develop a network of contacts in IBM development labs, and increase your productivity
and marketability.
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
We want our Redbooks™ to be as helpful as possible. Send us your comments about this or
other Redbooks in one of the following ways:
Use the online Contact us review redbook form found at:
ibm.com/redbooks
Send your comments in an email to:
redbook@us.ibm.com
Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. QXXE Building 80-E2
650 Harry Road
San Jose, California 95120-6099
The DS6000 is designed specifically for medium and large enterprise customers seeking new
ways to simplify systems and storage infrastructures, improve the use of information
throughput, its life cycle and support business continuity. The continuing exponential growth
of data means that storage subsystems must be more cost effective and flexible enough to
support a variety of working environments. In addition, its very flexibility to support a variety of
environments helps enable your business to accommodate the continuing exponential growth
of data.
The DS6000 series advanced functionality is shared with the DS8000 series. IBM provides an
enterprise storage continuum of disk products with compatible Copy Services, common
advanced functions, and common management interfaces.
With the additional advantages of IBM FlashCopy, data availability can be enhanced even
further; for instance, production workloads can continue execution concurrent with data
backups. Metro Mirror and Global Mirror business continuity solutions are designed to provide
The processors
The DS6800 utilizes two 64-bit PowerPC 750GX 1 GHz processors for the storage server and
the host adapters, respectively, and another PowerPC 750FX 500 MHz processor for the
device adapter on each controller card. The DS6800 is equipped with 2 GB memory in each
controller card, adding up to 4 GB. Some part of the memory is used for the operating system
and another part in each controller card acts as nonvolatile storage (NVS), but most of the
memory is used as cache. This design to use processor memory makes cache accesses very
fast.
When data is written to the DS6800, it is placed in cache and a copy of the write data is also
copied to the NVS of the other controller card, so there are always two copies of write data
until the updates have been destaged to the disks. On zSeries, this mirroring of write data can
be disabled by application programs, for example, when writing temporary data (Cache Fast
Write). The NVS is battery backed up and the battery can keep the data for at least 72 hours
if power is lost.
The DS6000 series controller’s Licensed Internal Code (LIC) is based on the DS8000 series
software. Since 97% of the functional code of the DS6000 is identical to the DS8000 series,
the DS6000 has a very good base to be a stable system.
Host adapters
The DS6800 has eight 2 Gbps Fibre Channel ports that can be equipped with two or up to
eight shortwave or longwave Small Formfactor Plugables (SFP). You order SFPs in pairs. The
2 Gbps Fibre Channel host ports (when equipped with SFPs) can also auto-negotiate to
There are four paths from the DS6800 controllers to each disk drive to provide greater data
availability in the event of multiple failures along the data path. The DS6000 series systems
provide preferred path I/O steering and can automatically switch the data path used to
improve overall performance.
Dense packaging
Calibrated Vectored Cooling technology used in IBM Eserver xSeries® and BladeCenter®
to achieve dense space saving packaging is also used in the DS6800. The DS6800 weighs
only 49.6 kg (109 lbs.) with 16 drives. It connects to normal power outlets with its two power
supplies in each DS6800 or DS6000 expansion enclosure. All this provides savings in space,
cooling, and power consumption.
Aside from the drives, the DS6000 expansion enclosure contains two Fibre Channel switches
to connect to the drives and two power supplies with integrated fans.
The minimum storage capability with eight 73 GB DDMs is 584 GB. The maximum storage
capability with 16 300 GB DDMs for the DS6800 controller enclosure is 4.8 TB. If you want to
connect more than 16 disks, you can use the optional DS6000 expansion enclosures that
allow a maximum of 128 DDMs per storage system and provide a maximum storage
capability of 38.4 TB.
RAID 5
RAID 5 is a method of spreading volume data plus data parity across multiple disk drives.
RAID 5 increases performance by supporting concurrent accesses to the multiple DDMs
within each logical volume.
RAID 10
RAID 10 implementation provides data mirroring from one DDM to another DDM. RAID 10
stripes data across half of the disk drives in the RAID 10 configuration. The other half of the
array mirrors the first set of disk drives. RAID 10 offers faster random writes than RAID 5
1.3.2 Resiliency
The DS6000 series has built in resiliency features that are not generally found in small
storage devices. The DS6000 series is designed and implemented with component
redundancy to help reduce and avoid many potential single points of failure.
Within a DS6000 series controller unit, there are redundant RAID controller cards, power
supplies, fans, Fibre Channel switches, and Battery Backup Units (BBUs).
There are four paths to each disk drive. Using Predictive Failure Analysis®, the DS6000 can
identify a failing drive and replace it with a spare drive without customer interaction.
Copy Services has four interfaces; a Web-based interface (DS Storage Manager), a
command-line interface (DS CLI), an application programming interface (DS Open API), and
host I/O commands from zSeries servers.
Incremental FlashCopy
Incremental FlashCopy provides the capability to refresh a LUN or volume involved in a
FlashCopy relationship. When a subsequent FlashCopy establish is initiated, only the data
required to bring the target current to the source's newly established point-in-time is copied.
The direction of the refresh can also be reversed, in which case the LUN or volume previously
defined as the target becomes the source for the LUN or volume previously defined as the
source (and is now the target).
Global Copy was previously called PPRC-XD on the ESS. It is an asynchronous copy of
LUNs or zSeries CKD volumes. An I/O is signaled complete to the server as soon as the data
is in cache and mirrored to the other controller cache. The data is then sent to the remote
storage system. Global Copy allows for copying data to far away remote sites. However, if you
have more than one volume, there is no mechanism that guarantees that the data of different
volumes at the remote site is consistent in time.
Global Mirror is a long distance remote copy solution across two sites using asynchronous
technology. It is designed to provide the following:
Support for virtually unlimited distances between the local and remote sites, with the
distance typically limited only by the capabilities of the network and channel extension
technology being used. This can better enable you to choose your remote site location
based on business needs and enables site separation to add protection from localized
disasters. A consistent and restartable copy of the data at the remote site, can be created
with little impact to applications at the local site.
Data currency, where for many environments the remote site lags behind the local site an
average of three to five seconds, helps to minimize the amount of data exposure in the
event of an unplanned outage. The actual lag in data currency experienced will depend
The online configuration and Copy Services are available via a Web browser interface
installed on the DS management console.
IBM has consistently demonstrated its leadership in the open standards movement and
offering the IBM TotalStorage DS Open API for DS6000 which is compatible with the SMI-S
standard, offers a compelling proof point of IBM’s commitment to the benefits that open
standards can offer.
The DS Open API supports routine LUN management activities, such as LUN creation,
mapping and masking, and the management of point-in-time copy and remote mirroring. It
supports these activities through the use of a standard interface as defined by the Storage
Networking Industry Association (SNIA) Storage Management Initiative Specification
(SMI-S).
The DS Open API is implemented through the IBM TotalStorage Common Information Model
Agent (CIM Agent) for the DS Open API, a middleware application designed to provide a
CIM-compliant interface. The interface allows Tivoli and third-party CIM-compliant software
management tools to discover, monitor, and control DS6000 series systems. The DS Open
API and CIM Agent are provided with the DS6000 series at no additional charge. The CIM
Agent is available for the AIX®, Linux®, and Microsoft® Windows operating system
environments.
This DS CLI has the ability to dynamically invoke copy services functions. This can help
enhance your productivity since it eliminates the previous requirement for you to create and
save a task using the GUI. The DS CLI can also issue copy services commands to an ESS
Model 750, ESS Model 800, or DS8000 series systems.
The DS CLI client is available for the AIX, HP-UX, Linux, Novell NetWare, Sun™ Solaris™,
and Microsoft Windows operating system environments.
http://www.ibm.com/servers/storage/disk/ds6000/interop.html
Particularly for zSeries and iSeries customers, the DS6000 series will be an exciting product,
since for the first time it gives them the choice to buy a midrange priced storage system for
their environment with a performance that is similar to or exceeds that of an IBM ESS.
Load balancing can reduce or eliminate I/O bottlenecks that occur when many I/O operations
are directed to common devices via the same I/O path. SDD also helps eliminate a potential
single point of failure by automatically rerouting I/O operations when a path failure occurs,
thereby supporting enhanced data availability.
SDD is a standard feature and is provided with the DS6000 at no additional charge. Fibre
Channel attachment configurations are supported in the AIX, HP-UX, Windows 2000, and
Solaris environments.
If you want to keep your ESS and if it is a model 800 or 750 with Fibre Channel adapters, you
can use your old ESS, for example, as a secondary for remote copy. With the ESS at the
appropriate LIC level, scripts or CLI commands written for Copy Services will work for both
the ESS and the DS6000.
Obviously the DS8000 series can deliver a higher throughput and scales higher than the
DS6000 series, but not all customers need this high throughput and capacity. You can choose
the system that fits your needs since both systems support the same SAN infrastructure and
the same host systems.
It is very easy to have a mixed environment, with DS8000 series systems where you need
them and DS6000 series systems where you need a very cost efficient solution.
Logical partitioning with some DS8000 models is not available on the DS6000.
In the first instance, DS6000 is the entry product of the DS6000/DS8000 family. This family is
the high-end products of the IBM TotalStorage disk portfolio. This family out perform all the
DS4000 family even if the DS4800 (high-end product o the DS4000 family) overlap the
DS6800 in term of performance. In term of server support DS4000 is only dedicated on Open
Server (Intel®, UNIX, Linux) meanwhile DS6000 also supports iSeries and zSeries servers.
Within the DS4000 family you have the option to choose from different DS4000 models,
among them very low cost entry models. DS4000 series storage systems can also be
equipped with cost efficient, high capacity Serial ATA drives.
The DS4000 series products allow you to grow with a granularity of a single disk drive, while
with the DS6000 series you have to order at least four drives. Currently the DS4000 series
also is more flexible with respect to changing RAID arrays on the fly and changing LUN sizes.
Extension of IBM's dynamic provisioning technology within the DS6000 series is planned to
provide LUN/volume dynamic expansion, online data relocation, virtual capacity over
provisioning, and space-efficient FlashCopy requiring minimal reserved target capacity.
While the DS4000 series also offers remote copy solutions, these functions are not
compatible with the DS6000 series.
Currently, the DS4000 series product family (FAStT) is a popular choice for many customers
who buy the SAN Volume Controller. With the DS6000 series, they now have an attractive
alternative. Since the SAN Volume Controller already has a rich set of advanced copy
functions, clients were looking for a cost efficient but reliable storage system. The DS6000
series fits perfectly into this environment since it offers good performance for the price while
still delivering all the reliability functions needed to protect your data.
To be able to share data in a heterogeneous environment the storage system must support
the sharing of LUNs. The DS6000 series can do this and therefore is an ideal candidate for
your SAN File System data.
There are some choices when planning your DS6000 hardware configuration, and their
relevance is related to the workload characteristics and performance expectations. This
chapter discusses:
Performance related hardware components on the DS6000
How to improve response time and throughput
Recommendations about how to enhance I/O performance
Benchmarks also look like an easy way for planning a disk configuration. But if you intend to
make a configuration decision based on benchmark performance results, you need to ensure
that the workload that is used for the benchmark resembles, as closely as possible, the
workload that you intend to run on your DS6000. You should also know both the physical and
the logical configuration of the DS6000s used during the benchmark, so your workload gets
the results you are expecting. Sometimes you will not be able to replicate the configuration
documented for the benchmark, and some other times you may not find documented a
benchmark that resembles your requirements. So benchmarks cannot always be used.
For these reasons, the recommended approach for correctly estimating the disk subsystem
configuration that is needed is to use and combine the following methods.
Reference by industry: Same industry customers can use following configurations.
Reference to a lab measurement: Lab have measured no read/write hit operation. It does
not matter cache size.
using a ROTs: For example, “TB *access density” Such a calculation is still effective.
Disk Magic: Some I/O characteristics of existing configuration can estimate the new
configuration.
Benchmark: You can also use the benchmark center of each region.
Also the following tools allow easy and accurate monitoring, analyzing, sizing, and modeling
for the required DS6000 configuration. Among these tools, we have:
Disk Magic
RMF™ Magic for zSeries
Capacity Magic
TPC (IBM TotalStorage Productivity Center)
DS6000 Performance Monitor on GUI
These tools will be discussed in detail in Chapter 4, “Planning and monitoring tools” on
page 85.
Another important characteristic of an I/O workload is whether the data is being remotely
copied or not.
If you are moving existing workloads to a new DS6000, then you have information that can be
used to model and estimate this new DS6000 configuration. You will also be able to model
any activity growth that you are planning in advance.
On the other hand, if the workload that you plan to run on the DS6000 is a new workload or
one that you do not have a good understanding of, then we recommend that you be
conservative when planning the disk storage subsystem hardware configuration.
If you will be running multiple heterogeneous servers, each server with different workload
characteristics, you will have the most complex case. You must ensure that your final
hardware configuration has enough capacity to cope with the maximum data rate, while
aggregating the whole set of applications.
When the workload demands of all the servers being consolidated are well understood, and
assuming they are predictable and consistent, it could be possible to manage the peaks and
thus get some resource savings.
On the other hand, if you are combining workloads that are not well understood or whose
requirements fluctuate in an unpredictable manner, then a more conservative approach
must be taken when considering the peaks in your hardware capacity planning.
For further information, seeChapter 12, “Understanding your workload” on page 407.
CTRL EXP
#1
EXP EXP
#3 #2
EXP EXP
#5 #4
: :
EXP EXP
#7 #6
From the performance perspective, the major components you need to consider when
planning the DS6000 hardware configuration are (refer to Figure 2-1):
DDM capacity and speed (RPM)
Number of arrays and RAID type
Number and type of host adapters
The hardware components presented in Figure 2-1 and their implications in the overall
DS6000 performance behavior are discussed in the following sections.
The DS6800 utilizes two 64-bit PowerPC 750GX 1 GHz processors for the storage server and
the host adapters, respectively, and another PowerPC 750FX 500 MHz processor for the
device adapter on each server card.
Fibre Fibre
QDR Channel Channel
QDR
Protocol Protocol
Engine Engine
PPC
750GX
SDRAM
Data Protection PPC
Data Mover 750GX
Enet Comp Flash
ASIC
Flash PPC
Hint 750FX
Bridge RAID -
Data Protection - Bridge NVRAM
Buffer
Data Mover
ASIC SDRAM
Switch
SES
MIdplane
EEEEEE
If you can view Figure 2-2 in color, you can use the colors as indicators of how the DS6000
hardware component is configured. Orange (gray in black and white) is host adapter (HA)
component, green (dark gray in black and white) is device adapter (DA) component and
yellow (white in black and white) is cache memory and server processor component.
2.5.1 Cache
Cache is used to keep both the read and write data that the host server needs to process.
Having the cache as an intermediate repository, the host has no need to wait for the hard disk
drive to either obtain or store the data that is needed. Instead, the operations of reading from
the hard disk drive (stage), as well as the operation of writing into the hard disk drive
(destage), are done by the DS6000 asynchronously from the host I/O processing. These allow
the completion of the host I/O operations at the electronic cache speeds without waiting for
the much slower hard disk drives’ operations.
Cache processing significantly improves the performance of the I/O operations done by the
host systems that attach to the DS6000. Cache size, together with the efficient internal
In the DS6000 there is 4 GB of fixed cache. This cache is divided between the two servers of
the DS6000, giving the servers their own non-shared cache.
To protect the data that is written during the I/O operations, the DS6000 stores two copies of
the data: One in the cache and another in its persistent memory.
N VS N VS
for for
Server1 Server 0
C ache C ache
m em ory m em ory
for for
Server 0 S erver 0
Server
Server 00 Server 1
The DS6000 uses the patent-pending Adaptive Replacement Cache (ARC) algorithm,
developed by IBM Storage Development in partnership with IBM Research. It is a self-tuning,
self-optimizing solution for a wide range of workloads with a varying mix of sequential and
random I/O streams. For a detailed description of ARC, see N. Megiddo and D. S. Modha,
“Outperforming LRU with an adaptive replacement cache algorithm,” IEEE Computer, vol. 37,
no. 4, pp. 58–65, 2004.
The decision to copy some amount of data into the DS6000 cache can be triggered from two
policies: demand paging and prefetching. Demand paging means that disk blocks are brought
in only on a cache miss. Demand paging is always active for all volumes and ensures that I/O
patterns with some locality find at least some recently used data in the cache.
Prefetching means that data is copied into the cache speculatively even before it is
requested. To prefetch, a prediction of likely future data accesses is needed. Because
effective, sophisticated prediction schemes need extensive history of page accesses (which is
not feasible in real-life systems), ARC uses prefetching for sequential workloads. Sequential
access patterns naturally arise in video-on-demand, database scans, copy, backup, and
recovery. The goal of sequential prefetching is to detect sequential access and effectively
pre-load the cache with data so as to minimize cache misses.
For prefetching, the cache management uses tracks. To detect a sequential access pattern,
counters are maintained with every track, to record if a track has been accessed together with
its predecessor. Sequential prefetching becomes active only when these counters suggest a
sequential access pattern. In this manner, the DS6000 monitors application read-I/O patterns
and dynamically determines whether it is optimal to stage into cache one of the following:
Just the page requested.
That page requested plus remaining data on the disk track.
An entire disk track (or a set of disk tracks) which has (have) not yet been requested.
The decision of when and what to prefetch is essentially made on a per-application basis
(rather than a system-wide basis) to be sensitive to the different data reference patterns of
different applications that can be running concurrently.
To decide which pages are evicted when the cache is full, sequential and random
(non-sequential) data is separated into different lists (see Figure 2-4 on page 24). A page
which has been brought into the cache by simple demand paging is added to the MRU (Most
Recently Used) head of the RANDOM list. Without further I/O access, it goes down to the
LRU (Least Recently Used) bottom. A page which has been brought into the cache by a
sequential access or by sequential prefetching is added to the MRU head of the SEQ list and
then goes in that list. Additional rules control the migration of pages between the lists so as to
not keep the same pages twice in memory.
MRU MRU
D e s ire d s iz e
S E Q b o tto m
LRU
R A N D O M b o tto m
LRU
Figure 2-4 Cache lists of the SARC algorithm for random and sequential data
To follow workload changes, the algorithm trades cache space between the RANDOM and
SEQ lists dynamically and adaptively. This makes ARC scan-resistant, so that one-time
sequential requests do not pollute the whole cache. ARC maintains a desired size parameter
for the sequential list. The desired size is continually adapted in response to the workload.
Specifically, if the bottom portion of the SEQ list is found to be more valuable than the bottom
portion of the RANDOM list, then the desired size is increased; otherwise, the desired size is
decreased. The constant adaptation strives to make the optimal use of limited cache space
and delivers greater throughput and faster response times for a given cache size.
Additionally, the algorithm modifies dynamically not only the sizes of the two lists, but also the
rate at which the sizes are adapted. In a steady state, pages are evicted from the cache at the
rate of cache misses. A larger (respectively, a smaller) rate of misses effects a faster
(respectively, a slower) rate of adaptation.
Other implementation details take into account the relation of read and write (persistent
memory) cache, efficient destaging, and the cooperation with Copy Services. In this manner,
the DS6000 cache management goes far beyond the usual variants of the LRU/LFU (Least
Recently Used / Least Frequently Used) approaches.
The following two figures show the performance superiority of ARC compared with NO ARC.
The performance difference seen in Figure 2-5 on page 25 and Figure 2-6 on page 25, is
bigger as the throughput increases. For example, the response time is 7.6ms with NO ARC
and 1.8ms with ARC at 4000 IOPS.
Due to the ARC algorithm, 33% of cache space effectiveness and 12.5% of peak throughput
have improved and 11% cache miss rate has reduced.
Basically, the larger the cache size the better the I/O performance characteristics. An
approximate and conservative rule of thumb (ROT), disregarding the access density and the
I/O operations specific characteristics, based solely on the backend total capacity, is to
estimate between 2 GB to 4 GB of cache per 1 TB of storage. This ROT can work for many
workloads, but it can also be very inaccurate for many other workloads.
Especially for cache friendly workload, the DS6000 may be slower than the ESS, which has a
bigger cache (8 GB or more). Simple approximation assumes that the read miss ratio (1 -
read hit ratio) will double for most z/OS and open systems workloads (see Table 1-1).
0.95 0.90
0.90 0.80
0.80 0.60
0.70 0.40
Consider that the cache size is not an isolated factor when estimating the overall DS6000
performance, but must be considered together with other important factors like the I/O
workload characteristics; the disk drives capacity and speed; the number and type of DS6000
host adapters; and the backend data layout and 1750-EX1 FC-AL loops.
For some z/OS environments, processor memory or cache in the DS6000 may be able to
contributes to high I/O rates and helps to minimize I/O response time.
It is not just the pure cache size which accounts for good performance figures. Economical
use of cache and smart, adaptive caching algorithms are just as important to guarantee
outstanding performance. These are implemented in the DS6000 series, except for the cache
segment size, which is currently 68 KB.
Processor memory is subdivided into a data in cache portion, which holds data in volatile
memory, and a persistent part of the memory, which functions as persistent memory to hold
DASD fast write (DFW) data until destaged to disk.
Our recommendation is to use Disk Magic for properly determining the better hardware
configuration. For more information, see 4.1, “Disk Magic” on page 86.
The minimum DS6000 available configuration is 292 GB. This capacity can be configured with
4 disk drives of 73 GB contained in one four-pack. All increments of capacity are installed in
four-packs; thus the minimum capacity increment is an four-packs of either 73 GB, 146 GB or
300 GB capacity.
Or, as a:
RAID 10 Array: 1+1+2S, 2+2 from 1 pack or 3+3+2S, 4+4 from 2 packs
The DS6000 Storage Manager will configure the four-pack on a loop with spare DDMs as
required. When the configuration includes an intermix of different capacity or speed drives,
this may result in the creation of additional DDM spares on a loop as compared to
non-intermixed configurations. Spare configurations and considerations are explained in
detail in 2.8, “RAID implementation” on page 33.
Currently with the DS6000 there is the choice of different disk drive capacities and speeds:
73 GB 15,000 rpm disks
146 GB 10,000 rpm disks
146 GB 15000 rpm disks
300 GB 10,000 rpm disks
The four disk drives assembled in each four-pack unit are all of the same capacity and speed.
But it is possible to mix four-packs of different capacity and speed (rpm) within an DS6000,
within the guidelines described in 2.6.4, “Disk four-pack intermixing” on page 28.
Physical capacity
The physical capacity (or raw capacity) of the DS6000 is the result of adding the physical
capacities of each of all the installed disk four-packs in the DS6000. The physical capacity of
the four-packs will be determined by the disk drives capacity that it holds (refer to Figure 2-7).
Effective capacity
The effective capacity of the DS6000 is the capacity available for user data. The combination
and sequence in which four-packs are added to the DS6000, and then how they are logically
configured, will determine the effective capacity of the DS6000 (refer to Figure 2-7).
The logical configuration alternatives and the resulting effective capacities are discussed later
in Chapter 3, “Logical configuration planning” on page 53.
RAID 5 Array
Effective Capacity Effective Capacity Effective Capacity Effective Capacity
DDMs in 4-Pack Physical Capacity 2+P+S 3+P 6+P+S 7+P
(1 Pack) (1 Pack) (2 Packs) (2 Packs)
73 GB 292 GB 126 GB 190 GB 382 GB 445 GB
146 GB 584 GB 256 GB 386 GB 773 GB 902 GB
300 GB 1200 GB 522 GB 787 GB 1576 GB 1837 GB
RAID 10 Array
Effective Capacity Effective Capacity Effective Capacity Effective Capacity
DDMs in 4-Pack Physical Capacity 1 + 1 + 2S 2+2 3 + 3 + 2S 4+4
(1 Pack) (1 Pack) (2 Packs) (2 Packs)
Capacity intermix
Disk four-packs of different capacities can be installed within the same DS6800 Server
Enclosure and DS6000 Expansion Enclosure, in the same or in different loops and boxes. In
the FC-AL loops of the DS6000, it is possible to intermix:
73 GB, 146 GB, and 300 GB capacity disk four-packs
RAID 5 and RAID 10 array configurations
If the DDM speed is the same, and they are configured first, two hot spare disks are created
by maximum drive capacity installed per FC-AL loop. The spares are reserved from either two
6+P arrays (RAID 5) or one 3+3 array (RAID 10). Sparing is discussed in detail in “RAID
implementation” on page 33.
Also the evolution in technology results are that, as new larger capacity disk drives become
available, they usually come with improved characteristics of performance. Thus we are
seeing that now many installations feel more confident when moving to the larger capacity
disk drives configurations.
When choosing the capacity of your DS6000 disk drives, these considerations should be
regarded:
The characteristics of the I/O workload (cache friendly, unfriendly, standard; block size;
random versus sequential; read/write ratio; I/O rate) are key factors when deciding the
capacity and number of the disk drives that will be included in the DS6000 configuration.
For example, if the workload is cache friendly, more I/Os are completed in cache and less
activity is performed on the backend disk drives. This type of workload is a good candidate
for storing its data in the larger capacity of disk drives.
Some I/O workloads may be very cache unfriendly or have a very high random write
content. These workloads, where a larger part of the I/Os are completed in the backend
disks, may perform better when using more disk drives.
A disk drive by itself can do a number of operations per second, so using more disks
drives can result in better performance.
Consider that the disk drive capacity is not an isolated factor when estimating the overall
DS6000 performance, but must be considered together with other important factors like the
I/O workload characteristics, the cache size, the disk drives speed, the number and type of
DS6000 host ports, and the backend data layout and FC-AL loops.
Our recommendation is to use Disk Magic for properly determining the more convenient disk
drives capacity mix to include in your DS6000 hardware configuration. In detail, see 4.1, “Disk
Magic” on page 86.
2.7.2 Disk Magic examples using 146 GB and 300 GB disk drives
In this section we present examples of DS6000 Disk Magic simulation results when using the
larger capacity 146 GB and 300 GB disk drives with same rpm. The discussions for these
examples will help you better understand what the performance implications are when using
the larger disks.
For this example of OLTP workload, and where the total effective capacity was almost the
same on both configurations, the 146 GB disk drive configuration shows a higher response
time compared to the 300 GB disk drive configuration.
OLTP w orkload
20
18
16
146GB Resp time
response time(ms)
14
12 300GB Resp time
10
8
6
4 read:write=7:3
2 cache hit ratio=50%
0
0 1000 2000 3000 4000 5000 6000 7000 8000
I/O block size=4kB
For this example of read intensive workload, maximum throughput is lower than the OLTP
workload because of a lower cache hit ratio. But the result is almost the same as the OLTP
workload. For the 300 GB disk drive configuration, 2300 or less IOPS can be acceptable and
for the 146 GB disk drive configuration, about twice the number of IOPS can be acceptable.
20
18
16 146GB Resp time
response time(ms)
14
300GB Resp time
12
10
8
6
read:write=2:1
4
2 cache hit ratio=28%
0
I/O block size=4kB
0 1000 2000 3000 4000 5000 6000
write efficiency=33%
Throughput(IOPS)
Figure 2-9 Open read intensive workload on 146 GB and 300 GB configuration
Being so, one of the simplest ways of improving the overall performance of a disk subsystem
is to install the highest speed (RPM) disk drives. This is especially relevant for workloads that
are cache-unfriendly or hostile, who benefit more than others from the DS6000 configurations
that include the faster DS6000 15 Krpm drives.
Consider that the disk drive speed is not an isolated factor when estimating the overall
DS6000 performance, but must be considered together with other important factors like the
I/O workload characteristics, the cache size, the disk drives capacity, the number and type of
DS6000 host ports, and the backend data layout and FC-AL loops.
2.7.4 Disk Magic examples using 15K rpm and 10K rpm disk drives
In this section we present examples of Disk Magic simulation results when using 15K rpm and
10K rpm disk drives. The discussions for these examples will help you better understand what
are the performance implications of the disk drive speed factor.
Figure 2-10 illustrates an example of an OLTP workload that is run on two 146 GB disk drives
configurations: One of the configurations has 10K rpm disk drives and the other one has 15K
rpm disk drives, each has 16 four-packs. You can see that the 15K rpm configuration
performs better than the 10K rpm configuration, always delivering better response times.
OLTP workload
20
18
16
14
response time(ms)
Figure 2-10 OLTP workload - 15K rpm versus 10K rpm disk drives
Figure 2-11 Read intensive - 15K rpm versus 10K rpm drives
Figure 2-11 illustrates the example when a read intensive workload was run on both the 10K
rpm and the 15K rpm configurations. The 15K rpm configuration performs better, always
delivering lower response times for all throughputs as compared to the 10K rpm configuration.
You can appreciate comparing the results in Figure 2-10 on page 32 and Figure 2-11, that
when running a higher cache hit ratio workload, the performance gain is less significant, than
when running the lower cache hit ratio workload.
For more detailed estimation, see 4.1, “Disk Magic” on page 86 and Chapter 12,
“Understanding your workload” on page 407.
Note: These results are just from the simulation of Disk Magic, your performance may
change according to your environment.
Logically, four DDMs are grouped in an Array Site automatically. As illustrated in Figure 2-12,
initially four DDMs of four 4-packs (16 DDMs) are randomly selected and make up the Array
Site. After initial setup, the next Array Site is configured by adding more four-pack DDMs.
The DS6000 disk Array is configured from one or two Array Sites in Redundant Array of
Independent Disks (RAID) implementations. When a RAID Array is configured from one Array
Site, this Array may be 2+P+S, 3+P (RAID 5), 1+1+2S (RAID 1) or 2+2 (RAID 10). And when
a RAID Array is configured from two Array Sites, this Array may be 6+P+S, 7+P (RAID 5),
3+3+2S or 4+4 (RAID 10).
Note: From a performance and capacity point of view, we recommend you configure
Arrays from two Array Sites. The more DDMs can manage the more I/Os and at RAID 5,
the smaller Arrays use more parity DDMs. But from an availability point of view, smaller
Arrays are superior because the probability of two DDMs failure become low.
A RAID Array does not fix the type of usage (FB or CKD) unlike the ESS Rank and there is no
relationship with the logical subsystem (LSS). With the DS6000, the process relating Array to
Rank and LSS is separated. Making Arrays just fixes the RAID type to either be RAID 5 or
RAID 10.
Ranks are made from Arrays which decides the type of usage (FB or CKD). Then each Rank
is formatted into multiple Extents of 1 GB (for FB) or .94 GB (for CKD). One or more Ranks
are assigned to an Extent Pool. Logical volumes are configured from an Extent Pool.
One to multiple Ranks are assigned to 1 to multiple storage pools (Extent Pool). That is, one
Extent Pool can contain 1 to multiple Ranks. Then logical volumes are configured from the
Extent Pool. If a LUN bigger than a Rank is needed, multiple Ranks must be assigned to the
Extent Pool.
Important: At the moment, we do not recommend you assign multiple Ranks to a single
Extent Pool. If a LUN is made, the LUN is not striped across the Ranks. Extents are taken
from one Rank and if more Extents are needed, the Extent is taken from next Rank. It
causes an unbalance of performance and loss of availability. If available, we recommend
host level striping. And from a management of view, it is easier to assign one Rank to one
Extent Pool.
When configured, the logical volumes are striped across all the data disks and then mirrored,
if it is a RAID 10 Rank; or striped across all data disks in the Array along with the parity disk
(floating), if it is a RAID 5 Rank.
Because the DS6000 architecture for maximum availability is based on two spare drives per
FC-AL loop (and per capacity, per rpm), if the first two arrays that are configured in a loop are
defined as RAID 5 then they will be defined by the DS6000 with two or six data disks plus one
parity disk plus one spare disk—this is a 2+P+S, 6+P+S Rank configuration. This will happen
for the first two arrays of each capacity installed in the loop, if configured as RAID 5.
Once the two spares per capacity/speed rule is fulfilled, then further RAID 5 arrays in the loop
will be configured by the DS6000 as three or seven data disks and one parity disk—this is a
3+P, 7+P Rank configuration. Figure 2-13 to Figure 2-15 on page 36 illustrate the three
arrangements of disks possible in the DS6000 when configuring two four-packs RAID 5
arrays.
IBM TotalStorage
DA
DA
1750-EX1-1 1750-EX1-2
DA
DA
IBM TotalStorage
DA
DA
In a RAID 5 implementation, each disk can be accessed, thus enabling multiple concurrent
accesses to the array. This results in multiple concurrent I/O requests being satisfied, thus
providing a higher random-transactions throughput.
The DS6000 architecture for maximum availability is based on two spare drives per loop (and
per capacity, per rpm). If the first array configured in a loop is defined as a RAID 10 Array, it
will be defined with one data disk plus one mirror plus two spares (made from 1 Array Site).
Or three data disk plus three mirrors plus two spares (made from 2 Array Sites). This is a
1+1+2S or 3+3+2S array configuration. This will happen for the first array of each capacity
installed in the loop if configured as RAID 10.
Once the two spare per capacity rule is fulfilled, then further RAID 10 arrays in the loop will be
configured by the DS6000 as two/four data disks plus two/four mirrors—this is a 2+2/4+4
array configuration. Figure 2-16 to Figure 2-18 on page 38 illustrate the three arrangements
of disks that can be found in the DS6000 when there are RAID 10 Ranks.
IBM TotalStorage
DA
DA
1750-EX1-1 1750-EX1-2
RAID 10 Two spares are configured on RAID10 Spares are not configured
3+3+2S first two array per loop. 4+4 on the rest of array.
Loop 1
DA
DA
IBM TotalStorage
DA
DA
–In case, initially configure 146 GB array and then add 300 GB DDMs , two
hot spares are configured for each capacity arrays per loop.
data spare mirror
RAID 10 is also known as RAID 1+0, because it is a combination of RAID 1 (mirroring) and
RAID 0 (striping). The striping optimizes the performance by striping volumes across several
disk drives (three or four DDMs in two Array Sites). RAID 1 is the protection against a disk
Note: If you make a RAID 5 array first and then make RAID 10 in the first or second
enclosure, three hot spare disks are made and it will reduce the total effective capacity. To
avoid this, you should create the RAID 10 Array in the first or second enclosure.
IBM TotalStorage
DA
DA
1750-EX1-1 1750-EX1-2
1st 2nd Loop 1
RAID10:3+3+2S RAID5:7+P
For reads which must be satisfied from disk, performance of RAID 5 and RAID 10 are roughly
equal, except at high I/O rates. If RAID 5 contains significantly more data than RAID 10 array,
data are located on the inside tracks of disks that it takes longer seek time that sometimes
RAID 5 become slower than RAID 10. But if RAID 5 contains almost same amount of data,
Or, viewed another way, if we use full space of RAID array for data, RAID 10 has twice the
number of DDMs. It means more read operations can be managed by RAID 10.
Sequential writes
Sequential writes are handled the same way as random writes, from the standpoint of getting
the data into cache and persistent cache, and then acknowledging the I/O as complete to the
host server. As with random writes, provided there is room in the cache areas, the response
time seen by the application is the time to get data into the cache.
RAID 10 destages are handled the same way as random writes. However, with sequential
writes, the volume of data is generally much larger, and since data is striped across the array,
I/Os will be done to every disk in the array. And, as with random writes, the data is written to
both sets of volumes.
RAID 5 sequential writes are done a bit differently than random writes. Because the larger
volume of data requires striping the data across the entire array, RAID 5 does a full stripe
write across all DDMs in the array, writing the data and parity generated from the data in
cache, without requiring any read of existing data or parity. Writing one copy of the data plus
parity information, as RAID 5 does, requires fewer disk operations than writing the data twice,
as RAID 10 does. This means that for sequential writes, a RAID 5 destage completes faster
and thereby reduces the busy time of the disk subsystem.
Array rebuilds
In theory, RAID 10 is better for Array rebuilds since RAID 5 must read every disk in the Array,
reconstruct the data using parity calculations, and then write the reconstructed data. In
comparison, to rebuild a failed disk RAID 10 has only to copy the data from one DDM to
another. However, the DS6000 switched fibre network connecting the disks has sufficient
bandwidth to permit the reads from all DDMs in the RAID 5 Array to be done concurrently. So
the actual elapsed time to rebuild a RAID 5 array is approximately the same as the elapsed
time to rebuild a RAID 10 array. However, due to the larger number of disk operations, a RAID
5 rebuild would be more likely to impact other disk activity on the same disk loops than would
RAID 10.
For reads from disk, either random or sequential, there is no significant difference in RAID 5
and RAID 10 performance, except at high I/O rates.
For random writes to disk, RAID 10 performs better. The improvement is not seen until high
levels of activity.
For sequential reads/writes and random reads to disk, RAID 5 performs better. They are
measurably better at the smallest level of activity.
Regardless of your workload characteristics (read versus write, random versus sequential), if
your workload’s access density (the ratio of I/O operations per second, divided by gigabytes of
capacity) is well within the capabilities of your disk subsystem, then either RAID 5 or RAID 10
will work fine.
For workloads that perform better with RAID 5, the difference in RAID 5 performance over
RAID 10 is typically not large. However, for workloads that perform better with RAID 10, the
difference in RAID 10 performance over RAID 5 can be significant. RAID 10 is generally
considered the RAID type of choice for random write workloads which need the absolute best
I/O performance possible.
The major downside to RAID 10 is usable space efficiency. In a DS8000, 64 DDMs (the
number typically supported on one device adapter pair) in a RAID 10 configuration provide 30
DDMs of usable space. Those same 64 DDMs in a RAID 5 configuration provide 52 DDMs of
usable space, which is 73 percent more usable capacity.
Instead of asking which is better, RAID 5 or RAID 10, a more appropriate question is when to
use RAID 5 and when to use RAID 10. The selection of RAID type is made for each individual
Array Site, which for the DS6000 is four DDMs. So you can select RAID type based on the
performance requirements of the files that will be located there. The best way to compare a
workload’s performance using RAID 5 versus RAID 10 is to have a Disk Magic model run. For
additional information about the capabilities of this tool, see 4.1, “Disk Magic” on page 86.
These problems are solved with the switched FC-AL implementation on the DS6000.
The DS6000 architecture employs dual redundant switched FC-AL access to each of the disk
enclosures. The key benefits of doing this are:
Two independent switched networks to access the disk enclosures
Four access paths to each DDM
Each device adapter port operates independently
Double the bandwidth over traditional FC-AL loop implementations
In the DS6000, the switch chipset is completely integrated into the servers. Each server
contains one switch. Note, however, that the switch chipset itself is completely separate from
the server chipset. In Figure 2-20 each DDM is depicted as being attached to two separate
Fibre Channel switches. This means that with two servers, we have four effective data paths
to each disk, each path come from a device adapter port and operates at 2Gbps.
S w itc h e d c o n n e c tio n s
F ib r e c h a n n e l s w itc h
C o n tr o lle r 0 C o n tr o lle r 1
d e v ic e d e v ic e
a d a p te r a d a p te r
F ib r e c h a n n e l s w itc h
When a connection is made between the device adapter and a disk, the connection is a
switched connection that uses arbitrated loop protocol. This means that a mini-loop is created
between the device adapter and the disk.
Note: It is highly recommend to not create the Extent Pool and LUN across the loop.
Server 0 Server 1 1
Switch
Switch
First
FC
FC
1 16 SBOD
Switch
Switch
Midplane
FC
FC
connection
16
1
Switch
Switch
FC
FC
Disk EXP Port Second
Third
Loop 0 1 SBOD
Switch
Switch
SBOD FC 16
FC
Fifth Fourth
16
SBOD SBOD
•1Gbps/2Gbps auto
Switch
negotiation SES
MIdplane
EEEEE
The host port can either be Fibre Channel or FICON (longwave or shortwave). You have to
change the port profile manually when connecting FCP or FICON by using the GUI or CLI.
Each host port has an SFP module with LC connector and it negotiates 2 Gbps or 1 Gbps
automatically.
The host servers supported by the DS6000 for each host port interface can be found at:
http://www-03.ibm.com/servers/storage/disk/ds6000/interop.html
Figure 2-23 shows the performance data of the host adapter and port compared with the host
adapter of ESS 800. The DS6000 host adapter performs much faster than the ESS 800 host
adapter.
Once the necessary information is ready, then Disk Magic can be run to evaluate the
alternatives for the DS6000 hardware configuration. As with other components of the DS6000
hardware configuration, consider that the DS6000 host adapters are not an isolated factor
when estimating the overall DS6000 performance. But instead, they must be considered
together with other important factors like the I/O workload characteristics, the cache size, the
disk drives capacity and speed, and the backend data layout and FC-AL loops.
It is the very rich connectivity options of the Fibre Channel technology that has resulted in the
Storage Area Network (SAN) implementations. The limitations seen on SCSI in terms of
distance, performance, addressability, and connectivity are overcome with Fibre Channel and
a SAN.
The DS6000 with its Fibre Channel/FICON host adapters provides Fibre Channel Protocol
(FCP, which is SCSI traffic on a serial fiber implementation) interface, for attachment to open
systems that use Fibre Channel adapters for their connectivity.
Note: The Fibre Channel/FICON host port supports both FICON or FCP, but not
simultaneously; the protocol to be used is configurable on a port-by-port basis.
The DS6000 supports up to 8 host ports, which allows for a maximum of 8 FCP ports per
DS6000. Each Fibre Channel/FICON host port provides one port with an LC connector type.
There are cable options that can be ordered with the DS6000 to enable connection of the
adapter port to an existing cable infrastructure.
As SANs migrate to 2 Gbps technology, your storage should be able to exploit this bandwidth.
The DS6000 Fibre Channel/FICON ports operate at up to 2 Gbps. The adapter
auto-negotiates to either 2 Gbps or 1 Gbps link speed, and will operate at 1 Gbps unless both
ends of the link support 2 Gbps operation.
There are two types of host adapter ports you can select: Longwave or shortwave. With the
longwave laser, you can connect nodes at distances of up to 10 km (non-repeated). With the
shortwave laser, you can connect at distances of up to 300 meters. The distances can be
extended if using a SAN fabric.
When equipped with the Fibre Channel/FICON host ports, these ports can participate in three
types of Fibre Channel topology by setting the port topology:
Port topology: SCSI-FCP
– Fibre channel topology: Point-to-point or Switched fabric
Port topology: FC-AL
– Fibre channel topology: Arbitrated loop
250 12000
z990
z990 FICON
FICON z890
z890 EXPRESS
EXPRESS z9
10000 9200 z9
200 2Gbps FICON
G5 ESCON G5
50
ESCON G6 2000 1200 G6
17
0 0
As you can see, the FICON Express2 channel as first introduced on the zSeries z890 and
z990 represents a significant improvement in both 4K I/O per second throughput and
maximum bandwidth capability compared to ESCON and previous FICON offerings.
Note: This performance data was measured in a controlled environment running an I/O
driver program. The actual throughput or performance that any user will experience will
vary depending upon considerations such as the amount of multiprogramming in the user’s
job stream, the I/O configuration, the storage configuration, and the workload processed.
The DS6000 series provides only 2 Gbps FCP ports, which can be configured either as
FICON to connect to zSeries servers or as FCP ports to connect to Fibre Channel-attached
open systems hosts. For example, if there are two FICON Express2 channels, just two FICON
Express2 channels have the potential to provide roughly a bandwidth of 2 x 270 MB/s, which
equals 540 MB/second. This is a very conservative number.
I/O rates with 4 KB blocks are about 13000 I/Os per second per FICON Express2 channel,
again a conservative number. For example, two FICON Express2 channels have the potential
of over 26000 I/Os per second with the conservative numbers. These numbers vary
depending on the server type used.
The DS6800 can achieve higher data rates than an ESS 800 by its bandwidth. The DS6800
can outperform better than ESS 800 for some sequential workload. But, in a z/OS
environment a typical transaction workload might perform on an ESS 800 Turbo II with a large
cache configuration better than with a DS6800.
FICON attachment
•Up to 8 FICON ports per DS6800
•One port with an LC connector type
•Long wave or short wave option
•Up to 200 MB/sec full duplex
zSeries
These characteristics allow more powerful and simpler configurations. The DS6000 supports
up to 8 Fibre Channel/FICON host ports, which allows for a maximum of 8 FICON ports per
machine.
Note: The Fibre Channel/FICON host port supports both FICON or FCP, but not
simultaneously; the protocol to be used is configurable on an adapter-by-adapter basis.
Each Fibre Channel/FICON host port provides one port with an LC connector type. The
adapter is a 2 Gbps card and provides a nominal 200 MB/s full-duplex data rate. The adapter
will auto-negotiate between 1 Gbps and 2 Gbps, depending upon the speed of the connection
There are two types of host adapter cards you can select: Longwave and shortwave. With
longwave laser, you can connect nodes at distances of up to 10 km (without repeaters). With
shortwave laser, you can connect distances of up to 300 m.
Each Fibre Channel/FICON host adapter provides one port with an LC connector type. There
are cable options that can be ordered with the DS6000 to enable connection of the adapter
port to an existing cable infrastructure.
Topologies
When configured with the FICON attachment, the DS6000 can participate in point-to-point
and switched topologies. The supported switch/directors for FICON connectivity can be found
at:
http://www-03.ibm.com/servers/storage/disk/ds6000/interop.html
For more information about host attachment see Chapter 5, “Host attachment” on page 143.
Figure 2-27 on page 50 to Figure 2-29 on page 51 shows the preferred path I/O activity of the
DS6000. As illustrated in Figure 2-27 on page 50, when the host has multiple paths, but only
one path to each server, I/O is always active/standby for a LUN (Extent Pool). An alternate
path is used in the case the server or path fails. When a host has multiple path to each server,
I/O can be load-balanced or round-robin across the paths connected to one side of the
servers. And in case the path fails, the rest of the paths are used for I/O and will not cause
failover. Only when the server fails, failover occurs. If the configuration is large capacity and
high performance, required especially for sequential workload, multiple paths to the
configuration may be effective. But if the configuration is small or workload is random, a
multipath configuration may not be needed. For a small configuration or random workload,
DDMs can be saturated earlier than the host port.
According to the operating system and multipath driver, to determine I/O activity
(load-balancing/round-robin/active-standby), users have to configure the OS or drivers
setting. For example, the Subsystem Device Driver’s (SDD) default setting is load-balance,
but if you want to set other activity, you have to use the datapath set command.
Note: When configuring multipath, the host must have same number of Host Bus Adapters
(HBA) as the number of host ports, to get the most effective performance. If the number of
HBAs is less than host ports, it may be a bottle neck.
If a host has only single path to one side of the servers, the host can still access the LUN
related to the other side of the server as shown in Figure 2-29 on page 51. I/O goes through
the interconnect bridge of the servers. But from a performance and RAS point of view we
strongly do not recommend, such type of configuration.
Host
Host port
Server 0 Server 1
Fibre Fibre
Channel Channel
Switch Switch
0 1
Fibre Fibre
Channel Channel
Switch Switch
0 1
Host
Host port
Server 0 Server 1
Fibre Fibre
Channel Channel
Switch Switch
0 1
If you have LUNs (Extent Pools) which are related to only one side of the servers, only half of
the server enclosure resource is used. To use the servers effectively, at least two LUNs
(Extent Pools) each related to one server must be created.
2.11.1 Whitepapers
IBM regularly publishes whitepapers which document the performance of specific DS6000
configurations. Typically, workloads are run on multiple configurations, and performance
results are compiled so the different configurations can be compared. For example,
workloads may be run using different numbers of host adapters, or different types of DDMs.
By reviewing these whitepapers, you can make inferences on the relative performance
benefits of different components. This will aid you in choosing the type and quantities of
components which would best fit your particular workload requirements. Your IBM
representative or IBM Business Partner has access to these whitepapers and can provide
them to you.
Logical configuration is the process of subdividing the physical storage devices comprising
your DS6000 into usable logical storage entities.
3.1.1 Isolation
Isolation means providing one set of applications with dedicated DS6000 hardware
resources to reduce the impact of other workloads. Looked at another way, isolation means
limiting one workload to a subset of DS6000 hardware resources so that it will not impact
other workloads. Isolation provides better resource availability for those hardware resources
dedicated to the single workload, and reduces contention with other applications for those
resources. However, isolation limits the single workload to a subset of the available DS6000
hardware, so its maximum potential performance may be reduced. Also, unless an application
has an entire DS6000 dedicated to its use, there is the potential for contention with other
applications for any resources which are not dedicated.
Note: Hosts must be connected to at least one I/O adapter from each of the two servers in
the DS6000 in order to provide connectivity if one path should fail.
Multiple resource-sharing workloads may have logical volumes on the same Ranks, and may
access the same DS6000 I/O adapters or even I/O ports through their SAN connections.
Resource sharing allows a workload to access more DS6000 hardware than could be
dedicated to the workload, providing greater potential performance, but this hardware sharing
may result in resource contention between applications that impacts performance at times.
3.1.3 Spreading
Spreading means distributing and balancing workload across all of the DS6000 hardware
resources available, including:
Server0 and server1 (including cache and processor resources)
Ranks
Host adapters
Spreading applies to both isolated workloads and resource-sharing workloads. The DS6000
hardware resources allocated to either one isolated workload or multiple resource-sharing
workloads should be balanced evenly across server0 and server1. That is, Ranks allocated
for either one isolated workload or multiple resource-sharing workloads should be assigned to
server0 and server1 in a balanced manner.
For either an isolated or a resource-sharing workload, volumes and host connections should
be distributed in a balanced manner across all DS6000 hardware resources available to that
workload.
One exception to the recommendation of spreading volumes is the case of files or datasets
which will never be accessed simultaneously, such as multiple log files for the same
application, where only one log file will be in use at a time.
Host connections should also be configured as evenly as possible across the I/O adapters
available to either an isolated or a resource-sharing workload. Where possible, do not share a
host adapter connection with remote mirroring traffic.
The next step is identifying balanced hardware resources that can be dedicated to the
isolated workload.
The third step is identifying the remaining DS6000 resources to be shared among the
resource-sharing workloads.
The final step is assigning volumes and host connections to the workloads in a way that is
balanced, and spread - either across all dedicated resources (for the isolated workload) or
across all shared resources (for the multiple resource-sharing workloads).
Workloads which require different disk drive types (capacity and speed), different RAID types
(RAID5 or RAID10), or different storage types (CKD or FB) require isolation to different
DS6000 arrays. Workloads that use different I/O protocols (FCP or FICON) require isolation
to different I/O ports. Organizational considerations may also dictate isolation (for example,
However, even workloads that use the same disk drive types, RAID type, storage type and I/O
protocol (and without a user requirement for isolation) should still be evaluated for separation
or isolation requirements.
High priority workloads should be considered for isolation to dedicated DS6000 hardware
resources to ensure that they will not be subject to contention. Database online transaction
processing workloads may require dedicated resources in order to achieve better service
levels.
Workloads with very heavy, continuous I/O access patterns should be considered for isolation
to prevent them from consuming all available DS6000 hardware resources and impacting the
performance of other workloads.
Isolation of only a few known heavy-hitting workloads often allows the remaining workloads to
share hardware resources and achieve acceptable levels of performance. Some examples of
I/O workloads or files/datasets which often have heavy and continuous I/O access patterns
are:
Sequential workloads
Log files or datasets
Tape simulation on disk
Business Intelligence and Data Mining
Mail applications that require reading and possibly updating every mailbox
Sort/work datasets or files
Disk copies (including Point in Time Copy background copy or Remote Mirror volumes)
Engineering/scientific applications
Video/imaging applications
Batch update workloads
Workloads for all applications for which DS6000 storage will be allocated should be taken into
account, including current workloads that will be migrated from other storage subsystems,
new workloads planned for the DS6000, and projected growth in all application workloads.
For existing applications, historical experience should be considered first. For example, is
there an application where certain datasets or files are known to have heavy, continuous I/O
access patterns? Is there a combination of multiple I/O workloads that would cause
unacceptable performance if their peak times occurred simultaneously?
For existing applications, performance monitoring tools available for the existing storage
subsystems and server platforms can also be used to understand current application
workload characteristics such as:
Read/Write ratio
Random/sequential ratio
Peak workload (I/Os per second for random access, and MB per second for sequential
access
Peak workload periods (time of day, time of month)
Requirements for new application workloads and for current application workload growth must
be projected.
The Disk Magic modeling tool can be used to model the current or projected workload and
estimate DS6000 hardware resources required.
For more information about performance monitoring tools, see Chapter 4, “Planning and
monitoring tools” on page 85.
For more information about workload characteristics, see Chapter 12, “Understanding your
workload” on page 407.
For more information about Disk Magic, see 4.1, “Disk Magic” on page 86.
The dedicated DS6000 resources should be balanced across DS6000 components, such as:
Ranks assigned to server0 Extent Pools and Ranks assigned to server1 Extent Pools
A pair of ports consisting of one from I/O adapter on server0 and one from server1 within
the DS6000
If a workload is to be assigned 2 dedicated I/O ports, one should be on an I/O adapter that is
managed by server0, and the other should be managed by server1.
The DS6000 resources that will be shared should be balanced across DS6000 components,
such as:
Ranks assigned to server0 Extent Pools and Ranks assigned to server1 Extent Pools
Note: The disk drives in the DS6000 enclosures have a dual ported FC-AL interface.
Instead of forming an FC-AL loop, each disk drive is connected to two Fibre Channel
switches within each enclosure. With this switching technology there is a point-to-point
connection to each disk drive. This allows maximum bandwidth for data movement,
eliminates the bottlenecks of loop designs, and allows for specific disk drive fault indication.
The physical DDM attachment connectivity within the DS6000 enclosure is shown in
Figure 3-1.
Figure 3-2 on page 61 is a schematic illustration of a full capacity DS6000 with the maximum
of seven expansion enclosures (1750-EX1), providing a total of 16 Arrays which have been
configured here as 16 Extent Pools. Each expansion enclosure is attached to each of the two
DS6000 controllers by a pair of Fibre Channel connections as shown here. We suggest that
the DS6000 server enclosure be physically placed mid-way in its frame, as we have indicated
in this diagram, so that all the expansion enclosures on Switched Loop 0 can be placed above
the base unit, and all the expansion enclosures on Switched Loop 1 can be placed below the
base unit. This makes the expansion enclosure Fibre Channel cabling simpler to manage.
1750-EX1 id 10
Rank 2 Ext Pool 2 Rank 3 Ext Pool 3
1750-EX1 id 11
Rank 4 Ext Pool 4 Rank 5 Ext Pool 5
Loop 1
1750-EX1 id 12
Rank 8 Ext Pool 8 Rank 9 Ext Pool 9
Each expansion enclosure is added to the configuration in a particular location on its specific
drive I/O Fibre Channel loop in order to spread workload as evenly as possible across the two
DS6000 servers. The two Fibre Channel loops are evenly populated as the number of
expansion enclosures increases.
The DS6000 can contain up to 128 disks drives of different capacities. The DS6000 supports
72.8 GB and 145.6 GB disk drives at 10,000 or 15000 rpm, and 300 GB disk drives at 10000
rpm capacity. The same disk technology and capacities are available for all DS6000
attachable servers via FCP and FICON attachment. The DS6000 drive sets that hold the disk
drives are installed in the DS6000 base frame and—if needed—up to seven expansion racks
may be used (as Figure 3-2 on page 61 illustrates). The base frame of the DS6000 can hold 8
or 16 disk drives as two or four drive sets. Each disk enclosure can hold 8 or 16 disk drive in
two or four pairs of drive sets.
The overriding consideration for configuring is the actual end user requirements. However,
before configuring a mix of disk geometries, you should consider the increased cost of
spares, and the increased complexity of your system.
While it is possible to mix drive sets with different geometries (speed and capacity) across
both drive loops, in general, we do not recommend it. Each drive geometry used will be
allocated its own pair of global spares of the same or greater capacity on its own I/O loop,
resulting in the possibility of inefficient use of your installed capacity if you spread the drive
types across drive loops. A global spare means that the spare is available to any drive
enclosure on the same drive loop.
For example, a DS6000 configured with a mixture of 300 GB, 145.6 GB and 72.8 GB DDMs to
achieve 13.5 TB of usable capacity could use 8 DDMs as spares in order to meet the
requirement of 2 spares of the same or greater capacity for each device geometry, and some
of these would be sparing smaller capacity drives on the loop. Overall, this configuration
utilizes approximately 73% of the raw capacity. A configuration with a similar raw disk capacity
comprising all 145.6 GB DDMs, only uses 4 DDMs as spares, utilizing approximately 83% of
the raw capacity.
A larger DDM may also be utilized as a spare for a smaller drive, in which case the DS6000
will physically fail back operations to the original DDM from the larger DDM, that was acting
as a smaller spare when the original failed DDM is repaired. This requires additional I/O
operations.
Important: The DS6000 is restricted to two address groups, and each of these can be
associated with either open systems data or with zSeries data, but not both. Refer to
“Address groups” on page 71 for further discussion.
CKD
In count-key-data (CKD) organization, the data field stores the user data. Also, because the
data records can be variable in length, they all have an associated count field that indicates
the user data record size. Then the key field is used to enable a hardware search based on a
key. However, this is not generally used for most data anymore. Extended count-key-data
(ECKD™) is a more recent version of CKD that uses an enhanced S/390® channel command
set.
Array Sites
The DS6000 is available with a configuration entity or Array Site consisting of one disk drive
set. A fully populated server enclosure or storage enclosure has two pairs of Array Sites, as
can be seen in Figure 3-3 on page 64 showing Array Site locations for a DS6000 server
enclosure. The DDMs selected for an Array Site will be selected from the same disk
enclosure string by the DS6000 and you have no way to influence this process. An example of
the relationship between DDMs and their associated Array Site is shown in Example 3-1 on
page 65. The Array Sites have been shown here as S1 through S4. An Array Site is the basic
building block for Array creation.
As we can see in Figure 3-3, there is a predetermined, but non-contiguous affinity between
the disk drive sets and Array Sites. In this example, Array Site S1 has DDMs from Drive Set 1
and Drive Set 3.
Note: When you first create your new logical 8 DDM Array from a pair of 4 DDM Array
Sites, it is good practice to select an adjacent odd/even pair of Array Sites for each Array. In
this example we chose Array Sites S1/S2 as our first pair and formed Array A0. Then we
chose Array Sites S3/S4 to form Array A1.
Important: When creating a new Array, the DS6000 configuration rules will prevent you
from inadvertently selecting a pair of Array Sites that are not within the same physical
enclosure.
Important: The Array Site numbering is selected by the DS6000 based on both the order
of cabling the Storage Enclosures together to form a DS6000 Storage subsystem, and on
the order in which Array Sites were populated with disk drive sets. There is no way to
pre-determine the Array Site numbering if one or more expansion enclosures are attached
to the server enclosure before you first power on the DS6000.
In all our example we used a DS6000 that has two fully populated expansion enclosures that
were attached one at a time after the server enclosure was installed, and we subsequently
added a third enclosure with 8 DDMs towards the end of the configuration process.
Consequently, our Array Site numbering started with Sites S1-S4 in the server enclosure,
Sites S5-S8 in the first expansion enclosure, and Sites S9-S12 in the second expansion
enclosure. The third, half populated server enclosure, has Array Sites S13 and S14.
Controller 0
Controller 1
1750-EX1 id 11 Array Sites S9 and S10 Array Sites S11 & S12
Figure 3-4 Sample of Array Site locations within three expansion enclosures
Here is another look at the example of the relationship between Array Sites and DDMs in a
DS6000 storage enclosure as seen in the arsite column in Example 3-1. You can see that the
DS6000 has apparently not chosen a strictly sequential arrangement for the location for each
DDM within an Array Site. This is because of the requirement to allocate one spare from each
of the first two Array Sites on each loop (S1 and S2). You can see that by the time we get to
the second expansion enclosure, we now see for Array Sites S9 - S13 that there is now a
one-to-one relationship between the Array Sites and the DDM position within the enclosure.
The spare DDMs locations are likely to change over time, as sparing takes effect on each of
the two drive loops.
Array size
Your first decision is to choose either a 4 DDM Array or an 8 DDM Array as your starting size.
Based on this initial planning decision, one or two of the DS6000 4 DDM Array Sites will be
used to create an Array. That is to say, we can plan to use one or two disk drive sets in each
Array.
Note: The 8 DDM Arrays will provide you with more usable storage from your DDMs than 4
DDM Arrays. For example, a fully populated storage enclosure enclosure containing 16
146 GB DDMs will provide approximately 1.9 TB of usable storage when configured with 8
DDM RAID 5 Arrays, and 1.6 TB of usable storage when utilising 4 DDM RAID 5 Arrays.
(Assuming no spares are required within this storage enclosure).
RAID 5 or RAID 10
Having performance in mind, we must determine which RAID organization we need, and
begin by selecting the number of drive sets we need in an Array. We do this by associating
one or two Array Sites, each with four DDMs associated with it, to become an Array. The
DS6000 Arrays can be defined as either RAID 5 or RAID 10.
We recommend the use of 8 DDM Arrays in RAID 5, if all the DDMs have similar speed and
capacity characteristics, as this will provide more usable capacity than a RAID 10
implementation, and acceptable performance for most applications with a normal range of I/O
requirements. If you have a specific performance requirement, such as providing for an
application with a high random write requirement, such as some data base applications, you
may need to consider creating some RAID 10 arrays utilizing single or dual Array Sites.
Always begin your configuration of a new DS6000 by verifying that all the expected Array
Sites are available, as seen in Example 3-2 on page 67 under the column headed arsite.
Follow this by ensuring that you group the Array Sites in adjacent odd/even pairs in order to
create an Array with disks from the correct disk sets within the DS6000.
Example 3-3 shows an Array creation utilizing the first two Array Sites
We should then confirm that we configured the appropriate Array Sites by checking that we
successfully changed the status of our Array Sites, and now have a newly created Array.
Review the arsite details in Example 3-4.
Important: The Ranks are created in the order in which that they are defined, so take care
to ensure that the first Rank you define uses Array A0, the second uses A1 and so on, if
you want to match Array numbers with their corresponding Rank. We recommend that you
maintain this simple Rank to Array relationship in order to simplify subsequent
performance management.
If you inadvertently make a mistake when assigning Arrays to Ranks, such as assigning A0 to
R1, and A1 to R0, you could possibly recover later in the logical building process by allocating
R1 to Extent Pool P0, and R0 to Extent Pool P1, but this would continue to be confusing for
other management personnel.
Note: If you are implementing a mixed open systems and z/OS environment, you need to
decide here if you want to separate each data type across the two Switched FC-AL loops.
You can then selectively set up the CKD and FB Ranks to achieve this separation across
the two loops if required.
Tip: You may want to separately manage the Arrays with different capacities, such as
those with spares assigned and those with differing DDM geometries.
In Example 3-5 we create our first Rank from Array A0. There is no parameter to allow you to
associate the R0 identifier with your Rank definition. The Ranks are created in sequential
order of definition, starting with R0, so exercise care at this stage, or you can inadvertently
introduce some potential performance bottlenecks by causing subsequent allocations to be
spread unevenly across the two storage servers in your DS6000.
Now we can determine that the Rank was successfully defined and see for the first time the
available capacity in FB or CKD extents of our Rank as seen in Example 3-6, where we note
that we have 773 FB extents available in Rank R0.
dscli>
If you have both open and zSeries hosts attached, you will need a separate Extent Pool for
each type of logical disk (FB or CKD), due to the different formatting of the Ranks that make
up each Extent Pool type.
Example 3-7 shows an example of defining Extent Pools using the CLI.
Note: There is no parameter in the mkextpool command to allow you to refer to a specific
Extent Pool, so remember that the first one defined will be P0 and the next one will be P1
and so on. Be sure that you associate the even numbered Extent Pools with server0, and
the odd numbered Pools with server1.
dscli>
Always verify that the Ranks were associated with the desired Extent Pools by issuing the
lsrank -l command, as shown here in Example 3-8.
dscli>
We recommend that you put only Ranks associated with DDMs with the same capacity and
rotational speed into the same Extent Pool when adding more than one Rank to the same
Extent Pool.
Tip: For performance management we recommend you create Extent Pools comprising a
single Rank only, unless you need to define logical disks that are larger than a single-Rank
Extent Pool, or need to utilize all available capacity.
Here we need to ensure that we associate the Ranks with a preferred DS6000 server. In
keeping with our philosophy of keeping even numbered components associated with server0,
we matched the even numbered Ranks with an even numbered Extent Pool using the chrank
DSCLI command, as shown in Example 3-9.
dscli>
The capacity of one or more Ranks can be aggregated into a single Extent Pool and logical
volumes configured in that aggregated Extent Pool are not bound to any specific Rank. This
allows us to define a logical volume up to 2 TB for an FB volume, even when the capacity of a
single Rank is much less than 2 TB. As such, the available capacity of the storage facility can
be flexibly allocated across the set of defined logical subsystems and logical volumes.
Different logical volumes within the same logical subsystem can be configured from different
Extent Pools, although performance of this can be difficult to manage, as it is quite time
consuming to identify the physical location of a user’s LUN that may be experiencing
contention.
However there is one restriction with the LSS now having an affinity to one of the DS6000
servers. All even numbered LSSs (X’x0’, X’x2’, X’x4’, up to X’xE’) belong to server 0 and all
odd numbered LSSs (X’x1’, X’x3’, X’x5’, up to X’xF’) belong to server 1.
All devices in an LSS must be either count-key-data (CKD) for zSeries data or fixed block (FB)
for open systems data. This restriction goes even further. LSSs are grouped into address
groups of 16 LSSs.
Address groups
There is no command parameter to specifically define an address group. Address groups are
created automatically when the first LSS associated with the address group is created and
deleted automatically when the last LSS in the address group is deleted.
Restriction: The DS6000 supports two address groups: address group 0 and address
group 1.
LSSs are numbered X’ab’, where a is the address group and b denotes an LSS within the
address group. So, for example X’10’ to X’1F’ are LSSs in address group 1. All LSSs within
Note: zSeries clients are reminded that the DS6000 does not have support for ESCON
hosts attachment.
LCU
zSeries users are familiar with a logical control unit (LCU). zSeries operating systems
configure LCUs to create device addresses. There is a one to one relationship between an
LCU and a CKD LSS in a DS6000 (LSS X'ab' maps to LCU X'ab'). Logical CKD volumes have
a logical volume number X'abcd' where X'ab' identifies the LSS and X'cd' is one of the 256
logical volumes on the LSS. This logical volume number is assigned to a logical volume when
a logical volume is created and determines the LSS that it is associated with. The 256
possible logical volumes associated with an LSS are mapped to the 256 possible device
addresses on an LCU (logical volume X'abcd' maps to device address X'cd' on LCU X'ab').
When creating CKD logical volumes and assigning their logical volume numbers, users
should consider whether Parallel Access Volumes (PAVs) are required on the LCU and
reserve some of the addresses on the LCU for alias addresses.
For open systems, LSSs do not play an important role except in determining which server the
LUN is managed by (and which Extent Pools it must be allocated in) and in certain aspects
related to Metro Mirror, Global Mirror, or any of the other remote copy implementations.
Note: LCUs must be specifically defined before defining any associated CKD volumes
using the lslcu command or the DS Storage Manager.
There is no command parameter to specifically define an LSS for open systems data. The
LSS definition is implied from the first two characters of the LUN identifier. The volume id of
0001 in Example 3-10 implies an LSS of 00.
Example 3-10 Implied LSS definition
dscli> lsfbvol
Date/Time: 26 July 2005 6:33:25 IBM DSCLI Version: 5.0.5.6 DS: IBM.1750-1301234
CMMCI9003W No FB Volume instances found in the system.
dscli>
dscli> mkfbvol -extpool p0 -cap 50 -name Test_50GB 0001
Date/Time: 26 July 2005 6:35:12 IBM DSCLI Version: 5.0.5.6 DS: IBM.1750-1301234
CMUC00025I mkfbvol: FB volume 0001 successfully created.
dscli> lslss
Date/Time: 26 July 2005 6:36:43 IBM DSCLI Version: 5.0.5.6 DS: IBM.1750-1301234
ID Group addrgrp stgtype confgvols
==================================
00 0 1 fb 1
dscli>
Some management actions in Metro Mirror, Global Mirror, or Global Copy operate at the LSS
level. For example, the freezing of pairs to preserve data consistency across all pairs, in case
you have a problem with one of the pairs, is done at the LSS level. With the option now to put
all or most of the volumes of a certain application in just one LSS, this can make the
management of remote copy operations easier. However, distributing logical volumes in an
LSS over multiple Extent Pools may make managing performance more difficult.
Tip: We recommend assigning one Rank to each Extent Pool, and don’t define LSSs that
span Extent Pools, in order to facilitate simpler performance management.
Figure 3-5 is an example of the relationship between logical volumes and Extent Pools.
Notice that volumes 0800, 0801... are backed by an Extent Pool with an even number (06),
and will be managed by server0 in the DS6000. These volumes are also associated with LSS
08 in our example.
Controller 1
0100
LSS X'01'
0101
DB2logs
1750-EX1 id 11 Rank 4 Extent Pool 4 Rank 5 Extent Pool 5
The DS6000 allocates each logical volume by aggregating the required number of available 1
GB Extents sequentially from the requested Extent Pool. Example 3-11 on page 74 shows a
50 GB logical volume being allocated from Extent Pool P0, which had 773 Extents available
before the allocation, and 723 available following the allocation.
dscli> lsextpool -l
Date/Time: 26 July 2005 6:35:23 IBM DSCLI Version: 5.0.5.6 DS: IBM.1750-1301234
Name ID stgtype rankgrp status availstor (2^30B) %allocated available reserved numvols numranks
====================================================================================================
ITSO_P0 P0 fb 0 exceeded 723 6 723 0 1 1
dscli>
Attention: It is important to balance the allocation of your logical volumes across both
storage servers (server0 and server1) to reduce the potential for I/O imbalance.
In the DS6000, host ports have a fixed assignment to a server (or controller card). In other
words, all data traffic that uses its preferred path to a server avoids having to cross through
the DS6000 inter-server connection to the other server. There is a small performance penalty
if data from a logical volume managed by one server is accessed from a port that is located
on the other server. The request for the logical volume and the data would have to be
transferred across the bridge interface that connects both servers. These transfers add some
latency to the response time. Furthermore, this interface carries other communication traffic
between the servers, such as being used to mirror the persistent memory and for other
inter-server communication. It could become a bottleneck if too many normal I/O requests
also run across it, although it is a high bandwidth, low latency, PCI-X connection.
Open systems hosts should ensure that they use multipath management software such as
IBM’s Multipath Subsystem Device Driver (SDD) that recognizes this preferred path usage
and can preferentially direct I/O requests to the preferred path.
When assigning host ports for open systems usage, always consider preferred pathing,
because the use of non-preferred paths will have a performance impact on your DS6000.
z/OS users already have this preferred path management capability inherent in the z/OS
operating system.
The modeling may be done using the Disk Magic modeling tool, as discussed in 4.1, “Disk
Magic” on page 86.
As is common for data placement and to optimize the DS6000 resources utilization, you
should:
Spread the logical disk allocations evenly across the two DS6000 servers by allocating
them equally from Extent Pools managed by server0 and server1, as this will balance the
I/O load distribution.
Spread the logical disks allocations for important applications across as many of the
DS6000 disks as possible.
Stripe your logical volumes across several Ranks when using a host based logical volume
manager.
Consider placing specific database objects (such as logs) on logical volumes that were
actually configured from different Ranks than those used for database user data and
tablespaces.
All disks in the storage subsystem should have roughly an equivalent utilization. Any disk that
is used more than the other disks is likely to become a bottleneck to performance. A practical
method is to make extensive use of host-based logical volume level striping across disk
drives.
DS6000 logical volumes are composed of Extents. An Extent Pool is a logical construct to
manage a set of Extents. One or more Ranks with the same attributes can be assigned to an
Extent Pool. One Rank can be assigned to only one Extent Pool.
Note: We recommend assigning one Rank per Extent Pool to control the placement of the
data. When creating a logical volume in an Extent Pool made up of several Ranks, the
Extents for this logical volume should be taken from the same Rank if possible. This
implies that you have no logical volumes that span Ranks, and can be done carefully with
the DSCLI. It is much more difficult to micromanage LUN placement while using the DS
GUI.
However, to be able to create very large logical volumes, you must consider having Extent
Pools that span more than one Rank.
Combining Extent Pools made up from one Rank and then utilizing an open systems
host-based Logical Volume Manager (LVM) to stripe a host Logical Volume over LUNs
created on each Extent Pool, will offer a balanced method to evenly spread open systems
data across the DS6000.
Note: z/OS does not provide any LVM functionality, and supports CKD volumes up to a
maximum size of 64 KB (actually 64 KB (65536) bytes less 256, or 65280 bytes).
Note: The logical volume stripe size has to be large enough to keep sequential data
relatively close together, but not too large so as to keep the data located on a single Array.
The recommended stripe sizes that should be defined using your host’s logical volume
manager are in the range of 4 MB to 64 MB.
You should choose a stripe size close to 4 MB, if you have a large number of applications
sharing the Arrays, and a larger size, when you have very few host servers or applications
sharing the Arrays.
We do not recommend sharing the PPRC paths with host I/O traffic.
The diagram shown in Figure 3-6 shows a zSeries z9 processor connected to a DS6000 with
two separate FICON Express2 paths to I/O ports that are each under the primary control of
separate DS6000 servers. This configuration is designed to spread the zSeries I/O load
across more resources in the DS6000 as well as enhancing data availability in the unlikely
event of a path failure.
B lu e Red
p re fe rre d p a th p referre d p ath
(F IC O N E xp ress 2 ) (F IC O N E xp re ss 2 )
S e rve r 0 S e rve r 1
b lu e re d
CACHE CACHE
NVS NVS
S e rve r en clo s u re
F ib re C h a n n e l sw itch
o oo
16 D D M
F ib re C h a n n e l sw itch
D S 6 8 00 e n c lo s u re
F ib re C h a n n e l sw itch
o oo E x p a n s io n en clo s u re
16 D D M
F ib re C h a n n e l sw itch
FICON channels in the IBM servers were initially operating at 1 Gbps. Subsequently the
FICON technology was enhanced to FICON Express channels in IBM 2064 and 2066
servers, operating at 2 Gbps, and further enhanced with FICON Express2 channels, which
also operate at 2 Gbps, but with an enhanced protocol, making them more efficient. See 10.7,
“FICON” on page 367 for a more detailed discussion of FICON Express2.
The recent announcement for IBM 2094 z9 servers also included FICON Express2
connectivity.
The DS6000 series provides 2 Gbps Host Adapter ports, which can be configured either as
FICON to connect to zSeries servers or as FCP ports to connect to Fibre Channel attached
open systems hosts. The example in Figure 3-6 shows only two FICON Express2 channels.
Two FICON Express2 channels have the potential to provide a bandwidth of approximately 2
x 175 MB/s, or an aggregate of 350 MB/second. This is still a conservative number. Some
I/O rates with 4 KB blocks are in the range of 35,000 I/Os per second or more with a single
DS6000 host port. A single FICON Express2 channel can actually perform up to about 9,000
read hit I/Os per second. Two FICON Express2 channels have the potential of over 13,000
I/Os per second with conservative numbers. These numbers will vary depending on the
server type used.
The ESS 750 has an aggregated bandwidth of about 500 MB/s for highly sequential reads
and about 350 MB/s for sequential writes. The DS6000 has achieved over 1000 MB/s with
64 KB data transfer reads and around 500 MB/s for sequential writes.
In a z/OS environment a typical transaction workload might perform on an ESS 800 Turbo II
with a large cache configuration slightly better than with a DS6000. This is the only example
where the ESS 800 outperforms the DS6000. In open systems environments, the DS6000
performs better than the ESS 750. This is also true for sequential throughput in z/OS
environments.
The way to spread I/O is by assigning logical disks evenly to Extent Pools managed by each
server in the DS6000.
Sometimes though, you may want to dedicate an Extent Pool or several Extent Pools for a
given host server or application. The overall I/O performance in that case may not be as great
as spreading I/O evenly across all of the DS6000’s resources, but should still be predictable,
especially for the application (or host server) whose storage is isolated.
The DS6000 is very good at detecting I/O patterns. So if your environment does a lot of large
sequential file copying from A to B, you might want to split A-reads and B-writes to different
Extent Pools. Let the reads come from logical disks on one Extent Pool, and the writes go to a
separate set of Extent Pools.
The DS6000 is very good at detecting sequential I/O and adjusting I/O requirements
accordingly; however, avoiding large reads and writes to the same Extent Pools and the
underlying Ranks, will improve performance.
Tip: Try to strike a reasonable balance between flexibility and manageability for your
needs.
You can realize that the DS6000 gives you great flexibility when it comes to allocating logical
disk space, as shown in Figure 3-7.
For FB Ranks, logical disk sizes can vary from 1 GB to the full effective capacity of the Extent
Pool, in increments of 1 GB. A DS6000 Extent Pool with one Rank of 145.6 GB disk drives
has a full effective capacity of 773 GB, when configured as a 6+P+S RAID 5 Rank.
For CKD servers, logical disk sizes can vary from a 1 cylinder 3390 device (that is, 849.9 KB),
to a 64 K cylinder 3390 device (that is, 54.8 GB). For iSeries servers, there are presently six
different logical disk sizes supported, in both protected and unprotected mode. More details
on these two modes may be found in Chapter 11, “iSeries servers” on page 387.
SAN implementations
In a Storage Area Network (SAN) implementation, care must be take in planning the
configuration to prevent the proliferation of disk devices presented to the attached hosts. In a
SAN environment, each path to a logical disk on the DS6000 presents that logical disk to the
host system as a unique physical device, leading to the requirement for a multipath manager,
such as the IBM Multipath Subsystem Device Driver to manage these different images of a
single logical disk. The SAN zones will also affect how many devices are presented to a
server.
The size of logical disks becomes very important when you want to re-assign DS6000 storage
capacity. For example, in an open systems environment, if you have a 200 GB logical disk,
and now you want to divide it into four 50 GB logical disks and assign to different hosts, you
have to delete the original logical disk, and wait for the DS6000 to return the extents they
were occupying back to their Extent Pool before you can re-assign that space to new LUNs. If
you had chosen the four 50 GB LUNs originally, it is a simpler process to re-assign them to
different host servers.
In the zSeries environments we recommend that you can use the bigger volumes (3390-9
and larger devices) without compromising server performance, if you use Parallel Access
Volumes (PAV). But as with open systems, if for any reason you later prefer to use a different
combination of capacity sizes within a specific Extent Pool, the extents will need to be
recovered before reuse.
Note: When allocating zSeries logical volumes, allow sufficient PAV addresses in each
LCU. This may require you to choose fewer, larger base devices to leave enough potential
alias addresses.
Tip: We recommend using dynamic PAV with the alias-to-base numbers shown in the table
in 10.2.2, “PAV and large volumes” on page 359.
Once created, logical disks can simply be deleted and removed from an Extent Pool without
any Rank or Array reformatting requirement on your part. Behind the scenes, the DS6000 will
recover these recently freed up Extents and return them to their Extent Pool.
The process of recovering Extents cannot be directly monitored but you will be able to
perform configuring operations from other, unaffected Extent Pools while the DS6000 is
processing freed up Extents.
The DS6000 supports a maximum of 256 host login IDs per Fibre Channel/FICON host
adapter port, with a total number of host logins of 1024 per DS6000, in contrast to the ESS
800 which supports up to 128 host logins per adapter port and a maximum of 512 host logins
per ESS and the DS8000, which supports up to 509 host logins per adapter port, and a
maximum of 8192 user logins per DS8000.
iSeries users are reminded that there is a maximum of 32 logical disks supported on each
Fibre Channel adapter in an iSeries server.
These numbers are important when considering the implementation of DS6000 Copy
Services, the maximum number of hosts to attach to a given DS6000, and the number of
logical disks to assign to each host.
When considering which logical disk size to use, it is also important to consider that the
DS6000 attachment type a host uses will limit the number of logical disks that can be
presented to the host. (Typically, a Fibre Channel (SCSI) attached host operating system can
support 8 or 32 logical disks). The limit for FCP attached hosts is typically 256 logical disks.
There is a performance penalty if data from a logical volume managed by one server is
accessed from a port that is located on the other server. The request for the logical volume
and the data would have to be transferred across the bridge interface that connects both
servers. These transfers add some latency to the response time. Furthermore, this interface
is also used to mirror the persistent memory and for other inter-server communication. It
could become a bottleneck if too many normal I/O requests ran across it, although it is a high
bandwidth, low latency connection.
Controller Controller
Host adapter Card 0 Card 1 Host adapter
chipset chipset
Server0 Server1
Pow er PC Processor Interconnect Processor Power PC
memory memory
chipset chipset
Volatile Volatile
Device adapter Device adapter
chipset Persistent Persistent chipset
If you need more than two paths from a host to the DS6000, spread the attached host I/O
paths evenly between the two sets of HAs on the DS6000 servers. This will ensure that you
achieve good aggregate I/O bandwidth, and that the host retains adequate access to its data
We recommend the inclusion of some supported multipath management software such as the
Multipath Subsystem Device Driver which is included with each DS6000, and is discussed
briefly below in “Multipathing software” on page 83.
For best reliability and performance, it is recommended that each attached host has two
connections, one to each controller as depicted in Figure 3-9 on page 83. This allows it to
maintain connection to the DS6000 through both a controller failure and Host Adapter (HBA
or HA) failure.
HBA HBA
We recommend that more than one SAN switch be provided to ensure continued availability.
For example, four of the eight fibre ports in a DS6000 could be configured to go through each
of two directors. The complete failure of either director leaves half of the paths still operating.
Multipathing software
Each host that is using more than a single path to the DS6000 requires a mechanism to allow
the attached operating system to manage multiple paths to the same device, and to also
show a preference in this routing so that I/O requests for each LUN are directed to the
preferred controller. Also, when a controller failover occurs, attached hosts that were routing
all I/O for a particular group of LUNs (LUNs on either even or odd LSSs) to a particular
controller (because it was the preferred controller) must have a mechanism to allow them to
detect that the preferred path is gone. It should then be able to re-route all I/O for those LUNs
to the alternate, previously non-preferred controller. Finally, it should be able to detect when a
controller comes back online so that I/O can now be directed back to the preferred controller
on a LUN by LUN basis (determined by which LSS a LUN is a member of). The mechanism
that will be used varies by the attached host operating system, as detailed in the next two
sections.
http://www.ibm.com/servers/storage/support/software/sdd
SDD provides availability through automatic I/O path failover. If a failure occurs in the data
path between the host and the DS6000, SDD automatically switches the I/O to another
available path for that host. SDD will also set the failed path back online after a repair is made.
SDD also improves performance by sharing I/O operations to a common disk over multiple
active paths to distribute and balance the I/O workload for some open systems environments.
SDD also supports the use of the DS6000 preferred path to a LUN.
SDD is not available for all supported open operating systems, so attention should be directed
to the IBM TotalStorage DS6000 Host Systems Attachment Guide, GC26-7680, and the
interoperability Web site for direction as to which multi-pathing software will be required.
Some devices, such as the IBM SAN Volume Controller, do not require any multi-pathing
software because the internal software in the device already supports multi- pathing and
preferred path. The interoperability Web site is located at:
http://www.ibm.com/servers/storage/disk/ds6000/interop.html
These functions are part of the zSeries architecture and are managed by the channel
subsystem on the host and the DS6000.
Logical paths are established through the FICON port between the host and some or all of the
LCUs in the DS6000, controlled by the hardware configuration definition (HCD) for that host.
This happens for each physical path between a zSeries processor and the DS6000. There
may be multiple system images or Logical Partitions (LPARs) in a zSeries processor and
logical paths are established for each system image. The DS6000 then knows which FICON
paths can be used to communicate between each LCU and each host.
Provided you have the correct maintenance level, all major zSeries operating systems should
support preferred path (z/OS, z/VM®, VSE/ESA™, TPF).
For RMF and RMF Magic for zSeries, see 10.9, “DS6000 performance monitoring tools” on
page 371.
Note: Disk Magic is for the IBM representative to use. Nevertheless, the DS6000 capacity
and sizing planning is done better when both the customer and the IBM representative are
familiar with the tool. Customers should contact their IBM representative to do the Disk
Magic runs, when planning for their DS6000 hardware configurations.
Disk Magic for Windows is a product of IntelliMagic, licensed exclusively to IBM and IBM
Business Partners.
When modeling an open systems workload, you will always start by entering data into the
Disk Magic dialogs. This should not be a problem since the amount of data entry is minimal.
The performance information you need to gather in this case is I/O rate, transfer size, read
percentage and read hit ratio.
For AIX and iSeries, you can also have an automated input.
For z/OS workload modeling, Disk Magic can model performance at either the subsystem
level or device level. Subsystem level performance modeling was designed to get realistic
results quickly, with a minimal amount of data entry, which can be obtained from RMF reports.
The data can also be obtained using RMF Magic, which is a z/OS only tool, and will be
described in 10.9.8, “RMF Magic for Windows” on page 378. The output from RMF Magic can
be used as an automatic input for Disk Magic, so no manual data entry needs to be done.
RMF Magic can produce the data at the disk subsystem level, the LCU level for each disk
subsystem, or the device/volume level. This Disk Magic input data is called the DMC file.
Disk Magic contains advanced algorithms that can substitute data that normally would have to
be entered manually, for both z/OS and open systems modeling. For instance, if cache
statistics are not provided, then the Automatic Cache Modeling feature will generate realistic
values based on other inputs provided.
Disk Magic is good for modeling random workloads. For sequential workloads Disk Magic
tends to be too optimistic.
When working with Disk Magic, always make sure to feed in accurate and representative
workload information, because Disk Magic results depend on the input data provided. Also
carefully estimate future demand growth, as this will be fed into Disk Magic for modeling
projections on which the hardware configuration decisions will be made.
Once the valid base model is created, you proceed with your projections. Essentially you will
be changing hardware configuration options of the base model, to decide what is your best
DS6000 configuration for a given workload. Or you can modify the workload values that you
initially entered, so, for example, you can see what happens when your workload grows or its
characteristics change.
In this section we do an overview of the more relevant dialog panels that Disk Magic presents
when doing a zSeries modeling and we discuss what information to complete in those panels.
Here we start with a disk subsystem where the workload is currently running.
Hardware configuration
Figure 4-2 on page 89 shows the Disk Subsystem - DSS1 dialog window with the ESS F20
disk subsystem as an example. The General tab is used to enter hardware information like
the Hardware Type (that basically identifies the base model machine), as well as the cache
and NVS size information. In this dialog panel, the number of logical control units (LCUs)
within the disk subsystem is entered; if this is an existing ESS, then this would be the number
of CKD LSSs. The Parallel Access Volume box must be checked to indicate if aliases are
used for this ESS.
By clicking Hardware Details we open an ESS Configuration Details dialog window, shown in
Figure 4-3. This dialog window is used to provide further information about the disk
subsystem hardware configuration. The fields displayed in this dialog window will depend on
the hardware type (see Figure 4-2) that was initially selected. In this ESS F20 example, we
need to choose how many Host Adapters and Device Adapters are configured with the ESS.
Also we need to select the cache size. The NVS size is not selectable, because it is a fixed
size for the ESS.
The number of 8-packs will be calculated based on the number of logical volumes that will be
defined in the zSeries disk tab.
Interfaces panel
Figure 4-4 on page 90 shows the panel where you define the interface connections used
between the server and the disk subsystem. If Remote Copy is used, you should define the
zSeries workload
Figure 4-5 on page 91 is where you enter the workload characteristics of all the LCUs
associated to the ESS F20. There is a tab for every LCU defined in the General Panel. The
workload characteristics include:
I/O Rate
IOSQ Time
Pending Time
Disconnect Time
Connect Time
If Remote Copy is used, you can also define the percentage of the total workload that is in
a Remote Copy relationship.
If you click Utilizations, you will get a new window that shows the utilizations of the various
components of the DS6800. Any resource that is a bottleneck will be shown with a red
background. If this happens, you should increase that resource to resolve the bottleneck. An
amber color should be considered a caution that if the workload grows, you may soon reach
the limit of that particular resource.
Hardware configuration
Figure 4-9 on page 95 shows the Disk Subsystem - DSS1 dialog window with the DS6800
disk subsystem. The General tab is used to enter hardware information like the Hardware
Type (that basically identifies the base model machine), as well as the cache and persistent
memory size information.
Interfaces panel
Figure 4-10 on page 96 shows the panel where you define the interface connections used
between the server and the disk subsystem. If Remote Copy is used, you should define the
Remote Copy function used, the connection to the Remote Copy site, and the distance
between the Primary and Secondary site. Note that the distance here is defined in kilometers.
If you click Utilizations, you will get a new window that shows the utilizations of the various
components of the DS6800. Any resource that is a bottleneck will be shown with a red
background. If this happens, you should increase that resource to resolve the bottleneck. An
amber color should be considered a caution that if the workload grows, you may soon reach
the limit of that particular resource.
The main projection we should do is how the service time will change with the growth in the
workload. Figure 4-12 on page 98 shows the graph of the service time as the I/O rate grows.
In this particular example, we see that the response time jumps significantly as the I/O rate
14
Service time
12
10
msec
2
1700 2700 3700 4700
I/O Rate
Next we should plot the HDD/DDM utilization. In Figure 4-13 on page 99 you can see that at
4700 IO/second this number reaches 93%. An HDD/DDM utilization greater than 50% will
have an impact on the service time.
With Disk Magic it is easy to do a reconfiguration and see the impact of it. For example, we
can increase the number of DDMs and observe the impact on the Service time.
80
70
60
50
%
40
30
20
10
0
1700 2700 3700 4700
I/O Rate
z/OS environment
For each control unit to be modelled (current and proposed), we need:
Control unit type and model
Cache size
NVS size
DDM size and speed
Number of channels
PAV: is it installed or not
For PPRC:
Distance
Number of links
In a z/OS environment, the SMF record types 70 through 78 are required to do a Disk Magic
modeling. The easiest way to send the SMF data to IBM is through ftp. To avoid huge dataset
sizes, the SMF data can be separated by SYSID or by date. The SMF dataset needs to be
tersed before putting it on the ftp site. Example 4-1 on page 100 shows the instructions on
how to terse the dataset.
UNIX environment
In a UNIX environment, we need the following information for each control unit to be included
in this study:
Vendor, machine and model type, for example IBM 2105-F20
How many disks are installed
How many servers are allocated and sharing these disks
What is the size and speed of the DDMs, for example 36 GB 15K rpm drives
Are there any issues regarding performance?
Number of SCSI channels, number of Fibre channels on each disk control unit and on
each server
Cache size
Direct attached or SAN attached
Data collection
For all servers attached to the disk to be modeled, we will need to collect iostat data. If there
are more than 5 servers with significant workload, we will need to determine an appropriate
strategy for collecting data from only a subset of these servers and extrapolating those
workload characteristics.
The data collection is done by setting up an iostat run that has some flags set, so that the
proper data comes out of the report. Below is the example of the commands that should be
used.
Sample command for HP-UX. This one only gives total I/Os and average block size.
sar -d 900 10
Note: Capacity Magic for Windows is a product of IntelliMagic, licensed exclusively to IBM
and IBM Business Partners. It may be used only by IBMers on IBM equipment or IBM
Business Partners on IBM Business Partner equipment. In particular, Capacity Magic
cannot be left with clients.
The purpose of the tool is to aid in planning the correct number of disk features that must
be ordered to meet client requirements for disk storage capacity, taking into consideration
the client’s choices of disk drive module capacity, RAID protection, and server platform.
Clients who want to have their DS6000 disk requirements computed by Capacity Magic
should ask their IBM representative or IBM Business Partner to perform this analysis.
With these options and combinations of options, it becomes a challenge to calculate the
physical capacity (also referred to as raw capacity) and effective capacity (also referred to as
usable capacity or net capacity) of a DS6000. The difference between physical and effective
capacity is taken up by:
Spare volumes
RAID 5 parity volumes
RAID 10 mirror volumes
Capacity Magic, an intuitive, easy-to-use tool that runs on a Windows 2000 or Windows XP
personal computer, is available to do these physical capacity and effective capacity
calculations, taking into consideration all applicable configuration rules.
4.2.2 Wizard
The configuration wizard allows you to quickly generate a configuration. For each server
platform, you may specify only one DDM type and one RAID type. The wizard allows you to
specify the effective capacity required for each server platform, and it will generate the
physical capacity to meet that requirement. For open systems and iSeries servers, you
specify the effective capacity in gigabytes. For zSeries servers, you specify the effective
capacity in terms of the numbers of each 3390 model type required.
Once you have completed going through the windows of the configuration wizard, Capacity
Magic displays the graphical interface to show how the DS6000 is physically configured. You
may then modify your configuration if desired, for example to include multiple DDM and RAID
types for a server platform, which is not supported in the wizard. Finally, you select the Report
tab to see the detailed reports.
Although there are actually other alternatives for creating Extent Pools on the DS6000, these
are the two most common options and are the only ones offered in Capacity Magic.
Then, for each Array Site (four DDMs on the DS6000), you specify:
DDM type
– 73 GB
– 146 GB
– 300 GB
RAID type
– RAID 5
– RAID 10
Server platform
– zSeries
– Open systems
– iSeries
Each number in the graphical interface (1 through 56) designates one disk drive set (4 DDMs)
and indicates the order in which you must configure them. Each drive set forms an Array Site.
An Array can be formed from one or two Array Sites.
Note: Effective August 30, 2005, the number of drive sets on the DS6000 is limited to a
maximum of 32 drive sets (128 DDMs).
4.2.4 Reports
After inputting the configuration, you switch to the Report tab, which provides a number of
reports on capacity:
Each of these metrics is provided for all RAID Arrays and also split out by:
DDM type
RAID type
For zSeries capacity, there is an allocated logical device report, which shows the logical
device counts per type, as specified in the zSeries logical device configuration dialog window.
For zSeries capacity, there is also a zSeries logical device space report, showing the number
of RAID arrays and extents, and the maximum number of logical devices which could be
specified for each logical device type, assuming all logical device types were that one type.
These numbers are shown for all RAID Arrays and also split out by:
DDM type
RAID type
4.2.5 Examples
We will now see examples of the Capacity Magic wizard, graphical interface, and reports.
Figure 4-14 Specify effective capacity for zSeries servers in terms of 3390 volumes
Graphical interface
Figure 4-15 on page 105 shows the blank graphical interface for the DS6800 with space for
56 drive sets, before any DDMs are specified.
The IBM TotalStorage Productivity Center offering is a powerful set of tools designed to help
simplify the management of complex storage network environments. The IBM TotalStorage
Productivity Center consists of TotalStorage Productivity Center for Disk, TotalStorage
Productivity Center for Replication, TotalStorage Productivity Center for Data (formerly Tivoli
Storage Resource Manager) and TotalStorage Productivity Center for Fabric (formerly Tivoli
SAN Manager).
The IBM TotalStorage Productivity Center for disk is described in this section. It can be
invoked from the IBM TotalStorage Productivity launch pad by double-clicking the Manage
Disk Performance and Replication icon as shown in Figure 4-21 on page 110.
This section presents only the IBM TotalStorage Productivity Center for Disk, which is the
component used to collect and monitor performance for the IBM TotalStorage DS6000.
The disk systems monitoring and configuration needs must be covered by a comprehensive
management tool like the TotalStorage Productivity Center for Disk. The requirements
addressed by the TotalStorage Productivity Center for Disk are shown in Figure 4-22 on
page 111.
In a SAN environment, multiple devices work together to create a storage solution. The
Productivity Center for Disk provides integrated administration, optimization for interacting
SAN devices, including:
IBM TotalStorage DS4000 family
IBM TotalStorage Enterprise Storage Server
IBM TotalStorage DS8000 and DS6000 Series
IBM TotalStorage SAN Volume Controller
It provides an integrated view of the underlying system so that administrators can drill down
through the virtualized layers to easily perform complex configuration tasks and more
productively manage the SAN infrastructure. Because the virtualization layers support
advanced replication configurations, the Productivity Center for Disk product offers features
that simplify the configuration and monitoring. In addition, specialized performance data
collection, analysis, and optimization features are provided.
As the SNIA standards mature, the Productivity Center view will be expanded to include
CIM-enabled devices from other vendors, in addition to IBM storage. Figure 4-23 on page 112
represents the Productivity Center for Disk operating environment.
The Productivity Center for Disk layers are open and can be accessed via GUI, CLI, or
standard-based Web Services.The Productivity Center for Disk provides the following
functions:
Device Manager
Performance Manager
Device Manager
The Device Manager is responsible for the discovery of supported devices; collecting asset,
configuration and availability data from the supported devices; and providing a limited
topography view of the storage usage relationships between those devices.
The Device Manager builds on the IBM Director discovery infrastructure. Discovery of storage
devices adheres to the SNIA SMI-S specification standards. Device Manager uses the
Service Level Protocol (SLP) to discover SMI-S enabled devices. The Device Manager
creates managed objects to represent these discovered devices. The discovered managed
objects are displayed as individual icons in the Group Contents pane of the IBM Director
Console as shown in Figure 4-24 on page 113.
Device Manager provides a subset of configuration functions for the managed devices,
primarily LUN allocation and assignment. These services communicate with the CIM Agents
that are associated with the particular devices to perform the required configuration. Devices
that are not SMI-S compliant are not supported.
The Device Manager health monitoring keeps you aware of all hardware status changes in
the discovered storage devices. You can drill down the status of the hardware device, if
applicable. This enables you to understand which components of a device are malfunctioning
and causing an error status for the device
Performance Manager
The Performance Manager function provides the raw capabilities of initiating and scheduling
performance data collection on the supported devices, of storing the received performance
statistics into database tables for later use, and of analyzing the stored data and generating
reports for various metrics of the monitored devices. In conjunction with data collection, the
Performance Manager is responsible for managing and monitoring the performance of the
supported storage devices. This includes the ability to configure performance thresholds for
the devices based on performance metrics, the generation of alerts when these thresholds
are exceeded, the collection and maintenance of historical performance data, and the
creation of gauges, or performance reports, for the various metrics to display the collected
historical data to the end user. The Performance Manager enables you to perform
sophisticated performance analysis for the supported storage devices.
Functions
TotalStorage Productivity Center for Disk provides the following functions:
There is a user interface that supports thresholds setting, enabling a user to:
Modify a threshold property for a set of devices of like type.
Modify a threshold property for a single device.
– Reset a threshold property to the IBM-recommended value (if defined) for a set of
devices of like type. IBM-recommended critical and warning values will be provided for
all thresholds known to indicate potential performance problems for IBM storage
devices.
– Reset a threshold property to the IBM-recommended value (if defined) for a single
device.
Show a summary of threshold properties for all of the devices of like type.
View performance data from the Performance Manager database.
Gauges
The Performance Manager supports a performance-type gauge. The performance-type
gauge presents sample-level performance data. The frequency at which performance data is
sampled on a device depends on the sampling frequency that you specify when you define
the performance collection task. The maximum and minimum values of the sampling
frequency depend on the device type. The static display presents historical data over time.
The refreshable display presents near real-time data from a device that is currently collecting
performance data.
The Performance Manager enables a Productivity Center for Disk user to access recent
performance data in terms of a series of values of one or more metrics associated with a finite
set of components per device. Only recent performance data is available for gauges. Data
that has been purged from the database cannot be viewed. You can define one or more
gauges by selecting certain gauge properties and saving them for later referral. Each gauge
is identified through a user-specified name and, when defined, a gauge can be started, which
means that it is then displayed in a separate window of the Productivity Center GUI. You can
have multiple gauges active at the same time. Gauge definition is accomplished through a
wizard to aid in entering a valid set of gauge properties. Gauges are saved in the Productivity
Center for Disk database and retrieved upon request. When you request data pertaining to a
defined gauge, the Performance Manager builds a query to the database, retrieves and
formats the data, and returns it to you. When started, a gauge is displayed in its own window,
and it displays all available performance data for the specified initial date/time range. The
date/time range can be changed after the initial gauge window is displayed.
You will need to provide a TCP/IP connection between the DS6000 and the IBM TotalStorage
Productivity Center so that performance information, in particular can be sent from the
DS6000 to the IBM TotalStorage Productivity Center. When IBM TotalStorage Productivity
Center receives information from the DS6000, it is stored in tables within a DB2 database.
Thus, you can prepare and produce customized reports using traditional DB2 command.
Installation environment
The storage management components of IBM TotalStorage Productivity Center can be
installed on a variety of platforms. However, for the IBM TotalStorage Productivity Center
suite, when all four manager components are installed on the same system, the only common
platforms for the managers are:
Windows 2000 Server with Service Pack 4
Windows 2000 Advanced Server
Windows 2003 Enterprise Server Edition
Hardware requirements:
Dual Pentium® 4 or Xeon™ 2.4 GHz or faster processors
Installation process
IBM TotalStorage Productivity Center provides a suite installer that helps guide you through
the installation process. You can also use the suite installer to install the components
standalone. One advantage of the suite installer is that it will interrogate your system and
install required prererequisites.
The suite installer will install the requisite products or components for IBM TotalStorage
Productivity Center for Disk in this order:
DB2 (required by all the managers)
IBM Director
WebSphere® Application Server
After you have completed the installation of IBM TotalStorage Productivity Center for Disk,
you need to install and configure the Common Information Model Object Manager (CIMOM)
and the Service Location Protocol (SLP) agents.
The IBM TotalStorage Productivity Center for Disk uses SLP as a method for the CIM client to
locate managed objects. The CIM client may have built in or external CIM agents. When a
CIM agent implementation is available for a supported device, the device may be accessed
and configured by management applications using industry-standard XML-over-HTTP.
If you want the DS8000, DS6000, ESS, SAN Volume Controller, or FAStT storage
subsystems to be managed using IBM TotalStorage Productivity Center for Disk, you must
install the prerequisite I/O Subsystem Licensed Internal Code and CIM Agent for the devices.
For detailed installation, refer to the IBM redbook Managing Disk Subsystems using IBM
TotalStorage Productivity Center, SG24-7097.
Installation consideration
For general performance considerations, if you are installing the CIM agent for the DS6000,
you must install it on a separate machine from the Productivity Center for Disk code as shown
in Figure 4-25. Attempting to run a full TotalStorage Productivity Center implementation
(Device Manager, Performance Manager, Data Manager, Replication Manager, DB2, IBM
Director and the WebSphere Application server) on the same host as the CIM agent, will
result in dramatically increased wait times for data retrieval.
To collect performance data, you have to create data collection tasks for the supported,
discovered storage devices. To create task, you have to specify
A task name
A brief description of the task
The sample frequency in minutes
The duration of data collection task (in hours)
Figure 4-28 on page 119 is an example of the panel you should see when you create the
performance Data Collection task on a DS6000.
Note: All figures below are taken from the DS8000, but the DS6000 view is the same.
Once the task is created, you can execute it immediately or define a scheduled job for this
task.
Note: Data collection tasks can be defined with one or more storage devices. By selecting
several storage devices, you can collect performance data from different storage
subsystems. In this case, the collection data will have the same characteristics (sample
frequency and duration of data collection).
To check the status of all defined tasks, you can access the Task Status. This tool gives
details of tasks including task status (for example: running, competed...), device ID, Device
status, Error Message ID, and Error message.
Gauges are used to tunnel down to the level of detail necessary to isolate performance issues
on the storage device. To view information collected by the Performance Manager, a gauge
must be created or a custom script written to access the DB2 tables/fields directly.
The metrics available change depending on whether the cluster, Rank Group or volume items
are selected.
Performance
DS6000 Server performance gauges values provide details on:
Total I/O rate: Cluster level (server0/server1) total I/O request rate
Figure 4-32 on page 123 presents an example of graphical output you get from TotalStorage
Productivity Center for Disk.
Exceptions
Exception gauges display data only for those DS6000 active thresholds that were crossed
during the reporting. One Exception gauge displays thresholds exception for the entire
storage device based on the thresholds active at the time of the collection.
DS6000 thresholds
Thresholds are used to determine the high watermarks for warning and error indicators the
storage subsystem. Figure 4-33 shows the available Performance Manager thresholds for the
DS6000.
You may only enable a particular threshold once minimum values for warning and error levels
have been defined. If you attempt to select a threshold and enable it without first modifying
this value, you will see a notification like Figure 4-34.
Tip: In TotalStorage Productivity Center for Disk, default thresholds warning and error
values of -1.0 are indicators that there is no recommended minimum value for the
threshold and are therefore entirely user defined. You may elect to provide any reasonable
value for these thresholds, keeping in mind the workload in your environment.
To modify the warning and error values for a given threshold, you may select the threshold,
and click the Properties button. The panel in Figure 4-35 will be shown. You can modify the
threshold as appropriate, and accept the new values by selecting the OK.
Before exploiting gauges, appropriate time frame samples which cover high/low I/O rate
should be collected.
For creating the gauge, launch the Performance gauges panel as shown in Figure 4-36 by
right-clicking on the DS6000 device.
Click Create to create a new gauge. You will see a panel similar to Figure 4-37. On the left
pane of this panel, you can choose to create a performance gauge at the Cluster
(server0/server1) level, the Rank Group (Rank) level, or the Volume level.
For example, when we select Cluster level, Total I/O, Reads/second, Writes/second and AVG
response time appear in the metrics box, and a pair clusters appears in the component box.
Enter the name and the description of the gauge. Select Data points or Date range to show
the historical data collection sampling period and check Display gauge. Upon clicking OK, we
can get the next panel as shown in Figure 4-38 on page 126.
Then the subsequent panel is shown in Figure 4-39 on page 127. When selected Rank level
on the left pane, and component for RANK 0 and the metrics for Average response time as
circled on the figure, the resulting gauge is shown in Figure 4-40 on page 127.
The IBM TotalStorage Productivity Center for Disk helps you to see the overall performance of
your DS6000s. It supplies information at the DS6000 subsystem level; it does not directly
connect the host view with the disk subsystem view. Using it in conjunction with host system
monitors and available performance tools, you will receive the necessary picture of your
DS6000’s performance.
In general we recommend:
Use the Cluster level gauge and identify if the DS6000’s servers are busy and persistent
memory is sufficient.
Use the Rank Group level gauge to identify if the Rank or internal bus is busy.
Use the Volumes level gauge to identify the type of workload and to verify persistent
memory full conditions on logical disk (volume) level.
Use the threshold to identify the most recent threshold exceptions.
The Rank Group analysis show how busy the DDMs are on your DS6000 at the RAID-Array
level. This information helps to determine where the most accessed data is located and what
performance you are achieving from the RAID Array. Rank Group analysis reports the global
performance workload of all the volumes defined in the selected RAID Array. Here is the
information provided:
Total I/O: DS6000 lower interface total I/O per second
Reads/second: DS6000 lower interface reads per seconds
Writes/second: DS6000 lower interface writes per seconds
Average response time: Average response time for the lower interface in milliseconds
This information takes account of all I/O activities against the DDMs. These activities are
dedicated to read and write along with either staging/destaging or
disk-to-cache/cache-to-disk.
Read and write request per second at Rank level
This information shows the number of I/O requests made by the server0 and server1 for
this Rank, including both read and write requests. This number is an indication of internal
DS6000 operations between Server cache to the Rank. This value is a sum of all read and
write activities applied to all volumes defined on this Rank. Analyzing reports of all
volumes defined on this Rank help you to understand which host server To monitor the
workload generated by a host server, use the volumes report
Average response time at Rank level
This information shows an average response time to complete each Rank request. The
number is not the average response time of a single DDM. Since a DS6000 Server makes
various kinds of I/O requests, some of these requests access all DDMs in a Rank, while
others may involve just one DDM. Taking this into consideration, you cannot use this
number to measure performance without knowing how each cluster makes I/O requests to
the Rank.
Consider that for read activity (4KB IO size), an DS6000 Rank can process more than:
1700 operations per second (using 15 Krpm DDM disks)
1200 operations per second (using 10 Krpm DDM disks)
In general, the DS6000 should show good performance when the average Rank read
response time is about 10 ms. Generally, the average response time should not exceed 35
ms.
There is a relationship between Rank operations, cache hit ratio, and percentage of read
requests. When the cache hit ratio is low, this indicates that the DS6000 has frequent
transfers from DDMs to cache (staging).
When percentage of read requests is high and cache hit ratio is also high, most of the I/O
requests can be satisfied without accessing the DDMs due to the SARC prefetching
algorithm.
When the percentage of read requests is low, the DS6000 write activity to the DDMs can be
high. This indicates that the DS6000 has frequent transfers from cache to DDMs (destaging).
Comparing different Rank’s performance gauges helps you to understand if your global
workload is equally spread on the DDMs of your DS6000. Spreading data across multiple
Ranks increases the number of DDMs used and optimizes the overall performance.
Important: Limitation of write workload to one Rank can increase the persistent memory
destaging execution time and so, impact all write activities on the same DS6000 server.
Also you have to check the % DASD fast write delay due to NVS metrics at the volume
level gauges.
For avoiding this situation, you should spread write I/O on multiple Ranks or add more
DS6000 servers or consider replacing it with the DS8000.
Note: Here metrics have been grouped with hose metrics that have the same unit of
measurement and therefore can be combined and displayed using a single gauge.
Analysis of volume level metrics will show how busy the volumes are on your DS6000. This
information helps to determine:
Where the most accessed data is located and what performance you get from the volume.
Understand the type of workload your application generates (sequential or random, read
or write operation ratio).
Determine the cache benefits for read operation (SARC prefetching algorithm).
Determine the eventual cache bottleneck for write operation.
I/O requests per second tells you how many I/O requests are processed. By comparing I/O
rates at cluster level, Rank level, and volume level, it’s possible to identify where the greatest
demand is occurring for storage system resources.
A small or high I/O rate depends on the I/O workload type you have. However, consider that
the DS6000 volume performance is limited to the performance of the Rank where it is
defined. To exceed the performance limitation of a Rank, you should create a volume based
on extents which belong to several DS6000 Ranks.
There are several host factors that can affect the I/O requests per second:
I/O contention on the host systems
Parallel Access Volume (PAV) for z/OS sharply reduces volume contention (IOSQ) within a
system. Multiple reads are executed in parallel and multiple writes are executed in parallel
and serialized by extent specification in the Define Extent command. You need to perform
additional subsystem tuning actions if your operating system does not allow parallel reads.
You can check the performance measurement tool available at the host system to
determine if there is a performance bottleneck on the host system side.
The IBM Subsystem Device Driver may not be installed
If you have configured your hosts to have multiple paths to the same DS6000 server, you
have to make sure that you have installed the IBM Subsystem Device Driver (SDD), which
comes with your DS6000. SDD can balance workload among paths to the same server but
not among the each side of servers, and pump more I/O requests to the DS6000. For
further information, refer to IBM TotalStorage Subsystem Device Driver User’s Guide,
SC26-7478.
To get useful reference data we recommend that you monitor your subsystem during regular
work days and peak workload activities when there are no reported user issues or
performance constraints. If subsystem I/O performance degrades, user complaints and
response times increase, you can compare performance data for intervals of normal
If applications use a database, this is different. A database management system has its own
caching mechanism using the host processor’s memory (also called the database buffer
pool). A database management system can defer the write operation until the modified data
occupies a certain amount of this buffer pool. The duration between read and the
corresponding write burst to the DS6000 subsystem can be long. You can see first a high
read requests rate, then less because of the write requests later on.
For a logical volume that has sequential files, you need to understand what kind of
applications access those sequential files. Normally, these are used for either read-only or
write-only at the time of use. The DS6000 pre-fetching SARC algorithm determine if the data
access pattern is sequential or not. If the access is sequential, then contiguous data is
pre-fetched into cache in anticipation of the next read request.
The DS6000 has the 100 percent write hit function. Due to this function, all write activity goes
to DS6000 cache before being written to disk. Therefore, all I/O requests against volumes are
completed without accessing DDMs. The most important policy that we have to consider on
exploiting write cache function is to protect data integrity. For this reason, the DS6000
maintains a two secured copies policy. This ensures that modified data is stored in two
different places in the DS6000 and a single component failure does not cause the loss of
data.
When the DS6000 accepts a write request, it will process it without writing to the DDMs
physically. The data is written into both the server to which belongs the volume and the
persistent memory of the second server in the DS6000. Later, the DS6000 asynchronously
destages the modified data out to the DDMs.
The DS6000’s lower interfaces use switched Fibre Channel connections, which provide a high
data transfer bandwidth. In addition, the destage operation is designed to avoid the write
penalty of RAID 5, if possible. For example, there is no write penalty when modified data to be
destaged is contiguous enough to fill the unit of a RAID 5 stride. However, when all of the
write operations are completely random across a RAID 5 rank, and the DS6000 cannot avoid
the write penalty, you could get some DDM level of I/O contention.
To get more details regarding RAID 5 and RAID 10 difference, refer to 2.8.5, “RAID 5 versus
RAID 10 performance” on page 39.
Disk-to-cache operation shows the number of data transfer operations from disks to cache,
referred to as staging for a specific volume. Disk-to-cache operations are directly linked to
read activity from hosts. Data requested for reads are first staged from backend disks into the
cache of the DS6000 server and then transferred to the host.
Read hits occur when all the data requested for a read data access is located in cache. The
DS6000 improves the performance of read caching by using SARC staging algorithms to
store in cache data tracks which have the greatest probability of being accessed by a read
operation.
Cache-to-disk operation shows the number of data transfer operations from cache to disks,
referred to as destaging for a specific volume. Cache-to-disk operations are directly linked to
write activity from hosts to this volume. Data written is first stored in the persistent memory
(also known as NVS) at the DS6000 server then destaged to the backend disk. The DS6000
destaging is enhanced automatically by striping the volume across all the DDMs in one or
several Ranks (depending on your configuration). This provides automatically load balancing
across DDMs in Ranks and an elimination of the hot spots.
The DASD fast write delay percentage due to persistent memory allocation give us
information about the cache usage for write activities. The DS6000 stores data in the
persistent memory before sending acknowledgement to the host. If the persistent memory is
full of data (no space available), the host will receive a retry for its write request. In parallel,
the server has to destage data stored in its persistent memory to the backend disk before
accepting new write operations from any host.
If one of your volumes is facing write operation delayed due to persistent memory constraint,
to avoid this situation, you should move your volume to a new Rank which is less used or
spread this volume on multiple Ranks (increase the number of DDMs used). If this solution
does not fix the persistent memory constraint problem, you can consider adding more
DS6000 servers or replacing it with the DS8000.
Read hit ratio shows how efficiently your cache works on the DS6000. For example, the value
of 1.00 indicates that all read requests are satisfied within the cache. If the DS6000 cannot
complete I/O requests within the cache, it transfers data from the DDMs. The DS6000
suspends the I/O request until it has read the data. This situation is called cache-miss. If an
I/O request is cache-miss, the response time will include not only the data transfer time
between host and cache, but also the overhead of staging data from the DDMs.
The read hit ratio depends on the characteristics of data on your DS6000 and applications
that use the data. If you have a database and it has the locality of reference, it will show a high
cache hit ratio, as most of the data referenced could remain in the cache. If your database
does not have the locality of reference, but it has the appropriate sets of indexes, it will also
show a high cache hit ratio, as the entire index could remain in the cache.
We recommend that you monitor the read hit ratio over an extended period of time:
If the cache hit ratio has been low historically, it is most likely due to the nature of your
data, and you do not have much control over this. You can first try to perform
de-fragmentation on a file system, making indexes if none exist, rather than considering
increasing the cache size.
If you have a high cache hit ratio initially, and it is decreasing as you load more data with
the same characteristics, then moving some data to another cluster that uses the other
cluster’s cache or adding server enclosure could improve the situation.
4.3.8 IBM TotalStorage Productivity Center for Disk and other tools
The IBM TotalStorage Productivity Center for Disk provides storage subsystem metric
performance data. We receive, for example, the number of I/O requests the DS6000 has
processed at different levels, the cache usage and the persistent memory use conditions. We
cannot get the host system view from the Productivity Center for Disk reports, like I/O activity
rate, I/O response time, or data transfer rate. If there is a performance problem with your
applications, you could see a delay of batch jobs and slower response times during online
transaction processing.
To determine if the I/O behavior is the reason for the problem, you need to gather the
information about I/O profiles on the host systems. For example, it is possible that one
application cannot get I/O services, while another application dominates I/O services. The I/O
response time and its breakdown for each of the logical volumes helps you to isolate the
source of the performance problem.
The following sections describe how to use host-based performance measurements and
reporting tools, in conjunction with the IBM TotalStorage Productivity Center for Disk under
UNIX, Linux, Windows 2000, iSeries and z/OS environments.
IBM TotalStorage Productivity Center for Disk Report and UNIX / Linux
Most application I/O requests against disk subsystems are through either database
management systems or file systems. It can be difficult to associate the application or
operating system I/O performance with that of the I/O subsystems directly. Why? Because
they have their own internal caching mechanisms. An I/O request from an application does
not always go directly to the I/O subsystems. You may see an I/O subsystem experiencing
poor performance while applications are not affected.
To get host information about I/O subsystems, CPU activities, virtual memory, and physical
memory use, you can use the following commands:
iostat
vmstat
sar
When you see a downward trend on effective data transfer rate for a volume on a host and I/O
request rate on a Rank or volume is going up, you need to perform further analysis even after
you have concluded that your DS6000 is not performing well. If you have other host systems,
you also need to check them, as the source of poor performance could be at one of the other
hosts. Cache-unfriendly applications or host systems could be a reason for high I/O request.
For this reason, we suggest that you check Cache reports for all volumes that are behind the
Rank showing high utilization.
If the volume you are concerned about is in the Rank, and other volumes in the same Rank
show poor cache statistics (such as low read hit ratio or low percent read requests), moving
the volumes to another Rank would be worth considering. This may relieve the performance
degradation condition, as the Rank or below level of I/O delay probably caused the situation.
Performance Monitor gives you the flexibility to customize the monitoring to capture various
categories of Windows system resources, including CPU and memory. You can also monitor
disk I/O through Performance Monitor.
The Performance Monitor shows the response time (Avg. Disk sec/I/O). The minimum monitor
interval is one second when you log performance data. A one-second response time may not
be a valid reflection of your system’s performance. When you use the Performance Monitor in
real time, you can set the monitor interval in the increment of one millisecond. If you set the
monitor interval to one millisecond, the value will be closer to the actual response time.
Increasing the sample count will impact your system’s performance and it will also affect the
accuracy of these performance counters. This is most likely not acceptable for your
production applications. In addition, it is not as convenient for historical analysis, since the
real time monitor provides just one screen of data and it wraps around. Although you can log
the performance data, the data is saved at a minimum of one second intervals, so the values
may not be as accurate.
We suggest that you use the same approach as for a UNIX/Linux system, that is, to monitor
data transfer rate trends over an extended period of time.
For further discussion, refer to Chapter 11, “iSeries servers” on page 387.
Before beginning the diagnostic process, you must understand your workload and your
physical configuration. You need to know how your system resources are allocated, as well as
understand your path and channel configuration for all attached servers.
Let us assume that you have an environment with a DS6000 attached to a z/OS host, an AIX
pSeries host, and several Windows hosts. You have noticed that your z/OS online users
experience a performance degradation between 7:30 a.m. and 8:00 a.m. each morning.
You may notice that there are 3390 volumes indicating high disconnect times, or high device
busy delay time for several volumes in the RMF device activity reports. Unlike UNIX or
Windows, you may notice response time and its breakdown to connect, disconnect, pending,
and IOS queuing.
Disconnect time is an indication of cache miss activity or destage wait (due to persistent
memory high utilization) for logical disks behind the DS6000s.
Device busy delay is an indication that another system locks up a volume, and an extent
conflict occurs among S/390 hosts or applications in the same host when using Parallel
Access Volumes. The DS6000’s multiple allegiance or Parallel Access Volume capability
allows it to process multiple I/Os against the same volume at the same time. However, if a
read or write request against an extent is pending while another I/O is writing to the extent, or
if a write request against an extent is pending while another I/O is reading or writing data from
the extent, the DS6000 will delay the I/O by queuing. This condition is referred to as extent
conflict. Queuing time due to extent conflict is accumulated to device busy (DB) delay time.
An extent is a sphere of access; the unit of increment is a track; usually I/O drivers or system
routines decide and declare the sphere.
To determine the possible cause of high disconnect times, you should check the read cache
hit ratios, read-to-write ratios, and bypass I/Os for those volumes. If you see the cache hit
ratio is lower than usual while you have not added other workload on your S/390 environment,
I/Os against open systems fixed block volumes might be a cause of the problem. Possibly
fixed block (FB) volumes defined on the same server had a cache-unfriendly workload, thus
impacting your S/390 volumes hit ratio.
In order to get more information about cache usage, you can check the cache statistics of the
Fixed Block volumes, which belong to the same server. You may be able to point out the Fixed
Block volumes that have a low read hit ratio and short cache holding time. If you can move the
The approaches for using other tool’s data in conjunction with the IBM TotalStorage
Productivity Center for Disk, as described in this chapter, do not cover all the possible
situations you will encounter. But if you basically understand how to interpret the DS6000
performance reports, and you also have a good understanding of how the DS6000 works,
then you will be able to develop your own ideas on how to correlate the DS6000 performance
reports with other performance measurement tools when approaching specific situations in
your production environment.
We’ll look at 4 example SAN configurations where SAN statistics may be beneficial for
monitoring and analyzing DS6000 performance.
The first example configuration, shown in Figure 4-42 on page 139 has host server Host_1
connecting to DS6000_1 through 2 SAN switches or directors (SAN Switch/Director_1 and
SAN Switch/Director_2). There is a single inter-switch link (ISL) between the 2 SAN switches.
In this configuration, the performance data available from the host and from the DS6000 will
not be able to show the performance of the ISL. If, for example, the Host_1 adapters and the
DS6000_1 adapters do not achieve the expected throughput, the SAN statistics for utilization
of the ISL should be checked to determine whether it is limiting I/O performance.
SAN Switch/Director_1
ISL
SAN Switch/Director_2
Storage
Enclosure
Storage
Enclosure
Storage
Enclosure
DS6000_1
Storage
Enclosure
Storage
Enclosure
Storage
Enclosure
A second type of configuration in which SAN statistics can be useful is shown in Figure 4-43
on page 140. In this configuration, host bus adapters or channels from multiple servers
access the same set of I/O ports on the DS6000 (server adapters 1-4 share access to
DS6000 I/O ports 5 and 6). In this environment, the performance data available from only the
host server or only the DS6000 may not be enough to confirm load balancing, or to identify
each server’s contributions to I/O port activity on the DS6000, because more than one host is
accessing the same DS6000 I/O ports.
If DS6000 I/O port 5 is highly utilized, it may not be clear whether Host_A, Host_B, or both
hosts are responsible for the high utilization. Taken together, the performance data available
from Host_A, Host_B and the DS6000 may be enough to isolate each server connection’s
contribution to I/O port utilization on the DS6000; however, the performance data available
from the SAN switch or director may make it easier to see load balancing and relationships
between I/O traffic on specific host server ports and DS6000 I/O ports at a glance, because it
can provide real-time utilization and traffic statistics for both host server SAN ports and
DS6000 SAN ports in a single view, with a common reporting interval and metrics.
Host_1 Host_2
1 2 3 4
5 6
Storage
Enclosure
Storage
Enclosure
Storage
Enclosure
DS6000
Storage
Enclosure
Storage
Enclosure
Storage
Enclosure
SAN statistics may also be helpful in isolating the individual contributions of multiple DS6000s
to I/O performance on a single server. In Figure 4-44 on page 141, host bus adapters or
channels 1 and 2 from a single host (Host_A) access I/O ports on multiple DS6000s (I/O
ports 3 and 4 on DS6000_1 and I/O ports 5 and 6 on DS6000_2).
In this configuration, the performance data available from either the host server or from the
DS6000 may not be enough to identify each DS6000’s contribution to adapter activity on the
host server, because the host server is accessing I/O ports on multiple DS6000s. For
example, if adapters on Host_A are highly utilized or if I/O delays are experienced, it may not
be clear whether this is due to traffic that is flowing between Host_A and DS6000_1, between
Host_A and DS6000_2, or between Host_A and both DS6000_1 and DS6000_2.
The performance data available from the host server and from both DS6000s may be used
together to identify the source of high utilization or I/O delays; additionally, the SAN switch or
director can provide real-time utilization and traffic statistics for both host server SAN ports
and DS6000 SAN ports in a single view, with a common reporting interval and metrics.
1 2
3 4 5 6
Storage Storage
Enclosure Enclosure
Storage Storage
Enclosure Enclosure
Storage Storage
Enclosure Enclosure
DS6000_1 DS6000_2
Storage Storage
Enclosure Enclosure
Storage Storage
Enclosure Enclosure
Storage Storage
Enclosure Enclosure
Primary Secondary
Site Site
Storage Storage
Enclosure Enclosure
Storage Storage
Enclosure Enclosure
Storage Storage
Enclosure Enclosure
DS6000_1 DS6000_2
Storage
1 2 3 4 Storage
Enclosure Enclosure
Storage Storage
Enclosure SAN Switch SAN Switch Enclosure
Storage or Director or Director Storage
Enclosure Enclosure
SAN statistics should be checked used to determine whether there are SAN bottlenecks
limiting DS6000 I/O traffic. SAN link utilization and throughput statistics can also be used to
breakdown the I/O activity contributed by adapters on different host servers to shared storage
subsystem I/O ports. Conversely, SAN statistics can be used to breakdown the I/O activity
contributed by different storage subsystems accessed by the same host server. SAN statistics
For additional information about monitoring performance through a SAN switch or director,
see:
http://www.brocade.com
http://www.cisco.com
http://www.mcdata.com
The DS6800 Model 511 contains two controller cards with four ports each for a total of eight
host attachment ports. You can configure the DS6000 host attachment ports for either Fibre
Channel protocol (FCP) or for Fibre Connection (FICON) protocol. For z-Series host
attachment, the DS6000 does not support ESCON.
The DS6000 supports 1 Gbps and 2 Gbps connections. The DS6000 negotiates the
connection speed automatically and determines whether it is best to run at 1 Gbps link or
2 Gbps link.
Fibre Channel connections are established between Fibre Channel ports that reside in I/O
devices, host systems, and the network that interconnects them. Each of the eight host
adapter ports available on the DS6000 has a unique worldwide port name (WWPN). You can
configure the port to operate with the SCSI-FCP upper-layer protocol or with the FC-AL
upper-layer protocol. The DS6000 can be configured with either shortwave small form factor
pluggables (SFPs) or with longwave SFPs to be installed on the host adapter ports, as
discussed in Chapter 1, “Model characteristics” on page 1. Fibre Channel adapters for
SCSI-FCP support provide the following configurations:
A maximum of eight host ports
A maximum of 8192 host logins per Fibre Channel port
A maximum of 2000 N-port logins per storage unit
Access to all 8,192 LUNs per target (one target per host adapter), depending on host type
Either arbitrated loop, switched fabric, or point-to-point topologies
As with the open systems connections, each of the two controller cards in the DS6000
contains four host adapter ports, and each port has a unique world wide port name (WWPN).
You can configure the port to operate with the FICON upper-layer protocol. When configured
for FICON, the Fibre Channel port supports connections to a maximum of 128 FICON hosts.
With FICON, the host adapter port can operate with fabric or point-to-point topologies. With
host adapter ports that are configured for FICON, the storage unit provides the following
configurations:
Windows
AIX z-Series
(FICON)
Linux
SAN fabric
Power PC Power PC
chipset Volatile Volatile chipset
memory memory
device adapter Persistent memory Persistent memory device adapter
chipset chipset
5.2 Multipathing
For whichever host attachment method you use, we recommend that whenever possible, you
use two or more paths from each FCP or FICON host to the DS6000, and balance the host
connections across both controller cards. For the DS6000, it is important that host systems
have attachment to both controller cards. See 2.3, “DS6000 major hardware components” on
page 19 for details on preferred pathing and why connectivity to both controller cards is
important for the DS6000.
By attaching a host with redundant paths to the DS6000, you can increase availability by
avoiding single points of failure. Additionally, over and above preferred pathing considerations,
I/O performance can be improved by configuring multiple physical paths to groups of heavily
5.3 FICON
FICON is a Fibre Connection used with zSeries servers (see 2.10.3, “FICON attachment” on
page 47). The connection speeds are 100–200 MB/s similar to Fibre Channel for open
systems.
FICON channels were introduced in the IBM 9672 G5 and G6 servers with the capability to
run at 1 Gbps. Eventually these channels were enhanced to FICON Express channels in IBM
the zSeries 800, zSeries 900, zSeries 990 servers and the IBM System z9 109, and were
capable of running at transfer speeds of 2Gbps. FICON Express2 channels are a new
generation of FICON channels that offer improved performance capability over previous
generations of FICON and FICON Express channels. They are supported on the
IBM Eserver zSeries 990 (z990), zSeries 890 (z890) and IBM System z9 109. A
comparison of the overall throughput capabilities of various generations of channel
technology is shown in Figure 5-2.
250 12000
z990
z990 FICON
FICON z890
z890 EXPRESS
EXPRESS z9
10000 9200 z9
200 2Gbps FICON
G5 ESCON G5
50
ESCON G6 2000 1200 G6
17
0 0
As you can see, the FICON Express2 channel as first introduced on the zSeries z890 and
z990 represents a significant improvement in both 4K I/O per second throughput and
maximum bandwidth capability compared to previous FICON offerings. The greater
performance capabilities of the FICON Express2 channel makes it a good match with the
performance characteristics of the DS6000 host adapters.
When you use a Fibre Channel/FICON host adapter to attach to FICON channels, either
directly or through a switch, the port is dedicated to FICON attachment and may not be
simultaneously attached to FCP hosts. When you attach a DS6000 to FICON channels
through one or more switches, the maximum number of FICON logical paths is 2048 per
DS6000 host adapter port.
Figure 5-3 shows an example of FICON attachment to connect a zSeries server through
FICON switches, using 16 FICON channel paths to the eight host adapter ports on the
DS6000, and addressing eight Logical Control Units (LCUs). This channel consolidation may
be possible when your host workload does not exceed the performance capabilities of the
DS6000 host adapter, and would be most appropriate when connecting to the original
generation FICON channel. It is likely, again depending on your workload, that FICON
Express2 channels should be configured one to one with a DS6000 host adapter port.
zSeries zSeries
FICON (FC) channels FICON (FC) channels
FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC
Table 5-1 Platforms, operating systems and applications supported with DS6000
Server platforms Operating systems Clustering applications
pSeries, RS/6000®, IBM AIX, Linux (Red Hat, SuSE) IBM HACMP™ (AIX only)
BladeCenter JS20
iSeries OS/400®, i5/OS™, Linux IBM HACMP (AIX only)
(Red Hat, SuSE), AIX
HP PARisc, Itanium® II HP UX HP MC/Serviceguard
HP Alpha OpenVMS, Tru64 UNIX HP TruCluster (only DS8000)
Intel IA-32, IA-64, IBM BladeCenter Microsoft Windows, VMware, Microsoft Cluster Service including
HS20 and HS40 Novell NetWare, Linux (Red Hat, Microsoft Datacenter, Novell NetWare
SuSE, Asianux, Red Flag Linux) Cluster Services
Apple Macintosh OS X
SGI IRIX
For specific considerations that apply to each server platform, as well as for the most current
information about supported servers—the list is updated periodically—check:
http://www.ibm.com/servers/storage/disk/ds6000/interop.html
Direct connect
This is the simplest of all the Fibre Channel topologies. By using just a fiber cable, two Fibre
Channel adapters (one host and one DS6000) are connected. The Fibre Channel host
adapter card C in Figure 5-4 on page 149 is an example of a direct connect connection.
FC Switch
This topology supports the maximum bandwidth of Fibre Channel, but does not exploit any of
the benefits that come with SAN implementations.
Tip: When using the DS Storage Manager or DS CLI to connect directly to a host HBA, set
the Fibre Channel port topology attribute to match the requirements of the host HBA
configuration.
The DS6000 supports direct connect at a maximum distance of 500 m (1500 ft.) at 1 Gbps
and 300 m (984 ft.) at 2 Gbps with the shortwave SFP feature. The DS6000 supports direct
connect at a maximum distance of 10 km (6.2 mi) with the longwave SFP feature.
Arbitrated Loop
Fibre Channel Arbitrated Loop (FC-AL) is a uni-directional ring topology very much like token
ring. Information is routed around the loop and repeated by intermediate ports until it arrives
at its destination. If using this topology, all other Fibre Channel ports in the loop must be able
to perform these routing and repeating functions in addition to all the functions required by the
point-to-point ports.
Up to a maximum of 127 ports can be interconnected via a looped interface. All ports share
the FC-AL interface and therefore also share the bandwidth of the interface. Only one
connection may be active at a time, and the loop must be a private loop. An example of Fibre
Channel arbitrated loop topology is shown in Figure 5-5 on page 150. Note how the three
servers with host adapters X, Y, and Z share a single port to the DS6000.
Arbitrated
Loop
Topology
The DS6000 does not support FC-AL topology on adapters that are configured for FICON
protocol.
Tip: When using the DS Storage Manager or DS CLI to connect to a FC-AL loop, set the
Fibre Channel port topology attribute to fc-al.
The DS6000 supports up to 127 hosts or devices on a loop. However, the loop goes through a
loop initialization process (LIP) whenever you add or remove a Fibre Channel host or device
from the loop. LIP disrupts any I/O operations currently in progress. For this reason, we
recommend that you only have a single host and a single DS6000 on any loop, effectively
making it a direct connection as discussed under“Direct connect” on page 148.
Note: Because of the architecture of the DS6000, connection via arbitrated loop is not
recommended. Remember that the single fibre connection will only provide preferred path
access to the luns owned by the connecting controller card and non-preferred path access
to the luns owned by the other controller card. If the connecting controller card should fail,
you will lose data access to all luns in the DS6000.
Switched fabric
A switched fabric is an intelligent switching infrastructure that delivers data from any source to
any destination. Figure 5-4 on page 149, with Fibre Channel adapters A and B, shows an
example of a switched fabric. A switched fabric is the basis for a Storage Area Network
(SAN), as shown in Figure 5-6 on page 153.
Tip: When using the DS Storage Manager or DS CLI to configure a switched fabric, always
use fcp-scsi as the Fibre Channel port topology attribute.
Table 5-2 Distances supported by Fibre Channel cables for the DS6000
Fibre Channel host Transfer rate Cable type Distance
adapter SFP feature
Recommendations for implementing a switched fabric are covered in more detail in the
following section.
Figure 5-6 on page 153 shows an example of a SAN switched fabric. It is called a switched
fabric because the SAN switches allow any Fibre Channel port to connect to any other Fibre
Channel port. All the Fibre Channel adapters in the servers and storage in this example are
running in switched fabric mode. There are four main components of the SAN:
The servers.
The storage subsystems—in this case a DS6000 and a tape library.
Notice each server is at least dual attached to the SAN for availability and load balancing. The
storage devices—the DS6000 and the tape library—have multiple SAN connections for
availability and performance.
This is just one example of a SAN, and there are a myriad other ways to create a SAN with
different types and numbers of servers, storage devices, and switches.
SAN Volume
Controller
HP-UX host
AIX host
Fibre Channel
SAN
Storage Area Network
Tape Library
DS6000
disk storage
Notice in Figure 5-6 how many different types of servers there are sharing the same DS6000
storage and tape library.
For performance, a general rule of thumb for host adapters ports is that you have a pair of
host adapter ports connected to the SAN to support each peak throughput increment of
300-400 MB per second for large block sequential workloads. For small block random
workloads a pair of host adapter ports for each increment of 15000 I/Os per second. For a
configuration that is expected to deliver 500MB per second of large block sequential or 40000
IOPS of small block random work, for instance, you should plan to provide at least four
DS6000 host adapter ports into the SAN.
If a host adapter should go bad and start logging in and out of the switched fabric, or a server
must be rebooted several times, you do not want it to disturb I/O to other hosts. Figure 5-7 on
page 156 shows zones that only include a single host adapter and multiple DS6000 ports.
This is the recommended way to create zones to prevent interaction between server host
adapters.
Tip: Each zone should contain a single host system adapter with the desired number of
ports attached to the DS6000.
By establishing zones, you reduce the possibility of interactions between system adapters in
switched configurations. You can establish the zones by using either of two zoning methods:
Switch port number
Worldwide port name (WWPN)
You can configure switch ports that are attached to the DS6000 in more than one zone. This
enables multiple host system adapters to share access to the DS6000 host adapter ports.
Shared access to a DS6000 host adapter port might be from host platforms that support a
combination of bus adapter types and operating systems.
Note: A DS6000 host adapter port configured to run with the FICON topology cannot be
shared in a zone with non-zSeries CKD hosts, and ports with non-FICON topology cannot
be shared in a zone with zSeries CKD hosts.
While it is possible to limit which DS6000 host adapter ports a given WWPN will connect to
Volume Groups through, we recommend that you define the WWPNs to have access to all
available DS6000 host adapter ports. Then, using the recommended process of creating
Fibre Channel zones as discussed in “Importance of establishing zones” on page 153, you
can limit the desired host adapter ports through the Fibre Channel zones. In a switched fabric
with multiple connections to the DS6000, this concept of LUN affinity enables the host to see
the same LUNs on different paths.
If the host is not capable of recognizing that the set of LUNs seen via each path is the same,
this may present data integrity problems when the LUNs are used by the operating system. To
get around this problem, you should install the IBM Subsystem Device Driver (SDD). Aside
from preventing the above problem, SDD also provides multipathing and load balancing,
which improves performance and path availability. SDD is covered in 5.6, “Subsystem Device
Driver (SDD) - multipathing” on page 157.
The number of times a DS6000 logical disk is presented as a disk device to an open host
depends on the number of paths from each host adapter to the DS6000. The number of paths
from an open server to the DS6000 is determined by the following:
The number of host adapter cards installed in the server
The number of connections between the SAN switches and the DS6000
The zone definitions created by the SAN switch software
Note: Each physical path to a logical disk on the DS6000 is presented to the host
operating system as a disk device.
By cabling the SAN components and creating zones as shown in Figure 5-7 on page 156,
each logical disk on the DS6000 will be presented to the host server four times since there
are four unique physical paths from host to DS6000. As you observe the picture, Zone A
shows that FC0 will have access through DS6000 host ports I0000 and I0101. Zone B shows
that FC1 will have access through DS6000 host ports I0003 and I0102. So in combination this
provides for four paths to each logical disk presented by the DS6000. If Zone A and Zone B
were modified to include four paths each to the DS6000, then the host would have a total of
eight paths to the DS6000. In that case, each logical disk assigned to the host would be
presented as eight physical disks to the host operating system. Additional DS6000 paths are
shown as connected to Switch A and Switch B, but are not in use for this example.
I0100
SAN
FC1 switch
I0101
I0102
B
I0103
In a SAN environment, Subsystem Device Driver (SDD) is used to provide load balancing and
failover. SDD also adds another device to the host operating system for each logical disk
presented from the DS6000. Figure 5-8 on page 156 shows how SDD adds a pseudo device
called a vpath (virtual path) on top of the disk devices. The host operating system issues I/O
calls to vpath0 in the example, and SDD in turn picks the best physical path (disk0, disk1,
disk2, or disk3) to use at a given time.
Host
I/O calls from OS
vpath0 from SDD
(Subsystem Device Driver)
vpath0 load balance & failover
In the example in Figure 5-8, the number of devices presented to the host operating system
for each DS6000 logical disk is limited to five (four disk devices + 1 vpath).
You can see how the number of logical devices presented to a host could increase rapidly in a
SAN environment if care is not taken in selecting the size of logical disks and the number of
paths from the host to the DS6000.
Typically, we recommend for dual attached hosts, cable the switches and create zones in the
SAN switch software so that each server host adapter has two to four paths from the switch to
each controller of the DS6000. Figure 5-7 on page 156 shows an example of hosts using four
paths to the DS6000; two path to each controller. With hosts configured this way, you can let
SDD balance the load across the two host adapters ports of the controller the owns the lun
(the preferred path) and will also allow balanced access across the non-preferred paths if the
preferred controller fails for any reason.
Some operating systems and file systems natively provide similar benefits provided by SDD,
for example, z/OS, OS/400, NUMA-Q® Dynix, and HP/UX.
SDD provides DS6000 attached hosts running Windows, AIX, HP/UX, NetWare, Sun Solaris,
or Linux with:
Dynamic load balancing between multiple paths when there is more than one path from a
host server to the DS6000. This may eliminate I/O bottlenecks that occur when many I/O
operations are directed to common devices via the same I/O path, thus improving the I/O
performance.
Automatic path failover protection and enhanced data availability for users that have more
than one path from a host server to the DS6000. It eliminates a potential single point of
failure by automatically rerouting I/O operations to remaining active paths from a failed
data path.
DS6000
An example of a dual attached host that can benefit from SDD is shown in Figure 5-9.
For some servers, like selected pSeries and RS/6000 models running AIX or for Windows
environments, booting off the DS6000 is supported. In that case LUNs used for booting are
manually excluded from the SDD configuration by using the querysn command to create an
exclude file. More information can be found in “querysn for multi-booting AIX off the DS6000”
on page 222.
For more information about installing and using SDD, refer to IBM TotalStorage Multipath
Subsystem Device Driver User’s Guide, SC30-4096. This publication and other information
are available at:
http://www.ibm.com/servers/storage/support/
The path selected to use for an I/O operation is determined by the policy specified for the
device. The policies available are:
Load balancing (default). The path to use for an I/O operation is chosen by estimating the
load on the adapter to which each path is attached. The load is a function of the number of
Normally, path selection is performed on a global rotating basis; however, the same path is
used when two sequential write operations are detected.
However, SDD does support a single-path Fibre Channel connection from your host system
to a DS6000. It is possible to create a volume group or a vpath device with only a single path.
Note: With a single-path connection, SDD cannot provide failure protection and load
balancing and this is not recommended.
Host adapter
single point of
failure
SAN switch
single point of
failure
Host
Port
I0001
Logical
disk
DS6000
From an availability point of view, the configuration is not good because of the single fiber
cable from the host to the SAN switch. However, this configuration is better than a single path
from the host to the DS6000 and can be useful for preparing for maintenance on the DS6000.
single point of
failure
SAN switch
Logical
disk
DS6000
Figure 5-11 SAN multi-path connection with single fiber
When a path failure occurs, the IBM SDD automatically reroutes the I/O operations from the
failed path to the other remaining paths. This eliminates the possibility of a data path being a
single point of failure.
We generally recommend using SDDPCM as the preferred multipathing solution for AIX,
because it runs as part of the storage device driver and thus has minor performance
advantages. Another benefit of it is that for each logical/virtual disk configured on the
DS6000, you end up having one hdisk device (rather than one vpath device plus one hdisk
per path).
datapath open device path Dynamically opens a path that is in an Invalid or Close_Dead
state.
datapath query adapstats Displays performance information for all SCSI and FCP
adapters that are attached to SDD devices.
datapath query devstats Displays performance information for a single SDD device or all
SDD devices.
datapath query essmap Displays each SDD vpath device, path, location, and attributes.
datapath query portmap Displays the connection status of SDD devices with regard to
the storage ports to which they are attached.
datapath query wwpn Displays the World Wide Port Name of the host adapter.
datapath remove device path Dynamically removes a path of an SDD vpath device.
datapath set adapter Sets all device paths that are attached to an adapter to online
or offline.
datapath set device policy Dynamically changes the path-selection policy of the SDD
devices. Choices are round-robin, load balance, default,
failover.
datapath set device path Sets the path of a device to online or offline.
A subset of these commands will be described here for the purposed of understanding path
management from a performance perspective. For more information about these commands,
refer to IBM TotalStorage Multipath Subsystem Device Driver User’s Guide, SC30-4096.
Example 5-1 on page 163 illustrates the command datapath query adapter. Notice that this
host has two adapters, both functioning normally.
Active Adapters :2
The terms used in the output of datapath query adapter are defined as follows:
Adpt# The number of the adapter.
Adapter Name The name of the adapter.
State The condition of the named adapter. It can be either:
-Normal, adapter is in use.
-Degraded, one or more paths are not functioning.
-Failed, the adapter is no longer being used by SDD.
Mode The mode of the named adapter, which is either Active or Offline.
Select The number of times this adapter was selected for input or output.
Errors The number of errors on all paths that are attached to this adapter.
Paths The number of paths that are attached to this adapter. In the Windows
NT® host system, this is the number of physical and logical devices
that are attached to this adapter.
Active The number of functional paths that are attached to this adapter. The
number of functional paths is equal to the number of paths attached to
this adapter minus any that are identified as failed or offline.
An example of the datapath query device commands is shown in Example 5-2. The output
shows the status of paths for vpath4. Notice that it has eight paths that are functioning
normally. We have an AIX system that sees eight hdisks (hdisk18, hdisk26, hdisk34, hdisk42,
hdisk50, hdisk58, hdisk66, and hdisk74) and the query command was issued when the AIX
volume group was online, so that the State shows OPEN. There are two different Fibre
Channel adapters on the host: fscsi0 and fscsi3. The switch zones are configured to give
each Fibre Channel adapter four paths to the DS6000. Based upon the number of selects that
we see for each hdisk or path, we can see that each Fibre Channel adapter has two preferred
paths and two non-preferred paths, which is also shown in Example 5-4 on page 164 and
Example 5-5 on page 165.
Example 5-4 is an example of the datapath query essmap command and shows the host
adapter port connections that are used for each path to the DS6000. The example output has
been edited to reflect the same vpath4 that was used in the previous examples. The standard
output from this query command would show all vpaths that are defined to this AIX system.
The datapath query essmap command is only available on AIX platforms.
The columns that are of interest for looking at host adapter paths have the headings of
Connection and port. The Connection column for hdisk18 (R1-B2-H1-ZB) shows that the path
is through DS6000 Controller Card 1 (bottom card), port 0 (R1 =Rack1, B2= I/O enclosure1,
H1=host adapter1, ZB=port1) which is reflected in the port column as 101, which is a
translation of the portname in dscli of I0101. The notation of Rx-By-Hz-Za shows the relative
position of the adapter in the DS6000, where:
Rx - Rack position - Always R1 for DS6000.
By - RAID Controller Card, where B1=controller0 (top) and B2=controller1 (bottom).
Hz - Host Adapter position - Always H1 for DS6000.
Za - relative position of the port on the Host Adapter. ZA is the left-most host port, ZD is
the right-most host port.
For hdisk18, you can interpret the Connection column R1-B2-H1-ZB as Controller 1 (the
bottom controller of the DS6000), only adapter, and second port from the left or a port_ID in
dscli of I0101. Note that this command represents the preferred path with an asterisk in the
column headed “ P ”.
Example 5-5 on page 165 is an example of the datapath query portmap command which
simply provides a different view of the host adapter port connections that are used for each
path to the DS6000. Note that this command represents the preferred path with capital “ Y ”
and alternate or non-preferred path with lowercase “ y ”. Once again, this is a command that
is available on the AIX platform only.
Note: 2105 devices' essid has 5 digits, while 1750/2107 device's essid has 7 digits.
For more information about SAN Volume Controller, see the redbook, IBM TotalStorage SAN
Volume Controller, SG24-6423.
The SAN Volume Controller solution is designed to reduce both the complexity and costs of
managing your SAN-based storage. With the SAN Volume Controller you will be able to:
Simplify management and increase administrator productivity by consolidating storage
management intelligence from disparate storage controllers into a single view.
Improve application availability by enabling data migration between disparate disk storage
devices non-disruptively.
Improve disaster recovery and business continuance needs by applying and managing
copy services across disparate disk storage devices within the Storage Area Network
(SAN).These solutions include a Common Information Model (CIM) Agent, enabling
unified storage management based on open standards for units that comply with CIM
Agent standards.
Provides advanced features and functions to the entire SAN; such as:
Large scalable cache
Copy Services
Space management
Mapping based on desired performance characteristics
Quality of Service (QoS) metering and reporting
For I/O purposes, SAN Volume Controller nodes within the cluster are grouped into pairs
(called I/O groups), with a single pair being responsible for serving I/O on a given vDisk. One
node within the I/O Group will represent the preferred path for I/O to a given vDisk - the other
node representing the non-preferred path. This preference will alternate between nodes as
Note: The preferred node by no means signifies absolute ownership. The data will still be
accessed by the partner node in the I/O Group in the event of a failure or if the preferred
node workload becomes too high.
Beyond automatic configuration and cluster administration, the data transmitted from
attached application servers is also treated in the most reliable manner. When data is written
by the host, the preferred node within the I/O Group stores a write in its own write cache and
the write cache of its partner (non-preferred) node before sending an I/O complete status
back to the host application. To ensure that data is written in the event of a node failure, the
surviving node empties its write cache and proceeds in write-through mode until the cluster is
returned to a fully operational state.
Note: Write-through mode is where the data is not cached in the nodes, but written directly
to the disk subsystem instead. While operating in this mode, performance is somewhat
degraded, however more importantly, it ensures that the data makes it to its destination
without the risk of data loss that a single copy of data in cache would expose you to.
Furthermore, each of the two nodes in the I/O group are protected by different uninterruptible
power supplies
The SAN must be zoned in such a way that the application servers cannot see the backend
storage, preventing any possible conflict between SAN Volume Controller and the application
servers both trying to manage the backend storage. In the fabric are defined two distinct
zones:
In the host zone, the host systems can identify and address the nodes. You can have more
than one host zone. Generally, you will create one host zone per operating system type.
In the disk zone, the nodes can identify the disk storage subsystems. Generally, you will
create only one zone including the storage subsystems.
The SAN Volume Controller I/O Groups are connected to the SAN in such a way that all
backend storage and all application servers are visible to all of the I/O Groups. The SAN
Volume Controller I/O Groups see the storage presented to the SAN by the backend
controllers as a number of disks, known as Managed Disks or mDisks. Because the SAN
Volume Controller does not attempt to provide recovery from physical disk failures within the
backend controllers, mDisks are usually, but not necessarily, part of a RAID array.
mDisks are collected into one or several groups, known as Managed Disks Group or MDGs.
Once a mDisk is assigned to a MDG, the mDisk is divided up into a number of extents (default
minimum size 16 MB, maximum size of 512 MB), which are numbered sequentially from the
start to the end of each mDisk.
A MDG provides a pool of capacity (Extents) which will be used to create volumes, know as
Virtual Disks or vDisks.
When creating vDisks, the default choice of striped allocation is normally the best choice. This
option helps to balance I/Os across all the managed disks in a MDG, which tends to optimize
overall performance and helps to reduce hot spots. Conceptually, this might be represented
as shown in Figure 6-1.
The virtualization function in the SAN Volume Controller maps the vDisks seen by the
application servers on to the mDisks provided by the backend controllers. I/O traffic for a
particular vDisk is, at any one time, handled exclusively by the nodes in a single I/O Group.
Thus, although a cluster could have many nodes within it, the nodes handle I/O in
independent pairs. This means that the I/O capability of the SAN Volume Controller scales
well (almost linearly), since additional throughput can be obtained by simply adding additional
I/O Groups.
Figure 6-2 on page 171 summarizes the various relationships that bridge the physical disks
through to the virtual disks within the SAN Volume Controller architecture.
The multi-pathing driver supported by the SAN Volume Controller is IBM’s Subsystem Device
Driver (SDD). It manages the multiple paths from the host to the SAN Volume Controller
making use of the preferred paths in a round robin manner before using any non-preferred
path. SDD performs data path failover in the event of a failure within the SAN Volume
Controller, or the host path while also masking out the additional disks that would otherwise
be seen by the hosts due to the redundant paths through the SAN fabric.
Note: The SDD code has been updated to support both the SAN Volume Controller, the
ESS, the DS6000 and the DS8000, and provided the latest version is used, IBM supports
the concurrent connections of a host to both a SAN Volume Controller and “native” storage
environments. Refer to IBM SDD documentation: Multipath Subsystem Device Driver
User's Guide, SC30-4096.
Note: SAN Volume Controller copy services functions are not compatible with the DS6000
and DS8000 copy services.
FlashCopy
FlashCopy is a Copy Service available with the SAN Volume Controller. It copies the contents
of a source virtual disk (VDisk) to a target VDisk. Any data that existed on the target disk is
lost and is replaced by the copied data. After the copy operation has been completed, the
target virtual disks contain the contents of the source virtual disks as they existed at a single
point in time. Although the copy operation takes some time to complete, the resulting data on
the target is presented in such a way that the copy appears to have occurred immediately.
Consistency Groups address the issue that the using application may have related data which
spans multiple Virtual Disks. FlashCopy must be performed in a way which preserves data
integrity across multiple Virtual Disks. One requirement for preserving the integrity of data
being written is to ensure that dependent writes are executed in the application's intended
sequence.
A FlashCopy mapping can be created between any two virtual disks in a cluster. It is not
necessary for the virtual disks to be in the same I/O group or even in the same managed disk
group. This functionality provides the ability to optimize your storage allocation using a
secondary storage subsystem (with, for example, less performance) as the target of the
FlashCopy. In this case, the resources of your high performance storage subsystem will be
only dedicated for production, while your low-cost (less performance) storage subsystem will
be used for a secondary application (for example, backup or development...). Figure 6-3 on
page 173 represents a FlashCopy relationship created between two vDisks defined in
different Managed Disk Groups from different backend disk subsystems.
The SAN Volume Controller assumes that the FC fabric to which it is attached contains
hardware, which achieves the long distance requirement for the application. This hardware
makes storage at a distance accessible as though it were local storage. Specifically, it enables
two SAN Volume Controller clusters to connect to each other and establish communications
in the same way as though they were located nearby on the same fabric. The only difference
is in the expected latency of that communication, the bandwidth capability of the link, and the
availability of the link as compared with the local fabric.
The relationship between the two copies is not symmetric. One copy of the data set is
considered the primary copy (sometimes also known as the source). This copy provides the
reference for normal run-time operation. Updates to this copy are shadowed to a secondary
copy (sometimes known as the destination or even target). The secondary copy is not
normally referenced for performing I/O.
The remote copy can be maintained in one of two modes, synchronous or asynchronous.
Synchronous remote copy ensures that updates are committed at both primary and
secondary before the application is given completion to an update. This ensures that the
secondary is fully up-to-date should it be needed in a failover. However, this means that the
application is fully exposed to the latency and bandwidth limitations of the communication link
to the secondary. Where this is truly remote, this can have a significant adverse effect on
application performance.
Today SAN Volume Controller implements synchronous Remote Copy. Future releases
should incorporate asynchronous Remote Copy.
Figure 6-4 Synchronous remote copy relationship between 2 SAN Volume Controller clusters
In the following section, we present the IBM SAN Volume Controller concepts and discuss the
performance of the SAN Volume Controller. In this section, we assume there are no
bottlenecks in the SAN or on the disk subsystem.
For sequential read operations, such as database scans or backup operations, a single I/O
group can achieve up to 1 GB per second, given that the backend disk configuration is
properly configured to provide this level of throughput.
If you have information about the workloads that you plan to use with the SAN Volume
Controller, you can use this information to size the amount of capacity you can configure per
I/O group. To be conservative, assume a throughput ability of about 800 MB/s per I/O group.
These guidelines show that grouping similar disks together is important. the following
guidelines should be followed when grouping similar disks:
Group equally performing managed disks, arrays, in a single group.
Group similar array, for example, all RAID 5 arrays, in one group.
Group managed disks from the same type of storage subsystem in a single managed disk
group.
Group managed disks that use the same type of underlying physical disk (for example, disk
capacity, RPM).
Note: When configuring managed disks with the SAN Volume Controller, create managed
disk groups to use the largest practical SAN Volume Controller extent size. Doing so
maximizes the learning ability of the SAN Volume Controller adaptive cache.
Redundant array of independent disks (RAID) is a method of configuring multiple disk drives
in a storage subsystem for high availability and high performance. The collection of two or
more disk drives presents the image of a single disk drive to the system. In the event of a
single device failure, data can be read or regenerated from the other disk drives in the array.
With RAID implementation, the Storage Unit offers fault-tolerant data storage. The Storage
Unit supports RAID implementation on the Storage Unit device adapters. The Storage Unit
supports groups of disk drive modules (DDMs) in both RAID 5 and RAID 10.
Array size
A DS6000 Array is a RAID 5 or RAID 10 Array made up of 4 or 8 DDMs.
We recommend to configure Array site made up 8 DDMs to get the maximum performance of
you backend storage system. For further discussion, refer to Chapter 3, “Logical configuration
planning” on page 53.
A DS6000 Array is created from one 8DDMs Array Site. DS6000 RAID 5 arrays will be either
6+P+S or 7+P. DS6000 RAID 10 arrays will be either 3+3+2S or 4+4.
RAID 5 or RAID 10
There are a number of workload attributes that influence the relative performance of RAID 5
versus RAID 10, including the use of cache, the relative mix of read versus write operations,
and whether data is referenced randomly or sequentially.
Consider that:
For either sequential or random reads from disk, there is no significant difference in RAID
5 and RAID 10 performance, except at high I/O rates.
For random writes to disk, RAID 10 performs better.
For sequential writes to disk, RAID 5 performs better.
To get more details regarding RAID 5 and RAID 10 difference, refer to 2.8.5, “RAID 5 versus
RAID 10 performance” on page 39.
If you need to maximize the performance of your SAN Volume Controller configuration,
allocate each Rank to its own Extent Pool so that you configure one Rank per pool. This gives
you the ability to direct our allocations to a known location within the DS6000. Furthermore,
this configuration will help you manage and monitor the resultant logical disk performance
when required.
To clearly explain this performance limitation we can use as an example the configuration
presented in Figure 6-5 on page 178.
In this example, an Extent Pool (Extent Pool 0) is defined on a DS6000. This Extent Pool
includes 3 Ranks of 519 GB. The overall capacity of this Extent Pool is 1.5 TB. This capacity
is available through a set of 1 GB DS6000 Extents (standard DS6000 Extent size).
In this pool of available Extents, we create one DS6000 Logical Volume called volume0,
which contains all the Extents in the Extent Pool. volume0 is 1.5 TB. Due to the DS6000
internal Logical Volume creation algorithm, the Extents from the Rank1 will be assigned, then
the Extents of the Rank2 and the Extents of the Rank3. In this case, the data stored on the
first third of the Volume0 will be physically located on Rank1, the second third on the Rank2,
and the last third on Rank3.
When volume0 is assigned to the SAN Volume Controller, the Logical Volume is identified by
the SAN Volume Controller cluster as the Managed Disk, mDiskB. mDiskB is assigned to a
Managed Disk Group, MDG0, where the SAN Volume Controller Extent size is defined as 512
MB (size can be from 16 MB to 512 MB). In this Managed Disk are also defined two others
Managed Disks, mDiskA and mDiskC. mDiskA and mDiskC are also 1.5 TB and come from
the same DS6000, but they come from different Extent Pools. These Extent Pools are
similarly configured as Extent Pool 0.
When vDisk0 was created, it was assigned sequentially one SAN Volume Controller Extent
from mDiskA then, one SAN Volume Controller Extent from mDiskB, and one Extent from
mDiskC and so on. In total, vDisk0 was assigned the first 34 Extents of mDiskA, the first 33 of
mDiskB and the first 33 of mDiskC.
Here is the bottleneck. All of the first 33 Extents used from mDiskB are physically located at
the beginning of Volume0. That means that all of these Extents belong to DS6000 Rank1.
This configuration does not follow the performance recommendation, that you should spread
the workload assigned to vDisk0 to all the Ranks defined in the Extent Pool. In this case,
performance will be limited to the performance of a single Rank.
Furthermore, If the configuration of the mDiskA and mDiskC are equivalent to mDiskB, data
stores on vDisk0 are spread across only 3 Ranks of the 9 available within the three DS6000
Extent Pools used by SAN Volume Controller.
This example shows the bottlenecks for vDisk0, but more generally, almost all of the vDisk
created in this Managed Disk Group will be spread on only three Ranks instead of the nine
available.
Attention: The configuration presented in Figure 6-5 on page 178 is not optimized for
performance in an SAN Volume Controller environment.
To clearly explain this performance optimization, we can use as an example the configuration
presented in Figure 6-6 on page 180.
In this example, three Extent Pools (Extent Pool 1,2,3) are defined on a DS6000. Each Extent
Pool includes only 1 Rank of 519 GB. The overall capacity of each Extent Pool is 519 GB.
This capacity is available through a set of 1 GB DS6000 Extents (standard DS6000 Extent
size).
In each Extent Pool we create one volume (volume1, volume2 and volume3) which assigns all
the capacity of each Extent Poll (all the Extents available are assigned). The three volumes
created have a size of 519 GB each.
Volume0, Volume1 and Volume2 are assigned to the SAN Volume Controller, these Volumes
are identified by the SAN Volume Controller cluster as Managed Disks (in the example
mDiskA, mDiskB, mDiskC). These Managed Disks are assigned in a Managed Disk Group
(MDG0) where the SAN Volume Controller Extent size is defined to 512 MB (size can be from
16 MB to 512 MB).
The overall capacity of the Managed Disk Group is 1.5 TB. This capacity is available through
a set of 512 MB SAN Volume Controller Extents. In this storage pool is created a Virtual Disk
(vDisk0) of 50 GB size (100 SAN Volume Controller Extents). The Virtual Disk is created in
SAN Volume Controller Striped mode in order to obtain the greatest performance.
In this case, all the SAN Volume Controller Extents assigned for vDisk0 are physically located
on Rank1, Rank2 and Rank3 of the DS6000. This configuration permits you to spread the
workload applied on vDisk0 to all three Ranks of the DS6000. In this case we efficiently use
the hardware available for each vDisk of the Managed Disk Group.
Important: The configuration presented in Figure 6-6 on page 180 is optimized for
performance in an SAN Volume Controller environment.
Note: For SAN Volume Controller attachment to the DS6000, we recommend to use FC
ports from Server0 and Server1 to improve the DS6000 access availability
IBM TotalStorage Productivity Center and its product IBM TotalStorage Productivity Center for
Disks are presented in 4.3, “IBM TotalStorage Productivity Center for Disk” on page 109.
Refer to this chapter to get more details.
For more general information about TotalStorage Productivity Center, refer to the redbook,
IBM TotalStorage Productivity Center: Getting Started, SG24-6490.
Device management
IBM TotalStorage Productivity Center for Disk can provide access to single-device and
cross-device configuration functionality. It enables the user to view important information
about the storage devices that are discovered by IBM TotalStorage Productivity Center for
Disk, examine the relationships between those devices, or change their configurations. IBM
TotalStorage Productivity Center for Disk supports the discovery and logical unit number
(LUN) provisioning of IBM TotalStorage DS4000 series storage systems, IBM TotalStorage
ESS, IBM TotalStorage DS6000, IBM TotalStorage DS8000 and IBM TotalStorage SAN
Volume Controller.
The user can view essential information about the storage, view the associations of the
storage to other devices, and change the storage configuration. DS8000, DS6000, ESS and
DS4000 storage subsystems, attached to the SAN or attached behind the SAN Volume
Controller, can be managed by IBM TotalStorage Productivity Center for Disk.
IBM TotalStorage Productivity Center for Disk is designed to enable the IT administrator to:
Monitor performance metrics across storage subsystems from a single console
Receive timely alerts to enable event actions based on customer policies
Focus on storage optimization through the identification LUN
6.3.2 Using IBM TotalStorage Productivity Center for Disk to monitor the SAN
Volume Controller
To install and configure TotalStorage Productivity Center for Disk to monitor IBM SAN Volume
Controller, refer to the IBM redbook, Managing Disk Subsystems using IBM TotalStorage
Productivity Center, SG24-7097 and to the Redpaper, Using IBM TotalStorage Productivity
Center for Disk to Monitor the SVC, REDP-3961.
You will need to setup a new Performance Data Collection Task for the SAN Volume
Controller device.
Performance metrics collected are at different levels within the SAN Volume Controller:
Virtual disk (VDisk): For a single Vdisk or all Vdisks combined
– Total and average number of reads, writes
– Number of 512-bytes blocks read, writes
Managed disk (MDisk): For a single MDisk or per MDisk group
– Total and average number of reads, writes
– Number of 512-bytes block read, writes
– Read and write transfer rates
– Total, average, minimum, maximum response time
Once data collection is complete, you may use the gauges task to retrieve information about a
variety of storage device metrics. Gauges are used to tunnel down to the level of detail
necessary to isolate performance issues on the storage device. To view information collected
by the Performance Manager, a gauge must be created or a custom script written to access
the DB2 tables/fields directly.
The data samples you collect must cover the appropriate time period which corresponds with
the high / low of I/O workload. Also it should also cover sufficient iterations of the peak activity
to perform analysis over a period of time.
If you plan to perform analysis for one specific instance of activity, then you may ensure that
performance data collection task covers the specific time period.
Note: The SAN Volume Controller can perform data collection at a minimum, 15 minute
interval.
You may only enable a particular threshold once the minimum values for warning and error
levels have been defined.
Tip: In TotalStorage Productivity Center for Disk, default threshold warning or error values
of -1.0 are indicators that there is no recommended minimum value for the threshold and
are therefore entirely user defined. You may elect to provide any reasonable value for these
thresholds, keeping in mind the workload in your environment.
See the following Web site for the latest supported configurations:
http://www-1.ibm.com/servers/storage/support/virtual/2145.html
6.4.1 Sharing the DS6000 between open systems server hosts and the IBM
SAN Volume Controller
If you have a mixed environment including IBM SAN Volume Controller and open systems
servers, we recommend to share the maximum of DS6000 resources to both environments.
An example of a storage configuration recommendation is to create one Extent Pool per
Rank. In each Extent Pool, create one volume allocated to the IBM SAN Volume Controller
environment and one or more other volumes allocated to the open system server hosts. In
this configuration, each environment can benefit from the DS6000 overall performance.
IBM supports sharing a DS6000 between a SAN Volume Controller and open system server
hosts. However, if a DS6000 port is in the same zone as a SAN Volume Controller port, that
same DS6000 port should not be in the same zone as another host.
6.4.2 Sharing the DS6000 between iSeries host and the IBM SAN Volume
Controller
IBM SAN Volume Controller does not support iSeries host attachment. If you have a mixed
server environment including IBM SAN Volume Controller and iSeries servers, you have to
share your DS6000 to provide a direct access to iSeries volumes and access to open system
server volumes through the IBM SAN Volume Controller.
IBM supports sharing a DS6000 between a SAN Volume Controller and iSeries hosts.
However, if a DS6000 port is in the same zone as a SAN Volume Controller port, that same
DS6000 port should not be in the same zone as iSeries hosts.
6.4.3 Sharing the DS6000 between zSeries server host and the IBM SAN
Volume Controller
IBM SAN Volume Controller does not support zSeries host attachment. If you have a mixed
server environment including IBM SAN Volume Controller and zSeries servers, you have to
share your DS6000 to provide a direct access to zSeries volumes and access to open system
server volumes through the IBM SAN Volume Controller.
In this case, you have to split your DS6000 resources between two environments. Some of
the Ranks have to be created using CKD format (used for zSeries access) and the other part
A DS6000 port will not support a shared attachment between zSeries and IBM SAN Volume
Controller because zSeries servers use ESCON or FICON connection and IBM SAN Volume
Controller only supports FC connection.
Attention: The new Cache Disable VDISK functionality included in the SAN Volume
Controller release 3.1 will provide the ability to use disk subsystem copy services for LUNs
that are managed by the SAN Volume Controller.
Before you delete or un-map a volume from the SAN Volume Controller, remove the logical
unit from the Managed Disk Group. The following is supported:
The supported volume size is 1 GB to 2 TB.
Logical units can be added dynamically.
Throughout this chapter, we refer to Ranks, Arrays and Extent Pools. These terms will be
used interchangeably and they all refer to the same thing. This assumes that there is one
Rank per Extent Pool as is recommended in Chapter 2, “Hardware configuration planning” on
page 17 and Chapter 5, “Host attachment” on page 143.
The tips and tools presented in this chapter will allow you to:
Collect host I/O stats for:
– Individual disk devices (paths to DS6000 LUNs)
– Vpaths
– Ranks
Develop an iostat report for all Ranks in the DS6000 (enterprise iostats) from a host
perspective.
Create baseline measurements of performance.
Test and improve sequential I/O.
Sometimes you will want to view performance of a specific host, and other times you will want
to view performance statistics of DS6000 components. Remember that multiple hosts can be
using logical disks (LUNs) from the DS6000 that reside on the same Array.
Keep in mind that the most important I/O measurements to gather from a server’s disk
subsystem are:
Number of I/O transactions per second (IOPS)
Total MB/s transferred
MB/s read
MB/s written
KB/transaction = [ (KB read/second + KB written/second) / (transactions/second ) ]
For general purpose, random I/O applications, what we are talking about right here is the
single most important step you should consider for optimum performance. Of course, there
are exceptions, like DB2, which uses algorithms unique to the application to balance I/O.
Note: The recommended method does not apply to certain applications like DB2. See
Chapter 13, “Databases” on page 415.
Many customers have SAN administrators and UNIX administrators. Eventually, these two
groups are going to get together to decide what servers get what LUNs. It is this decision that
this section is about. For optimum performance, the UNIX administrator should request one
LUN per Rank, in a round robin way until you have the amount of storage you require, and
each LUN should be the same size. For instance, if the application server needs 250 GB, this
total storage requirement should be satisfied by assigning four 72 GB LUNs to the server. And
most importantly, each LUN is physically located on a different Rank. See Figure 7-1 on
page 192. Notice that the host server has been allocated a piece of every DDM in our
hypothetical DS6000. Every spindle of every DDM has allocated storage to the host server.
Server 0
72 GB 72 GB 72 GB 72 GB 72 GB 72 GB
72 GB 72 GB 72 GB 72 GB 72 GB 72 GB 72 GB 72 GB 72 GB
Server 1
Note: For best performance, assign LUNs evenly from many available Ranks to your host
servers. This ultimately involves the maximum number of DDMs in doing the I/O work.
The above example uses two disk enclosures because, frankly, it would be difficult to cram
more Extent Pools into Figure 7-1. It is realistic to expect your configuration to include more
disk enclosures, each enclosure cabled to the two different disk adapter pairs or loops. So, to
expand upon our concept of the recommended method slightly: balance the LUNs assigned to
host systems between Ranks and disk adapter loops. The idea is to equally distribute your I/O
load evenly across all of the performance related resources (DDMs, Ranks, DA pairs) within
the DS6000 - get as much of this hardware working for you as possible!
It is also important to note that all the LUNs assigned to a host system should be the same
size, share a common DDM size and RAID Array type. In our hypothetical system, all the
LUNs were 72 GB, all the Arrays were RAID 5, and all the DDMs were 146 GB. The next host
system may only need 32 GB LUNs from every Rank. It is perfectly acceptable to have
different size LUNs sitting next to each other on the same Rank.
As this book is about performance and tuning, we end this section as we started it. The single
most important step you can take for optimum performance is to evenly distribute a host
system’s LUNs among multiple Ranks in the DS6000.
The next step for the UNIX system administrator is to begin the process of acquiring the LUNs
assigned and configuring them to the Logical Volume Manager of the operating system. This
discussion continues in 7.8.1, “Creating the volume group” on page 233, later in this chapter.
It is not necessary to have all the LUNs in a DS6000 be the same size. As stated in the
previous section, what is important is that a host system’s LUNs be evenly balanced among
Ranks. It is perfectly acceptable to have, for instance, one 72 GB LUN on every Rank
assigned to server A and one 8 GB LUN on every Rank assigned to server B. The balance is
still there.
Many of the considerations for LUN size, from a UNIX perspective, depend upon the functions
of a UNIX Logical Volume Manager. AIX and HP-UX have this function as part of the base
operating system. Solaris has Veritas Volume Manager. Here are some considerations when
choosing the DS6000 Logical Volume (LUN) size:
The recommended method advocates assigning one LUN from each Rank to your server.
Consider choosing your LUN size so that one LUN from every Rank gives your host
system the amount of storage it needs. For example, if you know that, on average, most of
your servers will require around 500 GB and your DS6000 has 16 Ranks, then a LUN size
of (500/16) 32 GB should be considered.
When filling up a LUN with logical volumes, always leave at least one physical partition
free on every LUN in the volume group. This leaves some extra room for the volume group
descriptor area (VGDA) to grow and enables the volume group to be expanded,
reorganized online, or changed from a standard volume group to a big volume group or to
a scalable volume group.
The concept of OS level striping will be recommended later in this chapter as a way to
further enhance the storage performance of a host server. OS level striping should be
done across same-size LUNs from different Ranks.
Choose a LUN size so that one to three LUNs from each Rank will satisfy the host
system’s storage requirements. This will prevent a huge of number of LUNs from being
presented to the operating system. Too many LUNs are harder to manage. They also
could impact boot times and HACMP failover events.
There are, of course, situations where larger LUN sizes (greater than 72 GB) should be
considered. For instance, in HACMP environments where failover time is important, consider
much larger LUN sizes - one LUN that uses the entire Rank. This is because during a failover
event, the failover time is largely determined by the total number of HACMP managed logical
volumes. Another example would be if you were preparing LUNs for the SAN Volume
Controller, where very large LUNs are preferred. Another example would be if the LUNs were
intended for DB2, which uses a containers concept to balance I/O. See Chapter 13,
“Databases” on page 415.
This step is worth doing! Below, in Figure 7-2 on page 195, is an example of how to document
a storage allocation of a DS6000 using an Excel spreadsheet. Storage allocation on a
DS6000 would be similar.
The legend for the above diagram is shown below in Figure 7-3 on page 196.
Here are the three most common multipathing solutions for UNIX operating systems that are
supported by the DS6000:
SDDPCM
– Available for AIX only
– Preferred multipath product for AIX
SDD
It is especially important in a SAN environment to limit the number of disk devices presented
to a host. In a SAN, every extra path from the host to the DS6000 will cause another disk
device to be presented to the host OS for every DS6000 LUN assigned to it with SDD. With
SDDPCM only one hdisk is presented to the host for a LUN.
It is important to understand the total bandwidth requirements of the host system when
choosing the number of paths from the DS6000. With 2 Gb Fibre Channel, four paths from the
DS6000 provide four different 200 MB/s paths from the DS6000 - 800 MB/s total. This should
be adequate to supply the maximum 400 MB/s that can be used by the two HBAs on the host.
In a special situation where a host system has four HBAs (or more) it might be necessary to
increase the number of paths from the DS6000 to the SAN.
Note: Because the number of paths might influence performance, you should use the
minimum number of paths necessary to achieve your performance requirements. The
recommended number of paths is 2 to 4.
For more information about SAN zoning for performance and availability, refer to , “SAN
implementations” on page 79.
In a SAN environment, the microcode levels on the DS6000, on the SCSI and Fibre Channel
adapters on the servers, and SAN switches code, all effect each other.
You can find information about microcode levels for RS/6000 and pSeries servers and
adapters at:
http://techsupport.services.ibm.com/server/mdownload
We will cover some specific SDD commands for AIX, HP-UX, and Sun Solaris in 7.6, “SDD
commands for AIX, HP-UX, and Solaris” on page 218. For more details on SDD see 5.6,
“Subsystem Device Driver (SDD) - multipathing” on page 157.
Useful information can also be found in the IBM TotalStorage DS6000: Host Systems
Attachment Guide, SC26-7628. Also, see the DS6000 Interoperability Matrix for equipment
that IBM has tested and supports attaching to the DS6000 at:
http://www.ibm.com/servers/storage/disk/ds6000/interop.html
Keep in mind that when these tools were created, UNIX servers had their own locally
attached storage and did not use disk devices presented from centralized disk storage
servers like the DS6000, which is full of RAID arrays.
We would not call these tools legacy yet, but some of their features do not work well with
storage from RAID arrays. When looking at the output from these commands keep in mind
that the numbers presented are not for a single disk anymore, but for a logical disk (LUN) on a
DS6000 (RAID 5 or RAID 10) Rank.
These tools are worth discussing because they are almost always available and system
administrators are accustomed to using them. You may have to administer a server, and
these are the only tools you have available to use. These tools offer a quick way to tell if a
system is I/O bound.
7.3.1 iostat
The base tool for evaluating I/O performance of disk devices for UNIX operating systems is
iostat. Although available on most UNIX platforms, iostat varies in its implementation from
system to system.
The iostat command is a fast way to get a first impression of whether the system has an
I/O-bound performance problem or not. The tool reports I/O statistics for TTY devices, disks,
and CD-ROMs. It is used for monitoring system I/O device utilization by observing the time
physical disks are active in relation to their average transfer rates.
It would not be unusual to see a device reported by iostat as 90 percent to 100 percent busy
because a DS6000 volume that is spread across an array of multiple disks can sustain a
much higher I/O rate than for a single physical disk. Having a device 100 percent busy would
generally be a problem for a single device, but probably not for a RAID 5 device.
Tip: When using iostat on a server that is running SDD with multiple attachments to the
DS6000, each disk device is really just a single path to the same logical disk (LUN) on the
DS6000. To understand how busy a logical disk is, you need to sum up iostats for each
disk device making up a vpath.
Figure 7-4 shows an example of how multiple paths to the DS6000 affect information
presented by iostat. In the example, a server has two Fibre Channel adapters and is zoned
so that it uses four paths to the DS6000.
Host
I/O calls from OS
vpath0 from SDD
(Subsystem Device Driver)
load balance & failover
vpath0
Port 1 Port 2 Port 3 Devices
FC0 FC1 presented to
the OS:
}
disk1
disk2 reported on
disk3 by iostat
disk4
Controller 0
DS8000
LUN 1
In order to determine the I/O statistics for vpath0 for the example given in Figure 7-4, you
would need to add up the iostats for hdisk1–4. One way to find out which disk devices make a
vpath is to use the datapath query essmap command included with SDD.
Another way is shown in Example 7-1 on page 200. The command, datapath query device
0, lists the paths (hdisks) to vpath0. In this example, the logical disk on the DS6000 has LUN
For a system with a large number of disk devices presented from the DS6000, iostat can
lose its effectiveness. You may want to try running iostat and then sort the output by %busy.
If your AIX system is in a SAN environment, you may have so many hdisks that iostat
presents too much information. We recommend using nmon, which can report iostats based
on vpaths or Ranks, as discussed in 7.4, “AIX-specific I/O monitoring commands and tools”
on page 208.
The tables that follow show sample iostat reports from IBM AIX, Sun Solaris, and HP-UX
systems.
Notice that the first stanza of the iostat output is history information.
Together with the KB read and written, the output reports the following:
%tm_act column indicates the percentage of the measured interval time that the device
was busy.
tps column shows the transactions per second over the interval period for the device. The
I/O transaction is a variable length of work assigned to a device. This field may also
appear higher than would normally be acceptable for a single physical disk device.
%iowait has become somewhat misleading as a measure of disk performance with the
advent of faster and faster CPU speeds. This value is really an indication of the percent of
time the CPU is idle, waiting for I/O to complete. As such, it is only indirectly related to I/O
performance.
The r/s column shows 124.3 reads per second; the %b column shows 90 percent busy for the
device; but the svc_t column shows a service time of 15.7 ms, quite reasonable for 124 I/Os
per second.
The calculations for service time that iostat presents are based on a single physical volume,
and, as previously mentioned, the physical volume that the DS6000 presents to the host is in
reality comprised of multiple physical volumes.
With RAID disks, the %b figure can be misleading and should not be relied on. To figure out
how busy the individual disks are in a RAID array in the DS6000, we would need to add up all
the iostats for LUNs on that array and divide by the number of disks in the array.
Notice that for Sun Solaris, iostat uses disk aliases like sdX for disk devices like cXtYdZ.
Depending on which version of Sun Solaris you are running, you may be able to use an -n flag
for iostat to list devices in the cXtYdZ format. Example 7-4 shows the output of iostat -n for
a Sun Solaris server.
There are also scripts available from Sun or in Sun Solaris user groups to map from sdX
aliases to cxtydz devices. Search on the Internet for sd_to_cxtydz.sh.
The man page for the iostat command on HP-UX states that the msps field is set to 1.0. With
the advent of new disk technologies, such as data striping, where a single data transfer is
spread across several disks, the number of milliseconds per average seek becomes
impossible to compute accurately. At best it is only an approximation, varying greatly, based
on several dynamic system conditions. For this reason, and to maintain backward
compatibility, the milliseconds per average seek (msps) field is set to the value 1.0.
For HP-UX, you may prefer to use vmstat -d to view disk stats, or use both vmstat and
iostat. Details on the HP-UX vmstat output are shown in 7.3.3, “vmstat” on page 206.
iostat summary
In a SAN environment with the DS6000 presenting several disk devices to a host, iostat
output is not as easy to evaluate as when using individual SCSI disks. You will probably want
to use another tool that presents iostats based on vpaths or internal DS6000 performance
statistics. The use of SDDPCM avoids the issue of having reports for each path, and is part of
the reason why SDDPCM is preferred.
With a DS6000, also remember that typically the majority of random writes are happening at
cache speeds. Data is written to the DS6000 and stored in cache to be destaged to disks
later. For example, you can run a command in one window to copy a large file between file
systems on DS6000 disks. Then in another window, watch iostat output. You will see that the
write comes back as complete before the disk activity has stopped; this is due to the DS6000
reporting to the host system, that the write is complete as soon as all data was written to
DS6000 cache. iostat will show disk activity still taking place as data is destaged from cache
to disk.
Taken alone, there is no unacceptable value for any of the above iostat fields because
statistics are too closely related to application characteristics and system configuration.
Therefore, when evaluating data, look for patterns and relationships. The most common
relationship is between disk utilization and data transfer rate.
To draw any valid conclusions from iostat data, you have to understand the application’s disk
data access patterns such as sequential, random, or combination, and the type of physical
disk drives and adapters on the system.
For example, if an application reads/writes sequentially, you should expect a high disk transfer
rate when you have a high disk busy rate. Kb_read and Kb_wrtn can confirm an understanding
of an application’s read/write behavior. However, they provide no information about the data
access patterns.
Generally you do not need to be concerned about a high disk busy rate as long as the disk
transfer rate is also high. However, if you get a high disk busy rate and a low disk transfer rate,
you may have a fragmented logical volume, file system, or individual file that is causing the
bottleneck.
7.3.2 SAR
System Activity Report (SAR) is a tool that reports the contents of certain cumulative activity
counters within the UNIX operating system. SAR has numerous options, providing paging,
TTY, CPU busy, and many other statistics. Used with the appropriate command flag (-u) SAR
provides a quick way to tell if a system is I/O bound.
There are three possible modes in which to use the sar command:
Real-time sampling and display
System activity accounting via cron
Display previously captured data
We will discuss these three modes of using the sar command. The following discussion uses
the AIX operating system as a platform for the examples. However, these commands are
common to HP-UX and Solaris as well.
Average 44 15 5 35
Not all sar options are the same for AIX, HP-UX, and Sun Solaris, but the sar -u output is the
same. The output in the example shows CPU information every 2 seconds, 5 times.
To check if a system is I/O bound, the important column to look at is %wio. The %wio includes
time spent waiting on I/O from all drives, including internal and DS6000 logical disks. If %wio
values exceed 40, this would give an indication that more investigation may be warranted to
understand storage I/O performance. The next thing to look at would be I/O service times
reported by the filemon command. You need to understand your workload though to make a
judgement. High %wio values may very well imply that there is too much CPU power in the
host system’s configuration.
There are useful flags for sar on AIX, especially the –d flag. The sar -d command changed at
AIX 5.3, and generates output such as:
The avwait and avserv are the average times spent in the wait queue and service queue
respectively. And avserv here would correspond to avgserv in the iostat output. The avque
value changed; at AIX 5.3, it represents the average number of IOs in the wait queue, and
prior to 5.3, it represents the average number of I/Os in the service queue.
Also remember that a system with busy CPUs can mask I/O wait. The definition of %wio is:
Idle with some process waiting for I/O (only block I/O, raw I/O, or VM pageins/swapins
indicated). If the system is CPU busy and also is waiting on I/O, the system accounting will
increment the CPU busy but not the %wio column.
The other column headings mean (refer to Example 7-6 on page 204):
%usr time system spent executing application code
%sys time system spent executing operating system calls
%idle time the system was idle with no outstanding I/O requests
To configure a system to collect data for sar, you can run the sadc command or the modified
sa1 and sa2 commands. Here is more information about the sa commands and how to
configure sar data collection:
The sa1 and sa2 commands are shell procedure variants of the sadc command.
The sa1 command collects and stores binary data in the /var/adm/sa/sadd file, where dd is
the day of the month.
The sa2 command is designed to be run automatically by the cron command and run
concurrently with the sa1 command. The sa2 command will generate a daily report called
/var/adm/sa/sardd. It will also remove a report more than one week old.
/var/adm/sa/sadd contains the daily data file, dd represents the day of the month.
/var/adm/sa/sardd contains the daily report file, dd represents the day of the month. Note
the r in /var/adm/sa/sardd for sa2 output.)
To configure a system to collect data, edit the root crontab file. For our example, if we just
want to run sa1 every 15 minutes every day, and the sa2 program to generate ASCII versions
of the data just before midnight, we will change the cron schedule to look like the following:
0,15,30,45 * * * 0-6 /usr/lib/sa/sa1
55 23 * * 0-6 /usr/lib/sa/sa2 -A
You can view performance information files from these files with:
sar -f /var/adm/sa/sadd where dd is the day you are interested in.
You can also focus on a certain time period, say 8 a.m. to 5:15 p.m. with:
sar -s 8:00 -e 17:15 -f /var/adm/sa/sadd
You can save sar info to view later with the commands:
sar -A -o data.file interval count > /dev/null & [SAR data saved to data.file]
sar -f data.file [Read SAR info back from saved file:]
All data is captured in binary form and saved to a file (data.file). The data can then be
selectively displayed with the sar command using the -f option.
sar summary
sar helps to tell quickly if a system is I/O bound. Remember though, that a busy system can
mask I/O issues since io_wait counters are not increased if the CPUs are busy. Compare sar
-d to iostat on your system and check the man pages for the different options to use. You
may prefer the sar -d output to iostat.
sar can help to save a history of I/O performance so you have a baseline measurement for
each host. You can then verify if tuning changes make a difference or not. You may want, for
example, to collect sar data for a week and create reports: 8am-5pm Monday-Friday if that is
prime time for random I/O; 6 p.m.–6 a.m. Sat–Sun if those are batch/backup windows.
7.3.3 vmstat
The vmstat utility is a useful tool for taking a quick snapshot or overview of the systems
performance. It is easy to see what is happening to the CPU, paging, swapping, interrupts, I/O
wait, and much more. There are several reports that vmstat can provide. These reports vary
slightly between the different versions of UNIX. Some of the I/O-related system information
can be gathered by entering the following options:
vmstat scdisk13 scdisk14 To display a summary of the statistics since boot including
statistics for logical disks scdisk13 and scdisk14
Tip: vmstat presents an average-since-boot on the first line. When running vmstat over an
interval, just disregard the first line of the vmstat output.
An example of vmstat output (over an interval of 2 seconds with a count of 5) for Sun Solaris
is shown in Example 7-7.
HP-UX has similar vmstat output as shown in Example 7-8 on page 207. Notice that with the
-d flag, you can see transfer statistics for disks.
Disk Transfers
device xfer/sec
c0t6d0 0
The vmstat output for HP-UX, AIX, and Sun Solaris are all similar. Some important fields are:
r - runque Shows the number of tasks waiting for CPU resources.
b- blocked Indicates processes are waiting on a resource, usually I/O related.
pi- page in Page-ins from paging space indicate a shortage of free memory and
swapping is occurring. Swapping activity can incur I/O costs.
us- user CPU Shows the amount of CPU used by user application code.
sy Shows the percent of CPU being used to service the operating
system.
id - idle The percent of CPU that is idle.
wa-wait The percent of time the CPUs are idle, waiting on I/O to complete (AIX
only).
vmstat reports are vital in determining what is happening to the system on a real-time basis.
SoHigh I/O wait percent (AIX includes this information in vmstat output in the wa column),
which indicates that a majority of the CPU cycles are waiting for I/O operations to complete.
This value has been creeping up as CPU performance has outpaced storage performance. To
put this in perspective, it is common these days for the CPU to tick off over 10,000,000 cycles
waiting for one I/O.
High number of blocked processes. This normally indicates that there are a lot of
processes waiting on a single resource; usually it is I/O related.
High paging space paging rate, which indicates an overload on the system memory.
High number of page faults, which could mean that the system is not making efficient use
of memory for caching files.
The vmstat command is only the first step to look for performance problems. It gives an
indication of where the performance problem could be located. With this in mind, choose a
resource-specific command and take a deeper look into the system behavior.
The topas and nmon tools are very thorough, providing an overall view of system
performance including such performance statistics as cpu busy, memory usage, disk I/O,
adapter I/O, top processes, and paging activity. The filemon and lvmstat tools look at I/O
performance in more detail and can be used to see which applications and file systems a host
spends the most time handling I/O for.
The nmon tool is especially good for monitoring DS6000 activity, because it can report iostats
based on either:
hdisks
vpaths
Ranks
Adapter statistics including SCSI and Fibre Channel adapters
7.4.1 topas
The interactive AIX tool, topas, is convenient if you want to get a quick overall view of the
system’s current activity. A fast snapshot of memory usage or user activity can be a helpful
starting point for further investigation. However, topas is of very limited use as a diagnostic
tool, when you are dealing with a large number of logical disks on a DS6000 since it reports
I/O on hdisks. Example 7-10 contains a sample topas output.
For monitoring DS6000 I/O on AIX hosts, we recommend the use of another tool called nmon,
which is discussed in the next section.
7.4.2 nmon
The nmon tool and analyzer for AIX and Linux is a great storage performance analysis
resource, and it is free! It is written by Nigel Griffiths who works for IBM in the United
Kingdom. This is one of the tools we use when performing customer benchmarks. It is
available at:
http://www.ibm.com/developerworks/eserver/articles/analyze_aix/
Note: The nmon tool is not formally supported. No warranty is given or implied, and you
cannot obtain help or maintenance from IBM.
nmon currently comes in two versions for running on different levels of AIX:
nmon version 10 for AIX 5L™
nmon version 9 for AIX 4.X. This version is functionally established and will not be
developed further.
The interactive nmon tool is very similar to monitor or topas, which you may have used before
to monitor AIX, but it offers many more features that are useful for monitoring DS6000
performance. We will explore these interactive options below.
Unlike topas, the nmon tool can also record data that can be used to establish a baseline of
performance for comparison later. Recorded data can be saved in a file and imported into the
nmon analyzer (spreadsheet format) for easy analysis and graphing.
The different options you can select when running nmon version 10 are shown in
Example 7-11.
Then start nmon with the -g flag to point to the map file:
When nmon starts, press the G key to view stats for your disk groups. An example of the output
is shown in Example 7-15.
Notice that:
nmon reports real-time iostats for the different disk groups.
In this case, the disk groups we created are for volume groups.
You can create logical groupings of hdisk for any kind of group you like.
You can make multiple disk-group map files and start nmon -g <map-file> to report on
different groups.
To enable nmon to report iostats based on Ranks, you can make a disk-group map file listing
Ranks with the associated hdisk members.
Use the SDD command datapath query essmap to provide a view of your host system’s
logical configuration on the DS6000 or DS6000. You could, for example, create a nmon disk
group of storage type (DS6000 or DS6000), LSS, Rank, port, etc., to give you unique views
into your storage performance.
Recording nmon information for import into the nmon analyzer tool
A great benefit nmon provides is the ability to collect data over time to a file and then just
import the file into the nmon analyzer tool, which can be found at:
http://www.ibm.com/developerworks/eserver/articles/analyze_aix/
To collect nmon data in comma-separated format for easy spread sheet import, do the
following:
1. Run nmon with the -f flag. See nmon -h for the details, but as an example, to run nmon for an
hour capturing data snapshots every 30 seconds, use:
nmon -f -s 30 -c 120
2. This will create the output file in the current directory called:
<hostname>_date_time.nmon
Many spreadsheets have fixed numbers of columns and rows. We suggest you collect a
maximum of 300 snapshots to avoid hitting these issues.
7.4.3 filemon
The filemon command monitors a trace of file system and I/O system events, and reports
performance statistics for files, virtual memory segments, logical volumes, and physical
volumes. The filemon command is useful to those whose applications are believed to be
disk-bound, and want to know where and why.
The filemon command provides a quick test to determine if there is an I/O problem by
measuring the I/O service times for reads and writes at the disk and logical volume level.
The filemon command resides in /usr/bin and is part of the bos.perf.tools file set, which can
be installed from the AIX base installation media.
filemon syntax
The syntax of the filemon command is as follows:
filemon [-d ][-i Trace_File -n Gennames_File ][-o File ] [ -O Levels ] [-P ] [ -T n ] [
--u ][ --v ]
Flags:
-i Trace_File
Reads the I/O trace data from the specified Trace_File, instead of from the real-time trace
process. The filemon report summarizes the I/O activity for the system and period
represented by the trace file. The -n option must also be specified.
-n Gennames_File
Specifies a Gennames_File for offline trace processing. This file is created by running the
gennames command and redirecting the output to a file as follows (the -i option must also
be specified): gennames >file.
-o File
Writes the I/O activity report to the specified file instead of to the stdout file.
-d
Starts the filemon command, but defers tracing until the trcon command has been
executed by the user. By default, tracing is started immediately.
-T n
Sets the kernel’s trace buffer size to n bytes. The default size is 32,000 bytes. The buffer
size can be increased to accommodate larger bursts of events (a typical event record size
is 30 bytes).
-P
Pins monitor process in memory. The -P flag causes the filemon command's text and data
pages to be pinned in memory for the duration of the monitoring period. This flag can be
used to ensure that the real-time filemon process is not paged out when running in a
memory constrained environment.
-v
Prints extra information in the report. The most significant effect of the -v flag is that all
logical files and all segments that were accessed are included in the I/O activity report,
instead of only the 20 most active files and segments.
filemon measurements
To provide a more complete understanding of file system performance for an application, the
filemon command monitors file and I/O activity at four levels:
Logical file system
The filemon command monitors logical I/O operations on logical files. The monitored
operations include all read, write, open, and seek system calls, which may or may not
result in actual physical I/O depending on whether the files are already buffered in
memory. I/O statistics are kept on a per-file basis.
Virtual memory system
The filemon command monitors physical I/O operations (that is, paging) between
segments and their images on disk. I/O statistics are kept on a per segment basis.
Logical volumes
The filemon command monitors I/O operations on logical volumes. I/O statistics are kept
on a per-logical volume basis.
Physical volumes
The filemon command monitors I/O operations on physical volumes. At this level, physical
resource utilizations are obtained. I/O statistics are kept on a per-physical volume basis.
filemon examples
A simple way to use filemon is to run the command shown in Example 7-16, which will:
Run filemon for 2 minutes and stop the trace.
Store output in /tmp/fmon.out.
Just collect logical volume and physical volume output
To produce some sample output for filemon, we ran a sequential write test in the background,
and started a filemon trace, as shown in Example 7-17. We used the lmktemp command to
create a 2 GB file full of nulls while filemon gathered I/O stats.
In Example 7-18, we look at parts of the /tmp/fmon.out file. When analyzing the output from
filemon, focus in on:
Most active physical volume.
– Look for balanced I/O across disks.
– Lack of balance may be a data layout problem.
Look at I/O service times at physical volume layer.
– Writes to cache that average less than 2 ms is good. Writes averaging significantly and
consistently higher indicate that write cache is full, and there is a bottleneck in the disk.
– Reads averaging less than 10 ms - 20 ms is good. The disk subsystem read cache hit
rate affects this value considerably. Higher read cache hit rates will result in lower I/O
service times, often near 5 ms or less. If reads average greater than 15 ms, then it can
indicate that something between the host and the disk is a bottleneck, though it usually
indicates a bottleneck in the disk subsystem.
– Look for consistent I/O service times across physical volumes. Inconsistent I/O service
times can indicate unbalanced I/O or a data layout problem.
– Longer I/O service times can be expected for I/Os that average greater than 64 KB in
size.
– Look at the difference between the I/O service times between the logical volume and
the physical volume layers. A significant difference indicates queuing or serialization in
the AIX I/O stack.
The fields in the filemon report of the filemon command are as follows:
util Utilization of the volume (fraction of time busy). The rows are sorted by
this field, in decreasing order. The first number, 1.00, means 100
percent.
description Contents of volume; either a file system name, or logical volume type
(jfs2, paging, jfslog, jfs2log, boot, or sysdump). Also indicates if the file
system is fragmented or compressed.
skipping...........
------------------------------------------------------------------------
Detailed Logical Volume Stats (512 byte blocks)
------------------------------------------------------------------------
skipping...........
------------------------------------------------------------------------
Detailed Physical Volume Stats (512 byte blocks)
------------------------------------------------------------------------
skipping to end.....................
The filemon command is a very useful tool to determine where a host is spending I/O. More
details on the filemon options and reports are available in the publication AIX 5L
Performance Tools Handbook, SG24-6039, which can be downloaded from:
http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/SG246039.html
7.4.4 lvmstat
A new performance monitoring tool was introduced in AIX 5L called lvmstat, which reports
input and output statistics for logical partitions, logical volumes, and volume groups. The
lvmstat command is useful in determining the I/O rates to LVM volume groups, logical
volumes and logical partitions. This is useful for dealing with unbalanced I/O situations where
data layout was not considered initially.
The lvmstat command generates reports that can be used to change the logical volume
configuration to better balance the input and output load between physical disks.
lvmstat resides in /usr/sbin and is part of the bos.rte.lvm file set, which is installed by default
from the AIX 5L base installation media.
Flags:
-c Count prints only the specified number of lines of statistics.
-C Causes the counters that keep track of the iocnt, Kb_read, and
Kb_wrtn to be cleared for the specified logical volume or volume
group.
-d Specifies that statistics collection should be disabled for the logical
volume or volume group specified.
-e Specifies that statistics collection should be enabled for the logical
volume or volume group specified.
Parameters:
Name Specifies the logical volume or volume group name to monitor.
Interval The interval parameter specifies the amount of time, in seconds,
between each report. If Interval is used to run lvmstat more than
once, no reports are printed if the statistics did not change since the
last run. A single period is printed instead.
The first report section generated by lvmstat provides statistics concerning the time since the
statistical collection was enabled. Each subsequent report section covers the time since the
previous report. All statistics are reported each time lvmstat runs. The report consists of a
header row, followed by a line of statistics for each logical partition or logical volume
depending on the flags specified.
If the statistics collection has not been enabled for the volume group or logical volume you
want to monitor, lvmstat will report an error like:
#lvmstat -v rootvg
0516-1309 lvmstat:Statistics collection is not enabled for this logical device.
Use -e option to enable.
To enable statistics collection for all logical volumes in a volume group (in this case the rootvg
volume group), use the -e option together with the -v <volume group> flag as the following
example shows:
#lvmstat -v rootvg -e
When you do not need to continue collecting statistics with lvmstat, it should be disabled
because it impacts the performance of the system. To disable statistics collection for all
logical volumes in a volume group (in this case the rootvg volume group), use the -d option
together with the -v <volume group> flag as the following example shows:
#lvmstat -v rootvg -d
This will disable the collection of statistics on all logical volumes in the volume group.
The lvmstat tool has powerful options such as reporting on a specific logical volume, or only
reporting busy logical volumes in a volume group. For more information about using the
lvmstat command and other tuning commands in detail, check the publication AIX 5L
Performance Tools Handbook, SG24-6039, which can be downloaded from:
http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/SG246039.html
There are some commands SDD provides that are specific for each platform, and we will
cover some of the AIX, HP-UX, and Sun Solaris SDD commands here. All three platforms
have the useful SDD command datapath query available for use.
A summary of the SDD commands and the different operating system platforms that they are
available for is shown in Table 7-1 on page 219.
chgvpath X
ckvpath X
datapath X X X
defvpath X X
get_root_disks X X
gettrace X
hd2vp X X
pathtest X X X
querysn X X
rmvpath X X
showvpath X X
vp2hd X X
vpathmkdev X
chgvpath X
addpaths X
cfallvpath X
dpovgfix X
extendvg4vp X
lquerypr X
lsvpcfg X
mkvg4vp X
restvg4vp X
savevg4vp X
addpaths Dynamically adds paths to SDD devices while they are in the Available state.
dpovgfix Fixes a SDD volume group that has mixed vpath and hdisk physical volumes.
hd2vp The SDD script that converts a DS6000 hdisk device volume group to a
Subsystem Device Driver vpath device volume group.
vp2hd The SDD script that converts a SDD vpath device volume group to a
DS6000 hdisk device volume group.
querysn The SDD driver tool to query unique serial numbers of DS6000 devices. This is
used to exclude certain LUNs from SDD, e.g., boot disks.
savevg4vp Backs up all files belonging to a specified volume group with SDD devices.
cfallvpath Fast-path configuration method to configure the SDD pseudo-parent dpo and all
SDD vpath devices.
restvg4vp Restores all files belonging to a specified volume group with SDD devices.
addpaths
In a SAN environment, where servers are attached to SAN switches, the paths from the
server to the DS6000 are controlled by zones created with the SAN switch software. You may
want to add a new path and remove another for planned maintenance on the DS6000 or for
proper load balancing. You can take advantage of the addpaths command to make the
changes live.
lsvpcfg
To display which DS6000 vpath devices are available to provide fail over protection, run the
lsvpcfg command. You will see output similar to that shown in Example 7-20.
Notice in the example that vpath0, vpath1, and vpath2 all have a single path (hdisk device)
and, therefore, will not provide fail over protection because there is no alternate path to the
The command lsvg -p vpathvg lists the physical volumes making up the volume group
vpathvg. Notice that hdisk46 is listed among the other vpath devices. This is not correct for fail
over and load balancing, because access to the DS6000 logical disk with serial number
02DFA067 is using a single path hdisk46 instead of vpath11. The system is operating in a
mixed-mode with vpath pseudo devices and partially uses hdisk devices.
lsvg -p vpathvg
vpathvg:
PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION
vpath10 active 29 4 00..00..00..00..04
hdisk46 active 29 4 00..00..00..00..04 ! MIXED MODE- HDISKs and VPATHS !
vpath12 active 29 4 00..00..00..00..04
vpath13 active 29 28 06..05..05..06..06
To fix this problem, run the command dpovgfix volume_group_name. Then re-run the lsvpcfg
or lsvg command to verify.
Note: In order for the dpovgfix shell script to be executed, all mounted file systems of this
volume group have to be unmounted. After successful completion of the dpovgfix shell
script, mount the file systems again.
These two conversion programs require that a volume group contain either all original
DS6000 hdisks or all SDD vpaths. The program fails if a volume group contains both kinds of
device special files (mixed volume group). You may need to use dpovgfix first to fix a volume
group to contain all of one kind of device or another.
The SDD driver will automatically exclude any DS6000 devices from the SDD configuration, if
these DS6000 boot devices are the physical volumes of an active rootvg.
Tip: If you require dual or multiple boot capabilities on a server, and multiple operating
systems are installed on multiple DS6000 boot devices, you should use the querysn
command to manually exclude all DS6000 boot devices that belong to multiple non-active
rootvg volume groups on the server.
SDD V1.3.3.3 allows you to manually exclude DS6000 devices from the SDD configuration.
The querysn command reads the unique serial number of a DS6000 device (hdisk) and saves
the serial number in an exclude file, /etc/vpexclude.
During the SDD configuration, SDD configure methods read all the serial numbers in this
exclude file and exclude these DS6000 devices from the SDD configuration.
The exclude file, /etc/vpexclude, holds the serial numbers of all inactive DS6000 devices
(hdisks) in the system. If an exclude file exists, the querysn command will add the excluded
serial number to that file. If no exclude file exists, the querysn command will create one. There
is no user interface to this file.
Tip: You should not use the querysn command on the same logical device multiple times.
Using the querysn command on the same logical device multiple times results in duplicate
entries in the /etc/vpexclude file, and the system administrator will have to administer the
file and its content.
The benefits are multi-pathing to your paging spaces. All the same commands for
hdisk-based volume groups apply to using vpath-based volume groups for paging spaces.
Important: IBM does not recommend moving the primary paging space out of rootvg.
Doing so may mean that no paging space is available during the system startup. Do not
redefine your primary paging space using vpath devices.
lquerypr
The lquerypr command implements certain SCSI-3 persistent reservation commands on a
device. The device can be either hdisk or SDD vpath devices. This command supports
persistent reserve service actions or read reservation key, release persistent reservation,
preempt-abort persistent reservation, and clear persistent reservation.
Flags:
–p If the persistent reservation key on the device is different from the
current host reservation key, it preempts the persistent reservation key
on the device.
–c If there is a persistent reservation key on the device, it removes any
persistent reservation and clears all reservation key registration on the
device.
–r Removes the persistent reservation key on the device made by this
host.
–v Displays the persistent reservation key if it exists on the device.
–V Verbose mode. Prints detailed message.
This command queries the persistent reservation on the device. If there is a persistent
reserve on a disk, it returns 0 if the device is reserved by the current host. It returns 1 if the
device is reserved by another host. Caution must be taken with the command, especially
when implementing preempt-abort or clear persistent reserve service action. With
preempt-abort service action not only the current persistent reserve key is preempted; it also
aborts tasks on the LUN that originated from the initiators that are registered with the
preempted key. With clear service action, both persistent reservation and reservation key
registrations are cleared from the device or LUN.
This command is useful if disk was attached to one system, and was not varied off leaving the
SCSI reserves on the disks, thus preventing another system from accessing them.
It is a good idea to check periodically to make sure none of the volume groups are using
hdisks instead of vpaths. You can verify the path status several ways. Some commands are:
lspv (look for hdisk with volume group names listed)
lsvpcfg
lsvg -p <vgname>
Remember to change any scripts you may have that call savevg or restvg and change the
calls to savevg4vp and restvg4vp.
rmvpath [-all, -vpathname] Removes SDD vpath devices from the configuration.
showvpath
The showvpath command for HP-UX is similar to the lsvpcfg command for AIX. Use
showvpath to verify that an HP-UX vpath is using multiple paths to the DS6000. An example of
the output from showvpath is displayed in Example 7-22.
Notice that vpath1 in the example has four paths to the DS6000. vpath2, however, has a
single point of failure since it is only using a single path.
Tip: You can use the output from showvpath to modify iostat or sar information to report
stats based on vpaths instead of hdisks. Gather iostats to a file, and then replace the disk
names with the corresponding vpaths.
rmvpath [-all, -vpathname] Removes SDD vpath devices from the configuration.
On Sun Solaris, SDD resides above the Sun SCSI disk driver (sd) in the protocol stack. For
more information about how SDD works, refer to 5.6, “Subsystem Device Driver (SDD) -
multipathing” on page 157. SDD is supported for the DS6000 on Solaris 8/9.
Some specific commands SDD provides to Sun Solaris are listed below as well as the steps
to update SDD after making DS6000 logical disk configuration changes for a Sun server.
cfvgpath
The cfgvpath command configures vpath devices using the following process:
Scan the host system to find all DS6000 devices (LUNs) that are accessible by the Sun
host.
Determine which DS6000 devices (LUNs) are the same devices that are accessible
through different paths.
Create configuration file /etc/vpath.cfg to save the information about DS6000 devices.
With the -c option: cfgvpath exits without initializing the SDD driver. The SDD driver will be
initialized after reboot. This option is used to reconfigure SDD after a hardware
reconfiguration.
Without the -c option: cfgvpath initializes the SDD device driver vpathdd with the
information stored in /etc/vpath.cfg and creates pseudo-vpath devices
/devices/pseudo/vpathdd*.
vpathmkdev
The vpathmkdev command creates files vpathMsN in the /dev/dsk/ and /dev/rdsk/ directories
by creating links to the pseudo-vpath devices /devices/pseudo/vpathdd*, which are created by
the SDD driver.
Files vpathMsN in the /dev/dsk/ and /dev/rdsk/ directories provide block and character access
to an application the same way as the cxtydzsn devices created by the system. The
vpathmkdev command is executed automatically during SDD package installation and should
be executed manually to update files vpathMsN after hardware reconfiguration.
showvpath
The showpath command lists all SDD devices and their underlying disks. An example of the
showpath command is displayed in Example 7-23.
Tip: Note that you can use the output from showvpath to modify iostat or sar information
to report stats based on vpaths instead of hdisks. Gather iostats to a file, and then replace
the disk device names with the corresponding vpaths.
For specific information about SDD commands, check IBM TotalStorage Multipath Subsystem
Device Driver User’s Guide, SC30-4096.
See Chapter 12, “Understanding your workload” on page 407 for an understanding of
workloads.
The UNIX dd command is a great tool to drive sequential read workloads or sequential write
workloads against the DS6000. It will be rare when you can actually drive the DS6000 at the
maximum data rates that you see in published performance benchmarks. But, once you
understand how your total configuration (for instance, a DS6000 attached with 4 SDD paths
through two SANs to your host with 2 HBAs, 4 CPUs, and 1 GB of memory) performs against
certain dd commands, you will have a baseline from which you can compare things like
operating system kernel parameter changes or different logical volume striping techniques in
order to improve performance.
While running the dd command in one host session, we recommend you use the UNIX
commands and shell scripts presented earlier in this chapter. We will assume that, at a
minimum, you will have the AIX nmon tool running with the c, a, e, and d features turned on.
Below, we will be running lots of different kinds of dd commands. If, at any time, you want to
make sure there are no dd processes running on your system, execute the following
kill-grep-awk command:
kill -kill ‘ps -ef | grep dd | awk ‘{ print $2 }’‘
Caution: Use extreme caution when using the dd command to perform a sequential write
operation. Ensure the dd is not writing to a device file that is part of the UNIX operating
system.
7.7.1 Using the dd command to test sequential Rank reads and writes
To test the sequential read speed of a Rank, you can run the command:
time dd if=/dev/rvpath0 of=/dev/null bs=128k count=781
If you determine that the average read speed for your vpaths, for example, 50 MB/s, then you
know you need to stripe your future logical volumes across at least 4 different Ranks to
achieve 200 MB/s sequential read speeds.
Let’s explore the dd command some more. Issue the following command:
dd if=/dev/rvpath0 of=/dev/null bs=128k
Your nmon monitor (the e option) should report that the above command has imposed a
sustained 100 MB/s bandwidth with a block size=128k on vpath0. Notice the xfers/sec
column; xfers/sec is IOPS. Now, if your dd command has not already errored out because it
reached the end of the disk, hit <ctrl-C> to stop the process. nmon reports idle. Next issue the
following dd command with a 4 KB block size and put it in the background:
dd if=/dev/rvpath0 of=/dev/null bs=4k &
For the above command, nmon should report a lower MB/s but a higher IOPS. That is the
nature of I/O as a function of blocksize. Use the above kill-grep-awk command to clear out
all the dd processes from your system. Try your dd sequential read command with a bs=1024
and you should see a high MB/s but a reduced IOPS. Now start several of these commands
and watch your throughput increase until it reaches a plateau - something in your
configuration (CPU?, HBAs?, DS6000, Rank?) has become a bottleneck. This is as fast as
your hardware configuration can perform sequential reads for a specific block size. The
kill-grep-awk script will clear everything out of the process table for you. Try loading up
another raw vpath device (vpath1) device. Watch the performance of your HBAs (nmon a
option) approach 200 MB/second.
You can perform the same kinds of tests against the block vpath device, vpath0. What is
interesting here is that you will always observe the same I/O characteristics, no matter what
block size you specify. That is because, in AIX anyway, the Logical Volume Manager breaks
everything up into 4 K blocks, reads and writes. Run the following two commands separately.
nmon should report about the same for both:
dd if=/dev/vpath0 of=/dev/null bs=128k
dd if=/dev/vpath0 of=/dev/null bs=4k
Use caution when using the dd command to test sequential writes. If LUNs have been
incorporated into the operating system using logical volume manager (LVM) commands, and
the dd command is used to write to the LUNs, they won’t be part of the operating system
anymore, and the operating system will not like that one bit. For example, if you want to write
to a vpath, that vpath should not be part of a LVM volume group. And if you want to write to a
LVM logical volume, it should not have a file system on it and if the logical volume has a
logical volume control block (LVCB), you should skip over the LVCB when writing to the logical
volume. It is possible to create a logical volume without a LVCB by using the mklv -T O option.
Try different block sizes, different raw vpath devices, combinations of reads and writes. Run
the commands against the block device (/dev/vpath0) and notice that block size does not
affect performance.
2. The next thing to do is to run sequential reads and writes to all of the vpath devices (raw or
block) for about an hour. Use the commands discussed in 7.7.1, “Using the dd command
to test sequential Rank reads and writes” on page 227. Then take a look at your SAN
infrastructure to see how it is doing.
Look at the UNIX error report. Problems will show up as storage errors, disk errors, or
adapter errors. If there are problems, they will not be hard to find in the error report - there
will be a lot of them. Troubleshooting at this stage can be fun. The source of the problem
could be hardware problems on the storage side of the SAN, Fibre Channel cables or
connections, down level device drivers or device (HBA) microcode. If you see something
like the errors shown in Example 7-26, stop and get them fixed.
Ensure that, after running an hour’s worth of dd commands on all your vpaths, there are
no storage errors in the UNIX error report.
3. Next issue the following command to see if SDD is correctly load balancing across paths
to the LUNs:
datapath query device
Output from this command will look like Example 7-27.
Total Devices : 16
Check to make sure, for every LUN, the counters under the Select column are the same
and that there are no errors.
4. The next thing to do is spot check the sequential read speed of the raw vpath device. The
following command is an example of the command run against a LUN called vpath0. For
the LUNs you test, ensure they each yield the same results.
time dd if=/dev/rvpath0 of=/dev/null bs=128k count=781
Tip: For the dd command above, the first time it is run against rvpath0, the I/O must be
read from disk and staged to the DS6000 cache. The second time it is run, the I/O is
already in cache. Notice the shorter read time when we get an I/O cache hit.
Of course, if any of these LUNs are on Ranks that are also being used by another
application, you should see a variation in the throughput. If there is a large variation in the
throughput, perhaps that LUN should be given back to the storage administrator; trade for
another one. You want all your LUNs to have the same performance.
If everything looks good, then continue with the configuration of volume groups and logical
volumes.
For HP-UX, use the prealloc command instead of lmktemp for AIX to create large files. For
Sun Solaris, use the mkfile command.
In the following sections, we will explore the considerations associated with creating volume
groups, logical volumes, and file systems. We will use the AIX Logical Volume Manager (LVM)
to create examples, but the topics discussed are applicable to all UNIX platforms.
Remember the recommended method discussed in 7.2.1, “I/O balanced across Extent Pools”
on page 191? We build upon that concept as we continue with the LVM configuration. So at
this point, the LUNs have already been created in the DS6000 and assigned to your host
system. The following sections explore what we need to do next.
Note: For AIX 5.2 and beyond, JFS2 and the 64 bit kernel are recommended. Note that the
nointegrity filesystem mount option is not supported in JFS2.
When creating the volume group, there are LVM limits to consider along with potential
expansion of the volume group. The main LVM limits for a volume group are shown in
Table 7-5.
To create the volume group, if you are using SDD, you use the mkvg4vp command. And if you
are using SDDPCM, you use the mkvg command. All the flags for the mkvg command apply to
the mkvg4vp command.
We recommend using the smallest physical partition (PP) size you can that will allow for
growth. A physical partition size in the 8 MB - 32 MB range is a good starting point for
planning. This will minimize wasted space and allow for expansion of the volume group later.
One reason to keep the physical partition size small is because, when creating an inter-disk
logical volume, the smallest logical volume that should be created is the product of multiplying
the physical partition size by the number of LUNs in the volume group. This is the minimum
unit of allocation (MUA). All of the logical volumes you create should be multiples of this MUA.
If the physical partition size is large, and the number of LUNs is large, then the minimum
logical volume that would result would be large. This situation could lead to wasted space if
many of the logical volumes for your application do not need to be at least as large as the
MUA.
Consider an example where we have one 100 GB LUN called vpath4. The following
command will create a volume group called 6000vg with a 128 MB physical partition size.
If you look at the characteristics of the new volume group, you will see the output similar to
Example 7-28.
The above mkvg4vp command chose a physical partition size of 128 MB for two reasons. First,
the maximum physical partitions per physical volume is 1016 or less, and the maximum
physical volumes for this volume group are not more than 32.
In the design of the AIX Logical Volume Manager (LVM), each logical partition maps to a
physical partition (except when using OS mirroring when one logical partition maps to two or
three physical partitions). Each physical partition maps to a number of disk sectors. The
design of LVM limits the number of physical partitions that LVM can track per disk to 1016. In
most cases, not all of the possible 1016 tracking partitions are used by a disk. In the above
example, (102272 / 128 =) 799 partitions are used.
In Example 7-29, notice the physical partition size was brought down into a more desirable
range, but at the expense of the number of LUNs allowed in the volume group! From
Table 7-6, notice that the t factor for this situation is 4. The relationship is that the maximum
number of physical volumes that can be included in a volume group will be reduced to (MAX
PVs / t factor). The t factor is between 1 and 16 for standard volume groups and it is between
1 and 64 for big volume groups. The manual page for the mkvg and the chvg commands talks
more about the t factor.
Also notice in Example 7-29 that the t factor rule will allow 8 LUNs of 100 GB into the
standard volume group. There is also a limit on the maximum number of logical partitions in a
logical volume - that limit is always 32,512 as shown in Table 7-5 on page 234.
1 1016 32 128
2 2032 16 64
4 4064 8 32
8 8128 4 16
16 15256 2 8
32 32512 4
64 65024 2
Remember the recommended method described in 7.2.1, “I/O balanced across Extent Pools”
on page 191? It basically states that one LUN from each Rank (Array set) should be assigned
to your host system initially. Likewise, when planning for future expansion of your volume
group, it is reasonable to plan for adding another Array set to your volume group. But you may
not be able to add another Array set to the volume group because of LVM limitations, unless
you have planned accordingly. For example, 16 or more LUNs (an Array set) could be initially
assigned to a host server. If the physical partition size of these LUNs had to be reduced using
the t factor, then the AIX standard volume group could not grow by the addition of a second
Array set because the total size of the volume group would have been reduced below 32 by
the t factor. This is what we were referring to earlier in this section, when we recommended
that the smallest physical partition size should be used that would allow for growth. It is also a
good idea to always keep a couple of free PPs on each disk in the volume group so it can be
changed from a standard volume group to a big volume group, or to a scalable volume group
- this allows expansion of the VDGA structure on each disk in the volume group. It is
necessary for the VGDA to grow to convert from a standard volume group to a big or a
scalable volume group, or from a big to a scalable volume group.
Note:
Use the smallest physical partition size that will allow for growth of your volume group.
Always keep a couple of free PPs on each disk in the volume group.
It is obvious that the AIX standard volume group has limitations that could quickly become an
issue with large storage allocations from the DS6000. We recommend creating AIX big
volume groups to manage DS6000 LUNs. As Table 7-5 on page 234 shows, a big volume
group can accommodate up to 128 physical volumes. To create a big volume group, use the
following command format:
We are still working with 100 GB LUNs. The above command uses a t factor of 4 and yields
the results shown in Example 7-30:
If the volume group might grow beyond 128 disks, use the mkvg4vp -G option. Be aware that
because the volume group descriptor area (VGDA) is increased substantially in big volume
groups, you can expect VGDA update operations (creating a logical volume, changing a
logical volume, adding a physical volume, etc.) to take longer on big volume groups than it
takes in a standard volume group.
We have been considering large LUNs to test the limits of the AIX LVM and volume groups.
But large LUNs on single Ranks has nothing to do with the recommended method. Let’s
explore another example using eight LUNs that are 68 GB each.
mkvg4vp -B -y 6000vg vpath8 vpath9 vpath10 vpath11 vpath12 vpath13 vpath14 vpath15
The above command creates the volume group shown in Example 7-31.
Let us get the physical partition size down to, say 16 MB and see what this volume group
looks like. Imposing a physical partition size and invoking the t factor, the command looks like:
mkvg4vp -L 256 -B -f -s 16 -y 6000vg vpath8 vpath9 vpath10 vpath11 vpath12 vpath13 vpath14
vpath15
The new volume group has a good physical partition size; room for 3 more sets of LUNs, so
plenty of room for growth here. What more could you want out of a volume group? Except that
the LTG size did not change to 256 KB. We could not get this to work from the command line.
However, the following command will fix this:
With AIX 5.3, the LTG size is automatically set, so the chvg or mkvg commands do not apply.
For randomly accessed logical volumes, we recommend using the maximum inter-policy,
which also is referred to as the inter-disk policy. To make striped logical volumes using the
inter-disk policy, use the -e x flags on the mklv command.
Note: We recommend the maximum inter-policy logical volume for randomly accessed
data. This logical volume is created using the -e x flag of the mklv command. Or in smit,
set the RANGE of physical volumes to maximum and specify all the vpaths in the volume
group.
The DS6000 is capable of exceptional throughput for the different types of I/O. Understanding
your I/O workload characteristics (see Chapter 12, “Understanding your workload” on
page 407) will allow you to further maximize the performance gains you obtain from the
DS6000. In order to choose the best type of logical volumes to create, It will be necessary to
know which of the following is more predominant in each of your applications:
Lots of small random I/O operations
Large sequential I/O operations
A combination of the above
16MB 16MB 16MB 16MB 16MB 16MB 16MB 16MB 16MB 16MB
lp1
pp1
lp2
pp2 pp3
lp3
pp4
lp4
pp5
lp5
pp6
lp6
pp7
lp7
pp8
lp8 ... pp499 pp500
/dev/non_striped_lv
In this example, the logical volume manager (LVM) of the host operating system has created
a volume group which has divided logical disk vpath0 into 16 MB physical partitions. A non
striped logical volume is simply a logical grouping together of eight of these partitions to
create a 128 MB logical volume called /dev/non_striped_lv.
A non striped logical volume would, however, be the kind of logical volume you would create
for applications like DB2 that uses the concept of containers within logical volumes. This is
because DB2 can do application striping across containers.
Note: Consider using logical volumes of this type for DB2, which can randomize I/O across
containers.
Figure 7-6 shows an example of the inter-disk policy logical volume. The LVM has created a
volume group containing four LUNs and has created 16 MB physical partitions on the LUNs.
The logical volume in this example is a group of 16 MB physical partitions from four different
logical disks—vpath0, vpath1, vpath2, and vpath3.
Note: We recommend using the inter-disk logical volume for random access workloads.
vpath0, vpath1, vpath2, vpath3 are hardware striped LUNs on different DS8000 Extent Pools
8 GB / 16 GB partitions = 500 physical partitions per LUN (pp1-pp500)
/dev/inter-disk_lv is made up of 8 logical partitions
(lp1 + lp2 + lp3 + lp4 + lp5 +lp6 +lp7 + lp8) = 8 * 16 = 128 MB
For a discussion of striped file systems, it is necessary to first define a few terms:
On each DDM within a DS6000, the RAID 5 controllers create 256 KB strips on each DDM
in the Array. These strips are used by the RAID 5 hardware to create RAID 5 stripes. For a
6+P RAID 5 Array, the RAID 5 stripe is 1.5 MB. For a 7+P array, the RAID 5 stripe is 1.75
MB. We will always refer to this as a RAID 5 stripe.
With a RAID 5 Array, there is something called the RAID 5 write penalty. This is
experienced when I/O to the RAID array is less than the RAID 5 stripe size. The RAID 5
write penalty happens because to do a logical write, the controller must (1) read the data
that is being over written, (2) read the associated parity, then using the data the is being
written, calculate the new parity information, then (3) write the new data and (4) write the
new parity. This causes a total of four disk I/Os per host write. I/O writes exceeding the
RAID 5 stripe size will not suffer the RAID 5 write penalty. All the data, plus the parity, is
written at one time. This is called a full stripe write.
Let us assume that we, as systems administrators, have just received the storage
requirements for a new application that needs to be put into production. The new application
folks need 5 logical volumes that are 16 GB, 37.3 GB, 305 GB, 34 GB, and 47 GB. We know
that our volume group uses a physical partition size of 16 MB and contains 8 LUNs, so the
MUA is 128 MB. Below is a summary of the requirements in Table 7-7.
16 1032
38 2440
305 19528
34 2184
47 3016
The equation to determine the number of physical partitions that should be specified is shown
below. It adds one MUA (8) for a little extra space and uses the physical partition size of 16
MB and the fact that there are eight LUNs that we are striping across.
Notice, from the lvmap.ksh 305glv shell script command (see Appendix B, “UNIX shell
scripts” on page 481) the balance in the distribution of the logical volume, 305glv, on the
LUNs! Each of the eight vpaths has 2441 logical partitions for this logical volume.
Notice that /dev/striped_lv is also made up of eight 16 MB physical partitions, but each
partition is then subdivided into 64 chunks of 256 KB—only 3 of the 256 KB chunks are
shown per logical partition for space reasons. Most operating systems include a -S flag or
similar to create a striped logical volume.
vpath0, vpath, vpath2 and vpath3 are hardware striped LUNS on different DS8000 Extent Pools
8 GB / 16 MB partitions = 500 physical partitions per LUN (pp1 – pp500)
Access to /dev/striped_lv has advantages and disadvantages for performance. For random
I/O, there is an increased chance of I/Os spanning strip boundaries, thus increasing the
physical IOPS required to complete the I/O. For sequential I/O, striped logical volumes have
the advantage that I/O throughput is typically higher than for maximum inter-policy logical
volumes as it gets more physical volumes doing I/O in parallel. Striped logical volumes can
also cause increased disk subsystem read ahead which can be an advantage when the
pre-fetched data will in fact be accessed shortly; or a disadvantage when the data pre-fetched
will not in fact be used, thus occupies the memory when it might be better used for other I/O.
For a discussion of striped file systems, it is necessary to first define a few terms:
LVM stripe is the size of the stripe that is specified by the mklv -S command. We will refer
to this stripe as an LVM stripe, LV stripe size, or a stripe.
On each DDM within a DS6000, the RAID 5 controllers create a 256 KB strip across each
DDM in the Array. This strip is used by the RAID 5 hardware to create RAID 5 stripes. For
a 6+P RAID 5 Array, the RAID 5 data stripe is 1.5 MB. For a 7+P array, the RAID 5 data
stripe is 1.75 MB. We will always refer to this as a RAID 5 stripe.
AIX 5.2 128 KB, 256 KB, 512 KB, 1 MB 4 KB, 8 KB, 16 KBM, 32 KB, 64
KB, 128 KB, 256 KB, 512 KB, 1
MB
AIX 5.3 128 KB, 256 KB, 512 KB, 1 MB, 4 KB, 8 KB, 16 KBM, 32 KB, 64
2 MB, 4 MB, 16 MB KB, 128 KB, 256 KB, 512 KB, 1
MB, 2 MB, 4 MB, 16 MB, 32 MB,
64 MB, 128 MB
Striping Considerations:
Only use LVM striped logical volumes when the logical volume must be sequentially
accessed and requires a high throughput rate. Otherwise, use the maximum inter-policy.
Striped logical volumes are good for sequentially accessed data and high throughput rates
are required. For instance, consider the case where an application reads single threaded
I/O from, say, a satellite. This example would be especially efficient if the I/O size matched
the full stripe on the DS6000 (I/O size = DS6000 strip size x DS6000 stripe width), or even
better if the I/O size matches the full stripe on the logical volume (I/O size = logical volume
strip size x logical volume stripe width). Also consider the case where a simple dd
command is used to characterize sequential reads and writes to raw devices.
Striping inhibits DS6000 read ahead algorithms for sequential I/O. When the DS6000
detects sequential reading from a LUN, it reads ahead and puts the data in read cache
which improves I/O service time. This is called a read hit. With a small strip size, it takes
two full logical volume stripes of I/O before the DS6000 realizes that sequential reading is
occurring from all the LUNs in the logical volume stripe.
A small stripe size is useful to spread the I/Os to very small structures across more disks.
In AIX 5.1 and AIX 5.2, it is not possible to dynamically increase the stripe width (the
number of physical volumes that the striped logical volume is striped across). To increase
the stripe width, data must be backed up and the striped logical volumes must be
recreated.
In AIX 5.3, the stripe width can be changed to multiples of the existing stripe width using a
technique called striped columns, though the LVM will fill the first set of disks (or columns)
before allocating data to the second set of disks (or columns).
{CCF-part2:root}/ ->
{CCF-part2:root}/ -> mkvg4vp -L 256 -B -f -s 16 -y stripevg vpath0 vpath1 vpath2 vpath3 vpath4 vpath5 vpath6 vpath7
stripevg
{CCF-part2:root}/ -> lsvg stripevg
VOLUME GROUP: stripevg VG IDENTIFIER: 00e033c400004c000000010656cd3c3f
VG STATE: active PP SIZE: 16 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 5112 (81792 megabytes)
MAX LVs: 512 FREE PPs: 5112 (81792 megabytes)
LVs: 0 USED PPs: 0 (0 megabytes)
OPEN LVs: 0 QUORUM: 5
TOTAL PVs: 8 VG DESCRIPTORS: 8
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 8 AUTO ON: yes
MAX PPs per PV: 1016 MAX PVs: 128
LTG size: 128 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
Let us assume that we, as systems administrators, have just received the storage
requirements for a new application that needs to be put into production. The new application
folks need 3 logical volumes that are 16 GB, 37.3 GB, 23 GB. They insist that the logical
volumes be striped, in spite of our recommendations to use inter-disk logical volumes. We
know that our volume group uses a physical partition size of 16 MB and contains eight LUNs,
so the MUA is 128 MB. See Table 7-9.
16 1032
38 2440
23 1480
To create the striped logical volumes we used commands as shown in Example 7-36.
Notice the balance in the distribution of the logical volume on the LUNs! To see this we used
the lvmapt.ksh shell script in Appendix B, “UNIX shell scripts” on page 481.
Note: The use of INLINE jfs2 logs is preferred to outline logs. Inline logs are created when
the filesystem is created.
In Example 7-37, we use the above command to create a file system and then we look at the
file system specifics.
When tuning the operating system, do one thing at a time; verify I/O improvement every step
of the way. Have a clear understand of your current system settings before making changes to
the operating system.
Response time and throughput trade-offs exist that affect overall performance. Generally a
multi-user system will want to ensure good response time for users, while a system that runs
only batch jobs should be tuned for maximum throughput. The appropriate tuning parameters
are dependent upon the nature of the application, the number of CPUs, the amount of cache
in the DS6000, the write rate, and other factors. We suggest that you tune these values for
maximum disk throughput with reasonable response time for users.
The new commands are part of the bos.perf.tune fileset in AIX and they all use the same
syntax and command options to manipulate files in the /etc/tunables directory. Also, starting
with AIX 5.2, SMIT provides full support for these commands.
A full discussion of AIX tuning can be found in the AIX 5L Version 5.3 Performance
Management Guide which can be found by first selecting the AIX documentation link, then
selecting the Performance management and tuning link, at:
http://publib.boulder.ibm.com/infocenter/pseries/index.jsp
Sets the number of frames on the maxfree (vmo) maxfree (vmo) Yes
free list at which page stealing is to
stop. Must be larger than minfree
by a value at least as large as
maxpgahead.
Sets the number of frames on the minfree (vmo) minfree (vmo) Yes
free list at which page stealing
starts to replenish the free list
Sets a hard limit on memory for minperm (vmo) maxclient (vmo) - always a hard limit Yes
caching strict_maxperm strict_maxclient
Sets the maximum pages used for maxpgahead (ioo) j2_maxPageReadAhead (ioo) Yes
sequential read ahead. Should be a
power of 2 and greater than or
equal to MinPgAhead. Related to
minfree and maxfree.
Sets the minimum pages used for minpgahead (ioo) j2_minPageReadAhead (ioo) Yes
sequential read ahead
Sets the maximum number of chdev -l sys0 -a chdev -l sys0 maxpout=value Yes
pending I/Os to a file maxpout=value
Sets the minimum number of chdev -l sys0 -a chdev -l sys0 -a minpout=value Yes
pending I/Os to a file at which minpout=value
programs blocked by maxpout may
proceed
Sets the amount of modified data maxrandwrt (ioo) j2_maxRandomWrite (ioo) Yes
cache for a file with random writes
Controls the gathering of I/Os for numclust (ioo) j2_nPagesPerWriteBehindCluster (ioo) Yes
sequential write behind ioo j2_nRandomCluster (ioo)
Sets the number of file system numfsbufs (ioo) j2_nBufferPerPagerDevice (ioo) mount option
bufstructs
j2_dynamicBufferPreallocation (ioo) requires a
remount
Note that there can be too much file system cache with systems having more than 24 GB of
RAM. Part of the time that the syncd daemon runs, interrupts are suspended and I/O is
halted. The syncd has to check all the file system cache to see if the data needs to be flushed
to disk. Reading large amounts of memory can take seconds, so too much cache can be bad
for I/O. We can use release behind filesystem mount options (rbr, rbw, and rbrw) to keep data
out of the file system cache that does not have to be there. We can even put a limit on file
system cache by setting maxperm, maxclient, and strict_maxperm values. Maxclient is a hard
limit, but maxperm is a soft limit unless strict_maxperm is set to 1. The downside of setting a
strict limit for maxperm is that it causes the page replacement algorithm (lrud) to run when
there is plenty of free memory in the system. So there is a trade-off here. Generally, prior to
AIX 5.3, you will want a hard limit for filesystem cache from 24 Gb and up, depending on the
system memory bandwidth and processor speed.
In AIX 5.3, tuning I/O buffers can be tuned at the volume group level, rather than at the
system level. Use the lvmo -a command to view volume group level pbuf statistics. Tuning is
the same as above.
Read ahead
Read ahead at the file system level detects that we are reading sequentially and puts the data
into filesystem cache before the application requests it. This is supposed to reduce the
amount of percent I/O wait (%iowait) or increases I/O throughput, as seen from the operating
system. Too much read ahead means you do I/O that you do not need. The VMM tunable
parameters that control read ahead are minpgahead and maxpgahead for JFS and
j2_minPgReadAhead and j2_maxPgReadAhead for JFS2. These parameters are related to
maxfree and are used to ensure sufficient memory is available for I/O and to ensure good
keyboard response times on systems with heavy I/O workloads.
Bear in mind that the DS6000 has algorithms that perform read ahead also. Sometimes these
algorithms work in harmony with the operating system read ahead parameters, and
sometimes they don’t.
I/O pacing
I/O pacing limits the number of write I/Os that can be outstanding to a file. When a process
exceeds the maxpout limit (high water mark) it is put to sleep until the number of outstanding
writes I/Os is less than minpout (low water mark). This allows another process to use the
CPU. Said another way, I/O pacing causes the CPU to stop performing I/O to a file after a
specified amount of time. This frees up the CPU to do something else. Turning I/O pacing off
(default) improves backup times and sequential throughput. Turning I/O pacing on ensures
that no process hogs the CPU for I/O. Typically, we recommend to leave I/O pacing turned off.
There are certain circumstances where it is appropriate to have I/O pacing turned on like if
you are using HACMP. If you turn it on, start with settings of maxpout=321 and minpout=240.
Also, with AIX 5.3, I/O pacing can be turned on at the file system level with mount command
Write behind
This parameter is used to have the operating system initiate I/O that is normally controlled by
the syncd when a specified number of sequential 16 KB clusters are updated. The
parameters for write behind are:
Sequential write behind
– numclust for JFS
– j2_nPagesPerWriteBehindCluster and j2_nRandomCluster for JFS2
Random write behind
– maxrandwrt for JFS
– j2_maxRandomWrite
Mount options
Use release behind mount options where they make sense
Release behind mount options can reduce syncd and lrud overhead and should be used
where it makes sense. The options basically throw away the data that would otherwise be
held in JFS2 cache. You would use these options if you knew that data going into or out of
certain file systems would not be requested again by the application before the data is
likely to be paged out. This means that the lrud daemon has less work to do to free up
cache and eliminates any syncd overhead for this file system.
– -rbr for release behind after read
– -rbw for release behind after write
– -rbrw for release behind after read or write
I/O pacing can be specified for a specific filesystem with mount options and would be
useful where we do not want I/O from one filesystem to slow down other I/O or
applications:
mount -o minpout=40, maxpout=60
Direct I/O (DIO)
– Bypass JFS/JFS2 cache
– No read ahead
– An option of the mount command
– Useful for databases that use file systems rather than raw logical volumes, the idea
being that if an application has its own cache, then it does not make sense to also have
the data in file system cache.
Concurrent I/O (CIO)
– Same as DIO but without inode locking, so the application must ensure data integrity
for multiple simultaneous I/Os to a file.
lru_poll_interval
The lru_poll_interval parameter was introduced in ML4 of AIX 5.2. The parameter tells the
page stealer (lrud) whether it should stop working and poll for interrupts or continue
The default maximum transfer size is 0x100000. Consider changing this value to 200000 or
larger. These values are adapter dependent. This changes the maximum I/O size that the
adapter will support and it also increases the DMA memory area used for data transfers by
the adapter. When the max_xfer_size=0x100000, then the memory area is 16 MB, and for
other values it is 128 MB.
The default number of simultaneous I/Os the adapter will handle is 200. The maximum for a 2
Gb HBA is 2048.
There are two different file system types for HP-UX, VxFS, and HFS. VxFS is preferred for
performance reasons.
vxfs_max_ra_kbytes Maximum amount of read-ahead data, in KB, that the kernel may
have outstanding for a single VxFS file system
hfs_max_ra_blocks The maximum number of read-ahead blocks that the kernel may
have outstanding for a single HFS file system.
hfs_max_revra_blocks The maximum number of reverse read-ahead blocks that the kernel
may have outstanding for a single HFS file system.
hfs_ra_per_disk The amount of HFS file system read-ahead per disk drive, in KB.
Tip: Tuning the read ahead options varies from system to system depending on the
platform and amount of memory installed. Experiment with different values, making small
changes at a time.
How many pages of memory are allocated for buffer cache use at any given time is
determined by system needs, but the two parameters ensure that allocated memory never
drops below dbc_min_pct and cannot exceed dbc_max_pct percent of total system memory.
The default value for dbc_max_pct is 50 percent, which is usually overkill. If you want to use a
dynamic buffer cache, set the dbc_max_pct value to 25 percent. If you have 4 GB of memory
or more, start with an even smaller value.
With a large buffer cache the system is likely to have to pageout or shrink the buffer cache to
meet application memory needs, which causes I/Os to paging space. You want to avoid that
from happening and set memory buffers to favor applications over cached files.
Check the following guide for an updated list of the updates Sun Solaris needs for different
attachment types to the DS6000: IBM TotalStorage Enterprise Storage Server Host System
Attachment Guide, SC26-7446. This guide can be downloaded from:
http://ssddom02.storage.ibm.com/techsup/webnav.nsf/support/2105
The Tunable Parameters Reference Manual for Solaris 8 can be found at:
http://docs.sun.com/app/docs/doc/816-0607
And the Tunable Parameters Reference Manual for Solaris 9 can be found at:
http://docs.sun.com/app/docs/doc/806-7009
maxphys
This parameter specifies the maximum number of bytes that you can transfer for each SCSI
transaction. The default value is 126976 (124 KB). If the I/O block size that you requested
exceeds the default value, the request is broken into more than one request. The value
should be tuned for the application requirements. For maximum bandwidth, set the maxphys
parameter by adding the following line to the /etc/system file:
set maxphys=1048576 (1 MB)
Attention: Do not set the value for maxphys greater than 1048576 (1 MB). Doing so can
cause the system to hang.
vxio:vol_maxio
If you are use the Veritas volume manager on the DS6000 LUNs, you must set the VxVM
maximum I/O size parameter (vol_maxio) to match the maxphys parameter. When you set the
maxphys parameter to 1048576 and you use the Veritas Volume Manager on your DS6000
LUNs, set the maxphys parameter like in the following command:
set vxio:vol_maxio=2048
Note: Use this setting for JNI Fibre Channel adapters only.
The default value is 256, but you must set the parameter to a value less than or equal
to a maximum queue depth for each LUN connected. Determine the value by using the
following formula:
256 ÷ (LUNs per adapter)
Where LUNs per adapter is the largest number of LUNs assigned to a single adapter.
To set the sd_max_throttle parameter for the DS6000 LUNs in this example, you would
add the following line to the /etc/system file:
set sd:sd_max_throttle=5
The following settings should be set for all Fibre Channel adapter types (JNI, Emulex, or
Qlogic).
– sd_io_time
In this chapter we also discuss the supported distributions of Linux when using the DS6000,
as well as the tools that can be helpful for the monitoring and tuning activity:
uptime
dmesg
top
iostat
vmstat
sar, isag
GKrellM
KDE System Guard
LVM
Bonnie
If problems are encountered with installed versions, you may be required to update your Linux
configuration to a higher supported level before problem determination can take place.
For further clarification and the most current information about DS6000-supported Linux
distributions and kernel support compatible with the DS6000, you can refer to the Web site:
http://www-03.ibm.com/servers/storage/disk/ds6000/interop.html
Once there, click the link for the PDF file: Download interoperability matrix.
It is the kernel’s job to manage all of these different memory spaces. When, for example, an
application is started, the kernel must transfer all the data from the hard disk to the buffer
space. After that, it must free some memory in the user space to load the application. Since
the user space will be divided into different chunks, it must sometimes rearrange certain
processes to get a big enough chunk for the application it is trying to load. When it has
While virtual memory makes it possible for computers to more easily handle larger and more
complex applications, as with any powerful tool, it comes at a price. The price in this case is
one of overhead: An application that is 100 percent memory-resident will run faster than one
residing in virtual memory.
However, this is no reason to throw up one's hands and give up. The benefits of virtual
memory are too great to do that. And, with a bit of tuning, good performance is possible. The
thing that must be done is to look at the system resources that are impacted by heavy use of
the virtual memory subsystem.
The interrelated nature of these loads makes it easy to see how resource shortages can lead
to severe performance problems. All it takes is:
A system with too little RAM
Heavy page fault activity
A system running near its limit in terms of CPU or disk I/O
At this point, the system will be thrashing, with performance rapidly decreasing.
From this, the overall point to keep in mind is that the performance impact of virtual memory is
minimal when it is used as little as possible.
The primary determinant of good virtual memory subsystem performance is having enough
RAM. Next in line (but much lower in relative importance) are sufficient disk I/O and CPU
capacity. However, these resources do little to help the virtual memory subsystem
performance (although they obviously can play a major role in overall system performance).
Note: A reasonably active system will always experience some page faults, if for no other
reason than because a newly-launched application will experience page faults as it is
brought into memory.
If there is insufficient memory installed in a server, it will begin paging the least used data
from memory to the swap partitions on the disks. A general rule is that the swap partitions
should be on the fastest drives available. If the server has more than one array, it is always a
good idea to spread the swap partitions over all of the arrays. This will generally improve the
performance of the server.
Furthermore, there is a way to parallelize swap file read/writes. It is possible to give each
swap partition a priority setting in the /etc/fstab file. If you open the /etc/fstab file, you might
see something like in Example 8-1.
Under normal circumstances, Linux would use the swap partition /dev/sda2 first, then
/dev/sdb2, and so on, until it had allocated enough swapping space. This means that perhaps
only the first partition, /dev/sda2, will be used if there is no need for a large swap space.
Spreading the data over all available swap partitions will improve performance, because all
read/write requests will be performed simultaneously to all selected partitions. If you change
the file, as in Example 8-2, you will assign a higher priority level to the first three partitions.
Swap partitions are used from the highest priority to the lowest (where 32767 is the highest
and 0 the lowest). Giving the same priority to the first three disks causes the data to be written
to all three disks; the system does not wait until the first swap partition is full before it starts to
The fourth partition is used if the first three are completely filled up and there is still additional
space needed for swapping. It is also possible to give all partitions the same priority to stripe
the data over all partitions, but if one drive is slower than the others (/dev/sdd2 in Example
5-2), performance would decrease.
If the server is running out of swap space and there is additional hard disk space left, it is
possible to create additional swap partitions with fdisk. However, if you cannot create a new
partition, you can create a swap file instead. There are two disadvantages to locating a swap
file outside a dedicated swap partition.
The performance of swap files in a data partition is slower than on a swap partition.
If the swap file gets damaged, the data on the whole partition may be lost.
For these reasons, we recommend that you not place the swap file on a data partition. In the
following example, we will create a 512 MB swap file with a block size of 2 KB:
1. Start by creating a directory for the swap file:
mkdir /swap
2. Create the swap file:
dd if=/dev/zero of=/swap/swapfile bs=2048 count=262144
This command creates a file called /swap/swapfile with a block size of 2 KB. The size
will be 512 MB (2048*262144=512 MB). The size is determined using the bs and count
parameters of dd, so the command could have also been:
dd if=/dev/zero of=/swap/swapfile bs=1M count=512
3. Initialize the swap file:
mkswap /swap/swapfile 262144
4. Synchronize the file:
sync
5. Configure Linux to use the swap file:
swapon /swap/swapfile
6. If the swap file is no longer needed, you can instruct the system to stop using the swap file
and then delete the file:
swapoff /swap/swapfile
rm /swap/swapfile
It is also possible to use a swap file permanently. The information needs to be put into the
/etc/fstab file, which would look as illustrated in Example 8-3.
While swapping (writing modified pages out to the system swap space) is a normal part of a
Red Hat and Suse Linux system's operation, it is possible for a system to experience too
much swapping. The reason to be wary of excessive swapping is that the following situation
can easily occur, over and over again: Pages from a process are swapped; the process
becomes runnable and attempts to access a swapped page; the page is faulted back into
memory; a short time later, the page is swapped out again.
There are daemons running on every server that are probably not needed. Disabling these
daemons frees memory and decreases the number of processes the CPU has to handle.
linuxconf, chkconfig, and serviceconf are tools that make it easy, among other things, to
disable and enable daemons. If linuxconf is not found on your system it is available from:
http://www.solucorp.qc.ca/linuxconf
Figure 8-1shows one interface for disabling daemons on Red Hat Linux, linuxconf.
chkconfig is a text type tool run from the command line. Example 8-4 shows output with the
chkconfig command.
Change a value (and check if the setting changed). For example, to turn off the sshd daemon,
type in the following command from the host system:
chkconfig --level 5 sshd off
Now turn the sshd daemon back on by typing in the following command:
chkconfig --level 5 sshd on
And now check to see if it has changed by typing in the following command:
chkconfig --list sshd
Also you can use serviceconfig to disable unnecessary daemons, as illustrated in Figure 8-2
on page 269.
If you do not have the ability to run linuxconf, serviceconfig, or chkconfig, or you do not
want to use them, it is also possible to disable or enable daemons from the command line. In
the following example, we will show how to stop the sendmail daemon. First log on as root
and enter the following command:
/etc/init.d/sendmail stop
Every daemon can be started and stopped in the same way. Some also provide further
functions such as restart, status, and so on.
If you do not want the daemon to start the next time the machine boots, you will need to
change the contents of the various run level directories:
1. Determine which run level the machine is running with the command runlevel.
This will print the previous and current run level (for example, N 3 means that there was no
previous run level (N) and that the current run level is 3).
2. To switch between run levels, use the init command. For example, to switch to run level 5,
enter the command init 5.
3. To prevent a daemon from starting, you will need to rename the appropriate file in the \etc
directory structure. For example, to disable the sendmail daemon in run level 3 at startup,
enter the command:
rename /etc/rc3.d/S80sendmail /etc/rc3.d/K80sendmail
Or,
mv /etc/rc3.d/S80sendmail /etc/rc3.d/K80sendmail
Daemons with an S at the beginning of the symbolic link name will be started; those
starting with a K will not be started in that specific run level. In our example, the sendmail
daemon will not be started on the next reboot. Note that you must select the correct run
level to change this.
Before you begin, you will need to know what hardware is installed in the server. You can
obtain a list by typing in the command lspci. The most important things to know are:
CPU type
Amount of memory installed
SCSI adapter
RAID controller
Fibre Channel adapter
Network adapter
Video adapter
The more information you have about the hardware used, the more easily the Linux kernel
can be configured.
This procedure can be tricky at some steps, so we refer you to a complete discussion of how
to compile the kernel in the IBM Redpaper, Running the Linux 2.4 Kernel on IBM Eserver
xSeries Servers, REDP0121, available from:
http://www.redbooks.ibm.com
Select Redpapers from the left navigation bar and do a search using the Redpaper form
number REDP0121.
Tip: By default, the kernel includes the necessary module to enable you to make changes
using sysctl without needing to reboot. However, If you choose to remove this support
(during the operating system installation), then you will have to reboot Linux before the
change can take effect.
SUSE LINUX offers a graphical method of modifying these sysctl parameters, illustrated in
Figure 8-3. To launch the powertweak tool, issue the following command:
/sbin/yast powertweak
Red Hat offers a graphical method of modifying these sysctl parameters. To launch the tool,
issue the following command:
/usr/bin/redhat-config-proc
Reading the files in the /proc directory tree provides a simple way to view configuration
parameters that are related to the kernel, processes, memory, network and other
components. Each process running in the system has a directory in /proc with the process ID
(PID) as name. Table 8-1 lists some of the files that contain kernel information.
/proc/loadavg Information about the load of the server in 1-minute, 5-minute, and15-minute intervals.
The uptime command gets information from this file.
/proc/kcore (SUSE LINUX Enterprise Server only) Contains data to generate a core dump at run
time, for kernel debugging purposes. The command to create the core dump is gdb as
in:
#gdb /usr/src/linux/vmlinux /proc/kcore
/proc/meminfo Information about memory usage. The free command uses this information.
/proc/sys/abi/* Used to provide support for “foreign” binaries, not native to Linux: those compiled
under other UNIX variants such as SCO UnixWare 7, SCO OpenServer, and SUN
Solaris 2. By default, this support is installed, although it can be removed during
installation.
/proc/sys/fs/* Used to increase the number of open files the OS allows and to handle quota.
/proc/sys/kernel/* For tuning purposes, you can enable hotplug, manipulate shared memory, and specify
the maximum number of pid files and level of debug in syslog.
The next time you reboot, the parameter file will be read. You can do the same thing without
rebooting by issuing the following command:
#sysctl -p
Table 8-2 lists the SUSE Linux V2.4 kernel parameters that are most relevant to performance.
Table 8-2 List of the SUSE LINUX V2.4 kernel parameters that are most relevant
Parameter Description/example of use
kernel.shm-bigpages-per-file Normally used for tuning database servers. The default is 32768. To
calculate a suitable value, take the amount of System Global Area
(SGA) memory in GB and multiply by 1024. For example:
sysctl -w kernel.shm-bigpages-per-file=16384
kernel.sched_yield_scale Enables the dynamic resizing of time slices given to processes. When
enabled, the kernel reserves more time slices for busy processes and
fewer for idle processes. The parameters kernel.min-timeslice and
kernel.max-timeslice are used to specify the range of time slices that
the kernel can supply as needed. If disabled, the time slices given to
each process are the same.
sysctl -w kernel.sched_yield_scale=1
sysctl -w kernel.shm-use-bigpages=1
net.ipv4.conf.all.hidden All interface addresses are hidden from Address Resolution Protocol
(ARP) broadcasts and will be included in the ARP response of other
addresses. Default is 0 (disabled). For example:
sysctl -w net.ipv4.conf.all.hidden=1
sysctl -w net.ipv4.conf.default.hidden=1
net.ipv4.conf.eth0.hidden Enables only interface eth0 as hidden. Uses the ID of your network
card. Default is 0 (disabled).
sysctl -w net.ipv4.conf.eth0.hidden=1
net.ipv4.ip_conntrack_max This setting is the number of separate connections that can be tracked.
Default is 65536.
sysctl -w net.ipv4.ip_conntrack_max=32768
sysctl -w net.ipv6.conf.all.mtu=9000
net.ipv6.conf.all.router_solicitation_delay Determines whether to wait after interface opens before sending router
solicitations. Default is 1 (the kernel should wait). For example:
sysctl -w net.ipv6.conf.all.router_solicitation_delay=0
sysctl -w net.ipv6.conf.all.router_solicitation_interval=3
sysctl -w net.ipv6.conf.all.router_solicitations=2
sysctl -w net.ipv6.conf.all.temp_prefered_lft=259200
sysctl -w net.ipv6.conf.all.temp_valid_lft=302400
net.ipv6.conf.default.accept_redirects Accepts redirects sent by a router that works with IPV6, but it cannot be
set if forwarding is set to enable. Always one or the other, it can never
be set together because it will cause problems in all-IPV6 networks.
Default is 1 (enabled).
sysctl -w net.ipv6.conf.default.accept_redirects=0
sysctl -w net.ipv6.conf.default.autoconf=0
sysctl -w net.ipv6.conf.default.dad_transmits=0
net.ipv6.conf.default.mtu Sets the default value for Maximum Transmission Unit (MTU). Default is
1280.
sysctl -w net.ipv6.conf.default.mtu=9000
sysctl -w net.ipv6.conf.default.regen_max_retry=3
net.ipv6.conf.default.router_solicitation_dela Number in seconds to wait, after interface is brought up, before sending
y router request. Default is 1 (enabled).
sysctl -w net.ipv6.conf.default.router_solicitation_delay=0
vm.heap-stack-gap Enables the heap of memory that is used to store information about
status of process and local variables. You should disable this when you
need to run a server with Java Development Kit (JDK™); otherwise
your software will crash. Default is 1 (enabled).
sysctl -w vm.heap-stack-gap=0
vm.vm_anon_lru Allows the virtual memory (vm) to always have visibility of anonymous
pages. Default is 1 (enabled).
sysctl -w vm.vm_anon_lru=0
vm.vm_lru_balance_ratio Balances active and inactive sections of memory. Define the amount of
inactive memory that the kernel will rotate. Default is 2.
sysctl -w vm.vm_lru_balance_ratio=3
sysctl -w vm.vm_mapped_ratio=90
vm.vm_passes Number of attempts that the kernel will try to balance the active and
inactive sections of memory. Default is 60.
sysctl -w vm.vm_passes=30
sysctl -w vm.vm_shmem_swap=1
vm.vm_vfs_scan_ratio Proportion of Virtual File System unused caches that will try to be in
one VM freeing pass. Default is 6.
sysctl -w vm.vm_vfs_scan_ratio=6
Table 8-3 Red Hat parameters that are most relevant to performance tuning
Parameter Description / example of use
net.ipv4.inet_peer_gc_maxtime How often the garbage collector (gc) should pass over the inet peer
storage memory pool during low or absent memory pressure. Default is
120, measured in jiffies. For definition of jiffy, see:
http://www.kernelnewbies.org/glossary/#J
sysctl -w net.ipv4.inet_peer_gc_maxtime=240
net.ipv4.inet_peer_gc_mintime Sets the minimum time that the garbage collector can pass cleaning
memory. If your server is heavily loaded, you may want to increase this
value. Default is 10, measured in jiffies.
sysctl -w net.ipv4.inet_peer_gc_mintime=80
net.ipv4.inet_peer_maxttl The maximum time-to-live for the inet peer entries. New entries will
expire after this period of time. Default is 600, measured in jiffies.
sysctl -w net.ipv4.inet_peer_maxttl=500
net.ipv4.inet_peer_minttl The minimum time-to-live for inet peer entries. Set to a high enough
value to cover fragment time to live in the reassembling side of frag-
mented packets. This minimum time must be smaller than
net.ipv4.inet_peer_threshold. Default is 120, measured in jiffies.
sysctl -w net.ipv4.inet_peer_minttl=80
net.ipv4.inet_peer_threshold Set the size of inet peer storage. When this limit is reached, peer
entries will be thrown away, using the inet_peer_gc_mintime timeout.
Default is 65644.
sysctl -w net.ipv4.inet_peer_threshold=65644
vm.hugetlb_pool The hugetlb feature works in the same way as bigpages, but after
hugetlb allocates memory, only the physical memory can be accessed
by hugetlb or shm allocated with SHM_HUGETLB. It is normally used
with databases such as Oracle or DB2. Default is 0.
sysctl -w vm.hugetlb_pool=4608
sysctl -w vm.inactive_clean_percent=30
vm.pagecache Designates how much memory should be used for page cache. This is
important for databases such as Oracle and DB2. Default is 1 15 100.
Many different file systems are available for Linux that differ in performance and scalability.
Besides storing and managing data on the disks, file systems are also responsible for
guaranteeing data integrity. The newer Linux distributions include journaling file systems as
part of their default installation. Journaling, or logging, prevents data inconsistency in case of
a system crash. All modifications to the file system metadata have been maintained in a
separate journal or log and can be applied after a system crash to bring it back to its
consistent state. Journaling also improves recovery time, because there is no need to perform
file system checks at system reboot.
As with other aspects of computing, you will find that there is a trade-off between performance
and integrity. However, as Linux servers make their way into corporate data centers and
enterprise environments, requirements such as high availability can be addressed.
In this section, we cover the default file systems available on Red Hat Enterprise Linux AS
and SUSE LINUX Enterprise Server and some simple ways to improve their performance.
ext2
ext2 is still a commonly used file system in the Linux community. It provides the standard
UNIX file semantics and advanced features. It is robust and offers excellent performance. The
ext2 standard features include:
Support for standard UNIX file types (regular files, directories, device special files, and
symbolic links)
Up to 4 TB of volume size
Support for long file names (up to 255 characters)
The ext2 kernel code contains many performance optimizations, which improve I/O speed
when accessing data on a disk. One of the optimizations is a read ahead algorithm. When a
block is read, the kernel code automatically requests the follow-on blocks. In this way, it
ensures that the next block is already in the buffer cache and available for further processing.
In addition, ext2 contains many allocation optimizations. Block groups are used to store
related inodes and data together. The kernel always tries to allocate data blocks for a file in
the same group as its inode. This results in fewer disk head seeks performed when the kernel
reads an inode and its data blocks.
One problem with ext2 is that if an unexpected power failure or an unclean shutdown occurs,
the file system may be in an inconsistent state. Therefore, an e2fsck is forced on the next
reboot of the system, which may or may not recover the file system from its inconsistent state.
Journaling file systems like ext3 greatly reduce the chance of getting an inconsistent file
system.
Since you cannot change the stripe size on the disks of the DS6000, to achieve optimal
performance, your OS software stripe size should be changed to be a multiple of your file
system block size or slightly larger. The actual file system block size for /dev/sda1 can be
found with the following command:
dumpe2fs -h /dev/sda1 |grep -F "Block size"
Example 8-7 Determining file system block size from the dumpe2fs command
dumpe2fs 1.23,15-Aug-2001 for EXT2 FS 0.5b,95/08/09
Block size: 1024
The block size cannot be changed when the partition is already formatted, so you have to
decide which block size you will use when formatting the partition. So, if you create a new ext2
partition on /dev/sda5 with a block size of 4096 bytes/block, the command will be:
mke2fs -b 4096 /dev/sda5
ext3
ext3 is the updated version of the ext2 file system. It has many new features and
enhancements compared to the previous ext2. Its main advantages are:
Availability: ext3 always writes data in a consistent way to the disks. So in case of an
unclean shutdown (unexpected power failure, system crash), the server does not need to
check the consistency of the data on a ext3 volume.
The time spent to recover the journal is about one second (depending on the hardware
used). On an ext2 volume, the e2fsck performed after a unclean shutdown may take
hours, depending on the size of the volume and number of files.
Data integrity: You can choose the type and level of protection of your data. You can
choose to keep the file system consistent, but allow for damage to data on the file system
in case of unclean system shutdown. This can improve performance under some, but not
all, circumstances.
Alternatively, you can choose to ensure that the data is consistent with the state of the file
system. This second choice is the safer choice and is the default.
Speed: There are three different journaling modes available to optimize speed:
If maximum performance is needed, use ext2 since it has generally less overhead than any
journaling file system. But keep in mind that your data may be inconsistent in the event of a
power failure or an unclean shutdown.
The current version of ReiserFS that is installed with SUSE LINUX Enterprise Server 8 is
V3.6. There is work underway to deliver the next release, Reiser4. The new Reiser4 file
system is expected to deliver an unbreakable file system by eliminating corruption with the
implementation of an atomic file system where I/O is guaranteed to complete, a 2x to 5x
speed improvement by implementing new access algorithms, and ease of third-party
upgrades without reformatting, through the use of plug-ins.
Testing performed with FTP transmissions has shown that with scalable window support
enabled and the TCP window size set to an appropriate level (depending on the network),
network throughput improves 100–500 percent on WAN links. There is less impact on
The default setting of 64 KB for most Linux configurations is fine for most LANs, but too low for
Internet connections. Set this to a value between 256 KB for T1 lines or lower, and 2 to 4 MB
for T3, OC-3, or even faster connections.
To determine the optimal buffer size for your environment, you can use the following formula:
buffer size =2 *bandwidth *delay
Where bandwidth is the bandwidth of the slowest connection between the server and the
client.
For a Linux kernel 2.4.x system, add the following lines to /etc/rc.d/rc.local:
echo "4096 65536 4194304">/proc/sys/net/ipv4/tcp_rmem
echo "4096 65536 4194304">/proc/sys/net/ipv4/tcp_wmem
The three values describe the minimum, default, and maximum window sizes used by TCP.
The Linux kernel 2.4.x actually does a good job of adjusting the window size automatically,
depending on network conditions. You simply need to specify appropriate minimum and
maximum values.
8.3.1 uptime
The uptime command can be used to see how long the server has been running, how many
logged on users there are, and gives a quick overview of what average load the server has.
The system load average is displayed for the last one, five, and fifteen minute intervals. The
load average is not a percentage, but instead the number of processes in queue waiting to be
processed. If processes that request CPU time are blocked (which means the CPU has no
time for processing them), the load average will increase. On the other hand, if each process
gets immediate access for CPU time and there are CPU cycles lost, the load will decrease.
The optimal value of the load would be 1, which means each process gets immediate access
to the CPU and there are no CPU cycles lost. The typical loads can vary from system to
system: For a uniprocessor workstation, 1–2 might be acceptable, whereas you will probably
see values of 8–10 on multiprocessor servers.
For more information about uptime, see the online help (uptime) or the man pages on uptime.
Note: You can also use w, who, or finger instead of uptime. They also provide information
about who is currently logged onto the machine and what the user is doing.
8.3.2 dmesg
With dmesg, you can determine what hardware is installed in your server. During every boot,
Linux checks your hardware and logs this information. You can view these logs using dmesg.
You can see information about the CPU, DS6000 disk subsystem, network adapters, and
amount of memory that is installed. Example 8-9 illustrates the output of the dmesg command.
For more information about dmesg see the online help (man dmesg).
8.3.3 top
The top command shows you actual processor activity. By default, it displays the most
CPU-intensive tasks of the server and updates the list every five seconds. You can sort the
processes by PID (numerically), age (newest first), and Example 8-10 resident memory
usage and time (time the process has occupied the CPU since startup). shows a sample of
the output of the top command.
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
12795 root 20 0 10592 2404 924 R 98.9 0.9 22:11 jre
13125 root 10 0 1028 1024 832 R 0.9 0.4 0:00 top
1 root 8 0 524 524 456 S 0.0 0.2 0:04 init
2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd
3 root 19 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU0
4 root 9 0 0 0 0 SW 0.0 0.0 0:03 kswapd
5 root 9 0 0 0 0 SW 0.0 0.0 0:00 kreclaimd
6 root 9 0 0 0 0 SW 0.0 0.0 0:00 bdflush
7 root 9 0 0 0 0 SW 0.0 0.0 0:00 kupdated
8 root -1 -20 0 0 0 SW<0.0 0.0 0:00 mdrecoveryd
15 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_2
18 root 9 0 0 0 0 SW 0.0 0.0 0:08 kjournald
93 root 9 0 0 0 0 SW 0.0 0.0 0:00 khubd
185 root 9 0 0 0 0 SW 0.0 0.0 0:00 kjournald
628 root 9 0 620 620 524 S 0.0 0.2 0:00 syslogd
633 root 9 0 1100 1100 448 S 0.0 0.4 0:00 klogd
653 rpc 9 0 592 592 504 S 0.0 0.2 0:00 portmap
681 rpcuser 9 0 764 764 664 S 0.0 0.2 0:00 rpc.statd
You can further modify the processes using renice to give a new priority to each process. If a
process hangs or occupies too much CPU, you can kill the process. Of course you can also
use the standard commands renice or kill to perform these steps, but with top you have
one interface to perform all these tasks.
For more information about top, see the online help (man top).
Note: It may not always be possible to change the priority of a process via the nice level. If
a process is running too slowly, you can assign more CPU to it by giving it a lower nice
level. Of course, this means that all other programs have fewer processor cycles and will
run more slowly.
Linux supports nice levels from 19 (lowest or least nice—gets more CPU) to -20 (highest or
nicest). Without an option the default value is 10. To change the nice level of a program to a
negative number, it is necessary to log on as root.
To start the program xyz with a nice level of -5, issue the command:
nice -n 5 xyz
To change the nice level of a program already running, issue the command:
renice -10 pid
Where pid is the process identification of the process. The process will decrease its nice level
to -10.
Zombie processes
When a process has already terminated through receiving a signal to do so, it normally takes
some time until it has finished all tasks (closing open files, and so on) before ending itself. In
that normally very short time frame, the process is a zombie.
After the process has finished all these shutdown tasks, it reports to the parent process that it
is about to terminate. Sometimes a zombie process is unable to terminate itself, in which
case, you will see processes with a status of Z (zombie).
It is not possible to kill such a process with the kill command, because it is already
considered dead. If you cannot get rid of a zombie, you can kill the parent process and then
the zombie disappears as well. However, if the parent process is the init process, you should
8.3.4 iostat
If the iostat command is not included in your distribution, you may get it here:
http://linux.inet.hr/
The iostat command lets you see average CPU times since the system was started, in a way
similar to uptime. In addition, however, iostat creates a report about the activities of the
DS6000 disk subsystem on the server. The report is split in CPU utilization and device
utilization, where device utilization means the disk subsystem. Example 8-11 illustrates a
sample output of the iostat command.
For more information about iostat see the online help (man iostat).
8.3.6 sar
The sar command, which is included in the sysstat package, uses the standard system
activity daily data file to generate a report.
To install the sysstat package, log in as root and mount the CD-ROM containing the package.
Then do the following steps:
cd /mnt/cdrom/RedHat/RPMS
Or:
mount -t iso9660 /dev/cdrom /mnt
rpm -Uivh sysstat sysstat-3.3.5-3.i386.rpm
The system has to be configured to grab the information and log it; therefore, a cron job must
be set up. Add the following lines to the /etc/crontab. Example 8-13 illustrates an example of
automatic log reporting with cron.
You get a detailed overview of your CPU utilization (%user, %nice, %system, %idle), memory
paging, network I/O and transfer statistics, process creation activity, activity for block devices,
and interrupts/second over time.
These are the main values that are displayed if you use sar -A (the -A is equivalent to
-bBcdqrRuvwWy -I SUM -I PROC -n FULL -U ALL, which selects the most relevant counters of
the system):
kbmemfree Free memory in KB
kbmenmuse Used memory in KB (without memory used by the kernel)
%memused Percentage of used memory
kbmemshrd Amount of shared memory by the system (always 0 with kernel 2.4)
kbbuffers Memory used for buffers by kernel in KB
kbcached Memory used for caching by kernel in KB
kbswpfree Free swap space in KB
kbswpused Used swap space in KB
%swpused Percentage of used swap space
intr/s Interrupts per second
05:00:01 PM proc/s
05:10:00 PM 13.16
05:20:00 PM 0.14
05:30:00 PM 0.05
05:40:00 PM 0.05
05:50:01 PM 0.05
06:00:01 PM 0.05
06:10:01 PM 0.07
06:20:01 PM 0.05
06:30:00 PM 0.05
8.3.7 isag
The output of sar is straight text and can be very time consuming to process. Instead, the
isag command (Interactive System Activity Grapher) can show the data gathered by sar in a
graphical format (see Figure 8-5).
When you start isag you must first select a data source. Click the - button to the right of data
source. A menu will appear showing the different data sources available. The data sources
are named sa01, sa02, sa03, etc., each standing for a day of the month when recorded (for
example, sa11 would mean the log file recorded on the 11th day of the current month).
However, only the last nine days are available for analysis.
The slider on the left of the window (see Figure 8-5) is used to adjust the vertical scale of the
graph. By default, isag will display the paging statistics, but you can change the view by
clicking Chart and then choosing the data you are interested in:
I/O transfer rate
Paging statistics
Process creation
Run queue
Memory and swap
Memory activities
CPU utilization
node status
System switching
System swapping
Note: isag keeps data for only one week. After one week, the collected data for the
seventh day will be deleted. This might not be enough to do a proper bottleneck analysis or
to make a trend analysis of the server.
Run Queue
Run Queue has the following counters:
runq-sz Run queue length (number of processes waiting for runtime)
plist-sz Number of processes in the process list
Figure 8-8 on page 291 illustrates a sample Memory and Swap graphic report.
Memory Activities
Memory Activities has the following counters:
frmpg/s Number of memory pages freed by the system per second. (A
negative value represents the number of pages allocated by the
system.)
shmpg/s Number of additional memory pages shared by the system per
second. A negative value means fewer pages shared by the system.
bufpg/s Number of additional memory pages used as buffers by the system
per second. A negative value means fewer pages used as buffers by
the system.
camp/s Number of additional memory pages cached by the system per
second. A negative value means fewer pages in the cache.
Figure 8-9 on page 292 illustrates a sample Memory Activities graphic report.
CPU Utilization
CPU Utilization has the following counters:
%user Percentage of CPU utilization that occurred while executing at the user
level application)
%nice Percentage of CPU utilization that occurred while executing at the user
level with nice priority
%system Percentage of CPU utilization that occurred while executing at the
system level (kernel)
Figure 8-10 on page 293 illustrates a sample CPU Utilization graphic report.
System swapping
System swapping uses the following counters:
pswpin/s Total number of swap pages the system brought in per second
pswpout/s Total number of swap pages the system brought out per second
For more information about sar and isag see the man pages (sar, man isag).
Note: GKrellM is an Xwindows tool. Running X may impact your performance analysis.
Of course, you can get all these things with multiple monitors, but the one big advantage of
GKrellM is that it only takes up one process for monitoring your system. Further, the charts
have an autoscaling feature, but you can also use fixed scaling modes. Figure 8-12 shows the
output of GKrellM.
The graphical front end uses sensors to retrieve the information it displays. A sensor can
return simple values or more complex information such as tables. For each type of
information, one or more displays are provided. Displays are organized in worksheets that
can be saved and loaded independently from each other.
Note: KSysguard is an Xwindows tool. Running X may impact your performance analysis.
The KSysguard main window (see Figure 5-11) consists of a menu bar, an optional tool bar
and status bar, the sensor browser, and the work space. When first started, you see your local
machine listed as localhost in the sensor browser and two pages in the work space area. This
is the default setup.
The sensor browser displays the registered hosts and their sensors in a tree form, and
includes the type of data. Each sensor monitors a certain system value. All of the displayed
sensors can be dragged and dropped in the work space. There are two options:
You can delete and replace sensors in the actual work space.
You can create a new worksheet and drop new sensors meeting your needs.
KSysguard is part of the KDE project and information and updates can be obtained at:
http://www.kde.org
For each HBA, there are some BIOS, driver settings which are suitable for connecting to
DS6000. If these settings are not configured correctly, it may affect performance or may not
work properly.
You can get the driver, BIOS, and HBA related information from the following link.
http://knowledge.storage.ibm.com/servers/storage/support/hbasearch/interop/hbaSearch.do
Note: When configuring the HBA, we strongly recommend to install the newest version of
the driver and BIOS. The newer versions include more effective functions or problem fixes
that improve performance and RAS.
8.5.1 Implementation
LVM and LVM2 for Linux can be downloaded free of charge at the following Web site:
http://sources.redhat.com/lvm/
http://sourceware.org/lvm2/
Note: Because LVM is licensed free of charge, there is no warranty for the program, to the
extent permit ted by applicable law. Except when otherwise stated in writing, the copyright
holders and/or other parties provide the program “as is” without warranty of any kind, either
expressed or implied, including, but not limited to, the implied warranties of merchantability
and fitness for a particular purpose. The entire risk as to the quality and performance of the
program is with you. Should the program prove defective, you assume the cost of all
necessary servicing, repair, or correction.
In order to use the LVM you will need to make sure that the kernel supports the tool. This will
require another kernel build and re-compilation. Again we refer you to the Redpaper Running
the Linux 2.4 Kernel on IBM Eserver xSeries Servers, REDP0121, available from:
http://www.redbooks.ibm.com
Select Redpapers from the left navigation bar and do a search using the Redpaper form
number REDP0121, which contains detailed steps to add the LVM to the source kernel.
For a more complete discussion on how to use the LVM tool, visit the Web site:
http://tldp.org/HOWTO/LVM-HOWTO/index.html
RAID
A striped RAID 5 or RAID 10 LUN is created on the DS6000 before it is assigned to the Linux
OS. When Linux boots on DS6000 with this configuration, it will only see this LUN as a single
disk. This means, as far as LVM is concerned, that there is just one disk in the machine and it
is to be used as such. If one of the disks fails within the LUN on the DS6000, LVM will not
even know. When the IBM representative replaces the disk (even on the fly) on the DS6000,
LVM will not know about that either, then the controller will resync the mirrored array and all
will be well. This is where most users take a step back and ask: Then what good does LVM do
for me with this RAID controller? The easy answer is that in most cases, after you define a
logical drive in the DS6000, you cannot add more disks to that drive later. So if you
miscalculate the space requirements, or you simply need more space, you cannot add a new
disk or set of disks into a pre-existing OS level software stripe-set. This means that you must
create or assign through the DS Storage Manager a new RAID LUN in the DS6000, and then
with LVM you can simply extend the LVM logical volume so that it seamlessly spans both
LUNs on the host platform. However, this is only the case if you do not use striping at the LVM
level.
Data striping
For performance reasons, with the LVM it can be beneficial to spread data in a stripe over
multiple physical volumes. We can use this functionality to spread data across different
DS6000 LUNs. Figure 8-14 on page 298 illustrates an example where block 1 is on Physical
Volume 0 (PV 0), and block 2 is on PV 1, while block 3 is on PV 2. Of course, you can also
stripe over more than 3 LUNs.
Vol 1
Logical Volume
Physical Volume
This arrangement means that you have more disk bandwidth available. It also means that
more spindles are potentially involved. We say potentially because having this Logical Volume
spread across three LUNs, as opposed to one LUN, would not involve more spindles if the
LUNs were all defined on the same Rank. This is because when LUNs are assigned from the
same Rank, each LUN is spread across, for example, the same 7 disks in that Rank.
Therefore, if you looked at Disk 1 (or DDM 1) within the DS6000 Rank, you would see data
from LUN 1, LUN 2, and LUN 3. This only involves the 7 spindles (in this example) that are
included in the disks on that Rank. Figure 8-15 on page 299 demonstrates this point, that
LUNs will store data on the same physical disks within a RAID array when assigned from one
Rank (as opposed to each LUN being assigned to separate Ranks). Don’t stripe from LUNs
on the same RAID array!
Disk 1
LUN 2
LUN 3
Data tracks
Figure 8-15 Three LUNs on the same DS6000 Rank will not optimize performance
If you assigned each of these LUNs to different Ranks, for example 3 different Ranks, you
would involve 21 different spindles (if they were built on 7+P arrays).
Note: One of the most important performance considerations when using the LVM on
Linux is that you should stripe the LV across LUNs on different Ranks to gain performance.
This means each LUN already needs to be defined on a separate Rank. Otherwise, if you
stripe on the same Rank, then you could worsen the performance.
Considerations
We recommend that you do not create a stripe size less than 64 KB. If many applications are
addressing the same array, then software striping at the OS level will not help much for host
performance improvement, but it also will not hurt the array performance. For sequential I/O, it
is better to have larger stripe sizes so that the LVM will not have to split write requests. You
could take it to 512 MB, the maximum single I/O size, to fully utilize the FC connection. For
random I/O, the stripe size is not so important.
Note: Remember that once a striped LV has been created through the Linux LVM, you
cannot add a physical volume to this LV. If there is any possibility there will be a need to
later extend LVs, then you should not use striping.
With -i we tell LVM how many physical volumes it should use to stripe across. Striping is not
really done on a bit-by-bit basis, but on blocks. With -I (uppercase i) we can specify the
8.6 Bonnie
Bonnie is a performance measurement tool written by Tim Bray. For a more complete
description and documentation on Bonnie go to the following Web site:
http://www.textuality.com/bonnie/
Bonnie performs a series of tests on a file of known size. If the size is not specified, Bonnie
uses 100 Mb, but that probably is not enough for a big modern server. Bonnie works with
64-bit pointers if you have them.
For each test, Bonnie reports the bytes processed per elapsed second, per CPU second, and
the % CPU usage (user and system).
8.6.1 Benchmarks
Bonnie does the following benchmarks:
char output with put() / putc_unlocked()
The result is the performance a program will see that uses putc() to write single
characters. On most systems, the speed for this is limited by the overhead of the library
calls into the libc, not by the underlying device. The _unlocked version (used if bonnie is
called with -u) may be considerably faster, as it involves less overhead.
char input with getc() / getc_unlocked()
The result is the performance a program will see that uses getc() to read single characters.
The same comments apply as to putc().
Block output with write()
This is the speed with which your program can output data to the underlying file system
and device writing blocks to a file with write(). As writes are buffered on most systems, you
will see numbers that are much higher than the actual speed of your device, unless you
sync() after the writes (option -y) or use a considerably larger size for your test file than
your OS will buffer. For Linux, this is almost all your main memory.
If called with the -o_direct option, this operation (and the ones described in the following
two paragraphs) is done with the O_DIRECT flag set, which results in direct DMA from
your hardware to userspace, thus avoiding CPU overhead copying buffers around. This
will prevent buffering, and gives a much better estimate of real hardware speed, also for
small test sizes.
Block input with read()
This is the speed with which you can read blocks of data from a file with read(). The same
comment as for block output regarding your OS doing buffering for you applies, with the
exception that using -y does not help to get realistic numbers for reading. You would need
to flush the buffers of the underlying block device, but this turns out to not be trivial, as you
first have to find out the block device. It would be a Linux-only feature anyway.
Block in/out rewrite
Bonnie does a read(), changes a few bytes, write()s the data back and reread()s it. This is
a pattern that occurs on some database applications. Its result tells you how your
operating root (/) file system can handle such access patterns.
8.6.2 Downloading
For downloading Bonnie, go to the following Web site:
http://www.textuality.com/bonnie/download.html
Installation and compilation should be straightforward. For Linux, your easiest option is to use
rpm --rebuild on the source RPM. If you use Linux (preferably SuSE Linux) with a i386
machine, you can even use the binary RPM.
8.7 Bonnie++
Bonnie++ is a benchmark suite that is aimed at performing a number of simple tests of hard
drive and file system performance. Then you can decide which test is important and decide
how to compare different systems after running it.
The main program tests database type access to a single file (or a set of files if you want to
test more than 1 G of storage), and it tests creation, reading, and deleting of small files that
can simulate the usage of programs such as Squid, INN, or Maildir format e-mail.
The ZCAV program tests the performance of different zones of a hard drive. It does not write
any data (so you can use it on full file systems). It can show why comparing the speed of
Windows at the start of a hard drive to Linux at the end of the hard drive (typical dual-boot
scenario) is not a valid comparison.
Bonnie++ was based on the code for Bonnie by Tim Bray. Go to the following Web site for a
summary of the differences between Bonnie 1.0 and Bonnie++.
http://www.coker.com.au/bonnie++/
The original author (Tim Bray) has also put a description of Bonnie on his pages.
The disk subsystem can be the most important aspect of I/O performance, but problems can
be hidden by other factors, such as the lack of memory. Finding disk bottlenecks is easier
The disk subsystem’s speed affects the overall performance of the file server in the following
ways:
It usually improves the minimum sustained transaction rate.
It may only slightly affect performance under light loads because most requests are
serviced directly from the disk cache. In this case, network transfer time is a relatively
large component and disk transfer times are hidden by disk cache performance.
As the server disk performance improves, increased network adapter and CPU
performance is required to support greater disk I/O transaction rates.
When the I/O subsystem is well tuned and performing efficiently, more throughput and
transactions per second can be done by the system as users and workload increase (see
Figure 8-16).
The I/O operations per second counter in the tools so far discussed in this chapter can be
used to determine if the server has disk bottlenecks. Collect logged data over a period of time
and then analyze the collected data to find if a trend can be detected, which will point to a
future disk bottleneck.
After verifying that the disk subsystem is causing a system bottleneck, a number of solutions
are possible. These solutions include the following:
Consider using faster disks. Allocating your application’s data on the 15K rpm disk drive
Ranks will deliver better performance as compared to the 10K rpm disk drive Ranks.
Eventually change the RAID implementation if this is relevant to the server’s I/O workload
characteristics. For example, going to RAID 10 if the activity is heavy random writes may
show observable gains.
Add more arrays. This will allow you to spread the data across multiple physical disks and
thus improve performance for both reads and writes. Also, use hardware RAID instead of
the software implementation provided by Linux. If hardware RAID is being used, the RAID
level is hidden from the operating system and is therefore more efficient.
Add more RAM. Adding memory will increase system memory disk cache, which in effect
improves disk response times.
Finally, if the previous actions do not provide the desired application performance, then
offload processing to another host system in the network (either users, applications, or
services).
The most current list of Windows servers that can attach to the DS6000 can be found in the
following Web site:
http://www-03.ibm.com/servers/storage/disk/ds6000/interop.html
Tuning all the components in a system is demanding and will require you to not only take a
benchmark before you change anything, but to take periodic measurements as you go.
Nonetheless, this more comprehensive activity pays off with an optimized system
performance.
The various system components that can affect the disk performance and are discussed in
this chapter, are:
Priorities between foreground and background processes
Virtual memory
System cache
File system layout and management
There is a recommended publication that can help you when tuning the whole system: Tuning
IBM eServer xSeries Servers for Performance, SG24-5287 and Tuning Windows Server
2003 on IBM eServer xSeries Servers, REDP-3943.
The following list provides additional steps that can be taken to provide better disk
performance on the host:
Modify the priorities between foreground and background processes.
Applications that are CPU and memory intensive should be scheduled during after-hours
operation. Examples of these applications are virus scanners, backup software, and disk
fragmentation utilities. These type of applications should be scheduled to run when the
server is not being utilized.
Allocate virtual memory pro-actively.
Specify the server type to determine how system cache is allocated and used.
Disable unnecessary services.
There are some documents in the Microsoft Web site which are helpful to understand the
performance superiority of Window Server 2003 and where to tune. These documents are
available from:
This setting lets you choose how processor resources are shared between the foreground
process and the background processes. Typically, for a server, you do not want the
foreground process to have more CPU cycles allocated to it than the background processes.
We recommend selecting Background services so that all programs receive equal amounts
of processor time.
To change this:
1. Open Control Panel.
2. Open System.
3. Select the Advanced tab.
4. Click the Performance Option button and the window shown in Figure 9-1 will appear.
Recommend setting
Under Processor scheduling, you can choose one of two settings to optimize performance:
Programs if more processor resources are given to the foreground process than the
background processes
Background services (recommended) if all programs receive equal amounts of
processor resources
We strongly recommend that you use only the GUI interface for these settings in order to
always get valid, appropriate, operating system revision-specific, and optimal values in the
registry.
Windows Server 2003 and 2000, as most other server operating systems, employ virtual
memory techniques that allow applications to address greater amounts of memory than what
is physically available. Memory pressure occurs when the demand for physical memory
exceeds the amount of installed memory, causing the operating system to page excess
memory onto a disk drive.
Paging is the process whereby blocks of data are swapped from the physical memory to a file
on the hard disk. The paging file is page file.SYS. The combination of the paging file and the
physical memory is known as virtual memory. Some paging is normal for the usual operating
system, but when excessive consistent paging which is called thrashing occurs, it affects the
system performance. To avoid this, paging should be minimized.
You can control the size of the paging file and this can improve performance if you specify the
minimum value to be what the server normally allocates during the peak time of the day. This
ensures that no processing resources are lost to the allocation and segmentation of the
paging space.
Create separate
page file for each
disk to improve
system performance
You can set the initial size and the maximum size of the paging file for every drive. The
maximum number of page files is 16 and the maximum size is 4 GB per page file. This means
a maximum of total page size is 64 GB. Total size of page file is used as one page file by the
operating system. For a file server, set the minimum to the recommended value, as shown in
the window. For other server applications the recommendation varies. For a discussion of
recommended values refer to the publication Tuning IBM eServer xSeries Servers for
Performance, SG24-5287.
In a production environment with well-written server applications, hard page faults should not
constantly occur. If there is any sustained paging, check the available bytes in the Task
Manager.
If Available Bytes is less than 20 percent of installed RAM, then add more RAM.
If Available Bytes is much greater than 20 percent of total installed RAM, then the
application cannot make use of additional RAM, so the only solution is to optimize the
page device.
Note: If you remove the page file from the boot partition, dump files (memory.dmp) which
include debugging the information, can’t be created when blue screen occurs. If you need
a dump file, you need to have a page file at least the size of physical memory +1 MB on the
boot partition.
Having a page file size smaller than the current RAM size will affect performance of the
server. Our recommendation is to set the memory page file size to twice the size of the RAM
for a maximum performance gain. The only drawback of having a big page file is the
restriction in space available for files on the hard drives. Since the host will be using DS6000
disks this should not be a concern.
The best way to create a contiguous static page file is to follow this procedure:
1. Remove the current page file from your server by clearing the Initial and Maximum size
values in the Virtual Memory settings window, then click Set (refer to Figure 9-2 on
page 309).
2. Reboot the machine and click OK; ignore the warning message about the page file.
3. Defragment the disk you want to create the page file on. This step should give you enough
continuous space to avoid partitioning of your new page file.
4. Create a new static page file by setting the Initial and Maximum size with the same value.
If possible, use twice the size of your RAM.
5. Reboot the server.
The above procedure will leave you with a contiguous static page file.
Ideally, the paging device will be a separate physical drive. Having data that is being
accessed on the same drive as the paging drive can reduce performance, especially when
multiple logical drives are configured on one physical drive. This causes long seek operations
and slows performance.
If the page files and active data must reside on the same physical device, place them on the
same logical drive. This will keep the page file and data files physically close together, and will
improve performance by reducing the time spent seeking between the two logical drives. Of
course, you can ignore this issue if no I/O access is made to the data drive during normal
operation.
The file system cache is a dynamic memory pool used to store recently accessed data for all
cacheable peripheral devices, which includes data transfers between hard drives, networks
cards, and networks. The Windows Virtual Memory Manager copies data to and from the file
system cache as though it were an array in memory. When data resides in file system cache,
it will improve performance and reduce disk activity.
Tip: Windows Server 2003 has two applets to manage file system cache as compared with
the previous version of Windows which has just one applet.
Two applets of Windows Server 2003 determine how much system memory is available to be
allocated to the working set of file system cache versus how much memory is available to be
allocated to the working set of applications, and the priority with which they are managed
against one another.
To change File and Printer Sharing for Microsoft Network setting (Both Windows Server 2003
and Windows 2000),
1. Click Start -> Settings -> Network and Dial-Up Connection.
Note: This setting affects all LAN connections, so which LAN connection you choose in the
above steps is not important. If you are not using this system as a file system server, then
you will not be able to modify the cache priorities here.
The file system cache has a working set of memory like any other process. The option chosen
in this dialog effectively determines how large the working set is allowed to grow to and with
what priority the file system cache is treated by the operating system relative to other
applications and processes running on the server.
You have four choices but typically only one of the bottom two options is selected for an
enterprise server implementation:
1. Minimize memory used.
This choice will minimize the memory used for disk cache and maximize the memory
available for the operating system. However, on file servers, the resulting performance
would not be desirable. Therefore, only use this choice for workstations.
The value of the registry entries will be set depending on the option selected in the control
panel, as listed in Table 9-1.
Balance 2 0
The second control panel which is used to manage a file system cache of Windows Server
2003 is in the System applet (Windows Server 2003 only):
1. Click Start -> Control Panel -> System.
2. Select Advanced.
3. Within the Performance frame, click Settings.
4. Select Advanced.The window shown in Figure 9-4 on page 314 appears.
The System applet can also change the value of LanmanServer LargeSystemCache registry
key as File and Printer Sharing in the Network applet. But the System applet can change
LargeSystemCache without affecting the Memory Management Size that File Printer Sharing
does.
Given that most users will only use the Maximize throughput for network applications
option or the Maximize throughput for file sharing option for enterprise servers, the Size
value remains the same, a value of 3. This means that using the System applet to adjust the
LargeSystemCache value is redundant as it is just as easily set using File and Print Sharing.
As a result, we recommend using the first control panel as described above and leave this
second control panel untouched. It would seem that the only advantage to using both Control
Panel applets in conjunction would be to enable you to have the applets actually indicate
Maximize throughput for network applications and simultaneously indicate memory usage
favors System cache. This same effect to the registry is achieved by selecting Maximize
throughput for file-sharing (as per Table 9-1 on page 313) — visually it simply does not say
“Maximize throughput for network applications”. If you do desire this change purely for
aesthetic reasons, then make sure you set the first Network applet before the second System
applet, as the first overrides the second selection, but the reverse does not occur.
With Windows Server 2003, when Maximize data throughput for file sharing is selected,
the file system cache can grow until 960 MB. When Maximize throughput for network
applications is selected, the file system cache can grow until 512 MB. (See Microsoft KB
837331; location below.) Depending on the selection made here, it is possible that adding
On a server with a lot of physical memory (2 GB or more), it may be preferable to leave the
option Maximize data throughput for file sharing selected (that is, as long as the total
amount of memory used by the operating system and server applications does not exceed
the amount of physical RAM minus 960 MB). In fact, any application server that can have 960
MB or more of RAM unused, will likely improve performance by enabling the large system
cache.
By enabling this, all of the disk and network I/O performance benefits of using a large file
system cache are realized, and the applications running on the server continue to run without
being memory-constrained.
Some applications have their own memory management optimizers built into them, including
Microsoft SQL Server and Microsoft Exchange. In such instances, the setting above is best
set to Maximize throughput for network applications to let the application manage
memory and their own internal system cache as it sees appropriate.
Services can be seen in the Computer Management console. To view services running on
Windows, right-click My Computer and select Manage. Then the Computer Management
window will appear. Select Services in the left pane of the window. Click the Standard tab at
the bottom of the right-side pane. Then a window similar to that shown in Figure 9-5 on
page 316 will appear.
You should stop services that are not needed to free additional memory to those that need it
most, such as the operating system and user applications. To do this, select a service from
the service list and click Stop.
Also examine the startup values of the installed services. Right-click the service and select
Properties. Select Disabled if you do not want this service to run at all on server startup, or
Manual if you want to start a service only at the time you need to use it.
On Windows Server 2003, many services are added from Window 2000 to strengthen
security. Most of the startup type of them are set to Disabled or Manual by default, but some
of them are set to Automatic. When booted, the services set to Automatic are enabled and
use resources. Actually, some of these services are not required, so you should stop them
and set their startup type to Disabled or Manual. For example, the Print Spooler service is
enabled by default, but usually this service is not required if the server doesn’t work as a print
spooler or doesn’t have a local printer.
Table 9-2 lists the services that should be considered whether the system requires them or
not on Windows Server 2003 system. This list is not applicable for all systems but just the
recommendation for a usual system. For example, File Replication Service (FRS) is normally
required for Active Directory domain controller but, for other servers, this service would not be
required. These services are not disabled by default. Before disabling these services, further
investigation is required.
You can also stop processes using the Task Manager. Unneeded applications and processes
are those that you do not need running at the moment, for example, when an application is
launched at startup that does system maintenance, such as disk scanning and
de-fragmentation.
To open the Task Manager, press the Ctrl + Shift + Esc keys. From the Applications tab, select
the unneeded application then click End Task. You can also do this from the Processes tab
by selecting the unneeded process then clicking End Process, as illustrated in Figure 9-6 on
page 318.
A process is considered unnecessary when it has nothing to do with your current server
function. It could have been invoked from the registry by some application that was not
correctly un-installed, for example.
Even if it requires preempting a thread of lower priority, threads with the highest priority
always run on the processor. This activity ensures Windows still pays attention to critical
system threads required to keep the operating system running. A thread will run on the
processor for either the duration of its CPU quantum (or time slice, described in 9.2.1,
“Foreground and background priorities” on page 307) or until it is preempted by a thread of
higher priority.
Task Manager allows you to easily see the priority of all threads running on a system. To do
so, open Task Manager, and click View -> Select Columns, then add a checkmark beside
Base Priority as shown in Figure 9-7 on page 319.
This displays a column in Task Manager as shown Figure 9-8 that enables you to see the
relative priority of processes running on the system.
Most applications that are loaded by users run at a normal priority, which has a base priority
value of 8. Task Manager also gives the administrator the ability to change the priority of a
process, either higher or lower.
To do so, right-click the process in question, and click Set Priority from the pull-down menu
as shown in Figure 9-9 on page 320. Click the new priority you want to assign to the process.
If you want to launch a process with a non-normal priority, you can do so using the start
command from a command prompt. Type start /? for more information about how to do this.
Threads, as a subcomponent of processes, inherit the base priority of their parent process.
Each process’s priority class sets a range of priority values (between 1 and 31), and the
threads of that process have a priority within that range. If the priority class is Realtime
(priorities 16 to 31), the thread’s priority can never change while it is running. A single thread
running at priority 31 will prevent all other threads from running.
Conversely, threads running in all other priority classes are variable, meaning that the
thread’s priority can change while the thread is running. For threads in the Normal or High
priority classes (priorities 1 through 15), the thread’s priority can be raised or lowered by up to
a value of 2 but cannot fall below its original, program-defined base priority.
When should you modify the priority of a process? In most instances, you should do this as
rarely as possible. Windows normally does a very good job of scheduling processor time to
threads. Changing process priority is not an appropriate long-term solution to a bottleneck on
a system. Eventually, additional or faster processors will be required to improve system
performance.
Normally, the only conditions under which the priority of a process should be modified are
when the system is CPU-bound. Processor utilization, queue length, and context switching
can all be measured using System Monitor to help identify processor bottlenecks.
Important: Changing priorities might destabilize the system. Increasing the priority of a
process might prevent other processes, including system services, from running. In
particular, be careful not to schedule many processes with the High priority and avoid
using the Realtime priority altogether. Setting a processor-bound process to Realtime
could cause the computer to stop responding altogether.
Decreasing the priority of a process might prevent it from running, not merely force it to run
less frequently. In addition, lowering priority does not necessarily reduce the amount of
processor time a thread receives; this happens only if it is no longer the highest-priority
thread.
Hard affinity can be applied to permanently bind a process to a given CPU or set of CPUs,
forcing the designated process to always return to the same processor. The performance
advantage in doing this is best seen in systems with large Level 2 caches, as the cache hit
ratio (in the server, not storage) will improve dramatically.
Some applications, such as SQL Server, provide internal options to assign themselves to
specific CPUs. The other method for setting affinity is via Task Manager: right-click the
process in question and click Set Affinity, as shown in Figure 9-10 on page 322. Add check
marks next to the CPUs you want to restrict the process to and click OK.
Note that, as with changing the process’s priority, changing process affinity in this manner will
only last for the duration of the process. If the process ends or the system is rebooted, the
affinity will have to be reallocated as required. Note also that not all processes permit affinity
changes.
Specifically for server performance tuning purposes, Intfiltr enables you to assign the
interrupts that are generated by each network adapter to a specific CPU. Of course, it is only
useful on SMP systems with more than one network adapter installed. Binding the individual
network adapters in a server to a given CPU can offer large performance efficiencies.
Intfiltr uses plug-and-play features of Windows that permit affinity for device interrupts to
particular processors. Intfiltr binds a filter driver to devices with interrupts and is then used to
set the affinity mask for the devices that have the filter driver associated with them. This
permits Windows to have specific device interrupts associated with nominated processors.
Interrupt filtering can affect the overall performance of your computer in both a positive and
negative manner. Under normal circumstances, there is no easy way to determine which
processor is best left to handle specific interrupts. Experimentation and analysis will be
required to determine whether interrupt affinity has yielded performance gains. To this end, by
default without tools like Intfiltr, Windows directs interrupts to any available processor.
Some considerations should be made when configuring Intfiltr on a server that supports
Hyper-Threading to ensure that the interrupts are assigned to the correct physical processors
desired, not the logical processors. Assigning interrupt affinity to two logical processors that
actually refer to the same physical processor will obviously offer no benefit.
Interrupt affinity for network cards can offer definite performance advantages on large, busy
servers with many CPUs. Our recommendation is to try Intfiltr in a test environment to
associate specific interrupts for network cards with selected processors. This enables you to
determine whether using interrupt affinity will offer a performance advantage for your network
interface cards.
Note that Intfiltr can be used for creating an affinity between CPUs and devices other than
network cards, such as disk controllers. Again, experimentation is the best way to determine
potential performance gains. To determine the interrupts of network cards or other devices,
use Windows Device Manager or run System Information (WINMSD.EXE).
The Intfiltr utility and documentation are available free of charge from Microsoft:
ftp://ftp.microsoft.com/bussys/winnt/winnt-public/tools/affinity/intfiltr.zip
Windows provides a /3GB parameter to be added to the BOOT.INI file that reallocates 3 GB of
memory to be available for user-mode applications and reduces the amount of memory for
the system kernel to 1 GB. Some applications, such as Microsoft Exchange and Microsoft
SQL Server, written to do so, can derive performance benefits from having large amounts of
addressable memory available to individual user-mode processes.
To edit the BOOT.INI file to make this change, complete the following steps:
1. Open the System Control Panel.
2. Select Advanced.
3. In the Startup and Recovery frame, click Settings.
4. Click Edit. Notepad opens to edit the current BOOT.INI file.
This switch normally should be used only when a specific application recommends its use.
Typically this is where applications have been compiled to use more than 2 GB per process,
such as some components of Exchange.
Important: The /3GB switch actually works for all versions of Windows 2000 Server and
Windows Server 2003. However, you should use it only when running Advanced Edition or
Datacenter Edition.
Standard Edition can allocate to user-mode applications at most 2 GB. If the /3GB switch is
configured in the BOOT.INI file, then the privileged-mode kernel is restricted to 1 GB of
addressable memory without the corresponding increase for user-mode applications. This
effectively means 1 GB of address space is lost.
PAE requires appropriate hardware and operating system support to be implemented. Intel
introduced PAE 36-bit physical addressing with the Intel Pentium Pro processor. PAE is
supported with the Advanced and Datacenter Editions of Windows 2000 Server and the
Enterprise and Datacenter Editions of Windows Server 2003.
Windows uses 4 KB pages with PAE to map up to 64 GB of physical memory into a 32-bit (4
GB) virtual address space. The kernel effectively creates a map in the privileged mode
addressable memory space to manage the physical memory above 4 GB.
The Advanced and Datacenter Editions of Window 2000 Server and Windows Server 2003
allow for PAE through use of a /PAE switch in the BOOT.INI file. This effectively allows the
operating system to use physical memory above 4 GB.
Even with PAE enabled, the underlying architecture of the system is still based on 32-bit
linear addresses. This effectively retains the usual 2 GB of application space per user-mode
process and the 2 GB of kernel mode space, because only 4 GB of addresses are available.
However, multiple processes can immediately benefit from the increased amount of
Address Windowing Extensions (AWE) is a set of Windows APIs that take advantage of the
PAE functionality of the underlying operating system and enable applications to directly
address physical memory above 4 GB. Some applications such as SQL Server 2000,
Enterprise Edition, have been written with these APIs and can harness the significant
performance advantages of being able to address more than 2 GB of memory per process.
To edit the BOOT.INI file to enable PAE, complete the following steps:
1. Open the System Control Panel.
2. Select Advanced.
3. In the Startup and Recovery frame, click Settings.
4. Click Edit. Notepad opens to edit the current BOOT.INI file.
5. Edit the current BOOT.INI file to include the /PAE switch as shown in Figure 17.
6. Restart the server for the change to take effect.
On a server with between 4 GB and 16 GB of RAM hosting applications that have been
compiled or written with AWE to use more than 2 GB of RAM per process or hosting many
applications (processes) that each contend for limited physical memory, it would be desirable
to use both the /3GB and /PAE switches.
NTFS is always be the file system of choice for servers. NTFS offers considerable
performance benefits over the FAT and FAT32 file systems and should be used exclusively on
Window servers. In addition, NTFS offers many security, scalability, stability and reliability
benefits over FAT.
Under previous versions of Windows, FAT and FAT32 was often implemented for smaller
volumes (say < 400 MB) as they were often faster in such situations. With disk storage
relatively inexpensive today and operating systems and applications pushing drive capacity to
a maximum it is unlikely that such small volumes will be warranted. FAT32 scales better than
FAT on larger volumes but is still not an appropriate file system for Windows servers.
FAT and FAT32 have often been implemented in the past as they were seen as more easily
recoverable and manageable with native DOS tools in the event of a problem with a volume.
Today, with the various NTFS recoverability tools built both natively into the operating system
and as third-party utilities available, there should no longer be a valid argument for not using
NTFS for file systems.
NTFS was designed to provide reliability, security, and fault tolerance through data
redundancy. In addition, support was built into NTFS for large files and disks, Unicode-based
Windows 2003 uses the following default cluster sizes for NTFS, as shown in Table 9-3,
where the value for number of sectors assumes a standard, 512 byte sector. On systems with
sectors that are not 512 bytes, the number of sectors per cluster may change, but the cluster
size remains fixed.
These values are only used if an allocation unit size is not specified at format time, using the
/A:<size> switch with the format command.
Note: The maximum NTFS volume size as implemented in Windows Server 2003 is 232
clusters minus 1 cluster. For example, using 64-KB clusters, the maximum NTFS volume
size is 256 terabytes minus 64 KB. Using the default cluster size of 4 KB, the maximum
NTFS volume size is 16 terabytes minus 4 KB.
If you use large numbers of files in an NTFS folder (300,000 or more), disable short-file name
generation for better performance, and especially if the first six characters of the long file
names are similar.
Before disabling short name generation, make sure that there is no DOS or 16-bit application
running on the server that requires 8.3 file names, nor are there any users accessing the files
on the server via 16-bit applications.
To disable the generation of 8.3 short names, edit the following registry parameter:
HKEY_LOCAL_MACHINE \SYSTEM \CurrentControlSet \Control \FileSystem
\NtfsDisable8dot3NameCreation
Change its value from 0 to 1. In Windows Server 2003, this parameter can also be set by
using the command:
fsutil behavior set disable8dot3 1
In Windows Server 2003, this parameter can also be set by using the command:
fsutil behavior set disablelastaccess 1
Reliability
To ensure reliability of NTFS, three major areas were addressed: Recoverability, removal of
fatal single sector failures, and hot fixing.
The recoverability designed into NTFS is such that a user should never have to run any sort of
disk repair utility on an NTFS partition. This is because NTFS uses a journaled log to keep
track of transactions made against the file system. When a CHKDSK is performed on a FAT
file system, the consistency of pointers within the directory, allocation, and file tables is being
checked. Under NTFS, because a log of transactions against these components is
maintained, CHKDSK need only roll back transactions to the last commit point in order to
recover consistency within the file system.
Under FAT, if a sector that is the location of one of the file system's special objects fails, then
a single sector failure will occur. NTFS avoids this in two ways. First, by not using special
objects on the disk and tracking and protecting all objects that are on the disk. Second, under
NTFS, multiple copies (the number depends on the volume size) of the Master File Table are
kept.
Similar to OS/2® versions of HPFS, NTFS supports hot fixing. NTFS will attempt to move the
data in a damaged cluster to a new location in a fashion that is transparent to the user. The
damaged cluster is then marked as unusable. Unfortunately, it is possible depending on what
damage has occurred, that the moved data may be unusable.
Under Windows, a FAT is a careful-write file system. The FAT's careful-write file system only
allows writes one at a time and alters its volume information after each write. This is a very
secure form of writing. It is, however, also a very slow process. In order to improve
performance on FAT you can opt to utilize the lazy-write file system feature, which uses the
systems memory cache. This means that all writes are performed to this cache and the file
system intelligently waits for the appropriate time to perform all the writes to disk.
This system gives the user faster access to the file system and prevents holdups due to
slower disk access. It is also possible, if the same file is being modified more than once, that it
may never actually be written to disk until the modifications are finished within the cache. Of
course, this can also lead to lost data if the system crashes and unwritten modifications are
still held in the cache.
NTFS provides the speed of a lazy-write file system along with additional recovery features.
Each write request to an NTFS partition generates both redo and undo information in the
transaction log. In the recovery process this log can assure that after only a few moments
after a reboot that the file system's integrity is back to one hundred percent without the need
of running a utility such as CHKDSK, which requires the scanning of an entire volume. The
overhead associated with this recoverable file system is less than the type used by the
careful-write file system.
Choosing a file system depends on your particular environment. Some of the factors for
choosing a file system include MSDOS compatibility, file level and file system security,
performance, and recoverability. In general, NTFS is best for use on logical volumes of about
400 MB or more. This is because performance does not degrade under NTFS, as it does
under FAT, with larger volume sizes.
Users seeking highly scalable solutions will use software and hardware solutions in
combination. For example, NTFS uses 64-bit addresses and file offsets. This allows for
theoretically immense file and volume sizes. Today, there are external limitations on volume
and file sizes imposed by the logical disk manager's disk partitioning system and by the
underlying hardware. However, NTFS will continue to scale as these limitations are broken
down.
Disk seek time is normally considerably longer than read or write activities. As noted above,
data is initially written to the outside edge of a disk. As demand for disk storage increases and
This means that monitoring disk space utilization is important not just for capacity reasons but
for performance also. It is not practical nor realistic to have disks with excessive free space,
however.
Tip: As a rule of thumb, work towards a goal of keeping disk free space between 20-25%
of total disk space. DS6000 doesn’t have the tool to monitor drive space utilization that you
have to monitor it from the server side.
Warning: Using Registry Editor incorrectly can cause serious problems that may
require you to reinstall your operating system.
For information about how to edit the registry, view the “Change keys and Values” Help topic
in Registry Editor (Regedit.exe). Note that you should back up the registry before you edit it. If
you are running Windows, you should also update your Emergency Repair Disk.
Value: DisablePagingExecutive
Recommendation: 0x1
Performance and system stability can be seriously impacted if Windows experiences memory
resource constraints and is unable to assign memory to these pools. The amount of physical
memory assigned to these two pools is assigned dynamically at system boot time. The
Some applications and workloads can demand more pooled memory than the system has
been allocated by default. Setting the PagedPoolSize registry value as listed in Table 9-4 may
assist in ensuring sufficient pooled memory is available.
0x0 (default) The system will dynamically calculate an optimal value at system
startup for the paged pool based on the amount of physical
memory in the computer. This value will change if more memory
is installed in the computer. The system typically sets the size of
the paged pool to approximately twice that of the nonpaged pool
size.
Range: 0x1 - Creates a paged pool of the specified size, in bytes. This takes
0x20000000 (512 MB) precedence over the value that the system calculates, and it
prevents the system from adjusting this value dynamically.
Limiting the size of the paged pool to 192 MB (or smaller) lets the
system expand the file system (or system pages) virtual address
space up to 960 MB. This setting is intended for file servers and
other systems that require an expanded file system address
space (meaning slightly faster access) at the expense of being
able to actually cache less data. This only makes sense if you
know the files your server frequently accesses already fit easily
into the cache
0xFFFFFFFF Windows will calculate the maximum paged pool allowed for the
system. For 32-bit systems, this is 491 MB. This setting is
typically used for servers that are attempting to cache a very
large number of frequently used small files, some number of very
large size files, or both. In these cases, the file cache that relies
on the paged pool to manage its caching it able to cache more
files (and for longer periods of time) if more paged pool is
available.
Setting this value to 0xB71B000 (192 MB) provides the system with a large virtual address
space, expandable to up to 960 MB. Note that a corresponding entry of zero (0) is required in
the SystemPages registry value for this to take optimal effect, as described below.
Value: PagedPoolSize
Value: SystemPages
Recommendation: 0x0
These value ranges listed in Table 9-5 above equate to those calculated in Table 9-6
depending on the exact amount of physical RAM in the machine. As most servers today will
have more than 512MB RAM, the calculations in Table 9-7 take into account only 512MB
RAM and above.
The appropriate value should be determined from Table 9-5 and then entered into the registry
value IoPageLockLimit. This value will then take precedence over the system default of 512
KB and will specify the maximum number of bytes that can be locked for I/O operations:
Value: IoPageLockLimit
512 MB 0x1C000000
1 GB 0x3C000000
2 GB 0x80000000
4 GB 0xFC000000
8 GB 0xFFFFFFFF
Click OK, quit Registry Editor, and then shut down and restart the computer.
For each HBA, there are some BIOS and driver settings which are suitable for connecting to
your DS6000. If these settings are not configured correctly, it may affect the performance or
may not work properly.
To configure the HBA, see IBM TotalStorage DS6000 Host Systems Attachment Guide,
GC26-7680 . Detailed procedures and recommended settings are written in it. You also
should read the readme and manuals for the driver, BIOS and HBA.
You can get the driver, BIOS and HBA related information from the following link.
http://knowledge.storage.ibm.com/servers/storage/support/hbasearch/interop/hbaSearch.do
Note: When configuring the HBA, we strongly recommend you install the newest version of
driver and BIOS. The newer version includes more effective function or problem fixes, so
that the performance or RAS may improve.
The logging feature of the Performance console makes it possible to store, append, chart,
export, and analyze data captured over time. Products such as SQL Server and Exchange
provide additional monitors that allow the Performance console to extend its usefulness
beyond the operating system level.
The Performance console is a snap-in for Microsoft Management Console (MMC). The
Performance console is used to access the System Monitor and Performance Logs and Alerts
tools.
The Performance console can be opened by clicking Start -> Programs -> Administrative
Tools -> Performance or by typing PERFMON on the command line.
In Figure 9-15 on page 336 we see the System Monitor. The System Monitor can be used to
view real-time or logged data of objects and counters. Performance Logs and Alerts can be
used to log objects and counters, and create alerts.
Alerts can be configured to notify the user or write the condition to the system event log based
on thresholds.
There are three ways to view the real-time or logged data counters:
Chart
This view displays performance counters in response to real-time changes or processes
logged data to build a performance graph.
Histogram
This view displays bar graphics for performance counters in response to real-time
changes or logged performance data. It is useful for displaying peak values of the
counters.
Report
This view displays only numeric values of objects or counters. It can be used for displaying
real-time activity or displaying logged data results. It is useful for displaying many
counters.
Tuning these key objects will greatly improve the performance of disk I/O.
Physical disks Average disk queue length The number of requests for disk
access. The general rule of
thumb is that the total average
disk queue length should be
less than or equal to three. It
may be important to note the
actual number of spindles in a
hardware RAID set and multiply
the number of spindles by the
average disk queue length.
Logical disks Current disk queue length The current number of requests
for access to the logical disk
device.
Windows Server 2003 has both the logical and physical disk counters enabled by default.
In Windows 2000, physical disk counters are enabled by default. The logical disk performance
counters are disabled by default and may be required for some monitoring applications. If you
require the logical counters, you can enable them by typing the command DISKPERF -yv
then restarting the computer.
Keeping this setting on all the time draws about 2-3% CPU, but if your CPU is not a
bottleneck, this is irrelevant and can be ignored. Enter DISKPERF /? for more help on the
command.
Note: Physical drive counters should be used if the system is using hardware RAID, such
as the DS6000.
Performance console disk counters are available with either the LogicalDisk or PhysicalDisk
objects:
For non-DS6000 RAID disks, LogicalDisk monitors the operating system partitions of
physical drives. It is useful to determine which partition is causing the disk activity, possibly
indicating the application or service that is generating the requests. PhysicalDisk monitors
the individual hard disk drives, and is useful for monitoring disk drives as a whole.
For the DS6000 (all disks are RAID disks), LogicalDisk monitors the operating system
partitions (if any), while PhysicalDisk monitors the logical drives created from the DS6000
RAID arrays.
Tip: When attempting to analyze disk performance bottlenecks, you should always use
physical disk counters.
Physical Disk: Avg. This is the average number of both read and write requests queued to the
Disk Queue Length selected disk during the sample interval.
If this value is consistently over 2-3 times the number of disks in the array
(for example, 8-12 for a 4-disk array), it indicates that the application is
waiting too long for disk I/O operations to complete. To confirm this
assumption, always check the Avg. Disk Second/Transfer counter.
Also, the Avg. Disk Queue Length counter is a key counter for determining if
a disk bottleneck can be alleviated by adding disks to the array. Remember,
adding disks to an array only results in increased throughput when the
application can issue enough multiple requests to the array to keep all disks
in the array busy. For optimal disk performance, we want the Avg. Disk
Queue Length to be no more than 2 or 3 times the number of physical disks
in the array.
Also, in most cases the application has no knowledge of how many disks are
in an array because this information is hidden from the application by the
disk array controller. So unless an application configuration parameter is
available to adjust the number of outstanding I/O commands, an application
will simply issue as many disk I/Os as it needs to accomplish its work, up to
the limit supported by the application and/or disk device driver.
Before adding disks to an array to improve performance, always check the
Avg. Disk Queue Length counter and only add enough disks to satisfy the
2-3 disk I/Os per physical disk rule. For example if the array shows an Avg.
Disk Queue Length of 30 then an array of at most 10-15 disks should be
used.
Physical Disk: Avg. This is the average number of bytes transferred to or from the disk during
Disk Bytes/Transfer write or read operations. This counter can be used as an indicator of the
stripe size that should be used for optimal performance.
For example, always create disk arrays with a stripe size that is at least as
large as the average disk bytes per transfer counter value as measured over
an extended period of time.
Physical Disk: Avg. The Avg. Disk second/Transfer is a key counter that indicates the health of
Disk sec/Transfer the disk subsystem. This is the time to complete a disk I/O operation. For
optimal performance, this should be less then 20-25 ms for non-clustered
systems, and no higher than 40-50 ms for clustered disk configurations. In
general, this counter can grow to be very high when insufficient numbers of
disks, slow disks, poor physical disk layout, or severe disk fragmentation
occurs.
Memory: This is the number of pages read from the disk or written to the disk to
Pages/second resolve memory references to pages that were not in memory at the time of
the reference.
High value indicates disk activity due to insufficient memory. Add more RAM
to your server.
The product of this counter and Physical Disk: Avg. Disk second/Transfer is
an approximation of the amount of disk time spent on paging file activity
during the sampling period. If it exceeds 0.1 (10 percent) then you may have
excessive paging.
Note: Do not use the % Disk Time physical disk counter. This is the percentage of elapsed
time that the selected disk drive is busy servicing read or write requests. The counter is
only useful with IDE drives, which, unlike SCSI disks, can only perform one I/O operation at
a time. % Disk Time is derived by assuming the disk is 100 percent busy when it is
processing an I/O and 0 percent busy when it is not. The counter is a running average of
the 100 percent versus 0 percent count (binary).
DS6000 can perform many hundreds or thousands of I/Os per second before they
encounter bottlenecks. Most array controllers can perform two to three disk I/Os per drive
before a bottleneck occurs. For example, if an array controller with 60 drives has one disk
I/O to perform at all times it will be 100% utilized according to the % Disk Time counter.
However, that array could actually be issuing 120-180 I/Os before a true bottleneck occurs.
Figure 9-17 shows a sample chart setting for finding disk bottlenecks.
6. In the General tab, Sample data every: is used to set how frequently you capture the
data. If you capture many counters from a local or remote computer, you should use long
intervals; otherwise, you may run out of disk space or consume too much network
bandwidth.
7. In the Run As field, input the account with sufficient rights to collect the information about
the server to be monitored and then click Set Password to input the relevant password.
8. The Log Files tab, shown in Figure 9-19 on page 345, lets you set the type of the saved
file, the suffix that is appended to the file name and an optional comment. You can use two
types of suffix in a file name: numbers or dates. The log file types are listed in Table 13-3.
If you click Configure... then you can also set the location, file name, and file size for a log
file.
Text file - TSV Tab-delimited log file (TSV extension). Use this
format to export the data to a spreadsheet
program.
9. The Schedule tab shown in Figure 9-20 on page 346 lets you specify when this log is
started and stopped. You can select the option box in the start log and stop log section to
manage this log manually using the Performance console shortcut menu. You can
configure to start a new log file or run a command when this log file closes.
This log settings file can then be opened using with Internet Explorer.You can also use the
pop-up menu to start, stop and save the logs, as shown in Figure 9-21 on page 347.
3. At the System Monitor Properties dialog box, select the Data tab. You should how see any
counter that you specified when setting up the Counter Log as shown in Figure 9-23 on
page 349. If you only selected counter objects then the Counters section will be empty. To
add counters from an object, simply click Add...and then select the appropriate ones.
Tip: Depending on how long the counter log file was running, there will be quite a lot of
data to observe. If you are interested in looking at a certain time frame when the log file
was recording data, complete these steps:
1. Click the Properties icon on the System Monitor toolbar.
2. The System Monitor Properties box will open; click the Source tab.
3. Select the time frame you want to view (see Figure 9-22) and click OK.
The Windows Resource Kit also contains INTFILTR, which is an interrupt binding tool that
allows you to bind device interrupts to specific processors on SMP servers. This is a useful
technique for maximizing performance, scaling, and partitioning of large servers. It can
provide a network performance increase of up to 20 percent.
Figure 9-24 shows that Task Manager has three views: Applications, Processes, and
Performance. The latter two are of interest to us in this discussion.
Processes tab
In this view (see Figure 9-24) you can see the resources being consumed by each of the
processes currently running. You can click the column headings to change the sort order
which will be based on that column.
Click View -> Select Columns. This displays the window shown in Figure 9-25 on page 351,
from which you can select additional data to be displayed for each process.
Table 9-10 shows the columns available in the Windows Server 2003 operating system that
are related to disk I/O.
Paged Pool The paged pool (user memory) usage of each process. The paged pool
is virtual memory available to be paged to disk. It includes all of the user
memory and a portion of the system memory.
Non-Paged Pool The amount of memory reserved as system memory and not pageable
for this process.
Base Priority The process’s base priority level (low/normal/high). You can change the
process’s base priority by right-clicking it and selecting Set Priority. This
remains in effect until the process stops.
I/O Reads The number of read input/output (file, network, and disk device)
operations generated by the process.
I/O Read Bytes The number of bytes read in input/output (file, network, and disk device)
operations generated by the process.
I/O Writes The number of write input/output operations (file, network, and disk
device) generated by the process.
I/O Write Bytes The number of bytes written in input/output operations (file, network,
and device) generated by the process.
I/O Other The number of input/output operations generated by the process that
are neither reads nor writes (for example, a control type of operation).
I/O Other Bytes The number of bytes transferred in input/output operations generated
by the process that are neither reads nor writes (for example, a control
type of operation).
The charts show you the CPU and memory usage of the system as a whole. The bar charts
on the left show the instantaneous values, and the line graphs on the right show the history
since Task Manager was started.
Many of the tools that used to be in the Windows 2000 support tools or Resource Kit have
now been included in the standard Windows Server 2003 build. For example, the typeperf
command that used to be part of the Windows 2000 resource kit is now included as standard
in Windows Server 2003. Table 13-6 lists a number of these tools and provides the
executable, where the tool is installed and a brief description.
Empty working set empty.exe Resource Kit Frees the working set
of a specified task or
process.
9.9 Iometer
Iometer is an I/O subsystem measurement and characterization tool for single and clustered
systems. Formerly, Iometer was owned by Intel Corporation, but Intel has discontinued work
on Iometer and it was given to the Open Source Development Lab. For more information
about Iometer, go to:
http://www.iometer.org/
Iometer is both a workload generator (it performs I/O operations in order to stress the system)
and a measurement tool (it examines and records the performance of its I/O operations and
their impact on the system). It can be configured to emulate the disk or network I/O load of
Iometer is the controlling program. Using Iometer’s graphical user interface, you configure the
workload, set operating parameters, and start and stop tests. Iometer tells Dynamo what to
do, collects the resulting data, and summarizes the results in output files. Only one copy of
Iometer should be running at a time. It is typically run on the server machine.
Dynamo is the workload generator. It has no user interface. At Iometer’s command, Dynamo
performs I/O operations and records performance information, then returns the data to
Iometer. There can be more than one copy of Dynamo running at a time. Typically one copy
runs on the server machine and one additional copy runs on each client machine. Dynamo is
multi threaded. Each copy can simulate the workload of multiple client programs. Each
running copy of Dynamo is called a manager. Each thread within a copy of Dynamo is called
a worker.
It also provides I/O load balancing. For each I/O request, SDD dynamically selects one of the
available paths to balance the load across all possible paths.
To receive the benefits of path balancing, ensure that the disk drive subsystem is configured
so that there are multiple paths to each LUN. Doing this not only will enable performance
benefits from the SDD path balancing, but also prevent loss of access to data in the event of a
path failure.
The Subsystem Device Driver is discussed in further detail in 5.6, “Subsystem Device Driver
(SDD) - multipathing” on page 157.
In this chapter we describe these performance features and other enhancements that enable
performance improvements when migrating your workload to a DS6000. We also show some
monitoring tools and describe how to use them for the DS6000.
Specifically for the zSeries servers, the DS6000 is a disk subsystem with a very good price
performance ratio. It should be able to handle sequential workload better than an ESS 800,
like data mining and work volumes. Large database applications do not work as well on the
DS6000, in which case you will need to use a DS8000.
The DS6000 features that have performance implications in the application I/O activity are
described in the following sections:
Parallel Access Volumes
Multiple Allegiance
I/O Priority Queuing
Logical volume sizes
FICON
In the following sections of this chapter we describe these DS6000 features and discuss how
they can be used to boost the performance of your zSeries environment.
Traditionally, access to highly active volumes has involved manual tuning, splitting data across
multiple volumes, and more things in order to avoid those hot spots. With PAV and the z/OS
Workload Manager, you can now almost forget about manual device level performance tuning
or optimizers. The Workload Manager is able to automatically tune your PAV configuration
and adjust it to workload changes. The DS6000 in conjunction with z/OS has the ability to
meet the highest performance requirements.
PAV is implemented by defining alias addresses to the conventional base address. The alias
address provides the mechanism for z/OS to initiate parallel I/O to a volume. As its name
implies, an alias is just another address/UCB that can be used to access the volume defined
on the base address. An alias can only be associated with a base address defined in the
same LCU. The maximum number of addresses you can define in an LCU is 256.
Theoretically you can define 1 base address plus 255 aliases in an LCU.
With dynamic PAV, you do not need to assign as many aliases in an LCU as compared to a
static PAV environment, because the aliases will be moved around to the base addresses that
need an extra alias to satisfy an I/O request.
WLM manages PAVs across all the members of a Sysplex. When making decisions on alias
reassignment, WLM considers I/O from all systems in the Sysplex. By default, the function is
turned off, and must be explicitly activated for the Sysplex through an option in the WLM
service definition, and through a device level option in HCD. Dynamic alias management
requires your Sysplex to run in WLM Goal mode.
As a rule-of-thumb, the numbers in Table 10-1 can be used to determine how many aliases
you need for each volume in a dynamic or static PAV environment. When using large volumes
and these guidelines, you may be able to use less than 256 addresses per LCU.
Table 10-1 Rule-of-thumb for number of aliases for various 3390 sizes
Number of aliases
1 - 3339 1/3 1
6679 - 10,017 1 3
23,374 - 30,051 2 6
50,086 - 60,102 3 9
The DS6000 accepts multiple parallel I/O requests from different hosts to the same device
address, increasing parallelism and reducing channel overhead.
With Multiple Allegiance (MA), the requests are accepted by the DS6000 and all requests will
be processed in parallel, unless there is a conflict when writing data to the same extent of the
CKD logical volume. Still, good application access patterns can improve the global parallelism
by avoiding reserves, limiting the extent scope to a minimum, and setting an appropriate file
mask, for example, if no write is intended.
In systems without Multiple Allegiance, all except the first I/O request to a shared volume are
rejected, and the I/Os are queued in the zSeries channel subsystem, showing up as PEND
time in the RMF reports.
The DS6000 ability to run channel programs to the same device in parallel can dramatically
reduce the IOSQ and the PEND time components in shared environments.
First we will look at a disk subsystem that does not support both of these functions. If there is
an outstanding I/O operation to a volume, all subsequent I/Os will have to wait as illustrated in
Figure 10-1 on page 361. I/Os coming from the same LPAR will wait in the LPAR and this wait
time is recorded in IOSQ Time. I/Os coming from different LPARs will wait in the disk control
unit and be recorded in Device Busy Delay Time, which is part of PEND Time.
In the ESS and DS6000, all these I/Os will be executed concurrently using PAV and Multiple
Allegiance, as shown in Figure 10-2 on page 361. I/O from the same LPAR will be executed
concurrently using UCB 1FF that is an alias of base address 100. I/O from a different LPAR
will be accepted by the disk control unit and executed concurrently. All these I/O operations
will be satisfied from either the cache or one of the DDMs on a Rank where the volume
resides.
One I/O to
one volume
at one time 100
z/OS 1 z/OS 2
Appl.A Appl.B Appl.C
UCB 1FF UCB 100 UCB 100
- Alias to
UCB 100
TCB1: READ1
TCB2: READ2 TCB READ2 READ1 TCB
concurrent concurrent
Note: The domain of an I/O covers the specified extents to which the I/O operation applies.
It is identified by the Define Extent command in the channel program. The domain covered
by the Define Extent used to be much larger than the domain covered by the I/O operation.
When concurrent I/Os to the same volume were not allowed, this was not an issue, since
subsequent I/Os will have to wait anyway.
With the availability of PAV and Multiple Allegiance, this could prevent multiple I/Os from
being executed concurrently. This extent conflict can occur when multiple I/O operations try
to execute against the same domain on the volume. The solution is to update the channel
programs so that they minimize the domain that each channel program is covering. For a
random I/O operation the domain should be the one track where the data resides.
If a write operation is being executed, then any read or write to the same domain will have to
wait. The same case will happen if a read to a domain starts, then subsequent I/Os that want
to write to the same domain will have to wait until the read operation is done.
To summarize, all reads can be executed concurrently, even if they are going to the same
domain on the same volume. A write operation cannot be executed concurrently with any
other read or write operations that access the same domain on the same volume. The
purpose of serializing a write operation to the same domain is to maintain data integrity.
TCB1: WRITE1
TCB2: WRITE2 TCB WRITE2 WRITE1 TCB
concurrent concurrent
Channel programs that cannot execute in parallel are processed in the order they are queued.
A fast system cannot monopolize access to a device also accessed from a slower system.
Each system gets a fair share.
The DS6000 can also queue I/Os from different z/OS system images in a priority order. z/OS
Workload Manager can make use of this and prioritize I/Os from one system against the
others. You can activate I/O Priority Queuing in WLM Goal mode with the I/O priority
management option in the WLM’s Service Definition settings.
When a channel program with a higher priority comes in and is put ahead of the queue of
channel programs with lower priorities, the priorities of the lower priority programs will be
increased. This prevents high priority channel programs from dominating lower priority ones
and gives each system a fair share.
When planning the configuration, you should also consider future growth. Which means that
you may want to define more alias addresses than needed, so that in the future you can add
an additional Rank on this LCU, if needed.
Figure 10-5 shows the number of volumes that can be defined on a (6+P) RAID 5 Rank for
different 3390 models. It is obvious that if you define 3390-3 volumes on a 146 GB DDM
Rank, you cannot define all the 291 volumes on one LCU due to the 256 address limitation on
the LCU. In this case you will have to define multiple LCUs on that Rank. A better option
would be to use the bigger 3390 models, especially if you have multiple Ranks that you want
to define under one LCU.
600
300 GB DDM 590
146 GB DDM
# volumes on a (6+P) RAID-5 rank
73 GB DDM
500
400
300 291
200 191
144
100
97
48 59
14 29 14 30
7
0
3390-3
3390-9
3390-3
3390-9
3390-3
3390-9
3390-27
3390-54
3390-27
3390-54
3390-27
3390-54
Note: Even though the benchmarks were performed on an ESS F20, the comparative
results should be similar on the DS6000.
Random workload
The measurements for DB2 and IMS™ online transaction workloads in our measurements
showed that there was only a slight difference in device response time between a six 3390-27
volumes versus a sixty 3390-3 volumes configuration of equal capacity on the ESS F20 using
FICON channels.
The measurements for DB2 are shown in Figure 10-6. It should be noted that even when the
device response time for a large volume configuration is higher, the online transaction
response time could sometimes be lower due to the reduced system overhead of managing
fewer volumes.
3
Device response time (msec)
3390-3
3390-27
0
2101 3535
Total I/O rate (IO/sec)
The measurements were carried out so that all volumes were initially assigned with zero or
one alias. WLM dynamic alias management then assigned additional aliases as needed. The
number of aliases at the end of the test run reflects the number that was adequate to keep
IOSQ down. For this DB2 benchmark, the alias assignment done by WLM resulted in an
approximately 4:1 reduction in the total number of UCBs used.
Sequential workload
Figure 10-7 on page 366 shows elapsed time comparisons between nine 3390-3s versus one
3390-27 when a DFSMSdss™ full volume physical dump and full volume physical restore are
executed. The workloads were run on a 9672-XZ7 processor connected to an ESS F20 with
eight FICON channels. The volumes are dumped to or restored from a single 3590E tape with
an A60 Control Unit with one FICON channel. No PAV aliases were assigned to any volumes
for this test, even though an alias could have improved the performance.
1000
500
0
Full volume dump Full volume restore
Larger volumes
To avoid potential I/O bottlenecks when using large volumes you may also consider the
following recommendations:
Use of PAVs to reduce IOS queuing.
Parallel Access Volume (PAV) is of key importance when using large volumes. PAV
enables one z/OS system image to initiate multiple I/Os to a device concurrently. This
keeps IOSQ times down even with many active data sets on the same volume. PAV is a
practical must with large volumes. In particular, we recommend using dynamic PAVs.
Multiple Allegiance is a function that the DS6000 automatically provides.
Multiple Allegiance automatically allows multiple I/Os from different z/OS systems to be
executed concurrently. This will reduce the Device Busy Delay time, which is part of PEND
time.
Eliminate unnecessary reserves.
As the volume sizes grow larger, more data and data sets will reside on a single CKD
device address. Thus, the larger the volume, the greater the multi-system performance
impact will be when serializing volumes with RESERVE processing. You need to exploit a
Global Resource Serialization (GRS) Star Configuration and convert all RESERVEs
possible into system ENQ requests.
10.7 FICON
FICON provides several benefits as compared to ESCON, from the simplified system
connectivity to the greater throughput that can be achieved when using FICON to attach the
host to the DS6000.
FICON allows you to significantly reduce the batch window processing time. Response time
improvements may accrue particularly for data stored using larger block sizes. The data
transfer portion of response time is greatly reduced because of the much higher data rate
during transfer with FICON. This improvement leads to significant reductions in the connect
time component of the response time. The larger the transfer, the greater the reduction as a
percentage of the total I/O service time.
The pending time component of the response time, that is caused by director port busy, is
totally eliminated because collisions in the director are eliminated with the FICON
architecture. For users whose ESCON directors are experiencing as much as 45–50 percent
busy conditions, this will provide significant response time reduction.
Another performance advantage delivered by FICON is that the DS6000 accepts multiple
channel command words (CCWs) concurrently without waiting for completion of the previous
CCW. This allows setup and execution of multiple CCWs from a single channel to happen
concurrently. Contention among multiple I/Os accessing the same data is now handled in the
FICON host adapter, and queued according to the I/O priority indicated by the Workload
Manager.
Significant performance advantages can be realized by users accessing the data remotely.
FICON eliminates data rate droop effect for distances up to 100 km for both read and write
operations by using enhanced data buffering and pacing schemes. FICON thus extends the
DS6000’s ability to deliver high bandwidth potential to the logical volumes needing it, when
they need it.
For additional information about FICON, see 5.3, “FICON” on page 146.
The MIDAW facility is a modification to a channel programming technique that has existed
since S/360™ days. MIDAWs are a new method of gathering/scattering data into/from
non-contiguous storage locations during an I/O operation. There is no tuning needed to use
this MIDAW facility. The requirements to be able to take advantage of this MIDAW facility are:
z9 server.
Applications that use Media Manager.
Applications that use long chains of small blocks.
The biggest performance benefit comes with FICON Express2 channels running on 2 Gb
links.
Compared to ESCON channels, using FICON channels will improve performance. This
performance improvement is more significant for I/Os with bigger block sizes, because FICON
channels can transfer data much faster, which will reduce the connect time. The improvement
for I/Os with smaller block sizes is not as significant. In these cases where chains of small
records are processed, MIDAWs can significantly improve FICON Express2 performance if
the I/Os use Media Manager.
Figure 10-8 shows the hypothetical performance of long chains of short blocks (lcsb)
workload. Here we can see the effect of MIDAWs on lcsb workload. As the chart shows,
MIDAWs can double the throughput of lcsb as compared to when MIDAWs are not used.
0 20 40 60 80 100
channel utilzation (%)
Figure 10-9 shows the maximum throughput of a FICON port on the DS6000 as compared to
the maximum throughput of FICON channels on the zSeries servers. Considering that the
maximum throughput on a DS6000 FICON port is higher than the maximum throughput of a
FICON Express, and not that much lower as compared to the maximum throughput of a
FICON Express2, in general, we do not recommend daisy chaining several DS6000s to the
same FICON channels on the zSeries host.
Note: Daisy chaining is connecting FICON ports from multiple DS6000s to the same
FICON channel on the zSeries server.
300
DS6000 FICON port
FICON Express
250 FICON Express2
200
MB/sec
150
100
50
maximum throughput
Figure 10-10 shows configuration A with no daisy chaining. In this configuration we see that
each DS6000 uses four FICON ports and each port is connected to a separate FICON
channel on the host. In this case, we have two sets of four FICON ports connected to eight
FICON channels on the zSeries host.
In configuration B, we double the number of FICON ports on both DS6000s and keep the
same number of FICON channels on the zSeries server. We can now connect each FICON
channel to two FICON ports, one on each DS6000. The advantage of configuration B is:
Workload from each DS6000 will now be spread across more FICON ports. This should
lower the load on the FICON ports and FICON Host Adapters.
Any imbalance in the load that is going to the two DS6000s will now be spread more
evenly across the eight FICON channels.
CEC
DS6000 DS6000
Assumption:
Each line from the FICON channel in the CEC
CEC and each line from the FICON port in the
DS8000 represents a set of 4-paths
DS6000 DS6000
An LCU can have all of its volumes defined in one Extent Pool or it can be defined to span
multiple Extent Pools. This one Rank per Extent Pool setup will make it easier when
monitoring performance, because the performance monitoring tools, like RMF, produce
performance statistics by Extent Pool and by Rank. This way it would be simpler to identify
which Rank belongs to which LCU.
Sharing resources in a DS6000 has advantages from a storage administration and resource
sharing perspective, but does have some implications for workload planning. Resource
sharing has the benefit that a larger resource pool (for example, disk drives or cache) is
available for critical applications. However, some care should be taken to ensure that
uncontrolled or unpredictable applications do not interfere with mission-critical work.
If you have a workload that is truly mission-critical, you may want to consider isolating it from
other workloads, particularly if those other workloads are very unpredictable in their
demands. There are several ways to isolate the workloads:
Place the data on separate DS6000s. This is, of course, the best choice.
Place the data on separate DS6000 servers, This will isolate use of memory buses,
microprocessors, and cache resource. However, before doing that, make sure that a half
DS6000 provides sufficient performance to meet the needs of your important application.
Note, that Disk Magic provides a way to model the performance of a half DS6000 by
specifying the Failover Mode. Consult your IBM representative for a Disk Magic analysis.
Place the data behind separate device adapters.
Place the data on separate Ranks. This will reduce contention for use of DDMs.
Note: z/OS and open systems data can only be placed on separate Extent Pools.
10.9.1 RMF
RMF provides performance information for the DS6000 and other disk subsystems for the
z/OS users. RMF Device Activity reports account for all activity to a base and all its
associated alias addresses. Activity on alias addresses is not reported separately, but they
are accumulated into the base address. RMF will report the number of PAV addresses (or in
RMF cache statistics are collected by volume and reported by volume and by LCU. To check
the status of the whole cache, you have to check the cache reports of all the LCUs defined on
the DS6000.
An Extent Pool, which is a new concept that comes with the DS6000, also has performance
statistics related to it.
TOTAL SAMPLES = 300 IODF = 99 CR-DATE: 07/15/2005 CR-TIME: 11.17.22 ACT: ACTIVATE
DEVICE AVG AVG AVG AVG AVG AVG AVG % % % AVG % %
STORAGE DEV DEVICE VOLUME PAV LCU ACTIVITY RESP IOSQ CMR DB PEND DISC CONN DEV DEV DEV NUMBER ANY MT
GROUP NUM TYPE SERIAL RATE TIME TIME DLY DLY TIME TIME TIME CONN UTIL RESV ALLOC ALLOC PEND
6900 33909 DS6900 6 013E 348.056 5.1 0.0 0.2 0.0 0.2 4.3 0.5 3.16 28.30 0.0 16.4 100.0 0.0
6901 33909 DS6901 2 013E 177.213 6.9 2.0 0.2 0.0 0.2 4.2 0.5 4.41 41.51 0.0 16.4 100.0 0.0
6902 33909 DS6902 2 013E 177.926 6.3 1.6 0.2 0.0 0.2 4.1 0.5 4.39 40.86 0.0 16.4 100.0 0.0
6903 33909 DS6903 2 013E 178.203 6.9 2.0 0.2 0.0 0.2 4.2 0.5 4.64 41.64 0.0 16.4 100.0 0.0
6904 33909 DS6904 1 013E 58.675 11.2 5.7 0.2 0.0 0.2 4.6 0.6 3.81 30.86 0.0 16.4 100.0 0.0
6905 33909 DS6905 1 013E 59.339 11.0 5.7 0.2 0.0 0.2 4.4 0.6 3.72 30.10 0.0 16.4 100.0 0.0
6906 33909 DS6906 1 013E 59.362 9.5 4.4 0.2 0.0 0.2 4.3 0.6 3.37 29.15 0.0 16.4 100.0 0.0
6907 33909 DS6907 1 013E 59.582 10.7 5.5 0.2 0.0 0.2 4.4 0.6 3.70 29.93 0.0 16.4 100.0 0.0
6908 3390 DS6908 1 013E 58.519 12.4 7.0 0.2 0.0 0.2 4.5 0.7 4.00 30.41 0.0 16.4 100.0 0.0
6909 3390 DS6909 1 013E 59.022 11.1 5.5 0.2 0.0 0.2 4.7 0.7 4.15 31.88 0.0 16.4 100.0 0.0
PAV
This is the base address plus the number of aliases assigned to that base address. An
asterisk (*) following the PAV number indicates that during this RMF interval, the number of
aliases assigned to that base address has changed, either increased or decreased.
Increases and decreases are done on demand. You might find a large number of aliases
assigned to a device with a zero I/O rate if no other volume on the LSS needs aliases. To
determine the average number of UCB / Aliases held during the measurement, multiply the
DEVICE ACTIVITY RATE by (PEND+DISC+CONN) and divide by 1000. In Example 10-1,
SC2A00 did 2.033 operations per second each holding the UCB or one of the 2 aliases
assigned for 16.7 ms per operation. That is 33.95 milliseconds per second, or 0.03395
seconds per second, or an average of 0.03395 UCBs/Aliases in use. % DEV UTIL shows that
0.03395 out of 3 PAV (1.06 percent) of the PAV is actually used. It is not likely there will be
much IOSQ when operating at such low levels. However, there are Database Management
Systems (DBMSs) which will cause queuing activity at very low levels of activity. For DB2
work files, hundreds of requests might be made to the same tablespace instantaneously.
PEND time
Pend time represents the time an I/O request waits in the hardware. This PEND time can be
increased by:
High channel utilization. More channels will be required.
I/O Processor (IOP) contention at the zSeries host. More IOP may be needed. IOP is the
processor that is assigned to handle I/Os. If only certain IOPs are saturated, then
redefining the channels used by the control units can help balance the load to the IOP. For
more information, see “Analyze I/O queuing activity” on page 374.
CMR Delay is part of PEND time. It is the initial selection time for the first command in a
chain for a FICON channel. It can be elongated by contention down stream from the
channel, like control unit busy.
Device Busy Delay is also part of PEND time. This is caused by a domain conflict,
because of a read or write operation against a domain that is in use for update. If there is
a high Device Busy Delay time it could have been caused by the domain of the I/O not
being limited to the track where the I/O operation is going to. If an ISV product is used,
asking the vendor for an updated version may help solve this problem.
DISC time
If the major cause of delay is the DISC time then you will need to do some further research to
find the cause. The most probable cause of high DISC time is having to wait while data is
being staged from the DS6000 Array into cache, because of a read miss operation. This time
can be elongated by:
Low read hit ratio. The lower the read hit ratio, the more read operations will have to wait
for the data to be staged from the DDMs to the cache.
High DDM utilization. This can be verified from the RMF Rank report. See “Analyze Rank
statistics” on page 376. Look at the Rank read response time. As a rule-of-thumb (RoT)
this number should be less than 35 msec. If it is higher than that, it is an indication that this
Rank is too busy. If this happens, consider spreading the busy volumes to other Ranks
that are not as busy.
Persistent memory (NVS) full condition can also elongate the DISC time, see “Analyze
cache statistics” on page 375.
CONN time
For each I/O operation, the channel subsystem measures the time the DS6000, channel and
CEC were connected. At high levels of utilization significant time can be spent in contention,
rather than transferring data.
The BUS utilization is always greater than 5%, even if there is no I/O activity at all on the
channel. For small block transfers, the BUS utilization is less than the FICON channel
utilization, and for large block transfers, the BUS utilization is greater than the FICON channel
utilization.
IODF = 99 CR-DATE: 07/15/2005 CR-TIME: 11.17.22 ACT: ACTIVATE MODE: LPAR CPMF: EXTENDED MODE
-------------------------------------------------------------------------------------------------------------------------
DETAILS FOR ALL CHANNELS
-------------------------------------------------------------------------------------------------------------------------
CHANNEL PATH UTILIZATION(%) READ(MB/SEC) WRITE(MB/SEC) CHANNEL PATH UTILIZATION(%) READ(MB/SEC) WRITE(MB/SEC)
ID TYPE G SHR PART TOTAL BUS PART TOTAL PART TOTAL ID TYPE G SHR PART TOTAL BUS PART TOTAL PART TOTAL
2E FC_S 2 Y 1.12 1.75 4.92 0.04 0.17 1.75 1.81 36 FC_? 2 Y 0.00 0.13 3.96 0.00 0.00 0.00 0.00
2F FC_S 2 Y 1.10 1.73 4.90 0.04 0.17 1.71 1.77 37 FC_? 2 Y 0.00 0.13 3.96 0.00 0.00 0.00 0.00
30 FC_S 2 Y 0.00 0.14 3.96 0.00 0.00 0.00 0.00 38 FC_S 2 Y 0.00 4.11 4.35 0.00 0.00 0.00 0.00
39 FC_S 2 Y 0.00 3.63 4.29 0.00 0.00 0.00 0.00 43 FC_S 2 Y 11.17 11.17 7.24 2.43 2.43 2.36 2.36
3A FC_S 2 Y 0.00 0.13 3.96 0.00 0.00 0.00 0.00 4A FC_S 2 Y 0.00 0.13 3.96 0.00 0.00 0.00 0.00
3B FC_S 2 Y 0.00 0.14 3.96 0.00 0.00 0.00 0.00 4B FC_S 2 Y 0.00 0.14 3.97 0.00 0.00 0.00 0.00
3C FC_S 2 Y 0.00 1.30 4.07 0.00 0.00 0.00 0.00 4C FC_S 2 Y 4.21 4.21 5.46 0.80 0.80 1.77 1.77
3D FC_S 2 Y 0.00 0.96 4.06 0.00 0.00 0.00 0.00 4D FC_S 2 Y 11.15 11.15 7.26 2.42 2.42 2.43 2.43
where N = the number of channels in the path group. It is multiplied by 1000, because the I/O
Rate unit is in seconds and the CMR, CONN and DISC are in milliseconds.
Because RMF reports total I/O rate by LCU, this formula needs to be calculated for all LCUs
that share the same path group, and also from all LPARs that are on the same CEC.
The number of Open Exchanges is limited to 32 on all CECs, but on the z9 processor this limit
is increased to 64.
Where, N = number of FICON Ports used by the same path group. It is multiplied by 1000,
because the I/O Rate unit is in seconds and the CONN time unit is in milliseconds.
Because RMF reports total I/O rate by LCU, this formula needs to be calculated for all LCUs
that share the same path group, and also from all LPARs on all CECs.
The RoT for this number is to keep it under two, and if it exceeds four, it means that the
FICON port is very overloaded. High FICON concurrency will increase the CONN time.
High DFW BYPASS is an indication that persistent memory is overcommitted. DFW BYPASS
actually means DASD Fast Write I/Os that are retried, because persistent memory is full.
Calculate the quotient of DFW BYPASS divided by the total I/O rate, as a RoT, if this number
is higher than 1%, then the write retry operations will have a significant impact on the DISC
time.
Check the Disk Activity part of the report, the Read response time must be less than 35 msec.
If it is higher than 35 msec, then it is an indication that the DDMs on the Rank where this LCU
resides are saturated.
Following the above report is the report by volume serial number, as shown in Example 10-5.
Here you can see to which Extent Pool each volume belongs. In the case where we have the
following set up:
One Extent Pool has one Rank.
All volumes on an LCU belong to the same Extent Pool.
then it would be easier to do the analysis if a performance problem happens on the LCU. If we
look at the Rank statistics, see Example 10-6 on page 377, we know that all the I/O activity on
that Rank is coming from the same LCU. So we can concentrate the analysis on the volumes
on that LCU only.
Note: Depending on the DDM size used and the 3390 model selected, you can put
multiple LCUs on one Rank or you may also have an LCU that spans more than one Rank.
Do not worry if write response times are high. When NVS is used up to a high water mark,
data is written furiously until NVS usage is reduced to a low water mark. Multiple requests are
queued for the same HDD, meaning response time could be more than double counted.
Remember, the application that wrote the data was given device end long ago. After this flurry
of activity there is a relatively long period of time doing nothing. Write response times being
high are not usually an indication of a performance problem.
If an LCU uses multiple Extent Pools, then we can still see the backend performance of each
individual Extent Pool/Rank.
If there is a problem in a DS6000 which has multiple LCUs defined on an Extent Pool, then it
will be harder to determine which LCU is causing the problem. The LCU with the highest
response time may just be a victim and not the perpetrator of the problem. The perpetrator is
usually the LCU that is flooding the Rank with I/Os.
Identifying the cause of a problem will become more complicated if you define multiple Ranks
in an Extent Pool. This is because in the cache report, the volume is associated with an
Extent Pool, and not a Rank.
RMF Magic provides consolidated performance reporting about your z/OS Disk Subsystems,
from the point of view of those disk subsystems, rather than from the host perspective, even
when disk subsystems are shared between multiple Sysplexes. This disk-centric approach
makes it much easier to analyze the I/O configuration and performance: RMF Magic
automatically determines the I/O configuration from your RMF data, showing the relationship
between the disk subsystem serial numbers, SSID, LCUs, device numbers and device types.
With RMF Magic there is no need to compare printed RMF reports.
While RMF Magic reports are based on information from RMF records, the analysis and
reporting goes beyond what RMF provides: in particular it computes accurate estimates for
the read and write bandwidth (MB/s) for each disk subsystem and down to the device level.
With this unique capability RMF Magic can size the links in a future remote copy configuration
as it knows the bandwidth required for the links, both in I/O requests and in megabytes per
second, for each point in time.
RMF Magic consolidates the information from RMF records with channel, disk, LCU and
cache information into one view per disk subsystem, per SSID (LCU), and per storage group.
RMF Magic gives insight into your performance and workload data for each individual RMF
interval within a period selected for the analysis that can span weeks. Where RMF
postprocessor reports are sorted by host and LCU, RMF Magic reports are sorted by disk
subsystem and SSID (LSS). With this information you can plan migrations and consolidations
more effectively, because RMF Magic provides a detailed insight in the workload, both from a
disk subsystem and a storage group perspective.
RMF Magic’s graphical capabilities let you to quickly find any hot spots and tuning
opportunities in your disk configuration. Based on user-defined criteria, RMF Magic will
automatically identify peaks within the analysis period. On top of that, the graphical reports
will make all peaks and anomalies stand out immediately, which allows you to explain peak
behavior correctly.
RMF Magic cannot only be used to analyze subsystem performance for z/OS hosts, but if
your DS6000 is also providing storage for open systems hosts, this tool will also report on
RAID rank statistics and host port links statistics. The DS6000 storage subsystem will provide
these open systems statistics to RMF whenever performance data is reported to RMF and is
available for reporting through RMF Magic. Of course if you have a DS6000 that has only
open systems activity, and does not include z/OS 3390 volumes, then this data cannot be
collected by RMF and will not be reported on by RMF Magic.
RMF Magic consists of two components that offer three main functions:
1. A Graphical User Interface (GUI) for the Windows platform that provides:
– A Run Control Center that provides an easy interface allowing the user to prepare,
initiate and supervise the execution of the batch component when executed on the
Windows platform. The Run Control Center is also used to load the data into an
Reporting database.
– A Reporter Control Center where the user can interactively analyze the data in the
Reporting database by requesting the creation of Microsoft Excel tables and charts.
2. A batch component that validates, reduces, extracts, completes and summarizes the
input. All of this is done in two steps: Reduce and Analyze, as described in the RMF Magic
Reduce and RMF Magic Analyze steps below. The batch component can be executed on
either the z/OS or Windows platform.
When preparing the data for processing by RMF Magic for Windows, it is important to be sure
that the data is sorted by time stamp and the RMF Magic package comes with a sample set of
JCL which can be used for sorting the data. You will want to gather the RMF data for all
systems that have access to the disk subsystems that you are going to be studying.
We recommend that the input to the SORT step contains only the RMF records that are
recorded by the SMF subsystem. This subset of SMF data can be obtained by executing the
If you choose to do the reduce on the workstation, RMF Magic for Windows provides a utility
which should be used to package the RMF data for transfer to the workstation. This utility will
not only compress the data for more efficient data transfer, but it will also package the data so
that the Variable Blocked Spanned (VBS) records that are used for collecting the data at the
host do not result in data transfer problems.
The JCL that is used for either option is provided with the RMF Magic for Windows product.
The specific data that is presented in Figure 10-11 is not really relevant, but this set of charts
shows the power of being able to see a graphical representation of your storage subsystem
over time. These particular summary charts show I/O activity to a particular storage
subsystem in terms of data transfer (megabytes per second), I/O rates and response times,
and a view of the number of concurrent I/Os over time. This is a sampling of the standard
summary charts that are automatically created by the RMF Magic reporting tool. Additional
standard reports include backend RAID activity rates, read hit percentages, and a variety of
breakdowns of I/O response time components.
In additional to the graphical view of your performance data, RMF Magic also provides
detailed spreadsheet views of important measurement data, with highlighting of spreadsheet
cells for more visual access to the data. Figure 10-12 on page 382 is an example of the
Performance Summary for a single disk subsystem (called a DSS in the RMF Magic tool).
For each of the data measurement points (for example Column C shows I/O Rate while
Column D shows Response Time), rows 10 through 15 shows a summary of the information.
Rows 10 through 12 show the maximum rate that was measured, the RMF interval in which
that rate occurred and which row of the spreadsheet shows that maximum number. Easy
For example, selecting cell D10 and then clicking Goto Max would move the spreadsheet
data view so that you would be looking at row 122 in about the middle of the screen to see the
average response times in the intervals surrounding the highest interval. This view would also
immediately provide you with a view of the I/O rate and data rate during those intervals of
higher response times.
Figure 10-12 also shows color-coded highlighting for cells that show the top intervals of
measurement data. For example, if you are viewing this text in color, you will see that column
F has cells that are highlighted in pink. The pink cells represent all of the measurement
intervals that have values higher than the 95th percentile. Once again, this is a feature of the
RMF Magic tool which provides visual access to potential performance hot spots represented
by your measurement data.
Figure 10-12 I/O and data rate summary for a single subsystem
Figure 10-13 on page 383 shows a spreadsheet similar in appearance to Figure 10-12 but in
this case shows the summary of the cache measurement data. Again, the tool will highlight
those cells which have the highest measurement intervals for ease of navigation within the
data.
Figure 10-14 on page 384 and Figure 10-15 on page 385 show additional views of how data
can be analyzed within a single subsystem. Using the views shown in Figure 10-14 on
page 384 it is possible to quickly look into the busiest logical control units within the
subsystem. By using data as represented by Figure 10-15 on page 385 it is possible to see
the different components of response time, over time, in order to identified specific intervals
which may need closer analysis.
In summary, the RMF Magic for Windows tool is used to get a view of I/O activity from the disk
subsystem point of view, versus the operating system point of view shown by the standard
RMF reporting tools. This approach allows you to analyze the affect each of operating system
images has on the various disk subsystem resources. This subsystem view is presented in an
easy to use graphical interface with tools programmed into the interface to ease the analysis
of data.
This chapter refers to the DS6000 on iSeries OS/400 and i5/OS operating systems only, not
AIX or Linux.
iSeries software versions V5R2 and earlier are OS/400 versions. Beginning with V5R3, the
software is i5/OS. However, in this chapter, when statements refer to the iSeries software in
general, we simply say OS/400. Note that the DS6000 is supported only on V5R2 and
subsequent releases.
One thing that distinguishes the iSeries is the LUN sizes it will be using. The LUNs will report
into the iSeries as the different models of the 1750 device type. The models will depend on
the size of LUNs that have been configured.
The other distinguishing characteristics of the iSeries come on an upper layer on top of the
preceding architectural characteristics described so far. This is the single level storage
concept that the iSeries uses. This is a powerful characteristic that makes the iSeries a
unique server.
Storage management and caching of data into main storage is completely automated based
on Expert Cache algorithms. Storage management automatically spreads the data across the
disk arms or disk drives (or across LUNs for DS6000 disks) and continues to add records to
files until specified threshold levels are reached.
Single level storage is efficient. Regardless of how many application programs need to use an
object, only one copy of it is required to exist. This makes the entire main storage of an
iSeries server a fast cache for disk storage.
By caching in main storage, the system eliminates access to the storage devices and reduces
associated I/O traffic. Expert Cache works by minimizing the effect of synchronous disk I/O on
a job. The best candidates for performance improvement by using Expert Cache are jobs that
are most affected by synchronous disk I/Os.
This is very different from the way auxiliary storage (disk) was regarded prior to OS/400
V5R1. Until then, all iSeries disks were considered to be owned and usable only by a single
system. Enhancements made in this and later releases make using independent disk pools
an attractive option for many customers who are looking for higher levels of availability and
server consolidation. We use the terms independent disk pool and independent auxiliary
storage pool interchangeably.
There are three types of auxiliary storage pools (ASP):
System auxiliary storage pool (ASP 1): This storage pool contains OS/400 and licensed
program products, plus any user objects.
Basic user auxiliary storage pool (ASP 2-32): Prior to OS/400 V5R2, ASPs 2-32 were
known as user storage pools. Their function has not changed, but they are now referred to
as basic user ASPs. They allow the disk storage attached to a single iSeries server to be
grouped into separate pools. However, these pools have a close relationship to the system
ASP.
Independent auxiliary storage pool (ASP 33-255): This disk pool type contains objects,
directories, or libraries that contain the objects, and other object attributes such as
authorization and ownership attributes. An independent disk pool can be made available
(varied on) and made unavailable (varied off) to the server without restarting the system.
When an independent auxiliary storage pool is associated with a switchable hardware group,
it becomes a switchable auxiliary storage pool and can be switched between one iSeries
server and another iSeries server in a clustered environment. Note that with the required
hardware, internal iSeries disks can be switchable. External storage servers are not required
in order to switch storage from one iSeries server to another. Achieving the full benefits of an
external storage server in an iSeries environment requires the storage server be set up as its
own independent ASP.
When the iSeries disks are external, as when using a DS6000, the disk devices are mapped
into the Logical Unit Numbers (LUNs) that are carved in the DS6000 Ranks. In the DS6000
LUNs are striped across a Rank, with the Ranks being either RAID 5 or RAID 10 protected.
The DS6000 can accommodate all iSeries disks, including the load source unit. Load source
units on external storage devices are supported only on ~ i5 models and on i5/OS
V5R3 or later.
Since iSeries servers already make use of cache in main storage, iSeries workloads
generally do not benefit from large cache in the DS6000 as much as other server platforms
do. Also, the large iSeries cache means that its sequential reads from DS6000 do not follow
the pattern of sequential reads typical of other server platforms, so the DS6000 Sequential
Adaptive Replacement Cache (SARC) algorithms may not provide the benefit in an iSeries
environment that they do on other server platforms.
Although iSeries performs well with internal storage, it also performs well with external
storage, providing clients the flexibility of separating server management from storage
management. In addition, external storage seamlessly becomes part of the iSeries single
level storage environment.
In iSeries terminology, a device adapter is referred to as an I/O adapter (IOA). Each adapter
requires its own dedicated I/O processor (IOP) within the iSeries server. The adapters are
auto-sensing and run either 1 Gbps or 2 Gbps. Each IOA can support up to 32 LUNs.
Table 11-1 shows a comparison of the performance of the 2766 and 2787 IOAs. These
numbers are in a controlled test and should not be expected in a typical client environment.
However, they do show the relative differences in the 2766 and 2787 IOAs. Specifically, the
2787 has 38 percent greater throughput than the 2766. Response time as measured by
iSeries Collection Services is 31 percent faster with the 2787 as compared to the 2766.
With multi-target support with OS/400 V5R2, multiple DS6000s can be supported from a
single Fibre Channel adapter initiator, but the total number of addressable LUNs remains 32.
This means, for example, that a single iSeries Fibre Channel disk adapter can have 16 LUNs
on each of two DS6000s.
The RAID type can noticeably affect performance. For workloads with a high number of
random writes, RAID 10 generally supports a higher I/O rate than RAID 5. It does this since
for a random write, RAID 10 requires half as many disk operations as does RAID 5. However,
to provide the same amount of usable capacity, a RAID 10 Array requires significantly more
physical DDMs than does RAID 5.
The effects on performance of different disk drive modules and RAID options can be modeled
using the Disk Magic tool. For additional information about this tool, see 4.1, “Disk Magic” on
page 86.
Unlike other open systems using Fixed Block architecture, OS/400 only supports specific
volume sizes and these may not be an exact number of extents. In this case, some portion of
the last extent used in building a LUN may be unused space. Table 11-2 shows the iSeries
device types that are emulated by the DS6000, the number of DS6000 Extents required for
each device type, and the space efficiency of each device size (percentage of available
Extent space that used by the iSeries LUN).
Note: As shown in Table 11-2, OS/400 volumes are defined in decimal Gigabytes (GB: 109
bytes), while DS6000 Extents are defined in binary Gigabytes (GiB: 230 bytes).
However, a way to reduce the impact of this limitation is to reduce the internal disk size or
storage server LUN size. For example, take a scenario where you reduce the LUN size by 50
percent, double the number of LUNs to maintain the same total capacity, and also keep the
same access density (number of I/O operations per second, per gigabyte of capacity). This
reduces the I/O rate to each LUN by 50 percent. In an environment with high I/O rates,
spreading the I/Os across more smaller disk devices (LUNs in the DS6000) can show a
significant improvement in performance.
A good balance between small LUNs, which reduce OS/400 I/O request queues, and large
LUNs, which maximize the amount of data which can be accessed per IOA, is to use a LUN
size that will provide at least two LUNs per capacity of an individual DDM. For example, with
73 GB DDMs, you should use a maximum LUN size of 35.1 GB. With 146 GB DDMs, you
should use a maximum LUN size of 70.5 GB. And with 300 GB DDMs, you should use a
maximum LUN size of 141.1 GB.
Note: A DS6000 Array can contain iSeries LUNs of different sizes. For example, if you are
creating 141.1 GB LUNs, and have residual capacity in the Extent Pool of 100 Extents, you
can create three 35.1 GB LUNs, which require 33 Extents each.
For example, if you have an Array of 836 GB (the effective capacity of a RAID 5 6+P Array of
146 GB disks), from which you build an Extent Pool, and you have one specific workload
which requires that amount of capacity, then you would dedicate that Array to that workload. If
no one workload requires that capacity, but multiple workloads on a single server do, then you
would dedicate that Array to the workloads on that one server.
A more proactive solution would be to plan your DS6000 Array capacities with your workloads
in mind. For example, if you have a workload that requires 416 GB, a RAID 5 6+P Array of 73
GB disks matches that capacity, and you could dedicate that entire Array to that workload.
Even when workload capacities are so large that they require several Arrays, you should still
dedicate Arrays to workloads. For example, if your workload requires 3345 GB, four RAID 5
We also recommend dedicating host adapter ports on the DS6000 to a specific iSeries
server. That is, a given DS6000 host adapter port (or multiple ports which are part of a
multipath group), are dedicated to the LUNs for a specific server. This is to ensure that
another server does not dominate use of the host adapter port and cause degraded I/O
response times to the iSeries server
However, before deleting the logical volume on the DS6000, you must first remove it from the
OS/400 configuration (assuming it was still configured). This is an OS/400 task which is
disruptive if the disk is in the system ASP or user ASPs 2-32, because it requires an IPL of
OS/400 to completely remove the volume from the OS/400 configuration. This is no different
than removing an internal disk from an OS/400 configuration. Indeed, deleting a logical
volume on the DS6000 is similar to physically removing a disk drive from an iSeries. Disks
can be removed from an independent ASP with the ASP varied off without IPLing the system.
11.3 Multipath
Multipath support was added for external disks in V5R3 of i5/OS. Multiple connections
provide availability by allowing disk storage to be utilized even if a single path fails. Unlike
other server platforms which have an add-on software component for multipath, such as
Subsystem Device Driver (SDD), multipath is part of the base operating system. Up to eight
connections can be defined from multiple I/O adapters on an iSeries server to a single logical
volume in the DS6000. Each connection for a multipath disk unit functions independently and
can provide access to a logical volume if other connections to that volume fail.
Multipath is important for iSeries because it provides greater resilience to SAN failures, which
can be critical to OS/400 due to the single level storage architecture. Multipath is not available
for iSeries internal disk units but the likelihood of path failure is much less with internal drives,
because there are fewer components, or points of failure.
If you are using two adapters to provide a multipath connection to a group of 32 LUNs in a
DS6000, this will use up the 32 LUNs available on each adapter.
Both of the iSeries adapters which support the DS6000 can be used for multipath. There is no
requirement for all paths to use the same type of adapter. Both adapters can address up to 32
The system enforces the following rules when you use multipath disk units in a
multiple-system environment:
If you move an IOP with a multipath connection to a different logical partition, you must
also move all other IOPs with connections to the same disk unit to the same logical
partition.
When you make an expansion unit switchable, make sure that all multipath connections to
a disk unit will switch with the expansion unit.
When you configure a switchable independent disk pool, make sure that all of the required
IOPs for multipath disk units will switch with the independent disk pool.
If a multipath configuration rule is violated, the system issues warnings or errors to alert you
of the condition.
The iSeries performance tools provide data for looking at the performance of the server
environment as a whole, since performance problems can come from many sources. Many of
the tools are interrelated, so before looking at each one in detail, we will provide an overview
so you can better see how they fit together:
Collection Services
Collection Services is a no-charge part of i5/OS that collects a set of iSeries metrics or
categories over a start/stop time period. Collection Services allows you to gather
performance data with minimal system resource consumption. Collection Services collects
sample data, which is summary data that is captured at regular time intervals. Collection
Services data is the foundation for basic performance analysis of your system.
Performance Explorer
Performance Explorer is a data collection tool that helps you identify the causes of
performance problems that cannot be identified by collecting data using Collection Services
or by doing general trend analysis. The collection functions and related commands of
Performance Explorer are part of i5/OS. The reporting function and its associated commands
are part of the base option in the Performance Tools licensed program.
Collection Services allows you to gather performance data with little or no observable impact
on system performance. You can use iSeries Navigator to configure Collection Services to
collect the data you want as frequently as you want to gather it. Once you have configured
and started Collection Services, performance data is continuously collected. When you need
to work with performance data, you can copy the data you need into a set of performance
database files.
The system monitors display the data stored in the collection objects that are generated and
maintained by Collection Services. You can use monitors to track and research many different
elements of system performance and can have many different monitors running
simultaneously. When used together, the monitors provide a sophisticated tool for observing
and managing system performance.
Performance Tools includes reports, interactive commands, and other functions, including the
following:
The Work with System Activity command allows you to work interactively with the jobs,
threads and tasks currently running in the system. The command reports system resource
utilization, including CPU utilization on a per-task basis for partitions that use a shared
processing pool.
The Display Performance Data graphical user interface allows you to view performance
data, summarize the data into reports, display graphs to show trends, and analyze the
details of your system performance all from within iSeries Navigator.
The Performance Tools reports organize Collection Services performance data in a logical
and useful format.
The Performance Tools graphics function allows you to work with performance data in a
graphical format. You can display the graphs interactively, or you can print, plot, or save
the data to a graphics data format file for use by other utilities.
Performance Explorer is a data collection tool that helps you identify the causes of
performance problems that cannot be identified by sample data that was collected by
Collection Services or by doing general trend analysis. Use Performance Explorer for
detailed application analysis at a program, procedure, module, or method level. You can
collect trace data on CPU and I/O activity for an individual program. Performance Explorer
is described in more detail in 11.4.7, “Performance Explorer” on page 400.
Figure 11-1 on page 400 shows a sample Performance Tools Disk Utilization report.
Performance Explorer and Collection Services are separate collecting agents. Each one
produces its own set of database files that contain grouped sets of collected data. You can
run both data collections at the same time.
Note: Performance Explorer is the tool you need to use after you have tried the other tools.
It gathers specific forms of data that can more easily isolate the factors involved in a
performance problem; however, when you collect this data, you can significantly affect the
performance of your system.
Like Collection Services, Performance Explorer collects data for later analysis. However, they
collect very different types of data. Collection Services collects a broad range of system data
at regularly scheduled intervals, with minimal system resource consumption. In contrast,
Performance Explorer starts a session that collects trace-level data. This trace generates a
large amount of detailed information about the resources consumed by an application, job, or
thread.
You can use Performance Explorer to answer specific questions about areas like
system-generated disk I/O, procedure calls, Java method calls, page faults, and other trace
events. It is the ability to collect very specific and very detailed information that makes the
Performance Explorer effective in helping isolate performance problems. For example,
Example 11-1 Performance Explorer trace definition to show all disk events
iDoctor for iSeries is a suite of tools and services consisting of these components:
Consulting Services
Job Watcher
Heap Analysis Tools for Java
PEX Analyzer
Performance Trace Data Visualizer
Consulting Services are provided on a fee basis. Job Watcher and PEX Analyzer offer a
45-day free trial with the option to purchase the product after the trial period. Heap Analysis
Tools and PTDV are offered as a free service on an as-is basis.
Consulting Services
Consulting Services provide basic instruction on installation and use of the software tools
within the iDoctor product suite to collect necessary data from your system. It also includes an
analysis of data collected by the iDoctor tools and a final report that includes a detailed
description of IBM’s findings.
Job Watcher
Job Watcher displays real-time tables and graphical data that represent, in a very detailed
way, what a job is doing and why it is not running. Job Watcher provides several different
reports that provide detailed job statistics by interval. These statistics include items such as
CPU utilization, DASD counters, waits, faults, call stack information, and conflict information.
PEX Analyzer
PEX Analyzer evaluates the overall performance of your system and builds on what you have
done with the Performance Tools licensed program. The PEX Analyzer condenses volumes of
trace data into reports that can be graphed or viewed to help isolate performance issues and
reduce overall problem determination time. The PEX Analyzer provides an easy-to-use
graphical interface for analyzing CPU utilization, physical disk operations, logical disk
input/output, data areas, and data queues. The PEX Analyzer can also help you isolate the
cause of application slowdowns.
The Workload Estimator and PM iSeries have been enhanced to work with one another.
Through a Web-based application, you can size the upgrade to the required iSeries system
that accommodates your existing system’s utilization, performance, and growth as reported
by PM iSeries. As an additional option, sizings can also include capacity for adding specific
applications like Domino®, Java, and WebSphere, or the consolidation of multiple AS/400 or
iSeries traditional OS/400 workloads on one system. This capability allows you to plan for
future system requirements based on utilization data coming from your existing system.
When modeling a disk subsystem for iSeries, Disk Magic takes into consideration the unique
single level storage structure that is at the heart of the iSeries architecture, the Expert Cache
algorithms designed into OS/400 to manage the single level storage, and the implications this
has for modeling cache behavior of externally-attached disk subsystems.
Disk Magic allows you to create a model of your existing iSeries internal disk workload, and
project the performance impact of migrating the workload to a new or existing external disk
subsystem. By modeling different configurations in a what if mode, you can determine the
best DS6000 configuration for your I/O requirements when attaching to an iSeries.
The statistics below are needed as input to Disk Magic and can be extracted from collected
performance data:
Reads and writes per second
Average transfer size (KB per I/O)
Cache effectiveness
Average service times and unit utilization
Reports produced by Performance Tools, see “Performance Tools reports” on page 399, can
be processed by Disk Magic as TXT files to build a base system for modeling. The reports
needed for a Disk Magic analysis are:
Component Report - Disk Activity
Resource Interval Report - Disk Utilization Detail
System Report - Disk Utilization
System Report - Storage Pool Utilization
The reports may cover iSeries internal disks and/or external storage servers. Disk Magic
accepts I/O load statistics of a single or multiple iSeries hosts, produced for either internal
disks or an external storage server. The storage server can be either an IBM or non-IBM
model.
The required reports listed above may be concatenated into a single TXT file, optionally
embedded in any other Performance Tools reports, which can then be processed by Disk
Magic. However, if you prefer, Disk Magic can process each report individually, in which case
Disk Magic will prompt you for the name of each report until all of them have been processed.
Performance Tools reports for multiple iSeries servers may be further concatenated into a
single file, bearing in mind that Disk Magic will create a single disk subsystem for all internally
attached disks it finds in the concatenated reports, regardless of the number of hosts or
externally attached disk subsystems represented in the TXT file. If applicable, a second disk
subsystem is created if externally attached disks are found in the Performance Tools reports.
When Disk Magic detects more than one auxiliary storage pool (ASP), it asks if you want to
model at the ASP level.
The recommended way to input Performance Tools reports to Disk Magic is by first
concatenating them together into a single TXT file, sequenced as in the list above. This file
may contain other Performance Tools reports than the four mentioned above; Disk Magic will
select what it needs. Also, when you have multiple iSeries hosts, the preferred way of
processing their Performance Tools reports is by concatenating the reports together into a
single TXT file before processing it with Disk Magic.
If the iSeries host currently employs software mirroring to maintain a duplicate image of the
disks, then Disk Magic detects this in the Performance Tools reports and asks you if it should
assume continued software mirroring, or will data be written just once to a DS6000 (which
provides RAID 5 or RAID 10 data protection).
If you are planning to attach the iSeries to an existing DS6000, you must also have
performance data from the other workloads that are running on the DS6000. You need to
have the configuration of the DS6000, including cache size, number of Fibre Channel
connections, installed disks, and the mapping of the other workloads across the configuration.
User ASPs must also be included in capacity and performance sizings of storage solutions for
iSeries. Ensure collected sizing information includes the current and future use of user ASPs.
11.5.1 Publications
We recommend the following Redbooks and iSeries publications:
iSeries in Storage Area Networks, A Guide to Implementing FC Disk and Tape with
iSeries, SG24-6220
Performance Tools for iSeries Version 5, SC41-5340
iSeries Performance Capabilities Reference i5/OS Version 5 Release 3, SC41-0607
iSeries Performance Version 5 Release 3, which you can find at:
http://www.ibm.com/servers/eserver/iseries/perfmgmt/resource.html
http://publib.boulder.ibm.com/infocenter/iseries/v5r3/topic/rzahx/rzahx.pdf
Collecting and Analyzing PEX Trace Profile Data, which you can find at:
http://www.ibm.com/servers/eserver/iseries/perfmgmt/resource.html
http://www-03.ibm.com/servers/eserver/iseries/perfmgmt/pdf/tprof.pdf
iSeries Disk Arm Requirements Based on Processor Model Performance, which you can
find at:
http://www-03.ibm.com/servers/eserver/iseries/perfmgmt/pdf/V5R2FiSArmct.pdf
Information in this chapter is not only dedicated to IBM TotalStorage DS6000 but can be
applied more generally to all storage equipment.
Adding to the preceding list, Table 12-1 further provides a summary of the different workload
types’ characteristics.
The DB2 environment can often be difficult to typify, since there can be wide differences in I/O
characteristics. DB2 Query has high read content and is of a sequential nature. Transaction
environments have more random content, and are sometimes very cache unfriendly, but
some other times have very good hit ratios. DB2 has also implemented several changes that
affect I/O characteristics, such as sequential pre-fetch and exploitation of I/O priority queuing.
Users need to understand the unique characteristics of their installation’s processing before
generalizing about DB2 performance.
A DB2 query workload should mostly have the same characteristics as a sequential read
workload. Storage subsystem implements sequential pre-fetch algorithms. This functionality
which caches data which has the most probability to be accessed provides very good
performance improvements for most DB2 queries.
The enhanced prefetch cache algorithms, together with the high storage backend bandwidth,
and minimal RAID 5 write penalty (none for RAID 10) provides high subsystem throughput
and high transaction rates for DB2 transaction-based workloads.
One of DB2’s main advantages is the exploitation of a large buffer pool in processor storage.
When managed properly, the buffer pool can avoid a large percentage of the accesses to
disk. Depending on the application and the size of the buffer pool, this can translate to poor
cache hit ratios for what in DB2 is called synchronous reads. Spreading data across several
RAID Arrays can be used to increase the throughput even if all accesses are read misses.
DB2 administrators often require that tablespaces and their indexes are placed on separate
volumes. This configuration improves both availability and performance.
These workload categories are summarized in Table 12-2, and the common applications that
can be found at any installation are classified following this categorization.
An example of data warehouse is a design around a financial institution and its functions,
such as loans, savings, bank cards, and trusts for a financial institution. In this application
there are basically three kinds of operations: The initial loading, the access, and the updating
of the data. However, due to the fundamental characteristics of a warehouse these operations
can occur simultaneously. At times this application could perform 100 percent reads when
accessing the warehouse; 70 percent reads and 30 percent writes when accessing data while
record updating occurs simultaneously; or even 50 percent reads and 50 percent writes when
the user load is heavy. Keep in mind that the data within the warehouse is a series of
snapshots and once the snapshot of data is made, the data in the warehouse does not
change. Therefore, there is typically a higher read ratio when using the data warehouse.
Object-Relational DBMS (ORDBMS) are now being developed, and they not only offer
traditional relational DBMS features, but will additionally support complex data types. Object
storage and manipulation can be done, and complex queries at the database level can be
Depending on the host and operating system used to perform this application, transfers are
typically medium to large in size and access is always sequential. Image processing consists
of moving huge image files for the purpose of editing. In these applications the user is
regularly moving huge high-resolution images between the storage device and the host
system. These applications service many desktop publishing and workstation applications.
Editing sessions can include loading large files of up to 16 MB into host memory, where users
edit, render, modify, and eventually store back onto the storage system. High interface
transfer rates are needed for these applications or the users will waste huge amounts of time
waiting to see results. If the interface can move data to and from the storage device at over 32
MB/second then an entire 16 MB image can be stored and retrieved in less than one second.
The need for throughput is all important to these applications and, along with the additional
load of many users, I/O operations per second are also a major requirement.
To monitor the workload applied on your DS6000, the monitoring tool available is the IBM
TotalStorage Productivity Center for Disk with Performance Manager. See 4.3, “IBM
TotalStorage Productivity Center for Disk” on page 109.
These three commands are standard tools available with most UNIX systems and UNIX-like
(Linux). We recommend using iostat for the data you will need to evaluate your host I/O levels.
Specific monitoring tools are also available for AIX, Linux, HP-UNIX and Sun Solaris.
For more information, refer to Chapter 7, “Open systems servers - UNIX” on page 189 and to
Chapter 8, “Open system servers - Linux for xSeries” on page 261.
Performance Monitor gives you the flexibility to customize the monitoring to capture various
categories of Windows 2000 system resources, including CPU and memory. You can also
monitor disk I/O through Performance Monitor.
For more information, refer to Chapter 9, “Open system servers - Windows” on page 305.
iSeries environment
Here are the most popular tools:
Collection Services
iSeries Navigator Monitors
IBM Performance management for iSeries (PM iSeries)
Performance Tools for iSeries
Most of these are comprehensive planning tools in that they address the entire spectrum of
workload performance on iSeries including CPU, system memory, disks and adapters.
For more information, refer to Chapter 11, “iSeries servers” on page 387.
zSeries environment
The z/OS systems have proven performance monitoring and management tools available to
use for performance analysis. RMF, a z/OS performance tool, collects performance data and
reports it for the desired interval. It also provides cache reports. The cache reports are similar
to disk-to-cache and cache-to-disk reports available in the TotalStorage productivity Center
for Disk, except that RMF’s cache reports are provided in text format. RMF provides the Rank
level statistics as SMF records. These SMF records are raw data that you can run your own
post processor against to generate reports.
For more information, refer to Chapter 10, “zSeries servers” on page 357.
The information and discussion contained in this chapter can further be complemented with
information at this Web site:
http://www.ibm.com/software/data/db2/udb
OLTP systems process the day-to-day operation of businesses and, therefore, have strict
user response and availability requirements. They also have very high throughput
requirements and are characterized by large amounts of database inserts and updates. They
typically serve hundreds, or even thousands, of concurrent users.
DSS systems typically deal with substantially larger volumes of data than OLTP systems due
to their role in supplying users with large amounts of historical data. Whereas 100 GB of data
would be considered large for an OLTP environment, a large DSS system could be 1 terabyte
of data or more. The increased storage requirements of DSS systems can also be attributed
to the fact that they often contain multiple, aggregated views of the same data.
While OLTP queries are mostly related to one specific business function, DSS queries are
often substantially more complex. The need to process large amounts of data results in many
CPU intensive database sort and join operations. The complexity and variability of these
types of queries must be given special consideration when estimating the performance of a
DSS system.
Data tablespaces can be divided in two groups: System tablespaces and user tablespaces.
Both of these have identical data attributes. The difference is that system tablespaces are
used to control and manage the DB2 subsystem and user data. System tablespaces require
the highest availability and some special considerations. User data cannot be accessed if the
system data is not available.
In addition to data tablespaces, DB2 requires a group of traditional datasets not associated to
tablespaces that are used by DB2 to provide data availability: The backup and recovery
datasets.
The following sections describe the different objects and datasets that DB2 uses.
TABLE
All data managed by DB2 is associated to a table. The table is the main object used by DB2
applications.
TABLESPACE
A tablespace is used to store one or more tables. A tablespace is physically implemented with
one or more datasets. Tablespaces are VSAM linear datasets (LDS). Because tablespaces
can be larger than the largest possible VSAM dataset, a DB2 tablespace may require more
than one VSAM dataset.
INDEX
A table can have one or more indexes (or can have no index). An index contains keys. Each
key may point to one or more data rows. The purpose of an index is to get direct and faster
access to the data in a table.
DATABASE
Database is a DB2 representation of a group of related objects. Each of the previously named
objects has to belong to a database. DB2 databases are used to organize and manage these
objects.
STOGROUP
A DB2 storage group is a list of storage volumes. STOGROUPs are assigned to databases,
tablespaces, or index spaces when using DB2 managed objects. DB2 uses STOGROUPs for
disk allocation of the table and index spaces.
Installations that are SMS managed can define STOGROUP with VOLUME(*). This
specification implies that SMS assigns a volume to the table and index spaces in that
STOGROUP. In order to do this, SMS uses ACS routines to assign a storage class, a
management class and a storage group to the table or index space.
Application tablespaces and index spaces are VSAM LDS datasets with the same attributes
as DB2 system tablespaces and index spaces. The difference between system and
application data is made only because they have different performance and availability
requirements.
You may intermix tables and indexes, and also system, application, and recovery datasets, on
DS6000 Ranks. The overall I/O activity will be more evenly spread, and I/O skews will be
avoided.
VSAM data striping addresses this problem with two modifications to the traditional data
organization:
The records are not placed in key ranges along the volumes; instead they are organized in
stripes.
Parallel I/O operations are scheduled to sequential stripes in different volumes.
By striping data the VSAM control intervals (CIs) are spread across multiple devices. This
format allows a single application request for records in multiple tracks and CIs to be satisfied
by concurrent I/O requests to multiple volumes.
The result is improved data transfer to the application. The scheduling of I/O to multiple
volumes in order to satisfy a single application request is referred as an I/O path packet.
We can stripe across Ranks, across device adapters, across servers, and across DS6000s.
If you plan to enable VSAM I/O striping refer to the following publication for additional
discussion: DB2 for z/OS and OS/390 Version 7 Performance Topics, SG24-6129.
Measurements oriented to determine how large volumes can impact DB2 performance have
shown that similar response times can be obtained when using larger volumes as when using
the smaller 3390-3 standard size volumes (refer to 10.6.2, “Larger versus smaller volumes
performance” on page 364 for a discussion).
Examples of DB2 applications that will benefit from MIDAWs are DB2 prefetch and DB2
utilities.
The information presented in this section is further discussed in detail in (and liberally
borrowed from) the redbook, IBM ESS and IBM DB2 UDB Working Together, SG24-6262.
Many of the concepts presented are applicable to the DS6000. We highly recommend this
redbook. However, based on customer solution experiences using SG24-6262, there are two
corrections we want to point out:
In IBM ESS and IBM DB2 UDB Working Together, SG24-6262, Section 3.2.2, “Balance
workload across ESS resources”, suggests that a data layout policy should be established
that allows partitions and containers within partitions to be spread evenly across ESS
resources. It further suggests that you can choose either a horizontal mapping, in which
every partition has containers on every available ESS Rank, or a vertical mapping in which
DB2 partitions are isolated to specific arrays, with containers spread evenly across those
Ranks. We now recommend the vertical mapping approach.
Another data placement consideration, suggests that it is important to manage where data
is placed on the disk, outer edge or middle. We no longer believe this is an important
consideration.
The database object that maps the physical storage is the tablespace. Figure 13-1 on
page 422 illustrates how DB2 UDB is logically structured and how the tablespace maps the
physical object.
Instance(s)
Database(s)
Tablespaces is where tables are stored:
SMS or DMS
Tablespace(s) Each container Each container
is a directory is a fixed,
in the file space pre-allocated
Tables of the operating file or physical
Indexes system. device such as
a disk.
long data
/fs.rb.T1.DA3a1
/fs.rb.T1.DA3b1
Instances
An instance is a logical database manager environment where databases are cataloged and
configuration parameters are set. An instance is similar to an image of the actual database
manager environment. You can have several instances of the database manager product on
the same database server. You can use these instances to separate the development
environment from the production environment, tune the database manager to a particular
environment, and protect sensitive information from a particular group of users.
For database partitioning features (DPF) of the DB2 Enterprise Server Edition (ESE), all data
partitions reside within a single instance.
Databases
A relational database structures data as a collection of database objects. The primary
database object is the table (a defined number of columns and any number of rows). Each
database includes a set of system catalog tables that describe the logical and physical
structure of the data, configuration files containing the parameter values allocated for the
database, and recovery logs.
DB2 UDB allows multiple databases to be defined within a single database instance.
Configuration parameters can also be set at the database level, thus allowing you to tune, for
example, memory usage and logging.
Database partitions
A partition number in DB2 UDB terminology is equivalent to a data partition. Databases with
multiple data partitions residing on an SMP system are also called multiple logical partition
(MLN) databases.
The configuration information of the database is stored in the catalog partition. The catalog
partition is the partition from which you create the database.
Partitiongroups
A partitiongroup is a set of one or more database partitions. For non-partitioned
implementations (all editions except for DPF), the partitiongroup is always made up of a
single partition.
Partitioning map
When a partitiongroup is created, a partitioning map is associated with it. The partitioning
map in conjunction with the partitioning key and hashing algorithm is used by the database
manager to determine which database partition in the partitiongroup will store a given row of
data. Partitioning maps do not apply to non-partitioned databases.
Containers
A container is the way of defining where on the storage device will the database objects be
stored. Containers may be assigned from file systems by specifying a directory. These are
identified as PATH containers. Containers may also reference files that reside within a
directory. These are identified as FILE containers and a specific size must be identified.
Containers may also reference raw devices. These are identified as DEVICE containers, and
the device must already exist on the system before the container can be used.
All containers must be unique across all databases; a container can belong to only one
tablespace.
Tablespaces
A database is logically organized in tablespaces. A tablespace is a place to store tables. To
spread a tablespace over one or more disk devices you simply specify multiple containers.
For partitioned databases, the tablespaces reside in partitiongroups. In the create tablespace
command execution, the containers themselves are assigned to a specific partition in the
partitiongroup, thus maintaining the shared nothing character of DB2 UDB DPF.
Tablespaces can be either system managed space (SMS) or data managed space (DMS).
For an SMS tablespace, each container is a directory in the file system, and the operating
system file manager controls the storage space (LVM for AIX). For a DMS tablespace, each
container is either a fixed-size pre-allocated file, or a physical device such as a disk (or in the
case of the DS6000, a vpath), and the database manager controls the storage space.
There are three main types of user tablespaces: Regular (index and/or data), temporary, and
long. In addition to these user-defined tablespaces, DB2 requires a system tablespace, the
catalog tablespace, to be defined. For partitioned database systems this catalog tablespace
resides on the catalog partition.
When creating a table you can choose to have certain objects, such as indexes and large
object (LOB) data, stored separately from the rest of the table data. In order to do this, the
table must be defined to a DMS tablespace.
Indexes are defined for a specific table and assist in the efficient retrieval of data to satisfy
queries. They can also be used to assist in the clustering of data.
Large objects (LOBs) can be stored in columns of the table. These objects, although logically
referenced as part of the table, may be stored in their own tablespace when the base table is
defined to a DMS tablespace. This allows for more efficient access of both the LOB data and
the related table data.
Pages
Data is transferred to and from devices in discrete blocks that are buffered in memory. These
discrete blocks are called pages, and the memory reserved to buffer a page transfer is called
an I/O buffer. DB2 UDB supports various page sizes including 4 k, 8 k, 16 k, and 32k.
When an application accesses data randomly, the page size determines the amount of data
transferred. This corresponds to the size of data transfer request done to the DS6000, which
is sometimes referred to as the physical record.
Sequential read patterns can also influence the page size selected. Larger page sizes for
workloads with sequential read patterns can enhance performance by reducing the number of
I/Os.
Extents
An extent is a unit of space allocation within a container of a tablespace for a single
tablespace object. This allocation consists of multiple pages. The extent size (number of
pages) for an object is set when the tablespace is created.
An extent is a group of consecutive pages defined to the database.
The data in the tables spaces is striped by extent across all the containers in the system.
Buffer pools
A buffer pool is main memory allocated in the host processor to cache table and index data
pages as they are being read from disk or being modified. The purpose of the buffer pool is to
improve system performance. Data can be accessed much faster from memory than from
disk; therefore, the fewer times the database manager needs to read from or write to disk
(I/O) the better the performance. Multiple buffer pools can be created.
Sequential pre fetch reads consecutive pages into the buffer pool before they are needed by
DB2. List pre fetches are more complex. In this case the DB2 optimizer optimizes the retrieval
of randomly located data.
Page cleaners
Page cleaners are present to make room in the buffer pool before pre fetchers read pages on
disk storage and move them into the buffer pool. For example, if a large amount of data has
been updated in a table, many data pages in the buffer pool may be updated but not written
into disk storage (these pages are called dirty pages). Since prefetchers cannot place fetched
data pages onto the dirty pages in the buffer pool, these dirty pages must be flushed to disk
storage and become clean pages so that prefetchers can place fetched data pages from disk
storage.
Logs
Changes to data pages in the buffer pool are logged. Agent processes updating a data record
in the database update the associated page in the buffer pool and write a log record into a log
buffer. The written log records in the log buffer will be flushed into the log files asynchronously
by the logger. With UNIX, you can see a logger process (db2loggr) for each active database
using the ps command.
To optimize performance neither the updated data pages in the buffer pool nor the log records
in the log buffer are written to disk immediately. They are written to disk by page cleaners and
the logger, respectively.
Parallel operations
DB2 UDB extensively uses parallelism to optimize performance when accessing a database.
DB2 supports several types of parallelism:
Query
I/O
Query parallelism
There are two dimensions of query parallelism: Inter-query parallelism and intra-query
parallelism. Inter-query parallelism refers to the ability of multiple applications to query a
database at the same time. Each query executes independently of the others, but they are all
executed at the same time. Intra-query parallelism refers to the simultaneous processing of
parts of a single query, using intra-partition parallelism, inter-partition parallelism, or both.
Intra-partition parallelism subdivides what is usually considered a single database
operation, such as index creation, database loading, or SQL queries, into multiple parts,
many or all of which can be run in parallel within a single database partition.
Inter-partition parallelism subdivides what is usually considered a single database
operation, such as index creation, database loading, or SQL queries, into multiple parts,
many or all of which can be run in parallel across multiple partitions of a partitioned
database on one machine or on multiple machines. Inter-partition parallelism only applies
to DPF.
I/O parallelism
When there are multiple containers for a tablespace, the database manager can exploit
parallel I/O. Parallel I/O refers to the process of writing to, or reading from, two or more I/O
devices simultaneously. This can result in significant improvements in throughput.
DB2 implements a form of data striping by spreading the data in a tablespace across multiple
containers. In storage terminology, the part of a stripe that is on a single device is a strip. The
DB2 term for strip is extent. If your tablespace has three containers, DB2 will write one extent
to container 0, the next extent to container 1, the next extent to container 2, then back to
container 0. The stripe width—a generic term not often used in DB2 literature—is equal to the
number of containers, or three in this case.
Containers for a tablespace would ordinarily be placed on separate physical disks, allowing
work to be spread across those disks, and allowing disks to operate in parallel. Since the
DS6000 logical disks are striped across the Rank, the database administrator can allocate
DB2 containers on separate logical disks residing on separate DS6000 Arrays. This will take
advantage of the parallelism both in DB2 and in the DS6000. For example, four DB2
containers residing on four DS6000 logical disks on four different 7+P Ranks will have data
spread across 32 physical disks.
For a more detailed and accurate approach that takes into consideration the particularities of
your DB2 UDB environment, you should contact your IBM representative, who can assist you
with the DS6000 capacity and configuration planning.
If you want optimal performance from the DS6000, do not treat it completely like a black box.
Establish a storage allocation policy that allocates data using several DS6000 Ranks.
Understand how DB2 tables map to underlying logical disks, and how the logical disks are
allocated across the DS6000 Ranks.
As a result, you can balance activity across DS6000 resources by following these rules:
Span DS6000 storage units.
Span Ranks (RAID Arrays) within a storage unit.
Engage as many arrays as possible.
Figure 13-2 on page 428 illustrates this technique for a single tablespace consisting of eight
containers.
3
4
5
6
7 8
Figure 13-2 Allocating DB2 containers using a “spread your data” approach
Look again at Figure 13-2. In this case, we are striping across arrays, across disk adapters,
across clusters, and across DS6000 boxes. This can all be done using the striping capabilities
of DB2’s container and shared nothing concept. This eliminates the need to employ AIX
logical volume striping.
4 KB 64 GB
8 KB 128 GB
16 KB 256 GB
32 KB 512 GB
Select a page size that can accommodate the total expected growth requirements of the
objects in the tablespace.
For OLTP applications that perform random row read and write operations, a smaller page
size is preferable, because it wastes less buffer pool space with unwanted rows. For DSS
applications that access large numbers of consecutive rows at a time, a larger page size is
better, because it reduces the number of I/O requests that are required to read a specific
number of rows.
Tip: Experience indicates that page size can be dictated to some degree by the type of
workload. For pure OLTP workloads a 4 KB page size is recommended. For a pure DSS
workload a 32 KB page size is recommended. For a mixture of OLTP and DSS workload
characteristics we recommend either an 8 KB page size or a 16 KB page size.
Extent size
If you want to stripe across multiple arrays in your DS6000, then assign a LUN from each
Rank to be used as a DB2 container. During writes, DB2 will write one extent to the first
container, the next extent to the second container, and so on until all eight containers have
been addressed before cycling back to the first container. DB2 stripes across containers at
the tablespace level.
Since DS6000 stripes at a fairly fine granularity (256 KB), selecting multiples of 256KB for the
extent size makes sure that multiple DS6000 disks are used within a Rank when a DB2
prefetch occurs. However, you should keep your extent size below 1 MB.
I/O performance is fairly insensitive to selection of extent sizes, mostly due to the fact that
DS6000 employs sequential detection and prefetch. For example, even if you picked an extent
size such as 128 KB, which is smaller than the full array width (it would involve accessing only
four disks in the array), the DS6000 sequential prefetch would keep the other disks in the
array busy.
Prefetch size
The tablespace prefetch size determines the degree to which separate containers can
operate in parallel.
It is worthwhile to note that prefetch size is tunable. By this we mean that prefetch size can be
altered after the tablespace has been defined and data loaded. This is not true for extent and
page size that are set at tablespace creation time and cannot be altered without re-defining
the tablespace and re-loading the data.
Tip: The prefetch size should be set so that as many arrays as desired can be working on
behalf of the prefetch request. For other than the DS6000, the general recommendation is
to calculate prefetch size to be equal to a multiple of the extent size times the number of
containers in your tablespace. For the DS6000 you may work with a multiple of the extent
size times the number of arrays underlying your tablespace.
The DS6000 supports a high degree of parallelism and concurrency on a single logical disk.
As a result, a single logical disk the size of an entire array achieves the same performance as
many smaller logical disks. However, you must consider how logical disk size affects both the
host I/O operations and the complexity of your organization’s systems administration.
Smaller logical disks provide more granularity, with their associated benefits. But it also
increases the number of logical disks seen by the operating system. Select an DS6000 logical
disk size that allows for granularity and growth without proliferating the number of logical
disks.
You should also take into account your container size and how the containers will map to AIX
logical volumes and DS6000 logical disks. In the simplest situation, the container, the AIX
logical volume, and the DS6000 logical disk will be the same size.
Tip: Try to strike a reasonable balance between flexibility and manageability for your
needs. Our general recommendation is that you create no fewer than two logical disks in
an array, and the minimum logical disk size should be between 10 GB- 20 GB. Unless you
have an extremely compelling reason, standardize a unique logical disk size throughout
the DS6000.
Among the advantages and the disadvantages between larger and smaller logical disks sizes,
we have the following:
Advantages of smaller size logical disks:
– Easier to allocate storage for different applications and hosts.
– Greater flexibility in performance reporting; for example, PDCU reports statistics for
logical disks.
Disadvantages of smaller size logical disks:
Small logical disk sizes can contribute to proliferation of logical disks, particularly in SAN
environments and large configurations. Administration gets complex and confusing.
Advantages of larger size logical disks:
Examples
Let us assume a 6+P array with 146 GB disk drives. Suppose you wanted to allocate disk
space on your 16-array DS6000 as flexibly as possible. You could carve each of the 16 arrays
up into 32 GB logical disks or LUNs, resulting in 27 logical disks per array (with a little left
over). This would yield a total of 16 * 27 = 432 LUNs. Then you could implement 4-way
multi-pathing, which in turn would make 4* 432 = 1728 hdisks visible to the operating system.
Not only would this create an administratively complex situation, but at every reboot the
operating system would query each of those 1728 disks. Reboots could take a long time.
Alternatively, you could have created just 16 large logical disks. With multi-pathing and
attachment of four Fibre Channel ports, you would have 4* 16 = 128 hdisks visible to the
operating system. Although this number is large, it is certainly more manageable; and reboots
would be much faster. Having overcome that problem, you could then use the operating
system logical volume manager to carve this space up into smaller pieces for use.
There are problems with this large logical disk approach as well, however. If the DS6000 is
connected to multiple hosts or it is on a SAN, then disk allocation options are limited when
you have so few logical disks. You would have to allocate entire arrays to a specific host; and
if you wanted to add additional space, you would have to add it in array-size increments.
This problem is less severe if you know your needs well enough to say that your DS6000 will
never be connected to more than one host. Nevertheless, in some versions of UNIX an hdisk
can be assigned to only one logical volume group. This means that if you want an operating
system volume group that spans all arrays of the DS6000, you are limited to a single volume
group for the entire DS6000.
DB2 can use containers from multiple volume groups, so this is not technically a problem for
DB2. So, if you want the ability to do disk administration at the volume group level (exports,
imports, backups, and so on) then you will not be very pleased with a volume group that is
three to eleven terabytes in size.
13.5.6 Multi-pathing
Use DS6000 multi-pathing along with DB2 striping to ensure balanced use of Fibre Channel
paths.
Multi-pathing is the hardware and software support that provides multiple avenues of access
to your data from the host computer. When using the DS6000, this means you need to
provide at least two Fibre Channel or SCSI connections to each host computer from any
component being multi-pathed. It also involves some additional considerations when
configuring the DS6000 host adapters and volumes.
DS6000 multi-pathing requires the installation of multipathing software. For AIX, you have two
choices. SDDPCM or the IBM Subsystem Device Driver (SDD). For AIX, we recommend
SDDPCM. These products are discussed in Chapter 7, “Open systems servers - UNIX” on
page 189and in Chapter 5, “Host attachment” on page 143.
It provides functions for preserving the integrity and maintaining the databases. It allows
multiple tasks to access and update the data, while ensuring the integrity of the data. It also
provides functions for reorganizing and restructuring the databases.
The IMS databases are organized internally using a number of IMS’s own internal database
organization access methods. The database data is stored on disk storage using the normal
operating system access methods.
During IMS execution, all information necessary to restart the system in the event of a failure
is recorded on a system log dataset. The IMS logs are made up of the following.
The OLDS are made of multiple datasets that are used in a wrap-around manner. At least
three datasets must be allocated for the OLDS to allow IMS to start, while an upper limit of
100 is supported.
Only complete log buffers are written to OLDS, to enhance performance. Should any
incomplete buffers need to be written out, they are written to the WADS.
When IMS processing requires writing of a partially filled OLDS buffer, a portion of the buffer
is written to the WADS. If IMS or the system fails, the log data in the WADS is used to
terminate the OLDS, which can be done as part of an emergency restart, or as an option on
the IMS Log Recovery Utility.
The WADS space is continually reused after the appropriate log data has been written to the
OLDS. This dataset is required for all IMS systems, and must be pre-allocated and formatted
at IMS start-up when first used.
If you want optimal performance from the DS6000, do not treat it totally like a ‘black box’.
Understand how your IMS datasets map to underlying volumes, and how the volumes map to
RAID Arrays.
You may intermix IMS databases and log datasets on DS6000 Ranks. The overall I/O activity
will be more evenly spread, and I/O skews will be avoided.
Measurements to determine how large volumes can impact IMS performance have shown
that similar response times can be obtained when using larger volumes as when using the
smaller 3390-3 standard size volumes.
Figure 13-3 on page 435 illustrates the device response times when using 32 3390-3
volumes versus four large volumes 3390-27 on an ESS-F20 using FICON channels. Even
though the benchmark was performed on an ESS-F20, the results should be similar on the
DS6000. The results show that with the larger volumes the response times are similar to the
standard size 3390-3 volumes.
3390-3
1
3390-27
0.5
0
2905 4407
Total I/O rate (IO/sec)
Copy Services has four interfaces; a Java-enabled Web-based interface (DS Storage
Manager), a command-line interface (DS CLI), an application programming interface (DS
Open API), and host I/O commands from zSeries servers.
This chapter discusses the functions, objectives, and performance related aspects of the
following Copy Services:
FlashCopy
Metro Mirror
Global Copy
Interoperability between IBM TotalStorage Enterprise Storage Server (ESS), the DS6000
and the DS8000
Managing performance
Note: Remote Mirror and Copy was referred to as Peer-to-Peer Remote Copy (PPRC) in
earlier documentation for the IBM TotalStorage Enterprise Storage Server.
You can manage Copy Services functions through the function-rich DS Command-Line
Interface (CLI) called the IBM TotalStorage DS CLI and the Web-based interface called the
IBM TotalStorage DS Storage Manager. The DS Storage Manager allows you to set up and
manage data copy features from anywhere that network access is available.
There are several points to consider when you are planning to use FlashCopy that may help
you minimize any impact that the FlashCopy operation may have on host I/O performance.
This section gives an overview of FlashCopy for a DS6000 in a z/OS environment from a
performance perspective. We will describe:
FlashCopy operational areas
FlashCopy basic concepts
Data set level FlashCopy
FlashCopy in combination with other Copy Services
As Figure 14-1 on page 440 illustrates, when FlashCopy is invoked, a relationship (or
session) is established between the source and target volumes of the FlashCopy pair. This
includes creation of the necessary bitmaps and metadata information needed to control the
copy operation. This FlashCopy establish process is very quick to complete, at which point:
The FlashCopy relationship is fully established.
Control returns to the operating system or task that requested the FlashCopy.
Both the source volume and its time zero (T0) target volume are available for full read/write
access.
At this time a background task within the DS6000 starts copying the tracks from the source to
the target volume. Optionally, you can suppress this background copy task. This is efficient,
for example, if you are doing a temporary copy just to take a backup from that copy to tape.
Write
Read and write to both source
and target possible. Optional
T0 physical copy progresses in
background
For a straightforward FlashCopy, the FlashCopy relationship ends when the background copy
task completes. However, if the FlashCopy was requested with the no-background copy
option, or with the persistent option, then the relationship must be explicitly ended by a
FlashCopy withdraw command.
FlashCopy has several options. Not all options are available to all user interfaces. It is
important right from the beginning to know for which purpose the target volume should be
used afterwards. Knowing this, the options to be used with FlashCopy can be identified and
the environment can be selected which supports the selected options.
Supported options within the environments are those identified in Figure 14-2 on page 441.
Remote FlashCopy
Persistent Flashcopy
Reverse restore,
fast reverse restore
Figure 14-2 FlashCopy interfaces and functions
Note: Normal DS6000 application related I/O will affect the performance of FlashCopy
background copy.
Tip: For better performance with FlashCopy, do not FlashCopy between logical volumes
within the same Rank.
Copying within a single Rank means that the data reads and writes are all performed on
the same physical set of eight DDMs in the Rank, effectively doubling the I/O rates to that
Rank during the FlashCopy operation.
Better FlashCopy background copy performance can be expected if the source and
destination Ranks are managed by different DA Pairs.
With multiple background copy operations running concurrently, the throughput
performance of each operation will be less than the performance of one operation running
by itself. For instance, assume the background copy of one 36 GB volume completes in 12
minutes. It should be expected that the background copy of four 36 GB volumes, running
concurrently, will take more than 12 minutes to complete.
Slightly better FlashCopy background copy performance can be expected by ensuring the
source and target volumes are on the same RIO Loop.
FlashCopy relationships can exist between logical volumes that are backed by Ranks
consisting of differing geometries (speed, capacity and RAID type). These FlashCopy
relationships will also be slightly impacted by the performance capability of the underlying
devices, but this is unlikely to cause any noticeable performance degradation, unless the
supporting Ranks are already highly utilized by normal I/O activity.
Additionally, if you have many FlashCopy pairs to manage, try and balance the DS6000
source and target volumes across the two DS6000 servers, so that the copy tasks are able
to exploit more of the internal bandwidth within the DS6000 server.
Evenly distribute FlashCopy source volumes throughout available source Ranks. Likewise,
distribute target volumes evenly throughout target Ranks.
There is no requirement for source and target volumes to have the same RAID levels.
Metro Mirror
This function, formerly known as synchronous Peer-to-Peer Remote Copy, or PPRC, provides
a synchronous real-time mirror of logical volumes in a second DS6000 (or DS8000 or
ESS750 or ESS 800). Every host write to the source logical volume is copied to the target
logical volume before acknowledging write completion to the application, maintaining the pair
of volumes in a duplex relationship, as shown in Figure 14-3 on page 444.
Host I/O
1
4 2
Primary Secondary
3
I/O Time
1. Application writes to primary logical volume. Start
2. The primary DS6000 initiates an I/O to the ..
secondary DS6000 to transfer the data. ..
3. Secondary indicates to the primary that the write is ..
complete.
..
4. Primary acknowledges to the application system
Completion
that the write is complete.
When the application performs a write update operation to a primary volume, this is what
happens:
1. Write to primary volume (DS6000 cache).
2. Write to secondary (DS6000 cache).
3. Signal write complete on the secondary DS6000.
4. Post I/O complete to host server.
Metro Mirror utilizes Fibre Channel Protocol to communicate between the pair of participating
DS6000s, or between the DS6000 and its remote partner storage subsystem. Care should be
taken to ensure that the paths have adequate bandwidth. FC paths are set up between the
LSS that the source volume resides in, and the LSS that contains the target volume.
Some decisions need to be made when setting up each Metro Mirror logical volume pair as to
which of the following actions should be taken if the paths become unavailable:
Keep accepting writes to primary volume and allow the Metro Mirror process to track the
changes. Re synchronize the pair when the paths become available. This enables the
application to keep running, but in the event of a primary site failure while the paths are
missing, the secondary volume will not be current and transactions that occurred since the
path interruption are likely to be lost.
Suspend all updates to primary volume. This is disruptive to the application, but maintains
the best data integrity.
The redbook IBM TotalStorage DS6000: Concepts and Architecture, SG24-6471 introduces
the concept of data consistency, and Consistency Groups. For Metro Mirror, consistency
requirements are managed through use of the Consistency Group option when you are
defining Metro Mirror paths between pairs of LSSs. Volumes or LUNs which are paired
between two LSSs whose paths are defined with the Consistency Group option can be
considered part of a Consistency Group.
Consistency is provided by means of the extended long busy (for z/OS) or queue full (for open
systems) conditions. These are triggered when the DS6000 detects a condition where it
cannot update the Metro Mirror secondary volume. The volume pair that first detects the error
will go into the extended long busy or queue full condition, such that it will not do any I/O. For
z/OS a system message will be issued (IEA494I state change message); for open systems an
SNMP trap message will be issued. These messages can be used as a trigger for automation
purposes - to provide data consistency by use of the Freeze/Run (or Unfreeze) commands.
For further discussion of these refer to Metro Mirror options:.
Important: The data on disk at the secondary site is an exact mirror of that at the primary
site. Remember that any data still in host system buffers or processor memory is not yet on
disk and so will not be mirrored to the secondary site. This is a similar situation to a power
failure in the primary site.
Freeze and unfreeze commands - Metro Mirror freeze and run commands are used by
automation processes to ensure data consistency. We will discuss their usage in this
section. The freeze and unfreeze commands are only available using the DS CLI and not
the DS GUI.
Critical attribute - Consistency Group and critical mode combination - The -critmode
parameter of the mkpprc command and the -consistgrp parameter of mkpprcpath
Important: The DS6000 supports Fibre Channel (FC) Metro Mirror links only.
Metro Mirror pairs are set up between volumes in LSSs, usually in different disk subsystems,
and these are normally in separate locations. To establish a Metro Mirror pair, there needs to
be a Metro Mirror path between the LSSs that the volumes reside in. These paths are
bi-directional and can be shared by any Metro Mirror pairs in the same LSS to the secondary
LSS in the same direction. For bandwidth and redundancy, more than one path can be
created between the same LSSs, Metro Mirror will balance the workload across the available
paths between the primary and secondary.
Note: Remember that the LSS is not a physical construct in the DS6000. Volumes in an
LSS can come from multiple disk Arrays.
Metro Mirror pairs can only be established between storage control units of the same (or
similar) type and features. For example, a DS6000 can have a Metro Mirror pair with another
DS6000, a DS8000, an ESS 800, or an ESS 750. It cannot have a Metro Mirror pair with an
RVA or an ESS F20. Note that all disk subsystems must have the appropriate Metro Mirror
feature installed (for DS6000 the Remote Mirror and Copy 2244 function authorization model,
which is 2244 Model RMC). If your DS6000 is being mirrored to an ESS disk subsystem, the
ESS must have PPRC Version 2 (which supports Fibre Channel PPRC links).
A path (or group of paths) needs to be established from the LSS to each LSS with related
secondaries. Also, a path (or group of paths) must be established to the LSS from each LSS
with related primaries. These logical paths are transported over physical links between the
disk subsystems. The physical link includes the host adapter in the primary DS6000, the
cabling, switches or directors, any wide band or long distance transport devices (DWDM,
channel extenders, WAN) and the host adapters in the secondary disk subsystem. Physical
links can carry multiple logical Metro Mirror paths, as shown in Figure 14-4 on page 447.
Although one FCP link would have sufficient bandwidth for most Metro Mirror environments,
for redundancy reasons we recommend configuring two Fibre Channel links between each
primary and secondary disk subsystem.
Metro Mirror FCP links can be direct connected, or connected by up to two switches.
Dedicating Fibre Channel ports for Metro Mirror use guarantees no interference from host I/O
activity. This is recommended with Metro Mirror, which is time critical and should not be
impacted by host I/O activity. The Metro Mirror ports used provide connectivity for all LSSs
within the DS6000 and can carry multiple logical Metro Mirror paths.
Logical paths
A Metro Mirror logical path is a logical connection between the sending LSS and the receiving
LSS. An FCP link can accommodate multiple Metro Mirror logical paths.
Figure 14-5 on page 448 shows an example where we have a 1:1 mapping of source to target
LSSs, and where the 3 logical paths are accommodated in one Metro Mirror link:
LSS1 in DS6000 1 to LSS1 in DS6000 2
LSS2 in DS6000 1 to LSS2 in DS6000 2
LSS3 in DS6000 1 to LSS3 in DS6000 2
Alternatively, if the volumes in each of the LSSs of DS6000 1 map to volumes in all three
secondary LSSs in DS6000 2, there will be 9 logical paths over the Metro Mirror link (not fully
illustrated in Figure 14-4). Note that we recommend a 1:1 LSS mapping.
LSS 3 LSS 3
Metro Mirror
paths
1 logical path 1 logical path
Bandwidth
Prior to establishing your Metro Mirror solution you should establish what your peak write rate
bandwidth requirement will be. This will help to ensure you have enough Metro Mirror links in
place to support that requirement. Remember only writes are mirrored across to the
secondary volumes.
LSS design
Since the DS6000 has made the LSS a topological construct which is not tied to a physical
Array as in the ESS, the design of your LSS layout can be simplified. It is now possible to
assign LSSs to applications, for example, without concern over under or over-allocation of
physical disk subsystem resources. This can also simplify the Metro Mirror environment, as it
is possible to reduce the number of commands that are required for data consistency.
Distance
The distance between your primary and secondary DS6000 subsystems will have an effect
on the response time overhead of the Metro Mirror implementation. Your IBM Field Technical
Sales Specialist (FTSS) can be contacted to assist you in assessing your configuration and
the distance implications.
Figure 14-6 shows this idea in a graphical form. DS6000 #1 has Metro Mirror paths defined to
DS6000 # 2, which is in a remote location. On DS6000 #1, volumes defined in LSS 00 are
mirrored to volumes in LSS 00 on DS6000 #2 (volume P1 is paired with volume S1, P2 with
S2, P3 with S3, and so on). Volumes in LSS 01 on DS6000 #1 are mirrored to volumes in LSS
01 on DS6000 #2, and so on. Requirements for additional capacity can be added in a
symmetrical way also - by addition of volumes into existing LSSs, and by addition of new
LSSs when needed (for example, addition of two volumes in LSS 03 and 05, and one volume
to LSS 04 will bring them to the same number of volumes as the other LSS. Additional
volumes could then be distributed evenly across all LSSs or additional LSSs added).
As well as making the maintenance of the Metro Mirror configuration easier, this has the
added benefit of helping to balance the workload across the DS6000. Figure 14-6 shows a
logical configuration - this idea applies equally to the physical aspects of the DS6000. You
should attempt to balance workload and apply symmetrical concepts to other aspects of your
DS6000 (for example, the Extent Pools).
Consider a non symmetrical configuration for a moment. For instance, the primary site could
have volumes defined on RAID 5 Arrays (Ranks) comprised of 72 GB DDMs. And the
secondary site could have Ranks comprised of 300 GB DDMs. Because the capacity of the
Volumes
You will need to consider which volumes should be mirrored to the secondary site. One option
is to mirror all volumes. This is advantageous for the following reasons:
You will not need to consider whether any required data has been missed.
Users will not need to remember which logical pool of volumes is mirrored and which is
not.
Addition of volumes to the environment is simplified - you will not have to have two
processes for addition of disk (one for mirrored volumes, and another for non-mirrored
volumes).
You will be able to move data around your disk environment easily without a concern over
whether the target volume is a mirrored volume or not.
You may choose not to mirror all volumes. In this case you will need careful control over what
data is placed on the mirrored volumes (to avoid any capacity issues) and what is placed on
the non-mirrored volumes (to avoid missing any required data). One method of doing this
could be to place all mirrored volumes in a particular set of LSSs, in which all volumes have
Metro Mirror enabled, and direct all data requiring mirroring to these volumes.
Performance considerations
Some basic things you should consider:
The process of getting the primary and secondary Metro Mirror volumes into a
synchronized state is called the initial establish. Each link I/O port will provide a maximum
throughput. Multiple LUNs in the initial establish will quickly saturate the links. This is
referred to as the aggregate copy rate and is dependent primarily on the number of links,
or bandwidth between sites. It is important to understand this copy rate to have a realistic
expectation about how long the initial establish will take to complete.
Production I/O will be given priority over DS6000 replication I/O activity. High production
I/O activity will negatively affect both initial establish data rates and synchronous copy data
rates.
We recommend you not share the Metro Mirror link I/O ports with host attachment ports. A
result could be a non predictable performance of Metro Mirror and a much more
complicated search in case of performance problems.
Distance is an important value for both the initial establish data rate and synchronous write
performance. Data must go to the other site, and the acknowledgement goes back. Add
possible latency times of some active components on the way. We think it is a good rule of
thumb to calculate 1 ms additional response time per 100 KM for a write I/O.
Distance will also affect the establish data rate.
Scalability
The DS6000 Metro Mirror environment can be scaled up or down as required. If new volumes
are added to the DS6000 that require mirroring, they can be dynamically added. If additional
Metro Mirror paths are required, they also can be dynamically added.
Addition of capacity
As we have previously mentioned, the logical nature of the LSS has made a Metro Mirror
implementation on the DS6000 easier to plan, implement and manage. However if you need
to add more to your Metro Mirror environment, your management and automation solutions
should be set up to handle this.
Global Copy
This function, formerly known as PPRC-Extended Distance, copies data non-synchronously
and over longer distances than is possible with the Metro Mirror implementation.
When operating in Global Copy mode, the source volume sends a periodic, incremental copy
of updated tracks to the target volume, instead of sending a constant stream of updates. This
causes less impact to application writes for source volumes and less demand for bandwidth
resources, while allowing a more flexible use of the available bandwidth.
Global Copy does not keep a strict sequence of write operations, but you can make a
consistent copy through a periodical synchronization process, (called a go-to-sync operation).
The Global Copy logical process differs from the Metro Mirror process as shown here in
Figure 14-7 on page 452.
Host I/O
1
2
3
Primary Secondary
4
I/O Time
1. Write to primary volume in DS6000. Start
2. Primary DS6000 acknowledges to the application Completion
system that the write is complete.
Performance improvement
The new technology used within the DS6000 means that the response time penalty for
synchronous mirrored writes on a DS6000 is less than that observed on the ESS 800, which
was slightly greater than one millisecond for a zero distance 4k Metro Mirror write. Under
similar conditions, the response time penalty on the DS6000 is slightly less than one
millisecond for a zero distance Metro Mirror write.
Establish
Establish Global
Metro
Copy
Mirror
Go to sync
Resync Terminate
Resync
Suspend Suspend
Suspended
Figure 14-8 Global Copy and Metro Mirror state change logic
When you initially establish a mirror relationship from a volume in Simplex state, you have
the option to request that it become a Global Copy pair (establish Global Copy arrow in
Figure 14-8), or a Metro Mirror pair (establish Metro Mirror arrow in Figure 14-8).
Pairs can change from the Copy Pending state, to Full Duplex state when a go-to-SYNC is
commanded (go-to-SYNC arrow in Figure 14-8).
You can also request that a pair be suspended as soon as it reaches the full-duplex state
(go-to-SYNC and Suspended in Figure 14-8).
Pairs cannot change directly from Full Duplex state to Copy Pending state. They need to
go through an intermediate Suspended state.
You can go from suspended state to Global Copy state doing an incremental copy
(copying out-of-sync tracks only). This is a similar process as for the traditional transition
from suspended state to SYNC state (RESYNC/copy out-of-sync arrow in Figure 14-8).
http://www.ibm.com/servers/storage/disk/ds6000/interop.html
In order to use the Global Copy function you have to purchase the Remote Mirror and Copy
feature for the primary and secondary DS6000 systems.
For Global Copy paths you need at least one Fibre Channel connection between the two
DS6000 subsystems you want to set up in a physical Global Copy relationship. For higher
availability you must use at least one host Fibre Channel connection from each of the two
DS6000 servers.
The Fibre Channel ports used for Global Copy can be used as dedicated ports, that means
they will only be used for the Global Copy paths or they can be shared between Global Copy
and Fibre Channel data traffic, in this case you also need Fibre Channel switches for
connectivity.
For supported SAN switches you can refer to the DS6000 Interoperability Matrix.
Physical
LSS 0 LSS 0
Fibre
LSS 1 Channel LSS 1
LSS 2 link LSS 2
LSS 3 LSS 3
: :
LSS 08 LSS 08
Up to 256 bi-directional
: logical paths :
LSS nn per FCP path LSS nn
The paths are defined between the pair of LSSs that contain your source and target logical
volumes.
When using channel extender products with Global Copy, the channel extender vendor will
determine the maximum distance supported between the primary and secondary DS6000.
The channel extender vendor should be contacted for their distance capability, line quality
requirements, and WAN attachment capabilities.
A complete and current list of Global Copy supported environments, configurations, networks,
and products is available in the DS6000 Interoperability Matrix.
The channel extender vendor should be contacted regarding hardware and software
prerequisites when using their products in an DS6000 Global Copy configuration. Evaluation,
qualification, approval and support of Global Copy configurations using channel extender
products is the sole responsibility of the channel extender vendor.
A simple way to envision DWDM is to consider that at the primary end, multiple fibre optic
input channels such as ESCON, Fibre Channel, FICON, or Gbit Ethernet, are combined by
the DWDM into a single fiber optic cable. Each channel is encoded as light of a different
wavelength. You might think of each individual channel as an individual color: the DWDM
system is transmitting a rainbow. At the receiving end, the DWDM fans out the different optical
channels. DWDM, by the very nature of its operation, utilizes the full bandwidth capability of
the individual channel. As the wavelength of light is, from a practical perspective, infinitely
divisible, DWDM extension technology is only limited by the sensitivity of its receptors. Thus, a
high aggregate bandwidth is possible.
A complete and current list of Global Copy supported environments, configurations, networks,
and products is available in the DS6000 Interoperability Matrix.
The DWDM vendor should be contacted regarding hardware and software prerequisites when
using their products in a DS6000 Global Copy configuration.
If you are going to have tertiary copies, then within the target Storage Image you should have
an available set of volumes ready to become the FlashCopy target. If your next step is to
dump the tertiary volumes onto tapes, then you must ensure that the tape resources are
capable of handling these dump operations in between the point-in-time checkpoints, unless
you have additional sets of volumes ready to become alternate FlashCopy targets within the
secondary Storage Images.
14.4.7 Performance
As the distance between DS6000s increases, Metro Mirror response time is proportionally
affected, and this negatively impacts the application performance. When implementations
over extended distances are needed, Global Copy becomes an excellent trade-off solution.
You can estimate Global Copy application impact, as that of the application when working
with Metro Mirror suspended volumes. For the DS6000, there is some more work to do with
the Global Copy volumes as compared to the suspended volumes because with Global Copy,
the changes have to be sent to the remote DS6000. But this is a negligible overhead for the
application, as compared with the typical synchronous overhead.
There will be no processor resources (CPU and memory) consumed by your Global Copy
volume pairs (your management solution excluded), as this is managed by your DS6000
subsystem.
14.4.8 Scalability
The DS6000 Global Copy environment can be scaled up or down as required. If new volumes
are added to the DS6000 that require mirroring, they can be dynamically added. If additional
Global Copy paths are required, they also can be dynamically added.
This chapter discusses performance aspects when planning and configuring for Global Mirror
together with the potential impact to application write I/Os caused by the process used to form
a Consistency Group.
We also consider distributing the target Global Copy and target FlashCopy volumes across
different Ranks to balance load over the entire target storage server and minimize the I/O
load for selected busy volumes.
Host
2 Acknowledge
write
Host write
1
FlashCopy
B (automatically)
A
Write to secondary
(asynchronously)
C
Automatic cycle in active session
FlashCopy
Global Copy
A B C
In this example, the A volumes at the local site are the production volumes and are used as
Global Copy primary volumes. The data from the A volumes is replicated to the B volumes,
which are Global Copy secondary volumes. At a certain point in time, a Consistency Group is
created using all of the A volumes, even if they are located in different DS6000 or ESS boxes.
This has minimal application impact because the creation of the Consistency Group is very
quick (in the order of milliseconds).
Note: The copy created with Consistency Group is a power-fail consistent copy, not
necessarily an application-based consistent copy. When you use this copy for recovery,
you may need to perform additional recovery operations, such as the fsck command in an
AIX filesystem.
Once the Consistency Group is created, the application writes can continue updating the A
volumes. The incremental changes that are part of the consistent data are sent to the B
volumes using the existing Global Copy relationship. As soon as all the consistent data
reaches the B volumes, it is FlashCopied to the C volumes.
The C volumes now contain a consistent copy of data. Because the B volumes normally
contain a fuzzy copy of the data from the local site (but not when doing the FlashCopy), the C
volumes are used to hold the most recent point-in-time consistent data while the B volumes
continue to be updated by the Global Copy relationship.
If a disaster occurs during the FlashCopy of the data, special procedures are needed to
finalize the FlashCopy.
In the recovery phase, the consistent copy is created in the B volumes. You need to
consider developing some operational processes to check and create the consistent copy.
You need to check the status of the B volumes for the recovery operations. Generally, these
check and recovery operations are complicated and difficult with the SM GUI or DS CLI in
a disaster situation. Therefore, you may want to use some management tools, (for
example, Global Mirror Utilities), or management software, (for example, TDP for Disk
Replication Manager), for Global Mirror to automate this recovery procedure.
The data at the remote site is current within 3 to 5 seconds, but this recovery point (RPO)
depends on the workload and bandwidth available to the remote site.
Note: Global Mirror can also be used for failover and failback operations. A failover
operation is the process of temporarily switching production to a backup facility (normally
your recovery site) following a planned outage, such as a scheduled maintenance period or
an unplanned outage, such as a disaster. A failback operation is the process of returning
production to its original location. These operations use Remote Mirror and Copy functions
to help reduce the time that is required to synchronize volumes after the sites are switched
during a planned or unplanned outage.
This means that we should analyze performance at the production site, at the recovery site,
as well as between both sites, with the objective of providing a stable Recovery Point
Objective without significantly impacting production.
At the production site, where production I/O always has a higher priority over DS6000
replication I/O activity, the storage server needs resources to handle both loads. If your
primary storage server is already overloaded with production I/O, the potential delay
before a Consistency Group can be formed may become unacceptable.
The bandwidth between both sites needs to be sized for production load peaks.
At the recovery site, even if there is no local production I/O workload, it is hosting the
target Global Copy volumes and handling the inherent FlashCopy processing and will
need some performance evaluation.
This section looks at the aggregate impact of Global Copy and FlashCopy in the overall
performance of Global Mirror.
Remember that Global Copy itself has minimal or no significant impact on the response time
of an application write I/O to a Global Copy primary volume.
Each Global Copy write to its secondary volume during the time period between the formation
of successive Consistency Groups causes an actual FlashCopy write I/O operation on the
target server. This is described in Figure 14-13 where we summarize approximately what
happens between two Consistency Group creation points when application writes come in.
Host
Figure 14-13 Global Copy with write hit at the remote site
1. The application write I/O completes immediately to volume A1 at the local site.
2. Eventually Global Copy replicates the application I/O and reads the data at the local site to
send to the remote site.
3. The modified track gets written across the link to the remote B1 volume.
4. FlashCopy nocopy sees that the track is about to change,
5. Track is written to the C1 volume.
This is an approximation of the sequence of internal I/O events. There are optimization and
consolidation effects which make the entire process quite efficient.
Figure 14-13 showed the normal sequence of I/Os within a Global Mirror configuration. The
critical path is between (2) and (3). Usually (3) is simply a write hit in the Persistent Cache (or
NVS) in B1 and some time later, and after (3) completes, the original FlashCopy source track
is copied from B1 to C1.
If persistent memory or non-volatile cache is over committed in the secondary storage server
there is some potential impact on the Global Copy data replication operation performance.
Figure 14-14 summarizes roughly what happens when Persistent Cache or NVS in the
remote storage server is over committed. A read (3) and a write (4) to preserve the source
track and write it to the C volume is required before the write (5) can complete. Eventually the
track gets updated on the B1 volume to complete write (5). But usually all writes are quick
writes to cache and persistent memory and happen in the order as Figure 14-12 on page 459
outlines.
A write I/O to the FlashCopy source volume also triggers the maintenance of a bit map for the
source volume, which is created when the FlashCopy volume pair is established with the start
change recording attribute. This allows only the replication of the changed recording bitmap
to the corresponding bitmap for the target volume in the course of forming a Consistency
Group. A more detailed explanation of this processing may be found in IBM TotalStorage
DS6000 Series: Copy Services in Open Environments, SG24-6788, and IBM TotalStorage
DS6000 Series: Copy Services with IBM ~ zSeries, SG24-6787.
Note: This all applies only to write I/Os to Global Mirror primary volumes.
coordination drain
time time
Serialize all
Perform
1 Global Copy 2 Drain data from local to remote site 3 FlashCopy
primary volumes
Write
Hold write I/O
A1 B1
Primary Secondary C1
I/O PPRC Tertiary
path(s)
A2 B2
Primary Secondary C2
Tertiary
Figure 14-15 Coordination time - how does it impact application write I/Os?
The coordination time, which you can limit by specifying a number of milliseconds, is the
maximum impact to an application’s write I/Os you will allow, when forming a Consistency
Group. The intention is to keep the coordination time as small as possible. The default of 50
ms may be a bit high in a transaction processing environment. A valid number may also be a
in the single digit range. The required communication between the Master storage server and
potential Subordinate storage servers is inband over PPRC paths between the Master and
Subordinates. This communication is highly optimized and allows you to minimize the
potential application write I/O impact to 3 ms, for example. There must be at least one PPRC
FCP link between a Master storage server and each Subordinate storage server, although for
redundancy we recommend you use two PPRC FCP links.
The following example addresses the impact of the coordination time when Consistency
Group formation starts and whether this impact has the potential to be significant or not.
Assume a total aggregated number of 5,000 write I/Os over two primary storage servers with
2,500 write I/Os per second to each storage server. Each write I/O takes 0.5 ms. You
specified 3 ms maximum to coordinate between the Master storage server and its
Subordinate storage server. Assume further that a Consistency Group is created every 3
seconds which is a goal with a Consistency Group interval time of zero.
5,000 write I/Os
0.5 ms response time for each write I/O
Maximum coordination time is 3 ms
Every 3 seconds a Consistency Group is created
This is 5 I/Os for every millisecond or 15 I/Os within 3 ms. So each of these 15 write I/Os
experiences a 3 ms delay. This happens every 3 seconds. Then we observe an average
response time delay of approximately:
(15 IOs * 0.003 second) / 3*5000 IO/second) = 0.000003 second or 0.003 ms.
The response time increases in average from 0.5 ms to 0.503 ms in this example.
This drain period is the time required to replicate all remaining data for the Consistency
Group from the primary to the secondary storage server. This needs to fall within a time limit
set by maximum drain time which may also be limited. The default is 30 seconds and this may
be too small in an environment with a write intensive workload. A number in the range of 300
seconds to 600 seconds may need to be considered.
The actual replication process usually does not impact the application write I/O. There is a
very slight chance that a very same track within a Consistency Group is updated before this
track is replicated to the secondary site within the specified drain time period. When this
unlikely event happens, the affected track is immediately replicated to the secondary storage
server before the application write I/O modifies the original track. The concerned application
write I/O is going to experience a similar response time, as though the I/O would have been
written to a Metro Mirror primary volume.
Note that further subsequent writes to this very same track do not experience any delay
because the tracks have been already replicated to the remote site.
There are 2 loads to consider, the production load on the production site, and the Global
Mirror load on both sites.
There are only production volumes on the production site. At the production site, high speed
disk and cache can be considered to fit production needs. At the same time, the storage
server needs to be able to handle both production and replication workload. But because only
source Global Copy volumes are hosted, besides balancing the full production load on all the
storage server Ranks and on both servers of the cluster, nothing else can really be done.
On the other hand, at the recovery site, only Global Mirror volumes have to be considered, but
they are of 2 types, target Global Copy and target FlashCopy. Because, target Global Copy is
modified when the target FlashCopy is not modified and target FlashCopy is modified when
the target Global Copy is not modified, both types of volumes could share the same Ranks
with disk of a larger size because the storage at the recovery site will be consolidated and
shared between target Global Copy and target FlashCopy not working at the same time.
Primary
A Primary
A Primary
B1
Primary C3
A1
Primary
Primary
Secondary Tertiary
Primary
PENDING
PENDING
Rank 1 Rank 1
Rank 2 Rank 2
Primary
A Primary
A Primary
B3
Primary C2
A3
Primary
Primary
Secondary Tertiary
Host channels Primary
PENDING
PENDING
Rank 3 Rank 3
FCP
Host
Figure 14-16 Remote storage server configuration, all Ranks contain equal numbers of volumes
The goal is to put the same number of each volume type into each Rank. As volume type here
we refer to B volumes and C volumes within a Global Mirror configuration. In order to avoid
performance bottlenecks, spread busy volumes over multiple Ranks. Otherwise hot spots
may be concentrated on single Ranks, when you put the B and C volume on the very same
Rank. We recommend spreading B and C volumes as Figure 14-16 suggests.
Another aspect may be to focus on very busy volumes and keep these volumes on separate
Ranks from all the other volumes.
With mixed Disk Drive Module (DDM) capacities and different speeds at the remote storage
server you may consider to spread B volumes not only over the fast DDMs but over all Ranks.
Basically follow a similar approach as Figure 14-16 recommends. You may keep particular
busy B volumes and C volumes on the faster DDMs.
Rank 1 Rank 1
Primary
PPRC links A Primary
A FCP FCP
Primary
B2
Primary C1
A2
Primary
Primary Secondary
Tertiary
Primary
PENDING
PENDING
Rank 2 Rank 2
Primary
A Primary
A Primary
B3
Primary C2
A3
Primary
Primary
Secondary
Host channels Primary
Tertiary
PENDING
PENDING
Rank 3 Rank 3
Primary
FCP A
Primary
Primary
D1
Primary D2
A Primary Host
Host Primary
D3
Primary
Primary
D4
Rank 4
Local site Remote Site
Figure 14-17 shows besides the three Global Mirror volumes, also the D volumes which may
be created once in a while for test or other purposes. Here we suggest as an alternative, a
Rank with larger and perhaps slower DDMs. The D volumes may be read from another host
and any other I/O to the D volumes does not impact the Global Mirror volumes in the other
Ranks. Note that a NOCOPY relationship between B and D volumes will read the data from B
when coming through the D volume. So you may consider a physical COPY when you create
D volumes on a different Rank. This will separate additional I/O to the D volumes from I/O to
the Ranks with the B volumes.
An option here may be to spread all B volumes across all Ranks again and also configure the
same number of volumes in each Rank.
Still, put the B and C volumes in different Ranks. It is further recommended to configure
corresponding B and C volumes in such a way, that these volumes have an affinity to the
same server. Ideally it would also be to have the B volumes connected to a different DA pair
than the C volumes.
Tip: We recommend starting with the default values for these parameters.
The default for the maximum drain time is 30 seconds, which is normally sufficient time to
transfer a consistency group to the target system.while still allowing for intermittent
communications issues. The system will favor host I/O activity at the expense of forming
consistency groups.
If it has been unable to form consistency groups for 30 minutes, Global Mirror forms a
consistency group irrespective of the maximum drain time setting.
However, this also increases the time between successive FlashCopies at the remote site,
and increasing this value may be counter productive in high bandwidth environments as
frequent consistency group formation will reduce the overheads of Copy on Write processing
The default for the Consistency Group Interval is 0 seconds so Global Mirror will continuously
form consistency groups as fast as the environment will allow. We recommend leaving this
parameter at the default and allowing Global Mirror to form consistency groups as fast as
possible given the workload as it automatically changes to Global Copy mode for a period of
time if the Drain Time is exceeded.
Three-site solution
The combination of Global Mirror and Global Copy, called Metro/Global Copy is currently
available on the ESS 750 and ESS 800. It is a three site approach that offers very good
recovery in the event of the failure of any one or two of the three sites. You first copy your data
synchronously to an intermediate site and from there you go asynchronously to a more
distant site. For an overview of this solution refer to the “z/OS Metro/Global Mirror” in IBM
TotalStorage DS8000 Series: Copy Services with IBM ~ zSeries, SG24-6787.
z/OS Metro/Global Mirror uses z/OS Global Mirror to mirror primary site data to a remote
location, and also uses Metro Mirror for primary site data to a location within Metro Mirror
distance limits. This gives you a three-site high-availability and disaster recovery solution.
Figure 14-18 shows an example of a three site z/OS Metro/Global Mirror solution
Metropolitan Unlimited
distance distance
Metro
Mirror z/OS Global
Mirror
P X’
P’ X
In the example shown in Figure 14-18, we see that the zSeries server in Site 1 is normally
accessing DS8000 disk at Site 2. These disks are mirrored back to Site 1 to another DS8000
via Metro Mirror. At the same time, the Site 2 disks are using z/OS Global Mirror pairs
established to Site 3, which can be at continental distances from Site 2. This covers the
following potential failure scenarios:
Site 1 disaster - Site 3 can be brought up after the Site 2 updates will have completed
mirroring across to Site 3. Site 3 FlashCopy disk could be used to preserve a copy of the
original Recovery Point.
Site 2 disaster - Site 1 can readily switch to using Site 1 disk. Mirroring to Site 3 will be
suspended.
Tip: z/OS Global Mirror, also known as extended remote copy or XRC, is available on the
DS8000, DS6000, ESS 750 and the ESS 800, but it is recommended that you only use the
DS6000 as an XRC target storage server.
For a more complete discussion of z/OS Global Mirror (XRC), refer to IBM TotalStorage
DS8000 Series: Performance Monitoring and Tuning, SG24-7146, and the IBM TotalStorage
DS8000 Series: Copy Services with IBM ~ zSeries, SG24-6787.
DFSMSdss will implicitly or explicitly utilize the storage subsystem’s FlashCopy functionality
for volume and data set copies, if it finds that the source and targets are eligible. The user will
notice that a fast copy capability has been used in the SYSLOG messages from the utility.
There is no requirement for the user to change their JCL, or to code extra parameters to
achieve this, because the default option of FASTREPLICATION(PREFERRED) for the COPY
function attempts to use FlashCopy.
Here are some of the conditions that will result in DFSMSdss using FlashCopy within the
DS6000 to complete some volume replication functions:
The source and target devices must have the same track format.
The source volumes and target volumes are in the same DS6000.
The source and target volumes must be online.
No other FlashCopy is active for the same source volume. If so, the copy will be done by
DFSMSdss software based.
The FASTREPLICATION(NONE) keyword must not be specified.
Detailed information can be found in the IBM publication z/OS DFSMSdss Storage
Administration Reference SC35-0424.
Tip: We recommend the use of DFSMSdss for full volume copies where possible, because
not all tracks on the volume will normally need to be copied. When DFSMSdss invokes
FlashCopy for a full volume copy, it requests FlashCopy for allocated extents only, which
can lead to more effective copying than with alternative invocations.
In order to balance between excluding free space and saving the number of FlashCopy
relationships, up to 255 copy relationship may be created for each full volume copy. If there
are more than 255 separate, allocated extents on the source volume, then the copy function
of DFSMSdss will do some initial extent management before starting the FlashCopy. Some
extents will be merged (to reduce the number of extents to copy) resulting in some free space
being copied for a highly fragmented logical volume.
Here are some high level suggestions covering some aspects of Copy Services you may want
to manage and evaluate.
Be aware that the DS6000 will ensure that it limits the possible impact of Copy Services tasks
might have on normal user I/O in order to favor user I/O. This means that you should not
expect all the Copy tasks to complete at the same time, as they will have restricted access to
server resources during their copy operations. Consider the possibility of starting to establish
your copy pairs at different times in order to spread this server workload.
FlashCopy implementation
Consider the use of Incremental FlashCopy if doing regular copies of a volume, as there will
be less background copy operations involved each time a fresh copy is made, and it will be
faster. This has the added optional feature of a reverse restore for recovery from the target
back to the source.
Consider the use of FlashCopy Consistency Groups for much faster database recovery
(recovery becomes possible in minutes rather than many hours.
Descriptions of some of the available counters for showfbvol -metrics are shown in
Example 14-1, for showrank -metrics in Example 14-2 on page 473 and for showioport
-metrics in Example 14-3 on page 473.
You should also periodically monitor traffic through your host I/O adapters with the showrank
-metrics command, so that you can identify increases in I/O.
Appendix A. Benchmarking
Benchmarking storage systems has become very complex over the years given all of the
hardware and software parts being used for storage systems. In this appendix, we discuss
the goals and the ways to conduct an effective storage benchmark.
To conduct a benchmark, you need a solid understanding of all parts of your environment.
This understanding includes not only the storage system requirements but also the SAN
infrastructure, the server environments, and the applications. Recreating a representative
emulation of the environment, including actual applications and data, along with user
simulation provides efficient and accurate analysis of the performance of the storage system
tested. The characteristic of a performance benchmark test is that results must be
reproducable to validate the tests’ integrity.
Performance is not the only component that should be considered as benchmark results.
Reliability and cost effectiveness are also parameters that must be considered. Balancing
benchmark performance results with reliability functionalities and total cost ownership of the
storage system will give you a global view of the storage product value.
The popularity of such benchmarks is dependent on how meaningful the workload is versus
the main and new workloads which companies are deploying today. If the generic benchmark
workloads are representative of your production, you can use the different benchmarks
results to identify the product you should implement in your production environment. But, if
the generic benchmark definition is not representative or does not include your requirements
or restrictions, running a dedicated benchmark designed to be representative of your
workload will give you the ability to choose the right storage system.
The OLTP category typically has many users, all accessing the same disk storage system
and a common set of files. The requests are typically spread across many files, therefore the
file sizes are typically small and randomly accessed. Typical applications consist of a network
file server or disk subsystem being accessed by a sales department entering order
information.
To identify the specificity of your production workload, you can use monitoring tools available
at the operating system level.
The first way, the most complex, is to set up the production environment including the
applications software and the application data. In this case, you have to ensure that the
application is well configured and optimized on the server operating system. The data volume
has also to be representative of the production environment. Depending on your application,
workload can be generated using application scripts or an external transaction simulation
tool. These kind of tools provide a simulation of users accessing your application. You use
workload tools to provide application stress from end-to-end. To configure an external
simulation tool, you first record a standard request from a single user and then, generate this
request several times. This can provide an emulation of hundreds or thousands of concurrent
users to put the application through the rigors of real-life user loads and measure the
response times of key business processes. Examples of software available: IBM Rational®
Software, Mercury Load Runner TM, etc.
The other way to generate the workload is to use a standard workload generator. These tools,
specific to each operating system, produce different kinds of workloads. You can configure
and tune these tools to match your application workload. The main tuning parameters include
the type of workload (sequential, random), the read/write ratio, the I/O block size, the number
of I/Os per second and the test duration. With a minimum of setup, these simulation tools can
help you to recreate your production environment workload without setting up all of the
software components. Examples of software available include: iozone, iometer, etc.
Attention: Each workload test must be defined with a minimum time duration in order to
eliminate any side effects or warm-up period, such as populating cache, which could
generate incorrect results.
Note that monitoring can have an impact on component performance. In that case, you
should implement the monitoring tools in the first sequence of tests to understand your
workload and then, disable it in order to eliminate any impact which could distort performance
results.
During a benchmark, each scenario has to be run several times to, first, understand how the
different components are performing using monitoring tools, to identify bottlenecks and then,
test different ways to get an overall performance improvement by tuning each of the different
components.
By downloading the Acrobat PDF version of this publication, you should be able to copy and
paste these scripts for easy installation on your host systems. To function properly, the scripts
presented here rely on:
An AIX host running AIX 5L
Subsystem Device Driver (SDD) for AIX Version 1.3.1.0 or later
Attention: These scripts are provided on an ‘as is’ basis. They are not supported or
maintained by IBM in any formal way. No warranty is given or implied, and you cannot
obtain help with these scripts from IBM.
vgmap
The vgmap script displays which vpaths a Volume Group uses and also which Rank each
vpath belongs to. Use this script to determine if a Volume Group is made up of vpaths on
several different Ranks and which vpaths to use for creating striped logical volumes.
Example output of the vgmap command is shown in Example B-1. The vgmap shell script is in
Example B-2.
# AIX
lsvg -p $1 | grep -v "PV_NAME" > $workfile
echo "\nPV_NAME RANK PV STATE TOTAL PPs FREE PPs Free D"
cat $workfile
rm $workfile
rm $sortfile
########################## THE END ######################
lvmap
The lvmap script displays which vpaths and Ranks a logical volume uses. Use this script to
determine if a logical volume spans vpaths on several different Ranks. The script does not tell
you if a logical volume is striped or not. Use lslv <lv_name> for that information or modify this
script.
An example output of the lvmap command is shown in Example B-3. The lvmap shell script is
in Example B-4.
lvmap.ksh 8000stripelv
LV_NAME RANK COPIES IN BAND DISTRIBUTION
8000stripelv:N/A
vpath4 0000 010:000:000 100% 010:000:000:000:000
vpath5 ffff 010:000:000 100% 010:000:000:000:000
rm $workfile
rm $sortfile
vpath_iostat
The vpath_iostat script is a a wrapper program for AIX that converts iostat information based
on hdisk devices to vpaths instead.
The script first builds a map file to list hdisk devices and their associated vpaths and then
converts iostat information from hdisks to vpaths.
To run the script, make sure the SDD datapath query essmap command is working
properly—that is, all Volume Groups are using vpaths instead of hdisk devices.
Or,
vpath_iostat <interval> <iteration>
An example of the output vpath_iostat produces is shown in Example B-5. The vpath_iostat
shell script is in Example B-6 on page 485.
##########################################################
# set the default period for number of seconds to collect
# iostat data before calculating average
period=5
iterations=1000
#############################################
# Create a list of the vpaths this system uses
# Format: hdisk DS-vpath
# datapath query essmap output MUST BE correct or the IO stats reported
# will not be correct
#############################################
if [ ! -f $ifile ]
then
echo "Collecting DS6000 info for disk to vpath map..."
datapath query essmap > $ifile
fi
do
echo "$internal $internal" >> $essfile
done
###############################################
# Set interval value or leave as default
if [[ $# -ge 1 ]]
then
period=$1
fi
##########################################
# Set <iteration> value
if [[ $# -eq 2 ]]
then
iterations=$2
fi
#################################################################
# ess_iostat <interval> <count>
i=0
while [[ $i -lt $iterations ]]
do
iostat $period 2 > $ofile # run 2 iterations of iostat
# first run is IO history since boot
grep hdisk $ofile > $ofile.temp # only gather hdisk info- not cd
# other devices
###########################################
#Converting hdisks to vpaths.... #
###########################################
for j in `cat $wfile | awk '{print $1}'`
do
vpath=`grep -w $j $essfile | awk '{print $2}'`
sed "s/$j /$vpath/g" $wfile > $wfile2
cp $wfile2 $wfile
done
###########################################
# Determine Number of different VPATHS used
###########################################
numvpaths=`cat $wfile | awk '{print $1} ' | grep -v hdisk | sort -u | wc -l`
print "\n$hname: Total VPATHS used: $numvpaths $dt $period sec interval"
printf "%s\t%s\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" "$hname" "Vpath:" "MBps" "tps" \
"KB/trans" "MB_read" "MB_wrtn"
###########################################
END {
if ( tpsum > 0 )
printf ("%-7s\t%4s\t\t%-9.3f\t%-9.1f\t%-9.1f\t%-9.1f\t%-9.1f\n" , hname, \
vpath, kbpsum/1000, tpsum, kbpsum/tpsum , kbreadsum/1000, kwrtnsum/1000)
else
printf ("%-7s\t%4s\t\t%-9.3f\t%-9.1f\t%-9.1f\t%-9.1f\t%-9.1f\n" , hname, \
vpath, kbpsum/1000, tpsum, "0", kbreadsum/1000, kwrtnsum/1000)
}' hname="$hname" vpath="$x" >> $wfile2.tmp
done
#############################################
# Sort VPATHS/hdisks by NUMBER of TRANSACTIONS
#############################################
if [[ -f $wfile2.tmp ]]
then
cat $wfile2.tmp | sort +3 -n -r
rm $wfile2.tmp
fi
##############################################################
# SUM TOTAL IO USAGE for ALL DISKS/LUNS over INTERVAL
##############################################################
#Disks: % tm_act Kbps tps Kb_read Kb_wrtn
# field 5 read field 6 written
tail -n $pvcount $ofile.temp | grep -v "0.0 0.0 0.0 0 \
0" | awk 'BEGIN { }
{ rsum=rsum+$5 }
{ wsum=wsum+$6 }
END {
rsum=rsum/1000
wsum=wsum/1000
printf
("------------------------------------------------------------------------------------------\n")
if ( divider > 1 )
{
printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n", hname, "TOTAL READ: ", \
rsum, "MB", "TOTAL WRITTEN: ", wsum, "MB")
}
printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n\n\n", hname, "READ SPEED: ", \
rsum/divider, "MB/sec", "WRITE SPEED: ", wsum/divider, "MB/sec" )
}' hname="$hname" divider="$period"
let i=$i+1
#
rm $ofile
rm $wfile
rm $wfile2
rm $essfile
ds_iostat
The ds_iostat script is a a wrapper program for AIX that converts iostat information based
on hdisk devices to Ranks instead.
The ds_iostat script depends on the SDD datapath query essmamp command and iostat.
The script first builds a map file to list hdisk devices and their associated Ranks and then
converts iostat information from hdisks to Ranks.
Or,
ds_iostat <interval> <iteration>
An example of the ds_iostat output is shown in Example B-7. The sa_iostat shell script is in
Example B-8.
garmo-aix: Total RANKS used: 12 20:01 Sun 16 Feb 2003 5 sec interval
garmo-aix Ranks: MBps tps KB/trans MB_read MB_wrtn
garmo-aix 1403 9.552 71.2 134.2 47.8 0.0
garmo-aix 1603 6.779 53.8 126.0 34.0 0.0
garmo-aix 1703 5.743 43.0 133.6 28.8 0.0
garmo-aix 1503 5.809 42.8 135.7 29.1 0.0
garmo-aix 1301 3.665 32.4 113.1 18.4 0.0
garmo-aix 1601 3.206 27.2 117.9 16.1 0.0
garmo-aix 1201 2.734 22.8 119.9 13.7 0.0
garmo-aix 1101 2.479 22.0 112.7 12.4 0.0
garmo-aix 1401 2.299 20.4 112.7 11.5 0.0
garmo-aix 1501 2.180 19.8 110.1 10.9 0.0
garmo-aix 1001 2.246 19.4 115.8 11.3 0.0
garmo-aix 1701 2.088 18.8 111.1 10.5 0.0
------------------------------------------------------------------------------------------
garmo-aix TOTAL READ: 430.88 MB TOTAL WRITTEN: 0.06 MB
garmo-aix READ SPEED: 86.18 MB/sec WRITE SPEED: 0.01 MB/sec
##########################################################
# set the default period for number of seconds to collect
# iostat data before calculating average
period=5
iterations=1000
essfile=/tmp/lsess.out
#############################################
# Create a list of the ranks this system uses
# Format: hdisk DS-rank
# datapath query essmap output MUST BE correct or the IO stats reported
# will not be correct
#############################################
datapath query essmap|grep -v "*"|awk '{print $2 "\t" $11}' > $essfile
#########################################
# ADD INTERNAL SCSI DISKS to RANKS list
#########################################
for internal in `lsdev -Cc disk | grep SCSI | awk '{print $1}'`
do
echo "$internal $internal" >> $essfile
done
###############################################
# Set interval value or leave as default
if [[ $# -ge 1 ]]
then
period=$1
fi
##########################################
# Set <iteration> value
if [[ $# -eq 2 ]]
then
#################################################################
# ess_iostat <interval> <count>
i=0
while [[ $i -lt $iterations ]]
do
iostat $period 2 > $ofile # run 2 iterations of iostat
# first run is IO history since boot
grep hdisk $ofile > $ofile.temp # only gather hdisk info- not cd
# other devices
###########################################
#Converting hdisks to ranks.... #
###########################################
for j in `cat $wfile | awk '{print $1}'`
do
rank=`grep -w $j $essfile | awk '{print $2}'`
sed "s/$j /$rank/g" $wfile > $wfile2
cp $wfile2 $wfile
done
###########################################
# Determine Number of different ranks used
###########################################
numranks=`cat $wfile | awk '{print $1} ' | grep -v hdisk | cut -c 1-4| sort -u -n | wc -l`
dt=`date +"%H:%M %a %d %h %Y"`
print "\n$hname: Total RANKS used: $numranks $dt $period sec interval"
printf "%s\t%s\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" "$hname" "Ranks:" "MBps" "tps" \
"KB/trans" "MB_read" "MB_wrtn"
###########################################
# Sum Usage for EACH RANK and Internal Hdisk
###########################################
for x in `cat $wfile | awk '{ print $1}' | sort -u`
do
cat $wfile | grep -w $x | awk '{ printf ("%4d\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" , \
$1, $2, $3, $4, $5, $6) }' | awk 'BEGIN {
}
{ tmsum=tmsum+$2 }
{ kbpsum=kbpsum+$3 }
{ tpsum=tpsum+$4 }
{ kbreadsum=kbreadsum+$5 }
{ kwrtnsum=kwrtnsum+$6 }
END {
if ( tpsum > 0 )
printf ("%-7s\t%4s\t\t%-9.3f\t%-9.1f\t%-9.1f\t%-9.1f\t%-9.1f\n" , hname, \
rank, kbpsum/1000, tpsum, kbpsum/tpsum , kbreadsum/1000, kwrtnsum/1000)
else
printf ("%-7s\t%4s\t\t%-9.3f\t%-9.1f\t%-9.1f\t%-9.1f\t%-9.1f\n" , hname, \
rank, kbpsum/1000, tpsum, "0", kbreadsum/1000, kwrtnsum/1000)
}' hname="$hname" rank="$x" >> $wfile2.tmp
#############################################
# Sort RANKS/hdisks by NUMBER of TRANSACTIONS
#############################################
if [[ -f $wfile2.tmp ]]
then
cat $wfile2.tmp | sort +3 -n -r
rm $wfile2.tmp
fi
##############################################################
# SUM TOTAL IO USAGE for ALL DISKS/LUNS over INTERVAL
##############################################################
#Disks: % tm_act Kbps tps Kb_read Kb_wrtn
# field 5 read field 6 written
tail -n $pvcount $ofile.temp | grep -v "0.0 0.0 0.0 0 0" \
| awk 'BEGIN { }
{ rsum=rsum+$5 }
{ wsum=wsum+$6 }
END {
rsum=rsum/1000
wsum=wsum/1000
printf
("------------------------------------------------------------------------------------------\n")
if ( divider > 1 )
{
printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n", hname, "TOTAL READ: ", \
rsum, "MB", "TOTAL WRITTEN: ", wsum, "MB")
}
let i=$i+1
done
rm $ofile
rm $wfile
rm $wfile2
rm $essfile
################################## THE END ##########################
test_disk_speeds
Use the test_disk_speeds script to test a 100 MB sequential read against one raw vpath
(rvpath0) and record the speed at different times throughout the day to get an average read
speed a Rank is capable of in your environment.
You can change the amount of data read, the block size, and the vpath by editing the script
and changing the variables:
The publications listed in this section are considered particularly suitable for a more detailed
discussion of the topics covered in this redbook.
IBM Redbooks
For information on ordering these publications, see “How to get IBM Redbooks” on page 495.
Note that some of the documents referenced here may be available in softcopy only.
IBM TotalStorage DS6000 Series: Implementation, SG24-6781
IBM TotalStorage DS6000 Series: Copy Services in Open Environments, SG24-6783
IBM TotalStorage DS6000 Series: Copy Services with IBM ~ zSeries, SG24-6782
IBM TotalStorage DS6000 Series: Concepts and Architecture, SG24-6471
IBM TotalStorage Solutions for Business Continuity Guide, SG24-6547
The IBM TotalStorage Solutions Handbook, SG24-5250
iSeries and IBM TotalStorage: A Guide to Implementing External Disk on ~ i5,
SG24-7120
Managing Disk Subsystems using IBM TotalStorage Productivity Center, SG24-7097
Using IBM TotalStorage Productivity Center for Disk to Monitor the SVC, REDP-3961
You may find the following Redbooks related to the DS8000 and ESS useful, particularly if
you are implementing a mixed systems environment with Copy Services.
IBM TotalStorage DS8000 Series: Implementation, SG24-6786
IBM TotalStorage DS8000 Series: Concepts and Architecture, SG24-6452
IBM TotalStorage DS8000 Series: Copy Services in Open Environments, SG24-6788
IBM TotalStorage DS8000 Series: Copy Services with IBM ~ zSeries, SG24-6787
IBM TotalStorage Enterprise Storage Server: Implementing ESS Copy Services in Open
Environments, SG24-5757
IBM TotalStorage Enterprise Storage Server: Implementing ESS Copy Services with IBM
~ zSeries, SG24-5680
DFSMShsm ABARS and Mainstar Solutions, SG24-5089
Practical Guide for SAN with pSeries, SG24-6050
Fault Tolerant Storage - Multipathing and Clustering Solutions for Open Systems for the
IBM ESS, SG24-6295
Implementing Linux with IBM Disk Storage, SG24-6261
Linux with zSeries and ESS: Essentials, SG24-7025
Other publications
These publications are also relevant as further information sources:
IBM TotalStorage DS6000 Installation, Troubleshooting, and Recovery Guide, GC26-7678
Online resources
These Web sites and URLs are also relevant as further information sources:
Documentation for DS6800
http://www.ibm.com/servers/storage/support/disk/ds6800/
SDD and Host Attachment scripts
http://www.ibm.com/support/
IBM Disk Storage Feature Activation (DSFA)
http://www.ibm.com/storage/dsfa
The PSP information
http://www-1.ibm.com/servers/resourcelink/svc03100.nsf?OpenDatabase
Documentation for the DS6000
http://www.ibm.com/servers/storage/support/disk/1750.html
The interoperability matrix
http://www.ibm.com/servers/storage/disk/ds6000/interop.html
Fibre Channel host bus adapter firmware and driver level matrix
http://knowledge.storage.ibm.com/servers/storage/support/hbasearch/interop/hbaSearch.do
ATTO
http://www.attotech.com/
Emulex
http://www.emulex.com/ts/dds.html
JNI
http://www.jni.com/OEM/oem.cfm?ID=4
QLogic
http://www.qlogic.com/support/ibm_page.html
IBM
http://www.ibm.com/storage/ibmsan/products/sanfabric.html
McDATA
http://www.mcdata.com/ibm/
Cisco
http://www.cisco.com/go/ibm/storage
Symbols B
balanced
2, 74
DB2 workload 420, 434
I/O for UNIX systems 191
Numerics bandwidth 448
10,000 rpm drives 32 Base model 91
15,000 rpm drives 32 bays
1750 iSeries devices 391 host adapter bays 43
2 Gb Fibre Channel/FICON host adapter 46 benchmarking 475
3+3+2S RAID 10 ranks 37 goals 476
4+4 RAID 10 ranks 37 benchmarks 18, 300
6+P+S RAID 5 ranks 35 requirements 476
7+P RAID 5 ranks 35 benefits 2
9337 iSeries devices 391 of SAN 152
block size 19
Bonnie 300
A Bonnie++ 301
access density 19, 26 bottlenecks 341
adapter code level 197 in Linux 301
addpaths command 220 in WIndows 340
AID 35 Buffer Pools 424
AIX Business Continuity 2
file system caching 252
filemon command 212
iostat output 200 C
lvmstat command 216 cache 21–22
nmon command 209 algorithms 22
SDD commands 219, 223 friendly workload 19
secondary system paging 222 hostile workload 32
topas command 208 Windows system cache tuning 311
tuning for sequential I/O 251 capacity 28, 451, 456
algorithms 22 disks 27
allocation effective capacity 28
device options 79 intermix 28
analyze DS6000 port statistics 377 Capacity Magic 52, 101
analyze FICON statistics 374 examples 104
arbitrated loop 148–149 graphical interface 102
example 149 overview and features 101
array 63 reports 103
choosing disk speed 31 wizard 102
effective capacity 28 cfgvpath command 225
implementation 33 chkconfig command 266
physical capacity 28 choosing
RAID 10 implementation 37 CKD volume size 364
RAID 5 and RAID 10 combination 39 disk size with DB2 UDB 430
RAID 5 implementation 35 disks number and size 78
array sites 63 disks speed 31
array size 66 DS6000 disks 29
assigning interrupt affinity 322 logical device size 80
Asynchronous Cascading PPRC see Metro/Global Copy CKD 63
Asynchronous PPRC see Global Mirror Collection Services 395–396
attachment functions 396
direct connect example 148 combination
FICON 147 RAID 5 and RAID 10 39
auxiliary storage pools 388 Command Tag Queuing 11
Index 499
compared to ESS 11 performance 70
DS6000 series compared to ESS 11 Extent Pool implications
DS6800 defining Extent Pools 69
controller enclosure 3 extents 424
host adapters 3
open systems host connection 83
switched FC-AL 4 F
DS8000 FAT file system 326
Metro Mirror FAT32 file system 326
Metro/Global Copy 468 FC adapter 160
z/OS Metro/Global Mirror 467 FC-AL
DS8000 I/O ports 454 problem 41
dynamic buffer cache 257 FCP
supported servers 148
FCP attachment 45
E Fibre Channel distances 46
eight-packs 27 Fibre Channel 148
choosing disk speed 31 adapters 390
effective capacity 27 distances 46
layout in the DS6000 60 topologies 148
physical capacity 27 Fibre Channel adapter settings 256
ESS Model 800 Fibre Channel topologies 148
disk capacity intermix 28 arbitrated loop 149
disk speed intermix 29 direct connect 148
ess_iostat script 488 switched fabric 150
examples FICON 10, 146, 367
15K rpm versus 10K rpm - cache hostile 32 host attachment 147
arbitrated loop 149 FICON attachment 47
creating Linux swap file 265 Figure 1-1 on page 5 20
datapath query output 199 file system overview 326
devices presented to iostat 199 filemon 212
direct connect 148 examples 213
FCP multipathing 160 measurements 213
FICON connection 147 syntax 212
iostat output for AIX 200 filemon command 212
iostat output for HP-UX 202 filesystems
iostat output for Sun 201 ext2 277
larger versus smaller volumes - random workload ext3 278
365 FAT 326
Linux kernel compilation 270 FAT32 326
rank device spreading 241 Linux tuning 277
sar output 204 NTFS 326
SDD in a SAN 156 striped 245
striped filesystem 245 fixed block 63
topas command output 208 FlashCopy 438
vmstat output for HP-UX 206 inband commands 438
vmstat output for Sun 206 objectives 439
zoning in a SAN environment 155 performance considerations 441
Expert Cache 388 planning 442
exploiting gauges 124 FlashCopy to a Remote Mirror primary 6, 438
creating gauges 125 floating parity disk 35
modify gauge to view Rank level metrics 126 functional overview 11
exploiting Performance Manager 117 Copy Services 6
ext2 277 DS6000 combined with virtualizaton products 14
ext3 278 DS6000 compared to DS8000 12
Extent Pool DS6000 series compared to DS4000 series 13
associating ranks 70 IBM TotalStorage DS family comparisons 11
capacity utilization 70 RAID 10 5
defining 69 RAID 5 5
implications 69 resiliency 6
number of ranks 70 SAN File System 15
Index 501
iSeries 136 internal versus external storage on iSeries 389
mixed environment 137 iSeries Navigator monitors 397
UNIX/Linux 134 LUN size and performance 391
zSeries 137 LUNs on the DS6000 391
IBM TotalStorage Productivity Center for Disk 110 monitoring tools 394, 405
IBM TotalStorage SAN Volume Controller see SVC multipath compared to mirroring 394
iDoctor for iSeries 402 PATROL for iSeries (AS/400) 403
Consulting Services 402 Performance Explorer 400
Heap Analysis Tools for Java 402 Performance Explorer definitions 401
Job Watcher 402 Performance Explorer reports 402
Performance Trace Data Visualizer 402 performance Tools for iSeries 398
PEX Analyzer 402 protected and unprotected volumes 393
implementation 296 single level storage 388
IMS 432 iSeries tools
logging 433 Collection Services 395
performance considerations 433 IBM Performance Management for iSeries 395
WADS 433 iDoctor for iSeries 396
IMS in a z/OS environment 432 iSeries Navigator monitors 395
Inband commands over remote mirror link 7 PATROL for iSeries (AS/400) - Predict 396
Incremental FlashCopy 6, 438 Performance Explorer 396
index 417 Performance Tools for iSeries 395
indexspace 418 Workload Estimator for iSeries 396
Information Life Cycle Management 3
Infrastructure Simplification 2
installation planning K
host attachment 144 KDE System Guard 294
FICON 144 kernel 270
open systems 144 kernel parameter storage locations 272
instance 422 kernel parameters 271, 273
interfaces panel 88, 95 kernel space 262
intermix
different capacity disks 28 L
different speed disks 29 large volume support
Iometer 354 planning volume size 366
IOSQ time 373 links
iostat 197–198, 284 Fibre Channel 447
iostat command 198, 284 Linux 261
isag 288 daemons 266
isar command 288 kernel compilation 270
iSeries LVM considerations 296
architecture 388 monitoring tools 280
changing from single path to multipath 394 paging statistics 289
changing LUN protection 393 swap partition 264
multipath 393 swapping 265
multipath rules for multiple iSeries systems or parti- tuning TCP window size 279
tions 394 Linux OS components 262
iSeries and DS6000 configuration planning 392 LMC load 159
iSeries LUNs on the DS6000 391 load balancing
iSeries Navigator monitors 395, 397 SDD 158
iSeries servers log buffers 433
Collection Services 396 logging 433
configuration planning 392 logical configuration
DB2 overview 417 disks in a SAN 154
DS6000 attachment 390 logical configuration planning
DS6000 disk drives 390 array and array sites 63
DS6000 sizing process 403 array sites 63
Expert Cache 388 array size 66
Fibre Channel adapters 390 CKD 63
iDoctor for iSeries 402 components and terminology 60
independent auxiliary storage pools 388 configuring for performance 61
Index 503
O 334
OLDS 433 top 282
OLTP 416 tuning the swap partition 264
online recovery 161 uptime 280
Open Disk panel 96 using the sysctl commands 273
open read intensive workload 31 virtual memory 308
open standard OLTP workload 30 vmstat 285
open system servers Windows Performance console 334
added functionality 328 open systems servers
assigning interrupt affinity 322 addpaths 220
benchmarks 300 AIX file system caching 252
Bonnie 300 dpovmfix 221
changing kernel parameters 271 dynamic buffer cache 257
compiling the kernel 270 Fibre Channel adapter settings 256
CPU Utilization 292 filemon 212
daemons 266 filemon measurements 213
disable kernel paging 330 filemon syntax 212
disable NTFS last access updates 327 hd2vp 221
disabling short file name generation 327 iostat 198
disk bottlenecks 301 LUN assignments 194
dmesg 281 lvmstat 216
ext2 277 SAR 204
ext3 278 Sun Fibre Channel 258
file system overview 326 SUN Solaris resources 259
finding bottlenecks 341 Sun Solaris settings 258
GKrellM 294 system and adapter code level 197
host bus adapter settings 295, 333 adapter code level 194
host system performance 306 topas 208
I/O locking operations 332 tuning I/O buffers 254
implementation 296 verify storage subsystem 229
introduction to Linux OS components 262 verify the file systems and characterize performance
Iometer 354 232
iostat 284 verify the logical volumes 231
isag 288 vp2hd 221
KDE System Guard 294 open systems workload 96
kernel parameter storage locations 272 operation characteristics 115
kernel parameters 273 installation consideration 115
key objects and counters 336 installation environments 115
Memory Activities 291 installation process 115
memory and swap 290 optimizing
monitoring disk counters 340 logical device size 80
NTFS and FAT performance and recoverability con- output information 87
siderations 329 Performance console 338
NTFS file compression 329
optimize the paged pool size 330 P
performance logs 339 page cleaners 425
performance management 297 page file size 310
process affinity 321 pages 424
process priority levels 318 paging statistics 289
removing disk bottlenecks 343 Parallel Access Volumes 10
removing limitations 328 parallel operations 426
Run Queue 289 parity disk 35
sar 285 partitiongroups 423
Subsystem Device Driver 356 partitioning map 423
SUSE LINUX file system 279 path failover 161
System Monitor 335 PAV 372
system swapping 293 characteristics 358
Task Manager 349 PEND time 373
the /3GB BOOT.INI parameter 323 performance 369, 456
tools for Windows Server 2003 and Windows 2000 AIX secondary system paging 222
Index 505
rank format 68 DB2 UDB environment 431
rates lsvpcfg command 220–221
I/O 19 SDDPCM 161
ratios sendmail daemon 266
read/write 19 sequential
recovery data sets 418 measuring with dd command 227
Red Hat 262 tuning for sequential I/O 251, 256–257
Redbooks Web site 495 serviceconfig command 268
Contact us xxii showpath command 226
registry options 330 showvpath command 224
remote mirror and Copy feature 7 single path mode 159
remote mirror connections 8 size
removing disk bottlenecks 343 LUN 193
report warning message 109 spare drives
reports 103 RAID 10 ranks 37
resiliency 6 RAID 5 ranks 35
RMC speed
Global Mirror 457 choosing 31
RMF Magic for Windows 378 intermix 29
analysis process 379 SSA
analyze step 380 loops 33
data collection step 379 states change logic
data presentation and reporting 380 Global Copy 452
ROT see rule of thumb storage allocation
rule of thumb 18 device options 79
Run Queue 289 striping 193
DB2 UDB 428
VSAM data striping 420
S with Linux LVM 297
SAN 79 Subsystem Device Driver 356
benefits 152 Sun
cabling for availability 153 iostat output 201
implementation 151 SDD commands 225
switched fabric 46 tuning for sequential I/O 257
using SDD 156 vmstat output example 206
zoning example 155 Sun Fibre Channel settings 258
SAN cabling 153 SUN Solaris resources 259
SAN File System 15 Sun Solaris settings 258
SAN implementations 151 supported environments 10
SAN Statistics supported servers
monitoring performance 138 FCP attachment 148
SAN Volume Controller 14 SuSE 262
SAR 204 SUSE LINUX file system 279
sar 285 SVC 14
sar command 198, 204, 285 swap partition 264
SARC 22 swapping 265
SARC see Sequential Prefetching in Adaptive Replace- switched fabric 46, 148, 150
ment Cache considerations 150
SBOD see Switched Bunch of Disks connection switched FC-AL 4
scalability 451, 456 advantages 41
scripts switched FC-AL implementation 42
ess_iostat 488 symmetrical configuration 449
lvmap 483 Synchronous PPRC see Metro Mirror
test_disk_speeds 491 sysctl commands 273
vgmap 482 System Monitor 335
vpath_iostat 484 system swapping 293
SDD 156–157, 356
addpaths 220
commands in AIX 219 T
commands in HP-UX 223 table 417
commands in Sun 225 tables, indexes, and LOBs 424
U X
UNIX environment 100 xSeries servers
UNIX performance monitoring tools Linux 261
iostat 197
UNIX shell scripts
ds_iostat 488 Z
introduction 482 z/OS
Ivmap 483 Disk Magic environment 99
test_disk_speeds 491 planning guidelines 369
vgmap 482 z/OS Global Mirror 8
vpath_isotat 484 z/OS Metro/Global Mirror 8, 467
unprotected volumes 393 zombie processes 283
uptime 280 zSeries
uptime command 280 host connection 84
using logical device space report 107
iostat command 198 workload 90
Task Manager 349 zSeries servers
analyze DS6000 port statistics 377
analyze FICON statistics 374
V concurrent read operation 361
vgmap 482 concurrent write operation 362
vgmap script 482 CONN time 373
virtual memory 263, 308 DISC time 373
Windows considerations 308 IOSQ time 373
vmstat 285 overview 358
vmstat command 198, 206, 285 PAV 372
volumes 450 PEND time 373
vp2hd 221
vpath_iostat script 484
vpath_isotat 484
vpathmkdev command 226
Index 507
508 IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning
IBM TotalStorage DS6000 Series:
Performance Monitoring and
Tuning
IBM TotalStorage DS6000 Series:
Performance Monitoring and Tuning
(1.0” spine)
0.875”<->1.498”
460 <-> 788 pages
IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning
IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning
IBM TotalStorage DS6000 Series:
Performance Monitoring and
Tuning
IBM TotalStorage DS6000 Series:
Performance Monitoring and
Tuning
Back cover ®