IBM Total Storage DS6000 Series Performance Monitoring and Tuning

Front cover
IBM TotalStorage DS6000

Series: Performance
Monitoring and Tuning
Understand the performance aspects
of the DS6000 architecture
Configure the DS6000 to fully

exploit its capabilities
Use planning and monitoring

tools with the DS6000
Cathy Warrick Brannen Proctor

Craig Gordon Jim Sedgwick
Benoit Granier Paulus Usong
Keitaro Imai Mary Ann Vandermark
Rosemary McCutchen John Wickes
ibm.com/redbooks
International Technical Support Organization
IBM TotalStorage DS6000 Series: Performance

December 2005
SG24-7145-00
Note: Before using this information and the product it supports, read the information in “Notices” on
page xvii.
First Edition (December 2005)
This edition applies to the IBM TotalStorage DS6000 and its capabilities as of August 2005.
© Copyright International Business Machines Corporation 2005. All rights reserved.

Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule
Contract with IBM Corp.
Contents
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
The team that wrote this redbook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii
Chapter 1. Model characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Benefits of the DS6000 series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Infrastructure Simplification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Business Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Information Life Cycle Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Hardware overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 DS6800 controller enclosure (Model 1750-511) . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 DS6000 expansion enclosure (Model 1750-EX1) . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Functional overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Storage capacity: RAID 5 and RAID 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Resiliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Copy Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 DS management tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 IBM TotalStorage DS Storage Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 IBM TotalStorage DS Open API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.3 IBM TotalStorage DS Command Line Interface (CLI). . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Supported environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Performance overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.1 Sequential Prefetching in Adaptive Replacement Cache (SARC) . . . . . . . . . . . . 10
1.6.2 Performance for zSeries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.3 IBM TotalStorage Multi-path Subsystem Device Driver (SDD) . . . . . . . . . . . . . . . 11
1.6.4 Command Tag Queuing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 IBM TotalStorage DS family comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7.1 DS6000 series compared to ESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7.2 DS6000 series compared to DS8000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7.3 DS6000 series compared to DS4000 series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8 IBM DS6000 combined with virtualization products . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.8.1 IBM SAN Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.8.2 IBM SAN File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 2. Hardware configuration planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1 Rules of thumb and benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Understanding your workload characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 DS6000 major hardware components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 DS6000 server processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Cache and persistent memory (formerly NVS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.2 Persistent memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.3 Cache algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
© Copyright IBM Corp. 2005. All rights reserved. iii

2.5.4 Cache size consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 DS6000 disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.1 DS6000 disk capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.2 Disk four-packs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.3 Disk four-pack capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.4 Disk four-pack intermixing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.5 Disk conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7 Choosing the DS6000 disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7.1 Disk capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7.2 Disk Magic examples using 146 GB and 300 GB disk drives . . . . . . . . . . . . . . . . 30
2.7.3 Disk speed (RPM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7.4 Disk Magic examples using 15K rpm and 10K rpm disk drives . . . . . . . . . . . . . . 32
2.8 RAID implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.8.1 RAID Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.8.2 RAID 5 array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8.3 RAID 10 array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.8.4 Combination of RAID 5 and RAID 10 arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.8.5 RAID 5 versus RAID 10 performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.9 SBOD (Switched Bunch of Disks) connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.9.1 Standard storage subsystem FC-AL problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.9.2 Switched FC-AL advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.9.3 DS6000 switched FC-AL implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.10 Host adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.10.1 Host adapter configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.10.2 FCP attachment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.10.3 FICON attachment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.10.4 Preferred Path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.11 Tools to aid in hardware planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.11.1 Whitepapers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.11.2 Disk Magic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.11.3 Capacity Magic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Chapter 3. Logical configuration planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.1 Principles for performance optimization: balance, isolation and spread . . . . . . . . . . . . 54
3.1.1 Isolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.1.2 Resource sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.3 Spreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.4 Using isolation, resource-sharing and spreading to optimize the DS6000
performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Isolation requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.1 Review the application workload characteristics to determine the isolation
requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3 Plan assignment of DS6000 hardware resources to workloads . . . . . . . . . . . . . . . . . . 58
3.3.1 Plan DS6000 hardware resources for isolated workloads . . . . . . . . . . . . . . . . . . 58
3.3.2 Plan DS6000 hardware resources for resource-sharing workloads . . . . . . . . . . . 58
3.3.3 Spread volumes and host connections across available hardware . . . . . . . . . . . 59
3.4 Logical configuration - components and terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Configuring for performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.1 Mixing drive geometries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5.2 Mixing open and zSeries logical disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5.3 Arrays and Array Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5.4 Select a Rank format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5.5 Extent Pool implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
iv IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

3.5.6 Number of Ranks in an Extent Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5.7 LSS design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5.8 Preferred paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.6 Performance and sizing considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.6.1 Workload characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.6.2 Data placement in the DS6000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.6.3 Open systems LVM striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.7 .Performance and sizing considerations for z/OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.7.1 Performance potential in z/OS environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.8 Logical disks - number and size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.9 Logical disk sizes - general considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.9.1 Future requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.9.2 Maximum number of devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.10 Configuring I/O ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.10.1 Multiple host attachment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Chapter 4. Planning and monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.1 Disk Magic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.1.1 Overview and characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.1.2 Output information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.1.3 Disk Magic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.1.4 Disk Magic for zSeries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.1.5 Disk Magic for open systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.1.6 Workload growth projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.1.7 Input data needed for DIsk Magic study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.2 Capacity Magic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2.1 Overview and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2.2 Wizard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2.3 Graphical interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2.4 Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.3 IBM TotalStorage Productivity Center for Disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.3.1 IBM TotalStorage Productivity Center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.3.2 IBM TotalStorage Productivity Center for Disk . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3.3 Operation characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.3.4 Using IBM TotalStorage Productivity Center for Disk . . . . . . . . . . . . . . . . . . . . . 117
4.3.5 Exploiting gauges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.3.6 Interpreting the DS6000 performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.3.7 Performance gauge - considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.3.8 IBM TotalStorage Productivity Center for Disk and other tools. . . . . . . . . . . . . . 134
4.3.9 IBM TotalStorage Productivity Center for Disk in mixed environment . . . . . . . . 137
4.4 SAN statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.5 Monitoring performance through a SAN switch or director . . . . . . . . . . . . . . . . . . . . . 138
Chapter 5. Host attachment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.1 DS6000 host attachment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.1.1 Attaching to open systems hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.1.2 FICON-attached S/390 and zSeries hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.1.3 Example of host attachments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.2 Multipathing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.3 FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.4 Fibre Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.4.1 Supported Fibre Channel attached hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Contents v
5.4.2 Fibre Channel topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.5 SAN implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.5.1 Description and characteristics of a SAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.5.2 Benefits of a SAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.5.3 SAN cabling for availability and performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.5.4 Importance of establishing zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.5.5 LUN masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.5.6 Configuring logical disks in a SAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.6 Subsystem Device Driver (SDD) - multipathing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.6.1 SDD load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.6.2 Concurrent LMC load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.6.3 Single path mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.6.4 Single FC adapter with multiple paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.6.5 Path failover and online recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.6.6 Using SDDPCM on an AIX host system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.6.7 SDD datapath command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Chapter 6. IBM TotalStorage SAN Volume Controller attachment . . . . . . . . . . . . . . . 167

6.1 IBM TotalStorage SAN Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.1.1 IBM TotalStorage SAN Volume Controller concepts. . . . . . . . . . . . . . . . . . . . . . 168
6.1.2 SAN Volume Controller multipathing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.1.3 Copy Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.1.4 SAN Volume Controller performance considerations . . . . . . . . . . . . . . . . . . . . . 174
6.2 DS6000 performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.2.1 DS6000 Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.2.2 DS6000 Rank format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.2.3 DS6000 Extent Pool implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.2.4 DS6000 volumes consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.2.5 Volume assignment to SAN Volume Controller cluster. . . . . . . . . . . . . . . . . . . . 181
6.2.6 Number of paths to attach the DS6000 to SAN Volume Controller. . . . . . . . . . . 181
6.3 Performance monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.3.1 IBM TotalStorage Productivity Center for Disk . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.3.2 Using IBM TotalStorage Productivity Center for Disk to monitor the SAN Volume
Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.4 Sharing the DS6000 between a host and the IBM SAN Volume Controller . . . . . . . . 185
6.4.1 Sharing the DS6000 between open systems server hosts and the IBM SAN Volume
Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.4.2 Sharing the DS6000 between iSeries host and the IBM SAN Volume Controller 185
6.4.3 Sharing the DS6000 between zSeries server host and the IBM SAN Volume
Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.5 Advanced functions for the DS6000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.6 Volume creation and deletion on the DS6000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.7 Configuration guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Chapter 7. Open systems servers - UNIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

7.1 UNIX performance monitoring and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.2 Planning and preparing UNIX servers for performance . . . . . . . . . . . . . . . . . . . . . . . 191
7.2.1 I/O balanced across Extent Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
7.2.2 DS6000 LUN size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.2.3 Document the LUN assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.2.4 Multipathing considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.2.5 System and adapter code level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.3 Common UNIX performance monitoring tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
vi IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

7.3.1 iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.3.2 SAR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
7.3.3 vmstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.4 AIX-specific I/O monitoring commands and tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.4.1 topas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.4.2 nmon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
7.4.3 filemon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.4.4 lvmstat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
7.5 HP-UX specific I/O monitoring commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.6 SDD commands for AIX, HP-UX, and Solaris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.6.1 HP-UX SDD commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.6.2 Sun Solaris SDD commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
7.7 Testing and verifying DS6000 storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.7.1 Using the dd command to test sequential Rank reads and writes . . . . . . . . . . . 227
7.7.2 Verifying your system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
7.8 Volume groups, logical volumes and file systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.8.1 Creating the volume group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.8.2 Creating a logical volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.8.3 Creating the file system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
7.9 Operating system tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
7.9.1 AIX operating system tuning (JFS and JFS2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
7.9.2 HP-UX OS tuning for sequential I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
7.9.3 Sun Solaris OS tuning for sequential I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Chapter 8. Open system servers - Linux for xSeries . . . . . . . . . . . . . . . . . . . . . . . . . . 261

8.1 Supported Linux distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.2 Introduction to Linux OS components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.2.1 Understanding and tuning virtual memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.2.2 Understanding and tuning the swap partition . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
8.2.3 Understanding and tuning the daemons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
8.2.4 Compiling the kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
8.2.5 Changing kernel parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
8.2.6 Kernel parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
8.2.7 Understanding and tuning the file systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
8.2.8 Tuning TCP window size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
8.3 Linux monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
8.3.1 uptime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
8.3.2 dmesg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
8.3.3 top . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
8.3.4 iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
8.3.5 vmstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
8.3.6 sar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
8.3.7 isag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
8.3.8 GKrellM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
8.3.9 KDE System Guard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
8.4 Host bus adapter (HBA) settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
8.5 Logical Volume Manager for Linux (LVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
8.5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
8.5.2 Performance management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
8.6 Bonnie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
8.6.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
8.6.2 Downloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
8.7 Bonnie++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Contents vii
8.8 Disk bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
8.9 Other performance resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Chapter 9. Open system servers - Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

9.1 Host system performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
9.2 Tuning Windows 2000 and Server 2003 systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
9.2.1 Foreground and background priorities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
9.2.2 Virtual memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
9.2.3 File system cache tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
9.2.4 Disabling unnecessary services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
9.2.5 Process priority levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
9.2.6 Process affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
9.2.7 Assigning interrupt affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
9.2.8 The /3GB BOOT.INI parameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
9.2.9 Using PAE and AWE to access memory above 4 GB . . . . . . . . . . . . . . . . . . . . 324
9.3 File system overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
9.3.1 NTFS overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
9.3.2 Disabling short file name generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
9.3.3 Disable NTFS last access updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
9.3.4 Added functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
9.3.5 Removing limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
9.3.6 NTFS and FAT performance and recoverability considerations . . . . . . . . . . . . . 329
9.3.7 Do not use NTFS file compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
9.3.8 Monitor drive space utilization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
9.4 Windows registry options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
9.4.1 Disable kernel paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
9.4.2 Optimize the paged pool size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
9.4.3 Increase memory available for I/O locking operations . . . . . . . . . . . . . . . . . . . . 332
9.4.4 Improve memory utilization of file system cache. . . . . . . . . . . . . . . . . . . . . . . . . 333
9.5 Host bus adapter (HBA) settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
9.6 Tools for Windows Server 2003 and Windows 2000 . . . . . . . . . . . . . . . . . . . . . . . . . 334
9.6.1 Windows Performance console. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
9.6.2 System Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
9.6.3 Key objects and counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
9.6.4 Performance console output information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
9.6.5 Performance Logs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
9.6.6 Monitoring disk counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
9.6.7 Disk bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
9.6.8 How to monitor, collect and view performance reports . . . . . . . . . . . . . . . . . . . . 343
9.7 Task Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
9.7.1 Starting Task Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
9.8 Other Windows tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
9.9 Iometer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
9.10 General considerations for Windows servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
9.11 Subsystem Device Driver (SDD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
Chapter 10. zSeries servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

10.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
10.2 Parallel Access Volumes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
10.2.1 Static and dynamic PAVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
10.2.2 PAV and large volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
10.3 Multiple Allegiance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
10.4 How PAV and Multiple Allegiance work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
viii IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

10.4.1 Concurrent read operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
10.4.2 Concurrent write operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
10.5 I/O priority queuing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
10.6 Logical volume sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
10.6.1 Selecting the volume size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
10.6.2 Larger versus smaller volumes performance . . . . . . . . . . . . . . . . . . . . . . . . . . 364
10.6.3 Planning the volume sizes of your configuration. . . . . . . . . . . . . . . . . . . . . . . . 366
10.7 FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
10.7.1 MIDAWs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
10.8 z/OS planning and configuration guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
10.8.1 Channel configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
10.8.2 Extent Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
10.8.3 Considerations for mixed workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
10.9 DS6000 performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
10.9.1 RMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
10.9.2 Analyze the response time components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
10.9.3 Analyze I/O queuing activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
10.9.4 Analyze FICON statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
10.9.5 Analyze cache statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
10.9.6 Analyze Rank statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
10.9.7 Analyze DS6000 port statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
10.9.8 RMF Magic for Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Chapter 11. iSeries servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

11.1 iSeries architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
11.1.1 Single level storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
11.1.2 Expert Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
11.1.3 Independent auxiliary storage pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
11.1.4 Internal versus external storage on iSeries. . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
11.2 DS6000 attachment to iSeries server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
11.2.1 Fibre Channel adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
11.2.2 DS6000 disk drives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
11.2.3 iSeries LUNs on the DS6000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
11.2.4 LUN size and performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
11.2.5 iSeries and DS6000 configuration planning . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
11.2.6 Protected and unprotected volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
11.2.7 Changing LUN protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
11.3 Multipath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
11.3.1 Multipath compared to mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
11.3.2 Multipath rules for multiple iSeries systems or partitions . . . . . . . . . . . . . . . . . 394
11.3.3 Changing from single path to multipath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
11.4 iSeries performance and monitoring tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
11.4.1 Rules of thumb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
11.4.2 iSeries tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
11.4.3 Collection Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
11.4.4 iSeries Navigator monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
11.4.5 IBM Performance Management for iSeries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
11.4.6 Performance Tools for iSeries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
11.4.7 Performance Explorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
11.4.8 iDoctor for iSeries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
11.4.9 Workload Estimator for iSeries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
11.4.10 PATROL for iSeries (AS/400) - Predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
11.4.11 Sizing a DS6000 on iSeries using Disk Magic . . . . . . . . . . . . . . . . . . . . . . . . 403
Contents ix
11.5 Additional information about iSeries performance . . . . . . . . . . . . . . . . . . . . . . . . . . 405
11.5.1 Publications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
11.5.2 Web sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
Chapter 12. Understanding your workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

12.1 General workload types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
12.1.1 Standard workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
12.1.2 Read intensive cache unfriendly workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
12.1.3 Sequential workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
12.1.4 Batch jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
12.1.5 Sort jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
12.2 Database workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
12.2.1 DB2 query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
12.2.2 DB2 logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
12.2.3 DB2 transaction environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
12.2.4 DB2 utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
12.3 Application workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
12.3.1 General file serving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
12.3.2 Online transaction processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
12.3.3 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
12.3.4 Video on demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
12.3.5 Data warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
12.3.6 Engineering and scientific applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
12.3.7 Digital video editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
12.4 How to understand your workload type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
12.4.1 Monitoring the DS6000 workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
12.4.2 Monitor host workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Chapter 13. Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

13.1 DB2 in a z/OS environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
13.1.1 Understanding your database workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
13.1.2 DB2 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
13.1.3 DB2 storage objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
13.1.4 DB2 dataset types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
13.2 DS6000 considerations for DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
13.3 DB2 with the DS6000: Performance recommendations . . . . . . . . . . . . . . . . . . . . . . 419
13.3.1 Know where your data resides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
13.3.2 Balance workload across DS6000 resources . . . . . . . . . . . . . . . . . . . . . . . . . . 420
13.3.3 Take advantage of VSAM data striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
13.3.4 Large volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
13.3.5 MIDAWs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
13.3.6 Monitoring the DS6000 performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
13.4 DS6000 DB2 UDB - open systems environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
13.4.1 DB2 UDB storage concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
13.5 DB2 UDB with DS6000: Performance recommendations . . . . . . . . . . . . . . . . . . . . . 427
13.5.2 Balance workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
13.5.3 Use DB2 to stripe across containers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
13.5.4 Selecting DB2 logical sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
13.5.5 Selecting the DS6000 logical disk sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
13.5.6 Multi-pathing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
13.6 IMS in a z/OS environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
13.6.1 IMS overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
x IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

13.6.2 IMS logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
13.7 DS6000 considerations for IMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
13.8 IMS with the DS6000: Performance recommendations . . . . . . . . . . . . . . . . . . . . . . 434
13.8.2 Balance workload across DS6000 resources . . . . . . . . . . . . . . . . . . . . . . . . . . 434
13.8.3 Large volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
13.8.4 Monitoring the DS6000 performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Chapter 14. Copy Services for the DS6000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437

14.1 Copy Services introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
14.2 IBM TotalStorage FlashCopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
14.2.1 FlashCopy objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
14.2.2 Performance considerations with FlashCopy . . . . . . . . . . . . . . . . . . . . . . . . . . 441
14.2.3 Planning for FlashCopy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
14.3 IBM TotalStorage Metro Mirror (Synchronous PPRC) . . . . . . . . . . . . . . . . . . . . . . . 443
14.3.1 Metro Mirror options: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
14.3.2 Metro Mirror interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
14.3.3 Metro Mirror configuration considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
14.3.4 Metro Mirror performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
14.4 IBM TotalStorage Global Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
14.4.1 Global Copy state change logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
14.4.2 Configuration guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
14.4.3 DS6000 I/O ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
14.4.4 Global Copy connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
14.4.5 Distance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
14.4.6 Other planning considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
14.4.7 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
14.4.8 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
14.4.9 Addition of capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
14.5 IBM TotalStorage Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
14.5.1 Performance aspects for Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
14.5.2 Performance considerations at coordination time . . . . . . . . . . . . . . . . . . . . . . . 462
14.5.3 Consistency Group drain time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
14.5.4 Avoid unbalanced configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
14.5.5 Remote storage server configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
14.5.6 Growth within Global Mirror configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
14.5.7 Global Mirror performance recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . 467
14.6 IBM TotalStorage z/OS Metro/Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
14.7 Copy Services performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
14.8 Measuring Copy Services performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
14.9 z/OS and Copy Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
14.9.1 RMF and Copy Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
14.10 Copy Services performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
14.11 DS CLI metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
14.11.1 Managing performance with DS CLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
Appendix A. Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475

Goals of benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
Requirements for a benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
Define the benchmark architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
Define the benchmark workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
Monitoring the performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
Define the benchmark time frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
Contents xi
Caution using benchmark results to design production . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
Appendix B. UNIX shell scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
vgmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
lvmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
vpath_iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
ds_iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
test_disk_speeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493

IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
How to get IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
xii IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

Figures
1-1 DS6000 Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1-2 IBM TotalStorage DS6000 and ESS comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1-3 IBM TotalStorage DS6000 series and DS8000 series comparison . . . . . . . . . . . . . . 13
1-4 IB M TotalStorage DS6000 series and DS4800 comparison. . . . . . . . . . . . . . . . . . . 14
2-1 Planning the DS6000 hardware configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2-2 DS6000 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2-3 Persistent memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2-4 Cache lists of the SARC algorithm for random and sequential data . . . . . . . . . . . . . 24
2-5 IO Response Time (under 3500 IOPS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2-6 IO response time (over 3500 IOPS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2-7 DS6000 Arrays - physical and effective capacities . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2-8 Open OLTP workload on 146 GB and 300 GB configuration . . . . . . . . . . . . . . . . . . 30
2-9 Open read intensive workload on 146 GB and 300 GB configuration . . . . . . . . . . . . 31
2-10 OLTP workload - 15K rpm versus 10K rpm disk drives . . . . . . . . . . . . . . . . . . . . . . . 32
2-11 Read intensive - 15K rpm versus 10K rpm drives . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2-12 Example of Array Site configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2-13 RAID 5 Array implementation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2-14 RAID 5 Array implementation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2-15 RAID 5 array implementation 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2-16 RAID 10 Array implementation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2-19 RAID 5 and RAID 10 in the same loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2-20 SBOD FC-AL structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2-21 Server enclosure and expansion enclosure connection . . . . . . . . . . . . . . . . . . . . . . 43
2-22 Host port architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2-23 Comparison of host adapter performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2-24 Fibre Channel/FICON host adapters - FCP attachment . . . . . . . . . . . . . . . . . . . . . . 46
2-25 Performance Enhancement of FICON Express2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2-26 Fibre Channel/FICON host adapters - FICON attachment . . . . . . . . . . . . . . . . . . . . 48
2-27 Host has two paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2-28 Host has multiple paths to each server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2-29 Host has only single path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3-1 Disk Drive Module internal connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3-2 Fully configured DS6000 with seven expansion enclosures . . . . . . . . . . . . . . . . . . . 61
3-3 Array Sites S1-S4 in a DS6000 base unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3-4 Sample of Array Site locations within three expansion enclosures . . . . . . . . . . . . . . 65
3-5 Logical CKD volume physical location example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3-6 DS6000 preferred path connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3-7 Logical Volume sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3-8 Host Adapter server affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3-9 Host shown with dual paths to DS6000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4-1 Welcome to Disk Magic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4-2 Disk Subsystem zSeries dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4-3 Configuration details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4-4 Interfaces panel zSeries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4-5 zSeries workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4-6 Merge dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
© Copyright IBM Corp. 2005. All rights reserved. xiii

4-7 Merge Target panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4-8 Merge Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4-9 Disk Subsystem open systems dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4-10 Interfaces panel open systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4-11 Open systems workload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4-12 Response time projection with workload growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4-13 HDD/DDM Utilization with workload growth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4-14 Specify effective capacity for zSeries servers in terms of 3390 volumes . . . . . . . . 104
4-15 Graphical interface before any DDMs are specified . . . . . . . . . . . . . . . . . . . . . . . . 105
4-16 Graphical interface populated with DDMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4-17 RAID Array report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4-18 zSeries reports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4-19 Graphical interface with non-optimum spare configuration . . . . . . . . . . . . . . . . . . . 108
4-20 Report warning of non-optimum spare configuration . . . . . . . . . . . . . . . . . . . . . . . . 109
4-21 IBM TotalStorage Productivity Center launch pad . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4-22 Monitor and configure the storage infrastructure disk area . . . . . . . . . . . . . . . . . . . 111
4-23 IBM TotalStorage Productivity Center for Disk operating environment . . . . . . . . . . 112
4-24 IBM Director Console. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4-25 TPC architecture overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4-26 IBM Director console with Performance Manager . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4-27 Performance Manager tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4-28 DS8000/DS6000 Data Collection Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4-29 DS8000/DS6000 Cluster level gauge values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4-30 DS8000/DS6000 Rank Group level gauge values. . . . . . . . . . . . . . . . . . . . . . . . . . 121
4-31 DS8000/DS6000 volume level gauge values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4-32 Display performance gauge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4-33 DS8000/DS6000 performance thresholds panel . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4-34 DS8000/DS6000 Threshold enable warning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4-35 Modifying DS8000/DS6000 threshold warning and error values . . . . . . . . . . . . . . . 124
4-36 Performance gauges panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4-37 Create performance gauge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4-38 Gauge for DS6000 Cluster performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4-39 Customizing gauge for array level metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4-40 Modified gauge with Avg. response time chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4-41 Windows 2000 Performance Monitor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4-42 Inter-switch link (ISL) configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4-43 Shared DS6000 I/O ports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4-44 Single server accessing multiple DS6000s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4-45 Remote mirroring configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5-1 DS6000 attachment types: FCP and FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5-2 Measurements of channel performance over several generations of channels. . . . 146
5-3 DS6000 FICON attachment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5-4 Fibre Channel connections with a DS6000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5-5 Fibre Channel arbitrated loop topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5-6 Example of a Storage Area Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5-7 Zoning in a SAN environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5-8 SDD with multiple paths to a DS6000 logical disk . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5-9 Subsystem Device Driver configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5-10 SAN single-path connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5-11 SAN multi-path connection with single fiber. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6-1 Extents being used to create Virtual Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6-2 Relationship between physical and virtual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6-3 SAN Volume Controller FlashCopy “outside the box” . . . . . . . . . . . . . . . . . . . . . . . 173
xiv IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

6-4 Synchronous remote copy relationship between 2 SAN Volume Controller clusters 174
6-5 Shows an example configuration that illustrates this performance limitation. . . . . . 178
6-6 SAN Volume Controller configuration based on DS6000 (1 Rank per Extent Pool) 180
7-1 One LUN from each Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7-2 Data layout diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7-3 Data layout legend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7-4 Devices presented to iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
7-5 Non striped logical volume. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
7-6 Inter-disk policy logical volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7-7 Striped logical volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8-1 linuxconf screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
8-2 serviceconfig screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
8-3 SUSE Linux Powertweak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
8-4 Red Hat kernel tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
8-5 Paging statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
8-6 I/O transfer rate report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
8-7 Run Queue report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
8-8 Memory and Swap report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
8-9 Memory Activities report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
8-10 CPU Utilization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
8-11 System swapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
8-12 GKrellM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
8-13 KDE System Guard default window. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
8-14 Striped volume set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
8-15 Three LUNs on the same DS6000 Rank will not optimize performance . . . . . . . . . 299
8-16 Effect of tuning the I/O subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
9-1 Performance options in Windows 2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
9-2 Virtual memory settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
9-3 Configuring the system cache in Windows 2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
9-4 Memory optimization settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
9-5 Windows Server 2003 Services window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
9-6 Task Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
9-7 Selecting Base Priority in Task Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
9-8 Base Priority. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
9-9 Changing the priority of a process using Task Manager . . . . . . . . . . . . . . . . . . . . . 320
9-10 Assigning Processor Affinity to a selected process . . . . . . . . . . . . . . . . . . . . . . . . . 322
9-11 Assigning processor affinity using the INTFILTR tool . . . . . . . . . . . . . . . . . . . . . . . 323
9-12 Editing the BOOT.INI to include the /3GB switch. . . . . . . . . . . . . . . . . . . . . . . . . . . 324
9-13 Editing the BOOT.INI to include the /PAE switch. . . . . . . . . . . . . . . . . . . . . . . . . . . 325
9-14 Main Performance console window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
9-15 The Performance console: System Monitor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
9-16 Performance Logs and Alerts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
9-17 Chart setting for finding disk bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
9-18 New counter log, General tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
9-19 Log Files tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
9-20 New counter log, Schedule tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
9-21 Counter log pop-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
9-22 System Monitor Properties (Source tab) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
9-23 System Monitor Properties (Data tab) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
9-24 Windows Task Manager - Processes tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
9-25 Select columns for the Processes view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
9-26 Task Manager - Performance view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
10-1 Concurrent I/O prior to PAV and Multiple Allegiance . . . . . . . . . . . . . . . . . . . . . . . . 361
Figures xv
10-2 Concurrent I/O with PAV and Multiple Allegiance . . . . . . . . . . . . . . . . . . . . . . . . . . 361
10-3 Concurrent read operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
10-4 Concurrent write operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
10-5 Number of volumes on a (6+P) RAID 5 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
10-6 DB2 large volume performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
10-7 DSS dump large volume performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
10-8 Channel utilization limits for hypothetical workloads . . . . . . . . . . . . . . . . . . . . . . . . 368
10-9 FICON port and channel throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
10-10 Daisy chaining DS6000s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
10-11 Sample set of RMF Magic workload summary charts . . . . . . . . . . . . . . . . . . . . . . . 381
10-12 I/O and data rate summary for a single subsystem . . . . . . . . . . . . . . . . . . . . . . . . . 382
10-13 Cache summary for a single subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
10-14 Breakdown of measurement data by SSID within a subsystem . . . . . . . . . . . . . . . 384
10-15 Summary of subsystem response time components . . . . . . . . . . . . . . . . . . . . . . . . 385
11-1 Performance Tools Disk Utilization Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
13-1 DB2 UDB logical structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
13-2 Allocating DB2 containers using a “spread your data” approach. . . . . . . . . . . . . . . 428
13-3 IMS large volume performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
14-1 FlashCopy establish. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
14-2 FlashCopy interfaces and functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
14-3 Synchronous logical volume replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
14-4 Logical paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
14-5 Logical paths for Metro Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
14-6 Symmetrical Metro Mirror configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
14-7 Asynchronous logical volume replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
14-8 Global Copy and Metro Mirror state change logic . . . . . . . . . . . . . . . . . . . . . . . . . . 453
14-9 Logical paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
14-10 Global Copy environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
14-11 Global Mirror overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
14-12 How Global Mirror works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
14-13 Global Copy with write hit at the remote site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
14-14 Application write I/O within two Consistency Group points . . . . . . . . . . . . . . . . . . . 462
14-15 Coordination time - how does it impact application write I/Os? . . . . . . . . . . . . . . . . 463
14-16 Remote storage server configuration, all Ranks contain equal numbers of volumes 465
14-17 Remote storage server with D volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
14-18 A three-site z/OS Metro/Global Mirror implementation . . . . . . . . . . . . . . . . . . . . . . 468
xvi IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not give you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such provisions are
inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS
PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring
any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrates programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, modify, and
distribute these sample programs in any form without payment to IBM for the purposes of developing, using,
marketing, or distributing application programs conforming to IBM's application programming interfaces.
© Copyright IBM Corp. 2005. All rights reserved. xvii

Trademarks
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
Eserver® DB2® PowerPC®
iSeries™ DFSMSdss™ Predictive Failure Analysis®
i5/OS™ DFSORT™ Rational®
pSeries® Enterprise Storage Server® Redbooks™
xSeries® ECKD™ Redbooks (logo) ™
z/OS® ESCON® RMF™
z/VM® FlashCopy® RS/6000®
zSeries® FICON® S/360™
z9™ HACMP™ S/390®
AIX 5L™ IBM® System z9™
AIX® IMS™ Tivoli®
AS/400® MVS™ TotalStorage Proven™
BladeCenter® Netfinity® TotalStorage®
CICS® NUMA-Q® VSE/ESA™
Domino® OS/2® WebSphere®
DB2 Universal Database™ OS/400®
The following terms are trademarks of other companies:
Java, JDK, Solaris, Sun, Sun Microsystems, Ultra, and all Java-based trademarks are trademarks of Sun Microsystems,
Inc. in the United States, other countries, or both.
BackOffice, Excel, Microsoft, Windows server, Windows NT, Windows, and the Windows logo are trademarks of Microsoft
Corporation in the United States, other countries, or both.
i386, Intel, Itanium, Pentium, Xeon, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered
trademarks of Intel Corporation or its subsidiaries in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
xviii IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

Preface
This IBM® Redbook provides guidance about how to configure, monitor, and manage your
IBM TotalStorage® DS6000 to achieve optimum performance. We describe the DS6000
performance features and characteristics and how they can be exploited with the different
server platforms that can attach to it. Then in consecutive chapters we detail the specific
performance recommendations and discussions that apply for each server environment, as
well as for database and Copy Services environments.
We also outline the various tools available for monitoring and measuring I/O performance for
the different server environments, as well as how to monitor performance of the entire
DS6000 subsystem.
The team that wrote this redbook

This redbook was produced by a team of specialists from around the world working with the
International Technical Support Organization, San Jose Center at the Washington Systems
Center in Gaithersburg, MD.
Cathy Warrick is a Project Leader and Certified IT Specialist in the IBM International
Technical Support Organization. She has over 27 years of experience in IBM with large
systems, open systems, and storage, including education on products internally and for the
field. Prior to joining the ITSO three years ago, she developed the Technical Leadership
education program for IBM and IBM Business Partner’s technical field force and was the
Program Manager for the Storage Top Gun classes.
Craig Gordon is a Certified Consulting IT Specialist working in the Advanced Technical

Support organization in Gaithersburg, Maryland. He specializes in storage benchmarking on
IBM enterprise storage devices and IBM Advanced Copy Services functions. He worked the
majority of his career in the large mainframe environment, but has also worked with storage
benchmarks in open systems environments.
Benoit Granier is part of the IBM European Advanced Technical Support Center in
Montpellier (France) for three years. As an IT Specialist, he started working at the pSeries®
and TotalStorage Benchmark Center. He is now responsible for the Early Shipment Programs
for storage disk systems in EMEA. Benoit's areas of expertise include: mid-range/high-end
storage solutions (IBM DS4000/ESS/DS8000), virtualization (IBM SAN Volume Controller),
and high-end IBM Eserver® pSeries servers. Benoit has a degree in Telecommunication
from ESIGETEL.
Keitaro Imai has been working in AP Advanced Technical Support in Japan and engaged in
Open environment storage as an IT Specialist for three years. He mainly supported DS4000
(formerly FAStT) for two years. He was assigned to the DS6000 Support Team at the end of
last year, where he provides technical consultation, troubleshooting, and support, including
critical situations and skills transfer.
Rosemary McCutchen is a Certified Consulting IT Specialist in Advanced Technical Support

(ATS) in the Washington Systems Center in Gaithersburg, Maryland. She has 26 years of
experience in Information Technology, including 16 years with IBM. She began working for
IBM as a Systems Engineer and has worked for multiple IBM development organizations as
well as ATS. She is currently a member of the Storage Benchmarking group where she
specializes in pre-sales customer Proof-of-Concept demonstrations, performance
© Copyright IBM Corp. 2005. All rights reserved. xix

benchmarks, and TotalStorage Proven™ testing for high end disks (ESS, DS8000 and
DS6000). She also supports IBM Beta and Early Support Programs, and has authored
multiple technical bulletins and technical courses.
Brannen Proctor is a Senior IT Specialist with the IBM Storage Techline organization in
Atlanta, Georgia. He has been in Techline since 1998, providing pre-sales technical support
on IBM disk, tape, and SAN storage products. He is currently the Team Leader of the
Business Partner support team, and also coordinates training activities for the Storage
Techline team. Prior to joining the Techline organization, he was a Transition Leader in IBM
Global Services for five years. Prior to joining IBM, he held positions in the computer industry
as a programmer, systems programmer, performance analyst, and internal consultant.
Jim Sedgwick is a member of the Americas Storage Advanced Technical Support team in
Raleigh, North Carolina. His main responsibility is open systems storage performance. His 23
year career has included work in IBM Global Services, IBM Sales and Distribution, and IBM
Printer Advanced Development.
Paulus Usong started his IBM career in Indonesia decades ago. He moved to New York and
worked at a bank for a few years, before rejoining IBM at the Santa Teresa Lab (now called
the Silicon Valley Lab). In 1995 he joined the Advanced Technical Support group in San Jose.
Currently he is a Consulting IT Specialist and his main responsibility is handling DASD
performance CritSits and performing XRC sizing for customers who want to implement a
disaster recovery system using this option.
Mary Ann Vandermark is a Product Field Engineer (PFE) in the Washington D.C. area and
has worked for IBM for seven years. Her career began with a focus on quality assurance
processes and testing of hardware and software storage products. Mary Ann developed and
published a field Escape Analysis process used to improve product quality and test
effectiveness. Her current responsibilities include on-site support for ESS and DS6000/8000
for secure U.S. Government accounts in addition to remote PFE support for North American
accounts in the private sector.
John Wickes has had more than 30 years experience in the IT industry, with many years as a
mainframe MVS™ Operating Systems Specialist, several years as the IBM ANZ MVS
Instructor, and more recently as a Storage and Storage Area Network (SAN) design and
Implementation Specialist. John has been involved in several Copy Services projects,
including both FlashCopy® and Peer-to-Peer Remote Copy (PPRC) implementations.
Front row - Paulus, Keitaro, Cathy, Jim, John, MaryAnn; Back row - Benoit, Craig, Brannen, Rosemary,
John Amann
We want to thank John Amann for hosting this residency at the Washington Systems Center
in Gaithersburg, MD.
xx IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

We especially want to thank Lee La Frese of Tucson for being our special advisor and
development contact for this redbook.
In addition other members of the performance advisory group helped us out with
presentations as well as reviewing our material:
Ime Archibong, Siebo Friesenborg, Joe Hyde, Carl Jones, Josh Martin, Henry May, Bruce
McNutt, Vernon Miller, Dharmendra Modha, Rick Ripberger, Mike Roll, and Sonny
Williams.
Many thanks to those people in IBM in Montpellier, France who helped us with access to
equipment as well as technical information and review:
Olivier Alluis (Manager of the ATS TotalStorage Benchmark Center), Philippe Jachymczyk,
Dominique Salomon, Christophe Majek and Jean-Armand Broyelle.
Thanks to the following people for their contributions to this project:
Mary Lovelace
International Technical Support Organization, San Jose Center
Martin Kammerer
IBM Germany
Steve Pratt
IBM Austin
Edward Holcombe
IBM Beaverton
Mike Downie
IBM Boulder
Dan Braden
IBM Dallas
Donald C. Laing
IBM Midland
Mary Anne Bromley

IBM Oakland
Cathy Cronin
IBM Poughkeepsie
Jeffrey Berger
IBM San Jose
Mike Gonzales
IBM Santa Teresa
Kwai Wong
IBM Toronto
Andy Ruhl
IBM Tucson
Preface xxi
Many thanks to:
Gilbert Houtekamer from Intellimagic
Pablo Clifton from CompuPro
Many thanks to Emma Jacobs, Sokkieng Wang, and Gabrielle Velez
Become a published author

Join us for a two- to six-week residency program! Help write an IBM Redbook dealing with
specific products or solutions, while getting hands-on experience with leading-edge
technologies. You'll team with IBM technical professionals, Business Partners and/or
customers.
Your efforts will help increase product acceptance and customer satisfaction. As a bonus,
you'll develop a network of contacts in IBM development labs, and increase your productivity
and marketability.
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
We want our Redbooks™ to be as helpful as possible. Send us your comments about this or
other Redbooks in one of the following ways:
򐂰 Use the online Contact us review redbook form found at:
ibm.com/redbooks
򐂰 Send your comments in an email to:
redbook@us.ibm.com
򐂰 Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. QXXE Building 80-E2
650 Harry Road
San Jose, California 95120-6099
xxii IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

1
Chapter 1. Model characteristics

This chapter provides an overview of the model characteristics of the IBM Total Storage
DS6000. The following topics are included:
򐂰 Benefits of the DS6000
򐂰 Hardware overview
򐂰 Functional overview
򐂰 DS Management tools
򐂰 Supported environments
򐂰 Performance overview
򐂰 Comparisons to IBM TotalStorage DS family
򐂰 Combination with IBM virtualization products
© Copyright IBM Corp. 2005. All rights reserved. 1

1.1 Benefits of the DS6000 series
The IBM TotalStorage DS products family delivers a new class of storage capabilities in a
space efficient, scalable design at an affordable price. By leveraging proven software function
of the Enterprise Storage Server® and now the DS8000 series, the DS6000 series (see
Figure 1-1) brings a proven enterprise class technology to a modular package.
The DS6000 is designed specifically for medium and large enterprise customers seeking new
ways to simplify systems and storage infrastructures, improve the use of information
throughput, its life cycle and support business continuity. The continuing exponential growth
of data means that storage subsystems must be more cost effective and flexible enough to
support a variety of working environments. In addition, its very flexibility to support a variety of
environments helps enable your business to accommodate the continuing exponential growth
of data.
Figure 1-1 DS6000 Series
1.1.1 Infrastructure Simplification

The DS6000 series delivers high performance with greater scalability taking up a fraction of
the space of competitive products. The DS6000 is designed to support a broad range of
operating environments including IBM and non-IBM UNIX® and Windows® servers as well as
IBM Eserver zSeries® and iSeries™ servers. This functionality, as well as high performance
and advanced features generally found only on high-end storage devices, is delivered in 3U
(5.25’ inches) enclosures, including storage server and modular expansion units. This dense
packaging mountable in a 19 inch rack provides an easy way to grow along with your storage
needs.
The DS6000 series advanced functionality is shared with the DS8000 series. IBM provides an
enterprise storage continuum of disk products with compatible Copy Services, common
advanced functions, and common management interfaces.
1.1.2 Business Continuity

The DS6000 series has built in resiliency features that are not generally found in small
storage devices. The DS6000 series is designed to avoid potential single points of failure.
Within a single controller unit are redundant controller cards, power supplies, and fans. There
are four paths to each HDD which allow the unit to keep functioning even if multiple
components have failed.
With the additional advantages of IBM FlashCopy, data availability can be enhanced even
further; for instance, production workloads can continue execution concurrent with data
backups. Metro Mirror and Global Mirror business continuity solutions are designed to provide
2 IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

the advanced functionality and flexibility needed to tailor a business continuity environment for
almost any recovery point or recovery time objective.
1.1.3 Information Life Cycle Management

By retaining frequently accessed or high-value data in one storage server and archiving less
valuable information in a less costly one, systems like the DS6000 series can help improve
the management of information according to its business value—from the moment of its
creation to the moment of its disposal. The policy-based management capabilities built into
the IBM TotalStorage Open Software Family, IBM DB2® Content Manager and IBM Tivoli®
Storage Manager for Data Retention, are designed to help you automatically preserve critical
data, while preventing deletion of that data before its scheduled expiration.
1.2 Hardware overview

The DS6000 series consists of the DS6800, Model 1750-511, which has dual Fibre Channel
RAID controllers with up to 16 disk drives in the enclosure. Capacity can be increased by
adding up to 7 DS6000 expansion enclosures, Model 1750-EX1, each with up to 16 disk
drives.
1.2.1 DS6800 controller enclosure (Model 1750-511)

IBM TotalStorage systems are based on a server architecture. At the core of the DS6800
controller unit are two active/active RAID controllers based on IBM’s industry leading
PowerPC® architecture. By employing a server architecture with standard hardware
components, IBM’s storage division can always take advantage of the best of breed
components developed by other IBM divisions. The customer gets the benefit of a very cost
efficient and high performing storage system.
The processors
The DS6800 utilizes two 64-bit PowerPC 750GX 1 GHz processors for the storage server and
the host adapters, respectively, and another PowerPC 750FX 500 MHz processor for the
device adapter on each controller card. The DS6800 is equipped with 2 GB memory in each
controller card, adding up to 4 GB. Some part of the memory is used for the operating system
and another part in each controller card acts as nonvolatile storage (NVS), but most of the
memory is used as cache. This design to use processor memory makes cache accesses very
fast.
When data is written to the DS6800, it is placed in cache and a copy of the write data is also
copied to the NVS of the other controller card, so there are always two copies of write data
until the updates have been destaged to the disks. On zSeries, this mirroring of write data can
be disabled by application programs, for example, when writing temporary data (Cache Fast
Write). The NVS is battery backed up and the battery can keep the data for at least 72 hours
if power is lost.
The DS6000 series controller’s Licensed Internal Code (LIC) is based on the DS8000 series
software. Since 97% of the functional code of the DS6000 is identical to the DS8000 series,
the DS6000 has a very good base to be a stable system.
Host adapters
The DS6800 has eight 2 Gbps Fibre Channel ports that can be equipped with two or up to
eight shortwave or longwave Small Formfactor Plugables (SFP). You order SFPs in pairs. The
2 Gbps Fibre Channel host ports (when equipped with SFPs) can also auto-negotiate to
Chapter 1. Model characteristics 3

1 Gbps for existing SAN components that support only 1 Gbps. Each port can be configured
individually to operate in Fibre Channel or FICON® mode, but you should always have pairs.
Host servers should have paths to each of the two RAID controllers of the DS6800.
Switched FC-AL subsystem

The disk drives in the DS6800 or DS6000 expansion enclosure have a dual ported FC-AL
interface. Instead of forming an FC-AL loop, each disk drive is connected to two Fibre
Channel switches within each enclosure. With this switching technology there is a
point-to-point connection to each disk drive. This allows maximum bandwidth for data
movement, eliminates the bottlenecks of loop designs, and allows for specific disk drive fault
indication.
There are four paths from the DS6800 controllers to each disk drive to provide greater data
availability in the event of multiple failures along the data path. The DS6000 series systems
provide preferred path I/O steering and can automatically switch the data path used to
improve overall performance.
The disk drives sizes available

The DS6800 controller unit can be equipped with up to 16 internal FC-AL disk drive modules.
Different disk types are available: 73 GB or 146 GB operating at 15,000 RPM and 146 GB or
300 GB operating at 10,000 RPM. This offers up to 4.8 TB of physical storage capacity (using
300 GB DDM size) in only 3U (5.25”) of standard 19” rack space.
Dense packaging
Calibrated Vectored Cooling technology used in IBM Eserver xSeries® and BladeCenter®
to achieve dense space saving packaging is also used in the DS6800. The DS6800 weighs
only 49.6 kg (109 lbs.) with 16 drives. It connects to normal power outlets with its two power
supplies in each DS6800 or DS6000 expansion enclosure. All this provides savings in space,
cooling, and power consumption.
1.2.2 DS6000 expansion enclosure (Model 1750-EX1)

The size and the front look of the DS6000 expansion enclosure (1750-EX1) is the same as
the DS6800 controller enclosure. In the front you can have up to 16 disk drives.
Aside from the drives, the DS6000 expansion enclosure contains two Fibre Channel switches
to connect to the drives and two power supplies with integrated fans.
Up to 7 DS6000 expansion enclosures can be added to a DS6800 controller enclosure. The

DS6800 supports two dual redundant switched loops. The first loop is for the DS6800 and up
to three DS6000 expansion enclosures. The second switched loop is for up to four expansion
enclosures. For connections to the previous and next enclosure, four inbound and four
outbound 2 Gbps Fibre Channel ports are available.
Hard disk drive

The DS6000 series offers outstanding scalability with physical capacities ranging from 584
GB up to 38.4 TB, while maintaining excellent performance. Physical capacity for the DS6800
and DS6000 expansion enclosure is purchased via disk drive sets. A disk drive set contains
four identical disk drives (same capacity and revolutions per minute (RPM)). Currently, a
minimum of four drives (one disk drive set) are required for the DS6800. You can increase the
capacity of your DS6000 by adding one or more disk drive sets to the DS6800 or DS6000
expansion enclosure. Within the controller model DS6800, you can install up to four disk drive
sets (16 disk drive modules (DDMs)). Up to 7 DS6000 expansion enclosures can be added
non-disruptively, on demand, as your storage needs grow.

According to your performance needs you can select from four different disk drive types: fast
73 GB drives rotating at 15,000 RPM, good performing and cost efficient 146 GB drives
operating at 15,000 RPM or 10,000 RPM, and high capacity 300 GB drives running at 10,000
RPM.
The minimum storage capability with eight 73 GB DDMs is 584 GB. The maximum storage
capability with 16 300 GB DDMs for the DS6800 controller enclosure is 4.8 TB. If you want to
connect more than 16 disks, you can use the optional DS6000 expansion enclosures that
allow a maximum of 128 DDMs per storage system and provide a maximum storage
capability of 38.4 TB.
Here is a summary of the DS6800 major features:

򐂰 Two RAID controller cards.
򐂰 Two PowerPC 750GX 1 GHz processors and one PowerPC 750FX processor on each
RAID controller card.
򐂰 4 GB of cache.
򐂰 Two battery backup units (one for each controller card).
򐂰 Two AC/DC power supplies with imbedded enclosure cooling units.
򐂰 Eight 2 Gbps device ports - for additional DS6000 expansion enclosures connectivity.
򐂰 Two Fibre Channel switches for disk drive connectivity in each DS6000 series enclosure.
򐂰 Eight Fibre Channel host ports that can be configured as pairs of FCP or FICON host
ports. The host ports auto-negotiate to either 2 Gbps or 1 Gbps link speeds.
򐂰 Attachment to up to seven (7) DS6000 expansion enclosures.
򐂰 Very small size, weight, and power consumption. All DS6000 series enclosures are 3U in
height and mountable in a standard 19-inch rack.
1.3 Functional overview

This section presents the principal functions available on the IBM TotalStorage DS6000
series.
1.3.1 Storage capacity: RAID 5 and RAID 10

Every four or eight drives form a RAID array and you can choose between RAID 5 and RAID
10. The configuration process enforces that at least two spare drives are defined on each
loop. In case of a disk drive failure or even when the DS6000’s predictive failure analysis
comes to the conclusion that a disk drive might fail soon, the data of the failing disk is
reconstructed on the spare disk. More spare drives might be assigned if you have drives of
mixed capacity and speed.
RAID 5
RAID 5 is a method of spreading volume data plus data parity across multiple disk drives.
RAID 5 increases performance by supporting concurrent accesses to the multiple DDMs
within each logical volume.
RAID 10
RAID 10 implementation provides data mirroring from one DDM to another DDM. RAID 10
stripes data across half of the disk drives in the RAID 10 configuration. The other half of the
array mirrors the first set of disk drives. RAID 10 offers faster random writes than RAID 5

because it does not need to manage parity. However, with half of the DDMs in the group used
for data and the other half used to mirror that data, RAID 10 disk groups have less capacity
than RAID 5 disk groups.
1.3.2 Resiliency
The DS6000 series has built in resiliency features that are not generally found in small
storage devices. The DS6000 series is designed and implemented with component
redundancy to help reduce and avoid many potential single points of failure.
Within a DS6000 series controller unit, there are redundant RAID controller cards, power
supplies, fans, Fibre Channel switches, and Battery Backup Units (BBUs).
There are four paths to each disk drive. Using Predictive Failure Analysis®, the DS6000 can
identify a failing drive and replace it with a spare drive without customer interaction.
1.3.3 Copy Services

Copy Services is a collection of functions that provide Disaster Recovery, data migration, and
data duplication functions. Copy Services run on the DS6000 server enclosure and they
support open systems and zSeries environments.
Copy Services has four interfaces; a Web-based interface (DS Storage Manager), a
command-line interface (DS CLI), an application programming interface (DS Open API), and
host I/O commands from zSeries servers.
Point-in-time copy feature

IBM TotalStorage FlashCopy, a key member of the IBM TotalStorage Resiliency Family is
designed to offer DS6000 users industry leading point-in-time copy flexibility. Offering high
performance full and NOCOPY (copy on first write) capabilities, FlashCopy supports many
advanced capabilities including:
Data Set FlashCopy

Data Set FlashCopy allows a FlashCopy of a data set in a zSeries environment.
Multiple Relationship FlashCopy

Multiple Relationship FlashCopy allows a source to have FlashCopy relationships with
multiple targets simultaneously. This flexibility allows you to initiate up to 12 FlashCopy
establishes on a given logical unit number (LUN), volume, or data set, without needing to first
wait for or cause previous relationships to end.
Incremental FlashCopy
Incremental FlashCopy provides the capability to refresh a LUN or volume involved in a
FlashCopy relationship. When a subsequent FlashCopy establish is initiated, only the data
required to bring the target current to the source's newly established point-in-time is copied.
The direction of the refresh can also be reversed, in which case the LUN or volume previously
defined as the target becomes the source for the LUN or volume previously defined as the
source (and is now the target).
FlashCopy to a remote mirror primary

This capability lets you establish a FlashCopy relationship where the target is also a remote
mirror primary volume. This enables you to create full or incremental point-in-time copies at a
local site and then use remote mirroring commands to copy the data to the remote site.

Consistency Group commands
This allows DS6000 series systems to hold off I/O activity to a LUN or volume until the
FlashCopy Consistency Group command is issued. Consistency Groups can be used to help
create a consistent point-in-time copy across multiple LUNs or volumes, and even across
multiple DS6000s.
Inband commands over remote mirror link

In a remote mirror environment, commands to manage FlashCopy at the remote site can be
issued from the local or intermediate site and transmitted over the remote mirror Fibre
Channel links. This eliminates the need for a network connection to the remote site solely for
the management of FlashCopy.
Remote Mirror and Copy feature

In a remote mirror environment where you want to do a FlashCopy of the remote volumes,
instead of sending FlashCopy commands across an Ethernet connection to the remote
DS6800, Inband FlashCopy allows commands to be issued from the local and intermediate
site, and transferred over the remote mirror Fibre Channel links for execution on the remote
DS6800. This eliminates the need for a network connection to the remote site solely for the
management of FlashCopy.
IBM TotalStorage Metro Mirror (Synchronous PPRC)

Metro Mirror is a robust, synchronous mirroring capability. Leveraging IBM’s leadership in
Fibre Channel technology, Metro Mirror can support very high performance synchronous
mirroring over distances of up to 300 km. The exceptional performance available through the
use of Global Mirror helps minimize infrastructure costs thus lowering total cost of ownership.
In Metro Mirror environments, the DS6000 is fully interoperable, with the DS8000 and ESS
Models 750, 800 and 800 with turbo option interchangeably between primary and secondary
sites.
Global Copy (PPRC-XD)

This is a non-synchronous long distance copy option for data migration and backup.
Global Copy was previously called PPRC-XD on the ESS. It is an asynchronous copy of
LUNs or zSeries CKD volumes. An I/O is signaled complete to the server as soon as the data
is in cache and mirrored to the other controller cache. The data is then sent to the remote
storage system. Global Copy allows for copying data to far away remote sites. However, if you
have more than one volume, there is no mechanism that guarantees that the data of different
volumes at the remote site is consistent in time.
Global Mirror (Asynchronous PPRC)

Global Mirror is similar to Global Copy but it provides data consistency.
Global Mirror is a long distance remote copy solution across two sites using asynchronous
technology. It is designed to provide the following:
򐂰 Support for virtually unlimited distances between the local and remote sites, with the
distance typically limited only by the capabilities of the network and channel extension
technology being used. This can better enable you to choose your remote site location
based on business needs and enables site separation to add protection from localized
disasters. A consistent and restartable copy of the data at the remote site, can be created
with little impact to applications at the local site.
򐂰 Data currency, where for many environments the remote site lags behind the local site an
average of three to five seconds, helps to minimize the amount of data exposure in the
event of an unplanned outage. The actual lag in data currency experienced will depend

upon a number of factors, including specific workload characteristics and bandwidth
between the local and remote sites.
򐂰 Efficient synchronization of the local and remote sites, with support for failover and failback
modes, which helps to reduce the time required to switch back to the local site after a
planned or unplanned outage.
z/OS Global Mirror (XRC)

z/OS® Global Mirror (previously called XRC) offers a specific set of very high scalability and
high performance asynchronous mirroring capabilities designed to match very demanding,
large zSeries resiliency requirements. The DS6000 series systems should only be used as a
target system in z/OS Global Mirror operations.
z/OS Metro/Global Mirror

z/OS Metro/Global Mirror offers a very high performance, multi-site mirroring solution for the
most stringent of requirements of the very largest, and most demanding, Tier 7 zSeries
customer resiliency requirements. This solution implements z/OS Global Mirror to mirror
primary site data to a remote location, and also uses Metro Mirror for primary site data to a
location within Metro Mirror distance limits. This provides you a three-site high-availability and
disaster recovery solution. z/OS Global Mirror is available on the DS8000, DS6000, ESS 750
and the ESS 800, but we recommend that you only use the DS6000 as an XRC target storage
server.
Remote Mirror connections

All of the remote mirroring solutions described here use Fibre Channel as the
communications link between the primary and secondary systems. The Fibre Channel ports
used for Remote Mirror and Copy can be configured either as dedicated remote mirror links or
as shared ports between remote mirroring and Fibre Channel Protocol (FCP) data traffic.
1.4 DS management tools

The DS6000 management console running the DS Storage Manager software is used to
configure and manage DS6000 series systems. The software runs on a Windows system that
the client can provide. Three management tools are available, DS Storage Manager is a
built-in graphical user interface tool for limited configuration, the command line interface is for
integration within the application or use for more complex configuration (automation scripts)
and the DS Open API is dedicated for a centralized management tool as IBM TotalStorage
Productivity Center for instance.
1.4.1 IBM TotalStorage DS Storage Manager

The IBM TotalStorage DS Storage Manager, a standard value added feature, is a high
function, easy to use GUI management dashboard for performing logical configuration and
copy services management functions. DS Storage Manager is accessed via a Web browser
and can be accessed from any location with network access thereby increasing management
efficiency.
Management capabilities that are available through DS Storage Manager include:

򐂰 Offline configuration: Allows a user to create and save logical configurations and apply
them to a new DS6000 series system. This offline configuration tool is installed on a
customer server and can be used for the configuration of a DS6000 series system at initial
installation.
򐂰 Online configuration: Provides real-time configuration management support.

򐂰 Copy services: Allows a user to execute copy services functions.
The online configuration and Copy Services are available via a Web browser interface
installed on the DS management console.
1.4.2 IBM TotalStorage DS Open API

In mid 2002 the Storage Networking Industry Association (SNIA) launched the Storage
Management Initiative (SMI) to create and foster the universal adoption of a highly functional
open interface for the management of storage networks. Over time, open standards are
expected to offer customers increased flexibility, control and lower long term costs and IBM is
a leader in promoting open standards for the potential they offer for long term improved
flexibility and lower costs for customers.
IBM has consistently demonstrated its leadership in the open standards movement and
offering the IBM TotalStorage DS Open API for DS6000 which is compatible with the SMI-S
standard, offers a compelling proof point of IBM’s commitment to the benefits that open
standards can offer.
The DS Open API supports routine LUN management activities, such as LUN creation,
mapping and masking, and the management of point-in-time copy and remote mirroring. It
supports these activities through the use of a standard interface as defined by the Storage
Networking Industry Association (SNIA) Storage Management Initiative Specification
(SMI-S).
The DS Open API is implemented through the IBM TotalStorage Common Information Model
Agent (CIM Agent) for the DS Open API, a middleware application designed to provide a
CIM-compliant interface. The interface allows Tivoli and third-party CIM-compliant software
management tools to discover, monitor, and control DS6000 series systems. The DS Open
API and CIM Agent are provided with the DS6000 series at no additional charge. The CIM
Agent is available for the AIX®, Linux®, and Microsoft® Windows operating system
environments.
1.4.3 IBM TotalStorage DS Command Line Interface (CLI)

The IBM TotalStorage DS CLI, another standard feature of DS6000, has the ability to perform
a full set of commands for both configuration and copy services activities. The application will
have three modes of execution:
򐂰 Single shot mode will connect, issue a single command, and then return to the user.
򐂰 Script mode will connect, execute a predefined customer script, and then return to the
user.
򐂰 Interactive mode will place the user in a shell environment with a static connection to the
storage subsystem so the user can execute multiple commands.
This DS CLI has the ability to dynamically invoke copy services functions. This can help
enhance your productivity since it eliminates the previous requirement for you to create and
save a task using the GUI. The DS CLI can also issue copy services commands to an ESS
Model 750, ESS Model 800, or DS8000 series systems.
The DS CLI client is available for the AIX, HP-UX, Linux, Novell NetWare, Sun™ Solaris™,
and Microsoft Windows operating system environments.

1.5 Supported environments
Continuing IBM’s commitment to heterogeneous operating system support, DS6000 series
supports a broad range operating system versions and releases including IBM Eserver
zSeries, iSeries, xSeries, BladeCenter, and pSeries servers, as well as servers from Sun
Microsystems™, Hewlett-Packard, and other providers. You can easily split up the DS6000
system storage capacity among the attached environments. This makes it an ideal system for
storage consolidation in a dynamic and changing on demand environment. See the
interoperability matrix for the most up-to-date list of supported servers:
http://www.ibm.com/servers/storage/disk/ds6000/interop.html
Particularly for zSeries and iSeries customers, the DS6000 series will be an exciting product,
since for the first time it gives them the choice to buy a midrange priced storage system for
their environment with a performance that is similar to or exceeds that of an IBM ESS.
1.6 Performance overview

The DS6000 series is designed to meet and exceed the storage requirements of a broad
range of heterogeneous operating environments. DS6000 is well suited for environments that
can benefit from a high performing, highly available, space efficient storage offering that can
support both open systems and mainframe servers.
1.6.1 Sequential Prefetching in Adaptive Replacement Cache (SARC)

Another performance enhancer is the new self-learning cache algorithm. The DS6000 series
caching technology improves cache efficiency and enhances cache hit ratios. The
patent-pending algorithm used in the DS8000 series and the DS6000 series is called
Sequential Prefetching in Adaptive Replacement Cache (SARC).
SARC provides the following:

򐂰 Sophisticated, patented algorithms to determine what data should be stored in cache
based upon the recent access and frequency needs of the hosts.
򐂰 Pre-fetching, which anticipates data prior to a host request and loads it into cache.
򐂰 Self-Learning algorithms to adoptively and dynamically learn what data should be stored
in cache based upon the frequency needs of the hosts, support both open systems and
mainframe servers.
1.6.2 Performance for zSeries

In this section we discuss some z/OS relevant performance features available on the DS6000
series.
򐂰 FICON extends the ability of the DS6000 series system to deliver high bandwidth potential
to the logical volumes needing it, when they need it. Older technologies are limited by the
bandwidth of a single disk drive or a single ESCON® channel, but FICON, working
together with other DS6000 series functions, provides a high-speed pipe supporting a
multiplexed operation.
򐂰 Parallel Access Volumes (PAV) is an optional feature for zSeries environments. PAV
enables a single zSeries server to simultaneously process multiple I/O operations to the
same logical volume, which can help to significantly improve throughput. This is achieved
by defining multiple addresses per volume. With Dynamic PAV, the assignment of

addresses to volumes can be automatically managed to help the workload meet its
performance objectives and reduce overall queuing.
򐂰 Priority I/O Queuing is a standard DS6000 feature which improves performance in z/OS
environments and does so autonomically through communication with the server workload
manager. Storage administrator productivity can also be improved and Service Level
Agreements better managed due to this capability.
򐂰 Multiple Allegiance is a standard DS6000 feature which expands simultaneous logical
volume access capability across multiple zSeries servers. This function, along with PAV,
enables the DS6000 to process more I/Os in parallel, helping to dramatically improve
performance and enabling greater use of large volumes.
1.6.3 IBM TotalStorage Multi-path Subsystem Device Driver (SDD)

Multi-path Subsystem Device Driver (SDD) provides load balancing and enhanced data
availability capability in configurations with more than one I/O path between the host server
and the DS6000.
Load balancing can reduce or eliminate I/O bottlenecks that occur when many I/O operations
are directed to common devices via the same I/O path. SDD also helps eliminate a potential
single point of failure by automatically rerouting I/O operations when a path failure occurs,
thereby supporting enhanced data availability.
SDD is a standard feature and is provided with the DS6000 at no additional charge. Fibre
Channel attachment configurations are supported in the AIX, HP-UX, Windows 2000, and
Solaris environments.
1.6.4 Command Tag Queuing

Command Tag Queuing provides Multiple AIX/UNIX I/O commands which may be queued to
the DS6000 which improve performance through autonomic storage management versus the
server queuing one I/O request at a time.
1.7 IBM TotalStorage DS family comparisons

In the midrange market the DS6000 competes with many other products. It overlaps in price
and performance with the DS4500 of the IBM DS4000 series of storage products. But the
DS6000 series offers enterprise capabilities not found in other midrange offerings. A zSeries
or iSeries customer might ponder whether to buy a DS6000 series or a DS8000 series class
storage subsystem or a competitive subsystem. When comparing the DS6000 series with the
DS4000 series and DS8000 series of storage products it is important to have a closer look at
advanced functions and the management interfaces.
1.7.1 DS6000 series compared to ESS

The ESS clients will find it very easy to replace their old systems with a DS6000. All functions
with the exception of Metro/Global Copy (Asychronous cascading PPRC) and z/OS Global
Mirror, are the same as on the ESS and are also available on a DS6000. Figure 1-2 on
page 12 presents in detail the comparison of both products.
If you want to keep your ESS and if it is a model 800 or 750 with Fibre Channel adapters, you
can use your old ESS, for example, as a secondary for remote copy. With the ESS at the
appropriate LIC level, scripts or CLI commands written for Copy Services will work for both
the ESS and the DS6000.

For most environments the DS6800 performs much better than an ESS. You might even
replace two ESS 800s with one DS6800. The sequential performance of the DS6800 is
excellent. However, when you plan to replace an ESS with a large cache (let’s say more than
16 GB) with a DS6800 (which comes with 4 GB cache) and you currently get the benefit of a
high cache hit rate, your cache hit rate on the DS6800 will drop down. This is because of the
smaller cache. z/OS benefits from large cache, so for transaction-oriented workloads with
high read cache hits, careful planning is required.
Figure 1-2 IBM TotalStorage DS6000 and ESS comparison
1.7.2 DS6000 series compared to DS8000

You can think of the DS6000 series as the small brother or sister of the DS8000 series. All
Copy Services (with the exception of z/OS Global Mirror) are available on both systems. You
can do Metro Mirror, Global Mirror, and Global Copy between the two series. The CLI
commands and the DS Storage Manager GUI look the same for both systems. DS6000 and
DS8000 now offer enterprise storage continuum solutions.
Obviously the DS8000 series can deliver a higher throughput and scales higher than the
DS6000 series, but not all customers need this high throughput and capacity. You can choose
the system that fits your needs since both systems support the same SAN infrastructure and
the same host systems.
It is very easy to have a mixed environment, with DS8000 series systems where you need
them and DS6000 series systems where you need a very cost efficient solution.
Logical partitioning with some DS8000 models is not available on the DS6000.
Figure 1-3 on page 13 presents in detail the comparison of both series.

Figure 1-3 IBM TotalStorage DS6000 series and DS8000 series comparison
1.7.3 DS6000 series compared to DS4000 series

Previous DS4000 series clients will find more of a difference between the DS4000 series and
DS6000 series of products.
In the first instance, DS6000 is the entry product of the DS6000/DS8000 family. This family is
the high-end products of the IBM TotalStorage disk portfolio. This family out perform all the
DS4000 family even if the DS4800 (high-end product o the DS4000 family) overlap the
DS6800 in term of performance. In term of server support DS4000 is only dedicated on Open
Server (Intel®, UNIX, Linux) meanwhile DS6000 also supports iSeries and zSeries servers.
Within the DS4000 family you have the option to choose from different DS4000 models,
among them very low cost entry models. DS4000 series storage systems can also be
equipped with cost efficient, high capacity Serial ATA drives.
The DS4000 series products allow you to grow with a granularity of a single disk drive, while
with the DS6000 series you have to order at least four drives. Currently the DS4000 series
also is more flexible with respect to changing RAID arrays on the fly and changing LUN sizes.
The implementation of FlashCopy on the DS4000 series is different as compared to the

DS6000 series. While on a DS4000 series you need space only for the changed data, you will
need the full LUN size for the copy LUN on a DS6000 series. However, while the target LUN
on a DS4000 series cannot be used for production, it can on the DS6000 series. If a real copy
of a LUN is needed on a DS4000 series, there is the option to do a volume copy. However,
this is a two step process and it can take a long time until the copy is available for use. On a
DS6000 series the copy is available for production after a few seconds.

Some of the differences in functions will disappear in the future. For the DS6000 series there
is a General Statement of Direction from IBM (from the October 12, 2004 Hardware
Announcement):
Extension of IBM's dynamic provisioning technology within the DS6000 series is planned to
provide LUN/volume dynamic expansion, online data relocation, virtual capacity over
provisioning, and space-efficient FlashCopy requiring minimal reserved target capacity.
While the DS4000 series also offers remote copy solutions, these functions are not
compatible with the DS6000 series.
Figure 1-4 presents in detail the comparison of both products.
Figure 1-4 IB M TotalStorage DS6000 series and DS4800 comparison
1.8 IBM DS6000 combined with virtualization products

The IBM TotalStorage DS6000 series is designed for the cost, performance, and high
capacity requirements of today's on demand business environments. The DS6000 can be
combined with virtualization products such as IBM SAN Volume Controller and SAN File
System.
1.8.1 IBM SAN Volume Controller

IBM TotalStorage SAN Volume Controller is designed to increase the flexibility of your storage
infrastructure by introducing a new layer between the hosts and the storage systems. The
SAN Volume Controller can enable a tiered storage environment to increase flexibility in
storage management. The SAN Volume Controller combines the capacity from multiple disk
storage systems into a single storage pool, which can be managed from a central point. This

is simpler to manage and helps increase utilization. It also allows you to apply advanced Copy
Services across storage systems from many different vendors to help further simplify
operations.
Currently, the DS4000 series product family (FAStT) is a popular choice for many customers
who buy the SAN Volume Controller. With the DS6000 series, they now have an attractive
alternative. Since the SAN Volume Controller already has a rich set of advanced copy
functions, clients were looking for a cost efficient but reliable storage system. The DS6000
series fits perfectly into this environment since it offers good performance for the price while
still delivering all the reliability functions needed to protect your data.
1.8.2 IBM SAN File System

The SAN File System provides a common file system for UNIX, Windows, and Linux servers,
with a single global namespace to help provide data sharing across servers. It is designed as
a highly scalable solution supporting both very large files and very large numbers of files,
without the limitations normally associated with Network File System (NFS) or Common
Internet File System (CIFS) implementations.
To be able to share data in a heterogeneous environment the storage system must support
the sharing of LUNs. The DS6000 series can do this and therefore is an ideal candidate for
your SAN File System data.

2
Chapter 2. Hardware configuration planning

This chapter reviews the performance features and characteristics of the DS6000 that must
be considered when planning your DS6000 hardware configuration.
There are some choices when planning your DS6000 hardware configuration, and their
relevance is related to the workload characteristics and performance expectations. This
chapter discusses:
򐂰 Performance related hardware components on the DS6000
򐂰 How to improve response time and throughput
򐂰 Recommendations about how to enhance I/O performance

2.1 Rules of thumb and benchmarks
In the past, IT technical personnel have used rules of thumbs (ROTs), as a fast and easy way
to plan the disk subsystem configuration. Some ROTs based on capacity—like for example,
you should configure one 1 Gb Fibre Channel host adapter per 400 GB of capacity—were
easy to use because a good understanding of the user workload was not required. Because
rules of thumb were used to cover most of the users’ workloads, disregarding their I/O and
unique processing requirements, they tended to be too conservative. With the numerous
options available in the DS6000, with the more stringent and unique application requirements,
and the lower cost objectives that organizations face nowadays, ROTs inherent lack of
accuracy makes them an impractical approach.
Benchmarks also look like an easy way for planning a disk configuration. But if you intend to
make a configuration decision based on benchmark performance results, you need to ensure
that the workload that is used for the benchmark resembles, as closely as possible, the
workload that you intend to run on your DS6000. You should also know both the physical and
the logical configuration of the DS6000s used during the benchmark, so your workload gets
the results you are expecting. Sometimes you will not be able to replicate the configuration
documented for the benchmark, and some other times you may not find documented a
benchmark that resembles your requirements. So benchmarks cannot always be used.
For these reasons, the recommended approach for correctly estimating the disk subsystem
configuration that is needed is to use and combine the following methods.
򐂰 Reference by industry: Same industry customers can use following configurations.
򐂰 Reference to a lab measurement: Lab have measured no read/write hit operation. It does
not matter cache size.
򐂰 using a ROTs: For example, “TB *access density” Such a calculation is still effective.
򐂰 Disk Magic: Some I/O characteristics of existing configuration can estimate the new
configuration.
򐂰 Benchmark: You can also use the benchmark center of each region.
Also the following tools allow easy and accurate monitoring, analyzing, sizing, and modeling
for the required DS6000 configuration. Among these tools, we have:
򐂰 Disk Magic
򐂰 RMF™ Magic for zSeries
򐂰 Capacity Magic
򐂰 TPC (IBM TotalStorage Productivity Center)
򐂰 DS6000 Performance Monitor on GUI
These tools will be discussed in detail in Chapter 4, “Planning and monitoring tools” on
page 85.
2.2 Understanding your workload characteristics

To properly configure your DS6000, it is important to have an understanding of the I/O
workload you intend to run. As you read the following sections discussing the DS6000
hardware components, you will be able to realize how the I/O workload characteristics
significantly affect what will be the optimum DS6000 hardware configuration.
The I/O workload characteristics that we are speaking about are:

򐂰 Read hit ratio: Cache friendly, unfriendly, standard. This becomes visible in the read hit
ratios and write hit ratios of the I/O workload.
򐂰 Read/write ratio: This is the number of reads per writes.
򐂰 Block size of the I/O operations.
򐂰 Percent of reads that are sequential: Random versus sequential I/O processing.
򐂰 The I/O rate: I/O operations per second, and the associated access density. This is the
number of I/O operations per second per GB of data (IO/sec/GB).
Another important characteristic of an I/O workload is whether the data is being remotely
copied or not.
If you are moving existing workloads to a new DS6000, then you have information that can be
used to model and estimate this new DS6000 configuration. You will also be able to model
any activity growth that you are planning in advance.
On the other hand, if the workload that you plan to run on the DS6000 is a new workload or
one that you do not have a good understanding of, then we recommend that you be
conservative when planning the disk storage subsystem hardware configuration.
If you will be running multiple heterogeneous servers, each server with different workload
characteristics, you will have the most complex case. You must ensure that your final
hardware configuration has enough capacity to cope with the maximum data rate, while
aggregating the whole set of applications.
When the workload demands of all the servers being consolidated are well understood, and
assuming they are predictable and consistent, it could be possible to manage the peaks and
thus get some resource savings.
On the other hand, if you are combining workloads that are not well understood or whose
requirements fluctuate in an unpredictable manner, then a more conservative approach
must be taken when considering the peaks in your hardware capacity planning.
For further information, seeChapter 12, “Understanding your workload” on page 407.
2.3 DS6000 major hardware components

We will discuss now the features of the DS6000 and their effects on the resulting performance
of the I/O processing. When reading these sections, you must remember that any of these
performance characteristics will also be determined by the type of workload that is run.
Chapter 2. Hardware configuration planning 19

Two RAID Controllers with
DDM Capacity and Speed
Fixed 4GB Cache
Quantity of 32 Packs in Switched
8 Port Fibre/FICON FC-AL two Loops
host adapter
CTRL EXP
#1
EXP EXP
#3 #2
EXP EXP
#5 #4
: :
EXP EXP
#7 #6
Figure 2-1 Planning the DS6000 hardware configuration
From the performance perspective, the major components you need to consider when
planning the DS6000 hardware configuration are (refer to Figure 2-1):
򐂰 DDM capacity and speed (RPM)
򐂰 Number of arrays and RAID type
򐂰 Number and type of host adapters
The hardware components presented in Figure 2-1 and their implications in the overall
DS6000 performance behavior are discussed in the following sections.
2.4 DS6000 server processor

Here, we describe the architecture of the Server processor. The processor that connects
DS6000 components is pictured in Figure 2-2 on page 21.
The DS6800 utilizes two 64-bit PowerPC 750GX 1 GHz processors for the storage server and
the host adapters, respectively, and another PowerPC 750FX 500 MHz processor for the
device adapter on each server card.

.
Server Schematic (one side of servers)
Host Adapter Disk Connection port
Fibre Fibre
QDR Channel Channel
QDR
Protocol Protocol
Engine Engine
PPC
750GX
SDRAM
Data Protection PPC
Data Mover 750GX
Enet Comp Flash
ASIC
Flash PPC
Hint 750FX
Bridge RAID -
Data Protection - Bridge NVRAM
Buffer
Data Mover
ASIC SDRAM
Fibre Channel Fibre Channel

Protocol Engine Protocol Engine
Switch
SES
MIdplane
EEEEEE
Figure 2-2 DS6000 architecture
If you can view Figure 2-2 in color, you can use the colors as indicators of how the DS6000
hardware component is configured. Orange (gray in black and white) is host adapter (HA)
component, green (dark gray in black and white) is device adapter (DA) component and
yellow (white in black and white) is cache memory and server processor component.
2.5 Cache and persistent memory (formerly NVS)

In this section we describe the cache and persistent memory (formerly NVS) operations and
new algorithm called ARC in the DS6000. We believe that if you have a good understanding
of how cache works, you will better be able to understand why workloads can be differentiated
as cache friendly or cache hostile. This understanding helps in making better decisions when
planning for the performance of your environment.
2.5.1 Cache
Cache is used to keep both the read and write data that the host server needs to process.
Having the cache as an intermediate repository, the host has no need to wait for the hard disk
drive to either obtain or store the data that is needed. Instead, the operations of reading from
the hard disk drive (stage), as well as the operation of writing into the hard disk drive
(destage), are done by the DS6000 asynchronously from the host I/O processing. These allow
the completion of the host I/O operations at the electronic cache speeds without waiting for
the much slower hard disk drives’ operations.
Cache processing significantly improves the performance of the I/O operations done by the
host systems that attach to the DS6000. Cache size, together with the efficient internal

structure and algorithms that the DS6000 uses, are factors that improve the I/O performance.
The significance of this benefit will mostly be determined by the type of workload that is run.
In the DS6000 there is 4 GB of fixed cache. This cache is divided between the two servers of
the DS6000, giving the servers their own non-shared cache.
To protect the data that is written during the I/O operations, the DS6000 stores two copies of
the data: One in the cache and another in its persistent memory.
2.5.2 Persistent memory

The persistent memory is used to store a second copy of the written data to ensure data
integrity should there be a power failure or a server failure and the cache copy would be lost.
The persistent memory of server 0 is located in server 1 of the DS6000, and the persistent
memory of server 1 is located in server 0 as depicted Figure 2-3. In this way, in the event of a
server failure, the updated data of the failed server will be in the persistent memory of the
surviving server. This write data can then be destaged to the disk arrays. At the same time,
the surviving server will start to use its own persistent memory for all write data, ensuring that
write data is still maintained. This ensures that no data is lost even in the event of a
component failure.
N VS N VS
for for
Server1 Server 0
C ache C ache
m em ory m em ory
for for
Server 0 S erver 0
Server
Server 00 Server 1
Figure 2-3 Persistent memory
2.5.3 Cache algorithms

Most, if not all, high-end disk systems have internal cache integrated into the system design,
and some amount of system cache is required for operation. Over time, cache sizes have
dramatically increased, but the ratio of cache size to system disk capacity has remained
nearly the same.
The DS6000 uses the patent-pending Adaptive Replacement Cache (ARC) algorithm,
developed by IBM Storage Development in partnership with IBM Research. It is a self-tuning,
self-optimizing solution for a wide range of workloads with a varying mix of sequential and
random I/O streams. For a detailed description of ARC, see N. Megiddo and D. S. Modha,
“Outperforming LRU with an adaptive replacement cache algorithm,” IEEE Computer, vol. 37,
no. 4, pp. 58–65, 2004.
ARC basically attempts to determine four things:

򐂰 When data is copied into the cache.
򐂰 Which data is copied into the cache.
򐂰 Which data is evicted when the cache becomes full.
򐂰 How does the algorithm dynamically adapt to different workloads.
The decision to copy some amount of data into the DS6000 cache can be triggered from two
policies: demand paging and prefetching. Demand paging means that disk blocks are brought
in only on a cache miss. Demand paging is always active for all volumes and ensures that I/O
patterns with some locality find at least some recently used data in the cache.
Prefetching means that data is copied into the cache speculatively even before it is
requested. To prefetch, a prediction of likely future data accesses is needed. Because
effective, sophisticated prediction schemes need extensive history of page accesses (which is
not feasible in real-life systems), ARC uses prefetching for sequential workloads. Sequential
access patterns naturally arise in video-on-demand, database scans, copy, backup, and
recovery. The goal of sequential prefetching is to detect sequential access and effectively
pre-load the cache with data so as to minimize cache misses.
For prefetching, the cache management uses tracks. To detect a sequential access pattern,
counters are maintained with every track, to record if a track has been accessed together with
its predecessor. Sequential prefetching becomes active only when these counters suggest a
sequential access pattern. In this manner, the DS6000 monitors application read-I/O patterns
and dynamically determines whether it is optimal to stage into cache one of the following:
򐂰 Just the page requested.
򐂰 That page requested plus remaining data on the disk track.
򐂰 An entire disk track (or a set of disk tracks) which has (have) not yet been requested.
The decision of when and what to prefetch is essentially made on a per-application basis
(rather than a system-wide basis) to be sensitive to the different data reference patterns of
different applications that can be running concurrently.
To decide which pages are evicted when the cache is full, sequential and random
(non-sequential) data is separated into different lists (see Figure 2-4 on page 24). A page
which has been brought into the cache by simple demand paging is added to the MRU (Most
Recently Used) head of the RANDOM list. Without further I/O access, it goes down to the
LRU (Least Recently Used) bottom. A page which has been brought into the cache by a
sequential access or by sequential prefetching is added to the MRU head of the SEQ list and
then goes in that list. Additional rules control the migration of pages between the lists so as to
not keep the same pages twice in memory.

RANDOM SEQ
MRU MRU
D e s ire d s iz e
S E Q b o tto m
LRU
R A N D O M b o tto m
LRU
Figure 2-4 Cache lists of the SARC algorithm for random and sequential data
To follow workload changes, the algorithm trades cache space between the RANDOM and
SEQ lists dynamically and adaptively. This makes ARC scan-resistant, so that one-time
sequential requests do not pollute the whole cache. ARC maintains a desired size parameter
for the sequential list. The desired size is continually adapted in response to the workload.
Specifically, if the bottom portion of the SEQ list is found to be more valuable than the bottom
portion of the RANDOM list, then the desired size is increased; otherwise, the desired size is
decreased. The constant adaptation strives to make the optimal use of limited cache space
and delivers greater throughput and faster response times for a given cache size.
Additionally, the algorithm modifies dynamically not only the sizes of the two lists, but also the
rate at which the sizes are adapted. In a steady state, pages are evicted from the cache at the
rate of cache misses. A larger (respectively, a smaller) rate of misses effects a faster
(respectively, a slower) rate of adaptation.
Other implementation details take into account the relation of read and write (persistent
memory) cache, efficient destaging, and the cooperation with Copy Services. In this manner,
the DS6000 cache management goes far beyond the usual variants of the LRU/LFU (Least
Recently Used / Least Frequently Used) approaches.
The following two figures show the performance superiority of ARC compared with NO ARC.
The performance difference seen in Figure 2-5 on page 25 and Figure 2-6 on page 25, is
bigger as the throughput increases. For example, the response time is 7.6ms with NO ARC
and 1.8ms with ARC at 4000 IOPS.
Due to the ARC algorithm, 33% of cache space effectiveness and 12.5% of peak throughput
have improved and 11% cache miss rate has reduced.

Figure 2-5 IO Response Time (under 3500 IOPS)
Response Time @ 4000 IO/s:

7,60 ms NO ARC
1,88 ms ARC
Figure 2-6 IO response time (over 3500 IOPS)

2.5.4 Cache size consideration
The cache size in the DS6000 is 4 GB fixed and you can’t add more cache. So use caution
when migrating from existing large cache and high cache hit ratio configurations (8 GB or
more of total cache) to the DS6800. The following factors will have to be considered to
determine whether one or more DS6800 or DS8000 will be needed:
򐂰 The total amount of disk capacity that the DS6000 backend will hold
򐂰 The characteristic access density (I/Os per GB) for the stored data
򐂰 The characteristics of the I/O workload (cache friendly, unfriendly, standard; block size;
random or sequential; read/write ratio; I/O rate)
Basically, the larger the cache size the better the I/O performance characteristics. An
approximate and conservative rule of thumb (ROT), disregarding the access density and the
I/O operations specific characteristics, based solely on the backend total capacity, is to
estimate between 2 GB to 4 GB of cache per 1 TB of storage. This ROT can work for many
workloads, but it can also be very inaccurate for many other workloads.
Especially for cache friendly workload, the DS6000 may be slower than the ESS, which has a
bigger cache (8 GB or more). Simple approximation assumes that the read miss ratio (1 -
read hit ratio) will double for most z/OS and open systems workloads (see Table 1-1).
Table 2-1 Conversion of Cache Hit Ratio from ESS800 to DS6000

ESS 800 Read Hit Ratio DS6000 Read Hit Ratio
0.95 0.90
0.90 0.80
0.80 0.60
0.70 0.40
Consider that the cache size is not an isolated factor when estimating the overall DS6000
performance, but must be considered together with other important factors like the I/O
workload characteristics; the disk drives capacity and speed; the number and type of DS6000
host adapters; and the backend data layout and 1750-EX1 FC-AL loops.
For some z/OS environments, processor memory or cache in the DS6000 may be able to
contributes to high I/O rates and helps to minimize I/O response time.
It is not just the pure cache size which accounts for good performance figures. Economical
use of cache and smart, adaptive caching algorithms are just as important to guarantee
outstanding performance. These are implemented in the DS6000 series, except for the cache
segment size, which is currently 68 KB.
Processor memory is subdivided into a data in cache portion, which holds data in volatile
memory, and a persistent part of the memory, which functions as persistent memory to hold
DASD fast write (DFW) data until destaged to disk.
Our recommendation is to use Disk Magic for properly determining the better hardware
configuration. For more information, see 4.1, “Disk Magic” on page 86.

2.6 DS6000 disks
The increased number of options of disk drive capacities available with the DS6000, including
intermix support, provides more flexibility to choose your configuration. In this section we
summarize disk drive options available in the DS6000 and then we discuss guidelines that will
help you define your optimal DS6000 disk configuration.
2.6.1 DS6000 disk capacity

The maximum number of hard disk drives that a fully configured DS6000 holds is 128. The
Server Enclosure can have 16 DDMs and each attached 1750-EX1 can also have 16 DDMs.
The maximum number of attaching 1750-EX1s is seven: three for Loop 0 including the Server
Enclosure and four for Loop 1. When configured with 300 GB capacity disk drives, this gives a
total physical disk capacity of approximately 38.4 TB.
The minimum DS6000 available configuration is 292 GB. This capacity can be configured with
4 disk drives of 73 GB contained in one four-pack. All increments of capacity are installed in
four-packs; thus the minimum capacity increment is an four-packs of either 73 GB, 146 GB or
300 GB capacity.
2.6.2 Disk four-packs

The DS6000 four-pack is the basic unit of disk capacity within the DS6000 base and
1750-EX1. Each four-pack consists of four hard disk drives of same capacity and speed.
Each four-pack can be configured as a:
򐂰 RAID 5 Array: 2+P+S, 3+P from 1 pack or 6+P+S, 7+P from 2 packs
Or, as a:
򐂰 RAID 10 Array: 1+1+2S, 2+2 from 1 pack or 3+3+2S, 4+4 from 2 packs
The DS6000 Storage Manager will configure the four-pack on a loop with spare DDMs as
required. When the configuration includes an intermix of different capacity or speed drives,
this may result in the creation of additional DDM spares on a loop as compared to
non-intermixed configurations. Spare configurations and considerations are explained in
detail in 2.8, “RAID implementation” on page 33.
Currently with the DS6000 there is the choice of different disk drive capacities and speeds:
򐂰 73 GB 15,000 rpm disks
򐂰 146 GB 15000 rpm disks
The four disk drives assembled in each four-pack unit are all of the same capacity and speed.
But it is possible to mix four-packs of different capacity and speed (rpm) within an DS6000,
within the guidelines described in 2.6.4, “Disk four-pack intermixing” on page 28.
2.6.3 Disk four-pack capacity

Four-packs in the DS6000 can be of different capacities and speed. Additionally, once
installed in the DS6000, they can be configured either as RAID 5 or RAID 10 arrays. Thus we
can speak of a raw physical capacity that will depend on the number and type of four-packs
that the DS6000 holds, and we can also speak of the resulting effective capacity of the
DS6000 that will depend on the RAID configuration being done.

RAID 5 as implemented on the DS6000 offers the most cost-effective performance/capacity
trade-off options for the DS6000 internal disk configurations, because it optimizes the disk
storage capacity utilization. RAID 10 can offer higher performance for selected application
workloads, but requires considerably more disk space. Refer to Figure 2-7.
Physical capacity
The physical capacity (or raw capacity) of the DS6000 is the result of adding the physical
capacities of each of all the installed disk four-packs in the DS6000. The physical capacity of
the four-packs will be determined by the disk drives capacity that it holds (refer to Figure 2-7).
Effective capacity
The effective capacity of the DS6000 is the capacity available for user data. The combination
and sequence in which four-packs are added to the DS6000, and then how they are logically
configured, will determine the effective capacity of the DS6000 (refer to Figure 2-7).
The logical configuration alternatives and the resulting effective capacities are discussed later
in Chapter 3, “Logical configuration planning” on page 53.
RAID 5 Array
Effective Capacity Effective Capacity Effective Capacity Effective Capacity
DDMs in 4-Pack Physical Capacity 2+P+S 3+P 6+P+S 7+P
(1 Pack) (1 Pack) (2 Packs) (2 Packs)
73 GB 292 GB 126 GB 190 GB 382 GB 445 GB
146 GB 584 GB 256 GB 386 GB 773 GB 902 GB
300 GB 1200 GB 522 GB 787 GB 1576 GB 1837 GB
RAID 10 Array
Effective Capacity Effective Capacity Effective Capacity Effective Capacity
DDMs in 4-Pack Physical Capacity 1 + 1 + 2S 2+2 3 + 3 + 2S 4+4
(1 Pack) (1 Pack) (2 Packs) (2 Packs)
73 GB 292 GB 62 GB 126 GB 190 GB 254 GB

146 GB 584 GB 127 GB 256 GB 386 GB 515 GB
300 GB 1200 GB 262 GB 522 GB 787 GB 1050 GB
Figure 2-7 DS6000 Arrays - physical and effective capacities
2.6.4 Disk four-pack intermixing

It is possible for a DS6000 to have an intermix of four-packs of different capacity and speed
characteristics. Some guidelines apply for these intermix configurations.
Capacity intermix
Disk four-packs of different capacities can be installed within the same DS6800 Server
Enclosure and DS6000 Expansion Enclosure, in the same or in different loops and boxes. In
the FC-AL loops of the DS6000, it is possible to intermix:
򐂰 73 GB, 146 GB, and 300 GB capacity disk four-packs
򐂰 RAID 5 and RAID 10 array configurations

For a newly installed DS6000 with an intermixed disk drive capacity configuration, the
four-packs will be installed in sequence from highest capacity to lowest capacity.
If the DDM speed is the same, and they are configured first, two hot spare disks are created
by maximum drive capacity installed per FC-AL loop. The spares are reserved from either two
6+P arrays (RAID 5) or one 3+3 array (RAID 10). Sparing is discussed in detail in “RAID
implementation” on page 33.
Speed (RPM) intermix

A DS6000 can have four-packs that differ in their drive speed. If different speed four-packs are
installed, two hot spare disks are created per same speed four-packs per FC-AL loop. See
2.8, “RAID implementation” on page 33 in detail. It is recommended to use Capacity Magic to
estimate how many hot spare disks are created. For more information, see 4.2, “Capacity
Magic” on page 101.
2.6.5 Disk conversions

There is the ability to exchange an installed four-pack with a four-pack of greater capacity, or
higher rpm, or both. As well as enabling you to best exploit the intermix function, the capacity
conversions are particularly useful for increasing storage capacity at sites with floor-space
constraints that prohibit the addition of the Expansion Enclosures.
2.7 Choosing the DS6000 disks

In this section we discuss the considerations to analyze when deciding the disk’s capacities
and speeds that will be included in the DS6000 hardware configuration.
2.7.1 Disk capacity

Analysis of workload trends over the recent years shows that access densities continue to
decline. Access density is defined as the number of I/O requests per second per gigabyte
(IO/s/GB) of storage. This trend results in that workloads can often be migrated to higher
capacity disk drives without any significant impact in their I/O performance.
Also the evolution in technology results are that, as new larger capacity disk drives become
available, they usually come with improved characteristics of performance. Thus we are
seeing that now many installations feel more confident when moving to the larger capacity
disk drives configurations.
When choosing the capacity of your DS6000 disk drives, these considerations should be
regarded:
򐂰 The characteristics of the I/O workload (cache friendly, unfriendly, standard; block size;
random versus sequential; read/write ratio; I/O rate) are key factors when deciding the
capacity and number of the disk drives that will be included in the DS6000 configuration.
For example, if the workload is cache friendly, more I/Os are completed in cache and less
activity is performed on the backend disk drives. This type of workload is a good candidate
for storing its data in the larger capacity of disk drives.
򐂰 Some I/O workloads may be very cache unfriendly or have a very high random write
content. These workloads, where a larger part of the I/Os are completed in the backend
disks, may perform better when using more disk drives.
A disk drive by itself can do a number of operations per second, so using more disks
drives can result in better performance.

򐂰 When making decisions on the capacity of the DS6000 disk drives you should also
consider the speed of disks. Faster disks (rpm) perform better, and the relevance of this
will be determined by the type of the workload, whether an important part of the I/O needs
to complete on the backend disks, as opposed to the cache.
򐂰 For remote copy implementations when the I/O response time at the secondary DS6000 is
not a critical issue, the larger capacity disk drives can be a good and less expensive
choice. This can be especially true for PPRC Extended Distance (PPRC-XD), Global
Mirror/Copy secondaries and XRC secondaries. For these implementations the secondary
DS6000 can be configured with the larger 300 GB capacity disk drives.
Consider that the disk drive capacity is not an isolated factor when estimating the overall
DS6000 performance, but must be considered together with other important factors like the
I/O workload characteristics, the cache size, the disk drives speed, the number and type of
DS6000 host ports, and the backend data layout and FC-AL loops.
Our recommendation is to use Disk Magic for properly determining the more convenient disk
drives capacity mix to include in your DS6000 hardware configuration. In detail, see 4.1, “Disk
2.7.2 Disk Magic examples using 146 GB and 300 GB disk drives
In this section we present examples of DS6000 Disk Magic simulation results when using the
larger capacity 146 GB and 300 GB disk drives with same rpm. The discussions for these
examples will help you better understand what the performance implications are when using
the larger disks.
Open standard OLTP workload

Figure 2-8 illustrates an example of open OLTP workload that is run in a 146 GB/10K rpm
disk drive configuration with 16 four-packs and in a 300 GB/10K rpm disk drive configuration
with 8 four-packs.
For this example of OLTP workload, and where the total effective capacity was almost the
same on both configurations, the 146 GB disk drive configuration shows a higher response
time compared to the 300 GB disk drive configuration.
OLTP w orkload
20
18
16
146GB Resp time
response time(ms)
14
12 300GB Resp time
10
8
6
4 read:write=7:3
2 cache hit ratio=50%
0
0 1000 2000 3000 4000 5000 6000 7000 8000
I/O block size=4kB
Throughput(IOPS) write efficiency=33%
Figure 2-8 Open OLTP workload on 146 GB and 300 GB configuration

You can see in our example in Figure 2-8 on page 30 that for the lower throughput
values—less than 2500 IOPS—the response time of the larger capacity 300 GB disk drive
configuration can be acceptable for many applications. And about twice the amount of
throughput can be acceptable for the 146 GB disk drive configuration. Larger capacity DDMs
have more effective capacity with fewer DDMs than smaller capacity DDMs, but when you
manage a lot of I/O, you have to consider the DDM capacity along with the number of DDMs.
Open read intensive workload

Figure 2-9 illustrates an example of a open read intensive workload that is run in a 146
GB/10K rpm with 16 four-packs configuration and in a 300 GB/10K rpm disk drive
configuration with 8 four-packs. Read intensive workload is similar to DB activity.
For this example of read intensive workload, maximum throughput is lower than the OLTP
workload because of a lower cache hit ratio. But the result is almost the same as the OLTP
workload. For the 300 GB disk drive configuration, 2300 or less IOPS can be acceptable and
for the 146 GB disk drive configuration, about twice the number of IOPS can be acceptable.
Read intensive w orkload
20
18
16 146GB Resp time
response time(ms)
14
300GB Resp time
12
10
8
6
read:write=2:1
4
0
I/O block size=4kB
0 1000 2000 3000 4000 5000 6000
write efficiency=33%
Throughput(IOPS)
Figure 2-9 Open read intensive workload on 146 GB and 300 GB configuration
2.7.3 Disk speed (RPM)

Rotational speed of the hard disk drives is measured in revolutions per minute (RPM) and,
with the other physical and mechanical characteristics remaining the same, the faster the disk
the better the performance. Because of this, disk drive vendors have continued to increase
rotational speed in order to improve both disk transfer rates and rotational latency (the time a
read/write head must wait for a sector on a disk to pass under it).
Being so, one of the simplest ways of improving the overall performance of a disk subsystem
is to install the highest speed (RPM) disk drives. This is especially relevant for workloads that
are cache-unfriendly or hostile, who benefit more than others from the DS6000 configurations
that include the faster DS6000 15 Krpm drives.
Consider that the disk drive speed is not an isolated factor when estimating the overall
DS6000 performance, but must be considered together with other important factors like the
I/O workload characteristics, the cache size, the disk drives capacity, the number and type of
DS6000 host ports, and the backend data layout and FC-AL loops.

Our recommendation is to use Disk Magic for properly assessing how significant it is for your
I/O processing, the performance benefits of having faster disk drives in your DS6000
hardware configuration. For more information, see 4.1, “Disk Magic” on page 86.
2.7.4 Disk Magic examples using 15K rpm and 10K rpm disk drives
In this section we present examples of Disk Magic simulation results when using 15K rpm and
10K rpm disk drives. The discussions for these examples will help you better understand what
are the performance implications of the disk drive speed factor.
Figure 2-10 illustrates an example of an OLTP workload that is run on two 146 GB disk drives
configurations: One of the configurations has 10K rpm disk drives and the other one has 15K
rpm disk drives, each has 16 four-packs. You can see that the 15K rpm configuration
performs better than the 10K rpm configuration, always delivering better response times.
OLTP workload
20
18
16
14
response time(ms)
12 146GB10k Resp time

10 146GB15k Resp time
8
6
4 read:write=7:3
0
0 2000 4000 6000 8000 10000 I/O block size=4kB
Throughput(IOPS) write efficiency=33%
Figure 2-10 OLTP workload - 15K rpm versus 10K rpm disk drives

.
Read intensive workload

20
18
response time(ms) 16
14
12 146GB10K Resp time
10 146GB15K Resp time
8
read:write=7:3
6
2 I/O block size=4kB
0
write efficiency=33%
0 2000 4000 6000 8000 10000
Throughput(IOPS)
Figure 2-11 Read intensive - 15K rpm versus 10K rpm drives
Figure 2-11 illustrates the example when a read intensive workload was run on both the 10K
rpm and the 15K rpm configurations. The 15K rpm configuration performs better, always
delivering lower response times for all throughputs as compared to the 10K rpm configuration.
You can appreciate comparing the results in Figure 2-10 on page 32 and Figure 2-11, that
when running a higher cache hit ratio workload, the performance gain is less significant, than
when running the lower cache hit ratio workload.
For more detailed estimation, see 4.1, “Disk Magic” on page 86 and Chapter 12,
“Understanding your workload” on page 407.
Note: These results are just from the simulation of Disk Magic, your performance may
change according to your environment.
2.8 RAID implementation

In this section we describe the characteristics of the RAID implementation in the DS6000.
The performance considerations regarding the possible RAID configurations are discussed in
detail later in Chapter 3, “Logical configuration planning” on page 53.
2.8.1 RAID Arrays

The basic unit where data is stored in the DS6000 is the disk drive module (also referred to as
a DDM). Disk drives for the DS6000 are available in capacities of 73 GB, 146 GB, 300 GB
(capacity and speed), and are grouped together in an four-pack, and these four-packs can be
installed in each on the FC-AL loops. One FC-AL loop can hold up to 16 four-packs, which
means that the maximum number of 64 DDMs can be found in a loop. The raw or physical
capacity installed on a loop will depend on the four-packs (their capacity) that are installed on
that loop. Physical capacity is discussed in 2.6.3, “Disk four-pack capacity” on page 27.
Logically, four DDMs are grouped in an Array Site automatically. As illustrated in Figure 2-12,
initially four DDMs of four 4-packs (16 DDMs) are randomly selected and make up the Array
Site. After initial setup, the next Array Site is configured by adding more four-pack DDMs.

S3 S2 S3 S1
1750-511
S3 S1 S4 S3
S4 S1 S2 S1
Array site 1 4 are made from
S4 S4 S2 S2 randomly selected DDMs.
S5 S5 S5 S6
1750-EX1-1
S6 S6 S5 S8
S7 S7 S7 S8
S8 S7 S6 S8 randomly selected DDMs.
S9 S9 S10 S10
1750-EX1-2
S11 S11 S9 S10
S12 S12 S9 S11
S11 S10 S12 S12
randomly selected DDMs.
:
:
Figure 2-12 Example of Array Site configuration
The DS6000 disk Array is configured from one or two Array Sites in Redundant Array of
Independent Disks (RAID) implementations. When a RAID Array is configured from one Array
Site, this Array may be 2+P+S, 3+P (RAID 5), 1+1+2S (RAID 1) or 2+2 (RAID 10). And when
a RAID Array is configured from two Array Sites, this Array may be 6+P+S, 7+P (RAID 5),
3+3+2S or 4+4 (RAID 10).
Note: From a performance and capacity point of view, we recommend you configure
Arrays from two Array Sites. The more DDMs can manage the more I/Os and at RAID 5,
the smaller Arrays use more parity DDMs. But from an availability point of view, smaller
Arrays are superior because the probability of two DDMs failure become low.
A RAID Array does not fix the type of usage (FB or CKD) unlike the ESS Rank and there is no
relationship with the logical subsystem (LSS). With the DS6000, the process relating Array to
Rank and LSS is separated. Making Arrays just fixes the RAID type to either be RAID 5 or
RAID 10.
Ranks are made from Arrays which decides the type of usage (FB or CKD). Then each Rank
is formatted into multiple Extents of 1 GB (for FB) or .94 GB (for CKD). One or more Ranks
are assigned to an Extent Pool. Logical volumes are configured from an Extent Pool.
One to multiple Ranks are assigned to 1 to multiple storage pools (Extent Pool). That is, one
Extent Pool can contain 1 to multiple Ranks. Then logical volumes are configured from the
Extent Pool. If a LUN bigger than a Rank is needed, multiple Ranks must be assigned to the
Extent Pool.
Important: At the moment, we do not recommend you assign multiple Ranks to a single
Extent Pool. If a LUN is made, the LUN is not striped across the Ranks. Extents are taken
from one Rank and if more Extents are needed, the Extent is taken from next Rank. It
causes an unbalance of performance and loss of availability. If available, we recommend
host level striping. And from a management of view, it is easier to assign one Rank to one
Extent Pool.
When configured, the logical volumes are striped across all the data disks and then mirrored,
if it is a RAID 10 Rank; or striped across all data disks in the Array along with the parity disk
(floating), if it is a RAID 5 Rank.

2.8.2 RAID 5 array
One of the two possible RAID implementations in a DS6000 array is RAID 5. The DS6000
RAID 5 implementation consists of four or eight disk drives: A set of 2, 3, 6 or 7 disks for user
data, plus a parity disk for reconstruction of any of the user data disks should one become
unusable. In fact there is no dedicated physical parity disk, but a floating parity disk striped
across the rest of the data disks in the array. This prevents any possibility of the parity disk
becoming an I/O hot spot.
Because the DS6000 architecture for maximum availability is based on two spare drives per
FC-AL loop (and per capacity, per rpm), if the first two arrays that are configured in a loop are
defined as RAID 5 then they will be defined by the DS6000 with two or six data disks plus one
parity disk plus one spare disk—this is a 2+P+S, 6+P+S Rank configuration. This will happen
for the first two arrays of each capacity installed in the loop, if configured as RAID 5.
Once the two spares per capacity/speed rule is fulfilled, then further RAID 5 arrays in the loop
will be configured by the DS6000 as three or seven data disks and one parity disk—this is a
3+P, 7+P Rank configuration. Figure 2-13 to Figure 2-15 on page 36 illustrate the three
arrangements of disks possible in the DS6000 when configuring two four-packs RAID 5
arrays.
IBM TotalStorage
RAID 5 configuration example

When using same capacity and rpm.
Loop 0
RAID5 RAID5
1750-511 6+P+S 7+P 1750-EX1-3
DA
DA
1750-EX1-1 1750-EX1-2
Two spares are configured on Spares are not configured

first two enclosures per loop. on the rest of array.
Loop 1
data spare parity (float)
Figure 2-13 RAID 5 Array implementation 1

IBM TotalStorage

When using different capacity and rpm.
Loop 0
RAID5 RAID5
1750-511(146GB / 10krpm) 6+P+S 6+P+S 1750-EX1-3(73GB / 15krpm)
DA
DA
1750-EX1-1(146GB / 10krpm) 1750-EX1-2(73GB / 15krpm)

Two spares are configured on Two additional spares are configured for
first two enclosures per loop. faster DDMs per loop.
Loop 1
IBM TotalStorage

When using different capacity.
Loop 0
RAID5 RAID5
1750-511(300GB / 10krpm) 6+P+S 7+P+S 1750-EX1-3(146GB / 10krpm)
DA
DA

Two larger capacity hot spares are Loop 1
configured on each loop.
–In case, initially configure 146 GB array and then add 300 GB DDMs , two
hot spares are configured for each capacity arrays per loop.
Figure 2-15 RAID 5 array implementation 3
In a RAID 5 implementation, each disk can be accessed, thus enabling multiple concurrent
accesses to the array. This results in multiple concurrent I/O requests being satisfied, thus
providing a higher random-transactions throughput.

RAID 5 is well suited for random access to data in small blocks. Most data transfers, reads, or
writes, involve only one disk and hence operations can be performed in parallel and provide a
higher throughput. In addition to this efficiency for random transaction operations, the RAID 5
implementation of the DS6000 is able to work in a strip write style when processing
sequential operations for maximum sequential throughput.
2.8.3 RAID 10 array

The other possible RAID implementation in a DS6000 is RAID 10 (also known as RAID 1+0).
A RAID 10 array consists of a set of disks for user data and their mirrors. There is no parity
disk to rebuild a failed disk. In case one disk becomes unusable, then its mirror will be used
to access the data and also to build the spare.
The DS6000 architecture for maximum availability is based on two spare drives per loop (and
per capacity, per rpm). If the first array configured in a loop is defined as a RAID 10 Array, it
will be defined with one data disk plus one mirror plus two spares (made from 1 Array Site).
Or three data disk plus three mirrors plus two spares (made from 2 Array Sites). This is a
1+1+2S or 3+3+2S array configuration. This will happen for the first array of each capacity
installed in the loop if configured as RAID 10.
Once the two spare per capacity rule is fulfilled, then further RAID 10 arrays in the loop will be
configured by the DS6000 as two/four data disks plus two/four mirrors—this is a 2+2/4+4
array configuration. Figure 2-16 to Figure 2-18 on page 38 illustrate the three arrangements
of disks that can be found in the DS6000 when there are RAID 10 Ranks.
IBM TotalStorage

Loop 0
1750-511 1750-EX1-3
DA
DA
1750-EX1-1 1750-EX1-2
RAID 10 Two spares are configured on RAID10 Spares are not configured
3+3+2S first two array per loop. 4+4 on the rest of array.
Loop 1
data spare mirror

IBM TotalStorage

When using different capacity and rpm.
RAID 10 RAID 10
4+4 4+4
Loop 0
1750-511(146GB / 10krpm) 1750-EX1-3(73GB / 15krpm)
DA
DA

RAID 10 Two spares are configured on RAID 10 Two additional spares are
3+3+2S first two array per loop. 3+3+2S configured for faster DDMs per loop.
Loop 1
data spare mirror
IBM TotalStorage

When using different capacity.
Loop 0
RAID10
1750-511(300GB / 10krpm) 4+4 1750-EX1-3(146GB / 10krpm)
DA
DA

RAID10 Two larger capacity hot spares are
3+3+2S configured on each loop.
Loop 1
–In case, initially configure 146 GB array and then add 300 GB DDMs , two
hot spares are configured for each capacity arrays per loop.
data spare mirror
RAID 10 is also known as RAID 1+0, because it is a combination of RAID 1 (mirroring) and
RAID 0 (striping). The striping optimizes the performance by striping volumes across several
disk drives (three or four DDMs in two Array Sites). RAID 1 is the protection against a disk

failure by having a mirror copy of each disk. By combining the two, RAID 10 provides data
protection and good I/O performance.
2.8.4 Combination of RAID 5 and RAID 10 arrays

It is possible to have RAID 10 and RAID 5 arrays configured within the same loop or
enclosure, as illustrated in Figure 2-19.
Note: If you make a RAID 5 array first and then make RAID 10 in the first or second
enclosure, three hot spare disks are made and it will reduce the total effective capacity. To
avoid this, you should create the RAID 10 Array in the first or second enclosure.
IBM TotalStorage
RAID 5+10 configuration example

If first array is RAID 5 and second array is RAD10 in
first two enclosures, three hot spares are made.
1st 2nd
RAID5:6+P+S RAID10:3+3+2S Loop 0
1750-511
1750-EX1-3
DA
DA
1750-EX1-1 1750-EX1-2
1st 2nd Loop 1
RAID10:3+3+2S RAID5:7+P
data spare mirror parity (float)
Figure 2-19 RAID 5 and RAID 10 in the same loop
2.8.5 RAID 5 versus RAID 10 performance

You may be asking which is better, RAID 5 or RAID 10? As with most complex issues, the
answer is that it depends. There are a number of workload attributes that influence the
relative performance of RAID 5 versus RAID 10, including the use of cache, the relative mix of
read versus write operations, and whether data is referenced randomly or sequentially.
Random and sequential reads and random writes

When a storage server receives a read request from a host, the storage server first looks to
see if the requested data is in cache. If it is, there is no need to read the data from disk, and
so it does not matter whether RAID 5 or RAID 10 is used.
For reads which must be satisfied from disk, performance of RAID 5 and RAID 10 are roughly
equal, except at high I/O rates. If RAID 5 contains significantly more data than RAID 10 array,
data are located on the inside tracks of disks that it takes longer seek time that sometimes
RAID 5 become slower than RAID 10. But if RAID 5 contains almost same amount of data,

most of data are located on the outside tracks of disks that it does not take so long seek time
that sometimes RAID5 become faster than RAID10.
Or, viewed another way, if we use full space of RAID array for data, RAID 10 has twice the
number of DDMs. It means more read operations can be managed by RAID 10.
These thoughts are applicable for random writes.
Sequential writes
Sequential writes are handled the same way as random writes, from the standpoint of getting
the data into cache and persistent cache, and then acknowledging the I/O as complete to the
host server. As with random writes, provided there is room in the cache areas, the response
time seen by the application is the time to get data into the cache.
RAID 10 destages are handled the same way as random writes. However, with sequential
writes, the volume of data is generally much larger, and since data is striped across the array,
I/Os will be done to every disk in the array. And, as with random writes, the data is written to
both sets of volumes.
RAID 5 sequential writes are done a bit differently than random writes. Because the larger
volume of data requires striping the data across the entire array, RAID 5 does a full stripe
write across all DDMs in the array, writing the data and parity generated from the data in
cache, without requiring any read of existing data or parity. Writing one copy of the data plus
parity information, as RAID 5 does, requires fewer disk operations than writing the data twice,
as RAID 10 does. This means that for sequential writes, a RAID 5 destage completes faster
and thereby reduces the busy time of the disk subsystem.
Array rebuilds
In theory, RAID 10 is better for Array rebuilds since RAID 5 must read every disk in the Array,
reconstruct the data using parity calculations, and then write the reconstructed data. In
comparison, to rebuild a failed disk RAID 10 has only to copy the data from one DDM to
another. However, the DS6000 switched fibre network connecting the disks has sufficient
bandwidth to permit the reads from all DDMs in the RAID 5 Array to be done concurrently. So
the actual elapsed time to rebuild a RAID 5 array is approximately the same as the elapsed
time to rebuild a RAID 10 array. However, due to the larger number of disk operations, a RAID
5 rebuild would be more likely to impact other disk activity on the same disk loops than would
RAID 10.
Summary: RAID 5 versus RAID 10

Workloads that make effective use of storage server cache, for reads and/or writes, will see
little difference between RAID 5 and RAID 10 configurations.
For reads from disk, either random or sequential, there is no significant difference in RAID 5
and RAID 10 performance, except at high I/O rates.
For random writes to disk, RAID 10 performs better. The improvement is not seen until high
levels of activity.
For sequential reads/writes and random reads to disk, RAID 5 performs better. They are
measurably better at the smallest level of activity.
Regardless of your workload characteristics (read versus write, random versus sequential), if
your workload’s access density (the ratio of I/O operations per second, divided by gigabytes of
capacity) is well within the capabilities of your disk subsystem, then either RAID 5 or RAID 10
will work fine.

For array rebuilds, RAID 5 and RAID 10 require approximately the same elapsed time,
although RAID 5 requires significantly more disk operations and is therefore more likely to
impact other disk activity on the same disk loops.
For workloads that perform better with RAID 5, the difference in RAID 5 performance over
RAID 10 is typically not large. However, for workloads that perform better with RAID 10, the
difference in RAID 10 performance over RAID 5 can be significant. RAID 10 is generally
considered the RAID type of choice for random write workloads which need the absolute best
I/O performance possible.
The major downside to RAID 10 is usable space efficiency. In a DS8000, 64 DDMs (the
number typically supported on one device adapter pair) in a RAID 10 configuration provide 30
DDMs of usable space. Those same 64 DDMs in a RAID 5 configuration provide 52 DDMs of
usable space, which is 73 percent more usable capacity.
Instead of asking which is better, RAID 5 or RAID 10, a more appropriate question is when to
use RAID 5 and when to use RAID 10. The selection of RAID type is made for each individual
Array Site, which for the DS6000 is four DDMs. So you can select RAID type based on the
performance requirements of the files that will be located there. The best way to compare a
workload’s performance using RAID 5 versus RAID 10 is to have a Disk Magic model run. For
additional information about the capabilities of this tool, see 4.1, “Disk Magic” on page 86.
2.9 SBOD (Switched Bunch of Disks) connection

In the DS6000 Server Enclosure and Expansion Enclosure, we adopt new mechanism, called
SBOD, to connect to the DDMs inside them. It can also enhance the performance of usual
FC-AL connection.
2.9.1 Standard storage subsystem FC-AL problem

In a standard FC-AL disk enclosure all of the disks are arranged in a loop. This loop-based
architecture means that data flows through all disks before arriving at either end of the RAID
controller.
The main problems with standard FC-AL access to DDMs are:

򐂰 The full loop is required to participate in data transfer. Full discovery of the loop via LIP
(loop initialization protocol) is required before any data transfer. Loop stability can be
affected by DDM failures.
򐂰 In the event of a disk failure, it can be difficult to identify the cause of a loop breakage,
leading to complex problem determination.
򐂰 There is a performance drop off when the number of devices in the loop increases.
򐂰 To expand the loop it is normally necessary to partially open it. If mistakes are made, a
complete loop outage can result.
These problems are solved with the switched FC-AL implementation on the DS6000.
2.9.2 Switched FC-AL advantages

The DS6000 uses switched FC-AL technology to link the device adapter (DA) pairs and the
DDMs. Switched FC-AL uses the standard FC-AL protocol, but the physical implementation is
different. The key features of switched FC-AL technology are:
򐂰 Standard FC-AL communication protocol from DA to DDMs

򐂰 Direct point-to-point links are established between DA and DDM
򐂰 Isolation capabilities in case of DDM failures, which provides easy problem determination
򐂰 Predictive failure statistics
򐂰 Simplified expansion: no cable rerouting required when adding another disk enclosure
The DS6000 architecture employs dual redundant switched FC-AL access to each of the disk
enclosures. The key benefits of doing this are:
򐂰 Two independent switched networks to access the disk enclosures
򐂰 Four access paths to each DDM
򐂰 Each device adapter port operates independently
򐂰 Double the bandwidth over traditional FC-AL loop implementations
In the DS6000, the switch chipset is completely integrated into the servers. Each server
contains one switch. Note, however, that the switch chipset itself is completely separate from
the server chipset. In Figure 2-20 each DDM is depicted as being attached to two separate
Fibre Channel switches. This means that with two servers, we have four effective data paths
to each disk, each path come from a device adapter port and operates at 2Gbps.
S w itc h e d c o n n e c tio n s
F ib r e c h a n n e l s w itc h
C o n tr o lle r 0 C o n tr o lle r 1
d e v ic e d e v ic e
a d a p te r a d a p te r
F ib r e c h a n n e l s w itc h
Figure 2-20 SBOD FC-AL structure
When a connection is made between the device adapter and a disk, the connection is a
switched connection that uses arbitrated loop protocol. This means that a mini-loop is created
between the device adapter and the disk.
2.9.3 DS6000 switched FC-AL implementation

For a more detailed look at how the switched disk architecture expands in the DS6000, refer
to Figure 2-21 on page 43. It depicts how the DS6000 is divided into two disk loops. The
server enclosure (which contains the first 16 DDMs) is on loop 0. The first expansion
enclosure is placed on loop 1. This allows for the best performance since we are now using all
four ports on the device adapter chipset. Expansion is achieved by adding expansion
enclosures onto each loop, until each loop has 64 DDMs. That is, loop 0 has 3 expansion
enclosures and loop 1 has 4 expansion enclosures. The server enclosure is the first
enclosure on loop 0, which is why we can only add a total of 7 expansion enclosures.
For balancing the workload to use the server resource effectively, you should place the DDM
and data evenly on each loop.
Note: It is highly recommend to not create the Extent Pool and LUN across the loop.

– DS6000 has two loops.
– SBOD is connected the following order.
• Loop1 Loop1 Loop0 Loop1 Loop0 Loop1…. Loop 1
Disk CTRL Port
Server 0 Server 1 1
Switch
Switch
First
FC
FC
1 16 SBOD
Switch
Switch
Midplane
FC
FC
connection
16
1
Switch
Switch
FC
FC
Disk EXP Port Second
Third
Loop 0 1 SBOD
Switch
Switch
SBOD FC 16
FC
Fifth Fourth
16
SBOD SBOD
Figure 2-21 Server enclosure and expansion enclosure connection
2.10 Host adapter

The DS6000 connects to the application servers by means of its host adapter. Four host ports
are effectively on a PCI-X 64 Bit 133 MHz card, the same card used in the DS8000. One
adapter card is mounted on each server so that the maximum number of host connections is
eight. Unlike ESS or DS8000, each host port of DS6000 can communicate with only one side
of the server of the DS6000. The chipset is driven by a new high function/high performance
ASIC. To ensure maximum data integrity it supports metadata creation and checking. Four
host ports of each server consists of two pairs and each pair is connected to a Fibre Channel
protocol engine. Each pairs of host ports are connected to one 64-bit PowerPC 750GX 1 GHz
processor. For details see Figure 2-22 on page 44.

Host Adapter architecture (one side of the Servers)
Eight ports per Server Host Adapter Disk Connection port

enclosure
Fibre Fibre
•Four ports per Server QDR Channel Channel
QDR
Protocol Protocol
•Each port is connected only Engine Engine PPC
750GX
one side of the Servers SDRAM
Data Protection PPC
Data Mover 750GX
2Gbps Fibre channel/FICON Enet Comp Flash
ASIC
Flash PPC
connection
Hint 750FX
Bridge RAID -
•Up to eight for each Data Protection -
Bridge NVRAM
Buffer
connection Data Mover SDRAM
ASIC
•Long wave/Short wave
Fibre Channel Protocol Fibre Channel Protocol
connection Engine Engine
•1Gbps/2Gbps auto
Switch
negotiation SES
MIdplane
EEEEE
Figure 2-22 Host port architecture
The host port can either be Fibre Channel or FICON (longwave or shortwave). You have to
change the port profile manually when connecting FCP or FICON by using the GUI or CLI.
Each host port has an SFP module with LC connector and it negotiates 2 Gbps or 1 Gbps
automatically.
The host servers supported by the DS6000 for each host port interface can be found at:
http://www-03.ibm.com/servers/storage/disk/ds6000/interop.html
Figure 2-23 shows the performance data of the host adapter and port compared with the host
adapter of ESS 800. The DS6000 host adapter performs much faster than the ESS 800 host
adapter.
ESS Model 800 DS6800
Single Port Open 44600

IO/s 17500
Single Port MB/s 206

150
4 Port Open IO/s 155000

17500
4 Port MB/s 778

150
Figure 2-23 Comparison of host adapter performance

2.10.1 Host adapter configuration
When planning the configuration of the DS6000 host adapters that you will need, consider the
following:
1. Make a list of all the servers that will be connected to the DS6000, documenting the
amount of storage capacity they will be using, and the type of I/O workload they will
execute.
2. For the zSeries servers, document the type and number of channels you have available for
DS6000 connection. Channels can be only FICON. FICON Express2 surpasses former
FICON performance and connectivity capabilities, so it is the recommended option.
3. For the open systems servers you should also document the type and number of host bus
adapters (HBA) you have available for DS6000 connection.
4. For FICON and Fibre Channel connectivity you have the additional consideration of the
speed of the connecting ports. 2 Gbps should be the option for the servers’ HBAs and
SAN switches that connect to the 2 Gbps Fibre Channel/FICON host adapter ports of the
DS6000.
5. For all types of servers, the more DS6000 host adapters they connect to, the more I/O
bandwidth they will have available for their I/O processing. You must evaluate which is the
optimum number for each application/server, beyond which the significance of adding
more host attachment connections is negligible from a performance point of view.
Once the necessary information is ready, then Disk Magic can be run to evaluate the
alternatives for the DS6000 hardware configuration. As with other components of the DS6000
hardware configuration, consider that the DS6000 host adapters are not an isolated factor
when estimating the overall DS6000 performance. But instead, they must be considered
together with other important factors like the I/O workload characteristics, the cache size, the
disk drives capacity and speed, and the backend data layout and FC-AL loops.
2.10.2 FCP attachment

Fibre Channel is a technology standard that allows data to be transferred from one node to
another at high speeds (up to 200 MB/s) and greater distances (up to 10 km).
It is the very rich connectivity options of the Fibre Channel technology that has resulted in the
Storage Area Network (SAN) implementations. The limitations seen on SCSI in terms of
distance, performance, addressability, and connectivity are overcome with Fibre Channel and
a SAN.

FCP attachment
•Up to 8 FCP ports per DS6800
•One port with an LC connector type
•Long wave or short wave option intel UNIX iSeries
•Up to 200 MB/sec full duplex
•Up to 10 km distance with long SAN Switch
wave and 300m short wave
•Fibre Channel Protocol (FCP) SAN Switch
•Supports switched fabric
•Fully participates in Storage Area
Networks (SANs)
•LUN masking
Figure 2-24 Fibre Channel/FICON host adapters - FCP attachment
The DS6000 with its Fibre Channel/FICON host adapters provides Fibre Channel Protocol
(FCP, which is SCSI traffic on a serial fiber implementation) interface, for attachment to open
systems that use Fibre Channel adapters for their connectivity.
Note: The Fibre Channel/FICON host port supports both FICON or FCP, but not
simultaneously; the protocol to be used is configurable on a port-by-port basis.
The DS6000 supports up to 8 host ports, which allows for a maximum of 8 FCP ports per
DS6000. Each Fibre Channel/FICON host port provides one port with an LC connector type.
There are cable options that can be ordered with the DS6000 to enable connection of the
adapter port to an existing cable infrastructure.
As SANs migrate to 2 Gbps technology, your storage should be able to exploit this bandwidth.
The DS6000 Fibre Channel/FICON ports operate at up to 2 Gbps. The adapter
auto-negotiates to either 2 Gbps or 1 Gbps link speed, and will operate at 1 Gbps unless both
ends of the link support 2 Gbps operation.
There are two types of host adapter ports you can select: Longwave or shortwave. With the
longwave laser, you can connect nodes at distances of up to 10 km (non-repeated). With the
shortwave laser, you can connect at distances of up to 300 meters. The distances can be
extended if using a SAN fabric.
When equipped with the Fibre Channel/FICON host ports, these ports can participate in three
types of Fibre Channel topology by setting the port topology:
򐂰 Port topology: SCSI-FCP
– Fibre channel topology: Point-to-point or Switched fabric
򐂰 Port topology: FC-AL
– Fibre channel topology: Arbitrated loop
Fibre Channel distances

The type of DS6000 Fibre Channel/FICON host port, shortwave or longwave, and the
physical characteristics of the fiber used for the link, will determine the maximum distances for
connecting nodes to the DS6000.

2.10.3 FICON attachment
FICON channels started in the IBM 9672 G5 and G6 servers with 1 Gbps. Eventually these
channels were enhanced to FICON Express channels in zSeries 800(z800), zSeries 900
(z900) servers and IBM System z9™ 109, with 2 Gbps. FICON Express 2 channels are a new
generation of FICON channels that offer improved performance capability over previous
generations of FICON Express and FICON channels. They are supported on the
IBM Eserver zSeries 990 (z990), zSeries 890 (z890) and IBM System z9 109. Figure 2-25
shows the performance enhancement of FICON Express 2.
M B /sec Thoroughput(full duplex)

Large sequential R/W interm ix IO PS 4KB block size channel
100% utilized FICON
FICON EXPRESS2
300 EXPRESS2
14000
13000
270
250 12000
z990
z990 FICON
FICON z890
z890 EXPRESS
EXPRESS z9
10000 9200 z9
200 2Gbps FICON
FICON 170 EXPRESS

8000 7200 z990
EXPRESS FICON
150 1Gbps z890
z990 6000
120 z890 6000 z900
FICON z900 FICON z900 z800
100
74 z900 z800
4000 3600
G5 ESCON G5
50
ESCON G6 2000 1200 G6
17
0 0
Figure 2-25 Performance Enhancement of FICON Express2
As you can see, the FICON Express2 channel as first introduced on the zSeries z890 and
z990 represents a significant improvement in both 4K I/O per second throughput and
maximum bandwidth capability compared to ESCON and previous FICON offerings.
Note: This performance data was measured in a controlled environment running an I/O
driver program. The actual throughput or performance that any user will experience will
vary depending upon considerations such as the amount of multiprogramming in the user’s
job stream, the I/O configuration, the storage configuration, and the workload processed.
The DS6000 series provides only 2 Gbps FCP ports, which can be configured either as
FICON to connect to zSeries servers or as FCP ports to connect to Fibre Channel-attached
open systems hosts. For example, if there are two FICON Express2 channels, just two FICON
Express2 channels have the potential to provide roughly a bandwidth of 2 x 270 MB/s, which
equals 540 MB/second. This is a very conservative number.
I/O rates with 4 KB blocks are about 13000 I/Os per second per FICON Express2 channel,
again a conservative number. For example, two FICON Express2 channels have the potential
of over 26000 I/Os per second with the conservative numbers. These numbers vary
depending on the server type used.

The ESS 800 has an aggregated bandwidth of about 500 MB/s for highly sequential reads
and about 350 MB/s for sequential writes. DS6800 would be able to perform better than the
ESS with four FICON Express2 channels for highly sequential workload. However, for highly
sequential workload (for example 500 MB/s), the bandwidth of 2 FICON Express2 channels is
540 MB/s, as stated above, it is better to prepare 4 FICON Express2 channels.
The DS6800 can achieve higher data rates than an ESS 800 by its bandwidth. The DS6800
can outperform better than ESS 800 for some sequential workload. But, in a z/OS
environment a typical transaction workload might perform on an ESS 800 Turbo II with a large
cache configuration better than with a DS6800.
FICON attachment
•Up to 8 FICON ports per DS6800
•One port with an LC connector type
•Long wave or short wave option
•Up to 200 MB/sec full duplex
zSeries
•Up to 10 km distance with long wave and

300m short wave
•Each FICON port can address all 32 DS6000
LCU images
Logical paths
•1024 LCU logical paths per FICON port
Addresses
•16,384 device addresses per channel
FICON distances
•10 km distance (without repeaters)
•100 km distance (with extenders)
Figure 2-26 Fibre Channel/FICON host adapters - FICON attachment
FICON connection limits:

򐂰 Addressing limit, 16,384 (maximum of 8192 devices supported within one DS6000).
򐂰 Up to 512 control unit logical paths per port.
򐂰 FICON channels allow multiple concurrent I/O connections.
򐂰 Channel and link bandwidth: 170 MB/s for FICON Express and 270 MB/s for FICON
Express2.
򐂰 FICON path consolidation using switched point-to-point topology.
򐂰 Link distances 10 km without repeaters and 100 km with extenders.
These characteristics allow more powerful and simpler configurations. The DS6000 supports
up to 8 Fibre Channel/FICON host ports, which allows for a maximum of 8 FICON ports per
machine.
Note: The Fibre Channel/FICON host port supports both FICON or FCP, but not
simultaneously; the protocol to be used is configurable on an adapter-by-adapter basis.
Each Fibre Channel/FICON host port provides one port with an LC connector type. The
adapter is a 2 Gbps card and provides a nominal 200 MB/s full-duplex data rate. The adapter
will auto-negotiate between 1 Gbps and 2 Gbps, depending upon the speed of the connection

at the other end of the link. For example, from the DS6000 to a switch/director, the FICON
adapter can negotiate up to 2 Gbps if the switch/director also has 2 Gbps support. The
switch/director to host link can then negotiate at 1 Gbps.
There are two types of host adapter cards you can select: Longwave and shortwave. With
longwave laser, you can connect nodes at distances of up to 10 km (without repeaters). With
shortwave laser, you can connect distances of up to 300 m.
Each Fibre Channel/FICON host adapter provides one port with an LC connector type. There
are cable options that can be ordered with the DS6000 to enable connection of the adapter
port to an existing cable infrastructure.
Topologies
When configured with the FICON attachment, the DS6000 can participate in point-to-point
and switched topologies. The supported switch/directors for FICON connectivity can be found
at:
For more information about host attachment see Chapter 5, “Host attachment” on page 143.
2.10.4 Preferred Path

DS6000 has an architecture called Preferred Path. On DS8000 and ESS, each port of a host
adapter is accessible to both servers so that when the host has multiple paths to host
adapters, I/O can be load-balanced or round-robin. Unlike DS8000 or ESS, each host port of
the DS6000 is related to only one of the servers. So even if the host has multiple paths to host
port, I/O may not become load-balancing or round-robin according to the configuration. This
concept is almost same as the DS4000 (formerly FAStT)
Figure 2-27 on page 50 to Figure 2-29 on page 51 shows the preferred path I/O activity of the
DS6000. As illustrated in Figure 2-27 on page 50, when the host has multiple paths, but only
one path to each server, I/O is always active/standby for a LUN (Extent Pool). An alternate
path is used in the case the server or path fails. When a host has multiple path to each server,
I/O can be load-balanced or round-robin across the paths connected to one side of the
servers. And in case the path fails, the rest of the paths are used for I/O and will not cause
failover. Only when the server fails, failover occurs. If the configuration is large capacity and
high performance, required especially for sequential workload, multiple paths to the
configuration may be effective. But if the configuration is small or workload is random, a
multipath configuration may not be needed. For a small configuration or random workload,
DDMs can be saturated earlier than the host port.
According to the operating system and multipath driver, to determine I/O activity
(load-balancing/round-robin/active-standby), users have to configure the OS or drivers
setting. For example, the Subsystem Device Driver’s (SDD) default setting is load-balance,
but if you want to set other activity, you have to use the datapath set command.
Note: When configuring multipath, the host must have same number of Host Bus Adapters
(HBA) as the number of host ports, to get the most effective performance. If the number of
HBAs is less than host ports, it may be a bottle neck.
If a host has only single path to one side of the servers, the host can still access the LUN
related to the other side of the server as shown in Figure 2-29 on page 51. I/O goes through
the interconnect bridge of the servers. But from a performance and RAS point of view we
strongly do not recommend, such type of configuration.

When host has one path to each server
•Only one preferred path is used for the LUNs related to Server0.
•If there are LUNs related to Server1, Alternate Path is used.
•One path is active and another is standby for a LUN.
Host
Preferred Path Alternate Path
Host port
Server 0 Server 1
Fibre Fibre
Channel Channel
Switch Switch
0 1
LUN (Extent Pool) related to Server0

Figure 2-27 Host has two paths
When host has multipath to each server

•Preferred path is still used for the LUNs related to Server 0.
•Each path to Server 0 is used for I/O by load-balance or round-robin.
•Four paths are active and the other side four are standby for a LUN.
Preferred Path Host Alternate Path
load-balance/ Host port

Server 0 Server 1
round-robin
Fibre Fibre
Channel Channel
Switch Switch
0 1
LUN (Extent Pool) related to Server0
Figure 2-28 Host has multiple paths to each server

When host has one path to one side on servers
•Both I/O goes through the one side of server.
•The I/O for the LUN related to the other side of server goes through
interconnect of both server.
Host
Both I/O through
Host port
Server 0 Server 1
Fibre Fibre
Channel Channel
Switch Switch
0 1
LUN (Extent Pool)

related to Server0
LUN (Extent Pool)
related to Server1
Figure 2-29 Host has only single path
If you have LUNs (Extent Pools) which are related to only one side of the servers, only half of
the server enclosure resource is used. To use the servers effectively, at least two LUNs
(Extent Pools) each related to one server must be created.
2.11 Tools to aid in hardware planning

In this chapter we have discussed the hardware components of the D6000. Each component
is designed to provide high performance in its specific function and to mesh well with the
other hardware components to provide a well-balanced, high performance storage system.
There are a number of tools which can assist in planning your specific hardware
configuration.
2.11.1 Whitepapers
IBM regularly publishes whitepapers which document the performance of specific DS6000
configurations. Typically, workloads are run on multiple configurations, and performance
results are compiled so the different configurations can be compared. For example,
workloads may be run using different numbers of host adapters, or different types of DDMs.
By reviewing these whitepapers, you can make inferences on the relative performance
benefits of different components. This will aid you in choosing the type and quantities of
components which would best fit your particular workload requirements. Your IBM
representative or IBM Business Partner has access to these whitepapers and can provide
them to you.
2.11.2 Disk Magic

A knowledge of DS6000 hardware components will aid in understanding the device and its
potential performance. However, we recommend using Disk Magic to model your planned

DS6000 hardware configuration to ensure it will handle the workload required. Your IBM
representative or IBM Business Partner has access to this tool and can run a Disk Magic
study to configure a DS6000 based on your specific workloads. For additional information
about the capabilities of this tool, see 4.1, “Disk Magic” on page 86.
2.11.3 Capacity Magic

Determining usable capacity of a disk configuration is a complex task, dependent on DDM
types, RAID technique, and the type of logical volumes created. We recommend using the
Capacity Magic tool to determine effective utilization. Your IBM representative or IBM
Business Partner has access to this tool and can use it to validate that the planned physical
disk configuration will provide enough effective capacity to meet your storage requirements.
For additional information about the capabilities of this tool, see 4.2, “Capacity Magic” on
page 101.

3
Chapter 3. Logical configuration planning

This chapter discusses logical configuration planning for the IBM TotalStorage DS6000, with
the objective being to help you set up your DS6000 system to exploit the inherent
performance capabilities and simplify performance management.
Logical configuration is the process of subdividing the physical storage devices comprising
your DS6000 into usable logical storage entities.

3.1 Principles for performance optimization: balance, isolation
and spread
We will briefly discuss the three important principles for creating a logical configuration for the
DS6000 to optimize performance:
1. Isolation
2. Resource-sharing
3. Spreading
3.1.1 Isolation
Isolation means providing one set of applications with dedicated DS6000 hardware
resources to reduce the impact of other workloads. Looked at another way, isolation means
limiting one workload to a subset of DS6000 hardware resources so that it will not impact
other workloads. Isolation provides better resource availability for those hardware resources
dedicated to the single workload, and reduces contention with other applications for those
resources. However, isolation limits the single workload to a subset of the available DS6000
hardware, so its maximum potential performance may be reduced. Also, unless an application
has an entire DS6000 dedicated to its use, there is the potential for contention with other
applications for any resources which are not dedicated.
DS6000 disk capacity isolation

DS6000 disk capacity sub-setting for isolation may be done at the Rank or at the Server level:
Rank level Certain Ranks are dedicated to a workload. These Ranks may or may not
be of a different disk type (capacity or speed), RAID Array type (RAID 5 or
RAID 10, Arrays with spares or Arrays without spares), or storage type
(CKD or FB) than those used by other workloads. Workloads requiring
different storage types will dictate Rank, Extent Pool and Address Group
isolation. The Rank may be chosen from within the same Storage
Enclosure, or from different Storage Enclosures and still achieve the same
level of workload isolation within the DS6000.
HDD level Rank level saturation occurs when the underlying HDDs cannot meet the
I/O demands being placed on them. Under heavy I/O workload, it is
probable that the HDDs within the Rank will saturate well before the Device
Adapters or I/O Loops.
LUN level Try to avoid sharing LUNs allocated on a single Rank with different
workloads, unless you understand the performance requirements of each
workload, as the cumulative LUN accesses can potentially saturate the
owning Rank
Server level All Ranks assigned to Extent Pools managed by either server0 or server1
are dedicated to a workload. This is usually not recommended as it limits
the processor and cache resources available to the workload.
RAID level In order to achieve capacity isolation for workloads with very specific I/O
demands (such as database logging), it may be considered necessary to
allocate some of the DS6000 DDMs into Extent Pools that are backed by
RAID 10 Arrays. Workloads with very high write content can benefit from
being backed by an underlying RAID 10 Array. This is usually not
recommended as RAID 10 requires approximately twice that number of
HDDs to provide equivalent LUN capacity.

DS6000 host connection isolation
DS6000 host connection sub-setting for isolation may also be done at two levels:
򐂰 I/O port level. Certain I/O ports are dedicated to a workload. This is quite common.
Workloads requiring different protocols (FICON or FCP) must be isolated at the I/O port
level because each individual I/O port on the two 4-port FCP/FICON-capable Host adapter
cards can be configured to support only one of these protocols. Although open systems
host servers and remote mirroring links use the same protocol (FCP), we recommend that
they be isolated to different I/O ports.
򐂰 Host adapter level. Certain Host adapters are dedicated to a workload. Requirement for
both FICON and FCP access does not necessarily dictate Host adapter isolation, because
separate I/O ports on the same 4-port FCP/FICON-capable Host adapter card can be
configured to support each protocol. However, host connection requirements may dictate
a unique type of Host adapter card (Longwave/LW or Shortwave/SW) for a workload. The
DS6000 does not support ESCON adapters.
Note: Hosts must be connected to at least one I/O adapter from each of the two servers in
the DS6000 in order to provide connectivity if one path should fail.
3.1.2 Resource sharing

Resource sharing means multiple workloads use a common set of DS6000 hardware
resources, such as:
򐂰 Ranks
򐂰 Device adapters
򐂰 I/O ports
򐂰 I/O adapters
Multiple resource-sharing workloads may have logical volumes on the same Ranks, and may
access the same DS6000 I/O adapters or even I/O ports through their SAN connections.
Resource sharing allows a workload to access more DS6000 hardware than could be
dedicated to the workload, providing greater potential performance, but this hardware sharing
may result in resource contention between applications that impacts performance at times.
3.1.3 Spreading
Spreading means distributing and balancing workload across all of the DS6000 hardware
resources available, including:
򐂰 Server0 and server1 (including cache and processor resources)
򐂰 Ranks
򐂰 Host adapters
Spreading applies to both isolated workloads and resource-sharing workloads. The DS6000
hardware resources allocated to either one isolated workload or multiple resource-sharing
workloads should be balanced evenly across server0 and server1. That is, Ranks allocated
for either one isolated workload or multiple resource-sharing workloads should be assigned to
server0 and server1 in a balanced manner.
For either an isolated or a resource-sharing workload, volumes and host connections should
be distributed in a balanced manner across all DS6000 hardware resources available to that
workload.
Chapter 3. Logical configuration planning 55

Volumes for an isolated workload should be created as evenly as possible across all Ranks
allocated to the isolated workload and similarly, volumes for a resource-sharing workload
should be created as evenly as possible across all Ranks allocated to the resource-sharing
workloads. Host-level striping (open systems Logical Volume Manager striping or z/OS
Storage Groups) should then be used across all the volumes belonging to the workload.
One exception to the recommendation of spreading volumes is the case of files or datasets
which will never be accessed simultaneously, such as multiple log files for the same
application, where only one log file will be in use at a time.
Host connections should also be configured as evenly as possible across the I/O adapters
available to either an isolated or a resource-sharing workload. Where possible, do not share a
host adapter connection with remote mirroring traffic.
3.1.4 Using isolation, resource-sharing and spreading to optimize the DS6000

performance
A recommended approach to optimizing performance on the DS6000 is to begin by
identifying any workload that has the potential to negatively impact the performance of other
workloads, or to have its performance negatively impacted by other workloads.
The next step is identifying balanced hardware resources that can be dedicated to the
isolated workload.
The third step is identifying the remaining DS6000 resources to be shared among the
resource-sharing workloads.
The final step is assigning volumes and host connections to the workloads in a way that is
balanced, and spread - either across all dedicated resources (for the isolated workload) or
across all shared resources (for the multiple resource-sharing workloads).
3.2 Isolation requirements

The first and most important step in creating a successful logical configuration for the DS6000
is analyzing the workload characteristics for the applications that will access the DS6000 so
that DS6000 hardware resources such as RAID arrays and I/O ports can be properly
allocated to workloads. This workload analysis should be done during the DS6000 capacity
planning process and should be completed prior to ordering the DS6000 hardware.
3.2.1 Review the application workload characteristics to determine the

isolation requirements
A critical piece of the workload analysis is identifying workloads which require isolated or
dedicated DS6000 hardware resources, because this will affect the total amount of disk
capacity required, as well as the type of disks and the number and type of I/O adapters. For
DS6000 disk allocation, different levels of isolation requirements may dictate the use of
separate Ranks, or even separate DS6000 Storage Images. For DS6000 I/O port allocation,
different levels of isolation requirements may dictate the use of separate I/O ports.
Workloads which require different disk drive types (capacity and speed), different RAID types
(RAID5 or RAID10), or different storage types (CKD or FB) require isolation to different
DS6000 arrays. Workloads that use different I/O protocols (FCP or FICON) require isolation
to different I/O ports. Organizational considerations may also dictate isolation (for example,

one user group may insist on dedicated resources or separation from other specific
workloads).
However, even workloads that use the same disk drive types, RAID type, storage type and I/O
protocol (and without a user requirement for isolation) should still be evaluated for separation
or isolation requirements.
High priority workloads should be considered for isolation to dedicated DS6000 hardware
resources to ensure that they will not be subject to contention. Database online transaction
processing workloads may require dedicated resources in order to achieve better service
levels.
Workloads with very heavy, continuous I/O access patterns should be considered for isolation
to prevent them from consuming all available DS6000 hardware resources and impacting the
performance of other workloads.
Isolation of only a few known heavy-hitting workloads often allows the remaining workloads to
share hardware resources and achieve acceptable levels of performance. Some examples of
I/O workloads or files/datasets which often have heavy and continuous I/O access patterns
are:
򐂰 Sequential workloads
򐂰 Log files or datasets
򐂰 Tape simulation on disk
򐂰 Business Intelligence and Data Mining
򐂰 Mail applications that require reading and possibly updating every mailbox
򐂰 Sort/work datasets or files
򐂰 Disk copies (including Point in Time Copy background copy or Remote Mirror volumes)
򐂰 Engineering/scientific applications
򐂰 Video/imaging applications
򐂰 Batch update workloads
Workloads for all applications for which DS6000 storage will be allocated should be taken into
account, including current workloads that will be migrated from other storage subsystems,
new workloads planned for the DS6000, and projected growth in all application workloads.
For existing applications, historical experience should be considered first. For example, is
there an application where certain datasets or files are known to have heavy, continuous I/O
access patterns? Is there a combination of multiple I/O workloads that would cause
unacceptable performance if their peak times occurred simultaneously?
Consider also workload priorities (workloads of critical importance, workloads of lesser

importance)
For existing applications, performance monitoring tools available for the existing storage
subsystems and server platforms can also be used to understand current application
workload characteristics such as:
򐂰 Read/Write ratio
򐂰 Random/sequential ratio
򐂰 Peak workload (I/Os per second for random access, and MB per second for sequential
access
򐂰 Peak workload periods (time of day, time of month)

򐂰 Copy services requirements (Point-in-Time Copy, Remote Mirroring)
򐂰 Host connections utilization and throughput (FCP Host connections, FICON channels)
򐂰 Remote mirroring links utilization and throughput
Requirements for new application workloads and for current application workload growth must
be projected.
The Disk Magic modeling tool can be used to model the current or projected workload and
estimate DS6000 hardware resources required.
For more information about performance monitoring tools, see Chapter 4, “Planning and
monitoring tools” on page 85.
For more information about workload characteristics, see Chapter 12, “Understanding your
workload” on page 407.
For more information about Disk Magic, see 4.1, “Disk Magic” on page 86.
3.3 Plan assignment of DS6000 hardware resources to

workloads
Once workload isolation requirements have been identified, specific DS6000 hardware
resources should be planned, first for any workloads requiring isolation, and then for the
remaining resource-sharing workloads.
3.3.1 Plan DS6000 hardware resources for isolated workloads

For DS6000 disk allocation, isolation requirements may dictate allocation of certain individual
Ranks. For DS6000 I/O port allocation, isolation requirements may dictate allocation of
certain I/O ports.
The dedicated DS6000 resources should be balanced across DS6000 components, such as:
򐂰 Ranks assigned to server0 Extent Pools and Ranks assigned to server1 Extent Pools
򐂰 A pair of ports consisting of one from I/O adapter on server0 and one from server1 within
the DS6000
For example, if a workload is to be assigned 2 dedicated Ranks, 1 Rank should be assigned

to a server0 Extent Pool and 1 Rank should be assigned to a server1 Extent Pool.
If a workload is to be assigned 2 dedicated I/O ports, one should be on an I/O adapter that is
managed by server0, and the other should be managed by server1.
3.3.2 Plan DS6000 hardware resources for resource-sharing workloads

After DS6000 resources have been planned for the isolated workloads, all workloads that do
not require dedicated resources should be considered. The shared set of DS6000 hardware
resources that these workloads will be using should be identified and assigned.
The DS6000 resources that will be shared should be balanced across DS6000 components,
such as:
򐂰 Ranks assigned to server0 Extent Pools and Ranks assigned to server1 Extent Pools

򐂰 I/O adapters in left and right I/O enclosures
3.3.3 Spread volumes and host connections across available hardware

After hardware resources have been assigned to both isolated workloads and
resource-sharing workloads, volume and host connections spreading should be planned.
Spread volumes and host connections for isolated workload

Volumes and I/O ports for the isolated workload should be spread across all of its assigned
DS6000 components as possible. Volumes should be created on each assigned Rank, and
host server-level striping (open systems Logical Volume Manager striping or z/OS storage
groups) should be used across all volumes. Host connections should access all assigned
DS6000 I/O ports.
Spread volumes and host connections for resource-sharing workloads

Volumes and I/O ports for any one of the resource-sharing workloads should be spread
across as many of the assigned DS6000 components as possible. For example, for each
workload in the group of resource-sharing workloads:
򐂰 Review the required number and size of logical volumes identified during the workload
analysis.
򐂰 Review the Ranks and DA pairs available in the group of shared DS6000 resources.
򐂰 Assign each required logical volume to a different Rank.
– If the number of volumes required is less than the number of Ranks, assign the
volumes evenly to Ranks owned by server0 and Ranks owned by server1 on each
available DA pair.
– If the number of volumes required is greater than the number of Ranks, assign
additional volumes to the Ranks in a balanced manner. Ideally the workload would
have the same number of logical volumes each Rank.
򐂰 Plan to use host server-level striping (open systems Logical Volume Manager striping or
z/OS storage groups) across all logical volumes.
򐂰 Review the required number and type (SW, LW, FCP or FICON) of host connections
identified in the workload analysis. A minimum of 2 host connections to different DS6000
host adapter cards should be used to ensure availability.
򐂰 Identify the Host adapters and I/O enclosures available in the group of shared DS6000
resources.
򐂰 Assign each required host connection to a different I/O adapter in a different I/O enclosure
if possible, balancing across left and right I/O enclosures.
– If the number of host connections required is less than the number of I/O enclosures
available (this may be typical for an open systems server), an equal number of host
connections should be assigned to left (0, 2, 4, 6) and right (1, 3, 5, 7) I/O enclosures if
possible.
– Within an I/O enclosure, assign each required host connection to the Host adapter of
the required type (SW, LW) with the greatest number of unused ports. When Host
adapters have an equal number of unused ports, assign the host connection to the
adapter which has the least number of connections for this workload.
– If the number of host connections required is greater than the number of I/O
enclosures, assign the additional connections to different Host adapters with the
greatest number of unused ports within the I/O enclosures. When Host adapters have

an equal number of unused ports, assign the host connection to the adapter which has
the least number of connections for this workload.
3.4 Logical configuration - components and terminology

The building blocks of the DS6000 disk storage capacity begin with Array Sites, which form
the basis of the virtualization within the DS6000. Each Array Site has the potential to attach
four or eight disk drives to form an Array. The disk drive modules (DDMs) within an Array
must consist of the same capacity and speed. The DS6000 contains two device adapter pairs
(DA pairs), providing a total of two Switched Fibre Channel Arbitrated loops (FC-AL) which
the Arrays are attached to via the two internal Fibre Channel switches. Each loop can attach a
maximum of eight Arrays. Each Array is logically associated with a Rank, which in turn is
associated with an Extent Pool, from which logical disks are allocated.
Note: The disk drives in the DS6000 enclosures have a dual ported FC-AL interface.
Instead of forming an FC-AL loop, each disk drive is connected to two Fibre Channel
switches within each enclosure. With this switching technology there is a point-to-point
connection to each disk drive. This allows maximum bandwidth for data movement,
eliminates the bottlenecks of loop designs, and allows for specific disk drive fault indication.
The physical DDM attachment connectivity within the DS6000 enclosure is shown in
Figure 3-1.
20 port internal fibre channel switch
Disk Drive Modules (DDMs)
20 port internal fibre channel switch
Figure 3-1 Disk Drive Module internal connections
Figure 3-2 on page 61 is a schematic illustration of a full capacity DS6000 with the maximum
of seven expansion enclosures (1750-EX1), providing a total of 16 Arrays which have been
configured here as 16 Extent Pools. Each expansion enclosure is attached to each of the two
DS6000 controllers by a pair of Fibre Channel connections as shown here. We suggest that
the DS6000 server enclosure be physically placed mid-way in its frame, as we have indicated
in this diagram, so that all the expansion enclosures on Switched Loop 0 can be placed above
the base unit, and all the expansion enclosures on Switched Loop 1 can be placed below the
base unit. This makes the expansion enclosure Fibre Channel cabling simpler to manage.

Rank 14 Ext Pool 14 Rank 15 Ext Pool 15 1750-EX1 id 03

Loop 0
Rank 0 Ext Pool 0 Rank 1 Ext Pool 1 1750-511 id 00 Base Unit
1750-EX1 id 10
Rank 2 Ext Pool 2 Rank 3 Ext Pool 3
1750-EX1 id 11
Loop 1
1750-EX1 id 12

1750-EX1 id 13
Figure 3-2 Fully configured DS6000 with seven expansion enclosures
Each expansion enclosure is added to the configuration in a particular location on its specific
drive I/O Fibre Channel loop in order to spread workload as evenly as possible across the two
DS6000 servers. The two Fibre Channel loops are evenly populated as the number of
expansion enclosures increases.
3.5 Configuring for performance

We have briefly discussed the way to physically allocate the disks into Extent Pools, but
before we actually do this, we should consider the following potential performance
implications:
򐂰 Mixing drive geometries
– Same loop or different loop
򐂰 Array Sites and Arrays
– 4-disk or 8-disk Ranks
򐂰 Ranks
– Rank size
– RAID 5 or RAID 10
򐂰 Extent Pools
– Mixing open systems and zSeries data within one DS6000

– Number of Extent Pools to use
– Number of Ranks in each Extent Pool
򐂰 LSS and LCU considerations
򐂰 Host connectivity
– Number of host paths
3.5.1 Mixing drive geometries

The DS6000 data disks can consist of a mixture of different sizes and rotational speeds, and
some care needs to be taken when configuring your logical disks to exploit the physical
capabilities of the physical disks.
Drive sets and disk drives

A drive set in a DS6000 is a set of four disk drive modules or DDMs. All four disk drives in a
drive set must be of the same capacity and speed. Disks are currently added to the DS6000
in pairs of drive sets.
The DS6000 can contain up to 128 disks drives of different capacities. The DS6000 supports
72.8 GB and 145.6 GB disk drives at 10,000 or 15000 rpm, and 300 GB disk drives at 10000
rpm capacity. The same disk technology and capacities are available for all DS6000
attachable servers via FCP and FICON attachment. The DS6000 drive sets that hold the disk
drives are installed in the DS6000 base frame and—if needed—up to seven expansion racks
may be used (as Figure 3-2 on page 61 illustrates). The base frame of the DS6000 can hold 8
or 16 disk drives as two or four drive sets. Each disk enclosure can hold 8 or 16 disk drive in
two or four pairs of drive sets.
The overriding consideration for configuring is the actual end user requirements. However,
before configuring a mix of disk geometries, you should consider the increased cost of
spares, and the increased complexity of your system.
While it is possible to mix drive sets with different geometries (speed and capacity) across
both drive loops, in general, we do not recommend it. Each drive geometry used will be
allocated its own pair of global spares of the same or greater capacity on its own I/O loop,
resulting in the possibility of inefficient use of your installed capacity if you spread the drive
types across drive loops. A global spare means that the spare is available to any drive
enclosure on the same drive loop.
For example, a DS6000 configured with a mixture of 300 GB, 145.6 GB and 72.8 GB DDMs to
achieve 13.5 TB of usable capacity could use 8 DDMs as spares in order to meet the
requirement of 2 spares of the same or greater capacity for each device geometry, and some
of these would be sparing smaller capacity drives on the loop. Overall, this configuration
utilizes approximately 73% of the raw capacity. A configuration with a similar raw disk capacity
comprising all 145.6 GB DDMs, only uses 4 DDMs as spares, utilizing approximately 83% of
the raw capacity.
A larger DDM may also be utilized as a spare for a smaller drive, in which case the DS6000
will physically fail back operations to the original DDM from the larger DDM, that was acting
as a smaller spare when the original failed DDM is repaired. This requires additional I/O
operations.
However, it may be necessary to choose a mixed geometry configuration for a specific

application, and the overriding consideration should be related to the actual end user
requirements.

3.5.2 Mixing open and zSeries logical disks
Open systems data is fixed block (FB) and zSeries data is count-key-data (CKD) and they are
physically different representations within the DS6000. The two data types are not able to be
mixed within an Array.
Important: The DS6000 is restricted to two address groups, and each of these can be
associated with either open systems data or with zSeries data, but not both. Refer to
“Address groups” on page 71 for further discussion.
CKD
In count-key-data (CKD) organization, the data field stores the user data. Also, because the
data records can be variable in length, they all have an associated count field that indicates
the user data record size. Then the key field is used to enable a hardware search based on a
key. However, this is not generally used for most data anymore. Extended count-key-data
(ECKD™) is a more recent version of CKD that uses an enhanced S/390® channel command
set.
Fixed Block (FB)

Open systems data and zSeries data access patterns also tend to be different, with z-Series
data typically being stored and transferred in logical blocks that approximate a 3390-1 track
size of 56 KB, while open systems data is typically managed in 4 KB logical blocks.
3.5.3 Arrays and Array Sites

An Array in a DS6000 consists one or two Array Sites, and is analogous to a Disk Group in an
ESS (which is the logical representation of two pairs of 4 disks taken from two different
physical Ranks on the same SSA loop in an ESS). The minimum orderable base DS6000
must contain at least two disk drive sets, where a disk drive set contains four identical disk
drives (same capacity and revolutions per minute (RPM)). The minimum additional purchase
unit is one disk drive set or four DDMs.
Array Sites
The DS6000 is available with a configuration entity or Array Site consisting of one disk drive
set. A fully populated server enclosure or storage enclosure has two pairs of Array Sites, as
can be seen in Figure 3-3 on page 64 showing Array Site locations for a DS6000 server
enclosure. The DDMs selected for an Array Site will be selected from the same disk
enclosure string by the DS6000 and you have no way to influence this process. An example of
the relationship between DDMs and their associated Array Site is shown in Example 3-1 on
page 65. The Array Sites have been shown here as S1 through S4. An Array Site is the basic
building block for Array creation.

S1 S3 S3 S3 S3 S4 S4 S4 S4 S1 S1 S1 S2 S2 S2 S2
Drive Set Drive Set Drive Set Drive Set

1 2 3 4
Figure 3-3 Array Sites S1-S4 in a DS6000 base unit
As we can see in Figure 3-3, there is a predetermined, but non-contiguous affinity between
the disk drive sets and Array Sites. In this example, Array Site S1 has DDMs from Drive Set 1
and Drive Set 3.
Note: When you first create your new logical 8 DDM Array from a pair of 4 DDM Array
Sites, it is good practice to select an adjacent odd/even pair of Array Sites for each Array. In
this example we chose Array Sites S1/S2 as our first pair and formed Array A0. Then we
chose Array Sites S3/S4 to form Array A1.
As we add expansion enclosures to a DS6000 server enclosure, additional Array Sites

become available as seen in Figure 3-4 on page 65. This example is a sample representation
of the Array Sites and does not mean that they will always appear in this order.
Important: When creating a new Array, the DS6000 configuration rules will prevent you
from inadvertently selecting a pair of Array Sites that are not within the same physical
enclosure.
Important: The Array Site numbering is selected by the DS6000 based on both the order
of cabling the Storage Enclosures together to form a DS6000 Storage subsystem, and on
the order in which Array Sites were populated with disk drive sets. There is no way to
pre-determine the Array Site numbering if one or more expansion enclosures are attached
to the server enclosure before you first power on the DS6000.
In all our example we used a DS6000 that has two fully populated expansion enclosures that
were attached one at a time after the server enclosure was installed, and we subsequently
added a third enclosure with 8 DDMs towards the end of the configuration process.
Consequently, our Array Site numbering started with Sites S1-S4 in the server enclosure,
Sites S5-S8 in the first expansion enclosure, and Sites S9-S12 in the second expansion
enclosure. The third, half populated server enclosure, has Array Sites S13 and S14.

1750-EX1 id 01 Array Sites S13 & S14 Array Sites S15 & S16
Controller 0
Base Unit Array Sites S1 and S2 Array Sites S3 and S4
Controller 1
1750-EX1 id 10 Array Sites S5 and S6 Array Sites S7 and S8
1750-EX1 id 11 Array Sites S9 and S10 Array Sites S11 & S12
Figure 3-4 Sample of Array Site locations within three expansion enclosures
Here is another look at the example of the relationship between Array Sites and DDMs in a
DS6000 storage enclosure as seen in the arsite column in Example 3-1. You can see that the
DS6000 has apparently not chosen a strictly sequential arrangement for the location for each
DDM within an Array Site. This is because of the requirement to allocate one spare from each
of the first two Array Sites on each loop (S1 and S2). You can see that by the time we get to
the second expansion enclosure, we now see for Array Sites S9 - S13 that there is now a
one-to-one relationship between the Array Sites and the DDM position within the enclosure.
The spare DDMs locations are likely to change over time, as sparing takes effect on each of
the two drive loops.
Example 3-1 Relationship between DDMs and Array Sites

dscli> lsddm -l IBM.1750-1301234
Date/Time: 26 July 2005 5:35:16 IBM DSCLI Version: 5.0.5.6 DS: IBM.1750-1301234
ID DA Pair dkcap (10^9B) dkuse arsite State
==================================================================================
IBM.1750-511-1301234/R0-P1-D1 0 146.0 spare required S1 Normal
IBM.1750-511-1301234/R0-P1-D2 0 146.0 unconfigured S3 Normal

IBM.1750-511-1301234/R0-P1-D13 0 146.0 spare required S2 Normal
....
....
IBM.1750-EX1-13ADD02/R0-P1-D1 0 146.0 unconfigured S9 Normal
....
....
dscli>
Array size
Your first decision is to choose either a 4 DDM Array or an 8 DDM Array as your starting size.
Based on this initial planning decision, one or two of the DS6000 4 DDM Array Sites will be
used to create an Array. That is to say, we can plan to use one or two disk drive sets in each
Array.
Note: The 8 DDM Arrays will provide you with more usable storage from your DDMs than 4
DDM Arrays. For example, a fully populated storage enclosure enclosure containing 16
146 GB DDMs will provide approximately 1.9 TB of usable storage when configured with 8
DDM RAID 5 Arrays, and 1.6 TB of usable storage when utilising 4 DDM RAID 5 Arrays.
(Assuming no spares are required within this storage enclosure).
RAID 5 or RAID 10
Having performance in mind, we must determine which RAID organization we need, and
begin by selecting the number of drive sets we need in an Array. We do this by associating
one or two Array Sites, each with four DDMs associated with it, to become an Array. The
DS6000 Arrays can be defined as either RAID 5 or RAID 10.
We recommend the use of 8 DDM Arrays in RAID 5, if all the DDMs have similar speed and
capacity characteristics, as this will provide more usable capacity than a RAID 10
implementation, and acceptable performance for most applications with a normal range of I/O
requirements. If you have a specific performance requirement, such as providing for an
application with a high random write requirement, such as some data base applications, you
may need to consider creating some RAID 10 arrays utilizing single or dual Array Sites.
Always begin your configuration of a new DS6000 by verifying that all the expected Array
Sites are available, as seen in Example 3-2 on page 67 under the column headed arsite.

Example 3-2 Verify available Array Sites
dscli> lsarraysite -l
arsite DA Pair dkcap (10^9B) diskrpm State Array
=====================================================
S1 0 146.0 10000 Unassigned -
S2 0 146.0 10000 Unassigned -
S3 0 146.0 10000 Unassigned -
S4 0 146.0 10000 Unassigned -
S5 0 146.0 10000 Unassigned -
S6 0 146.0 10000 Unassigned -
S7 0 146.0 10000 Unassigned -
S8 0 146.0 10000 Unassigned -
S9 0 146.0 10000 Unassigned -
S10 0 146.0 10000 Unassigned -
S11 0 146.0 10000 Unassigned -
S12 0 146.0 10000 Unassigned -
dscli>
Follow this by ensuring that you group the Array Sites in adjacent odd/even pairs in order to
create an Array with disks from the correct disk sets within the DS6000.
Example 3-3 shows an Array creation utilizing the first two Array Sites
Example 3-3 Array creation

dscli> mkarray -raidtype 5 -arsite S1,S2
CMUC00004I mkarray: Array A0 successfully created.
dscli>
We should then confirm that we configured the appropriate Array Sites by checking that we
successfully changed the status of our Array Sites, and now have a newly created Array.
Review the arsite details in Example 3-4.
Example 3-4 Verify Array creation

dscli> lsarraysite -l
arsite DA Pair dkcap (10^9B) diskrpm State Array
===================================================
S1 0 146.0 10000 Assigned A0
S2 0 146.0 10000 Assigned A0
...
...
dscli> lsarray
Array State Data RAIDtype arsite Rank DA Pair DDMcap (10^9B)
====================================================================
A0 Unassigned Normal 5 (6+P) S1,S2 - 0 146.0
...
...
dscli>

3.5.4 Select a Rank format
The next step is to create a Rank from each Array. There is a one-to-one relationship between
Arrays and Ranks, so the only decision to be made here is whether you want open systems or
zSeries data.
Important: The Ranks are created in the order in which that they are defined, so take care
to ensure that the first Rank you define uses Array A0, the second uses A1 and so on, if
you want to match Array numbers with their corresponding Rank. We recommend that you
maintain this simple Rank to Array relationship in order to simplify subsequent
performance management.
If you inadvertently make a mistake when assigning Arrays to Ranks, such as assigning A0 to
R1, and A1 to R0, you could possibly recover later in the logical building process by allocating
R1 to Extent Pool P0, and R0 to Extent Pool P1, but this would continue to be confusing for
other management personnel.
Note: If you are implementing a mixed open systems and z/OS environment, you need to
decide here if you want to separate each data type across the two Switched FC-AL loops.
You can then selectively set up the CKD and FB Ranks to achieve this separation across
the two loops if required.
Tip: You may want to separately manage the Arrays with different capacities, such as
those with spares assigned and those with differing DDM geometries.
In Example 3-5 we create our first Rank from Array A0. There is no parameter to allow you to
associate the R0 identifier with your Rank definition. The Ranks are created in sequential
order of definition, starting with R0, so exercise care at this stage, or you can inadvertently
introduce some potential performance bottlenecks by causing subsequent allocations to be
spread unevenly across the two storage servers in your DS6000.
Example 3-5 Creating a Rank from an Array

dscli> mkrank -array a0 -stgtype fb
CMUC00007I mkrank: Rank R0 successfully created.
dscli>
Now we can determine that the Rank was successfully defined and see for the first time the
available capacity in FB or CKD extents of our Rank as seen in Example 3-6, where we note
that we have 773 FB extents available in Rank R0.
Example 3-6 Verifying the Rank Status

dscli> lsrank -l
ID Group State datastate Array RAIDtype extpoolID extpoolnam stgtype exts usedexts
=======================================================================================
R0 - Unassigned Normal A0 5 - - fb 773 -
dscli>

3.5.5 Extent Pool implications
You must define a minimum of two Extent Pools with one created for each storage server to
fully exploit the resources of the DS6000. Each Extent Pool is visible to both of the two
DS6000 storage servers, but directly managed by one of them (but not the same one). This
means that to reduce the amount of inter-server communication traffic, each logical disk
should be defined and accessed through the storage controller that is its primary manager.
Logical disks from even numbered Extent Pools are accessed from the Server0 controller,
and those from odd numbered Extent Pools are managed by Server1.
The maximum number of Extent Pools is equivalent to the number of Ranks.
If you have both open and zSeries hosts attached, you will need a separate Extent Pool for
each type of logical disk (FB or CKD), due to the different formatting of the Ranks that make
up each Extent Pool type.
Defining Extent Pools

We recommend that you define one Extent Pool for each Rank that you have, to simplify
performance management operations.
Example 3-7 shows an example of defining Extent Pools using the CLI.
Note: There is no parameter in the mkextpool command to allow you to refer to a specific
Extent Pool, so remember that the first one defined will be P0 and the next one will be P1
and so on. Be sure that you associate the even numbered Extent Pools with server0, and
the odd numbered Pools with server1.
Example 3-7 Defining Extent Pools

dscli> mkextpool -rankgrp 0 -stgtype fb pool_P0
CMUC00000I mkextpool: Extent pool P0 successfully created.
dscli>
Always verify that the Ranks were associated with the desired Extent Pools by issuing the
lsrank -l command, as shown here in Example 3-8.
Example 3-8 Verifying the Rank and Extent Pool associations

dscli> lsrank -l

ID Group State datastate Array RAIDtype extpoolID extpoolnam stgtype exts usedexts
====================================================================================
R0 0 Normal Normal A0 5 P0 pool_P0 fb 773 0
dscli>
3.5.6 Number of Ranks in an Extent Pool

There is a minimum requirement of one Rank per Extent Pool and although it is possible to
include many Ranks in an Extent Pool, we do not recommend more than one.
Extent Pools and capacity utilization

If you need to maximize the available capacity of your configuration, allocate half of the
available Ranks to one Extent Pool and half to the second Extent Pool. You will then be able
to define logical disks up to the maximum of 2 TB each from each open systems Extent Pool,
or multiple volumes of various supported sizes, and many LCUs for z/OS. This will enable you
to fully allocate all the available storage in the DS6000, but has the potential to make
subsequent performance management more difficult, due to the absence of a clear
relationship between physical layout and logical disks.
Extent Pools and performance

If you need to maximize the performance of your configuration, allocate each Rank to its own
Extent Pool so that you configure one Rank per pool. Logical disks can then be configured
from each of these Extent Pools. This gives us the ability to direct our logical allocations to a
known location within the DS6000, which can help us monitor or manage our resultant logical
disk performance when required.
We recommend that you put only Ranks associated with DDMs with the same capacity and
rotational speed into the same Extent Pool when adding more than one Rank to the same
Extent Pool.
Tip: For performance management we recommend you create Extent Pools comprising a
single Rank only, unless you need to define logical disks that are larger than a single-Rank
Extent Pool, or need to utilize all available capacity.
Associating Ranks with Extent Pools

This step needs to be planned carefully, in order to ensure that you have laid out your DS6000
to perform as well as possible in your environment.
Here we need to ensure that we associate the Ranks with a preferred DS6000 server. In
keeping with our philosophy of keeping even numbered components associated with server0,
we matched the even numbered Ranks with an even numbered Extent Pool using the chrank
DSCLI command, as shown in Example 3-9.
Example 3-9 Adding Ranks to Extent Pools

dscli> chrank -extpool P0 R0
CMUC00008I chrank: Rank R0 successfully modified.

dscli>
3.5.7 LSS design

A predetermined association between Array and Logical SubSystem (LSS) exists in the
DS6000 and the number of LSSs has changed from the 0, 8 or 16 available for each storage
type in the ESS. You can now define between 2 and 32 LSSs in the DS6000. You can even
have more LSSs than Arrays, which allows much greater flexibility when planning and
configuring your DS6000.
The capacity of one or more Ranks can be aggregated into a single Extent Pool and logical
volumes configured in that aggregated Extent Pool are not bound to any specific Rank. This
allows us to define a logical volume up to 2 TB for an FB volume, even when the capacity of a
single Rank is much less than 2 TB. As such, the available capacity of the storage facility can
be flexibly allocated across the set of defined logical subsystems and logical volumes.
Different logical volumes within the same logical subsystem can be configured from different
Extent Pools, although performance of this can be difficult to manage, as it is quite time
consuming to identify the physical location of a user’s LUN that may be experiencing
contention.
However there is one restriction with the LSS now having an affinity to one of the DS6000
servers. All even numbered LSSs (X’x0’, X’x2’, X’x4’, up to X’xE’) belong to server 0 and all
odd numbered LSSs (X’x1’, X’x3’, X’x5’, up to X’xF’) belong to server 1.
All devices in an LSS must be either count-key-data (CKD) for zSeries data or fixed block (FB)
for open systems data. This restriction goes even further. LSSs are grouped into address
groups of 16 LSSs.
Address groups
There is no command parameter to specifically define an address group. Address groups are
created automatically when the first LSS associated with the address group is created and
deleted automatically when the last LSS in the address group is deleted.
Restriction: The DS6000 supports two address groups: address group 0 and address
group 1.
LSSs are numbered X’ab’, where a is the address group and b denotes an LSS within the
address group. So, for example X’10’ to X’1F’ are LSSs in address group 1. All LSSs within

one address group have to be of the same type, CKD or FB. The first LSS defined in an
address group fixes the type of that address group.
Note: zSeries clients are reminded that the DS6000 does not have support for ESCON
hosts attachment.
LCU
zSeries users are familiar with a logical control unit (LCU). zSeries operating systems
configure LCUs to create device addresses. There is a one to one relationship between an
LCU and a CKD LSS in a DS6000 (LSS X'ab' maps to LCU X'ab'). Logical CKD volumes have
a logical volume number X'abcd' where X'ab' identifies the LSS and X'cd' is one of the 256
logical volumes on the LSS. This logical volume number is assigned to a logical volume when
a logical volume is created and determines the LSS that it is associated with. The 256
possible logical volumes associated with an LSS are mapped to the 256 possible device
addresses on an LCU (logical volume X'abcd' maps to device address X'cd' on LCU X'ab').
When creating CKD logical volumes and assigning their logical volume numbers, users
should consider whether Parallel Access Volumes (PAVs) are required on the LCU and
reserve some of the addresses on the LCU for alias addresses.
For open systems, LSSs do not play an important role except in determining which server the
LUN is managed by (and which Extent Pools it must be allocated in) and in certain aspects
related to Metro Mirror, Global Mirror, or any of the other remote copy implementations.
Note: LCUs must be specifically defined before defining any associated CKD volumes
using the lslcu command or the DS Storage Manager.
There is no command parameter to specifically define an LSS for open systems data. The
LSS definition is implied from the first two characters of the LUN identifier. The volume id of
0001 in Example 3-10 implies an LSS of 00.
Example 3-10 Implied LSS definition
dscli> lsfbvol
CMMCI9003W No FB Volume instances found in the system.
dscli>
dscli> mkfbvol -extpool p0 -cap 50 -name Test_50GB 0001
CMUC00025I mkfbvol: FB volume 0001 successfully created.
dscli> lslss
ID Group addrgrp stgtype confgvols
==================================
00 0 1 fb 1
dscli>
Some management actions in Metro Mirror, Global Mirror, or Global Copy operate at the LSS
level. For example, the freezing of pairs to preserve data consistency across all pairs, in case
you have a problem with one of the pairs, is done at the LSS level. With the option now to put
all or most of the volumes of a certain application in just one LSS, this can make the
management of remote copy operations easier. However, distributing logical volumes in an
LSS over multiple Extent Pools may make managing performance more difficult.

On the DS6000 you can group your volumes in one or a few LSSs, but still have the volumes
in many Arrays or Ranks.
Tip: We recommend assigning one Rank to each Extent Pool, and don’t define LSSs that
span Extent Pools, in order to facilitate simpler performance management.
Figure 3-5 is an example of the relationship between logical volumes and Extent Pools.
Notice that volumes 0800, 0801... are backed by an Extent Pool with an even number (06),
and will be managed by server0 in the DS6000. These volumes are also associated with LSS
08 in our example.
Logical CKD Volumes

Physical Drives
0800
LSS X'08'
1750-EX1 id 01 Rank 6 Extent Pool 6 Rank 7 Extent Pool 7
0801
DB2db
Controller 0
Base Unit Rank 0 Extent Pool 0 Rank 1 Extent Pool 1
Controller 1
0100
LSS X'01'
0101
DB2logs
Figure 3-5 Logical CKD volume physical location example
The DS6000 allocates each logical volume by aggregating the required number of available 1
GB Extents sequentially from the requested Extent Pool. Example 3-11 on page 74 shows a
50 GB logical volume being allocated from Extent Pool P0, which had 773 Extents available
before the allocation, and 723 available following the allocation.

Example 3-11 LUN allocation from Extent Pool
dscli> lsextpool -l
Name ID stgtype rankgrp status availstor (2^30B) %allocated available reserved numvols numranks
====================================================================================================
ITSO_P0 P0 fb 0 exceeded 773 0 773 0 0 1
dscli> mkfbvol -extpool p0 -cap 50 -name ITSO_50GB_#h 0001

CMUC00025I mkfbvol: FB volume 0001 successfully created.
dscli> lsextpool -l
Name ID stgtype rankgrp status availstor (2^30B) %allocated available reserved numvols numranks
====================================================================================================
ITSO_P0 P0 fb 0 exceeded 723 6 723 0 1 1
dscli>
Attention: It is important to balance the allocation of your logical volumes across both
storage servers (server0 and server1) to reduce the potential for I/O imbalance.
3.5.8 Preferred paths

See also 2.10.4, “Preferred Path” on page 49.
In the DS6000, host ports have a fixed assignment to a server (or controller card). In other
words, all data traffic that uses its preferred path to a server avoids having to cross through
the DS6000 inter-server connection to the other server. There is a small performance penalty
if data from a logical volume managed by one server is accessed from a port that is located
on the other server. The request for the logical volume and the data would have to be
transferred across the bridge interface that connects both servers. These transfers add some
latency to the response time. Furthermore, this interface carries other communication traffic
between the servers, such as being used to mirror the persistent memory and for other
inter-server communication. It could become a bottleneck if too many normal I/O requests
also run across it, although it is a high bandwidth, low latency, PCI-X connection.
Open systems hosts should ensure that they use multipath management software such as
IBM’s Multipath Subsystem Device Driver (SDD) that recognizes this preferred path usage
and can preferentially direct I/O requests to the preferred path.
When assigning host ports for open systems usage, always consider preferred pathing,
because the use of non-preferred paths will have a performance impact on your DS6000.
z/OS users already have this preferred path management capability inherent in the z/OS
operating system.
3.6 Performance and sizing considerations

To determine the most optimal DS6000 layout, the I/O performance requirements of the
different servers and applications should be well understood before you start designing the
disk subsystem configuration. These requirements will play a large part in dictating both the
physical and logical configuration of the disk subsystem.
3.6.1 Workload characteristics

The answers to questions like how many host connections do I need?, and how much cache
do I need? and the like, always depend on the workload requirements (such as, how many
I/Os per second per server, how many I/Os per second per gigabyte of storage, and so forth).

You may be able to perform some workload modeling to assist you with the detailed design of
your DS6000. The information you need, ideally, to conduct detailed modeling includes:
򐂰 Number of I/Os per second
򐂰 I/O density
򐂰 Megabytes per second
򐂰 Relative percentage of reads and writes
򐂰 Random or sequential access characteristics
򐂰 Cache hit ratio
The modeling may be done using the Disk Magic modeling tool, as discussed in 4.1, “Disk
3.6.2 Data placement in the DS6000

Once you have determined the disk subsystem throughput, the disk space and number of
logical disks required by your different hosts and applications, you have to make a decision
regarding the data placement.
As is common for data placement and to optimize the DS6000 resources utilization, you
should:
򐂰 Spread the logical disk allocations evenly across the two DS6000 servers by allocating
them equally from Extent Pools managed by server0 and server1, as this will balance the
I/O load distribution.
򐂰 Spread the logical disks allocations for important applications across as many of the
DS6000 disks as possible.
򐂰 Stripe your logical volumes across several Ranks when using a host based logical volume
manager.
򐂰 Consider placing specific database objects (such as logs) on logical volumes that were
actually configured from different Ranks than those used for database user data and
tablespaces.
Note: Database logging usually consists of sequences of synchronous sequential writes.

Log archiving functions (copying an active log to an archived space) also tend to consist of
simple sequential read and write sequences. You should consider isolating log files on
separate Arrays.
All disks in the storage subsystem should have roughly an equivalent utilization. Any disk that
is used more than the other disks is likely to become a bottleneck to performance. A practical
method is to make extensive use of host-based logical volume level striping across disk
drives.
3.6.3 Open systems LVM striping

Host based Logical Volume Manager (LVM) striping is an open systems technique for
spreading the data in a logical volume across several disk drives in such a way that the I/O
capacity of the disk drives can be used in parallel to access data on the logical volume. The
primary objective of striping is very high performance reading and writing of large sequential
files, but there are also benefits for random access.
DS6000 logical volumes are composed of Extents. An Extent Pool is a logical construct to
manage a set of Extents. One or more Ranks with the same attributes can be assigned to an
Extent Pool. One Rank can be assigned to only one Extent Pool.

To create the DS6000 logical volume, also known as a Logical Unit or LUN, extents from the
selected Extent Pool are concatenated to allocate the required LUN capacity. If an Extent
Pool is made up of several Ranks, a LUN can potentially have extents on different Ranks and
so be spread over those Ranks.
Note: We recommend assigning one Rank per Extent Pool to control the placement of the
data. When creating a logical volume in an Extent Pool made up of several Ranks, the
Extents for this logical volume should be taken from the same Rank if possible. This
implies that you have no logical volumes that span Ranks, and can be done carefully with
the DSCLI. It is much more difficult to micromanage LUN placement while using the DS
GUI.
However, to be able to create very large logical volumes, you must consider having Extent
Pools that span more than one Rank.
Combining Extent Pools made up from one Rank and then utilizing an open systems
host-based Logical Volume Manager (LVM) to stripe a host Logical Volume over LUNs
created on each Extent Pool, will offer a balanced method to evenly spread open systems
data across the DS6000.
Note: z/OS does not provide any LVM functionality, and supports CKD volumes up to a
maximum size of 64 KB (actually 64 KB (65536) bytes less 256, or 65280 bytes).
The logical volume stripe size

Each striped logical volume that is created by the open system host’s logical volume manager
has an optional stripe size that specifies the fixed amount of data stored on each DS6000
logical volume (LUN) at one time.
Note: The logical volume stripe size has to be large enough to keep sequential data
relatively close together, but not too large so as to keep the data located on a single Array.
The recommended stripe sizes that should be defined using your host’s logical volume
manager are in the range of 4 MB to 64 MB.
You should choose a stripe size close to 4 MB, if you have a large number of applications
sharing the Arrays, and a larger size, when you have very few host servers or applications
sharing the Arrays.
3.7 .Performance and sizing considerations for z/OS

Here we discuss some z/OS-specific topics regarding the performance potential of the
DS6000. We also address what to consider when you configure and size a DS6000 to replace
older storage hardware in z/OS environments.
3.7.1 Performance potential in z/OS environments

The DS6000 can be configured with 8 host Fibre attachment ports, which may be individually
configured to operate with FICON or Fibre Channel protocols. Four of these ports are
preferentially managed by server0 within the DS6000 and the other four host ports are
managed by server1. As with the open systems attachment, we strongly recommend that
zSeries hosts are connected in a balanced manner across both the DS6000 servers.

Note: Careful planning is required when using PPRC, as at least two DS6000 ports should
be configured for PPRC paths, with one path being allocated from server0 and one from
server1.
We do not recommend sharing the PPRC paths with host I/O traffic.
The diagram shown in Figure 3-6 shows a zSeries z9 processor connected to a DS6000 with
two separate FICON Express2 paths to I/O ports that are each under the primary control of
separate DS6000 servers. This configuration is designed to spread the zSeries I/O load
across more resources in the DS6000 as well as enhancing data availability in the unlikely
event of a path failure.
B lu e Red
p re fe rre d p a th p referre d p ath
(F IC O N E xp ress 2 ) (F IC O N E xp re ss 2 )
b lu e I/O re d I/O b lu e I/O re d I/O
S e rve r 0 S e rve r 1
b lu e re d
CACHE CACHE
NVS NVS
S e rve r en clo s u re
F ib re C h a n n e l sw itch
o oo
16 D D M
D S 6 8 00 e n c lo s u re
o oo E x p a n s io n en clo s u re
16 D D M
Figure 3-6 DS6000 preferred path connectivity
FICON channels in the IBM servers were initially operating at 1 Gbps. Subsequently the
FICON technology was enhanced to FICON Express channels in IBM 2064 and 2066
servers, operating at 2 Gbps, and further enhanced with FICON Express2 channels, which
also operate at 2 Gbps, but with an enhanced protocol, making them more efficient. See 10.7,
“FICON” on page 367 for a more detailed discussion of FICON Express2.
The recent announcement for IBM 2094 z9 servers also included FICON Express2
connectivity.
The DS6000 series provides 2 Gbps Host Adapter ports, which can be configured either as
FICON to connect to zSeries servers or as FCP ports to connect to Fibre Channel attached
open systems hosts. The example in Figure 3-6 shows only two FICON Express2 channels.
Two FICON Express2 channels have the potential to provide a bandwidth of approximately 2
x 175 MB/s, or an aggregate of 350 MB/second. This is still a conservative number. Some

measurements show more than 200 MB/s per 2 Gbps FICON Express2 channel and an
aggregate of more than 380 MB/s for two channels.
I/O rates with 4 KB blocks are in the range of 35,000 I/Os per second or more with a single
DS6000 host port. A single FICON Express2 channel can actually perform up to about 9,000
read hit I/Os per second. Two FICON Express2 channels have the potential of over 13,000
I/Os per second with conservative numbers. These numbers will vary depending on the
server type used.
The ESS 750 has an aggregated bandwidth of about 500 MB/s for highly sequential reads
and about 350 MB/s for sequential writes. The DS6000 has achieved over 1000 MB/s with
64 KB data transfer reads and around 500 MB/s for sequential writes.
In a z/OS environment a typical transaction workload might perform on an ESS 800 Turbo II
with a large cache configuration slightly better than with a DS6000. This is the only example
where the ESS 800 outperforms the DS6000. In open systems environments, the DS6000
performs better than the ESS 750. This is also true for sequential throughput in z/OS
environments.
3.8 Logical disks - number and size

Generally, for easier administration and better overall performance, we recommend spreading
everything across everything; meaning spread I/O across all Extent Pools, both DA adapter
pairs and two servers if possible. This will allow you to benefit from the aggregated throughput
provided by the full I/O processing capability of the DS6000. It is also important to remember
that each server has half of the cache, and its own Persistent Cache (formerly referred to as
Non Volatile Storage or NVS) area. Since all writes must go through Persistent Cache and
cache, you will want to balance I/O across both clusters.
The way to spread I/O is by assigning logical disks evenly to Extent Pools managed by each
server in the DS6000.
Sometimes though, you may want to dedicate an Extent Pool or several Extent Pools for a
given host server or application. The overall I/O performance in that case may not be as great
as spreading I/O evenly across all of the DS6000’s resources, but should still be predictable,
especially for the application (or host server) whose storage is isolated.
The DS6000 is very good at detecting I/O patterns. So if your environment does a lot of large
sequential file copying from A to B, you might want to split A-reads and B-writes to different
Extent Pools. Let the reads come from logical disks on one Extent Pool, and the writes go to a
separate set of Extent Pools.
The DS6000 is very good at detecting sequential I/O and adjusting I/O requirements
accordingly; however, avoiding large reads and writes to the same Extent Pools and the
underlying Ranks, will improve performance.
It is a challenge to select logical disk sizes in a manner that:

򐂰 Allows you to spread I/O across multiple Extent Pools.
򐂰 Does not proliferate the number of logical disk devices presented to a host.
򐂰 Allows enough granularity for performance monitoring but not that much so that analyzing
the performance data gets too complex. For example, the output from the AIX iostats
command can become overwhelming.
򐂰 Allows for growth and re-assignment of logical disks from one host to another when
working with open systems.

򐂰 Allows for growth and expansion of data sets, such as avoiding out-of-space cancellations
when working with zSeries servers.
Tip: Try to strike a reasonable balance between flexibility and manageability for your
needs.
For FB servers Rank definition

Generic logical disk sizes from one RAID 5 (6+P+S / 7+P)
Extent (1 GB) to full rank capacity or
RAID 10 (3+3+2S / 4+4)
a maximum of 2 TB with larger
Extent Pools
For iSeries servers
FCP: 8.5, 17.5, 35.1, 70.5, 141.1
and 282.6 GB
protected / non-protected
For CKD servers

3390-3 and 3390-9 in 3390 track
format
CKD large volumes from 1 to 64K
cylinders
Figure 3-7 Logical Volume sizes
You can realize that the DS6000 gives you great flexibility when it comes to allocating logical
disk space, as shown in Figure 3-7.
For FB Ranks, logical disk sizes can vary from 1 GB to the full effective capacity of the Extent
Pool, in increments of 1 GB. A DS6000 Extent Pool with one Rank of 145.6 GB disk drives
has a full effective capacity of 773 GB, when configured as a 6+P+S RAID 5 Rank.
For CKD servers, logical disk sizes can vary from a 1 cylinder 3390 device (that is, 849.9 KB),
to a 64 K cylinder 3390 device (that is, 54.8 GB). For iSeries servers, there are presently six
different logical disk sizes supported, in both protected and unprotected mode. More details
on these two modes may be found in Chapter 11, “iSeries servers” on page 387.
SAN implementations
In a Storage Area Network (SAN) implementation, care must be take in planning the
configuration to prevent the proliferation of disk devices presented to the attached hosts. In a
SAN environment, each path to a logical disk on the DS6000 presents that logical disk to the
host system as a unique physical device, leading to the requirement for a multipath manager,
such as the IBM Multipath Subsystem Device Driver to manage these different images of a
single logical disk. The SAN zones will also affect how many devices are presented to a
server.

3.9 Logical disk sizes - general considerations
The flexibility that you have with the DS6000 for choosing the logical disk size is particularly
helpful when you need to satisfy dissimilar I/O processing requirements, but it can present a
challenge as you plan for future needs. In this section we discuss the considerations that
should be viewed when planning the size of your logical disks in the DS6000.
3.9.1 Future requirements

The DS6000 supports a high degree of parallelism and concurrency on a single logical disk.
Because the DS6000 does not serialize I/O on the basis of logical disks, the size of logical
disks does not have an impact on performance internal to a DS6000. Measurement results
indicate that a single logical disk consuming an entire Array can achieve the same
performance as many smaller logical disks on that Array. However, the size of logical disks
can have an impact on server performance and also on the administration tasks.
The size of logical disks becomes very important when you want to re-assign DS6000 storage
capacity. For example, in an open systems environment, if you have a 200 GB logical disk,
and now you want to divide it into four 50 GB logical disks and assign to different hosts, you
have to delete the original logical disk, and wait for the DS6000 to return the extents they
were occupying back to their Extent Pool before you can re-assign that space to new LUNs. If
you had chosen the four 50 GB LUNs originally, it is a simpler process to re-assign them to
different host servers.
In the zSeries environments we recommend that you can use the bigger volumes (3390-9
and larger devices) without compromising server performance, if you use Parallel Access
Volumes (PAV). But as with open systems, if for any reason you later prefer to use a different
combination of capacity sizes within a specific Extent Pool, the extents will need to be
recovered before reuse.
Note: When allocating zSeries logical volumes, allow sufficient PAV addresses in each
LCU. This may require you to choose fewer, larger base devices to leave enough potential
alias addresses.
Tip: We recommend using dynamic PAV with the alias-to-base numbers shown in the table
in 10.2.2, “PAV and large volumes” on page 359.
Once created, logical disks can simply be deleted and removed from an Extent Pool without
any Rank or Array reformatting requirement on your part. Behind the scenes, the DS6000 will
recover these recently freed up Extents and return them to their Extent Pool.
The process of recovering Extents cannot be directly monitored but you will be able to
perform configuring operations from other, unaffected Extent Pools while the DS6000 is
processing freed up Extents.
3.9.2 Maximum number of devices

There is a maximum number of supported devices for an entire DS6000. There is also a
limitation of supported devices for each attached server, based on that servers’ attachment
type. These maximum numbers should be taken into consideration when planning the
number and size of the devices that will be configured on the DS6000.

The maximum number of logical disks an DS6000 can support is 8192. In the DS6000, each
LSS supports a maximum of 256 devices, so if you use all 32 LSSs, the maximum number of
logical disks supported is 32 x 256 = 8192.
The DS6000 supports a maximum of 256 host login IDs per Fibre Channel/FICON host
adapter port, with a total number of host logins of 1024 per DS6000, in contrast to the ESS
800 which supports up to 128 host logins per adapter port and a maximum of 512 host logins
per ESS and the DS8000, which supports up to 509 host logins per adapter port, and a
maximum of 8192 user logins per DS8000.
iSeries users are reminded that there is a maximum of 32 logical disks supported on each
Fibre Channel adapter in an iSeries server.
These numbers are important when considering the implementation of DS6000 Copy
Services, the maximum number of hosts to attach to a given DS6000, and the number of
logical disks to assign to each host.
When considering which logical disk size to use, it is also important to consider that the
DS6000 attachment type a host uses will limit the number of logical disks that can be
presented to the host. (Typically, a Fibre Channel (SCSI) attached host operating system can
support 8 or 32 logical disks). The limit for FCP attached hosts is typically 256 logical disks.
3.10 Configuring I/O ports

Although each Host Adapter (HA) port can reach any logical volume in the disk subsystem,
Figure 3-8 on page 82 indicates a server affinity to its local HA and its Fibre Channel ports.
This introduces the concept of a preferred path. When a volume has an affinity, for example,
to server0, and is accessed through a port in the HA of server0, then the I/O is locally
processed. When this volume is accessed through the HA of the other server, in this example
from server1, then the I/O is routed across the interconnect to the server which owns the
extent pool, which here is server0.
There is a performance penalty if data from a logical volume managed by one server is
accessed from a port that is located on the other server. The request for the logical volume
and the data would have to be transferred across the bridge interface that connects both
servers. These transfers add some latency to the response time. Furthermore, this interface
is also used to mirror the persistent memory and for other inter-server communication. It
could become a bottleneck if too many normal I/O requests ran across it, although it is a high
bandwidth, low latency connection.

2 Gbps Host Fibre Channel ports 2 Gbps Host Fibre Channel ports
Controller Controller
Host adapter Card 0 Card 1 Host adapter
chipset chipset
Server0 Server1
Pow er PC Processor Interconnect Processor Power PC
memory memory
chipset chipset
Volatile Volatile
Device adapter Device adapter
chipset Persistent Persistent chipset
2 Gbps Fibre Channel ports 2 Gbps Fibre Channel ports
Fibre Channel switch

ooo
16 DDM
Fibre Channel switch
Server
To next enclosure
switch
First Expansion enclosure
Figure 3-8 Host Adapter server affinity
If you need more than two paths from a host to the DS6000, spread the attached host I/O
paths evenly between the two sets of HAs on the DS6000 servers. This will ensure that you
achieve good aggregate I/O bandwidth, and that the host retains adequate access to its data
We recommend the inclusion of some supported multipath management software such as the
Multipath Subsystem Device Driver which is included with each DS6000, and is discussed
briefly below in “Multipathing software” on page 83.
For best reliability and performance, it is recommended that each attached host has two
connections, one to each controller as depicted in Figure 3-9 on page 83. This allows it to
maintain connection to the DS6000 through both a controller failure and Host Adapter (HBA
or HA) failure.

Dual pathed host
HBA HBA
Host Ports Host Ports

I0000 I0001 I0002 I0003 I0100 I0101 I0102 I0103
Even LSS Odd LSS
Controller 0 Logical Logical Controller 1
DA Volumes Volumes DA
Inter-controller data path
DS6000 server enclosure
Figure 3-9 Host shown with dual paths to DS6000
3.10.1 Multiple host attachment

If you need to connect a large number of hosts to the DS6000, each using multiple paths, the
eight host adapter ports that are available in the DS6000 may not be sufficient to
accommodate all the connections. The solution to this problem is the use of SAN switches or
directors to switch logical connections from multiple hosts. In a zSeries environment you will
need to select a SAN switch or director that also supports FICON.
We recommend that more than one SAN switch be provided to ensure continued availability.
For example, four of the eight fibre ports in a DS6000 could be configured to go through each
of two directors. The complete failure of either director leaves half of the paths still operating.
Multipathing software
Each host that is using more than a single path to the DS6000 requires a mechanism to allow
the attached operating system to manage multiple paths to the same device, and to also
show a preference in this routing so that I/O requests for each LUN are directed to the
preferred controller. Also, when a controller failover occurs, attached hosts that were routing
all I/O for a particular group of LUNs (LUNs on either even or odd LSSs) to a particular
controller (because it was the preferred controller) must have a mechanism to allow them to
detect that the preferred path is gone. It should then be able to re-route all I/O for those LUNs
to the alternate, previously non-preferred controller. Finally, it should be able to detect when a
controller comes back online so that I/O can now be directed back to the preferred controller
on a LUN by LUN basis (determined by which LSS a LUN is a member of). The mechanism
that will be used varies by the attached host operating system, as detailed in the next two
sections.
See 5.6, “Subsystem Device Driver (SDD) - multipathing” on page 157.
Open systems host connection

In the majority of open systems environments, IBM recommends the use of the Multipath
Subsystem Device Driver (SDD) to manage both path failover and preferred path
determination. SDD is supplied free of charge to all IBM customers who use ESS 2105, SAN

Volume Controller, DS6000, or DS8000. Current versions of SDD are available to manage
pathing to the DS6000 and DS8000 for open systems and may be obtained from the following
Web site
http://www.ibm.com/servers/storage/support/software/sdd
SDD provides availability through automatic I/O path failover. If a failure occurs in the data
path between the host and the DS6000, SDD automatically switches the I/O to another
available path for that host. SDD will also set the failed path back online after a repair is made.
SDD also improves performance by sharing I/O operations to a common disk over multiple
active paths to distribute and balance the I/O workload for some open systems environments.
SDD also supports the use of the DS6000 preferred path to a LUN.
SDD is not available for all supported open operating systems, so attention should be directed
to the IBM TotalStorage DS6000 Host Systems Attachment Guide, GC26-7680, and the
interoperability Web site for direction as to which multi-pathing software will be required.
Some devices, such as the IBM SAN Volume Controller, do not require any multi-pathing
software because the internal software in the device already supports multipathing and
preferred path. The interoperability Web site is located at:
zSeries host connection

In the zSeries environment, the normal practice is to provide multiple paths from each host to
a disk subsystem. Typically, four paths are installed from the Series processor to a disk
subsystem such as the DS6000. The channels in each host that can access each Logical
Control Unit (LCU) in the DS6000 are defined in the HCD (or IOCDS) for that host. Dynamic
Path Selection (DPS) allows the channel subsystem to select any available (non-busy) path to
initiate an operation to the disk subsystem. Dynamic Path Reconnect (DPR) allows the
DS6000 to select any available path to a host to reconnect and resume a disconnected I/O
operation, for example, to transfer data after disconnection due to a cache miss.
These functions are part of the zSeries architecture and are managed by the channel
subsystem on the host and the DS6000.
Logical paths are established through the FICON port between the host and some or all of the
LCUs in the DS6000, controlled by the hardware configuration definition (HCD) for that host.
This happens for each physical path between a zSeries processor and the DS6000. There
may be multiple system images or Logical Partitions (LPARs) in a zSeries processor and
logical paths are established for each system image. The DS6000 then knows which FICON
paths can be used to communicate between each LCU and each host.
Provided you have the correct maintenance level, all major zSeries operating systems should
support preferred path (z/OS, z/VM®, VSE/ESA™, TPF).

4
Chapter 4. Planning and monitoring tools

This chapter describes the tools available for DS6000 capacity planning and IBM
performance monitoring. Some of these tools are only for the IBM Business Partner or the
Field Technical Sale Support (FTSS).
In this chapter, we present the following:

򐂰 Disk Magic
򐂰 Capacity Magic
򐂰 IBM TotalStorage Productivity Center
򐂰 SAN statistics
For RMF and RMF Magic for zSeries, see 10.9, “DS6000 performance monitoring tools” on
page 371.

4.1 Disk Magic
In this section we describe Disk Magic and what Disk Magic is used for. We also include
examples where we show the input data that is required, how it is fed into the tool, and also
show the output reports and information that Disk Magic provides.
Note: Disk Magic is for the IBM representative to use. Nevertheless, the DS6000 capacity
and sizing planning is done better when both the customer and the IBM representative are
familiar with the tool. Customers should contact their IBM representative to do the Disk
Magic runs, when planning for their DS6000 hardware configurations.
Disk Magic for Windows is a product of IntelliMagic, licensed exclusively to IBM and IBM
Business Partners.
4.1.1 Overview and characteristics

Disk Magic is a Windows-based disk subsystem performance modeling tool. Disk Magic can
be used to help in planning the DS6000 hardware configuration. With Disk Magic you model
the DS6000 performance when migrating from another disk subsystem or when making
changes to an existing DS6000 configuration and/or the I/O workload. Disk Magic is for use
with both z/OS and open systems server workloads.
When doing the DS6000 modeling, you will be starting from either of these scenarios:
򐂰 An existing, non-DS6000 model from which to migrate to a DS6000. This could be an IBM
product such as an old IBM 3990-6, an ESS, or a non-IBM disk subsystem. Because a
DS6000 can have much greater storage and throughput capacity than other disk storage
subsystems, with Disk Magic you can merge the workload from several existing disk
subsystems into a single DS6000.
򐂰 An existing DS6000 workload.
When modeling an open systems workload, you will always start by entering data into the
Disk Magic dialogs. This should not be a problem since the amount of data entry is minimal.
The performance information you need to gather in this case is I/O rate, transfer size, read
percentage and read hit ratio.
For AIX and iSeries, you can also have an automated input.
For z/OS workload modeling, Disk Magic can model performance at either the subsystem
level or device level. Subsystem level performance modeling was designed to get realistic
results quickly, with a minimal amount of data entry, which can be obtained from RMF reports.
The data can also be obtained using RMF Magic, which is a z/OS only tool, and will be
described in 10.9.8, “RMF Magic for Windows” on page 378. The output from RMF Magic can
be used as an automatic input for Disk Magic, so no manual data entry needs to be done.
RMF Magic can produce the data at the disk subsystem level, the LCU level for each disk
subsystem, or the device/volume level. This Disk Magic input data is called the DMC file.
Disk Magic contains advanced algorithms that can substitute data that normally would have to
be entered manually, for both z/OS and open systems modeling. For instance, if cache
statistics are not provided, then the Automatic Cache Modeling feature will generate realistic
values based on other inputs provided.
Disk Magic is good for modeling random workloads. For sequential workloads Disk Magic
tends to be too optimistic.

Disk Magic is a product of IntelliMagic and is licensed to the IBM Systems and Technology
Group, to be used for marketing support purposes.
4.1.2 Output information

Disk Magic models the DS6000 performance, based on the I/O workload and the DS6000
hardware configuration. Thus it helps in the DS6000 capacity planning and sizing decisions.
Major DS6000 components that can be modeled using Disk Magic are:
򐂰 Number, capacity, and speed of DDMs
򐂰 Number of Arrays and RAID type
򐂰 Type and number of DS6000 Host Adapters
򐂰 Remote Copy option
When working with Disk Magic, always make sure to feed in accurate and representative
workload information, because Disk Magic results depend on the input data provided. Also
carefully estimate future demand growth, as this will be fed into Disk Magic for modeling
projections on which the hardware configuration decisions will be made.
4.1.3 Disk Magic modeling

The process of modeling with Disk Magic consists in loading a base model configuration, for
which you define the hardware configuration and then you enter the workload information.
Once this information is entered, you create a valid base model—selecting Base at the
bottom of the dialog panel. In this step, Disk Magic algorithms validate the hardware and
workload information you entered, and if everything is OK, then a valid base model is created.
If not, Disk Magic will provide messages and warnings in its log.
Once the valid base model is created, you proceed with your projections. Essentially you will
be changing hardware configuration options of the base model, to decide what is your best
DS6000 configuration for a given workload. Or you can modify the workload values that you
initially entered, so, for example, you can see what happens when your workload grows or its
characteristics change.
Welcome to Disk Magic

When we launch the Disk Magic program, we start with the Welcome to Disk Magic panel
(see Figure 4-1 on page 88). In this panel we have the option to either:
򐂰 Open an existing Disk Magic file, which has an extension of DM2
򐂰 Open an input file for:
– zSeries modeling. This input file is called a DMC file and is the output of RMF Magic.
– AIX: IOSTAT output from AIX
– iSeries: PT Report file from iSeries
򐂰 Create a new project by entering the input data manually. The options are:
– zSeries servers
– Open servers
– iSeries servers
– TPF project
– SAN Volume Controller project
Chapter 4. Planning and monitoring tools 87

Figure 4-1 Welcome to Disk Magic
4.1.4 Disk Magic for zSeries

In the following example to do a model for zSeries, we select General Project from the
Create New Project options in the panel above, and enter 1 as the Number of zSeries
Servers.
In this section we do an overview of the more relevant dialog panels that Disk Magic presents
when doing a zSeries modeling and we discuss what information to complete in those panels.
Here we start with a disk subsystem where the workload is currently running.
Hardware configuration
Figure 4-2 on page 89 shows the Disk Subsystem - DSS1 dialog window with the ESS F20
disk subsystem as an example. The General tab is used to enter hardware information like
the Hardware Type (that basically identifies the base model machine), as well as the cache
and NVS size information. In this dialog panel, the number of logical control units (LCUs)
within the disk subsystem is entered; if this is an existing ESS, then this would be the number
of CKD LSSs. The Parallel Access Volume box must be checked to indicate if aliases are
used for this ESS.

Figure 4-2 Disk Subsystem zSeries dialog
By clicking Hardware Details we open an ESS Configuration Details dialog window, shown in
Figure 4-3. This dialog window is used to provide further information about the disk
subsystem hardware configuration. The fields displayed in this dialog window will depend on
the hardware type (see Figure 4-2) that was initially selected. In this ESS F20 example, we
need to choose how many Host Adapters and Device Adapters are configured with the ESS.
Also we need to select the cache size. The NVS size is not selectable, because it is a fixed
size for the ESS.
The number of 8-packs will be calculated based on the number of logical volumes that will be
defined in the zSeries disk tab.
Figure 4-3 Configuration details
Interfaces panel
Figure 4-4 on page 90 shows the panel where you define the interface connections used
between the server and the disk subsystem. If Remote Copy is used, you should define the

Remote Copy function used, the connection to the Remote Copy site, and the distance
between the Primary and Secondary site. Note that the distance here is defined in kilometers.
Figure 4-4 Interfaces panel zSeries
zSeries Disk panel

Select zSeries Disk to enter the disk characteristics:
򐂰 Type (Physical) of the Disk Drive Module (DDM)
– GB capacity of the DDM
– Speed of the DDM in RPM
򐂰 Type (Logical) volume: the 3380 or 3390 model type
򐂰 Enter either of the following:
– Count (Logical): the number of volumes or
– Size (GB): the total GB capacity in this group
zSeries workload
Figure 4-5 on page 91 is where you enter the workload characteristics of all the LCUs
associated to the ESS F20. There is a tab for every LCU defined in the General Panel. The
workload characteristics include:
򐂰 I/O Rate
򐂰 IOSQ Time
򐂰 Pending Time
򐂰 Disconnect Time
򐂰 Connect Time
򐂰 If Remote Copy is used, you can also define the percentage of the total workload that is in
a Remote Copy relationship.

The cache statistics also need to be entered to create a proper model of the workload. This
can be done by clicking CRR, RMF or CMF and then entering the cache statistics.
Figure 4-5 zSeries workload
Creating Base model

After all the workload characteristics have been entered, we can now create the Base model
for this particular configuration. Clicking Base will build the Base model, if the workload
characteristics are within the capabilities of the disk subsystem. If not, an error will be
presented with the reason the Base model could not be created. In this case, you could check
if the workload characteristics entered are correct. If this still does not solve the problem, you
can choose, for a z/OS system, a different RMF interval time to use for the model.
Merging multiple disk subsystems

In the case of a DS6000, you may want to merge several other disk subsystems onto one
DS6000. Figure 4-6 on page 92 illustrates two ESS F20s being merged onto one DS6800.
Right-click one of the ESSs, and click Merge option, then click Add to Merge Source
Collection and create New Target. This will create a new disk subsystem called Merge
Target1. Now right-click the second ESS and select Merge and Add to Merge Source
Collection.

Figure 4-6 Merge dialog
Merge Target panel

Figure 4-7 on page 93 shows the Merge Target configuration which defaults to DS8100. This
can be changed to DS6800. You need to select the Parallel Access Volume check box.
Select Interfaces to define the connections to the host:

򐂰 From the servers
򐂰 From the disk subsystem
򐂰 For Remote Copy
Select zSeries Disk to define the DDMs:

򐂰 The DDM GB capacity and RPM
򐂰 You can update the disk type and capacity or count field, if you know what your new
configuration will be. Leaving the default will cause a 1-to-1 migration from each volume in
the original source disk subsystems.

Figure 4-7 Merge Target panel
Merge Result panel

Click Start Merge, this will merge the above two ESSs and create a new window called
MergeResult1 (see Figure 4-8 on page 94), which will have the characteristics of the new
DS6800. On this new DS6800, select zSeries Workload and Average, this will show the
Response Time of the combined workload on the DS6800.
If you click Utilizations, you will get a new window that shows the utilizations of the various
components of the DS6800. Any resource that is a bottleneck will be shown with a red
background. If this happens, you should increase that resource to resolve the bottleneck. An
amber color should be considered a caution that if the workload grows, you may soon reach
the limit of that particular resource.

Figure 4-8 Merge Result
4.1.5 Disk Magic for open systems

In this section we do an overview of the more relevant dialog panels that Disk Magic presents
when doing an open systems modeling and we discuss what information to complete in those
panels. Here we start with a disk subsystem where the workload will be moving to, in this
case a DS6000.
Hardware configuration
Figure 4-9 on page 95 shows the Disk Subsystem - DSS1 dialog window with the DS6800
disk subsystem. The General tab is used to enter hardware information like the Hardware
Type (that basically identifies the base model machine), as well as the cache and persistent
memory size information.

Figure 4-9 Disk Subsystem open systems dialog
Interfaces panel
Figure 4-10 on page 96 shows the panel where you define the interface connections used
between the server and the disk subsystem. If Remote Copy is used, you should define the
Remote Copy function used, the connection to the Remote Copy site, and the distance
between the Primary and Secondary site. Note that the distance here is defined in kilometers.

Figure 4-10 Interfaces panel open systems
Open Disk panel

Select Open Disk to enter the disk characteristics:
򐂰 Physical device type
– DDM capacity in GB and RPM
򐂰 RAID type
򐂰 Total capacity in GB
Open systems workload

For an open systems modeling, you need to enter the workload statistics for every open
systems server on the Open Workload panel (see Figure 4-11 on page 97). This includes the
I/O Rate or the MB/s and also the average Transfer Size in KB for each I/O. The cache
statistics can be entered by clicking Cache Statistics and the Remote Copy percentages by
clicking Remote Copy.

Figure 4-11 Open systems workload
Creating Base model

After all the workload characteristics have been entered, we can now create the Base model
for this particular configuration. Clicking Base will build the Base model, if the workload
characteristics are within the capabilities of the disk subsystem. If not, an error will be
presented with the reason the Base model could not be created. In this case, you can change
the DS6000 configuration to be able to accommodate the workload.
If you click Utilizations, you will get a new window that shows the utilizations of the various
components of the DS6800. Any resource that is a bottleneck will be shown with a red
background. If this happens, you should increase that resource to resolve the bottleneck. An
amber color should be considered a caution that if the workload grows, you may soon reach
the limit of that particular resource.
4.1.6 Workload growth projection

The workload running on any disk subsystem always grows over time. That is why it is
important to project how the new disk subsystem will perform as the workload grows. Some of
the growth options are:
򐂰 I/O rate
– This will project the disk subsystem performance when the I/O rate grows
򐂰 I/O rate with capacity growth
– This will project the disk subsystem performance when both the I/O rate and capacity
grows. The capacity growth will be proportional to the I/O growth.
The main projection we should do is how the service time will change with the growth in the
workload. Figure 4-12 on page 98 shows the graph of the service time as the I/O rate grows.
In this particular example, we see that the response time jumps significantly as the I/O rate

increases. Disk Magic tells us that the HDD/DDM utilization reaches the 100% limit at the I/O
rate over 4700 IO/second.
14
Service time
12
10
msec
2
1700 2700 3700 4700
I/O Rate
Figure 4-12 Response time projection with workload growth
Next we should plot the HDD/DDM utilization. In Figure 4-13 on page 99 you can see that at
4700 IO/second this number reaches 93%. An HDD/DDM utilization greater than 50% will
have an impact on the service time.
With Disk Magic it is easy to do a reconfiguration and see the impact of it. For example, we
can increase the number of DDMs and observe the impact on the Service time.

100
HDD utilization
90
80
70
60
50
%
40
30
20
10
0
1700 2700 3700 4700
I/O Rate
Figure 4-13 HDD/DDM Utilization with workload growth
4.1.7 Input data needed for DIsk Magic study

To do the Disk Magic study, we need to get information about the characteristics of the current
workload. Depending on the environment where the current workload is running, different
ways of data collection are needed.
z/OS environment
For each control unit to be modelled (current and proposed), we need:
򐂰 Control unit type and model
򐂰 Cache size
򐂰 NVS size
򐂰 DDM size and speed
򐂰 Number of channels
򐂰 PAV: is it installed or not
For PPRC:
򐂰 Distance
򐂰 Number of links
In a z/OS environment, the SMF record types 70 through 78 are required to do a Disk Magic
modeling. The easiest way to send the SMF data to IBM is through ftp. To avoid huge dataset
sizes, the SMF data can be separated by SYSID or by date. The SMF dataset needs to be
tersed before putting it on the ftp site. Example 4-1 on page 100 shows the instructions on
how to terse the dataset.

Example 4-1 How to terse the SMF dataset
There are 2 ways to do this:
- Use TRSMAIN with the PACK option
- Use this second option if the above output results in errors when the FTP data is
untersed
From TSO do this:
XMIT mvsid.userid DA(‘AAA.SMF.DATA’) OUTDSN(‘AAA.SMF.XMIT’)
+ mvsid = host MVS ID
+ userid = the TSO userid issuing the command
+ AAA.SMF.DATA = dataset name of the SMF data
+ AAA.SMF.XMIT = output datasetname of the XMIT command
Terse XXX.SMF.XMIT to XXX.SMF.XMIT.TERSED. This is the dataset that need to be
put on the FTP site
Example 4-2 shows how to FTP the tersed dataset.
Example 4-2 FTP instructions

The FTP site is: testcase.boulder.ibm.com
userid = anonymous
password = your email userid, for example: abc@defg.com
directory to put in: mvs/toibm
notify IBM with the filename you use to create your FTP file
UNIX environment
In a UNIX environment, we need the following information for each control unit to be included
in this study:
򐂰 Vendor, machine and model type, for example IBM 2105-F20
򐂰 How many disks are installed
򐂰 How many servers are allocated and sharing these disks
򐂰 What is the size and speed of the DDMs, for example 36 GB 15K rpm drives
򐂰 Are there any issues regarding performance?
򐂰 Number of SCSI channels, number of Fibre channels on each disk control unit and on
each server
򐂰 Cache size
򐂰 Direct attached or SAN attached
Disk Magic requires the following input:

򐂰 Blocksize
򐂰 Read/Write ratio
򐂰 Read hit ratio
򐂰 IO/second load at peak
Data collection
For all servers attached to the disk to be modeled, we will need to collect iostat data. If there
are more than 5 servers with significant workload, we will need to determine an appropriate
strategy for collecting data from only a subset of these servers and extrapolating those
workload characteristics.
The data collection is done by setting up an iostat run that has some flags set, so that the
proper data comes out of the report. Below is the example of the commands that should be
used.

Example 4-3 Data collection for UNIX
Run iostat for 10 times every 15 minutes during what you know is the busiest part of the
day.
Sample command for AIX

iostat -d 900 10 > /tmp/yourfilename
Sample command for SUN

iostat -x 900 10
Sample command for HP-UX. This one only gives total I/Os and average block size.
sar -d 900 10
Other operating systems environments

For Windows 2000, iSeries and LINUX environments, contact your IBM representative.
4.2 Capacity Magic

This section describes the Capacity Magic tool, explains the benefits of using it, and
discusses the input required. It also shows examples of the tool, highlighting its features.
Note: Capacity Magic for Windows is a product of IntelliMagic, licensed exclusively to IBM
and IBM Business Partners. It may be used only by IBMers on IBM equipment or IBM
Business Partners on IBM Business Partner equipment. In particular, Capacity Magic
cannot be left with clients.
The purpose of the tool is to aid in planning the correct number of disk features that must
be ordered to meet client requirements for disk storage capacity, taking into consideration
the client’s choices of disk drive module capacity, RAID protection, and server platform.
Clients who want to have their DS6000 disk requirements computed by Capacity Magic
should ask their IBM representative or IBM Business Partner to perform this analysis.
4.2.1 Overview and features

The DS6000 allows for highly flexible configuration options. When selecting the disk
configuration, a client can choose among disk drives with different capacities and rotation
speed. Clients with multiple workloads can choose different disks for each, all within a single
DS6000. In addition, different RAID levels can be selected, based on performance
requirements.
With these options and combinations of options, it becomes a challenge to calculate the
physical capacity (also referred to as raw capacity) and effective capacity (also referred to as
usable capacity or net capacity) of a DS6000. The difference between physical and effective
capacity is taken up by:
򐂰 Spare volumes
򐂰 RAID 5 parity volumes
򐂰 RAID 10 mirror volumes
Capacity Magic, an intuitive, easy-to-use tool that runs on a Windows 2000 or Windows XP
personal computer, is available to do these physical capacity and effective capacity
calculations, taking into consideration all applicable configuration rules.

Capacity Magic offers a wizard which allows you to enter your effective capacity
requirements, which it uses to calculate the physical capacity which must be ordered.
Capacity Magic also offers a graphical interface that allows you to explicitly specify the
configuration details of each RAID rank within the DS6000, which it then uses to calculate the
physical and effective storage capacity. In addition, the wizard populates the graphical
interface so you can use the wizard to create an initial configuration, and then modify the
configuration using the graphical interface. Capacity Magic also provides detailed reports on
the effective capacity of the configuration.
4.2.2 Wizard
The configuration wizard allows you to quickly generate a configuration. For each server
platform, you may specify only one DDM type and one RAID type. The wizard allows you to
specify the effective capacity required for each server platform, and it will generate the
physical capacity to meet that requirement. For open systems and iSeries servers, you
specify the effective capacity in gigabytes. For zSeries servers, you specify the effective
capacity in terms of the numbers of each 3390 model type required.
Once you have completed going through the windows of the configuration wizard, Capacity
Magic displays the graphical interface to show how the DS6000 is physically configured. You
may then modify your configuration if desired, for example to include multiple DDM and RAID
types for a server platform, which is not supported in the wizard. Finally, you select the Report
tab to see the detailed reports.
4.2.3 Graphical interface

Capacity Magic offers a graphical interface that allows the user of the tool to enter the
physical disk drive configuration of the DS6000. Capacity Magic then calculates the effective
capacity.
First, you make a selection regarding the composition of Extent Pools:

򐂰 Create an Extent Pool for each RAID rank.
򐂰 Create an Extent Pool for each type of extent, that is for each combination of DDM type,
RAID type, and server platform.
Although there are actually other alternatives for creating Extent Pools on the DS6000, these
are the two most common options and are the only ones offered in Capacity Magic.
Then, for each Array Site (four DDMs on the DS6000), you specify:
򐂰 DDM type
– 73 GB
– 146 GB
– 300 GB
򐂰 RAID type
– RAID 5
– RAID 10
򐂰 Server platform
– zSeries
– Open systems
– iSeries

Capacity Magic provides an accurate model of how the DS6000 will be configured, both
physically and logically. To do this, Capacity Magic enforces the configuration rules used
when physically installing DDMs:
򐂰 Disks do not have to be installed in order of largest to smallest DDM size. However,
installing DDMs and creating the Arrays in order of largest to smallest will minimize the
number of spares required.
򐂰 In each storage enclosure (16 DDM slots), you must have 1, 2, 3 or 4 drive sets. Each
drive set consists of 4 DDMs, and all 4 drives in each drive set must be the same capacity
and speed. However, each of the drive sets within a storage enclosure may be of a
different types. For example, within a storage enclosure you may have a drive set of 73 GB
disks as well as a drive set of 146 GB disks. If you have fewer than 16 DDMs in a storage
enclosure, then you must have fillers to bring the total of DDMs and fillers to 16.
򐂰 Storage enclosures are placed on the two loops in a specific sequence. The server
enclosure goes on loop 0. Storage enclosures 1 and 2 go on loop 1. The remaining
storage enclosures are then added to loops 0 and 1 in an alternating manner.
򐂰 When Arrays are created, the DS6000 allocates one spare from each RAID 5 Array, and
two spares from each RAID 10 Array until the following requirements are met:
– 2 spares of the largest capacity Array on the loop
– 2 spares of capacity and RPM greater than or equal to the fastest Array of any given
capacity on the loop
򐂰 For purposes of picking the Array Sites which will provide spares, Capacity Magic
assumes the Arrays are created in the order they are listed, top to bottom, on the graphical
interface.
Each number in the graphical interface (1 through 56) designates one disk drive set (4 DDMs)
and indicates the order in which you must configure them. Each drive set forms an Array Site.
An Array can be formed from one or two Array Sites.
Note: Effective August 30, 2005, the number of drive sets on the DS6000 is limited to a
maximum of 32 drive sets (128 DDMs).
4.2.4 Reports
After inputting the configuration, you switch to the Report tab, which provides a number of
reports on capacity:
RAID Array report, which shows the numbers for:

򐂰 RAID Arrays (4 or 8 DDMs)
򐂰 Array Sites (4 DDMs)
򐂰 Physical capacity in decimal gigabytes
򐂰 Extent count
򐂰 Effective capacity in decimal gigabytes
򐂰 Effective capacity in binary gigabytes
򐂰 Effective utilization percentage
Each of these metrics is provided for all RAID Arrays and also split out by:
򐂰 DDM type
򐂰 RAID type

򐂰 Server platform
For zSeries capacity, there is an allocated logical device report, which shows the logical
device counts per type, as specified in the zSeries logical device configuration dialog window.
For zSeries capacity, there is also a zSeries logical device space report, showing the number
of RAID arrays and extents, and the maximum number of logical devices which could be
specified for each logical device type, assuming all logical device types were that one type.
These numbers are shown for all RAID Arrays and also split out by:
򐂰 DDM type
򐂰 RAID type
4.2.5 Examples
We will now see examples of the Capacity Magic wizard, graphical interface, and reports.
Specifying zSeries logical devices required using the wizard

Figure 4-14 shows how you specify effective capacity required for zSeries servers in the
wizard. You may specify any quantity and combination of 3390 model types. Here we specify
128 3390-9 volumes.
Figure 4-14 Specify effective capacity for zSeries servers in terms of 3390 volumes
Graphical interface
Figure 4-15 on page 105 shows the blank graphical interface for the DS6800 with space for
56 drive sets, before any DDMs are specified.

Note: Effective August 30, 2005, the number of drive sets on the DS6000 is limited to a
maximum of 32 drive sets (128 DDMs).
Figure 4-15 Graphical interface before any DDMs are specified
Graphical interface with DDMs specified

Figure 4-16 on page 106 shows a section of the graphical interface after disks have been
specified. Note the toolbars at the top of the window where you specify DDM type (by feature
code), RAID type (for either 4-DDM or 8-DDM Arrays), and server platform. As each choice is
made, an icon appears on each Array Site indicating the selection. Also, Capacity Magic uses
colors and letters to indicate DDMs that are spare, parity or mirror volumes.

Figure 4-16 Graphical interface populated with DDMs

RAID Array report
Figure 4-17 is an example of a RAID Array report showing physical and effective capacities in
total and split out by DDM type, RAID type, and server platform.
Figure 4-17 RAID Array report
zSeries logical device space report

Figure 4-18 on page 108 shows an example of zSeries reports including the logical device
space report showing the number of 3390 devices by DDM type and RAID type.

Figure 4-18 zSeries reports
Non-optimum spare configuration

Figure 4-19 shows a section of the graphical interface in which a RAID 5 Array was specified,
followed by a RAID 10 Array. This results in a non-optimum configuration of three spares
being allocated, when only two spares are required.
Figure 4-19 Graphical interface with non-optimum spare configuration

Report warning message
Figure 4-20 shows a section of the report with a message warning of the non-optimum spare
configuration noted previously and providing a suggested resolution.
Figure 4-20 Report warning of non-optimum spare configuration
4.3 IBM TotalStorage Productivity Center for Disk

The information and description in this section are for a large part based on the information in
the redbook Managing Disk Subsystems using IBM TotalStorage Productivity Center,
SG24-7097. We have summarized the information that is relevant for the IBM TotalStorage
DS6000 performance management discussion.
4.3.1 IBM TotalStorage Productivity Center

The IBM TotalStorage Productivity Center is an open storage infrastructure management
solution designed to help reduce the effort of managing complex storage infrastructures, to
help improve storage capacity utilization, and to help improve administrative efficiency. It is
designed to enable an agile storage infrastructure that can respond to On Demand storage
needs.
The IBM TotalStorage Productivity Center offering is a powerful set of tools designed to help
simplify the management of complex storage network environments. The IBM TotalStorage
Productivity Center consists of TotalStorage Productivity Center for Disk, TotalStorage
Productivity Center for Replication, TotalStorage Productivity Center for Data (formerly Tivoli
Storage Resource Manager) and TotalStorage Productivity Center for Fabric (formerly Tivoli
SAN Manager).
The IBM TotalStorage Productivity Center for disk is described in this section. It can be
invoked from the IBM TotalStorage Productivity launch pad by double-clicking the Manage
Disk Performance and Replication icon as shown in Figure 4-21 on page 110.

Figure 4-21 IBM TotalStorage Productivity Center launch pad
This section presents only the IBM TotalStorage Productivity Center for Disk, which is the
component used to collect and monitor performance for the IBM TotalStorage DS6000.
4.3.2 IBM TotalStorage Productivity Center for Disk

IBM TotalStorage Productivity Center for Disk allows you to manage the disk systems. It will
discover and classify all disk systems that exist and draw a picture of all discovered disk
systems. This utility provides the ability to monitor, configure, create disks, and do LUN
masking of disks. It also does performance trending and performance threshold I/O analysis
for both real disks and virtual disks. It also does automated status and problem alerts via
SNMP. IBM TotalStorage Productivity Center for Disk was formerly the IBM TotalStorage
Multiple Device Manager.
The disk systems monitoring and configuration needs must be covered by a comprehensive
management tool like the TotalStorage Productivity Center for Disk. The requirements
addressed by the TotalStorage Productivity Center for Disk are shown in Figure 4-22 on
page 111.

Figure 4-22 Monitor and configure the storage infrastructure disk area
In a SAN environment, multiple devices work together to create a storage solution. The
Productivity Center for Disk provides integrated administration, optimization for interacting
SAN devices, including:
򐂰 IBM TotalStorage DS4000 family
򐂰 IBM TotalStorage Enterprise Storage Server
򐂰 IBM TotalStorage DS8000 and DS6000 Series
򐂰 IBM TotalStorage SAN Volume Controller
It provides an integrated view of the underlying system so that administrators can drill down
through the virtualized layers to easily perform complex configuration tasks and more
productively manage the SAN infrastructure. Because the virtualization layers support
advanced replication configurations, the Productivity Center for Disk product offers features
that simplify the configuration and monitoring. In addition, specialized performance data
collection, analysis, and optimization features are provided.
As the SNIA standards mature, the Productivity Center view will be expanded to include
CIM-enabled devices from other vendors, in addition to IBM storage. Figure 4-23 on page 112
represents the Productivity Center for Disk operating environment.

Figure 4-23 IBM TotalStorage Productivity Center for Disk operating environment
The Productivity Center for Disk layers are open and can be accessed via GUI, CLI, or
standard-based Web Services.The Productivity Center for Disk provides the following
functions:
򐂰 Device Manager
򐂰 Performance Manager
Device Manager
The Device Manager is responsible for the discovery of supported devices; collecting asset,
configuration and availability data from the supported devices; and providing a limited
topography view of the storage usage relationships between those devices.
The Device Manager builds on the IBM Director discovery infrastructure. Discovery of storage
devices adheres to the SNIA SMI-S specification standards. Device Manager uses the
Service Level Protocol (SLP) to discover SMI-S enabled devices. The Device Manager
creates managed objects to represent these discovered devices. The discovered managed
objects are displayed as individual icons in the Group Contents pane of the IBM Director
Console as shown in Figure 4-24 on page 113.

Figure 4-24 IBM Director Console
Device Manager provides a subset of configuration functions for the managed devices,
primarily LUN allocation and assignment. These services communicate with the CIM Agents
that are associated with the particular devices to perform the required configuration. Devices
that are not SMI-S compliant are not supported.
Common base functions that are available when managing a DS6000:

򐂰 DS6000 volume inventory
򐂰 DS6000 volume creation
򐂰 DS6000 volume assigning and unassigning
The Device Manager health monitoring keeps you aware of all hardware status changes in
the discovered storage devices. You can drill down the status of the hardware device, if
applicable. This enables you to understand which components of a device are malfunctioning
and causing an error status for the device
Performance Manager
The Performance Manager function provides the raw capabilities of initiating and scheduling
performance data collection on the supported devices, of storing the received performance
statistics into database tables for later use, and of analyzing the stored data and generating
reports for various metrics of the monitored devices. In conjunction with data collection, the
Performance Manager is responsible for managing and monitoring the performance of the
supported storage devices. This includes the ability to configure performance thresholds for
the devices based on performance metrics, the generation of alerts when these thresholds
are exceeded, the collection and maintenance of historical performance data, and the
creation of gauges, or performance reports, for the various metrics to display the collected
historical data to the end user. The Performance Manager enables you to perform
sophisticated performance analysis for the supported storage devices.
Functions
TotalStorage Productivity Center for Disk provides the following functions:

򐂰 Collect data from devices
The Productivity Center for Disk collects data from the IBM TotalStorage DS8000,
DS6000, Enterprise Storage Server (ESS), SAN Volume Controller, DS4000 family and
SMI-S enabled devices. Each Performance Collector collects performance data from one
or more storage groups, all of the same device type (for example, DS6000 or SAN Volume
Controller). Each Performance Collection has a start time, a stop time, and a sampling
frequency. The performance sample data is stored in DB2 database tables.
򐂰 Configure performance thresholds
You can use the Productivity Center for Disk to set performance thresholds for each device
type. Setting thresholds for certain criteria enables Productivity Center for Disk to notify
you when a certain threshold has been exceeded, so that you can take action before a
critical event occurs.
You can specify what action should be taken when a threshold-exceeded condition occurs.
The action may be to log the occurrence or to trigger an event. The threshold settings can
vary by individual device.
There is a user interface that supports thresholds setting, enabling a user to:
򐂰 Modify a threshold property for a set of devices of like type.
򐂰 Modify a threshold property for a single device.
– Reset a threshold property to the IBM-recommended value (if defined) for a set of
devices of like type. IBM-recommended critical and warning values will be provided for
all thresholds known to indicate potential performance problems for IBM storage
devices.
– Reset a threshold property to the IBM-recommended value (if defined) for a single
device.
򐂰 Show a summary of threshold properties for all of the devices of like type.
򐂰 View performance data from the Performance Manager database.
Gauges
The Performance Manager supports a performance-type gauge. The performance-type
gauge presents sample-level performance data. The frequency at which performance data is
sampled on a device depends on the sampling frequency that you specify when you define
the performance collection task. The maximum and minimum values of the sampling
frequency depend on the device type. The static display presents historical data over time.
The refreshable display presents near real-time data from a device that is currently collecting
performance data.
The Performance Manager enables a Productivity Center for Disk user to access recent
performance data in terms of a series of values of one or more metrics associated with a finite
set of components per device. Only recent performance data is available for gauges. Data
that has been purged from the database cannot be viewed. You can define one or more
gauges by selecting certain gauge properties and saving them for later referral. Each gauge
is identified through a user-specified name and, when defined, a gauge can be started, which
means that it is then displayed in a separate window of the Productivity Center GUI. You can
have multiple gauges active at the same time. Gauge definition is accomplished through a
wizard to aid in entering a valid set of gauge properties. Gauges are saved in the Productivity
Center for Disk database and retrieved upon request. When you request data pertaining to a
defined gauge, the Performance Manager builds a query to the database, retrieves and
formats the data, and returns it to you. When started, a gauge is displayed in its own window,
and it displays all available performance data for the specified initial date/time range. The
date/time range can be changed after the initial gauge window is displayed.

For performance-type gauges, if a metric selected for display is associated with a threshold
enabled for checking, the current threshold properties are also displayed in the gauge window
and are updated each time the gauge data is refreshed.
Database services for managing the collected performance data

The performance data collected from the supported devices is stored in a DB2 database.
Database services are provided that enable you to manage the potential volumes of data.
򐂰 Database purge function
A database purge function deletes older performance data samples and, optionally, the
associated exception data. Flexibility is built into the purge function, and it enables you to
specify the data to purge, allowing important data to be maintained for trend purposes.
You can specify to purge all of the sample data from all types of devices older than a
specified number of days. You can specify to purge the data associated with a particular
type of device.
If threshold checking was enabled at the time of data collection, you can exclude data that
exceeded at least one threshold value from being purged.
You can specify the number of days that data is to remain in the database before being
purged. Sample data and, optionally, exception data older than the specified number of
days will be purged.
A reorganization function is performed on the database tables after the sample data is
deleted from the respective database tables.
򐂰 Database information function
Due to the amount of data collected by the Performance Manager function provided by
Productivity Center for Disk, the database should be monitored to prevent it from running
out of space. The database information function returns the database % full. This function
can be invoked from either the Web user interface or the CLI.
4.3.3 Operation characteristics

IBM TotalStorage Productivity Center communicates with your DS6000 through the TPC/IP
network. Therefore, you can gather information from any DS6000 around the world as long as
you have a communication path between the IBM TotalStorage Productivity Center and the
DS6000 through your intranet or the World Wide Web (Internet).
You will need to provide a TCP/IP connection between the DS6000 and the IBM TotalStorage
Productivity Center so that performance information, in particular can be sent from the
DS6000 to the IBM TotalStorage Productivity Center. When IBM TotalStorage Productivity
Center receives information from the DS6000, it is stored in tables within a DB2 database.
Thus, you can prepare and produce customized reports using traditional DB2 command.
Installation environment
The storage management components of IBM TotalStorage Productivity Center can be
installed on a variety of platforms. However, for the IBM TotalStorage Productivity Center
suite, when all four manager components are installed on the same system, the only common
platforms for the managers are:
򐂰 Windows 2000 Server with Service Pack 4
򐂰 Windows 2000 Advanced Server
򐂰 Windows 2003 Enterprise Server Edition
Hardware requirements:
򐂰 Dual Pentium® 4 or Xeon™ 2.4 GHz or faster processors

򐂰 4 GB of DRAM
򐂰 Network connectivity
򐂰 80 GB available disk space
Installation process
IBM TotalStorage Productivity Center provides a suite installer that helps guide you through
the installation process. You can also use the suite installer to install the components
standalone. One advantage of the suite installer is that it will interrogate your system and
install required prererequisites.
The suite installer will install the requisite products or components for IBM TotalStorage
Productivity Center for Disk in this order:
򐂰 DB2 (required by all the managers)
򐂰 IBM Director
򐂰 WebSphere® Application Server
After you have completed the installation of IBM TotalStorage Productivity Center for Disk,
you need to install and configure the Common Information Model Object Manager (CIMOM)
and the Service Location Protocol (SLP) agents.
The IBM TotalStorage Productivity Center for Disk uses SLP as a method for the CIM client to
locate managed objects. The CIM client may have built in or external CIM agents. When a
CIM agent implementation is available for a supported device, the device may be accessed
and configured by management applications using industry-standard XML-over-HTTP.
If you want the DS8000, DS6000, ESS, SAN Volume Controller, or FAStT storage
subsystems to be managed using IBM TotalStorage Productivity Center for Disk, you must
install the prerequisite I/O Subsystem Licensed Internal Code and CIM Agent for the devices.
For detailed installation, refer to the IBM redbook Managing Disk Subsystems using IBM
TotalStorage Productivity Center, SG24-7097.
Installation consideration
For general performance considerations, if you are installing the CIM agent for the DS6000,
you must install it on a separate machine from the Productivity Center for Disk code as shown
in Figure 4-25. Attempting to run a full TotalStorage Productivity Center implementation
(Device Manager, Performance Manager, Data Manager, Replication Manager, DB2, IBM
Director and the WebSphere Application server) on the same host as the CIM agent, will
result in dramatically increased wait times for data retrieval.
Figure 4-25 TPC architecture overview

Notes:
1. You should refer to the following Web sites for the updated support summaries,
including specific software, hardware, and firmware levels supported.
http://www.storage.ibm.com/software/index.html
http://www.ibm.com/software/support/
2. The IBM TotalStorage Productivity Center program pre-configures the TCP/IP ports
used by WebSphere.
Ensure that TCP/IP connection characteristics are authorized through your network.
4.3.4 Using IBM TotalStorage Productivity Center for Disk

This section provides a step-by-step guide to configure and use performance Manager
functions provided by the TotalStorage Productivity Center for Disk.
Performance Manager GUI

The Performance Manager Graphical User Interface can be launched from the IBM Director
console interface. After logging on to IBM Director, you will see a screen as in Figure 4-26. On
the right most tasks pane, you will see the Manage Performance launch menu. It is
highlighted and expanded in the figure shown.
Figure 4-26 IBM Director console with Performance Manager
Exploiting Performance Manager

You can use the Performance Manager component of the TotalStorage Productivity Center for
Disk to manage and monitor the performance of the storage devices that TotalStorage
Productivity Center for disk supports.
Performance Manager provides the following functions:

򐂰 Collecting data from devices

򐂰 Configuring performance thresholds
򐂰 Viewing performance data
The installation of the performance Manager onto an existing TotalStorage Productivity

Center for Disk server provides a new Manage performance task tree on the right-hand side
of the TotalStorage Productivity Center for Disk host.
This task tree includes the subtasks as seen in Figure 4-27.
Figure 4-27 Performance Manager tasks
Performance Manager data collection

The Performance Manager collects data from the IBM TotalStorage DS6000, setting a
particular performance data collection frequency and duration of collection. Once
Performance Manager receives data collected, it processes the performance data and saves
it in Performance Manager database tables.
To collect performance data, you have to create data collection tasks for the supported,
discovered storage devices. To create task, you have to specify
򐂰 A task name
򐂰 A brief description of the task
򐂰 The sample frequency in minutes
򐂰 The duration of data collection task (in hours)
Figure 4-28 on page 119 is an example of the panel you should see when you create the
performance Data Collection task on a DS6000.
Note: All figures below are taken from the DS8000, but the DS6000 view is the same.

Figure 4-28 DS8000/DS6000 Data Collection Task
Once the task is created, you can execute it immediately or define a scheduled job for this
task.
Note: Data collection tasks can be defined with one or more storage devices. By selecting
several storage devices, you can collect performance data from different storage
subsystems. In this case, the collection data will have the same characteristics (sample
frequency and duration of data collection).
To check the status of all defined tasks, you can access the Task Status. This tool gives
details of tasks including task status (for example: running, competed...), device ID, Device
status, Error Message ID, and Error message.
Managing Performance Manager database

The collected performance data is stored in a backend DB2 database. This database needs
to be maintained in order to keep relevant data in the database. You may decide frequency for
purging old data based on your organization’s requirements.
Performance Manager gauges

Once data collection is complete, you may use the gauges task to retrieve information about a
variety of storage device metrics.
Gauges are used to tunnel down to the level of detail necessary to isolate performance issues
on the storage device. To view information collected by the Performance Manager, a gauge
must be created or a custom script written to access the DB2 tables/fields directly.
The metrics available change depending on whether the cluster, Rank Group or volume items
are selected.
Two types of gauges are available:

򐂰 Performance
򐂰 Exception
Performance
DS6000 Server performance gauges values provide details on:
򐂰 Total I/O rate: Cluster level (server0/server1) total I/O request rate

򐂰 % I/O requests delayed due to NVS: Percentage of I/O requests that have been delayed
due to NVS memory (persistent memory) shortage.
Figure 4-29 shows the different values available at the DS6000 cluster level.
Figure 4-29 DS8000/DS6000 Cluster level gauge values
DS6000 Rank Group (Rank) performance gauges provide details on:

򐂰 Total I/O: DS6000 lower interface total I/O per second
򐂰 Reads/second: DS6000 lower interface reads per seconds
򐂰 Writes/second: DS6000 lower interface writes per seconds
򐂰 AVG. response time: Average response time for the lower interface in milliseconds
Figure 4-30 on page 121 shows the different values available at the DS6000 Rank Group
level.

Figure 4-30 DS8000/DS6000 Rank Group level gauge values
DS6000 volume performance gauges provide details on:

򐂰 Read/second, Write/second (non-sequential)
򐂰 Cache hit read ratio, write ratio (non-sequential)
򐂰 Total cache hit ratio (non-sequential),
򐂰 Reads/second, Writes/second (sequential)
򐂰 Read cache hit ratio, write cache hit ratio (non-sequential)
򐂰 Total cache hit ratio (sequential)
򐂰 Disk<>cache/second (disk to cache)
򐂰 Cache<>disk/second (cache to disk)
򐂰 Read cache hit ratio, write cache hit ratio (sequential)
򐂰 Total cache hit ratio
򐂰 Record mode reads/second
򐂰 % DASD fast write delay due to NVS
򐂰 Cumulative read size, write size
򐂰 Cumulative read I/O response time, write I/O response time
Figure 4-31 on page 122 shows the different values available at the DS6000 volume level.

Figure 4-31 DS8000/DS6000 volume level gauge values
Figure 4-32 on page 123 presents an example of graphical output you get from TotalStorage
Productivity Center for Disk.

Figure 4-32 Display performance gauge
Exceptions
Exception gauges display data only for those DS6000 active thresholds that were crossed
during the reporting. One Exception gauge displays thresholds exception for the entire
storage device based on the thresholds active at the time of the collection.
DS6000 thresholds
Thresholds are used to determine the high watermarks for warning and error indicators the
storage subsystem. Figure 4-33 shows the available Performance Manager thresholds for the
DS6000.
Figure 4-33 DS8000/DS6000 performance thresholds panel
For DS6000, these thresholds are available:

򐂰 Non Volatile Storage (NVS) Cache full (%)

򐂰 Total I/O requests (requested I/O operations per second)
You may only enable a particular threshold once minimum values for warning and error levels
have been defined. If you attempt to select a threshold and enable it without first modifying
this value, you will see a notification like Figure 4-34.
Figure 4-34 DS8000/DS6000 Threshold enable warning
Tip: In TotalStorage Productivity Center for Disk, default thresholds warning and error
values of -1.0 are indicators that there is no recommended minimum value for the
threshold and are therefore entirely user defined. You may elect to provide any reasonable
value for these thresholds, keeping in mind the workload in your environment.
To modify the warning and error values for a given threshold, you may select the threshold,
and click the Properties button. The panel in Figure 4-35 will be shown. You can modify the
threshold as appropriate, and accept the new values by selecting the OK.
Figure 4-35 Modifying DS8000/DS6000 threshold warning and error values
4.3.5 Exploiting gauges

Gauges are a very useful tool and help in identifying performance bottlenecks. In this section
we will show the drill down capabilities of gauges. The purpose of this section is not to cover
performance analysis in detail for a specific product, but to highlight capabilities of the tool.
You may adopt and use a similar approach for the performance analysis.
Before exploiting gauges, appropriate time frame samples which cover high/low I/O rate
should be collected.

Creating gauges
Here, we will describe how to create and customize the gauges for DS6000.
For creating the gauge, launch the Performance gauges panel as shown in Figure 4-36 by
right-clicking on the DS6000 device.
Figure 4-36 Performance gauges panel
Click Create to create a new gauge. You will see a panel similar to Figure 4-37. On the left
pane of this panel, you can choose to create a performance gauge at the Cluster
(server0/server1) level, the Rank Group (Rank) level, or the Volume level.
For example, when we select Cluster level, Total I/O, Reads/second, Writes/second and AVG
response time appear in the metrics box, and a pair clusters appears in the component box.
Figure 4-37 Create performance gauge
Enter the name and the description of the gauge. Select Data points or Date range to show
the historical data collection sampling period and check Display gauge. Upon clicking OK, we
can get the next panel as shown in Figure 4-38 on page 126.

Figure 4-38 Gauge for DS6000 Cluster performance
Modify gauge to view Rank level metrics

To modify a performance gauge, select the gauge you want to modify in the Performance
gauges panel and click Properties as shown inFigure 4-36 on page 125. You can select the
"Save As" button to save the modified gauge as a new gauge (under a new name) while
keeping the existing gauge.
Then the subsequent panel is shown in Figure 4-39 on page 127. When selected Rank level
on the left pane, and component for RANK 0 and the metrics for Average response time as
circled on the figure, the resulting gauge is shown in Figure 4-40 on page 127.

Figure 4-39 Customizing gauge for array level metric
Figure 4-40 Modified gauge with Avg. response time chart

Tip: You can select multiple metrics and view in a chart by holding shift or control key. But
if the units for the y-axis is different, an error message is displayed.
4.3.6 Interpreting the DS6000 performance

DS6000 performance analysis is an iterative process that requires you to do the following:
Collect data over specified periods, analyze performance data, identify problems and
conditions that seem out of the ordinary, and decide on a course of action. After making
changes, begin collecting data again for the appropriate sample periods and verify and
analyze the changes.
The IBM TotalStorage Productivity Center for Disk helps you to see the overall performance of
your DS6000s. It supplies information at the DS6000 subsystem level; it does not directly
connect the host view with the disk subsystem view. Using it in conjunction with host system
monitors and available performance tools, you will receive the necessary picture of your
DS6000’s performance.
In general we recommend:
򐂰 Use the Cluster level gauge and identify if the DS6000’s servers are busy and persistent
memory is sufficient.
򐂰 Use the Rank Group level gauge to identify if the Rank or internal bus is busy.
򐂰 Use the Volumes level gauge to identify the type of workload and to verify persistent
memory full conditions on logical disk (volume) level.
򐂰 Use the threshold to identify the most recent threshold exceptions.
To understand how performance impacts occur, it is important to have a basic understanding

of the logical characteristics of the DS6000. The DS6000 offers new capabilities to support
the optimization of sequential performance and provide increased bandwidth.
Interpreting Cluster level gauge

A DS6000 has two servers, and each server independently provides major functions for the
disk subsystem. Some examples include directing host adapters for data transferring to and
from host processors, managing cache resources, and directing lower device interfaces for
data transferring to and from physical disks. Thus, this level of gauge gives you a
performance overview from the viewpoint of an entire subsystem and overall cache statistics,
but doesn’t show the lower level of detailed performance data.
Cluster level gauge has the two performance metrics:

򐂰 Total I/O rate
Total I/O can show you how many host I/O requests are managed by the DS6000’s server
per second. By comparing gauges for server0 and server1, you can see if the subsystem
total I/O is balanced. According to the workload type (sequential or random, read intensive
or write intensive and so on), I/O size, and configuration, this value may change
dramatically. So you can’t assume if the subsystem still has enough resource to manage
more host I/O or not indiscriminately.
To estimate and calculate how many host I/O requests are affordable, use Disk Magic.
Refer to 4.1, “Disk Magic” on page 86. For further analysis, you have to drill down to the
Rank level and the Volume level gauge. Especially, for threshold setting of Total Cluster
I/O Rate, this estimation should be done accurately. If this value is set inaccurately,
incorrect warning and error alert messages are sent.

򐂰 % I/O requests delayed due to NVS
It shows the percentage of I/O requests that have been delayed due to NVS (persistent
memory) memory shortage. This number indicates which percentage of write requests
have been delayed by the server due to space constraints in the persistent memory within
the server. This information helps you to identify during the monitoring period if the write
workload applied on your DS6000 has been heavy enough to saturate the write cache
memory (persistent memory). This situation can be due to write performance limitations
on the backend disk (at the Rank level) or limitation of the persistent memory size. In both
cases, you should consider the performance report at the Rank and volume level.
Interpreting Rank level gauge

The Rank Group is a collection of three, four, seven or eight DDMs. It is also an unit of logical
disk emulation control, also known as an Array or DS6000 Rank. A Rank provides a pool of
extents which are used to create one or several volumes. A volume can use DS6000 extents
from one or several Rank. A Rank can be either RAID 5 or RAID 10. Refer to 3.5.3, “Arrays
and Array Sites” on page 63 and 3.5.4, “Select a Rank format” on page 68 for further
discussion.
The Rank Group analysis show how busy the DDMs are on your DS6000 at the RAID-Array
level. This information helps to determine where the most accessed data is located and what
performance you are achieving from the RAID Array. Rank Group analysis reports the global
performance workload of all the volumes defined in the selected RAID Array. Here is the
information provided:
򐂰 Total I/O: DS6000 lower interface total I/O per second
򐂰 Reads/second: DS6000 lower interface reads per seconds
򐂰 Writes/second: DS6000 lower interface writes per seconds
򐂰 Average response time: Average response time for the lower interface in milliseconds
This information takes account of all I/O activities against the DDMs. These activities are
dedicated to read and write along with either staging/destaging or
disk-to-cache/cache-to-disk.
򐂰 Read and write request per second at Rank level
This information shows the number of I/O requests made by the server0 and server1 for
this Rank, including both read and write requests. This number is an indication of internal
DS6000 operations between Server cache to the Rank. This value is a sum of all read and
write activities applied to all volumes defined on this Rank. Analyzing reports of all
volumes defined on this Rank help you to understand which host server To monitor the
workload generated by a host server, use the volumes report
򐂰 Average response time at Rank level
This information shows an average response time to complete each Rank request. The
number is not the average response time of a single DDM. Since a DS6000 Server makes
various kinds of I/O requests, some of these requests access all DDMs in a Rank, while
others may involve just one DDM. Taking this into consideration, you cannot use this
number to measure performance without knowing how each cluster makes I/O requests to
the Rank.
Consider that for read activity (4KB IO size), an DS6000 Rank can process more than:
򐂰 1700 operations per second (using 15 Krpm DDM disks)
򐂰 1200 operations per second (using 10 Krpm DDM disks)

The maximum bandwidth at the Rank level for sequential read activity (64 KB IO size) is 240
MB/s, 150 MB/s for write.
In general, the DS6000 should show good performance when the average Rank read
response time is about 10 ms. Generally, the average response time should not exceed 35
ms.
There is a relationship between Rank operations, cache hit ratio, and percentage of read
requests. When the cache hit ratio is low, this indicates that the DS6000 has frequent
transfers from DDMs to cache (staging).
When percentage of read requests is high and cache hit ratio is also high, most of the I/O
requests can be satisfied without accessing the DDMs due to the SARC prefetching
algorithm.
When the percentage of read requests is low, the DS6000 write activity to the DDMs can be
high. This indicates that the DS6000 has frequent transfers from cache to DDMs (destaging).
Comparing different Rank’s performance gauges helps you to understand if your global
workload is equally spread on the DDMs of your DS6000. Spreading data across multiple
Ranks increases the number of DDMs used and optimizes the overall performance.
Important: Limitation of write workload to one Rank can increase the persistent memory
destaging execution time and so, impact all write activities on the same DS6000 server.
Also you have to check the % DASD fast write delay due to NVS metrics at the volume
level gauges.
For avoiding this situation, you should spread write I/O on multiple Ranks or add more
DS6000 servers or consider replacing it with the DS8000.
Interpreting Volume level gauge

This level of analysis helps you to understand I/O workload distribution among volumes as
well as workload characteristics (random or sequential, cache hit ratios). A DS6000 volume
belongs to one or several Ranks. Refer to Chapter 3, “Logical configuration planning” on
page 53 for further discussion.
Here is the information provided:

򐂰 Metrics reported as operations per second (IOPS)
– Non-sequential read/second, write/sec
– Reads/second, Writes/second (sequential)
– Disk<>cache/second
– Cache<>disk/second
– Record mode read
򐂰 Metrics reported as a ratio
– Cache hit read ratio, write ratio (non-sequential)
– Total cache hit ratio (non-sequential),
– Cache hit read ratio, write ratio (sequential)
– Total cache hit ratio (sequential)
– Cache hit read ratio, write ratio
– Total cache hit ratio

򐂰 Metrics reported as a percentage (0 to 100%)
– % DASD fast write delay due to NVS
򐂰 Metrics reported in KB (KB)
– Cumulative read size, write size
򐂰 Metrics reported in milliseconds (millisecond)
– Cumulative read I/O response time, write I/O response time
Note: Here metrics have been grouped with hose metrics that have the same unit of
measurement and therefore can be combined and displayed using a single gauge.
Analysis of volume level metrics will show how busy the volumes are on your DS6000. This
information helps to determine:
򐂰 Where the most accessed data is located and what performance you get from the volume.
򐂰 Understand the type of workload your application generates (sequential or random, read
or write operation ratio).
򐂰 Determine the cache benefits for read operation (SARC prefetching algorithm).
򐂰 Determine the eventual cache bottleneck for write operation.
Interpreting I/O request metrics (read/second, write/second)
I/O requests per second tells you how many I/O requests are processed. By comparing I/O
rates at cluster level, Rank level, and volume level, it’s possible to identify where the greatest
demand is occurring for storage system resources.
A small or high I/O rate depends on the I/O workload type you have. However, consider that
the DS6000 volume performance is limited to the performance of the Rank where it is
defined. To exceed the performance limitation of a Rank, you should create a volume based
on extents which belong to several DS6000 Ranks.
There are several host factors that can affect the I/O requests per second:
򐂰 I/O contention on the host systems
Parallel Access Volume (PAV) for z/OS sharply reduces volume contention (IOSQ) within a
system. Multiple reads are executed in parallel and multiple writes are executed in parallel
and serialized by extent specification in the Define Extent command. You need to perform
additional subsystem tuning actions if your operating system does not allow parallel reads.
You can check the performance measurement tool available at the host system to
determine if there is a performance bottleneck on the host system side.
򐂰 The IBM Subsystem Device Driver may not be installed
If you have configured your hosts to have multiple paths to the same DS6000 server, you
have to make sure that you have installed the IBM Subsystem Device Driver (SDD), which
comes with your DS6000. SDD can balance workload among paths to the same server but
not among the each side of servers, and pump more I/O requests to the DS6000. For
further information, refer to IBM TotalStorage Subsystem Device Driver User’s Guide,
SC26-7478.
To get useful reference data we recommend that you monitor your subsystem during regular
work days and peak workload activities when there are no reported user issues or
performance constraints. If subsystem I/O performance degrades, user complaints and
response times increase, you can compare performance data for intervals of normal

performance with performance data for intervals with performance problems, and take
appropriate actions, for example, spread workload across server0 and server1 and across
logical disks where possible. To determine how well the I/O workload is distributed across the
DS6000 resources, you can compare these numbers.
Interpreting read-to-write ratio (comparison of read/second and write/second)

The read-to-write ratio depends on how the application programs issue I/O requests. In
general, the overall average read-to-write ratio that you can find is in the range of 3 to 5 (75
percent to 80 percent reads). This is why you would see in most cases a larger value (greater
than 1 ratio or greater than 50 percent reads) for the read-to-write ratio. If read-to-write ratio
indicates an heavy write activity, you should probably distribute data to different volumes or
disk groups.
If applications use a database, this is different. A database management system has its own
caching mechanism using the host processor’s memory (also called the database buffer
pool). A database management system can defer the write operation until the modified data
occupies a certain amount of this buffer pool. The duration between read and the
corresponding write burst to the DS6000 subsystem can be long. You can see first a high
read requests rate, then less because of the write requests later on.
For a logical volume that has sequential files, you need to understand what kind of
applications access those sequential files. Normally, these are used for either read-only or
write-only at the time of use. The DS6000 pre-fetching SARC algorithm determine if the data
access pattern is sequential or not. If the access is sequential, then contiguous data is
pre-fetched into cache in anticipation of the next read request.
The DS6000 has the 100 percent write hit function. Due to this function, all write activity goes
to DS6000 cache before being written to disk. Therefore, all I/O requests against volumes are
completed without accessing DDMs. The most important policy that we have to consider on
exploiting write cache function is to protect data integrity. For this reason, the DS6000
maintains a two secured copies policy. This ensures that modified data is stored in two
different places in the DS6000 and a single component failure does not cause the loss of
data.
When the DS6000 accepts a write request, it will process it without writing to the DDMs
physically. The data is written into both the server to which belongs the volume and the
persistent memory of the second server in the DS6000. Later, the DS6000 asynchronously
destages the modified data out to the DDMs.
The DS6000’s lower interfaces use switched Fibre Channel connections, which provide a high
data transfer bandwidth. In addition, the destage operation is designed to avoid the write
penalty of RAID 5, if possible. For example, there is no write penalty when modified data to be
destaged is contiguous enough to fill the unit of a RAID 5 stride. However, when all of the
write operations are completely random across a RAID 5 rank, and the DS6000 cannot avoid
the write penalty, you could get some DDM level of I/O contention.
The DS6000 provides RAID 10 and RAID 5 capability. Consider that:

򐂰 For either sequential or random reads from disk, there is no significant difference in RAID
5 and RAID 10 performance, except at high I/O rates.
򐂰 For random writes to disk, RAID 10 performs better.
򐂰 For sequential writes to disk, RAID 5 performs better.
To get more details regarding RAID 5 and RAID 10 difference, refer to 2.8.5, “RAID 5 versus
RAID 10 performance” on page 39.

Interpreting cache metrics (disk-to-cache, cache-to-disk)
The Cache analysis tells you how your DS6000 has processed I/O requests in terms of cache
effectiveness.
Disk-to-cache operation shows the number of data transfer operations from disks to cache,
referred to as staging for a specific volume. Disk-to-cache operations are directly linked to
read activity from hosts. Data requested for reads are first staged from backend disks into the
cache of the DS6000 server and then transferred to the host.
Read hits occur when all the data requested for a read data access is located in cache. The
DS6000 improves the performance of read caching by using SARC staging algorithms to
store in cache data tracks which have the greatest probability of being accessed by a read
operation.
Cache-to-disk operation shows the number of data transfer operations from cache to disks,
referred to as destaging for a specific volume. Cache-to-disk operations are directly linked to
write activity from hosts to this volume. Data written is first stored in the persistent memory
(also known as NVS) at the DS6000 server then destaged to the backend disk. The DS6000
destaging is enhanced automatically by striping the volume across all the DDMs in one or
several Ranks (depending on your configuration). This provides automatically load balancing
across DDMs in Ranks and an elimination of the hot spots.
The DASD fast write delay percentage due to persistent memory allocation give us
information about the cache usage for write activities. The DS6000 stores data in the
persistent memory before sending acknowledgement to the host. If the persistent memory is
full of data (no space available), the host will receive a retry for its write request. In parallel,
the server has to destage data stored in its persistent memory to the backend disk before
accepting new write operations from any host.
If one of your volumes is facing write operation delayed due to persistent memory constraint,
to avoid this situation, you should move your volume to a new Rank which is less used or
spread this volume on multiple Ranks (increase the number of DDMs used). If this solution
does not fix the persistent memory constraint problem, you can consider adding more
DS6000 servers or replacing it with the DS8000.
Interpreting read hit ratio analysis
Read hit ratio shows how efficiently your cache works on the DS6000. For example, the value
of 1.00 indicates that all read requests are satisfied within the cache. If the DS6000 cannot
complete I/O requests within the cache, it transfers data from the DDMs. The DS6000
suspends the I/O request until it has read the data. This situation is called cache-miss. If an
I/O request is cache-miss, the response time will include not only the data transfer time
between host and cache, but also the overhead of staging data from the DDMs.
The read hit ratio depends on the characteristics of data on your DS6000 and applications
that use the data. If you have a database and it has the locality of reference, it will show a high
cache hit ratio, as most of the data referenced could remain in the cache. If your database
does not have the locality of reference, but it has the appropriate sets of indexes, it will also
show a high cache hit ratio, as the entire index could remain in the cache.
A database could be cache-unfriendly to applications by nature. An example would be if a

large amount of sequential data is written to a highly fragmented file system in an open
systems environment. If an application reads this file, the cache hit ratio will be very low,
because the application never reads the same data, due to the nature of sequential access. In
this case, de-fragmentation of the file system would improve the performance. You cannot
determine if increasing the server enclosure improves the I/O performance, without knowing

the characteristics of data on your DS6000 and without using a capacity planning tool like
Disk Magic. Refer to 4.1, “Disk Magic” on page 86.
We recommend that you monitor the read hit ratio over an extended period of time:
򐂰 If the cache hit ratio has been low historically, it is most likely due to the nature of your
data, and you do not have much control over this. You can first try to perform
de-fragmentation on a file system, making indexes if none exist, rather than considering
increasing the cache size.
򐂰 If you have a high cache hit ratio initially, and it is decreasing as you load more data with
the same characteristics, then moving some data to another cluster that uses the other
cluster’s cache or adding server enclosure could improve the situation.
4.3.7 Performance gauge - considerations

All averages appearing in the summary reports are values averaged over during the
performance collection interval. When setting the performance task, this frequency can be set
from 5 minutes to 60 minutes. We recommend that to drill down to detail reports, even though
the long frequency reports appear to be adequate. For example, we recommend collecting
performance data every 5 minutes and monitoring for 24 hours (therefore, taking 288 samples
a day), to see detailed performance and busiest time in a day.
4.3.8 IBM TotalStorage Productivity Center for Disk and other tools
The IBM TotalStorage Productivity Center for Disk provides storage subsystem metric
performance data. We receive, for example, the number of I/O requests the DS6000 has
processed at different levels, the cache usage and the persistent memory use conditions. We
cannot get the host system view from the Productivity Center for Disk reports, like I/O activity
rate, I/O response time, or data transfer rate. If there is a performance problem with your
applications, you could see a delay of batch jobs and slower response times during online
transaction processing.
To determine if the I/O behavior is the reason for the problem, you need to gather the
information about I/O profiles on the host systems. For example, it is possible that one
application cannot get I/O services, while another application dominates I/O services. The I/O
response time and its breakdown for each of the logical volumes helps you to isolate the
source of the performance problem.
The following sections describe how to use host-based performance measurements and
reporting tools, in conjunction with the IBM TotalStorage Productivity Center for Disk under
UNIX, Linux, Windows 2000, iSeries and z/OS environments.
IBM TotalStorage Productivity Center for Disk Report and UNIX / Linux
Most application I/O requests against disk subsystems are through either database
management systems or file systems. It can be difficult to associate the application or
operating system I/O performance with that of the I/O subsystems directly. Why? Because
they have their own internal caching mechanisms. An I/O request from an application does
not always go directly to the I/O subsystems. You may see an I/O subsystem experiencing
poor performance while applications are not affected.
To get host information about I/O subsystems, CPU activities, virtual memory, and physical
memory use, you can use the following commands:
򐂰 iostat
򐂰 vmstat
򐂰 sar

These three commands are standard tools available with most UNIX systems and UNIX-like
(Linux). We recommend using iostat for the data you will need to evaluate your host I/O
levels.
Use data transfer rates from iostat

With enough iostat data, you can see trends in the normal workload and then establish a
baseline of host-side activity. Focus your attention on the data transfer rate figures.
Correlate with the DS6000 analysis

When user complaints about host system response times increase, and when you see a
downward trend on the data transfer rate of a logical volume on a host, you can check Volume
analysis on Productivity Center for Disk. If volume and Rank continue to show low values, and
you still suspect I/O performance after you have investigated the applications, you may need
more host adapters on your DS6000, or you should implement the IBM Subsystem Device
Driver, if it has not already been installed. The bottleneck is likely the paths between the host
and the DS6000. High I/O request rate on a particular Rank or volume does not mean you
have a performance issue, unless the activity can be correlated with some host-side
performance problem that is identified through proactive monitoring or by a user complaint.
When you see a downward trend on effective data transfer rate for a volume on a host and I/O
request rate on a Rank or volume is going up, you need to perform further analysis even after
you have concluded that your DS6000 is not performing well. If you have other host systems,
you also need to check them, as the source of poor performance could be at one of the other
hosts. Cache-unfriendly applications or host systems could be a reason for high I/O request.
For this reason, we suggest that you check Cache reports for all volumes that are behind the
Rank showing high utilization.
If the volume you are concerned about is in the Rank, and other volumes in the same Rank
show poor cache statistics (such as low read hit ratio or low percent read requests), moving
the volumes to another Rank would be worth considering. This may relieve the performance
degradation condition, as the Rank or below level of I/O delay probably caused the situation.
IBM TotalStorage Productivity Center for Disk Report and Windows

Basically, almost all of the discussion in Chapter 7, “Open systems servers - UNIX” on
page 189 is also applicable for Windows systems. You can substitute the word file systems for
NTFS. The standard performance measurement tool available for Windows systems is
Performance Monitor.
Performance Monitor gives you the flexibility to customize the monitoring to capture various
categories of Windows system resources, including CPU and memory. You can also monitor
disk I/O through Performance Monitor.
Using the Performance Monitor

To initiate Performance Monitor, click Start -> Administrative Tools -> Performance
Monitor. You can add a physical disk or logical disk object to a chart, log, or report, as well as
other objects to be monitored. Figure 4-41 on page 136 shows an example of adding a
physical disk object to a chart. Remember that the term physical disk is from a Windows
perspective. When you define a logical disk in an DS6000 and assign it to a Windows system,
the DS6000 volume is a physical disk for the host, and you may create one or more logical
disk objects on it.

Figure 4-41 Windows 2000 Performance Monitor
The Performance Monitor shows the response time (Avg. Disk sec/I/O). The minimum monitor
interval is one second when you log performance data. A one-second response time may not
be a valid reflection of your system’s performance. When you use the Performance Monitor in
real time, you can set the monitor interval in the increment of one millisecond. If you set the
monitor interval to one millisecond, the value will be closer to the actual response time.
Increasing the sample count will impact your system’s performance and it will also affect the
accuracy of these performance counters. This is most likely not acceptable for your
production applications. In addition, it is not as convenient for historical analysis, since the
real time monitor provides just one screen of data and it wraps around. Although you can log
the performance data, the data is saved at a minimum of one second intervals, so the values
may not be as accurate.
We suggest that you use the same approach as for a UNIX/Linux system, that is, to monitor
data transfer rate trends over an extended period of time.
IBM TotalStorage Productivity Center for Disk report and iSeries

There are numerous products which aid in performance management for iSeries. Most are
comprehensive planning tools in that they address the entire spectrum of workload
performance on iSeries including CPU, system memory, disks and adapters. Although none
focus exclusively on external storage performance, each offers information which can aid in
this area. Here are the most popular tools:
򐂰 Collection Services
򐂰 iSeries Navigator Monitors
򐂰 IBM Performance management for iSeries (PM iSeries)
򐂰 Performance Tools for iSeries
For further discussion, refer to Chapter 11, “iSeries servers” on page 387.

IBM TotalStorage Productivity Center for Disk report and zSeries
The z/OS systems have proven performance monitoring and management tools available to
use for performance analysis. RMF, a z/OS performance tool, collects performance data and
reports it for the desired interval. It also provides cache reports. The cache reports are similar
to the Disk-to-Cache and Cache-to-Disk reports available in the TotalStorage productivity
Center for Disk, except that RMF’s cache reports are provided in text format. The RMF
provides the Rank level statistics as SMF records. These SMF records are raw data that you
can run your own post processor against to generate reports. RMF is discussed in detail in
Chapter 10, “zSeries servers” on page 357.
4.3.9 IBM TotalStorage Productivity Center for Disk in mixed environment

A benefit of the IBM TotalStorage productivity Center for Disk is that you can analyze both
open systems fixed block (FB) and S/390 CKD workloads. When the DS6000 subsystems are
attached to multiple hosts running on different platforms, open systems hosts may affect your
S/390 workload and vice versa. If this is the case, taking a look at RMF reports will not be
sufficient. You need also the information about the open systems. The IBM TotalStorage
Productivity Center for Disk informs you about the cache and I/O activity.
Before beginning the diagnostic process, you must understand your workload and your
physical configuration. You need to know how your system resources are allocated, as well as
understand your path and channel configuration for all attached servers.
Let us assume that you have an environment with a DS6000 attached to a z/OS host, an AIX
pSeries host, and several Windows hosts. You have noticed that your z/OS online users
experience a performance degradation between 7:30 a.m. and 8:00 a.m. each morning.
You may notice that there are 3390 volumes indicating high disconnect times, or high device
busy delay time for several volumes in the RMF device activity reports. Unlike UNIX or
Windows, you may notice response time and its breakdown to connect, disconnect, pending,
and IOS queuing.
Disconnect time is an indication of cache miss activity or destage wait (due to persistent
memory high utilization) for logical disks behind the DS6000s.
Device busy delay is an indication that another system locks up a volume, and an extent
conflict occurs among S/390 hosts or applications in the same host when using Parallel
Access Volumes. The DS6000’s multiple allegiance or Parallel Access Volume capability
allows it to process multiple I/Os against the same volume at the same time. However, if a
read or write request against an extent is pending while another I/O is writing to the extent, or
if a write request against an extent is pending while another I/O is reading or writing data from
the extent, the DS6000 will delay the I/O by queuing. This condition is referred to as extent
conflict. Queuing time due to extent conflict is accumulated to device busy (DB) delay time.
An extent is a sphere of access; the unit of increment is a track; usually I/O drivers or system
routines decide and declare the sphere.
To determine the possible cause of high disconnect times, you should check the read cache
hit ratios, read-to-write ratios, and bypass I/Os for those volumes. If you see the cache hit
ratio is lower than usual while you have not added other workload on your S/390 environment,
I/Os against open systems fixed block volumes might be a cause of the problem. Possibly
fixed block (FB) volumes defined on the same server had a cache-unfriendly workload, thus
impacting your S/390 volumes hit ratio.
In order to get more information about cache usage, you can check the cache statistics of the
Fixed Block volumes, which belong to the same server. You may be able to point out the Fixed
Block volumes that have a low read hit ratio and short cache holding time. If you can move the

workload of these open systems logical disks, or the S/390 CKD volumes you are concerned
about to the other side of the cluster, so that you can concentrate cache-friendly I/O workload
to either cluster, this would improve the situation. If you cannot do this, or if the condition has
not improved after moving, then balancing I/O distribution on more Ranks is worth
considering. This will optimize the staging and destaging operation.
The approaches for using other tool’s data in conjunction with the IBM TotalStorage
Productivity Center for Disk, as described in this chapter, do not cover all the possible
situations you will encounter. But if you basically understand how to interpret the DS6000
performance reports, and you also have a good understanding of how the DS6000 works,
then you will be able to develop your own ideas on how to correlate the DS6000 performance
reports with other performance measurement tools when approaching specific situations in
your production environment.
4.4 SAN statistics

This section discusses the benefits of Storage Area Network (SAN) statistics provided by
SAN switches and directors for monitoring performance of the DS6000.
4.5 Monitoring performance through a SAN switch or director

All SAN switch and director vendors provide management software that includes performance
monitoring capabilities. The real-time SAN statistics such as port utilization and throughput
information available from SAN management software can be used to complement the
performance information provided by host servers or storage subsystems. Additionally, most
SAN management software includes options to create Simple Network Management Protocol
(SNMP) alerts based on performance criteria, and to create historical reports for trend
analysis. Some SAN vendors offer advanced performance monitoring capabilities such as
measuring I/O traffic between specific pairs of source and destination ports, and measuring
I/O traffic for specific LUNs.
SAN statistics should be analyzed to:

򐂰 Ensure that there are no SAN bottlenecks limiting DS6000 I/O traffic. (For example, any
link utilization over 80% should be analyzed.)
򐂰 Confirm that multipathing/load balancing software is operating as expected.
򐂰 Isolate the I/O activity contributed by adapters on different host servers sharing storage
subsystem I/O ports.
򐂰 Isolate the I/O activity contributed by different storage subsystems accessed by the same
host server.
We’ll look at 4 example SAN configurations where SAN statistics may be beneficial for
monitoring and analyzing DS6000 performance.
The first example configuration, shown in Figure 4-42 on page 139 has host server Host_1
connecting to DS6000_1 through 2 SAN switches or directors (SAN Switch/Director_1 and
SAN Switch/Director_2). There is a single inter-switch link (ISL) between the 2 SAN switches.
In this configuration, the performance data available from the host and from the DS6000 will
not be able to show the performance of the ISL. If, for example, the Host_1 adapters and the
DS6000_1 adapters do not achieve the expected throughput, the SAN statistics for utilization
of the ISL should be checked to determine whether it is limiting I/O performance.

IBM Systems and Technology Group
Host_1
SAN Switch/Director_1
ISL
SAN Switch/Director_2
Storage
Enclosure
Storage
Enclosure
Storage
Enclosure
DS6000_1
Storage
Enclosure
Storage
Enclosure
Storage
Enclosure
Figure 4-42 Inter-switch link (ISL) configuration
A second type of configuration in which SAN statistics can be useful is shown in Figure 4-43
on page 140. In this configuration, host bus adapters or channels from multiple servers
access the same set of I/O ports on the DS6000 (server adapters 1-4 share access to
DS6000 I/O ports 5 and 6). In this environment, the performance data available from only the
host server or only the DS6000 may not be enough to confirm load balancing, or to identify
each server’s contributions to I/O port activity on the DS6000, because more than one host is
accessing the same DS6000 I/O ports.
If DS6000 I/O port 5 is highly utilized, it may not be clear whether Host_A, Host_B, or both
hosts are responsible for the high utilization. Taken together, the performance data available
from Host_A, Host_B and the DS6000 may be enough to isolate each server connection’s
contribution to I/O port utilization on the DS6000; however, the performance data available
from the SAN switch or director may make it easier to see load balancing and relationships
between I/O traffic on specific host server ports and DS6000 I/O ports at a glance, because it
can provide real-time utilization and traffic statistics for both host server SAN ports and
DS6000 SAN ports in a single view, with a common reporting interval and metrics.

Host_1 Host_2
1 2 3 4
SAN Switch or Director
5 6
Storage
Enclosure
Storage
Enclosure
Storage
Enclosure
DS6000
Storage
Enclosure
Storage
Enclosure
Storage
Enclosure
Figure 4-43 Shared DS6000 I/O ports
SAN statistics may also be helpful in isolating the individual contributions of multiple DS6000s
to I/O performance on a single server. In Figure 4-44 on page 141, host bus adapters or
channels 1 and 2 from a single host (Host_A) access I/O ports on multiple DS6000s (I/O
ports 3 and 4 on DS6000_1 and I/O ports 5 and 6 on DS6000_2).
In this configuration, the performance data available from either the host server or from the
DS6000 may not be enough to identify each DS6000’s contribution to adapter activity on the
host server, because the host server is accessing I/O ports on multiple DS6000s. For
example, if adapters on Host_A are highly utilized or if I/O delays are experienced, it may not
be clear whether this is due to traffic that is flowing between Host_A and DS6000_1, between
Host_A and DS6000_2, or between Host_A and both DS6000_1 and DS6000_2.
The performance data available from the host server and from both DS6000s may be used
together to identify the source of high utilization or I/O delays; additionally, the SAN switch or
director can provide real-time utilization and traffic statistics for both host server SAN ports
and DS6000 SAN ports in a single view, with a common reporting interval and metrics.

Host_1
1 2
SAN Switch or Director
3 4 5 6
Storage Storage
Enclosure Enclosure
Storage Storage
Enclosure Enclosure
Storage Storage
Enclosure Enclosure
DS6000_1 DS6000_2
Storage Storage
Enclosure Enclosure
Storage Storage
Enclosure Enclosure
Storage Storage
Enclosure Enclosure
Figure 4-44 Single server accessing multiple DS6000s
Another configuration in which SAN statistics can be important is a remote mirroring

configuration such as the one shown in Figure 4-45. Here 2 DS6000s are connected through
a SAN for synchronous or asynchronous remote mirroring or remote copying, and the SAN
statistics available from the SAN switch or director offer real-time utilization and traffic
statistics at a glance for the remote mirroring links.
Primary Secondary
Site Site
Storage Storage
Enclosure Enclosure
Storage Storage
Enclosure Enclosure
Storage Storage
Enclosure Enclosure
DS6000_1 DS6000_2
Storage
1 2 3 4 Storage
Enclosure Enclosure
Storage Storage
Enclosure SAN Switch SAN Switch Enclosure
Storage or Director or Director Storage
Enclosure Enclosure
Figure 4-45 Remote mirroring configuration
SAN statistics should be checked used to determine whether there are SAN bottlenecks
limiting DS6000 I/O traffic. SAN link utilization and throughput statistics can also be used to
breakdown the I/O activity contributed by adapters on different host servers to shared storage
subsystem I/O ports. Conversely, SAN statistics can be used to breakdown the I/O activity
contributed by different storage subsystems accessed by the same host server. SAN statistics

can also highlight whether multipathing/load balancing software is operating as desired, or
whether there are performance problems which need to be resolved.
For additional information about monitoring performance through a SAN switch or director,
see:
http://www.brocade.com
http://www.cisco.com
http://www.mcdata.com

5
Chapter 5. Host attachment

This chapter discusses the attachment considerations between host systems and the
DS6000 for availability and performance. Topics include:
򐂰 DS6000 attachment types
򐂰 SAN fabrics: Zoning and cabling
򐂰 Configuring logical disks in a SAN
򐂰 SDD for multi-path failover and load balancing

5.1 DS6000 host attachment
The DS6000 enterprise storage solution provides a variety of host attachments so that you
can consolidate storage capacity and workloads for open systems hosts, S/390 hosts, and
zSeries hosts.
The DS6800 Model 511 contains two controller cards with four ports each for a total of eight
host attachment ports. You can configure the DS6000 host attachment ports for either Fibre
Channel protocol (FCP) or for Fibre Connection (FICON) protocol. For z-Series host
attachment, the DS6000 does not support ESCON.
5.1.1 Attaching to open systems hosts

You can attach a DS6000 Storage Unit to an open systems host with Fibre Channel adapters.
Fibre Channel is a 1 Gbps or 2 Gbps, full-duplex, serial communications technology to

interconnect I/O devices and host systems that are separated by tens of kilometers.
The DS6000 supports 1 Gbps and 2 Gbps connections. The DS6000 negotiates the
connection speed automatically and determines whether it is best to run at 1 Gbps link or
2 Gbps link.
Fibre Channel connections are established between Fibre Channel ports that reside in I/O
devices, host systems, and the network that interconnects them. Each of the eight host
adapter ports available on the DS6000 has a unique worldwide port name (WWPN). You can
configure the port to operate with the SCSI-FCP upper-layer protocol or with the FC-AL
upper-layer protocol. The DS6000 can be configured with either shortwave small form factor
pluggables (SFPs) or with longwave SFPs to be installed on the host adapter ports, as
discussed in Chapter 1, “Model characteristics” on page 1. Fibre Channel adapters for
SCSI-FCP support provide the following configurations:
򐂰 A maximum of eight host ports
򐂰 A maximum of 8192 host logins per Fibre Channel port
򐂰 A maximum of 2000 N-port logins per storage unit
򐂰 Access to all 8,192 LUNs per target (one target per host adapter), depending on host type
򐂰 Either arbitrated loop, switched fabric, or point-to-point topologies
Additional information about open system platform connectivity is provided in:

򐂰 Chapter 7, “Open systems servers - UNIX” on page 189
򐂰 Chapter 8, “Open system servers - Linux for xSeries” on page 261
򐂰 Chapter 9, “Open system servers - Windows” on page 305
5.1.2 FICON-attached S/390 and zSeries hosts

You can attach the DS6000 Storage Unit to FICON-attached S/390 and zSeries hosts.
As with the open systems connections, each of the two controller cards in the DS6000
contains four host adapter ports, and each port has a unique world wide port name (WWPN).
You can configure the port to operate with the FICON upper-layer protocol. When configured
for FICON, the Fibre Channel port supports connections to a maximum of 128 FICON hosts.
With FICON, the host adapter port can operate with fabric or point-to-point topologies. With
host adapter ports that are configured for FICON, the storage unit provides the following
configurations:

򐂰 Either fabric or point-to-point topologies
򐂰 A maximum of eight host ports
򐂰 A maximum of 2048 logical paths on each host port
򐂰 Access to all 32 control-unit images (8192 CKD devices) over each FICON port
5.1.3 Example of host attachments

An example of the different host attachment types is shown in Figure 5-1. This example has a
mix of open systems servers and z/OS servers.
Windows
AIX z-Series
(FICON)
Linux
SAN fabric
host adapter Controller Controller host adapter

chipset
card 0 card 1 chipset
Power PC Power PC
chipset Volatile Volatile chipset
memory memory
device adapter Persistent memory Persistent memory device adapter
chipset chipset
DS6000 Model 511

Figure 5-1 DS6000 attachment types: FCP and FICON
5.2 Multipathing
For whichever host attachment method you use, we recommend that whenever possible, you
use two or more paths from each FCP or FICON host to the DS6000, and balance the host
connections across both controller cards. For the DS6000, it is important that host systems
have attachment to both controller cards. See 2.3, “DS6000 major hardware components” on
page 19 for details on preferred pathing and why connectivity to both controller cards is
important for the DS6000.
By attaching a host with redundant paths to the DS6000, you can increase availability by
avoiding single points of failure. Additionally, over and above preferred pathing considerations,
I/O performance can be improved by configuring multiple physical paths to groups of heavily
Chapter 5. Host attachment 145

used logical disks. For open systems, we recommend the use of SDD (subsystem device
driver) for load balancing and failover. For the zSeries servers the multi-pathing facility is
already part of the I/O subsystem—I/O operations can be started on one channel and
reconnect on a different one; so the recommendation is to lay out alternate paths to the
devices. Both SDD and z/OS automatically recognize and exploit the preferred path
characteristics of the DS6000.
5.3 FICON
FICON is a Fibre Connection used with zSeries servers (see 2.10.3, “FICON attachment” on
page 47). The connection speeds are 100–200 MB/s similar to Fibre Channel for open
systems.
FICON channels were introduced in the IBM 9672 G5 and G6 servers with the capability to
run at 1 Gbps. Eventually these channels were enhanced to FICON Express channels in IBM
the zSeries 800, zSeries 900, zSeries 990 servers and the IBM System z9 109, and were
capable of running at transfer speeds of 2Gbps. FICON Express2 channels are a new
generation of FICON channels that offer improved performance capability over previous
generations of FICON and FICON Express channels. They are supported on the
IBM Eserver zSeries 990 (z990), zSeries 890 (z890) and IBM System z9 109. A
comparison of the overall throughput capabilities of various generations of channel
technology is shown in Figure 5-2.
M B /sec Thoroughput(full duplex)

Large sequential R/W interm ix IO PS 4KB block size channel
100% utilized FICON
FICON
EXPRESS2
300 EXPRESS2
14000
13000
270
250 12000
z990
z990 FICON
FICON z890
z890 EXPRESS
EXPRESS z9
10000 9200 z9
200 2Gbps FICON
FICON 170 EXPRESS

8000 7200 z990
EXPRESS FICON
150 1Gbps z890
z990 6000
120 z890 6000 z900
FICON z900 FICON z900 z800
100
74 z900 z800
4000 3600
G5 ESCON G5
50
ESCON G6 2000 1200 G6
17
0 0
Figure 5-2 Measurements of channel performance over several generations of channels
As you can see, the FICON Express2 channel as first introduced on the zSeries z890 and
z990 represents a significant improvement in both 4K I/O per second throughput and
maximum bandwidth capability compared to previous FICON offerings. The greater
performance capabilities of the FICON Express2 channel makes it a good match with the
performance characteristics of the DS6000 host adapters.

For any generation of FICON channels, you can attach directly to a DS6000 or you can attach
the FICON channels to a FICON capable Fibre Channel switch. When you attach the FICON
channels directly to a DS6800, the maximum number of FICON attachments is eight.
Increased channel connectivity can be achieved through the use of FICON capable Fibre
Channel switches, but careful planning should be done so as not to overrun the capacity of
the DS6000 host ports. For example, as shown in Figure 5-2 on page 146, a single FICON
Express2 channel has been measured transferring 270 MB per second in a sequential
workload. More than a single FICON Express2 channel running at this rate will exceed the
capability of the DS6000 host port.
When you use a Fibre Channel/FICON host adapter to attach to FICON channels, either
directly or through a switch, the port is dedicated to FICON attachment and may not be
simultaneously attached to FCP hosts. When you attach a DS6000 to FICON channels
through one or more switches, the maximum number of FICON logical paths is 2048 per
DS6000 host adapter port.
Figure 5-3 shows an example of FICON attachment to connect a zSeries server through
FICON switches, using 16 FICON channel paths to the eight host adapter ports on the
DS6000, and addressing eight Logical Control Units (LCUs). This channel consolidation may
be possible when your host workload does not exceed the performance capabilities of the
DS6000 host adapter, and would be most appropriate when connecting to the original
generation FICON channel. It is likely, again depending on your workload, that FICON
Express2 channels should be configured one to one with a DS6000 host adapter port.
zSeries zSeries
FICON (FC) channels FICON (FC) channels
FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC
FICON FICON FICON FICON

Director Director Director Director
host adapter host adapter

chipset chipset
(LCU 20) LV 2000-20FF
LV 2100-21FF (LCU 21)
(LCU 22) LV 2200-22FF
LV 2300-23FF (LCU 23)
(LCU 24) LV 2400-24FF
LV 2500-25FF (LCU 25)
(LCU 26) LV 2600-26FF
LV 2700-27FF (LCU 27)
Figure 5-3 DS6000 FICON attachment

5.4 Fibre Channel
Fibre Channel is a 100 MBps or 200 MBps, full-duplex, serial communications technology to
interconnect I/O devices and host systems that are separated by tens of kilometers. The
DS6000 supports 100 MBps and 200 MBps connections; negotiating automatically to
determine whether it is best to run at 100 MBps (1 Gbps link) or 200-MBps (2 Gbps link). This
means that you can connect the DS6000 to a 1 Gbps link or a 2 Gbps link and the DS6000
will detect and operate at the higher link speed.
5.4.1 Supported Fibre Channel attached hosts

An overview of open systems supported as of August 2005 is found in Table 5-1:
Table 5-1 Platforms, operating systems and applications supported with DS6000
Server platforms Operating systems Clustering applications
pSeries, RS/6000®, IBM AIX, Linux (Red Hat, SuSE) IBM HACMP™ (AIX only)
BladeCenter JS20
iSeries OS/400®, i5/OS™, Linux IBM HACMP (AIX only)
(Red Hat, SuSE), AIX
HP PARisc, Itanium® II HP UX HP MC/Serviceguard
HP Alpha OpenVMS, Tru64 UNIX HP TruCluster (only DS8000)
Intel IA-32, IA-64, IBM BladeCenter Microsoft Windows, VMware, Microsoft Cluster Service including
HS20 and HS40 Novell NetWare, Linux (Red Hat, Microsoft Datacenter, Novell NetWare
SuSE, Asianux, Red Flag Linux) Cluster Services
SUN Solaris Sun Cluster, Veritas Cluster
Apple Macintosh OS X
Fujitsu PrimePower Solaris
SGI IRIX
For specific considerations that apply to each server platform, as well as for the most current
information about supported servers—the list is updated periodically—check:
5.4.2 Fibre Channel topologies

The DS6000 architecture supports all three Fibre Channel interconnection topologies:
򐂰 Direct connect
򐂰 Arbitrated loop
򐂰 Switched fabric
The three Fibre Channel topologies are briefly discussed below.
Direct connect
This is the simplest of all the Fibre Channel topologies. By using just a fiber cable, two Fibre
Channel adapters (one host and one DS6000) are connected. The Fibre Channel host
adapter card C in Figure 5-4 on page 149 is an example of a direct connect connection.

Host Host
SDD program
Fibre Fibre Fibre

Channel Channel Channel
adapter A adapter B adapter C
FC Switch

chipset chipset
Figure 5-4 Fibre Channel connections with a DS6000
This topology supports the maximum bandwidth of Fibre Channel, but does not exploit any of
the benefits that come with SAN implementations.
Tip: When using the DS Storage Manager or DS CLI to connect directly to a host HBA, set
the Fibre Channel port topology attribute to match the requirements of the host HBA
configuration.
The DS6000 supports direct connect at a maximum distance of 500 m (1500 ft.) at 1 Gbps
and 300 m (984 ft.) at 2 Gbps with the shortwave SFP feature. The DS6000 supports direct
connect at a maximum distance of 10 km (6.2 mi) with the longwave SFP feature.
Arbitrated Loop
Fibre Channel Arbitrated Loop (FC-AL) is a uni-directional ring topology very much like token
ring. Information is routed around the loop and repeated by intermediate ports until it arrives
at its destination. If using this topology, all other Fibre Channel ports in the loop must be able
to perform these routing and repeating functions in addition to all the functions required by the
point-to-point ports.
Up to a maximum of 127 ports can be interconnected via a looped interface. All ports share
the FC-AL interface and therefore also share the bandwidth of the interface. Only one
connection may be active at a time, and the loop must be a private loop. An example of Fibre
Channel arbitrated loop topology is shown in Figure 5-5 on page 150. Note how the three
servers with host adapters X, Y, and Z share a single port to the DS6000.

Host Host Host
Fibre Fibre Fibre

Channel Channel Channel
adapter X adapter Y adapter Z
Arbitrated
Loop
Topology

chipset chipset
Figure 5-5 Fibre Channel arbitrated loop topology
The DS6000 does not support FC-AL topology on adapters that are configured for FICON
protocol.
Tip: When using the DS Storage Manager or DS CLI to connect to a FC-AL loop, set the
Fibre Channel port topology attribute to fc-al.
The DS6000 supports up to 127 hosts or devices on a loop. However, the loop goes through a
loop initialization process (LIP) whenever you add or remove a Fibre Channel host or device
from the loop. LIP disrupts any I/O operations currently in progress. For this reason, we
recommend that you only have a single host and a single DS6000 on any loop, effectively
making it a direct connection as discussed under“Direct connect” on page 148.
Note: Because of the architecture of the DS6000, connection via arbitrated loop is not
recommended. Remember that the single fibre connection will only provide preferred path
access to the luns owned by the connecting controller card and non-preferred path access
to the luns owned by the other controller card. If the connecting controller card should fail,
you will lose data access to all luns in the DS6000.
Switched fabric
A switched fabric is an intelligent switching infrastructure that delivers data from any source to
any destination. Figure 5-4 on page 149, with Fibre Channel adapters A and B, shows an
example of a switched fabric. A switched fabric is the basis for a Storage Area Network
(SAN), as shown in Figure 5-6 on page 153.
Tip: When using the DS Storage Manager or DS CLI to configure a switched fabric, always
use fcp-scsi as the Fibre Channel port topology attribute.

The supported distance between the host and the DS6000 depends on the speed of the host
adapters (currently 2 Gbps or 1 Gbps), and whether shortwave or longwave host adapters
ports are being used. See Table 5-2 for a list of supported distances between hosts or
switches and the DS6000. Longwave adapter ports support greater distances than
shortwave, while higher link speeds reduce the maximum distance.
Table 5-2 Distances supported by Fibre Channel cables for the DS6000
Fibre Channel host Transfer rate Cable type Distance
adapter SFP feature
FC1310 (shortwave) 1 Gbps 62.5 µm, multimode 300 m (984 ft)
2 Gbps 62.5 µm, multimode 150 m (492 ft)
1 Gbps 50 µm, multimode 500 m (1640 ft)
2 Gbps 50 µm, multimode 300 m (984 ft)
FC1315 (longwave) 1 Gbps 9-µm fibre cable 10 km (6.2 mi)

(single mode)
2 Gbps 9-µm fibre cable 10 km (6.2 mi)

(single mode)
Notes for reading the information in Table 5-2:

򐂰 The adapters have auto-sensing that automatically negotiates with the attached unit to
determine whether the link operates at 1 or 2 Gbps.
򐂰 When attaching new equipment that is capable of operating at 2 Gbps to old equipment
that only operates at 1 Gbps, the connection length might be a problem. The length might
be adequate for 1 Gbps, but inadequate for 2 Gbps. You are responsible for determining
the supported distance for the cable that you use.
Recommendations for implementing a switched fabric are covered in more detail in the
following section.
5.5 SAN implementations

In this section we describe a basic SAN network and how to implement it for maximum
performance and availability. We show some examples of a properly connected SAN network
to maximize the throughput of disk I/O.
5.5.1 Description and characteristics of a SAN

SAN stands for Storage Area Network. A SAN allows you to connect heterogeneous open
system servers to a high speed (1 Gbps or 2 Gbps/s now; 4 Gbps or 10 Gbps in the future)
network, sharing storage devices such as disk storage and tape libraries. Instead of each
server having its own locally attached storage and tape drives, a SAN allows you to share
centralized storage components and easily allocate storage to hosts.
Figure 5-6 on page 153 shows an example of a SAN switched fabric. It is called a switched
fabric because the SAN switches allow any Fibre Channel port to connect to any other Fibre
Channel port. All the Fibre Channel adapters in the servers and storage in this example are
running in switched fabric mode. There are four main components of the SAN:
򐂰 The servers.
򐂰 The storage subsystems—in this case a DS6000 and a tape library.

򐂰 The SAN fabric switches in-between. The switches are the heart of the SAN network.
򐂰 The connection medium, which consists of cables, and the host adapter cards on each
end (Qlogic, Emulex, or JNI), that reside in the server and the storage.
Notice each server is at least dual attached to the SAN for availability and load balancing. The
storage devices—the DS6000 and the tape library—have multiple SAN connections for
availability and performance.
This is just one example of a SAN, and there are a myriad other ways to create a SAN with
different types and numbers of servers, storage devices, and switches.
5.5.2 Benefits of a SAN

One of the biggest benefits of a SAN is centralized storage. Before SANs were available, it
was common for servers in a data center to each have their own dedicated storage. Backups
were difficult because the servers had to use their own tape drives, or backup over an
Ethernet network. Sharing a tape drive could be possible, but usually meant moving the tape
drive from one server to another, and reboots were required to attach the tape drive.
Re-assigning disk storage meant making cable changes and possibly moving storage or
servers.
Some of the benefits a SAN offers are:

򐂰 The ability to assign and un-assign a large (TB) amount of storage to a host with just the
click of a mouse. No cabling changes required.
򐂰 Hosts can share storage for failover environments, or actually work on the same data
using software that provides locking protection.
򐂰 Availability: Multiple connections between servers and storage.
򐂰 Performance: Each Fibre Channel adapter runs at 100 MB/s or 200 MB/s.
򐂰 LAN congestion is reduced by moving backups off the LAN and onto the SAN.
򐂰 Faster backups: It is even possible to implement server free backups, which copy data
directly (at the block level) from the DS6000 to tape with no load on servers.
򐂰 Increased distance (10 km +) between servers and storage for convenience and security.

Windows 2000
Windows 2003
SAN Volume
Controller
HP-UX host
AIX host
Fibre Channel
SUN host Linux host

san switches
SAN
Storage Area Network
Tape Library
DS6000
disk storage
Figure 5-6 Example of a Storage Area Network
Notice in Figure 5-6 how many different types of servers there are sharing the same DS6000
storage and tape library.
5.5.3 SAN cabling for availability and performance

For availability you need to connect to both controllers of the DS6000. If using a switched
fabric, multiple Fibre Channel switches or directors should also be used to avoid the potential
single point of failure. InterSwitch Links (ISLs) can also be used for connectivity.
For performance, a general rule of thumb for host adapters ports is that you have a pair of
host adapter ports connected to the SAN to support each peak throughput increment of
300-400 MB per second for large block sequential workloads. For small block random
workloads a pair of host adapter ports for each increment of 15000 I/Os per second. For a
configuration that is expected to deliver 500MB per second of large block sequential or 40000
IOPS of small block random work, for instance, you should plan to provide at least four
DS6000 host adapter ports into the SAN.
5.5.4 Importance of establishing zones

For Fibre Channel attachments in a SAN, it is important to establish zones to prevent
interaction from host adapters. Every time a host adapter joins the fabric, it issues a

Registered State Change Notification (RSCN), which does not cross zone boundaries, but will
effect every device or host adapter in the same zone.
If a host adapter should go bad and start logging in and out of the switched fabric, or a server
must be rebooted several times, you do not want it to disturb I/O to other hosts. Figure 5-7 on
page 156 shows zones that only include a single host adapter and multiple DS6000 ports.
This is the recommended way to create zones to prevent interaction between server host
adapters.
Tip: Each zone should contain a single host system adapter with the desired number of
ports attached to the DS6000.
By establishing zones, you reduce the possibility of interactions between system adapters in
switched configurations. You can establish the zones by using either of two zoning methods:
򐂰 Switch port number
򐂰 Worldwide port name (WWPN)
You can configure switch ports that are attached to the DS6000 in more than one zone. This
enables multiple host system adapters to share access to the DS6000 host adapter ports.
Shared access to a DS6000 host adapter port might be from host platforms that support a
combination of bus adapter types and operating systems.
Note: A DS6000 host adapter port configured to run with the FICON topology cannot be
shared in a zone with non-zSeries CKD hosts, and ports with non-FICON topology cannot
be shared in a zone with zSeries CKD hosts.
5.5.5 LUN masking

In Fibre Channel attachment, LUN affinity is based on the world-wide port name (WWPN) of
the adapter on the host, independent of which DS6000 host adapter port the host is attached
to. This LUN masking function on the DS6000 is provided through the definition of DS6000
Volume Groups. A Volume Group is defined using the DS Storage Manager or dscli and host
WWPNs are connected to the Volume Group. The LUNs that are to be accessed by the hosts
connected to the Volume Group are then defined to reside in that Volume Group.
While it is possible to limit which DS6000 host adapter ports a given WWPN will connect to
Volume Groups through, we recommend that you define the WWPNs to have access to all
available DS6000 host adapter ports. Then, using the recommended process of creating
Fibre Channel zones as discussed in “Importance of establishing zones” on page 153, you
can limit the desired host adapter ports through the Fibre Channel zones. In a switched fabric
with multiple connections to the DS6000, this concept of LUN affinity enables the host to see
the same LUNs on different paths.
If the host is not capable of recognizing that the set of LUNs seen via each path is the same,
this may present data integrity problems when the LUNs are used by the operating system. To
get around this problem, you should install the IBM Subsystem Device Driver (SDD). Aside
from preventing the above problem, SDD also provides multipathing and load balancing,
which improves performance and path availability. SDD is covered in 5.6, “Subsystem Device
Driver (SDD) - multipathing” on page 157.
5.5.6 Configuring logical disks in a SAN

In a SAN, care must be taken in planning the configuration to prevent a large number of disk
device images from being presented to the attached hosts. A large number of disk devices

presented to a host can cause longer failover times in cluster environments. Also boot times
could take longer because the device discovery steps will take more time.
The number of times a DS6000 logical disk is presented as a disk device to an open host
depends on the number of paths from each host adapter to the DS6000. The number of paths
from an open server to the DS6000 is determined by the following:
򐂰 The number of host adapter cards installed in the server
򐂰 The number of connections between the SAN switches and the DS6000
򐂰 The zone definitions created by the SAN switch software
Note: Each physical path to a logical disk on the DS6000 is presented to the host
operating system as a disk device.
Consider a SAN configuration as shown in Figure 5-7 on page 156:

򐂰 The host has two connections to SAN switches and each SAN switch in turn has four
connections to the DS6000.
򐂰 Zone A includes one Fibre Channel card (FC0) and one path from each of SAN switch A
and SAN switch B to the DS6000.
򐂰 Zone B includes one Fibre Channel card (FC1) and one path from each of SAN switch A
and SAN switch B to the DS6000.
򐂰 This host is only using four of the eight possible paths to the DS6000 in this zoning
configuration.
By cabling the SAN components and creating zones as shown in Figure 5-7 on page 156,
each logical disk on the DS6000 will be presented to the host server four times since there
are four unique physical paths from host to DS6000. As you observe the picture, Zone A
shows that FC0 will have access through DS6000 host ports I0000 and I0101. Zone B shows
that FC1 will have access through DS6000 host ports I0003 and I0102. So in combination this
provides for four paths to each logical disk presented by the DS6000. If Zone A and Zone B
were modified to include four paths each to the DS6000, then the host would have a total of
eight paths to the DS6000. In that case, each logical disk assigned to the host would be
presented as eight physical disks to the host operating system. Additional DS6000 paths are
shown as connected to Switch A and Switch B, but are not in use for this example.

Host I0000
SAN I0001
switch
FC0 A
I0002
DS6000
I0003
I0100
SAN
FC1 switch
I0101
I0102
B
I0103
Zone A- FC0 Zone B- FC1

DS6000_I0000 DS6000_I0003
DS6000_I0101 DS6000_I0102
Figure 5-7 Zoning in a SAN environment
In a SAN environment, Subsystem Device Driver (SDD) is used to provide load balancing and
failover. SDD also adds another device to the host operating system for each logical disk
presented from the DS6000. Figure 5-8 on page 156 shows how SDD adds a pseudo device
called a vpath (virtual path) on top of the disk devices. The host operating system issues I/O
calls to vpath0 in the example, and SDD in turn picks the best physical path (disk0, disk1,
disk2, or disk3) to use at a given time.
Host
I/O calls from OS
vpath0 from SDD
(Subsystem Device Driver)
vpath0 load balance & failover
FC0 FC1 Devices

presented
to the OS:
SAN SAN
switch switch vpath0
disk1
disk2
disk3
disk4
DS6000
logical-disk 1
Figure 5-8 SDD with multiple paths to a DS6000 logical disk
In the example in Figure 5-8, the number of devices presented to the host operating system
for each DS6000 logical disk is limited to five (four disk devices + 1 vpath).

The installation and use of SDD is covered in detail in 5.6, “Subsystem Device Driver (SDD) -
multipathing” on page 157.
You can see how the number of logical devices presented to a host could increase rapidly in a
SAN environment if care is not taken in selecting the size of logical disks and the number of
paths from the host to the DS6000.
Typically, we recommend for dual attached hosts, cable the switches and create zones in the
SAN switch software so that each server host adapter has two to four paths from the switch to
each controller of the DS6000. Figure 5-7 on page 156 shows an example of hosts using four
paths to the DS6000; two path to each controller. With hosts configured this way, you can let
SDD balance the load across the two host adapters ports of the controller the owns the lun
(the preferred path) and will also allow balanced access across the non-preferred paths if the
preferred controller fails for any reason.
5.6 Subsystem Device Driver (SDD) - multipathing

The IBM Subsystem Device Driver (SDD) software is a host-resident pseudo device driver
designed to support the multipath configuration environments in the DS6000. SDD resides on
the host system with the native disk device driver and manages redundant connections
between the host server and the DS6000, providing enhanced performance and data
availability.
Some operating systems and file systems natively provide similar benefits provided by SDD,
for example, z/OS, OS/400, NUMA-Q® Dynix, and HP/UX.
SDD provides DS6000 attached hosts running Windows, AIX, HP/UX, NetWare, Sun Solaris,
or Linux with:
򐂰 Dynamic load balancing between multiple paths when there is more than one path from a
host server to the DS6000. This may eliminate I/O bottlenecks that occur when many I/O
operations are directed to common devices via the same I/O path, thus improving the I/O
performance.
򐂰 Automatic path failover protection and enhanced data availability for users that have more
than one path from a host server to the DS6000. It eliminates a potential single point of
failure by automatically rerouting I/O operations to remaining active paths from a failed
data path.

Subsystem Device Driver Host System
- Runs on host system SDD
- Manages multiple pathing
Host adapter Host adapter
- Manages dynamic load
balancing
- Manages path failover Fibre Channel
Connections
- Support for AIX, Windows 2000,
Windows 2003, Windows NT,
Host Host
Netware, HP-UX, Adapter adapter
Sun Solaris, Linux
Logical
disk / LUN
DS6000
Figure 5-9 Subsystem Device Driver configuration
An example of a dual attached host that can benefit from SDD is shown in Figure 5-9.
The Subsystem Device Driver may operate under different modes/configurations:

򐂰 Concurrent data access mode: A system configuration where simultaneous access to data
on common LUNs by more than one host is controlled by system application software
such as Oracle Parallel Server, or file access software that has the ability to deal with
address conflicts. The LUN is not involved in access resolution.
򐂰 Non-concurrent data access mode: A system configuration where there is no inherent
system software control of simultaneous access to the data on a common LUN by more
than one host. Therefore, access conflicts must be controlled at the LUN level by a
hardware-locking facility such as SCSI Reserve/Release.
For some servers, like selected pSeries and RS/6000 models running AIX or for Windows
environments, booting off the DS6000 is supported. In that case LUNs used for booting are
manually excluded from the SDD configuration by using the querysn command to create an
exclude file. More information can be found in “querysn for multi-booting AIX off the DS6000”
on page 222.
For more information about installing and using SDD, refer to IBM TotalStorage Multipath
Subsystem Device Driver User’s Guide, SC30-4096. This publication and other information
are available at:
http://www.ibm.com/servers/storage/support/
5.6.1 SDD load balancing

SDD automatically adjusts data routing for optimum performance. Multipath load balancing of
data flow prevents a single path from becoming overloaded, causing input/output congestion
that occurs when many I/O operations are directed to common devices along the same
input/output path.
The path selected to use for an I/O operation is determined by the policy specified for the
device. The policies available are:
򐂰 Load balancing (default). The path to use for an I/O operation is chosen by estimating the
load on the adapter to which each path is attached. The load is a function of the number of

I/O operations currently in process. If multiple paths have the same load, a path is chosen
at random from those paths.
򐂰 Round robin. The path to use for each I/O operation is chosen at random from those paths
not used for the last I/O operation. If a device has only two paths, SDD alternates between
the two.
򐂰 Failover only. All I/O operations for the device are sent to the same (preferred) path until
the path fails because of I/O errors. Then an alternate path is chosen for subsequent I/O
operations.
Normally, path selection is performed on a global rotating basis; however, the same path is
used when two sequential write operations are detected.
5.6.2 Concurrent LMC load

With SDD you can concurrently install and activate the DS6000 Licensed Machine Code
(LMC) while applications continue running, if multiple paths from the server have been
configured. During the activation process, the host adapters inside the DS6000 might not
respond to host I/O requests for up to 30 seconds. SDD makes this process transparent to the
host system through its path-selection and retry algorithms.
5.6.3 Single path mode

SDD does not support concurrent download and installation of the Licensed Machine Code
(LMC) to the DS6000 if hosts are using a single-path mode. An example of a host using a
single path is shown in Figure 5-10 on page 160.
However, SDD does support a single-path Fibre Channel connection from your host system
to a DS6000. It is possible to create a volume group or a vpath device with only a single path.
Note: With a single-path connection, SDD cannot provide failure protection and load
balancing and this is not recommended.

Host System
vpath
Host adapter
single point of
failure
SAN switch
single point of
failure
Host
Port
I0001
Logical
disk
DS6000
Figure 5-10 SAN single-path connection
5.6.4 Single FC adapter with multiple paths

A host system with a single Fibre Channel adapter that connects through a switch to multiple
DS6000 ports is considered to have multiple Fibre Channel paths. Figure 5-11 on page 161
illustrates an example.
From an availability point of view, the configuration is not good because of the single fiber
cable from the host to the SAN switch. However, this configuration is better than a single path
from the host to the DS6000 and can be useful for preparing for maintenance on the DS6000.

Host System
Host adapter
single point of
failure
SAN switch
Fibre Channel dual paths

connections from switch to
DS6000
Host Host
Port Port
I0001 I0101
Logical
disk
DS6000
Figure 5-11 SAN multi-path connection with single fiber
5.6.5 Path failover and online recovery

SDD automatically and non-disruptively can redirect data to an alternate data path. In most
cases, host servers are configured with multiple host adapters and with multiple DS6000 host
adapter ports, in combination providing internal component redundancy. With dual servers
and multiple host adapters ports, the DS6000 provides more flexibility in the number of
input/output paths that are available.
When a path failure occurs, the IBM SDD automatically reroutes the I/O operations from the
failed path to the other remaining paths. This eliminates the possibility of a data path being a
single point of failure.
5.6.6 Using SDDPCM on an AIX host system

SDDPCM is an alternative multipath management tool for the AIX environment. SDDPCM is a
loadable path control module for disk storage system devices to supply path management
functions and error recovery algorithms. When the disk storage system devices are
configured as Multipath I/O (MPIO)-devices, SDDPCM becomes part of the AIX MPIO FCP
(Fibre Channel Protocol) device driver during the configuration. The AIX MPIO-capable
device driver with the disk storage system SDDPCM module enhances the data availability
and I/O load balancing.
We generally recommend using SDDPCM as the preferred multipathing solution for AIX,
because it runs as part of the storage device driver and thus has minor performance
advantages. Another benefit of it is that for each logical/virtual disk configured on the
DS6000, you end up having one hdisk device (rather than one vpath device plus one hdisk
per path).

Note: Do not use SDDPCM if you are using HACMP with LVM mirroring because
SDDPCM only supports HACMP in enhanced-concurrent mode and enhanced-concurrent
mode does not support mirror write consistency.
5.6.7 SDD datapath command

SDD provides commands that you can use to display the status of adapters that are used to
manage disk devices, or to display the status of the disk devices themselves. You can also set
individual paths online or offline and also set all paths connected to an adapter online or
offline at once.
A summary of the datapath commands is listed in Table 5-3.
Table 5-3 datapath command options

Command Description
datapath disable ports Places paths connected to certain ports offline.
datapath enable ports Places paths connected to certain ports online.
datapath open device path Dynamically opens a path that is in an Invalid or Close_Dead
state.
datapath query adapter Displays information about adapters.
datapath query adapstats Displays performance information for all SCSI and FCP
adapters that are attached to SDD devices.
datapath query device Displays information about devices.
datapath query devstats Displays performance information for a single SDD device or all
SDD devices.
datapath query essmap Displays each SDD vpath device, path, location, and attributes.
datapath query portmap Displays the connection status of SDD devices with regard to
the storage ports to which they are attached.
datapath query wwpn Displays the World Wide Port Name of the host adapter.
datapath remove adapter Dynamically removes an adapter.
datapath remove device path Dynamically removes a path of an SDD vpath device.
datapath set adapter Sets all device paths that are attached to an adapter to online
or offline.
datapath set device policy Dynamically changes the path-selection policy of the SDD
devices. Choices are round-robin, load balance, default,
failover.
datapath set device path Sets the path of a device to online or offline.
datapath set qdepth Dynamically enables or disables queue depth.
A subset of these commands will be described here for the purposed of understanding path
management from a performance perspective. For more information about these commands,
refer to IBM TotalStorage Multipath Subsystem Device Driver User’s Guide, SC30-4096.
Example 5-1 on page 163 illustrates the command datapath query adapter. Notice that this
host has two adapters, both functioning normally.

Example 5-1 “datapath query adapter” command output
root@san5558b:/ > datapath query adapter
Active Adapters :2
Adpt# Name State Mode Select Errors Paths Active

0 fscsi0 NORMAL ACTIVE 396547 0 40 32
1 fscsi3 NORMAL ACTIVE 396559 0 40 32
The terms used in the output of datapath query adapter are defined as follows:
Adpt# The number of the adapter.
Adapter Name The name of the adapter.
State The condition of the named adapter. It can be either:
-Normal, adapter is in use.
-Degraded, one or more paths are not functioning.
-Failed, the adapter is no longer being used by SDD.
Mode The mode of the named adapter, which is either Active or Offline.
Select The number of times this adapter was selected for input or output.
Errors The number of errors on all paths that are attached to this adapter.
Paths The number of paths that are attached to this adapter. In the Windows
NT® host system, this is the number of physical and logical devices
that are attached to this adapter.
Active The number of functional paths that are attached to this adapter. The
number of functional paths is equal to the number of paths attached to
this adapter minus any that are identified as failed or offline.
An example of the datapath query device commands is shown in Example 5-2. The output
shows the status of paths for vpath4. Notice that it has eight paths that are functioning
normally. We have an AIX system that sees eight hdisks (hdisk18, hdisk26, hdisk34, hdisk42,
hdisk50, hdisk58, hdisk66, and hdisk74) and the query command was issued when the AIX
volume group was online, so that the State shows OPEN. There are two different Fibre
Channel adapters on the host: fscsi0 and fscsi3. The switch zones are configured to give
each Fibre Channel adapter four paths to the DS6000. Based upon the number of selects that
we see for each hdisk or path, we can see that each Fibre Channel adapter has two preferred
paths and two non-preferred paths, which is also shown in Example 5-4 on page 164 and
Example 5-5 on page 165.
Example 5-2 “datapath query device” command output

root@san5558b:/ > datapath query device 4
DEV#: 4 DEVICE NAME: vpath4 TYPE: 1750500 POLICY: Optimized

SERIAL: 13ABC2A1000
==========================================================================
Path# Adapter/Hard Disk State Mode Select Errors
0 fscsi0/hdisk18 OPEN NORMAL 0 0

Example 5-4 is an example of the datapath query essmap command and shows the host
adapter port connections that are used for each path to the DS6000. The example output has
been edited to reflect the same vpath4 that was used in the previous examples. The standard
output from this query command would show all vpaths that are defined to this AIX system.
The datapath query essmap command is only available on AIX platforms.
The columns that are of interest for looking at host adapter paths have the headings of
Connection and port. The Connection column for hdisk18 (R1-B2-H1-ZB) shows that the path
is through DS6000 Controller Card 1 (bottom card), port 0 (R1 =Rack1, B2= I/O enclosure1,
H1=host adapter1, ZB=port1) which is reflected in the port column as 101, which is a
translation of the portname in dscli of I0101. The notation of Rx-By-Hz-Za shows the relative
position of the adapter in the DS6000, where:
򐂰 Rx - Rack position - Always R1 for DS6000.
򐂰 By - RAID Controller Card, where B1=controller0 (top) and B2=controller1 (bottom).
򐂰 Hz - Host Adapter position - Always H1 for DS6000.
򐂰 Za - relative position of the port on the Host Adapter. ZA is the left-most host port, ZD is
the right-most host port.
Example 5-3 “dscli” display for comparison to Example 5-4

dscli> lsioport
Date/Time: October 3, 2005 1:54:07 PM EDT IBM DSCLI Version: 5.0.5.95 DS: IBM.1750-13ABC2A
ID WWPN State Type topo portgrp
===============================================================
I0000 500507630E01FC01 Online Fibre Channel-LW SCSI-FCP 0
I0002 500507630E05FC01 Online Fibre Channel-LW FICON 0
For hdisk18, you can interpret the Connection column R1-B2-H1-ZB as Controller 1 (the
bottom controller of the DS6000), only adapter, and second port from the left or a port_ID in
dscli of I0101. Note that this command represents the preferred path with an asterisk in the
column headed “ P ”.
Example 5-4 “datapath query essmap” command output

root@san5558b:/ > datapath query essmap
Disk Path P Location adapter LUN SN Type Size LSS Vol Rank C/A S Connection port RaidMode
------- ----- - ----------- ------ ----------- ------------ ---- ---- --- ----- ---- - ----------- ---- --------
vpath4 hdisk18 * 1p-20-02[FC] fscsi0 13ABC2A1000 IBM 1750-500 21.5 16 0 fffe 01 Y R1-B2-H1-ZB 101 RAID5
vpath4 hdisk26 1p-20-02[FC] fscsi0 13ABC2A1000 IBM 1750-500 21.5 16 0 fffe 01 Y R1-B1-H1-ZA 0 RAID5
vpath4 hdisk34 1p-20-02[FC] fscsi0 13ABC2A1000 IBM 1750-500 21.5 16 0 fffe 01 Y R1-B1-H1-ZB 1 RAID5
vpath4 hdisk42 * 1p-20-02[FC] fscsi0 13ABC2A1000 IBM 1750-500 21.5 16 0 fffe 01 Y R1-B2-H1-ZA 100 RAID5
vpath4 hdisk50 * 1E-28-02[FC] fscsi3 13ABC2A1000 IBM 1750-500 21.5 16 0 fffe 01 Y R1-B2-H1-ZB 101 RAID5
vpath4 hdisk58 1E-28-02[FC] fscsi3 13ABC2A1000 IBM 1750-500 21.5 16 0 fffe 01 Y R1-B1-H1-ZA 0 RAID5
vpath4 hdisk66 1E-28-02[FC] fscsi3 13ABC2A1000 IBM 1750-500 21.5 16 0 fffe 01 Y R1-B1-H1-ZB 1 RAID5
vpath4 hdisk74 * 1E-28-02[FC] fscsi3 13ABC2A1000 IBM 1750-500 21.5 16 0 fffe 01 Y R1-B2-H1-ZA 100 RAID5
Example 5-5 on page 165 is an example of the datapath query portmap command which
simply provides a different view of the host adapter port connections that are used for each
path to the DS6000. Note that this command represents the preferred path with capital “ Y ”
and alternate or non-preferred path with lowercase “ y ”. Once again, this is a command that
is available on the AIX platform only.

Example 5-5 “datapath query portmap” command output
root@san5558b:/ > datapath query portmap
BAY-1(B1) BAY-2(B2) BAY-3(B3) BAY-4(B4)
ESSID DISK H1 H2 H3 H4 H1 H2 H3 H4 H1 H2 H3 H4 H1 H2 H3 H4
ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD
BAY-5(B5) BAY-6(B6) BAY-7(B7) BAY-8(B8)
H1 H2 H3 H4 H1 H2 H3 H4 H1 H2 H3 H4 H1 H2 H3 H4
ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD
13ABC2A vpath4 YY-- ---- ---- ---- yy-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
13ABC2A vpath5 YY-- ---- ---- ---- yy-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
13ABC2A vpath6 YY-- ---- ---- ---- yy-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
13ABC2A vpath7 YY-- ---- ---- ---- yy-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
13ABC2A vpath8 yy-- ---- ---- ---- YY-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
13ABC2A vpath9 yy-- ---- ---- ---- YY-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
13ABC2A vpath10 yy-- ---- ---- ---- YY-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
13ABC2A vpath11 yy-- ---- ---- ---- YY-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
Y = online/open y = (alternate path) online/open

O = online/closed o = (alternate path) online/closed
N = offline n = (alternate path) offline
- = path not configured
PD = path down
Note: 2105 devices' essid has 5 digits, while 1750/2107 device's essid has 7 digits.

6
Chapter 6. IBM TotalStorage SAN Volume

Controller attachment
This chapter describes the guidelines and procedures to make the most of the performance
available from your DS6000 storage subsystem attached to the IBM SAN Volume Controller.

6.1 IBM TotalStorage SAN Volume Controller
IBM TotalStorage SAN Volume Controller is designed to increase the flexibility of your storage
infrastructure by introducing a new layer between the hosts and the storage systems. The
SAN Volume Controller can enable a tiered storage environment to increased flexibility in
storage management. The SAN Volume Controller combines the capacity from multiple disk
storage systems into a single storage pool, which can be managed from a central point. This
is simpler to manage and helps increase utilization. It also allows you to apply advanced Copy
Services across storage systems from many different vendors to help further simplify
operations.
For more information about SAN Volume Controller, see the redbook, IBM TotalStorage SAN
Volume Controller, SG24-6423.
6.1.1 IBM TotalStorage SAN Volume Controller concepts

The IBM TotalStorage SAN Volume Controller is a SAN appliance that attaches
open-systems storage devices to supported open-systems hosts. The IBM TotalStorage SAN
Volume Controller provides symmetric virtualization by creating a pool of managed disks from
the attached storage subsystems, which are then mapped to a set of virtual disks for use by
various attached host computer systems. System administrators can view and access a
common pool of storage on the SAN, which enables them to use storage resources more
efficiently, and provides a common base for advanced functions.
The SAN Volume Controller solution is designed to reduce both the complexity and costs of
managing your SAN-based storage. With the SAN Volume Controller you will be able to:
򐂰 Simplify management and increase administrator productivity by consolidating storage
management intelligence from disparate storage controllers into a single view.
򐂰 Improve application availability by enabling data migration between disparate disk storage
devices non-disruptively.
򐂰 Improve disaster recovery and business continuance needs by applying and managing
copy services across disparate disk storage devices within the Storage Area Network
(SAN).These solutions include a Common Information Model (CIM) Agent, enabling
unified storage management based on open standards for units that comply with CIM
Agent standards.
򐂰 Provides advanced features and functions to the entire SAN; such as:
򐂰 Large scalable cache
򐂰 Copy Services
򐂰 Space management
򐂰 Mapping based on desired performance characteristics
򐂰 Quality of Service (QoS) metering and reporting
SAN Volume Controller clustering

The IBM TotalStorage SAN Volume Controller is a collection of up to eight cluster nodes
added in pairs. In future releases the cluster size will be increased to permit further
performance scalability. These eight nodes are managed as a set (cluster) and present a
single point of control to the administrator for configuration and service activity.
For I/O purposes, SAN Volume Controller nodes within the cluster are grouped into pairs
(called I/O groups), with a single pair being responsible for serving I/O on a given vDisk. One
node within the I/O Group will represent the preferred path for I/O to a given vDisk - the other
node representing the non-preferred path. This preference will alternate between nodes as

each vDisk is created within an I/O Group to balance the workload evenly between the two
nodes.
Note: The preferred node by no means signifies absolute ownership. The data will still be
accessed by the partner node in the I/O Group in the event of a failure or if the preferred
node workload becomes too high.
Beyond automatic configuration and cluster administration, the data transmitted from
attached application servers is also treated in the most reliable manner. When data is written
by the host, the preferred node within the I/O Group stores a write in its own write cache and
the write cache of its partner (non-preferred) node before sending an I/O complete status
back to the host application. To ensure that data is written in the event of a node failure, the
surviving node empties its write cache and proceeds in write-through mode until the cluster is
returned to a fully operational state.
Note: Write-through mode is where the data is not cached in the nodes, but written directly
to the disk subsystem instead. While operating in this mode, performance is somewhat
degraded, however more importantly, it ensures that the data makes it to its destination
without the risk of data loss that a single copy of data in cache would expose you to.
Furthermore, each of the two nodes in the I/O group are protected by different uninterruptible
power supplies
SAN Volume Controller virtualization

The SAN Volume Controller provides block aggregation and volume management for disk
storage within the SAN. In simpler terms, this means that the SAN Volume Controller
manages a number of backend storage controllers and maps the physical storage within
those controllers to logical disk images that can be seen by application servers and
workstations in the SAN.
The SAN must be zoned in such a way that the application servers cannot see the backend
storage, preventing any possible conflict between SAN Volume Controller and the application
servers both trying to manage the backend storage. In the fabric are defined two distinct
zones:
򐂰 In the host zone, the host systems can identify and address the nodes. You can have more
than one host zone. Generally, you will create one host zone per operating system type.
򐂰 In the disk zone, the nodes can identify the disk storage subsystems. Generally, you will
create only one zone including the storage subsystems.
The SAN Volume Controller I/O Groups are connected to the SAN in such a way that all
backend storage and all application servers are visible to all of the I/O Groups. The SAN
Volume Controller I/O Groups see the storage presented to the SAN by the backend
controllers as a number of disks, known as Managed Disks or mDisks. Because the SAN
Volume Controller does not attempt to provide recovery from physical disk failures within the
backend controllers, mDisks are usually, but not necessarily, part of a RAID array.
mDisks are collected into one or several groups, known as Managed Disks Group or MDGs.
Once a mDisk is assigned to a MDG, the mDisk is divided up into a number of extents (default
minimum size 16 MB, maximum size of 512 MB), which are numbered sequentially from the
start to the end of each mDisk.
Chapter 6. IBM TotalStorage SAN Volume Controller attachment 169

Note: For performance considerations, IBM recommends to collect in a MDG only the
mDisks which have the same characteristics in terms of performance or reliability
A MDG provides a pool of capacity (Extents) which will be used to create volumes, know as
Virtual Disks or vDisks.
When creating vDisks, the default choice of striped allocation is normally the best choice. This
option helps to balance I/Os across all the managed disks in a MDG, which tends to optimize
overall performance and helps to reduce hot spots. Conceptually, this might be represented
as shown in Figure 6-1.
Figure 6-1 Extents being used to create Virtual Disks
The virtualization function in the SAN Volume Controller maps the vDisks seen by the
application servers on to the mDisks provided by the backend controllers. I/O traffic for a
particular vDisk is, at any one time, handled exclusively by the nodes in a single I/O Group.
Thus, although a cluster could have many nodes within it, the nodes handle I/O in
independent pairs. This means that the I/O capability of the SAN Volume Controller scales
well (almost linearly), since additional throughput can be obtained by simply adding additional
I/O Groups.
Figure 6-2 on page 171 summarizes the various relationships that bridge the physical disks
through to the virtual disks within the SAN Volume Controller architecture.

Figure 6-2 Relationship between physical and virtual
6.1.2 SAN Volume Controller multipathing

Each SAN Volume Controller node presents a vDisk to the SAN via multiple paths, usually
four. Since in normal operation two nodes are used to provide redundant paths to the same
storage, this means that depending on zoning, a single host HBA will see up to eight paths to
each LUN presented by the SAN Volume Controller. Because most operating systems are not
able to resolve multiple paths back to a single physical device, IBM provides a multi-pathing
device driver.
The multi-pathing driver supported by the SAN Volume Controller is IBM’s Subsystem Device
Driver (SDD). It manages the multiple paths from the host to the SAN Volume Controller
making use of the preferred paths in a round robin manner before using any non-preferred
path. SDD performs data path failover in the event of a failure within the SAN Volume
Controller, or the host path while also masking out the additional disks that would otherwise
be seen by the hosts due to the redundant paths through the SAN fabric.
Note: The SDD code has been updated to support both the SAN Volume Controller, the
ESS, the DS6000 and the DS8000, and provided the latest version is used, IBM supports
the concurrent connections of a host to both a SAN Volume Controller and “native” storage
environments. Refer to IBM SDD documentation: Multipath Subsystem Device Driver
User's Guide, SC30-4096.

6.1.3 Copy Services
The SAN Volume Controller provides two Copy Services that enable you to copy virtual disks
(VDisks): FlashCopy and Remote Copy. These Copy Services are available for all supported
hosts that are connected to the SAN Volume Controller.
򐂰 FlashCopy makes an instant, point-in-time copy from a source VDisk to a target VDisk.
You can use FlashCopy to:
– Backup data that changes frequently to tertiary storage.
– Test a new application on real business data before move into production.
– Create restart points for long-running batch jobs which is preferable to rerunning the
entire multiday job.
򐂰 Synchronous Peer-to-Peer Remote Copy (also known as Metro Mirror) provides a
consistent copy of a source VDisk on a target VDisk. Data is written to the target VDisk
synchronously after it is written to the source VDisk, so the copy is continuously updated.
You can use Remote Copy to:
– Built a Disaster Recovery solution. Because an exact copy of your business data can
be maintained at a remote location, you can use your remote location as a recovery
site in the event of a local disaster.
Note: SAN Volume Controller copy services functions are not compatible with the DS6000
and DS8000 copy services.
FlashCopy
FlashCopy is a Copy Service available with the SAN Volume Controller. It copies the contents
of a source virtual disk (VDisk) to a target VDisk. Any data that existed on the target disk is
lost and is replaced by the copied data. After the copy operation has been completed, the
target virtual disks contain the contents of the source virtual disks as they existed at a single
point in time. Although the copy operation takes some time to complete, the resulting data on
the target is presented in such a way that the copy appears to have occurred immediately.
Consistency Groups address the issue that the using application may have related data which
spans multiple Virtual Disks. FlashCopy must be performed in a way which preserves data
integrity across multiple Virtual Disks. One requirement for preserving the integrity of data
being written is to ensure that dependent writes are executed in the application's intended
sequence.
A FlashCopy mapping can be created between any two virtual disks in a cluster. It is not
necessary for the virtual disks to be in the same I/O group or even in the same managed disk
group. This functionality provides the ability to optimize your storage allocation using a
secondary storage subsystem (with, for example, less performance) as the target of the
FlashCopy. In this case, the resources of your high performance storage subsystem will be
only dedicated for production, while your low-cost (less performance) storage subsystem will
be used for a secondary application (for example, backup or development...). Figure 6-3 on
page 173 represents a FlashCopy relationship created between two vDisks defined in
different Managed Disk Groups from different backend disk subsystems.

Figure 6-3 SAN Volume Controller FlashCopy “outside the box”
Peer-to-Peer remote copy

The general application of Peer-to-Peer Remote Copy seeks to maintain two copies of a set
data. Often the two copies will be separated by some distance, hence the term remote, but
this is not required.
The SAN Volume Controller assumes that the FC fabric to which it is attached contains
hardware, which achieves the long distance requirement for the application. This hardware
makes storage at a distance accessible as though it were local storage. Specifically, it enables
two SAN Volume Controller clusters to connect to each other and establish communications
in the same way as though they were located nearby on the same fabric. The only difference
is in the expected latency of that communication, the bandwidth capability of the link, and the
availability of the link as compared with the local fabric.
The relationship between the two copies is not symmetric. One copy of the data set is
considered the primary copy (sometimes also known as the source). This copy provides the
reference for normal run-time operation. Updates to this copy are shadowed to a secondary
copy (sometimes known as the destination or even target). The secondary copy is not
normally referenced for performing I/O.
The remote copy can be maintained in one of two modes, synchronous or asynchronous.
Synchronous remote copy ensures that updates are committed at both primary and
secondary before the application is given completion to an update. This ensures that the
secondary is fully up-to-date should it be needed in a failover. However, this means that the
application is fully exposed to the latency and bandwidth limitations of the communication link
to the secondary. Where this is truly remote, this can have a significant adverse effect on
application performance.
Asynchronous remote copy provides comparable functionality to a continuous backup

process which is missing the last few updates. Recovery on the secondary site involves
bringing up the application on the recent backup and then re-applying the most recent
updates to bring the secondary up-to-date.
Today SAN Volume Controller implements synchronous Remote Copy. Future releases
should incorporate asynchronous Remote Copy.

The advantages of SAN Volume Controller remote copy is that we can implement such
relationships between two SAN Volume Controller clusters with different backend disk
subsystems. in this case, you reduce the overall cost of the disaster recovery infrastructure
implemented on the production site (first site) high performance backend disk subsystems
and on the recovery site low-cost backend disk subsystems, even if backend disk subsystems
copy services functions are not compatible (for example, different models, different
constructors...). Figure 6-4 represents an example of a remote copy relationship between two
SAN Volume Controller clusters. This relationship is established at the vDisk level and doesn’t
depend of the backend disk storage subsystem.
Figure 6-4 Synchronous remote copy relationship between 2 SAN Volume Controller clusters
6.1.4 SAN Volume Controller performance considerations

The SAN Volume Controller cluster is scalable up to eight I/O groups (eight pairs of SAN
Volume Controller nodes), and the performance is almost linear when adding more I/O groups
to a SAN Volume Controller cluster, until it becomes limited by other components in the
storage infrastructure. While virtualization with the SAN Volume Controller provides a great
deal of flexibility, it does not diminish the necessary to have a SAN and disk subsystems
which can deliver the desired performance.
In the following section, we present the IBM SAN Volume Controller concepts and discuss the
performance of the SAN Volume Controller. In this section, we assume there are no
bottlenecks in the SAN or on the disk subsystem.
Determine the number of I/O groups

The SAN Volume Controller cluster consists of I/O groups, where each I/O group consists of a
pair of SAN Volume Controller nodes (also called storage engine). Since the SAN Volume
Controller cluster is linearly scalable, we will first focus on performance for a single I/O group.

A single I/O group can handle up to 200,000 read hits per second of 512 bytes, where all I/Os
are cache hits, or up to 40,000 writes per second of 4 KB. The write number is lower because
all the writes are mirrored to both nodes in a pair, and the SAN Volume Controller must
destage data to disk, after it is written to both caches. Normal client workloads achieve less
throughput than these numbers, because real workloads include cache miss I/Os and may
also transfer more than 4KB.
For sequential read operations, such as database scans or backup operations, a single I/O
group can achieve up to 1 GB per second, given that the backend disk configuration is
properly configured to provide this level of throughput.
If you have information about the workloads that you plan to use with the SAN Volume
Controller, you can use this information to size the amount of capacity you can configure per
I/O group. To be conservative, assume a throughput ability of about 800 MB/s per I/O group.
Number of ports in the SAN used by SAN Volume Controller

Since each I/O group has eight Fibre Channel ports (four on each node), you will need eight
ports in the SAN per I/O group, respectively four ports in each fabric (two from each node).
Number of paths from SAN Volume Controller to disk subsystem

Generally, all SAN Volume Controller ports in a cluster should see the same ports on the disk
subsystems, and this means unless you reserve ports on the disk subsystem for direct host
access, all ports on the disk subsystems must be seen by all SAN Volume Controller nodes
ports, in the respective fabrics.
Optimal managed disk group configurations

A managed disk group provides the pool of storage from which virtual disks will be created. It
is therefore necessary to ensure that the entire pool of storage provides the same
performance and reliability characteristics:
򐂰 The performance of a managed disk group will generally be governed by the slowest
managed disk in the group.
򐂰 The reliability of a managed disk group will generally be governed by the weakest
managed disk in the group.
򐂰 If a single managed disk in a group fails, access to the entire group will be lost.
These guidelines show that grouping similar disks together is important. the following
guidelines should be followed when grouping similar disks:
򐂰 Group equally performing managed disks, arrays, in a single group.
򐂰 Group similar array, for example, all RAID 5 arrays, in one group.
򐂰 Group managed disks from the same type of storage subsystem in a single managed disk
group.
Group managed disks that use the same type of underlying physical disk (for example, disk
capacity, RPM).
Note: When configuring managed disks with the SAN Volume Controller, create managed
disk groups to use the largest practical SAN Volume Controller extent size. Doing so
maximizes the learning ability of the SAN Volume Controller adaptive cache.

6.2 DS6000 performance considerations
This section presents the principal DS6000 configuration recommendations to optimize the
performance of your virtualized environment.
6.2.1 DS6000 Array

When using virtualization, ensure that the storage devices are configured to provide some
type of redundancy against hard disk failures.
Redundant array of independent disks (RAID) is a method of configuring multiple disk drives
in a storage subsystem for high availability and high performance. The collection of two or
more disk drives presents the image of a single disk drive to the system. In the event of a
single device failure, data can be read or regenerated from the other disk drives in the array.
With RAID implementation, the Storage Unit offers fault-tolerant data storage. The Storage
Unit supports RAID implementation on the Storage Unit device adapters. The Storage Unit
supports groups of disk drive modules (DDMs) in both RAID 5 and RAID 10.
Array size
A DS6000 Array is a RAID 5 or RAID 10 Array made up of 4 or 8 DDMs.
We recommend to configure Array site made up 8 DDMs to get the maximum performance of
you backend storage system. For further discussion, refer to Chapter 3, “Logical configuration
planning” on page 53.
A DS6000 Array is created from one 8DDMs Array Site. DS6000 RAID 5 arrays will be either
6+P+S or 7+P. DS6000 RAID 10 arrays will be either 3+3+2S or 4+4.
Note: For performance optimization on the DS6000, we recommend to create Arrays of 8

DDMs.
RAID 5 or RAID 10
There are a number of workload attributes that influence the relative performance of RAID 5
versus RAID 10, including the use of cache, the relative mix of read versus write operations,
and whether data is referenced randomly or sequentially.
Consider that:
򐂰 For either sequential or random reads from disk, there is no significant difference in RAID
5 and RAID 10 performance, except at high I/O rates.
򐂰 For random writes to disk, RAID 10 performs better.
򐂰 For sequential writes to disk, RAID 5 performs better.
To get more details regarding RAID 5 and RAID 10 difference, refer to 2.8.5, “RAID 5 versus
RAID 10 performance” on page 39.
6.2.2 DS6000 Rank format

In the DS6000 architecture, Extent Pools are used to managed one or several Ranks. A Rank
is created for each Array. There is a one-to-one relationship between Arrays and Ranks. SAN
Volume Controller, as part of the open system server can access and use Ranks created in
Fixed Block format. FB format divides Ranks into 1 GB Extents.

6.2.3 DS6000 Extent Pool implications
An Extent Pool is visible to both of the two servers in the DS6000, but directly managed by
only one of them.You must define a minimum of two Extent Pools with one created for each
storage server to fully exploit the resources. There is a minimum of one Rank per Extent Pool.
Although it is possible to configure many Ranks in an Extent Pool, for SAN Volume Controller
performance optimization, we do not recommend more than one.
If you need to maximize the performance of your SAN Volume Controller configuration,
allocate each Rank to its own Extent Pool so that you configure one Rank per pool. This gives
you the ability to direct our allocations to a known location within the DS6000. Furthermore,
this configuration will help you manage and monitor the resultant logical disk performance
when required.
Multiple Ranks per Extent Pool configuration

This section shows why configuring several Ranks in an Extent Pool is not optimized for
performance in an SAN Volume Controller environment.
To clearly explain this performance limitation we can use as an example the configuration
presented in Figure 6-5 on page 178.

Figure 6-5 Shows an example configuration that illustrates this performance limitation
In this example, an Extent Pool (Extent Pool 0) is defined on a DS6000. This Extent Pool
includes 3 Ranks of 519 GB. The overall capacity of this Extent Pool is 1.5 TB. This capacity
is available through a set of 1 GB DS6000 Extents (standard DS6000 Extent size).
In this pool of available Extents, we create one DS6000 Logical Volume called volume0,
which contains all the Extents in the Extent Pool. volume0 is 1.5 TB. Due to the DS6000
internal Logical Volume creation algorithm, the Extents from the Rank1 will be assigned, then
the Extents of the Rank2 and the Extents of the Rank3. In this case, the data stored on the
first third of the Volume0 will be physically located on Rank1, the second third on the Rank2,
and the last third on Rank3.
When volume0 is assigned to the SAN Volume Controller, the Logical Volume is identified by
the SAN Volume Controller cluster as the Managed Disk, mDiskB. mDiskB is assigned to a
Managed Disk Group, MDG0, where the SAN Volume Controller Extent size is defined as 512
MB (size can be from 16 MB to 512 MB). In this Managed Disk are also defined two others
Managed Disks, mDiskA and mDiskC. mDiskA and mDiskC are also 1.5 TB and come from
the same DS6000, but they come from different Extent Pools. These Extent Pools are
similarly configured as Extent Pool 0.

The overall capacity of the Managed Disk Group is 4.5 TB. This capacity is available through
a set of 512 MB SAN Volume Controller Extents. Next, a SAN Volume Controller Virtual Disk
called vDisk0 is created in the Managed Disk Group 0. vDisk0 is 50 GB, 100 SAN Volume
Controller Extents. vDisk0 is created in SAN Volume Controller Striped mode in the hopes of
obtaining optimum performance. But actually, a performance bottleneck was just created.
When vDisk0 was created, it was assigned sequentially one SAN Volume Controller Extent
from mDiskA then, one SAN Volume Controller Extent from mDiskB, and one Extent from
mDiskC and so on. In total, vDisk0 was assigned the first 34 Extents of mDiskA, the first 33 of
mDiskB and the first 33 of mDiskC.
Here is the bottleneck. All of the first 33 Extents used from mDiskB are physically located at
the beginning of Volume0. That means that all of these Extents belong to DS6000 Rank1.
This configuration does not follow the performance recommendation, that you should spread
the workload assigned to vDisk0 to all the Ranks defined in the Extent Pool. In this case,
performance will be limited to the performance of a single Rank.
Furthermore, If the configuration of the mDiskA and mDiskC are equivalent to mDiskB, data
stores on vDisk0 are spread across only 3 Ranks of the 9 available within the three DS6000
Extent Pools used by SAN Volume Controller.
This example shows the bottlenecks for vDisk0, but more generally, almost all of the vDisk
created in this Managed Disk Group will be spread on only three Ranks instead of the nine
available.
Attention: The configuration presented in Figure 6-5 on page 178 is not optimized for
One Rank per Extent Pool configuration

This section shows why configuring one Rank in an Extent Pool is optimized for performance
in SAN Volume Controller environment.
To clearly explain this performance optimization, we can use as an example the configuration
presented in Figure 6-6 on page 180.

Figure 6-6 SAN Volume Controller configuration based on DS6000 (1 Rank per Extent Pool)
In this example, three Extent Pools (Extent Pool 1,2,3) are defined on a DS6000. Each Extent
Pool includes only 1 Rank of 519 GB. The overall capacity of each Extent Pool is 519 GB.
This capacity is available through a set of 1 GB DS6000 Extents (standard DS6000 Extent
size).
In each Extent Pool we create one volume (volume1, volume2 and volume3) which assigns all
the capacity of each Extent Poll (all the Extents available are assigned). The three volumes
created have a size of 519 GB each.
Volume0, Volume1 and Volume2 are assigned to the SAN Volume Controller, these Volumes
are identified by the SAN Volume Controller cluster as Managed Disks (in the example
mDiskA, mDiskB, mDiskC). These Managed Disks are assigned in a Managed Disk Group
(MDG0) where the SAN Volume Controller Extent size is defined to 512 MB (size can be from
16 MB to 512 MB).
The overall capacity of the Managed Disk Group is 1.5 TB. This capacity is available through
a set of 512 MB SAN Volume Controller Extents. In this storage pool is created a Virtual Disk
(vDisk0) of 50 GB size (100 SAN Volume Controller Extents). The Virtual Disk is created in
SAN Volume Controller Striped mode in order to obtain the greatest performance.

This mode implies that vDisk0 will assign sequentially one SAN Volume Controller Extent
from mDiskA then, one SAN Volume Controller Extent from mDiskB and one Extent from
mDiskC and so on. vDisk0 will assign the first 34 Extents of mDiskA, the first 33 of mDiskB
and the first 33 of mDiskC.
In this case, all the SAN Volume Controller Extents assigned for vDisk0 are physically located
on Rank1, Rank2 and Rank3 of the DS6000. This configuration permits you to spread the
workload applied on vDisk0 to all three Ranks of the DS6000. In this case we efficiently use
the hardware available for each vDisk of the Managed Disk Group.
Important: The configuration presented in Figure 6-6 on page 180 is optimized for
6.2.4 DS6000 volumes consideration

This section details the recommendations regarding volume creation on the DS6000 when
assigned to SAN Volume Controller.
Number of volumes per Extent Pool

The DS6000 provides a mechanism to create multiple volumes from a single Extent Pool.
This is useful when the storage subsystem is directly presenting storage to the hosts.
In a SAN Volume Controller environment, however, where possible, there should be a

one-to-one mapping between Extent Pools and Volumes. Ensuring that the Extent Pools are
configured in this way will make the subsequent load calculations and the managed disk and
managed disk group configuration tasks a lot easier.
Volume size consideration

As we recommend to define one volume per Extent Pool, the volume size will be determined
by the Extent Pool overall capacity.
If this recommendation is not possible in your specific environment, at least we recommend

you to assign to SAN Volume Controller DS6000 volume of the same size. In this
configuration, workload applied on a Virtual Disk will be equally balanced on the Managed
Disks within the Managed Disk Group.
6.2.5 Volume assignment to SAN Volume Controller cluster

On the DS6000, we recommend creating one Volume Group in which will be included all the
Volumes defined to be managed by SAN Volume Controller and all the Host Connection of
the SAN Volume Controller nodes ports.
6.2.6 Number of paths to attach the DS6000 to SAN Volume Controller

To avoid any performance limitation in the SAN Volume Controller-to-DS6000 communication,
we recommend you to assign all the host ports available on the DS6000 servers to the SAN
Volume Controller cluster. In this configuration, four active paths are used to access to one
volume defined on your DS6000.
Note: For SAN Volume Controller attachment to the DS6000, we recommend to use FC
ports from Server0 and Server1 to improve the DS6000 access availability

6.3 Performance monitoring
As well as IBM TotalStorage DS8000, IBM TotalStorage DS6000 and IBM TotalStorage
DS4000 series family, IBM Total Storage Productivity Center is used to manage the IBM SAN
Volume Controller and monitor the performance.
IBM TotalStorage Productivity Center and its product IBM TotalStorage Productivity Center for
Disks are presented in 4.3, “IBM TotalStorage Productivity Center for Disk” on page 109.
Refer to this chapter to get more details.
For more general information about TotalStorage Productivity Center, refer to the redbook,
IBM TotalStorage Productivity Center: Getting Started, SG24-6490.
6.3.1 IBM TotalStorage Productivity Center for Disk

IBM TotalStorage Productivity Center for Disk is designed to centralize management of
networked storage devices that implement the Storage Management Initiative
Specifications-S) established by the Storage Networking Industry Association (SNIA),
including the IBM TotalStorage DS8000, the IBM TotalStorage DS6000, IBM TotalStorage
Enterprise Storage Server, IBM TotalStorage DS4000 series family (formerly the FAStT
family), and IBM TotalStorage SAN Volume Controller, and the devices that they manage.
The IBM TotalStorage Productivity Center for Disk is designed to:

򐂰 Help reduce storage management complexity and costs while improving data availability.
򐂰 Centralize management of storage devices through open standards (SMI-S).
򐂰 Enhance storage administrator productivity.
򐂰 Improve storage resource utilization.
򐂰 Offer proactive management of storage devices.
Discovery of IBM storage devices that are SMI-S enabled

Centralized access to storage devices information, information concerning the system
attributes of connected storage devices, is available from the IBM TotalStorage Productivity
Center for Disk console.
Centralized management of storage devices

The device configuration and manager console for the SMI-S-enabled IBM storage devices
can be launched from the IBM TotalStorage Productivity Center for Disk console.
Device management
IBM TotalStorage Productivity Center for Disk can provide access to single-device and
cross-device configuration functionality. It enables the user to view important information
about the storage devices that are discovered by IBM TotalStorage Productivity Center for
Disk, examine the relationships between those devices, or change their configurations. IBM
TotalStorage Productivity Center for Disk supports the discovery and logical unit number
(LUN) provisioning of IBM TotalStorage DS4000 series storage systems, IBM TotalStorage
ESS, IBM TotalStorage DS6000, IBM TotalStorage DS8000 and IBM TotalStorage SAN
Volume Controller.
The user can view essential information about the storage, view the associations of the
storage to other devices, and change the storage configuration. DS8000, DS6000, ESS and
DS4000 storage subsystems, attached to the SAN or attached behind the SAN Volume
Controller, can be managed by IBM TotalStorage Productivity Center for Disk.

Performance monitoring and management
IBM TotalStorage Productivity Center for Disk can provide performance monitoring of ESS
and SAN Volume Controller storage devices, customization thresholds based on your storage
environment, and generation of events if thresholds are exceeded. In addition, IBM
TotalStorage Productivity Center for Disk is designed to help IT administrators select the LUN
for better performance. IBM TotalStorage Productivity Center for Disk enables an IT
administrator to specify both when and how often the data should be collected. IBM
TotalStorage Productivity Center for Disk is also designed to help support high availability or
critical applications by providing customization of threshold settings and generating alerts
when these thresholds are exceeded. Plus, it can provide gauges to track real-time
performance.
IBM TotalStorage Productivity Center for Disk is designed to enable the IT administrator to:
򐂰 Monitor performance metrics across storage subsystems from a single console
򐂰 Receive timely alerts to enable event actions based on customer policies
򐂰 Focus on storage optimization through the identification LUN
6.3.2 Using IBM TotalStorage Productivity Center for Disk to monitor the SAN
Volume Controller
To install and configure TotalStorage Productivity Center for Disk to monitor IBM SAN Volume
Controller, refer to the IBM redbook, Managing Disk Subsystems using IBM TotalStorage
Productivity Center, SG24-7097 and to the Redpaper, Using IBM TotalStorage Productivity
Center for Disk to Monitor the SVC, REDP-3961.
Data collected from SAN Volume Controller

Performance Manager uses an integrated configuration assistant tool (ICAT) interface of a
IBM TotalStorage SAN Volume Controller to start and stop performance statistics collection
on a SAN Volume Controller device.
You will need to setup a new Performance Data Collection Task for the SAN Volume
Controller device.
Performance metrics collected are at different levels within the SAN Volume Controller:
򐂰 Virtual disk (VDisk): For a single Vdisk or all Vdisks combined
– Total and average number of reads, writes
– Number of 512-bytes blocks read, writes
򐂰 Managed disk (MDisk): For a single MDisk or per MDisk group
– Total and average number of reads, writes
– Number of 512-bytes block read, writes
– Read and write transfer rates
– Total, average, minimum, maximum response time
Once data collection is complete, you may use the gauges task to retrieve information about a
variety of storage device metrics. Gauges are used to tunnel down to the level of detail
necessary to isolate performance issues on the storage device. To view information collected
by the Performance Manager, a gauge must be created or a custom script written to access
the DB2 tables/fields directly.

Gauges are a very useful tool and help in identifying performance bottlenecks. Before you
begin with gauges, ensure that there are enough, correct samples of data collected in
performance database.
The data samples you collect must cover the appropriate time period which corresponds with
the high / low of I/O workload. Also it should also cover sufficient iterations of the peak activity
to perform analysis over a period of time.
If you plan to perform analysis for one specific instance of activity, then you may ensure that
performance data collection task covers the specific time period.
Note: The SAN Volume Controller can perform data collection at a minimum, 15 minute
interval.
SAN Volume Controller thresholds

Thresholds are used to determine watermarks for warning and error indicators for an
assortment of storage metrics. SAN Volume Controller has the following thresholds with their
default properties:
򐂰 VDisk I/Os rate
Total number of virtual disk I/Os for each I/O group. SAN Volume Controller defaults:
– Status: Disabled
– Warning: None
– Error: None
򐂰 VDisk bytes per second
Virtual disk bytes per second for each I/O group. SAN Volume Controller defaults:
– Warning: None
– Error: None
򐂰 MDisk I/O rate
Total number of managed disk I/Os for each managed disk group. SAN Volume Controller
defaults:
– Warning: None
– Error: None
򐂰 MDisk bytes per second
Managed disk bytes per second for each managed disk group. SAN Volume Controller
defaults:
– Warning: None
– Error: None
You may only enable a particular threshold once the minimum values for warning and error
levels have been defined.
Tip: In TotalStorage Productivity Center for Disk, default threshold warning or error values
of -1.0 are indicators that there is no recommended minimum value for the threshold and
are therefore entirely user defined. You may elect to provide any reasonable value for these
thresholds, keeping in mind the workload in your environment.

6.4 Sharing the DS6000 between a host and the IBM SAN
Volume Controller
The DS6000 can be shared between hosts and a SAN Volume Controller. This sharing could
be useful if you want to have direct attachment for specific open systems servers or if you
need to share your DS6000 between the SAN Volume Controller, iSeries, and zSeries.
See the following Web site for the latest supported configurations:
http://www-1.ibm.com/servers/storage/support/virtual/2145.html
6.4.1 Sharing the DS6000 between open systems server hosts and the IBM
SAN Volume Controller
If you have a mixed environment including IBM SAN Volume Controller and open systems
servers, we recommend to share the maximum of DS6000 resources to both environments.
An example of a storage configuration recommendation is to create one Extent Pool per
Rank. In each Extent Pool, create one volume allocated to the IBM SAN Volume Controller
environment and one or more other volumes allocated to the open system server hosts. In
this configuration, each environment can benefit from the DS6000 overall performance.
IBM supports sharing a DS6000 between a SAN Volume Controller and open system server
hosts. However, if a DS6000 port is in the same zone as a SAN Volume Controller port, that
same DS6000 port should not be in the same zone as another host.
6.4.2 Sharing the DS6000 between iSeries host and the IBM SAN Volume
Controller
IBM SAN Volume Controller does not support iSeries host attachment. If you have a mixed
server environment including IBM SAN Volume Controller and iSeries servers, you have to
share your DS6000 to provide a direct access to iSeries volumes and access to open system
server volumes through the IBM SAN Volume Controller.
In this case, we recommend to share the maximum of DS6000 resources to both

environments. An example of storage configuration recommendation is to create one Extent
Pool per Rank. In each Extent Pool, create one volume allocated to IBM SAN Volume
Controller environment and one or more other volumes allocated to the iSeries hosts.
IBM supports sharing a DS6000 between a SAN Volume Controller and iSeries hosts.
However, if a DS6000 port is in the same zone as a SAN Volume Controller port, that same
DS6000 port should not be in the same zone as iSeries hosts.
6.4.3 Sharing the DS6000 between zSeries server host and the IBM SAN
Volume Controller
IBM SAN Volume Controller does not support zSeries host attachment. If you have a mixed
server environment including IBM SAN Volume Controller and zSeries servers, you have to
share your DS6000 to provide a direct access to zSeries volumes and access to open system
server volumes through the IBM SAN Volume Controller.
In this case, you have to split your DS6000 resources between two environments. Some of
the Ranks have to be created using CKD format (used for zSeries access) and the other part

have to be created in Fixed Block format (used for IBM SAN Volume Controller access). In
this case, both environments will get performance related to the allocated DS6000 resources.
A DS6000 port will not support a shared attachment between zSeries and IBM SAN Volume
Controller because zSeries servers use ESCON or FICON connection and IBM SAN Volume
Controller only supports FC connection.
6.5 Advanced functions for the DS6000

The DS6000 provides advanced functions that are not compatible with the SAN Volume
Controller.
FlashCopy and Concurrent copy

FlashCopy and Concurrent copy are not supported on any LUN that is managed by the SAN
Volume Controller.
Metro Mirror and Global Copy

Metro Mirror (synchronous PPRC) and Global Copy (formerly known as PPRC-XD) are not
supported on any LUN that is managed by the SAN Volume Controller.
Attention: The new Cache Disable VDISK functionality included in the SAN Volume
Controller release 3.1 will provide the ability to use disk subsystem copy services for LUNs
that are managed by the SAN Volume Controller.
6.6 Volume creation and deletion on the DS6000

Before creating or deleting a volume on the DS6000, consider these constraints.
Before you delete or un-map a volume from the SAN Volume Controller, remove the logical
unit from the Managed Disk Group. The following is supported:
򐂰 The supported volume size is 1 GB to 2 TB.
򐂰 Logical units can be added dynamically.
6.7 Configuration guidelines

Follow the guidelines and procedures outlined in this section to make the most of the
performance available from your DS6000 storage subsystems and avoid potential I/O
problems:
򐂰 When using virtualization, ensure that the storage devices are configured to provide some
type of redundancy against hard disk failures (RAID algorithm).
򐂰 Create a one-to-one relationship between Extent Pool and Rank.
򐂰 Avoid splitting an Extent Pool into multiple volumes at the DS6000 layer. Where possible,
create a single volume on the entire capacity of the Extent Pool.
򐂰 Ensure that you have an equal number of Extent Pools and volumes, equally spread on
the Disk Adapters and the two servers of the DS6000 storage subsystem
򐂰 Ensure that managed disk groups contain managed disks with similar characteristics and
approximately the same capacity. Consider the following factors:

– DS6000: The number of DDMs in the Array Site (we recommend 8 DDM) and the
physical disk type (for example, 10K/15K rpm).
– The underlying RAID type that the DS6000 storage subsystem is using to implement
the managed disk.
򐂰 Same disks capacity provides efficient use of the SAN Volume Controller striping.
򐂰 Do not mix managed disks of greatly differing performance in the same managed disk
group. The overall group performance will be limited by the slowest managed disk in the
group. Some disk controllers may be able to sustain much higher I/O bandwidths than
others, so ensure you do not mix managed disks provided by low-end subsystems with
those provided by high-end subsystems.

7
Chapter 7. Open systems servers - UNIX

This chapter contains information about how to get the best overall storage performance from
the DS6000 when attached to UNIX operating systems. In this chapter we discuss:
򐂰 Initial planning considerations for optimum performance of UNIX host systems using the
DS6000 in a SAN
򐂰 Common UNIX tools to monitor I/O performance
򐂰 Specific AIX, HP-UX, Sun Solaris tools, and tuning options
򐂰 SAN Multipathing considerations
򐂰 Testing and verifying DS6000 storage attached to UNIX host systems
򐂰 Configuring UNIX operating system logical storage for optimum performance
򐂰 UNIX operating system tuning considerations for maximum storage performance

7.1 UNIX performance monitoring and tuning
In the following sections we discuss configuring UNIX servers to improve DS6000 I/O
performance. We also present methods to monitor and tune disk I/O from AIX, HP-UX, and
Sun servers attached to a DS6000.
Throughout this chapter, we refer to Ranks, Arrays and Extent Pools. These terms will be
used interchangeably and they all refer to the same thing. This assumes that there is one
Rank per Extent Pool as is recommended in Chapter 2, “Hardware configuration planning” on
page 17 and Chapter 5, “Host attachment” on page 143.
The tips and tools presented in this chapter will allow you to:
򐂰 Collect host I/O stats for:
– Individual disk devices (paths to DS6000 LUNs)
– Vpaths
– Ranks
򐂰 Develop an iostat report for all Ranks in the DS6000 (enterprise iostats) from a host
perspective.
򐂰 Create baseline measurements of performance.
򐂰 Test and improve sequential I/O.
Performance tuning is an iterative process. It is very important to have a baseline

measurement so you can:
򐂰 Verify if changes actually improved performance or not.
򐂰 Know if that Database Administrator (DBA) who is complaining of an I/O problem at 5 p.m.
on a Friday afternoon is correct.
򐂰 Ask for a X percent raise after you have made the entire enterprise X percent faster!
Remember, when making tuning changes:

򐂰 Make sure you have a baseline I/O performance measurement under the same workload
to compare to after making tuning changes.
򐂰 Plan changes to be made and include a back out plan.
򐂰 Make changes one at a time.
򐂰 Document changes.
Sometimes you will want to view performance of a specific host, and other times you will want
to view performance statistics of DS6000 components. Remember that multiple hosts can be
using logical disks (LUNs) from the DS6000 that reside on the same Array.
Keep in mind that the most important I/O measurements to gather from a server’s disk
subsystem are:
򐂰 Number of I/O transactions per second (IOPS)
򐂰 Total MB/s transferred
򐂰 MB/s read
򐂰 MB/s written
򐂰 KB/transaction = [ (KB read/second + KB written/second) / (transactions/second ) ]

7.2 Planning and preparing UNIX servers for performance
Before delving into iostat numbers and performance tools, it is important to consider some
configuration factors that affect I/O performance on the DS6000. The main point we want to
make is that, for best performance, the I/O capabilities of the DS6000 should be evenly
distributed to each application server. The DS6000 has many components that contribute to
performance, among these are:
򐂰 Extent Pools
򐂰 LUNs
򐂰 Use of multi-path applications
򐂰 Paths to host adapters
򐂰 The use of supported code levels
7.2.1 I/O balanced across Extent Pools

Below we make the argument that for optimum performance, it is important to assign LUNs
from separate Ranks in a DS6000 to your host system, rather than assigning multiple LUNs
from the same Rank. For the rest of this discussion, we will refer to this concept as the
recommended method. This concept is the foundation for many of the recommendations we
make later in this chapter about how to create logical volumes and file systems for optimum
performance. The idea is to get as many spindles as possible working on I/O for each host
system.
For general purpose, random I/O applications, what we are talking about right here is the
single most important step you should consider for optimum performance. Of course, there
are exceptions, like DB2, which uses algorithms unique to the application to balance I/O.
Note: The recommended method does not apply to certain applications like DB2. See
Chapter 13, “Databases” on page 415.
To explore the recommended method, we first need a hypothetical DS6000. Consider a

DS6000 with two disk enclosures as shown in Figure 7-1 on page 192. Each disk enclosure
has 16 DDMs. Each DDM is 146 GB. Each enclosure in this DS6100 is configured with 2
RAID 5 Arrays/Ranks/Extent Pools. We will use these terms interchangeably. Each Array has
a usable size of 773 GB (6+P+Spare) or 902 GB (7+P). And let each Array be carved up into
as many 72 GB Logical Volumes (LUNs) as can fit within the Array. For now, don’t worry about
how the LUN size was chosen. We will address LUN size in 7.2.2, “DS6000 LUN size” on
page 193.
Many customers have SAN administrators and UNIX administrators. Eventually, these two
groups are going to get together to decide what servers get what LUNs. It is this decision that
this section is about. For optimum performance, the UNIX administrator should request one
LUN per Rank, in a round robin way until you have the amount of storage you require, and
each LUN should be the same size. For instance, if the application server needs 250 GB, this
total storage requirement should be satisfied by assigning four 72 GB LUNs to the server. And
most importantly, each LUN is physically located on a different Rank. See Figure 7-1 on
page 192. Notice that the host server has been allocated a piece of every DDM in our
hypothetical DS6000. Every spindle of every DDM has allocated storage to the host server.
Chapter 7. Open systems servers - UNIX 191

LUNs assigned to a
host volume group
Server 0
72 GB 72 GB 72 GB 72 GB 72 GB 72 GB
72 GB 72 GB 72 GB 72 GB 72 GB 72 GB 72 GB 72 GB 72 GB
... ... ... ... ... ... ... ...

72 GB 72 GB 72 GB 72 GB
72 GB 72 GB 72 GB 72 GB
Server 1
Figure 7-1 One LUN from each Rank
Note: For best performance, assign LUNs evenly from many available Ranks to your host
servers. This ultimately involves the maximum number of DDMs in doing the I/O work.
The above example uses two disk enclosures because, frankly, it would be difficult to cram
more Extent Pools into Figure 7-1. It is realistic to expect your configuration to include more
disk enclosures, each enclosure cabled to the two different disk adapter pairs or loops. So, to
expand upon our concept of the recommended method slightly: balance the LUNs assigned to
host systems between Ranks and disk adapter loops. The idea is to equally distribute your I/O
load evenly across all of the performance related resources (DDMs, Ranks, DA pairs) within
the DS6000 - get as much of this hardware working for you as possible!
It is also important to note that all the LUNs assigned to a host system should be the same
size, share a common DDM size and RAID Array type. In our hypothetical system, all the
LUNs were 72 GB, all the Arrays were RAID 5, and all the DDMs were 146 GB. The next host
system may only need 32 GB LUNs from every Rank. It is perfectly acceptable to have
different size LUNs sitting next to each other on the same Rank.
The following discussion explores storage allocation techniques to avoid. Unfortunately, it is

human nature to assign storage in the following manner. This is how we did it until experience
taught us otherwise. Consider the following two scenarios of what to avoid if possible (and
remember that there are exceptions like DB2 where you would want to do this):
򐂰 Two 400 GB LUNs are created on the same Rank. The first LUN is assigned to server A,
and this is all the storage that server A requires. Likewise, the second LUN is assigned to
server B. If server A has very high, sustained I/O, the performance of server B’s storage
could be negatively affected since all this storage exists on the same RAID 5 Array.
򐂰 Server A needs all 800 GB. Therefore, both of the 400 GB LUNs are assigned to server A.
Now server A will never experience storage I/O throughput that exceeds the I/O
capabilities of one RAID 5 Array. Server A’s maximum I/O performance is greatly limited by
the initial storage assignment.

The recommended method imposes an availability consideration worth mentioning. Consider
the example just discussed above where Server A received all of its storage from a single
Rank or RAID array. If that Rank were to fail, only Server A would be out of production. The
recommended method encourages a situation where multiple host servers are assigned
storage from a single RAID array. If, under these circumstances, the RAID array were to fail,
every host server receiving storage from this Rank would be down and out of production. The
chances of a RAID Array with multiple hot spares failing is statistically very remote, but not
unheard of or impossible. So if a highly redundant, highly available storage server like the
DS6000 does not provide enough data protection comfort for your unique circumstances,
then perhaps the recommended method is not right for you.
As this book is about performance and tuning, we end this section as we started it. The single
most important step you can take for optimum performance is to evenly distribute a host
system’s LUNs among multiple Ranks in the DS6000.
The next step for the UNIX system administrator is to begin the process of acquiring the LUNs
assigned and configuring them to the Logical Volume Manager of the operating system. This
discussion continues in 7.8.1, “Creating the volume group” on page 233, later in this chapter.
7.2.2 DS6000 LUN size

When implementing a DS6000, the question of what logical disk (LUN) size to use for best
performance often comes up. Internal to the DS6000, there is no performance difference for
logical disk (LUN) sizes. For the ESS, the previous recommendation was LUN sizes of 8 GB
or 16 GB for most requirements. For a DS6000, with the advent of higher capacity drives,
consider LUN sizes in the range of 8 GB to 72 GB.
It is not necessary to have all the LUNs in a DS6000 be the same size. As stated in the
previous section, what is important is that a host system’s LUNs be evenly balanced among
Ranks. It is perfectly acceptable to have, for instance, one 72 GB LUN on every Rank
assigned to server A and one 8 GB LUN on every Rank assigned to server B. The balance is
still there.
Many of the considerations for LUN size, from a UNIX perspective, depend upon the functions
of a UNIX Logical Volume Manager. AIX and HP-UX have this function as part of the base
operating system. Solaris has Veritas Volume Manager. Here are some considerations when
choosing the DS6000 Logical Volume (LUN) size:
򐂰 The recommended method advocates assigning one LUN from each Rank to your server.
Consider choosing your LUN size so that one LUN from every Rank gives your host
system the amount of storage it needs. For example, if you know that, on average, most of
your servers will require around 500 GB and your DS6000 has 16 Ranks, then a LUN size
of (500/16) 32 GB should be considered.
򐂰 When filling up a LUN with logical volumes, always leave at least one physical partition
free on every LUN in the volume group. This leaves some extra room for the volume group
descriptor area (VGDA) to grow and enables the volume group to be expanded,
reorganized online, or changed from a standard volume group to a big volume group or to
a scalable volume group.
򐂰 The concept of OS level striping will be recommended later in this chapter as a way to
further enhance the storage performance of a host server. OS level striping should be
done across same-size LUNs from different Ranks.
򐂰 Choose a LUN size so that one to three LUNs from each Rank will satisfy the host
system’s storage requirements. This will prevent a huge of number of LUNs from being
presented to the operating system. Too many LUNs are harder to manage. They also
could impact boot times and HACMP failover events.

See 7.8.1, “Creating the volume group” on page 233 for a more detailed discussion of volume
group and LVM setup.
There are, of course, situations where larger LUN sizes (greater than 72 GB) should be
considered. For instance, in HACMP environments where failover time is important, consider
much larger LUN sizes - one LUN that uses the entire Rank. This is because during a failover
event, the failover time is largely determined by the total number of HACMP managed logical
volumes. Another example would be if you were preparing LUNs for the SAN Volume
Controller, where very large LUNs are preferred. Another example would be if the LUNs were
intended for DB2, which uses a containers concept to balance I/O. See Chapter 13,
“Databases” on page 415.
7.2.3 Document the LUN assignments

Here is a step that is often skipped. It is important to understand how the LUNs are carved out
of your DS6000 Extent Pools. It is equally important to know what host system these LUNs
are assigned to. It is critical to know how these LUNs are used. We can think of two reasons
to understand, visually, how storage is allocated on the DS6000:
򐂰 During storage planning sessions, a diagram that can quickly confirm the LUN allocations
are well balanced across multiple RANKS will save a lot of time. This diagram is a visual
check point. Plus, managers love pictures. This is the old picture is worth a million words
scenario.
򐂰 Three years from now, when the application owners start to complain about performance,
this is a get out of jail free card. Imagine that you are in the first meeting (of many) to
discuss the root cause of the perceived performance problem. There are a bunch of
managers there. You know fingers will point to the storage subsystem first, because they
always do. You whip out a diagram that shows a perfectly balanced distribution of LUNs in
the DS6000. You spend the rest of the meetings smiling from the sidelines, rather than
being the main event.
This step is worth doing! Below, in Figure 7-2 on page 195, is an example of how to document
a storage allocation of a DS6000 using an Excel spreadsheet. Storage allocation on a
DS6000 would be similar.

Figure 7-2 Data layout diagram
The legend for the above diagram is shown below in Figure 7-3 on page 196.

Figure 7-3 Data layout legend
The data layout above shows:

򐂰 Storage allocations for 3 host systems.
򐂰 Each host system has 6 LUNs.
򐂰 Each LUN is 33 GB.
򐂰 Total amount of storage available in each Rank, how much is used, and how much is left.
򐂰 The DA pair that controls each Rank.
򐂰 There is a numbering scheme that relates the LUN Volume ID back to the Rank the LUN
resides on.
򐂰 Good distribution of each host server’s LUNs across Ranks and DA pairs.
7.2.4 Multipathing considerations

It is important to use software that provides multipath configuration environment support for
host systems attached to the DS6000, not only for load balancing and failover, but for the
tools they provide.
Here are the three most common multipathing solutions for UNIX operating systems that are
supported by the DS6000:
򐂰 SDDPCM
– Available for AIX only
– Preferred multipath product for AIX
򐂰 SDD
򐂰 Veritas Volume Manager with Dynamic Multipathing (DPM)

Notes:
1. We generally recommend using SDDPCM as the preferred multipathing solution for AIX
because it runs as part of the storage device driver and thus has minor performance
advantages. SDDPCM also presents a single hdisk for every LUN, while SDD presents
a vpath for every LUN and hdisk for every path to the vpath.
2. Do not use SDDPCM if you are using HACMP with LVM mirroring because SDDPCM
only supports HACMP in enhanced-concurrent mode and enhanced-concurrent mode
doesn't support mirror write consistency.
3. SDDPCM is not currently supported by virtualization products like the IBM SAN Volume
Controller.
4. There are different versions of SDD for concurrent and noncurrent HACMP. Be sure to
install the correct version for your environment.
It is especially important in a SAN environment to limit the number of disk devices presented
to a host. In a SAN, every extra path from the host to the DS6000 will cause another disk
device to be presented to the host OS for every DS6000 LUN assigned to it with SDD. With
SDDPCM only one hdisk is presented to the host for a LUN.
It is important to understand the total bandwidth requirements of the host system when
choosing the number of paths from the DS6000. With 2 Gb Fibre Channel, four paths from the
DS6000 provide four different 200 MB/s paths from the DS6000 - 800 MB/s total. This should
be adequate to supply the maximum 400 MB/s that can be used by the two HBAs on the host.
In a special situation where a host system has four HBAs (or more) it might be necessary to
increase the number of paths from the DS6000 to the SAN.
Note: Because the number of paths might influence performance, you should use the
minimum number of paths necessary to achieve your performance requirements. The
recommended number of paths is 2 to 4.
For more information about SAN zoning for performance and availability, refer to , “SAN
implementations” on page 79.
7.2.5 System and adapter code level

Before trying to tune disk I/O by moving data around or making kernel changes to the UNIX
OS, it is important to make sure all components (servers, DS6000, SAN switches) are
prepared and have the latest firmware/microcode.
In a SAN environment, the microcode levels on the DS6000, on the SCSI and Fibre Channel
adapters on the servers, and SAN switches code, all effect each other.
Before implementing the DS6000, be sure to verify/update:

򐂰 System and adapter microcode on the host servers
򐂰 Device drivers on the servers
򐂰 SAN switch software
򐂰 DS6000 Licensed Internal Code (LIC) level: Verify level with IBM support representative
You can find information about microcode levels for RS/6000 and pSeries servers and
adapters at:
http://techsupport.services.ibm.com/server/mdownload

For HP-UX servers, download device drivers from:
http://www.hp.com/country/us/eng/support.html
For Sun Solaris servers, download device drivers from:

http://wwws.sun.com/software/download/
We will cover some specific SDD commands for AIX, HP-UX, and Sun Solaris in 7.6, “SDD
commands for AIX, HP-UX, and Solaris” on page 218. For more details on SDD see 5.6,
“Subsystem Device Driver (SDD) - multipathing” on page 157.
Useful information can also be found in the IBM TotalStorage DS6000: Host Systems
Attachment Guide, SC26-7628. Also, see the DS6000 Interoperability Matrix for equipment
that IBM has tested and supports attaching to the DS6000 at:
Also, for HBA support, visit:

http://knowledge.storage.ibm.com/servers/storage/support/hbasearch/interop/hbaSearch.do
7.3 Common UNIX performance monitoring tools

Some common UNIX commands to monitor system performance are:
򐂰 iostat
򐂰 sar (System Activity Report)
򐂰 vmstat (Virtual Memory Statistics)
Keep in mind that when these tools were created, UNIX servers had their own locally
attached storage and did not use disk devices presented from centralized disk storage
servers like the DS6000, which is full of RAID arrays.
We would not call these tools legacy yet, but some of their features do not work well with
storage from RAID arrays. When looking at the output from these commands keep in mind
that the numbers presented are not for a single disk anymore, but for a logical disk (LUN) on a
DS6000 (RAID 5 or RAID 10) Rank.
These tools are worth discussing because they are almost always available and system
administrators are accustomed to using them. You may have to administer a server, and
these are the only tools you have available to use. These tools offer a quick way to tell if a
system is I/O bound.
7.3.1 iostat
The base tool for evaluating I/O performance of disk devices for UNIX operating systems is
iostat. Although available on most UNIX platforms, iostat varies in its implementation from
system to system.
The iostat command is a fast way to get a first impression of whether the system has an
I/O-bound performance problem or not. The tool reports I/O statistics for TTY devices, disks,
and CD-ROMs. It is used for monitoring system I/O device utilization by observing the time
physical disks are active in relation to their average transfer rates.

Tip: I/O activity monitors, such as iostat, have no way of knowing whether the disk they
are seeing is a single physical disk or a logical disk striped upon multiple physical disks in
a RAID array. Therefore, some performance figures reported for a device, for example,
%busy, could appear high.
It would not be unusual to see a device reported by iostat as 90 percent to 100 percent busy
because a DS6000 volume that is spread across an array of multiple disks can sustain a
much higher I/O rate than for a single physical disk. Having a device 100 percent busy would
generally be a problem for a single device, but probably not for a RAID 5 device.
Tip: When using iostat on a server that is running SDD with multiple attachments to the
DS6000, each disk device is really just a single path to the same logical disk (LUN) on the
DS6000. To understand how busy a logical disk is, you need to sum up iostats for each
disk device making up a vpath.
Figure 7-4 shows an example of how multiple paths to the DS6000 affect information
presented by iostat. In the example, a server has two Fibre Channel adapters and is zoned
so that it uses four paths to the DS6000.
Host
I/O calls from OS
vpath0 from SDD
(Subsystem Device Driver)
load balance & failover
vpath0
Port 1 Port 2 Port 3 Devices
FC0 FC1 presented to
the OS:
SAN SAN vpath0

switch switch
}
disk1
disk2 reported on
disk3 by iostat
disk4
I0000 I0001 I0002 I0003
Port 0 Port 1 Port 2 Port 3
Controller 0
DS8000
LUN 1
Figure 7-4 Devices presented to iostat
In order to determine the I/O statistics for vpath0 for the example given in Figure 7-4, you
would need to add up the iostats for hdisk1–4. One way to find out which disk devices make a
vpath is to use the datapath query essmap command included with SDD.
Another way is shown in Example 7-1 on page 200. The command, datapath query device
0, lists the paths (hdisks) to vpath0. In this example, the logical disk on the DS6000 has LUN

serial number 75065513000. The disk devices presented to the operating system are hdisk4,
hdisk12, hdisk20, and hdisk28, so we could add up the iostats for these four hdisk devices to
see how busy vpath0 is.
Example 7-1 datapath query device

{CCF-part2:root}/ -> datapath query device 0

SERIAL: 75065513000
==========================================================================
For a system with a large number of disk devices presented from the DS6000, iostat can
lose its effectiveness. You may want to try running iostat and then sort the output by %busy.
If your AIX system is in a SAN environment, you may have so many hdisks that iostat
presents too much information. We recommend using nmon, which can report iostats based
on vpaths or Ranks, as discussed in 7.4, “AIX-specific I/O monitoring commands and tools”
on page 208.
The tables that follow show sample iostat reports from IBM AIX, Sun Solaris, and HP-UX
systems.
iostat report from AIX

Example 7-2 shows AIX iostat output.
Example 7-2 AIX iostat output

tty: tin tout avg-cpu: % user % sys % idle % iowait
24.7 71.3 8.3 2.4 85.6 3.6
Disks: % tm_act Kbps tps Kb_read Kb_wrtn

hdisk0 2.2 19.4 2.6 268 894
hdisk1 1.0 15.0 1.7 4 894
hdisk2 5.0 231.8 28.1 1944 11964
hdisk4 5.4 227.8 26.9 2144 11524
hdisk3 4.0 215.9 24.8 2040 10916
hdisk5 0.0 0.0 0.0 0 0
Notice that the first stanza of the iostat output is history information.
Together with the KB read and written, the output reports the following:
򐂰 %tm_act column indicates the percentage of the measured interval time that the device
was busy.
򐂰 tps column shows the transactions per second over the interval period for the device. The
I/O transaction is a variable length of work assigned to a device. This field may also
appear higher than would normally be acceptable for a single physical disk device.
򐂰 %iowait has become somewhat misleading as a measure of disk performance with the
advent of faster and faster CPU speeds. This value is really an indication of the percent of
time the CPU is idle, waiting for I/O to complete. As such, it is only indirectly related to I/O
performance.

Here are some ways to analyze and interpret the data:
򐂰 Interval for the report, based on hdisk1 data = ( 894 + 4 KB ) / 15 KBps = 60 seconds
򐂰 Average IO size for hdisk4 = 227.8 KBps / 26.9 tps = 8.5 KB per transaction
򐂰 Estimated average I/O service time = 0.54 / 26.9 tps = 2 ms
򐂰 Disk usage is well balanced across hdisk 2-4.
Other interesting iostat options

򐂰 iostat -ts <interval> - shows overall system statistics
򐂰 iostat -ta <interval> - shows adapter statistics
򐂰 iostat -tm <interval> - shows MPIO device path statistics
򐂰 iostat -D - new to AIX 5.3, shows queuing information for disks and adapter. Use this to
investigate whether there is a need to increase the disk queue_depth. This also shows
average I/O service times for both reads and writes. For AIX, one can use iostat (at AIX
5.3 only) and sar (AIX 5.1 or later) to monitor some of the queues. The iostat -D
command generates output such as:
hdisk6 xfer:
%tm_act bps tps bread bwrtn
4.7 2.2M 19.0 0.0 2.2M
read: rps avgserv minserv maxserv timeouts fails
0.0 0.0 0.0 0.0 0 0
write: wps avgserv minserv maxserv timeouts fails
19.0 38.9 1.1 190.2 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
15.0 0.0 83.7 0.0 0.0 136
Here, the avgwqsz is the average wait queue size, and avgsqsz is the average service
queue size. The average time spent in the wait queue is avgtime. The sqfull was meant
to be an indicator of how many times the disk's queue_depth was exceeded, but it
increments one IO too soon, i.e., it increments each time the queue is filled. This
might be changed in the future. It's nice that iostat -D separates reads and writes, as
we would expect the IO service times to be different when we have a disk subsystem with
cache. The most useful report for tuning is just running "iostat -D" which shows
statistics since system boot, assuming the system is configured to continuously maintain
disk IO history (run lsattr -El sys0, or smitty chgsys to see the setting).
iostat report from Sun Solaris

Example 7-3 shows an iostat report from Sun Solaris. Here you can see an example of a
device that appears to be very busy, device sd1.
The r/s column shows 124.3 reads per second; the %b column shows 90 percent busy for the
device; but the svc_t column shows a service time of 15.7 ms, quite reasonable for 124 I/Os
per second.
The calculations for service time that iostat presents are based on a single physical volume,
and, as previously mentioned, the physical volume that the DS6000 presents to the host is in
reality comprised of multiple physical volumes.
With RAID disks, the %b figure can be misleading and should not be relied on. To figure out
how busy the individual disks are in a RAID array in the DS6000, we would need to add up all
the iostats for LUNs on that array and divide by the number of disks in the array.
Example 7-3 Sun Solaris iostat output

#iostat -x
extended disk statistics
disk r/s w/s kr/s kw/s wait actv svc_t %w %b

fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
sd1 124.3 14.5 3390.9 399.7 0.0 2.0 15.7 0 90
sd2 0.7 0.4 13.9 4.0 0.0 0.0 7.8 0 1
sd3 0.4 0.5 2.5 3.8 0.0 0.1 8.1 0 1
sd6 0.0 0.0 0.0 0.0 0.0 0.0 5.8 0 0
sd8 0.3 0.2 9.4 9.6 0.0 0.0 8.6 0 1
sd9 0.7 1.3 12.4 21.3 0.0 0.0 5.2 0 3
The fields have the following meanings:
disk name of the disk
r/s reads per second
w/s writes per second
kr/s kilobytes read per second
kw/s kilobytes written per second
wait average number of transactions waiting for ser-
vice (queue length)
actv average number of transactions actively being
serviced (removed from the queue but not yet
completed)
svc_t average service time
%w percent of time there are transactions waiting
for service (queue non-empty)
%b percent of time the disk is busy (transactions
in progress)
Notice that for Sun Solaris, iostat uses disk aliases like sdX for disk devices like cXtYdZ.
Depending on which version of Sun Solaris you are running, you may be able to use an -n flag
for iostat to list devices in the cXtYdZ format. Example 7-4 shows the output of iostat -n for
a Sun Solaris server.
Example 7-4 SUN Solaris iostat output using cxtydz devices

example% iostat -xnp
extended device statistics
device r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b
c0t0d0 0.4 0.3 10.4 7.9 0.0 0.0 0.0 0.0 0 1
c0t0d0s0 0.3 0.3 9.0 7.3 0.0 0.0 0.0 0.0 0 1
c0t0d0s1 0.0 0.0 0.1 0.5 0.0 0.0 0.0 0.0 0 0
The new fields have the following meanings:
wsvc_t average service time in wait queue, in milliseconds
asvc_t average service time active transactions, in milliseconds
There are also scripts available from Sun or in Sun Solaris user groups to map from sdX
aliases to cxtydz devices. Search on the Internet for sd_to_cxtydz.sh.
iostat report from HP-UX

Example 7-5 on page 203 shows an example of an iostat report from HP-UX. This is a fairly
simple format with three columns of statistics: bps indicates the kilobytes transferred per
second; sps indicates the seeks per second; and msps indicates milliseconds per average
seek. The first two numbers, bps and sps, are subject to the effects of the DS6000 RAID
architecture.
The man page for the iostat command on HP-UX states that the msps field is set to 1.0. With
the advent of new disk technologies, such as data striping, where a single data transfer is
spread across several disks, the number of milliseconds per average seek becomes
impossible to compute accurately. At best it is only an approximation, varying greatly, based
on several dynamic system conditions. For this reason, and to maintain backward
compatibility, the milliseconds per average seek (msps) field is set to the value 1.0.

Example 7-5 HP-UX iostat output
# iostat 1
device bps sps msps
c4t6d0 127 28.5 1.0
c3t6d0 118 24.5 1.0
c6t9d5 10252 44.4 1.0
c5t10d4 135 6.0 1.0
c30t4d1 148 8.0 1.0
c22t6d0 138 8.0 1.0
c31t10d2 138 8.0 1.0
c26t1d6 138 8.0 1.0
Column Heading Interpretation

device Device name
bps Kilobytes transferred per second
sps Number of seeks per second
msps Milliseconds per average seek
For HP-UX, you may prefer to use vmstat -d to view disk stats, or use both vmstat and
iostat. Details on the HP-UX vmstat output are shown in 7.3.3, “vmstat” on page 206.
iostat summary
In a SAN environment with the DS6000 presenting several disk devices to a host, iostat
output is not as easy to evaluate as when using individual SCSI disks. You will probably want
to use another tool that presents iostats based on vpaths or internal DS6000 performance
statistics. The use of SDDPCM avoids the issue of having reports for each path, and is part of
the reason why SDDPCM is preferred.
With a DS6000, also remember that typically the majority of random writes are happening at
cache speeds. Data is written to the DS6000 and stored in cache to be destaged to disks
later. For example, you can run a command in one window to copy a large file between file
systems on DS6000 disks. Then in another window, watch iostat output. You will see that the
write comes back as complete before the disk activity has stopped; this is due to the DS6000
reporting to the host system, that the write is complete as soon as all data was written to
DS6000 cache. iostat will show disk activity still taking place as data is destaged from cache
to disk.
Taken alone, there is no unacceptable value for any of the above iostat fields because
statistics are too closely related to application characteristics and system configuration.
Therefore, when evaluating data, look for patterns and relationships. The most common
relationship is between disk utilization and data transfer rate.
To draw any valid conclusions from iostat data, you have to understand the application’s disk
data access patterns such as sequential, random, or combination, and the type of physical
disk drives and adapters on the system.
For example, if an application reads/writes sequentially, you should expect a high disk transfer
rate when you have a high disk busy rate. Kb_read and Kb_wrtn can confirm an understanding
of an application’s read/write behavior. However, they provide no information about the data
access patterns.
Generally you do not need to be concerned about a high disk busy rate as long as the disk
transfer rate is also high. However, if you get a high disk busy rate and a low disk transfer rate,
you may have a fragmented logical volume, file system, or individual file that is causing the
bottleneck.

Discussions of disk, logical volume, and file system performance sometimes lead to the
conclusion that the more drives you have on your system, the better the disk I/O performance.
This is not always true since there is a limit to the amount of data that can be handled by the
adapter performing the I/O.
7.3.2 SAR
System Activity Report (SAR) is a tool that reports the contents of certain cumulative activity
counters within the UNIX operating system. SAR has numerous options, providing paging,
TTY, CPU busy, and many other statistics. Used with the appropriate command flag (-u) SAR
provides a quick way to tell if a system is I/O bound.
There are three possible modes in which to use the sar command:
򐂰 Real-time sampling and display
򐂰 System activity accounting via cron
򐂰 Display previously captured data
We will discuss these three modes of using the sar command. The following discussion uses
the AIX operating system as a platform for the examples. However, these commands are
common to HP-UX and Solaris as well.
Real-time sampling and display

One way you can run sar is by specifying a sampling interval and the number of times you
want it to run. To collect and display system statistic reports immediately, run # sar -u 2 5.
An example of sar output is shown in Example 7-6.
Example 7-6 sar sample output

# sar -u 2 5
AIX aixtest 2 5 001750154C00 8/205/05
System Configuration: lcpu=4
17:58:15 %usr %sys %wio %idle

17:58:17 43 9 1 46
17:58:19 35 17 3 45
17:58:21 36 22 20 23
17:58:23 21 17 0 63
17:58:25 85 12 3 0
Average 44 15 5 35
Not all sar options are the same for AIX, HP-UX, and Sun Solaris, but the sar -u output is the
same. The output in the example shows CPU information every 2 seconds, 5 times.
To check if a system is I/O bound, the important column to look at is %wio. The %wio includes
time spent waiting on I/O from all drives, including internal and DS6000 logical disks. If %wio
values exceed 40, this would give an indication that more investigation may be warranted to
understand storage I/O performance. The next thing to look at would be I/O service times
reported by the filemon command. You need to understand your workload though to make a
judgement. High %wio values may very well imply that there is too much CPU power in the
host system’s configuration.
There are useful flags for sar on AIX, especially the –d flag. The sar -d command changed at
AIX 5.3, and generates output such as:

16:50:59 device %busy avque r+w/s Kbs/s avwait avserv
16:51:00 hdisk1 0 0.0 0 0 0.0 0.0
hdisk0 0 0.0 0 0 0.0 0.0
The avwait and avserv are the average times spent in the wait queue and service queue
respectively. And avserv here would correspond to avgserv in the iostat output. The avque
value changed; at AIX 5.3, it represents the average number of IOs in the wait queue, and
prior to 5.3, it represents the average number of I/Os in the service queue.
Also remember that a system with busy CPUs can mask I/O wait. The definition of %wio is:
Idle with some process waiting for I/O (only block I/O, raw I/O, or VM pageins/swapins
indicated). If the system is CPU busy and also is waiting on I/O, the system accounting will
increment the CPU busy but not the %wio column.
The other column headings mean (refer to Example 7-6 on page 204):
򐂰 %usr time system spent executing application code
򐂰 %sys time system spent executing operating system calls
򐂰 %idle time the system was idle with no outstanding I/O requests
System activity accounting via cron

sar is an un-intrusive program because it just extracts data from information collected by the
system. You do need to configure a system to collect data however, and the frequency of the
data collection could effect performance and the size of data files collected.
To configure a system to collect data for sar, you can run the sadc command or the modified
sa1 and sa2 commands. Here is more information about the sa commands and how to
configure sar data collection:
򐂰 The sa1 and sa2 commands are shell procedure variants of the sadc command.
򐂰 The sa1 command collects and stores binary data in the /var/adm/sa/sadd file, where dd is
the day of the month.
򐂰 The sa2 command is designed to be run automatically by the cron command and run
concurrently with the sa1 command. The sa2 command will generate a daily report called
/var/adm/sa/sardd. It will also remove a report more than one week old.
򐂰 /var/adm/sa/sadd contains the daily data file, dd represents the day of the month.
/var/adm/sa/sardd contains the daily report file, dd represents the day of the month. Note
the r in /var/adm/sa/sardd for sa2 output.)
To configure a system to collect data, edit the root crontab file. For our example, if we just
want to run sa1 every 15 minutes every day, and the sa2 program to generate ASCII versions
of the data just before midnight, we will change the cron schedule to look like the following:
0,15,30,45 * * * 0-6 /usr/lib/sa/sa1
55 23 * * 0-6 /usr/lib/sa/sa2 -A
Display previously captured data

After the sa1 and sa2 commands are configured in cron and data collection starts, you will see
binary report files in /var/adm/sa/sadd, where dd represents the day of the month.
You can view performance information files from these files with:
sar -f /var/adm/sa/sadd where dd is the day you are interested in.
You can also focus on a certain time period, say 8 a.m. to 5:15 p.m. with:
sar -s 8:00 -e 17:15 -f /var/adm/sa/sadd

Remember, sa2 will remove the data collection files over a week old as scheduled in cron.
You can save sar info to view later with the commands:
sar -A -o data.file interval count > /dev/null & [SAR data saved to data.file]
sar -f data.file [Read SAR info back from saved file:]
All data is captured in binary form and saved to a file (data.file). The data can then be
selectively displayed with the sar command using the -f option.
sar summary
sar helps to tell quickly if a system is I/O bound. Remember though, that a busy system can
mask I/O issues since io_wait counters are not increased if the CPUs are busy. Compare sar
-d to iostat on your system and check the man pages for the different options to use. You
may prefer the sar -d output to iostat.
sar can help to save a history of I/O performance so you have a baseline measurement for
each host. You can then verify if tuning changes make a difference or not. You may want, for
example, to collect sar data for a week and create reports: 8am-5pm Monday-Friday if that is
prime time for random I/O; 6 p.m.–6 a.m. Sat–Sun if those are batch/backup windows.
7.3.3 vmstat
The vmstat utility is a useful tool for taking a quick snapshot or overview of the systems
performance. It is easy to see what is happening to the CPU, paging, swapping, interrupts, I/O
wait, and much more. There are several reports that vmstat can provide. These reports vary
slightly between the different versions of UNIX. Some of the I/O-related system information
can be gathered by entering the following options:
vmstat To display a summary of the statistics since boot
vmstat 2 5 To display five summaries at 2-second intervals
vmstat scdisk13 scdisk14 To display a summary of the statistics since boot including
statistics for logical disks scdisk13 and scdisk14
Tip: vmstat presents an average-since-boot on the first line. When running vmstat over an
interval, just disregard the first line of the vmstat output.
An example of vmstat output (over an interval of 2 seconds with a count of 5) for Sun Solaris
is shown in Example 7-7.
Example 7-7 Solaris vmstat output

# vmstat 2 5
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s1 s2 s3 in sy cs us sy id
0 0 0 53218 49592 0 1 5 0 0 0 0 0 1 0 0 116 106 30 1 1 99
0 0 0 285144 57916 0 1 0 0 0 0 0 2 0 0 0 110 5 19 0 1 99
0 0 0 285144 57916 0 0 0 0 0 0 0 0 0 0 0 114 8 19 0 0 100
0 0 0 285144 57916 0 0 0 0 0 0 0 0 0 0 0 103 9 20 0 0 100
HP-UX has similar vmstat output as shown in Example 7-8 on page 207. Notice that with the
-d flag, you can see transfer statistics for disks.

Example 7-8 HP-UX vmstat output
vmstat -d
procs memory page faults cpu

r b w avm free re at pi po fr de sr in sy cs us sy id
0 0 0 9158 721 0 0 0 0 0 0 0 101 18 7 0 0 100
Disk Transfers
device xfer/sec
c0t6d0 0
AIX vmstat output is shown in Example 7-9.
Example 7-9 AIX vmstat output

kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
0 0 87360 3161051 0 0 0 58 105 0 132 369 169 12 4 75 9
0 4 85150 3163595 0 0 0 0 0 0 2476 8534 207 1 1 98 0
1 4 85628 3162817 0 0 0 0 0 0 2510 17963 619 3 3 94 0
0 4 85002 3163764 0 0 0 0 0 0 2417 13762 90 0 2 98 0
0 4 85002 3163764 0 0 0 0 0 0 2412 439 39 0 0 99 0
The vmstat output for HP-UX, AIX, and Sun Solaris are all similar. Some important fields are:
r - runque Shows the number of tasks waiting for CPU resources.
b- blocked Indicates processes are waiting on a resource, usually I/O related.
pi- page in Page-ins from paging space indicate a shortage of free memory and
swapping is occurring. Swapping activity can incur I/O costs.
us- user CPU Shows the amount of CPU used by user application code.
sy Shows the percent of CPU being used to service the operating
system.
id - idle The percent of CPU that is idle.
wa-wait The percent of time the CPUs are idle, waiting on I/O to complete (AIX
only).
vmstat reports are vital in determining what is happening to the system on a real-time basis.
SoHigh I/O wait percent (AIX includes this information in vmstat output in the wa column),
which indicates that a majority of the CPU cycles are waiting for I/O operations to complete.
This value has been creeping up as CPU performance has outpaced storage performance. To
put this in perspective, it is common these days for the CPU to tick off over 10,000,000 cycles
waiting for one I/O.
򐂰 High number of blocked processes. This normally indicates that there are a lot of
processes waiting on a single resource; usually it is I/O related.
򐂰 High paging space paging rate, which indicates an overload on the system memory.
򐂰 High number of page faults, which could mean that the system is not making efficient use
of memory for caching files.
The vmstat command is only the first step to look for performance problems. It gives an
indication of where the performance problem could be located. With this in mind, choose a
resource-specific command and take a deeper look into the system behavior.

7.4 AIX-specific I/O monitoring commands and tools
In this section we discuss some tools unique to AIX for monitoring system performance. The
commands featured are:
򐂰 topas
򐂰 nmon
򐂰 filemon
򐂰 lvmstat
The topas and nmon tools are very thorough, providing an overall view of system
performance including such performance statistics as cpu busy, memory usage, disk I/O,
adapter I/O, top processes, and paging activity. The filemon and lvmstat tools look at I/O
performance in more detail and can be used to see which applications and file systems a host
spends the most time handling I/O for.
The nmon tool is especially good for monitoring DS6000 activity, because it can report iostats
based on either:
򐂰 hdisks
򐂰 vpaths
򐂰 Ranks
򐂰 Adapter statistics including SCSI and Fibre Channel adapters
7.4.1 topas
The interactive AIX tool, topas, is convenient if you want to get a quick overall view of the
system’s current activity. A fast snapshot of memory usage or user activity can be a helpful
starting point for further investigation. However, topas is of very limited use as a diagnostic
tool, when you are dealing with a large number of logical disks on a DS6000 since it reports
I/O on hdisks. Example 7-10 contains a sample topas output.
For monitoring DS6000 I/O on AIX hosts, we recommend the use of another tool called nmon,
which is discussed in the next section.
Example 7-10 topas output

Topas Monitor for host: bigbluebox EVENTS/QUEUES FILE/TTY
Thu May 31 10:44:28 2001 Interval: 2 Cswitch 528 Readch 28532
Syscall 1269 Writech 140
Kernel 0.1 | | Reads 104 Rawin 0
User 0.8 | | Writes 3 Ttyout 94
Wait 40.7 |########### | Forks 2 Igets 0
Idle 58.2 |################ | Execs 2 Namei 42
Runqueue 0.0 Dirblk 0
Interf KBPS I-Pack O-Pack KB-In KB-Out Waitqueue 12.0
en2 0.6 2.0 2.5 0.2 0.4
lo0 0.0 0.0 0.0 0.0 0.0 PAGING MEMORY
Faults 321 Real,MB 49151
Disk Busy% KBPS TPS KB-Read KB-Writ Steals 0 % Comp 18.2
hdisk1 0.0 0.0 0.0 0.0 0.0 PgspIn 0 % Noncomp 11.9
hdisk2 0.0 0.0 0.0 0.0 0.0 PgspOut 0 % Client 0.8
hdisk4 0.0 0.0 0.0 0.0 0.0 PageIn 0
hdisk0 0.0 0.0 0.0 0.0 0.0 PageOut 0 PAGING SPACE
hdisk3 0.0 0.0 0.0 0.0 0.0 Sios 0 Size,MB 33824
% Used 0.5
topas (52976) 19.0% PgSp: 1.8mb perfpol2 % Free 99.4
dtgreet (8552) 1.0% PgSp: 1.3mb root
db2sysc (46044) 0.5% PgSp: 0.4mb reg64

db2sysc (37534) 0.5% PgSp: 0.4mb reg64 Press "h" for help screen.
init (1) 0.0% PgSp: 0.6mb root Press "q" to quit program.
7.4.2 nmon
The nmon tool and analyzer for AIX and Linux is a great storage performance analysis
resource, and it is free! It is written by Nigel Griffiths who works for IBM in the United
Kingdom. This is one of the tools we use when performing customer benchmarks. It is
available at:
http://www.ibm.com/developerworks/eserver/articles/analyze_aix/
Note: The nmon tool is not formally supported. No warranty is given or implied, and you
cannot obtain help or maintenance from IBM.
nmon currently comes in two versions for running on different levels of AIX:
򐂰 nmon version 10 for AIX 5L™
򐂰 nmon version 9 for AIX 4.X. This version is functionally established and will not be
developed further.
The interactive nmon tool is very similar to monitor or topas, which you may have used before
to monitor AIX, but it offers many more features that are useful for monitoring DS6000
performance. We will explore these interactive options below.
Unlike topas, the nmon tool can also record data that can be used to establish a baseline of
performance for comparison later. Recorded data can be saved in a file and imported into the
nmon analyzer (spreadsheet format) for easy analysis and graphing.
Interactive nmon options for DS6000 performance monitoring

The interactive nmon tool is an excellent way to show comprehensive AIX system monitoring
information, of your choice, on one screen. When used interactively, nmon updates statistics
every 2 seconds. You can change the refresh rate. To run the tool, just type nmon and press
the Enter key. Then press the keys corresponding to the areas of interest. For this book, we
are interested in monitoring storage. The options relating to storage are a, d, D, and e. For
example, type nmon to start the tool, then select a (adapter), then d or D (disk I/O graphs or
disk statistics, but not both at the same time), then e (DS6000 vpath statistics). It is also
helpful to only view just the busiest disk devices, so type a . (period) to turn on this viewing
feature. We also like to take a look at c (CPU utilization) and sometimes t (top processes).
nmon has its own way of ordering the topics you choose on the screen.
The different options you can select when running nmon version 10 are shown in
Example 7-11.
Example 7-11 nmon options

+-HELP---------key=statistics which toggle on/off----------------------------+
¦ h = This help information ¦
¦ r = Resources CPU/cache/AIX etc. ¦
¦ t = Top Process Stats 1=basic 2=CPU-Use 3=CPU 4=Size 5=Disk-I/O ¦
¦ u = shows command arguments (hit u again to refresh) ¦
¦ c = CPU by processor l = longer term CPU averages ¦
¦ C = CPU can show up to 128 CPUs ¦
¦ m = Memory and Paging stats k = Kernel Internal stats ¦
¦ n = Network stats j = JFS Usage Stats ¦
¦ d = Disk I/O Graphs D=Stats o = Disks %Busy Map ¦
¦ a = Adapter I/O Stats e = ESS vpath Stats ¦

¦ A = Async I/O Servers g = Disk Groups (see cmdline -g) ¦
¦ v = VerboseChecks-OK/Warn/Danger N = NFS stats ¦
¦ b = black & white mode w = view AIX wait processes ¦
¦ W = WLM Section (S=SubClasses) W = WLM Section (S=SubClasses) ¦
¦ --- controls --- ¦
¦ + and - = double or half the screen refresh time ¦
¦ q = quit space = refresh screen now ¦
¦ . = Minimum Mode =display only busy disks and processes ¦
¦ 0 = reset peak counts to zero (peak = ">") ¦
¦ ¦
¦ nmon version v10p - written by Nigel Griffiths, nag@uk.ibm.com ¦
+----------------------------------------------------------------------------+
nmon adapter performance

Example 7-12 displays the ability of nmon to show I/O performance based on system
adapters. Notice the output shows two SCSI controllers and two Fibre Channel adapters. The
I/O load is balanced across each Fibre Channel adapter as we would expect if SDD is
functioning properly. The %busy column just showed up in the latest version of nmon and we
do not believe it is quite ready for interpretation yet.
Example 7-12 nmon adapter statistics

+-Adapter-I/O-Statistics-------------------------------------------------------+
¦Name %busy read write xfers Disks Adapter-Type ¦
¦ide0 0.0 0.0 0.0 KB/s 0.0 1 ATA/IDE Controller De¦
¦fcs0 177.4 1907.6 1031.9 KB/s 5879.0 8 FC Adapter ¦
¦fcs3 178.9 1834.4 1023.1 KB/s 5715.1 8 FC Adapter ¦
¦scsi0 0.0 0.0 0.0 KB/s 0.0 2 Wide/Ultra-3 SCSI I/O¦
¦TOTALS 4 adapters 3742.0 2055.0 KB/s 11594.1 19 TOTAL(MB/s)=5.7 ¦
+------------------------------------------------------------------------------+
nmon vpath performance

The e option of nmon shows I/O activity based on vpaths is shown in Example 7-13.
Example 7-13 nmon vpath option

¦Name Size(GB) AvgBusy read-KB/s write-KB/s TotalMB/s xfers/s vpaths=2 ¦
¦vpath0 10.0 22.9% 1659.2 796.3 2.4 4911.0 ¦
¦vpath1 10.0 23.5% 1673.0 765.5 2.4 4877.0 ¦
¦TOTALS 23.2% 3332.2 1561.8 4.8 9788.1 ¦
+------------------------------------------------------------------------------+
nmon disk group performance

nmon version 10 has a feature called disk grouping. For example, you could create a disk
group based on your AIX volume groups. First you need to create a file that maps hdisks to
nicknames. For example, you could create a map file like that shown in Example 7-14.
Example 7-14 nmon disk-group mapping file

vi /tmp/vg-maps
rootvg hdisk0 hdisk1

6000vg hdisk2 hdisk3 hdisk4 hdisk5 hdisk6 hdisk7 hdisk8 hdisk9 hdisk10 hdisk11 hdisk1
2 hdisk13 hdisk14 hdisk15 hdisk16 hdisk17
6000vg hdisk26 hdisk27 hdisk28 hdisk29 hdisk30 hdisk31 hdisk32 hdisk33
Then start nmon with the -g flag to point to the map file:

nmon -g /tmp/vg-maps
When nmon starts, press the G key to view stats for your disk groups. An example of the output
is shown in Example 7-15.
Example 7-15 nmon disk group output

--nmon-v10p---N=NFS--------------Host=san5198b-------Refresh=1 secs---14:02.10-----
Disk-Group-I/O
Name Disks AvgBusy Read|Write-KB/s TotalMB/s xfers/s R:W-SizeKB
rootvg 2 0.0% 0.0|0.0 0.0 0.0 0.0
6000vg 16 45.4% 882.6|93800.1 92.5 2131.5 44.4
6000vg 8 95.3% 1108.7|118592.0 116.9 2680.7 44.7
Groups= 3 TOTALS 26 5.4% 1991.3|212392.2 209.4 4812.2
Notice that:
򐂰 nmon reports real-time iostats for the different disk groups.
򐂰 In this case, the disk groups we created are for volume groups.
򐂰 You can create logical groupings of hdisk for any kind of group you like.
򐂰 You can make multiple disk-group map files and start nmon -g <map-file> to report on
different groups.
To enable nmon to report iostats based on Ranks, you can make a disk-group map file listing
Ranks with the associated hdisk members.
Use the SDD command datapath query essmap to provide a view of your host system’s
logical configuration on the DS6000 or DS6000. You could, for example, create a nmon disk
group of storage type (DS6000 or DS6000), LSS, Rank, port, etc., to give you unique views
into your storage performance.
Recording nmon information for import into the nmon analyzer tool
A great benefit nmon provides is the ability to collect data over time to a file and then just
import the file into the nmon analyzer tool, which can be found at:
http://www.ibm.com/developerworks/eserver/articles/analyze_aix/
To collect nmon data in comma-separated format for easy spread sheet import, do the
following:
1. Run nmon with the -f flag. See nmon -h for the details, but as an example, to run nmon for an
hour capturing data snapshots every 30 seconds, use:
nmon -f -s 30 -c 120
2. This will create the output file in the current directory called:
<hostname>_date_time.nmon
The nmon analyzer is a macro-customized Microsoft Excel® spreadsheet. After transferring

the output file to the machine running the nmon analyzer, simply start the nmon analyzer,
enabling the macros, and press the Analyze nmon data button. You will be prompted to select
your spreadsheet and then to save the results.
Many spreadsheets have fixed numbers of columns and rows. We suggest you collect a
maximum of 300 snapshots to avoid hitting these issues.

When you are capturing data to a file, the nmon tool disconnects from the shell to ensure that it
continues running even if you log out. This means that nmon can appear to crash, but it is still
running in the background until the end of the analysis period.
7.4.3 filemon
The filemon command monitors a trace of file system and I/O system events, and reports
performance statistics for files, virtual memory segments, logical volumes, and physical
volumes. The filemon command is useful to those whose applications are believed to be
disk-bound, and want to know where and why.
The filemon command provides a quick test to determine if there is an I/O problem by
measuring the I/O service times for reads and writes at the disk and logical volume level.
The filemon command resides in /usr/bin and is part of the bos.perf.tools file set, which can
be installed from the AIX base installation media.
filemon syntax
The syntax of the filemon command is as follows:
filemon [-d ][-i Trace_File -n Gennames_File ][-o File ] [ -O Levels ] [-P ] [ -T n ] [
--u ][ --v ]
Flags:
򐂰 -i Trace_File
Reads the I/O trace data from the specified Trace_File, instead of from the real-time trace
process. The filemon report summarizes the I/O activity for the system and period
represented by the trace file. The -n option must also be specified.
򐂰 -n Gennames_File
Specifies a Gennames_File for offline trace processing. This file is created by running the
gennames command and redirecting the output to a file as follows (the -i option must also
be specified): gennames >file.
򐂰 -o File
Writes the I/O activity report to the specified file instead of to the stdout file.
򐂰 -d
Starts the filemon command, but defers tracing until the trcon command has been
executed by the user. By default, tracing is started immediately.
򐂰 -T n
Sets the kernel’s trace buffer size to n bytes. The default size is 32,000 bytes. The buffer
size can be increased to accommodate larger bursts of events (a typical event record size
is 30 bytes).
򐂰 -P
Pins monitor process in memory. The -P flag causes the filemon command's text and data
pages to be pinned in memory for the duration of the monitoring period. This flag can be
used to ensure that the real-time filemon process is not paged out when running in a
memory constrained environment.
򐂰 -v
Prints extra information in the report. The most significant effect of the -v flag is that all
logical files and all segments that were accessed are included in the I/O activity report,
instead of only the 20 most active files and segments.

򐂰 -O Levels
Monitors only the specified file system levels. Valid level identifiers are:
lf Logical file level
vm Virtual memory level
lv Logical volume level
pv Physical volume level
all Short for lf, vm, lv, and pv
The vm, lv, and pv levels are implied by default.
򐂰 -u
Reports on files that were opened prior to the start of the trace daemon. The process ID
(PID) and the file descriptor (FD) are substituted for the file name.
filemon measurements
To provide a more complete understanding of file system performance for an application, the
filemon command monitors file and I/O activity at four levels:
򐂰 Logical file system
The filemon command monitors logical I/O operations on logical files. The monitored
operations include all read, write, open, and seek system calls, which may or may not
result in actual physical I/O depending on whether the files are already buffered in
memory. I/O statistics are kept on a per-file basis.
򐂰 Virtual memory system
The filemon command monitors physical I/O operations (that is, paging) between
segments and their images on disk. I/O statistics are kept on a per segment basis.
򐂰 Logical volumes
The filemon command monitors I/O operations on logical volumes. I/O statistics are kept
on a per-logical volume basis.
򐂰 Physical volumes
The filemon command monitors I/O operations on physical volumes. At this level, physical
resource utilizations are obtained. I/O statistics are kept on a per-physical volume basis.
filemon examples
A simple way to use filemon is to run the command shown in Example 7-16, which will:
򐂰 Run filemon for 2 minutes and stop the trace.
򐂰 Store output in /tmp/fmon.out.
򐂰 Just collect logical volume and physical volume output
Example 7-16 Using filemon

#filemon -o /tmp/fmon.out -O lv,pv -T 500000; sleep 120; trcstop
To produce some sample output for filemon, we ran a sequential write test in the background,
and started a filemon trace, as shown in Example 7-17. We used the lmktemp command to
create a 2 GB file full of nulls while filemon gathered I/O stats.
Example 7-17 filemon with a sequential write test

cd /interdiskfs
time lmktemp 2GBtest 2000M &

filemon -o /tmp/fmon.out ; sleep 120; trcstop
In Example 7-18, we look at parts of the /tmp/fmon.out file. When analyzing the output from
filemon, focus in on:
򐂰 Most active physical volume.
– Look for balanced I/O across disks.
– Lack of balance may be a data layout problem.
򐂰 Look at I/O service times at physical volume layer.
– Writes to cache that average less than 2 ms is good. Writes averaging significantly and
consistently higher indicate that write cache is full, and there is a bottleneck in the disk.
– Reads averaging less than 10 ms - 20 ms is good. The disk subsystem read cache hit
rate affects this value considerably. Higher read cache hit rates will result in lower I/O
service times, often near 5 ms or less. If reads average greater than 15 ms, then it can
indicate that something between the host and the disk is a bottleneck, though it usually
indicates a bottleneck in the disk subsystem.
– Look for consistent I/O service times across physical volumes. Inconsistent I/O service
times can indicate unbalanced I/O or a data layout problem.
– Longer I/O service times can be expected for I/Os that average greater than 64 KB in
size.
– Look at the difference between the I/O service times between the logical volume and
the physical volume layers. A significant difference indicates queuing or serialization in
the AIX I/O stack.
The fields in the filemon report of the filemon command are as follows:
util Utilization of the volume (fraction of time busy). The rows are sorted by
this field, in decreasing order. The first number, 1.00, means 100
percent.
#rblk Number of 512-byte blocks read from the volume.
#wblk Number of 512-byte blocks written to the volume.
KB/s Total transfer throughput in Kilobytes per second.
volume Name of volume.
description Contents of volume; either a file system name, or logical volume type
(jfs2, paging, jfslog, jfs2log, boot, or sysdump). Also indicates if the file
system is fragmented or compressed.
Example 7-18 filemon most active logical volumes report
Thu Oct 6 21:59:52 2005

System: AIX CCF-part2 Node: 5 Machine: 00E033C44C00
Cpu utilization: 73.5%
Most Active Logical Volumes

------------------------------------------------------------------------
util #rblk #wblk KB/s volume description
------------------------------------------------------------------------
0.73 0 20902656 86706.2 /dev/305glv /interdiskfs

0.00 0 472 2.0 /dev/hd8 jfs2log
0.00 0 32 0.1 /dev/hd9var /var
0.00 0 16 0.1 /dev/hd4 /
0.00 0 104 0.4 /dev/jfs2log01 jfs2log
Most Active Physical Volumes

------------------------------------------------------------------------
util #rblk #wblk KB/s volume description
------------------------------------------------------------------------
0.99 0 605952 2513.5 /dev/hdisk39 IBM FC 2107
0.99 0 704512 2922.4 /dev/hdisk55 IBM FC 2107
0.99 0 614144 2547.5 /dev/hdisk47 IBM FC 2107
0.99 0 684032 2837.4 /dev/hdisk63 IBM FC 2107
0.99 0 624640 2591.1 /dev/hdisk46 IBM FC 2107
0.99 0 728064 3020.1 /dev/hdisk54 IBM FC 2107
0.98 0 612608 2541.2 /dev/hdisk38 IBM FC 2107
skipping...........
------------------------------------------------------------------------
Detailed Logical Volume Stats (512 byte blocks)
------------------------------------------------------------------------
VOLUME: /dev/305glv description: /interdiskfs

writes: 81651 (0 errs)
write sizes (blks): avg 256.0 min 256 max 256 sdev 0.0
write times (msec): avg 1.816 min 1.501 max 2.409 sdev 0.276
write sequences: 6
write seq. lengths: avg 3483776.0 min 423936 max 4095744 sdev 1368402.0
seeks: 6 (0.0%)
seek dist (blks): init 78592,
avg 4095744.0 min 4095744 max 4095744 sdev 0.0
time to next req(msec): avg 1.476 min 0.843 max 13398.588 sdev 56.493
throughput: 86706.2 KB/sec
utilization: 0.73
skipping...........
------------------------------------------------------------------------
Detailed Physical Volume Stats (512 byte blocks)
------------------------------------------------------------------------
VOLUME: /dev/hdisk39 description: IBM FC 2107

write sequences: 2361
seeks: 2361 (99.7%)
avg 1928.4 min 256 max 511232 sdev 23445.5
seek dist (%tot blks):init 10.61802,
avg 0.00144 min 0.00019 max 0.38090 sdev 0.01747
utilization: 0.99
VOLUME: /dev/hdisk55 description: IBM FC 2107


write sequences: 2575
seeks: 2575 (93.6%)
avg 1725.9 min 256 max 511232 sdev 22428.8
seek dist (%tot blks):init 10.61897,
avg 0.00129 min 0.00019 max 0.38090 sdev 0.01671
utilization: 0.99
skipping to end.....................
In the filemon output, above, we notice:

򐂰 The most active logical volume is /dev/305glv (/interdiskfs); it is the busiest logical volume
with an average data rate of 87 MB/second.
򐂰 The Detailed Logical Volume Status shows an average write time of 1.816 ms for
/dev/305glv.
򐂰 The Detailed Physical Volume Stats shows an average write time of 1.934 ms for the
busiest disk, /dev/hdisk39, and 1.473 for /dev/hdisk55 for the next busiest disk.
The filemon command is a very useful tool to determine where a host is spending I/O. More
details on the filemon options and reports are available in the publication AIX 5L
Performance Tools Handbook, SG24-6039, which can be downloaded from:
http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/SG246039.html
7.4.4 lvmstat
A new performance monitoring tool was introduced in AIX 5L called lvmstat, which reports
input and output statistics for logical partitions, logical volumes, and volume groups. The
lvmstat command is useful in determining the I/O rates to LVM volume groups, logical
volumes and logical partitions. This is useful for dealing with unbalanced I/O situations where
data layout was not considered initially.
The lvmstat command generates reports that can be used to change the logical volume
configuration to better balance the input and output load between physical disks.
lvmstat resides in /usr/sbin and is part of the bos.rte.lvm file set, which is installed by default
from the AIX 5L base installation media.
The syntax of the lvmstat command is as follows:

lvmstat {-l |-v }Name [-e |-d ][-F ][-C ][-c Count ][-s ][ Interval [ Iterations ]]
Flags:
-c Count prints only the specified number of lines of statistics.
-C Causes the counters that keep track of the iocnt, Kb_read, and
Kb_wrtn to be cleared for the specified logical volume or volume
group.
-d Specifies that statistics collection should be disabled for the logical
volume or volume group specified.
-e Specifies that statistics collection should be enabled for the logical
volume or volume group specified.

-F Causes the statistics to be printed in colon-separated format.
-l Specifies the name of the stanza to list.
-s Suppresses the header from the subsequent reports when Interval is
used.
-v Specifies that the Name specified is the name of the volume group.
Parameters:
Name Specifies the logical volume or volume group name to monitor.
Interval The interval parameter specifies the amount of time, in seconds,
between each report. If Interval is used to run lvmstat more than
once, no reports are printed if the statistics did not change since the
last run. A single period is printed instead.
Enabling lvmstat sampling for volume groups

The lvmstat command generates reports that can be used to change the logical volume
configuration to better balance the input and output load between physical disks. By default,
the statistics collection is not enabled. By using the -e flag you enable Logical Volume Device
Driver (LVMDD) to collect the physical partition statistics for each specified logical volume or
the logical volumes in the specified volume group. Enabling the statistics collection for a
volume group enables it for all the logical volumes in that volume group. On every I/O call
done to the physical partition that belongs to an enabled logical volume, the I/O count for that
partition is increased by LVMDD. All the data collection is done by the LVMDD and the
lvmstat command reports on those statistics.
The first report section generated by lvmstat provides statistics concerning the time since the
statistical collection was enabled. Each subsequent report section covers the time since the
previous report. All statistics are reported each time lvmstat runs. The report consists of a
header row, followed by a line of statistics for each logical partition or logical volume
depending on the flags specified.
If the statistics collection has not been enabled for the volume group or logical volume you
want to monitor, lvmstat will report an error like:
#lvmstat -v rootvg
0516-1309 lvmstat:Statistics collection is not enabled for this logical device.
Use -e option to enable.
To enable statistics collection for all logical volumes in a volume group (in this case the rootvg
volume group), use the -e option together with the -v <volume group> flag as the following
example shows:
#lvmstat -v rootvg -e
When you do not need to continue collecting statistics with lvmstat, it should be disabled
because it impacts the performance of the system. To disable statistics collection for all
logical volumes in a volume group (in this case the rootvg volume group), use the -d option
together with the -v <volume group> flag as the following example shows:
#lvmstat -v rootvg -d
This will disable the collection of statistics on all logical volumes in the volume group.

Monitoring volume group I/O using lmstat
Once a volume group is enabled for lvmstat monitoring, like rootvg in this example, you
would only need to run lvmstat -v rootvg to monitor all activity to rootvg. An example of the
lvmstat output is shown in Example 7-19.
Example 7-19 lvmstat example

#lvmstat -v rootvg
Logical Volume iocnt Kb_read Kb_wrtn Kbps
lv05 682478 16 8579672 16.08
loglv00 0 0 0 0.00
datalv 0 0 0 0.00
lv07 0 0 0 0.00
lv06 0 0 0 0.00
lv04 0 0 0 0.00
lv03 0 0 0 0.00
Notice that lv05 is busy performing writes.
The lvmstat tool has powerful options such as reporting on a specific logical volume, or only
reporting busy logical volumes in a volume group. For more information about using the
lvmstat command and other tuning commands in detail, check the publication AIX 5L
Performance Tools Handbook, SG24-6039, which can be downloaded from:
http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/SG246039.html
7.5 HP-UX specific I/O monitoring commands

HP has graphical tools to measure system performance. Some of these tools are:
򐂰 HP Perfview/Measureware
򐂰 GlancePlus
HP Perfview/Measurware is good for recording performance measurements and maintaining

a baseline of system performance data to refer to. The HP Perfview/Measurware tool can
show stats for each physical disk in graphical format and you can change the time scale
easily to your liking.
7.6 SDD commands for AIX, HP-UX, and Solaris

For availability and performance, we recommend dual attaching the host to the DS6000 or
SAN fabric, and using a product that provides a multipath configuration environment like SDD.
A description of SDD and the common commands it provides is covered in 5.6, “Subsystem
Device Driver (SDD) - multipathing” on page 157.
There are some commands SDD provides that are specific for each platform, and we will
cover some of the AIX, HP-UX, and Sun Solaris SDD commands here. All three platforms
have the useful SDD command datapath query available for use.
A summary of the SDD commands and the different operating system platforms that they are
available for is shown in Table 7-1 on page 219.

Table 7-1 SDD command matrix
SDD Command AIX HP-UX Solaris
chgvpath X
ckvpath X
datapath X X X
defvpath X X
get_root_disks X X
gettrace X
hd2vp X X
pathtest X X X
querysn X X
rmvpath X X
showvpath X X
vp2hd X X
vpathmkdev X
chgvpath X
addpaths X
cfallvpath X
dpovgfix X
extendvg4vp X
lquerypr X
lsvpcfg X
mkvg4vp X
restvg4vp X
savevg4vp X
AIX SDD commands

There are some particular SDD commands for AIX, which you will want to use to:
򐂰 Verify a host is using vpath devices properly for redundancy and load-balancing
򐂰 Configure volume groups using vpaths instead of hdisks
򐂰 Add paths to vpaths dynamically
The AIX SDD commands are listed in Table 7-2.
Table 7-2 AIX SDD commands

Command Description
addpaths Dynamically adds paths to SDD devices while they are in the Available state.
lsvpcfg Queries the SDD configuration state.

Command Description
dpovgfix Fixes a SDD volume group that has mixed vpath and hdisk physical volumes.
hd2vp The SDD script that converts a DS6000 hdisk device volume group to a
Subsystem Device Driver vpath device volume group.
vp2hd The SDD script that converts a SDD vpath device volume group to a
DS6000 hdisk device volume group.
querysn The SDD driver tool to query unique serial numbers of DS6000 devices. This is
used to exclude certain LUNs from SDD, e.g., boot disks.
lquerypr The SDD driver persistent reserve command tool.
mkvg4vp Creates a SDD volume group.
extendvg4vp Extends SDD devices to a SDD volume group.
savevg4vp Backs up all files belonging to a specified volume group with SDD devices.
pathtest Used with tracing functions.
cfallvpath Fast-path configuration method to configure the SDD pseudo-parent dpo and all
SDD vpath devices.
restvg4vp Restores all files belonging to a specified volume group with SDD devices.
addpaths
In a SAN environment, where servers are attached to SAN switches, the paths from the
server to the DS6000 are controlled by zones created with the SAN switch software. You may
want to add a new path and remove another for planned maintenance on the DS6000 or for
proper load balancing. You can take advantage of the addpaths command to make the
changes live.
lsvpcfg
To display which DS6000 vpath devices are available to provide fail over protection, run the
lsvpcfg command. You will see output similar to that shown in Example 7-20.
Example 7-20 lsvpcfg for AIX output

# lsvpcfg
vpath0 (Avail pv vpathvg)018FA067=hdisk1 (Avail )
vpath1 (Avail )019FA067=hdisk2 (Avail )
vpath2 (Avail )01AFA067=hdisk3 (Avail )
vpath3 (Avail )01BFA067=hdisk4 (Avail )hdisk27(Avail )
vpath4 (Avail )01CFA067=hdisk5 (Avail )hdisk28 (Avail )
vpath5 (Avail )01DFA067=hdisk6 (Avail )hdisk29 (Avail )
vpath6 (Avail )01EFA067=hdisk7(Avail )hdisk30 (Avail )
vpath7(Avail )01FFA067=hdisk8 (Avail )hdisk31 (Avail )
vpath8 (Avail )020FA067=hdisk9 (Avail )hdisk32 (Avail )
vpath9 (Avail pv vpathvg)02BFA067=hdisk20 (Avail )hdisk44 (Avail )
vpath10 (Avail pv vpathvg)02CFA067=hdisk21 (Avail )hdisk45 (Avail )
vpath11 (Avail pv vpathvg)02DFA067=hdisk22 (Avail )hdisk46 (Avail )
vpath12 (Avail pv vpathvg)02EFA067=hdisk23 (Avail )hdisk47(Avail )
vpath13 (Avail pv vpathvg)02FFA067=hdisk24 (Avail )hdisk48 (Avail
Notice in the example that vpath0, vpath1, and vpath2 all have a single path (hdisk device)
and, therefore, will not provide fail over protection because there is no alternate path to the

DS6000 LUN. The other SDD vpath devices have two paths and, therefore, can provide fail
over protection and load balancing.
dpovmfix and dpovgfix

It is possible for certain commands such as chdev on an hdisk device to cause a pvid
(physical volume ID) to move back to an hdisk (single path to DS6000 LUN) instead of
remaining on the vpath device. For example, look at the output shown in Example 7-21. The
lsvpcfg command shows that hdisk46 is part of the volume group vpathvg and has a pvid
assigned. The idea is to ensure the PVIDs are assigned to the vpaths and not the underlying
hdisks
The command lsvg -p vpathvg lists the physical volumes making up the volume group
vpathvg. Notice that hdisk46 is listed among the other vpath devices. This is not correct for fail
over and load balancing, because access to the DS6000 logical disk with serial number
02DFA067 is using a single path hdisk46 instead of vpath11. The system is operating in a
mixed-mode with vpath pseudo devices and partially uses hdisk devices.
Example 7-21 AIX loss of a device path

#lsvpcfg
vpath11 (Avail pv vpathvg)02DFA067=hdisk22 (Avail )hdisk46 (Avail pv vpathvg)
vpath12 (Avail pv vpathvg)02EFA067=hdisk23 (Avail )hdisk47(Avail )
vpath13 (Avail pv vpathvg)02FFA067=hdisk24 (Avail )hdisk48 (Avail )
lsvg -p vpathvg
vpathvg:
PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION
vpath10 active 29 4 00..00..00..00..04
hdisk46 active 29 4 00..00..00..00..04 ! MIXED MODE- HDISKs and VPATHS !
vpath12 active 29 4 00..00..00..00..04
vpath13 active 29 28 06..05..05..06..06
To fix this problem, run the command dpovgfix volume_group_name. Then re-run the lsvpcfg
or lsvg command to verify.
Note: In order for the dpovgfix shell script to be executed, all mounted file systems of this
volume group have to be unmounted. After successful completion of the dpovgfix shell
script, mount the file systems again.
hd2vp and vp2hd

SDD provides two conversion scripts to move volume group devices to vpaths or hdisks:
򐂰 The hd2vp script converts a volume group from DS6000 hdisks into SDD vpaths. The
syntax for the hd2vp script is as follows: hd2vp vgname.
򐂰 The vp2hd script converts a volume group from SDD vpaths into DS6000 hdisks. Use the
vp2hd program when you want to configure your applications back to original DS6000
hdisks, or when you want to remove the SDD from your host system. The syntax for the
vp2hd script is: vp2hd vgname.
These two conversion programs require that a volume group contain either all original
DS6000 hdisks or all SDD vpaths. The program fails if a volume group contains both kinds of
device special files (mixed volume group). You may need to use dpovgfix first to fix a volume
group to contain all of one kind of device or another.

querysn for multi-booting AIX off the DS6000
AIX supports Fibre Channel boot capability for selected pSeries and RS/6000 systems. This
allows you to select DS6000 Fibre Channel devices as the boot device. However, a
multi-pathing boot device is not supported. If you plan to select a device as a boot device, you
should not configure that DS6000 device with multi-path configuration.
The SDD driver will automatically exclude any DS6000 devices from the SDD configuration, if
these DS6000 boot devices are the physical volumes of an active rootvg.
Tip: If you require dual or multiple boot capabilities on a server, and multiple operating
systems are installed on multiple DS6000 boot devices, you should use the querysn
command to manually exclude all DS6000 boot devices that belong to multiple non-active
rootvg volume groups on the server.
SDD V1.3.3.3 allows you to manually exclude DS6000 devices from the SDD configuration.
The querysn command reads the unique serial number of a DS6000 device (hdisk) and saves
the serial number in an exclude file, /etc/vpexclude.
During the SDD configuration, SDD configure methods read all the serial numbers in this
exclude file and exclude these DS6000 devices from the SDD configuration.
The exclude file, /etc/vpexclude, holds the serial numbers of all inactive DS6000 devices
(hdisks) in the system. If an exclude file exists, the querysn command will add the excluded
serial number to that file. If no exclude file exists, the querysn command will create one. There
is no user interface to this file.
Tip: You should not use the querysn command on the same logical device multiple times.
Using the querysn command on the same logical device multiple times results in duplicate
entries in the /etc/vpexclude file, and the system administrator will have to administer the
file and its content.
The syntax for querysn is: querysn <-d> -l device-name.
Managing secondary system paging space for AIX

For better performance, you may want to place a secondary paging space on the DS6000.
SDD 1.3.2.6 (or later) supports secondary system paging on a multi-path Fibre Channel
vpath device from an AIX 4.3.3 or AIX 5.1.0 host system to a DS6000. It is worth noting that
your host system should be tuned so that little or no I/O goes to page space. If this is not
possible, then the system needs more memory
The benefits are multi-pathing to your paging spaces. All the same commands for
hdisk-based volume groups apply to using vpath-based volume groups for paging spaces.
Important: IBM does not recommend moving the primary paging space out of rootvg.
Doing so may mean that no paging space is available during the system startup. Do not
redefine your primary paging space using vpath devices.
lquerypr
The lquerypr command implements certain SCSI-3 persistent reservation commands on a
device. The device can be either hdisk or SDD vpath devices. This command supports
persistent reserve service actions or read reservation key, release persistent reservation,
preempt-abort persistent reservation, and clear persistent reservation.

The syntax and options are:
lquerypr [[-p]|[-c]|[-r]][-v][-V][-h/dev/PVname]
Flags:
–p If the persistent reservation key on the device is different from the
current host reservation key, it preempts the persistent reservation key
on the device.
–c If there is a persistent reservation key on the device, it removes any
persistent reservation and clears all reservation key registration on the
device.
–r Removes the persistent reservation key on the device made by this
host.
–v Displays the persistent reservation key if it exists on the device.
–V Verbose mode. Prints detailed message.
To query the persistent reservation on a device, type:

lquerypr -h /dev/vpath30
This command queries the persistent reservation on the device. If there is a persistent
reserve on a disk, it returns 0 if the device is reserved by the current host. It returns 1 if the
device is reserved by another host. Caution must be taken with the command, especially
when implementing preempt-abort or clear persistent reserve service action. With
preempt-abort service action not only the current persistent reserve key is preempted; it also
aborts tasks on the LUN that originated from the initiators that are registered with the
preempted key. With clear service action, both persistent reservation and reservation key
registrations are cleared from the device or LUN.
This command is useful if disk was attached to one system, and was not varied off leaving the
SCSI reserves on the disks, thus preventing another system from accessing them.
mkvg4vp, extendvg4vp, savevg4vp, and restvg4vp

mkvg4vp, extendvg4vp, savevg4vp, and restvg4vp have the same functionality as their
counterpart commands without the 4vp extension; use the 4vp versions when operating on
vpath devices. These commands will maintain pvids on vpaths and keep SDD working
properly.
It is a good idea to check periodically to make sure none of the volume groups are using
hdisks instead of vpaths. You can verify the path status several ways. Some commands are:
򐂰 lspv (look for hdisk with volume group names listed)
򐂰 lsvpcfg
򐂰 lsvg -p <vgname>
Remember to change any scripts you may have that call savevg or restvg and change the
calls to savevg4vp and restvg4vp.
7.6.1 HP-UX SDD commands

SDD for HP-UX adds the specific commands shown in Table 7-3 on page 224.

Table 7-3 HP-UX SDD commands
Command Description
showvpath Lists the configuration mapping between SDD devices and

underlying disks.
chgvpath Configures SDD vpath devices. Updates the information in

/etc/vpath.cfg and /etc/vpathsave.cfg. The -c option updates the
configuration file. The -r option updates the device configuration
without a system reboot.
defvpath Second part of the chgvpath command configuration during startup

time.
datapath SDD driver console command tool.
rmvpath [-all, -vpathname] Removes SDD vpath devices from the configuration.
hd2vp Converts a volume group from DS6000 hdisks

into SDD vpaths.
get_root_disks Generates a file called /etc/vpathexcl.cfg to exclude bootable disks

from the SDD configuration.
querysn Lists all disk storage devices by serial number.
pathtest Debug tool.
gettrace Debug tool that gets trace information.
vp2hd Converts a volume group from SDD vpaths

into DS6000 hdisks.
showvpath
The showvpath command for HP-UX is similar to the lsvpcfg command for AIX. Use
showvpath to verify that an HP-UX vpath is using multiple paths to the DS6000. An example of
the output from showvpath is displayed in Example 7-22.
Example 7-22 showvpath command for HP-UX

#/opt/IBMsdd/bin/showvpath
vpath1:
/dev/rdsk/c12t0d0
/dev/rdsk/c10t0d0
/dev/rdsk/c7t0d0
/dev/rdsk/c5t0d0
vpath2:
/dev/rdsk/c12t0d1
Notice that vpath1 in the example has four paths to the DS6000. vpath2, however, has a
single point of failure since it is only using a single path.
Tip: You can use the output from showvpath to modify iostat or sar information to report
stats based on vpaths instead of hdisks. Gather iostats to a file, and then replace the disk
names with the corresponding vpaths.
hd2vp and vp2hd

The hd2vp and vp2hd commands work the same for HP-UX as they do for AIX. Use hd2vp to
convert volume groups to use vpaths and take advantage of the fail over and load balancing

features of SDD. When removing SDD, you can move the volume group devices back to disk
devices using the vp2hd command.
7.6.2 Sun Solaris SDD commands

SDD for Solaris adds the specific commands shown in Table 7-4.
Table 7-4 Solaris SDD commands

Command Description
cfgvpath Configures SDD vpath devices. Updates the information in

/etc/vpath.cfg and /etc/vpathsave.cfg. The -c option updates the
configuration file. The -r option updates the device configuration
without a system reboot.
defvpath Second part of the cfgvpath command configuration during startup

time.
datapath SDD driver console command tool.
rmvpath [-all, -vpathname] Removes SDD vpath devices from the configuration.
get_root_disks Generates a file called /etc/vpathexcl.cfg to exclude bootable disks

from the SDD configuration.
vpathmkdev Creates files links in /dev/dsk and /dev/rdsk.
showvpath Lists the configuration mapping between SDD devices and

underlying disks.
pathtest Debug tool.
On Sun Solaris, SDD resides above the Sun SCSI disk driver (sd) in the protocol stack. For
more information about how SDD works, refer to 5.6, “Subsystem Device Driver (SDD) -
multipathing” on page 157. SDD is supported for the DS6000 on Solaris 8/9.
Some specific commands SDD provides to Sun Solaris are listed below as well as the steps
to update SDD after making DS6000 logical disk configuration changes for a Sun server.
cfvgpath
The cfgvpath command configures vpath devices using the following process:
򐂰 Scan the host system to find all DS6000 devices (LUNs) that are accessible by the Sun
host.
򐂰 Determine which DS6000 devices (LUNs) are the same devices that are accessible
through different paths.
򐂰 Create configuration file /etc/vpath.cfg to save the information about DS6000 devices.
򐂰 With the -c option: cfgvpath exits without initializing the SDD driver. The SDD driver will be
initialized after reboot. This option is used to reconfigure SDD after a hardware
reconfiguration.
򐂰 Without the -c option: cfgvpath initializes the SDD device driver vpathdd with the
information stored in /etc/vpath.cfg and creates pseudo-vpath devices
/devices/pseudo/vpathdd*.

Note: cfgvpath without the -c option should not be used after hardware reconfiguration,
since the SDD driver is already initialized with the previous configuration information. A
reboot is required to properly initialize the SDD driver with the new hardware configuration
information.
vpathmkdev
The vpathmkdev command creates files vpathMsN in the /dev/dsk/ and /dev/rdsk/ directories
by creating links to the pseudo-vpath devices /devices/pseudo/vpathdd*, which are created by
the SDD driver.
Files vpathMsN in the /dev/dsk/ and /dev/rdsk/ directories provide block and character access
to an application the same way as the cxtydzsn devices created by the system. The
vpathmkdev command is executed automatically during SDD package installation and should
be executed manually to update files vpathMsN after hardware reconfiguration.
showvpath
The showpath command lists all SDD devices and their underlying disks. An example of the
showpath command is displayed in Example 7-23.
Example 7-23 Sun Solaris showpath command output

# showpath
vpath0c
c1t8d0s2 /devices/pci@1f,0/pci@1/scsi@2/sd@1,0:c,raw
c2t8d0s2 /devices/pci@1f,0/pci@1/scsi@2,1/sd@1,0:c,raw
Tip: Note that you can use the output from showvpath to modify iostat or sar information
to report stats based on vpaths instead of hdisks. Gather iostats to a file, and then replace
the disk device names with the corresponding vpaths.
Changing an SDD hardware configuration in Sun Solaris

When adding or removing multi-port SCSI devices from a Sun Solaris system, you must
reconfigure SDD to recognize the new devices. Perform the following steps to reconfigure
SDD:
1. Shut down the system. Type shutdown -i0 -g0 -y and press Enter.
2. Perform a configuration restart. From the OK prompt, type boot -r and press Enter.
This uses the current SDD entries during restart, not the new entries. The restart forces
the new disks to be recognized.
3. Run the SDD configuration utility to make the changes to the directory /opt/IBMdpo/bin.
Type cfgvpath -c and press Enter.
4. Shut down the system. Type shutdown -i6 -g0 -y and press Enter.
5. After the restart, change to the /opt/IBMdpo/bin directory by typing cd /opt/IBMdpo/bin.
6. Type devfsadm and press Enter to reconfigure all the drives.
7. Type vpathmkdev and press Enter to create all the vpath devices.
For specific information about SDD commands, check IBM TotalStorage Multipath Subsystem
Device Driver User’s Guide, SC30-4096.

7.7 Testing and verifying DS6000 storage
To characterize storage performance from a host perspective, we enter into a
multidimensional discussion involving considerations that include: throughput (IOPS and
MB/s); read or write; random or sequential; block size; and response time. Here are a couple
of rules that will help characterize these terms:
򐂰 Throughput or bandwidth is measured in two separate and opposing metrics: Input/Output
operations per second (IOPS) and data transfer rate (MB/s). Generally, random workloads
are characterized by IOPS and sequential workloads by MB/second.
򐂰 Maximum IOPS are experienced when moving data that has a smaller block size (4 KB).
򐂰 Maximum data transfer rates are experienced when moving data with larger block size
(1024 KB).
򐂰 Writes are faster than reads because of the affects of the disk subsystem’s cache.
򐂰 Random workloads are a mixture of reads and writes to storage.
򐂰 Sequential workloads generally have higher MB/s throughput than random workloads.
See Chapter 12, “Understanding your workload” on page 407 for an understanding of
workloads.
The UNIX dd command is a great tool to drive sequential read workloads or sequential write
workloads against the DS6000. It will be rare when you can actually drive the DS6000 at the
maximum data rates that you see in published performance benchmarks. But, once you
understand how your total configuration (for instance, a DS6000 attached with 4 SDD paths
through two SANs to your host with 2 HBAs, 4 CPUs, and 1 GB of memory) performs against
certain dd commands, you will have a baseline from which you can compare things like
operating system kernel parameter changes or different logical volume striping techniques in
order to improve performance.
In this section, we will discuss how to:

򐂰 Determine the sequential read speed that an individual vpath (LUN) can provide in your
environment.
򐂰 Measure sequential read and write speeds for file systems.
While running the dd command in one host session, we recommend you use the UNIX
commands and shell scripts presented earlier in this chapter. We will assume that, at a
minimum, you will have the AIX nmon tool running with the c, a, e, and d features turned on.
Below, we will be running lots of different kinds of dd commands. If, at any time, you want to
make sure there are no dd processes running on your system, execute the following
kill-grep-awk command:
kill -kill ‘ps -ef | grep dd | awk ‘{ print $2 }’‘
Caution: Use extreme caution when using the dd command to perform a sequential write
operation. Ensure the dd is not writing to a device file that is part of the UNIX operating
system.
7.7.1 Using the dd command to test sequential Rank reads and writes
To test the sequential read speed of a Rank, you can run the command:
time dd if=/dev/rvpath0 of=/dev/null bs=128k count=781

rvpath0 is the character or raw device file for the LUN presented to the operating system by
SDD. The above command will read 100 MB off of rvpath0 and report how long it takes in
seconds. Take 100 MB and divide by the number of seconds reported to determine the MB/s
read speed.
If you determine that the average read speed for your vpaths, for example, 50 MB/s, then you
know you need to stripe your future logical volumes across at least 4 different Ranks to
achieve 200 MB/s sequential read speeds.
An example of the output from test_disk_speeds sample shell script is shown in

Example 7-24.
Example 7-24 test_disk_speeds output

# test_disk_speeds
rvpath0 100 MB/sec 100 MB bs=128k
Let’s explore the dd command some more. Issue the following command:
dd if=/dev/rvpath0 of=/dev/null bs=128k
Your nmon monitor (the e option) should report that the above command has imposed a
sustained 100 MB/s bandwidth with a block size=128k on vpath0. Notice the xfers/sec
column; xfers/sec is IOPS. Now, if your dd command has not already errored out because it
reached the end of the disk, hit <ctrl-C> to stop the process. nmon reports idle. Next issue the
following dd command with a 4 KB block size and put it in the background:
dd if=/dev/rvpath0 of=/dev/null bs=4k &
For the above command, nmon should report a lower MB/s but a higher IOPS. That is the
nature of I/O as a function of blocksize. Use the above kill-grep-awk command to clear out
all the dd processes from your system. Try your dd sequential read command with a bs=1024
and you should see a high MB/s but a reduced IOPS. Now start several of these commands
and watch your throughput increase until it reaches a plateau - something in your
configuration (CPU?, HBAs?, DS6000, Rank?) has become a bottleneck. This is as fast as
your hardware configuration can perform sequential reads for a specific block size. The
kill-grep-awk script will clear everything out of the process table for you. Try loading up
another raw vpath device (vpath1) device. Watch the performance of your HBAs (nmon a
option) approach 200 MB/second.
You can perform the same kinds of tests against the block vpath device, vpath0. What is
interesting here is that you will always observe the same I/O characteristics, no matter what
block size you specify. That is because, in AIX anyway, the Logical Volume Manager breaks
everything up into 4 K blocks, reads and writes. Run the following two commands separately.
nmon should report about the same for both:
dd if=/dev/vpath0 of=/dev/null bs=128k
dd if=/dev/vpath0 of=/dev/null bs=4k
Use caution when using the dd command to test sequential writes. If LUNs have been
incorporated into the operating system using logical volume manager (LVM) commands, and
the dd command is used to write to the LUNs, they won’t be part of the operating system
anymore, and the operating system will not like that one bit. For example, if you want to write
to a vpath, that vpath should not be part of a LVM volume group. And if you want to write to a
LVM logical volume, it should not have a file system on it and if the logical volume has a
logical volume control block (LVCB), you should skip over the LVCB when writing to the logical
volume. It is possible to create a logical volume without a LVCB by using the mklv -T O option.

Caution: Use extreme caution when using the dd command to perform a sequential write
operation. Ensure the dd is not writing to a device file that is part of the UNIX operating
system.
The following command will perform sequential writes to your LUNs:

dd if=/dev/zero of=/dev/rvpath0 bs=128k
dd if=/dev/zero of=/dev/rvpath0 bs=1024k
time dd if=/dev/zero of=/dev/rvpath0 bs=128k count=781
Try different block sizes, different raw vpath devices, combinations of reads and writes. Run
the commands against the block device (/dev/vpath0) and notice that block size does not
affect performance.
7.7.2 Verifying your system

So far, we have just been playing around with the dd command. But the next section, 7.8,
“Volume groups, logical volumes and file systems” on page 233, will explore the use of the
LVM of the UNIX operating system to configure DS6000 storage for best performance. At
each step along the way, there are tests that should be run to test the storage infrastructure.
We will discuss these tests here so that you will be equipped with some testing techniques as
you configure your storage for best performance.
Verify the storage subsystem

After the LUNs have been assigned to the host system, and multipathing software like SDD
has discovered the LUNs, it is important to test the storage subsystem. The storage
subsystem includes the SAN infrastructure, the host system HBAs, the DS6000.
1. The first step is to run the following command to review that your storage allocation from
the DS6000 is working well with SDD:
datapath query essmap
Make sure what LUNs you have is what you expect. Are there the correct number of paths
to the LUNs? Are all the LUNs from different Ranks? Are the LUN sizes correct? Output
from the command looks like that in Example 7-25.
Example 7-25 datapath query essmap command output

{CCF-part2:root}/tmp/perf/scripts -> datapath query essmap
------- ----- - ----------- ------ ----------- ------------ ---- ---- --- ----- ---- - ----------- ---- --------
vpath0 hdisk4 7V-08-01[FC] fscsi0 75065513000 IBM 2107-900 10.7 48 0 0000 0e Y R1-B2-H1-ZA 100 RAID10
vpath0 hdisk12 7V-08-01[FC] fscsi0 75065513000 IBM 2107-900 10.7 48 0 0000 0e Y R1-B2-H1-ZB 101 RAID10
vpath0 hdisk20 7k-08-01[FC] fscsi1 75065513000 IBM 2107-900 10.7 48 0 0000 0e Y R1-B1-H3-ZC 32 RAID10
vpath0 hdisk28 7k-08-01[FC] fscsi1 75065513000 IBM 2107-900 10.7 48 0 0000 0e Y R1-B1-H3-ZD 33 RAID10
vpath1 hdisk5 7V-08-01[FC] fscsi0 75065513001 IBM 2107-900 10.7 48 1 fffe 0e Y R1-B2-H1-ZA 100 RAID10
vpath1 hdisk13 7V-08-01[FC] fscsi0 75065513001 IBM 2107-900 10.7 48 1 fffe 0e Y R1-B2-H1-ZB 101 RAID10
vpath1 hdisk21 7k-08-01[FC] fscsi1 75065513001 IBM 2107-900 10.7 48 1 fffe 0e Y R1-B1-H3-ZC 32 RAID10
vpath1 hdisk29 7k-08-01[FC] fscsi1 75065513001 IBM 2107-900 10.7 48 1 fffe 0e Y R1-B1-H3-ZD 33 RAID10
vpath2 hdisk6 7V-08-01[FC] fscsi0 75065513002 IBM 2107-900 10.7 48 2 fffc 0e Y R1-B2-H1-ZA 100 RAID10
vpath2 hdisk14 7V-08-01[FC] fscsi0 75065513002 IBM 2107-900 10.7 48 2 fffc 0e Y R1-B2-H1-ZB 101 RAID10
vpath2 hdisk22 7k-08-01[FC] fscsi1 75065513002 IBM 2107-900 10.7 48 2 fffc 0e Y R1-B1-H3-ZC 32 RAID10
vpath2 hdisk30 7k-08-01[FC] fscsi1 75065513002 IBM 2107-900 10.7 48 2 fffc 0e Y R1-B1-H3-ZD 33 RAID10
vpath3 hdisk7 7V-08-01[FC] fscsi0 75065513003 IBM 2107-900 10.7 48 3 fffa 0e Y R1-B2-H1-ZA 100 RAID10
vpath3 hdisk15 7V-08-01[FC] fscsi0 75065513003 IBM 2107-900 10.7 48 3 fffa 0e Y R1-B2-H1-ZB 101 RAID10
vpath3 hdisk23 7k-08-01[FC] fscsi1 75065513003 IBM 2107-900 10.7 48 3 fffa 0e Y R1-B1-H3-ZC 32 RAID10
vpath3 hdisk31 7k-08-01[FC] fscsi1 75065513003 IBM 2107-900 10.7 48 3 fffa 0e Y R1-B1-H3-ZD 33 RAID10
vpath4 hdisk8 7V-08-01[FC] fscsi0 75065513100 IBM 2107-900 10.7 49 0 ffff 17 Y R1-B2-H1-ZA 100 RAID10
vpath4 hdisk16 7V-08-01[FC] fscsi0 75065513100 IBM 2107-900 10.7 49 0 ffff 17 Y R1-B2-H1-ZB 101 RAID10
vpath4 hdisk24 7k-08-01[FC] fscsi1 75065513100 IBM 2107-900 10.7 49 0 ffff 17 Y R1-B1-H3-ZC 32 RAID10
vpath4 hdisk32 7k-08-01[FC] fscsi1 75065513100 IBM 2107-900 10.7 49 0 ffff 17 Y R1-B1-H3-ZD 33 RAID10
vpath5 hdisk9 7V-08-01[FC] fscsi0 75065513101 IBM 2107-900 10.7 49 1 fffd 17 Y R1-B2-H1-ZA 100 RAID10
vpath5 hdisk17 7V-08-01[FC] fscsi0 75065513101 IBM 2107-900 10.7 49 1 fffd 17 Y R1-B2-H1-ZB 101 RAID10
vpath5 hdisk25 7k-08-01[FC] fscsi1 75065513101 IBM 2107-900 10.7 49 1 fffd 17 Y R1-B1-H3-ZC 32 RAID10
vpath5 hdisk33 7k-08-01[FC] fscsi1 75065513101 IBM 2107-900 10.7 49 1 fffd 17 Y R1-B1-H3-ZD 33 RAID10

vpath6 hdisk10 7V-08-01[FC] fscsi0 75065513102 IBM 2107-900 10.7 49 2 fffb 17 Y R1-B2-H1-ZA 100 RAID10
vpath6 hdisk18 7V-08-01[FC] fscsi0 75065513102 IBM 2107-900 10.7 49 2 fffb 17 Y R1-B2-H1-ZB 101 RAID10
vpath6 hdisk26 7k-08-01[FC] fscsi1 75065513102 IBM 2107-900 10.7 49 2 fffb 17 Y R1-B1-H3-ZC 32 RAID10
vpath6 hdisk34 7k-08-01[FC] fscsi1 75065513102 IBM 2107-900 10.7 49 2 fffb 17 Y R1-B1-H3-ZD 33 RAID10
2. The next thing to do is to run sequential reads and writes to all of the vpath devices (raw or
block) for about an hour. Use the commands discussed in 7.7.1, “Using the dd command
to test sequential Rank reads and writes” on page 227. Then take a look at your SAN
infrastructure to see how it is doing.
Look at the UNIX error report. Problems will show up as storage errors, disk errors, or
adapter errors. If there are problems, they will not be hard to find in the error report - there
will be a lot of them. Troubleshooting at this stage can be fun. The source of the problem
could be hardware problems on the storage side of the SAN, Fibre Channel cables or
connections, down level device drivers or device (HBA) microcode. If you see something
like the errors shown in Example 7-26, stop and get them fixed.
Example 7-26 SAN problems reported in UNIX error report

IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
3074FEB7 0915100805 T H fscsi0 ADAPTER ERROR
825849BF 0915100705 T H fcs0 ADAPTER ERROR
Ensure that, after running an hour’s worth of dd commands on all your vpaths, there are
no storage errors in the UNIX error report.
3. Next issue the following command to see if SDD is correctly load balancing across paths
to the LUNs:
datapath query device
Output from this command will look like Example 7-27.
Example 7-27 datapath query device output

{CCF-part2:root}/tmp/perf/scripts -> datapath query device|more
Total Devices : 16

SERIAL: 75065513000
==========================================================================

SERIAL: 75065513001
==========================================================================


SERIAL: 75065513002
==========================================================================

SERIAL: 75065513003
Check to make sure, for every LUN, the counters under the Select column are the same
and that there are no errors.
4. The next thing to do is spot check the sequential read speed of the raw vpath device. The
following command is an example of the command run against a LUN called vpath0. For
the LUNs you test, ensure they each yield the same results.
time dd if=/dev/rvpath0 of=/dev/null bs=128k count=781
Tip: For the dd command above, the first time it is run against rvpath0, the I/O must be
read from disk and staged to the DS6000 cache. The second time it is run, the I/O is
already in cache. Notice the shorter read time when we get an I/O cache hit.
Of course, if any of these LUNs are on Ranks that are also being used by another
application, you should see a variation in the throughput. If there is a large variation in the
throughput, perhaps that LUN should be given back to the storage administrator; trade for
another one. You want all your LUNs to have the same performance.
If everything looks good, then continue with the configuration of volume groups and logical
volumes.
Verify the logical volumes

The next time to stop and take a look at how your DS6000 storage is doing is after the logical
volumes have been created. Remember, once volume groups and logical volumes have been
created, it is a disastrous idea to use the dd command to perform sequential writes to the raw
vpath device, so don’t do that! It is a great idea to create a temporary raw logical volume on
the vpaths and use it for testing.
1. Put the nmon monitor up for a quick check on I/O throughput performance and vpath
balance.
2. Test the sequential read speed on every raw logical volume device, if practical, or at least
a decent sampling if you have too many to test each one. The following command is an
example of the command run against a logical volume called 38glv. Perform this test
against all your logical volumes to ensure they each yield the same results.
time dd if=/dev/r38glv of=/dev/null bs=128k count=781
3. Use the dd command without the time or count options to perform sequential reads and
writes against all your logical volumes, raw or block devices. Watch nmon for each LUN’s
Mb/s and IOPS. Monitor the adapter. You should notice the following characteristics:
– Performance should be the same for all the logical volumes.

– Raw logical volumes devices (/dev/rlvname) are faster than the counterpart block
logical volume devices (/dev/lvname) as long as the block size specified is more than 4
KB.
– Larger block sizes result in higher MB/s but reduced IOPS for raw logical volumes.
– The block size will not affect the throughput of a block (not raw) logical volume. This is
because, in AIX, the LVM imposes a I/O block size of 4KB. Verify this by running the dd
command against a raw logical volume with a block size of 4 KB. This performance
should be the same as running the dd command against the non raw logical volume.
– Reads are faster than writes.
– With inter-disk logical volumes, nmon will not report that all the LUNs have input at the
same time, like a striped logical volume. This is normal and has to do with nmon’s
refresh rate and the characteristics of inter-disk logical volumes. For a full discussion of
the different types of logical volumes, see 7.8.2, “Creating a logical volume” on
page 239.
4. Ensure the UNIX errorlog is clear of storage related errors.
Verify the file systems and characterize performance

After the file systems have been created, it is a good idea to take some time to characterize
and document the file system’s performance. A simple way to test sequential write/read
speeds for file systems is to time how long it takes to create a large sequential file and then
how long it takes to copy the same file to /dev/null. After creating the file for the write test, you
need to take care that the file is not still cached in host memory because that will invalidate
the read test since data will come from RAM instead of disk. This lmktemp command, used
below, is currently on AIX 5.3 and has been around for a long time. It creates a file and lets
you control how large the file should be. It does not appear to be supported by any AIX
documentation and therefore may disappear in future releases of AIX.
򐂰 A simple sequential write test:
cd /interdiskfs
time lmktemp 2GBtestfile 2000M
real 0m18.83s
user 0m0.15s
sys 0m18.68s
Divide 2000/18.83 seconds = 107 MB/s sequential write speed.
򐂰 Sequential read speed:
cd /
unmount /interdiskfs (this will flush file from operating system (jfs, jfs2) memory)
mount /interdiskfs
cd - (cd back to the previous directory, /interdiskfs)
time dd if=/interdiskfs/2GBtestfile of=/dev/null bs=128k
real 0m11.19s
user 0m0.39s
sys 0m10.31s
Divide 2000/11.19 seconds = 178 MB/s read speed.
Now that the DS6000 cache is primed, run the test again. When we did this, we got 4.08s.
Priming the cache is a good idea for isolated application read testing. In other words, if you
have an application, like a database, and you perform isolated several fixed reads, ignore
the first run and measure the second run to take advantage of read hits from the DS6000
cache because this is a more realistic measure of how the application will perform.
For HP-UX, use the prealloc command instead of lmktemp for AIX to create large files. For
Sun Solaris, use the mkfile command.

Note: Note that the prealloc command for HP-UX and the lmktemp command for AIX have
a 2 GB size limitation. Those commands are not able to create a file greater than 2 GB in
size. If you want a file larger than 2 GB for a sequential read test, concatenate a couple of
2 GB files together.
7.8 Volume groups, logical volumes and file systems

Logical disks created on the DS6000 and presented to host systems are often referred to as
logical volumes. Unfortunately, this can cause confusion when we move over to the UNIX
operating system because here, too, we create logical volumes. The DS6000 logical volume
and the UNIX operating system logical volume are two separate things. So, in this section, we
will refer to the logical volumes created on the DS6000 as LUNs. And we will use operating
system LVM commands to create logical volumes using the available DS6000 LUNs.
In the following sections, we will explore the considerations associated with creating volume
groups, logical volumes, and file systems. We will use the AIX Logical Volume Manager (LVM)
to create examples, but the topics discussed are applicable to all UNIX platforms.
Remember the recommended method discussed in 7.2.1, “I/O balanced across Extent Pools”
on page 191? We build upon that concept as we continue with the LVM configuration. So at
this point, the LUNs have already been created in the DS6000 and assigned to your host
system. The following sections explore what we need to do next.
Note: For AIX 5.2 and beyond, JFS2 and the 64 bit kernel are recommended. Note that the
nointegrity filesystem mount option is not supported in JFS2.
LVM setup for AIX proceeds in the following order:

򐂰 Create the volume group.
򐂰 Create the logical volume.
򐂰 Create the file systems and file system logs.
7.8.1 Creating the volume group

A volume group is a logical collection of LUNs. The first question to consider is how many
volume groups (VGs) you need. Here are some volume group recommendations:
򐂰 Normally, one volume group for a system is adequate and provides the most flexibility for
data layout is the simplest to manage. The general recommendation is that when you
assign one LUN per Array to your host system, put all the LUNs in the same volume group.
Doing this is convenient later when creating logical volumes because you can just select
all the LUNs in the volume group to stripe across.
We had a lively debate about the above recommendation. The word normally was a
compromise. There are situations where you should really make the above
recommendation a strict rule, like if you are planning to use traditional logical volume
striping. And there are circumstances where you are planning to use inter-disk logical
volumes and one LUN from every Rank resulted in way more storage than the application
server required and all the logical volumes have roughly the same I/O rates. The idea we
will say time and again is that, to get the best performance you have to do two things:
– Get all the spindles in the DS6000 working for you
– Distribute your I/O load evenly across all the spindles to avoid hot spots that will slow
you down and everybody else using storage on the hot spot.

If you do this, you’ll get the best performance out of your DS6000.
򐂰 When your host system has been assigned 2 LUNs from each Array, you should consider
creating two volume groups to keep them organized. However, if you must, for some
reason, have two or more LUNs from each Rank in the same volume group, you can still
create striped logical volumes across specific LUNs by calling out the specific LUNs you
want to stripe across on the mklv command line, striping across just one LUN per array.
򐂰 Bear in mind that with multiple volume groups, it is more difficult to move data between
volume groups using LVM commands. But data can be moved around within a volume
group while the application is accessing the data, by using the reorgvg or migratelp
commands.
򐂰 Do not mix LUNs that have a different RAID Array type or DDM size in the same volume
group. The idea here is that all the LUNs have the same performance characteristics.
򐂰 Use of multiple applications with HACMP (and multiple resource groups) may require
multiple volume groups.
򐂰 Set the LTG size to 256 KB for a DS6000. The LTG size is the maximum I/O size that will
be issued from the LVM layer of the I/O stack. It must be less than or equal to the
maximum I/O size the DDMs in the DS6000 will support, which is 256 KB and can be
determined with the command, lquerypv -M vpath20. The LTG size can be changed with
the chvg -L 256 vgname command. More on LTG in “Striped logical volumes” on
page 245. With AIX 5.3, the LTG size is automatically set.
򐂰 We recommend the use of either Big or Scalable volume groups, as these avoid issues
with LVCBs. One can create logical volumes without LVCBs on Big volume groups via the
mklv -T O flag, and for scalable volume groups the LVCBs do not exist on the logical
volume. When LVCBs exist on a logical volume, they occupy the first 512 bytes of the
logical volume and can cause I/Os to span physical volumes due to this offset. This in turn
affects the atomicity properties of applications (such as, when a application expects that a
I/O will be completely done or not done at all) as an application I/O may then span two
physical volumes in a volume group and only one of the physical I/Os may complete
during a system crash.
When creating the volume group, there are LVM limits to consider along with potential
expansion of the volume group. The main LVM limits for a volume group are shown in
Table 7-5.
Table 7-5 Volume group characteristics

Standard VG Big VG Scalable VG
Maximum PVs/VG 32 128 1024
Maximum LVs/VG 256 512 4096
Maximum PPs/VG 32,512 130,048 2,097,152
To create the volume group, if you are using SDD, you use the mkvg4vp command. And if you
are using SDDPCM, you use the mkvg command. All the flags for the mkvg command apply to
the mkvg4vp command.
We recommend using the smallest physical partition (PP) size you can that will allow for
growth. A physical partition size in the 8 MB - 32 MB range is a good starting point for
planning. This will minimize wasted space and allow for expansion of the volume group later.

Note: We recommend using the smallest physical partition (PP) size that will allow for
future growth of the volume group. This size is usually not the smallest PP size you can use
to create the volume group. It is typically one size larger than the smallest. Also consider
that for large logical volumes, there is a limit of 32,512 logical partitions in a logical volume
- thus, with a 1 TB logical volume, the smallest PP size possible is 64 MB (32 MB x 32,512
< 1 TB).
One reason to keep the physical partition size small is because, when creating an inter-disk
logical volume, the smallest logical volume that should be created is the product of multiplying
the physical partition size by the number of LUNs in the volume group. This is the minimum
unit of allocation (MUA). All of the logical volumes you create should be multiples of this MUA.
If the physical partition size is large, and the number of LUNs is large, then the minimum
logical volume that would result would be large. This situation could lead to wasted space if
many of the logical volumes for your application do not need to be at least as large as the
MUA.
Consider an example where we have one 100 GB LUN called vpath4. The following
command will create a volume group called 6000vg with a 128 MB physical partition size.
mkvg4vp -y 6000vg vpath4
If you look at the characteristics of the new volume group, you will see the output similar to
Example 7-28.
Example 7-28 lsvg output

root@san5198b:/tmp > lsvg 6000vg
VOLUME GROUP: 6000vg VG IDENTIFIER: 000007ca00004c00000001061e8824c7
VG STATE: active PP SIZE: 128 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 799 (102272 megabytes)
MAX LVs: 256 FREE PPs: 799 (102272 megabytes)
LVs: 0 USED PPs: 0 (0 megabytes)
OPEN LVs: 0 QUORUM: 2
TOTAL PVs: 1 VG DESCRIPTORS: 2
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 1 AUTO ON: yes
MAX PPs per PV: 1016 MAX PVs: 32
LTG size: 128 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
Notice two things in the command’s output:

򐂰 MAX PPs per PV: 1016
򐂰 MAX PVs: 32
The above mkvg4vp command chose a physical partition size of 128 MB for two reasons. First,
the maximum physical partitions per physical volume is 1016 or less, and the maximum
physical volumes for this volume group are not more than 32.
In the design of the AIX Logical Volume Manager (LVM), each logical partition maps to a
physical partition (except when using OS mirroring when one logical partition maps to two or
three physical partitions). Each physical partition maps to a number of disk sectors. The
design of LVM limits the number of physical partitions that LVM can track per disk to 1016. In
most cases, not all of the possible 1016 tracking partitions are used by a disk. In the above
example, (102272 / 128 =) 799 partitions are used.

So, for the 100 GB LUN, the default rules yielded a physical partition size of 128 MB. But that
is getting on the large side. It is possible to buy down the PP size by increasing something
called the t factor. But there is a cost involved in using the t factor. It reduces the maximum
number of LUNs allowed in a volume group. The command below in Example 7-29
automatically invokes the t factor by specifying a smaller physical partition size using the
mkvg4vp -s option.
mkvg4vp -s 32 -y 6000vg vpath4
Example 7-29 lsvg output

root@san5198b:/tmp > lsvg 6000vg
VOLUME GROUP: 6000vg VG IDENTIFIER: 000007ca00004c00000001061e9680c0
Notice from Example 7-29:

򐂰 MAX PPs per PV: 4064
򐂰 MAX PVs: 8
In Example 7-29, notice the physical partition size was brought down into a more desirable
range, but at the expense of the number of LUNs allowed in the volume group! From
Table 7-6, notice that the t factor for this situation is 4. The relationship is that the maximum
number of physical volumes that can be included in a volume group will be reduced to (MAX
PVs / t factor). The t factor is between 1 and 16 for standard volume groups and it is between
1 and 64 for big volume groups. The manual page for the mkvg and the chvg commands talks
more about the t factor.
Also notice in Example 7-29 that the t factor rule will allow 8 LUNs of 100 GB into the
standard volume group. There is also a limit on the maximum number of logical partitions in a
logical volume - that limit is always 32,512 as shown in Table 7-5 on page 234.
Table 7-6 t factor

t factor PPs/PV Standard VG max Big VG max number
number of PVs of PVs
1 1016 32 128
2 2032 16 64
4 4064 8 32
8 8128 4 16
16 15256 2 8
32 32512 4
64 65024 2

In the above example, a smaller physical partition size comes at a relatively expensive price,
namely the number of physical volumes that can be in a volume group. The example used
one 100 GB LUN. What if the example above used two 100 GB LUNs? Nothing would have
changed except the total size of the volume group. That is because the discussion about the t
factor is a discussion about the number of physical partitions per physical volume. What if the
example used above used nine 100 GB LUNs? Trouble, because the volume group can only
take 8 LUNs.
Remember the recommended method described in 7.2.1, “I/O balanced across Extent Pools”
on page 191? It basically states that one LUN from each Rank (Array set) should be assigned
to your host system initially. Likewise, when planning for future expansion of your volume
group, it is reasonable to plan for adding another Array set to your volume group. But you may
not be able to add another Array set to the volume group because of LVM limitations, unless
you have planned accordingly. For example, 16 or more LUNs (an Array set) could be initially
assigned to a host server. If the physical partition size of these LUNs had to be reduced using
the t factor, then the AIX standard volume group could not grow by the addition of a second
Array set because the total size of the volume group would have been reduced below 32 by
the t factor. This is what we were referring to earlier in this section, when we recommended
that the smallest physical partition size should be used that would allow for growth. It is also a
good idea to always keep a couple of free PPs on each disk in the volume group so it can be
changed from a standard volume group to a big volume group, or to a scalable volume group
- this allows expansion of the VDGA structure on each disk in the volume group. It is
necessary for the VGDA to grow to convert from a standard volume group to a big or a
scalable volume group, or from a big to a scalable volume group.
Note:
򐂰 Use the smallest physical partition size that will allow for growth of your volume group.
򐂰 Always keep a couple of free PPs on each disk in the volume group.
It is obvious that the AIX standard volume group has limitations that could quickly become an
issue with large storage allocations from the DS6000. We recommend creating AIX big
volume groups to manage DS6000 LUNs. As Table 7-5 on page 234 shows, a big volume
group can accommodate up to 128 physical volumes. To create a big volume group, use the
following command format:
mkvg4vp -B -s 32 -y 6000vg vpath1 vpath3
We are still working with 100 GB LUNs. The above command uses a t factor of 4 and yields
the results shown in Example 7-30:
Example 7-30 Big volume group

root@san5198b:/ > mkvg4vp -f -B -s 32 -y 6000vg vpath1 vpath3
6000vg
root@san5198b:/ > lsvg 6000vg
VOLUME GROUP: 6000vg VG IDENTIFIER: 000007ca00004c0000000106504e4908

If the volume group might grow beyond 128 disks, use the mkvg4vp -G option. Be aware that
because the volume group descriptor area (VGDA) is increased substantially in big volume
groups, you can expect VGDA update operations (creating a logical volume, changing a
logical volume, adding a physical volume, etc.) to take longer on big volume groups than it
takes in a standard volume group.
Note: Use AIX big volume groups.
We have been considering large LUNs to test the limits of the AIX LVM and volume groups.
But large LUNs on single Ranks has nothing to do with the recommended method. Let’s
explore another example using eight LUNs that are 68 GB each.
mkvg4vp -B -y 6000vg vpath8 vpath9 vpath10 vpath11 vpath12 vpath13 vpath14 vpath15
The above command creates the volume group shown in Example 7-31.
Example 7-31 Volume group with large physical partition

{CCF-part2:root}/ -> mkvg4vp -B -f -y 6000vg vpath8 vpath9 vpath10 vpath11 vpath12 vpath13
vpath14 vpath15
6000vg
{CCF-part2:root}/ -> lsvg 6000vg
VOLUME GROUP: 6000vg VG IDENTIFIER: 00e033c400004c000000010650fa6230
Let us get the physical partition size down to, say 16 MB and see what this volume group
looks like. Imposing a physical partition size and invoking the t factor, the command looks like:
mkvg4vp -L 256 -B -f -s 16 -y 6000vg vpath8 vpath9 vpath10 vpath11 vpath12 vpath13 vpath14
vpath15
The new volume group looks like Example 7-32.
Example 7-32 Good looking volume group

{CCF-part2:root}/ -> varyoffvg 6000vg
{CCF-part2:root}/ -> exportvg 6000vg
{CCF-part2:root}/ -> mkvg4vp -L 256 -B -f -s 16 -y 6000vg vpath8 vpath9 vpath10 vpath11
vpath12 vpath13 vpath14 vpath15
6000vg
VOLUME GROUP: 6000vg VG IDENTIFIER: 00e033c400004c00000001065254eb1d

The new volume group has a good physical partition size; room for 3 more sets of LUNs, so
plenty of room for growth here. What more could you want out of a volume group? Except that
the LTG size did not change to 256 KB. We could not get this to work from the command line.
However, the following command will fix this:
chvg -L 256 6000vg
With AIX 5.3, the LTG size is automatically set, so the chvg or mkvg commands do not apply.
7.8.2 Creating a logical volume

After creating the volume group, next you create the logical volumes in the volume group.
Here a choice exists for using non-striped logical volumes or striped logical volumes. Striped
logical volumes come in two styles, striped or maximum inter-policy striped.
For randomly accessed logical volumes, we recommend using the maximum inter-policy,
which also is referred to as the inter-disk policy. To make striped logical volumes using the
inter-disk policy, use the -e x flags on the mklv command.
Note: We recommend the maximum inter-policy logical volume for randomly accessed
data. This logical volume is created using the -e x flag of the mklv command. Or in smit,
set the RANGE of physical volumes to maximum and specify all the vpaths in the volume
group.
The DS6000 is capable of exceptional throughput for the different types of I/O. Understanding
your I/O workload characteristics (see Chapter 12, “Understanding your workload” on
page 407) will allow you to further maximize the performance gains you obtain from the
DS6000. In order to choose the best type of logical volumes to create, It will be necessary to
know which of the following is more predominant in each of your applications:
򐂰 Lots of small random I/O operations
򐂰 Large sequential I/O operations
򐂰 A combination of the above
Below is an overview of the three types of logical volumes.
Non striped logical volumes

Figure 7-5 on page 240 shows a 128 MB logical volume on a single LUN. In this figure we use
vpath0, which is an 8 GB LUN on the DS6000. As we already know, the LUN is hardware
striped by the DS6000 in a RAID 5 or RAID 10 array. If the Rank is RAID 5 (7+P), for example,
then vpath0 will be striped across 8 DDMs.

non striped logical volume
8GB logical disk (LUN)=vpath0
16MB 16MB 16MB 16MB 16MB 16MB 16MB 16MB 16MB 16MB
lp1
pp1
lp2
pp2 pp3
lp3
pp4
lp4
pp5
lp5
pp6
lp6
pp7
lp7
pp8
lp8 ... pp499 pp500
/dev/non_striped_lv
vpath0 is a LUN that is hardware striped on a DS8000 RAID-5 Array

8 GB / 16 MB partitions = 500 physical partitions (pp1 – pp500)
/dev/non_striped_lv is made of 8 logical partitions = 8 * 16 = 128 MB

/dev/non_striped_lv = lp1+lp2 +lp3 + lp4 + lp5 +lp6 + lp7 + lp8
Figure 7-5 Non striped logical volume
In this example, the logical volume manager (LVM) of the host operating system has created
a volume group which has divided logical disk vpath0 into 16 MB physical partitions. A non
striped logical volume is simply a logical grouping together of eight of these partitions to
create a 128 MB logical volume called /dev/non_striped_lv.
Non striped logical volume recommendations:

򐂰 I/O to logical volume /dev/non_striped_lv will be limited by the performance capacity of the
array that vpath0 resides on. While this type of logical volume is ok for cache friendly
random I/O, we do not recommend this technique for optimum performance.
򐂰 LVM striping techniques should never be used to create striped logical volumes on a
single RAID Array.
򐂰 When filling up a LUN with logical volumes, always leave two or three physical partitions
free on every LUN in the volume group. This leaves some extra room for the logical volume
control block (LVCB) to grow and enables the volume group to be changed from a
standard volume group to a big volume group or to a scalable volume group.
Note: Because vpath0 is hardware striped as a RAID 5 Array, we do not recommend

creating a logical volume across vpath0 that is striped at the operating system level.
A non striped logical volume would, however, be the kind of logical volume you would create
for applications like DB2 that uses the concept of containers within logical volumes. This is
because DB2 can do application striping across containers.
Note: Consider using logical volumes of this type for DB2, which can randomize I/O across
containers.

Maximum inter-policy (inter-disk) striped logical volume
This is the striping technique we recommend for best performance under the broadest set of
application circumstances.
Figure 7-6 shows an example of the inter-disk policy logical volume. The LVM has created a
volume group containing four LUNs and has created 16 MB physical partitions on the LUNs.
The logical volume in this example is a group of 16 MB physical partitions from four different
logical disks—vpath0, vpath1, vpath2, and vpath3.
Note: We recommend using the inter-disk logical volume for random access workloads.
inter-disk logical volume

/dev/inter-disk_lv
8GB Logical disk (LUN) = vpath0
16MB 16MB 16MB 16MB 16MB 16MB 16MB 16MB
pp1
lp1
pp2
lp5
pp3 pp4 ... pp497 pp498 pp499 pp500

pp1 pp2
lp6
pp3 pp4 ... pp497 pp498 pp499 pp500
lp2

pp1 pp2
lp7
pp3 pp4 ... pp497 pp498 pp499 pp500
lp3

pp1 pp2
lp8
pp3 pp4 ... pp497 pp498 pp499 pp500
lp4
vpath0, vpath1, vpath2, vpath3 are hardware striped LUNs on different DS8000 Extent Pools
8 GB / 16 GB partitions = 500 physical partitions per LUN (pp1-pp500)
/dev/inter-disk_lv is made up of 8 logical partitions
(lp1 + lp2 + lp3 + lp4 + lp5 +lp6 +lp7 + lp8) = 8 * 16 = 128 MB
Figure 7-6 Inter-disk policy logical volume
For a discussion of striped file systems, it is necessary to first define a few terms:
򐂰 On each DDM within a DS6000, the RAID 5 controllers create 256 KB strips on each DDM
in the Array. These strips are used by the RAID 5 hardware to create RAID 5 stripes. For a
6+P RAID 5 Array, the RAID 5 stripe is 1.5 MB. For a 7+P array, the RAID 5 stripe is 1.75
MB. We will always refer to this as a RAID 5 stripe.
򐂰 With a RAID 5 Array, there is something called the RAID 5 write penalty. This is
experienced when I/O to the RAID array is less than the RAID 5 stripe size. The RAID 5
write penalty happens because to do a logical write, the controller must (1) read the data
that is being over written, (2) read the associated parity, then using the data the is being
written, calculate the new parity information, then (3) write the new data and (4) write the
new parity. This causes a total of four disk I/Os per host write. I/O writes exceeding the
RAID 5 stripe size will not suffer the RAID 5 write penalty. All the data, plus the parity, is
written at one time. This is called a full stripe write.
Inter-disk striping recommendations:

򐂰 Only stripe across LUNs that have the same RAID type and DDM size.

򐂰 When creating inter-disk logical volumes, use the mklv -a e flags so that the logical
volumes will be created from the outer edge of the disk towards the inner edge. Consider
creating the highest access density (greatest IOPS/GB) logical volumes first. The reason
for this is that the outer edge of the disks get greater throughput for sequential I/O, and this
tends to keep the disk actuators towards the outer edge of the disk. Alternatively, smitty
mklv may be used, specifying the value of edge for Position on physical volumes.
򐂰 When creating inter-disk logical volumes across LUNs, each logical volume will be created
using a separate mklv -e x command. This spreads (stripes) the logical volume from one
LUN to the next in chunks the size of the physical partition size of the volume group. For
every instance of the mklv -e x command, specify the LUNs in a random order. Most I/O
to a logical volume happens toward the front of the logical volume. This technique puts the
front of each logical volume on a different Rank. For instance:
mklv -a e -e x -t jfs2 -y lvname1 vgname 16 vpath0 vpath1 vpath2 vpath3
Alternatively, use the smitty mklv may be used, specifying maximum for RANGE of
physical volumes, and list the vpaths in a random order for PHYSICAL VOLUME names.
򐂰 The smallest inter-disk logical volume should be a multiple of the product of multiplying the
physical partition size of the volume group by the number of LUNs in the volume group.
This is the minimum unit of allocation (MUA). All of the logical volumes you create should
be multiples of this MUA.
򐂰 Create one logical volume to be later used as a jfs2log (or jfslog) about every 500 physical
partitions. The minimum size of a jfs2log has to follow the same rules as any other striped
logical volume. This would change the mklv example we used above to look like:
mklv -a e -e x -t jfs2log -y lvlogname1 vgname 4 vpath1 vpath3 vpath0 vpath2
mklv -a e -e x -t jfs2log -y lvlogname2 vgname 4 vpath1 vpath3 vpath0 vpath2
Alternatively, and preferably, use INLINE logs for JFS2 logical volumes as then there is
one log for every file system and it is automatically sized. Having one log per file system
improves performance as it avoids serialization of access when multiple filesystems are
making metadata changes. The disadvantage of inline logs is that they cannot be
monitored for I/O rates which can provide an indication of the rate of metadata changes for
a filesystem. Only with big volume groups (mkvg -B or -G), use the mklv -T 0 option to
remove the logical volume control block from the logical volume. This keeps I/O from
spanning disks. Logical volumes within a scalable volume group do not have logical
volume control blocks.
򐂰 With big volume groups use the mklv -T 0 option to create the logical volume without the
logical volume control block. This assists in preventing I/Os from spanning physical
partition boundaries, as the LVCB offsets all I/Os on the logical volume.Logical volumes
within a scalable volume group do not have logical volume control blocks.

Inter-disk logical volume example
As shown in Example 7-33, the output from the datapath query essmap command shows that
our host server has eight 68.7 GB LUNs assigned to it from 8 different Ranks. These LUNs
were previously configured into a volume group in Example 7-32 on page 238.
Example 7-33 datapath query essmap output

{CCF-part2:root}/ -> datapath query essmap
------- ----- - ----------- ------ ----------- ------------ ---- ---- --- ----- ---- - ----------- ---- --------
vpath15 hdisk43 7V-08-01[FC] fscsi0 75065513113 IBM 2107-900 68.7 49 19 fff9 17 Y R1-B2-H1-ZA 100 RAID10
vpath15 hdisk51 7V-08-01[FC] fscsi0 75065513113 IBM 2107-900 68.7 49 19 fff9 17 Y R1-B2-H1-ZB 101 RAID10
vpath15 hdisk59 7k-08-01[FC] fscsi1 75065513113 IBM 2107-900 68.7 49 19 fff9 17 Y R1-B1-H3-ZC 32 RAID10
vpath15 hdisk67 7k-08-01[FC] fscsi1 75065513113 IBM 2107-900 68.7 49 19 fff9 17 Y R1-B1-H3-ZD 33 RAID10
{CCF-part2:root}/ ->
Let us assume that we, as systems administrators, have just received the storage
requirements for a new application that needs to be put into production. The new application
folks need 5 logical volumes that are 16 GB, 37.3 GB, 305 GB, 34 GB, and 47 GB. We know
that our volume group uses a physical partition size of 16 MB and contains 8 LUNs, so the
MUA is 128 MB. Below is a summary of the requirements in Table 7-7.
Table 7-7 requirements

Requirements (GB) Number of physical partitions
16 1032
38 2440
305 19528
34 2184
47 3016
The equation to determine the number of physical partitions that should be specified is shown
below. It adds one MUA (8) for a little extra space and uses the physical partition size of 16
MB and the fact that there are eight LUNs that we are striping across.
[((GB x 1024) / 16) + 8] = physical partitions

To create our logical volumes, we use the following commands as shown in Example 7-34.
Example 7-34 inter-disk logical volume example

{CCF-part2:root}/ -> mklv -TO -ae -ex -tjfs2 -y 16glv 6000vg 1032 vpath8 vpath12 vpath15 vpath14 vpath9 vpath10 vpath13 vpath11
16glv
38glv
305glv
{CCF-part2:root}/ -> mklv -TO -ae -ex -tjfs2log -y jfs2log01 6000vg 8 vpath14 vpath9 vpath11 vpath10 vpath12 vpath13 vpath15 vpath8
jfs2log01
34glv
47glv

VOLUME GROUP: 6000vg VG IDENTIFIER: 00e033c400004c00000001065254eb1d
{CCF-part2:root}/ -> lsvg -l 6000vg

6000vg:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT
16glv jfs2 1032 1032 8 closed/syncd N/A
jfs2log01 jfs2log 8 8 8 closed/syncd N/A
{CCF-part2:root}/ -> lslv 16glv

LOGICAL VOLUME: 16glv VOLUME GROUP: 6000vg
LV IDENTIFIER: 00e033c400004c00000001065254eb1d.1 PERMISSION: read/write
VG STATE: active/complete LV STATE: closed/syncd
TYPE: jfs2 WRITE VERIFY: off
MAX LPs: 1032 PP SIZE: 16 megabyte(s)
COPIES: 1 SCHED POLICY: parallel
LPs: 1032 PPs: 1032
STALE PPs: 0 BB POLICY: relocatable
INTER-POLICY: maximum RELOCATABLE: yes
INTRA-POLICY: edge UPPER BOUND: 128
MOUNT POINT: N/A LABEL: None
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?: NO
DEVICESUBTYPE : DS_LVZ
{CCF-part2:root}/tmp/perf/scripts -> lvmap.ksh 305glv
LV_NAME RANK COPIES IN BAND DISTRIBUTION

305glv:/interdiskfs
vpath14 fffb 2441:000:000 100% 385:819:819:418:000
vpath9 fffe 2441:000:000 100% 385:819:819:418:000
vpath11 fffa 2441:000:000 100% 385:819:819:418:000
vpath10 fffc 2441:000:000 100% 385:819:819:418:000
vpath12 ffff 2441:000:000 100% 385:819:819:418:000
vpath13 fffd 2441:000:000 100% 385:819:819:418:000
vpath15 fff9 2441:000:000 100% 385:819:819:418:000
vpath8 0000 2441:000:000 100% 385:819:819:418:000
Notice, from the lvmap.ksh 305glv shell script command (see Appendix B, “UNIX shell
scripts” on page 481) the balance in the distribution of the logical volume, 305glv, on the
LUNs! Each of the eight vpaths has 2441 logical partitions for this logical volume.

Striped logical volumes
An example of a striped logical volume is shown in Figure 7-7. The logical volume called
/dev/striped_lv uses the same capacity as /dev/inter-disk_lv (shown in Table 7-6 on page 241)
but is created differently.
Notice that /dev/striped_lv is also made up of eight 16 MB physical partitions, but each
partition is then subdivided into 64 chunks of 256 KB—only 3 of the 256 KB chunks are
shown per logical partition for space reasons. Most operating systems include a -S flag or
similar to create a striped logical volume.
striped logical volume

8GB LUN = vpath0
lp1 pp1
lp5
pp2 pp3 pp4 .... pp497 pp498 pp499 pp500
1.1 1.2 1.3 5.1 5.2 5.3
8GB LUN = vpath1

lp2 pp1
lp6 pp2 pp3 pp4 .... pp497 pp498 pp499 pp500
2.1 2.2 2.3 6.1 6.2 6.3
IO 8GB LUN = vpath2

lp3 pp1 lp7 pp2 pp3 pp4 .... pp497 pp498 pp499 pp500
3.1 3.2 3.3 7.1 7.2 7.3
8GB LUN = vpath3

lp4 pp1
lp8 pp2 pp3 pp4 .... pp497 pp498 pp499 pp500
4.1 4.2 4.3 8.1 8.2 8.3
vpath0, vpath, vpath2 and vpath3 are hardware striped LUNS on different DS8000 Extent Pools
8 GB / 16 MB partitions = 500 physical partitions per LUN (pp1 – pp500)
/dev/striped_lv is made up of 8 logical partitions (8 * 16 = 128 MB)

Each logical partition is divided into 64 equal parts of 256 KB
(only 3 of the 256 KB parts are shown for each logical partition)
/dev/striped_lv = lp1.1 +lp2.1 + lp3.1 + lp4.1 + lp1.2 + lp2.2 + lp3.2 + lp4.2 + lp5.1….
Figure 7-7 Striped logical volume
Access to /dev/striped_lv has advantages and disadvantages for performance. For random
I/O, there is an increased chance of I/Os spanning strip boundaries, thus increasing the
physical IOPS required to complete the I/O. For sequential I/O, striped logical volumes have
the advantage that I/O throughput is typically higher than for maximum inter-policy logical
volumes as it gets more physical volumes doing I/O in parallel. Striped logical volumes can
also cause increased disk subsystem read ahead which can be an advantage when the
pre-fetched data will in fact be accessed shortly; or a disadvantage when the data pre-fetched
will not in fact be used, thus occupies the memory when it might be better used for other I/O.
For a discussion of striped file systems, it is necessary to first define a few terms:
򐂰 LVM stripe is the size of the stripe that is specified by the mklv -S command. We will refer
to this stripe as an LVM stripe, LV stripe size, or a stripe.
򐂰 On each DDM within a DS6000, the RAID 5 controllers create a 256 KB strip across each
DDM in the Array. This strip is used by the RAID 5 hardware to create RAID 5 stripes. For
a 6+P RAID 5 Array, the RAID 5 data stripe is 1.5 MB. For a 7+P array, the RAID 5 data
stripe is 1.75 MB. We will always refer to this as a RAID 5 stripe.

򐂰 Logical Track Group (LTG) size is a logical volume group thing. LTG is the maximum I/O
size issued from the LVM to the hdisk device driver for the volume group. When the LVM
receives a I/O request greater than the LTG size, it breaks it up into I/Os the LTG size or
smaller. The mkvg -L or chvg -L commands in AIX 5.2 accept different LTG sizes. AIX 5.3
automatically sets the LTG size of the volume group. Refer to Table 7-8. For AIX 5.2, the
LTG size can never be increased beyond the smallest of the maximum I/O size supported
for the LUNs in the volume group. For the DS6000, this is 256 KB, as reported by the
command:
lquerypv -M vpath5
򐂰 With a RAID 5 Array, there is something called the RAID 5 write penalty. This is
experienced when I/O to the RAID array is less than the RAID 5 stripe size. The RAID 5
write penalty happens because to do a logical write, the controller must (1) read the data
that is being over written, (2) read the associated parity, then using the data the is being
written, calculate the new parity information, then (3) write the new data and (4) write the
new parity. This causes a total of four disk I/Os per host write. I/O writes exceeding the
RAID 5 stripe size will not suffer the RAID 5 write penalty. All the data, plus the parity, is
written at one time. This is called a full stripe write.
Table 7-8 Valid LTG and stripe sizes

Valid LTG Size Valid LV stripe sizes
AIX 5.2 128 KB, 256 KB, 512 KB, 1 MB 4 KB, 8 KB, 16 KBM, 32 KB, 64
KB, 128 KB, 256 KB, 512 KB, 1
MB
AIX 5.3 128 KB, 256 KB, 512 KB, 1 MB, 4 KB, 8 KB, 16 KBM, 32 KB, 64
2 MB, 4 MB, 16 MB KB, 128 KB, 256 KB, 512 KB, 1
MB, 2 MB, 4 MB, 16 MB, 32 MB,
64 MB, 128 MB
Striping Considerations:
򐂰 Only use LVM striped logical volumes when the logical volume must be sequentially
accessed and requires a high throughput rate. Otherwise, use the maximum inter-policy.
򐂰 Striped logical volumes are good for sequentially accessed data and high throughput rates
are required. For instance, consider the case where an application reads single threaded
I/O from, say, a satellite. This example would be especially efficient if the I/O size matched
the full stripe on the DS6000 (I/O size = DS6000 strip size x DS6000 stripe width), or even
better if the I/O size matches the full stripe on the logical volume (I/O size = logical volume
strip size x logical volume stripe width). Also consider the case where a simple dd
command is used to characterize sequential reads and writes to raw devices.
򐂰 Striping inhibits DS6000 read ahead algorithms for sequential I/O. When the DS6000
detects sequential reading from a LUN, it reads ahead and puts the data in read cache
which improves I/O service time. This is called a read hit. With a small strip size, it takes
two full logical volume stripes of I/O before the DS6000 realizes that sequential reading is
occurring from all the LUNs in the logical volume stripe.
򐂰 A small stripe size is useful to spread the I/Os to very small structures across more disks.
򐂰 In AIX 5.1 and AIX 5.2, it is not possible to dynamically increase the stripe width (the
number of physical volumes that the striped logical volume is striped across). To increase
the stripe width, data must be backed up and the striped logical volumes must be
recreated.
򐂰 In AIX 5.3, the stripe width can be changed to multiples of the existing stripe width using a
technique called striped columns, though the LVM will fill the first set of disks (or columns)
before allocating data to the second set of disks (or columns).

Striping recommendations:
򐂰 Only use LVM striped logical volumes when the logical volume must be sequentially
accessed and requires a high throughput rate. Otherwise, use the maximum inter-policy.
򐂰 Make the LVM stripe size as big as possible.
– For AIX 5.2, the stripe size cannot be larger than the LTG size of the volume group,
which can be increased to 256 KB. So, for AIX 5.2, make the stripe size 256 KB.
– For AIX 5.3, make the stripe size 4 MB to 16 MB. This is to encourage full stripe writes.
There is also some recent performance testing that indicates that a larger stripe size
may reduce unnecessary backend I/O due to DS6000 read ahead algorithms.
– Make the stripe size larger than the typical I/O. It is important to prevent one logical I/O
from being split into two physical I/Os. If a split occurs, one I/O causes two disk seeks.
It takes about 15 times longer to position the head and rotate the platter, than it does to
read a typical 4 KB I/O, so we try to avoid seeks.
򐂰 Only stripe across LUNs that have the same RAID type and DDM size.
򐂰 Only with big volume groups (mkvg -B ), use the mklv -T 0 option to remove the logical
volume control block from the logical volume. This keeps logical volume strips from
spanning disks. Logical volumes within a scalable volume group do not have logical
volume control blocks.
򐂰 The smallest logical volume should be multiples of the product of multiplying the physical
partition size of the volume group by the number of LUNs in the volume group (stripe
width). This is the minimum unit of allocation (MUA). For instance, if you have 8 LUNs and
a physical partition size of 16 MB, then the smallest logical volume should be 128 MB. The
next largest logical volume would be 256 MB, and so on. All of the logical volumes you
create should be multiples of this MUA. This rule should be strictly enforced.
򐂰 This is the same rule as stated above (MUA), just said a different way. Ensure the number
of logical partitions specified in the mklv command is a multiple of the number of LUNs you
are striping across (stripe width). For instance, if you have 8 LUNs (one from each Rank),
valid values for the mklv command need to be multiples of 8.
򐂰 When creating striped logical volumes, use the mklv -a e flags so that the logical volumes
will be created from the outer edge of the disk towards the inner edge. It turns out this is
really towards the outer edge of the physical disks for the DS6000. Create the highest
access density (greatest IOPS/GB) logical volumes first. The reason for this is that the
outer edge of the disks get greater throughput for sequential I/O, and this tends to keep
the disk actuators towards the outer edge of the disk.
򐂰 When creating striped logical volumes across LUNs, each striped logical volume will be
created using a separate mklv -S <strip size> command. For every instance of the mk
-S <strip size> command, specify the LUNs in a random order. Most I/O to a logical
volume happens toward the front of the logical volume. This technique puts the front of
each logical volume on a different Rank. For instance:
mklv -a e -t jfs2 -y lvname1 -S 1024 vgname 16 vpath0 vpath1 vpath2 vpath3
򐂰 When using jfs2, it is preferred to use INLINE logs for the filesystems, rather than outline
logs that are demonstrated next.

򐂰 Create one logical volume to be later used as a jfs2log (or jfslog) about every 500 physical
partitions. The minimum size of a jfs2log has to follow the same rules as any other striped
logical volume. This would change the mklv example we used above to look like:
mklv -a e -t jfs2log -y lvlogname1 -S 1024 vgname 4 vpath1 vpath3 vpath0 vpath2
mklv -a e -t jfs2log -y lvlogname2 -S 1024 vgname 4 vpath1 vpath3 vpath0 vpath2
Striped logical volume example

Consider Example 7-35 below. The datapath query essmap command shows eight DS6000
LUNs assigned to the host server. Each LUN is 10.7 GB and is from a different Rank or
Extent Pool. Also shown is the creation of a volume group called stripevg that has a physical
partition size of 16 MB and an LTG size (finally) of 256 KB.
Example 7-35 striped logical volume - volume group

{CCF-part2:root}/ -> datapath query essmap
------- ----- - ----------- ------ ----------- ------------ ---- ---- --- ----- ---- - ----------- ---- --------
vpath7 hdisk11 7V-08-01[FC] fscsi0 75065513103 IBM 2107-900 10.7 49 3 fff9 17 Y R1-B2-H1-ZA 100 RAID10
vpath7 hdisk19 7V-08-01[FC] fscsi0 75065513103 IBM 2107-900 10.7 49 3 fff9 17 Y R1-B2-H1-ZB 101 RAID10
vpath7 hdisk27 7k-08-01[FC] fscsi1 75065513103 IBM 2107-900 10.7 49 3 fff9 17 Y R1-B1-H3-ZC 32 RAID10
vpath7 hdisk35 7k-08-01[FC] fscsi1 75065513103 IBM 2107-900 10.7 49 3 fff9 17 Y R1-B1-H3-ZD 33 RAID10
{CCF-part2:root}/ -> mkvg4vp -L 256 -B -f -s 16 -y stripevg vpath0 vpath1 vpath2 vpath3 vpath4 vpath5 vpath6 vpath7
stripevg
{CCF-part2:root}/ -> lsvg stripevg
VOLUME GROUP: stripevg VG IDENTIFIER: 00e033c400004c000000010656cd3c3f
{CCF-part2:root}/ -> chvg -L 256 stripevg


Let us assume that we, as systems administrators, have just received the storage
requirements for a new application that needs to be put into production. The new application
folks need 3 logical volumes that are 16 GB, 37.3 GB, 23 GB. They insist that the logical
volumes be striped, in spite of our recommendations to use inter-disk logical volumes. We
know that our volume group uses a physical partition size of 16 MB and contains eight LUNs,
so the MUA is 128 MB. See Table 7-9.
Table 7-9 Striping requirements

Requirements (GB) Number of physical partitions
16 1032
38 2440
23 1480
To create the striped logical volumes we used commands as shown in Example 7-36.
Example 7-36 striped logical volumes

{CCF-part2:root}/ -> mklv -TO -ae -tjfs2 -y 16gslv -S 256K stripevg 1032 vpath0 vpath1 vpath2 vpath3 vpath4 vpath5 vpath6 vpath7
16gslv
38gslv
{CCF-part2:root}/ -> mklv -TO -ae -tjfs2log -y jfs2slog01 stripevg 8 vpath3 vpath5 vpath0 vpath7 vpath1 vpath4 vpath2 vpath6
jfs2slog01
23gslv

{CCF-part2:root}/ -> lsvg -l stripevg

stripevg:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT
16gslv jfs2 1032 1032 8 closed/syncd N/A
jfs2slog01 jfs2log 8 8 1 closed/syncd N/A
{CCF-part2:root}/ -> lslv 16gslv

LOGICAL VOLUME: 16gslv VOLUME GROUP: stripevg
LV IDENTIFIER: 00e033c400004c000000010656cd3c3f.1 PERMISSION: read/write
VG STATE: active/complete LV STATE: closed/syncd
TYPE: jfs2 WRITE VERIFY: off
MAX LPs: 1032 PP SIZE: 16 megabyte(s)
COPIES: 1 SCHED POLICY: striped
LPs: 1032 PPs: 1032
STALE PPs: 0 BB POLICY: relocatable
INTER-POLICY: maximum RELOCATABLE: no

INTRA-POLICY: edge UPPER BOUND: 8
MOUNT POINT: N/A LABEL: None
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes (superstrict)
Serialize IO ?: NO
STRIPE WIDTH: 8
STRIPE SIZE: 256k
DEVICESUBTYPE : DS_LVZ
{CCF-part2:root}/tmp/perf/scripts -> lvmap.ksh 16gslv

16gslv:N/A
vpath0 0000 129:000:000 100% 128:001:000:000:000
vpath1 fffe 129:000:000 100% 128:001:000:000:000
vpath2 fffc 129:000:000 100% 128:001:000:000:000
vpath3 fffa 129:000:000 100% 128:001:000:000:000
vpath4 ffff 129:000:000 100% 128:001:000:000:000
vpath5 fffd 129:000:000 100% 128:001:000:000:000
vpath6 fffb 129:000:000 100% 128:001:000:000:000
vpath7 fff9 129:000:000 100% 128:001:000:000:000
Notice the balance in the distribution of the logical volume on the LUNs! To see this we used
the lvmapt.ksh shell script in Appendix B, “UNIX shell scripts” on page 481.
7.8.3 Creating the file system

One of the goals of this chapter has been to explain how to best use storage from the DS6000
to get the best storage related performance out of your UNIX server. With the creation of the
logical volumes, this work is largely complete. The questions surrounding how to utilize the
DS6000 for best performance have largely been answered. Everything else we do to prepare
this storage for the applications is basically operating system related activities.
Note: The use of INLINE jfs2 logs is preferred to outline logs. Inline logs are created when
the filesystem is created.
Create a jfs2 file system on an logical volume with INLINE log

To make a striped jfslog and logical volume called stripedlv we used the mklv commands:
򐂰 Format the inter-disk jfs2log:
logform -V jfs2 /dev/jfs2log01
򐂰 Create a jfs2 file system on the 305 GB inter-disk logical volume
crfs -v jfs2 -d 305glv -m /interdiskfs -A yes -p rw -a logname=/dev/jfs2log01
Create a jfs2 file system on an logical volume with OUTLINE log

򐂰 Create a jfs2 file system on the 305 GB inter-disk logical volume
crfs -v jfs2 -d 305glv -m /interdiskfs -A yes -p rw -a logname=INLINE
In Example 7-37, we use the above command to create a file system and then we look at the
file system specifics.
Example 7-37 Inter-disk file system creation

{CCF-part2:root}/tmp/perf/scripts -> crfs -v jfs2 -d 305glv -m /interdiskfs -A yes -p rw -a
logname=/dev/jfs2log01
File system created successfully.
319936784 kilobytes total disk space.
New File System size is 639893504
{CCF-part2:root}/tmp/perf/scripts -> grep -p interdiskfs /etc/filesystems
/interdiskfs:

dev = /dev/305glv
vfs = jfs2
log = /dev/jfs2log01
mount = true
options = rw
account = false
{CCF-part2:root}/tmp/perf/scripts -> lsfs -q /interdiskfs

Name Nodename Mount Pt VFS Size Options Auto Accounting
/dev/305glv -- /interdiskfs jfs2 639893504 rw yes no
(lv size: 639893504, fs size: 639893504, block size: 4096, sparse files: yes, inline log: no,
From Example 7-37 on page 250 we notice:

򐂰 We specified a specific jfs2log for our file system. With several large, busy file systems
using the same jfslog, the jfslog can get very busy. We have demonstrated how to create
and use a separate jfs2log for a specific file system.
򐂰 The default file system block size of 4096 K was assigned to this file system when it was
created. This size is recommended.
7.9 Operating system tuning

Operating system tuning recommendations are outside the scope of this book. However,
because tuning at this level can improve I/O, we will overview some areas of interest. We will,
of course, cover AIX in more detail than other UNIX operating systems because we know AIX
better. File system buffer tuning should be done carefully and incrementally.
When tuning the operating system, do one thing at a time; verify I/O improvement every step
of the way. Have a clear understand of your current system settings before making changes to
the operating system.
Response time and throughput trade-offs exist that affect overall performance. Generally a
multi-user system will want to ensure good response time for users, while a system that runs
only batch jobs should be tuned for maximum throughput. The appropriate tuning parameters
are dependent upon the nature of the application, the number of CPUs, the amount of cache
in the DS6000, the write rate, and other factors. We suggest that you tune these values for
maximum disk throughput with reasonable response time for users.
7.9.1 AIX operating system tuning (JFS and JFS2)

Beginning with AIX 5.2, kernel tuning has been redesigned. The vmtune has been replaced by
the ioo (for I/O related parameters) and vmo (for pure VMM parameters) commands. Similarly,
the schedtune command has been replaced with the schedo command. In AIX 5.3 the vmtune
and schedtune commands no longer exist.
The new commands are part of the bos.perf.tune fileset in AIX and they all use the same
syntax and command options to manipulate files in the /etc/tunables directory. Also, starting
with AIX 5.2, SMIT provides full support for these commands.
A full discussion of AIX tuning can be found in the AIX 5L Version 5.3 Performance
Management Guide which can be found by first selecting the AIX documentation link, then
selecting the Performance management and tuning link, at:
http://publib.boulder.ibm.com/infocenter/pseries/index.jsp

Table 7-10 is a summary of JFS and JFS2 tuning parameters that affect I/O. The parameters
are shown along with the appropriate command to change them, in parenthesis:
Table 7-10 JFS and JFS2 tuning parameters

Function JFS tuning JFS2 tuning parameter Dynamic ?
parameter (command to change)
(command to
change)
Sets the number of frames on the maxfree (vmo) maxfree (vmo) Yes
free list at which page stealing is to
stop. Must be larger than minfree
by a value at least as large as
maxpgahead.
Sets the number of frames on the minfree (vmo) minfree (vmo) Yes
free list at which page stealing
starts to replenish the free list
Sets the maximum amount of maxperm (vmo) maxclient (vmo) Yes

memory for caching files
Sets the minimum amount of minperm (vmo) No equivalent Yes

memory for cachine files
Sets a hard limit on memory for minperm (vmo) maxclient (vmo) - always a hard limit Yes
caching strict_maxperm strict_maxclient
Sets the maximum pages used for maxpgahead (ioo) j2_maxPageReadAhead (ioo) Yes
sequential read ahead. Should be a
power of 2 and greater than or
equal to MinPgAhead. Related to
minfree and maxfree.
Sets the minimum pages used for minpgahead (ioo) j2_minPageReadAhead (ioo) Yes
sequential read ahead
Sets the maximum number of chdev -l sys0 -a chdev -l sys0 maxpout=value Yes
pending I/Os to a file maxpout=value
Sets the minimum number of chdev -l sys0 -a chdev -l sys0 -a minpout=value Yes
pending I/Os to a file at which minpout=value
programs blocked by maxpout may
proceed
Sets the amount of modified data maxrandwrt (ioo) j2_maxRandomWrite (ioo) Yes
cache for a file with random writes
Controls the gathering of I/Os for numclust (ioo) j2_nPagesPerWriteBehindCluster (ioo) Yes
sequential write behind ioo j2_nRandomCluster (ioo)
Sets the number of file system numfsbufs (ioo) j2_nBufferPerPagerDevice (ioo) mount option
bufstructs
j2_dynamicBufferPreallocation (ioo) requires a
remount
AIX file system caching

AIX uses extra memory as JFS/JFS2 cache. The AIX operating system will leave pages that
have been read or written to in memory. For this discussion, the term page refers to a 4 KB
memory frame in system memory (RAM). If these file pages are requested again, and they
are already in the file system cache, then this saves an I/O operation. The minperm and

maxperm (or maxclient for JFS2) values control the level of this file system caching. Below
are considerations for setting minperm and maxperm:
򐂰 It is preferable to free up a file system cache frame rather than writing out working memory
to page space.
򐂰 If the percentage of file pages in memory exceeds maxperm, only file pages are taken by
the page replacement algorithm.
򐂰 If the percentage of file pages in memory is less than minperm, both file pages and
computational pages are taken by the page replacement algorithm.
򐂰 If the percentage of file pages in memory is in the range between minperm and maxperm,
the page replacement algorithm steals only the file pages, unless the number of file
repages is higher than the number of computational repages.
򐂰 Setting the value for minfree too high can result in excessive memory page replacement
because premature stealing of pages occurs to satisfy the required size of the memory
free list. Always ensure that the difference between the maxfree value and the minfree
value is equal to or greater than the maxpgahead value. And preferably the difference
should be #CPUs x (maxpgahead -or- maxPageReadAhead)
򐂰 It is possible to have too much file system cache (numperm or numclient for JFS2). The
operating system overhead can outweigh the I/O benefits. Overhead includes the syncd
daemon which by default runs every 60 seconds and reads in all the cache and finds files
which need to be synced to disk, which the syncd will then proceed to do. It may take a
significant amount of time to read in GBs of memory. So, with large memory systems,
reducing VMM cache can help performance. AIX 5.3 keeps track of dirty pages in cache,
so it can handle much larger file system cache before the cache overhead outweighs the
cache benefits.
򐂰 Limit the file system cache to 24 GB for large memory systems prior to AIX 5.3.
The following technique is one way to gauge correct VMM settings:

򐂰 Wait for memory to get all used (vmstat fre column near maxfree).
򐂰 Look at minperm, maxperm, and numperm from vmtune output
򐂰 Set maxperm (or maxclient for JFS2) to slightly less than numperm
򐂰 maxclient should be less than or equal to maxperm
򐂰 Check that sr:fr ratio form vmstat command is 4:1 or less. If not, then memory is over
committed.
򐂰 Set lru_file_repage to 0. This reduces I/O to page space.
Note that there can be too much file system cache with systems having more than 24 GB of
RAM. Part of the time that the syncd daemon runs, interrupts are suspended and I/O is
halted. The syncd has to check all the file system cache to see if the data needs to be flushed
to disk. Reading large amounts of memory can take seconds, so too much cache can be bad
for I/O. We can use release behind filesystem mount options (rbr, rbw, and rbrw) to keep data
out of the file system cache that does not have to be there. We can even put a limit on file
system cache by setting maxperm, maxclient, and strict_maxperm values. Maxclient is a hard
limit, but maxperm is a soft limit unless strict_maxperm is set to 1. The downside of setting a
strict limit for maxperm is that it causes the page replacement algorithm (lrud) to run when
there is plenty of free memory in the system. So there is a trade-off here. Generally, prior to
AIX 5.3, you will want a hard limit for filesystem cache from 24 Gb and up, depending on the
system memory bandwidth and processor speed.

Tuning I/O buffers for AIX 5.2 and AIX 5.3
AIX keeps track of I/O to disk using pbufs, which are memory buffers that are pinned in
memory. In AIX 5.2, pbufs are controlled by system wide tuning parameters like
pv_min_pbufs. The following testing technique can be used to tune I/O buffers for AIX 5.2:
򐂰 After a reboot, with the application off, run vmstat -v | tail -5 and notice:
– Zero values for I/Os blocked by pbuf, psbuf, and fsbuf
򐂰 Run your workload.
򐂰 Run the vmstat -v | tail -5 command again and look for large numbers of blocked I/O.
򐂰 Increase I/O buffer resources as necessary:
– pbufs - increasing pv_min_pbuf with ioo command.
– psbufs - increase by adding paging space or preferably tune to eliminate or minimize
paging.
– fsbufs - increase numfsbufs with the ioo command.
– external pager fsbufs - increase j2_nBufferPerPagerDevice and/or
j2_dynamicBufferPreallocation with the ioo command.
– client filesystem fsbufs - increase nfso’s nfs_v3_pdts and nfs_v3_vm_bufs or the
version 4 parameters using NFS4, and add piod daemons (mount -o bpios=50 ...) on
NFS clients.
򐂰 Unmount and remount the file system and then repeat this test.
In AIX 5.3, tuning I/O buffers can be tuned at the volume group level, rather than at the
system level. Use the lvmo -a command to view volume group level pbuf statistics. Tuning is
the same as above.
Read ahead
Read ahead at the file system level detects that we are reading sequentially and puts the data
into filesystem cache before the application requests it. This is supposed to reduce the
amount of percent I/O wait (%iowait) or increases I/O throughput, as seen from the operating
system. Too much read ahead means you do I/O that you do not need. The VMM tunable
parameters that control read ahead are minpgahead and maxpgahead for JFS and
j2_minPgReadAhead and j2_maxPgReadAhead for JFS2. These parameters are related to
maxfree and are used to ensure sufficient memory is available for I/O and to ensure good
keyboard response times on systems with heavy I/O workloads.
Bear in mind that the DS6000 has algorithms that perform read ahead also. Sometimes these
algorithms work in harmony with the operating system read ahead parameters, and
sometimes they don’t.
I/O pacing
I/O pacing limits the number of write I/Os that can be outstanding to a file. When a process
exceeds the maxpout limit (high water mark) it is put to sleep until the number of outstanding
writes I/Os is less than minpout (low water mark). This allows another process to use the
CPU. Said another way, I/O pacing causes the CPU to stop performing I/O to a file after a
specified amount of time. This frees up the CPU to do something else. Turning I/O pacing off
(default) improves backup times and sequential throughput. Turning I/O pacing on ensures
that no process hogs the CPU for I/O. Typically, we recommend to leave I/O pacing turned off.
There are certain circumstances where it is appropriate to have I/O pacing turned on like if
you are using HACMP. If you turn it on, start with settings of maxpout=321 and minpout=240.
Also, with AIX 5.3, I/O pacing can be turned on at the file system level with mount command

options. In summary, tuning on I/O pacing improves user response time at the expense of
throughput.
Write behind
This parameter is used to have the operating system initiate I/O that is normally controlled by
the syncd when a specified number of sequential 16 KB clusters are updated. The
parameters for write behind are:
򐂰 Sequential write behind
– numclust for JFS
– j2_nPagesPerWriteBehindCluster and j2_nRandomCluster for JFS2
򐂰 Random write behind
– maxrandwrt for JFS
– j2_maxRandomWrite
Note that setting j2_nPagesPerWriteBehindCluster to 0 disables JFS2 sequential write

behind and setting j2_maxRandomWrote=0 also disables JFS2 random write behind.
Mount options
Use release behind mount options where they make sense
򐂰 Release behind mount options can reduce syncd and lrud overhead and should be used
where it makes sense. The options basically throw away the data that would otherwise be
held in JFS2 cache. You would use these options if you knew that data going into or out of
certain file systems would not be requested again by the application before the data is
likely to be paged out. This means that the lrud daemon has less work to do to free up
cache and eliminates any syncd overhead for this file system.
– -rbr for release behind after read
– -rbw for release behind after write
– -rbrw for release behind after read or write
򐂰 I/O pacing can be specified for a specific filesystem with mount options and would be
useful where we do not want I/O from one filesystem to slow down other I/O or
applications:
mount -o minpout=40, maxpout=60
򐂰 Direct I/O (DIO)
– Bypass JFS/JFS2 cache
– No read ahead
– An option of the mount command
– Useful for databases that use file systems rather than raw logical volumes, the idea
being that if an application has its own cache, then it does not make sense to also have
the data in file system cache.
򐂰 Concurrent I/O (CIO)
– Same as DIO but without inode locking, so the application must ensure data integrity
for multiple simultaneous I/Os to a file.
lru_poll_interval
The lru_poll_interval parameter was introduced in ML4 of AIX 5.2. The parameter tells the
page stealer (lrud) whether it should stop working and poll for interrupts or continue

processing until the current request for free frames has been satisfied. The process starts
when the VMM needs memory and the lrud process starts running to steal memory. During
the process the VMM will determine how many pages it needs to steal, including enough
pages to get above maxfree. By setting lru_poll_interval to 0 (default) you tell the lrud not to
poll. By setting this parameter to a value, say 5, tells the lrud to poll every 5 milliseconds. We
recommend that you start with a value of 5. IBM has a Techdoc on setting this parameter:
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD102303
Fibre Channel adapter settings

There are two Fibre Channel adapter (HBA) settings that can help performance. The following
command changes the maximum transfer size (max_xfer_size) and the maximum number of
commands to queue (num_cmd_elems) of an HBA (fcs0) upon the next system reboot.
chdev -l fcs0 -P -a max_xfer_size=0x200000 -a num_cmd_elems=400 -P
The default maximum transfer size is 0x100000. Consider changing this value to 200000 or
larger. These values are adapter dependent. This changes the maximum I/O size that the
adapter will support and it also increases the DMA memory area used for data transfers by
the adapter. When the max_xfer_size=0x100000, then the memory area is 16 MB, and for
other values it is 128 MB.
The default number of simultaneous I/Os the adapter will handle is 200. The maximum for a 2
Gb HBA is 2048.
Putting it all together

For AIX 5.2 and AIX 5.3, you should be using a 64-bit kernel and JFS2 unless there is a
application related reason not to. Here is a good place to start:
򐂰 Set maxperm and maxclient to just less than numperm and numclient (after all the
memory is used up).
򐂰 Set minfree = 120 x # CPUs or the default whichever is larger.
򐂰 Set maxfree = minfree + (j2_maxPageReadAhead (or maxpgahead) x # CPUs).
򐂰 Set j2_maxPageReadAhead to 1024 (this for sequential read ahead only!).
򐂰 Set lru_poll_interval to 5.
򐂰 Use release behind mount options where it makes sense.
򐂰 Change the HBA settings.
7.9.2 HP-UX OS tuning for sequential I/O

For sequential I/O, HP-UX needs to turn on read ahead kernel options. A list of tunable kernel
parameters for HP-UX Release 11i can be found at:
http://docs.hp.com/en/TKP-90203/index.html
There are two different file system types for HP-UX, VxFS, and HFS. VxFS is preferred for
performance reasons.
VxFS read-ahead options

The kernel options for enabling read ahead for VxFS file systems are the following.
vxfs_max_ra_kbytes Maximum amount of read-ahead data, in KB, that the kernel may
have outstanding for a single VxFS file system
vxfs_ra_per_disk Maximum amount of VxFS file system read-ahead per disk, in KB

vx_fancyra_enabled Enable or disable VXFS file system read-ahead
HFS read-ahead options

The kernel options for enabling read-ahead for HFS file systems are the following:
hfs_max_ra_blocks The maximum number of read-ahead blocks that the kernel may
have outstanding for a single HFS file system.
hfs_max_revra_blocks The maximum number of reverse read-ahead blocks that the kernel
may have outstanding for a single HFS file system.
hfs_ra_per_disk The amount of HFS file system read-ahead per disk drive, in KB.
hp_hfs_mtra_enabled Enable or disable HFS multi-threaded read-ahead. No manpage.
Tip: Tuning the read ahead options varies from system to system depending on the
platform and amount of memory installed. Experiment with different values, making small
changes at a time.
Dynamic buffer cache

Another important option for HP-UX and I/O performance is dbc_min_pct and dbc_max_pct.
These two kernel parameters, dbc_min_pct and dbc_max_pct, control the lower and upper
limit, respectively, as a percentage of system memory that can be allocated for buffer cache.
How many pages of memory are allocated for buffer cache use at any given time is
determined by system needs, but the two parameters ensure that allocated memory never
drops below dbc_min_pct and cannot exceed dbc_max_pct percent of total system memory.
The default value for dbc_max_pct is 50 percent, which is usually overkill. If you want to use a
dynamic buffer cache, set the dbc_max_pct value to 25 percent. If you have 4 GB of memory
or more, start with an even smaller value.
With a large buffer cache the system is likely to have to pageout or shrink the buffer cache to
meet application memory needs, which causes I/Os to paging space. You want to avoid that
from happening and set memory buffers to favor applications over cached files.
7.9.3 Sun Solaris OS tuning for sequential I/O

Solaris 2.6, Solaris 7, and Solaris 8 require patches to ensure that the host and DS6000
function correctly. See the following Web site for the most current list of Solaris-SPARC
patches and the Solaris-x86 patch for Solaris 2.6, Solaris 7, and Solaris 8.
http://java.sun.com/j2se/1.3/install-solaris-patches.html
Check the following guide for an updated list of the updates Sun Solaris needs for different
attachment types to the DS6000: IBM TotalStorage Enterprise Storage Server Host System
Attachment Guide, SC26-7446. This guide can be downloaded from:
http://ssddom02.storage.ibm.com/techsup/webnav.nsf/support/2105
The Tunable Parameters Reference Manual for Solaris 8 can be found at:
http://docs.sun.com/app/docs/doc/816-0607
And the Tunable Parameters Reference Manual for Solaris 9 can be found at:
http://docs.sun.com/app/docs/doc/806-7009

Sun Solaris settings for high sequential I/O
Sun Solaris has some kernel settings that should be tuned for sequential I/O and also has
settings that should be set depending on which type of Fibre Channel adapter the host is
using.
maxphys
This parameter specifies the maximum number of bytes that you can transfer for each SCSI
transaction. The default value is 126976 (124 KB). If the I/O block size that you requested
exceeds the default value, the request is broken into more than one request. The value
should be tuned for the application requirements. For maximum bandwidth, set the maxphys
parameter by adding the following line to the /etc/system file:
set maxphys=1048576 (1 MB)
Attention: Do not set the value for maxphys greater than 1048576 (1 MB). Doing so can
cause the system to hang.
vxio:vol_maxio
If you are use the Veritas volume manager on the DS6000 LUNs, you must set the VxVM
maximum I/O size parameter (vol_maxio) to match the maxphys parameter. When you set the
maxphys parameter to 1048576 and you use the Veritas Volume Manager on your DS6000
LUNs, set the maxphys parameter like in the following command:
set vxio:vol_maxio=2048
Sun Fibre Channel settings

The following sections contain the procedures to set the Sun host system parameters for
optimum performance on the DS6000 with the Fibre Channel adapters:
1. Type cd/etc to change to the /etc subdirectory.
2. Back up the system file in the subdirectory.
3. Edit the system file and set the following parameters for servers with configurations that
use JNI, Emulex, or Qlogic Fibre Channel adapters:
– sd_max_throttle (JNI only)
This sd_max_throttle parameter specifies the maximum number of commands that the
sd driver can queue to the host adapter driver.
Note: Use this setting for JNI Fibre Channel adapters only.
The default value is 256, but you must set the parameter to a value less than or equal
to a maximum queue depth for each LUN connected. Determine the value by using the
following formula:
256 ÷ (LUNs per adapter)
Where LUNs per adapter is the largest number of LUNs assigned to a single adapter.
To set the sd_max_throttle parameter for the DS6000 LUNs in this example, you would
add the following line to the /etc/system file:
set sd:sd_max_throttle=5
The following settings should be set for all Fibre Channel adapter types (JNI, Emulex, or
Qlogic).
– sd_io_time

This parameter specifies the time-out value for disk operations. Add the following line
to the /etc/system file to set the sd_io_time parameter for the DS6000 LUNs:
set sd:sd_io_time=0x78
– sd_retry_count
This parameter specifies the retry count for disk operations. Add the following line to
the /etc/system file to set the sd_retry_count parameter for the DS6000 LUNs:
set sd:sd_retry_count=5
SUN Solaris resources

More information about SUN Solaris commands and tuning options is available from the
following Web sites:
http://sunsolve.sun.com/handbook_pub/
http://www.sun.com/bigadmin/collections/performance.html

8
Chapter 8. Open system servers - Linux for

xSeries
This chapter discusses the monitoring and tuning tools and techniques that can be used with
Linux systems to optimize throughput and performance when attaching the DS6000.
In this chapter we also discuss the supported distributions of Linux when using the DS6000,
as well as the tools that can be helpful for the monitoring and tuning activity:
򐂰 uptime
򐂰 dmesg
򐂰 top
򐂰 iostat
򐂰 vmstat
򐂰 sar, isag
򐂰 GKrellM
򐂰 KDE System Guard
򐂰 LVM
򐂰 Bonnie

8.1 Supported Linux distributions
For Intel-based servers attaching the DS6000, currently there are two major Linux
distributions that are supported:
򐂰 Red Hat Enterprise Linux 2.1, 3.0, and 4.0
򐂰 SuSE SLES 8, 9
If problems are encountered with installed versions, you may be required to update your Linux
configuration to a higher supported level before problem determination can take place.
For further clarification and the most current information about DS6000-supported Linux
distributions and kernel support compatible with the DS6000, you can refer to the Web site:
Once there, click the link for the PDF file: Download interoperability matrix.
8.2 Introduction to Linux OS components

It is important to understand the makeup of Linux and how the different components relate
and play together in the overall performance of the system.
8.2.1 Understanding and tuning virtual memory

To get the most performance out of a Linux server, it is important to understand how Linux
manages memory resources. It uses an always full concept of memory management, which
means that the system fills up the whole memory with data (such as applications, kernel,
cache). When the server boots, the first thing it does is divide the memory into different
pieces. The memory is divided into three main components:
1. Kernel space
Kernel space is where the actual kernel code is loaded, and where memory is allocated for
kernel-level operations. Kernel operations include scheduling, proc management,
signaling, device I/O, paging, controlling of the underlying hardware and swapping, and
the core operations that other programs rely on.
2. User space
User space is where all the application code (for example, database, e-mail, and Web
server code, user shell login or Xwindows) is loaded. In the user space, the memory is
again divided into chunks. Every process has its own allocated memory space. No other
process can access that data. This makes the operating system more stable since each
process is using its own protected part of the memory.
3. Buffer space
The rest of the memory is used as buffer space for caching. Every time an I/O operation
(for example, disk, network) is performed, the data is transferred first to memory for
caching. All Direct Memory Access (DMA) and busmaster transfers are also performed
through the buffer space.
It is the kernel’s job to manage all of these different memory spaces. When, for example, an
application is started, the kernel must transfer all the data from the hard disk to the buffer
space. After that, it must free some memory in the user space to load the application. Since
the user space will be divided into different chunks, it must sometimes rearrange certain
processes to get a big enough chunk for the application it is trying to load. When it has

finished and there is a contiguous section of memory available, it will load the application
code to the user space. This is why most of the Linux machines appear to load applications
more slowly than Windows boxes.
Here are some basics in Linux memory management:

򐂰 Page aging: A counter is maintained for each physical page and used to determine
whether or not to keep the page in memory. Each time the memory is scanned for pages
that should be evicted, the counter is increased. Each time the page is requested, the
counter is decreased. So over time, the pages that are less often used get a higher
counter and those that are more often used get a lower counter. If there is a request to
evict pages, the higher a page’s counter is, the greater chance there is of that page being
cleared.
򐂰 Page flushing: Writing out unneeded pages can dramatically decrease the performance of
a server, because of the extra disk seeks performed. A better solution is to delay the writes
and wait for another page flushing so that the disk seeks can be optimized and the seek
time minimized.
While virtual memory makes it possible for computers to more easily handle larger and more
complex applications, as with any powerful tool, it comes at a price. The price in this case is
one of overhead: An application that is 100 percent memory-resident will run faster than one
residing in virtual memory.
However, this is no reason to throw up one's hands and give up. The benefits of virtual
memory are too great to do that. And, with a bit of tuning, good performance is possible. The
thing that must be done is to look at the system resources that are impacted by heavy use of
the virtual memory subsystem.
Worst case scenario

Let us consider what system resources are used by extremely heavy page fault and swapping
activity:
򐂰 RAM: It stands to reason that available RAM will be low (otherwise there would be no need
to page fault or swap).
򐂰 Disk: While disk space would not be impacted, I/O bandwidth would be.
򐂰 CPU: The CPU will be expending cycles doing the necessary processing to support
memory management and setting up the necessary I/O operations for paging and
swapping.
The interrelated nature of these loads makes it easy to see how resource shortages can lead
to severe performance problems. All it takes is:
򐂰 A system with too little RAM
򐂰 Heavy page fault activity
򐂰 A system running near its limit in terms of CPU or disk I/O
At this point, the system will be thrashing, with performance rapidly decreasing.
Best case scenario

At best, virtual memory will present a minimal additional load to a well-configured system:
򐂰 RAM: Sufficient RAM for all working sets with enough left over to handle any page faults.
򐂰 Disk: Because of the limited page fault activity, disk I/O bandwidth would be minimally
impacted.
Chapter 8. Open system servers - Linux for xSeries 263

򐂰 CPU: The majority of CPU cycles will be dedicated to actually running applications,
instead of memory management.
From this, the overall point to keep in mind is that the performance impact of virtual memory is
minimal when it is used as little as possible.
The primary determinant of good virtual memory subsystem performance is having enough
RAM. Next in line (but much lower in relative importance) are sufficient disk I/O and CPU
capacity. However, these resources do little to help the virtual memory subsystem
performance (although they obviously can play a major role in overall system performance).
Note: A reasonably active system will always experience some page faults, if for no other
reason than because a newly-launched application will experience page faults as it is
brought into memory.
8.2.2 Understanding and tuning the swap partition

During installation, Linux creates a swap partition. The size of the swap partition should be at
least equal to the amount of RAM installed in the server, but we recommend that you make
the partition double the size of the RAM.
If there is insufficient memory installed in a server, it will begin paging the least used data
from memory to the swap partitions on the disks. A general rule is that the swap partitions
should be on the fastest drives available. If the server has more than one array, it is always a
good idea to spread the swap partitions over all of the arrays. This will generally improve the
performance of the server.
Furthermore, there is a way to parallelize swap file read/writes. It is possible to give each
swap partition a priority setting in the /etc/fstab file. If you open the /etc/fstab file, you might
see something like in Example 8-1.
Example 8-1 /etc/fstab file

/dev/sda2 swap swap sw 0 0
/dev/sdb2 swap swap sw 0 0
/dev/sdc2 swap swap sw 0 0
/dev/sdd2 swap swap sw 0 0
Under normal circumstances, Linux would use the swap partition /dev/sda2 first, then
/dev/sdb2, and so on, until it had allocated enough swapping space. This means that perhaps
only the first partition, /dev/sda2, will be used if there is no need for a large swap space.
Spreading the data over all available swap partitions will improve performance, because all
read/write requests will be performed simultaneously to all selected partitions. If you change
the file, as in Example 8-2, you will assign a higher priority level to the first three partitions.
Example 8-2 /etc/fstab file, modified

/dev/sda2 swap swap sw,pri=3 0 0
/dev/sdb2 swap swap sw,pri=3 0 0
/dev/sdc2 swap swap sw,pri=3 0 0
/dev/sdd2 swap swap sw,pri=1 0 0
Swap partitions are used from the highest priority to the lowest (where 32767 is the highest
and 0 the lowest). Giving the same priority to the first three disks causes the data to be written
to all three disks; the system does not wait until the first swap partition is full before it starts to

write on the next partition. The system uses the first three partitions in parallel and
performance generally improves.
The fourth partition is used if the first three are completely filled up and there is still additional
space needed for swapping. It is also possible to give all partitions the same priority to stripe
the data over all partitions, but if one drive is slower than the others (/dev/sdd2 in Example
5-2), performance would decrease.
If the server is running out of swap space and there is additional hard disk space left, it is
possible to create additional swap partitions with fdisk. However, if you cannot create a new
partition, you can create a swap file instead. There are two disadvantages to locating a swap
file outside a dedicated swap partition.
򐂰 The performance of swap files in a data partition is slower than on a swap partition.
򐂰 If the swap file gets damaged, the data on the whole partition may be lost.
For these reasons, we recommend that you not place the swap file on a data partition. In the
following example, we will create a 512 MB swap file with a block size of 2 KB:
1. Start by creating a directory for the swap file:
mkdir /swap
2. Create the swap file:
dd if=/dev/zero of=/swap/swapfile bs=2048 count=262144
This command creates a file called /swap/swapfile with a block size of 2 KB. The size
will be 512 MB (2048*262144=512 MB). The size is determined using the bs and count
parameters of dd, so the command could have also been:
dd if=/dev/zero of=/swap/swapfile bs=1M count=512
3. Initialize the swap file:
mkswap /swap/swapfile 262144
4. Synchronize the file:
sync
5. Configure Linux to use the swap file:
swapon /swap/swapfile
6. If the swap file is no longer needed, you can instruct the system to stop using the swap file
and then delete the file:
swapoff /swap/swapfile
rm /swap/swapfile
It is also possible to use a swap file permanently. The information needs to be put into the
/etc/fstab file, which would look as illustrated in Example 8-3.
Example 8-3 /etc/fstab file

/swap/swapfile none swap sw 0 0
While swapping (writing modified pages out to the system swap space) is a normal part of a
Red Hat and Suse Linux system's operation, it is possible for a system to experience too
much swapping. The reason to be wary of excessive swapping is that the following situation
can easily occur, over and over again: Pages from a process are swapped; the process
becomes runnable and attempts to access a swapped page; the page is faulted back into
memory; a short time later, the page is swapped out again.

If this sequence of events is widespread, it is known as thrashing and is normally indicative of
insufficient RAM for the present workload. Thrashing is extremely detrimental to system
performance, as the CPU and I/O loads that can be generated in such a situation can quickly
outweigh the load imposed by system's real work. In extreme cases, the system may actually
do no useful work, spending all its resources on moving pages to and from memory.
8.2.3 Understanding and tuning the daemons

A daemon is comparable to a service in Windows 2000. Daemons provide different services.
For example:
򐂰 The httpd daemon is a Web server.
򐂰 The sendmail daemon is a mail server.
There are daemons running on every server that are probably not needed. Disabling these
daemons frees memory and decreases the number of processes the CPU has to handle.
linuxconf, chkconfig, and serviceconf are tools that make it easy, among other things, to
disable and enable daemons. If linuxconf is not found on your system it is available from:
http://www.solucorp.qc.ca/linuxconf
Figure 8-1shows one interface for disabling daemons on Red Hat Linux, linuxconf.
Figure 8-1 linuxconf screen
chkconfig is a text type tool run from the command line. Example 8-4 shows output with the
chkconfig command.
Example 8-4 chkconfig

host system:~# chkconfig --list
keytable 0:off 1:on 2:on 3:on 4:on 5:on 6:off
atd 0:off 1:off 2:off 3:on 4:on 5:on 6:off

kdcrotate 0:off 1:off 2:off 3:off 4:off 5:off 6:off
syslog 0:off 1:off 2:on 3:on 4:on 5:on 6:off
gpm 0:off 1:off 2:on 3:on 4:on 5:on 6:off
kudzu 0:off 1:off 2:off 3:on 4:on 5:on 6:off
sendmail 0:off 1:off 2:on 3:on 4:on 5:on 6:off
netfs 0:off 1:off 2:off 3:on 4:on 5:on 6:off
network 0:off 1:off 2:on 3:on 4:on 5:on 6:off
random 0:off 1:off 2:on 3:on 4:on 5:on 6:off
rawdevices 0:off 1:off 2:off 3:on 4:on 5:on 6:off
apmd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
ipchains 0:off 1:off 2:on 3:on 4:on 5:on 6:off
iptables 0:off 1:off 2:on 3:on 4:on 5:on 6:off
crond 0:off 1:off 2:on 3:on 4:on 5:on 6:off
anacron 0:off 1:off 2:on 3:on 4:on 5:on 6:off
lpd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
xfs 0:off 1:off 2:on 3:on 4:on 5:on 6:off
ntpd 0:off 1:off 2:off 3:on 4:off 5:on 6:off
portmap 0:off 1:off 2:off 3:on 4:on 5:on 6:off
xinetd 0:off 1:off 2:off 3:on 4:on 5:on 6:off
autofs 0:off 1:off 2:off 3:on 4:on 5:on 6:off
nfs 0:off 1:off 2:off 3:off 4:off 5:off 6:off
nfslock 0:off 1:off 2:off 3:on 4:on 5:on 6:off
nscd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
identd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
radvd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rwhod 0:off 1:off 2:off 3:off 4:off 5:off 6:off
snmpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rhnsd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ypbind 0:off 1:off 2:off 3:on 4:on 5:on 6:off
sshd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
rstatd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rusersd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rwalld 0:off 1:off 2:off 3:off 4:off 5:off 6:off
vncserver 0:off 1:off 2:off 3:off 4:off 5:off 6:off
yppasswdd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ypserv 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ypxfrd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
smb 0:off 1:off 2:off 3:off 4:off 5:off 6:off
bcm5820 0:off 1:off 2:off 3:off 4:off 5:off 6:off
httpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
squid 0:off 1:off 2:off 3:off 4:off 5:off 6:off
tux 0:off 1:off 2:off 3:off 4:off 5:off 6:off
webmin 0:off 1:off 2:on 3:on 4:off 5:on 6:off
xinetd based services:
chargen-udp: off
chargen: off
daytime-udp: off
daytime: off
echo-udp: on
echo: on
time-udp: off
time: off
sgi_fam: on
finger: off
rexec: off
rlogin: off
rsh: off
ntalk: off
talk: off
telnet: on

wu-ftpd: on
host system:~#
Change a value (and check if the setting changed). For example, to turn off the sshd daemon,
type in the following command from the host system:
chkconfig --level 5 sshd off
Check to see if it has changed by typing in the following command:

chkconfig --list sshd
Example 8-5 is the output of what was just changed.
Example 8-5 chkconfig --list sshd

sshd 0:off 1:off 2:on 3:on 4:on 5:off 6:off
Now turn the sshd daemon back on by typing in the following command:
chkconfig --level 5 sshd on
And now check to see if it has changed by typing in the following command:
chkconfig --list sshd
Example 8-6 is the output of what was just changed.
Example 8-6 chkconfig --list sshd

sshd 0:off 1:off 2:on 3:on 4:on 5:on 6:of
You can get online help using (man chkconfig).
Also you can use serviceconfig to disable unnecessary daemons, as illustrated in Figure 8-2
on page 269.

Figure 8-2 serviceconfig screen
If you do not have the ability to run linuxconf, serviceconfig, or chkconfig, or you do not
want to use them, it is also possible to disable or enable daemons from the command line. In
the following example, we will show how to stop the sendmail daemon. First log on as root
and enter the following command:
/etc/init.d/sendmail stop
Every daemon can be started and stopped in the same way. Some also provide further
functions such as restart, status, and so on.
If you do not want the daemon to start the next time the machine boots, you will need to
change the contents of the various run level directories:
1. Determine which run level the machine is running with the command runlevel.
This will print the previous and current run level (for example, N 3 means that there was no
previous run level (N) and that the current run level is 3).
2. To switch between run levels, use the init command. For example, to switch to run level 5,
enter the command init 5.

Following, we provide a short description of the different run levels used in Linux:
– 0 - Halt (do not set initdefault to this or the server will immediately shut down after
finishing the boot process).
– 1 - Single user mode.
– 2 - Multi-user, without NFS (the same as 3, if you do not have networking).
– 3 - Full multi-user mode non-graphical.
– 4 - Unused.
– 5 - Full multi-user mode graphical X11.
– 6 - Reboot (do not set initdefault to this or the server machine will continuously reboot
at startup).
3. To prevent a daemon from starting, you will need to rename the appropriate file in the \etc
directory structure. For example, to disable the sendmail daemon in run level 3 at startup,
enter the command:
rename /etc/rc3.d/S80sendmail /etc/rc3.d/K80sendmail
Or,
mv /etc/rc3.d/S80sendmail /etc/rc3.d/K80sendmail
Daemons with an S at the beginning of the symbolic link name will be started; those
starting with a K will not be started in that specific run level. In our example, the sendmail
daemon will not be started on the next reboot. Note that you must select the correct run
level to change this.
8.2.4 Compiling the kernel

Compiling the kernel is not absolutely necessary to improve the performance of the server,
but we do recommend that you configure your Linux server to have the latest DS6000
supported kernel and drivers. There are always new improvements being made available,
which not only fix bugs, but also improve the performance of the Linux machine.
Before you begin, you will need to know what hardware is installed in the server. You can
obtain a list by typing in the command lspci. The most important things to know are:
򐂰 CPU type
򐂰 Amount of memory installed
򐂰 SCSI adapter
򐂰 RAID controller
򐂰 Fibre Channel adapter
򐂰 Network adapter
򐂰 Video adapter
The more information you have about the hardware used, the more easily the Linux kernel
can be configured.
This procedure can be tricky at some steps, so we refer you to a complete discussion of how
to compile the kernel in the IBM Redpaper, Running the Linux 2.4 Kernel on IBM Eserver
xSeries Servers, REDP0121, available from:
http://www.redbooks.ibm.com
Select Redpapers from the left navigation bar and do a search using the Redpaper form
number REDP0121.

8.2.5 Changing kernel parameters
The Linux kernel is the core of the operating system (OS) and is common to all Linux
distributions. You can make changes to the kernel by modifying parameters that control the
OS. These changes are made on the command line using the sysctl command.
Tip: By default, the kernel includes the necessary module to enable you to make changes
using sysctl without needing to reboot. However, If you choose to remove this support
(during the operating system installation), then you will have to reboot Linux before the
change can take effect.
SUSE LINUX offers a graphical method of modifying these sysctl parameters, illustrated in
Figure 8-3. To launch the powertweak tool, issue the following command:
/sbin/yast powertweak
For a text-based menu version, use the command:

/sbin/yast2 powertweak
Figure 8-3 SUSE Linux Powertweak
Red Hat offers a graphical method of modifying these sysctl parameters. To launch the tool,
issue the following command:
/usr/bin/redhat-config-proc
Figure 8-4 on page 272 shows the user interface.

Figure 8-4 Red Hat kernel tuning
Parameter storage locations

The kernel parameters that control how the kernel behaves are stored in /proc (and in
particular, /proc/sys).
Reading the files in the /proc directory tree provides a simple way to view configuration
parameters that are related to the kernel, processes, memory, network and other
components. Each process running in the system has a directory in /proc with the process ID
(PID) as name. Table 8-1 lists some of the files that contain kernel information.
Table 8-1 Parameters in /proc

File/directory Purpose
/proc/loadavg Information about the load of the server in 1-minute, 5-minute, and15-minute intervals.
The uptime command gets information from this file.
/proc/kcore (SUSE LINUX Enterprise Server only) Contains data to generate a core dump at run
time, for kernel debugging purposes. The command to create the core dump is gdb as
in:
#gdb /usr/src/linux/vmlinux /proc/kcore
/proc/stat Kernel statistics as process, swap and disk I/O.
/proc/cpuinfo Information about the installed CPUs.
/proc/meminfo Information about memory usage. The free command uses this information.
/proc/sys/abi/* Used to provide support for “foreign” binaries, not native to Linux: those compiled
under other UNIX variants such as SCO UnixWare 7, SCO OpenServer, and SUN
Solaris 2. By default, this support is installed, although it can be removed during
installation.

File/directory Purpose
/proc/sys/fs/* Used to increase the number of open files the OS allows and to handle quota.
/proc/sys/kernel/* For tuning purposes, you can enable hotplug, manipulate shared memory, and specify
the maximum number of pid files and level of debug in syslog.
/proc/sys/net/* Tuning of network in general, IPV4 and IPV6.
/proc/sys/vm/* Management of cache memory and buffer.
Using the sysctl commands

The sysctl commands use the names of files in the /proc/sys directory tree as parameters.
For example, to modify the shmmax kernel parameter, you can display (using cat) and
change (using echo) the file /proc/sys/kernel/shmmax:
#cat /proc/sys/kernel/shmmax
33554432
#echo 33554430 > /proc/sys/kernel/shmmax
#cat /proc/sys/kernel/shmmax
33554430
However, using these commands can easily introduce errors, so we recommend that you use
the sysctl command because it checks the consistency of the data before it makes any
change. For example:
#sysctl kernel.shmmax
kernel.shmmax = 33554432
#sysctl -w kernel.shmmax=33554430
#sysctl kernel.shmmax
This change to the kernel will stay in effect only until the next reboot. If you want to make the
change permanent, you can edit the /etc/sysctl.conf or /etc/sysconfig/sysctl file and add the
appropriate command. In our example:
The next time you reboot, the parameter file will be read. You can do the same thing without
rebooting by issuing the following command:
#sysctl -p
8.2.6 Kernel parameters

The Linux kernel has many parameters that can improve performance for your installation.
Table 8-2 lists the SUSE Linux V2.4 kernel parameters that are most relevant to performance.
Table 8-2 List of the SUSE LINUX V2.4 kernel parameters that are most relevant
Parameter Description/example of use
kernel.shm-bigpages-per-file Normally used for tuning database servers. The default is 32768. To
calculate a suitable value, take the amount of System Global Area
(SGA) memory in GB and multiply by 1024. For example:
sysctl -w kernel.shm-bigpages-per-file=16384

kernel.sched_yield_scale Enables the dynamic resizing of time slices given to processes. When
enabled, the kernel reserves more time slices for busy processes and
fewer for idle processes. The parameters kernel.min-timeslice and
kernel.max-timeslice are used to specify the range of time slices that
the kernel can supply as needed. If disabled, the time slices given to
each process are the same.
The default is 0 (disabled). Applications such as ERP and Java™ can

benefit from this being enabled. For real-time applications such as
streaming audio and video, leave it disabled. For example:
sysctl -w kernel.sched_yield_scale=1
kernel.shm-use-bigpages Enables the use of bigpages (typically for databases). Default is 0

(disabled). For example:
sysctl -w kernel.shm-use-bigpages=1
net.ipv4.conf.all.hidden All interface addresses are hidden from Address Resolution Protocol
(ARP) broadcasts and will be included in the ARP response of other
addresses. Default is 0 (disabled). For example:
sysctl -w net.ipv4.conf.all.hidden=1
net.ipv4.conf.default.hidden Enables all interfaces as hidden by default. Default is 0 (disabled).
sysctl -w net.ipv4.conf.default.hidden=1
net.ipv4.conf.eth0.hidden Enables only interface eth0 as hidden. Uses the ID of your network
card. Default is 0 (disabled).
sysctl -w net.ipv4.conf.eth0.hidden=1
net.ipv4.ip_conntrack_max This setting is the number of separate connections that can be tracked.
Default is 65536.
sysctl -w net.ipv4.ip_conntrack_max=32768
net.ipv6.conf.all.mtu Default maximum for transfer unit on IPV6. Default is 1280.
sysctl -w net.ipv6.conf.all.mtu=9000
net.ipv6.conf.all.router_solicitation_delay Determines whether to wait after interface opens before sending router
solicitations. Default is 1 (the kernel should wait). For example:
sysctl -w net.ipv6.conf.all.router_solicitation_delay=0
net.ipv6.conf.all.router_solicitation_interval Number of seconds to wait between router solicitations. Default is 4

seconds. For example:
sysctl -w net.ipv6.conf.all.router_solicitation_interval=3
net.ipv6.conf.all.router_solicitations Number of router solicitations to send until assuming no routers are

present. Default is 3.
sysctl -w net.ipv6.conf.all.router_solicitations=2
net.ipv6.conf.all.temp_prefered_lft Lifetime preferred in seconds for temporary addresses. Default is

86400 (1 day).
sysctl -w net.ipv6.conf.all.temp_prefered_lft=259200

net.ipv6.conf.all.temp_valid_lft Lifetime valid in seconds for temporary address. Default is 604800 (1

week).
sysctl -w net.ipv6.conf.all.temp_valid_lft=302400
net.ipv6.conf.default.accept_redirects Accepts redirects sent by a router that works with IPV6, but it cannot be
set if forwarding is set to enable. Always one or the other, it can never
be set together because it will cause problems in all-IPV6 networks.
Default is 1 (enabled).
sysctl -w net.ipv6.conf.default.accept_redirects=0
net.ipv6.conf.default.autoconf This automatically generates an address such as

"ff81::221:21ff:ae44:2781" on an interface with an L2-MAC Address.
Default is 1 (enabled).
sysctl -w net.ipv6.conf.default.autoconf=0
net.ipv6.conf.default.dad_transmits Determines whether Duplicate Address Detection (DAD) probes are

sent. Default is 1 (enabled).
sysctl -w net.ipv6.conf.default.dad_transmits=0
net.ipv6.conf.default.mtu Sets the default value for Maximum Transmission Unit (MTU). Default is
1280.
sysctl -w net.ipv6.conf.default.mtu=9000
net.ipv6.conf.default.regen_max_retry Number of attempts to try to generate a valid temporary address.

Default is 5.
sysctl -w net.ipv6.conf.default.regen_max_retry=3
net.ipv6.conf.default.router_solicitation_dela Number in seconds to wait, after interface is brought up, before sending
y router request. Default is 1 (enabled).
sysctl -w net.ipv6.conf.default.router_solicitation_delay=0
vm.heap-stack-gap Enables the heap of memory that is used to store information about
status of process and local variables. You should disable this when you
need to run a server with Java Development Kit (JDK™); otherwise
your software will crash. Default is 1 (enabled).
sysctl -w vm.heap-stack-gap=0
vm.vm_anon_lru Allows the virtual memory (vm) to always have visibility of anonymous
pages. Default is 1 (enabled).
sysctl -w vm.vm_anon_lru=0
vm.vm_lru_balance_ratio Balances active and inactive sections of memory. Define the amount of
inactive memory that the kernel will rotate. Default is 2.
sysctl -w vm.vm_lru_balance_ratio=3
vm.vm_mapped_ratio Controls the pageout rate. Default is 100.
sysctl -w vm.vm_mapped_ratio=90

vm.vm_passes Number of attempts that the kernel will try to balance the active and
inactive sections of memory. Default is 60.
sysctl -w vm.vm_passes=30
vm.vm_shmem_swap Improves performance of applications that use large amounts of

non-locked shared memory (such as ERP and database applications)
on a server with more than 8 GB of RAM. Default is 0 (disabled).
sysctl -w vm.vm_shmem_swap=1
vm.vm_vfs_scan_ratio Proportion of Virtual File System unused caches that will try to be in
one VM freeing pass. Default is 6.
sysctl -w vm.vm_vfs_scan_ratio=6
Table 8-3 Red Hat parameters that are most relevant to performance tuning
Parameter Description / example of use
net.ipv4.inet_peer_gc_maxtime How often the garbage collector (gc) should pass over the inet peer
storage memory pool during low or absent memory pressure. Default is
120, measured in jiffies. For definition of jiffy, see:
http://www.kernelnewbies.org/glossary/#J
sysctl -w net.ipv4.inet_peer_gc_maxtime=240
net.ipv4.inet_peer_gc_mintime Sets the minimum time that the garbage collector can pass cleaning
memory. If your server is heavily loaded, you may want to increase this
value. Default is 10, measured in jiffies.
sysctl -w net.ipv4.inet_peer_gc_mintime=80
net.ipv4.inet_peer_maxttl The maximum time-to-live for the inet peer entries. New entries will
expire after this period of time. Default is 600, measured in jiffies.
sysctl -w net.ipv4.inet_peer_maxttl=500
net.ipv4.inet_peer_minttl The minimum time-to-live for inet peer entries. Set to a high enough
value to cover fragment time to live in the reassembling side of frag-
mented packets. This minimum time must be smaller than
net.ipv4.inet_peer_threshold. Default is 120, measured in jiffies.
sysctl -w net.ipv4.inet_peer_minttl=80
net.ipv4.inet_peer_threshold Set the size of inet peer storage. When this limit is reached, peer
entries will be thrown away, using the inet_peer_gc_mintime timeout.
Default is 65644.
sysctl -w net.ipv4.inet_peer_threshold=65644
vm.hugetlb_pool The hugetlb feature works in the same way as bigpages, but after
hugetlb allocates memory, only the physical memory can be accessed
by hugetlb or shm allocated with SHM_HUGETLB. It is normally used
with databases such as Oracle or DB2. Default is 0.
sysctl -w vm.hugetlb_pool=4608

Parameter Description / example of use
vm.inactive_clean_percent Designates the percent of inactive memory that should be cleaned.

Default is 5%.
sysctl -w vm.inactive_clean_percent=30
vm.pagecache Designates how much memory should be used for page cache. This is
important for databases such as Oracle and DB2. Default is 1 15 100.
This parameter’s three values are:

򐂰 Minimum percent of memory used for page cache. Default is 1%.
򐂰 The initial amount of memory for cache. Default is 15%.
򐂰 Maximum percent of memory used for page cache. Default is
100%.
sysctl -w vm.pagecache=1 50 100
8.2.7 Understanding and tuning the file systems

Ultimately, all data must be retrieved from and stored to disk. Disk accesses are usually
measured in milliseconds and are thousands of times slower than other components (such as
memory or PCI operations, which are measured in nanoseconds or microseconds). The
Linux file system is the method by which data is stored and managed on the disks.
Many different file systems are available for Linux that differ in performance and scalability.
Besides storing and managing data on the disks, file systems are also responsible for
guaranteeing data integrity. The newer Linux distributions include journaling file systems as
part of their default installation. Journaling, or logging, prevents data inconsistency in case of
a system crash. All modifications to the file system metadata have been maintained in a
separate journal or log and can be applied after a system crash to bring it back to its
consistent state. Journaling also improves recovery time, because there is no need to perform
file system checks at system reboot.
As with other aspects of computing, you will find that there is a trade-off between performance
and integrity. However, as Linux servers make their way into corporate data centers and
enterprise environments, requirements such as high availability can be addressed.
In this section, we cover the default file systems available on Red Hat Enterprise Linux AS
and SUSE LINUX Enterprise Server and some simple ways to improve their performance.
ext2
ext2 is still a commonly used file system in the Linux community. It provides the standard
UNIX file semantics and advanced features. It is robust and offers excellent performance. The
ext2 standard features include:
򐂰 Support for standard UNIX file types (regular files, directories, device special files, and
symbolic links)
򐂰 Up to 4 TB of volume size
򐂰 Support for long file names (up to 255 characters)
The ext2 kernel code contains many performance optimizations, which improve I/O speed
when accessing data on a disk. One of the optimizations is a read ahead algorithm. When a
block is read, the kernel code automatically requests the follow-on blocks. In this way, it
ensures that the next block is already in the buffer cache and available for further processing.

Read aheads improve the performance most when you have sequential read requests on
your server (video, audio streaming).
In addition, ext2 contains many allocation optimizations. Block groups are used to store
related inodes and data together. The kernel always tries to allocate data blocks for a file in
the same group as its inode. This results in fewer disk head seeks performed when the kernel
reads an inode and its data blocks.
One problem with ext2 is that if an unexpected power failure or an unclean shutdown occurs,
the file system may be in an inconsistent state. Therefore, an e2fsck is forced on the next
reboot of the system, which may or may not recover the file system from its inconsistent state.
Journaling file systems like ext3 greatly reduce the chance of getting an inconsistent file
system.
Since you cannot change the stripe size on the disks of the DS6000, to achieve optimal
performance, your OS software stripe size should be changed to be a multiple of your file
system block size or slightly larger. The actual file system block size for /dev/sda1 can be
found with the following command:
dumpe2fs -h /dev/sda1 |grep -F "Block size"
which produces output as shown in Example 8-7.
Example 8-7 Determining file system block size from the dumpe2fs command
dumpe2fs 1.23,15-Aug-2001 for EXT2 FS 0.5b,95/08/09
Block size: 1024
The block size cannot be changed when the partition is already formatted, so you have to
decide which block size you will use when formatting the partition. So, if you create a new ext2
partition on /dev/sda5 with a block size of 4096 bytes/block, the command will be:
mke2fs -b 4096 /dev/sda5
For more information about ext2, refer to:

http://e2fsprogs.sourceforge.net/ext2.html
ext3
ext3 is the updated version of the ext2 file system. It has many new features and
enhancements compared to the previous ext2. Its main advantages are:
򐂰 Availability: ext3 always writes data in a consistent way to the disks. So in case of an
unclean shutdown (unexpected power failure, system crash), the server does not need to
check the consistency of the data on a ext3 volume.
The time spent to recover the journal is about one second (depending on the hardware
used). On an ext2 volume, the e2fsck performed after a unclean shutdown may take
hours, depending on the size of the volume and number of files.
򐂰 Data integrity: You can choose the type and level of protection of your data. You can
choose to keep the file system consistent, but allow for damage to data on the file system
in case of unclean system shutdown. This can improve performance under some, but not
all, circumstances.
Alternatively, you can choose to ensure that the data is consistent with the state of the file
system. This second choice is the safer choice and is the default.
򐂰 Speed: There are three different journaling modes available to optimize speed:

– data=writeback: This limits the data integrity guarantees, allowing old data to show up
in files after a crash. However, under some circumstances, this will increase the
performance of your disks.
– data=ordered (default mode): This guarantees that data is consistent with the file
system.
– data=journal: This requires a larger journal for a reasonable speed. It takes longer to
recover in case of an unclean shutdown, but is sometimes faster for certain database
operations.
To change the mode, add one of the following lines to the mount options for that volume in
/etc/fstab:
data=writeback
data=ordered
data=journal
For more information about ext3, see:

http://www.symonds.net/~rajesh/howto/ext3/index.html
http://ext2.sourceforge.net/2005-ols/paper-html
If maximum performance is needed, use ext2 since it has generally less overhead than any
journaling file system. But keep in mind that your data may be inconsistent in the event of a
power failure or an unclean shutdown.
ReiserFS: the default SUSE LINUX file system

The default file system on a SuSE installation since SUSE LINUX 7.1 has been ReiserFS,
developed by Hans Reiser. From its initial design, key performance aspects have included:
򐂰 Journaling designed into the file system from the beginning to improve reliability and
recovery.
򐂰 Faster access through the use of balanced tree data structures that allow for storing both
content data and security metadata.
򐂰 Efficient use of disk space because, unlike other file systems, this file system does not rely
on block sizes.
The current version of ReiserFS that is installed with SUSE LINUX Enterprise Server 8 is
V3.6. There is work underway to deliver the next release, Reiser4. The new Reiser4 file
system is expected to deliver an unbreakable file system by eliminating corruption with the
implementation of an atomic file system where I/O is guaranteed to complete, a 2x to 5x
speed improvement by implementing new access algorithms, and ease of third-party
upgrades without reformatting, through the use of plug-ins.
Note: ReiserFS is not supported by Red Hat Enterprise Linux AS.
8.2.8 Tuning TCP window size

You will most likely want to modify the TCP window size and use window scaling if your server
is connected to a network with high latency such as the Internet. You can either modify the
parameters on a running system by modifying values in /proc/sys/net/core/ and
/proc/sys/net/ipv4/ or modify the parameters permanently by changing values in the Linux
kernel sources and compiling your own kernel. We describe both methods in this section.
Testing performed with FTP transmissions has shown that with scalable window support
enabled and the TCP window size set to an appropriate level (depending on the network),
network throughput improves 100–500 percent on WAN links. There is less impact on

performance in LAN environments, but you may still want to experiment with these
parameters.
The default setting of 64 KB for most Linux configurations is fine for most LANs, but too low for
Internet connections. Set this to a value between 256 KB for T1 lines or lower, and 2 to 4 MB
for T3, OC-3, or even faster connections.
To determine the optimal buffer size for your environment, you can use the following formula:
buffer size =2 *bandwidth *delay
Where bandwidth is the bandwidth of the slowest connection between the server and the
client.
Changing the source and recompiling the kernel for TCP

Go to include/linux/skbuff.h header file in your Linux source directory and edit the values for
the maximum send and receive windows.
#define SK_WMEM_MAX 262140
#define SK_RMEM_MAX 262140
Changing values on a running system

Change the maximum parameters to an appropriate value, depending on your connection
speed.
For a Linux kernel 2.4.x system, add the following lines to /etc/rc.d/rc.local:
echo "4096 65536 4194304">/proc/sys/net/ipv4/tcp_rmem
echo "4096 65536 4194304">/proc/sys/net/ipv4/tcp_wmem
The three values describe the minimum, default, and maximum window sizes used by TCP.
The Linux kernel 2.4.x actually does a good job of adjusting the window size automatically,
depending on network conditions. You simply need to specify appropriate minimum and
maximum values.
8.3 Linux monitoring tools

In this section we discuss the tools available for the supported Linux distributions, that will aid
you in the performance monitoring and tuning activities of your Linux system I/O when using
the DS6000.
8.3.1 uptime
The uptime command can be used to see how long the server has been running, how many
logged on users there are, and gives a quick overview of what average load the server has.
The system load average is displayed for the last one, five, and fifteen minute intervals. The
load average is not a percentage, but instead the number of processes in queue waiting to be
processed. If processes that request CPU time are blocked (which means the CPU has no
time for processing them), the load average will increase. On the other hand, if each process
gets immediate access for CPU time and there are CPU cycles lost, the load will decrease.
The optimal value of the load would be 1, which means each process gets immediate access
to the CPU and there are no CPU cycles lost. The typical loads can vary from system to
system: For a uniprocessor workstation, 1–2 might be acceptable, whereas you will probably
see values of 8–10 on multiprocessor servers.

You can use uptime to narrow down a problem to your server or the network. If, for example, a
network application is running poorly, run uptime and you will see if the system load is high or
not (see Example 8-8). If not, the problem may more likely be related to your network than to
your server.
Example 8-8 Sample output of uptime

7:22pm up 2:31, 3 users, load average:1.12, 1.04, 0.77
For more information about uptime, see the online help (uptime) or the man pages on uptime.
Note: You can also use w, who, or finger instead of uptime. They also provide information
about who is currently logged onto the machine and what the user is doing.
8.3.2 dmesg
With dmesg, you can determine what hardware is installed in your server. During every boot,
Linux checks your hardware and logs this information. You can view these logs using dmesg.
You can see information about the CPU, DS6000 disk subsystem, network adapters, and
amount of memory that is installed. Example 8-9 illustrates the output of the dmesg command.
Example 8-9 Sample output of dmesg

Linux version 2.4.7-10 (bhcompile@stripples.devel.redhat.com)(gcc version 2.96
20000731 (Red Hat Linux 7.1 2.96-98))#1 Thu Sep 6 17:27:27 EDT 2001
...
Initializing CPU#0
Detected 448.957 MHz processor.
...
Memory:252964k/262128k available (1269k kernel code,6844k reserved,90k data,
220k init,0k highmem)
...
CPU:Before vendor init,caps:0183fbff 00000000 00000000,vendor =0
CPU:L1 I cache:16K,L1 D cache:16K
CPU:L2 cache:512K
...
CPU:Intel Pentium II (Deschutes)stepping 02
...
SCSI subsystem driver Revision:1.00
(scsi0)<Adaptec AIC-7895 Ultra SCSI host adapter>found at PCI 0/6/1
(scsi0)Wide Channel B,SCSI ID=7,32/255 SCBs
(scsi0)Downloading sequencer code...383 instructions downloaded
(scsi1)<Adaptec AIC-7895 Ultra SCSI host adapter>found at PCI 0/6/0
(scsi1)Wide Channel A,SCSI ID=7,32/255 SCBs
(scsi1)Downloading sequencer code...383 instructions downloaded
scsi0 :Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI)5.2.4/5.2.0
<Adaptec AIC-7895 Ultra SCSI host adapter>
scsi1 :Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI)5.2.4/5.2.0
<Adaptec AIC-7895 Ultra SCSI host adapter>
scsi2 :IBM PCI ServeRAID 4.72.00 <ServeRAID 3H>
Vendor:IBM Model:SERVERAID Rev:1.0
Type:Direct-Access ANSI SCSI revision:01
Vendor:IBM Model:SERVERAID Rev:1.0
Type:Processor ANSI SCSI revision:01
Vendor:SDR Model:GEM200 Rev:2
Type:Processor ANSI SCSI revision:02
Attached scsi disk sda at scsi2,channel 0,id 0,lun 0
SCSI device sda:53313536 512-byte hdwr sectors (27297 MB)

...
pcnet32_probe_pci:found device 0x001022.0x002000
ioaddr=0x002180 resource_flags=0x000101
eth%d:PCnet/FAST+79C972 at 0x2180,00 04 ac b8 a0 6e
tx_start_pt(0x0c00):~220 bytes,BCR18(9a61):BurstWrEn BurstRdEn NoUFlow
SRAMSIZE=0x1700,SRAM_BND=0x0800,
pcnet32:pcnet32_private lp=ce324000 lp_dma_addr=0xe324000 assigned IRQ 10.
pcnet32.c:v1.25kf 26.9.1999 tsbogend@alpha.franken.de
For more information about dmesg see the online help (man dmesg).
8.3.3 top
The top command shows you actual processor activity. By default, it displays the most
CPU-intensive tasks of the server and updates the list every five seconds. You can sort the
processes by PID (numerically), age (newest first), and Example 8-10 resident memory
usage and time (time the process has occupied the CPU since startup). shows a sample of
the output of the top command.
Example 8-10 Sample output of top

7:25pm up 2:34, 3 users, load average:1.02, 1.06, 0.83
40 processes:38 sleeping, 2 running, 0 zombie, 0 stopped
CPU states:4.2%user, 18.9%system, 0.0%nice, 76.6%idle
Mem:255572K av, 239200K used, 16372K free, 0K shrd, 19308K buff
Swap:1048120K av, 0K used, 1048120K free 87736K cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
12795 root 20 0 10592 2404 924 R 98.9 0.9 22:11 jre
13125 root 10 0 1028 1024 832 R 0.9 0.4 0:00 top
1 root 8 0 524 524 456 S 0.0 0.2 0:04 init
2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd
3 root 19 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU0
4 root 9 0 0 0 0 SW 0.0 0.0 0:03 kswapd
5 root 9 0 0 0 0 SW 0.0 0.0 0:00 kreclaimd
6 root 9 0 0 0 0 SW 0.0 0.0 0:00 bdflush
7 root 9 0 0 0 0 SW 0.0 0.0 0:00 kupdated
8 root -1 -20 0 0 0 SW<0.0 0.0 0:00 mdrecoveryd
15 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_2
18 root 9 0 0 0 0 SW 0.0 0.0 0:08 kjournald
93 root 9 0 0 0 0 SW 0.0 0.0 0:00 khubd
185 root 9 0 0 0 0 SW 0.0 0.0 0:00 kjournald
628 root 9 0 620 620 524 S 0.0 0.2 0:00 syslogd
633 root 9 0 1100 1100 448 S 0.0 0.4 0:00 klogd
653 rpc 9 0 592 592 504 S 0.0 0.2 0:00 portmap
681 rpcuser 9 0 764 764 664 S 0.0 0.2 0:00 rpc.statd
You can further modify the processes using renice to give a new priority to each process. If a
process hangs or occupies too much CPU, you can kill the process. Of course you can also
use the standard commands renice or kill to perform these steps, but with top you have
one interface to perform all these tasks.
The columns in the output are as follows:

PID Process identification.
USER Name of the user who owns (maybe started) the process.
PRI Priority of the process (see Process priority and nice levels, for
details).

NI Niceness level (that is, if the process tries to be nice by adjusting the
priority by the number given, see below for details).
SIZE Amount of memory (code+data+stack) in KB in use by the process.
RSS Amount of physical RAM used, in KB.
SHARE Amount of memory shared with other processes, in KB.
STAT State of the process: S=sleeping, R=running, T=stopped or traced,
D=interruptible sleep, Z=zombie. Zombie processes are discussed
further in “Zombie processes” on page 283.
%CPU Share of the CPU usage (since last screen update).
%MEM Share of physical memory.
TIME Total CPU time used by the process (since it was started).
COMMAND Command line used to start the task (including parameters.
For more information about top, see the online help (man top).
Process priority and nice levels

Process priority is a number that determines how much CPU time a process gets. The kernel
adjusts this number up and down as needed. The nice value is a limit on the priority. The
priority number is not allowed to go below the nice value (a lower nice value is a more favored
priority).
Note: It may not always be possible to change the priority of a process via the nice level. If
a process is running too slowly, you can assign more CPU to it by giving it a lower nice
level. Of course, this means that all other programs have fewer processor cycles and will
run more slowly.
Linux supports nice levels from 19 (lowest or least nice—gets more CPU) to -20 (highest or
nicest). Without an option the default value is 10. To change the nice level of a program to a
negative number, it is necessary to log on as root.
To start the program xyz with a nice level of -5, issue the command:
nice -n 5 xyz
To change the nice level of a program already running, issue the command:
renice -10 pid
Where pid is the process identification of the process. The process will decrease its nice level
to -10.
Zombie processes
When a process has already terminated through receiving a signal to do so, it normally takes
some time until it has finished all tasks (closing open files, and so on) before ending itself. In
that normally very short time frame, the process is a zombie.
After the process has finished all these shutdown tasks, it reports to the parent process that it
is about to terminate. Sometimes a zombie process is unable to terminate itself, in which
case, you will see processes with a status of Z (zombie).
It is not possible to kill such a process with the kill command, because it is already
considered dead. If you cannot get rid of a zombie, you can kill the parent process and then
the zombie disappears as well. However, if the parent process is the init process, you should

not kill the process. The init process is a very important process and therefore a reboot may
be needed to get rid of the zombie process.
8.3.4 iostat
If the iostat command is not included in your distribution, you may get it here:
http://linux.inet.hr/
The iostat command lets you see average CPU times since the system was started, in a way
similar to uptime. In addition, however, iostat creates a report about the activities of the
DS6000 disk subsystem on the server. The report is split in CPU utilization and device
utilization, where device utilization means the disk subsystem. Example 8-11 illustrates a
sample output of the iostat command.
Example 8-11 Sample output of iostat

Linux 2.4.7-10 (nf5000)11/07/2001
avg-cpu:%user %nice %sys %idle
5.27 0.03 27.26 67.45
Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

dev3-0 2.03 223.78 0.00 2365914 0
dev8-0 12.17 125.81 351.54 1330060 3716564
The CPU utilization report has four sections:

򐂰 %user: Shows the percentage of CPU utilization that occurred while executing at the user
level (applications).
򐂰 %nice: Shows the percentage of CPU utilization that occurred while executing at the user
level with nice priority (priority and nice levels are described in “Process priority and nice
levels” on page 283).
򐂰 %sys: Shows the percentage of CPU utilization that occurred while executing at the
system level (kernel).
򐂰 %idle: Shows the percentage of time the CPU was idle.
The device utilization report is split in the following sections:

򐂰 Device: The name of the block device
򐂰 tps: The number of transfers per second (I/O requests per second) to the device. Multiple
single I/O requests can be combined in a transfer request; because of that, a transfer
request can have different sizes.
򐂰 Blk_read/s, Blk_wrtn/s: Blocks read and written per second indicates data read/written
from/to the device in seconds. Blocks may also have different sizes. Typical sizes are
1024, 2048, or 4048 bytes, depending on the partition size. For example, the block size of
/dev/sda1 can be found with:
dumpe2fs -h /dev/sda1 |grep -F "Block size"
Which will give an output similar to:
dumpe2fs 1.23,15-Aug-2001 for EXT2 FS 0.5b,95/08/09
Block size:1024
򐂰 Blk_read, Blk_wrtn: Blocks read/written indicates the total number of blocks read/written
since boot.
For more information about iostat see the online help (man iostat).

8.3.5 vmstat
vmstat provides information about processes, memory, paging, block I/O, traps, and CPU
activity. Example 8-12 shows a sample vmstat output.
Example 8-12 Output of vmstat

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id

2 0 0 0 39256 19820 159460 0 0 200 100 133 141 31 5 64
The columns in the output are as follows:

򐂰 procs
– r: The number of processes waiting for runtime.
– b: The number of processes in un-interruptible sleep.
– w: The number of processes swapped out but otherwise runnable. This field is
calculated, but Linux never desperation swaps.
򐂰 memory
– swpd: The amount of virtual memory used (KB)
– free: The amount of idle memory (KB)
– buff: The amount of memory used as buffers (KB)
򐂰 swap
– si: Amount of memory swapped in from disk (KB/s)
– so: Amount of memory swapped to disk (KB/s)
򐂰 IO
– bi: Blocks sent to a block device (blocks/s)
– bo: Blocks received from a block device (blocks/s)
򐂰 system
– in: The number of interrupts per second, including the clock
– cs: The number of context switches per second
򐂰 cpu (these are percentages of total CPU time):
– us: User time
– sy: System time
– id: Idle time
8.3.6 sar
The sar command, which is included in the sysstat package, uses the standard system
activity daily data file to generate a report.
To install the sysstat package, log in as root and mount the CD-ROM containing the package.
Then do the following steps:
cd /mnt/cdrom/RedHat/RPMS
Or:
mount -t iso9660 /dev/cdrom /mnt
rpm -Uivh sysstat sysstat-3.3.5-3.i386.rpm

You can also obtain the sysstat package at:
http://perso.wanadoo.fr/sebastien.godard/
The system has to be configured to grab the information and log it; therefore, a cron job must
be set up. Add the following lines to the /etc/crontab. Example 8-13 illustrates an example of
automatic log reporting with cron.
Example 8-13 Example of automatic log reporting with cron

....
#8am-7pm activity reports every 10 minutes during weekdays.
0 8-18 **1-5 /usr/lib/sa/sa1 600 6 &
#7pm-8am activity reports every an hour during weekdays.
0 19-7 **1-5 /usr/lib/sa/sa1 &
#Activity reports every an hour on Saturday and Sunday.
0 ***0,6 /usr/lib/sa/sa1 &
#Daily summary prepared at 19:05
5 19 ***/usr/lib/sa/sa2 -A &
....
You get a detailed overview of your CPU utilization (%user, %nice, %system, %idle), memory
paging, network I/O and transfer statistics, process creation activity, activity for block devices,
and interrupts/second over time.
These are the main values that are displayed if you use sar -A (the -A is equivalent to
-bBcdqrRuvwWy -I SUM -I PROC -n FULL -U ALL, which selects the most relevant counters of
the system):
kbmemfree Free memory in KB
kbmenmuse Used memory in KB (without memory used by the kernel)
%memused Percentage of used memory
kbmemshrd Amount of shared memory by the system (always 0 with kernel 2.4)
kbbuffers Memory used for buffers by kernel in KB
kbcached Memory used for caching by kernel in KB
kbswpfree Free swap space in KB
kbswpused Used swap space in KB
%swpused Percentage of used swap space
intr/s Interrupts per second
Example 8-14 shows a sample of the output of the sar -A command.
Example 8-14 Example sar -A

Linux 2.4.7-10 (nf5000)11/07/2001
05:00:01 PM proc/s
05:10:00 PM 13.16
05:20:00 PM 0.14
05:30:00 PM 0.05
05:40:00 PM 0.05
05:50:01 PM 0.05
06:00:01 PM 0.05
06:10:01 PM 0.07
06:20:01 PM 0.05
06:30:00 PM 0.05

06:40:00 PM 0.05
06:50:00 PM 0.05
07:00:00 PM 0.05
07:09:59 PM 0.41
07:20:00 PM 0.12
07:30:00 PM 0.41
07:40:00 PM 0.49
07:50:00 PM 0.33
08:00:00 PM 0.26
08:10:02 PM 0.20
Average:0.85
05:00:01 PM cswch/s
...
Average:130.10
05:00:01 PM CPU %user %nice %system %idle
...
Average:all 5.67 0.03 33.46 60.84
05:00:01 PM INTR intr/s
...
Average:sum 114.41
05:00:01 PM pgpgin/s pgpgout/s activepg inadtypg inaclnpg inatarpg
...
Average:123.07 128.17 6011 17141 438 422
05:00:01 PM pswpin/s pswpout/s
...
Average:0.00 0.00
05:00:01 PM tps rtps wtps bread/s bwrtn/s
...
Average:10.89 5.46 5.43 246.15 256.34
05:00:01 PM frmpg/s shmpg/s bufpg/s campg/s
...
Average:0.01 0.00 -0.02 -1.84
05:00:01 PM kbmemfree kbmemused %memused kbmemshrd kbbuffers kbcached kbswpfree kbswpused
%swpused
...
Average:20006 235566 92.17 0 44418 49939 1048120 0
0.00
05:00:01 PM CPU i000/s i001/s i002/s i008/s i010/s i011/s i012/s i014/s i015/s
...
Average:0 100.00 0.90 0.00 0.00 0.00 9.68 0.67 3.16 0.00
05:00:01 PM dentunusd file-sz %file-sz inode-sz super-sz %super-sz dquot-sz %dquot-sz
rtsig-sz %rtsig-sz
...
Average:176307 491 5.99 178411 11 4.30 0 0.00
1 0.00
05:00:01 PM IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s
...
Average:lo 0.01 0.01 0.35 0.35 0.00 0.00 0.00
Average:eth0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:00:01 PM IFACE rxerr/s txerr/s coll/s rxdrop/s txdrop/s txcarr/s rxfram/s
rxfifo/s txfifo/s
...
Average:lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00
Average:eth0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00
05:00:01 PM totsck tcpsck udpsck rawsck ip-frag
...
Average:77 6 3 0 0
05:00:01 PM runq-sz plist-sz ldavg-1 ldavg-5

...
Average:4 57 0.64 0.67
05:00:01 PM DEV tps blks/s
...
Average:dev3-0 0.87 96.40
Average:dev8-0 10.01 406.08
8.3.7 isag
The output of sar is straight text and can be very time consuming to process. Instead, the
isag command (Interactive System Activity Grapher) can show the data gathered by sar in a
graphical format (see Figure 8-5).
Figure 8-5 Paging statistics
When you start isag you must first select a data source. Click the - button to the right of data
source. A menu will appear showing the different data sources available. The data sources
are named sa01, sa02, sa03, etc., each standing for a day of the month when recorded (for
example, sa11 would mean the log file recorded on the 11th day of the current month).
However, only the last nine days are available for analysis.
The slider on the left of the window (see Figure 8-5) is used to adjust the vertical scale of the
graph. By default, isag will display the paging statistics, but you can change the view by
clicking Chart and then choosing the data you are interested in:
򐂰 I/O transfer rate
򐂰 Paging statistics
򐂰 Process creation
򐂰 Run queue
򐂰 Memory and swap
򐂰 Memory activities
򐂰 CPU utilization
򐂰 node status
򐂰 System switching
򐂰 System swapping

Paging statistics
Paging statistics has the following counters:
pgpgin/s Total number of blocks the system paged in from disk per second
pgpgout/s Total number of blocks the system paged out to disk per second
activepg Number of active (recently touched) pages in memory
inadtypg Number of inactive dirty (modified or potentially modified) pages in
memory
inaclnpg Number of inactive clean (not modified) pages in memory
Note: isag keeps data for only one week. After one week, the collected data for the
seventh day will be deleted. This might not be enough to do a proper bottleneck analysis or
to make a trend analysis of the server.
I/O transfer rate

I/O transfer rate has the following counters:
rtps Total number of read requests issued to physical disk.
wtps Total number of write requests issued to physical disk.
bread/s Total amount of data read from the drive in blocks per second. A block
is of indeterminate size.
bwrtn/s Total amount of data written to the drive in blocks per second.
Figure 8-6 illustrates a sample I/O transfer rate graphic report.
Figure 8-6 I/O transfer rate report
Run Queue
Run Queue has the following counters:
runq-sz Run queue length (number of processes waiting for runtime)
plist-sz Number of processes in the process list

ldavg-1 System load average for the last minute
ldavg-5 System load average for the last five minutes
Figure 8-7 illustrates a sample Run Queue graphic report.
Figure 8-7 Run Queue report
Memory and Swap

Memory and Swap has the following counters:
kbmemfree Amount of free memory in KB
kbmemused Amount of used memory in KB (without memory used by the kernel)
kbmemshrd Amount of memory shared by the system in KB (always 0 with kernel
2.4)
kbbuffers Amount of memory used as buffers
kbcached Amount of memory used for caching
kbswpfree Amount of free swap space
kbswpused Amount of used swap space
Figure 8-8 on page 291 illustrates a sample Memory and Swap graphic report.

Figure 8-8 Memory and Swap report
Memory Activities
Memory Activities has the following counters:
frmpg/s Number of memory pages freed by the system per second. (A
negative value represents the number of pages allocated by the
system.)
shmpg/s Number of additional memory pages shared by the system per
second. A negative value means fewer pages shared by the system.
bufpg/s Number of additional memory pages used as buffers by the system
per second. A negative value means fewer pages used as buffers by
the system.
camp/s Number of additional memory pages cached by the system per
second. A negative value means fewer pages in the cache.
Figure 8-9 on page 292 illustrates a sample Memory Activities graphic report.

Figure 8-9 Memory Activities report
CPU Utilization
CPU Utilization has the following counters:
%user Percentage of CPU utilization that occurred while executing at the user
level application)
%nice Percentage of CPU utilization that occurred while executing at the user
level with nice priority
%system Percentage of CPU utilization that occurred while executing at the
system level (kernel)
Figure 8-10 on page 293 illustrates a sample CPU Utilization graphic report.

Figure 8-10 CPU Utilization
System swapping
System swapping uses the following counters:
pswpin/s Total number of swap pages the system brought in per second
pswpout/s Total number of swap pages the system brought out per second
For more information about sar and isag see the man pages (sar, man isag).
Figure 8-11 illustrates a sample system swapping graphic report.
Figure 8-11 System swapping

8.3.8 GKrellM
GKrellM is one of the many tools available that can be used to get the actual system status.
Other than the sar and isag commands, you can use GKrellM to get an idea of what your
system is doing at a specific point in time.
Note: GKrellM is an Xwindows tool. Running X may impact your performance analysis.
GKrellM has the following counters:

򐂰 SMP CPU monitor that can chart individual CPUs
򐂰 Proc
򐂰 Monitor with a chart for load and a display of number of current processes and users
򐂰 Disk monitor that can chart individual disks or a composite disk
򐂰 Net interface monitors with charts for all routed net interfaces
򐂰 Memory and swap space usage meters and a swap page in/out chart
Of course, you can get all these things with multiple monitors, but the one big advantage of
GKrellM is that it only takes up one process for monitoring your system. Further, the charts
have an autoscaling feature, but you can also use fixed scaling modes. Figure 8-12 shows the
output of GKrellM.
Figure 8-12 GKrellM
For more information about GKrellM see:

http://web.wt.net/~billw/gkrellm/gkrellm.html
8.3.9 KDE System Guard

KDE System Guard (KSysguard) is the K Desktop Environment (KDE) task manager and
performance monitor. It features a client server architecture that allows monitoring of local as
well as remote hosts. Figure 8-13 on page 295 shows the KDE System Guard default
window.

Figure 8-13 KDE System Guard default window
The graphical front end uses sensors to retrieve the information it displays. A sensor can
return simple values or more complex information such as tables. For each type of
information, one or more displays are provided. Displays are organized in worksheets that
can be saved and loaded independently from each other.
Note: KSysguard is an Xwindows tool. Running X may impact your performance analysis.
The KSysguard main window (see Figure 5-11) consists of a menu bar, an optional tool bar
and status bar, the sensor browser, and the work space. When first started, you see your local
machine listed as localhost in the sensor browser and two pages in the work space area. This
is the default setup.
The sensor browser displays the registered hosts and their sensors in a tree form, and
includes the type of data. Each sensor monitors a certain system value. All of the displayed
sensors can be dragged and dropped in the work space. There are two options:
򐂰 You can delete and replace sensors in the actual work space.
򐂰 You can create a new worksheet and drop new sensors meeting your needs.
KSysguard is part of the KDE project and information and updates can be obtained at:
http://www.kde.org
8.4 Host bus adapter (HBA) settings

To confirm the supported levels of HBAs, see the following link:
For each HBA, there are some BIOS, driver settings which are suitable for connecting to
DS6000. If these settings are not configured correctly, it may affect performance or may not
work properly.

To configure the HBA properly, see IBM TotalStorage DS6000: Host Systems Attachment
Guide, SC26-7680. Detailed procedures and recommendation settings are written in it. You
also should read the readme and manuals of the driver, BIOS, and HBA. Recommendations
and considerations are written there also.
You can get the driver, BIOS, and HBA related information from the following link.
Note: When configuring the HBA, we strongly recommend to install the newest version of
the driver and BIOS. The newer versions include more effective functions or problem fixes
that improve performance and RAS.
8.5 Logical Volume Manager for Linux (LVM)

Embedded into the Linux kernel, Sistina’s Logical Volume Manager software is the standard
LVM distribution used by tens of thousands of end users worldwide. LVM is a major building
block in Linux because it enables robust, enterprise-level disk volume management by
grouping arbitrary physical disks into virtual disk volumes. In addition, LVM increases
availability and performance by:
򐂰 Providing online addition and removal of physical devices
򐂰 Dynamic disk volume re-sizing
򐂰 Striping
򐂰 Allowing the IT managers the ability to administer entire storage configurations without
interrupting access to end-user or application data
8.5.1 Implementation
LVM and LVM2 for Linux can be downloaded free of charge at the following Web site:
http://sources.redhat.com/lvm/
http://sourceware.org/lvm2/
Note: Because LVM is licensed free of charge, there is no warranty for the program, to the
extent permit ted by applicable law. Except when otherwise stated in writing, the copyright
holders and/or other parties provide the program “as is” without warranty of any kind, either
expressed or implied, including, but not limited to, the implied warranties of merchantability
and fitness for a particular purpose. The entire risk as to the quality and performance of the
program is with you. Should the program prove defective, you assume the cost of all
necessary servicing, repair, or correction.
In order to use the LVM you will need to make sure that the kernel supports the tool. This will
require another kernel build and re-compilation. Again we refer you to the Redpaper Running
the Linux 2.4 Kernel on IBM Eserver xSeries Servers, REDP0121, available from:
http://www.redbooks.ibm.com
Select Redpapers from the left navigation bar and do a search using the Redpaper form
number REDP0121, which contains detailed steps to add the LVM to the source kernel.
For a more complete discussion on how to use the LVM tool, visit the Web site:
http://tldp.org/HOWTO/LVM-HOWTO/index.html

For tips on tuning with Volume Management on zSeries Linux, visit the Web site below. Keep
in mind the recommendations for zSeries Linux are not the same as for Intel Linux. This is
simply a pointer to another helpful resource.
http://www-128.ibm.com/developerworks/linux/linux390/perf/tuning_rec_dasd_lvm.html#begin
8.5.2 Performance management

The LVM tool can aid you in the performance monitoring and tuning tasks.
RAID
A striped RAID 5 or RAID 10 LUN is created on the DS6000 before it is assigned to the Linux
OS. When Linux boots on DS6000 with this configuration, it will only see this LUN as a single
disk. This means, as far as LVM is concerned, that there is just one disk in the machine and it
is to be used as such. If one of the disks fails within the LUN on the DS6000, LVM will not
even know. When the IBM representative replaces the disk (even on the fly) on the DS6000,
LVM will not know about that either, then the controller will resync the mirrored array and all
will be well. This is where most users take a step back and ask: Then what good does LVM do
for me with this RAID controller? The easy answer is that in most cases, after you define a
logical drive in the DS6000, you cannot add more disks to that drive later. So if you
miscalculate the space requirements, or you simply need more space, you cannot add a new
disk or set of disks into a pre-existing OS level software stripe-set. This means that you must
create or assign through the DS Storage Manager a new RAID LUN in the DS6000, and then
with LVM you can simply extend the LVM logical volume so that it seamlessly spans both
LUNs on the host platform. However, this is only the case if you do not use striping at the LVM
level.
Data striping
For performance reasons, with the LVM it can be beneficial to spread data in a stripe over
multiple physical volumes. We can use this functionality to spread data across different
DS6000 LUNs. Figure 8-14 on page 298 illustrates an example where block 1 is on Physical
Volume 0 (PV 0), and block 2 is on PV 1, while block 3 is on PV 2. Of course, you can also
stripe over more than 3 LUNs.

Physical
LUNS in a
Partitions
Volume Group
Three physical disks in the box form

a single volume group.
Physical Physical Volume Vol 0 Logical

Partitions Partitions
Vol 1
Logical Volume
Physical Volume Physical

Partitions
Vol 2
Physical Volume
Figure 8-14 Striped volume set
This arrangement means that you have more disk bandwidth available. It also means that
more spindles are potentially involved. We say potentially because having this Logical Volume
spread across three LUNs, as opposed to one LUN, would not involve more spindles if the
LUNs were all defined on the same Rank. This is because when LUNs are assigned from the
same Rank, each LUN is spread across, for example, the same 7 disks in that Rank.
Therefore, if you looked at Disk 1 (or DDM 1) within the DS6000 Rank, you would see data
from LUN 1, LUN 2, and LUN 3. This only involves the 7 spindles (in this example) that are
included in the disks on that Rank. Figure 8-15 on page 299 demonstrates this point, that
LUNs will store data on the same physical disks within a RAID array when assigned from one
Rank (as opposed to each LUN being assigned to separate Ranks). Don’t stripe from LUNs
on the same RAID array!

Three LUNs from Linux assigned to the same DS6000 rank
One DS6000 rank
Disk 1
LUN 1 Linux LVM LV

(in stripes)
Spindle
LUN 2
LUN 3
Data tracks
Figure 8-15 Three LUNs on the same DS6000 Rank will not optimize performance
If you assigned each of these LUNs to different Ranks, for example 3 different Ranks, you
would involve 21 different spindles (if they were built on 7+P arrays).
Note: One of the most important performance considerations when using the LVM on
Linux is that you should stripe the LV across LUNs on different Ranks to gain performance.
This means each LUN already needs to be defined on a separate Rank. Otherwise, if you
stripe on the same Rank, then you could worsen the performance.
Considerations
We recommend that you do not create a stripe size less than 64 KB. If many applications are
addressing the same array, then software striping at the OS level will not help much for host
performance improvement, but it also will not hurt the array performance. For sequential I/O, it
is better to have larger stripe sizes so that the LVM will not have to split write requests. You
could take it to 512 MB, the maximum single I/O size, to fully utilize the FC connection. For
random I/O, the stripe size is not so important.
Note: Remember that once a striped LV has been created through the Linux LVM, you
cannot add a physical volume to this LV. If there is any possibility there will be a need to
later extend LVs, then you should not use striping.
LVM native striping

Specifying the stripe configuration is done when creating the Logical Volume with lvcreate.
There are two relevant parameters. For example, if you wanted to create a 64 Kb stripe over 8
volumes, you would enter the following command:
# lvcreate -n stripedlv -i 8 -I 64 mygroup -L 20M
With -i we tell LVM how many physical volumes it should use to stripe across. Striping is not
really done on a bit-by-bit basis, but on blocks. With -I (uppercase i) we can specify the

granularity in kilobytes. Note that this must be a power of 2, and that the coarsest granularity
is 128 kilobytes.
8.6 Bonnie
Bonnie is a performance measurement tool written by Tim Bray. For a more complete
description and documentation on Bonnie go to the following Web site:
http://www.textuality.com/bonnie/
Bonnie performs a series of tests on a file of known size. If the size is not specified, Bonnie
uses 100 Mb, but that probably is not enough for a big modern server. Bonnie works with
64-bit pointers if you have them.
For each test, Bonnie reports the bytes processed per elapsed second, per CPU second, and
the % CPU usage (user and system).
8.6.1 Benchmarks
Bonnie does the following benchmarks:
򐂰 char output with put() / putc_unlocked()
The result is the performance a program will see that uses putc() to write single
characters. On most systems, the speed for this is limited by the overhead of the library
calls into the libc, not by the underlying device. The _unlocked version (used if bonnie is
called with -u) may be considerably faster, as it involves less overhead.
򐂰 char input with getc() / getc_unlocked()
The result is the performance a program will see that uses getc() to read single characters.
The same comments apply as to putc().
򐂰 Block output with write()
This is the speed with which your program can output data to the underlying file system
and device writing blocks to a file with write(). As writes are buffered on most systems, you
will see numbers that are much higher than the actual speed of your device, unless you
sync() after the writes (option -y) or use a considerably larger size for your test file than
your OS will buffer. For Linux, this is almost all your main memory.
If called with the -o_direct option, this operation (and the ones described in the following
two paragraphs) is done with the O_DIRECT flag set, which results in direct DMA from
your hardware to userspace, thus avoiding CPU overhead copying buffers around. This
will prevent buffering, and gives a much better estimate of real hardware speed, also for
small test sizes.
򐂰 Block input with read()
This is the speed with which you can read blocks of data from a file with read(). The same
comment as for block output regarding your OS doing buffering for you applies, with the
exception that using -y does not help to get realistic numbers for reading. You would need
to flush the buffers of the underlying block device, but this turns out to not be trivial, as you
first have to find out the block device. It would be a Linux-only feature anyway.
򐂰 Block in/out rewrite
Bonnie does a read(), changes a few bytes, write()s the data back and reread()s it. This is
a pattern that occurs on some database applications. Its result tells you how your
operating root (/) file system can handle such access patterns.

򐂰 Seeks
Multiple processes do random lseek(). The idea of using multiple processes is to always
have outstanding lseek() requests, so the device (disk) stays busy. Seek time is an
indication of how good your OS can order seeks and how fast your hardware actually can
do random accesses.
8.6.2 Downloading
For downloading Bonnie, go to the following Web site:
http://www.textuality.com/bonnie/download.html
Once there you find the sources of bonnie-1.4, selecting:

򐂰 Source tar ball, gzipped
򐂰 Source tar ball, bzipped2
򐂰 Source RPM (SuSE Linux)
򐂰 i386™ binary RPM (SuSE Linux 7.1+)
򐂰 Alpha binary RPM (SuSE Linux 7.1+)
Installation and compilation should be straightforward. For Linux, your easiest option is to use
rpm --rebuild on the source RPM. If you use Linux (preferably SuSE Linux) with a i386
machine, you can even use the binary RPM.
8.7 Bonnie++
Bonnie++ is a benchmark suite that is aimed at performing a number of simple tests of hard
drive and file system performance. Then you can decide which test is important and decide
how to compare different systems after running it.
The main program tests database type access to a single file (or a set of files if you want to
test more than 1 G of storage), and it tests creation, reading, and deleting of small files that
can simulate the usage of programs such as Squid, INN, or Maildir format e-mail.
The ZCAV program tests the performance of different zones of a hard drive. It does not write
any data (so you can use it on full file systems). It can show why comparing the speed of
Windows at the start of a hard drive to Linux at the end of the hard drive (typical dual-boot
scenario) is not a valid comparison.
Bonnie++ was based on the code for Bonnie by Tim Bray. Go to the following Web site for a
summary of the differences between Bonnie 1.0 and Bonnie++.
http://www.coker.com.au/bonnie++/
The original author (Tim Bray) has also put a description of Bonnie on his pages.
8.8 Disk bottlenecks

As a general rule for Linux performance analysis, the more observable gains with system
performance can be achieved by properly tuning and sizing the memory subsystem, disk
subsystem, and network subsystem, in that order.
The disk subsystem can be the most important aspect of I/O performance, but problems can
be hidden by other factors, such as the lack of memory. Finding disk bottlenecks is easier

than finding processor and memory bottlenecks, because the performance degradation is
readily apparent.
The disk subsystem’s speed affects the overall performance of the file server in the following
ways:
򐂰 It usually improves the minimum sustained transaction rate.
򐂰 It may only slightly affect performance under light loads because most requests are
serviced directly from the disk cache. In this case, network transfer time is a relatively
large component and disk transfer times are hidden by disk cache performance.
򐂰 As the server disk performance improves, increased network adapter and CPU
performance is required to support greater disk I/O transaction rates.
When the I/O subsystem is well tuned and performing efficiently, more throughput and
transactions per second can be done by the system as users and workload increase (see
Figure 8-16).
Figure 8-16 Effect of tuning the I/O subsystem
The I/O operations per second counter in the tools so far discussed in this chapter can be
used to determine if the server has disk bottlenecks. Collect logged data over a period of time
and then analyze the collected data to find if a trend can be detected, which will point to a
future disk bottleneck.
After verifying that the disk subsystem is causing a system bottleneck, a number of solutions
are possible. These solutions include the following:
򐂰 Consider using faster disks. Allocating your application’s data on the 15K rpm disk drive
Ranks will deliver better performance as compared to the 10K rpm disk drive Ranks.
򐂰 Eventually change the RAID implementation if this is relevant to the server’s I/O workload
characteristics. For example, going to RAID 10 if the activity is heavy random writes may
show observable gains.
򐂰 Add more arrays. This will allow you to spread the data across multiple physical disks and
thus improve performance for both reads and writes. Also, use hardware RAID instead of
the software implementation provided by Linux. If hardware RAID is being used, the RAID
level is hidden from the operating system and is therefore more efficient.
򐂰 Add more RAM. Adding memory will increase system memory disk cache, which in effect
improves disk response times.
Finally, if the previous actions do not provide the desired application performance, then
offload processing to another host system in the network (either users, applications, or
services).

8.9 Other performance resources
There is a lot of useful information in the UNIX chapter of this book, Chapter 7, “Open
systems servers - UNIX” on page 189, that also applies to Linux. For example, there is
information about performance planning, measuring sequential I/O, more detailed Logical
Volume and File system information, LUN size recommendations, and multipathing
considerations. Refer to this chapter for more assistance with tuning Linux for performance.
For multipathing on Linux, you should use the Multipath Subsystem Device Driver (SDD).
Here is the Web site for downloading SDD, and also included there is the User’s Guide with
Linux-specific commands.
http://www-1.ibm.com/support/docview.wss?rs=540&context=ST52G7&dc=D430&uid=ssg1S4000107&
loc=en_US&cs=utf-8&lang=en

9
Chapter 9. Open system servers - Windows

In this chapter we discuss performance considerations for attaching supported Intel-based
open systems to the IBM TotalStorage DS6000. We discuss the use of specific tools for
monitoring and tuning host performance issues. Also the most common disk performance
bottlenecks are discussed, as well as how to mitigate these problems.
The most current list of Windows servers that can attach to the DS6000 can be found in the
following Web site:

9.1 Host system performance
Gaining disk throughput and performance for any individual host will be affected by the
connectivity between the host and the disk subsystem. But the health and tuning of the whole
system plays an important part and should also be optimized in order to gain the best disk
performance.
Tuning all the components in a system is demanding and will require you to not only take a
benchmark before you change anything, but to take periodic measurements as you go.
Nonetheless, this more comprehensive activity pays off with an optimized system
performance.
The various system components that can affect the disk performance and are discussed in
this chapter, are:
򐂰 Priorities between foreground and background processes
򐂰 Virtual memory
򐂰 System cache
򐂰 File system layout and management
There is a recommended publication that can help you when tuning the whole system: Tuning
IBM eServer xSeries Servers for Performance, SG24-5287 and Tuning Windows Server
2003 on IBM eServer xSeries Servers, REDP-3943.
Performance monitoring tools

The tools that can be used for monitoring and tuning the whole system performance
discussed in this chapter are:
򐂰 Windows Performance console
򐂰 Task Manager
򐂰 Disk Management
򐂰 IOMeter
9.2 Tuning Windows 2000 and Server 2003 systems

Windows 2000 and Windows Server 2003 are largely self-tuning, so even leaving the system
defaults is reasonable from a performance perspective. There are, however, some things you
can adjust to get the most out of your system.
The following list provides additional steps that can be taken to provide better disk
performance on the host:
򐂰 Modify the priorities between foreground and background processes.
򐂰 Applications that are CPU and memory intensive should be scheduled during after-hours
operation. Examples of these applications are virus scanners, backup software, and disk
fragmentation utilities. These type of applications should be scheduled to run when the
server is not being utilized.
򐂰 Allocate virtual memory pro-actively.
򐂰 Specify the server type to determine how system cache is allocated and used.
򐂰 Disable unnecessary services.
There are some documents in the Microsoft Web site which are helpful to understand the
performance superiority of Window Server 2003 and where to tune. These documents are
available from:

http://www.microsoft.com/windowsserver2003/evaluation/performance/perfscaling.mspx
9.2.1 Foreground and background priorities

Windows Server 2003 and 2000 preemptively prioritize process threads that the CPU has to
attend to. Preemptive multitasking is a methodology whereby the execution of a process is
halted and another is started, at the discretion of the operating system.
This setting lets you choose how processor resources are shared between the foreground
process and the background processes. Typically, for a server, you do not want the
foreground process to have more CPU cycles allocated to it than the background processes.
We recommend selecting Background services so that all programs receive equal amounts
of processor time.
To change this:
1. Open Control Panel.
2. Open System.
3. Select the Advanced tab.
4. Click the Performance Option button and the window shown in Figure 9-1 will appear.
Recommend setting
Figure 9-1 Performance options in Windows 2003
Under Processor scheduling, you can choose one of two settings to optimize performance:
򐂰 Programs if more processor resources are given to the foreground process than the
background processes
򐂰 Background services (recommended) if all programs receive equal amounts of
processor resources
Chapter 9. Open system servers - Windows 307

The underlying effect of application performance boost control significantly differs between
various versions of Windows. In Windows 2000, the application boost GUI settings have the
following counterpart in the Registry:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\PriorityControl\
Win32PrioritySeparation
The Win32PrioritySeparation Registry values in Windows 2000 Server are:

򐂰 38 for applications performance of Programs
򐂰 24 for background services of Background services
It is a good tuning practice to disable any prioritization of foreground applications if services

serving users are running in the background. You should set this window according to your
needs. By clicking the Programs setting, the process priority is given to the foreground
applications, and by clicking the Background services setting, the priority is given to
background processes.
We strongly recommend that you use only the GUI interface for these settings in order to
always get valid, appropriate, operating system revision-specific, and optimal values in the
registry.
9.2.2 Virtual memory

Memory paging occurs when memory resources required by the server exceed the physical
amount of memory installed in the server. Memory can be accessed at over 1 GB/s, whereas
a single disk drive provides data at about 2–3 MB/s. Servers accessing information from disk
run much slower than when the information can be accessed directly from memory.
Windows Server 2003 and 2000, as most other server operating systems, employ virtual
memory techniques that allow applications to address greater amounts of memory than what
is physically available. Memory pressure occurs when the demand for physical memory
exceeds the amount of installed memory, causing the operating system to page excess
memory onto a disk drive.
Paging is the process whereby blocks of data are swapped from the physical memory to a file
on the hard disk. The paging file is page file.SYS. The combination of the paging file and the
physical memory is known as virtual memory. Some paging is normal for the usual operating
system, but when excessive consistent paging which is called thrashing occurs, it affects the
system performance. To avoid this, paging should be minimized.
You can control the size of the paging file and this can improve performance if you specify the
minimum value to be what the server normally allocates during the peak time of the day. This
ensures that no processing resources are lost to the allocation and segmentation of the
paging space.
To configure the page file size:

1. Open the Control Panel.
2. Open System.
3. Select Advanced.
4. Click Performance Option.
5. Click Change and the window shown in Figure 9-2 on page 309 will appear.
6. Enter new values for Initial Size (MB) and Maximum Size (MB) and click OK.

Window Server 2003 has several options for configuring the page file that previous versions
of Windows did not. Windows Server 2003 has introduced new settings for virtual memory
configuration, including letting the system manage the size of the page file, or to have no
page file at all. If you let Windows manage the size, it will create a page file of a size equal to
physical memory plus 1MB. This is the minimum amount of space required to create a
memory dump in the event the server encounters a STOP event (blue screen).
Create separate
page file for each
disk to improve
system performance
Figure 9-2 Virtual memory settings
You can set the initial size and the maximum size of the paging file for every drive. The
maximum number of page files is 16 and the maximum size is 4 GB per page file. This means
a maximum of total page size is 64 GB. Total size of page file is used as one page file by the
operating system. For a file server, set the minimum to the recommended value, as shown in
the window. For other server applications the recommendation varies. For a discussion of
recommended values refer to the publication Tuning IBM eServer xSeries Servers for
Performance, SG24-5287.
The size of the registry does not affect performance.
In a production environment with well-written server applications, hard page faults should not
constantly occur. If there is any sustained paging, check the available bytes in the Task
Manager.
򐂰 If Available Bytes is less than 20 percent of installed RAM, then add more RAM.
򐂰 If Available Bytes is much greater than 20 percent of total installed RAM, then the
application cannot make use of additional RAM, so the only solution is to optimize the
page device.

Consistent paging should be avoided because when the server is paging memory to disk,
every user’s performance will be adversely affected.
Configuring page file size for maximum performance gain

Optimal page file performance will be achieved by isolating page files to dedicated physical
drives. Where possible, do not place the page file on the same physical drive as the operating
system - which happens to be the system default. If this must occur, ensure that the page file
exists on the same volume (typically C:) as the operating system. Putting it on another volume
on the same physical drive only increases disk seek time and reduces system performance,
because the disk heads will be moving continually between the partitions of the system,
alternately accessing the page file and operating system files.
Note: If you remove the page file from the boot partition, dump files (memory.dmp) which
include debugging the information, can’t be created when blue screen occurs. If you need
a dump file, you need to have a page file at least the size of physical memory +1 MB on the
boot partition.
Having a page file size smaller than the current RAM size will affect performance of the
server. Our recommendation is to set the memory page file size to twice the size of the RAM
for a maximum performance gain. The only drawback of having a big page file is the
restriction in space available for files on the hard drives. Since the host will be using DS6000
disks this should not be a concern.
The best way to create a contiguous static page file is to follow this procedure:
1. Remove the current page file from your server by clearing the Initial and Maximum size
values in the Virtual Memory settings window, then click Set (refer to Figure 9-2 on
page 309).
2. Reboot the machine and click OK; ignore the warning message about the page file.
3. Defragment the disk you want to create the page file on. This step should give you enough
continuous space to avoid partitioning of your new page file.
4. Create a new static page file by setting the Initial and Maximum size with the same value.
If possible, use twice the size of your RAM.
5. Reboot the server.
The above procedure will leave you with a contiguous static page file.
page file usage

It is possible to detect page file usage by using the Windows system monitor Paging file:
%Usage Max counter. In general, any sustained paging is detrimental to server performance.
If so, you had better consider adding to the amount of physical memory in the server. For
example, if a page file is 2048 MB and %Usage Max shows 10% consistently on your server,
it may be better to add 256 MB of RAM or so.
Paging on a server local drive

For optimized performance, you may consider defining the paging on a server local drive
instead of the DS6000. On Windows, the page file must be accessed when paging the data,
so that if a page file is on the DS6000, paging may be affected by other host I/O or SAN
devices and it may decrease the performance. To avoid this, the page file should be placed on
a local physical drive.

Note: When configuring SAN Boot, it is recommend to put it on local physical disk for
paging for same reason. But if there is no page file on boot partition (usually C:\Windows or
C:\WINNT), system dump (memory.dump) is not created when STOP event (blue screen)
occurs.
Ideally, the paging device will be a separate physical drive. Having data that is being
accessed on the same drive as the paging drive can reduce performance, especially when
multiple logical drives are configured on one physical drive. This causes long seek operations
and slows performance.
If the page files and active data must reside on the same physical device, place them on the
same logical drive. This will keep the page file and data files physically close together, and will
improve performance by reducing the time spent seeking between the two logical drives. Of
course, you can ignore this issue if no I/O access is made to the data drive during normal
operation.
9.2.3 File system cache tuning

In Windows Server 2003 and 2000, you can optimize server performance by tuning the file
system cache.
The file system cache is a dynamic memory pool used to store recently accessed data for all
cacheable peripheral devices, which includes data transfers between hard drives, networks
cards, and networks. The Windows Virtual Memory Manager copies data to and from the file
system cache as though it were an array in memory. When data resides in file system cache,
it will improve performance and reduce disk activity.
Tip: Windows Server 2003 has two applets to manage file system cache as compared with
the previous version of Windows which has just one applet.
Two applets of Windows Server 2003 determine how much system memory is available to be
allocated to the working set of file system cache versus how much memory is available to be
allocated to the working set of applications, and the priority with which they are managed
against one another.
Settings in these options should be set according to the server role.
The two options are:

򐂰 File and Printer Sharing for Microsoft Networks (Network Control Panel applet), relates to
a server’s network performance. It decides how the server allocates memory to local
applications versus network connections. It affects not only network but also disk
performance.The most important memory allocation controlled by this setting is that given
to the system cache. It controls how a server prioritizes its memory allocations and thread
priorities for network services versus local applications and the size given to the system
cache.
򐂰 System (System Control Panel applet), also referred to in Microsoft documentation as
Performance Options, as it has consolidated several performance applets into one
location.
To change File and Printer Sharing for Microsoft Network setting (Both Windows Server 2003
and Windows 2000),
1. Click Start -> Settings -> Network and Dial-Up Connection.

2. Select any of your local area connections (it does not matter which one).
3. Click File -> Properties.
4. Select File and Print Sharing for Microsoft Networks.
5. Click Properties. The window shown in Figure 9-3 will appear.
Better for file servers and

servers with large amount of
RAM
Better for application

servers and those
with internal memory
management features
Figure 9-3 Configuring the system cache in Windows 2000
Note: This setting affects all LAN connections, so which LAN connection you choose in the
above steps is not important. If you are not using this system as a file system server, then
you will not be able to modify the cache priorities here.
The file system cache has a working set of memory like any other process. The option chosen
in this dialog effectively determines how large the working set is allowed to grow to and with
what priority the file system cache is treated by the operating system relative to other
applications and processes running on the server.
You have four choices but typically only one of the bottom two options is selected for an
enterprise server implementation:
1. Minimize memory used.
This choice will minimize the memory used for disk cache and maximize the memory
available for the operating system. However, on file servers, the resulting performance
would not be desirable. Therefore, only use this choice for workstations.

2. Balance
This choice will attempt to balance the use of real memory between the operating system
and the disk cache. This is a good choice for a non-dedicated server that is also used as a
workstation.
3. Maximize throughput for file sharing
This is the default setting. This instructs Windows to give the system cache a higher
priority for memory allocation than the working sets of applications. It will yield the best
performance in the file server environment but will require sufficient physical memory;
otherwise, a significant amount of swapping to the paging file will occur.
4. Maximize throughput for network application
This choice is the recommended setting for machines used for memory intensive
applications such as application servers and database servers, except for those:
– Dedicated file servers or with applications exhibiting file server-like characteristics
– Those with significant amounts of memory
These options modify two registry entries:
򐂰 HKLM\System\CurrentControlSet\Services\LanmanServer\Parameters\Size
򐂰 HKLM\System\CurrentControlSet\Control\Session Manager\Memory
Management\LargeSystemCache
The value of the registry entries will be set depending on the option selected in the control
panel, as listed in Table 9-1.
Table 9-1 Registry effect of Server Optimization option selected

Server optimization option selected LanmanServer Size LargeSystemCache
Minimize memory used 1 0
Balance 2 0
Maximize throughput for file sharing 3 1
Maximize throughput for network application 3 0
The second control panel which is used to manage a file system cache of Windows Server
2003 is in the System applet (Windows Server 2003 only):
1. Click Start -> Control Panel -> System.
2. Select Advanced.
3. Within the Performance frame, click Settings.
4. Select Advanced.The window shown in Figure 9-4 on page 314 appears.

Memory
optimization
settings for the
file system cache
are controlled
here
Figure 9-4 Memory optimization settings
The System applet can also change the value of LanmanServer LargeSystemCache registry
key as File and Printer Sharing in the Network applet. But the System applet can change
LargeSystemCache without affecting the Memory Management Size that File Printer Sharing
does.
Given that most users will only use the Maximize throughput for network applications
option or the Maximize throughput for file sharing option for enterprise servers, the Size
value remains the same, a value of 3. This means that using the System applet to adjust the
LargeSystemCache value is redundant as it is just as easily set using File and Print Sharing.
As a result, we recommend using the first control panel as described above and leave this
second control panel untouched. It would seem that the only advantage to using both Control
Panel applets in conjunction would be to enable you to have the applets actually indicate
Maximize throughput for network applications and simultaneously indicate memory usage
favors System cache. This same effect to the registry is achieved by selecting Maximize
throughput for file-sharing (as per Table 9-1 on page 313) — visually it simply does not say
“Maximize throughput for network applications”. If you do desire this change purely for
aesthetic reasons, then make sure you set the first Network applet before the second System
applet, as the first overrides the second selection, but the reverse does not occur.
Servers with large physical memory

The amount of memory that can be allocated to the file system cache depends on how much
physical memory exists in the server and the file system cache option selected in the dialogs
above.
With Windows Server 2003, when Maximize data throughput for file sharing is selected,
the file system cache can grow until 960 MB. When Maximize throughput for network
applications is selected, the file system cache can grow until 512 MB. (See Microsoft KB
837331; location below.) Depending on the selection made here, it is possible that adding

more physical memory up to a point, will enable the file system cache to grow even larger, up
to these stated maximums.
On a server with a lot of physical memory (2 GB or more), it may be preferable to leave the
option Maximize data throughput for file sharing selected (that is, as long as the total
amount of memory used by the operating system and server applications does not exceed
the amount of physical RAM minus 960 MB). In fact, any application server that can have 960
MB or more of RAM unused, will likely improve performance by enabling the large system
cache.
By enabling this, all of the disk and network I/O performance benefits of using a large file
system cache are realized, and the applications running on the server continue to run without
being memory-constrained.
Some applications have their own memory management optimizers built into them, including
Microsoft SQL Server and Microsoft Exchange. In such instances, the setting above is best
set to Maximize throughput for network applications to let the application manage
memory and their own internal system cache as it sees appropriate.
See Microsoft Knowledge Base article 837331 for more information:

http://support.microsoft.com/?kbid=837331
9.2.4 Disabling unnecessary services

When Windows is first installed, there are some services running on your server that may be
unnecessary. Also when some applications are installed, some services that are not actually
required may be running. These services waste server resources so that you should disable
them.
Services can be seen in the Computer Management console. To view services running on
Windows, right-click My Computer and select Manage. Then the Computer Management
window will appear. Select Services in the left pane of the window. Click the Standard tab at
the bottom of the right-side pane. Then a window similar to that shown in Figure 9-5 on
page 316 will appear.

Figure 9-5 Windows Server 2003 Services window
You should stop services that are not needed to free additional memory to those that need it
most, such as the operating system and user applications. To do this, select a service from
the service list and click Stop.
Also examine the startup values of the installed services. Right-click the service and select
Properties. Select Disabled if you do not want this service to run at all on server startup, or
Manual if you want to start a service only at the time you need to use it.
On Windows Server 2003, many services are added from Window 2000 to strengthen
security. Most of the startup type of them are set to Disabled or Manual by default, but some
of them are set to Automatic. When booted, the services set to Automatic are enabled and
use resources. Actually, some of these services are not required, so you should stop them
and set their startup type to Disabled or Manual. For example, the Print Spooler service is
enabled by default, but usually this service is not required if the server doesn’t work as a print
spooler or doesn’t have a local printer.
Table 9-2 lists the services that should be considered whether the system requires them or
not on Windows Server 2003 system. This list is not applicable for all systems but just the
recommendation for a usual system. For example, File Replication Service (FRS) is normally
required for Active Directory domain controller but, for other servers, this service would not be
required. These services are not disabled by default. Before disabling these services, further
investigation is required.
Table 9-2 Windows startup service recommendations

Service Default Startup Type Recommended Setting
Application Management Manual Disabled

Service Default Startup Type Recommended Setting
Alerter Automatic Disabled
Clipbook Disabled Disabled
Computer Browser Automatic Disabled
Distributed file system Automatic Disabled
Distributed link tracking Automatic Disabled

client
Distributed transaction Automatic Manual

coordinator
Error Reporting Service Automatic Disabled
FAX Service Manual Disabled
File Replication Manual Disabled
Help and Support Automatic Disabled
HTTP SSL Manual Disabled
License Logging Manual Disabled
Logical Disk Manager Automatic Manual
Messenger Automatic Disabled
Portable Media Serial Manual Disabled

Number Service
Shell Hardware Detection Automatic Disabled
Windows Audio Automatic Disabled
Wireless Configuration Automatic Disabled
You can also stop processes using the Task Manager. Unneeded applications and processes
are those that you do not need running at the moment, for example, when an application is
launched at startup that does system maintenance, such as disk scanning and
de-fragmentation.
To open the Task Manager, press the Ctrl + Shift + Esc keys. From the Applications tab, select
the unneeded application then click End Task. You can also do this from the Processes tab
by selecting the unneeded process then clicking End Process, as illustrated in Figure 9-6 on
page 318.

Figure 9-6 Task Manager
A process is considered unnecessary when it has nothing to do with your current server
function. It could have been invoked from the registry by some application that was not
correctly un-installed, for example.
9.2.5 Process priority levels

The scheduler is a component of the Windows kernel. The scheduler co-ordinates the
servicing of the processes and threads waiting and ready to use the system CPUs. The
kernel schedules ready threads based upon their individual dynamic priority. The dynamic
priority which is a number between 0 to 31 determines the importance of threads relative to
one another. The higher the priority value, the higher the priority level. For example, a thread
with a priority of 15 will be serviced more quickly than a thread with a priority of 10.
Even if it requires preempting a thread of lower priority, threads with the highest priority
always run on the processor. This activity ensures Windows still pays attention to critical
system threads required to keep the operating system running. A thread will run on the
processor for either the duration of its CPU quantum (or time slice, described in 9.2.1,
“Foreground and background priorities” on page 307) or until it is preempted by a thread of
higher priority.
Task Manager allows you to easily see the priority of all threads running on a system. To do
so, open Task Manager, and click View -> Select Columns, then add a checkmark beside
Base Priority as shown in Figure 9-7 on page 319.

Select Base Priority
to ensure you can
see the priority of all
running processes.
Figure 9-7 Selecting Base Priority in Task Manager
This displays a column in Task Manager as shown Figure 9-8 that enables you to see the
relative priority of processes running on the system.
Figure 9-8 Base Priority
Most applications that are loaded by users run at a normal priority, which has a base priority
value of 8. Task Manager also gives the administrator the ability to change the priority of a
process, either higher or lower.
To do so, right-click the process in question, and click Set Priority from the pull-down menu
as shown in Figure 9-9 on page 320. Click the new priority you want to assign to the process.

Note: This procedure changes the priority of actual processes running on the system, but
the change lasts only as long as the life of the selected process.
Figure 9-9 Changing the priority of a process using Task Manager
If you want to launch a process with a non-normal priority, you can do so using the start
command from a command prompt. Type start /? for more information about how to do this.
Threads, as a subcomponent of processes, inherit the base priority of their parent process.
The four priority classes are:

򐂰 Idle
򐂰 Normal
򐂰 High
򐂰 Realtime
Each process’s priority class sets a range of priority values (between 1 and 31), and the
threads of that process have a priority within that range. If the priority class is Realtime
(priorities 16 to 31), the thread’s priority can never change while it is running. A single thread
running at priority 31 will prevent all other threads from running.
Conversely, threads running in all other priority classes are variable, meaning that the
thread’s priority can change while the thread is running. For threads in the Normal or High
priority classes (priorities 1 through 15), the thread’s priority can be raised or lowered by up to
a value of 2 but cannot fall below its original, program-defined base priority.
When should you modify the priority of a process? In most instances, you should do this as
rarely as possible. Windows normally does a very good job of scheduling processor time to
threads. Changing process priority is not an appropriate long-term solution to a bottleneck on
a system. Eventually, additional or faster processors will be required to improve system
performance.
Normally, the only conditions under which the priority of a process should be modified are
when the system is CPU-bound. Processor utilization, queue length, and context switching
can all be measured using System Monitor to help identify processor bottlenecks.

In a system with plenty of spare CPU capacity, testing has shown that changing the base
priority of a process offers marginal, if any, improvements. This is because the processor is
comfortable with the load it is under and able to schedule time appropriately to threads
running on the system. Conversely, on a system suffering heavy CPU load, CPU time being
allocated to nominated processes will likely benefit from changing the base priority. On
extremely busy systems, threads with the lowest priority will be serviced infrequently, if at all.
Important: Changing priorities might destabilize the system. Increasing the priority of a
process might prevent other processes, including system services, from running. In
particular, be careful not to schedule many processes with the High priority and avoid
using the Realtime priority altogether. Setting a processor-bound process to Realtime
could cause the computer to stop responding altogether.
Decreasing the priority of a process might prevent it from running, not merely force it to run
less frequently. In addition, lowering priority does not necessarily reduce the amount of
processor time a thread receives; this happens only if it is no longer the highest-priority
thread.
9.2.6 Process affinity

On SMP systems, the Windows scheduler distributes the load of ready threads over all
processors based on thread priority. Even though Windows will often try to associate known
threads with a specific CPU (called soft affinity), threads will invariably end up being
distributed among multiple processors.
Hard affinity can be applied to permanently bind a process to a given CPU or set of CPUs,
forcing the designated process to always return to the same processor. The performance
advantage in doing this is best seen in systems with large Level 2 caches, as the cache hit
ratio (in the server, not storage) will improve dramatically.
Assigning processor affinity to processes is not normally used as a method of improving

performance. The only circumstances under which it will occasionally be employed are those
servers where multiple instances of the same application are running on the same system,
such as SQL Server or Oracle. As with all tuning techniques, the performance of the system
should be measured to determine whether using process affinity has actually offered any
tangible benefit. Determining the correct mix of processes that are assigned to the CPUs in
the system can be time-consuming.
Some applications, such as SQL Server, provide internal options to assign themselves to
specific CPUs. The other method for setting affinity is via Task Manager: right-click the
process in question and click Set Affinity, as shown in Figure 9-10 on page 322. Add check
marks next to the CPUs you want to restrict the process to and click OK.
Note that, as with changing the process’s priority, changing process affinity in this manner will
only last for the duration of the process. If the process ends or the system is rebooted, the
affinity will have to be reallocated as required. Note also that not all processes permit affinity
changes.

Figure 9-10 Assigning Processor Affinity to a selected process
9.2.7 Assigning interrupt affinity

Microsoft offers a utility called Intfiltr that enables the binding of device interrupts to specific
system processors. This partitioning technique can be employed to improve system
performance, scaling, and the partitioning of large servers.
Specifically for server performance tuning purposes, Intfiltr enables you to assign the
interrupts that are generated by each network adapter to a specific CPU. Of course, it is only
useful on SMP systems with more than one network adapter installed. Binding the individual
network adapters in a server to a given CPU can offer large performance efficiencies.
Intfiltr uses plug-and-play features of Windows that permit affinity for device interrupts to
particular processors. Intfiltr binds a filter driver to devices with interrupts and is then used to
set the affinity mask for the devices that have the filter driver associated with them. This
permits Windows to have specific device interrupts associated with nominated processors.
Interrupt filtering can affect the overall performance of your computer in both a positive and
negative manner. Under normal circumstances, there is no easy way to determine which
processor is best left to handle specific interrupts. Experimentation and analysis will be
required to determine whether interrupt affinity has yielded performance gains. To this end, by
default without tools like Intfiltr, Windows directs interrupts to any available processor.
Some considerations should be made when configuring Intfiltr on a server that supports
Hyper-Threading to ensure that the interrupts are assigned to the correct physical processors
desired, not the logical processors. Assigning interrupt affinity to two logical processors that
actually refer to the same physical processor will obviously offer no benefit.
Interrupt affinity for network cards can offer definite performance advantages on large, busy
servers with many CPUs. Our recommendation is to try Intfiltr in a test environment to
associate specific interrupts for network cards with selected processors. This enables you to
determine whether using interrupt affinity will offer a performance advantage for your network
interface cards.
Note that Intfiltr can be used for creating an affinity between CPUs and devices other than
network cards, such as disk controllers. Again, experimentation is the best way to determine
potential performance gains. To determine the interrupts of network cards or other devices,
use Windows Device Manager or run System Information (WINMSD.EXE).

Figure 9-11 Assigning processor affinity using the INTFILTR tool
The Intfiltr utility and documentation are available free of charge from Microsoft:
ftp://ftp.microsoft.com/bussys/winnt/winnt-public/tools/affinity/intfiltr.zip
For more information, see:

9.2.8 The /3GB BOOT.INI parameter

By default, the 32-bit editions of Windows can address a total of 4 GB of virtual address
space. This is the limitation of the 32-bit architecture. Normally, 2 GB of this is reserved for the
operating system kernel requirements (privileged-mode) and the other 2 GB is reserved for
application (user-mode) requirements. Under normal circumstances, this creates a 2 GB
per-process address limitation.
Windows provides a /3GB parameter to be added to the BOOT.INI file that reallocates 3 GB of
memory to be available for user-mode applications and reduces the amount of memory for
the system kernel to 1 GB. Some applications, such as Microsoft Exchange and Microsoft
SQL Server, written to do so, can derive performance benefits from having large amounts of
addressable memory available to individual user-mode processes.
To edit the BOOT.INI file to make this change, complete the following steps:
1. Open the System Control Panel.
2. Select Advanced.
3. In the Startup and Recovery frame, click Settings.
4. Click Edit. Notepad opens to edit the current BOOT.INI file.

5. Edit the current BOOT.INI file to include the /3GB switch as shown in Figure 9-12.
6. Restart the server for the change to take effect.
Write /3GB switch to enable 3GB of memory

for user-mode applications.
Figure 9-12 Editing the BOOT.INI to include the /3GB switch
This switch normally should be used only when a specific application recommends its use.
Typically this is where applications have been compiled to use more than 2 GB per process,
such as some components of Exchange.
Important: The /3GB switch actually works for all versions of Windows 2000 Server and
Windows Server 2003. However, you should use it only when running Advanced Edition or
Datacenter Edition.
Standard Edition can allocate to user-mode applications at most 2 GB. If the /3GB switch is
configured in the BOOT.INI file, then the privileged-mode kernel is restricted to 1 GB of
addressable memory without the corresponding increase for user-mode applications. This
effectively means 1 GB of address space is lost.
9.2.9 Using PAE and AWE to access memory above 4 GB

The native 32-bit architecture of the x86 processor allows a maximum addressable memory
space of 4 GB. The Intel Physical Address Extension (PAE) is a 36-bit memory addressing
mode that allows 32-bit systems to address memory above 4 GB.
PAE requires appropriate hardware and operating system support to be implemented. Intel
introduced PAE 36-bit physical addressing with the Intel Pentium Pro processor. PAE is
supported with the Advanced and Datacenter Editions of Windows 2000 Server and the
Enterprise and Datacenter Editions of Windows Server 2003.
Windows uses 4 KB pages with PAE to map up to 64 GB of physical memory into a 32-bit (4
GB) virtual address space. The kernel effectively creates a map in the privileged mode
addressable memory space to manage the physical memory above 4 GB.
The Advanced and Datacenter Editions of Window 2000 Server and Windows Server 2003
allow for PAE through use of a /PAE switch in the BOOT.INI file. This effectively allows the
operating system to use physical memory above 4 GB.
Even with PAE enabled, the underlying architecture of the system is still based on 32-bit
linear addresses. This effectively retains the usual 2 GB of application space per user-mode
process and the 2 GB of kernel mode space, because only 4 GB of addresses are available.
However, multiple processes can immediately benefit from the increased amount of

addressable memory because they are less likely to encounter physical memory restrictions
and begin paging.
Address Windowing Extensions (AWE) is a set of Windows APIs that take advantage of the
PAE functionality of the underlying operating system and enable applications to directly
address physical memory above 4 GB. Some applications such as SQL Server 2000,
Enterprise Edition, have been written with these APIs and can harness the significant
performance advantages of being able to address more than 2 GB of memory per process.
To edit the BOOT.INI file to enable PAE, complete the following steps:
1. Open the System Control Panel.
2. Select Advanced.
3. In the Startup and Recovery frame, click Settings.
4. Click Edit. Notepad opens to edit the current BOOT.INI file.
5. Edit the current BOOT.INI file to include the /PAE switch as shown in Figure 17.
6. Restart the server for the change to take effect.
Write /PAE switch to enable 36-bit memory addressing

mode.
Figure 9-13 Editing the BOOT.INI to include the /PAE switch
Interaction of the /3GB and /PAE switches

There is often confusion between when to use the /3GB switch and when to use the /PAE
switch in the BOOT.INI file. In some cases it is desirable to use both.
Recall the following information that was covered previously:

򐂰 The /3GB switch reallocates the maximum 4 GB addressable memory from the usual 2
GB for user-mode applications and 2 GB for the kernel to allow 3 GB of physical memory
to be used for applications, leaving 1 GB for the system kernel.
򐂰 PAE permits the operating system to see and make use of physical memory above and
beyond 4 GB. This is achieved through the use of the 1 or 2 GB of kernel addressable
memory (depending on the use of the /3GB switch) to map and manage the physical
memory above 4 GB.
򐂰 Applications that are written using AWE make use of PAE to enable individual applications
(processes) to use more than the 2 GB limitation per process.
On a server with between 4 GB and 16 GB of RAM hosting applications that have been
compiled or written with AWE to use more than 2 GB of RAM per process or hosting many
applications (processes) that each contend for limited physical memory, it would be desirable
to use both the /3GB and /PAE switches.

Servers with more than 16 GB of physical memory should not use both the /3GB switch and
the /PAE switch. The /PAE switch is obviously required to make use of all physical memory
above 4 GB. Remember that PAE uses the kernel addressable memory to manage the
physical memory above 4 GB. When physical memory exceeds 16 GB, the 1 GB of memory
allocated to the kernel when the /3GB switch is used is not sufficient to manage all of the
additional physical memory above 4 GB. Thus, only the /PAE switch should be used in such a
case to avoid the system running out of kernel memory.
9.3 File system overview

A file system is a part of the operating system that determines how files are named, stored,
and organized on a volume. A file system manages files and folders, and the information
needed to locate and access these items by local and remote users. Microsoft Windows
Server 2003 and Windows 2000 support both the FAT and NTFS file systems.
FAT The File Allocation Table (FAT) file system is the original file system introduced by
Microsoft. FAT was designed for small disks and a simple directory structure. The
maximum drive size using FAT is 2 GB. FAT and FAT32 do not have any file or folder
security. Therefore the server does not have to check the permissions on an
individual file when it is accessed by users. For performance reasons, it is
recommended that FAT be used on drives that will be less than 400 MB.
FAT32 FAT32 is a newer, improved version of FAT. FAT32 allows you to format a disk that is
greater than 2 GB. It also uses smaller clusters than the FAT. This saves disk drive
space on large drives by being able to store data more efficiently.
NTFS NTFS uses a B-tree structure. This type of structure improves performance by
minimizing the number of times the disk is accessed, which makes it faster than the
FAT.
NTFS is always be the file system of choice for servers. NTFS offers considerable
performance benefits over the FAT and FAT32 file systems and should be used exclusively on
Window servers. In addition, NTFS offers many security, scalability, stability and reliability
benefits over FAT.
Under previous versions of Windows, FAT and FAT32 was often implemented for smaller
volumes (say < 400 MB) as they were often faster in such situations. With disk storage
relatively inexpensive today and operating systems and applications pushing drive capacity to
a maximum it is unlikely that such small volumes will be warranted. FAT32 scales better than
FAT on larger volumes but is still not an appropriate file system for Windows servers.
FAT and FAT32 have often been implemented in the past as they were seen as more easily
recoverable and manageable with native DOS tools in the event of a problem with a volume.
Today, with the various NTFS recoverability tools built both natively into the operating system
and as third-party utilities available, there should no longer be a valid argument for not using
NTFS for file systems.
9.3.1 NTFS overview

NTFS is a journaling file system that enables fast file recovery. Journaling file systems are
based on the transaction processing concepts found in database theory. Internally, it more
resembles a relational database than a traditional file system. It is comparable in function to
the Veritas file system found on some UNIX implementations.
NTFS was designed to provide reliability, security, and fault tolerance through data
redundancy. In addition, support was built into NTFS for large files and disks, Unicode-based

names, bad-cluster remapping, multiple data streams, general indexing of file attributes, and
POSIX. All of these contribute to making NTFS a robust file system.
Windows 2003 uses the following default cluster sizes for NTFS, as shown in Table 9-3,
where the value for number of sectors assumes a standard, 512 byte sector. On systems with
sectors that are not 512 bytes, the number of sectors per cluster may change, but the cluster
size remains fixed.
Table 9-3 NTFS default cluster sizes

Drive size Cluster size Number of sectors
512 MB or less 512 bytes 1
513 MB to 1024 MB (1 GB) 1K 2
1025 MB to 2048 MB (2 GB) 2K 4
2049 MB to 2048 GB (2 TB) 4K 8
These values are only used if an allocation unit size is not specified at format time, using the
/A:<size> switch with the format command.
Note: The maximum NTFS volume size as implemented in Windows Server 2003 is 232
clusters minus 1 cluster. For example, using 64-KB clusters, the maximum NTFS volume
size is 256 terabytes minus 64 KB. Using the default cluster size of 4 KB, the maximum
NTFS volume size is 16 terabytes minus 4 KB.
9.3.2 Disabling short file name generation

By default, for every long file name created in Windows, NTFS generates a corresponding
short file name in the older 8.3 DOS file name convention for compatibility with older
operating systems. In many instances this functionality can be disabled, offering a
performance increase.
If you use large numbers of files in an NTFS folder (300,000 or more), disable short-file name
generation for better performance, and especially if the first six characters of the long file
names are similar.
Before disabling short name generation, make sure that there is no DOS or 16-bit application
running on the server that requires 8.3 file names, nor are there any users accessing the files
on the server via 16-bit applications.
To disable the generation of 8.3 short names, edit the following registry parameter:
HKEY_LOCAL_MACHINE \SYSTEM \CurrentControlSet \Control \FileSystem
\NtfsDisable8dot3NameCreation
Change its value from 0 to 1. In Windows Server 2003, this parameter can also be set by
using the command:
fsutil behavior set disable8dot3 1
9.3.3 Disable NTFS last access updates

Each file and folder on an NTFS volume contains an attribute called Last Access Time. This
attribute shows when the file or folder was last accessed, such as when a user performs a
folder listing, adds files to a folder, reads a file, or makes changes to a file. Maintaining this
information creates a performance overhead for the file system especially in environments

where a large number of files and directories are accessed quickly and in a short period of
time, such as by a backup application. Apart from in highly secure environments, retaining
this information may be adding a burden to a server that can be avoided by updating the
following registry key:
HKLM\SYSTEM \CurrentControlSet \Control \FileSystem
NTFSDisableLastAccessUpdate
Change its value from 0 to 1. For more information, see:

http://www.microsoft.com/resources/documentation/WindowsServ/2003/all/deployguide/en-us/466
56.asp
In Windows Server 2003, this parameter can also be set by using the command:
fsutil behavior set disablelastaccess 1
Reliability
To ensure reliability of NTFS, three major areas were addressed: Recoverability, removal of
fatal single sector failures, and hot fixing.
The recoverability designed into NTFS is such that a user should never have to run any sort of
disk repair utility on an NTFS partition. This is because NTFS uses a journaled log to keep
track of transactions made against the file system. When a CHKDSK is performed on a FAT
file system, the consistency of pointers within the directory, allocation, and file tables is being
checked. Under NTFS, because a log of transactions against these components is
maintained, CHKDSK need only roll back transactions to the last commit point in order to
recover consistency within the file system.
Under FAT, if a sector that is the location of one of the file system's special objects fails, then
a single sector failure will occur. NTFS avoids this in two ways. First, by not using special
objects on the disk and tracking and protecting all objects that are on the disk. Second, under
NTFS, multiple copies (the number depends on the volume size) of the Master File Table are
kept.
Similar to OS/2® versions of HPFS, NTFS supports hot fixing. NTFS will attempt to move the
data in a damaged cluster to a new location in a fashion that is transparent to the user. The
damaged cluster is then marked as unusable. Unfortunately, it is possible depending on what
damage has occurred, that the moved data may be unusable.
9.3.4 Added functionality

NTFS fully supports the Windows NT security model and supports multiple data streams. No
longer is a data file a single stream of data. Additionally, under NTFS, a user can add his or
her own user-defined attributes to a file.
9.3.5 Removing limitations

First, NTFS has greatly increased the size of files and volumes so that they can now be up to
264 bytes (16 exabytes). NTFS has also returned to the FAT concept of clusters in order to
avoid the HPFS problem of a fixed sector size. This was done because Windows Server 2003
is a portable operating system and different disk technology is likely to be encountered at
some point. Therefore, 512 bytes per sector was viewed as having a large possibility of not
always being a good fit for the allocation. This was accomplished by allowing the cluster to be
defined as multiples of the hardware's natural allocation size. Finally, in NTFS all file names
are Unicode based, and 8.3 file names are kept along with long file names.

9.3.6 NTFS and FAT performance and recoverability considerations
The NTFS is generally faster than FAT for disk reads and for fast recovery in case of a system
failure. However, NTFS performs transaction logging on writes, resulting in a slightly slower
write performance than FAT.
Under Windows, a FAT is a careful-write file system. The FAT's careful-write file system only
allows writes one at a time and alters its volume information after each write. This is a very
secure form of writing. It is, however, also a very slow process. In order to improve
performance on FAT you can opt to utilize the lazy-write file system feature, which uses the
systems memory cache. This means that all writes are performed to this cache and the file
system intelligently waits for the appropriate time to perform all the writes to disk.
This system gives the user faster access to the file system and prevents holdups due to
slower disk access. It is also possible, if the same file is being modified more than once, that it
may never actually be written to disk until the modifications are finished within the cache. Of
course, this can also lead to lost data if the system crashes and unwritten modifications are
still held in the cache.
NTFS provides the speed of a lazy-write file system along with additional recovery features.
Each write request to an NTFS partition generates both redo and undo information in the
transaction log. In the recovery process this log can assure that after only a few moments
after a reboot that the file system's integrity is back to one hundred percent without the need
of running a utility such as CHKDSK, which requires the scanning of an entire volume. The
overhead associated with this recoverable file system is less than the type used by the
careful-write file system.
Choosing a file system depends on your particular environment. Some of the factors for
choosing a file system include MSDOS compatibility, file level and file system security,
performance, and recoverability. In general, NTFS is best for use on logical volumes of about
400 MB or more. This is because performance does not degrade under NTFS, as it does
under FAT, with larger volume sizes.
Users seeking highly scalable solutions will use software and hardware solutions in
combination. For example, NTFS uses 64-bit addresses and file offsets. This allows for
theoretically immense file and volume sizes. Today, there are external limitations on volume
and file sizes imposed by the logical disk manager's disk partitioning system and by the
underlying hardware. However, NTFS will continue to scale as these limitations are broken
down.
9.3.7 Do not use NTFS file compression

While it is an easy way to reduce space on volumes, NTFS file system compression is not
appropriate for enterprise file servers. Implementing compression places an unnecessary
overhead on the CPU for all disk operations and is best avoided. Think about options for
adding disk, near-line storage or archiving data before seriously considering file system
compression.
9.3.8 Monitor drive space utilization

The less data a disk has on it, the faster it will operate. This is because on a well
defragmented drive, data is written as close to the outer edge of the disk as possible, as this
is where the disk spins the fastest and yields the best performance.
Disk seek time is normally considerably longer than read or write activities. As noted above,
data is initially written to the outside edge of a disk. As demand for disk storage increases and

free space reduces, data is written closer to the centre of the disk. Disk seek time is increased
in locating the data as the head moves away from the edge, and once found, it takes longer to
read, hindering disk I/O performance.
This means that monitoring disk space utilization is important not just for capacity reasons but
for performance also. It is not practical nor realistic to have disks with excessive free space,
however.
Tip: As a rule of thumb, work towards a goal of keeping disk free space between 20-25%
of total disk space. DS6000 doesn’t have the tool to monitor drive space utilization that you
have to monitor it from the server side.
9.4 Windows registry options

This section details changes that can be made to the Windows registry.
Warning: Using Registry Editor incorrectly can cause serious problems that may
require you to reinstall your operating system.
For information about how to edit the registry, view the “Change keys and Values” Help topic
in Registry Editor (Regedit.exe). Note that you should back up the registry before you edit it. If
you are running Windows, you should also update your Emergency Repair Disk.
9.4.1 Disable kernel paging

Servers with sufficient physical memory may benefit from disabling portions of the Windows
operating system kernel and user-mode and kernel-mode drivers from being paged to disk.
This registry setting will force Windows to keep all components of the kernel (or executive)
and drivers in memory and thus allow much faster access to them when required.
Key: HKLM\SYSTEM \CurrentControlSet \Control \Session Manager\MemoryManagement
Value: DisablePagingExecutive
Data type: REG_DWORD
Range: 0x1 (default) or 0x1
Recommendation: 0x1

9.4.2 Optimize the paged pool size

Windows allocates memory in pools for the operating system and its components, which
processes access through the use of kernel mode. Two pools of kernel mode memory exist:
򐂰 The paged pool (which can be paged to the pagefile)
򐂰 The non-paged pool (which can never be paged)
Performance and system stability can be seriously impacted if Windows experiences memory
resource constraints and is unable to assign memory to these pools. The amount of physical
memory assigned to these two pools is assigned dynamically at system boot time. The

maximum default size for the paged memory pool is 491 MB, and 256 MB for the non-paged
pool.
Some applications and workloads can demand more pooled memory than the system has
been allocated by default. Setting the PagedPoolSize registry value as listed in Table 9-4 may
assist in ensuring sufficient pooled memory is available.
Changing this setting requires a restart of the operating system.
Table 9-4 PagedPoolSize values

PagedPoolSize value Meaning
0x0 (default) The system will dynamically calculate an optimal value at system
startup for the paged pool based on the amount of physical
memory in the computer. This value will change if more memory
is installed in the computer. The system typically sets the size of
the paged pool to approximately twice that of the nonpaged pool
size.
Range: 0x1 - Creates a paged pool of the specified size, in bytes. This takes
0x20000000 (512 MB) precedence over the value that the system calculates, and it
prevents the system from adjusting this value dynamically.
Limiting the size of the paged pool to 192 MB (or smaller) lets the
system expand the file system (or system pages) virtual address
space up to 960 MB. This setting is intended for file servers and
other systems that require an expanded file system address
space (meaning slightly faster access) at the expense of being
able to actually cache less data. This only makes sense if you
know the files your server frequently accesses already fit easily
into the cache
0xFFFFFFFF Windows will calculate the maximum paged pool allowed for the
system. For 32-bit systems, this is 491 MB. This setting is
typically used for servers that are attempting to cache a very
large number of frequently used small files, some number of very
large size files, or both. In these cases, the file cache that relies
on the paged pool to manage its caching it able to cache more
files (and for longer periods of time) if more paged pool is
available.
Setting this value to 0xB71B000 (192 MB) provides the system with a large virtual address
space, expandable to up to 960 MB. Note that a corresponding entry of zero (0) is required in
the SystemPages registry value for this to take optimal effect, as described below.
Value: PagedPoolSize
Range: 0x0 (default) to 0xFFFFFFFF
Recommendation: 0xB71B000 (192000000)
Value exists by default: Yes.
Value: SystemPages

Range: 0x0 (default) to 0xFFFFFFFF
Recommendation: 0x0
Value exists by default: Yes.

http://www.microsoft.com/resources/documentation/windows/2000/server/reskit/en-us/core/fnec
_evl_fhcj.asp
9.4.3 Increase memory available for I/O locking operations

If you have some extra RAM and an active file system, you can speed up file system activity
by increasing the IoPageLockLimit from the default 512 KB to 4096 KB or more (refer to
Table 9-5). This entry is the maximum number of bytes that can be locked for I/O operations.
When the value is 0, the system defaults to 512 K. The largest value is based on the amount
of memory in your system.
Table 9-5 Windows registry IoPageLockLimit settings

Physical RAM Maximum LockLimit
Less than 64 MB Physical Memory minus 7 MB
64 - 512 MB Physical Memory minus 16 MB
512 MB upwards Physical Memory minus 64 MB
These value ranges listed in Table 9-5 above equate to those calculated in Table 9-6
depending on the exact amount of physical RAM in the machine. As most servers today will
have more than 512MB RAM, the calculations in Table 9-7 take into account only 512MB
RAM and above.
The appropriate value should be determined from Table 9-5 and then entered into the registry
value IoPageLockLimit. This value will then take precedence over the system default of 512
KB and will specify the maximum number of bytes that can be locked for I/O operations:
Value: IoPageLockLimit
Range: 0 (default) to 0xFFFFFFFF (in bytes. Do not exceed this maximum!)
Recommendation: (depends on RAM. See Table 9-7 above)
Default: 0x80000 (512 KB)
Value exists by default: No, needs to be added.
Table 9-6 Recommend IoPageLockLimit settings

Physical RAM IoPageLockLimit setting (Hex)
512 MB 0x1C000000
1 GB 0x3C000000

Physical RAM IoPageLockLimit setting (Hex)
2 GB 0x80000000
4 GB 0xFC000000
8 GB 0xFFFFFFFF

http://www.microsoft.com/windows2000/techinfo/reskit/en-us/regentry/29932.asp
9.4.4 Improve memory utilization of file system cache

This registry entry allows the system to improve memory utilization for the file system cache
and allows for more files to be opened simultaneously on a large system. It can, however,
utilize additional memory from the page pool.
The following steps can be used to add this registry key:

1. Start Registry Editor (Regedit.exe). Locate and select the following Registry subkey in the
HKEY_LOCAL_MACHINE subtree:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory
Management
2. From the Edit menu, click Add Value.
3. Enter UnusedFileCache as the value name and set the Data Type to REG_DWORD.
4. Click OK and enter a value of 0 or 5–40 as the data.
– 0: Default behavior similar to Windows NT 4.0 Service Pack 3.
– 5–40: Trim unused file cache based on pool usage.
• The value being set represents the percent of pool that we allow to be consumed by
unused segments, where 5 is most aggressive (for example, it increases the size of
the cache the least), and 40 is least aggressive (for example, it lets the cache grow
to the largest before trimming).
• Testing performed by Microsoft found that this registry change has positive benefits
in that it also increases the performance of some applications such as IIS. It works
best when set to 15–20. Do not choose a value greater than 20 without extensive
stress testing.
Click OK, quit Registry Editor, and then shut down and restart the computer.
9.5 Host bus adapter (HBA) settings

On Windows, the following HBAs are supported to connect the DS6000 to your servers.
򐂰 Emulex
– LP9002L
– LP9002DC
– LP9402DC
– LP9802
– LP10000
– LP10000DC

򐂰 Netfinity® adapter card
– 19K1246
– 24P0960
򐂰 Qlogic
– QLA2310F/2310FL
– QLA2340L/2342L
To confirm the supported HBA, see the following link:

http://www-03.ibm.com/servers/storage/disk/ds6000/pdf/ds6000-interop.pdf
For each HBA, there are some BIOS and driver settings which are suitable for connecting to
your DS6000. If these settings are not configured correctly, it may affect the performance or
may not work properly.
To configure the HBA, see IBM TotalStorage DS6000 Host Systems Attachment Guide,
GC26-7680 . Detailed procedures and recommended settings are written in it. You also
should read the readme and manuals for the driver, BIOS and HBA.
You can get the driver, BIOS and HBA related information from the following link.
Note: When configuring the HBA, we strongly recommend you install the newest version of
driver and BIOS. The newer version includes more effective function or problem fixes, so
that the performance or RAS may improve.
9.6 Tools for Windows Server 2003 and Windows 2000

In this section we discuss the available tools for the Windows Server 2003 and Windows 2000
users, how to implement them, how they can help you with the disk performance activities,
and we show some examples of how to use them and get support.
The following tools are discussed in the following sections:

򐂰 Windows Performance console
򐂰 Task Manager
򐂰 IOMeter
9.6.1 Windows Performance console

The Performance console is one of the most valuable monitoring tools available to Windows
2000 administrators. It is commonly used to monitor server performance and to isolate
bottlenecks. The tool provides real-time information about server subsystem performance.
The data collection interval can be adjusted based on your requirements.
The logging feature of the Performance console makes it possible to store, append, chart,
export, and analyze data captured over time. Products such as SQL Server and Exchange
provide additional monitors that allow the Performance console to extend its usefulness
beyond the operating system level.
The Performance console includes two tools:

򐂰 System Monitor
򐂰 Performance Logs and Alerts

The main Performance console window is shown in Figure 9-14.
Figure 9-14 Main Performance console window
The Performance console is a snap-in for Microsoft Management Console (MMC). The
Performance console is used to access the System Monitor and Performance Logs and Alerts
tools.
The Performance console can be opened by clicking Start -> Programs -> Administrative
Tools -> Performance or by typing PERFMON on the command line.
Tip: If there is no Administrative Tools folder, you can display it as follows:

1. Right-click Start and click Properties.
2. At the Taskbar and Start Menu properties dialog box, click Customize....
3. Select Advanced.
4. Scroll down the Start menu items list until you find the System Administrative Tools
section.
5. Select the Display on the All Programs menu and the Start menu option.
6. Click OK to close the Customize Start Menu dialog box.
7. Click OK to close the Taskbar and Start Menu properties dialog box.
9.6.2 System Monitor

When starting the Performance console on a Windows Server 2003 the System Monitor
automatically runs. The default monitors are:
򐂰 Memory: Pages/second
򐂰 PhysicalDisk: Avg. Disk Queue Length
򐂰 Processor: % Processor Time
In Figure 9-15 on page 336 we see the System Monitor. The System Monitor can be used to
view real-time or logged data of objects and counters. Performance Logs and Alerts can be
used to log objects and counters, and create alerts.

Displaying the real-time data of objects and counters is sometimes not enough to identify
server performance. Logged data can provide a better understanding of the server
performance.
Alerts can be configured to notify the user or write the condition to the system event log based
on thresholds.
There are three ways to view the real-time or logged data counters:
򐂰 Chart
This view displays performance counters in response to real-time changes or processes
logged data to build a performance graph.
򐂰 Histogram
This view displays bar graphics for performance counters in response to real-time
changes or logged performance data. It is useful for displaying peak values of the
counters.
򐂰 Report
This view displays only numeric values of objects or counters. It can be used for displaying
real-time activity or displaying logged data results. It is useful for displaying many
counters.
Figure 9-15 The Performance console: System Monitor
9.6.3 Key objects and counters

Performance Monitor provides the ability to monitor many aspects of your system; however,
for our discussion we are interested mainly in I/O-related performance. Table 9-7 on page 337
describes the different I/O-related statistics that can be reported on by Performance Monitor.
Note that some of the statistics require the disk counters that are provided by diskperf.

The key objects in Windows are:
򐂰 Memory
򐂰 Processor
򐂰 Disk
򐂰 Network
Tuning these key objects will greatly improve the performance of disk I/O.
Table 9-7 Performance monitoring objects

Object Counter Description
Physical disk Percent disk time The percentage of time that a

disk is busy. The general rule is
that the total percent disk time
for all logical disks should be
less than 85 percent.
Physical disks Average disk queue length The number of requests for disk
access. The general rule of
thumb is that the total average
disk queue length should be
less than or equal to three. It
may be important to note the
actual number of spindles in a
hardware RAID set and multiply
the number of spindles by the
average disk queue length.
Logical disks Avg. disk second/transfer The average number of reads

and writes I/Os per second.
Logical disks Disk bytes/second The current reads and writes

throughput per second.
Logical disks Current disk queue length The current number of requests
for access to the logical disk
device.
Cache Copy read hits The percentage of requests

found in the cache. Based on an
analysis of the user
environment the goal was to
emulate approximately an 80
percent cache hit rate.
Memory Cache bytes The amount of the cache

memory currently being used
by the system. The maximum
amount of RAM that the system
can use for caching is 512 MB.
Memory Cache bytes peak Maximum number of bytes

used by the cache at any given
time.

Object Counter Description
Memory Pages per second The number of pages read from

or written to disk to resolve hard
page faults. Hard page faults
occur when a process requires
code or data that is not in its
working set or elsewhere in
physical memory and must be
retrieved from disk. A high level
of paging activity can be
acceptable (pages per sec >
500), but if it is associated with
low available bytes a problem
may exist.
Memory Pool non-paged bytes The number of bytes in the

non-paged pool. An area of
system memory (physical
memory used by the operating
system) for objects that cannot
be written to disk, but must
remain in physical memory as
long as they are allocated. The
system may be overloaded
when this value is greater than
120 MB or the sum of paged
and non-paged pools is 256
MB. This counter displays the
latest observed value only if it is
not an average.
Memory Pool paged bytes The number of bytes in the

paged pool. An area of system
memory (physical memory
used by the operating system)
for objects that can be written to
disk when they are not being
used. The system may be
overloaded with a SAM size or a
large number of user sessions
when this value is greater than
156 MB or the sum of paged
and non-paged pools is 256
MB. This counter displays the
last observed value only if it is
not an average.
9.6.4 Performance console output information

There are several situations in which you would use the Performance console:
򐂰 The Performance console can be run during activity-intensive periods to get a real picture
of server performance. For example, you may want to measure your Windows NT domain
controller performance early morning when user logins storm the server. The Performance
console can effectively be used to monitor networking-related counters such as NetBEUI,
NWLink, TCP/IP, network utilization, and others. This is important if you are monitoring
how these protocols are affecting your network subsystem performance.

򐂰 The Performance console can send alerts when predefined threshold levels are reached.
This is useful especially when you want to perform actions as soon as your pre-set
threshold conditions are met.
򐂰 With the Performance console, you can view special counters provided by Microsoft
BackOffice® applications. For example, when you install Exchange Server, it installs
additional object counters of its own. You can then monitor and analyze these counters
and relate them to your Exchange Server’s performance.
9.6.5 Performance Logs

The Performance Logs and Alerts window, shown in Figure 9-16, lets you collect performance
data manually or automatically from local or remote systems. Saved data can be displayed in
System Monitor or data can be exported to a spreadsheet or database.
Figure 9-16 Performance Logs and Alerts
Performance Logs and Alerts provides the following functions:

򐂰 Counter logs
This function lets you create a log file with specific objects and counters and their
instances. Log files can be saved in different formats (file name + file number, or file name
+ file creation date) for use in System Monitor or for exporting to database or spreadsheet
applications. You can schedule the logging of data, or the counter log can be started
manually using program shortcuts. Counter log settings can also be saved in HTML format
for use in a browser either locally or remotely via TCP/IP.
򐂰 Trace logs
This function lets you create trace logs that contain trace data provider objects. Trace logs
differ from counter logs in that they measure data continuously rather than at specific
intervals. You can log operating system or application activity using event providers. There
are two kinds of providers: System and non-system providers.
For system providers, the following events are provided by the Windows 2000 kernel trace
provider:
– Process creations/deletions
– Thread creations/deletions
– Disk input/output
– Network TCP/IP
– Page faults
– File details

Non-system providers can be application events. You can also code provider events. Any
application or service installed in your system may support event providers. For example,
in Active Directory, NetLogon and Local Security Authority are non-system event
providers.
򐂰 Alerts
This function lets you track objects and counters to ensure they are within a specified
range. If the counter’s value is under or over the specified value, an alert is issued. Actions
from an alert include:
– Send the alert to another machine.
– Log the alert in the application event log.
– Start a new counter log.
– Run a command from the command line.
– Alerts can be started and stopped automatically or manually.
9.6.6 Monitoring disk counters

In Windows Server 2003 and Windows 2000, there are two kinds of disk counters:
PhysicalDisk object counters and LogicalDisk object counters. PhysicalDisk object counters
are used to monitor single disks and logical disks of hardware RAID arrays, and are enabled
by default. LogicalDisk object counters are used to monitor logical disks or software RAID
arrays.
Windows Server 2003 has both the logical and physical disk counters enabled by default.
In Windows 2000, physical disk counters are enabled by default. The logical disk performance
counters are disabled by default and may be required for some monitoring applications. If you
require the logical counters, you can enable them by typing the command DISKPERF -yv
then restarting the computer.
Keeping this setting on all the time draws about 2-3% CPU, but if your CPU is not a
bottleneck, this is irrelevant and can be ignored. Enter DISKPERF /? for more help on the
command.
Note: Physical drive counters should be used if the system is using hardware RAID, such
as the DS6000.
9.6.7 Disk bottlenecks

Windows retrieves programs and the data these programs use from disk. The disk subsystem
can be the most important aspect of I/O performance, but problems can be hidden by other
factors, such as lack of memory.
Performance console disk counters are available with either the LogicalDisk or PhysicalDisk
objects:
򐂰 For non-DS6000 RAID disks, LogicalDisk monitors the operating system partitions of
physical drives. It is useful to determine which partition is causing the disk activity, possibly
indicating the application or service that is generating the requests. PhysicalDisk monitors
the individual hard disk drives, and is useful for monitoring disk drives as a whole.
򐂰 For the DS6000 (all disks are RAID disks), LogicalDisk monitors the operating system
partitions (if any), while PhysicalDisk monitors the logical drives created from the DS6000
RAID arrays.

Finding disk bottlenecks is easier than finding processor and memory bottlenecks because
the performance degradation is readily apparent.
Tip: When attempting to analyze disk performance bottlenecks, you should always use
physical disk counters.
Finding the bottlenecks

You can use the object counters (described in Table 9-8) in the Performance console to help
you determine if you have disk bottlenecks. Then examine the indications of disk bottlenecks
based on the object counter readings. Afterwards, you should perform appropriate actions to
respond to the situation.
Table 9-8 Performance counters for detecting disk bottlenecks

Counter Description
Physical Disk: Avg. This is the average number of both read and write requests queued to the
Disk Queue Length selected disk during the sample interval.
If this value is consistently over 2-3 times the number of disks in the array
(for example, 8-12 for a 4-disk array), it indicates that the application is
waiting too long for disk I/O operations to complete. To confirm this
assumption, always check the Avg. Disk Second/Transfer counter.
Also, the Avg. Disk Queue Length counter is a key counter for determining if
a disk bottleneck can be alleviated by adding disks to the array. Remember,
adding disks to an array only results in increased throughput when the
application can issue enough multiple requests to the array to keep all disks
in the array busy. For optimal disk performance, we want the Avg. Disk
Queue Length to be no more than 2 or 3 times the number of physical disks
in the array.
Also, in most cases the application has no knowledge of how many disks are
in an array because this information is hidden from the application by the
disk array controller. So unless an application configuration parameter is
available to adjust the number of outstanding I/O commands, an application
will simply issue as many disk I/Os as it needs to accomplish its work, up to
the limit supported by the application and/or disk device driver.
Before adding disks to an array to improve performance, always check the
Avg. Disk Queue Length counter and only add enough disks to satisfy the
2-3 disk I/Os per physical disk rule. For example if the array shows an Avg.
Disk Queue Length of 30 then an array of at most 10-15 disks should be
used.
Physical Disk: Avg. This is the average number of bytes transferred to or from the disk during
Disk Bytes/Transfer write or read operations. This counter can be used as an indicator of the
stripe size that should be used for optimal performance.
For example, always create disk arrays with a stripe size that is at least as
large as the average disk bytes per transfer counter value as measured over
an extended period of time.
Physical Disk: Avg. The Avg. Disk second/Transfer is a key counter that indicates the health of
Disk sec/Transfer the disk subsystem. This is the time to complete a disk I/O operation. For
optimal performance, this should be less then 20-25 ms for non-clustered
systems, and no higher than 40-50 ms for clustered disk configurations. In
general, this counter can grow to be very high when insufficient numbers of
disks, slow disks, poor physical disk layout, or severe disk fragmentation
occurs.

Counter Description
Memory: This is the number of pages read from the disk or written to the disk to
Pages/second resolve memory references to pages that were not in memory at the time of
the reference.
High value indicates disk activity due to insufficient memory. Add more RAM
to your server.
The product of this counter and Physical Disk: Avg. Disk second/Transfer is
an approximation of the amount of disk time spent on paging file activity
during the sampling period. If it exceeds 0.1 (10 percent) then you may have
excessive paging.
Note: Do not use the % Disk Time physical disk counter. This is the percentage of elapsed
time that the selected disk drive is busy servicing read or write requests. The counter is
only useful with IDE drives, which, unlike SCSI disks, can only perform one I/O operation at
a time. % Disk Time is derived by assuming the disk is 100 percent busy when it is
processing an I/O and 0 percent busy when it is not. The counter is a running average of
the 100 percent versus 0 percent count (binary).
DS6000 can perform many hundreds or thousands of I/Os per second before they
encounter bottlenecks. Most array controllers can perform two to three disk I/Os per drive
before a bottleneck occurs. For example, if an array controller with 60 drives has one disk
I/O to perform at all times it will be 100% utilized according to the % Disk Time counter.
However, that array could actually be issuing 120-180 I/Os before a true bottleneck occurs.
Figure 9-17 shows a sample chart setting for finding disk bottlenecks.
Figure 9-17 Chart setting for finding disk bottlenecks

Removing disk bottlenecks
Once a disk bottleneck has been detected, some of the actions that can be taken are the
following:
򐂰 Use faster disks: 15,000 rpm arrays instead of 10,000 rpm arrays on the DS6000. This will
mean migrating the data to another RANK when incorporating the new disks.
򐂰 Offload processing to another system in the network (either users, applications, or
services).
򐂰 Add more RAM. Adding memory will increase system memory disk cache, which in effect
improves disk response times.
򐂰 Spread the I/O activity across the DS6000 arrays (logs, page file, etc.).
򐂰 Since the DS6000 arrays can be RAID 5 and RAID 10, swap RAID arrays from one to the
other and take a new set of measurements. For example, if the databases workload
activity is mostly sequential write operations then using RAID 5 can improve the
performance. The following list could be followed.
– Sequential reads and writes work fine on RAID 5.
– High I/O random reads and writes work better on RAID 10.
– Low I/O random reads and writes can work as well on RAID 5 as on RAID 10.
򐂰 Correct the host OS stripe size used to match the I/O transfer size.
򐂰 Use more DS6000 Ranks.
򐂰 Use more adapter cards on the host to increase the I/O bandwidth your application is
using.
9.6.8 How to monitor, collect and view performance reports

The Windows Performance Monitor tool allows you to monitor the performance of your
system by either displaying information on a real-time basis, or by collecting the data into a
log file. The following sections explain how this can be set up.
Monitoring disk performance in real time

Monitoring disk activity in real-time will permit you to view current disk activity on local or
remote disk drives. Only the current, and not the historical, level of disk activity is shown in the
chart. If you want to determine whether excessive disk activity on your system is slowing
performance, log the disk activity of the desired disks to a file, over a period of time that
represents typical use of your network. View the logged data in a chart to see if disk activity is
affecting system performance.
1. Open Performance Monitor.
2. Create a chart for the following Performance Monitor counters:
– Logical Disk: Avg. Disk second/Transfer
– Logical Disk: Disk Bytes/second
– Logical Disk: Current Disk Queue Length
Monitoring disk performance from collected data

Disk activity is best monitored using a log, since real-time monitoring provides a view of only
the current disk activity, not the historical disk usage over an extended period of time. The
information in the log can be viewed and evaluated in a chart at a later time.
1. Open Performance Monitor.

2. Select the Countr Log at the left pane and right-click, and select New Log Settings....
3. Enter the log name, and click OK.
4. Enter a new log name for the counter log and then click OK.
5. The New Counter Log dialog window (Figure 9-18) appears, you now have the choice of
adding objects or counters to the log. Click Add Objects to add an object and all its
associated counters or click Add Counters to add individual counters. Whichever option
you choose, you will be able to select the computer to monitor and then select the relevant
objects or counters you want to capture. To capture disk performance, click Add Counters
and select the following counters and click Add and then click Close:
– Logical Disk: Avg. Disk second/Transfer
– Logical Disk: Disk Bytes/second
– Logical Disk: Current Disk Queue Length
Figure 9-18 New counter log, General tab
6. In the General tab, Sample data every: is used to set how frequently you capture the
data. If you capture many counters from a local or remote computer, you should use long
intervals; otherwise, you may run out of disk space or consume too much network
bandwidth.
7. In the Run As field, input the account with sufficient rights to collect the information about
the server to be monitored and then click Set Password to input the relevant password.
8. The Log Files tab, shown in Figure 9-19 on page 345, lets you set the type of the saved
file, the suffix that is appended to the file name and an optional comment. You can use two
types of suffix in a file name: numbers or dates. The log file types are listed in Table 13-3.
If you click Configure... then you can also set the location, file name, and file size for a log
file.

Figure 9-19 Log Files tab
Table 9-9 Counter log file formats

Log file type Description
Text file - CSV Comma-delimited log file (CSV extension). Use

this format to export to a spreadsheet.
Text file - TSV Tab-delimited log file (TSV extension). Use this
format to export the data to a spreadsheet
program.
Binary file Sequential, binary log file (BLG extension). Use

this format to capture data intermittently
(stopping and resuming log operation).
Binary circular Circular, binary-format log file (BLG extension).

file Use this format to store log data to same log file,
overwriting old data.
SQL Database Logs are output as an SQL database.
9. The Schedule tab shown in Figure 9-20 on page 346 lets you specify when this log is
started and stopped. You can select the option box in the start log and stop log section to
manage this log manually using the Performance console shortcut menu. You can
configure to start a new log file or run a command when this log file closes.

Figure 9-20 New counter log, Schedule tab
Starting and stopping a counter log

When creating a counter log, you can schedule the start and stop time or you can specify
whether to manually start or stop the counter log. To manually start and stop the counter log,
do the following:
1. Select the counter log you want to start.
2. Click the Start the selected log icon on the toolbar.
3. To stop the counter log, click the Stop the selected log icon on the toolbar.
Saving counter log settings

You can save the counter log settings to use them later. To save log settings, do the following:
1. Select the counter log file in which you want to save settings.
2. Right-click the log. The window shown in Figure 13-14 appears.
3. Click Save Setting As.
4. Select a location and enter a file name, then click Save (saving to an HTML file is the only
option).
This log settings file can then be opened using with Internet Explorer.You can also use the
pop-up menu to start, stop and save the logs, as shown in Figure 9-21 on page 347.

Figure 9-21 Counter log pop-up
Importing counter logs properties

You can import counter logs settings from saved files. To import settings, do the following:
1. Right-click the right-hand window.
2. Select New Log Settings From.
3. The Open dialog window will appear. Choose the location and select a file name, then
click Open.
4. The Name dialog box will appear. If you want to change the log setting name, you can do
so; otherwise, click OK.
5. A dialog window will be displayed where you can add or remove counters and change log
file settings and scheduling.
6. If you want to edit the settings, change the required fields; otherwise, click OK.
Retrieving data from a counter log file

Once you have saved data to a log file, you can retrieve that data and process it. By default,
System Monitor displays real-time data. In order to display previously logged data, do the
following:
1. Click the View log data file icon on the System Monitor toolbar.
2. The System Monitor Properties dialog box shown in Figure 9-22 on page 348 opens at the
Source tab. Select Log Files and then click Add. The Select Log File dialog box opens.
Select the log file you want and click Open.

Figure 9-22 System Monitor Properties (Source tab)
3. At the System Monitor Properties dialog box, select the Data tab. You should how see any
counter that you specified when setting up the Counter Log as shown in Figure 9-23 on
page 349. If you only selected counter objects then the Counters section will be empty. To
add counters from an object, simply click Add...and then select the appropriate ones.

Figure 9-23 System Monitor Properties (Data tab)
Tip: Depending on how long the counter log file was running, there will be quite a lot of
data to observe. If you are interested in looking at a certain time frame when the log file
was recording data, complete these steps:
1. Click the Properties icon on the System Monitor toolbar.
2. The System Monitor Properties box will open; click the Source tab.
3. Select the time frame you want to view (see Figure 9-22) and click OK.
9.7 Task Manager

In addition to the Performance console, Windows also includes Task Manager, a small utility
that allows you to view the status of processes and applications and gives you some real-time
information about memory usage.
The Windows Resource Kit also contains INTFILTR, which is an interrupt binding tool that
allows you to bind device interrupts to specific processors on SMP servers. This is a useful
technique for maximizing performance, scaling, and partitioning of large servers. It can
provide a network performance increase of up to 20 percent.
9.7.1 Starting Task Manager

You can run Task Manager using any one of the following three methods:
򐂰 Right-click a blank area of the task bar and select Task Manager.

򐂰 Press Ctrl + Alt + Delete and click the Task Manager button.
򐂰 Press Ctrl + Shift + Esc.
Figure 9-24 shows that Task Manager has three views: Applications, Processes, and
Performance. The latter two are of interest to us in this discussion.
Processes tab
In this view (see Figure 9-24) you can see the resources being consumed by each of the
processes currently running. You can click the column headings to change the sort order
which will be based on that column.
Figure 9-24 Windows Task Manager - Processes tab
Click View -> Select Columns. This displays the window shown in Figure 9-25 on page 351,
from which you can select additional data to be displayed for each process.

Figure 9-25 Select columns for the Processes view
Table 9-10 shows the columns available in the Windows Server 2003 operating system that
are related to disk I/O.
Table 9-10 Task Manager - Disk-related columns

Column Description
Paged Pool The paged pool (user memory) usage of each process. The paged pool
is virtual memory available to be paged to disk. It includes all of the user
memory and a portion of the system memory.
Non-Paged Pool The amount of memory reserved as system memory and not pageable
for this process.
Base Priority The process’s base priority level (low/normal/high). You can change the
process’s base priority by right-clicking it and selecting Set Priority. This
remains in effect until the process stops.
I/O Reads The number of read input/output (file, network, and disk device)
operations generated by the process.
I/O Read Bytes The number of bytes read in input/output (file, network, and disk device)
operations generated by the process.
I/O Writes The number of write input/output operations (file, network, and disk
device) generated by the process.
I/O Write Bytes The number of bytes written in input/output operations (file, network,
and device) generated by the process.
I/O Other The number of input/output operations generated by the process that
are neither reads nor writes (for example, a control type of operation).
I/O Other Bytes The number of bytes transferred in input/output operations generated
by the process that are neither reads nor writes (for example, a control
type of operation).

Performance tab
The Performance view shows you performance indicators, as shown in Figure 9-26.
Figure 9-26 Task Manager - Performance view
The charts show you the CPU and memory usage of the system as a whole. The bar charts
on the left show the instantaneous values, and the line graphs on the right show the history
since Task Manager was started.
The four sets of numbers under the charts are as follows:

򐂰 Totals
– Handles: Current total handles of the system
– Threads: Current total threads of the system
– Processes: Current total processes of the system
򐂰 Physical Memory (K)
– Total: Total RAM installed (in KB)
– Available: Total RAM available to processes (in KB)
– File Cache: Total RAM released to the file cache on demand (in KB)
򐂰 Commit Charge (K)
– Total: Total amount of virtual memory in use by all processes (in KB)
– Limit: Total amount of virtual memory (in KB) that can be committed to all processes
without adjusting the size of the paging file

– Peak: Maximum virtual memory used in the session (in KB)
򐂰 Kernel Memory (K)
– Total: Sum of paged and non-paged kernel memory (in KB)
– Paged: Size of paged pool allocated to the operating system (in KB)
– Non-paged: Size of non-paged pool allocated to the operating system (in KB)
9.8 Other Windows tools

Since Windows 2000, Microsoft has shipped a set of support tools with its server operating
systems. These support tools are not installed by standard but it is a simple matter of
installing them from the installation CD-ROM. Microsoft still produces a Resource Kit for both
Windows 2000 Server and Windows Server 2003 and we recommend that you install and use
it. At the time of writing, the Windows Server 2003 Resource Kit was available for download at
no charge from:
http://www.microsoft.com/windowsServer2003/downloads/tools/default.mspx
Many of the tools that used to be in the Windows 2000 support tools or Resource Kit have
now been included in the standard Windows Server 2003 build. For example, the typeperf
command that used to be part of the Windows 2000 resource kit is now included as standard
in Windows Server 2003. Table 13-6 lists a number of these tools and provides the
executable, where the tool is installed and a brief description.
Table 9-11 Windows Server 2003 performance tools

Name Executable Location Description
Clear memory clearmem.exe Resource Kit Command-line tool.

Used to clear RAM
pages.
Logman logman.exe Windows Command-line tool to

manager performance
monitor.
Memory consumer consume.exe Resource Kit Command-line tool

tool that will consume:
򐂰 memory
򐂰 pagefile
򐂰 disk space
򐂰 CPU time
򐂰 kernel pool
Empty working set empty.exe Resource Kit Frees the working set
of a specified task or
process.
Extensible exctrlst.exe Support Displays all the object

performance counter and counter
list Tools information about
local or remote
computer.

Name Executable Location Description
Device Console Utility devcon.exe Support Tools Command-line tool

used to enable,
disable or display
detailed information
for installed devices.
Defrag defrag.exe Windows Used to defragment

hard disks.
Page fault monitor pfmon.exe Resource Kit Allows the monitoring

of page faults while a
specified application
is running.
Process Monitor pmon.exe Resource Kit Displays processes

and their relevant
CPU time, page faults
and memory in a
console window.
Process Viewer pviewer.exe Support Tools GUI version on

Process Monitor.
Performance data showperf.exe Resource Kit GUI tool to load

block dump utility performance counter
DLL on a local or
remote server. Will
then collect and
display the data from
the counter.
Typeperf typeperf.exe Windows Displays performance

counter data at a
console.
Tracerpt tracerpt.exe Windows Used to process trace

logs.
Windows Program ntimer.exe Resource Kit Will return the length

Timer of time a particular
application runs for.
Virtual Address Dump vadump.exe Resource Kit Command-line tool

that will display virtual
address space.
9.9 Iometer
Iometer is an I/O subsystem measurement and characterization tool for single and clustered
systems. Formerly, Iometer was owned by Intel Corporation, but Intel has discontinued work
on Iometer and it was given to the Open Source Development Lab. For more information
about Iometer, go to:
http://www.iometer.org/
Iometer is both a workload generator (it performs I/O operations in order to stress the system)
and a measurement tool (it examines and records the performance of its I/O operations and
their impact on the system). It can be configured to emulate the disk or network I/O load of

any program or benchmark, or can be used to generate entirely synthetic I/O loads. It can
generate and measure loads on single or multiple (networked) systems.
Iometer can be used for measurement and characterization of:

򐂰 Performance of disk and network controllers
򐂰 Bandwidth and latency capabilities of buses
򐂰 Network throughput to attached drives
򐂰 Shared bus performance
򐂰 System-level hard drive performance
򐂰 System-level network performance
Iometer consists of two programs, Iometer and Dynamo.
Iometer is the controlling program. Using Iometer’s graphical user interface, you configure the
workload, set operating parameters, and start and stop tests. Iometer tells Dynamo what to
do, collects the resulting data, and summarizes the results in output files. Only one copy of
Iometer should be running at a time. It is typically run on the server machine.
Dynamo is the workload generator. It has no user interface. At Iometer’s command, Dynamo
performs I/O operations and records performance information, then returns the data to
Iometer. There can be more than one copy of Dynamo running at a time. Typically one copy
runs on the server machine and one additional copy runs on each client machine. Dynamo is
multi threaded. Each copy can simulate the workload of multiple client programs. Each
running copy of Dynamo is called a manager. Each thread within a copy of Dynamo is called
a worker.
9.10 General considerations for Windows servers

In this section we discuss the general considerations for improving the disk-based I/O
throughput of either a file server or database server. First there are some general guidelines
that can be followed:
򐂰 Microsoft does not recommend that large dedicated file servers or database servers be
configured as backup domain controllers (BDC). This is due to the overhead associated
with the netlogon service.
򐂰 Carefully evaluate the installed Windows services to determine whether they are needed
for your environment or can be provided for by another server. Consider stopping,
manually starting, or disabling the following services: Alerter, Clipbook Server, Computer
Browser, Messenger, Network DDE, OLE Schedule, and spooler.
򐂰 Set up the Windows tasking relative to your workload types. More often than not, many
applications, such as SQL Server, run as background tasks and therefore setting the
background and foreground tasks to run equally can be of benefit.
򐂰 You can improve the performance of file caching by optimizing the server service for the
file and print server.
򐂰 Format the logical volumes with 64 KB allocations. Setting the allocation size to 64 KB
improves the efficiency of the NTFS file system by reducing fragmentation of the file
system and reducing the number of allocation units required for large file allocations. This
is accomplished through the following command line entry: format x: /A:64 K /fs:ntfs.
򐂰 Setting the NTFS log file to 64 MB reduces the frequency of the NTFS log file expansion.
Log file expansion is costly because it locks the volume for the duration of the log file
expansion operation. This is accomplished through the following command line entry:
chkdsk x: /L:65536.

9.11 Subsystem Device Driver (SDD)
The IBM Subsystem Device Driver (SDD) provides path failover/failback processing for the
Windows server™ attaching the IBM TotalStorage DS6000.
It also provides I/O load balancing. For each I/O request, SDD dynamically selects one of the
available paths to balance the load across all possible paths.
To receive the benefits of path balancing, ensure that the disk drive subsystem is configured
so that there are multiple paths to each LUN. Doing this not only will enable performance
benefits from the SDD path balancing, but also prevent loss of access to data in the event of a
path failure.
The Subsystem Device Driver is discussed in further detail in 5.6, “Subsystem Device Driver
(SDD) - multipathing” on page 157.

10
Chapter 10. zSeries servers

The DS6000 is one of the newer IBM disk subsystems that has significant performance
capabilities.
In this chapter we describe these performance features and other enhancements that enable
performance improvements when migrating your workload to a DS6000. We also show some
monitoring tools and describe how to use them for the DS6000.

10.1 Overview
With its new and unique architecture, the DS6000 provides a high level of performance
across all the server platforms it attaches to.
Specifically for the zSeries servers, the DS6000 is a disk subsystem with a very good price
performance ratio. It should be able to handle sequential workload better than an ESS 800,
like data mining and work volumes. Large database applications do not work as well on the
DS6000, in which case you will need to use a DS8000.
The DS6000 features that have performance implications in the application I/O activity are
described in the following sections:
򐂰 Parallel Access Volumes
򐂰 Multiple Allegiance
򐂰 I/O Priority Queuing
򐂰 Logical volume sizes
򐂰 FICON
In the following sections of this chapter we describe these DS6000 features and discuss how
they can be used to boost the performance of your zSeries environment.
10.2 Parallel Access Volumes

Parallel Access Volume (PAV) is one of the original features that the ESS provides specifically
for the z/OS users. Simply stated, PAV allows multiple concurrent I/Os to the same volume at
the same time from applications running on the same z/OS system image. This concurrency
helps zSeries applications better share the same logical volumes with reduced contention.
The ability to send multiple concurrent I/O requests to the same volume nearly eliminates I/O
queuing in the operating system, thus reducing I/O responses times.
Traditionally, access to highly active volumes has involved manual tuning, splitting data across
multiple volumes, and more things in order to avoid those hot spots. With PAV and the z/OS
Workload Manager, you can now almost forget about manual device level performance tuning
or optimizers. The Workload Manager is able to automatically tune your PAV configuration
and adjust it to workload changes. The DS6000 in conjunction with z/OS has the ability to
meet the highest performance requirements.
PAV is implemented by defining alias addresses to the conventional base address. The alias
address provides the mechanism for z/OS to initiate parallel I/O to a volume. As its name
implies, an alias is just another address/UCB that can be used to access the volume defined
on the base address. An alias can only be associated with a base address defined in the
same LCU. The maximum number of addresses you can define in an LCU is 256.
Theoretically you can define 1 base address plus 255 aliases in an LCU.
10.2.1 Static and dynamic PAVs

Aliases are initially defined to be associated to a certain base address. In a static PAV
environment, the alias is always associated to the same base address, while in a dynamic
PAV environment, an alias can be reassigned to any base address as need dictates.
With dynamic PAV, you do not need to assign as many aliases in an LCU as compared to a
static PAV environment, because the aliases will be moved around to the base addresses that
need an extra alias to satisfy an I/O request.

The z/OS Workload Manager (WLM) is used to implement dynamic PAVs. This function is
called dynamic alias management. With dynamic alias management, WLM can automatically
perform alias device reassignments from one base device to another to help meet its goals
and to minimize IOS queuing as workloads change.
WLM manages PAVs across all the members of a Sysplex. When making decisions on alias
reassignment, WLM considers I/O from all systems in the Sysplex. By default, the function is
turned off, and must be explicitly activated for the Sysplex through an option in the WLM
service definition, and through a device level option in HCD. Dynamic alias management
requires your Sysplex to run in WLM Goal mode.
10.2.2 PAV and large volumes

By using queuing models, we can see the performance impact on the IOS queuing time when
comparing a 3390-3 to larger volume sizes with various numbers of aliases. This modeling
shows that one 3390-9 with 3 aliases (for a total of 4 UCBs) will have less IOSQ time as
compared to three 3390-3s with 1 alias each (for a total of 6 UCBs). This means that using
larger volumes should reduce the number of total UCBs required.
As a rule-of-thumb, the numbers in Table 10-1 can be used to determine how many aliases
you need for each volume in a dynamic or static PAV environment. When using large volumes
and these guidelines, you may be able to use less than 256 addresses per LCU.
Table 10-1 Rule-of-thumb for number of aliases for various 3390 sizes
Number of aliases
Number of cylinders Dynamic PAV Static PAV
1 - 3339 1/3 1
3340 - 6678 2/3 2
6679 - 10,017 1 3
10,018 - 16,695 1 1/3 4
16,696 - 23,373 1 2/3 5
23,374 - 30,051 2 6
30,052 - 40,068 2 1/3 7
40,069 - 50,085 2 2/3 8
50,086 - 60,102 3 9
> 60,103 3 1/3 10
10.3 Multiple Allegiance

Normally, if a zSeries host image (server or LPAR) does an I/O request to a device address
for which the storage subsystem is already processing an I/O that came from another zSeries
host image, then the storage subsystem will send back a device busy indication and the I/O
has to be retried. This delays the new request and adds to processor and channel overhead
(this delay is reported in the RMF Device Activity Report PEND time column).
The DS6000 accepts multiple parallel I/O requests from different hosts to the same device
address, increasing parallelism and reducing channel overhead.
Chapter 10. zSeries servers 359

Before the ESS, a device had an implicit allegiance, that is, a relationship created in the disk
control unit between the device and a channel path group, when an I/O operation was
accepted by the device. The allegiance caused the control unit to guarantee access (no busy
status presented) to the device for the remainder of the channel program over the set of paths
associated with the allegiance.
With Multiple Allegiance (MA), the requests are accepted by the DS6000 and all requests will
be processed in parallel, unless there is a conflict when writing data to the same extent of the
CKD logical volume. Still, good application access patterns can improve the global parallelism
by avoiding reserves, limiting the extent scope to a minimum, and setting an appropriate file
mask, for example, if no write is intended.
In systems without Multiple Allegiance, all except the first I/O request to a shared volume are
rejected, and the I/Os are queued in the zSeries channel subsystem, showing up as PEND
time in the RMF reports.
Multiple Allegiance provides significant benefits for environments running a Sysplex or

zSeries systems sharing access to volumes. Multiple Allegiance and PAV can operate
together to handle multiple requests from multiple hosts.
The DS6000 ability to run channel programs to the same device in parallel can dramatically
reduce the IOSQ and the PEND time components in shared environments.
In particular, different workloads—for example, batch and online—running in parallel on

different systems can have an unfavorable impact on each other. In such cases, Multiple
Allegiance can dramatically improve the overall throughput.
10.4 How PAV and Multiple Allegiance work

These two functions allow multiple I/Os to be executed concurrently against the same volume
in a z/OS environment. In the case of PAV, the I/Os are coming from the same LPAR or z/OS
system, while for Multiple Allegiance, the I/Os are coming from different LPARs or z/OS
systems.
First we will look at a disk subsystem that does not support both of these functions. If there is
an outstanding I/O operation to a volume, all subsequent I/Os will have to wait as illustrated in
Figure 10-1 on page 361. I/Os coming from the same LPAR will wait in the LPAR and this wait
time is recorded in IOSQ Time. I/Os coming from different LPARs will wait in the disk control
unit and be recorded in Device Busy Delay Time, which is part of PEND Time.
In the ESS and DS6000, all these I/Os will be executed concurrently using PAV and Multiple
Allegiance, as shown in Figure 10-2 on page 361. I/O from the same LPAR will be executed
concurrently using UCB 1FF that is an alias of base address 100. I/O from a different LPAR
will be accepted by the disk control unit and executed concurrently. All these I/O operations
will be satisfied from either the cache or one of the DDMs on a Rank where the volume
resides.

z/OS 1 z/OS 2
Appl.A Appl.B Appl.C
UCB 100 UCB 100

UCB 100
Wait = Wait =
IOSQ Device Busy
PEND
UCB Busy
DASD Control Unit
One I/O to
one volume
at one time 100
Figure 10-1 Concurrent I/O prior to PAV and Multiple Allegiance
z/OS 1 z/OS 2
Appl.A Appl.B Appl.C
UCB 1FF UCB 100 UCB 100
- Alias to
UCB 100
DASD Control Unit
Parallel Access Multiple

Volumes Allegiance
100
Figure 10-2 Concurrent I/O with PAV and Multiple Allegiance
10.4.1 Concurrent read operation

Figure 10-3 on page 362 shows that concurrent read operations from the same LPAR or
different LPARs can be executed at the same time, even if they are accessing the same
record on the same volume.

Parallel Access Volumes Multiple Allegiance
z/OS 1 z/OS 2 z/OS 1 z/OS 2
TCB1: READ1
TCB2: READ2 TCB READ2 READ1 TCB
concurrent concurrent
Same logical volume Same logical volume
Figure 10-3 Concurrent read operations
10.4.2 Concurrent write operation

Figure 10-4 on page 363 shows the concurrent write operation. If the write I/Os are accessing
different domains on the volume, all the write I/Os will be executed concurrently. In case the
write operations are directed to the same domain, then the first write to that domain will be
executed and other writes to the same domain will have to wait until the first write finishes.
This wait time is included in the Device Busy Delay time, which is part of PEND time.
Note: The domain of an I/O covers the specified extents to which the I/O operation applies.
It is identified by the Define Extent command in the channel program. The domain covered
by the Define Extent used to be much larger than the domain covered by the I/O operation.
When concurrent I/Os to the same volume were not allowed, this was not an issue, since
subsequent I/Os will have to wait anyway.
With the availability of PAV and Multiple Allegiance, this could prevent multiple I/Os from
being executed concurrently. This extent conflict can occur when multiple I/O operations try
to execute against the same domain on the volume. The solution is to update the channel
programs so that they minimize the domain that each channel program is covering. For a
random I/O operation the domain should be the one track where the data resides.
If a write operation is being executed, then any read or write to the same domain will have to
wait. The same case will happen if a read to a domain starts, then subsequent I/Os that want
to write to the same domain will have to wait until the read operation is done.
To summarize, all reads can be executed concurrently, even if they are going to the same
domain on the same volume. A write operation cannot be executed concurrently with any
other read or write operations that access the same domain on the same volume. The
purpose of serializing a write operation to the same domain is to maintain data integrity.

Parallel Access Volumes Multiple Allegiance
z/OS 1 z/OS 2 z/OS 1 z/OS 2
TCB1: WRITE1
TCB2: WRITE2 TCB WRITE2 WRITE1 TCB
concurrent concurrent
Same logical volume Same logical volume
Figure 10-4 Concurrent write operation
10.5 I/O priority queuing

The DS6000 can manage multiple channel programs concurrently, as long as the data
accessed by one channel program is not altered by another channel program. If I/Os cannot
run in parallel, for example, due to extent conflicts, and must be serialized to ensure data
consistency, the DS6000 will internally queue I/Os.
Channel programs that cannot execute in parallel are processed in the order they are queued.
A fast system cannot monopolize access to a device also accessed from a slower system.
Each system gets a fair share.
The DS6000 can also queue I/Os from different z/OS system images in a priority order. z/OS
Workload Manager can make use of this and prioritize I/Os from one system against the
others. You can activate I/O Priority Queuing in WLM Goal mode with the I/O priority
management option in the WLM’s Service Definition settings.
When a channel program with a higher priority comes in and is put ahead of the queue of
channel programs with lower priorities, the priorities of the lower priority programs will be
increased. This prevents high priority channel programs from dominating lower priority ones
and gives each system a fair share.
10.6 Logical volume sizes

The DS6000 supports CKD logical volumes of any size from one cylinder up to65,520
cylinders. The term custom volume denotes that a user has the flexibility to select the size of a
volume, and does not need to match the size of the standard real devices such as the 3339
cylinders of the 3390-3, or the 10,017 cylinders of the 3390-9.

With the capability to support large volumes of up to 65,520 cylinders, there are now two new
standard models. The 3390-27 with 32,760 cylinders and a capacity of approximately 27 GB
and the 3390-54 with 65,520 cylinders and a capacity of approximately 54 GB.
10.6.1 Selecting the volume size

A key factor to consider when planning the CKD volumes configuration and sizes is the 256
device limit per LSS. You need to define volumes with enough capacity, so that you can use
all your installed capacity with at most 256 devices. On ESCON-attached systems the
number of devices can be even smaller, 128 or even 64, due to ESCON constraints. If using
PAV, a portion of the 256 addresses will be used for aliases.
When planning the configuration, you should also consider future growth. Which means that
you may want to define more alias addresses than needed, so that in the future you can add
an additional Rank on this LCU, if needed.
Figure 10-5 shows the number of volumes that can be defined on a (6+P) RAID 5 Rank for
different 3390 models. It is obvious that if you define 3390-3 volumes on a 146 GB DDM
Rank, you cannot define all the 291 volumes on one LCU due to the 256 address limitation on
the LCU. In this case you will have to define multiple LCUs on that Rank. A better option
would be to use the bigger 3390 models, especially if you have multiple Ranks that you want
to define under one LCU.
600
300 GB DDM 590
146 GB DDM
# volumes on a (6+P) RAID-5 rank
73 GB DDM
500
400
300 291
200 191
144
100
97
48 59
14 29 14 30
7
0
3390-3
3390-9
3390-3
3390-9
3390-3
3390-9
3390-27
3390-54
3390-27
3390-54
3390-27
3390-54
73 GB DDM 146 GB DDM 300 GB DDM
Figure 10-5 Number of volumes on a (6+P) RAID 5 Rank
10.6.2 Larger versus smaller volumes performance

The performance of configurations using larger custom volumes as compared to an equal
total capacity configuration of smaller volumes has been measured using various online and

batch workloads. In this section we include some measurement examples that can help you
evaluate the performance implications of using larger volumes.
Note: Even though the benchmarks were performed on an ESS F20, the comparative
results should be similar on the DS6000.
Random workload
The measurements for DB2 and IMS™ online transaction workloads in our measurements
showed that there was only a slight difference in device response time between a six 3390-27
volumes versus a sixty 3390-3 volumes configuration of equal capacity on the ESS F20 using
FICON channels.
The measurements for DB2 are shown in Figure 10-6. It should be noted that even when the
device response time for a large volume configuration is higher, the online transaction
response time could sometimes be lower due to the reduced system overhead of managing
fewer volumes.
3
Device response time (msec)
3390-3
3390-27
0
2101 3535
Total I/O rate (IO/sec)
Figure 10-6 DB2 large volume performance
The measurements were carried out so that all volumes were initially assigned with zero or
one alias. WLM dynamic alias management then assigned additional aliases as needed. The
number of aliases at the end of the test run reflects the number that was adequate to keep
IOSQ down. For this DB2 benchmark, the alias assignment done by WLM resulted in an
approximately 4:1 reduction in the total number of UCBs used.
Sequential workload
Figure 10-7 on page 366 shows elapsed time comparisons between nine 3390-3s versus one
3390-27 when a DFSMSdss™ full volume physical dump and full volume physical restore are
executed. The workloads were run on a 9672-XZ7 processor connected to an ESS F20 with
eight FICON channels. The volumes are dumped to or restored from a single 3590E tape with
an A60 Control Unit with one FICON channel. No PAV aliases were assigned to any volumes
for this test, even though an alias could have improved the performance.

1500
3390-3 3390-27
Elapsed time (sec)
1000
500
0
Full volume dump Full volume restore
Figure 10-7 DSS dump large volume performance
10.6.3 Planning the volume sizes of your configuration

From a simplified storage management perspective, we recommend that you select and use a
uniform volume size for the majority of your volumes. With a uniform volume size
configuration in your DS6000, you do not have to keep track of what size each of your
volumes is. Several functions, such as FlashCopy, PPRC, and full volume restores, require
that the target volume cannot be smaller than the source. This simplifies and avoids mistakes
in your storage administration activities.
Larger volumes
To avoid potential I/O bottlenecks when using large volumes you may also consider the
following recommendations:
򐂰 Use of PAVs to reduce IOS queuing.
Parallel Access Volume (PAV) is of key importance when using large volumes. PAV
enables one z/OS system image to initiate multiple I/Os to a device concurrently. This
keeps IOSQ times down even with many active data sets on the same volume. PAV is a
practical must with large volumes. In particular, we recommend using dynamic PAVs.
򐂰 Multiple Allegiance is a function that the DS6000 automatically provides.
Multiple Allegiance automatically allows multiple I/Os from different z/OS systems to be
executed concurrently. This will reduce the Device Busy Delay time, which is part of PEND
time.
򐂰 Eliminate unnecessary reserves.
As the volume sizes grow larger, more data and data sets will reside on a single CKD
device address. Thus, the larger the volume, the greater the multi-system performance
impact will be when serializing volumes with RESERVE processing. You need to exploit a
Global Resource Serialization (GRS) Star Configuration and convert all RESERVEs
possible into system ENQ requests.

򐂰 Some applications may use poorly designed channel programs that define the whole
volume or the whole extent of the dataset it is accessing as their extent range or domain,
instead of just the actual track where the I/O operates on. This prevents other I/Os from
running simultaneously, if a write I/O is being executed against that volume or dataset,
even when PAV is used. You need to identify such applications and allocate the data sets
on volumes where they do not conflict with other applications. Custom volumes are an
option here. For an Independent Software Vendor (ISV) product, asking the vendor for an
updated version may help solve the problem.
Other benefits of using large volumes can be briefly summarized as follows:

򐂰 Reduce the number of UCBs required. Not only do we reduce the number of UCBs by
consolidating smaller volumes to larger volumes, but we will also reduce the number of
total aliases required, as explained in 10.2.2, “PAV and large volumes” on page 359.
򐂰 Simplified storage administration
򐂰 Larger pools of free space, thus reducing number of X37 abends and allocation failures
򐂰 Reduced number of multivolume data sets to manage
10.7 FICON
FICON provides several benefits as compared to ESCON, from the simplified system
connectivity to the greater throughput that can be achieved when using FICON to attach the
host to the DS6000.
FICON allows you to significantly reduce the batch window processing time. Response time
improvements may accrue particularly for data stored using larger block sizes. The data
transfer portion of response time is greatly reduced because of the much higher data rate
during transfer with FICON. This improvement leads to significant reductions in the connect
time component of the response time. The larger the transfer, the greater the reduction as a
percentage of the total I/O service time.
The pending time component of the response time, that is caused by director port busy, is
totally eliminated because collisions in the director are eliminated with the FICON
architecture. For users whose ESCON directors are experiencing as much as 45–50 percent
busy conditions, this will provide significant response time reduction.
Another performance advantage delivered by FICON is that the DS6000 accepts multiple
channel command words (CCWs) concurrently without waiting for completion of the previous
CCW. This allows setup and execution of multiple CCWs from a single channel to happen
concurrently. Contention among multiple I/Os accessing the same data is now handled in the
FICON host adapter, and queued according to the I/O priority indicated by the Workload
Manager.
Significant performance advantages can be realized by users accessing the data remotely.
FICON eliminates data rate droop effect for distances up to 100 km for both read and write
operations by using enhanced data buffering and pacing schemes. FICON thus extends the
DS6000’s ability to deliver high bandwidth potential to the logical volumes needing it, when
they need it.
For additional information about FICON, see 5.3, “FICON” on page 146.

10.7.1 MIDAWs
The IBM system z9 server introduces a Modified Indirect Data Address Word (MIDAW)
facility, which in conjunction with the DS6000 and the FICON Express2 channels delivers
enhanced I/O performance for Media Manager applications running under z/OS 1.7. It is also
supported under z/OS 1.6 with PTFs.
The MIDAW facility is a modification to a channel programming technique that has existed
since S/360™ days. MIDAWs are a new method of gathering/scattering data into/from
non-contiguous storage locations during an I/O operation. There is no tuning needed to use
this MIDAW facility. The requirements to be able to take advantage of this MIDAW facility are:
򐂰 z9 server.
򐂰 Applications that use Media Manager.
򐂰 Applications that use long chains of small blocks.
򐂰 The biggest performance benefit comes with FICON Express2 channels running on 2 Gb
links.
Compared to ESCON channels, using FICON channels will improve performance. This
performance improvement is more significant for I/Os with bigger block sizes, because FICON
channels can transfer data much faster, which will reduce the connect time. The improvement
for I/Os with smaller block sizes is not as significant. In these cases where chains of small
records are processed, MIDAWs can significantly improve FICON Express2 performance if
the I/Os use Media Manager.
Figure 10-8 shows the hypothetical performance of long chains of short blocks (lcsb)
workload. Here we can see the effect of MIDAWs on lcsb workload. As the chart shows,
MIDAWs can double the throughput of lcsb as compared to when MIDAWs are not used.
Long chain of short blocks

pre-MIDAWs MIDAWs
MB/sec
0 20 40 60 80 100
channel utilzation (%)
Figure 10-8 Channel utilization limits for hypothetical workloads

10.8 z/OS planning and configuration guidelines
This section discusses general configuration guidelines and recommendations for planning
the DS6000 configuration. For a less generic and more detailed analysis that takes into
account your particular environment, the Disk Magic modeling tools are available to IBM
personnel and IBM Business Partners who can help you in the planning activities. Disk Magic
can be used to help understand the performance effects of various configuration options,
such as the number of ports and host adapters, disk drive capacity, number of disks, etc. See
4.1, “Disk Magic” on page 86.
10.8.1 Channel configuration

Due to the number of possible configurations available, you need to observe some guidelines
when configuring a DS6000 so that you do not make decisions that will limit the potential
performance of the box. The following generic guidelines can be complemented with the
information in Chapter 5, “Host attachment” on page 143:
򐂰 Define all FICON channels as a path-group to all the LCU/volumes on the DS6000.
Figure 10-9 shows the maximum throughput of a FICON port on the DS6000 as compared to
the maximum throughput of FICON channels on the zSeries servers. Considering that the
maximum throughput on a DS6000 FICON port is higher than the maximum throughput of a
FICON Express, and not that much lower as compared to the maximum throughput of a
FICON Express2, in general, we do not recommend daisy chaining several DS6000s to the
same FICON channels on the zSeries host.
Note: Daisy chaining is connecting FICON ports from multiple DS6000s to the same
FICON channel on the zSeries server.
300
DS6000 FICON port
FICON Express
250 FICON Express2
200
MB/sec
150
100
50
maximum throughput
Figure 10-9 FICON port and channel throughput

However, if you have multiple DS6000s installed, it may be a good option to balance the
channel load on the zSeries server. You can double the number of required FICON ports on
the DS6000s and daisy chain these FICON ports to the same channels on the zSeries server.
This will provide the advantage of being able to balance the load on the FICON channels,
since the load on the DS6000 does fluctuate during the day.
Figure 10-10 shows configuration A with no daisy chaining. In this configuration we see that
each DS6000 uses four FICON ports and each port is connected to a separate FICON
channel on the host. In this case, we have two sets of four FICON ports connected to eight
FICON channels on the zSeries host.
In configuration B, we double the number of FICON ports on both DS6000s and keep the
same number of FICON channels on the zSeries server. We can now connect each FICON
channel to two FICON ports, one on each DS6000. The advantage of configuration B is:
򐂰 Workload from each DS6000 will now be spread across more FICON ports. This should
lower the load on the FICON ports and FICON Host Adapters.
򐂰 Any imbalance in the load that is going to the two DS6000s will now be spread more
evenly across the eight FICON channels.
CEC
DS6000 DS6000
Assumption:
Each line from the FICON channel in the CEC
CEC and each line from the FICON port in the
DS8000 represents a set of 4-paths
DS6000 DS6000
Figure 10-10 Daisy chaining DS6000s
10.8.2 Extent Pool

An Extent Pool is a pool of extents where you allocate your volumes on. An extent is the size
of a 3390-1, which is 1113 cylinders. You assign Ranks to an Extent Pool. One Extent Pool
can have one to n Ranks, where n can be any number up to the total number of Ranks that
are defined in a server.

Tip: For performance and monitoring purposes, we recommend using one Rank per
Extent Pool.
An LCU can have all of its volumes defined in one Extent Pool or it can be defined to span
multiple Extent Pools. This one Rank per Extent Pool setup will make it easier when
monitoring performance, because the performance monitoring tools, like RMF, produce
performance statistics by Extent Pool and by Rank. This way it would be simpler to identify
which Rank belongs to which LCU.
10.8.3 Considerations for mixed workloads

The bigger capacity of the DS6000 will allow you to combine data and workloads from several
different kinds of independent servers into a single DS6000. Examples of mixed workloads
include:
򐂰 z/OS and open systems
򐂰 Mission-critical production and test
Sharing resources in a DS6000 has advantages from a storage administration and resource
sharing perspective, but does have some implications for workload planning. Resource
sharing has the benefit that a larger resource pool (for example, disk drives or cache) is
available for critical applications. However, some care should be taken to ensure that
uncontrolled or unpredictable applications do not interfere with mission-critical work.
If you have a workload that is truly mission-critical, you may want to consider isolating it from
other workloads, particularly if those other workloads are very unpredictable in their
demands. There are several ways to isolate the workloads:
򐂰 Place the data on separate DS6000s. This is, of course, the best choice.
򐂰 Place the data on separate DS6000 servers, This will isolate use of memory buses,
microprocessors, and cache resource. However, before doing that, make sure that a half
DS6000 provides sufficient performance to meet the needs of your important application.
Note, that Disk Magic provides a way to model the performance of a half DS6000 by
specifying the Failover Mode. Consult your IBM representative for a Disk Magic analysis.
򐂰 Place the data behind separate device adapters.
򐂰 Place the data on separate Ranks. This will reduce contention for use of DDMs.
Note: z/OS and open systems data can only be placed on separate Extent Pools.
10.9 DS6000 performance monitoring tools

There are several tools available that can help with monitoring the performance of the
DS6000. There are two main tools available:
򐂰 Resource Management Facility (RMF)
򐂰 RMF Magic
10.9.1 RMF
RMF provides performance information for the DS6000 and other disk subsystems for the
z/OS users. RMF Device Activity reports account for all activity to a base and all its
associated alias addresses. Activity on alias addresses is not reported separately, but they
are accumulated into the base address. RMF will report the number of PAV addresses (or in

RMF terms, exposures) that have been used by a device, and whether the number of
exposures has changed during the reporting interval, which will be denoted by an asterisk
next to the PAV number.
RMF cache statistics are collected by volume and reported by volume and by LCU. To check
the status of the whole cache, you have to check the cache reports of all the LCUs defined on
the DS6000.
An Extent Pool, which is a new concept that comes with the DS6000, also has performance
statistics related to it.
10.9.2 Analyze the response time components

If a Service Level Agreement is not being met and the problem could be related to storage,
you should use the RMF Direct Access Device Activity report (see Example 10-1) as a
starting point in performance monitoring. If possible, rank volumes related to the application
by population or intensity. Concentrate on the largest component of the response time. For
selected volumes, drill down and identify the relative importance of each of the components of
response time. Decide whether the effort required to reduce each of the components is
worthwhile.
Example 10-1 RMF DASD report

D I R E C T A C C E S S D E V I C E A C T I V I T Y
z/OS V1R6 SYSTEM ID GDP5 DATE 08/24/2005 INTERVAL 05.00.006

RPT VERSION V1R5 RMF TIME 12.30.00 CYCLE 1.000 SECONDS
TOTAL SAMPLES = 300 IODF = 99 CR-DATE: 07/15/2005 CR-TIME: 11.17.22 ACT: ACTIVATE
DEVICE AVG AVG AVG AVG AVG AVG AVG % % % AVG % %
STORAGE DEV DEVICE VOLUME PAV LCU ACTIVITY RESP IOSQ CMR DB PEND DISC CONN DEV DEV DEV NUMBER ANY MT
GROUP NUM TYPE SERIAL RATE TIME TIME DLY DLY TIME TIME TIME CONN UTIL RESV ALLOC ALLOC PEND
6900 33909 DS6900 6 013E 348.056 5.1 0.0 0.2 0.0 0.2 4.3 0.5 3.16 28.30 0.0 16.4 100.0 0.0
6901 33909 DS6901 2 013E 177.213 6.9 2.0 0.2 0.0 0.2 4.2 0.5 4.41 41.51 0.0 16.4 100.0 0.0
6902 33909 DS6902 2 013E 177.926 6.3 1.6 0.2 0.0 0.2 4.1 0.5 4.39 40.86 0.0 16.4 100.0 0.0
6903 33909 DS6903 2 013E 178.203 6.9 2.0 0.2 0.0 0.2 4.2 0.5 4.64 41.64 0.0 16.4 100.0 0.0
6904 33909 DS6904 1 013E 58.675 11.2 5.7 0.2 0.0 0.2 4.6 0.6 3.81 30.86 0.0 16.4 100.0 0.0
6905 33909 DS6905 1 013E 59.339 11.0 5.7 0.2 0.0 0.2 4.4 0.6 3.72 30.10 0.0 16.4 100.0 0.0
6906 33909 DS6906 1 013E 59.362 9.5 4.4 0.2 0.0 0.2 4.3 0.6 3.37 29.15 0.0 16.4 100.0 0.0
6907 33909 DS6907 1 013E 59.582 10.7 5.5 0.2 0.0 0.2 4.4 0.6 3.70 29.93 0.0 16.4 100.0 0.0
6908 3390 DS6908 1 013E 58.519 12.4 7.0 0.2 0.0 0.2 4.5 0.7 4.00 30.41 0.0 16.4 100.0 0.0
6909 3390 DS6909 1 013E 59.022 11.1 5.5 0.2 0.0 0.2 4.7 0.7 4.15 31.88 0.0 16.4 100.0 0.0
PAV
This is the base address plus the number of aliases assigned to that base address. An
asterisk (*) following the PAV number indicates that during this RMF interval, the number of
aliases assigned to that base address has changed, either increased or decreased.
Increases and decreases are done on demand. You might find a large number of aliases
assigned to a device with a zero I/O rate if no other volume on the LSS needs aliases. To
determine the average number of UCB / Aliases held during the measurement, multiply the
DEVICE ACTIVITY RATE by (PEND+DISC+CONN) and divide by 1000. In Example 10-1,
SC2A00 did 2.033 operations per second each holding the UCB or one of the 2 aliases
assigned for 16.7 ms per operation. That is 33.95 milliseconds per second, or 0.03395
seconds per second, or an average of 0.03395 UCBs/Aliases in use. % DEV UTIL shows that
0.03395 out of 3 PAV (1.06 percent) of the PAV is actually used. It is not likely there will be
much IOSQ when operating at such low levels. However, there are Database Management
Systems (DBMSs) which will cause queuing activity at very low levels of activity. For DB2
work files, hundreds of requests might be made to the same tablespace instantaneously.

IOSQ time
This is the time measured when an I/O request is being queued in the LPAR by z/OS. There
are two possible causes of IOSQ time. A small portion of what is measured might be due to a
long busy condition during device error recovery. However, most all of the IOSQ time will be
due to not having enough aliases to initiate an I/O request. There are three things you can do
about this:
򐂰 Provide more aliases.
򐂰 Lower the I/O load through data in memory or use faster storage devices.
򐂰 Ignore it if it turns out it is just because of the way a DBMS does things. The % DEV UTIL
times the number in PAV indicates the total time spent on disk (and the maximum impact
of improving things).
PEND time
Pend time represents the time an I/O request waits in the hardware. This PEND time can be
increased by:
򐂰 High channel utilization. More channels will be required.
򐂰 I/O Processor (IOP) contention at the zSeries host. More IOP may be needed. IOP is the
processor that is assigned to handle I/Os. If only certain IOPs are saturated, then
redefining the channels used by the control units can help balance the load to the IOP. For
more information, see “Analyze I/O queuing activity” on page 374.
򐂰 CMR Delay is part of PEND time. It is the initial selection time for the first command in a
chain for a FICON channel. It can be elongated by contention down stream from the
channel, like control unit busy.
򐂰 Device Busy Delay is also part of PEND time. This is caused by a domain conflict,
because of a read or write operation against a domain that is in use for update. If there is
a high Device Busy Delay time it could have been caused by the domain of the I/O not
being limited to the track where the I/O operation is going to. If an ISV product is used,
asking the vendor for an updated version may help solve this problem.
DISC time
If the major cause of delay is the DISC time then you will need to do some further research to
find the cause. The most probable cause of high DISC time is having to wait while data is
being staged from the DS6000 Array into cache, because of a read miss operation. This time
can be elongated by:
򐂰 Low read hit ratio. The lower the read hit ratio, the more read operations will have to wait
for the data to be staged from the DDMs to the cache.
򐂰 High DDM utilization. This can be verified from the RMF Rank report. See “Analyze Rank
statistics” on page 376. Look at the Rank read response time. As a rule-of-thumb (RoT)
this number should be less than 35 msec. If it is higher than that, it is an indication that this
Rank is too busy. If this happens, consider spreading the busy volumes to other Ranks
that are not as busy.
򐂰 Persistent memory (NVS) full condition can also elongate the DISC time, see “Analyze
cache statistics” on page 375.
CONN time
For each I/O operation, the channel subsystem measures the time the DS6000, channel and
CEC were connected. At high levels of utilization significant time can be spent in contention,
rather than transferring data.

If a significant portion of the I/O response time is connect time, you should get an
understanding of contention from the ESS Link Statistics report (Example 10-7 on page 377)
and Channel Path Activity report (Example 10-3). ESS Link Statistics is better reporting for
multiple CEC systems daisy chained to a Host Adapter; Channel Path Activity is better for
multiple DS6000 to a FICON channel.
10.9.3 Analyze I/O queuing activity

Example 10-2 shows the report where we can check if the IOP is saturated. An average
queue length greater than one indicates that the IOP is saturated, even though an average
queue length greater than 0.5 should be considered as a warning sign. A burst of I/O can also
trigger a high average queue length.
Example 10-2 I/O queuing activity

I/O Q U E U I N G A C T I V I T Y

TOTAL SAMPLES = 300 IODF = 99 CR-DATE: 07/15/2005 CR-TIME: 11.17.22 ACT: ACTIVATE
- INITIATIVE QUEUE - ------- IOP UTILIZATION ------- -- % I/O REQUESTS RETRIED -- -------- RETRIES / SSCH ---------
IOP ACTIVITY AVG Q % IOP I/O START INTERRUPT CP DP CU DV CP DP CU DV
RATE LNGTH BUSY RATE RATE ALL BUSY BUSY BUSY BUSY ALL BUSY BUSY BUSY BUSY
00 2761.055 0.01 7.03 2761.055 2916.185 0.9 0.8 0.0 0.0 0.0 0.01 0.01 0.00 0.00 0.00
01 381.159 0.00 1.01 381.159 422.858 2.1 2.1 0.0 0.0 0.0 0.02 0.02 0.00 0.00 0.00
SYS 3142.214 0.01 4.02 3142.214 3339.043 1.0 1.0 0.0 0.0 0.0 0.01 0.01 0.00 0.00 0.00
10.9.4 Analyze FICON statistics

The FICON report, as shown in Example 10-3, shows the FICON channel related statistics.
The PART Utilization is the FICON channel utilization by this partition or LPAR, and the Total
Utilization is the total utilization of the FICON channel from all LPARs that are defined on the
CEC. The RoT for the total FICON channel utilization and the BUS utilization is to keep them
under 50%. If these numbers exceed 50%, then you will see an elongated CONN time.
The BUS utilization is always greater than 5%, even if there is no I/O activity at all on the
channel. For small block transfers, the BUS utilization is less than the FICON channel
utilization, and for large block transfers, the BUS utilization is greater than the FICON channel
utilization.
Example 10-3 Channel Path Activity report

C H A N N E L P A T H A C T I V I T Y

IODF = 99 CR-DATE: 07/15/2005 CR-TIME: 11.17.22 ACT: ACTIVATE MODE: LPAR CPMF: EXTENDED MODE
-------------------------------------------------------------------------------------------------------------------------
DETAILS FOR ALL CHANNELS
-------------------------------------------------------------------------------------------------------------------------
CHANNEL PATH UTILIZATION(%) READ(MB/SEC) WRITE(MB/SEC) CHANNEL PATH UTILIZATION(%) READ(MB/SEC) WRITE(MB/SEC)
ID TYPE G SHR PART TOTAL BUS PART TOTAL PART TOTAL ID TYPE G SHR PART TOTAL BUS PART TOTAL PART TOTAL
2E FC_S 2 Y 1.12 1.75 4.92 0.04 0.17 1.75 1.81 36 FC_? 2 Y 0.00 0.13 3.96 0.00 0.00 0.00 0.00
2F FC_S 2 Y 1.10 1.73 4.90 0.04 0.17 1.71 1.77 37 FC_? 2 Y 0.00 0.13 3.96 0.00 0.00 0.00 0.00
30 FC_S 2 Y 0.00 0.14 3.96 0.00 0.00 0.00 0.00 38 FC_S 2 Y 0.00 4.11 4.35 0.00 0.00 0.00 0.00
39 FC_S 2 Y 0.00 3.63 4.29 0.00 0.00 0.00 0.00 43 FC_S 2 Y 11.17 11.17 7.24 2.43 2.43 2.36 2.36
3A FC_S 2 Y 0.00 0.13 3.96 0.00 0.00 0.00 0.00 4A FC_S 2 Y 0.00 0.13 3.96 0.00 0.00 0.00 0.00
3B FC_S 2 Y 0.00 0.14 3.96 0.00 0.00 0.00 0.00 4B FC_S 2 Y 0.00 0.14 3.97 0.00 0.00 0.00 0.00
3C FC_S 2 Y 0.00 1.30 4.07 0.00 0.00 0.00 0.00 4C FC_S 2 Y 4.21 4.21 5.46 0.80 0.80 1.77 1.77
3D FC_S 2 Y 0.00 0.96 4.06 0.00 0.00 0.00 0.00 4D FC_S 2 Y 11.15 11.15 7.26 2.42 2.42 2.43 2.43
FICON Open Exchange

To prevent the DS6000 FICON ports from being overloaded, especially when connected to
multiple CECs, there is a limit set on the number of active channel programs that can be

executed from one FICON channel. This is called the FICON Open Exchange. The formula to
calculate this is:
(I/O Rate) x (CMR + CONN + DISC) / (N x 1000)
where N = the number of channels in the path group. It is multiplied by 1000, because the I/O
Rate unit is in seconds and the CMR, CONN and DISC are in milliseconds.
Because RMF reports total I/O rate by LCU, this formula needs to be calculated for all LCUs
that share the same path group, and also from all LPARs that are on the same CEC.
The number of Open Exchanges is limited to 32 on all CECs, but on the z9 processor this limit
is increased to 64.
DS6000 FICON port concurrency

This calculation will indicate the concurrency of the FICON channel. The formula is:
(I/O Rate x CONN) / (N x 1000)
Where, N = number of FICON Ports used by the same path group. It is multiplied by 1000,
because the I/O Rate unit is in seconds and the CONN time unit is in milliseconds.
Because RMF reports total I/O rate by LCU, this formula needs to be calculated for all LCUs
that share the same path group, and also from all LPARs on all CECs.
The RoT for this number is to keep it under two, and if it exceeds four, it means that the
FICON port is very overloaded. High FICON concurrency will increase the CONN time.
10.9.5 Analyze cache statistics

The RMF Cache Subsystem Activity report provides useful information for analyzing the
reason of high DISC time. Example 10-4 shows a sample cache report by LCU. The RMF
cache report can also report the cache statistics by volume.
Low read hit ratios will contribute to higher DISC time.
High DFW BYPASS is an indication that persistent memory is overcommitted. DFW BYPASS
actually means DASD Fast Write I/Os that are retried, because persistent memory is full.
Calculate the quotient of DFW BYPASS divided by the total I/O rate, as a RoT, if this number
is higher than 1%, then the write retry operations will have a significant impact on the DISC
time.
Check the Disk Activity part of the report, the Read response time must be less than 35 msec.
If it is higher than 35 msec, then it is an indication that the DDMs on the Rank where this LCU
resides are saturated.
Example 10-4 Cache Subsystem Activity summary

C A C H E S U B S Y S T E M A C T I V I T Y

RPT VERSION V1R5 RMF TIME 12.30.00
SUBSYSTEM 2107-01 CU-ID 6901 SSID 2041 CDATE 08/24/2005 CTIME 12.30.00 CINT 05.00
TYPE-MODEL 1750-511 MANUF IBM PLANT 13 SERIAL 0000000ABC2A
------------------------------------------------------------------------------------------------------------------------------------
CACHE SUBSYSTEM STATUS
------------------------------------------------------------------------------------------------------------------------------------
SUBSYSTEM STORAGE NON-VOLATILE STORAGE STATUS
CONFIGURED 3392.0M CONFIGURED 128.0M CACHING - ACTIVE
AVAILABLE 781.8M PINNED 0.0 NON-VOLATILE STORAGE - ACTIVE
PINNED 0.0 CACHE FAST WRITE - ACTIVE
OFFLINE 0.0 IML DEVICE AVAILABLE - YES

------------------------------------------------------------------------------------------------------------------------------------
CACHE SUBSYSTEM OVERVIEW
------------------------------------------------------------------------------------------------------------------------------------
TOTAL I/O 477972 CACHE I/O 477972 CACHE OFFLINE 0
TOTAL H/R 0.672 CACHE H/R 0.672
CACHE I/O -------------READ I/O REQUESTS------------- ----------------------WRITE I/O REQUESTS---------------------- %
REQUESTS COUNT RATE HITS RATE H/R COUNT RATE FAST RATE HITS RATE H/R READ
NORMAL 353304 1178 196373 654.6 0.556 118573 395.2 118573 395.2 118573 395.2 1.000 74.9
SEQUENTIAL 0 0.0 0 0.0 N/A 6095 20.3 6095 20.3 6095 20.3 1.000 0.0
CFW DATA 0 0.0 0 0.0 N/A 0 0.0 0 0.0 0 0.0 N/A N/A
TOTAL 353304 1178 196373 654.6 0.556 124668 415.6 124668 415.6 124668 415.6 1.000 73.9
-----------------------CACHE MISSES----------------------- ------------MISC------------ ------NON-CACHE I/O-----
REQUESTS READ RATE WRITE RATE TRACKS RATE COUNT RATE COUNT RATE
DFW BYPASS 0 0.0 ICL 0 0.0
NORMAL 156931 523.1 0 0.0 157137 523.8 CFW BYPASS 0 0.0 BYPASS 0 0.0
SEQUENTIAL 0 0.0 0 0.0 0 0.0 DFW INHIBIT 0 0.0 TOTAL 0 0.0
CFW DATA 0 0.0 0 0.0 ASYNC (TRKS) 111232 370.8
TOTAL 156931 RATE 523.1
---CKD STATISTICS--- ---RECORD CACHING--- ----HOST ADAPTER ACTIVITY--- --------DISK ACTIVITY-------
BYTES BYTES RESP BYTES BYTES
WRITE 0 READ MISSES 0 /REQ /SEC TIME /REQ /SEC
WRITE HITS 0 WRITE PROM 50292 READ 4.1K 4.8M READ 505.26 53.9K 28.2M
WRITE 12.0K 5.0M WRITE 1027.0 17.2K 5.7M
Following the above report is the report by volume serial number, as shown in Example 10-5.
Here you can see to which Extent Pool each volume belongs. In the case where we have the
following set up:
򐂰 One Extent Pool has one Rank.
򐂰 All volumes on an LCU belong to the same Extent Pool.
then it would be easier to do the analysis if a performance problem happens on the LCU. If we
look at the Rank statistics, see Example 10-6 on page 377, we know that all the I/O activity on
that Rank is coming from the same LCU. So we can concentrate the analysis on the volumes
on that LCU only.
Note: Depending on the DDM size used and the 3390 model selected, you can put
multiple LCUs on one Rank or you may also have an LCU that spans more than one Rank.
Example 10-5 Cache Subsystem Activity by volume serial number

C A C H E S U B S Y S T E M A C T I V I T Y

RPT VERSION V1R5 RMF TIME 12.30.00
SUBSYSTEM 2107-01 CU-ID 6901 SSID 2041 CDATE 08/24/2005 CTIME 12.30.00 CINT 05.00
TYPE-MODEL 1750-511 MANUF IBM PLANT 13 SERIAL 0000000ABC2A
------------------------------------------------------------------------------------------------------------------------------------
CACHE SUBSYSTEM DEVICE OVERVIEW
------------------------------------------------------------------------------------------------------------------------------------
VOLUME DEV XTNT % I/O ---CACHE HIT RATE-- ----------DASD I/O RATE---------- ASYNC TOTAL READ WRITE %
SERIAL NUM POOL I/O RATE READ DFW CFW STAGE DFWBP ICL BYP OTHER RATE H/R H/R H/R READ
*ALL 100.0 1593 654.6 415.6 0.0 523.1 0.0 0.0 0.0 0.0 370.8 0.672 0.556 1.000 73.9
*CACHE-OFF 0.0 0.0
*CACHE 100.0 1593 654.6 415.6 0.0 523.1 0.0 0.0 0.0 0.0 370.8 0.672 0.556 1.000 73.9
DS6900 6900 0003 21.8 348.1 146.7 87.9 0.0 113.4 0.0 0.0 0.0 0.0 62.0 0.674 0.564 1.000 74.7
DS6901 6901 0003 11.1 177.2 73.0 46.1 0.0 58.1 0.0 0.0 0.0 0.0 37.0 0.672 0.557 1.000 74.0
DS6902 6902 0003 11.2 177.9 74.2 46.3 0.0 57.5 0.0 0.0 0.0 0.0 37.1 0.677 0.563 1.000 74.0
DS6903 6903 0003 11.2 178.2 74.2 45.4 0.0 58.6 0.0 0.0 0.0 0.0 36.6 0.671 0.559 1.000 74.5
DS6904 6904 0003 3.7 58.7 23.1 15.8 0.0 19.8 0.0 0.0 0.0 0.0 16.6 0.663 0.539 1.000 73.1
DS6905 6905 0003 3.7 59.3 24.0 15.5 0.0 19.8 0.0 0.0 0.0 0.0 16.0 0.666 0.547 1.000 73.9
DS6906 6906 0003 3.7 59.4 23.9 15.9 0.0 19.6 0.0 0.0 0.0 0.0 16.6 0.670 0.549 1.000 73.2
DS6907 6907 0003 3.7 59.6 23.9 16.3 0.0 19.4 0.0 0.0 0.0 0.0 17.1 0.674 0.552 1.000 72.7
DS6908 6908 0003 3.7 58.5 23.8 15.3 0.0 19.4 0.0 0.0 0.0 0.0 16.3 0.668 0.550 1.000 73.8
DS6909 6909 0003 3.7 59.0 23.5 15.5 0.0 20.0 0.0 0.0 0.0 0.0 16.3 0.661 0.540 1.000 73.7
10.9.6 Analyze Rank statistics

The following discussions are based on assigning one Rank on one Extent Pool, as
recommended in “Extent Pool” on page 370.

In the case of one LCU on one Extent Pool, analyzing how the LCU’s backend performs is as
simple as looking at the ESS Rank Statistics for the Extent Pool/Rank. What we need to look
at is the read response time per operation. This number should be less than 35 msec.
Do not worry if write response times are high. When NVS is used up to a high water mark,
data is written furiously until NVS usage is reduced to a low water mark. Multiple requests are
queued for the same HDD, meaning response time could be more than double counted.
Remember, the application that wrote the data was given device end long ago. After this flurry
of activity there is a relatively long period of time doing nothing. Write response times being
high are not usually an indication of a performance problem.
If an LCU uses multiple Extent Pools, then we can still see the backend performance of each
individual Extent Pool/Rank.
If there is a problem in a DS6000 which has multiple LCUs defined on an Extent Pool, then it
will be harder to determine which LCU is causing the problem. The LCU with the highest
response time may just be a victim and not the perpetrator of the problem. The perpetrator is
usually the LCU that is flooding the Rank with I/Os.
Identifying the cause of a problem will become more complicated if you define multiple Ranks
in an Extent Pool. This is because in the cache report, the volume is associated with an
Extent Pool, and not a Rank.
Example 10-6 Rank statistics

E S S R A N K S T A T I S T I C S

SERIAL NUMBER 00000ABC2A TYPE-MODEL 001750-511 CDATE 08/24/2005 CTIME 12.30.01 CINT 04.59
------ READ OPERATIONS ------- ------ WRITE OPERATIONS ------
--EXTENT POOL-- OPS BYTES BYTES RTIME OPS BYTES BYTES RTIME --ARRAY-- MIN RANK RAID
ID TYPE RRID /SEC /OP /SEC /OP /SEC /OP /SEC /OP NUM WDTH RPM CAP TYPE
0000 FIBRE 1Gb 0000 0.9 7.7K 6.6K 16.4 0.1 3.9K 438.4 50.4 1 2 N/A 600G RAID 5
0002 CKD 1Gb 0001 31759 54.0K 1.7G 272.1 20532 4.9K 101.6M 38.6 1 2 N/A 600G RAID 5
0003 CKD 1Gb 0003 92313 53.9K 5.0G 74.1 53325 5.1K 271.3M 104.2 1 7 N/A 511G RAID 5
0004 CKD 1Gb 0004 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 7 N/A 511G RAID 5
0006 CKD 1Gb 0006 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 7 N/A 511G RAID 5
10.9.7 Analyze DS6000 port statistics

The following report in Example 10-7 shows the port report on the DS6000, which will include
the FICON ports and also the PPRC link ports.
The SAID is the port ID:

򐂰 The first two characters denote the enclosure number
򐂰 The third character denotes the Host Adapter number within the enclosure
– Numbered 0, 1, 3, and 4
򐂰 The last character denotes the port ID within that Host Adapter
– Numbered 0, 1, 2, and 3
Example 10-7 Link/port statistics

E S S L I N K S T A T I S T I C S

SERIAL NUMBER 00000ABC2A TYPE-MODEL 001750-511 CDATE 08/24/2005 CTIME 12.30.01 CINT 04.59
------ADAPTER------ --LINK TYPE-- BYTES BYTES OPERATIONS RESP TIME I/O
SAID TYPE /SEC /OPERATION /SEC /OPERATION INTENSITY
0000 FIBRE 2Gb NO DATA TO REPORT OR ZERO
0001 FIBRE 2Gb NO DATA TO REPORT OR ZERO

0002 FIBRE 2Gb ECKD READ 800.9K 3.9K 204.8 0.0 10.1
ECKD WRITE 1.8M 14.9K 120.9 0.2 22.5
------
32.6
0003 FIBRE 2Gb ECKD READ 802.7K 3.9K 205.5 0.0 10.2
ECKD WRITE 1.8M 15.1K 121.4 0.2 22.7
------
32.9
0233 FIBRE 2Gb PPRC SEND 9.9M 39.7K 248.5 0.6 155.5
PPRC RECEIVE 0.0 0.0 0.0 0.0 0.0
------
155.5
10.9.8 RMF Magic for Windows

RMF Magic for Windows is a tool that is available from Intellimagic BV, a company that
specializes in storage performance and modeling software. Information about obtaining this
product is available through the Intellimagic Web site at www.intellimagic.net or via e-mail to
info@intellimagic.net.
RMF Magic provides consolidated performance reporting about your z/OS Disk Subsystems,
from the point of view of those disk subsystems, rather than from the host perspective, even
when disk subsystems are shared between multiple Sysplexes. This disk-centric approach
makes it much easier to analyze the I/O configuration and performance: RMF Magic
automatically determines the I/O configuration from your RMF data, showing the relationship
between the disk subsystem serial numbers, SSID, LCUs, device numbers and device types.
With RMF Magic there is no need to compare printed RMF reports.
While RMF Magic reports are based on information from RMF records, the analysis and
reporting goes beyond what RMF provides: in particular it computes accurate estimates for
the read and write bandwidth (MB/s) for each disk subsystem and down to the device level.
With this unique capability RMF Magic can size the links in a future remote copy configuration
as it knows the bandwidth required for the links, both in I/O requests and in megabytes per
second, for each point in time.
RMF Magic consolidates the information from RMF records with channel, disk, LCU and
cache information into one view per disk subsystem, per SSID (LCU), and per storage group.
RMF Magic gives insight into your performance and workload data for each individual RMF
interval within a period selected for the analysis that can span weeks. Where RMF
postprocessor reports are sorted by host and LCU, RMF Magic reports are sorted by disk
subsystem and SSID (LSS). With this information you can plan migrations and consolidations
more effectively, because RMF Magic provides a detailed insight in the workload, both from a
disk subsystem and a storage group perspective.
RMF Magic’s graphical capabilities let you to quickly find any hot spots and tuning
opportunities in your disk configuration. Based on user-defined criteria, RMF Magic will
automatically identify peaks within the analysis period. On top of that, the graphical reports
will make all peaks and anomalies stand out immediately, which allows you to explain peak
behavior correctly.
RMF Magic cannot only be used to analyze subsystem performance for z/OS hosts, but if
your DS6000 is also providing storage for open systems hosts, this tool will also report on
RAID rank statistics and host port links statistics. The DS6000 storage subsystem will provide
these open systems statistics to RMF whenever performance data is reported to RMF and is
available for reporting through RMF Magic. Of course if you have a DS6000 that has only
open systems activity, and does not include z/OS 3390 volumes, then this data cannot be
collected by RMF and will not be reported on by RMF Magic.

RMF Magic analysis process
The input to the RMF Magic tool is a set of RMF type 7X records that are collected on the
z/OS host. The first step of the analysis process is to reduce these records, selecting the data
that is necessary for the analysis.
RMF Magic consists of two components that offer three main functions:
1. A Graphical User Interface (GUI) for the Windows platform that provides:
– A Run Control Center that provides an easy interface allowing the user to prepare,
initiate and supervise the execution of the batch component when executed on the
Windows platform. The Run Control Center is also used to load the data into an
Reporting database.
– A Reporter Control Center where the user can interactively analyze the data in the
Reporting database by requesting the creation of Microsoft Excel tables and charts.
2. A batch component that validates, reduces, extracts, completes and summarizes the
input. All of this is done in two steps: Reduce and Analyze, as described in the RMF Magic
Reduce and RMF Magic Analyze steps below. The batch component can be executed on
either the z/OS or Windows platform.
An RMF Magic study is executed in 4 steps:

1. Data Collection: RMF records are collected in the site to be analyzed and sorted in time
sequence.
2. RMF Magic Reduce: reduces the input data (better than 10:1) and creates a database.
The Reduce step also validates the input data.
3. RMF Magic Analyze: based on information found in the RMF data, supplemental values
such as Write MB per second are computed. The data can be summarized by groups as
defined by the analyst, for XRC sizing or performance reporting. The output of this step
consists of:
a. Reports – these are created in the form of Comma Separated Value (CSV) files to be
used as input for the next step.
b. Top-10 interval lists based on user criteria.
The CSV files are loaded in an Reporting database (or in an Microsoft Excel workbook).
4. Data presentation and analysis: RMF Magic for Windows is used to create graphical
summaries of the data stored in the Reporting database. The analyst can now profile the
workload, looking a different workload characteristics from the Storage Unit point of view.
This process may require additional analyze runs as different interest groups (or
application data groups) are identified for analysis.
Data collection step

This step of the process is always done at the z/OS host since it is the process of collecting
the RMF data that has been captured by the host system. As the RMF data itself is being
captured at the host, you will want to be sure to be gathering data for disk, cache, and host
adapters by specifying DEVICE(DASD), CACHE, and ESS in your ERBRMFxx member of
PARMLIB.
When preparing the data for processing by RMF Magic for Windows, it is important to be sure
that the data is sorted by time stamp and the RMF Magic package comes with a sample set of
JCL which can be used for sorting the data. You will want to gather the RMF data for all
systems that have access to the disk subsystems that you are going to be studying.
We recommend that the input to the SORT step contains only the RMF records that are
recorded by the SMF subsystem. This subset of SMF data can be obtained by executing the

IFASMFDP program and selecting only TYPE(70-78) records. You should not run SORT
directly against the SMF data, as this sometimes produces invalid output data sets.
Example 10-8 Sample job for sorting the RMF data

//RMFSORT EXEC PGM=SORT
//SORTIN DD DISP=SHR,DSN=USERID1.RMFMAGIC.RMFDATA
//* DD DISP=SHR,DSN=<INPUT_RMFDATA_SYSTEM2>
//* :
//* :
//* DD DISP=SHR,DSN=<INPUT_RMFDATA_SYSTEMN>
//SORTOUT DD DISP=(NEW,CATLG),UNIT=(SYSDA,5),SPACE=(CYL,(500,500)),
// DSN=USERID1.SORTED.RMFMAGIC.RMFDATA
//SORTWKJ1 DD DISP=(NEW,DELETE),UNIT=SYSDA,SPACE=(CYL,(500))
//SYSPRINT DD SYSOUT=*
//SYSOUT DD SYSOUT=*
//SYSIN DD *
INCLUDE COND=(06,1,BI,GE,X'46',AND,06,1,BI,LE,X'4E')
SORT FIELDS=(11,4,CH,A,7,4,CH,A),EQUALS
MODS E15=(ERBPPE15,500,,N),E35=(ERBPPE35,500,,N)
RMF Magic reduce step

This step of the process can either be run on the z/OS host or it can be run on the workstation
where RMF Magic for Windows is installed. If you have the resources available at the z/OS
host, we recommend that you do the reduce step there. This results in less data needing to be
transferred from the host to the workstation.
If you choose to do the reduce on the workstation, RMF Magic for Windows provides a utility
which should be used to package the RMF data for transfer to the workstation. This utility will
not only compress the data for more efficient data transfer, but it will also package the data so
that the Variable Blocked Spanned (VBS) records that are used for collecting the data at the
host do not result in data transfer problems.
The JCL that is used for either option is provided with the RMF Magic for Windows product.
RMF Magic analyze step

Again, the analyze step can either be run at the z/OS host or on the workstation. We
recommend that this step is executed at the workstation because part of the analysis process
is to define interest groups, or sets of application volumes that you want to focus in on. If the
analyze step is done on the host, and after looking at the reports you decide that you need a
different interest group view, then you would need to analyze the data again, and send the
resulting database down to the workstation again. Performing the analyze step at the
workstation results in a single data transfer from the host to the workstation, where you can
experiment with different interest groups.
Data presentation and reporting step

RMF Magic for Windows uses a combination of Microsoft Excel and Microsoft PowerPoint for
reporting performance data and building reports from the data. Using the spreadsheet you
are able to look at details of the data that has been collected and analyzed, or review charts
that RMF Magic has created. When you are interested in presenting your findings to an
audience, RMF Magic provides an automated interface which allows you to select specific
charts from your analysis and to generate a Microsoft PowerPoint presentation file using
those charts.

The data presentation that is provided by RMF Magic is a powerful approach to
understanding your data. At the highest level, you can see a summary of performance
information, at a glance, in a series of charts as shown in Figure 10-11.
Figure 10-11 Sample set of RMF Magic workload summary charts
The specific data that is presented in Figure 10-11 is not really relevant, but this set of charts
shows the power of being able to see a graphical representation of your storage subsystem
over time. These particular summary charts show I/O activity to a particular storage
subsystem in terms of data transfer (megabytes per second), I/O rates and response times,
and a view of the number of concurrent I/Os over time. This is a sampling of the standard
summary charts that are automatically created by the RMF Magic reporting tool. Additional
standard reports include backend RAID activity rates, read hit percentages, and a variety of
breakdowns of I/O response time components.
In additional to the graphical view of your performance data, RMF Magic also provides
detailed spreadsheet views of important measurement data, with highlighting of spreadsheet
cells for more visual access to the data. Figure 10-12 on page 382 is an example of the
Performance Summary for a single disk subsystem (called a DSS in the RMF Magic tool).
For each of the data measurement points (for example Column C shows I/O Rate while
Column D shows Response Time), rows 10 through 15 shows a summary of the information.
Rows 10 through 12 show the maximum rate that was measured, the RMF interval in which
that rate occurred and which row of the spreadsheet shows that maximum number. Easy

navigation is available to that hot spot by selecting a cell in the desired column, then clicking
Goto Max.
For example, selecting cell D10 and then clicking Goto Max would move the spreadsheet
data view so that you would be looking at row 122 in about the middle of the screen to see the
average response times in the intervals surrounding the highest interval. This view would also
immediately provide you with a view of the I/O rate and data rate during those intervals of
higher response times.
Figure 10-12 also shows color-coded highlighting for cells that show the top intervals of
measurement data. For example, if you are viewing this text in color, you will see that column
F has cells that are highlighted in pink. The pink cells represent all of the measurement
intervals that have values higher than the 95th percentile. Once again, this is a feature of the
RMF Magic tool which provides visual access to potential performance hot spots represented
by your measurement data.
Figure 10-12 I/O and data rate summary for a single subsystem
Figure 10-13 on page 383 shows a spreadsheet similar in appearance to Figure 10-12 but in
this case shows the summary of the cache measurement data. Again, the tool will highlight
those cells which have the highest measurement intervals for ease of navigation within the
data.

Figure 10-13 Cache summary for a single subsystem
Figure 10-14 on page 384 and Figure 10-15 on page 385 show additional views of how data
can be analyzed within a single subsystem. Using the views shown in Figure 10-14 on
page 384 it is possible to quickly look into the busiest logical control units within the
subsystem. By using data as represented by Figure 10-15 on page 385 it is possible to see
the different components of response time, over time, in order to identified specific intervals
which may need closer analysis.
In summary, the RMF Magic for Windows tool is used to get a view of I/O activity from the disk
subsystem point of view, versus the operating system point of view shown by the standard
RMF reporting tools. This approach allows you to analyze the affect each of operating system
images has on the various disk subsystem resources. This subsystem view is presented in an
easy to use graphical interface with tools programmed into the interface to ease the analysis
of data.

Figure 10-14 Breakdown of measurement data by SSID within a subsystem

Figure 10-15 Summary of subsystem response time components

11
Chapter 11. iSeries servers

This chapter gives an overview of the iSeries storage architecture and basic guidelines to
help you plan how to configure your DS6000 for optimal capacity and performance when
connected to iSeries servers. The general guidelines presented in this chapter are intended
to work with most user situations, but may not account for the peculiarities of all installations.
Your IBM representative or IBM Business Partner can assist when planning a more detailed
approach that considers all the particular factors of any user installation.
This chapter refers to the DS6000 on iSeries OS/400 and i5/OS operating systems only, not
AIX or Linux.
iSeries software versions V5R2 and earlier are OS/400 versions. Beginning with V5R3, the
software is i5/OS. However, in this chapter, when statements refer to the iSeries software in
general, we simply say OS/400. Note that the DS6000 is supported only on V5R2 and
subsequent releases.

11.1 iSeries architecture
When attaching an iSeries to a DS6000, the iSeries will be like any other Fixed Block
Architecture server that is Fibre Channel connected. Basically the DS6000 presents a set of
LUNs to the iSeries, like it does for any other open systems server.
One thing that distinguishes the iSeries is the LUN sizes it will be using. The LUNs will report
into the iSeries as the different models of the 1750 device type. The models will depend on
the size of LUNs that have been configured.
The other distinguishing characteristics of the iSeries come on an upper layer on top of the
preceding architectural characteristics described so far. This is the single level storage
concept that the iSeries uses. This is a powerful characteristic that makes the iSeries a
unique server.
11.1.1 Single level storage

Both the main memory of the iSeries and the storage disks are treated as a very large virtual
address space, known as single level storage. This is probably the most significant
differentiation of the iSeries when compared to other open systems. As far as applications on
the iSeries are concerned, there is really no such thing as a disk unit. This is true for external
storage servers as well as internal disks. When a DS6000 is attached to an iSeries server, it
becomes an automatic extension of iSeries storage and the iSeries automated storage
management capabilities.
Storage management and caching of data into main storage is completely automated based
on Expert Cache algorithms. Storage management automatically spreads the data across the
disk arms or disk drives (or across LUNs for DS6000 disks) and continues to add records to
files until specified threshold levels are reached.
Single level storage is efficient. Regardless of how many application programs need to use an
object, only one copy of it is required to exist. This makes the entire main storage of an
iSeries server a fast cache for disk storage.
11.1.2 Expert Cache

Expert Cache is a set of algorithms that executes in the iSeries server. Expert Cache uses
designated pools in main storage to cache database operations. The size of the pools is
adjusted dynamically or controlled by the system operator as a function of time.
By caching in main storage, the system eliminates access to the storage devices and reduces
associated I/O traffic. Expert Cache works by minimizing the effect of synchronous disk I/O on
a job. The best candidates for performance improvement by using Expert Cache are jobs that
are most affected by synchronous disk I/Os.
11.1.3 Independent auxiliary storage pools

An independent disk pool, or independent auxiliary storage pool (ASP), is a collection of disk
units that can be brought online or taken offline independent of the rest of the storage on a
system, including the system ASP, user ASPs, and other independent disk pools. An ASP can
consist of either disks internal to an iSeries server or external to the iSeries server (such as a
DS6000). An independent disk pool can be either:
򐂰 Private: Privately connected to a single system, also known as stand-alone independent
ASPs

򐂰 Switchable: Switched between two systems or partitions in a clustered environment
This is very different from the way auxiliary storage (disk) was regarded prior to OS/400
V5R1. Until then, all iSeries disks were considered to be owned and usable only by a single
system. Enhancements made in this and later releases make using independent disk pools
an attractive option for many customers who are looking for higher levels of availability and
server consolidation. We use the terms independent disk pool and independent auxiliary
storage pool interchangeably.
There are three types of auxiliary storage pools (ASP):
򐂰 System auxiliary storage pool (ASP 1): This storage pool contains OS/400 and licensed
program products, plus any user objects.
򐂰 Basic user auxiliary storage pool (ASP 2-32): Prior to OS/400 V5R2, ASPs 2-32 were
known as user storage pools. Their function has not changed, but they are now referred to
as basic user ASPs. They allow the disk storage attached to a single iSeries server to be
grouped into separate pools. However, these pools have a close relationship to the system
ASP.
򐂰 Independent auxiliary storage pool (ASP 33-255): This disk pool type contains objects,
directories, or libraries that contain the objects, and other object attributes such as
authorization and ownership attributes. An independent disk pool can be made available
(varied on) and made unavailable (varied off) to the server without restarting the system.
When an independent auxiliary storage pool is associated with a switchable hardware group,
it becomes a switchable auxiliary storage pool and can be switched between one iSeries
server and another iSeries server in a clustered environment. Note that with the required
hardware, internal iSeries disks can be switchable. External storage servers are not required
in order to switch storage from one iSeries server to another. Achieving the full benefits of an
external storage server in an iSeries environment requires the storage server be set up as its
own independent ASP.
11.1.4 Internal versus external storage on iSeries

Disk volumes can either be internal to the iSeries or attached externally. Disk volumes are
grouped in auxiliary pools (ASPs). Disk volumes can be protected or unprotected. Protection
for internal disk volumes can be OS/400 software mirroring or RAID 5 hardware protection.
When the iSeries disks are external, as when using a DS6000, the disk devices are mapped
into the Logical Unit Numbers (LUNs) that are carved in the DS6000 Ranks. In the DS6000
LUNs are striped across a Rank, with the Ranks being either RAID 5 or RAID 10 protected.
The DS6000 can accommodate all iSeries disks, including the load source unit. Load source
units on external storage devices are supported only on ~ i5 models and on i5/OS
V5R3 or later.
Since iSeries servers already make use of cache in main storage, iSeries workloads
generally do not benefit from large cache in the DS6000 as much as other server platforms
do. Also, the large iSeries cache means that its sequential reads from DS6000 do not follow
the pattern of sequential reads typical of other server platforms, so the DS6000 Sequential
Adaptive Replacement Cache (SARC) algorithms may not provide the benefit in an iSeries
environment that they do on other server platforms.
Although iSeries performs well with internal storage, it also performs well with external
storage, providing clients the flexibility of separating server management from storage
management. In addition, external storage seamlessly becomes part of the iSeries single
level storage environment.
Chapter 11. iSeries servers 389

11.2 DS6000 attachment to iSeries server
The DS6000 is supported on all iSeries models which support Fibre Channel attachment for
external storage. Fibre Channel is supported on iSeries models 270 and 8xx, and on all
~ i5 models.
Fibre Channel support was first provided in OS/400 V5R1. However, the DS6000 is not
supported on V5R1. The DS6000 is supported only on V5R2 and subsequent releases.
V5R2 was a major release for Fibre Channel support, providing:

򐂰 2 Gbps adapter speed
򐂰 Full switch in addition to point-to-point and arbitrated loop topologies
򐂰 Multi-target support
򐂰 Maximum of 32 LUNs per I/O adapter
Additional enhancements, including multipath support, were added in V5R3.
11.2.1 Fibre Channel adapters

There are two Fibre Channel device adapters used in iSeries servers for attachment of
external disk storage servers:
򐂰 2766: 2 Gigabit Fibre Channel Disk Controller PCI
򐂰 2787: 2 Gigabit Fibre Channel Disk Controller PCI-X
In iSeries terminology, a device adapter is referred to as an I/O adapter (IOA). Each adapter
requires its own dedicated I/O processor (IOP) within the iSeries server. The adapters are
auto-sensing and run either 1 Gbps or 2 Gbps. Each IOA can support up to 32 LUNs.
Table 11-1 shows a comparison of the performance of the 2766 and 2787 IOAs. These
numbers are in a controlled test and should not be expected in a typical client environment.
However, they do show the relative differences in the 2766 and 2787 IOAs. Specifically, the
2787 has 38 percent greater throughput than the 2766. Response time as measured by
iSeries Collection Services is 31 percent faster with the 2787 as compared to the 2766.
Table 11-1 Comparison of 2766 and 2787 I/O adapters

IOA Card Max Throughput Response Time as Port Response Time
Over Single measured by as measured
Connection Collection Services internally in DS6000
(MB/s) (Milliseconds) (Milliseconds)
2766 62 8.3 2.9
2787 86 5.7 1.3
With multi-target support with OS/400 V5R2, multiple DS6000s can be supported from a
single Fibre Channel adapter initiator, but the total number of addressable LUNs remains 32.
This means, for example, that a single iSeries Fibre Channel disk adapter can have 16 LUNs
on each of two DS6000s.
11.2.2 DS6000 disk drives

The speed of the physical disk drive modules (DDMs) of the DS6000 disk megapacks can
have significant effect on performance. The throughput of a 15K rpm disk is generally about
33 percent greater than the throughput of a 10K rpm disk.

Assuming the same access density (number of I/Os per second, per gigabyte of capacity) and
an even distribution of I/Os across all data, using smaller DDMs means fewer I/Os per DDM,
which can improve performance in an environment of heavy I/O demands.
The RAID type can noticeably affect performance. For workloads with a high number of
random writes, RAID 10 generally supports a higher I/O rate than RAID 5. It does this since
for a random write, RAID 10 requires half as many disk operations as does RAID 5. However,
to provide the same amount of usable capacity, a RAID 10 Array requires significantly more
physical DDMs than does RAID 5.
The effects on performance of different disk drive modules and RAID options can be modeled
using the Disk Magic tool. For additional information about this tool, see 4.1, “Disk Magic” on
page 86.
11.2.3 iSeries LUNs on the DS6000

OS/400 and i5/OS are supported on DS6000 as Fixed Block storage. The LUNs will report
into the iSeries as different models of the 1750 device type. The models will relate to the size
of LUNs that have been configured. In general, these relate to the volume sizes available with
internal devices, although some larger sizes are supported for external storage only.
Unlike other open systems using Fixed Block architecture, OS/400 only supports specific
volume sizes and these may not be an exact number of extents. In this case, some portion of
the last extent used in building a LUN may be unused space. Table 11-2 shows the iSeries
device types that are emulated by the DS6000, the number of DS6000 Extents required for
each device type, and the space efficiency of each device size (percentage of available
Extent space that used by the iSeries LUN).
Table 11-2 iSeries LUN types emulated by DS6000

Model type OS/400 DS6000 Space efficiency
Device size Extents (GiB) (percentage of available
Unprotected Protected (GB) required space used)
1750-A81 1750-A01 8.5 8 100.00%
1750-A82 1750-A02 17.5 17 96.14%
1750-A85 1750-A05 35.1 33 99.24%
1750-A84 1750-A04 70.5 66 99.57%
1750-A86 1750-A06 141.1 132 99.57%
1750-A87 1750-A07 282.2 263 99.95%
Note: As shown in Table 11-2, OS/400 volumes are defined in decimal Gigabytes (GB: 109
bytes), while DS6000 Extents are defined in binary Gigabytes (GiB: 230 bytes).
11.2.4 LUN size and performance

From the DS6000 perspective, the size of a logical device has no effect on its performance.
The DS6000 does not serialize I/O operations on the basis of logical devices. However, you
must select the size of the LUNs based on iSeries performance considerations, the limitation
of 32 LUNs per IOA, as well as storage administration considerations.

iSeries allows only one outstanding I/O operation per disk device at a time. That is, iSeries
does not support command tag queuing. Command tag queuing refers to queuing multiple
commands to a disk device. This can improve performance because the device itself
determines the most efficient way to order and execute I/O commands. This process assigns
identifiers, or tags, to each command that is sent to a disk. The disk then reorders and
processes the commands in the most efficient manner to minimize seek time, which is the
time required to move the disk read/write head from one location to another on the disk
platter. The fact that OS/400 does not support this means that an iSeries cannot achieve the
I/O rate to an internal disk or external storage server LUN of a given size that is achievable by
other server platforms that do support command tag queuing.
However, a way to reduce the impact of this limitation is to reduce the internal disk size or
storage server LUN size. For example, take a scenario where you reduce the LUN size by 50
percent, double the number of LUNs to maintain the same total capacity, and also keep the
same access density (number of I/O operations per second, per gigabyte of capacity). This
reduces the I/O rate to each LUN by 50 percent. In an environment with high I/O rates,
spreading the I/Os across more smaller disk devices (LUNs in the DS6000) can show a
significant improvement in performance.
A good balance between small LUNs, which reduce OS/400 I/O request queues, and large
LUNs, which maximize the amount of data which can be accessed per IOA, is to use a LUN
size that will provide at least two LUNs per capacity of an individual DDM. For example, with
73 GB DDMs, you should use a maximum LUN size of 35.1 GB. With 146 GB DDMs, you
should use a maximum LUN size of 70.5 GB. And with 300 GB DDMs, you should use a
maximum LUN size of 141.1 GB.
Note: A DS6000 Array can contain iSeries LUNs of different sizes. For example, if you are
creating 141.1 GB LUNs, and have residual capacity in the Extent Pool of 100 Extents, you
can create three 35.1 GB LUNs, which require 33 Extents each.
11.2.5 iSeries and DS6000 configuration planning

As described in the chapter on logical configuration planning, Chapter 3, “Logical
configuration planning” on page 53, a DS6000 LUN is built using a number of 1 GiB (1 binary
Gigabyte) Extents from a single Extent Pool. The DS6000 offers considerable flexibility in
specifying the type and number of RAID Arrays which go into each Extent Pool. However, to
better manage performance on the DS6000, we recommend building a separate Extent Pool
for each individual RAID Array. And we recommend that each Extent Pool, and therefore each
RAID Array, is isolated as much as practical. Levels of isolation, in order of most isolated to
least isolated, are:
1. Specific workload
2. Single host server
3. Group of host servers of same platform
For example, if you have an Array of 836 GB (the effective capacity of a RAID 5 6+P Array of
146 GB disks), from which you build an Extent Pool, and you have one specific workload
which requires that amount of capacity, then you would dedicate that Array to that workload. If
no one workload requires that capacity, but multiple workloads on a single server do, then you
would dedicate that Array to the workloads on that one server.
A more proactive solution would be to plan your DS6000 Array capacities with your workloads
in mind. For example, if you have a workload that requires 416 GB, a RAID 5 6+P Array of 73
GB disks matches that capacity, and you could dedicate that entire Array to that workload.
Even when workload capacities are so large that they require several Arrays, you should still
dedicate Arrays to workloads. For example, if your workload requires 3345 GB, four RAID 5

6+P Arrays of 146 GB disks provides that amount of capacity, and you could dedicate all four
Arrays to the workload.
We also recommend dedicating host adapter ports on the DS6000 to a specific iSeries
server. That is, a given DS6000 host adapter port (or multiple ports which are part of a
multipath group), are dedicated to the LUNs for a specific server. This is to ensure that
another server does not dominate use of the host adapter port and cause degraded I/O
response times to the iSeries server
11.2.6 Protected and unprotected volumes

When defining OS/400 logical volumes, you must decide whether these should be protected
or unprotected. This is simply a notification to OS/400. In reality, all DS6000 LUNs are
protected, either RAID 5 or RAID 10. Defining a volume as unprotected means that it is
available for OS/400 to mirror that volume to another of equal capacity, either an internal or
external volume. If you do not intend to use OS/400 (host based) mirroring, you should define
your logical volumes as protected. Defining a logical volume as protected means that OS/400
software mirroring is not permitted.
11.2.7 Changing LUN protection

It is not possible to simply change a volume from protected to unprotected, or vice versa. If
you want to do so, you must delete the logical volume. This will return the extents used for that
volume to the Extent Pool. You will then be able to create a new logical volume with the
correct protection.
However, before deleting the logical volume on the DS6000, you must first remove it from the
OS/400 configuration (assuming it was still configured). This is an OS/400 task which is
disruptive if the disk is in the system ASP or user ASPs 2-32, because it requires an IPL of
OS/400 to completely remove the volume from the OS/400 configuration. This is no different
than removing an internal disk from an OS/400 configuration. Indeed, deleting a logical
volume on the DS6000 is similar to physically removing a disk drive from an iSeries. Disks
can be removed from an independent ASP with the ASP varied off without IPLing the system.
11.3 Multipath
Multipath support was added for external disks in V5R3 of i5/OS. Multiple connections
provide availability by allowing disk storage to be utilized even if a single path fails. Unlike
other server platforms which have an add-on software component for multipath, such as
Subsystem Device Driver (SDD), multipath is part of the base operating system. Up to eight
connections can be defined from multiple I/O adapters on an iSeries server to a single logical
volume in the DS6000. Each connection for a multipath disk unit functions independently and
can provide access to a logical volume if other connections to that volume fail.
Multipath is important for iSeries because it provides greater resilience to SAN failures, which
can be critical to OS/400 due to the single level storage architecture. Multipath is not available
for iSeries internal disk units but the likelihood of path failure is much less with internal drives,
because there are fewer components, or points of failure.
If you are using two adapters to provide a multipath connection to a group of 32 LUNs in a
DS6000, this will use up the 32 LUNs available on each adapter.
Both of the iSeries adapters which support the DS6000 can be used for multipath. There is no
requirement for all paths to use the same type of adapter. Both adapters can address up to 32

logical volumes. This does not change with multipath support. When deciding how many I/O
adapters to use, your first priority should be to consider performance throughput of the IOA,
since this limit may be reached before the maximum number of logical units.
11.3.1 Multipath compared to mirroring

Prior to multipath being available, some customers used OS/400 mirroring to two sets of
disks, either in the same or different external disk subsystems. This provided implicit
dual-path as long as the mirrored copy was connected to a different IOP/IOA, bus, or I/O
tower. However, this required two copies of data, and the data was sent from the iSeries
server to the external storage server twice, using additional resources. Since disk level
protection is already provided by RAID 5 or RAID 10 in the external disk subsystem, we
recommend that you use the data protection features in the DS6000, rather than iSeries
mirroring.
11.3.2 Multipath rules for multiple iSeries systems or partitions

When you use multipath disk units, you must consider the implications of moving IOPs and
multipath connections between nodes. You must not split multipath connections between
nodes, either by moving IOPs between logical partitions or by switching expansion units
between systems. If two different nodes both have connections to the same LUN in the
DS6000, both nodes might potentially overwrite data from the other node.
The system enforces the following rules when you use multipath disk units in a
multiple-system environment:
򐂰 If you move an IOP with a multipath connection to a different logical partition, you must
also move all other IOPs with connections to the same disk unit to the same logical
partition.
򐂰 When you make an expansion unit switchable, make sure that all multipath connections to
a disk unit will switch with the expansion unit.
򐂰 When you configure a switchable independent disk pool, make sure that all of the required
IOPs for multipath disk units will switch with the independent disk pool.
If a multipath configuration rule is violated, the system issues warnings or errors to alert you
of the condition.
11.3.3 Changing from single path to multipath

If you have a configuration where the logical units were only assigned to one I/O adapter, you
can easily change to multipath. Simply assign the logical units in the DS6000 to another I/O
adapter and the existing DDxxx drives will change to DMPxxx and new DMPxxx resources
will be created for the new path.
11.4 iSeries performance and monitoring tools

There are numerous products which aid in capacity planning and performance management
for iSeries. Most are comprehensive planning tools in that they address the entire spectrum of
workload performance on iSeries including CPU, system memory, disks, and adapters.
Although none has external storage performance as its primary focus, each offers some
information which can aid in that area.

11.4.1 Rules of thumb
Rules of thumb are guidelines for making hardware selections, which will provide a hardware
configuration that is at least workable for most environments. However, the workload of every
client is different, so you should look closely at the assumptions behind generalizations. Rules
of thumb are sometimes based on throughput numbers which are no longer valid. In an
attempt to make the guidelines apply to a large number of environments, they are frequently
too conservative, resulting in more hardware than is actually required to handle the workload.
Also, many rules of thumb are based on a series of assumptions, compounding the margin of
error. Instead of rules of thumb, we strongly recommend modeling the proposed external
storage environment with the Disk Magic tool. For additional information about this tool, see
4.1, “Disk Magic” on page 86. A Disk Magic analysis can be done by your IBM representative
or IBM Business Partner.
11.4.2 iSeries tools

iSeries has a number of tools which can produce statistics on system performance. These
metrics can be used as input to the Disk Magic modeling tool. They can also identify
components which have very high utilization and are therefore very likely the cause of
performance bottlenecks. Disk Magic is discussed in a separate chapter of this book. See
Chapter 4, “Planning and monitoring tools” on page 85.
The iSeries performance tools provide data for looking at the performance of the server
environment as a whole, since performance problems can come from many sources. Many of
the tools are interrelated, so before looking at each one in detail, we will provide an overview
so you can better see how they fit together:
Collection Services
Collection Services is a no-charge part of i5/OS that collects a set of iSeries metrics or
categories over a start/stop time period. Collection Services allows you to gather
performance data with minimal system resource consumption. Collection Services collects
sample data, which is summary data that is captured at regular time intervals. Collection
Services data is the foundation for basic performance analysis of your system.
iSeries Navigator monitors

iSeries Navigator provides a number of system monitors which display the data stored in the
collection objects that are generated and maintained by Collection Services, allowing you to
track the elements of system performance. System monitors display data as it is collected,
and they can take specified actions when certain events, such as the percentage of CPU
utilization or the status of a job, occur.
IBM Performance Management for iSeries

IBM Performance Management for iSeries (commonly shortened to PM ~ iSeries or
PM iSeries) automatically triggers Collection Services to gather the non proprietary
performance and capacity data from your server and then sends the data to IBM. IBM stores
the data for you and provides you with a series of PM iSeries reports and graphs that show
your server’s growth and performance. You can access your reports electronically using a
traditional browser.
Performance Tools for iSeries

Performance Tools for iSeries (commonly shortened to Performance Tools) is a licensed
program (5722-PT1) consisting of collection of tools and commands for viewing, reporting,
and graphing performance data. You can use Performance Tools to view performance data
collected with Collection Services or to view trace data collected with the Start Performance

Trace command. Performance Tools provides reports which offer an easy way for you to look
at your collected data and isolate performance problems.
Performance Explorer
Performance Explorer is a data collection tool that helps you identify the causes of
performance problems that cannot be identified by collecting data using Collection Services
or by doing general trend analysis. The collection functions and related commands of
Performance Explorer are part of i5/OS. The reporting function and its associated commands
are part of the base option in the Performance Tools licensed program.
iDoctor for iSeries

iDoctor for iSeries software consists of a suite of tools to collect performance data on a
system and tools to analyze that data. Some of the tools are provided at no-charge on an
as-is basis, and others offer a free trial with the option to purchase.
Workload Estimator for iSeries

Workload Estimator is a Web-based application which is used to size the upgrade to the
required iSeries system that accommodates your existing system’s utilization, performance,
and growth as reported by PM iSeries.
PATROL for iSeries (AS/400) - Predict

PATROL for iSeries (AS/400®) - Predict is a licensed capacity planning tool that helps you
estimate future iSeries requirements based on Collection Services data. The predictive
analysis is performed through a graphical interface on a personal computer workstation.
11.4.3 Collection Services

Collection Services is both a valuable tool for performance analysis as a stand alone
application and as a utility used by other applications for the collection of performance data.
Collection Services is a no-charge part of i5/OS that collects a set of iSeries metrics or
categories over a start/stop time period. The time period start and stop times can be specified
by an administrator or automatically when running Performance Management for iSeries.
Collection Services comes with IBM-supplied, predefined sets of performance categories that
can be selected for collection or you can select only specific categories to collect under a
custom option. In general, we recommend always running Collection Services. When starting
Collection Services, you can determine how long to keep a collection object before the
system automatically deletes it. This gives you the advantage of having at least
summary-level performance data over a long time and collection object data available for
detailed analysis (after creating the performance database files based on the collection
object). You can create the performance database files at any time from the collection object
as long as it is still on the system. Starting and ending Collection Services can be performed
using either i5/OS commands or the iSeries Navigator interface.
Functions of Collection Services

Use Collection Services to collect performance data for later analysis by the Performance
Tools for iSeries licensed program or other performance report applications, iSeries Navigator
monitors, and the graph history function. If you prefer viewing real-time performance data,
system monitors provide an easy-to-use graphical interface for monitoring system
performance. Collection Services collects data that identifies the relative amount of system
resource used by different areas of your system. Use Collection Services to:
򐂰 Easily manage your collection objects
򐂰 Collect performance data continuously and automatically with minimal system overhead

򐂰 Control what data is collected and how the data is used
򐂰 Move performance data between releases without converting the data
򐂰 Create performance data files that are used by Performance Tools
򐂰 Integrate your own programs to collect user-defined performance data into Collection
Services
Collection Services allows you to gather performance data with little or no observable impact
on system performance. You can use iSeries Navigator to configure Collection Services to
collect the data you want as frequently as you want to gather it. Once you have configured
and started Collection Services, performance data is continuously collected. When you need
to work with performance data, you can copy the data you need into a set of performance
database files.
11.4.4 iSeries Navigator monitors

The monitors included in iSeries Navigator use Collection Services data to track the elements
of system performance of specific interest to you. Moreover, they can take specified actions
when certain events, such as the percentage of CPU utilization or the status of a job, occur.
You can use monitors to see and manage system performance as it happens across multiple
systems and groups of systems. iSeries Navigator provides the following types of monitors:
򐂰 System monitor: You can collect and display performance data as it happens. Detailed
graphs help you visualize what is going on with your servers. Choose from a variety of
metrics (performance measurements) to pinpoint specific aspects of system performance.
򐂰 Job monitor: You can monitor a job or a list of jobs based on job name, job user, job type,
subsystem, or server type. Choose from a variety of metrics to monitor the performance,
status, or error messages for a job.
򐂰 Message monitor: You can find out whether your application completes successfully or
monitor for specific messages that are critical to your business needs.
򐂰 B2B activity monitor: If you have an application like Connect for iSeries configured, you
can use a B2B activity monitor to view a graph of active transactions over time, and you
can run commands automatically when thresholds are triggered.
򐂰 File monitor: Monitor one or more selected files for a specified text string, for a specified
size, or for any modification to the file.
The system monitors display the data stored in the collection objects that are generated and
maintained by Collection Services. You can use monitors to track and research many different
elements of system performance and can have many different monitors running
simultaneously. When used together, the monitors provide a sophisticated tool for observing
and managing system performance.
11.4.5 IBM Performance Management for iSeries

IBM Performance Management for iSeries (commonly shortened to PM ~ iSeries or
PM iSeries) is automated and self-managing, which allows for easy use. PM iSeries
automatically triggers Collection Services to gather the non proprietary performance and
capacity data from your server and then sends the data to IBM. All collection sites are
network secure, and the time of the transfer is completely under your control. When you send
your data to IBM, you eliminate the need to store all the trending data yourself. IBM stores the
data for you and provides you with a series of PM iSeries reports and graphs that show your
server’s growth and performance. You can access your reports electronically using a
traditional browser.

These reports can help you plan for and manage system resources through ongoing analysis
of key performance indicators. The IBM Operational Support Services for PM iSeries offering
includes a set of reports, graphs, and profiles that help you maximize current application and
hardware performance (by using performance trend analysis). This offering also helps you
better understand (by using capacity planning) how your business trends relate to the timing
of required hardware upgrades such as CPU or disk. Capacity planning information is
provided by trending system utilization resources and throughput data.
11.4.6 Performance Tools for iSeries

Performance Tools for iSeries (commonly shortened to Performance Tools) is a licensed
program (5722-PT1) that allows you to analyze performance data in a variety of ways.
Performance Tools is a collection of tools and commands for viewing, reporting, and graphing
performance data. You can use Performance Tools for iSeries to view performance data
collected with Collection Services or to view trace data collected with the Start Performance
Trace command. You can then summarize the data into a report to research a performance
problem on your system. You can also create graphs of the performance data to see resource
utilization over time.
Performance Tools functions

The Performance Tools for iSeries licensed program analyzes two distinct types of
performance data: sample data and trace data. Collection Services collects sample data,
which is summary data that is captured at regular time intervals. You collect sample data for
trend analysis and performance analysis. The data relates to things such as storage pools
and response times. However, Collection Services does not support the collection of trace
data. Trace data is detailed data that you collect to gain additional information about specific
jobs and transactions. To collect trace data, you can use either the Start Performance Trace
command or Performance Explorer.
Performance Tools includes reports, interactive commands, and other functions, including the
following:
򐂰 The Work with System Activity command allows you to work interactively with the jobs,
threads and tasks currently running in the system. The command reports system resource
utilization, including CPU utilization on a per-task basis for partitions that use a shared
processing pool.
򐂰 The Display Performance Data graphical user interface allows you to view performance
data, summarize the data into reports, display graphs to show trends, and analyze the
details of your system performance all from within iSeries Navigator.
򐂰 The Performance Tools reports organize Collection Services performance data in a logical
and useful format.
򐂰 The Performance Tools graphics function allows you to work with performance data in a
graphical format. You can display the graphs interactively, or you can print, plot, or save
the data to a graphics data format file for use by other utilities.
򐂰 Performance Explorer is a data collection tool that helps you identify the causes of
performance problems that cannot be identified by sample data that was collected by
Collection Services or by doing general trend analysis. Use Performance Explorer for
detailed application analysis at a program, procedure, module, or method level. You can
collect trace data on CPU and I/O activity for an individual program. Performance Explorer
is described in more detail in 11.4.7, “Performance Explorer” on page 400.

Performance Tools reports
The Performance Tools reports provide an easy way for you to look at your collected data and
isolate performance problems. After you have collected performance data over time, you can
print the reports to see how and where system resources are being used. The reports can
direct you to specific application programs, users, or inefficient workloads that are causing
slower overall response times. Collection Services provides data for most of the Performance
Tools reports with the exception of the Transaction, Lock, and Trace reports. The following list
describes each report:
򐂰 System report: Uses Collection Services data to provide an overview of how the system is
operating. The report contains summary information about the workload, resource use,
storage pool utilization, disk utilization, and communications.
򐂰 Component: Uses Collection Services data to provide information about the same
components of system performance as a System Report, but at a greater level of detail.
This report helps you find which jobs are consuming high amounts of system resources,
such as CPU and disk.
򐂰 Transaction: Uses trace data to provide detailed information about the transactions that
occurred during the performance data collection.
򐂰 Lock: Uses trace data to provide information about lock and seize conflicts during system
operation, causing jobs to wait. With this information you can determine if jobs are being
delayed during processing because of unsatisfied lock requests or internal machine seize
conflicts, which objects the jobs are waiting for and the length of the wait.
򐂰 Batch Trace Job: Uses trace data to show the progression of different job types (for
example, batch jobs) traced through time. Resources utilized, exceptions, and state
transitions are reported.
򐂰 Job Interval: Uses Collection Services data to show information about all or selected
intervals and jobs, including detail and summary information for interactive jobs and for
non interactive jobs.
򐂰 Pool Interval: Uses Collection Services data to provide a section on subsystem activity
and a section on pool activity. Data is shown for each sample interval.
򐂰 Resource Interval: Uses Collection Services data to provide resource information about all
or selected intervals.
Figure 11-1 on page 400 shows a sample Performance Tools Disk Utilization report.

Figure 11-1 Performance Tools Disk Utilization Report
11.4.7 Performance Explorer

The collection functions and related commands of Performance Explorer are part of the
OS/400 licensed program. Performance Explorer is a tool that helps find the causes of
performance problems that cannot be identified by using tools that do general performance
monitoring.
Performance Explorer and Collection Services are separate collecting agents. Each one
produces its own set of database files that contain grouped sets of collected data. You can
run both data collections at the same time.
Note: Performance Explorer is the tool you need to use after you have tried the other tools.
It gathers specific forms of data that can more easily isolate the factors involved in a
performance problem; however, when you collect this data, you can significantly affect the
performance of your system.
Like Collection Services, Performance Explorer collects data for later analysis. However, they
collect very different types of data. Collection Services collects a broad range of system data
at regularly scheduled intervals, with minimal system resource consumption. In contrast,
Performance Explorer starts a session that collects trace-level data. This trace generates a
large amount of detailed information about the resources consumed by an application, job, or
thread.
You can use Performance Explorer to answer specific questions about areas like
system-generated disk I/O, procedure calls, Java method calls, page faults, and other trace
events. It is the ability to collect very specific and very detailed information that makes the
Performance Explorer effective in helping isolate performance problems. For example,

Collection Services can tell you that disk storage space is rapidly being consumed. You can
use Performance Explorer to identify what programs and objects are consuming too much
disk space, and why.
Performance Explorer definitions

Performance Explorer provides the following types of data collection:
򐂰 Statistics type definitions: Identifies applications and IBM programs or modules that have
high CPU usage or that perform a high number of disk I/O operations. Typically, you use
the statistical type to identify programs that should be investigated further as potential
performance bottlenecks.
򐂰 Profile type definitions: Identifies high-level language programs that consume excessive
CPU resources based on source program statement numbers. You can also identify a
program that is constantly branching between the start of the program and subroutines at
the end of the program. If the program is large enough, this constant jumping back and
forth can cause excessive page fault rates on a system with limited main storage.
򐂰 Trace definitions: Gathers a historical trace of performance activity generated by one or
more jobs on the system. The trace type gathers specific information about when and in
what order events occurred. The trace type collects detailed reference information about
programs, Licensed Internal Code (LIC) tasks, OS/400 job, and object reference
information. Some common trace events are:
– Program and procedure calls and returns
– Storage, for example, allocate and deallocate
– Disk I/O, for example, read operations and write operations
– Java method, for example, entry and exit
– Java, for example, object create and garbage collection
– Journal, for example, start commit and end commit
– Synchronization, for example, mutex lock and unlock or semaphore waits
– Communications, for example, TCP, IP, or UDP.
Example 11-1 is an example of a Performance Explorer trace definition called DISKTRACE

that will show usage for all disk events:
Example 11-1 Performance Explorer trace definition to show all disk events
ADDPEXDFN DFN(DISKTRACE) /* The name of the definition. */
TYPE(*TRACE) /* The type of definition */
JOB(*ALL) /*All Jobs */
TASKS(*ALL) /*All tasks */
TRCTYPE(*SLTEVT) /* Only selected individual events and machine instructions are

included in the trace definition */
SLTEVT(*YES) /* *SLTEVT allows you to specify individual machine instructions and

events to be specified in addition to the categories of events available with the
TRCTYPE parameter. */
DSKEVT((*ALL)) /* All disk events are to be traced. */

Performance Explorer reports
Performance Explorer gathers detailed information about a program or job’s behavior and
performance and stores this information in Performance Explorer database files. You can
query these files with SQL, or by running one of several reports. You can generate four
different reports with Performance Explorer: Statistics, Profile, Trace, and Base reports.
11.4.8 iDoctor for iSeries

The iDoctor for iSeries software consists of tools to collect performance data on a system and
tools to analyze that data either on the same system or on a separate system. The iDoctor
suite of tools can be used on a system with performance problems or to determine the health
of your system to use as a baseline for future performance comparisons. In most cases, you
would have used other performance tools before using one of the iDoctor components.
iDoctor for iSeries is a suite of tools and services consisting of these components:
򐂰 Consulting Services
򐂰 Job Watcher
򐂰 Heap Analysis Tools for Java
򐂰 PEX Analyzer
򐂰 Performance Trace Data Visualizer
Consulting Services are provided on a fee basis. Job Watcher and PEX Analyzer offer a
45-day free trial with the option to purchase the product after the trial period. Heap Analysis
Tools and PTDV are offered as a free service on an as-is basis.
Consulting Services
Consulting Services provide basic instruction on installation and use of the software tools
within the iDoctor product suite to collect necessary data from your system. It also includes an
analysis of data collected by the iDoctor tools and a final report that includes a detailed
description of IBM’s findings.
Job Watcher
Job Watcher displays real-time tables and graphical data that represent, in a very detailed
way, what a job is doing and why it is not running. Job Watcher provides several different
reports that provide detailed job statistics by interval. These statistics include items such as
CPU utilization, DASD counters, waits, faults, call stack information, and conflict information.
Heap Analysis Tools for Java

Heap Analysis Tools for Java provide invaluable information to aid in debugging some of the
most complex problems in the area of Java and WebSphere.
PEX Analyzer
PEX Analyzer evaluates the overall performance of your system and builds on what you have
done with the Performance Tools licensed program. The PEX Analyzer condenses volumes of
trace data into reports that can be graphed or viewed to help isolate performance issues and
reduce overall problem determination time. The PEX Analyzer provides an easy-to-use
graphical interface for analyzing CPU utilization, physical disk operations, logical disk
input/output, data areas, and data queues. The PEX Analyzer can also help you isolate the
cause of application slowdowns.
Performance Trace Data Visualizer

The Performance Trace Data Visualizer for iSeries (PTDV) is a tool for processing, analyzing,
and viewing Performance Explorer collection trace data residing in Performance Explorer

database files. PTDV allows you to view program flows and get details (such as CPU time,
current system time, number of cycles, and number of instructions) summarized by trace, job,
thread, and procedures. When visualizing Java application traces, additional details such as
the number and type of objects created and information about Java locking behavior can be
displayed.
11.4.9 Workload Estimator for iSeries

The Workload Estimator for iSeries helps you size system needs based on estimated
workloads for specific workload types, using PM iSeries data. PM iSeries is an integrated
OS/400 function that users under processor warranty or on an IBM maintenance agreement
can activate for no additional charge. In return, you receive capacity and performance
analysis graphs useful in planning for and managing system growth and performance.
The Workload Estimator and PM iSeries have been enhanced to work with one another.
Through a Web-based application, you can size the upgrade to the required iSeries system
that accommodates your existing system’s utilization, performance, and growth as reported
by PM iSeries. As an additional option, sizings can also include capacity for adding specific
applications like Domino®, Java, and WebSphere, or the consolidation of multiple AS/400 or
iSeries traditional OS/400 workloads on one system. This capability allows you to plan for
future system requirements based on utilization data coming from your existing system.
11.4.10 PATROL for iSeries (AS/400) - Predict

The PATROL for iSeries (AS/400) - Predict product is a capacity planning tool that helps you
estimate future iSeries requirements to accommodate transaction throughput and application
workload increases. The estimation process is based on Collection Services data that
provides resource utilization, performance, and 5250 online transaction processing
(interactive) response time information that is measured on your iSeries server. The
predictive analysis is performed through a graphical interface on a PC workstation.
PATROL for iSeries (AS/400) - Predict is a product of BMC Software.
11.4.11 Sizing a DS6000 on iSeries using Disk Magic

Disk Magic can be used to model the performance of a disk subsystem when attached to an
iSeries server. This may be a single or multiple servers, or even a heterogeneous group of
servers consisting of iSeries, open systems, and/or zSeries servers. For iSeries servers with
multiple auxiliary storage pools, Disk Magic will allow you to model at the ASP level.
When modeling a disk subsystem for iSeries, Disk Magic takes into consideration the unique
single level storage structure that is at the heart of the iSeries architecture, the Expert Cache
algorithms designed into OS/400 to manage the single level storage, and the implications this
has for modeling cache behavior of externally-attached disk subsystems.
Disk Magic allows you to create a model of your existing iSeries internal disk workload, and
project the performance impact of migrating the workload to a new or existing external disk
subsystem. By modeling different configurations in a what if mode, you can determine the
best DS6000 configuration for your I/O requirements when attaching to an iSeries.
DS6000 sizing process

Ensuring good storage performance requires that you determine the proper capacity, speed,
and number of DS6000 disk modules, number of DS6000 host adapters, number of iSeries
Fibre Channel adapters (feature codes 2766 or 2787), number of iSeries I/O towers (frames),
and the number and size of LUNs per iSeries Fibre Channel adapter. While the iSeries fibre

adapter supports up to 32 LUNs each, significant storage I/O requirements may require
limiting that number in order to meet customer performance expectations. When using
multipath, you may need additional fibre adapters.
The statistics below are needed as input to Disk Magic and can be extracted from collected
performance data:
򐂰 Reads and writes per second
򐂰 Average transfer size (KB per I/O)
򐂰 Cache effectiveness
򐂰 Average service times and unit utilization
Reports produced by Performance Tools, see “Performance Tools reports” on page 399, can
be processed by Disk Magic as TXT files to build a base system for modeling. The reports
needed for a Disk Magic analysis are:
򐂰 Component Report - Disk Activity
򐂰 Resource Interval Report - Disk Utilization Detail
򐂰 System Report - Disk Utilization
򐂰 System Report - Storage Pool Utilization
The reports may cover iSeries internal disks and/or external storage servers. Disk Magic
accepts I/O load statistics of a single or multiple iSeries hosts, produced for either internal
disks or an external storage server. The storage server can be either an IBM or non-IBM
model.
The required reports listed above may be concatenated into a single TXT file, optionally
embedded in any other Performance Tools reports, which can then be processed by Disk
Magic. However, if you prefer, Disk Magic can process each report individually, in which case
Disk Magic will prompt you for the name of each report until all of them have been processed.
Performance Tools reports for multiple iSeries servers may be further concatenated into a
single file, bearing in mind that Disk Magic will create a single disk subsystem for all internally
attached disks it finds in the concatenated reports, regardless of the number of hosts or
externally attached disk subsystems represented in the TXT file. If applicable, a second disk
subsystem is created if externally attached disks are found in the Performance Tools reports.
When Disk Magic detects more than one auxiliary storage pool (ASP), it asks if you want to
model at the ASP level.
The recommended way to input Performance Tools reports to Disk Magic is by first
concatenating them together into a single TXT file, sequenced as in the list above. This file
may contain other Performance Tools reports than the four mentioned above; Disk Magic will
select what it needs. Also, when you have multiple iSeries hosts, the preferred way of
processing their Performance Tools reports is by concatenating the reports together into a
single TXT file before processing it with Disk Magic.
If the iSeries host currently employs software mirroring to maintain a duplicate image of the
disks, then Disk Magic detects this in the Performance Tools reports and asks you if it should
assume continued software mirroring, or will data be written just once to a DS6000 (which
provides RAID 5 or RAID 10 data protection).
If you are planning to attach the iSeries to an existing DS6000, you must also have
performance data from the other workloads that are running on the DS6000. You need to
have the configuration of the DS6000, including cache size, number of Fibre Channel
connections, installed disks, and the mapping of the other workloads across the configuration.
User ASPs must also be included in capacity and performance sizings of storage solutions for
iSeries. Ensure collected sizing information includes the current and future use of user ASPs.

Note: Disk Magic is for IBM and IBM Business Partner use only. Clients should contact
their IBM representative or IBM Business Partner to request a Disk Magic study to ensure
the DS6000 being proposed will meet the client’s workload requirements.
11.5 Additional information about iSeries performance

There are numerous sources of additional information about iSeries performance, including
Redbooks, iSeries publications, and Web sites, which deal with performance and planning on
iSeries servers.
11.5.1 Publications
We recommend the following Redbooks and iSeries publications:
򐂰 iSeries in Storage Area Networks, A Guide to Implementing FC Disk and Tape with
iSeries, SG24-6220
򐂰 Performance Tools for iSeries Version 5, SC41-5340
򐂰 iSeries Performance Capabilities Reference i5/OS Version 5 Release 3, SC41-0607
򐂰 iSeries Performance Version 5 Release 3, which you can find at:
http://www.ibm.com/servers/eserver/iseries/perfmgmt/resource.html
http://publib.boulder.ibm.com/infocenter/iseries/v5r3/topic/rzahx/rzahx.pdf
򐂰 Collecting and Analyzing PEX Trace Profile Data, which you can find at:
http://www.ibm.com/servers/eserver/iseries/perfmgmt/resource.html
http://www-03.ibm.com/servers/eserver/iseries/perfmgmt/pdf/tprof.pdf
򐂰 iSeries Disk Arm Requirements Based on Processor Model Performance, which you can
find at:
http://www-03.ibm.com/servers/eserver/iseries/perfmgmt/pdf/V5R2FiSArmct.pdf
11.5.2 Web sites

We recommend the following Web sites:
򐂰 Performance Management for IBM ~ iSeries:
http://www-03.ibm.com/servers/eserver/iseries/perfmgmt/
򐂰 Performance Management for IBM ~ iSeries Resource Library:
http://www-03.ibm.com/servers/eserver/iseries/perfmgmt/resource.html
򐂰 iDoctor for iSeries:
https://www-912.ibm.com/i_dir/idoctor.nsf/iDoctor.html

12
Chapter 12. Understanding your workload

In this chapter we present and discuss the different kinds of workload that an application can
generate. This characterization can be useful for understanding performance documents and
reports, as well as categorizing the different workloads in your installation.
Information in this chapter is not only dedicated to IBM TotalStorage DS6000 but can be
applied more generally to all storage equipment.

12.1 General workload types
Following are the workload type definitions as used in several of the IBM performance
documents.
12.1.1 Standard workload

This workload is characterized by random access of small I/O size records (less than or equal
to 16 KB). This workload is a mix of 70 percent reads and 30 percent writes. This is also
characterized by moderate read hit ratios in the disk subsystem cache (approximately 50
percent). This workload might be representative of a variety of online applications (for
example, SAP R/3 application, many database applications, and file systems).
12.1.2 Read intensive cache unfriendly workload

This workload is characterized by very random 4 KB reads. The accesses are extremely
random, such that virtually no cache hits occur in the disk subsystem cache. This might be
representative of some decision support or business intelligence applications, where virtually
all of the cache hits are absorbed in the host memory buffers.
12.1.3 Sequential workload

In many user environments, sequential performance is critical, due to heavy use of sequential
processing during the batch window. The types of sequential I/O requests that play an
important role in batch processing cover a wide range.
12.1.4 Batch jobs

Batch workloads have some common characteristics:
򐂰 Frequently, a mixture of random data bases accesses, skip-sequential, pure sequential
and sorting.
򐂰 Large data transfers and high path utilizations
򐂰 Often constrained to operate within a particular window of time, online operation is
restricted or shutdown. Poorer or better performance is often not recognized unless it
impacts this window.
12.1.5 Sort jobs

Most sorting applications (like the z/OS DFSORT™) are characterized by large transfers for
input, output, and work data sets.
Adding to the preceding list, Table 12-1 further provides a summary of the different workload
types’ characteristics.
Table 12-1 Workload types

Workload type Characteristics Representative of
Sequential read Large record reads Database backups

- QSAM half track Large queries
- Open 64 KB blocks Batch
Large files from disk Reports

Workload type Characteristics Representative of
Sequential write Large record writes Database restores and loads

Large files to disk Batch
z/OS Cache uniform Random 4 KB record Average database

R/W ratio 3.4 CICS/VSAM
Read hit ratio 84% IMS
z/OS Cache standard Random 4 KB record Representative of ‘typical’

R/W ratio 3.0 database conditions
Read hit ratio 78%
z/OS Cache friendly Random 4 KB record Interactive

R/W ratio 5.0 Legacy software
Read hit ratio 92%
z/OS Cache hostile Random 4 KB record DB2 logging

R/W ratio 2.0
Read hit ratio 40%
Open read-intensive Random 4 KB record Very large DB

Read % = 67% DB2
Hit ratio 28%
Open standard Random 4 KB record OLTP

Read % = 70% File system
Hit ratio 50%
Open read-intensive Random 4 KB record Decision support

Read % = 100% Warehousing
Hit ratio 0% Large DB inquiry
12.2 Database workload

Analyzing and discussing database workload characteristics can be a very a broad subject. In
this section we are going to limit our discussion to DB2 I/O situations as an example of the
database workload demands. The information discussed in the chapter can be further
complemented with the information in Chapter 13, “Databases” on page 415.
The DB2 environment can often be difficult to typify, since there can be wide differences in I/O
characteristics. DB2 Query has high read content and is of a sequential nature. Transaction
environments have more random content, and are sometimes very cache unfriendly, but
some other times have very good hit ratios. DB2 has also implemented several changes that
affect I/O characteristics, such as sequential pre-fetch and exploitation of I/O priority queuing.
Users need to understand the unique characteristics of their installation’s processing before
generalizing about DB2 performance.
12.2.1 DB2 query

DB2 query workloads can typically be characterized by:
򐂰 High read content
򐂰 Large transfer size
A DB2 query workload should mostly have the same characteristics as a sequential read
workload. Storage subsystem implements sequential pre-fetch algorithms. This functionality
which caches data which has the most probability to be accessed provides very good
performance improvements for most DB2 queries.
Chapter 12. Understanding your workload 409

12.2.2 DB2 logging
DB2 logging is mostly a very cache unfriendly workload with a high sequential write
component. A high sequential write capability storage subsystem provides excellent
performance for DB2 logging.
12.2.3 DB2 transaction environment

DB2 transaction workloads can be characterized by:
򐂰 Low to moderate read hits, depending upon the size of DB2 buffers.
򐂰 May qualify as cache unfriendly for some applications.
򐂰 Deferred writes can cause low write hit ratios.
򐂰 Deferred write chains with multiple locate-record commands in chain.
򐂰 Can be low read/write ratio, due to reads being satisfied in large DB2 buffer pool.
The enhanced prefetch cache algorithms, together with the high storage backend bandwidth,
and minimal RAID 5 write penalty (none for RAID 10) provides high subsystem throughput
and high transaction rates for DB2 transaction-based workloads.
One of DB2’s main advantages is the exploitation of a large buffer pool in processor storage.
When managed properly, the buffer pool can avoid a large percentage of the accesses to
disk. Depending on the application and the size of the buffer pool, this can translate to poor
cache hit ratios for what in DB2 is called synchronous reads. Spreading data across several
RAID Arrays can be used to increase the throughput even if all accesses are read misses.
DB2 administrators often require that tablespaces and their indexes are placed on separate
volumes. This configuration improves both availability and performance.
12.2.4 DB2 utilities

DB2 utilities such as loads, reorganizations, copies, and recovers generate high read and
write operation sequential and some times random. This kind of workload takes advantages
of the sequential bandwidth performance of the backend storage connection (such as Fibre
Channel technology) and backend storage disk technology (15K rpm speed for DDMs).
12.3 Application workload

This section categorizes different types of common applications according to their I/O
behavior. The application behavior is typified in four categories:
1. Need for high throughput. These applications need more bandwidth, and more the better.
Transfers are large, read only I/Os and typically sequential access. Data Base
Management Systems (DBMSs) are used; however, some random DBMS access may
exist.
2. Need for high throughput and R/W mix, similar to category 1 (large transfer sizes). In
addition to 100 percent read operations, this situation will have a mixture of read and write
in the 70/30 and 50/50 ratios. Here the DBMS is typically sequential, but random and 100
percent writes operation also exist.
3. Need for high I/O rate and throughput. This category requires both performance
characteristics of IO/s and MB/second. Depending upon the application, the profile is
typically sequential access, medium to large transfer sizes (16KB, 32KB, and 64KB), and
100/0, 0/100, and 50/50 R/W ratios.
4. Need for high I/O rate. With many users and many different applications running
simultaneously, this category could consist of any of the following: Small to medium sized

transfers (4 KB, 8 KB, 16 KB, and 32 KB), 50/50 and 70/30 R/W ratios, and a random
DBMS.
These workload categories are summarized in Table 12-2, and the common applications that
can be found at any installation are classified following this categorization.
Table 12-2 Application workload types

Category Application Read/write ratio I/O size Access type
4 General file serving All simultaneously 4 KB - 32 KB Random and

sequential
4 Online transaction 50/50, 70/30 4 KB, 8 KB Random

processing
4 Batch update 50/50 16 KB, 32 KB Random and

sequential
1 Data mining 100/0 32 KB, 64 KB, or larger Mainly sequential,

some random
1 Video on demand 100/0 64 KB or larger Sequential
2 Data warehousing 100/0, 70/30, 50/50 64 KB or larger Mainly sequential,

random easier
2 Engineering and 100/0, 0/100, 70/30, 64 KB or larger Sequential

scientific 50/50
3 Digital video editing 100/0, 0/100, 50/50 32 KB, 64 KB Sequential
3 Image processing 100/0, 0/100, 50/50 16 KB, 32 KB, 64 KB Sequential
4 Batch update 50/50 16 KB, 32 KB Random and sequential
12.3.1 General file serving

This application type consists of many users, running many different applications, all with
varying file access sizes and mixtures of read/write ratios, all occurring simultaneously.
Applications could include file server, LAN storage, disk arrays, and even Internet/intranet
servers. There is no standard profile here, other than the ‘chaos’ principle of file access. This
application is considered here, because this profile covers almost all transfer sizes and R/W
ratios.
12.3.2 Online transaction processing

This application category typically has many users, all accessing the same disk storage
subsystem and a common set of files. The file access typically is under control of a DBMS
and each user may be working on the same or unrelated activities. The I/O requests are
typically spread across many files, therefore the file sizes are typically small and randomly
accessed. Typical applications consist of a network file server or disk subsystem being
accessed by a sales department entering order information.
12.3.3 Data mining

Databases are the repository of most data, and every time information is needed some type
of database is accessed. Data mining is the process of extracting valid, previously unknown
and ultimately comprehensive information from large databases and using it to make crucial
business decisions. This application category consists of a number of different operations,

each of which is supported by a variety of techniques such as rule induction, neural networks,
conceptual clustering, association discovery, and so on. In these applications the DBMS will
only extract large sequential or possibly random files depending on the DBMS access
algorithms.
12.3.4 Video on demand

Video on demand consists of video playback that can be used to broadcast quality video for
either satellite transmission or a commercial application, like in-room movies. Fortunately for
the storage industry, the current data rates needed for this type of transfer have been reduced
dramatically due to data compression developments. A broadcast quality video stream
MPEG2 now only needs about 3.7 Mb/second bandwidth to serve a single user. These
advancements have reduced the need for higher speed interfaces and can be serviced with
the current interface. However, these applications are now demanding numerous concurrent
users interactively accessing multiple files within the same storage subsystem. This
requirement has changed the environment of video applications in that the storage
subsystem will be specified by a number of video streams that they can service
simultaneously. In this application the DBMS will only extract large sequential files.
12.3.5 Data warehousing

A data warehouse supports information processing by providing a solid platform of integrated,
historical data from which to do analysis. A data warehouse organizes and stores the data
needed for informational and analytical processing over a long historical time. A data
warehouse is subject-oriented, integrated, time-variant, nonvolatile collection of data used to
support the management’s decision making process. A data warehouse is always a physically
separate store of data that spans a spectrum of time, and the relationships found in the data
warehouse are many.
An example of data warehouse is a design around a financial institution and its functions,
such as loans, savings, bank cards, and trusts for a financial institution. In this application
there are basically three kinds of operations: The initial loading, the access, and the updating
of the data. However, due to the fundamental characteristics of a warehouse these operations
can occur simultaneously. At times this application could perform 100 percent reads when
accessing the warehouse; 70 percent reads and 30 percent writes when accessing data while
record updating occurs simultaneously; or even 50 percent reads and 50 percent writes when
the user load is heavy. Keep in mind that the data within the warehouse is a series of
snapshots and once the snapshot of data is made, the data in the warehouse does not
change. Therefore, there is typically a higher read ratio when using the data warehouse.
12.3.6 Engineering and scientific applications

The engineering and scientific arena includes hundreds of different applications. Some typical
applications are CAD, Finite Element Analysis, simulations and modeling, large scale physics
applications, and so on. Some transfers could consist of 1 GB of data for 16 users, while
others may require 20 GB of data and hundreds of users. The engineering and scientific
areas of business are more concerned with the manipulation of spatial data as well as of
series data. This application typically goes beyond standard relational DBMS systems, which
manipulate only flat (two-dimensional) data. Spatial or multi-dimensional issues and the
ability to handle complex data types are commonplace in engineering and scientific
applications.
Object-Relational DBMS (ORDBMS) are now being developed, and they not only offer
traditional relational DBMS features, but will additionally support complex data types. Object
storage and manipulation can be done, and complex queries at the database level can be

made. Object data is data about real world objects, including information about their location,
geometry, and topology. Where location describes their position, geometry relates to their
shape, and topology includes their relationship to other objects. These applications
essentially have an identical profile to that of the data warehouse.
12.3.7 Digital video editing

Digital video editing (DVE) is very popular in the movie industry. The idea that a film editor can
load entire feature films onto disk storage and interactively edit and immediately replay the
edited clips has become a reality. This application combines the ability to store huge volumes
of digital audio and video data onto relatively affordable storage devices to process a feature
film. In the near future, when films are being shot on location, there will be no need for
standard 38 mm film, because all cameras will be directly fed into storage devices and ‘takes’
will be immediately reviewed. If the ‘take’ does not turn out as expected, it will be redone
immediately. DVE has also been used to generate the latest high tech films that require
sophisticated computer-generated special effects.
Depending on the host and operating system used to perform this application, transfers are
typically medium to large in size and access is always sequential. Image processing consists
of moving huge image files for the purpose of editing. In these applications the user is
regularly moving huge high-resolution images between the storage device and the host
system. These applications service many desktop publishing and workstation applications.
Editing sessions can include loading large files of up to 16 MB into host memory, where users
edit, render, modify, and eventually store back onto the storage system. High interface
transfer rates are needed for these applications or the users will waste huge amounts of time
waiting to see results. If the interface can move data to and from the storage device at over 32
MB/second then an entire 16 MB image can be stored and retrieved in less than one second.
The need for throughput is all important to these applications and, along with the additional
load of many users, I/O operations per second are also a major requirement.
12.4 How to understand your workload type

To understand your workload generated by your application and applied to your storage
subsystem, monitoring tools are available at different levels from the storage subsystem point
of view and from the host point of view.
12.4.1 Monitoring the DS6000 workload

To understand the type of workload your application generates, you can monitor the workload
applied on resources of your storage subsystem. Most of the storage subsystems have a
performance monitoring tool which indicates the workload characteristics that hosts are
performing. These tools provide workload information such as read and write rate, read and
write ratio, sequential and random type, average I/O size and so on, at different levels in the
storage subsystem (controller, logical volume, RAID Array).
To monitor the workload applied on your DS6000, the monitoring tool available is the IBM
TotalStorage Productivity Center for Disk with Performance Manager. See 4.3, “IBM
TotalStorage Productivity Center for Disk” on page 109.
12.4.2 Monitor host workload

The following sections lists the host-based performance measurements and reporting tools,
under UNIX, Linux, Windows 2000, iSeries and z/OS environments.

Open system servers
Here are listed the most common tools available on open system servers to monitor the
workload.
Open system servers - UNIX and Linux

To get host information about I/O subsystems, CPU activities, virtual memory, and physical
memory use, you can use the following common UNIX and Linux tools:
򐂰 iostat
򐂰 vmstat
򐂰 sar
These three commands are standard tools available with most UNIX systems and UNIX-like
(Linux). We recommend using iostat for the data you will need to evaluate your host I/O levels.
Specific monitoring tools are also available for AIX, Linux, HP-UNIX and Sun Solaris.
For more information, refer to Chapter 7, “Open systems servers - UNIX” on page 189 and to
Chapter 8, “Open system servers - Linux for xSeries” on page 261.
Open system servers Intel

Common Windows 2000 and Windows NT monitoring tools:
򐂰 Windows 2000 and NT Performance Monitor
Performance Monitor gives you the flexibility to customize the monitoring to capture various
categories of Windows 2000 system resources, including CPU and memory. You can also
monitor disk I/O through Performance Monitor.
For more information, refer to Chapter 9, “Open system servers - Windows” on page 305.
iSeries environment
Here are the most popular tools:
򐂰 Collection Services
򐂰 iSeries Navigator Monitors
򐂰 IBM Performance management for iSeries (PM iSeries)
򐂰 Performance Tools for iSeries
Most of these are comprehensive planning tools in that they address the entire spectrum of
workload performance on iSeries including CPU, system memory, disks and adapters.
For more information, refer to Chapter 11, “iSeries servers” on page 387.
zSeries environment
The z/OS systems have proven performance monitoring and management tools available to
use for performance analysis. RMF, a z/OS performance tool, collects performance data and
reports it for the desired interval. It also provides cache reports. The cache reports are similar
to disk-to-cache and cache-to-disk reports available in the TotalStorage productivity Center
for Disk, except that RMF’s cache reports are provided in text format. RMF provides the Rank
level statistics as SMF records. These SMF records are raw data that you can run your own
post processor against to generate reports.
For more information, refer to Chapter 10, “zSeries servers” on page 357.

13
Chapter 13. Databases

This chapter reviews the major IBM database systems and the performance considerations
when they are used with the DS6000 disk subsystem. In this chapter we limit our discussion
to the following databases:
򐂰 DB2 in a z/OS environment
򐂰 DB2 in an open environment
򐂰 IMS in a z/OS environment
The information and discussion contained in this chapter can further be complemented with
information at this Web site:
http://www.ibm.com/software/data/db2/udb

13.1 DB2 in a z/OS environment
In this section we provide a description of the characteristics of the different database
workloads, as well as of the types of data-related objects used by DB2 (in a z/OS
environment). Also discussed in this section are the performance considerations and general
guidelines when using DB2 with the DS6000, as well a description of the tools and reports
that can be used for monitoring DB2.
13.1.1 Understanding your database workload

To better understand and position the performance of your particular database system, it is
helpful to first learn about the following very common database profiles and their unique
workload characteristics.
DB2 online transaction processing (OLTP)

OLTP databases are among the most mission-critical and widely deployed of all. The primary
defining characteristic of OLTP systems is that the transactions are processed in real-time or
online and often require immediate response back to the user. Examples would be:
򐂰 A point of sale terminal in a retail business
򐂰 An automated teller machine (ATM) used for bank transactions
򐂰 A telemarketing site processing sales orders and checking the inventories
From a workload perspective, OLTP databases typically:

򐂰 Process a large number of concurrent user sessions.
򐂰 Process a large number of transactions using simple SQL statements.
򐂰 Process a single database row at a time.
򐂰 Are expected to complete transactions in seconds, not minutes or hours.
OLTP systems process the day-to-day operation of businesses and, therefore, have strict
user response and availability requirements. They also have very high throughput
requirements and are characterized by large amounts of database inserts and updates. They
typically serve hundreds, or even thousands, of concurrent users.
Decision support systems (DSSs)

DSSs differ from the typical transaction-oriented systems in that they most often use data
extracted from multiple sources for the purpose of supporting end-user decision making. The
types of processing consist of:
򐂰 Data analysis applications using predefined queries
򐂰 Application-generated queries
򐂰 Ad-hoc user queries
򐂰 Reporting requirements
DSS systems typically deal with substantially larger volumes of data than OLTP systems due
to their role in supplying users with large amounts of historical data. Whereas 100 GB of data
would be considered large for an OLTP environment, a large DSS system could be 1 terabyte
of data or more. The increased storage requirements of DSS systems can also be attributed
to the fact that they often contain multiple, aggregated views of the same data.
While OLTP queries are mostly related to one specific business function, DSS queries are
often substantially more complex. The need to process large amounts of data results in many
CPU intensive database sort and join operations. The complexity and variability of these
types of queries must be given special consideration when estimating the performance of a
DSS system.

13.1.2 DB2 overview
DB2 is a database management system based on the relational data model. Most users
choose DB2 for applications that require good performance and high availability for large
amounts of data. This data is stored in datasets mapped to DB2 tablespaces and distributed
across DB2 databases. Data in tablespaces is often accessed using indexes that are stored
in index spaces.
Data tablespaces can be divided in two groups: System tablespaces and user tablespaces.
Both of these have identical data attributes. The difference is that system tablespaces are
used to control and manage the DB2 subsystem and user data. System tablespaces require
the highest availability and some special considerations. User data cannot be accessed if the
system data is not available.
In addition to data tablespaces, DB2 requires a group of traditional datasets not associated to
tablespaces that are used by DB2 to provide data availability: The backup and recovery
datasets.
In summary, the three main dataset types in a DB2 subsystem are:

򐂰 DB2 system tablespaces
򐂰 DB2 user tablespaces
򐂰 DB2 backup and recovery datasets
The following sections describe the different objects and datasets that DB2 uses.
13.1.3 DB2 storage objects

DB2 manages data by associating it to a set of DB2 objects. These objects are logical
entities, and some of them are kept in storage. The following are DB2 data objects:
򐂰 TABLE
򐂰 TABLESPACE
򐂰 INDEX
򐂰 INDEXSPACE
򐂰 DATABASE
򐂰 STOGROUP
Here, we briefly describe each of them.
TABLE
All data managed by DB2 is associated to a table. The table is the main object used by DB2
applications.
TABLESPACE
A tablespace is used to store one or more tables. A tablespace is physically implemented with
one or more datasets. Tablespaces are VSAM linear datasets (LDS). Because tablespaces
can be larger than the largest possible VSAM dataset, a DB2 tablespace may require more
than one VSAM dataset.
INDEX
A table can have one or more indexes (or can have no index). An index contains keys. Each
key may point to one or more data rows. The purpose of an index is to get direct and faster
access to the data in a table.
Chapter 13. Databases 417

INDEXSPACE
An index space is used to store an index. An index space is physically represented by one or
more VSAM LDS datasets.
DATABASE
Database is a DB2 representation of a group of related objects. Each of the previously named
objects has to belong to a database. DB2 databases are used to organize and manage these
objects.
STOGROUP
A DB2 storage group is a list of storage volumes. STOGROUPs are assigned to databases,
tablespaces, or index spaces when using DB2 managed objects. DB2 uses STOGROUPs for
disk allocation of the table and index spaces.
Installations that are SMS managed can define STOGROUP with VOLUME(*). This
specification implies that SMS assigns a volume to the table and index spaces in that
STOGROUP. In order to do this, SMS uses ACS routines to assign a storage class, a
management class and a storage group to the table or index space.
13.1.4 DB2 dataset types

As already mentioned, DB2 uses system and user tablespaces for the data, as well as a
group of datasets not associated with tablespaces that are used by DB2 to provide data
availability; these are backup and recovery datasets.
DB2 system tablespaces

DB2 uses databases to control and manage its own operation and the application data.
򐂰 The catalog and directory databases
Both databases contain DB2 system tables. DB2 system tables hold data definitions,
security information, data statistics, and recovery information for the DB2 system. The
DB2 system tables reside in DB2 system tablespaces.
The DB2 system tablespaces are allocated when a DB2 system is first created. DB2
provides the IDCAMS statements required to allocate these datasets as VSAM LDSs.
򐂰 The work database
The work database is used by DB2 to resolve SQL queries that require temporary work
space. Multiple tablespaces can be created for the work database.
DB2 application tablespaces

All application data in DB2 is organized in the objects as described in 13.1.3, “DB2 storage
objects” on page 417.
Application tablespaces and index spaces are VSAM LDS datasets with the same attributes
as DB2 system tablespaces and index spaces. The difference between system and
application data is made only because they have different performance and availability
requirements.
DB2 recovery datasets

In order to provide data integrity, DB2 uses datasets for recovery purposes. Following is a
brief description of these DB2 recovery datasets. These datasets are described in further
detail in the IBM publication Administration Guide of DB2 for OS/390, SC26-8957.

򐂰 Bootstrap dataset
DB2 uses the bootstrap dataset (BSDS) to manage recovery and other DB2 subsystem
information. The BSDS contains information needed to restart and to recover DB2 from
any abnormal circumstance. For example, all log datasets are automatically recorded with
the BSDS. While DB2 is active, the BSDS is open and updated.
DB2 always requires two copies of the BSDS because they are critical for data integrity.
For availability reasons, the two BSDS datasets should be put on separate Servers on the
DS6000 or separate LCUs.
򐂰 Active logs
The active log datasets are used for data recovery and to ensure data integrity in case of
software or hardware errors.
DB2 uses active log datasets to record all updates to user and system data. The active log
datasets are open as long as DB2 is active. Active log datasets are reused when the total
active log space is used up, but only after the active log (to be overlaid) has been copied to
an archive log.
DB2 supports dual active logs. It is strongly recommend that you use dual active logs for
all DB2 production environments. For availability reasons, the log datasets should be put
on separate Servers on the DS6000 or separate LCUs.
򐂰 Archive logs
Archive log datasets are DB2 managed backups of the active log datasets. Archive logs
datasets are automatically created by DB2 whenever an active log is filled. DB2 supports
dual archive logs, and it is recommended that you use dual archive log datasets for all
production environments.
Archive log datasets are sequential datasets that can be defined on disk or on tape,
migrated and deleted with standard procedures.
13.2 DS6000 considerations for DB2

Some of the benefits of using the DS6000 in an DB2 environment are the following:
򐂰 DB2 takes advantage of the PAV function that allows multiple concurrent I/Os to the same
volume at the same time from applications running on a z/OS system image.
򐂰 Less disk contention when accessing the same volumes from different systems in an DB2
data sharing group, using the Multiple Allegiance function.
򐂰 Higher bandwidth on the DS6000 allows higher I/O rates to be handled by the disk
subsystem, thus allowing for higher application transaction rates.
13.3 DB2 with the DS6000: Performance recommendations

When using a DS6000, the following generic recommendations will be useful when planning
for good DB2 performance.
13.3.1 Know where your data resides

DB2 storage administration can be done using SMS to simplify disk use and control, or also
without using SMS. In both cases it is very important that you know where your data resides.

If you want optimal performance from the DS6000, do not treat it totally like a ‘black box’.
Understand how DB2 tables map to underlying volumes, and how the volumes map to RAID
Arrays.
13.3.2 Balance workload across DS6000 resources

You can balance workload activity across DS6000 resources doing the following:
򐂰 Spread DB2 data across DS6000 boxes if practical
򐂰 Spread DB2 data across servers in each DS6000
򐂰 Spread DB2 data across DS6000 device adapters
򐂰 Spread DB2 data across as many Extent Pools/Ranks as practical
You may intermix tables and indexes, and also system, application, and recovery datasets, on
DS6000 Ranks. The overall I/O activity will be more evenly spread, and I/O skews will be
avoided.
13.3.3 Take advantage of VSAM data striping

Before VSAM data striping was available, in a multi-extent, multi-volume VSAM dataset
sequential processing did not present any type of parallelism. This meant that when an I/O
operation was executed for an extent in a volume, no other activity from the same task was
scheduled to the other volumes.
VSAM data striping addresses this problem with two modifications to the traditional data
organization:
򐂰 The records are not placed in key ranges along the volumes; instead they are organized in
stripes.
򐂰 Parallel I/O operations are scheduled to sequential stripes in different volumes.
By striping data the VSAM control intervals (CIs) are spread across multiple devices. This
format allows a single application request for records in multiple tracks and CIs to be satisfied
by concurrent I/O requests to multiple volumes.
The result is improved data transfer to the application. The scheduling of I/O to multiple
volumes in order to satisfy a single application request is referred as an I/O path packet.
We can stripe across Ranks, across device adapters, across servers, and across DS6000s.
If you plan to enable VSAM I/O striping refer to the following publication for additional
discussion: DB2 for z/OS and OS/390 Version 7 Performance Topics, SG24-6129.
13.3.4 Large volumes

With large volume support, which supports up to 65,520 cylinders per volume, zSeries users
can allocate the larger capacity volumes in the DS6000. From the DS6000 perspective, the
capacity of a volume does not determine its performance. From the z/OS perspective, PAVs
reduce or eliminate any additional enqueues that may originate from the increased I/O on the
larger volumes. From the storage administration perspective, configurations with larger
volumes are simpler to manage.
Measurements oriented to determine how large volumes can impact DB2 performance have
shown that similar response times can be obtained when using larger volumes as when using
the smaller 3390-3 standard size volumes (refer to 10.6.2, “Larger versus smaller volumes
performance” on page 364 for a discussion).

13.3.5 MIDAWs
MIDAWs, as described in 10.7.1, “MIDAWs” on page 368, helps improve performance when
accessing large chains of small blocks of data. In order to get this benefit the dataset must be
accessed by Media Manager. The bigger benefit is realized for the following datasets:
򐂰 Extended Format (EF) datasets
򐂰 Has small block sizes (4K)
Examples of DB2 applications that will benefit from MIDAWs are DB2 prefetch and DB2
utilities.
13.3.6 Monitoring the DS6000 performance

RMF can be used to monitor the performance of the DS6000. For a detailed discussion, see
10.9, “DS6000 performance monitoring tools” on page 371.
13.4 DS6000 DB2 UDB - open systems environment

This section discusses the performance considerations when using the DB2 Universal
Database™ (DB2 UDB) with the DS6000 in an open systems environment.
The information presented in this section is further discussed in detail in (and liberally
borrowed from) the redbook, IBM ESS and IBM DB2 UDB Working Together, SG24-6262.
Many of the concepts presented are applicable to the DS6000. We highly recommend this
redbook. However, based on customer solution experiences using SG24-6262, there are two
corrections we want to point out:
򐂰 In IBM ESS and IBM DB2 UDB Working Together, SG24-6262, Section 3.2.2, “Balance
workload across ESS resources”, suggests that a data layout policy should be established
that allows partitions and containers within partitions to be spread evenly across ESS
resources. It further suggests that you can choose either a horizontal mapping, in which
every partition has containers on every available ESS Rank, or a vertical mapping in which
DB2 partitions are isolated to specific arrays, with containers spread evenly across those
Ranks. We now recommend the vertical mapping approach.
򐂰 Another data placement consideration, suggests that it is important to manage where data
is placed on the disk, outer edge or middle. We no longer believe this is an important
consideration.
Note: Based on experience, we now recommend a vertical data mapping approach

(shared nothing) as well as not micro-managing data placement on storage.
13.4.1 DB2 UDB storage concepts

DB2 Universal Database (DB2 UDB) is IBM’s object-relational database for UNIX, Linux,
OS/2, and Windows operating environments.
The database object that maps the physical storage is the tablespace. Figure 13-1 on
page 422 illustrates how DB2 UDB is logically structured and how the tablespace maps the
physical object.

Logical Database Objects Equivalent Physical Object
System
Instance(s)
Database(s)
Tablespaces is where tables are stored:
SMS or DMS
Tablespace(s) Each container Each container
is a directory is a fixed,
in the file space pre-allocated
Tables of the operating file or physical
Indexes system. device such as
a disk.
long data
/fs.rb.T1.DA3a1
/fs.rb.T1.DA3b1
Figure 13-1 DB2 UDB logical structure
Instances
An instance is a logical database manager environment where databases are cataloged and
configuration parameters are set. An instance is similar to an image of the actual database
manager environment. You can have several instances of the database manager product on
the same database server. You can use these instances to separate the development
environment from the production environment, tune the database manager to a particular
environment, and protect sensitive information from a particular group of users.
For database partitioning features (DPF) of the DB2 Enterprise Server Edition (ESE), all data
partitions reside within a single instance.
Databases
A relational database structures data as a collection of database objects. The primary
database object is the table (a defined number of columns and any number of rows). Each
database includes a set of system catalog tables that describe the logical and physical
structure of the data, configuration files containing the parameter values allocated for the
database, and recovery logs.
DB2 UDB allows multiple databases to be defined within a single database instance.
Configuration parameters can also be set at the database level, thus allowing you to tune, for
example, memory usage and logging.
Database partitions
A partition number in DB2 UDB terminology is equivalent to a data partition. Databases with
multiple data partitions residing on an SMP system are also called multiple logical partition
(MLN) databases.

Partitions are identified by the physical system (CPU partition) where they reside by a unique
partition number. The partition number, which can be from 0 to 999, uniquely defines a
partition. Partition numbers must be in ascending sequence (gaps in the sequence are
allowed).
The configuration information of the database is stored in the catalog partition. The catalog
partition is the partition from which you create the database.
Partitiongroups
A partitiongroup is a set of one or more database partitions. For non-partitioned
implementations (all editions except for DPF), the partitiongroup is always made up of a
single partition.
Partitioning map
When a partitiongroup is created, a partitioning map is associated with it. The partitioning
map in conjunction with the partitioning key and hashing algorithm is used by the database
manager to determine which database partition in the partitiongroup will store a given row of
data. Partitioning maps do not apply to non-partitioned databases.
Containers
A container is the way of defining where on the storage device will the database objects be
stored. Containers may be assigned from file systems by specifying a directory. These are
identified as PATH containers. Containers may also reference files that reside within a
directory. These are identified as FILE containers and a specific size must be identified.
Containers may also reference raw devices. These are identified as DEVICE containers, and
the device must already exist on the system before the container can be used.
All containers must be unique across all databases; a container can belong to only one
tablespace.
Tablespaces
A database is logically organized in tablespaces. A tablespace is a place to store tables. To
spread a tablespace over one or more disk devices you simply specify multiple containers.
For partitioned databases, the tablespaces reside in partitiongroups. In the create tablespace
command execution, the containers themselves are assigned to a specific partition in the
partitiongroup, thus maintaining the shared nothing character of DB2 UDB DPF.
Tablespaces can be either system managed space (SMS) or data managed space (DMS).
For an SMS tablespace, each container is a directory in the file system, and the operating
system file manager controls the storage space (LVM for AIX). For a DMS tablespace, each
container is either a fixed-size pre-allocated file, or a physical device such as a disk (or in the
case of the DS6000, a vpath), and the database manager controls the storage space.
There are three main types of user tablespaces: Regular (index and/or data), temporary, and
long. In addition to these user-defined tablespaces, DB2 requires a system tablespace, the
catalog tablespace, to be defined. For partitioned database systems this catalog tablespace
resides on the catalog partition.

Tables, indexes, and LOBs
A table is a named data object consisting of a specific number of columns and unordered
rows. Tables are uniquely identified units of storage maintained within a DB2 tablespace.
They consist of a series of logically linked blocks of storage that have been given the same
name. They also have a unique structure for storing information that allows the information to
be related to information on other tables
When creating a table you can choose to have certain objects, such as indexes and large
object (LOB) data, stored separately from the rest of the table data. In order to do this, the
table must be defined to a DMS tablespace.
Indexes are defined for a specific table and assist in the efficient retrieval of data to satisfy
queries. They can also be used to assist in the clustering of data.
Large objects (LOBs) can be stored in columns of the table. These objects, although logically
referenced as part of the table, may be stored in their own tablespace when the base table is
defined to a DMS tablespace. This allows for more efficient access of both the LOB data and
the related table data.
Pages
Data is transferred to and from devices in discrete blocks that are buffered in memory. These
discrete blocks are called pages, and the memory reserved to buffer a page transfer is called
an I/O buffer. DB2 UDB supports various page sizes including 4 k, 8 k, 16 k, and 32k.
When an application accesses data randomly, the page size determines the amount of data
transferred. This corresponds to the size of data transfer request done to the DS6000, which
is sometimes referred to as the physical record.
Sequential read patterns can also influence the page size selected. Larger page sizes for
workloads with sequential read patterns can enhance performance by reducing the number of
I/Os.
Extents
An extent is a unit of space allocation within a container of a tablespace for a single
tablespace object. This allocation consists of multiple pages. The extent size (number of
pages) for an object is set when the tablespace is created.
򐂰 An extent is a group of consecutive pages defined to the database.
򐂰 The data in the tables spaces is striped by extent across all the containers in the system.
Buffer pools
A buffer pool is main memory allocated in the host processor to cache table and index data
pages as they are being read from disk or being modified. The purpose of the buffer pool is to
improve system performance. Data can be accessed much faster from memory than from
disk; therefore, the fewer times the database manager needs to read from or write to disk
(I/O) the better the performance. Multiple buffer pools can be created.
DB2 prefetch (reads)

Prefetching is a technique for anticipating data needs and reading ahead from storage in
large blocks. By transferring data in larger blocks, fewer system resources are used and less
time is required.
Sequential pre fetch reads consecutive pages into the buffer pool before they are needed by
DB2. List pre fetches are more complex. In this case the DB2 optimizer optimizes the retrieval
of randomly located data.

The amount of data being pre fetched determines the amount of parallel I/O activity.
Ordinarily the database administrator should define a pre fetch value large enough to allow
parallel use of all of the available containers.
Consider the following example:

򐂰 A tablespace is defined with a page size of 16 KB using raw DMS.
򐂰 The tablespace is defined across four containers, each container residing on a separate
logical device, and the logical devices are on different DS6000 Ranks.
򐂰 The extent size is defined as 16 pages (or 256 KB).
򐂰 The prefetch value is specified as 64 pages (number of containers * extent size).
򐂰 A user makes a query that results in a tablespace scan, which then results in DB2
performing a prefetch operation.
The following will happen:

򐂰 DB2, recognizing that this prefetch request for 64 pages (1 MB) evenly spans four
containers, will do four parallel I/O requests, one against each of those containers. The
request size to each container will be 16 pages (or 256 KB).
򐂰 The AIX Logical Volume Manager will break the 256 KB request to each AIX logical
volume into smaller units (128 KB is the largest), and pass them on to the DS6000 as
‘back-to-back’ requests against each DS6000 logical disk.
򐂰 As the DS6000 receives a request for 128 KB (if the data is not in cache), four Ranks will
operate in parallel to retrieve the data.
򐂰 After receiving several of these requests, the DS6000 will recognize that these DB2
prefetch requests are arriving as sequential accesses, causing the DS6000 sequential
prefetch to take effect. This will result in all of the disks in all four DS6000 Ranks to operate
concurrently, staging data to the DS6000 cache, to satisfy the DB2 prefetch operations.
Page cleaners
Page cleaners are present to make room in the buffer pool before pre fetchers read pages on
disk storage and move them into the buffer pool. For example, if a large amount of data has
been updated in a table, many data pages in the buffer pool may be updated but not written
into disk storage (these pages are called dirty pages). Since prefetchers cannot place fetched
data pages onto the dirty pages in the buffer pool, these dirty pages must be flushed to disk
storage and become clean pages so that prefetchers can place fetched data pages from disk
storage.
Logs
Changes to data pages in the buffer pool are logged. Agent processes updating a data record
in the database update the associated page in the buffer pool and write a log record into a log
buffer. The written log records in the log buffer will be flushed into the log files asynchronously
by the logger. With UNIX, you can see a logger process (db2loggr) for each active database
using the ps command.
To optimize performance neither the updated data pages in the buffer pool nor the log records
in the log buffer are written to disk immediately. They are written to disk by page cleaners and
the logger, respectively.

The logger and the buffer pool manager cooperate and ensure that the updated data page is
not written to disk storage before its associated log record is written to the log. This behavior
ensures that the database manager can obtain enough information from the log to recover
and protect a database from being left in an inconsistent state when the database has
crashed as a result of an event, such as a power failure.
Parallel operations
DB2 UDB extensively uses parallelism to optimize performance when accessing a database.
DB2 supports several types of parallelism:
򐂰 Query
򐂰 I/O
Query parallelism
There are two dimensions of query parallelism: Inter-query parallelism and intra-query
parallelism. Inter-query parallelism refers to the ability of multiple applications to query a
database at the same time. Each query executes independently of the others, but they are all
executed at the same time. Intra-query parallelism refers to the simultaneous processing of
parts of a single query, using intra-partition parallelism, inter-partition parallelism, or both.
򐂰 Intra-partition parallelism subdivides what is usually considered a single database
operation, such as index creation, database loading, or SQL queries, into multiple parts,
many or all of which can be run in parallel within a single database partition.
򐂰 Inter-partition parallelism subdivides what is usually considered a single database
operation, such as index creation, database loading, or SQL queries, into multiple parts,
many or all of which can be run in parallel across multiple partitions of a partitioned
database on one machine or on multiple machines. Inter-partition parallelism only applies
to DPF.
I/O parallelism
When there are multiple containers for a tablespace, the database manager can exploit
parallel I/O. Parallel I/O refers to the process of writing to, or reading from, two or more I/O
devices simultaneously. This can result in significant improvements in throughput.
DB2 implements a form of data striping by spreading the data in a tablespace across multiple
containers. In storage terminology, the part of a stripe that is on a single device is a strip. The
DB2 term for strip is extent. If your tablespace has three containers, DB2 will write one extent
to container 0, the next extent to container 1, the next extent to container 2, then back to
container 0. The stripe width—a generic term not often used in DB2 literature—is equal to the
number of containers, or three in this case.
Extent sizes are normally measured in numbers of DB2 pages.
Containers for a tablespace would ordinarily be placed on separate physical disks, allowing
work to be spread across those disks, and allowing disks to operate in parallel. Since the
DS6000 logical disks are striped across the Rank, the database administrator can allocate
DB2 containers on separate logical disks residing on separate DS6000 Arrays. This will take
advantage of the parallelism both in DB2 and in the DS6000. For example, four DB2
containers residing on four DS6000 logical disks on four different 7+P Ranks will have data
spread across 32 physical disks.

13.5 DB2 UDB with DS6000: Performance recommendations
for good DB2 UDB performance.
For a more detailed and accurate approach that takes into consideration the particularities of
your DB2 UDB environment, you should contact your IBM representative, who can assist you
with the DS6000 capacity and configuration planning.

Know where your data resides. Understand how DB2 containers map to DS6000 logical
disks, and how those logical disks are distributed across the DS6000 Ranks. Spread DB2
data across as many DS6000 Ranks as possible.
If you want optimal performance from the DS6000, do not treat it completely like a black box.
Establish a storage allocation policy that allocates data using several DS6000 Ranks.
Understand how DB2 tables map to underlying logical disks, and how the logical disks are
allocated across the DS6000 Ranks.
13.5.2 Balance workload

One way of making this process easier to manage is to maintain a modest number of DS6000
logical disks. Balance workload across DS6000 resources. Establish a storage allocation
policy that allows a balanced workload activity across RAID Arrays. You can take advantage
of the inherent balanced activity and parallelism within DB2, spreading the work for DB2
partitions and containers across the DS6000 Arrays. This applies to both OLTP and DSS
workload types. If you do that, and have planned sufficient resource, then many of the other
decisions become secondary.
The following general recommendations can be considered:

򐂰 DB2 query parallelism allows workload to be balanced across CPUs and, if DB2 Universal
Database Partitioning Feature (DPF) is installed, across data partitions.
򐂰 DB2 I/O parallelism allows workload to be balanced across containers.
As a result, you can balance activity across DS6000 resources by following these rules:
򐂰 Span DS6000 storage units.
򐂰 Span Ranks (RAID Arrays) within a storage unit.
򐂰 Engage as many arrays as possible.
Figure 13-2 on page 428 illustrates this technique for a single tablespace consisting of eight
containers.

1
2
3
4
5
6
7 8
Container Container Container Container Container Container Container Container

1 2 3 4 5 6 7 8
Figure 13-2 Allocating DB2 containers using a “spread your data” approach
In addition, consider the following:

򐂰 You may intermix data, indexes, and temp spaces on the DS6000 Ranks. Your I/O activity
will be more evenly spread and thus will avoid the skew effect, which you would otherwise
see if the components were isolated.
򐂰 For DPF systems, establish a policy that allows partitions and containers within partitions
to be spread evenly across DS6000 resources. You should choose a vertical mapping, in
which DB2 partitions are isolated to specific arrays, with containers spread evenly across
those arrays.
13.5.3 Use DB2 to stripe across containers

Use the inherent striping of DB2, placing containers for a tablespace on separate DS6000
logical disks on different DS6000 Ranks. This will eliminate the need for using underlying
operating system or logical volume manager striping.
Look again at Figure 13-2. In this case, we are striping across arrays, across disk adapters,
across clusters, and across DS6000 boxes. This can all be done using the striping capabilities
of DB2’s container and shared nothing concept. This eliminates the need to employ AIX
logical volume striping.
13.5.4 Selecting DB2 logical sizes

The three settings in a DB2 system that primarily affect the movement of data to and from the
disk subsystem work together. These are:
򐂰 Page size
򐂰 Extent size
򐂰 Prefetch size

Page size
Page sizes are defined for each tablespace. There are four supported page sizes: 4 K, 8 K, 16
K, and 32 K. Some factors that affect the choice of page size include:
򐂰 The maximum number of records per page is 255. To avoid wasting space on a page, do
not make page size greater than 255 times the row size plus the page overhead.
򐂰 The maximum size of a tablespace is proportional to the page size of its tablespace. In
SMS, the data and index objects of a table have limits, as shown in Table 13-1. In DMS,
these limits apply at the tablespace level.
Table 13-1 Page size relative to tablespace size
Page size Maximum data/index object size
4 KB 64 GB
8 KB 128 GB
16 KB 256 GB
32 KB 512 GB
Select a page size that can accommodate the total expected growth requirements of the
objects in the tablespace.
For OLTP applications that perform random row read and write operations, a smaller page
size is preferable, because it wastes less buffer pool space with unwanted rows. For DSS
applications that access large numbers of consecutive rows at a time, a larger page size is
better, because it reduces the number of I/O requests that are required to read a specific
number of rows.
Tip: Experience indicates that page size can be dictated to some degree by the type of
workload. For pure OLTP workloads a 4 KB page size is recommended. For a pure DSS
workload a 32 KB page size is recommended. For a mixture of OLTP and DSS workload
characteristics we recommend either an 8 KB page size or a 16 KB page size.
Extent size
If you want to stripe across multiple arrays in your DS6000, then assign a LUN from each
Rank to be used as a DB2 container. During writes, DB2 will write one extent to the first
container, the next extent to the second container, and so on until all eight containers have
been addressed before cycling back to the first container. DB2 stripes across containers at
the tablespace level.
Since DS6000 stripes at a fairly fine granularity (256 KB), selecting multiples of 256KB for the
extent size makes sure that multiple DS6000 disks are used within a Rank when a DB2
prefetch occurs. However, you should keep your extent size below 1 MB.
I/O performance is fairly insensitive to selection of extent sizes, mostly due to the fact that
DS6000 employs sequential detection and prefetch. For example, even if you picked an extent
size such as 128 KB, which is smaller than the full array width (it would involve accessing only
four disks in the array), the DS6000 sequential prefetch would keep the other disks in the
array busy.
Prefetch size
The tablespace prefetch size determines the degree to which separate containers can
operate in parallel.

Although larger prefetch values might enhance throughput of individual queries, mixed
applications would generally operate best with moderate-sized prefetch and extent
parameters. You will want to engage as many arrays as possible in your prefetch, to maximize
throughput.
It is worthwhile to note that prefetch size is tunable. By this we mean that prefetch size can be
altered after the tablespace has been defined and data loaded. This is not true for extent and
page size that are set at tablespace creation time and cannot be altered without re-defining
the tablespace and re-loading the data.
Tip: The prefetch size should be set so that as many arrays as desired can be working on
behalf of the prefetch request. For other than the DS6000, the general recommendation is
to calculate prefetch size to be equal to a multiple of the extent size times the number of
containers in your tablespace. For the DS6000 you may work with a multiple of the extent
size times the number of arrays underlying your tablespace.
13.5.5 Selecting the DS6000 logical disk sizes

The DS6000 gives you great flexibility when it comes to disk allocation. This is particularly
helpful, for example, when you need to attach multiple hosts. However, this flexibility can
present a challenge as you plan for future requirements.
The DS6000 supports a high degree of parallelism and concurrency on a single logical disk.
As a result, a single logical disk the size of an entire array achieves the same performance as
many smaller logical disks. However, you must consider how logical disk size affects both the
host I/O operations and the complexity of your organization’s systems administration.
Smaller logical disks provide more granularity, with their associated benefits. But it also
increases the number of logical disks seen by the operating system. Select an DS6000 logical
disk size that allows for granularity and growth without proliferating the number of logical
disks.
You should also take into account your container size and how the containers will map to AIX
logical volumes and DS6000 logical disks. In the simplest situation, the container, the AIX
logical volume, and the DS6000 logical disk will be the same size.
Tip: Try to strike a reasonable balance between flexibility and manageability for your
needs. Our general recommendation is that you create no fewer than two logical disks in
an array, and the minimum logical disk size should be between 10 GB- 20 GB. Unless you
have an extremely compelling reason, standardize a unique logical disk size throughout
the DS6000.
Among the advantages and the disadvantages between larger and smaller logical disks sizes,
we have the following:
򐂰 Advantages of smaller size logical disks:
– Easier to allocate storage for different applications and hosts.
– Greater flexibility in performance reporting; for example, PDCU reports statistics for
logical disks.
򐂰 Disadvantages of smaller size logical disks:
Small logical disk sizes can contribute to proliferation of logical disks, particularly in SAN
environments and large configurations. Administration gets complex and confusing.
򐂰 Advantages of larger size logical disks:

– Simplifies understanding of how data maps to arrays.
– Reduces the number of resources used by the operating system.
– Storage administration is simpler, thus more efficient and less chances for mistakes.
򐂰 Disadvantages of larger size logical disks:
Less granular storage administration and the resulting less flexibility in storage allocation.
Examples
Let us assume a 6+P array with 146 GB disk drives. Suppose you wanted to allocate disk
space on your 16-array DS6000 as flexibly as possible. You could carve each of the 16 arrays
up into 32 GB logical disks or LUNs, resulting in 27 logical disks per array (with a little left
over). This would yield a total of 16 * 27 = 432 LUNs. Then you could implement 4-way
multi-pathing, which in turn would make 4* 432 = 1728 hdisks visible to the operating system.
Not only would this create an administratively complex situation, but at every reboot the
operating system would query each of those 1728 disks. Reboots could take a long time.
Alternatively, you could have created just 16 large logical disks. With multi-pathing and
attachment of four Fibre Channel ports, you would have 4* 16 = 128 hdisks visible to the
operating system. Although this number is large, it is certainly more manageable; and reboots
would be much faster. Having overcome that problem, you could then use the operating
system logical volume manager to carve this space up into smaller pieces for use.
There are problems with this large logical disk approach as well, however. If the DS6000 is
connected to multiple hosts or it is on a SAN, then disk allocation options are limited when
you have so few logical disks. You would have to allocate entire arrays to a specific host; and
if you wanted to add additional space, you would have to add it in array-size increments.
This problem is less severe if you know your needs well enough to say that your DS6000 will
never be connected to more than one host. Nevertheless, in some versions of UNIX an hdisk
can be assigned to only one logical volume group. This means that if you want an operating
system volume group that spans all arrays of the DS6000, you are limited to a single volume
group for the entire DS6000.
DB2 can use containers from multiple volume groups, so this is not technically a problem for
DB2. So, if you want the ability to do disk administration at the volume group level (exports,
imports, backups, and so on) then you will not be very pleased with a volume group that is
three to eleven terabytes in size.
13.5.6 Multi-pathing
Use DS6000 multi-pathing along with DB2 striping to ensure balanced use of Fibre Channel
paths.
Multi-pathing is the hardware and software support that provides multiple avenues of access
to your data from the host computer. When using the DS6000, this means you need to
provide at least two Fibre Channel or SCSI connections to each host computer from any
component being multi-pathed. It also involves some additional considerations when
configuring the DS6000 host adapters and volumes.
DS6000 multi-pathing requires the installation of multipathing software. For AIX, you have two
choices. SDDPCM or the IBM Subsystem Device Driver (SDD). For AIX, we recommend
SDDPCM. These products are discussed in Chapter 7, “Open systems servers - UNIX” on
page 189and in Chapter 5, “Host attachment” on page 143.

There are several benefits from using multi-pathing: Higher availability, higher bandwidth, and
easier management. A high availability implementation is one in which your application can
still access data using an alternate resource if a component fails. Easier performance
management means that the multi-pathing software automatically balances the workload
across the paths.
13.6 IMS in a z/OS environment

This section discusses IMS, its logging, and the performance considerations when IMS
datasets are placed on the DS6000.
13.6.1 IMS overview

IMS consists of three components, the Transaction Manager (TM) component, the Database
Manager (DB) component, and a set of system services that provides common services to
the other two components.
IMS Transaction Manager

The IMS Transaction Manager provides a network with access to the applications running
under IMS. The users can be people at terminals or workstations, or other application
programs.
IMS Database Manager

The IMS Database Manager provides a central point of control and access to the data that is
processed by IMS applications. The Database Manager component of IMS supports
databases using IMS’s own hierarchic database model. It provides access to the databases
from the applications running under the IMS Transaction Manager, the CICS® transaction
monitor, and z/OS batch jobs.
It provides functions for preserving the integrity and maintaining the databases. It allows
multiple tasks to access and update the data, while ensuring the integrity of the data. It also
provides functions for reorganizing and restructuring the databases.
The IMS databases are organized internally using a number of IMS’s own internal database
organization access methods. The database data is stored on disk storage using the normal
operating system access methods.
IMS system services

There are a number of functions that are common to the Database Manager and Transaction
Manager:
򐂰 Restart and recovery of the IMS subsystems failures
򐂰 Security: Controlling access to IMS resources
򐂰 Managing the application programs: Dispatching work, loading application programs,
providing locking services
򐂰 Providing diagnostic and performance information
򐂰 Providing facilities for the operation of the IMS subsystem
򐂰 Providing an interface to other z/OS subsystems that interface with the IMS applications.

13.6.2 IMS logging
The IMS logging is one of the most write-intensive operations in a database environment.
During IMS execution, all information necessary to restart the system in the event of a failure
is recorded on a system log dataset. The IMS logs are made up of the following.
IMS log buffers

The log buffers are used to write the information that needs to be logged.
Online log datasets (OLDS)

The OLDS are datasets that contain all the log records required for restart and recovery.
These datasets must be pre-allocated on DASD and will hold the log records until they are
archived.
The OLDS are made of multiple datasets that are used in a wrap-around manner. At least
three datasets must be allocated for the OLDS to allow IMS to start, while an upper limit of
100 is supported.
Only complete log buffers are written to OLDS, to enhance performance. Should any
incomplete buffers need to be written out, they are written to the WADS.
Write ahead datasets (WADS)

The WADS is a small direct access dataset that contains a copy of committed log records that
are in OLDS buffers, but have not yet been written to OLDS.
When IMS processing requires writing of a partially filled OLDS buffer, a portion of the buffer
is written to the WADS. If IMS or the system fails, the log data in the WADS is used to
terminate the OLDS, which can be done as part of an emergency restart, or as an option on
the IMS Log Recovery Utility.
The WADS space is continually reused after the appropriate log data has been written to the
OLDS. This dataset is required for all IMS systems, and must be pre-allocated and formatted
at IMS start-up when first used.
System log datasets (SLDS)

The SLDS is created by the IMS log archive utility, preferably after every OLDS switch. It is
usually placed on tape, but can reside on disk. The SLDS can contain the data from one or
more OLDS datasets.
Recovery log datasets (RLDS)

When the IMS log archive utility is run, the user can request creation of an output dataset that
contains all of the log records needed for database recovery. This is the RLDS and also
known to DBRC. The RLDS is optional.
13.7 DS6000 considerations for IMS

Some of the benefits of using the DS6000 in an IMS environment are the following:
򐂰 IMS takes advantage of the PAV function that allows multiple concurrent I/Os to the same
volume at the same time from applications running on a z/OS system image.
򐂰 Less disk contention when accessing the same volumes from different systems in an IMS
data sharing group, using the Multiple Allegiance function.

򐂰 Higher bandwidth on the DS6000 allows higher I/O rates to be handled by the disk
subsystem, thus allowing for higher application transaction rates.
13.8 IMS with the DS6000: Performance recommendations

for good IMS performance.

IMS storage administration can be done using SMS to simplify disk use and control, or also
without using SMS. In both cases it is very important that you know where your data resides.
If you want optimal performance from the DS6000, do not treat it totally like a ‘black box’.
Understand how your IMS datasets map to underlying volumes, and how the volumes map to
RAID Arrays.
13.8.2 Balance workload across DS6000 resources

You can balance workload activity across DS6000 resources doing the following:
򐂰 Spread IMS data across DS6000 boxes if practical
򐂰 Spread IMS data across servers in each DS6000
򐂰 Spread IMS data across DS6000 device adapters
򐂰 Spread IMS data across as many Extent Pools/Ranks as practical
You may intermix IMS databases and log datasets on DS6000 Ranks. The overall I/O activity
will be more evenly spread, and I/O skews will be avoided.
13.8.3 Large volumes

With large volume support, which supports up to 65,520 cylinders per volume, zSeries users
can allocate the larger capacity volumes in the DS6000. From the DS6000 perspective, the
capacity of a volume does not determine its performance. From the z/OS perspective, PAVs
reduce or eliminate any additional enqueues that may originate from the increased I/O on the
larger volumes. From the storage administration perspective, configurations with larger
volumes are simpler to manage.
Measurements to determine how large volumes can impact IMS performance have shown
that similar response times can be obtained when using larger volumes as when using the
smaller 3390-3 standard size volumes.
Figure 13-3 on page 435 illustrates the device response times when using 32 3390-3
volumes versus four large volumes 3390-27 on an ESS-F20 using FICON channels. Even
though the benchmark was performed on an ESS-F20, the results should be similar on the
DS6000. The results show that with the larger volumes the response times are similar to the
standard size 3390-3 volumes.

32 3390-3 vs 4 3390-27
Device response time (msec) 1.5
3390-3
1
3390-27
0.5
0
2905 4407
Total I/O rate (IO/sec)
Figure 13-3 IMS large volume performance
13.8.4 Monitoring the DS6000 performance

RMF can be used to monitor the performance of the DS6000. For a detailed discussion, see
10.9, “DS6000 performance monitoring tools” on page 371.

14
Chapter 14. Copy Services for the DS6000

In this chapter, we describe the implementation of Copy Services for the DS6000 from a
performance enhancing viewpoint. Copy Services is a collection of functions provided by the
DS6000 that provide Disaster Recovery, data migration, and data duplication functions. Copy
Services are optional features that run on the DS6000 server enclosure and they support
open systems and zSeries environments.
Copy Services has four interfaces; a Java-enabled Web-based interface (DS Storage
Manager), a command-line interface (DS CLI), an application programming interface (DS
Open API), and host I/O commands from zSeries servers.
This chapter discusses the functions, objectives, and performance related aspects of the
following Copy Services:
򐂰 FlashCopy
򐂰 Metro Mirror
򐂰 Global Copy
򐂰 Interoperability between IBM TotalStorage Enterprise Storage Server (ESS), the DS6000
and the DS8000
򐂰 Managing performance

14.1 Copy Services introduction
The DS6000 supports a rich set of Copy Service functions and management tools that can be
used to build solutions to help meet business continuance requirements. These include IBM
TotalStorage Resiliency Family Point-in-Time Copy and Remote Mirror and Copy solutions
that are currently also supported by the DS6000 and the Enterprise Storage Server.
Note: Remote Mirror and Copy was referred to as Peer-to-Peer Remote Copy (PPRC) in
earlier documentation for the IBM TotalStorage Enterprise Storage Server.
You can manage Copy Services functions through the function-rich DS Command-Line
Interface (CLI) called the IBM TotalStorage DS CLI and the Web-based interface called the
IBM TotalStorage DS Storage Manager. The DS Storage Manager allows you to set up and
manage data copy features from anywhere that network access is available.
14.2 IBM TotalStorage FlashCopy

FlashCopy can help reduce or eliminate planned outages for critical applications. FlashCopy
is designed to provide the same point-in-time copy capability for logical volumes on the
DS6000 series and the DS8000 series as FlashCopy V2 does for ESS, and allows access to
the source data and the copy almost immediately following the FlashCopy volume pair
establishment.
FlashCopy supports many advanced capabilities, including:

򐂰 Data Set FlashCopy for z/OS only
Data Set FlashCopy allows a FlashCopy of a dataset in a zSeries environment.
򐂰 Multiple Relationship FlashCopy
Multiple Relationship FlashCopy allows a source volume to have up to 12 targets
simultaneously.
򐂰 Incremental FlashCopy
Incremental FlashCopy provides the capability to update a FlashCopy target without
having to recopy the entire volume.
򐂰 FlashCopy to a Metro Mirror or Global Copy primary
FlashCopy to a Remote Mirror or Global Mirror primary gives you the option to use a
FlashCopy target volume also as a Metro Mirror or Global Copy primary volume. This
process allows you to create a point-in-time copy and then make a copy of that data at a
remote site.
򐂰 Consistency Group FlashCopy
Consistency Group commands allow DS6000 series systems to temporarily delay I/O
activity to a LUN or volume until the FlashCopy Consistency Group command is issued.
Consistency groups can be used to help create a consistent point-in-time copy across
multiple LUNs or volumes, and even across multiple DS6000s.
򐂰 Inband Commands over Remote Mirror link
In a remote mirror environment, commands to manage FlashCopy at the remote site can
be issued from the local or intermediate site and transmitted over the remote mirror Fibre
Channel links. This eliminates the need for a network connection to the remote site solely
for the management of FlashCopy.

򐂰 Remote FlashCopy
The DS CLI is able to manage a FlashCopy relationship at a remote site.
򐂰 Persistent FlashCopy
The FlashCopy relationship will continue to exist until explicitly removed by an interface
method. If this option is not selected the FlashCopy relationship will exist until all data has
been copied from the source volume to the target.
򐂰 Reverse Restore FlashCopy
With this option the FlashCopy relationship can be reversed by copying over modified
tracks from the target volume to the source volume. The background copy process must
complete before you can reverse the order of the FlashCopy relationship to its original
source and target relationship. Change recording is a pre-requisite for reverse restore.
򐂰 Fast Reverse Restore
This option is used with Global Mirror. If you specify this option, you can reverse the
FlashCopy relationship without waiting for the finish of the background copy of the
previous FlashCopy.
14.2.1 FlashCopy objectives

The IBM TotalStorage FlashCopy or Point-in-Time Copy (PTC) feature of the DS6000
provides you with the capability of making an immediate copy of a logical volume at a specific
point in time, which we also refer to as a point-in-time-copy, instantaneous copy or time-zero
copy (T0 copy), within a single DS6000.
There are several points to consider when you are planning to use FlashCopy that may help
you minimize any impact that the FlashCopy operation may have on host I/O performance.
This section gives an overview of FlashCopy for a DS6000 in a z/OS environment from a
performance perspective. We will describe:
򐂰 FlashCopy operational areas
򐂰 FlashCopy basic concepts
򐂰 Data set level FlashCopy
򐂰 FlashCopy in combination with other Copy Services
As Figure 14-1 on page 440 illustrates, when FlashCopy is invoked, a relationship (or
session) is established between the source and target volumes of the FlashCopy pair. This
includes creation of the necessary bitmaps and metadata information needed to control the
copy operation. This FlashCopy establish process is very quick to complete, at which point:
򐂰 The FlashCopy relationship is fully established.
򐂰 Control returns to the operating system or task that requested the FlashCopy.
򐂰 Both the source volume and its time zero (T0) target volume are available for full read/write
access.
At this time a background task within the DS6000 starts copying the tracks from the source to
the target volume. Optionally, you can suppress this background copy task. This is efficient,
for example, if you are doing a temporary copy just to take a backup from that copy to tape.
Chapter 14. Copy Services for the DS6000 439

FlashCopy provides a T0 copy
FlashCopy command issued
Source Target T0 copy immediately available
Write Read Read

Time
Write
Read and write to both source
and target possible. Optional
T0 physical copy progresses in
background
When copy is complete,

Simplex Simplex relationship between
source and target ends
Figure 14-1 FlashCopy establish
For a straightforward FlashCopy, the FlashCopy relationship ends when the background copy
task completes. However, if the FlashCopy was requested with the no-background copy
option, or with the persistent option, then the relationship must be explicitly ended by a
FlashCopy withdraw command.
FlashCopy has several options. Not all options are available to all user interfaces. It is
important right from the beginning to know for which purpose the target volume should be
used afterwards. Knowing this, the options to be used with FlashCopy can be identified and
the environment can be selected which supports the selected options.
Supported interfaces are:

򐂰 DS CLI
򐂰 DS SM GUI
򐂰 DS Open API
Supported options within the environments are those identified in Figure 14-2 on page 441.

Interface DS front ends
Function DS SM DS CLI
Multiple relationship FlashCopy
Consistency Group FlashCopy
Target on existing Metro Mirror or

Global Copy primary
Incremental FlashCopy
Remote FlashCopy
Persistent Flashcopy
Reverse restore,
fast reverse restore
Figure 14-2 FlashCopy interfaces and functions
14.2.2 Performance considerations with FlashCopy

The FlashCopy guidelines and recommendations that we discuss in this chapter are fairly
generic. At the time of this writing (August 2005), FlashCopy services on the DS6000 have
not been benchmarked. Therefore, no specific guidance can be given for the throughput
characteristics of FlashCopy.
There are a number of considerations on FlashCopy performance, which we discuss in this

section:
򐂰 Time necessary for FlashCopy to complete
򐂰 Application downtime
򐂰 Background copy
򐂰 FlashCopy priority
򐂰 Physical relationship between source and target volume
Time required for FlashCopy to establish the copy relationship

FlashCopy operates on an entire logical volume. The process of establishing the copy
relationship and making the copy available for a few volumes takes typically a few seconds.
The establish times for two hundred volumes should be under 30 seconds. The size of the
logical volume makes very little difference to this time. When requesting the background copy,
the establish times can be slightly higher.
Application downtime during backups

With traditional backup methods, applications had to be stopped for the duration of the
backup, which could take a long time. With FlashCopy, the application only needs to be
stopped (quiesced, put into backup mode, etc.) for a minimal amount of time, just the very
short duration of the FlashCopy initialization (establish) time. Then the optional background
physical copy proceeds in parallel with the application doing its normal read and write activity,
unaware of FlashCopy activity. The FlashCopy target would be used as the backup source of
data for the backup application. The backup activity would continue independently of the
original FlashCopy source, which could have been put back into normal production mode by
the application as soon as the FlashCopy initialization was complete.

Priority, application related I/O versus FlashCopy background copy
FlashCopy is a hardware implementation of the DS6000, and uses a very small amount of
cache to hold bitmaps and metadata during the establish phase of FlashCopy. The intent of
FlashCopy background copy processing is that this I/O activity should be indiscernible to the
applications. Therefore, background copy I/O is managed by the DS6000 with a low priority. In
order to understand the performance related throughput characteristics of your background
copy, it is important to test this function while the DS6000 is experiencing normal application
related I/O loads. FlashCopy background copy performance on an idle DS6000 will be faster
than on a DS6000 that is busy handling application related I/O loads.
Note: Normal DS6000 application related I/O will affect the performance of FlashCopy
background copy.
Physical relationship between source and target volumes

Once FlashCopy has completed the initialization, then the optional background copy is
started. Below are some performance considerations for this background copy process.
򐂰 For best performance with background copies, the source volume and the target volume
should be located on separate Ranks.
Tip: For better performance with FlashCopy, do not FlashCopy between logical volumes
within the same Rank.
Copying within a single Rank means that the data reads and writes are all performed on
the same physical set of eight DDMs in the Rank, effectively doubling the I/O rates to that
Rank during the FlashCopy operation.
򐂰 Better FlashCopy background copy performance can be expected if the source and
destination Ranks are managed by different DA Pairs.
򐂰 With multiple background copy operations running concurrently, the throughput
performance of each operation will be less than the performance of one operation running
by itself. For instance, assume the background copy of one 36 GB volume completes in 12
minutes. It should be expected that the background copy of four 36 GB volumes, running
concurrently, will take more than 12 minutes to complete.
򐂰 Slightly better FlashCopy background copy performance can be expected by ensuring the
source and target volumes are on the same RIO Loop.
򐂰 FlashCopy relationships can exist between logical volumes that are backed by Ranks
consisting of differing geometries (speed, capacity and RAID type). These FlashCopy
relationships will also be slightly impacted by the performance capability of the underlying
devices, but this is unlikely to cause any noticeable performance degradation, unless the
supporting Ranks are already highly utilized by normal I/O activity.
򐂰 Additionally, if you have many FlashCopy pairs to manage, try and balance the DS6000
source and target volumes across the two DS6000 servers, so that the copy tasks are able
to exploit more of the internal bandwidth within the DS6000 server.
򐂰 Evenly distribute FlashCopy source volumes throughout available source Ranks. Likewise,
distribute target volumes evenly throughout target Ranks.
14.2.3 Planning for FlashCopy

When planning to use FlashCopy within your DS6000, the following considerations apply:
򐂰 The source and target volumes can be on any LSS within the DS6000.

򐂰 Target volumes need not be dedicated to any particular source, but each target volume
can only be the target for one source at a time.
򐂰 With FlashCopy Version 2 (the only version that runs in the DS6000) a source can have up
to 12 targets. Therefore, the number of target volumes you must plan for depends on the
number of concurrent FlashCopy sessions you will be running.
򐂰 A FlashCopy target volume must be at least the same physical capacity as its source
volume. z/OS users with FlashCopy can also request a data set copy, so in this case
source and target can refer either to extents of tracks or full volumes (the full extent that
makes up a volume). Furthermore, the data set can be copied using FlashCopy to another
location within the same volume, providing a flexible planning option for z/OS users.
There is no requirement for source and target volumes to have the same RAID levels.
14.3 IBM TotalStorage Metro Mirror (Synchronous PPRC)

Metro Mirror is a remote data mirroring technique for all supported servers, including z/OS
and open systems. It is designed to constantly maintain an up-to-date copy of the local
application data at a remote site which is within the metropolitan area (typically up to 300 km
away using a Dense Wave Division Multiplexor or DWDM). With synchronous mirroring
techniques, data currency is maintained between sites, though the distance can have some
impact on performance. Metro Mirror is most commonly used as part of a business
continuance solution for protecting data against disk storage system loss or complete site
failure.
Metro Mirror
This function, formerly known as synchronous Peer-to-Peer Remote Copy, or PPRC, provides
a synchronous real-time mirror of logical volumes in a second DS6000 (or DS8000 or
ESS750 or ESS 800). Every host write to the source logical volume is copied to the target
logical volume before acknowledging write completion to the application, maintaining the pair
of volumes in a duplex relationship, as shown in Figure 14-3 on page 444.

Metro Mirror
(synchronous)
zSeries z9
Host I/O
1
4 2
Primary Secondary
3
I/O Time
1. Application writes to primary logical volume. Start
2. The primary DS6000 initiates an I/O to the ..
secondary DS6000 to transfer the data. ..
3. Secondary indicates to the primary that the write is ..
complete.
..
4. Primary acknowledges to the application system
Completion
that the write is complete.
Figure 14-3 Synchronous logical volume replication
When the application performs a write update operation to a primary volume, this is what
happens:
1. Write to primary volume (DS6000 cache).
2. Write to secondary (DS6000 cache).
3. Signal write complete on the secondary DS6000.
4. Post I/O complete to host server.
Metro Mirror utilizes Fibre Channel Protocol to communicate between the pair of participating
DS6000s, or between the DS6000 and its remote partner storage subsystem. Care should be
taken to ensure that the paths have adequate bandwidth. FC paths are set up between the
LSS that the source volume resides in, and the LSS that contains the target volume.
Some decisions need to be made when setting up each Metro Mirror logical volume pair as to
which of the following actions should be taken if the paths become unavailable:
򐂰 Keep accepting writes to primary volume and allow the Metro Mirror process to track the
changes. Re synchronize the pair when the paths become available. This enables the
application to keep running, but in the event of a primary site failure while the paths are
missing, the secondary volume will not be current and transactions that occurred since the
path interruption are likely to be lost.
򐂰 Suspend all updates to primary volume. This is disruptive to the application, but maintains
the best data integrity.

The Fibre Channel connection between primary and secondary subsystems can be direct,
through a switch, or through other supported distance solutions (for example Dense Wave
Division Multiplexor, or DWDM).
The redbook IBM TotalStorage DS6000: Concepts and Architecture, SG24-6471 introduces
the concept of data consistency, and Consistency Groups. For Metro Mirror, consistency
requirements are managed through use of the Consistency Group option when you are
defining Metro Mirror paths between pairs of LSSs. Volumes or LUNs which are paired
between two LSSs whose paths are defined with the Consistency Group option can be
considered part of a Consistency Group.
Consistency is provided by means of the extended long busy (for z/OS) or queue full (for open
systems) conditions. These are triggered when the DS6000 detects a condition where it
cannot update the Metro Mirror secondary volume. The volume pair that first detects the error
will go into the extended long busy or queue full condition, such that it will not do any I/O. For
z/OS a system message will be issued (IEA494I state change message); for open systems an
SNMP trap message will be issued. These messages can be used as a trigger for automation
purposes - to provide data consistency by use of the Freeze/Run (or Unfreeze) commands.
For further discussion of these refer to Metro Mirror options:.
Important: The data on disk at the secondary site is an exact mirror of that at the primary
site. Remember that any data still in host system buffers or processor memory is not yet on
disk and so will not be mirrored to the secondary site. This is a similar situation to a power
failure in the primary site.
14.3.1 Metro Mirror options:

Below is an overview of Metro Mirror options:
򐂰 Initial disk synchronization - You may choose to do the initial disk synchronization in the
Global Copy mode, then switch to Metro Mirror mode. This will allow you to bring all
volumes to the Duplex Pending XD state, which has no application impact, then switch to
the Full Duplex state.
򐂰 Failover and failback - Metro Mirror can do failover and failback operations. A failover
operation is the process of temporarily switching production to a backup facility (normally
your recovery site) following a planned or unplanned outage. A failback operation is the
process of returning production to its original location. These operations use Remote
Mirror and Copy functions to help reduce the time that is required to synchronize volumes
after the sites are switched during a planned or unplanned outage.
򐂰 Consistency Group considerations - With Copy Services, you can create Consistency
Groups for FlashCopy and PPRC. Consistency Group is a function to keep data
consistency in the backup copy. In general, consistency implies that the order of
dependent writes is preserved in the data copy.
Note: Consistency Groups in a Metro Mirror environment support the consistency of

data of a target LSS between two LSSs, not over a group of LSSs.
򐂰 Freeze and unfreeze commands - Metro Mirror freeze and run commands are used by
automation processes to ensure data consistency. We will discuss their usage in this
section. The freeze and unfreeze commands are only available using the DS CLI and not
the DS GUI.
򐂰 Critical attribute - Consistency Group and critical mode combination - The -critmode
parameter of the mkpprc command and the -consistgrp parameter of mkpprcpath

command can be combined. For open systems, an SNMP trap will be issued to the
address specified when the DS6000 SNMP service was configured. There are solutions
available for integrating Metro Mirror into a cluster, like HACMP/XD for AIX and IBM Total
Storage Continuous Availability for Windows (GDS),
14.3.2 Metro Mirror interfaces

There are various interfaces available for the configuration and control of DS6000 Metro
Mirror.
Copy Services functions can be initiated over the following interfaces:

򐂰 zSeries Host I/O interface
– TSO
– DFSMSdss
򐂰 DS Storage Manager Web-based interface
򐂰 DS Command-Line Interface (DS CLI)
14.3.3 Metro Mirror configuration considerations
Important: The DS6000 supports Fibre Channel (FC) Metro Mirror links only.
Metro Mirror pairs are set up between volumes in LSSs, usually in different disk subsystems,
and these are normally in separate locations. To establish a Metro Mirror pair, there needs to
be a Metro Mirror path between the LSSs that the volumes reside in. These paths are
bi-directional and can be shared by any Metro Mirror pairs in the same LSS to the secondary
LSS in the same direction. For bandwidth and redundancy, more than one path can be
created between the same LSSs, Metro Mirror will balance the workload across the available
paths between the primary and secondary.
Note: Remember that the LSS is not a physical construct in the DS6000. Volumes in an
LSS can come from multiple disk Arrays.
Metro Mirror pairs can only be established between storage control units of the same (or
similar) type and features. For example, a DS6000 can have a Metro Mirror pair with another
DS6000, a DS8000, an ESS 800, or an ESS 750. It cannot have a Metro Mirror pair with an
RVA or an ESS F20. Note that all disk subsystems must have the appropriate Metro Mirror
feature installed (for DS6000 the Remote Mirror and Copy 2244 function authorization model,
which is 2244 Model RMC). If your DS6000 is being mirrored to an ESS disk subsystem, the
ESS must have PPRC Version 2 (which supports Fibre Channel PPRC links).
A path (or group of paths) needs to be established from the LSS to each LSS with related
secondaries. Also, a path (or group of paths) must be established to the LSS from each LSS
with related primaries. These logical paths are transported over physical links between the
disk subsystems. The physical link includes the host adapter in the primary DS6000, the
cabling, switches or directors, any wide band or long distance transport devices (DWDM,
channel extenders, WAN) and the host adapters in the secondary disk subsystem. Physical
links can carry multiple logical Metro Mirror paths, as shown in Figure 14-4 on page 447.

Physical
LSS 0 LSS 0
Fibre
LSS 1 Channel LSS 1
LSS 2 link LSS 2
LSS 3 LSS 3
: :
LSS 08 LSS 08
Up to 256 logical paths
: :
per FCP path
LSS nn LSS nn
Figure 14-4 Logical paths
Metro Mirror Fibre Channel links

A DS6000 Fibre Channel port can simultaneously be:
򐂰 Sender for Metro Mirror primary
򐂰 Receiver for a Metro Mirror secondary
򐂰 Target for Fibre Channel Protocol (FCP) hosts I/O from open systems and Linux on
zSeries
Although one FCP link would have sufficient bandwidth for most Metro Mirror environments,
for redundancy reasons we recommend configuring two Fibre Channel links between each
primary and secondary disk subsystem.
Metro Mirror FCP links can be direct connected, or connected by up to two switches.
Dedicating Fibre Channel ports for Metro Mirror use guarantees no interference from host I/O
activity. This is recommended with Metro Mirror, which is time critical and should not be
impacted by host I/O activity. The Metro Mirror ports used provide connectivity for all LSSs
within the DS6000 and can carry multiple logical Metro Mirror paths.
Logical paths
A Metro Mirror logical path is a logical connection between the sending LSS and the receiving
LSS. An FCP link can accommodate multiple Metro Mirror logical paths.
Figure 14-5 on page 448 shows an example where we have a 1:1 mapping of source to target
LSSs, and where the 3 logical paths are accommodated in one Metro Mirror link:
򐂰 LSS1 in DS6000 1 to LSS1 in DS6000 2
Alternatively, if the volumes in each of the LSSs of DS6000 1 map to volumes in all three
secondary LSSs in DS6000 2, there will be 9 logical paths over the Metro Mirror link (not fully
illustrated in Figure 14-4). Note that we recommend a 1:1 LSS mapping.

DS6000 1 DS6000 2
3-9 logical paths
LSS 1 LSS 1
1 logical path 1 logical path

LSS 2 switch LSS 2
Port Port
1 logical path 1 Link 1 logical path
LSS 3 LSS 3
Metro Mirror
paths
1 logical path 1 logical path
Figure 14-5 Logical paths for Metro Mirror
Metro Mirror links have certain architectural limits, these include:

򐂰 A primary LSS can maintain paths to a maximum of four secondary LSSs. Each
secondary LSS can reside in a separate DS6000.
򐂰 Up to eight logical paths per LSS-LSS relationship can be defined. Each Metro Mirror path
requires a separate physical Metro Mirror link.
򐂰 An FCP port can host up to 2048 logical paths. These are the logical and directional paths
that are made from LSS to LSS.
򐂰 An FCP path (the physical path from one port to another port) can host up to 256 logical
paths (Metro Mirror paths).
򐂰 An FCP port can accommodate up to 126 different physical paths (DS6000 port to
DS6000 port through the SAN).
Bandwidth
Prior to establishing your Metro Mirror solution you should establish what your peak write rate
bandwidth requirement will be. This will help to ensure you have enough Metro Mirror links in
place to support that requirement. Remember only writes are mirrored across to the
secondary volumes.
LSS design
Since the DS6000 has made the LSS a topological construct which is not tied to a physical
Array as in the ESS, the design of your LSS layout can be simplified. It is now possible to
assign LSSs to applications, for example, without concern over under or over-allocation of
physical disk subsystem resources. This can also simplify the Metro Mirror environment, as it
is possible to reduce the number of commands that are required for data consistency.
Distance
The distance between your primary and secondary DS6000 subsystems will have an effect
on the response time overhead of the Metro Mirror implementation. Your IBM Field Technical
Sales Specialist (FTSS) can be contacted to assist you in assessing your configuration and
the distance implications.
The maximum supported distance for Metro Mirror is 300 km.

Symmetrical configuration
As an aid to planning and management of your Metro Mirror environment, we recommend you
maintain a symmetrical configuration - in terms of both physical and logical elements. This will
have the following benefits:
򐂰 Simplify management - it is easy to see where volumes will be mirrored, and processes
can be automated.
򐂰 Reduce administrator overhead - due to automation, and the simpler nature of the
solution.
򐂰 Ease the addition of new capacity into the environment - new arrays can be added in a
modular fashion.
򐂰 Ease problem diagnosis - the simple structure of the solution will aid in identifying where
any problems may exist.
Figure 14-6 shows this idea in a graphical form. DS6000 #1 has Metro Mirror paths defined to
DS6000 # 2, which is in a remote location. On DS6000 #1, volumes defined in LSS 00 are
mirrored to volumes in LSS 00 on DS6000 #2 (volume P1 is paired with volume S1, P2 with
S2, P3 with S3, and so on). Volumes in LSS 01 on DS6000 #1 are mirrored to volumes in LSS
01 on DS6000 #2, and so on. Requirements for additional capacity can be added in a
symmetrical way also - by addition of volumes into existing LSSs, and by addition of new
LSSs when needed (for example, addition of two volumes in LSS 03 and 05, and one volume
to LSS 04 will bring them to the same number of volumes as the other LSS. Additional
volumes could then be distributed evenly across all LSSs or additional LSSs added).
Figure 14-6 Symmetrical Metro Mirror configuration
As well as making the maintenance of the Metro Mirror configuration easier, this has the
added benefit of helping to balance the workload across the DS6000. Figure 14-6 shows a
logical configuration - this idea applies equally to the physical aspects of the DS6000. You
should attempt to balance workload and apply symmetrical concepts to other aspects of your
DS6000 (for example, the Extent Pools).
Consider a non symmetrical configuration for a moment. For instance, the primary site could
have volumes defined on RAID 5 Arrays (Ranks) comprised of 72 GB DDMs. And the
secondary site could have Ranks comprised of 300 GB DDMs. Because the capacity of the

destination Ranks is four times that of the source Ranks, it would seem feasible to define four
times as many LSSs per Rank on the destination side. However, this situation, where four
primary LSSs on four Ranks were feeding into four secondary LSSs on one Rank, would
create a performance bottleneck on the secondary Rank and slow down the entire Metro
Mirror process.
Volumes
You will need to consider which volumes should be mirrored to the secondary site. One option
is to mirror all volumes. This is advantageous for the following reasons:
򐂰 You will not need to consider whether any required data has been missed.
򐂰 Users will not need to remember which logical pool of volumes is mirrored and which is
not.
򐂰 Addition of volumes to the environment is simplified - you will not have to have two
processes for addition of disk (one for mirrored volumes, and another for non-mirrored
volumes).
򐂰 You will be able to move data around your disk environment easily without a concern over
whether the target volume is a mirrored volume or not.
Note: Consider the bandwidth you need to mirror all volumes.
You may choose not to mirror all volumes. In this case you will need careful control over what
data is placed on the mirrored volumes (to avoid any capacity issues) and what is placed on
the non-mirrored volumes (to avoid missing any required data). One method of doing this
could be to place all mirrored volumes in a particular set of LSSs, in which all volumes have
Metro Mirror enabled, and direct all data requiring mirroring to these volumes.
14.3.4 Metro Mirror performance considerations

Because no benchmarking has been completed on DS6000 Metro Mirror capabilities, we can
only offer generic performance recommendations and considerations, based on experience.
Performance considerations
Some basic things you should consider:
򐂰 The process of getting the primary and secondary Metro Mirror volumes into a
synchronized state is called the initial establish. Each link I/O port will provide a maximum
throughput. Multiple LUNs in the initial establish will quickly saturate the links. This is
referred to as the aggregate copy rate and is dependent primarily on the number of links,
or bandwidth between sites. It is important to understand this copy rate to have a realistic
expectation about how long the initial establish will take to complete.
򐂰 Production I/O will be given priority over DS6000 replication I/O activity. High production
I/O activity will negatively affect both initial establish data rates and synchronous copy data
rates.
򐂰 We recommend you not share the Metro Mirror link I/O ports with host attachment ports. A
result could be a non predictable performance of Metro Mirror and a much more
complicated search in case of performance problems.
򐂰 Distance is an important value for both the initial establish data rate and synchronous write
performance. Data must go to the other site, and the acknowledgement goes back. Add
possible latency times of some active components on the way. We think it is a good rule of
thumb to calculate 1 ms additional response time per 100 KM for a write I/O.
򐂰 Distance will also affect the establish data rate.

򐂰 Know your workload characteristics. Factors such as block size, read/write ratio, random
or sequential processing are all key Metro Mirror performance considerations.
򐂰 Monitor link performance to determine if links are becoming over utilized. Use tools such
as PDCU, RMF, or review SAN switch statistics.
Scalability
The DS6000 Metro Mirror environment can be scaled up or down as required. If new volumes
are added to the DS6000 that require mirroring, they can be dynamically added. If additional
Metro Mirror paths are required, they also can be dynamically added.
Addition of capacity
As we have previously mentioned, the logical nature of the LSS has made a Metro Mirror
implementation on the DS6000 easier to plan, implement and manage. However if you need
to add more to your Metro Mirror environment, your management and automation solutions
should be set up to handle this.
The eRCMF service offerings are designed to provide this functionality.
14.4 IBM TotalStorage Global Copy

Global Copy (formerly known as PPRC Extended Distance, or PPRC-XD), is an
asynchronous remote copy function for z/OS and open systems for longer distances than are
possible with Metro Mirror. With Global Copy, write operations complete on the primary
storage system before they are received by the secondary storage system. This capability is
designed to prevent the primary system’s performance from being affected by wait time from
writes on the secondary system. Therefore, the primary and secondary copies can be
separated by any distance. This function is appropriate for remote data migration, off-site
backups and transmission of inactive database logs at virtually unlimited distances.
Global Copy
This function, formerly known as PPRC-Extended Distance, copies data non-synchronously
and over longer distances than is possible with the Metro Mirror implementation.
When operating in Global Copy mode, the source volume sends a periodic, incremental copy
of updated tracks to the target volume, instead of sending a constant stream of updates. This
causes less impact to application writes for source volumes and less demand for bandwidth
resources, while allowing a more flexible use of the available bandwidth.
Global Copy does not keep a strict sequence of write operations, but you can make a
consistent copy through a periodical synchronization process, (called a go-to-sync operation).
The Global Copy logical process differs from the Metro Mirror process as shown here in
Figure 14-7 on page 452.

Global Copy
(asynchronous)
zSeries z9
Host I/O
1
2
3
Primary Secondary
4
I/O Time
1. Write to primary volume in DS6000. Start
2. Primary DS6000 acknowledges to the application Completion
system that the write is complete.
At some later time:
3. The primary DS6000 initiates an I/O to the

secondary DS6000 to transfer the data.
4. Secondary indicates to the primary that the write is
complete.
5. Primary resets indication of modified track.
Figure 14-7 Asynchronous logical volume replication
Performance improvement
The new technology used within the DS6000 means that the response time penalty for
synchronous mirrored writes on a DS6000 is less than that observed on the ESS 800, which
was slightly greater than one millisecond for a zero distance 4k Metro Mirror write. Under
similar conditions, the response time penalty on the DS6000 is slightly less than one
millisecond for a zero distance Metro Mirror write.
14.4.1 Global Copy state change logic

Figure 14-8 on page 453 illustrates the Global Copy and Metro Mirror state change logic.
There are four different volume states:
򐂰 Simplex: The volume is not in a Global Copy relationship.
򐂰 Suspended: In this state the writes to the primary volume are not mirrored onto the
secondary volume. The secondary volume becomes out-of-sync. During this time, Global
Copy keeps a bitmap record of the changed tracks in the primary volume. Later, the
volume pair can be re-synchronized, and then only the tracks that were updated will be
copied.
򐂰 Full duplex: Updates on the primary volumes are synchronously mirrored to the
secondary volume.
򐂰 Copy Pending: Updates on the primary volume are asynchronously mirrored to the
secondary volume.

Simplex
Terminate Terminate
Establish
Establish Global
Metro
Copy
Mirror
Go to sync
Copy Pending Go to sync and Full Duplex

suspend
Resync Terminate
Resync
Suspend Suspend
Suspended
Figure 14-8 Global Copy and Metro Mirror state change logic
򐂰 When you initially establish a mirror relationship from a volume in Simplex state, you have
the option to request that it become a Global Copy pair (establish Global Copy arrow in
Figure 14-8), or a Metro Mirror pair (establish Metro Mirror arrow in Figure 14-8).
򐂰 Pairs can change from the Copy Pending state, to Full Duplex state when a go-to-SYNC is
commanded (go-to-SYNC arrow in Figure 14-8).
򐂰 You can also request that a pair be suspended as soon as it reaches the full-duplex state
(go-to-SYNC and Suspended in Figure 14-8).
򐂰 Pairs cannot change directly from Full Duplex state to Copy Pending state. They need to
go through an intermediate Suspended state.
򐂰 You can go from suspended state to Global Copy state doing an incremental copy
(copying out-of-sync tracks only). This is a similar process as for the traditional transition
from suspended state to SYNC state (RESYNC/copy out-of-sync arrow in Figure 14-8).
14.4.2 Configuration guidelines

The DS6000 can interoperate with the Enterprise Storage Server (ESS) Model 750, and 800,
and DS8000 Series systems in a Global Copy environment. Refer to the DS6000
Interoperability Matrix for more information.
In order to use the Global Copy function you have to purchase the Remote Mirror and Copy
feature for the primary and secondary DS6000 systems.

14.4.3 DS6000 I/O ports
The DS6000 can have a maximum number of 8 host adapter ports which can each be
configured to support Fibre Channel or FICON.
For Global Copy paths you need at least one Fibre Channel connection between the two
DS6000 subsystems you want to set up in a physical Global Copy relationship. For higher
availability you must use at least one host Fibre Channel connection from each of the two
DS6000 servers.
The Fibre Channel ports used for Global Copy can be used as dedicated ports, that means
they will only be used for the Global Copy paths or they can be shared between Global Copy
and Fibre Channel data traffic, in this case you also need Fibre Channel switches for
connectivity.
For supported SAN switches you can refer to the DS6000 Interoperability Matrix.
14.4.4 Global Copy connectivity

When implementing a Global Copy configuration, one requirement is the definition of the
paths that Global Copy is going use to communicate between the primary and secondary
DS6000. These logical paths are defined upon existing Fibre Channel links. Over one
physical Fibre Channel link you can define up to 256 logical paths. The Fibre Channel links
are bidirectional, that means you can set up on a existing Fibre Channel link, paths in both
directions. Links can carry multiple logical Metro Mirror paths, as shown in Figure 14-9.
Physical
LSS 0 LSS 0
Fibre
LSS 1 Channel LSS 1
LSS 2 link LSS 2
LSS 3 LSS 3
: :
LSS 08 LSS 08
Up to 256 bi-directional
: logical paths :
LSS nn per FCP path LSS nn
Figure 14-9 Logical paths
The paths are defined between the pair of LSSs that contain your source and target logical
volumes.
14.4.5 Distance considerations

The maximum distance for a direct Fibre Channel connection is 10 KM. If you want to use
Global Copy over longer distances the following connectivity technologies can be used to
extend this distance:
򐂰 Fibre Channel switches

򐂰 Channel Extenders over Wide Area Network (WAN) lines
򐂰 Dense Wave Division Multiplexors (DWDM) on dark fiber
Global Copy channel extender support

Channel extender vendors connect DS6000 systems via a variety of Wide Area Network
(WAN) connections, including Fibre Channel, Ethernet/IP, ATM-OC3 and T1/T3.
When using channel extender products with Global Copy, the channel extender vendor will
determine the maximum distance supported between the primary and secondary DS6000.
The channel extender vendor should be contacted for their distance capability, line quality
requirements, and WAN attachment capabilities.
A complete and current list of Global Copy supported environments, configurations, networks,
and products is available in the DS6000 Interoperability Matrix.
The channel extender vendor should be contacted regarding hardware and software
prerequisites when using their products in an DS6000 Global Copy configuration. Evaluation,
qualification, approval and support of Global Copy configurations using channel extender
products is the sole responsibility of the channel extender vendor.
Global Copy Dense Wave Division Multiplexor (DWDM) support

Wave Division Multiplexing (WDM) and Dense Wave Division Multiplexing (DWDM) is the
basic technology of fibre optical networking. It is a technique for carrying many separate and
independent optical channels on a single dark fiber.
A simple way to envision DWDM is to consider that at the primary end, multiple fibre optic
input channels such as ESCON, Fibre Channel, FICON, or Gbit Ethernet, are combined by
the DWDM into a single fiber optic cable. Each channel is encoded as light of a different
wavelength. You might think of each individual channel as an individual color: the DWDM
system is transmitting a rainbow. At the receiving end, the DWDM fans out the different optical
channels. DWDM, by the very nature of its operation, utilizes the full bandwidth capability of
the individual channel. As the wavelength of light is, from a practical perspective, infinitely
divisible, DWDM extension technology is only limited by the sensitivity of its receptors. Thus, a
high aggregate bandwidth is possible.
A complete and current list of Global Copy supported environments, configurations, networks,
and products is available in the DS6000 Interoperability Matrix.
The DWDM vendor should be contacted regarding hardware and software prerequisites when
using their products in a DS6000 Global Copy configuration.
14.4.6 Other planning considerations

When planning to use Global Copy for point-in-time backup solutions as shown in
Figure 14-10 on page 456, you must also consider the following:
If you are going to have tertiary copies, then within the target Storage Image you should have
an available set of volumes ready to become the FlashCopy target. If your next step is to
dump the tertiary volumes onto tapes, then you must ensure that the tape resources are
capable of handling these dump operations in between the point-in-time checkpoints, unless
you have additional sets of volumes ready to become alternate FlashCopy targets within the
secondary Storage Images.

Primary Site Secondary Site
channel Global Copy

channel
primary extender
extender secondary
Fla
s hCo
minimum py
fuzzy copy of tertiary
performance data
impact consistent
tertiary copy of
data
Figure 14-10 Global Copy environment
14.4.7 Performance
As the distance between DS6000s increases, Metro Mirror response time is proportionally
affected, and this negatively impacts the application performance. When implementations
over extended distances are needed, Global Copy becomes an excellent trade-off solution.
You can estimate Global Copy application impact, as that of the application when working
with Metro Mirror suspended volumes. For the DS6000, there is some more work to do with
the Global Copy volumes as compared to the suspended volumes because with Global Copy,
the changes have to be sent to the remote DS6000. But this is a negligible overhead for the
application, as compared with the typical synchronous overhead.
There will be no processor resources (CPU and memory) consumed by your Global Copy
volume pairs (your management solution excluded), as this is managed by your DS6000
subsystem.
14.4.8 Scalability
The DS6000 Global Copy environment can be scaled up or down as required. If new volumes
are added to the DS6000 that require mirroring, they can be dynamically added. If additional
Global Copy paths are required, they also can be dynamically added.
14.4.9 Addition of capacity

As we have previously mentioned, the logical nature of the LSS has made a Global Copy
implementation on the DS86000 easier to plan, implement and manage. However if you need
to add more LSSs to your Global Copy environment, your management and automation
solutions should be set up to handle this.
The eRCMF service offerings are designed to provide this functionality.
Adding capacity in new DS6000s

If you are adding new DS6000s into your configuration, you will need to add physical Global
Copy links prior to defining your Global Copy paths and volume pairs. A minimum of two
Global Copy paths per DS6000 pair is required for redundancy reasons. Your bandwidth
analysis will indicate if you require more than two paths. Keep in mind that when you add
capacity which you want to use for Global Copy, you may also have to purchase capacity
upgrades to the appropriate Feature Code for Global Copy

14.5 IBM TotalStorage Global Mirror
Global Mirror copying provides a two-site extended distance remote mirroring function for
z/OS and open systems servers by combining and coordinating a Global Copy and
FlashCopy operations. With Global Mirror, the data that the host writes to the storage unit at
the local site is asynchronously shadowed to the storage unit at the remote site. A consistent
copy of the data is then periodically automatically maintained on the storage unit at the
remote site by forming a Consistency Group at the local site, and subsequently creating a
tertiary copy of the data at the remote site with FlashCopy. This two-site data mirroring
function is designed to provide a high performance, cost effective, global distance data
replication and disaster recovery solution.
This chapter discusses performance aspects when planning and configuring for Global Mirror
together with the potential impact to application write I/Os caused by the process used to form
a Consistency Group.
We also consider distributing the target Global Copy and target FlashCopy volumes across
different Ranks to balance load over the entire target storage server and minimize the I/O
load for selected busy volumes.
Global Mirror (Asynchronous PPRC)

Global Mirror provides a long-distance remote copy feature across two sites using
asynchronous technology. This solution is based on the existing Global Copy and FlashCopy.
With Global Mirror, the data that the host writes to the storage unit at the local site is
asynchronously shadowed to the storage unit at the remote site. A consistent copy of the data
is automatically maintained on the storage unit at the remote site.
Global Mirror operations provide the following benefits:

򐂰 Support for virtually unlimited distances between the local and remote sites, with the
distance typically limited only by the capabilities of the network and the channel extension
technology. This unlimited distance enables you to choose your remote site location
based on business needs and enables site separation to add protection from localized
disasters.
򐂰 A consistent and restartable copy of the data at the remote site, created with minimal
impact to applications at the local site.
򐂰 Data currency, where for many environments, the remote site lags behind the local site
typically 3 to 5 seconds, minimizing the amount of data exposure in the event of an
unplanned outage. The actual lag in data currency that you experience can depend upon
a number of factors, including specific workload characteristics and bandwidth between
the local and remote sites.
򐂰 Dynamic selection of the desired recovery point objective, based upon business
requirements and optimization of available bandwidth.
򐂰 Session support whereby data consistency at the remote site is internally managed across
up to eight storage units that are located across the local and remote sites.
򐂰 Efficient synchronization of the local and remote sites with support for failover and failback
modes, helping to reduce the time that is required to switch back to the local site after a
planned or unplanned outage.

Attention: If you use Global Mirror, your participating DS6000 storage servers must meet
the following conditions:
򐂰 The target DS6000 storage server requires the Point-in-Time Copy function
authorization
򐂰 The source or primary DS6000 storage server also requires a Point-in-Time Copy
function authorization if it is going to be used for failback from the secondary system.
򐂰 Both source and target DS6000 storage servers require the Remote Mirror and Copy
function authorization.
Host
2 Acknowledge
write
Host write
1
FlashCopy
B (automatically)
A
Write to secondary
(asynchronously)
C
Automatic cycle in active session
Figure 14-11 Global Mirror overview
How Global Mirror works

We explain how Global Mirror works in Figure 14-12 on page 459.

Global Mirror - How it works
PPRC Primary PPRC Secondary
FlashCopy Source FlashCopy Target
FlashCopy
Global Copy
A B C
Local Site Remote Site
Automatic Cycle in an active Global Mirror Session

1. Create Consistency Group of volumes at local site
2. Send increment of consistent data to remote site
3. FlashCopy at the remote site
4. Resume Global Copy (copy out-of-sync data only)
5. Repeat all the steps according to the defined time period
Figure 14-12 How Global Mirror works
In this example, the A volumes at the local site are the production volumes and are used as
Global Copy primary volumes. The data from the A volumes is replicated to the B volumes,
which are Global Copy secondary volumes. At a certain point in time, a Consistency Group is
created using all of the A volumes, even if they are located in different DS6000 or ESS boxes.
This has minimal application impact because the creation of the Consistency Group is very
quick (in the order of milliseconds).
Note: The copy created with Consistency Group is a power-fail consistent copy, not
necessarily an application-based consistent copy. When you use this copy for recovery,
you may need to perform additional recovery operations, such as the fsck command in an
AIX filesystem.
Once the Consistency Group is created, the application writes can continue updating the A
volumes. The incremental changes that are part of the consistent data are sent to the B
volumes using the existing Global Copy relationship. As soon as all the consistent data
reaches the B volumes, it is FlashCopied to the C volumes.
The C volumes now contain a consistent copy of data. Because the B volumes normally
contain a fuzzy copy of the data from the local site (but not when doing the FlashCopy), the C
volumes are used to hold the most recent point-in-time consistent data while the B volumes
continue to be updated by the Global Copy relationship.

Note: Global Mirror’s usage of FlashCopy between the B and C volumes uses the No
Background copy and Start Change Recording options. This means that before the latest
data is updated to the B volumes, the last consistent data on the B volumes is moved to the
C volumes. During this brief period of time, part of the consistent data resides on the B
volume, and the other part of consistent data is on the C volume.
If a disaster occurs during the FlashCopy of the data, special procedures are needed to
finalize the FlashCopy.
In the recovery phase, the consistent copy is created in the B volumes. You need to
consider developing some operational processes to check and create the consistent copy.
You need to check the status of the B volumes for the recovery operations. Generally, these
check and recovery operations are complicated and difficult with the SM GUI or DS CLI in
a disaster situation. Therefore, you may want to use some management tools, (for
example, Global Mirror Utilities), or management software, (for example, TDP for Disk
Replication Manager), for Global Mirror to automate this recovery procedure.
The data at the remote site is current within 3 to 5 seconds, but this recovery point (RPO)
depends on the workload and bandwidth available to the remote site.
Note: Global Mirror can also be used for failover and failback operations. A failover
operation is the process of temporarily switching production to a backup facility (normally
your recovery site) following a planned outage, such as a scheduled maintenance period or
an unplanned outage, such as a disaster. A failback operation is the process of returning
production to its original location. These operations use Remote Mirror and Copy functions
to help reduce the time that is required to synchronize volumes after the sites are switched
during a planned or unplanned outage.
14.5.1 Performance aspects for Global Mirror

Global Mirror is basically comprised of Global Copy (formerly known as PPRC-XD) and
FlashCopy, and combines and manages both functions to create a distributed solution that
will provide consistent data at a distant site.
This means that we should analyze performance at the production site, at the recovery site,
as well as between both sites, with the objective of providing a stable Recovery Point
Objective without significantly impacting production.
򐂰 At the production site, where production I/O always has a higher priority over DS6000
replication I/O activity, the storage server needs resources to handle both loads. If your
primary storage server is already overloaded with production I/O, the potential delay
before a Consistency Group can be formed may become unacceptable.
򐂰 The bandwidth between both sites needs to be sized for production load peaks.
򐂰 At the recovery site, even if there is no local production I/O workload, it is hosting the
target Global Copy volumes and handling the inherent FlashCopy processing and will
need some performance evaluation.
This section looks at the aggregate impact of Global Copy and FlashCopy in the overall
performance of Global Mirror.
Remember that Global Copy itself has minimal or no significant impact on the response time
of an application write I/O to a Global Copy primary volume.

The FlashCopy used as an integral part of this Global Copy operation is running in nocopy
mode and will cause additional, internally triggered I/Os within the target storage server for
each write I/O to the FlashCopy source volume. (that is the Global Copy target volume). This
I/O is to preserve the track from the FlashCopy source volume.
Each Global Copy write to its secondary volume during the time period between the formation
of successive Consistency Groups causes an actual FlashCopy write I/O operation on the
target server. This is described in Figure 14-13 where we summarize approximately what
happens between two Consistency Group creation points when application writes come in.
1 Write A Read 2 5 Write A Read Write Primary

Primary
Primary
A1
Primary Primary
B1
Primary C1
Primary Secondary
PENDING 3 4 Tertiary
PPRC FCP links PENDING
FCP Local server Remote Server
Host
Figure 14-13 Global Copy with write hit at the remote site
1. The application write I/O completes immediately to volume A1 at the local site.
2. Eventually Global Copy replicates the application I/O and reads the data at the local site to
send to the remote site.
3. The modified track gets written across the link to the remote B1 volume.
4. FlashCopy nocopy sees that the track is about to change,
5. Track is written to the C1 volume.
This is an approximation of the sequence of internal I/O events. There are optimization and
consolidation effects which make the entire process quite efficient.
Figure 14-13 showed the normal sequence of I/Os within a Global Mirror configuration. The
critical path is between (2) and (3). Usually (3) is simply a write hit in the Persistent Cache (or
NVS) in B1 and some time later, and after (3) completes, the original FlashCopy source track
is copied from B1 to C1.
If persistent memory or non-volatile cache is over committed in the secondary storage server
there is some potential impact on the Global Copy data replication operation performance.

Figure 14-14 Application write I/O within two Consistency Group points
Figure 14-14 summarizes roughly what happens when Persistent Cache or NVS in the
remote storage server is over committed. A read (3) and a write (4) to preserve the source
track and write it to the C volume is required before the write (5) can complete. Eventually the
track gets updated on the B1 volume to complete write (5). But usually all writes are quick
writes to cache and persistent memory and happen in the order as Figure 14-12 on page 459
outlines.
A write I/O to the FlashCopy source volume also triggers the maintenance of a bit map for the
source volume, which is created when the FlashCopy volume pair is established with the start
change recording attribute. This allows only the replication of the changed recording bitmap
to the corresponding bitmap for the target volume in the course of forming a Consistency
Group. A more detailed explanation of this processing may be found in IBM TotalStorage
DS6000 Series: Copy Services in Open Environments, SG24-6788, and IBM TotalStorage
DS6000 Series: Copy Services with IBM ~ zSeries, SG24-6787.
Note: This all applies only to write I/Os to Global Mirror primary volumes.
14.5.2 Performance considerations at coordination time

When looking at the three phases that Global Mirror goes through to create a set of data
consistent volumes at the secondary site, the first question which comes to mind is whether
the coordination window imposes an impact to the application write I/O, see Figure 14-15 on
page 463.

Maximum Maximum
coordination drain
time time
Serialize all
Perform
1 Global Copy 2 Drain data from local to remote site 3 FlashCopy
primary volumes
Write
Hold write I/O
A1 B1
Primary Secondary C1
I/O PPRC Tertiary
path(s)
A2 B2
Primary Secondary C2
Tertiary
Local site Remote Site
Figure 14-15 Coordination time - how does it impact application write I/Os?
The coordination time, which you can limit by specifying a number of milliseconds, is the
maximum impact to an application’s write I/Os you will allow, when forming a Consistency
Group. The intention is to keep the coordination time as small as possible. The default of 50
ms may be a bit high in a transaction processing environment. A valid number may also be a
in the single digit range. The required communication between the Master storage server and
potential Subordinate storage servers is inband over PPRC paths between the Master and
Subordinates. This communication is highly optimized and allows you to minimize the
potential application write I/O impact to 3 ms, for example. There must be at least one PPRC
FCP link between a Master storage server and each Subordinate storage server, although for
redundancy we recommend you use two PPRC FCP links.
The following example addresses the impact of the coordination time when Consistency
Group formation starts and whether this impact has the potential to be significant or not.
Assume a total aggregated number of 5,000 write I/Os over two primary storage servers with
2,500 write I/Os per second to each storage server. Each write I/O takes 0.5 ms. You
specified 3 ms maximum to coordinate between the Master storage server and its
Subordinate storage server. Assume further that a Consistency Group is created every 3
seconds which is a goal with a Consistency Group interval time of zero.
򐂰 5,000 write I/Os
򐂰 0.5 ms response time for each write I/O
򐂰 Maximum coordination time is 3 ms
򐂰 Every 3 seconds a Consistency Group is created
This is 5 I/Os for every millisecond or 15 I/Os within 3 ms. So each of these 15 write I/Os
experiences a 3 ms delay. This happens every 3 seconds. Then we observe an average
response time delay of approximately:
(15 IOs * 0.003 second) / 3*5000 IO/second) = 0.000003 second or 0.003 ms.
The response time increases in average from 0.5 ms to 0.503 ms in this example.

14.5.3 Consistency Group drain time
Once the Consistency Group is established at the primary storage servers by establishing
corresponding bitmaps within the coordination time window, all remaining data which is still in
the out-of-sync bitmap is sent to the secondary storage server using Global Copy. This
process is thus draining the contents of the previous Consistency Group to the remote server.
This drain period is the time required to replicate all remaining data for the Consistency
Group from the primary to the secondary storage server. This needs to fall within a time limit
set by maximum drain time which may also be limited. The default is 30 seconds and this may
be too small in an environment with a write intensive workload. A number in the range of 300
seconds to 600 seconds may need to be considered.
The actual replication process usually does not impact the application write I/O. There is a
very slight chance that a very same track within a Consistency Group is updated before this
track is replicated to the secondary site within the specified drain time period. When this
unlikely event happens, the affected track is immediately replicated to the secondary storage
server before the application write I/O modifies the original track. The concerned application
write I/O is going to experience a similar response time, as though the I/O would have been
written to a Metro Mirror primary volume.
Note that further subsequent writes to this very same track do not experience any delay
because the tracks have been already replicated to the remote site.
14.5.4 Avoid unbalanced configurations

When the load distribution is unknown in your configuration, you may consider developing
some rudimentary code which regularly issues queries against the storage server. You may
then process the output of the queries to better understand the write load distribution over the
Global Mirror volumes or better, over the Global Copy primary volumes. Otherwise, the PDCU
tool and Tivoli Productivity Center for Disk could be used to analyze the storage server
performances.
There are 2 loads to consider, the production load on the production site, and the Global
Mirror load on both sites.
There are only production volumes on the production site. At the production site, high speed
disk and cache can be considered to fit production needs. At the same time, the storage
server needs to be able to handle both production and replication workload. But because only
source Global Copy volumes are hosted, besides balancing the full production load on all the
storage server Ranks and on both servers of the cluster, nothing else can really be done.
On the other hand, at the recovery site, only Global Mirror volumes have to be considered, but
they are of 2 types, target Global Copy and target FlashCopy. Because, target Global Copy is
modified when the target FlashCopy is not modified and target FlashCopy is modified when
the target Global Copy is not modified, both types of volumes could share the same Ranks
with disk of a larger size because the storage at the recovery site will be consolidated and
shared between target Global Copy and target FlashCopy not working at the same time.
14.5.5 Remote storage server configuration

There will be I/O skews and hot spots in storage servers. This is true for the local and remote
storage servers. In local storage servers you may consider a horizontal pooling approach and
spread each volume type across all Ranks. Volume types in this context are, for example,
DB2 database volumes, logging volumes, batch volumes, temporary work volumes, etc. Your
goal may be to have the same number of each volume type within each Rank.

Through a one-to-one mapping from local to remote storage server, you achieve the same
configuration at the remote site for the B volumes and the C volumes. Figure 14-16 proposes
to spread the B and C volumes across different Ranks at the remote storage server.
Primary
A Primary
A Primary
B1
Primary C3
A1
Primary
Primary
Secondary Tertiary
Primary
PENDING
PENDING
Rank 1 Rank 1
PPRC links A Primary

Primary
A FCP FCP
Primary
B2
Primary C1
A2
Primary
Primary
Secondary
Tertiary
Primary
PENDING
PENDING
Rank 2 Rank 2
Primary
A Primary
A Primary
B3
Primary C2
A3
Primary
Primary
Secondary Tertiary
Host channels Primary
PENDING
PENDING
Rank 3 Rank 3
FCP
Host
Figure 14-16 Remote storage server configuration, all Ranks contain equal numbers of volumes
The goal is to put the same number of each volume type into each Rank. As volume type here
we refer to B volumes and C volumes within a Global Mirror configuration. In order to avoid
performance bottlenecks, spread busy volumes over multiple Ranks. Otherwise hot spots
may be concentrated on single Ranks, when you put the B and C volume on the very same
Rank. We recommend spreading B and C volumes as Figure 14-16 suggests.
Another aspect may be to focus on very busy volumes and keep these volumes on separate
Ranks from all the other volumes.
With mixed Disk Drive Module (DDM) capacities and different speeds at the remote storage
server you may consider to spread B volumes not only over the fast DDMs but over all Ranks.
Basically follow a similar approach as Figure 14-16 recommends. You may keep particular
busy B volumes and C volumes on the faster DDMs.

Primary
A Primary
A Primary
B1
Primary C3
A1
Primary
Primary
Secondary Tertiary
Primary
PENDING
PENDING
Rank 1 Rank 1
Primary
PPRC links A Primary
A FCP FCP
Primary
B2
Primary C1
A2
Primary
Primary Secondary
Tertiary
Primary
PENDING
PENDING
Rank 2 Rank 2
Primary
A Primary
A Primary
B3
Primary C2
A3
Primary
Primary
Secondary
Host channels Primary
Tertiary
PENDING
PENDING
Rank 3 Rank 3
Primary
FCP A
Primary
Primary
D1
Primary D2
A Primary Host
Host Primary
D3
Primary
Primary
D4
Rank 4
Figure 14-17 Remote storage server with D volumes
Figure 14-17 shows besides the three Global Mirror volumes, also the D volumes which may
be created once in a while for test or other purposes. Here we suggest as an alternative, a
Rank with larger and perhaps slower DDMs. The D volumes may be read from another host
and any other I/O to the D volumes does not impact the Global Mirror volumes in the other
Ranks. Note that a NOCOPY relationship between B and D volumes will read the data from B
when coming through the D volume. So you may consider a physical COPY when you create
D volumes on a different Rank. This will separate additional I/O to the D volumes from I/O to
the Ranks with the B volumes.
An option here may be to spread all B volumes across all Ranks again and also configure the
same number of volumes in each Rank.
Still, put the B and C volumes in different Ranks. It is further recommended to configure
corresponding B and C volumes in such a way, that these volumes have an affinity to the
same server. Ideally it would also be to have the B volumes connected to a different DA pair
than the C volumes.
14.5.6 Growth within Global Mirror configurations

When you add a rather large number of volumes at once to an existing Global Mirror session,
then the available resources for Global Copy within the affected Ranks may be over-utilized or
even monopolized by the initial copy pass. To avoid too much impact you may consider
adding many new volumes to an existing Global Mirror session in stages. If possible, add only
a few volumes within a Rank during application peak I/O periods. Once the first initial copy
completes, add the next volume, again plan for one or two volumes within a Rank during peak
I/O load on the affected Ranks for the initial copy pass.

If possible then plan to do your bulk additions of new Global Copy volumes into existing
sessions during off-peak periods.
14.5.7 Global Mirror performance recommendations

The main parameters that you can set that will influence the performance of Global Mirror are
򐂰 Maximum Coordination Time
򐂰 Maximum Drain Time
򐂰 Consistency Group Interval
Tip: We recommend starting with the default values for these parameters.
Maximum Coordination Time

This is the maximum time that Global Mirror will allow while trying to reach a consistent set of
data for this Consistency Group, before failing this attempt. It then retries the attempt. The
default for the maximum coordination time is 50ms, and this is a good starting point.,
Maximum Drain Time

The maximum drain time is the maximum amount of time that Global Mirror will spend
draining a Consistency Group before failing the consistency group. The previous Consistency
group will still be retained on the remote site. The primary disk subsystem will restart
consistency group formation attempts.
The default for the maximum drain time is 30 seconds, which is normally sufficient time to
transfer a consistency group to the target system.while still allowing for intermittent
communications issues. The system will favor host I/O activity at the expense of forming
consistency groups.
If it has been unable to form consistency groups for 30 minutes, Global Mirror forms a
consistency group irrespective of the maximum drain time setting.
Consistency Group Interval

The consistency group interval is the amount of time Global Mirror will spend in Global Copy
mode between each the formation of each consistency group. The effect of increasing this
value will be to reduce the amount of duplicate updates that are sent from the primary to the
secondary disk subsystems.
However, this also increases the time between successive FlashCopies at the remote site,
and increasing this value may be counter productive in high bandwidth environments as
frequent consistency group formation will reduce the overheads of Copy on Write processing
The default for the Consistency Group Interval is 0 seconds so Global Mirror will continuously
form consistency groups as fast as the environment will allow. We recommend leaving this
parameter at the default and allowing Global Mirror to form consistency groups as fast as
possible given the workload as it automatically changes to Global Copy mode for a period of
time if the Drain Time is exceeded.
14.6 IBM TotalStorage z/OS Metro/Global Mirror

The z/OS Metro/Global Mirror capability requires z/OS Global Mirror (XRC) to mirror primary
site data to a location that is a long distance away and also uses Metro Mirror to mirror
primary site data to a location within the metropolitan area. This enables a z/OS three-site

high availability and disaster recovery solution for even greater protection from unplanned
outages. The z/OS Metro/Global Mirror function is an optional function requiring both Remote
Mirror for z/OS (2244 Model RMZ) and Remote Mirror and Copy function (2244 Model RMC)
for both the primary and secondary storage units.
Three-site solution
The combination of Global Mirror and Global Copy, called Metro/Global Copy is currently
available on the ESS 750 and ESS 800. It is a three site approach that offers very good
recovery in the event of the failure of any one or two of the three sites. You first copy your data
synchronously to an intermediate site and from there you go asynchronously to a more
distant site. For an overview of this solution refer to the “z/OS Metro/Global Mirror” in IBM
TotalStorage DS8000 Series: Copy Services with IBM ~ zSeries, SG24-6787.
z/OS Metro/Global Mirror uses z/OS Global Mirror to mirror primary site data to a remote
location, and also uses Metro Mirror for primary site data to a location within Metro Mirror
distance limits. This gives you a three-site high-availability and disaster recovery solution.
Figure 14-18 shows an example of a three site z/OS Metro/Global Mirror solution
Site 1 Site2 Site 3
Metropolitan Unlimited
distance distance
Metro
Mirror z/OS Global
Mirror
P X’
P’ X
DS8000 DS8000 DS8000 X”

Metro Mirror Metro Mirror/ z/OS Global Mirror
Secondary z/OS Global Mirror Secondary
FlashCopy
Primary
Figure 14-18 A three-site z/OS Metro/Global Mirror implementation
In the example shown in Figure 14-18, we see that the zSeries server in Site 1 is normally
accessing DS8000 disk at Site 2. These disks are mirrored back to Site 1 to another DS8000
via Metro Mirror. At the same time, the Site 2 disks are using z/OS Global Mirror pairs
established to Site 3, which can be at continental distances from Site 2. This covers the
following potential failure scenarios:
򐂰 Site 1 disaster - Site 3 can be brought up after the Site 2 updates will have completed
mirroring across to Site 3. Site 3 FlashCopy disk could be used to preserve a copy of the
original Recovery Point.
򐂰 Site 2 disaster - Site 1 can readily switch to using Site 1 disk. Mirroring to Site 3 will be
suspended.

򐂰 Site 1 and Site 2 disaster - Site 3 can be brought up on Site 3 FlashCopy disk to preserve
original recovery point. Updates since the last FlashCopy would be lost.
򐂰 Site 3 disaster - Sites 1 and 2 continue to operate normally.
Tip: z/OS Global Mirror, also known as extended remote copy or XRC, is available on the
DS8000, DS6000, ESS 750 and the ESS 800, but it is recommended that you only use the
DS6000 as an XRC target storage server.
For a more complete discussion of z/OS Global Mirror (XRC), refer to IBM TotalStorage
DS8000 Series: Performance Monitoring and Tuning, SG24-7146, and the IBM TotalStorage
DS8000 Series: Copy Services with IBM ~ zSeries, SG24-6787.
14.7 Copy Services performance considerations

Copy Services in the DS6000 provide the means of replicating logical disks with no
host-based activity required. These copies may be made within this DS6000 using
FlashCopy, or over external Fibre Channel paths to a target DS6000, DS8000 or ESS 800
using Metro Mirror, Global Mirror, or Global Copy. In this section we look at some of the
factors that can affect Copy Services performance.
Copy considerations include

򐂰 Cache load/capacity
– Cache utilization
– Cache residence time
򐂰 Persistent Cache, formerly known as Non Volatile Storage (NVS)
– Utilization
– Persistent Cache shortage conditions
Remote Copy Services considerations.

򐂰 Paths available for remote copy
– Distance between source and target storage devices
– Number of copy pairs being managed
– Channel Extension or DWDM or FCIP used for transferring the copied data between
storage devices
򐂰 Cache load/capacity
– Cache utilization
– Cache residence time
򐂰 Persistent Cache
– Utilization
– Persistent Cache (NVS) full conditions
14.8 Measuring Copy Services performance

The following tools may be used to infer some understanding of Copy Services performance
򐂰 RMF for z/OS environments. Refer to 10.9.1, “RMF” on page 371.
򐂰 TotalStorage Productivity Center for Disk (TPC for Disk) for open systems and z/OS
environments, see 4.3, “IBM TotalStorage Productivity Center for Disk” on page 109.

򐂰 DS Storage Manager GUI provides some basic performance numbers, but they are quite
difficult to interpret.
򐂰 DS CLI commands. The showrank, showfbvol, showckdvol and lsioport commands
have a -metrics option which provides a lot of detailed performance-related information
that can provide some indirect insight into the impact of Copy Services operations. We
discuss some of these in “DS CLI metrics” on page 472
򐂰 Performance Data Collection Utility (PDCU) is available for IBM Service personnel and is
not available for general usage.
14.9 z/OS and Copy Services

z/OS has two integrated interfaces to Copy Services functions in the DS6000, DS8000 and
the ESS, resulting in a limited requirement to use the DS Storage Manager GUI or the
command line DS CLI to implement and manage Copy Services. These z/OS interfaces are
provided by TSO and DFSMSdss.
DFSMSdss will implicitly or explicitly utilize the storage subsystem’s FlashCopy functionality
for volume and data set copies, if it finds that the source and targets are eligible. The user will
notice that a fast copy capability has been used in the SYSLOG messages from the utility.
There is no requirement for the user to change their JCL, or to code extra parameters to
achieve this, because the default option of FASTREPLICATION(PREFERRED) for the COPY
function attempts to use FlashCopy.
Here are some of the conditions that will result in DFSMSdss using FlashCopy within the
DS6000 to complete some volume replication functions:
򐂰 The source and target devices must have the same track format.
򐂰 The source volumes and target volumes are in the same DS6000.
򐂰 The source and target volumes must be online.
򐂰 No other FlashCopy is active for the same source volume. If so, the copy will be done by
DFSMSdss software based.
򐂰 The FASTREPLICATION(NONE) keyword must not be specified.
Detailed information can be found in the IBM publication z/OS DFSMSdss Storage
Administration Reference SC35-0424.
Tip: We recommend the use of DFSMSdss for full volume copies where possible, because
not all tracks on the volume will normally need to be copied. When DFSMSdss invokes
FlashCopy for a full volume copy, it requests FlashCopy for allocated extents only, which
can lead to more effective copying than with alternative invocations.
In order to balance between excluding free space and saving the number of FlashCopy
relationships, up to 255 copy relationship may be created for each full volume copy. If there
are more than 255 separate, allocated extents on the source volume, then the copy function
of DFSMSdss will do some initial extent management before starting the FlashCopy. Some
extents will be merged (to reduce the number of extents to copy) resulting in some free space
being copied for a highly fragmented logical volume.

Attention: A Metro Mirror or Global Copy pair currently in full duplex state will go into a
duplex pending state when a FlashCopy relationship is established to target the current
Metro Mirror source volume. When the FlashCopy completes the copy operation, the Metro
Mirror or Global Copy pair will return to full duplex state. (This can only be done when the
FlashCopy was initiated by a DFSMSdss utility).
14.9.1 RMF and Copy Services

RMF can give some information about the source volumes in Copy Services relationships, but
RMF does not gather statistics from volumes that are offline to z/OS, and consequently does
not directly report on any volume-level Copy Services activity. It may be possible to estimate
some overall impact from introducing Copy Services to your system by looking at the before
and after RMF Cache Statistics reports. Copy Services I/Os will have some impact on overall
cache usage, and increases in cache usage observed with RMF may be attributed to this
extra load.
14.10 Copy Services performance considerations

The Copy Services physical copy functions are performed by the DS6000 servers and use
some of their capability. You should monitor the Copy Services performance in order to
identify and rectify any bottlenecks as soon as possible.
Here are some high level suggestions covering some aspects of Copy Services you may want
to manage and evaluate.
What I should be looking for

Monitor some key indicators, such as Persistent Cache usage and overall cache usage to see
if there is too much Copy Services work being run. You might also want to look at your busiest
Rank, to see if you might have over-committed some Copy functions to that Rank.
Be aware that the DS6000 will ensure that it limits the possible impact of Copy Services tasks
might have on normal user I/O in order to favor user I/O. This means that you should not
expect all the Copy tasks to complete at the same time, as they will have restricted access to
server resources during their copy operations. Consider the possibility of starting to establish
your copy pairs at different times in order to spread this server workload.
Configuring for Remote Mirror

If you are planning to use Metro Mirror, Global Mirror, or Global Copy, (formerly known as
Peer-to-Peer Remote Copy, or PPRC) to or from your DS6000, we recommend that you set
aside two host adapter ports for this purpose, choosing one port from server0 and the other
from server1 in order to distribute the PPRC traffic across both servers. This leaves six host
adapters available for normal host connectivity.
FlashCopy implementation
Consider the use of Incremental FlashCopy if doing regular copies of a volume, as there will
be less background copy operations involved each time a fresh copy is made, and it will be
faster. This has the added optional feature of a reverse restore for recovery from the target
back to the source.
Consider the use of FlashCopy Consistency Groups for much faster database recovery
(recovery becomes possible in minutes rather than many hours.

14.11 DS CLI metrics
Several of the DS CLI commands may be used to provide an insight into the performance of
the DS6000. The DS6000 maintains an internal set of counters that are regularly incremented
over time. These counters wrap when they become full, so any technique that you develop to
exploit these counters for performance management purposes need so take this wrapping
into consideration. Each counter is a four byte (32 bit) signed binary number, and the only
time they are reset is during a server power-on cycle.
Restriction: There is no way to reset these counters non-disruptively.
Descriptions of some of the available counters for showfbvol -metrics are shown in
Example 14-1, for showrank -metrics in Example 14-2 on page 473 and for showioport
-metrics in Example 14-3 on page 473.
Example 14-1 Counters displayed by showfbvol -metrics

Counter Name Content
normrdrqts Specifies Search/Read Normal I/O Requests.
normrdhits Specifies Search/Read Normal I/O Requests instances.
normwritereq Specifies Write Normal I/O Requests.
normwritehits Specifies DASD Fast Write I/O Request instances.
seqreadreqs Specifies Search/Read Sequential I/O Requests.
seqreadhits Specifies Search/Read Sequential I/O Request instances.
seqwritereq Specifies Write Sequential I/O Requests.
seqwritehits Specifies DASD Fast Write Sequential I/O Request instances.
cachfwrreqs Specifies Search/Read Cache Fast Write I/O Requests.
cachfwrhits Specifies Search/Read Cache Fast Write I/O Request instances.
cachfwreqs Specifies Cache Fast Write I/O Requests.
cachfwhits Specifies Cache Fast Write I/O Requests instances.
inbcachload Specifies Inhibit Cache Loading I/O Requests that operate with
DASD.
bypasscach Specifies Bypass Cache I/O Requests.
seqDASDtrans Specifies Sequential DASD to Cache Transfer Operations.
DASDtrans Specifies DASD to Cache Transfer Operation Count.
cachetrans Specifies Cache to DASD Transfer Operation Count.
NVSspadel Specifies DASD Fast Write Operations Delayed Due to nonvolatile
storage Space Constraints.
normwriteops Specifies Normal ‘DASD Fast Write’ Write Operation Counts.
seqwriteops Specifies Sequential Access ‘DASD Fast Write’ Write Operation
Counts.
reccachemis Specifies Number of record cache Read Misses.
qwriteprots Specifies Quick Write Promotes.
CKDirtrkac Specifies Irregular Track Accesses.
CKDirtrkhits Specifies Irregular Track Accesses instances.
cachspdelay Specifies Operations Delayed Due To Cache Space Constraints.
timelowifact Specifies Milliseconds of lower interface I/O activity for the
indicated device.
phread Specifies Physical Storage Read Operations.
phwrite Specifies Physical Storage Write Operations.
phbyteread Specifies Physical Storage Bytes Read in 128 KB increments.
phbytewrit Specifies Physical Storage Bytes Written in 128 KB incremnts.
recmoreads Specifies Record Mode Read Operations.
sfiletrkreads Specifies the Number of tracks read from the Concurrent Copy or
XRC Sidefile.

contamwrts Specifies the Number of Contaminating writes for a Concurrent
Copy or XRC volume.
PPRCtrks Specifies the Number of tracks or portion of tracks that were
transferred to the secondary device of a PPRC pair.
NVSspallo Specifies the NVS Space Allocations.
timephread Specifies the Physical Storage Read Response Time in 16 ms
increments.
timephwrite Specifies the Physical Storage Write Response Time in 16 ms
increments.
byteread Specifies the number of Bytes read in 128 KB increments.
bytewrit Specifies the number of Bytes written in 128 KB incremnts.
timeread Specifies the accumulated response time for all read operations.
timewrite Specifies the accumulated response time for all write operation
Example 14-2 Counters displayed with showrank -metrics command

byteread Specifies Bytes Read
bytewri t Specifies Bytes Written
Reads Specifies Read Operations
Writes Specifies Write Operations
timeread Specifies Rank Read Response Time
timewrite Specifies Rank Write Response Time
Example 14-3 Counters displayed with showioport -metrics command

byteread (FICON/ESCON) ECKD Bytes Read
bytewrit (FICON/ESCON) ECKD Byte Writes
Reads (FICON/ESCON) ECKD Read Operations
Writes (FICON/ESCON) ECKD Write Operations
timeread (FICON/ESCON) ECKD Accumulated Read Time
timewrite (FICON/ESCON) ECKD Accumulated Write Time
bytewrit (PPRC) PPRC Bytes Send
byteread (PPRC) PPRC Bytes Received
Writes (PPRC) PPRC Send Operations
Reads (PPRC) PPRC Received Operations
timewrite (PPRC) PPRC Send Time Accumulated
timeread (PPRC) PPRC Received Time Accumulted
byteread (SCSI) SCSI Bytes Read
bytewrit (SCSI) SCSI Bytes Write
Reads (SCSI) SCSI Read Operations
Writes (SCSI) SCSI Write Operations
timeread (SCSI) SCSI Read Accumulated Time On Channel
timewrite (SCSI) SCSI Write Accumulated Time On Channel
14.11.1 Managing performance with DS CLI

The showrank -metrics output is a good place to begin to monitor your storage subsystem’s
performance. If you can maintain a database with the deltas of these counters, you have a
good basis to spot any significant changes from normal processing. Plan to save these base
counters for the same busy time period each day, and review the delta changes each week.
This discipline will build up an understanding of how your storage subsystem is performing,
and trends or significant changes should provide an early alert for possible I/O processing
bottlenecks.

You can then monitor the logical disks that were defined in each suspect Rank using the
deltas from successive showckdvol -metrics or showfbvol -metrics commands to find which
volume is causing the Rank imbalance.
You should also periodically monitor traffic through your host I/O adapters with the showrank
-metrics command, so that you can identify increases in I/O.

A
Appendix A. Benchmarking
Benchmarking storage systems has become very complex over the years given all of the
hardware and software parts being used for storage systems. In this appendix, we discuss
the goals and the ways to conduct an effective storage benchmark.

Goals of benchmarking
Today, customers have to face difficult choices regarding the number of different storage
vendors and their product portfolios. Performance information provided by storage vendors
could be generic and often not representative of real customer environments. To help in
making decisions, benchmarking is a way to get an accurate representation of the storage
product’s performance in simulated application environments. The main objective of
benchmarking is to identify performance capabilities of a specific production environment,
compare performance of two or more storage systems. Including the use of real production
data in the benchmark may be ideal.
To conduct a benchmark, you need a solid understanding of all parts of your environment.
This understanding includes not only the storage system requirements but also the SAN
infrastructure, the server environments, and the applications. Recreating a representative
emulation of the environment, including actual applications and data, along with user
simulation provides efficient and accurate analysis of the performance of the storage system
tested. The characteristic of a performance benchmark test is that results must be
reproducable to validate the tests’ integrity.
Benchmark key indicators

The popularity of a benchmark is based on how meaningful its workload is representative.
Three key indicators can be used out of the benchmark results in order to evaluate the
performance of the storage system.
򐂰 Performance results in a real application environment
򐂰 Reliability
򐂰 Total cost ownership (TCO)
Performance is not the only component that should be considered as benchmark results.
Reliability and cost effectiveness are also parameters that must be considered. Balancing
benchmark performance results with reliability functionalities and total cost ownership of the
storage system will give you a global view of the storage product value.
To help customer understanding of intrinsic storage product values in the marketplace,

vendor-neutral independent organizations developed several generic benchmarks. The two
most famous organizations are:
򐂰 SPC - Storage Performance Council: http://www.storageperformance.org
򐂰 TPC - Transaction Processing Performance Council: http://www.tpc.org
The popularity of such benchmarks is dependent on how meaningful the workload is versus
the main and new workloads which companies are deploying today. If the generic benchmark
workloads are representative of your production, you can use the different benchmarks
results to identify the product you should implement in your production environment. But, if
the generic benchmark definition is not representative or does not include your requirements
or restrictions, running a dedicated benchmark designed to be representative of your
workload will give you the ability to choose the right storage system.
Requirements for a benchmark

You need to carefully review your requirements before you set up a storage benchmark and
use these requirements to develop a detailed but reasonable benchmark specification and
time frame. Furthermore, you need to clearly identify the objective of the benchmark with all
the participants and precisely define the success criteria of the results.

Define the benchmark architecture
This process includes, obviously, the specific storage equipment you want to test but also, the
servers which host your application, the servers used to generate the workload, and the SAN
equipment used to interconnect the servers and the storage system. The monitoring
equipment and software are also part of the benchmark architecture.
Define the benchmark workload

Your application environment can have different categories of data processing. In most cases,
two data processing types can be identified: One is characterized as an online transaction
processing (OLTP) and the other as batch processing.
The OLTP category typically has many users, all accessing the same disk storage system
and a common set of files. The requests are typically spread across many files, therefore the
file sizes are typically small and randomly accessed. Typical applications consist of a network
file server or disk subsystem being accessed by a sales department entering order
information.
Batch workloads are frequently a mixture of random database accesses, skip-sequential,

pure sequential and sorting. They generate large data transfers and result in high path
utilizations. Often constrained to operate within a particular window of time, during which time
online operation is restricted or shutdown. Poorer or better performance is often not
recognized unless it impacts this window.
To identify the specificity of your production workload, you can use monitoring tools available
at the operating system level.
In a benchmark environment, there are two ways to generate workload.
The first way, the most complex, is to set up the production environment including the
applications software and the application data. In this case, you have to ensure that the
application is well configured and optimized on the server operating system. The data volume
has also to be representative of the production environment. Depending on your application,
workload can be generated using application scripts or an external transaction simulation
tool. These kind of tools provide a simulation of users accessing your application. You use
workload tools to provide application stress from end-to-end. To configure an external
simulation tool, you first record a standard request from a single user and then, generate this
request several times. This can provide an emulation of hundreds or thousands of concurrent
users to put the application through the rigors of real-life user loads and measure the
response times of key business processes. Examples of software available: IBM Rational®
Software, Mercury Load Runner TM, etc.
The other way to generate the workload is to use a standard workload generator. These tools,
specific to each operating system, produce different kinds of workloads. You can configure
and tune these tools to match your application workload. The main tuning parameters include
the type of workload (sequential, random), the read/write ratio, the I/O block size, the number
of I/Os per second and the test duration. With a minimum of setup, these simulation tools can
help you to recreate your production environment workload without setting up all of the
software components. Examples of software available include: iozone, iometer, etc.
Attention: Each workload test must be defined with a minimum time duration in order to
eliminate any side effects or warm-up period, such as populating cache, which could
generate incorrect results.
Appendix A. Benchmarking 477

Monitoring the performance
Monitoring is a critical component of benchmarking and has to be fully integrated into the
benchmark architecture. The more we have information of component activity at each level of
a benchmark environment, the more we understand where the solution weaknesses are. With
this critical source of information, we can precisely identify bottlenecks and have the ability to
optimize component utilization and improve the configuration.
A minimum of monitoring tools are required at different levels in a storage benchmark

architecture:
򐂰 Storage level: monitoring at the storage level provides information of intrinsic
performance of the storage equipment components: Most of these monitoring tools report
storage servers utilization, storage cache utilization, Volume performance, RAID Array and
DDMs performance, and adapter utilization.
򐂰 SAN level: Monitoring at the SAN level provides information of the interconnect workloads
between servers and storage systems. This helps to check that workload is well balanced
between the different paths used for production and to verify that the interconnection is not
a bottleneck in terms of performance.
򐂰 Server level: Monitoring at the server level provides information of server components
utilization (processor, memory, storage adapter, file system, etc.). This helps also to
understand what type of workload the application hosted on the server is generating and
evaluate what the storage performance is in term of response time and bandwidth from
the application point of view.
򐂰 Application level: This is the most powerful tool in terms of performance analysis
because these tools monitor the performance at the end user point of view and highlight
bottlenecks of the entire solution. Monitoring the application is not always possible, only a
few applications provide a performance module for monitoring application processes.
Note that monitoring can have an impact on component performance. In that case, you
should implement the monitoring tools in the first sequence of tests to understand your
workload and then, disable it in order to eliminate any impact which could distort performance
results.
Define the benchmark time frame

Some things to consider at the definition of the benchmark planning:
򐂰 Time to set up the environment (hardware and software), restore your data, and validate
the working of the solution.
򐂰 Time of execution of each scenario, considering that each has to be run several times.
򐂰 Time to analyze the monitoring data that is collected.
򐂰 After each run, some benchmark data could be changed, inserted, deleted, or otherwise
modified so that it has to be restored before another test iteration. In that case, consider
the time needed to restore original data after each run.
During a benchmark, each scenario has to be run several times to, first, understand how the
different components are performing using monitoring tools, to identify bottlenecks and then,
test different ways to get an overall performance improvement by tuning each of the different
components.

Caution using benchmark results to design production
While taking benchmark results as the foundation to build your production environment
infrastructure, you must take a close look at the benchmark performance information and
watch out for different points:
򐂰 Benchmark hardware resources (servers, storage systems, SAN equipment and network)
are dedicated only for this performance test.
򐂰 Benchmark infrastructure configuration must be fully documented. Ensure there is no
dissimilar configurations (for example, dissimilar cache sizes, different disk
capacities/speed).
򐂰 Benchmarks are focused on the core application performance and do not often consider
interferences with other applications that can occur in the real infrastructure.
򐂰 The benchmark configuration must be realistic (technical choice regarding performance
versus usability or availability).
򐂰 Scenarios built must be representative in different ways:
– Volume of data
– Extreme or unrealistic workload
– Timing execution
– Workload ramp-up
– Avoid side effects such as populating cache, which could generate incorrect results
򐂰 Be sure to note detailed optimization actions performed on each of the components of the
infrastructure (including application, servers, storage systems, etc.).
򐂰 Servers and storage performance reports must be fully detailed including bandwidth, I/O
per second and also response time.
򐂰 Understand the availability level of the solution tested, considering the impact of each
component failure (HBA, switch, RAID, etc.).
򐂰 Balance the performance results with your high availability requirements. First, consider
the advanced copy services available on the specific storage system to duplicate or mirror
your production data within the subsystem or to a remote backup site. Then, watch out
that performance is not equivalent if your data is mirrored or duplicated.
Appendix A. Benchmarking 479

B
Appendix B. UNIX shell scripts

This appendix includes scripts helpful to manage disk devices and monitor I/O for servers
attached to the DS6000. Implementation of these scripts is described in Chapter 7, “Open
systems servers - UNIX” on page 189.

Introduction
The scripts presented in this appendix were written and tested on AIX servers, but could be
modified to work with SUN Solaris and HP-UX. They have been modified slightly from the
scripts published in earlier versions of performance and tuning Redbooks. These
modifications are mainly due to the absence of the ESS Utility commands, lsess and lssdd,
which are not available for the DS6000. The SDD datapath query essmap command was
used instead.
By downloading the Acrobat PDF version of this publication, you should be able to copy and
paste these scripts for easy installation on your host systems. To function properly, the scripts
presented here rely on:
򐂰 An AIX host running AIX 5L
򐂰 Subsystem Device Driver (SDD) for AIX Version 1.3.1.0 or later
The scripts presented in this appendix are:

򐂰 vgmap
򐂰 lvmap
򐂰 vpath_iostat
򐂰 ds_iostat
򐂰 test_disk_speeds
Attention: These scripts are provided on an ‘as is’ basis. They are not supported or
maintained by IBM in any formal way. No warranty is given or implied, and you cannot
obtain help with these scripts from IBM.
vgmap
The vgmap script displays which vpaths a Volume Group uses and also which Rank each
vpath belongs to. Use this script to determine if a Volume Group is made up of vpaths on
several different Ranks and which vpaths to use for creating striped logical volumes.
Example output of the vgmap command is shown in Example B-1. The vgmap shell script is in
Example B-2.
Example: B-1 vgmap output

# vgmap testvg
PV_NAME RANK PV STATE TOTAL PPs FREE PPs

testvg:
vpath0 1100 active 502 502
vpath2 1000 active 502 502
Example: B-2 vgmap shell script

#!/bin/ksh
##############################
# vgmap
# usage: vgmap <vgname>
#
# Displays DS6000 logical disks and RANK ids for each
# disk in the volume group
#
#
# Author: Pablo Clifton pablo@compupro.com

# Date: August 28, 2005
##############################################
datapath query essmap > /tmp/lssdd.out

lssddfile=/tmp/lssdd.out
workfile=/tmp/work.$0
sortfile=/tmp/sort.$0
# AIX
lsvg -p $1 | grep -v "PV_NAME" > $workfile
echo "\nPV_NAME RANK PV STATE TOTAL PPs FREE PPs Free D"
for i in `cat $workfile | grep vpath | awk '{print $1}'`

do
#echo "$i ... rank"
rank=`grep -w $i $lssddfile | awk '{print $11}' | head -n 1`
sed "s/$i /$i $rank/g" $workfile > $sortfile
cp $sortfile $workfile
done
cat $workfile
rm $workfile
rm $sortfile
########################## THE END ######################
lvmap
The lvmap script displays which vpaths and Ranks a logical volume uses. Use this script to
determine if a logical volume spans vpaths on several different Ranks. The script does not tell
you if a logical volume is striped or not. Use lslv <lv_name> for that information or modify this
script.
An example output of the lvmap command is shown in Example B-3. The lvmap shell script is
in Example B-4.
Example: B-3 lvmap output

lvmap.ksh 8000stripelv
lvmap.ksh 8000stripelv
8000stripelv:N/A
vpath4 0000 010:000:000 100% 010:000:000:000:000
vpath5 ffff 010:000:000 100% 010:000:000:000:000
Example: B-4 lvmap shell script

#!/bin/ksh
##############################################
# LVMAP
# usage: lvmap <lvname>
#
# displays logical disk and rank ids for each
# disk a logical volume resides on
# Note: the script depends on correct lssdd info in

# /tmp/lssdd.out
#
Appendix B. UNIX shell scripts 483

# Before running the first time, run:
##############################################
datapath query essmap > /tmp/lssdd.out

lssddfile=/tmp/lssdd.out
workfile=/tmp/work.$0
sortfile=/tmp/sort.$0
lslv -l $1 | grep -v " COPIES " > $workfile
for i in `cat $workfile | grep vpath | awk '{print $1}'`

do
#echo "$i ... rank"
rank=`grep -w $i $lssddfile | awk '{print $11}' | head -n 1`
sed "s/$i /$i $rank/g" $workfile > $sortfile
cp $sortfile $workfile
done
echo "\nLV_NAME RANK COPIES IN BAND DISTRIBUTION"

cat $workfile
rm $workfile
rm $sortfile
###################### End #######################
vpath_iostat
The vpath_iostat script is a a wrapper program for AIX that converts iostat information based
on hdisk devices to vpaths instead.
The script first builds a map file to list hdisk devices and their associated vpaths and then
converts iostat information from hdisks to vpaths.
To run the script, make sure the SDD datapath query essmap command is working
properly—that is, all Volume Groups are using vpaths instead of hdisk devices.
The command syntax is:

vpath_iostat (control c to break out)
Or,
vpath_iostat <interval> <iteration>
An example of the output vpath_iostat produces is shown in Example B-5. The vpath_iostat
shell script is in Example B-6 on page 485.
Example: B-5 vpath_iostat output

garmo-aix: Total VPATHS used: 8 16:16 Wed 26 Feb 2003 5 sec interval
garmo-aix Vpath: MBps tps KB/trans MB_read MB_wrtn
garmo-aix vpath0 12.698 63.0 201.5 0.0 63.5
garmo-aix vpath6 12.672 60.6 209.1 0.0 63.4
garmo-aix vpath14 11.238 59.8 187.9 0.0 56.2
garmo-aix vpath8 11.314 44.6 253.7 0.0 56.6
garmo-aix vpath2 6.963 44.2 157.5 0.0 34.8

garmo-aix vpath12 7.731 30.2 256.0 0.0 38.7
garmo-aix vpath4 3.840 29.4 130.6 0.0 19.2
garmo-aix vpath10 2.842 13.2 215.3 0.0 14.2
------------------------------------------------------------------------------------------
garmo-aix TOTAL READ: 0.00 MB TOTAL WRITTEN: 346.49 MB
garmo-aix READ SPEED: 0.00 MB/sec WRITE SPEED: 70.00 MB/sec
Example: B-6 vpath_iostat shell script

#!/bin/ksh
#####################################################################
# Usage:
# vpath_iostat (default: 5 second intervals, 1000 iterations)
# vpath_iostat <interval> <count>
#
# Function:
# Gather IOSTATS and report on DS6000 VPATHS instead of disk devices
# AIX hdisks
# HP-UX [under development ]
# SUN [under development ]
# Linux [under development ]
#
# Note:
#
# A small amount of free space < 1MB is required in /tmp
#
#####################################################################
##########################################################
# set the default period for number of seconds to collect
# iostat data before calculating average
period=5
iterations=1000
essfile=/tmp/disk-vpath.out # File to store output from lssdd command

ifile=/tmp/lssdd.out # Input file containing LSSDD info
ds=`date +%d%H%M%S` # time stamp

hname=`hostname` # get Hostname
ofile=/tmp/vstats # raw iostats
wfile=/tmp/wvfile # work file
wfile2=/tmp/wvfile2 # work file
pvcount=`iostat | grep hdisk | wc -l | awk '{print $1}'`
#############################################
# Create a list of the vpaths this system uses
# Format: hdisk DS-vpath
# datapath query essmap output MUST BE correct or the IO stats reported
# will not be correct
#############################################
if [ ! -f $ifile ]
then
echo "Collecting DS6000 info for disk to vpath map..."
datapath query essmap > $ifile
fi
cat $ifile | awk '{print $2 "\t" $1}' > $essfile

#########################################
# ADD INTERNAL SCSI DISKS to RANKS list
#########################################
for internal in `lsdev -Cc disk | grep SCSI | awk '{print $1}'`
do
echo "$internal $internal" >> $essfile
done
###############################################
# Set interval value or leave as default
if [[ $# -ge 1 ]]
then
period=$1
fi
##########################################
# Set <iteration> value
if [[ $# -eq 2 ]]
then
iterations=$2
fi
#################################################################
# ess_iostat <interval> <count>
i=0
while [[ $i -lt $iterations ]]
do
iostat $period 2 > $ofile # run 2 iterations of iostat
# first run is IO history since boot
grep hdisk $ofile > $ofile.temp # only gather hdisk info- not cd
# other devices
tail -n $pvcount $ofile.temp | grep -v "0.0 0.0 0.0 0 \

0" | sort +4 -n -r | head -n 100 > $wfile
###########################################
#Converting hdisks to vpaths.... #
###########################################
for j in `cat $wfile | awk '{print $1}'`
do
vpath=`grep -w $j $essfile | awk '{print $2}'`
sed "s/$j /$vpath/g" $wfile > $wfile2
cp $wfile2 $wfile
done
###########################################
# Determine Number of different VPATHS used
###########################################
numvpaths=`cat $wfile | awk '{print $1} ' | grep -v hdisk | sort -u | wc -l`
dt=`date +"%H:%M %a %d %h %Y"`
print "\n$hname: Total VPATHS used: $numvpaths $dt $period sec interval"
printf "%s\t%s\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" "$hname" "Vpath:" "MBps" "tps" \
"KB/trans" "MB_read" "MB_wrtn"
###########################################

# Sum Usage for EACH VPATH and Internal Hdisk
###########################################
for x in `cat $wfile | awk '{ print $1}' | sort -u`
do
cat $wfile | grep -w $x | awk '{ printf ("%4d\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" , \
$1, $2, $3, $4, $5, $6) }' | awk 'BEGIN {
}
{ tmsum=tmsum+$2 }
{ kbpsum=kbpsum+$3 }
{ tpsum=tpsum+$4 }
{ kbreadsum=kbreadsum+$5 }
{ kwrtnsum=kwrtnsum+$6 }
END {
if ( tpsum > 0 )
printf ("%-7s\t%4s\t\t%-9.3f\t%-9.1f\t%-9.1f\t%-9.1f\t%-9.1f\n" , hname, \
vpath, kbpsum/1000, tpsum, kbpsum/tpsum , kbreadsum/1000, kwrtnsum/1000)
else
vpath, kbpsum/1000, tpsum, "0", kbreadsum/1000, kwrtnsum/1000)
}' hname="$hname" vpath="$x" >> $wfile2.tmp
done
#############################################
# Sort VPATHS/hdisks by NUMBER of TRANSACTIONS
#############################################
if [[ -f $wfile2.tmp ]]
then
cat $wfile2.tmp | sort +3 -n -r
rm $wfile2.tmp
fi
##############################################################
# SUM TOTAL IO USAGE for ALL DISKS/LUNS over INTERVAL
##############################################################
#Disks: % tm_act Kbps tps Kb_read Kb_wrtn
# field 5 read field 6 written
tail -n $pvcount $ofile.temp | grep -v "0.0 0.0 0.0 0 \
0" | awk 'BEGIN { }
{ rsum=rsum+$5 }
{ wsum=wsum+$6 }
END {
rsum=rsum/1000
wsum=wsum/1000
printf
("------------------------------------------------------------------------------------------\n")
if ( divider > 1 )
{
printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n", hname, "TOTAL READ: ", \
rsum, "MB", "TOTAL WRITTEN: ", wsum, "MB")
}
printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n\n\n", hname, "READ SPEED: ", \
rsum/divider, "MB/sec", "WRITE SPEED: ", wsum/divider, "MB/sec" )
}' hname="$hname" divider="$period"
let i=$i+1

done
#
rm $ofile
rm $wfile
rm $wfile2
rm $essfile
################## THE END ##################
ds_iostat
The ds_iostat script is a a wrapper program for AIX that converts iostat information based
on hdisk devices to Ranks instead.
The ds_iostat script depends on the SDD datapath query essmamp command and iostat.
The script first builds a map file to list hdisk devices and their associated Ranks and then
converts iostat information from hdisks to Ranks.
To run the script, enter:

ds_iostat (control c to break out)
Or,
ds_iostat <interval> <iteration>
An example of the ds_iostat output is shown in Example B-7. The sa_iostat shell script is in
Example B-8.
Example: B-7 ds_iostat output

# ds_iostat 5 1
garmo-aix: Total RANKS used: 12 20:01 Sun 16 Feb 2003 5 sec interval
garmo-aix Ranks: MBps tps KB/trans MB_read MB_wrtn
garmo-aix 1403 9.552 71.2 134.2 47.8 0.0
garmo-aix 1603 6.779 53.8 126.0 34.0 0.0
garmo-aix 1703 5.743 43.0 133.6 28.8 0.0
garmo-aix 1503 5.809 42.8 135.7 29.1 0.0
garmo-aix 1301 3.665 32.4 113.1 18.4 0.0
garmo-aix 1601 3.206 27.2 117.9 16.1 0.0
garmo-aix 1201 2.734 22.8 119.9 13.7 0.0
garmo-aix 1101 2.479 22.0 112.7 12.4 0.0
garmo-aix 1401 2.299 20.4 112.7 11.5 0.0
garmo-aix 1501 2.180 19.8 110.1 10.9 0.0
garmo-aix 1001 2.246 19.4 115.8 11.3 0.0
garmo-aix 1701 2.088 18.8 111.1 10.5 0.0
------------------------------------------------------------------------------------------
garmo-aix TOTAL READ: 430.88 MB TOTAL WRITTEN: 0.06 MB
garmo-aix READ SPEED: 86.18 MB/sec WRITE SPEED: 0.01 MB/sec
Example: B-8 ds_iostat shell script

#!/bin/ksh
#set -x
#########################
#Usage:
# ds_iostat (default: 5 second intervals, 1000 iterations)

# ds_iostat <interval> <count>
#
# Function:
# Gather IOSTATS and report on DS6000 RANKS instead of disk devices
# AIX hdisks
# HP-UX
# SUN
# Linux
#
# Note:
# ds_iostat depends on valid rank ids from the datapath query essmap command
#
# A small amount of free space < 1MB is required in /tmp
#
# Date: Feb 28, 2003
#####################################################################
##########################################################
# set the default period for number of seconds to collect
# iostat data before calculating average
period=5
iterations=1000
essfile=/tmp/lsess.out
ds=`date +%d%H%M%S` # time stamp

hname=`hostname` # get Hostname
ofile=/tmp/rstats # raw iostats
wfile=/tmp/wfile # work file
wfile2=/tmp/wfile2 # work file
pvcount=`iostat | grep hdisk | wc -l | awk '{print $1}'`
#############################################
# Create a list of the ranks this system uses
# Format: hdisk DS-rank
# datapath query essmap output MUST BE correct or the IO stats reported
# will not be correct
#############################################
datapath query essmap|grep -v "*"|awk '{print $2 "\t" $11}' > $essfile
#########################################
# ADD INTERNAL SCSI DISKS to RANKS list
#########################################
for internal in `lsdev -Cc disk | grep SCSI | awk '{print $1}'`
do
echo "$internal $internal" >> $essfile
done
###############################################
# Set interval value or leave as default
if [[ $# -ge 1 ]]
then
period=$1
fi
##########################################
# Set <iteration> value
if [[ $# -eq 2 ]]
then

iterations=$2
fi
#################################################################
# ess_iostat <interval> <count>
i=0
while [[ $i -lt $iterations ]]
do
iostat $period 2 > $ofile # run 2 iterations of iostat
# first run is IO history since boot
grep hdisk $ofile > $ofile.temp # only gather hdisk info- not cd
# other devices
tail -n $pvcount $ofile.temp | grep -v "0.0 0.0 0.0 0 0" \

| sort +4 -n -r | head -n 100 > $wfile
###########################################
#Converting hdisks to ranks.... #
###########################################
for j in `cat $wfile | awk '{print $1}'`
do
rank=`grep -w $j $essfile | awk '{print $2}'`
sed "s/$j /$rank/g" $wfile > $wfile2
cp $wfile2 $wfile
done
###########################################
# Determine Number of different ranks used
###########################################
numranks=`cat $wfile | awk '{print $1} ' | grep -v hdisk | cut -c 1-4| sort -u -n | wc -l`
dt=`date +"%H:%M %a %d %h %Y"`
print "\n$hname: Total RANKS used: $numranks $dt $period sec interval"
printf "%s\t%s\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" "$hname" "Ranks:" "MBps" "tps" \
"KB/trans" "MB_read" "MB_wrtn"
###########################################
# Sum Usage for EACH RANK and Internal Hdisk
###########################################
for x in `cat $wfile | awk '{ print $1}' | sort -u`
do
cat $wfile | grep -w $x | awk '{ printf ("%4d\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" , \
$1, $2, $3, $4, $5, $6) }' | awk 'BEGIN {
}
{ tmsum=tmsum+$2 }
{ kbpsum=kbpsum+$3 }
{ tpsum=tpsum+$4 }
{ kbreadsum=kbreadsum+$5 }
{ kwrtnsum=kwrtnsum+$6 }
END {
if ( tpsum > 0 )
rank, kbpsum/1000, tpsum, kbpsum/tpsum , kbreadsum/1000, kwrtnsum/1000)
else
rank, kbpsum/1000, tpsum, "0", kbreadsum/1000, kwrtnsum/1000)
}' hname="$hname" rank="$x" >> $wfile2.tmp

done
#############################################
# Sort RANKS/hdisks by NUMBER of TRANSACTIONS
#############################################
if [[ -f $wfile2.tmp ]]
then
cat $wfile2.tmp | sort +3 -n -r
rm $wfile2.tmp
fi
##############################################################
# SUM TOTAL IO USAGE for ALL DISKS/LUNS over INTERVAL
##############################################################
#Disks: % tm_act Kbps tps Kb_read Kb_wrtn
# field 5 read field 6 written
tail -n $pvcount $ofile.temp | grep -v "0.0 0.0 0.0 0 0" \
| awk 'BEGIN { }
{ rsum=rsum+$5 }
{ wsum=wsum+$6 }
END {
rsum=rsum/1000
wsum=wsum/1000
printf
("------------------------------------------------------------------------------------------\n")
if ( divider > 1 )
{
printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n", hname, "TOTAL READ: ", \
rsum, "MB", "TOTAL WRITTEN: ", wsum, "MB")
}
printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n\n\n", hname, "READ SPEED: ",\

rsum/divider, "MB/sec", "WRITE SPEED:", wsum/divider, "MB/sec" )
}' hname="$hname" divider="$period"
let i=$i+1
done
rm $ofile
rm $wfile
rm $wfile2
rm $essfile
################################## THE END ##########################
test_disk_speeds
Use the test_disk_speeds script to test a 100 MB sequential read against one raw vpath
(rvpath0) and record the speed at different times throughout the day to get an average read
speed a Rank is capable of in your environment.
You can change the amount of data read, the block size, and the vpath by editing the script
and changing the variables:

tsize=100 # MB
bs=128 # KB
vpath=rvpath0 # disk to test
An example of the output for test_disk_speeds is shown in Example B-9.
Example: B-9 test_disk_speeds

#!/bin/ksh
##########################################################
# test_disk_speeds
# Measure disk speeds using dd
#
# tsize = total test size in MB
# bs = block size in KB
# testsize= total test size in KB; tsize*1000
# count = equal to the number of test blocks to read which is

# testsize/bsize
#########################################################
# SET these 2 variables to change the block size and total
# amount of data read. Set the vpath to test
tsize=100 # MB
bs=128 # KB
vpath=rvpath0 # disk to test
#########################################################
let testsize=$tsize*1000
let count=$testsize/$bs
# calculate start time, dd file, calculate end time

stime=`perl -e "print time"`
dd if=/dev/$vpath of=/dev/null bs="$bs"k count=$count
etime=`perl -e "print time"`
# get total run time in seconds

let totalt=$etime-$stime
let speed=$tsize/$totalt
printf "$vpath\t%4.1f\tMB/sec\t$tsize\tMB\tbs="$bs"k\n" $speed

########################## THE END ###############################

Related publications
The publications listed in this section are considered particularly suitable for a more detailed
discussion of the topics covered in this redbook.
IBM Redbooks
For information on ordering these publications, see “How to get IBM Redbooks” on page 495.
Note that some of the documents referenced here may be available in softcopy only.
򐂰 IBM TotalStorage DS6000 Series: Implementation, SG24-6781
򐂰 IBM TotalStorage DS6000 Series: Copy Services in Open Environments, SG24-6783
򐂰 IBM TotalStorage DS6000 Series: Copy Services with IBM ~ zSeries, SG24-6782
򐂰 IBM TotalStorage DS6000 Series: Concepts and Architecture, SG24-6471
򐂰 IBM TotalStorage Solutions for Business Continuity Guide, SG24-6547
򐂰 The IBM TotalStorage Solutions Handbook, SG24-5250
򐂰 iSeries and IBM TotalStorage: A Guide to Implementing External Disk on ~ i5,
SG24-7120
򐂰 Managing Disk Subsystems using IBM TotalStorage Productivity Center, SG24-7097
򐂰 Using IBM TotalStorage Productivity Center for Disk to Monitor the SVC, REDP-3961
You may find the following Redbooks related to the DS8000 and ESS useful, particularly if
you are implementing a mixed systems environment with Copy Services.
򐂰 IBM TotalStorage DS8000 Series: Implementation, SG24-6786
򐂰 IBM TotalStorage DS8000 Series: Concepts and Architecture, SG24-6452
򐂰 IBM TotalStorage DS8000 Series: Copy Services in Open Environments, SG24-6788
򐂰 IBM TotalStorage DS8000 Series: Copy Services with IBM ~ zSeries, SG24-6787
򐂰 IBM TotalStorage Enterprise Storage Server: Implementing ESS Copy Services in Open
Environments, SG24-5757
򐂰 IBM TotalStorage Enterprise Storage Server: Implementing ESS Copy Services with IBM
~ zSeries, SG24-5680
򐂰 DFSMShsm ABARS and Mainstar Solutions, SG24-5089
򐂰 Practical Guide for SAN with pSeries, SG24-6050
򐂰 Fault Tolerant Storage - Multipathing and Clustering Solutions for Open Systems for the
IBM ESS, SG24-6295
򐂰 Implementing Linux with IBM Disk Storage, SG24-6261
򐂰 Linux with zSeries and ESS: Essentials, SG24-7025
Other publications
These publications are also relevant as further information sources:
򐂰 IBM TotalStorage DS6000 Installation, Troubleshooting, and Recovery Guide, GC26-7678

򐂰 IBM TotalStorage DS6000: Introduction and Planning Guide, GC26-7679
򐂰 IBM TotalStorage DS6000 Command-Line Interface User’s Guide, GC26-7681
򐂰 IBM TotalStorage DS6000: Host Systems Attachment Guide, GC26-7680
򐂰 IBM TotalStorage Multipath Subsystem Device Driver User’s Guide, SC30-4096
򐂰 IBM TotalStorage DS Open Application Programming Interface Reference, GC35-0493
򐂰 IBM TotalStorage DS6000 Messages Reference, GC26-7682
򐂰 z/OS DFSMS Advanced Copy Services, SC35-0248
򐂰 Device Support Facilities: User’s Guide and Reference, GC35-0033
Online resources
These Web sites and URLs are also relevant as further information sources:
򐂰 Documentation for DS6800
http://www.ibm.com/servers/storage/support/disk/ds6800/
򐂰 SDD and Host Attachment scripts
http://www.ibm.com/support/
򐂰 IBM Disk Storage Feature Activation (DSFA)
http://www.ibm.com/storage/dsfa
򐂰 The PSP information
http://www-1.ibm.com/servers/resourcelink/svc03100.nsf?OpenDatabase
򐂰 Documentation for the DS6000
http://www.ibm.com/servers/storage/support/disk/1750.html
򐂰 The interoperability matrix
򐂰 Fibre Channel host bus adapter firmware and driver level matrix
򐂰 ATTO
http://www.attotech.com/
򐂰 Emulex
http://www.emulex.com/ts/dds.html
򐂰 JNI
http://www.jni.com/OEM/oem.cfm?ID=4
򐂰 QLogic
http://www.qlogic.com/support/ibm_page.html
򐂰 IBM
http://www.ibm.com/storage/ibmsan/products/sanfabric.html
򐂰 McDATA
http://www.mcdata.com/ibm/
򐂰 Cisco
http://www.cisco.com/go/ibm/storage

򐂰 CIENA
http://www.ciena.com/products/transport/shorthaul/cn2000/index.asp
򐂰 Nortel
http://www.nortelnetworks.com/
򐂰 ADVA
http://www.advaoptical.com/
How to get IBM Redbooks

You can search for, view, or download Redbooks, Redpapers, Hints and Tips, draft
publications and Additional materials, as well as order hardcopy Redbooks or CD-ROMs, at
this Web site:
ibm.com/redbooks
Help from IBM

IBM Support and downloads
ibm.com/support
IBM Global Services

ibm.com/services
Related publications 495

Index
Symbols B
balanced
2, 74
DB2 workload 420, 434
I/O for UNIX systems 191
Numerics bandwidth 448
10,000 rpm drives 32 Base model 91
15,000 rpm drives 32 bays
1750 iSeries devices 391 host adapter bays 43
2 Gb Fibre Channel/FICON host adapter 46 benchmarking 475
3+3+2S RAID 10 ranks 37 goals 476
4+4 RAID 10 ranks 37 benchmarks 18, 300
6+P+S RAID 5 ranks 35 requirements 476
7+P RAID 5 ranks 35 benefits 2
9337 iSeries devices 391 of SAN 152
block size 19
Bonnie 300
A Bonnie++ 301
access density 19, 26 bottlenecks 341
adapter code level 197 in Linux 301
addpaths command 220 in WIndows 340
AID 35 Buffer Pools 424
AIX Business Continuity 2
file system caching 252
filemon command 212
iostat output 200 C
lvmstat command 216 cache 21–22
nmon command 209 algorithms 22
SDD commands 219, 223 friendly workload 19
secondary system paging 222 hostile workload 32
topas command 208 Windows system cache tuning 311
tuning for sequential I/O 251 capacity 28, 451, 456
algorithms 22 disks 27
allocation effective capacity 28
device options 79 intermix 28
analyze DS6000 port statistics 377 Capacity Magic 52, 101
analyze FICON statistics 374 examples 104
arbitrated loop 148–149 graphical interface 102
example 149 overview and features 101
array 63 reports 103
choosing disk speed 31 wizard 102
effective capacity 28 cfgvpath command 225
implementation 33 chkconfig command 266
physical capacity 28 choosing
RAID 10 implementation 37 CKD volume size 364
RAID 5 and RAID 10 combination 39 disk size with DB2 UDB 430
RAID 5 implementation 35 disks number and size 78
array sites 63 disks speed 31
array size 66 DS6000 disks 29
assigning interrupt affinity 322 logical device size 80
Asynchronous Cascading PPRC see Metro/Global Copy CKD 63
Asynchronous PPRC see Global Mirror Collection Services 395–396
attachment functions 396
direct connect example 148 combination
FICON 147 RAID 5 and RAID 10 39
auxiliary storage pools 388 Command Tag Queuing 11

commands Copy feature 7
addpaths 220 Data Set FlashCopy 6
cfgvpath 225 FlashCopy to a Remote Mirror primary 6
chkconfig 266 Global Copy 7
cron 205 Global Mirror 7
datapath 199 Global Mirror performance recommendations 467
dd 227 growth within Global Mirror configurations 466
dmesg 281 IBM TotalStorage Metro Mirror 7
filemon 212 inband commands 7
iostat 198, 284 Incremental FlashCopy 6
isar 288 introduction 438
lquerypr 222 Metro Mirror configuration considerations 446
lsvpcfg 220–221 Metro Mirror interfaces 446
lvmstat 216 Metro Mirror performance considerations 450
nmon 209 Multiple Relationship FlashCopy 6
sar 198, 204, 285 performance aspects for Global Mirror 460
SDD commands in AIX 219 performance considerations 469, 471
SDD commands in HP-UX 223 performance considerations at coordination time 462
SDD commands in Sun 225 performance considerations with FlashCopy 441
SDD datapath 162 Point-in-time copy feature 6
serviceconfig 268 remote mirror 7
showvpath 224, 226 remote mirror connections 8
top 282 RMF 471
topas 208 z/OS and Copy Services 470
uptime 280 z/OS Global Mirror 8
vmstat 198, 206, 285 z/OS Metro/Global Mirror 8
vpathmkdev 226 correlate with the DS6000 analysis 135
compiling Linux kernel 270 CPU Utilization 292
components and terminology 60 creating Base model 97
concurrent read operation 361 creating gauges 125
concurrent write operation 362 cron command 205
configuring for performance 61
configuring I/O ports 81
CONN time 373 D
connectivity daemons 266
Global Copy 454 httpd 266
considerations sendmail 266
DB2 performance 419 data collection 100
DB2 UDB performance 427 Data Set FlashCopy 6, 438
disk bottlenecks 301 data striping 297
FlashCopy performance 441 data transfer rates from iostat 135
for monitoring UNIX systems 190 database partitions 422
for Windows 355 databases 418, 422
Linux LVM 296 Buffer Pools 424
Linux virtual memory 262 containers 423
logical disks in a SAN 154 database partitions 422
switched fabric 150 DB2 in a z/OS environment 416
Windows foreground and background priorities 307 DB2 logical sizes 428
Windows page file size 310 DB2 prefetch 424
Windows virtual memory 308 DB2 to stripe across containers 428
z/OS planning 369 DS6000 considerations for DB2 419
zones 153 extents 424
Consistency Group IMS in a z/OS environment 432
commands 438 logs 425
Consistency Group commands 7 monitoring the DS6000 performance 435
containers 423, 427 multi-pathing 431
conversions 29 page cleaners 425
Copy Services 6 pages 424
Consistency Group commands 7 parallel operations 426
Consistency Group drain time 464 partitiongroups 423
partitioning map 423

tables, indexes, and LOBs 424 disks 27
tablespaces 423 bottlenecks in Linux 301
workload 416 bottlenecks in Windows 340
DB2 416 capacity 27
recovery data sets 418 capacity intermix 28
storage objects 417 choosing the capacity 29
DB2 overview 417 choosing the speed 31
DB2 prefetch 424 configuring for SAN 154
DB2 UDB 421 conversions 29
container 423, 427 eight-packs 27
instance 422 number 78
performance considerations 427 RAID 5 parity disk 35
striping 428 size 78
tablespace 423 size with DB2 UDB 430
dd command 227 speed intermix 29
DDM 41 distance 448
Device Manager 112 distance considerations 454
direct connect 148 distributions
disabling Windows services 315 Linux 262
DISC time 373 dmesg 281
disk bottlenecks 301 dmesg command 281
disk capacity 29 dpovmfix 221
disk conversions 29 DS management tools 8
disk four-pack capacity 27 ds_iostat 488
effective capacity 28 DS6000
physical capacity 28 attachment to iSeries 390
disk four-pack intermixing benefits 2
capacity intermix 28 DDM 41
speed intermix 28 disk drives 390
disk four-packs 27 disks 27
Disk Magic 51, 86 expansion enclosure 4
examples 30 FICON 10, 144
modeling 87 FlashCopy 438
open read intensive workload 31 Global Mirror
open standard OLTP workload 30 hard disk drive 4
output information 87 hardware components 19
overview and characteristics 86 hardware planning 17
Disk Magic for open system host adapters 43
creating Base model 97 host attachment 144
Disk Magic for open systems interpreting cache metrics 133
hardware configuration 94 interpreting Cluster level gauge 128
interfaces panel 95 interpreting I/O request metrics 131
Open Disk panel 96 interpreting Rank level gauge 129
open systems workload 96 interpreting read hit ratio analysis 133
Disk Magic for zSeries 88 interpreting read-to-write ratio 132
creating Base model 91 interpreting Volume level gauge 130
Disk panel 90 open systems 144
hardware configuration 88 priority queuing 363
interfaces panel 88 RAID 10 33, 37
merge Result panel 93 RAID 5 33
merge Target panel 92 RAID implementations 33
merging multiple disk subsystems 91 switched FC-AL implementation 42
workload 90 DS6000 considerations for DB2 419
Disk Magic modeling 87 DS6000 disk capacity 27
Disk Magic study 99 DS6000 disks 27
data collection 100 DS6000 performance 128
UNIX environment 100 interpreting 128
z/OS environment 99 DS6000 series
Disk panel 90 compared to DS4000 series 13
disk speed 31 compared to DS8000 12
Index 499
compared to ESS 11 performance 70
DS6000 series compared to ESS 11 Extent Pool implications
DS6800 defining Extent Pools 69
controller enclosure 3 extents 424
host adapters 3
open systems host connection 83
switched FC-AL 4 F
DS8000 FAT file system 326
Metro Mirror FAT32 file system 326
Metro/Global Copy 468 FC adapter 160
z/OS Metro/Global Mirror 467 FC-AL
DS8000 I/O ports 454 problem 41
dynamic buffer cache 257 FCP
supported servers 148
FCP attachment 45
E Fibre Channel distances 46
eight-packs 27 Fibre Channel 148
choosing disk speed 31 adapters 390
effective capacity 27 distances 46
layout in the DS6000 60 topologies 148
physical capacity 27 Fibre Channel adapter settings 256
ESS Model 800 Fibre Channel topologies 148
disk capacity intermix 28 arbitrated loop 149
disk speed intermix 29 direct connect 148
ess_iostat script 488 switched fabric 150
examples FICON 10, 146, 367
15K rpm versus 10K rpm - cache hostile 32 host attachment 147
arbitrated loop 149 FICON attachment 47
creating Linux swap file 265 Figure 1-1 on page 5 20
datapath query output 199 file system overview 326
devices presented to iostat 199 filemon 212
direct connect 148 examples 213
FCP multipathing 160 measurements 213
FICON connection 147 syntax 212
iostat output for AIX 200 filemon command 212
iostat output for HP-UX 202 filesystems
iostat output for Sun 201 ext2 277
larger versus smaller volumes - random workload ext3 278
365 FAT 326
Linux kernel compilation 270 FAT32 326
rank device spreading 241 Linux tuning 277
sar output 204 NTFS 326
SDD in a SAN 156 striped 245
striped filesystem 245 fixed block 63
topas command output 208 FlashCopy 438
vmstat output for HP-UX 206 inband commands 438
vmstat output for Sun 206 objectives 439
zoning in a SAN environment 155 performance considerations 441
Expert Cache 388 planning 442
exploiting gauges 124 FlashCopy to a Remote Mirror primary 6, 438
creating gauges 125 floating parity disk 35
modify gauge to view Rank level metrics 126 functional overview 11
exploiting Performance Manager 117 Copy Services 6
ext2 277 DS6000 combined with virtualizaton products 14
ext3 278 DS6000 compared to DS8000 12
Extent Pool DS6000 series compared to DS4000 series 13
associating ranks 70 IBM TotalStorage DS family comparisons 11
capacity utilization 70 RAID 10 5
defining 69 RAID 5 5
implications 69 resiliency 6
number of ranks 70 SAN File System 15

SAN Volume Controller 14 RAID implementation 33
Sequential Prefetching in Adaptive Replacement switched FC-AL advantages 41
Cache switched FC-AL implementation 42
storage capacity 5 tools 51
understanding your workload characteristics 18
whitepapers 51
G hd2vp 221
GKrellM 294 host adapter 3, 43
Global Copy 7, 451 2 Gb Fibre Channel/FICON 43, 46
capacity 456 64-bit ESCON 43
connectivity 454 bays 43
distance considerations 454 configuration 45
DS8000 I/O ports 454 recommendations 45
performance 456 SCSI 43
planning considerations 455 host attachments
scalability 456 benefits of a SAN 152
states change logic 452 concurrent LMC load 159
Global Mirror 7, 457 description and characteristics of a SAN 151
how works 458 DS6000 144
graphical interface 102, 104 example 145
graphical interface with DDMs specified 105 FC adapter 160
guidelines Fibre Channel 148
z/OS planning 369 Fibre Channel topologies 148
FICON 146
H multipathing 145
hard disk drives 4 online recovery 161
capacity intermix 28 path failover 161
choosing the speed 31 SAN cabling 153
hardware SAN implementations 151
DS6000 components 19 SDDPCM on an AIX host system 161
hardware configuration 94 single path mode 159
hardware configuration planning host bus adapter settings 295, 333
cache 21 host connection
size consideration 26 open systems 83
cache and persistent memory 21 zSeries 84
Capacity Magic 52 host system performance 306
disk capacity 29 HP-UX
disk conversions 29 iostat output 202
disk four-pack capacity 27 tuning for sequential I/O 256
disk four-pack intermixing 28 vmstat output example 206
disk four-packs 27 httpd daemon 266
Disk Magic 30, 51
Disk Magic examples 32 I
disk speed 31 I/O
DS6000 disk capacity 27 priority queuing 363
DS6000 disks 27 rate 19
DS6000 major hardware components 19 workloads 18
DS6000 server processor 20 IBM DS6000 combined with virtualization products 14
FC-AL 41 IBM TotalStorage DS Command Line Interface (CLI) 9
FCP attachment 45 IBM TotalStorage DS family comparisons 11
FICON attachment 47 IBM TotalStorage DS Open API 9
host adapter 43 IBM TotalStorage DS Storage Manager 8
host adapter configuration 45 IBM TotalStorage Metro Mirror 7
persistent memory 22 IBM TotalStorage Productivity
preferred path 49 Disk Report 135
RAID 10 array 37 Windows 135
RAID 5 and RAID 10 arrays 39 IBM TotalStorage Productivity Center 109, 134–135, 137
RAID 5 Array 35 Disk Report 134, 137
RAID 5 versus RAID 10 performance 39 IBM TotalStorage Productivity Center
RAID arrays 33 Disk Report 136
Index 501
iSeries 136 internal versus external storage on iSeries 389
mixed environment 137 iSeries Navigator monitors 397
UNIX/Linux 134 LUN size and performance 391
zSeries 137 LUNs on the DS6000 391
IBM TotalStorage Productivity Center for Disk 110 monitoring tools 394, 405
IBM TotalStorage SAN Volume Controller see SVC multipath compared to mirroring 394
iDoctor for iSeries 402 PATROL for iSeries (AS/400) 403
Consulting Services 402 Performance Explorer 400
Heap Analysis Tools for Java 402 Performance Explorer definitions 401
Job Watcher 402 Performance Explorer reports 402
Performance Trace Data Visualizer 402 performance Tools for iSeries 398
PEX Analyzer 402 protected and unprotected volumes 393
implementation 296 single level storage 388
IMS 432 iSeries tools
logging 433 Collection Services 395
performance considerations 433 IBM Performance Management for iSeries 395
WADS 433 iDoctor for iSeries 396
IMS in a z/OS environment 432 iSeries Navigator monitors 395
Inband commands over remote mirror link 7 PATROL for iSeries (AS/400) - Predict 396
Incremental FlashCopy 6, 438 Performance Explorer 396
index 417 Performance Tools for iSeries 395
indexspace 418 Workload Estimator for iSeries 396
Information Life Cycle Management 3
Infrastructure Simplification 2
installation planning K
host attachment 144 KDE System Guard 294
FICON 144 kernel 270
open systems 144 kernel parameter storage locations 272
instance 422 kernel parameters 271, 273
interfaces panel 88, 95 kernel space 262
intermix
different capacity disks 28 L
different speed disks 29 large volume support
Iometer 354 planning volume size 366
IOSQ time 373 links
iostat 197–198, 284 Fibre Channel 447
iostat command 198, 284 Linux 261
isag 288 daemons 266
isar command 288 kernel compilation 270
iSeries LVM considerations 296
architecture 388 monitoring tools 280
changing from single path to multipath 394 paging statistics 289
changing LUN protection 393 swap partition 264
multipath 393 swapping 265
multipath rules for multiple iSeries systems or parti- tuning TCP window size 279
tions 394 Linux OS components 262
iSeries and DS6000 configuration planning 392 LMC load 159
iSeries LUNs on the DS6000 391 load balancing
iSeries Navigator monitors 395, 397 SDD 158
iSeries servers log buffers 433
Collection Services 396 logging 433
configuration planning 392 logical configuration
DB2 overview 417 disks in a SAN 154
DS6000 attachment 390 logical configuration planning
DS6000 disk drives 390 array and array sites 63
DS6000 sizing process 403 array sites 63
Expert Cache 388 array size 66
Fibre Channel adapters 390 CKD 63
iDoctor for iSeries 402 components and terminology 60
independent auxiliary storage pools 388 configuring for performance 61

configuring I/O ports 81 volumes 450
Extent Pool implications 69 Metro/Global Copy 468
fixed block 63 mixing drive geometries 62
LSS design 71 disk drives 62
mixing drive geometries 62 drive sets 62
mixing open and zSeries logical disks 63 Model 1750-511 3
multiple host attachment 83 Model 1750-EX1 4
number of ranks in an Extent Pool 70 model characteristics
RAID 5 or RAID 10 66 benefits of DS6000 series 2
rank format 68 DS management tools 8
logical device functional overview 5
allocation options 79 hard disk drive 4
choosing the size 364 IBM TotalStorage DS Command Line Interface 9
configuring in a SAN 154 IBM TotalStorage DS Open 9
disk bottlenecks 301 IBM TotalStorage DS Storage Manager 8
iSeries 9337 and 1750 devices 391 performance for zSeries 10
number and size 78 performance overview 10
planning volume size 366 supported environments 10
size 80 monitoring
logical paths 447 UNIX systems 190
logs 425 monitoring disk counters 340
loops monitoring the DS6000 performance 435
SSA 33 monitoring tools
lquerypr command 222 AIX specific tools 208
LSS design 71, 448 Bonnie 300
address groups 71 Bonnie++ 301
LCU 72 Capacity Magic 101
lsvpcfg command 220–221 Disk Magic 86
LUN GKrellM 294
masking 154 HP-UX specific commands 218
size 193 Intel based servers 306
size and performance 391 Iometer 354
LUN assignments 194 iSeries servers 394, 405
LVM KDE System Guard 294
for Linux 296 Linux tools 280
lvmap 483 Performance console 334
lvmap script 483 Performance Monitor 343
lvmstat 216 Task Manager 349
lvmstat command 216 UNIX common tools 198
Windows tools 334
multipathing 145, 157
M DB2 UDB environment 431
managing Performance Manager database 119 SDD 356
masking LUNs 154 software 83
Memory Activities 291 multipathing software 83
memory and swap 290 Multiple Allegiance 11, 359
merge Result panel 93 multiple host attachment
merge Target panel 92 multipathing software 83
merging subsystems 91 open systems host connection 83
Metro Mirror 443 Multiple Relationship FlashCopy 6, 438
addition of capacity 451
bandwidth 448
configuration considerations 446 N
distance 448 nmon command 209
Fibre Channel links 447 non-optimum spare configuration 108
interfaces 446 non-volatile storage see NVS
logical paths 447 NTFS file system 326
LSS design 448 NVS 22
performance considerations 450
scalability 451
symmetrical configuration 449
Index 503
O 334
OLDS 433 top 282
OLTP 416 tuning the swap partition 264
online recovery 161 uptime 280
Open Disk panel 96 using the sysctl commands 273
open read intensive workload 31 virtual memory 308
open standard OLTP workload 30 vmstat 285
open system servers Windows Performance console 334
added functionality 328 open systems servers
assigning interrupt affinity 322 addpaths 220
benchmarks 300 AIX file system caching 252
Bonnie 300 dpovmfix 221
changing kernel parameters 271 dynamic buffer cache 257
compiling the kernel 270 Fibre Channel adapter settings 256
CPU Utilization 292 filemon 212
daemons 266 filemon measurements 213
disable kernel paging 330 filemon syntax 212
disable NTFS last access updates 327 hd2vp 221
disabling short file name generation 327 iostat 198
disk bottlenecks 301 LUN assignments 194
dmesg 281 lvmstat 216
ext2 277 SAR 204
ext3 278 Sun Fibre Channel 258
file system overview 326 SUN Solaris resources 259
finding bottlenecks 341 Sun Solaris settings 258
GKrellM 294 system and adapter code level 197
host bus adapter settings 295, 333 adapter code level 194
host system performance 306 topas 208
I/O locking operations 332 tuning I/O buffers 254
implementation 296 verify storage subsystem 229
introduction to Linux OS components 262 verify the file systems and characterize performance
Iometer 354 232
iostat 284 verify the logical volumes 231
isag 288 vp2hd 221
KDE System Guard 294 open systems workload 96
kernel parameter storage locations 272 operation characteristics 115
kernel parameters 273 installation consideration 115
key objects and counters 336 installation environments 115
Memory Activities 291 installation process 115
memory and swap 290 optimizing
monitoring disk counters 340 logical device size 80
NTFS and FAT performance and recoverability con- output information 87
siderations 329 Performance console 338
NTFS file compression 329
optimize the paged pool size 330 P
performance logs 339 page cleaners 425
performance management 297 page file size 310
process affinity 321 pages 424
process priority levels 318 paging statistics 289
removing disk bottlenecks 343 Parallel Access Volumes 10
removing limitations 328 parallel operations 426
Run Queue 289 parity disk 35
sar 285 partitiongroups 423
Subsystem Device Driver 356 partitioning map 423
SUSE LINUX file system 279 path failover 161
System Monitor 335 PAV 372
system swapping 293 characteristics 358
Task Manager 349 PEND time 373
the /3GB BOOT.INI parameter 323 performance 369, 456
tools for Windows Server 2003 and Windows 2000 AIX secondary system paging 222

Capacity Magic tool 101 reports 399
DB2 considerations 419 persistent memory 22
DB2 UDB considerations 427 physical capacity 28
disabling Windows unnecessary services 315 planning
disk bottlenecks in Linux 301 DS6000 hardware 17
disk bottlenecks in Windows 340 FlashCopy 442
Disk Magic tool 86 logical volume size 366
FlashCopy considerations 441 UNIX servers for performance 191
improve Windows memory utilization of file system planning and monitoring tools
cache 333 Capacity Magic 101
IMS considerations 433 Disk Magic 86
Linux virtual memory 262 Disk Magic for zSeries 88
logical device size 80 Disk Magic study 99
multipathing 145 exploiting gauges 124
open systems 74 exploiting Performance Manager 117
data placement 75 managing Performance Manager database 119
LVM striping 75 performance gauge - considerations 134
workload characteristics 74 Performance Manager data collection 118
planning for UNIX systems 191 Performance Manager gauges 119
preferred paths 74 Performance Manager GUI 117
tuning daemons 266 Performance Monitor 135
tuning for sequential I/O 251, 256–257 workload growth projection 97
tuning Windows systems 306 planning considerations 455
UNIX common monitoring tools 198 Point-in-time copy 6
Windows considerations 355 PPRC Extended Distance see Global Copy
Windows page file size 310 PPRC-XD see Global Copy
Windows system cache tuning 311 preferred path 49
Windows virtual memory 308 priorities
z/OS 76 Windows foreground and background 307
potential 76 priority I/O queuing 11
Performance console 334 priority queuing 363
Performance Explorer 400 process affinity 321
definitions 401 process priority levels 318
reports 402 protected volumes 393
performance for zSeries 10
Command Tag Queuing 11
FICON 10 R
Multiple Allegiance 11 RAID 297
Parallel Access Volumes 10 RAID 10 33, 37, 66
priority I/O queuing 11 3+3+2S configuration 37
performance gauge-considerations 134 4+4 configuration 37
performance logs 339 and RAID 5 combination 39
performance management 297 spare drives 37
data striping 297 RAID 10 array 37
RAID 297 RAID 5 33, 66
Performance Manager 113 6+P+S configuration 35
data collection 118 7+P configuration 35
database services 115 and RAID 10 combination 39
functions 113 spare drives 35
gauges 114, 119 RAID 5 Array 35
gauges performance 119 RAID Array report 107
GUI 117 RAID arrays 33
Performance Manager gauges rank
DS6000 thresholds 123 choosing disk speed 31
exceptions 123 effective capacity 28
Performance Monitor 135 implementation 33
Performance Monitor tool 343 physical capacity 28
performance overview 10 RAID 10 implementation 37
Performance Tools RAID 5 and RAID 10 combination 39
functions 398 RAID 5 implementation 35
RAID implementations 33
Index 505
rank format 68 DB2 UDB environment 431
rates lsvpcfg command 220–221
I/O 19 SDDPCM 161
ratios sendmail daemon 266
read/write 19 sequential
recovery data sets 418 measuring with dd command 227
Red Hat 262 tuning for sequential I/O 251, 256–257
Redbooks Web site 495 serviceconfig command 268
Contact us xxii showpath command 226
registry options 330 showvpath command 224
remote mirror and Copy feature 7 single path mode 159
remote mirror connections 8 size
removing disk bottlenecks 343 LUN 193
report warning message 109 spare drives
reports 103 RAID 10 ranks 37
resiliency 6 RAID 5 ranks 35
RMC speed
Global Mirror 457 choosing 31
RMF Magic for Windows 378 intermix 29
analysis process 379 SSA
analyze step 380 loops 33
data collection step 379 states change logic
data presentation and reporting 380 Global Copy 452
ROT see rule of thumb storage allocation
rule of thumb 18 device options 79
Run Queue 289 striping 193
DB2 UDB 428
VSAM data striping 420
S with Linux LVM 297
SAN 79 Subsystem Device Driver 356
benefits 152 Sun
cabling for availability 153 iostat output 201
implementation 151 SDD commands 225
switched fabric 46 tuning for sequential I/O 257
using SDD 156 vmstat output example 206
zoning example 155 Sun Fibre Channel settings 258
SAN cabling 153 SUN Solaris resources 259
SAN File System 15 Sun Solaris settings 258
SAN implementations 151 supported environments 10
SAN Statistics supported servers
monitoring performance 138 FCP attachment 148
SAN Volume Controller 14 SuSE 262
SAR 204 SUSE LINUX file system 279
sar 285 SVC 14
sar command 198, 204, 285 swap partition 264
SARC 22 swapping 265
SARC see Sequential Prefetching in Adaptive Replace- switched fabric 46, 148, 150
ment Cache considerations 150
SBOD see Switched Bunch of Disks connection switched FC-AL 4
scalability 451, 456 advantages 41
scripts switched FC-AL implementation 42
ess_iostat 488 symmetrical configuration 449
lvmap 483 Synchronous PPRC see Metro Mirror
test_disk_speeds 491 sysctl commands 273
vgmap 482 System Monitor 335
vpath_iostat 484 system swapping 293
SDD 156–157, 356
addpaths 220
commands in AIX 219 T
commands in HP-UX 223 table 417
commands in Sun 225 tables, indexes, and LOBs 424

tablespace 417, 423 W
application 418 WADS 433
system 418 whitepapers 51
tablespaces 423 Windows
Task Manager 349 disabling unnecessary services 315
test_disk_speeds 491 disk bottlenecks 340
test_disk_speeds script 491 foreground and background priorities 307
top 282 Iometer 354
top command 282 monitoring tools 334
topas 208 NTFS file system 326
topas command 208 page file size 310
topologies Performance Monitor 343
arbitrated loop 148 registry options 330
direct connect 148 system cache tuning 311
switched fabric 148 Task Manager 349
tuning tuning 306
daemons 266 virtual memory 308
Linux filesystems 277 wizard 102
Linux virtual memory 262 workload
TCP window size 279 access density 19
UNIX systems 190 cache friendly 19
Windows considerations 355 characteristics 18
Windows system cache 311 databases 416
Windows systems 306 workload growth protection 97
tuning I/O buffers 254 write ahead data sets see WADS
U X
UNIX environment 100 xSeries servers
UNIX performance monitoring tools Linux 261
iostat 197
UNIX shell scripts
ds_iostat 488 Z
introduction 482 z/OS
Ivmap 483 Disk Magic environment 99
test_disk_speeds 491 planning guidelines 369
vgmap 482 z/OS Global Mirror 8
vpath_isotat 484 z/OS Metro/Global Mirror 8, 467
unprotected volumes 393 zombie processes 283
uptime 280 zSeries
uptime command 280 host connection 84
using logical device space report 107
iostat command 198 workload 90
Task Manager 349 zSeries servers
analyze DS6000 port statistics 377
analyze FICON statistics 374
V concurrent read operation 361
vgmap 482 concurrent write operation 362
vgmap script 482 CONN time 373
virtual memory 263, 308 DISC time 373
Windows considerations 308 IOSQ time 373
vmstat 285 overview 358
vmstat command 198, 206, 285 PAV 372
volumes 450 PEND time 373
vp2hd 221
vpath_iostat script 484
vpath_isotat 484
vpathmkdev command 226
Index 507
IBM TotalStorage DS6000 Series:
Performance Monitoring and
Tuning
Performance Monitoring and Tuning
(1.0” spine)
0.875”<->1.498”
460 <-> 788 pages
IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning
IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning
Tuning
Tuning
Back cover ®

Series: Performance
Understand the This IBM Redbook provides guidance about how to
INTERNATIONAL
performance aspects configure, monitor, and manage your IBM TotalStorage
TECHNICAL
of the DS6000 DS6000 to achieve optimum performance. We describe the
SUPPORT
architecture DS6000 performance features and characteristics and how
ORGANIZATION
they can be exploited with the different server platforms
Configure the DS6000 that can attach to it. Then in consecutive chapters we
to fully exploit its detail the specific performance recommendations and
capabilities discussions that apply for each server environment, as well
as for database and Copy Services environments. BUILDING TECHNICAL
Use planning and INFORMATION BASED ON
We also outline the various tools available for monitoring PRACTICAL EXPERIENCE
monitoring tools with
and measuring I/O performance for the different server
the DS6000 IBM Redbooks are developed
environments, as well as how to monitor performance of the
by the IBM International
entire DS6000 subsystem. Technical Support
Organization. Experts from
IBM, Customers and Partners
from around the world create
timely technical information
based on realistic scenarios.
Specific recommendations
are provided to help you
implement IT solutions more
effectively in your
environment.
For more information:

ibm.com/redbooks
SG24-7145-00 ISBN 0738494119

IBM Total Storage DS6000 Series Performance Monitoring and Tuning

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IBM Total Storage DS6000 Series Performance Monitoring and Tuning

Uploaded by

Copyright:

Available Formats

Front cover

IBM TotalStorage DS6000

Configure the DS6000 to fully

Use planning and monitoring

Cathy Warrick Brannen Proctor

IBM TotalStorage DS6000 Series: Performance

First Edition (December 2005)

© Copyright International Business Machines Corporation 2005. All rights reserved.

Chapter 1. Model characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 2. Hardware configuration planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

© Copyright IBM Corp. 2005. All rights reserved. iii

Chapter 3. Logical configuration planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

iv IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

Chapter 4. Planning and monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Chapter 5. Host attachment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Chapter 6. IBM TotalStorage SAN Volume Controller attachment . . . . . . . . . . . . . . . 167

Chapter 7. Open systems servers - UNIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

vi IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

Chapter 8. Open system servers - Linux for xSeries . . . . . . . . . . . . . . . . . . . . . . . . . . 261

Chapter 9. Open system servers - Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

Chapter 10. zSeries servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

viii IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

Chapter 11. iSeries servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

Chapter 12. Understanding your workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

Chapter 13. Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

x IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

Chapter 14. Copy Services for the DS6000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437

Appendix A. Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475

Appendix B. UNIX shell scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481

Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493

xii IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

1-1 DS6000 Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

© Copyright IBM Corp. 2005. All rights reserved. xiii

xiv IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

xvi IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

© Copyright IBM Corp. 2005. All rights reserved. xvii

The following terms are trademarks of other companies:

xviii IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

The team that wrote this redbook

Craig Gordon is a Certified Consulting IT Specialist working in the Advanced Technical

Rosemary McCutchen is a Certified Consulting IT Specialist in Advanced Technical Support

© Copyright IBM Corp. 2005. All rights reserved. xix

xx IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

Thanks to the following people for their contributions to this project:

Mary Anne Bromley

Many thanks to Emma Jacobs, Sokkieng Wang, and Gabrielle Velez

Become a published author

xxii IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

Chapter 1. Model characteristics

© Copyright IBM Corp. 2005. All rights reserved. 1

Figure 1-1 DS6000 Series

1.1.1 Infrastructure Simplification

1.1.2 Business Continuity

2 IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning

1.1.3 Information Life Cycle Management

1.2 Hardware overview

1.2.1 DS6800 controller enclosure (Model 1750-511)

Chapter 1. Model characteristics 3

Switched FC-AL subsystem

The disk drives sizes available

1.2.2 DS6000 expansion enclosure (Model 1750-EX1)

Up to 7 DS6000 expansion enclosures can be added to a DS6800 controller enclosure. The

Hard disk drive