Professional Documents
Culture Documents
Abstract
This white paper describes the performance
characteristics, metrics, and testing considerations
for EMC VPLEX family of products. Its intent is to refine
performance expectations, to review key planning
considerations, and to describe testing best
practices for VPLEX Local, Metro and Geo. This
paper is not suitable for planning for exceptional
situations. In configuring for performance, every
environment is unique and actual results may vary.
Table of Contents
Executive summary ....................................................................................................... 5
Audience ........................................................................................................................ 5
Introduction .................................................................................................................... 6
Transaction-based workloads................................................................................... 8
Throughput-based workloads................................................................................... 8
The Role of Applications in Determining Acceptable Performance .................. 8
Section1: VPLEX Architecture ...................................................................................... 10
VPLEX hardware platform ....................................................................................... 10
VPLEX GeoSynchrony 5.1 System Configuration Limits ....................................... 10
READ / Write IO Limits ............................................................................................... 11
Section 2: VPLEX Performance Highlights .................................................................. 12
Understanding VPLEX overhead ............................................................................ 12
Native vs. VPLEX Local Performance ..................................................................... 12
OLTP Workload Example ......................................................................................... 15
Native vs. VPLEX Metro Performance .................................................................... 15
Native vs. VPLEX Geo Performance ...................................................................... 16
Section 3: Hosts and Front-end Connectivity............................................................. 17
Host Environment ...................................................................................................... 17
Host Paths .................................................................................................................. 17
Host to director connectivity .................................................................................. 20
Host Path Monitoring ................................................................................................ 22
Policy based path monitoring ................................................................................ 23
VPLEX Real-time GUI Performance Monitoring Stats ........................................... 24
Remote Monitoring and Scripting .......................................................................... 26
Watch4Net ................................................................................................................ 26
Perpetual Logs .......................................................................................................... 26
Benchmarking Applications, Tools and Utilities .................................................... 26
Section 4: Application Performance Considerations ................................................ 31
High Transaction environments .............................................................................. 31
High Throughput environments .............................................................................. 32
VPLEX Device Geometry ......................................................................................... 32
Section 5: Back-end Performance Considerations ................................................... 34
Storage Considerations ........................................................................................... 34
Storage Array Block Size .......................................................................................... 34
SAN Architecture for Storage Array Connectivity ............................................... 34
Executive summary
For several years, businesses have relied on traditional physical storage to meet their
information needs. Developments such as sever virtualization and the growth of
multiple sites throughout a businesses network have placed new demands on how
storage is managed and how information is accessed.
To keep pace with these new requirements, storage must evolve to deliver new
methods of freeing data from a physical device. Storage must be able to connect
to virtual environments and still provide automation, integration with existing
infrastructure, consumption on demand, cost efficiency, availability, and security.
The EMC VPLEX family is the next generation solution for information mobility and
access within, across, and between data centers. It is the first platform in the world
that delivers both Local and Distributed Federation.
VPLEX completely changes the way IT is managed and delivered particularly when
deployed with server virtualization. By enabling new models for operating and
managing IT, resources can be federated pooled and made to cooperate through
the stackwith the ability to dynamically move applications and data across
geographies and service providers. The VPLEX family breaks down technology silos
and enables IT to be delivered as a service.
VPLEX resides at the storage layer, where optimal performance vital. This document
focuses on key considerations for VPLEX performance, performance metrics, and
testing best practices. The information provided is based on VPLEX Release 5.1. The
subject is advanced and it is assumed the reader has a basic understanding of the
VPLEX technology. For additional information on VPLEX best practices and detailed
technologies see the appendix for a reference list and hyperlinks to relevant
documents.
Audience
This white paper is intended for storage, network and system administrators who
desire a deeper understanding of the performance aspects of EMC VPLEX, the
testing best practices, and/or the planning considerations for the future growth of
their VPLEX virtual storage environment(s).
This document outlines how VPLEX
technology interacts with existing storage environments, how existing environments
might impact VPLEX technology, and how to apply best practices through basic
guidelines and troubleshooting techniques as uncovered by EMC VPLEX
performance engineering and EMC field experiences.
Introduction
Before we begin our discussion, it is important to know why we are providing
guidance on interpretation of the performance data provided in this document. The
Business unit that has delivered VPLEX to the market has a guiding policy to be as
open and transparent as possible with EMC field resources, partners and customers.
We believe that all modern storage products have limitations and constraints and
therefore the most successful and satisfied customers are those that fully understand
the various constraints and limitations of the technology they intend to implement.
This approach leads our customers to success because there are fewer surprises and
the product expectations match the reality. Our intent is to be candid as possible.
We ask readers to use the information to understand the performance aspects of
VPLEX implementations and to make better informed judgments about nominal
VPLEX capabilities rather than use the document as the final word on all VPLEX
performance (as competitors may be tempted to do). If you have questions about
any of the content in this document please contact to your local EMC Sales or
Technical representatives.
When considering a given solution from any vendor there will undoubtedly be
strengths and weaknesses that need to be considered. There will always be a
specific unique IO profile that poses challenges in servicing the application load, the
key is to understand the overall IO mix and how this will impact real production
workloads. It is misleading to extrapolate a specific IO profile to be representative of
an entire environment unless the environment homogeneously shares a single IO
profile.
Lets begin our discussion of VPLEX performance by considering performance in
general terms.
What is good performance anyway? Performance can be
considered to be a measure of the amount of work that is being accomplished in a
specific time period. Storage resource performance is frequently quoted in terms of
IOPS (IO per second) and/or throughput (MB/s). While IOPS and throughput are both
measures of performance, they are not synonymous and are actually inversely
related meaning if you want high IOPS, you typically get low MB/s. This is driven in
large part by the size of the IO buffers used by each storage product and the time it
takes to load and unload each of them. This produces a relationship between IOPS
and throughput as shown in Figure 1 below.
Figure 1
For example, an application requests 1,000 IOPS at an 8KB IO size which equals 8
MB/s of throughput (1,000 IOPS x 8KB = 8MB/s). Using 200MB/s Fibre channel, 8 MB/s
doesnt intuitively appear to be good performance (8MB/s is only 4% utilization of the
Fibre Channel bus) if youre thinking of performance in terms of MB/s. However, if the
application is requesting 1,000 IOPS and the storage device is supplying 1,000 IOPS
without queuing (queue depth = 0), then the storage resource is servicing the
application needs without delay meaning the performance is actually good.
Conversely, if a video streaming application is sequentially reading data with a 64MB
IO size and 3 concurrent streams, it would realize 192MB/s aggregate performance
across the same 200MB/s Fibre channel connection (64MB x 3 streams = 192MB/s).
While theres no doubt that 192 MB/s performance is good (96% utilization of the Fibre
Channel bus), its equally important to note were only supporting 3 IOPS in this
application environment.
These examples illustrate the context dependent nature of performance that is,
performance depends upon what you are trying to accomplish (MB/s or IOPS).
Knowing and understanding how your host servers and applications handle their IO
workload is the key to being successful with VPLEX performance optimization. In
general, there are two types of IO workloads:
Transaction-based
Throughput-based
As you saw in Figure 1, these workloads are quite different in terms of their objectives
and must be planned for in specific ways. We can describe these two types of
workloads in the follow ways:
The point we are trying to make is that performance is very much dependent on the
point of view. Ultimately, performance can be considered good if the application is
not waiting on the storage frame. Understanding the applications performance
requirements and providing compatible storage resources ensures maximum
performance and application productivity. It goes without saying to always be
cautious about performance claims and spec sheet speeds and feeds.
If the
environment that generated the claims is not identical or does not closely
approximate your environment, you may very well not see the same performance
results.
In configurations with more than one engine, the cluster also contains:
As you add engines you add cache, front-end, back-end, and wan-com
connectivity capacity as indicated in Table 2 below.
Local
No Known Limit
Metro
No Known Limit
Geo
No Known Limit
8,000
16,000
16,000
Maximum storage
elements
8,000
16,000
16,000
Minimum/maximum virtual
volume size
100MB/32TB
100MB/32TB
No VPLEX Limit
/ 32TB
No VPLEX Limit /
32TB
Minimum/maximum
storage volume size
Number of host initiators
1600
1600
100MB/32TB
No VPLEX Limit /
32TB
800
Table 1
10
Engine Type
VPLEX VS1
VPLEX VS2
Model
Cache
[GB]
FC speed
[Gb/s]
Engines
FC Ports
Announced
Single
64
32
10-May-10
Dual
128
64
10-May-10
Quad
256
128
10-May-10
Single
72
16
23-May-11
Dual
144
32
23-May-11
Quad
288
64
23-May-11
Table 2
Table 1 and Table 2 show the current limits and hardware specifications for the VPLEX
VS1 and VS2 hardware versions. Although the VS2 engines have half the number of
ports as VS1 the actual system throughput is improved as each VS2 port can supply
full line rate (8 Gbps) of throughput whereas the VS1 ports are over-subscribed.
Several of the VPLEX maximums are determined by the limits of the externally
connected physical storage frames and therefore unlimited in terms of VPLEX itself.
The latest configuration limits are published in the GeoSynchrony 5.1 Release Notes
which are available on Powerlink.EMC.com.
11
These latency values will vary slightly depending on the factors mentioned earlier. For
example, if there are large block IO requests which must be broken up into smaller
parts (based on VPLEX or individual array capabilities) and then written serially in
smaller pieces to the storage array. Further, if you are comparing native array to
VPLEX performance, it will be heavily dependent on the overall load on the array. If
you have an array that is under cache pressure, adding VPLEX to the environment
can actually improve read performance.
The additive cache from VPLEX may
offload a portion of read IO from the array, thereby reducing average IO latency.
Additional discussion on this topic is provided later in the subsequent host and
storage sections.
Native vs. VPLEX Local Performance
Native performance tests use a direct connection between a host and storagearray. VPLEX Local testing inserts VPLEX in the path between the host and array.
4KB Random Read Hit
Random read hits are tested over a working set size that fits entirely into array or
VPLEX cache.
12
13
14
In this test, the application demonstrates slightly more host latency compared to
native with VPLEX. The additional latency overhead is about 600 microseconds.
Native vs. VPLEX Metro Performance
VPLEX Metro write performance is highly dependent upon the WAN round-trip-time
latency (RTT latency). The general rule of thumb for Metro systems is host write IO
latency will be approximately 1x-3x the WAN round-trip-time. While some may view
this as overly negative impact, we would caution against this view and highlight the
following points. First, VPLEX Metro uses a synchronous cache model and therefore is
subject to the laws of physics when it comes to data replication. In order to provide
a true active-active storage presentation it is incumbent on VPLEX to provide a
consistent and up to date view of data at all times. Second, many workloads have a
considerable read component, so the net WAN latency impact can be masked by
the improvements in the read latency provided by VPLEX read cache. This is another
reason that we recommend a thorough understanding of the real application
workload so as to ensure that any testing that is done is applicable to the workload
and environment you are attempting to validate.
In comparing VPLEX Metro to native array performance it is important to ensure that
the native array testing is also synchronously replicating data across an equal
distance and WAN link as VPLEX. Comparing Metro write performance to a single
array that is not doing synchronous replication is an apples to bananas comparison.
15
16
Recommended Policy
PVLinks set to Failover.
Set NMP policy to Fixed
Native MPIO set to Round Robin
MPIO set to Round Robin Load Balancing
Table 3
Note: The most current and detailed information for each host OS is provided in the
corresponding Host Connectivity Guides on Powerlink at: http://powerlink.emc.com
Host Paths
EMC recommends that you limit the total number of paths that the multipathing
software on each host is managing to four paths, even though the maximum
supported is considerably more than four. Following these rules helps prevent many
issues that might otherwise occur and leads to improved performance.
17
The major reason to limit the number of paths available to a host from the VPLEX is for
error recovery, path failover, and path failback purposes. These are also important
during the VPLEX non-disruptive upgrade (NDU) process. The overall time for
handling path loss by a host is significantly reduced when you keep the total number
of host paths to a reasonable number required to provide the aggregate
performance and availability. Additionally, the consumption of resources within the
host is greatly reduced each time you remove a path from path management
software.
During NDU, there are intervals where only half of the VPLEX directors and associated
front-end ports (on first and second upgraders, respectively) are available on the
front-end fabric. NDU front-end high availability checks ensure that the front-end
fabric is resilient against single points of failure during the NDU, even when either the
first or second upgrader front-end ports are offline.
From a host pathing perspective there are two types of configurations:
High availability configurations VPLEX configurations that include
redundancy to avoid data unavailability during NDU, even in the
front-end fabric or port failures. The NDU high-availability pre-checks
for these configurations.
Minimal configurations VPLEX configurations that do not include
redundancy to avoid data unavailability in the event of front-end
port failures.
sufficient
event of
succeed
sufficient
fabric or
For minimal configurations, the NDU high-availability pre-checks fail. Instead, the prechecks for these configurations must be performed manually. This can take a
considerable amount of time in large environments and in general EMC believes that
the benefits in lower port count requirements are not justified based on the increased
operational impact.
High availability configurations
VPLEX Non-Disruptive Upgrade (NDU) automated pre-checks verify that VPLEX is
resilient in the event of failures while the NDU is in progress.
In high availability configurations:
In dual- or quad-engine systems, each view has front-end target ports across
two or more engines in the first upgrader set (A directors), and two or more
engines in the second upgrader set (B directors).
In single-engine systems, each initiator port in a view has a path to at least one
front-end target port in the first upgrader (A director) and second upgrader (B
director). (See 7).
There are two variants of front-end configurations to consider for high availability that
will pass the high-availability pre-checks:
18
An optimal configuration for a single engine cluster is one in which there are
redundant paths (dotted and solid lines in 7) between both front end fabrics
and both directors. In addition to protecting against failures of an initiator port,
HBA, front-end switch, VPLEX front-end, or director, these redundant paths also
protect against front end port failures during NDU.
A high-availability configuration for a single-engine cluster is one in which
there is a single path between the front end fabrics and the directors (solid
lines in 7). Like the optimal configuration described above, a high-availability
configuration protects against failures of initiator ports, HBAs, front-end
switches, and director failures during NDU.
19
Update these views to satisfy the high availability requirement. Ensure the storageview in question has front-end target ports across two or more engines in the first
upgrader set (A directors) and second upgrader set (B directors).
Figure8 illustrates a single-engine cluster with a minimal front-end configuration:
20
Figure 9 Current VPLEX NDU Enforced Single and Dual Engine Connectivity
Note: For code releases through VPLEX GeoSynchrony code version 5.1 Patch 3, the
non-disruptive upgrade pre-check strictly enforces connecting hosts across 4
directors with 2 and 4 engine VPLEX systems. This restriction will likely be relaxed in
future releases to better align with the reasoning presented above.
When considering attaching a host to more than two directors in a dual-engine or
quad-engine VPLEX configuration, both the performance and the scalability of the
VPLEX complex should be considered. Though this may contradict what the
automated NDU will accept, this guidance for the following reasons:
Utilizing more than two directors per host increases cache update traffic
among the directors
Utilizing more than two directors per host decreases probability of read cache
hits on the ingress director.
Based on the reliability and availability characteristics of VPLEX hardware,
attaching a host to just two directors provides a high availability configuration
without unnecessarily impacting performance and scalability of the solution
21
Also, latency by path is available with the powermt display latency command:
powermt display latency
Invista logical device count=86
==============================================================================
22
----- Host Bus Adapters --------- ------ Storage System ---- - Latency (us) ### HW Path
ID
Interface
Current
Max
==============================================================================
3 port3\path0
FNM0010360####
01
0
0
4 port4\path0
FNM0010360####
04
0
0
It is also possible to set an autorestore policy with Powerpath so that any paths that
drop offline are brought back online if they are healthy.
Example 3 Auto restore paths
powermt set periodic_autorestore=on|off
Each of these commands can provide hosts with self-monitoring and self-recovery to
provide the greatest resiliency and available possible for each host. This command
can be combined with a scheduler, such as cron, and a notification system, such as
an e-mail, to notify SAN administrators and system administrators if the number of
paths to the system changes.
For Veritas DMP there are recovery settings that control how often a path will be
retried after failure. If these are not the default settings on your hosts, you should set
the following on any hosts using DMP:
23
The values shown in Example 4 specify a retry 30s period for handling transient errors.
When all paths to a disk fail (such as during a VPLEX NDU), there may be certain
paths that have a temporary failure and are likely to be restored soon. If IOs are not
retried for a non-zero period of time, the IO may be failed by the application layer.
The DMP tunable dmp_lun_retry_timeout can be used for more robust handling of
such transient errors. If the tunable is set to a non-zero value, I/Os to a disk with all
failed paths will be retried until the specified dmp_lun_retry_timeout interval or
until the I/O succeeds on one of the paths, whichever happens first. The default
value of the tunable is 0, which means that the paths are probed only once.
VPLEX Real-time GUI Performance Monitoring Stats
The Unisphere for VPLEX UI contains several key performance statistics for host
performance and overall health.
They can be found on the Performance
Dashboard tab and can be added to the default performance charts that are
displayed. Using the data provided, the VPLEX administrator can quickly determine
the source of performance problems within an environment. Figure 10, below shows
the performance data included in the GeoSynchrony 5.1 version of VPLEX.
24
25
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
service
service
service
service
service
service
service
service
service
service
service
service
service
service
service
service
service
service
service
service
service
service
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
3374442
10485855
10485864
10486060
10485825
10485922
10486009
10486000
10486298
10486207
10485969
2467450
10485770
10486183
10485816
10485977
10486275
10485793
10486230
10485762
10485807
10486077
2012-11-17
2012-11-14
2012-09-10
2012-11-07
2012-10-30
2012-10-22
2012-10-15
2012-10-08
2012-10-01
2012-09-24
2012-09-17
2012-11-17
2012-11-15
2012-09-07
2012-11-08
2012-10-31
2012-10-24
2012-10-16
2012-10-08
2012-09-30
2012-09-22
2012-09-14
05:41
18:38
01:25
03:33
12:14
21:38
12:43
05:20
03:31
02:24
00:24
05:41
11:39
09:02
01:10
13:49
01:25
07:27
07:56
05:37
01:53
14:31
director-1-1-A_PERPETUAL_vplex_sys_perf_mon.log
director-1-1-A_PERPETUAL_vplex_sys_perf_mon.log.1
director-1-1-A_PERPETUAL_vplex_sys_perf_mon.log.10
director-1-1-A_PERPETUAL_vplex_sys_perf_mon.log.2
director-1-1-A_PERPETUAL_vplex_sys_perf_mon.log.3
director-1-1-A_PERPETUAL_vplex_sys_perf_mon.log.4
director-1-1-A_PERPETUAL_vplex_sys_perf_mon.log.5
director-1-1-A_PERPETUAL_vplex_sys_perf_mon.log.6
director-1-1-A_PERPETUAL_vplex_sys_perf_mon.log.7
director-1-1-A_PERPETUAL_vplex_sys_perf_mon.log.8
director-1-1-A_PERPETUAL_vplex_sys_perf_mon.log.9
director-1-1-B_PERPETUAL_vplex_sys_perf_mon.log
director-1-1-B_PERPETUAL_vplex_sys_perf_mon.log.1
director-1-1-B_PERPETUAL_vplex_sys_perf_mon.log.10
director-1-1-B_PERPETUAL_vplex_sys_perf_mon.log.2
director-1-1-B_PERPETUAL_vplex_sys_perf_mon.log.3
director-1-1-B_PERPETUAL_vplex_sys_perf_mon.log.4
director-1-1-B_PERPETUAL_vplex_sys_perf_mon.log.5
director-1-1-B_PERPETUAL_vplex_sys_perf_mon.log.6
director-1-1-B_PERPETUAL_vplex_sys_perf_mon.log.7
director-1-1-B_PERPETUAL_vplex_sys_perf_mon.log.8
director-1-1-B_PERPETUAL_vplex_sys_perf_mon.log.9
26
benchmarking tools that are useful (and not so useful) when testing VPLEX
performance in your environment.
Good benchmarks
IOMeter
IOMeter is one of the most popular public domain benchmarking tools among
storage vendors, and is primarily a Windows-based tool.
It is available
from http://www.iometer.org. In the Benchmarking section of this document we
provide some examples of IOMeter settings that are used to simulate specific
workloads for testing.
The popularity of IOMeter holds true at EMC. Many internal teams, including the
VPLEX Performance Engineering team use IOMeter and are familiar with its behavior,
input parameters, and output. That being said, the IO patterns, queue depths, and
other tunables can be misused and distorted. Its important to maintain healthy
skepticism from any benchmark numbers you see until you know the full details of the
settings and overall testing parameters.
Warning: It's not recommended to run the IO client (dynamo) on Linux. Dynamo does
not appear to function completely as expected. It's best to use Windows clients with
Dynamo.
IOZone
Broad operating system support, however primarily file system based. It is available
for free from http://www.iozone.org.
27
iorate
Initially implemented by EMC, iorate has been released to the public as open source.
Available for free from http://iorate.org/
fio
io is an I/O tool meant to be used both for benchmark and stress/hardware
verification. It has support for 13 different types of I/O engines, I/O priorities (for
newer Linux kernels), rate I/O, forked or threaded jobs, and much more. It can work
on block devices as well as files. fio is a tool that will spawn a number of threads or
processes doing a particular type of I/O action as specified by the user. The typical
use of fio is to write a job file matching the I/O load one wants to simulate. Available
for free from http://freecode.com/projects/fio
Additional info: http://linux.die.net/man/1/fio
Poor benchmarks
In general, any single outstanding I/O or filesystem focused benchmarks are not
good choices.
Unix dd test
dd is completely single-threaded or single outstanding I/O.
The dreaded dd test:
Bonnie
Bonnie was designed to test UNIX file systems and is over 20 years old.
Bst5 or "Bart's stuff test"
Bst5 is single outstanding I/O. http://www.nu2.nu/bst/
File copy commands
28
These are single threaded and single outstanding I/O. They use a host memory file
cache, so it is not known when or if a particular file IO hits storage. It is also not clear
what I/O size the filesystem will happen to choose and so it might be reading and
writing with inefficient IO sizes. In theory, a multiple file copy benchmark could be
constructed; however it requires careful parallelism and multiple independent source
and target locations.
It is best to separate reads and writes in performance testing. For example, a slow
performing read source device could penalize a fast write target device. The entire
copy test would show up as slow. Without detailed metrics into the read and write
times (not always gathered in a simple "how long did it take" file copy test), the
wrong conclusions can easily be drawn about the storage solution.
Note: See Section 8: Benchmarking for specific testing recommendations and
example results.
Benchmarking Applications
The list of possible application level benchmarking programs is numerous. Some that
are fairly well known and understood are:
Microsoft Exchange - JetStress
Microsoft SQL Server - SQLIO
Oracle - SwingBench, DataPump, or export/import commands
VMware - VMbench, VMmark - virtual machine benchmarking tools
These particular benchmarking applications are potentially one step closer to a
production application environment; however as all artificially crafted benchmarking
applications suffer from the fact that at the end of the day they likely are not
representative of your environment.
Engage EMC's application experts when you are interested in a specific application
benchmark. We stress that these benchmarks also exercise more of the application
and host IO stack and so they may not be representative of the underlying storage
devices and could be affected by a lot of things outside the storage layer.
Application Testing
Testing with the actual application is the best way to measure storage performance.
Production-like environment that can stress storage limits is desirable.
Measure performance of different solutions:
Compare OLTP response times.
Compare batch run times.
Compare sustained streaming rates.
Operating system and application tools can help monitor storage performance.
29
Production Testing
Ultimately, there must be a level of trust in the solution and the deployment the
solution in your production environment.
When are considering moving an
application into product there are some risks and rewards.
Risk vs. Reward:
Risk: taking an unsupported, well-traveled evaluation unit and putting it in a
production environment could compromise application availability and
expose unexpected system problems.
Reward: sometimes this is the only way to know for certain that storage
performance is acceptable for an application.
In order to minimize the risk side of the equation, consider a staged approach
whereby at first non-business critical applications can be virtualized with VPLEX. This is
a similar approach recommended by VMware in the early stages of host
virtualization.
Go for the low hanging fruit first and then closely monitor the
performance throughout the process.
30
31
therefore, perform better if they are isolated onto dedicated storage volumes.
VPLEXs large read cache benefits this sort of a read portion workload but the log
volumes will likely not benefit from cache, and therefore will need underlying storage
volumes that can keep pace with the write workload.
High Throughput environments
With high throughput workloads, you have fewer transactions, but much larger IO per
transaction. IO sizes of 128 K or greater are normal, and these IOs are generally
sequential in nature. Applications that typify this type of workload are imaging,
video servers, seismic processing, high performance computing (HPC), and backup
servers.
When running applications that use larger size I/O, it is important to be aware of the
extra IO impact that VPLEX will add as a result of breaking up write IO that are larger
than 128KB. For example, a single 1MB host write would require VPLEX to do 8 x 128KB
writes out to the backend storage frame.
When practical, maximum host and
application IO size and allocations units for thigh throughput systems should be set
to128KB or less. An increase the maximum back-end write size to 1MB is expected in
the next major VPLEX code release.
Best practice: Database table spaces, journals, and logs should not be placed on
virtual volumes that reside on extents from the same backend storage volume.
32
device. It is acceptable to use striped (raid-0) volumes with applications and storage
frames that do not already stripe their data across physical disks.
33
Note: VPLEX can and does read/write to back-end arrays at I/O sizes as
small as 512 bytes as of GeoSynchrony 5.0.
34
Active/Active Arrays
With Active/Active storage platforms such as EMC VMAX and Symmetrix, Hitachi VSP,
IBM XIV, and HP 3PAR each director in a VPLEX cluster must have a minimum of two
paths to every local back-end storage array and to every storage volume presented
to VPLEX. Each VPLEX director requires physical connections to the back-end
storage across dual fabrics. Each director is required to have redundant paths to
every back-end storage array across both fabrics. Otherwise this would create a
single point of failure at the director level that could lead to rebuilds that
continuously start/restart and never finish. This is referred to as asymmetric backend
visibility. This is detrimental when VPLEX is mirroring across local devices (RAID-1) or
across Distributed Devices (Distributed RAID-1).
Each storage array should have redundant controllers connected to dual fabrics,
with each VPLEX controller having a minimum of two ports connected to the backend storage arrays through the dual fabrics (required).
VPLEX allows a maximum of 4 back-end paths per director to a given LUN. This is
considered optimal because each director will load balance across the four paths to
the storage volume. Maximum because VPLEX using more paths to any given
storage volume or the Initiator, Target, LUN (ITL) would potentially lead to an excess
ITL nexus per storage volume resulting in the inability to claim or work with the device.
Exceeding 4 paths per storage volume per director can lead to elongated backend
path failure resolution, ndu pre-check failures, and decreased scalability.
High quantities of storage volumes (i.e 1000+ storage volumes) or entire arrays
provisioned to VPLEX should be divided up into appropriately sized groups (i.e.
masking views or storage groups) and presented from the array to VPLEX via groups
of four array ports per VPLEX director so as not to exceed the four active paths per
VPLEX director limitation. As an example, following the rule of four active paths per
storage volume per director (also referred to as ITLs), a four engine VPLEX cluster
could have each director connected to four array ports dedicated to that director.
In other words, a quad engine VPLEX cluster would have the ability to connect to 32
ports on a single array for access to a single device presented through all 32 ports
and still meet the connectivity rules of 4 ITLs per director. This can be accomplished
using only two ports per backend I/O module leaving the other two ports for access
to another set of volumes over the same or different array ports.
Appropriateness would be judged based on the planned total IO workload for the
group of LUNs and the limitations of the physical storage array. For example, storage
arrays often have limits around the number of LUNs per storage port, storage group,
or masking view they can have.
Maximum performance, environment wide, is achieved by balancing IO workload
across maximum number of ports on an array while staying within the IT limits.
Performance is not based on a single host but the overall impact of all resources
35
being utilized. Proper balancing of all available resources provides the best overall
performance.
Storage Best Practices: Create separate port groups within the storage frame for
each of the logical path groups that have been established. Spread each group
of four ports across storage array engines for redundancy. Mask devices to allow
access to the appropriate VPLEX initiators for both port groups.
Figure 12 shows the physical connectivity from a quad-engine to a hex-engine VMAX
array.
36
37
would allow for the greatest possible balancing of all resources resulting in the best
possible environment performance.
38
39
40
CPU Utilization - % busy of the VPLEX directors in each engine. 50% or less is
considered ideal
Front-end Aborts SCSI aborts received from hosts connected to VPLEX frontend ports. 0 is ideal.
Front-end Bandwidth - Total IO as measured in MB per second from hosts to
VPLEX.
Front-end Latency time in microseconds for IO to complete between VPLEX
and hosts. Very dependent on backend array latency.
Front-end Throughput Total IO as measured in IO per second.
Rebuild Status Completion status of local and remote device rebuild jobs.
Subpage Writes Number of writes that < 4KB. This statistic has taken on a very
diminished importance for VPLEX Local and Metro systems running
GeoSynchrony 5.0.1 and later code. For VPLEX Geo, this is still a very relevant
metric.
WAN Link Usage IO between VPLEX clusters as measured in MB per second.
This chart can be further sub divided into system, rebuild, and distributed
volume write activity.
WAN Link Performance IO between VPLEX clusters as measured in IO per
second.
The Back-end Latency statistic provides a quick way to narrow down the source of
performance issues and if they are caused by the array, the hosts, or by VPLEX. If
back-end latency is high then you should expect to see correspondingly high frontend latency. This would rule out VPLEX as the source of the latency.
Back-end Connectivity Summary
For typical workloads, VPLEX will normally perform as well as the underlying
storage-array
Injected VPLEX Local write overhead is usually in the 300-600 microsecond range
VPLEXs cache can benefit read-intensive applications leading to reduced
latency (compared to baseline) when there are VPLEX read cache hits.
Baseline and document your storage and application environments pre-VPLEX
Follow your storage vendors best practices for performance regarding RAID
layout, disk types (SSD, FC, SAS/SATA), thin/thick, and automated storage tiering.
Reference the EMC Support Matrix, Release Notes, and online documentation
available on http://Powerlink.EMC.com for specific array configuration
requirements
Engage the EMC and 3rd party storage vendor as needed
41
42
43
FC WAN Sizing
Insufficient WAN bandwidth for the desired workload will guarantee performance
degradation. The application will end up seeing high response times because of
queue build-ups within VPLEX when the WAN pipe is saturated.
The minimum required inter-cluster bandwidth for VPLEX Geo is 1Gbps. VPLEX MetroIP has a release notes stated minimum of 3Gbps, however solutions running at 1Gbps
are considered for RPQs.
Ensure your WAN devices are properly configured for distance and have proper
licenses, and that QoS or bandwidth rate limits are not artificially capping available
inter-cluster bandwidth.
Ensure compliant WAN round-trip-times
Unsupported inter-cluster WAN round-trip-times can result in unexpected
performance results. VPLEX Metro with FC WAN can benefit from FC Fast Write
technology (available from vendors such as Brocade and Cisco) by minimizing the
total number of round trips incurred for writes between data centers.
Buffer to Buffer Credits
If FC switches are used over dark fibre or DWDM WAN equipment, ensure that the
WAN facing FC ports have sufficient buffer credits allocated to the ports. A lack of
buffer credits will impose an undesired limit on the maximum throughput on the WAN
link.
Brocade switches:
For Brocade, an extended fabric license is required for each edge switch, and the WAN
facing ports must be set to LS or LD mode. See the command portcfglongdistance.
Monitor the port's counters for non-zero values for tim_txcrd_z or time transmission
credits are zero. This means the FC port wanted to transmit a FC packet, but did not
have sufficient buffer credits to so. Any non-zero value in this category implies
performance issues on the WAN link. If FCIP gateway devices are used between VPLEX
clusters, ensure that the FCIP tunnel is configured properly.
Brocade FCIP switches:
Check for bandwidth rate limiting setting on the tunnel. See the command portshow
fciptunnel. Verify the values for Min Comm Rt and Max Comm Rt (Minimum /
Maximum Communication Rate) are not causing a bottleneck. Check for improper
QoS settings on the tunnel. From portshow fciptunnel command output, check
the values for QoS Percentages. Note that only if QoS has been set on the LAN
facing FC ports will QoS settings affect the fciptunnel settings.
44
45
IP QoS settings
Sufficient bandwidth
MTU sizes Use Jumbo frames whenever possible
Dirty/Unhealthy FC fabric
Fabric health is CRITICAL to VPLEX performance
Watch for c3discards, CRC errors, internal link failures, slow drain devices,
etc.
Brocade: 8Gbps Fabrics change fillword setting per port
Avoid incorrect FC port speed between the fabric and VPLEX. Use highest
possible bandwidth to match the VPLEX maximum port speed and use dedicated
port speed s i.e. do not use oversubscribed ports on SAN switches.
Each VPLEX director has the capability of connecting both FE and BE IO modules
to both fabrics with multiple ports.
o The ports connected to on the SAN should be on different blades or switches
so a single blade or switch failure wont cause loss of access on that fabric
overall.
o A good design will group VPLEX BE ports with Array ports that will be
provisioning groups of devices to those VPLEX BE ports in such a way as to
minimize traffic across blades.
Note: A more detailed treatment of VPLEX best practices can be found in the
VPLEX Implementation and Planning Best Practices Technote available on
http://Powerlink.EMC.com
46
47
Note: The most comprehensive treatment of VPLEX best practices can be found in
the VPLEX Implementation and Planning Best Practices Technote which is located at
http://powerlink.EMC.com
48
49
3.
4.
If you have questions or concerns about the appropriate number of engines for your
VPLEX system, please contact your EMC account team.
50
Section 8: Benchmarking
Tips when running the benchmarks
There are four important guidelines to running benchmarks properly:
1) Ensure that every benchmark run is well understood. Pay careful attention to the
benchmark parameters chosen, and the underlying test systems configuration
and settings.
2) Each test should be run several times to ensure accuracy, and standard deviation
or confidence levels should be used to determine the appropriate number of
runs.
3) Tests should be run for a long enough period of time, so that the system is in a
steady state for a majority of the run. This means most likely at least tens of
minutes for a single test. A test that only runs for 10 seconds or less is not sufficient.
4) The benchmarking process should be automated using scripts to avoid mistakes
associated with manual repetitive tasks. Proper benchmarking is an iterative
process. Inevitably you will run into unexpected, anomalous, or just interesting
results. To explain these results, you often need to change configuration
parameters or measure additional quantities - necessitating additional iterations
of your benchmark. It pays upfront to automate the process as best as possible
from start to finish.
Take a scientific approach when testing
Before starting any systems performance testing or benchmarking, here are some
best practices:
First things first, define your benchmark objectives. You need success metrics so
you know that you have succeeded. They can be response times, transaction
rates, users, anything as long as they are something.
Document your hardware/software architecture. Include device names and
specifications for systems, network, storage, applications. It is considered good
scientific practice to provide enough information for others to validate your
results.
This is an important requirement if you find the need to engage EMC Support on
benchmarking environments.
When practical, implement just one change variable at a time.
Keep a change log. What tests were run? What changes were made? What
were the results? What were your conclusions for that specific test?
51
Map your tests to what performance reports you based your conclusions on.
Sometimes using codes or special syntax when you name your reports helps.
52
Not bringing the various system caches back to a consistent state between runs can
cause timing inconsistencies. Clearing the caches between test runs will help create
identical runs, thus ensuring more stable results. If, however, warm cache results are
desired, this can be achieved by running the experiment n+1 times, and discarding
the first run's result.
Testing storage performance with file copy commands
Simple file copy commands are typically single threaded and result in single
outstanding I/Os which is poor for performance and does not reflect normal usage.
Testing peak bandwidth of storage with a bandwidth-limited host peripheral slot
If your server happens to be an older model, you could be host motherboard PCI bus
limited. Ensure you have sufficient host hardware resources (CPU, memory, bus, HBA
or CNA cards, etc.) An older model fibre channel network (2Gbps as an example)
may performance limit newer servers.
Forgetting to monitor processor utilization during testing
Similar to peak bandwidth limitations on hosts, ensure that your host server isn't
completely used up. If this is happens, your storage performance is bound to be
limited.
Same goes for the storage virtualization appliance, and the storage-array. If you are
maxing out the available CPU resources in the storage device you will be
performance limited.
Not catching performance bottlenecks
Performance bottlenecks have the potential to occur at each and every stack in the
I/O layer between the application and the data resting on flash or spinning media.
Of course ultimately the performance that the application sees relies upon all of the
sub-components situated between it and the storage, but it's critical to understand in
which layer of this cake the performance limitations may exist. One misbehaving
layer can spoil everything.
Performance testing with artificial setups
Avoid "performance specials". Test with a system configuration that is similar to your
production target. For example, switching storage-array cache mirroring may speed
up your test, but would you do that in production? Short-stroking the storage-array's
RAID configuration may boost performance but in reality is a very inefficient use of
disk space that would not normally be used.
VMWare vSphere - Performance testing directly on the ESXi hypervisor console
Don't do it. Ever. ESXi explicitly throttles the performance of the console to prevent a
console app from killing VM performance. Also, doing I/O from the console directly
to a file on VMFS results in excessive metadata operations (SCSI reservations) that
otherwise would not be present when running a similar performance test from a VM.
53
Figure 16
VPLEX Performance Benchmarking Guidelines
There are a few themes to mention with regards to performance benchmarking with
VPLEX.
Test with multiple volumes
54
VPLEX performance benefits from I/O concurrency. Concurrency can easily and is
best achieved by running I/O to multiple virtual-volumes. Testing with only one
volume does not fully utilize VPLEX or the storage-array full performance capabilities.
With regard to the previously mentioned single outstanding I/O issue, with enough
volumes active (such as a few hundred) having a single outstanding I/O per volume
is acceptable. Multiple volumes active creates a decent level of concurrency. A
single volume with single outstanding I/O most definitely does not.
Storage Arrays
For VPLEX Metro configurations, ensure that each cluster's storage-arrays are of equal
class. Check the VPLEX back-end storage-volume read and write latency for
discrepancies. Perform a local-device benchmarking test at each cluster, if possible,
to eliminate the WAN and remote storage-array from the equation.
One Small Step
Walk before you run. It is typically quite exciting to test the full virtualization solution
end to end, soup to nuts. For VPLEX, this may not always be the most scientific
approach when problems arise. By testing end to end immediately the test results
may disappoint (due to unrealistic expectations) and lead to the false conclusions
about the overall solution without understanding the individual pieces of the puzzle.
Take a moderated staged approach to system testing:
1) Start with your native performance test:
Host <-> storage-array
1a) If you have a two cluster deployment in mind, it is important to quantify the
performance of the storage-arrays at each cluster.
This will be your baseline to compare to VPLEX. For certain workloads, VPLEX can only
perform as well as the underlying storage-array.
2) Encapsulate the identical or similar performing volumes to VPLEX configuring them
as local-devices:
55
2a) Test both cluster's local-device performance. (Note: The second cluster's VPLEX
local-device performance test could be skipped if Step 1 showed satisfactory
performance native performance on the second cluster.)
3) Create a VPLEX distributed-device spanning both clusters storage-arrays.
Host <-> VPLEX <-> cluster-1 storage-array and cluster-2 storage-array (distributeddevice)
56
This section provides IOMeter settings examples in the form of screen captures from
actual test systems. They illustrate the setting that can be used to simulate various
workloads and create benchmarks.
Disk Targets Tab:
57
Multi-threaded I/O:
58
59
60
Conclusion
This paper has focused on VPLEXs role in providing a virtual storage layer between servers
and block storage frames. Because VPLEX lives at the very heart of the storage area
network, VPLEXs primary design principals are continuous availability and minimized IO
latency. VPLEX also provides non-disruptive data mobility within and across data centers
while simplifying the management of heterogeneous storage frames. When VPLEX and the
corresponding storage environment are properly sized and configured IO latency can be
reduced in the case of read skewed workloads and kept nearly neutral for write biased
workloads. Individual results will, of course, vary based on the application and IO
workload.
Weve learned how inserting an inline virtualization engine like VPLEX has the potential to
increase I/O latency. In particular, we have seen how writes a metro distances reacts in a
synchronous caching model. The read/write mix, the I/O pattern, and the I/O stream
characteristics can affect the overall result. If benchmark or proof of concept testing is
being done, it is important to understand the factors that impact VPLEX performance and
make every effort to ensure the benchmark test workload is as close to the real world
workload as possible.
The role of SAN, server and storage capabilities in terms of congestion, reads and writes
was another important topic of discussion. These external components are extremely
relevant in determining overall VPLEX performance results. Weve discussed how VPLEXs
read cache may increase the level of performance compared to baseline for a native
array and how each host write must be acknowledge by the back-end storage frames.
Understanding the impact of VPLEX and how an environment can be prepared for single,
dual or quad VPLEX clusters will greatly increase the chances of your success when
configuring virtualized storage environments for testing, benchmarks, and production.
61
References
The following reference documents are available at Powerlink.EMC.com:
External References
62
Appendix A: Terminology
Term
Definition
Storage volume
Metadata volume
Extent
Device
Virtual volume
Front-end port
Back-end port
Director
Engine
VPLEX cluster
VPLEX Metro
VPLEX Metro HA
Access Anywhere
63
Federation
64