You are on page 1of 49

MICROSOFT SQL SERVER

HIGH AVAILABILITY
AND DISASTER RECOVERY
Michael Poremba // October 2008
Database HA & DR Experience…
2

 Work with business to determine HA or DR


requirements for applications and data?

 Design HA or DR solutions?

 Administer HA or DR process?

 Still learning MS SQL Server HA & DR capabilities?


Scope of this Presentation
3

Presentation Focus Beyond Scope of Presentation

 Data Availability  In-depth how-to


(available elsewhere)
 Data recovery  Partitioned views (federated)
 High availability  Advanced DBA techniques
 Custom application logic
 Disaster recovery  3rd-party software solutions
 Technology Focus  Alternate DBMS engines
(e.g. Oracle; DB2)
 MS SQL Server  HA on virtual machines
 Physical servers  Complex scenarios & solutions
 SANs  Load balancing
4 Introduction to Data Availability
So, you need to make your
production database bulletproof…
Data Availability Continuum
5

Degrees of protection for information systems:


Business Risk Solution
Data Recovery Data loss Redundant data
High Availability Downtime of Redundant system
database service components
Disaster Recovery Downtime of Redundant systems
business operations and facilities
Business Case for Availability
6

High Availability Disaster Recovery

 Keep business-critical  Protect against loss of


applications available data center
 Secondary:  Secondary:
 Server maintenance  Application upgrades
 Infrastructure
upgrades
Service Level Agreement (SLA)
7

 Permitted downtime (planned vs. unplanned?)


Uptime SLA Downtime Downtime
per Year per Month
99.9% 8.76 hours 43.8 minutes
99.99% 52.6 minutes 4.38 minutes
99.999% 5.26 minutes 0.438 minutes
 Acceptable data/transaction loss
 Application response times
 Mean time to recovery

Note: Database uptime is not equivalent to application availability


 Failures of other application services
 Network outages
Protect What?
8

 Application data stores


 Databases
 Files
 Other data repositories
 Database services
 DBMS availability for applications
 Application services
 Application availability for users and external systems

Databases are the heart of most information systems;


they deserve the highest affordable protection.
Database Failure Scenarios
9

Physical Infrastructure Failures Logical Data Failures

 Storage subsystem  Operator errors


 Disk  DBMS interruption
 Controller  Drops / deletes
 Network  Application defects
 Server  DBMS defects
 Power  Data corruption
Service Recovery Strategies
10

Standby Failover Behavior SQL Server Feature


Mode
Cold • Manual intervention • Backup and restore
standby required to restore offline
data copy
Warm • Data copy online and ready • Transaction log
standby • Manual failover required shipping
• Database mirroring
Hot • Automatic failover • Database mirroring
standby • Failover clustering
Data Recovery—Terminology
11

Terminology varies for source vs. copy


High Availability Data Source Data Copy
Strategy
Backup and Restore Database Backup
Log Shipping Primary Secondary
Standby
Database Mirroring Principal Mirror
Failover Clustering Primary Secondary
Active Passive
Standby
Inactive
12 Data Recovery
[Briefly…]
Database Backups
13

 Traditional backup types


 Full backup
 Differential backup
 Transaction log backup
 Disk is better than tape
 First backup to disk (separate physical disk volume)
 Detect exceptions encountered during backup
 Verify backup files
 Copy backup files to tape or remote disk
 Data retention policy for backup files
Database Backup Strategy
14

Backup of user databases not sufficient for recovery


 System database

 Master database

 MSDB database

 Model database

 External data stores…


Synch with External Data Stores
15

Synchronize recovered database with external data


stores:
Identity column seeds

Full-text indexes
(SQL Server 2000)
LDAP entries
File system objects

Other databases
Backup Retention Policy
16

 Location of backup files


 Duration of retention
 Protection of sensitive data
 Sarbanes/Oxley (SOX)
 HIPAA
 Internal policies for data management and protection
 Access to backups from offsite data storage
Data Recovery Process
17

 Backup file sets  Recovery strategy depends on


 Full baseline, differential, and failure scenario
transaction logs  Create comprehensive failure
 Retrieving backup files matrix
 Offsite storage
 Devise recovery strategy for
each scenario
 Tape  Does worst-case recovery
 Network copy scenario fit within SLA
 Dependency on multiple parameters?
people to get access to  Recovery time; SLA
backup files
 Include future data growth in
recovery plan
 Fully test recovery strategies
—practice is essential
18 High Availability
High Availability
19

 Minimize or avoid service downtime


 Whether planned or unplanned
 When components fail,
service interruption is brief or non-existent
 Automatic failover
 Eliminate single points of failure (as affordable)
 Redundant components
 Fault-tolerant servers
Redundant Components
20

Objective: Avoid single points of failure (where affordable)


Approach: Use redundant components for database service
 Database server nodes

 Server components
 ECC RAM; failure-tolerant HW & OS
 DBMS instance
 User databases
 Storage devices
 Storage unit components
 MPIO: Interfaces; paths; switches; controllers
 RAID: Disks
 Networking
 MPIO: Interfaces; paths; switches
 Data copies
 E.g. Recovering torn page from mirror in SQL Server 2008
Transaction Log Shipping
21

 Warm standby solution


 Duplicate user database
 Copy transaction logs to standby server & restore
 Database available for read-only access
 Users must disconnect for logs to be applied
 Two database licenses required if querying standby
 Manual application failover
 Supported on standard hardware
 Possible data loss (unapplied transactions)
Database Mirroring
22

 Redundancy at user database level


 Duplicate copy of user database
 Independent storage devices
 Multiple copies of instance databases
witness
 Mirrored over private network channel (optional)
 Mirror always redoing transactions from principal
 Negligible impact on transaction throughput
 Multiple mirroring modes: node A node B
 High-availability: commit @ log on mirror; automatic failover
 High-protection: commit @ log on mirror; manual failover
 High-performance: commit when logged on principal
 Very fast automatic failover—seconds Local Storage Local Storage
· local sys DBs · local sys DBs
 Requires witness server · source user DB · mirror user DB
 Mirror-aware application client connection
 Provided by client library
 Database connection string must specify both servers
 Mirror may be available for read-only access (snapshots)
 Works with standard hardware
Mirror Witness
23

 With mirroring, more than one server is required to


decide on failover
 Witness automates failover from primary to mirror
 Watches database availability
 Reports observations back to principal and mirror
 Runs in separate SQL Server instance (Express is OK)
 Prevents “split brain” scenario
 Very low resource consumption
 Can be witness for multiple databases
 Not a single point of failure
SQL Server Failover Clustering
24

 Two clustered nodes


 Active/Passive config
 MS SQL services
 Running on virtual server node A node B

 Shared storage device Shared Storage


 User databases · system DBs
· user DBs
· quorum
 System databases
 Quorum drive
 Redundant internal
components
Active/Passive Failover Clustering
25

 Redundancy at database instance level


 All databases fail over together
 Shared copy of system databases
 Single data copy on shared storage
device
 No I/O overhead reducing throughput
 Storage unit is single point of failure for node A node B
cluster
 All database services are clustered
 SQL Agent; Analysis Services; Full-Text Shared Storage
engine, MS DTC · system DBs
· user DBs
 Automatic failover (up to minutes)
· quorum
 DBMS accessed over virtual IP
 Database not available from inactive
node for DB client connections
 Storage is controlled by one cluster node
at a time
 Requires hardware certified by Microsoft
for Microsoft Cluster Service
HA Comparison
26

Database Mirroring Failover Clustering


 Scope: user DB  Scope: DBMS instance
 Standard hardware  Certified hardware
 One SQL license  One SQL license
(unless querying snapshots on (only one node can access
mirror) database)
 Very fast failover (seconds)  Automatic failover (up to minutes)
 OS flexible (e.g. 32/64)
 Enterprise OS
 Independent storage
 Shared storage
 Independent services
 Clustered services
 Standby not available
 Reporting on mirror
 Servers are usually co-located
 Geographic separation OK
Considerations for HA
27

 HA complements backup and recovery strategy


 Does not replace data recovery plan
 Application service availability is often determined by
a network of interdependent services
 Availability can be difficult to define (e.g. partial failures)
 Failure probability difficult to measure or compute
 Increased system complexity could lead to lower
service availability!
 Operator error a leading cause of availability issues
 Increased number/types of system components
 More complex to configure and administer
Data Recovery Requirements
28
29 Disaster Recovery
Disaster Recovery
30

 Minimize downtime of business operations


 Redundant systems and facilities
 SQL Server features:
 Transaction log shipping
 Database mirroring
 Failover clustering
 Other technologies
 Storage-based mirroring
Disaster Recovery Planning
31

 Data security requirements


 Clarify SLA, data loss allowance
 Evaluate system cost vs. data protection
 Failure analysis
 System redundancy
 Process validation
 Training for personnel
 Prevention practices
 Executing disaster recovery and business continuity
 Practice, practice, practice
Business Continuity Facility
32

 System redundancy
 Systems: Web servers app servers; database, etc.
 Data: Databases; data files on OS; security info, etc.
 Networking: Domain, routing, subnet, VIPs, etc.
 Alternate facilities
 Network bandwidth
 Physical or network access by operations staff
 Failover
 Often a deliberate decision, using manual failover
Data Redundancy
33

 Synchronous redundancy
 Network bandwidth cost
 Network latency and application performance
 Network reliability
 Asynchronous redundancy
 Risk of data loss
 More cost-effective
 Resilient to network latency issues
 Candidate Technologies
 SQL Server database mirroring
 Failover clustering with SAN-based mirroring
DR Using Database Mirroring
34

 Two sites: Primary and DR location


 Separate failover clusters at each site
 SQL Server database mirroring between sites
witness
(optional)

failover cluster at site A failover cluster at site B

node A1 node A2 database node B1 node B2


mirroring

Shared Storage A Shared Storage B


· local sys DBs · local sys DBs
· local quorum · local quorum
· source user DB · mirror user DB
DR Using SAN-Based Mirroring
35

 Two sites: Primary and DR location


 Four-node failover cluster; one virtual IP address
 SAN-based mirroring between sites
 Manual cluster failover
failover cluster nodes at site A failover cluster nodes at site B

node A1 node A2 node B1 node B2

storage-
based
Shared Storage A mirroring Shared Storage B
· system DBs · system DBs
· quorum · quorum
· user DBs · user DBs
36 Complimentary Technologies
[Skip if time is running short.]
SAN-Based Data Mirroring
37

 Data blocks duplicated at storage level


 Similar to transaction log shipping
 Copy performed in sequence and coordinated with
database checkpoint
 Ensures consistency of mirrored data files
 Synchronous or asynchronous mirroring
 Co-located or geographically dispersed—both are OK
 SAN link bandwidth must support database I/O rate
 May require extra feature support from SAN vendor
 Could rely on Failover Clustering for HA
SQL Server Database Snapshots
38

 Read-only point-in-time database snapshot


 No data is copied—instantaneous
 Historical snapshot pages tracked separately from
changing pages
 Snapshots can be maintained indefinitely
 Limited only by available storage
 Snapshot copy can be used for reporting
 Read-only, so no locking issues
SQL Server Replication
39

 Transactional replication  Subscriber databases


 High transaction volume available for reporting
 Low data latency required  Replicate data subsets
 Mixed technologies:  Some data loss is possible
Integrates with other DBMS
 Merge replication
 Periodically validate
replicated data
 Bi-directional data changes
 Typically server-to-client
 Snapshot replication
 Large, infrequent data
changes
 Data change latency OK
 Best for smaller data sets
40 App Development and Admin
Considerations for App Developers
41

 App services tolerant to database service interruptions


 Application transactions must be handled in code—data consistency
 Exception handling for transaction retry, connection recovery
 Requires coding standards, code reviews, and testing
 Bulk data operations
 Transaction volume impacts rollback time during failover
 Batch jobs must be run on alternate nodes
 Don’t bypass transaction logging
 Synchronization with external data sources?
 Be aware of database recovery model
 Mirroring uses FailoverPartner in connection string
 Use TCP/IP as client protocol
Considerations for Admins
42

 Use identical server hardware, when possible


 Design network redundancies, when feasible
 Consider network latency for geographic separation
 Always manage through virtual cluster, not individual cluster nodes
 Retest failover/failback after HA maintenance
 Diagnose after failover
 Repair alternate node
 Resynchronize data, as necessary
 Be aware of primary/secondary locations
 Ensure application services are connected and functioning properly
 Keep server node configurations synchronized:
 Service pack and patch levels
 Duplicate non-redundant resources
 Jobs; logins and permissions; OS & sys objects
HA Risks
43

 System performance degradation


 HA system complexity leads to availability issues
 Some system failures not planned for
 Backup and recovery planning incomplete
 Administrators not fully trained or informed
 User databases not synchronized with other data
sources
Common Admin Use Cases
44

 Maintain HA nodes
 Hardware maintenance
 Rolling upgrades and software patches
 Resynchronize the redundant copy
 Re-synch mirror
 Restart log shipping
 Diagnose and repair
 Diagnose cause of failover
 Repair failed node and restore failover capabilities
 Test failover and failback
Common Admin Actions
45

Train and practice administrators to:


 Initiate a database mirror

 Manually failover mirror database or cluster node

 Add/remove passive node from mirror or cluster

 Upgrade/patch servers nodes

 Restart or redirect application services


46 More Information
References—Books
47

High Availability Related Topics


 Microsoft SQL Server 2008 High  Pro SQL Server 2005 Replication
Availability with Clustering & by Sujoy Paul, 2006.
Database Mirroring  Pro SQL Server 2005 Service
by Michael Otey, 2009. Broker
 Microsoft SQL Server High by Klaus Aschenbrenner, 2007.
Availability  The Rational Guide to SQL Server
by Paul Bertucci, 2004. 2005 Service Broker
 Pro SQL Server 2005 High by Roger Wolter, 2006.
Availability
by Allan Hirt, 2007.
References—Presentations
48

 Microsoft Load Balancing and Clustering


http://ce.sharif.edu/courses/84-85/2/ce317/resources/root/lecture%20slides/14.%20Microsoft%20L
oad%20Balancing%20and%20Clustering.ppt
 SQL Server 2005 High Availability
http://www.atlantamdf.com/Presentations/AtlantaMDF_111207HA.ppt
 High Availability Technologies In SQL Server 2000 And SQL Server 2005
http://202.181.238.2/hk/teched2004/ppt/Day_2_Rm407/DAT431(1330-1445).ppt
 Meeting the Availability Challenge
http://download.microsoft.com/download/E/D/C/EDCF54DB-19CD-4882-9FC4-
4F7D46FCEAA6/HighAvailability.ppt
 Disaster Recovery Mistakes
http://www.sqlsig.org/Oct%2011%20DASSUG%20-%20Jason%20Hall%2010-11-07%20MM.ppt
 SQL Server 2005 High Availability
http://blogs.msdn.com/sql2005event/attachment/564303.ashx
 Effective Usage of SQL Server 2005 Database Mirroring
http://www.sqlserver-qa.net/SSQA-Effective%20Usage%20of%20SQL%20Server
%202005%20Database%20Mirroring_show.ppt
References—Articles
49

 Achieve High Availability for SQL Server


http://technet.microsoft.com/en-us/magazine/cc162477.aspx
 Geographically Dispersed Clusters in Windows
Server 2003
http://www.microsoft.com/windowsserver2003/techinfo/overview/clustergeo.mspx
 Restoring file and filegroup backups
http://support.microsoft.com/kb/281122/en-us
 Restoring specific tables or rows from backups
http://support.microsoft.com/kb/321836/en-us
 Maintaining Availability During Upgrades
http://msdn.microsoft.com/en-us/library/ms191449.aspx

You might also like