You are on page 1of 130

Copyright Intelligent Quotient System Pvt. Ltd.

Business
Continuity and
Disaster
Recovery
MODULE-III

Business Continuity and Disaster Recovery

Preface
The purpose of this book is to give an overview of the Business continuity
Planning and its Implementation. It covers the topics such as Need and
importance of BCDR, Types of disasters, Disaster Recovery, BCP and
Governance, Industry Standards supporting BCP and DRP and Benefits of BCP
and DR
This book first introduces the basics of Business Continuity Plan, BCP Process
Steps for Development of Business Continuity Plan. It then provides in-depth
coverage of BCP/DR and Recovery Technology and Disk system Fault
Tolerance.

Why Business continuity and disaster recovery plan?


With the funda of Global Warming, Disaster might occur any time. With a solid
Disaster Recovery plan, it is possible to retrieve lost data swiftly and smoothly,
thus protecting business against unpredictable losses. If proper planning and
Implementation is done then it is possible to face unwanted Disaster like
accidental deletion of critical data or a system failure on a massive scale. The
Disaster Recovery plan serves something called Availability which is majorly
checked in SLAs.
The work environment is getting increasingly expanded and complex. It is not
sufficient for companies to merely protect themselves against risks. Instead,
they need an integrated business flexibility process that can help them adapt
and respond to these challenges. Disasters especially relating to IT Disasters
are very costly. Most of the people opt for insurance. But what they insure is
their assets and not data. In the example of Bank, Customers does not care
about Bank Assets in case of disaster, they care whether Bank shows the
correct balance of their account or not. Hence it is important from any
business perspective to understand Business Continuity and Disaster
management function especially in the regime of IT.

Who Is This Book For?


This book is intended to serve the needs of students and to provide guidance to
build robust recovery plan, its implementation and focus on the principle of
Availability.
In addition, concepts are reinforced by real-world examples of disaster and its
consequences. These real-world examples, along with Hands-on Practicals and
Case studies make this book a practical learning tool.

Copyright Intelligent Quotient System Pvt. Ltd. |

Business Continuity and Disaster Recovery

Table of Contents
Business Continuity and Disaster Recovery
CHAPTER

NAME

TOPICS
1.1 Introduction
1.2 Management Commitment
1.3 PDCA
1.4 Conclusion

BCP and Secure


Processes

Business Continuity
and Disaster
Recovery

2.1 Introduction
2.2 Need of BCDR
2.3 Types of disasters
2.4 BCP and DRP Differences and Similarities
2.5 Components of BCP/ Disaster Recovery
2.6 BCP and Governance
2.7 Industry Standards supporting BCP and
DRP
2.8 Benefits of BCP and DR

BC/DR Planning

3.1 Business Continuity and Disaster


Recovery Plan Steps
3.2 Benefits of BCP and DRP Planning
3.3 Basics of Business Continuity Plan
3.4 BCP Process Steps
3.5 Development of Business Continuity Plan

BCP/DR Plan
Development and
Implementation

4.1 Purpose of BCP


4.2 BCP Methodology
4.3 BCP/DR Testing Techniques
4.4 BCP/DR Maintenance and Re-assessment
of Plans
4.5 Features of good BCP
4.6 Data Recovery Strategies
4.7 Contents of Disaster Recovery Plan

BCP/DR and
Recovery Technology

5.1
5.2
5.3
5.4
5.5
5.6

Fault Tolerance and Disaster Recovery


Hot sites
Clustering Technologies
Warm Sites
Cold Sites
Power Management

Copyright Intelligent Quotient System Pvt. Ltd. |

Business Continuity and Disaster Recovery

Disk system Fault


Tolerance

5.7
6.1
6.2
6.3
OS
6.4
6.5
6.6
6.7

Issues in implementing a DC /DR solution


Server Storage Technologies
Disk System Fault Tolerance
Disk Management in Microsoft Windows
Disk Management Tool
Creating Dynamic Volumes
Backup Considerations
Virus Protection

Copyright Intelligent Quotient System Pvt. Ltd. |

Business Continuity and Disaster Recovery

Chapter 1
BCP and Secure processes
Objective
1.1 Introduction
1.2 Management Commitment
1.3 PDCA
1.4 Conclusion

1.1

Introduction

ISO 27001 has 11 domains, which address key area of the information security
management. It covers the following areas:

Security policy
Organizing information security
Asset Management
Human Resource Security
Physical and Environmental security
Communication and operation management
Access Control
Information System Acquisition, Development and maintenance
Information Security Incident Management
Business Continuity Management
Compliance

It has total 134 best practices which covers all 11 domains. The best practices
are control to achieve objectives of the IT security management. ISO 27001
uses PDCA model for its implementation. The PDCA is cyclic model has to be
done for long run with solid backing & dedication of management. It ensures
that correct components are engaged, evaluated, monitored and improved on
continuous basis.

Copyright Intelligent Quotient System Pvt. Ltd. |

Business Continuity and Disaster Recovery

1.2

Management Commitment

The requirement for BS7799 / ISO 27001 implementation or certification is


mainly driven by external pressure, like a client requirement. The management
will only be worried of the above mentioned aspects and first step they would
do it to allocate a budget for this project and ask the IT or QMS or for that case
any department to complete the project. The goal should be, to make the
management understand the actual requirement for this implementation and
also project the results / benefits of this project.1

1.3

PDCA

http://www.infosecwriters.com/text_resources/pdf/ISMS_VKumar.pdf

http://www.google.com/imgres?num=10&hl=en&biw=1366&bih=622&tbm=isch&tbnid=LGrV2dG58mvbYM:&img
refurl=http://www.velaction.com/pdcacycle/&docid=ZwUlOzW_qR_fnM&imgurl=http://www.velaction.com/lean-information/wpcontent/uploads/2009/10/PDCA-Cycle-Pic.jpg&w=495&h=409&ei=pHNQUKa2H--

Copyright Intelligent Quotient System Pvt. Ltd. |

Business Continuity and Disaster Recovery

The First step in implementing ISO 27001 from BCDR perspective is


a. PLAN: In this phase, the team appointed by Senior Management shall
find you the existing Assets, processes critical as well as non-critical of
the company for which the BCDR is supposed to be implemented. The
word Assets includes people, data/information, people etc.
i.
ii.
iii.
iv.

v.

The First job is to identify these Assets.


Label these assets as per the sensitivity or criticality. This is also
called as classification of Assets.
Identify various vulnerabilities relating to assets as well as find out
the existing threats to these assets.
Make sure, if there are any controls implemented to minimize the
damage from the threats, are they sufficient enough to reduce the
risk?
Calculate the value of asset for the organization.
The value is derived based on confidentiality, integrity and
availability. E.g. mail servers value is to be calculated. We may
scale it between 1 and 5. We have to take it for CIA. The following
is one of the methods.
Asset Value = confidentiality + integrity +availability
Mail Server Value = 4 + 4 + 4 = 12 (for very critical)
Mail Server Value = 2 + 2 + 2 = 8 (for not critical)

vi.

Probability of Occurrence
With respect to each and every Asset, it is important to find out the
probability of occurrence of threat for each Asset within the
organization. The probability of occurrence is required to
understand the frequency at which such failures occur. This is
based upon previous experiences and also looking at the current
implementation. Usually, probability is marked in flags Like High,
Medium & Low. Every department head or a knowledgeable person
from the department has to set this probability. They have to find
the interdependent processes and their effect in case of disruption.

ViQfdnYCIAQ&zoom=1&iact=hc&vpx=110&vpy=305&dur=2168&hovh=204&hovw=247&tx=139&ty=123&sig=1130
65180021542067817&page=1&tbnh=118&tbnw=143&start=0&ndsp=21&ved=1t:429,r:14,s:0,i:146

Copyright Intelligent Quotient System Pvt. Ltd. |

Business Continuity and Disaster Recovery

vii.

Risk Value
The risk value is calculated. Risk is always calculate in terms of
numbers i.e. Rupees or any respective currency. The risk value is
calculated by identifying the possible threats that can impact CIA.
It checks impact and frequency of impact.
E.g. The threats to the mail server.
Power failures
Hardware failure
Fire
Virus attacks / Malicious code injection
Intruders (Hacking), Denial of Service (DoS attack)
Mail accidentally sent to a different recipient
Data corruption / data loss
Unauthorized access
Link failure
Natural calamities
Risk can be calculated with a single formula:
Risk= Vulnerability * Threat.
The result of Risk Value Calculation is the input for the next phase
i.e. DO Phase to decide which Asset should be treated on Priority.
Usually, once the risk value calculation is done the Assets are
ranked from Highest Risky Assets to the Lowest Risky Assets.
Accordingly, the further Risk Treatment methodology is selected.

viii. Business Impact Analysis (BIA)


BIA is performed to analyze the impact on the
various unprecedented events or incidents. The
scenarios and its possible business impacts are
includes technical problems, human resources and

system due to
various failure
analyzed. This
other events.

BIA is different from Risk assessment. Risk Assessment identifies


the possible threats and vulnerabilities and how those will impact
the asset and business. The asset value shows how critical is that
asset to the organization. BIA is based on time. If there is a server
crash, how much time can the organization go without an email
server? The RPO and RTO are calculated based on criticality of the
asset to run the business.

Copyright Intelligent Quotient System Pvt. Ltd. |

Business Continuity and Disaster Recovery

The Business Impact analysis is done using following steps


Identification of Critical Assets
Outage Impact
Develop Priorities
After the consideration of BIA, the priorities are decided based on
the impact on the business. Sometimes there is sight outage
known as disaster. E.g. Nine Eleven Attack.
b. DO Phase:
The Input of Plan phase like:
Critical & Non-Critical Assets including critical processes & noncritical processes within the organization,
Threat Exposure,
Risk Value Calculation & Prioritizing the Assets,
Finally, the Business Impact Analysis showing the result of loss of
business with the existing state of controls will decide the further
treatment for the protection of Assets, processes for the
continuation of business.
DO Phase is the actual implementation phase. In the earlier PLAN Phase,
we did all the preparatory work relating to identifying, calculating,
ranking etc.
In DO Phase, the committee decides the action to be taken to minimize
the Risk and look towards Continuation of the Business.
With the exposure factor, threats ascertained and calculated business
impact analysis, risk is further treated with the formula of 3T-1M i.e.
i.
Risk Transfer
ii.
Risk Treatment
iii.
Risk Tolerate
iv.
Risk Mitigate
i.

Risk Transfer: Those risks which neither can be accepted by


management nor can be treated are transferred to the third party
to reduce the onus of risk on management. Like: Lets take the
example of simple Asset called Land & Building. Usually the
valuation of Land & Building goes in crores. Management cannot
afford to keep the risk of natural threats open like Earthquake,
Floods, and Hurricanes etc. Hence a Clever Management wills
immediately this kind of risk to Insurance Company wherein
Insurance Company will take care of the appropriate claim in case
of any Natural Disaster.
Copyright Intelligent Quotient System Pvt. Ltd. |

Business Continuity and Disaster Recovery

ii.

Risk Treatment: In PLAN phase, the committee has already


analyzed the existing controls, if any, for the Critical as well as
non-critical Assets as well as processes. They also check if the
existing controls are really reducing the risk to the level of
acceptance by the management.
For Example: In a company, the Gtalk is access is present on the
users machine. Through gtalk user can accept or send files. The
threat is of contaminant i.e. Virus, worm or Trojan. The company
understands this and has installed licensed version of famous
Anti-virus. They also update and scan the network at regular
intervals but still there are certain constant attacks of such
contaminants. Here, one can say that for the threat of contaminant
the company has installed the control but it is not sufficient. The
treatment may be blocking the gtalk from the server end by
blocking the port or installing the firewall. This is called Risk
Treatment.

iii.

Risk Tolerate: This is also called as Risk Acceptance. Risk can


never be zero. This means after identifying the risk and applying
control, some risk still remain, it is called as Residual Risk. This
Residual Risk Management has to accept under any circumstances
because this risk cannot further be treated.
For Example: Lets take the example of Petrol Pump. Petrol, being a
chemical, has an inherent property of evaporation. Company takes
all the precaution to reduce/minimize the evaporation loss but still
due to inherent property of Petrol company has to bear certain loss
which cannot be treated by installing any control. This loss
company has to tolerate.

iv.

Risk Mitigation: This is something like avoiding the loss/risk. For


Example: In a manufacturing unit, a company is manufacturing
screw as well as nails. Imagine for this company from last 3 years
Screw department is generating constant super profits and nail
department is generating constant losses, then it is always a wise
decision for the management to shut down the loss making unit
i.e. nail department to reduce the future risk. This is called Risk
Mitigation.

Risk Management is nothing but making all the attempts to minimize to


the level that it can be accepted or it cannot be treated further or made
minimum after a certain level.

Copyright Intelligent Quotient System Pvt. Ltd. |

Business Continuity and Disaster Recovery

In this phase, training all the employees on all the policies, guidelines,
procedures to be followed in case of Disaster or for making the attempt of
continuity of business are most important.
There are different methods to pass on the information to end users.
Some of which have been explained below.
Train the trainer approach
At times it is very difficult to reach every user in an organization (usually
organization with more than 500 employees) and also tracking will be a tedious
process. This method will be used to train a set of people (generally in the level
of middle management) and they take the responsibility of training their team.
Without train the trainer approach
This method is used generally in smaller organizations. Here the training
program will conducted to each and every employees of the organization by the
same team of trainers.
Training Materials
Preparation of training materials should depend on the targeted audience. Split
the organization based on the following:
Senior Management
Middle Management
End Users
If a training session for the senior management, it is need that to make sure
that include some statistics of vulnerability report, comparison between
previous reports. The main focus should be to show the improvements that
have been achieved through this implementation.
The end user training can be contacted through shooting a shot film by having
some in-house members to act for the video. The video can also have pictures
taken in around organizations premises that pose as examples for the common
security breaches and use those pictures can be used as your screen savers.
The handbook, hand-outs and Information Security bulletin are additional
means to spread information to all employees.
ISO 27001 provides certain possible solutions on certain types of risks which
can be referred in the following table:

Copyright Intelligent Quotient System Pvt. Ltd. |

10

Business Continuity and Disaster Recovery

Risk Management Implementation.


The risk management implementation is based on so many factors like risk
apatite, availability of expertise.
Threats
ISO 27001 Controls
Possible Implementation
Power Failures
Hardware Failures
Fire

Virus, Malicious Code


Injection
Hacking, DoS attacks

Mail accidentally sent to


a different recipient
Data Corruption / Data
Loss
Unauthorized access
Natural Calamities

A.9.2.2
A.9.2.4
A.9.1.4

UPS, generator
AMC's
Fire Extinguishers,
Sprinklers, keep phone
list of concern with
names
at
required
location
A.10.4.1
Anti-virus, Anti-spam,
spy ware removal tool
A.6.2.1, A.6.2.3, A.10.6.1 Perimeter Security
Devices, Adequate
Network controls
A.10.8.4
Digital Signatures
A.10.5.1

Backup

A.11.2.2, A.11.2.4,
A.11.5.2
A.9.1.4

Active Directory, User


access rights
Identification of such
Areas
using
GIS,
Insurance,
Disaster
Recovery sites

Above is the example of how we can map each threat identified to ISO 27001
Controls and also to find how to minimize the risk.
While making BCDR applicable as per the Standards of ISO 27001, the 1st step
is to prepare the Statement of Applicability.
Statement of Applicability (SoA)
SoA is a document that states all of the ISO 27001 controls that are applicable
for a particular type of organization. A justification also needs to be given for
that control that has not been chosen for implementation. This SOA document

Copyright Intelligent Quotient System Pvt. Ltd. |

11

Business Continuity and Disaster Recovery

will be provided to clients and external trusted authorities on demand, for them
to identify the level of implementation of security practices in the organization.
Control Reference - A.9.2.2
Description - Fire Supplies
Implementation Yes
Justification - Company has implemented UPS systems and also a dedicated
generator for the entire building
Some of these controls require policies to support the implementation. E.g. The
anti-virus policy that defines how anti-virus is to be deployed across the
organization, what are the tools used and how is it monitored. Organization
need to make sure that all the policies are in place and also require
documenting the operating procedures of all the assets in the organization.
This is very important.
c. CHECK Phase:
In this phase, the output of DO Phase is used as input for CHECK phase.
In this phase, the team has to check, verify & audit all the controls which
are implemented for BCDR. The team has to check two important points:
1.
2.

whether the controls are implemented appropriately to cover


the weaknesses; and
The controls implemented are sufficient enough to cover the
weakness and reduce the risk at all times.

Sometimes it may happen that the Anti-virus control is present but it is


not sufficient enough to cover the contaminant risk. Hence management
may decide to implement the Firewall. Here the audit should be carried
out in such a way that after installing the new control, the audit team
will make sure that the present control is configured and working in the
fashion in which it is supposed to be working i.e. it is reducing the risk of
the organization and second, it should provide all the alerts whenever
necessary on the occurrence of any threat and generate log report for
continuous monitoring.
Audit of the Controls
The audit is part of monitoring and review of the implemented process. The
first step in the certification audit process is the document review. The
following documents generally audited:
Policy documents
Policy statement
Risk assessment report
Copyright Intelligent Quotient System Pvt. Ltd. |

12

Business Continuity and Disaster Recovery

Risk assessment procedure


Mapping of threats to the assets
Statement of applicability
Mapping of risk assessment report to the statement of applicability
BCP, BCP testing procedure and test results
Technical audit reports (Vulnerability Assessment and Penetration
Testing reports)
Metrics if any
Procedure and guideline documents

The following types of audits are conducted.


On floor Audit
The auditor will look for physical security as he walks through the organization
premises for auditing user awareness as well as individual departments within
the scope. All departments with the scope should have their policy, procedure
and guideline documents updated.
Internal Audit
An internal audit may be conducted before the start of the project. This will
project the gaps and the project team will understand where you stand.
Further conduct two more internal audits, one in the middle of the project and
one just before the document review. Document you internal audit schedules
for the next one year, as this is one of the documents that will be asked for
during the document review. The following audits are conducted as a part of
internal audit.
Desktop Audit
Desktop audit is primarily done to check if users have any illegal contents on
their desktops. Such as .mp3 files, video files, .jpeg, .jpg and .gif files that can
have pornography materials. The mailboxes are also audited by looking for
mails with huge attachments, jokes been received and forwarded to other
colleagues (all these must be mentioned as a violation in your organization
email policy). All the services running on the particular machine are identified
and accordingly which ports are open and which are closed is monitored.
Flagged events are checked wherever necessary. This shows the attempts made
by the user/process, if any, to by-pass the security policies, procedure,
guidelines.3
3

http://www.infosecwriters.com/text_resources/pdf/ISMS_VKumar.pdf

Copyright Intelligent Quotient System Pvt. Ltd. |

13

Business Continuity and Disaster Recovery

User Awareness Audit


User awareness audits are conducted to check the level of awareness in the
employees. Whatever technical solutions have be implemented, unless the user
awareness is not strong, it will be biggest threat to the organization. It is under
the principle of Due Diligence and Due Care that every employer must
train their employees on the policies developed, procedures derived and
guidelines stated. A mock trail is usually conducted to arrive at the conclusion
whether the employee has really understood the meaning & purpose of
training.
Technical Audit
The vulnerability assessment and penetration testing is to be conducted by
external vendors. We should not build the network and test it ourselves. Also
audit of; method of logging and monitoring internet traffic, keep an eye on it
and see if there is any access to illegal sites
Social Engineering
Social engineering is a method of extracting information from people (in this
case the employee) to intrude into your premises or network. Social
Engineering tests can be conducted by making telephone calls, sending emails
etc.
Physical Security
Apart from walking around and viewing the infrastructure, some locations are
checked where anyone can get some confidential information. The printer
location is common where user has fired the print, but has never collected the
same. This is the place where we may find a pile of documents near the
printers.
Some of the organizations have the habit of piling up the documents to be
shredded and the office boy does it once every day during COB (Close of
business). The need is to check if the office boy actually shreds the papers or is
some is carried away.
d. ACT Phase:
The results obtained from CHECK Phase are the inputs for ACT Phase.
All the audit results obtained from CHECK Phase are reworked upon if there is
any discrepancy found in ACT Phase. CHECK Phase is nothing but a kind of
gap analysis of what the expected result was and what result a particular
Copyright Intelligent Quotient System Pvt. Ltd. |

14

Business Continuity and Disaster Recovery

control is presently giving. In ACT Phase, it is expected that all nonconformities as well as non-compliances are complied with.
ACT Phase can also be term as a phase where post audit checks are confirmed.

Post Audit Check


The following things are need to be acted as a part of PDCA.

Asset tags Make sure all your assets is been labeled as per your policy
Mechanism to assess and improve user awareness among employees
There should be a mechanism, at least maintain records for the user
awareness training conducted
Mechanism (procedure) to record the security incidents and their
solutions There should be a process to record security incidents found
and reported by users, action taken for those incidents and learning from
those incidents need to be documented.
Mechanism to store the logs of servers and other monitoring tools for
further reference Log retention need to defined and practiced.
Back-up and restore procedures to be in place. Test of restoring data has
to be practiced and documented.
BCP needs to be documented. Any test done to check the BCP need to be
documented with test results.
DR site should be defined and documented
All cabling (power & data) should be adequately protected
License management should be demonstrated License management
using some tools or recorded in an excel file should be produced. Audits
will be conducted to check if the installation of software is same as
mentioned in the license management document.
Audit reports of VA, PT and other audits conducted in the organization
should be adequately documented, measured and improvements should
be projected for auditing
Patch management and anti-virus management is recommended to be
centralized and a dedicated person be assigned to monitor this area. A
random audit should be conducted to check if any of the machines has
been omitted by the system of any anti-virus or patch updates

The every stage of PDCA requires awareness training. This must be conducted
for better implementation. ISO 27001 provides detailed guidance for synthesis
ISMS with organizations risk profile. The system is built by iterations of PDCA
cycle. Each cycle improves effectiveness of the system. ISMSs focus is on
Confidentiality, integrity and availability of information.

Copyright Intelligent Quotient System Pvt. Ltd. |

15

Business Continuity and Disaster Recovery

1.4 Conclusion: The overall implementation ISO 27001 plan works as


follows:

http://www.infosecwriters.com/text_resources/pdf/ISMS_VKumar.pdf

Copyright Intelligent Quotient System Pvt. Ltd. |

16

Business Continuity and Disaster Recovery

Summary

ISO 27001 has 11 domains, which address key areas of the Information
Security Management.
ISO 27001 uses PDCA model for its implementation.
PDCA covers Plan-Do-Check-Act phases in implementing ISO 27001
from BCDR perspective.
Business Impact Analysis (BIA) is performed to analyze the impact on
the system due to various unprecedented events or incidents.
Risk is treated with the formula of 3T-1M i.e. Risk Transfer, Risk
Treatment, Risk Tolerate, Risk Mitigate
Risk Management is making all the attempts to minimize the level of
Risk so that it can be accepted or it cannot be treated further or made
minimum after a certain level.
Training all the employees on all the policies, guidelines and procedures
to be followed in case of disaster or for making the attempt of continuity
of business are most important.
The audit is a part of monitoring and reviewing the implemented
process.
ISMSs focus is on Confidentiality, Integrity and Availability of
information.

Copyright Intelligent Quotient System Pvt. Ltd. |

17

Business Continuity and Disaster Recovery

Chapter -2
Business Continuity and Disaster Recovery
Objective
2.1 Introduction
2.2 Need of BCDR
2.3 Types of disasters
2.4 BCP and DRP Differences and Similarities
2.5 Components of BCP/ Disaster Recovery
2.6 BCP and Governance
2.7Industry Standards supporting BCP and DRP
2.8 Benefits of BCP and DR

2.1Introduction
All of us know that threats are uninvited hurdles, problems directly resulting
into monetary losses. Business Continuity is the aim of every organization as
they are in the market for profit making. Whereas Disasters are sometimes
natural or man-made may or may not be avoidable in certain circumstances.
Prioritizing the IT and technology needs of a business while ensuring the IT
budget is being used effectively is arguably the biggest challenge.
Business continuity and disaster recovery solutions designed specifically for
maintaining the ongoing operation of key business systems during a major
event have traditionally been achieved via full physical backups and total
replication. This most often doubled the cost of IT infrastructure needs, giving
BCDR a reputation for being expensive and often unattainable.5

http://www.itbusinessedge.com/cm/community/features/guestopinions/blog/have-you-bcdrd-yourbusiness/?cs=50061

Copyright Intelligent Quotient System Pvt. Ltd. |

18

Business Continuity and Disaster Recovery

2.2

Need of BCDR

A wise enterprise should ask a question to itself or its directors or partners or


even senior management that how much risk their business can afford and
what the best BCDR solution for their business is. This is how the need of
BCDR shall be ascertain by each every organization irrespective of Industry.
According to the American Management Association, About 50% of businesses
that suffer from a major disaster without a disaster recovery plan in place
never reopen for business. Corporate governance using IT governance has
increased a corporate officers liability for business continuity. The
organization need to meet these business needs, so more senior executives and
security officers are turning to Business Continuity / Disaster Recovery
(BC/DR) services that help them to protect their business in the event of a
disaster.
An expert consultancy should be provided to have a comprehensive BC/DR
program. The program should effectively and efficiently meet corporate
governance requirement by minimizing BC/DR projects spending. The
organization must work in partnership with their employees, vendors, partners
and government to ensure the continuity of critical business functions in the
event of a disaster.6
Unpredictability is an element of human life as well as businesses.
Business continuity is not a project but it is continues process. Business
continuity planning for information systems are elements of the system of
internal control established to manage the availability of critical processes and
valuable computer data in the event of interruption. The ultimate goal of the
business continuity process is to be better able to respond to incident that may
impact people, operations and ability to deliver goods and service to the
marketplace.
Business Continuity Plan (BCP) is a proactive plan to develop advance
arrangements and procedures that enable an organization to respond to an
interruption in such a manner that critical business functions continue with
planned levels of interruption. Especially in tried economies, some
organizations even abandoned their BCP teams due to its non-utilization (or
less utilized) and re-directing backs them for core processes.

http://www.iim-edu.org/executivejournal/Whitepaper_BCDR_Best_Practices.pdf

Copyright Intelligent Quotient System Pvt. Ltd. |

19

Business Continuity and Disaster Recovery

For Examples of proactive solutions:

BCP is process desinged to reduce to the organizations business risk arising


from an unexpected disruption of the critical funtions that are necessary for
the survival of an organisation.
Now the business has become more compitative and complex because of
Information Technology and Globalization. Every organization carries corporate
risks and tries to ensure the survival of the organization by creating culture
which can manage those risks.
There are various threats and vulnerabilities to which
business today is exposed. They could be:
1. catastrophic
events
such
as
floods,
earthquakes, or acts of terrorism
2. accidents or sabotage
3. outages due to an application error, hardware
or network failures
Some of them come unwarned. Most of them never
happen. The key is to be prepared and be able to respond
to the event when it does happen, so that the organization survives; its losses
are minimized; it remains viable and it can be business as usual, even before
the customers feel the effects of the downtime. An effective Business Continuity
Plan serves to secure businesses against financial disasters.
Not many years ago when a business wanted to find the ways to prepare itself
against disaster and ensure business continuity should catastrophe strike, the
bulk of the organization's time, money, and effort would be spent on ways that
disasters could (hopefully) be avoided altogether. Often the outcome of an
organization's search for ways to protect their most critical business
applications (in order to shore up their business continuity if disaster hit), was
that they found they could potentially avoid harm through the use of
redundant data lines.
Copyright Intelligent Quotient System Pvt. Ltd. |

20

Business Continuity and Disaster Recovery

2.3

Types of Disasters

Every organization is at risk from potential disasters that include:

Disasters

Natural
Disasters

Man-made
Disasters

Natural disasters
Tornadoes
Floods
Blizzards
Earthquakes
Fire

Copyright Intelligent Quotient System Pvt. Ltd. |

21

Business Continuity and Disaster Recovery

Man-made Disasters

Labor: strikes, walkouts, and slow-downs that disrupt services and


supplies
Social-political: war, terrorism, sabotage, vandalism, civil unrest,
protests, demonstrations, cyber-attacks, hacker activities and blockades
Materials: fires, hazardous materials spills
Utilities: power failures, communications outages, water supply
shortages, fuel shortages, and radioactive fallout from power plant
accidents

Disasters further can be classified into four parts:

Disasters
Insurable

NonInsurable

Technical

NonTechnical

Disasters can take several different forms. Some primarily impact individuals -e.g., hard drive meltdowns -- while others have a larger, collective impact.
Disasters can occur such as power outages, floods, fires, storms, equipment
failure, sabotage, terrorism, or even epidemic illness. Each of these can at the
very least cause short-term disruptions in normal business operation. But
recovering from the impact of many of the aforementioned disasters can take
much longer, especially if organizations have not made preparations in
advance. However, if proper preparations have been made, the disaster
recovery process does not have to be exceedingly stressful. Instead the process
can be streamlined, but this facilitation of recovery will only happen where
preparations have been made. Organizations that take the time to implement
Copyright Intelligent Quotient System Pvt. Ltd. |

22

Business Continuity and Disaster Recovery

disaster recovery plans ahead of time often ride out catastrophes with minimal
or no loss of data, hardware, or business revenue. This in turn allows them to
maintain the faith and confidence of their customers and investors.7
Some disasters can be insured and loss can be minimized. For Example: Fire in
the building will minimize the loss of entire value of building as well as assets
present in it due to Insurance Claim. But not all losses can be insured. For
Example: System Administrator while leaving the job formatted the hard drive
and the company lost entire data of last 3 years for which there was no back
up present. This loss due to human behavior cannot be insured.

Preparedness: Every organization should anticipate all the threats associated


with the type of industry in which they are serving or doing business. For
Example: For a petrol pump owner, he/she can anticipate loss during
transport i.e. road accidents, loss due to increase in temperature, loss due to
Fire at the Petrol Pump, loss due to human error, negligence etc. and they have
to implement the necessary precautions
7

http://itfirstaid.ca/services/disaster-recovery/

http://www.google.co.in/imgres?um=1&hl=en&client=firefox-a&sa=N&rls=org.mozilla:enUS:official&biw=1366&bih=622&tbm=isch&tbnid=xqK8s6riYBdOM:&imgrefurl=http://www.crookston.mn.us/EM/&docid=LuUbIWEypbrwTM&imgurl=http://www.crooksto
n.mn.us/EM/images/hazard%252520arrow.gif&w=480&h=346&ei=Eb5RUIavB8fJrAfIzIGIBA&zoom=1&iact=hc&vpx
=1035&vpy=318&dur=3221&hovh=191&hovw=265&tx=120&ty=124&sig=109287564227310851567&page=3&tbn
h=133&tbnw=185&start=49&ndsp=25&ved=1t:429,r:11,s:49,i:279

Copyright Intelligent Quotient System Pvt. Ltd. |

23

Business Continuity and Disaster Recovery

Response: With the same above example, the petrol pump should do transit
insurance, install fire extinguishers, train the employees for the emergency
procedures, install the smoke detectors, put the sand buckets ready etc.
Recovery: In case of actual fire, the sand buckets, fire extinguishers to be used
appropriately. Since all the employees are trained & they know how to execute
the emergency recovery plan, the recovery can be done with minimum damage.
Mitigation: Either from own disasters faced or from the industry to which the
organization belongs, the disasters can be anticipated and accordingly new
plans to mitigate such threats can be made by the management. This also
reduces huge cost of damage.
BCP/ Disaster Recovery Planning is the factor that makes the critical difference
between the organizations that can successfully manage crises with minimal
cost and effort and maximum speed, and those that are left picking up the
pieces for untold lengths of time and at whatever cost providers decide to
charge; organizations forced to make decision out of desperation.9
Detailed disaster recovery plans can prevent many of the heartaches and
headaches experienced by an organization in times of disaster. By having
practiced plans, not only for equipment and network recovery, but also plans
that precisely outline what steps each person involved in recovery efforts
should undertake, an organization can improve their recovery time and
minimize the time that their normal business functions are disrupted. Thus it
is vitally important that disaster recovery plans be carefully laid out and
regularly updated. Organizations need to put systems in place to regularly
train their network engineers and managers. Special attention should also be
paid to training any new employees who will have a critical role in the disaster
recovery process.
There are several options available for organizations to use once they decide to
begin creating their disaster recovery plan. The first and often most accessible
source a business can draw on would be to have any experienced managers
within the organization draw on the knowledge and experience they have to
help craft a plan that will fit the recovery needs specific to their unique
organization. For organizations that do not have this type of expertise in house,
there are a number of outside options that can be called on, such as trained
consultants and specially designed software.
One of the most common practices used by responsible organizations is a
disaster recovery plan template. While templates might not cover every need
specific to every organization, they are a great place from which to start one's
9

http://www.disasterrecovery.org/disaster_recovery.html

Copyright Intelligent Quotient System Pvt. Ltd. |

24

Business Continuity and Disaster Recovery

preparation. Templates help make the preparation process simpler and more
straightforward. They provide guidance and can even reveal aspects of disaster
recovery that might otherwise be forgotten.
The primary goal of any BCP/disaster recovery plan is to help the organization
maintain its business continuity, minimize damage, and prevent loss. Thus the
most important question to ask when evaluating disaster recovery plan is, "Will
the plan work?" The best way to ensure reliability of one's plan is to practice it
regularly. Have the appropriate people actually practice what they would do to
help recover business function should a disaster occur. Also regular reviews
and updates of recovery plans should be scheduled. Some organizations find it
helpful to do this on a monthly basis so that the plan stays current and reflects
the needs an organization has today, and not just the data, software, etc., it
had six months ago.
IT Disaster and WAN Redundancy
One of the most common areas of vulnerability for organizations when a
disaster strikes is the loss of their WAN connectivity. Earthquakes, floods, and
acts of war can certainly disrupt the use of an organization's data lines. But
loss of WAN connectivity can happen even without a major catastrophe. Much
simpler threats such as the accidental cutting of data lines or equipment
failure can have the same devastating net result on connectivity. Whether the
cause is a construction mishap from the new building next door, or the effects
of a far more serious event like a flood, fire, or terrorist attack, if an
organization loses their connectivity their business continuity is often lost as
well, and they are functionally in a state of disaster.
The loss of WAN connectivity can have serious consequences for an
organization's daily business activities. Emails, financial transactions,
ERP/CRM systems, order placement and processing, are just a few of the
critical operations affected by WAN connectivity. If connectivity is lost these
activities can be severely slowed or halted altogether until data lines can be
recovered. Thus, having a functioning WAN system is critical for productive
business operation and should be an integral part of any disaster recovery
plan.
There are several methods available for organizations who want to ensure a
high availability of WAN connectivity as part of their disaster recovery plan. The
earliest techniques used to back up data lines were complex and cumbersome.
They used multiple data lines that were connected to a programmable router.
Complex programming allowed data to be passed over multiple connections
which helped reduce vulnerability to a single line and helped protect against
backbone failure. This technique, though far from streamlined, was better than
no back-up system at all and did help maintain at least some business
continuity.
Copyright Intelligent Quotient System Pvt. Ltd. |

25

Business Continuity and Disaster Recovery

Since that time the technology has evolved and a more elegant technique is
available. This new technique involves the use of intelligent devices that can
handle multiple data lines of different speeds from multiple providers
simultaneously. These devices, called Router Clustering Devices, intelligently
detect if a line, component or service is failing and then proceed to switch the
flow of data to other available and working lines. These advancements provide
better protection for an organization's data flow. They reduce the potential
mess of disaster recovery and in turn increase business continuity when
disasters do happen without the complexity and awkwardness of the old
system.
2.4

BCP and DRP Differences and Similarities

Disaster recovery is the process by which business resumes after a disruptive


event. The event might be something huge-like an earthquake or the terrorist
attacks on the World Trade Center-or something small, like malfunctioning
software caused by a computer virus.
Business continuity planning suggests a more comprehensive approach to
making sure that business can keep making money, not only after a natural
calamity but also in the event of smaller disruptions including illness or
departure of key staffers, supply chain partner problems or other challenges
that businesses face from time to time.
BCP

Activities required to ensure the continuation of critical business


processes in an organization
Alternate personnel, equipment, and facilities
Often includes non-IT aspects of business

DRP

2.5

Assessment, salvage, repair, and eventual restoration of damaged


facilities and systems
Often focuses on IT systems
Components of BCP/Disaster Recovery

1. Destructive measures
2. Response procedures and continuity of operations
3. Determination of backup requirements
Copyright Intelligent Quotient System Pvt. Ltd. |

26

Business Continuity and Disaster Recovery

4.
5.
6.
7.
8.
2.6

Development of plans for recovery actions after a disruptive event


Development of procedures for off-site processing
Guidelines for determining critical and essential workload
Team member responsibilities in response to an emergency situation
Emergency destructive procedures
BCP and Governance

Not many years ago when a business wanted to find the ways to prepare itself
against disaster and ensure business continuity should catastrophe strike, the
bulk of the organization's time, money, and effort would be spent on ways that
disasters could (hopefully) be avoided altogether. Often the outcome of an
organization's search for ways to protect their most critical business
applications (in order to shore up their business continuity if disaster hit), was
that they found they could potentially avoid harm through the use of
redundant data lines.
The first step is to obtain the commitment of the management and all the
stakeholders towards the plan. They have to set down the objectives of the
plan, its scope and the policies. An example of a decision on scope would be
whether the target is the entire organization or just some divisions, or whether
it is only the data processing, or all the organizations services. Management
provides sponsorship in terms of finance and manpower. They need to weigh
potential business losses versus the annual cost of creating and maintaining
the Business Continuity Planning. For this, they will have to find answers to
questions such as how much it would cost or how much would be considered
adequate.
Broadly, the objective of the Business Continuity Planning (BCP) for a business
can only be to identify and reduce risk exposures and to proactively manage
the contingency.
A BCP contains a governance structure often in the form of a committee that
will ensure senior management commitments and define senior management
roles and responsibilities.
The BCP senior management committee is responsible for the oversight,
initiation, planning, approval, testing and audit of the BCP. It also implements
the BCP, coordinates activities, approves the BIA survey, oversees the creation
of continuity plans and reviews the results of quality assurance activities.

Copyright Intelligent Quotient System Pvt. Ltd. |

27

Business Continuity and Disaster Recovery

Senior managers or a BCP Committee would normally:

Approve the governance structure;


Clarify their roles, and those of participants in the
program;
Oversee the creation of a list of appropriate committees,
working groups and teams to develop and execute the
plan;
Provide strategic direction and communicate essential
messages;
Approve the results of the BIA;
Review the critical services and products that have been
identified;
Approve the continuity plans and arrangement;
Monitor quality assurance activities; and resolve
conflicting interests and priorities.

This BCP committee is normally comprised of the following members:

Executive sponsor has overall responsibility for the BCP


committee; elicits senior management's support and
direction; and ensures that adequate funding is
available for the BCP program.
BCP Coordinator secures senior management's support;
estimates funding requirements; develops BCP policy;
coordinates and oversees the BIA process; ensures
effective participant input; coordinates and oversees the
development of plans and arrangements for business
continuity; establishes working groups and teams and
defines their responsibilities; coordinates appropriate
training; and provides for regular review, testing and
audit of the BCP.
Security Officer works with the coordinator to ensure
that all aspects of the BCP meet the security
requirements of the organization.
Chief Information Officer (CIO) cooperates closely with
the BCP coordinator and IT specialists to plan for
effective and harmonized continuity.
Business unit representatives provide input, and assist
in performing and analyzing the results of the business
impact analysis.

The BCP committee is commonly co-chaired by the executive sponsor and the
coordinator.
Copyright Intelligent Quotient System Pvt. Ltd. |

28

Business Continuity and Disaster Recovery

2.7

Industry Standards Supporting BCP and DRP

ISO 27001: Requirements for Information Security Management Systems.


Section 14 addresses business continuity management.
ISO 27002: Code of Practice for Business Continuity Management.
NIST 800-34
Contingency Planning Guide for Information Technology Systems.
Seven step process for BCP and DRP projects
From U.S. National Institute for Standards and Technology
2.8

Benefits of BCP and DR


Improves business processes
Improved technology
Fewer disruptions
Higher quality services
Competitive advantages

Summary

Threats are uninvited hurdles, problems directly resulting into monetary


losses.
Business Continuity is the aim of every organization as they are in the
market for profit making.
Disasters are natural or man-made, may and may not be avoidable in
certain circumstances.
BCP is a process designed to reduce the organizations business risk
arising from an unexpected disruption of the critical functions that are
necessary for the survival of an organization.
Business continuity is not a project but it is a continuous process.
Disaster recovery is the process by which business resumes after a
disruptive event.
Benefits of BCP and DR: Improves business processes, improved
technology, fewer disruptions, higher quality services, Competitive.
Copyright Intelligent Quotient System Pvt. Ltd. |

29

Business Continuity and Disaster Recovery

Copyright Intelligent Quotient System Pvt. Ltd. |

30

Business Continuity and Disaster Recovery

Chapter 3
BC/DR Planning
Objective
3.1
3.2
3.3
3.4
3.5

Business Continuity and Disaster Recovery Plan Steps


Benefits of BCP and DRP Planning
Basics of Business Continuity Plan
BCP Process Steps
Development of Business Continuity Plan

3.1 Business Continuity and Disaster Recovery Plan Steps


The unfortunate event in life of mankind i.e. the attack on World Trade Center
on 9/11 taught a big lesson to the entire world as well as all the industries.
Business Continuity (BC) and Disaster Recovery (DR) are the watchwords of
businesses in the Information Technology (IT) world. The predominant role of
Wide Area Networks (WANs) in almost all major fields of business has made it
an imperative for IT and Network managers across the globe to accelerate their
network infrastructure, and also devise workable BC/DR plans.
Following are the reason why management shall have a concrete tested plan for
BC/DR:

Customer expects supplies & service to continue-or resume rapidly-in all


situations.
Shareholders expect management control to remain operational through
any crisis.
Employees expect both their lives & livelihoods to be protected.
Suppliers expect their revenue stream to continue.
Regulate agencies expect their requirements to be met, regardless of
circumstances.
Insurance companies expect due care to be exercised.10

The primary objective of a Disaster Recovery plan and Business Continuity


plan is the description of how an organization has to deal with potential
natural or human-induced disasters. The disaster recovery plan steps that
10

http://www.availability.com/resource/pdfs/dpro-100862.pdf

Copyright Intelligent Quotient System Pvt. Ltd. |

31

Business Continuity and Disaster Recovery

every enterprise incorporates as part of business management includes the


guidelines and procedures to be undertaken to effectively respond to and
recover from disaster recovery scenarios, which adversely impacts information
systems and business operations. Plan steps that are well-constructed and
implemented will enable organizations to minimize the effects of the disaster
and resume mission-critical functions quickly.
Business Continuity or DRP steps involve an extensive analysis of an
organizations business processes, IT infrastructure, data backup,
resources, continuity requirements and disaster prevention methods. As
well as, it is the process of creating a comprehensive document encompassing
details that will aid businesses in recovering from catastrophic events.
Developing a disaster recovery plan differs between enterprises based on
business type, processes, the security levels needed, and the organization size.
There are various stages involved in developing an effective Disaster Recovery
or Business Continuity planning.
3.2

3.3

Benefits of BCP and DRP Planning


Reduced risk
Process improvements
Improved organizational maturity
Improved availability and reliability
Marketplace advantage

Basics of Business Continuity Plan


Business Continuity plan does not depend upon what technology a
particular organization is using, rather it totally depends upon the
initiative of the senior management, the member of the committee
developing the plan, present legislation & options available at the time of
recovery.
Lets

see what will happen if there is no Business Continuity Plan:


Business Process Failure
Asset Loss
Regulatory Liability (as specified by the Governing Statute)
Customer Service Failure (Service Level Agreements)
Copyright Intelligent Quotient System Pvt. Ltd. |

32

Business Continuity and Disaster Recovery

3.4

Damage to the reputation or brand (Fall in the share price on


Stock Exchange showing negligence of senior management towards
their own company or brand)

BCP Process Steps

The BCP process can be divided into following steps:

Creation of a business continuity policy


Business Impact Analysis
Classification of processes and criticality Analysis
Identification of processes that supports critical organizational functions
Development of business continuity plan and IS disaster recovery
procedures
Development of resumption procedures
Training and awareness program
Testing and Implementation of Plan
Monitoring
3.4.1 Types of Business Continuity Plans

Disaster Recovery Plan to recover mission-critical technology &


applications at the alternate site.
Business Resumption Plan to continue mission functions at the
production site through work-around until the application is
restored.
Business Recovery Plan recover mission-critical business
processes at the alternate site(may be called as workspace
recovery)
Contingency Plan to manage an external event that has farreaching impact on the business.

3.4.2 How to create a BCDR Plan

Copyright Intelligent Quotient System Pvt. Ltd. |

33

Business Continuity and Disaster Recovery

11

3.4.3 Business Continuity Policy


BCP policy creation is important. The first step in this is to understand the
organization and identify its mission-critical processes, technology, data &
people. The BCP policy designer should know how the company works. The
planner can create process chart to understand the company. The process
chart covers all processes of the organization e.g. operational processes like
stationary supplies to Strategic processes like new product launch. The
planner needs to see following things.

Data
Process
Network
People
Time required for process
Interdependencies of processes

The BCP covers mainly on baking up data and providing system redundancy
but this one small part of BCP. The disaster recovery includes some things like
transporting of people to proper place, developing ways of carrying out
automated tasks manually, documenting needed configurations, alerting
business processes to maintain critical functions.
Business continuity is also part of security policy and program. Every business
organization is there to make profit. This is rational objective of every business
11

http://www.thelshgroup.com/Pages/ContinuityPlanningProcesses.aspx

Copyright Intelligent Quotient System Pvt. Ltd. |

34

Business Continuity and Disaster Recovery

organization. So the plans are prepared to achieve this objective. The main
reason to develop the plans is to reduce risk of financial loss by improving the
companys ability to recover and restore operations. This includes the goal of
mitigating the effects of the disaster. Many companies feel that they do not
have the time or resources to devote to disaster recovery plan. BCP is
ultimately responsibility of top management. The disruptions in business need
to be managed using wisdom and foresight.
The BCP policy can be designed by considering process management and
incident management.
3.4.4 Incident Management
The business activity is dynamic so incidents and crises are also dynamic, so it
needs dynamic management along with proactive action and need
documentation. An incident is any unexpected event. It may cause damage or
may not. Depending on as estimation of the level of damage to the organization,
all types of incidents should be categorized. A classification system could
include the following categories: negligible, minor, major and crisis. Any such
classification is dynamically provisional until the incident is resolved.
These levels can be described as follows:
Negligible incidents: Negligible incidents are those causing no perceptible or
significant damage, such as very brief OS crashes with full information
recovery or momentary power outages with UPS backup or non-catastrophic
failures.
Minor events: Minor events are those that, while not negligible, produce no
negative material or financial impact.
Major incidents: Major incidents cause a negative material impact on
business processes and may affect other systems, departments or even outside
clients.
Crisis: Crisis is a major incident that can have serious material impact on the
continued functioning of the business and may also adversely impact other
systems or third parties. How serious they are depends on the industry and
circumstances, but severity is generally directly proportional to the time
elapsed from the inception of the incident to incident resolution.
Minor, major and crisis incidents should be documented, classified, and
followed up on until corrected or resolved. This is a dynamic process, as a
major incident generally deescalates for time being or momentarily then May
Copyright Intelligent Quotient System Pvt. Ltd. |

35

Business Continuity and Disaster Recovery

effected as major crisis. Negligible incidents can be analyzed statistically to


identify any systemic or avoidable causes.
In general, the main criterion for incident severity is service downtime. Service
can be defined as including commitments with clients that can be either
external customers or internal departments.
Creating and maintaining a BCP helps ensure that an institution has the
resources and information needed to deal with these emergencies.
3.4.5 Risk Assessment
The risk assessment step is critical and has significant bearing on whether
business continuity planning efforts will be successful. If the threat scenarios
developed are unreasonably limited, the resulting BCP may be inadequate.
During the risk assessment step, business processes and the business impact
analysis assumptions are stress tested with various threat scenarios. This will
result in a range of outcomes, some that require no action for business
processes to be successful and others that will require significant BCPs to be
developed and supported with resources (financial and personnel).
The
organization should develop realistic threat scenarios that may potentially
disrupt their business processes and ability to meet their clients expectations
(internal, business partners, or customers).

12

12

http://www.google.co.in/imgres?start=154&hl=en&client=firefox-a&rls=org.mozilla:enUS:official&biw=1366&bih=622&tbm=isch&tbnid=LzCOAAftKkiNlM:&imgrefurl=http://www.spherebase.com/risk-

Copyright Intelligent Quotient System Pvt. Ltd. |

36

Business Continuity and Disaster Recovery

Threats can take many forms, including malicious activity as well as natural
and technical disasters. Where possible, institutions should analyze a threat by
focusing on its impact on the institution, not the nature of the threat. For
example, the effects of certain threat scenarios can be reduced to business
disruptions that affect only specific work areas, systems, facilities (i.e.,
buildings), or geographic areas.
Additionally, the magnitude of the business disruption should consider a wide
variety of threat scenarios based upon practical experiences and potential
circumstances and events.
If the threat scenarios are not comprehensive, BCPs may be too basic and omit
reasonable steps that could improve business processes' resiliency to
disruptions.
Threat scenarios need to consider the impact of a disruption and probability of
the threat occurring. Threats range from those with a high probability of
occurrence and low impact to the institution (e.g., brief power interruptions), to
those with a low probability of occurrence and high impact on the institution
(e.g., hurricane, terrorism). High probability threats are often supported by
very specific BCPs. However, the most difficult threats to address are those
that have a high impact on the institution but a low probability of occurrence.
Using a risk assessment, BCPs may be more flexible and adaptable to specific
types of disruptions that may not be initially considered.
Likelihood
Level
High

Medium

Low

Likelihood Definition
The threat's source is highly motivated
and sufficiently capable, and controls that
prevent the vulnerability from being
exercised are ineffective.
The threat's source is motivated and
capable, but controls are in place that may
impede a successful exercise of the
vulnerability.
The threat's source lacks motivation or
capability, and controls are in place to
prevent or significantly impede the
vulnerability from being exercised.

assessment-analytics.htm&docid=cntNDyKnowC_0M&imgurl=http://www.spherebase.com/images/riskassessment.jpg&w=359&h=326&ei=n9ZSUI3cO8yHrAfPgoCoBA&zoom=1&iact=hc&vpx=259&vpy=118&dur=3904&
hovh=214&hovw=236&tx=86&ty=132&sig=109287564227310851567&page=7&tbnh=131&tbnw=143&ndsp=28&v
ed=1t:429,r:1,s:154,i:9

Copyright Intelligent Quotient System Pvt. Ltd. |

37

Business Continuity and Disaster Recovery

Risk
Likelihood
(Adapted from NIST's Risk Management Guide for Information Technology Systems)13

Levels

It is at this point in the business continuity planning process that organization


should perform a "gap analysis." In this context, a gap analysis is a methodical
comparison of what types of plans the institution (or business line) needs to
maintain, resume, or recover normal business operations in the event of a
disruption, versus what the existing BCP provides. The difference between the
two highlights additional risk exposure that management and the board need
to address in BCP development.
3.4.6 Business Impact Analysis
The organizations first necessary step in developing a BCP is to perform a BIA.
The amount of time and resources necessary to complete the BIA will depend
on the size and complexity of the organization. The organization should include
all business functions and departments in this process, not just data
processing.
The BIA phase identifies the potential impact of uncontrolled, non-specific
events on the organization's business processes. The BIA phase also should
determine what and how much is at risk by identifying critical business
functions and prioritizing them. It should estimate the maximum allowable
downtime for critical business processes, recovery point objectives and
backlogged transactions, and the costs associated with downtime.
Management should establish recovery priorities for business processes that
identify essential personnel, technologies, facilities, communications systems,
vital records, and data. The BIA also considers the impact of legal and
regulatory requirements such as the privacy and availability of customer data
and required notifications to the organizations regulator and customers when
facilities are relocated.
Personnel responsible for this phase should consider developing uniform
interview and inventory questions that can be used on an enterprise-wide
basis.
Uniformity can improve the consistency of responses and help
personnel involved in the BIA phase compare and evaluate business process
requirements. This phase may initially prioritize business processes based on
their importance to the 0rganization's achievement of strategic goals and
maintenance of safe and sound practices. However, this prioritization should
be revisited once the business processes are modeled against various threat
scenarios so that a BCP can be developed.

13

http://www.theiia.org/intAuditor/itaudit/archives/2007/may/understanding-the-risk-management-process/

Copyright Intelligent Quotient System Pvt. Ltd. |

38

Business Continuity and Disaster Recovery

When determining an organization's critical needs, reviews should be


conducted for all functions, processes, and personnel within each department.
Each department should document the mission critical functions to be
performed. This is possible through process mapping for all departments.
The BIA helps organisations to:

Obtain an understanding of the organisation's most critical objectives,


the priority of each and the timeframes for resumption of the
unscheduled interruption.
Inform a management decision on Maximum Tolerable Outage for each
function
Provide the resource information from which an appropriate recovery
strategy can be determined / recommended
Outline dependencies that exist both internally and externally to achieve
critical objectives.14

3.4.7 Classification of processes and criticality Analysis and Identification


of processes that supports critical organizational functions
The process mapping and process dependency mapping is done for each and
every process of the organization. The following template covers required
information to decide process criticality and its impact on the business incase
disruption.
The risk analysis within IT Service Continuity Management collects the
following data in order to assess the risks in the event of disasters:

Critical business processes

Name of the Process


Purpose and objectives of the process

Classification of the processes into criticality categories (e.g. Marginal,


Normal, Critical, Highly Critical)
Critical business data

Name
Type of information and usage of the data
Classification of the data into criticality categories (e.g. Marginal,
Normal, Critical, Highly Critical)

14

http://www.analytix.co.za/Consulting/BusinessContinuityManagement.aspx

Copyright Intelligent Quotient System Pvt. Ltd. |

39

Business Continuity and Disaster Recovery

Critical IT Services

Dependencies of the critical business processes and data upon the IT


Service (relationships between processes/ data and IT Services)

Classification of the IT Service into criticality categories (e.g. Marginal,


Normal, Critical, Highly Critical)
Critical IT infrastructure components

Name

Name
Dependencies of the critical IT Services upon the IT infrastructure
components (relationships between IT Services and IT infrastructure
components)

Classification of the IT infrastructure components into criticality


categories (e.g. Marginal, Normal, Critical, Highly Critical)
Threat analysis

For all critical infrastructure components:

Which consequences does the occurrence of a scenario carry?


Which level of damage would be expected?

How likely is the occurrence? (e.g. Highly Improbable, Improbable,


Relatively Improbable, Rather Improbable, Highly Probable
Analysis of vulnerabilities

For all critical infrastructure components:

Which threats/ disaster scenarios are imaginable?

Which
vulnerabilities,
impairing
the
critical
infrastructure
components in the event of a disaster, are imaginable?
Which consequences would a failure carry?
Which level of damage would be expected?

How great is the probability of occurrence? (e.g. Highly Improbable,


Improbable, Relatively Improbable, Rather Improbable, Highly
Probable
Prioritized list of the risks (risk = occurrence probability x level of damage)

Type of risk
Based on which threat or vulnerability
Risk classification, e.g. Negligible, Marginal risk, temporarily
tolerable, Increased, still temporarily tolerable risk, High risk, not

Copyright Intelligent Quotient System Pvt. Ltd. |

40

Business Continuity and Disaster Recovery

tolerable without precautionary measures, Extreme risk, to be ruled


out by all means
3.5

Development of Business Continuity Plan

15

An Enterprise appoints a Disaster Recovery team within the organization,


which can actively involves in creating the plan steps, implementing and
maintaining the plan.
As a priority, businesses organizations create DRP templates as a basis for
developing Disaster Recovery plans for the organization. The following steps are
taken in creating an efficient disaster recovery or business continuity planning:
Objective
The statement of objective including project details, onsite/offsite data,
resources and business type.
3.5.1 Disaster Recovery Plan
15

http://www.google.co.in/imgres?hl=en&client=firefox-a&hs=tF6&sa=X&rls=org.mozilla:enUS:official&biw=1366&bih=622&tbm=isch&prmd=imvnsb&tbnid=WIiZqDud7tXdM:&imgrefurl=http://www.eci.com/solutions/bsn_resilency_protection/businesscontinuity.html&docid=FTnUr3-8aLnXkM&imgurl=http://www.eci.com/images/Eze-BCP-Life-Cycle-2010SMA.gif&w=275&h=275&ei=mNJSUKW6E8_trQeCqoDgCA&zoom=1&iact=hc&vpx=865&vpy=297&dur=642&hovh=
142&hovw=142&tx=132&ty=104&sig=109287564227310851567&page=2&tbnh=142&tbnw=142&start=21&ndsp=
26&ved=1t:429,r:11,s:21,i:176

Copyright Intelligent Quotient System Pvt. Ltd. |

41

Business Continuity and Disaster Recovery

A documentation of the procedures is created for

Procedure to declare emergency


Evacuation of site pertaining to nature of disaster
Active backup
Notification of the related officials/DR team/staff
Notification of procedures to be followed when disaster breaks out
Alternate location specifications, should all be maintained.

It is beneficial to be prepared in advance with sample DRPs and disaster


recovery examples so that every individual in an organization are better
educated on the basics. A workable business continuity planning template
or scenario plans are available with most IT-based organizations to train
employees with the procedures to be carried out in the event of a catastrophe.
3.5.2 Documentation of DR and BCP Teams Roles and Responsibilities
Documentation should include identification and contact details of key
personnel in the disaster recovery team, their roles and responsibilities in the
team. The call tree is also prepared for reporting the incidents.
3.5.3 Design and Documentation of Contingency Procedures
The routine to be established when operating in contingency mode should be
determined and documented. It should include inventory of systems and
equipment in the location; descriptions of process, equipment, software;
minimum requirements of processing; location of vital records with categories;
descriptions of data and communication networks, and customer/vendor
details. A resource planning should be developed for operating in emergency
mode. The essential procedures to restore normalcy and business continuity
must be listed out, including the plan steps for recovering lost data and to
restore normal operating mode.
3.5.4 Documenting Testing and Maintenance procedures
The dates of testing, disaster recovery scenario, and plans for each scenario
should be documented. Maintenance involves record of scheduled review on a
daily, weekly, monthly, quarterly, yearly basis; reviews of plans, teams,
activities, tasks accomplished and complete documentation review and update.
The disaster recovery plan developed thereby should be tested for efficiency. To
aid in that function a test strategy and corresponding test plan should be
developed and administered. The results obtained should be recorded,
analyzed, and modified as required. Organizations realize the importance of
business continuity plans that keep their business operations continuing
Copyright Intelligent Quotient System Pvt. Ltd. |

42

Business Continuity and Disaster Recovery

without any hindrance. Disaster recovery planning is a crucial component of


todays network-based organizations that determine productivity, and business
continuity.
3.5.5 IT Network Disaster Recovery Plan
Information Technology (IT) has redefined the global business lifecycle.
Networking and Communications have accelerated business operations and
made them more flexible. The Wide Area Network (WAN) and related
technologies are the keys for efficient business operations in the competitive
market. Organizations are adopting technology and standards to keep their IT
infrastructure sound and to ensure business continuity. The continued
operations of an Enterprise is determined by its ability to deal with potential
natural or man-made disasters through creating an effective IT Disaster
Recovery Plan (DRP) that can enable minimizing disruptions to the networks,
and quickly restore normalcy of operations.
A IT Disaster Recovery Plan is a comprehensive documentation of well-planned
actions that are to be adopted before, during, and after a catastrophic event. In
order to ensure business continuity and availability of critical resources during
disasters, the plan should be documented and also tested in advance. This will
help expedite the process when the actual disaster or emergency strikes. The
key to IT or network disaster recovery is preparedness. The DR plan is the
master tool of IT-based as well as other organizations to protect their IT
infrastructure, ascertain organizational stability, and systematic disaster
recovery. The primary objectives of IT/network disaster recovery planning
include:

Minimizing disruption of business operations


Minimizing risk of delays
Ensuring a level of security
Assuring reliable backup systems
Aiding in restoration of operations with speed

Business vulnerabilities are ever increasing and every organization is


compelled to make appropriate disaster recovery plans and use advanced
technology to keep its network secure and stable. Network-reliant companies
find it an absolute necessity to frame disaster recovery policies and procedures
to respond to the varied circumstances and problems.
In any organization that prepares itself for Disaster Recovery, the three
main points to be considered are Prevention, Anticipation, and
Mitigation. Prevention is the act of avoiding those disasters that can be
prevented. Anticipation is to plan and develop adequate measures to counter
unavoidable disasters. Mitigation is to effectively manage the disasters, and
thereby minimize the negative impact.
Copyright Intelligent Quotient System Pvt. Ltd. |

43

Business Continuity and Disaster Recovery

IT Disaster Recovery planning involves a thorough analysis of existing network


structure, applications, databases, equipment, organization setup, and related
details. It is important to define in the document about the key components
involved in the business, the disaster recovery team personnel with contact
details, recovery time objective, and communication methods at the time of the
disaster, alternative facility for the organization, and master list of all
inventory, storage locations, customer/vendor, forms and policies. The
following are the steps that should be taken in IT disaster recovery planning:
1. Constitute a Disaster Recovery Team: The organization should form a
DR team that will assist in the entire disaster recovery operations. The
team should be composed of core members from all departments with
representative from the top management. The team will also be
responsible for overseeing the development and implementation of the
DR plan.
2. Perform Risk Assessment: A risk analysis and business impact analysis
should be conducted, which includes in scope the possible disasters,
both natural and manmade. By conducting an analysis of the impact and
aftermath in disaster scenarios, the security of crucial resources can be
determined.
3. Prioritize Processes and Operations: The organizations critical
requirements pertaining to each department must be determined with
respect to data, documentation, services, processes, operations, vital
resources, and policies/procedures. They should all be categorized and
ordered based on priority as Essential, Important, and Non-essential.
4. Data Collection: The complete data about the organization must be
gathered and documented. It should include inventory of forms, policies,
equipment, communications; important telephone numbers, contact
details, and customer details; equipment, systems, applications and
resources description; onsite and offsite location; details of backup
storage facility and retention schedules; and other material and
documentation.
5. Creating the Disaster Recovery Plan: The DR plan should be created in
a standard format that would enable detailing of procedures and
including essential information. All important procedures should be
completely outlined and explained in the plan. The plan should have
step-by-step details of what is to be done when the disaster strikes. It
should also comprise procedures for maintaining and updating of the
plan, with regular review by the Disaster Recovery team and top
personnel of the organization.
6. Testing the Plan: The developed Disaster Recovery Plan should be
tested for efficiency. Testing provides a platform wherein an analysis can
be done as to what changes are required and make appropriate
adjustments to the plan. The plan can be tested using different types of
tests such as Checklist tests, Simulation tests, Parallel tests, Full
interruption tests, etc.
Copyright Intelligent Quotient System Pvt. Ltd. |

44

Business Continuity and Disaster Recovery

Developing a good IT disaster recovery plan will enable organizations to


minimize potential economic loss and disruption to operations in the face of a
disaster. It will aid in organized form of recovery, ensuring that the assets of
the organization are secure, and pave way for business continuity in the most
resourceful manner.
Conclusion: Finally, to understand the development of BC/DR plan refer the
following diagram:

16

16

http://www.thelshgroup.com/Pages/ContinuityPlanningProcesses.aspx

Copyright Intelligent Quotient System Pvt. Ltd. |

45

Business Continuity and Disaster Recovery

Summary

The primary objective of a Disaster Recovery plan and Business


Continuity plan is the description of how an organization has to deal
with potential natural or human-induced disasters.
The disaster recovery plan steps that every enterprise incorporates as
part of business management includes the guidelines and procedures to
be undertaken to effectively respond to and recover from disaster
recovery scenarios, which adversely impacts information systems and
business operations.
Benefits of BCP and DRP Planning are - Reduced risk, Process
improvements, improved organizational maturity, improved availability
and reliability, Marketplace advantage.
Types of Business Continuity Plans: Disaster Recovery Plan, Business
Resumption Plan, Business Recovery Plan, and Contingency Plan.
An incident is any unexpected event. It may cause damage or may not.
They can be classified as: Negligible incidents, Minor events, Major
incidents, Crisis
The BIA phase identifies the potential impact of uncontrolled, nonspecific events on the organization's business processes.
In any organization that prepares itself for Disaster Recovery, the three
main points to be considered are Prevention, Anticipation, and
Mitigation.
The steps that should be taken in IT disaster recovery planning are:
Constitute a Disaster Recovery Team, Perform Risk Assessment,
Prioritize Processes and Operations, Data Collection, Creating the
Disaster Recovery Plan and Testing the Plan.

Copyright Intelligent Quotient System Pvt. Ltd. |

46

Business Continuity and Disaster Recovery

Chapter 4
BCP/DR Plan Development and Implementation
Objective
4.1 Purpose of BCP
4.2 BCP Methodology
4.3 BCP/DR Testing Techniques
4.4 BCP/DR Maintenance and Re-assessment of Plans
4.5 Features of good BCP
4.6 Data Recovery Strategies
4.7 Contents of Disaster Recovery Plan

Developing a Business Continuity Plan is must in an organizations overall


efforts to prepare & respond to known and unknown disasters, and continue or
restore operations following a disaster. Because of an ever changing
environment, evolving technologies, unforeseen circumstances, and other
variables, a plan will not always be 100 percent successful as originally
written. However, if it is comprehensive, well-written and based on a sound
planning process, a plan greatly increases the chance for successful response
and recovery.
4.1 Purpose of BCP
The plan provides a general overview of the BCP and becomes the operating
manual when disaster strikes by providing the guidance needed to continue or
restore operations. Information and directions detailed in the plan make it
possible for the appointed business continuity team to keep the business going.
The ultimate goal of every Business Continuity Plan is to ensure the business
can survive and continue in case any disruption occurs. Minimizing the
interruption of service and loss of data will save the money and will minimize
the disruption to the customers or suppliers. The organization need to design
frame work for this activity. The activities are done in phase manner as part of
Project Management for BCP.
The BCP project Activities can be described as under:
4.1.1 Main phases

Pre-project activities
Perform a Business Impact Assessment (BIA)
Copyright Intelligent Quotient System Pvt. Ltd. |

47

Business Continuity and Disaster Recovery

Risk Assessment
Determining Choices and Business Continuity Strategy
Developing and Implementing BCP
Test resumption and recovery plans

4.1.2

Pre-project Activities
Obtain executive support
Formally define the scope of the project
Choose project team members
Develop a project plan
Get a project manager
Develop a project charter

The BCP project has number of essential tasks which are common to all
projects. These tasks include assembling the project team and appointment of
Project manager and Deputy Project Manager. The team formation is important
task for the success of the project.
The role of Project Manager is as under:

Developing of an enterprise-wide BCP and prioritization of business


objectives and critical operations that are essential for recovery.
Business continuity planning to include the recovery, resumption, and
maintenance of all aspects of the business, not just recovery of the
technology components.
Considering the integration of the institutions role in financial markets;
Regularly updating business continuity plans based on changes in
business processes, audit recommendations, and lessons learned from
testing.
Following a cyclical, process-oriented approach that includes a business
impact analysis (BIA), a risk assessment, management and monitoring
and testing.
Considering all factors and deciding upon declaring a crisis.

Now a day almost all activities are done using Information Technology. The
business functions spread across more than one department. It is necessary
that each department understands its role in the plan. It is also important that
each gives its support to maintain it. In case of a disaster, each has to be
prepared for a recovery process, aimed at protection of critical functions

Copyright Intelligent Quotient System Pvt. Ltd. |

48

Business Continuity and Disaster Recovery

The committee consisting of senior officials from departments like HR, IT, Legal,
Business and Information Security needs to be instituted with the following
broad mandate:

To exercise, maintain and to invoke business continuity plan, as needed.


Communicate, train and promote awareness.
Ensure that the Business Continuity Plan (BCP) fits with other plans
and requirement of concerned authorities.
Budgetary issues.
Ensure training and awareness on BCP to concerned teams and
employees.
Coordinating the activities of other recovery, continuity, and response
teams and handling key decision-making.
They determine the activation of the BCP
Other functions entail handling legal matters evolving from the disaster,
and handling public relations and media inquiries

The teams are formed with assigned responsibilities in the event of an incident.
The following teams are created depending upon the size of the organization.

Incident Response Team


Emergency Action Team
Information Security Team
Damage Assessment Team
Emergency Operation Team
Network Recovery Team
Communication Team
Transportation Team
Data Protection Team
Administrative support Team

The team is formed on the required adequacy for various aspects of BCP at
central office, as well as individual controlling offices or at a branch level, as
required. Among all the teams that can be considered are based on need. The
BCP Project team should be carefully selected. The selected member should be
formally notified their selection.
4.1.3 Critical
Framework

Components

of

Business

Continuity

Management

The BCP requirements enunciated in this document should be considered. The


responsibility lies on the Board and Senior Management for generating detailed
Copyright Intelligent Quotient System Pvt. Ltd. |

49

Business Continuity and Disaster Recovery

components of BCP in the light of an individual bank's activities, systems and


processes.
4.2 BCP Methodology
The organization should consider looking at BCP methodologies and
standardsBS 25999 by BSIwhich follows the Plan-Do-Check-Act Principle.
BCP methodology should include:
Phase 1: Business Impact Analysis

Business Impact Analysis (BIA) is the backbone of the continuity


planning process. It establishes the goals to be achieved to enable an
organization to continue or resume operations following a disaster.
It is a tool that assists in identifying, understanding, and prioritizing the
critical business functions of each business unit and the related time
frame in which each must be restored to avoid the organization crisis.
Assign Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
to each business function.
Alternate procedures during the time systems are not available and
estimating resource requirements

Phase 2: Risk Assessment

i.

Structured risk assessment based on comprehensive business impact


analysis. This assessment considers all business processes and is not
limited to the information processing facilities.
Risk management by implementing appropriate strategy/ architecture to
attain the organizations agreed RTOs and RPOs.
Impact on restoring critical business functions, including customerfacing systems and payment and settlement systems.
Dependency and risk involved in use of external resources and support

Phase 3: Determining Choices and Business Continuity Strategy

BCP should evolve beyond the information technology realm and must
also cover people, processes and infrastructure
The methodology should prove for the safety and well-being of people in
the organization / outside location at the time of the disaster.
Define response actions based on identified classes of disaster.

Copyright Intelligent Quotient System Pvt. Ltd. |

50

Business Continuity and Disaster Recovery

To arrive at the selected process resumption plan, one must consider the
risk acceptance for the organization, industry and applicable regulations

Phase 4: Developing and Implementing BCP

Action plans, i.e.: defined response actions specific to the organizations


processes , practical manuals( do and donts, specific paragraphs
customized to individual business units) and testing procedures
Establishing management succession and emergency powers
Compatibility and co-ordination of contingency plans.
The recovery procedure should not compromise on the control
environment at the recovery location
Having specific contingency plans for each outsourcing arrangement
based on the degree of materiality of the outsourced activity to the
organizations business
Periodic updating to absorb changes in the institution or its service
providers. Examples of situations that might necessitate updating the
plans include acquisition of new equipment, up gradation of the
operational systems and changes in:

Personnel
Addresses or telephone numbers
Business Strategy
Location facilities and resources
Legislation
Processes new or withdrawn
Risk
Contractors, key customers

4.2.1Key Factors to be considered for BCP Design


Following factors should be considered while designing the BCP:

Probability of unplanned events, including natural or man-made


disasters, earthquakes, fire, hurricanes or bio-chemical disaster
Security threats
Increasing infrastructure and application interdependencies
Regulatory and compliance requirements, which are growing increasingly
complex
Failure of key third party arrangements
Globalization and the challenges of operating in multiple countries.

BCP framework should consider the following:

Copyright Intelligent Quotient System Pvt. Ltd. |

51

Business Continuity and Disaster Recovery

Conditions for activating plans, which describe a process to be followed


(how to assess the situation, who is to be involved, etc.) before each plan
is activated
Emergency procedures, which describe the actions to be taken following
an incident which jeopardizes business operations and/ or human life.
This should include arrangements for public relations management and
for effective liaison with appropriate public authorities e.g. police, fire
service, health-care services and local government
Identification of the processing resources and locations, available to
replace those supporting critical activities; fall back procedures which
describe the actions to be taken to move essential business activities or
support services to alternative temporary locations and to bring business
processes back into operation in the required time-scales
Identification of information to be backed up and the location for storage,
as well as the requirement for the information to be saved for back-up
purpose on a stated schedule and compliance therewith
Resumption procedures, which describe the actions to be taken to return
to normal business operations
A maintenance schedule which specifies how and when the plan will be
tested and the process for maintaining the plan
Awareness and education activities, which are designed to create
understanding of critical banking operations and functions, business
continuity processes and ensure that the processes continue to be
effective
The responsibilities of the individuals, describing who is responsible for
executing which component of the plan. Alternatives should be
nominated as required.

4.2.2BCP/DRP Training
In order to provide the greatest benefit to users, gain the greatest return on
investment in a new system, and to be able to operate it effectively without
consulting support, it is critical to provide thorough and effective training.
Project Team members must become experts in the operation of the software,
and end users must become self-sufficient in its use. Executives should have
enough knowledge of the system to understand its capabilities and its
requirements for operations and on-going maintenance.17
4.2.3 BCP/DR Documentation
Documentation is critical to support end-users, to manage change to the
system throughout its lifetime, and to ensure consistent and appropriate use of
17

http://www.ohio.edu/sisrfp/OHIOSISProjectCharter.pdf

Copyright Intelligent Quotient System Pvt. Ltd. |

52

Business Continuity and Disaster Recovery

the system. Current and accurate documentation facilitates training, and


reduces the cost of system corrections and modifications. The documentation
effort will be an integral part of the project, and must be conducted throughout
the course of the project, not just at the end. The development of
documentation for the system is the responsibility of project team members,
with the help of a Documentation Coordinator. This is the only way to ensure
that the documentation will meet business continuity needs.
The
Implementation People can also provide guidance and support, and can provide
samples and templates of end-user documentation and customization
specifications.
4.2.4Testing A BCP
The organization must regularly test BCP to ensure that they are up to date and
effective: Testing of BCP should include all aspects i.e. people, processes and
resources including technology. BCP, after full or partial testing may fail.
Reasons are incorrect assumptions, oversights or changes in equipment or
personnel. BCP tests should ensure that all members of the recovery team and
other relevant staff are aware of the plans. The test schedule for BCPs should
indicate how and when each component of a plan is to be tested. It is
recommended to test the individual components of the plans frequently,
typically at a minimum of once a year. A variety of techniques should be used
in order to provide assurance that the plan will operate in real life.
The organization should involve their Internal Auditors (including IS Auditors) to
audit the effectiveness of BCP: And its periodic testing as part of their Internal
Audit work and their findings/ recommendations in this regard should be
incorporated in their report to the Board of Directors.
The organization should have a BCP drill planned along with the critical third
parties: In order to provide services and support to continue with pre-identified
minimal required processes.
The organization should also periodically moving their operations like people,
processes and resources (IT and non-IT) to the planned fall-over or DR site in
order to test the BCP effectiveness and also gauge the recovery time needed to
bring operations to normal functioning.
The organization should consider having unplanned BCP drill: Wherein only a
restricted set of people and certain identified personnel may be aware of the
drill and not the floor or business personnel.

Copyright Intelligent Quotient System Pvt. Ltd. |

53

Business Continuity and Disaster Recovery

4.3BCP/DR Testing Techniques


Business continuity planning is the continuity of critical business processes
after disasters. Testing the plan is good practice because it ensures the
correctness and viability of the plan. Testing of BCP can be performed on the
full integrated plan, at the component and the module level. Different methods
are used to carry out BCP testing. The choice of a method is based on the
testing plan and the type of testing that one wants of the BCP.
BCP requires a considerable amount of work for business continuity
management because nearly every aspect of technology, information and
people in the organization needs to be reviewed, planned and developed. The
BCP plan can have flaws inside it, and there exists an uncertainty about the
preliminary plan. Many questions need to be answered before it is recognized
as a final plan. Examples like, is this viable in emergency circumstances? Does
this really demonstrate the business activities and operations? Is this free of
bugs? And, is this also effective in the dynamic business environment?
To get the answers to the above questions and for the surety of the plan it
should be tested and updated regularly. The testing of the plan is equally
important as its preparation which requires comprehensive knowledge and
resources. Different methods have been developed and used to test the BCP.
This report covers and illustrates various adopted BCP testing methods and
provides a comparison between them according to their easiness of
implementation and common quality factors
The below are few of the illustrative techniques that can be used for BCP testing
purposes:
1. Paper Tests
A paper test is a review of BCP procedures and other response documentation,
such as contact lists. In a paper test, individual staff members review these
documents on their own. The testing team members individually start
evaluating BCP procedures and actions in the plan and keep writing notes if
anything is missing or needs to be changed. Combine the results from all the
members and handle it to the team leader to the project manager.
2. Walkthrough Tests: In the walkthrough test a group of people review the
BCP procedures and actions in the plan. The walkthrough is more formal than
the paper test. But the procedure is almost same to the paper test just testing
is performed at group rather than with individuals. Roles and responsibilities
are assigned in the group to perform testing and the end result is produced
and provided to the project manager.

Copyright Intelligent Quotient System Pvt. Ltd. |

54

Business Continuity and Disaster Recovery

3. Simulation Testing: Simulation is basically an on-location walkthrough


test. A simulation is more than a walkthrough but it shares many
characteristics with a walkthrough. The difference is that simulation test is
performed by special (expert) recovery team, the result then handled to the
procedure owner (Manager). The participants choose a specific scenario and
simulate an on-location BCP situation. It involves testing of all resources:
people, IT and others, who are required to enable the business continuity for a
chosen scenario. The focus is on demonstration of capability, including
knowledge, team interaction and decision-making capabilities. It can also
specify role playing with simulated response at alternate locations/facilities to
act out critical steps, recognize difficulties, and resolve problems.
4. Parallel Testing: With parallel testing, disaster response personnel actually
perform the steps in their response procedures.
When the procedures say build a server, the personnel build a server. When
the procedures say start the applications, the personnel start the applications.
Parallel testing procedure is quite different from other procedures, First
company need to create a new network, systems and database, then company
hires the additional employees and test the image of each transaction side by
side.
5. Cutover Testing: In a cutover test, the recovery team builds and readies
recovery systems that can support critical business functions. A cutover test is
the real thing. If the recovery systems dont work, the business processes they
support will really be interrupted. That could be a real disaster. Cutover testing
procedure starts with shutting down production systems and has to move
operations to recovery systems. After that the notification is sent to the team
members to get ready for full production workload and then they need to
identify either recovery systems are actually performing all the functions. Then
test the script or plan which has been prepared. After testing script on recovery
system all production loads reverts back to primary systems and all duties has
to resume, In this case primary system would have full knowledge of the
transactions performed by recovery system and examine all the functions that
has been performed correctly. Test team and test organizers need to document
all the issues and results gained from this test.
BCP Testing can be performed at various levels. The deeper and internal we go
the more surety will be achieved. So the need is to test it for each level. It
cannot be say that the full test gives the surety and viability of each component
until and unless each component is tested individually. The best approach of
BCP testing is to start
from its component level and go to its full plan test. It
is
also important to remember that the surety of the plan can be achieved
only when we test the each component of the plan by adopting and exercising
the provided solution in a hypothetical disaster atmosphere.

Copyright Intelligent Quotient System Pvt. Ltd. |

55

Business Continuity and Disaster Recovery

6. Component testing: This is to validate the functioning of an individual part


or a sub-process of a process, in the event of BCP invocation. It focuses on
concentrating on in-depth testing of the part or sub-process to identify and
prepare for any risk that may hamper its smooth running.
Each organization must define frequency, schedule and clusters of Business
Areas, selected for test after a through Risk and Business Impact Analysis has
been done.
4.4BCP/DR Maintenance and Re-assessment of Plans
a. BCPs should be maintained by annual reviews and updates to ensure
their continued effectiveness. Procedures should be included within the
organizations change management program to ensure that business
continuity matters are appropriately addressed. Responsibility should be
assigned for regular reviews of each business continuity plan. The
identification of changes in business arrangements/processes, not yet
reflected in the business continuity plans, should be followed by an
appropriate update of the plan on a periodic basis, say quarterly. This
would require a process of conveying any changes to the institutions
business, structure, systems, software, hardware, personnel, or facilities
to the BCP coordinator/team. If significant changes have occurred in the
business environment, or if audit findings warrant changes to the BCP or
test program, the business continuity policy guidelines and program
requirements should be updated accordingly.
b. Changes should follow the organizations formal change management
process in place for its policy or procedure documents. This formal
change control process should ensure that the updated plans are
distributed and reinforced by regular reviews of the complete plan.
4.5Features of good BCP
a. An effective BCP should take into account the potential of wide area
disasters, which impact an entire region, and for resulting loss or
inaccessibility of staff. It should also consider and address
interdependencies, both market-based and geographic, among financial
system participants as well as infrastructure service providers.
b. The organization banks should consider the need to put in place
necessary backup sites for their critical systems which interact with the
systems at the Data centers.
c. Some organizations may also consider running some critical processes
and business operations from primary and the secondary sites, wherein
each would provide back-up to the other.
e. All critical processes should be documented to reduce dependency on
personnel for scenarios where the staff is not able to reach the
designated office premises.
Copyright Intelligent Quotient System Pvt. Ltd. |

56

Business Continuity and Disaster Recovery

f. Backup/standby personnel should be identified for all critical roles.


g. The relevant portion of the BCP adopted should also be disseminated to
all concerned, including the customers, so that the awareness would
enable them to react positively and in consonance with the BCP. This
would help maintain the customers faith on the organization
h. Organization should consider formulating a clear Communication
Strategy with the help of media management personnel to control the
content and form of news being percolated to their customers in times of
panic.
i. Organization should consider having a detailed BCP plan for
encountering natural calamity/ disaster situation.
4.5.1Infrastructure Aspects of BCP

Organization should consider paying special attention to availability of


basic amenities such as electricity, water and first-aid box in all offices.
(E.g. evaluate the need of electricity backup not just for its systems but
also for its people and running the infrastructure like central airconditioning.)
Organization should consider assigning ownership for each area.
Emergency procedures, manual fallback plans and resumption plans
should be within the responsibility of the owners of the appropriate
business resources or processes involved.
In-house telecommunications systems and wireless transmitters on
buildings should have backup power. Redundant systems, such as
analogue line phones and satellite phones (where appropriate), and other
simple measures, such as ensuring the availability of extra batteries for
mobile phones, may prove essential to maintaining communications in a
wide-scale infrastructure failure.
Possible fallback arrangements should be considered and alternative
services should be carried out in co-ordination with the service providers,
contractors, suppliers under written agreement or contract, setting out
roles and responsibilities of each party, for meeting emergencies. Also,
imposition of penalties, including legal action, may be initiated by an
organization against service providers or contractors or suppliers, in the
event of noncompliance or non-co-operation.
When new requirements are identified, established emergency
procedures: e.g. Evacuation plans or any existing fallback arrangements
should be amended as appropriate.
The plans may also suitably be aligned with those of the local
government authorities
Organization should consider not to storing critical papers, files, servers
in the ground floors where there is possibility of floods or water logging.
However, organization should also consider avoiding top floors in taller
building to reduce impact due to probable fire.
Copyright Intelligent Quotient System Pvt. Ltd. |

57

Business Continuity and Disaster Recovery

Fire-proof and water-proof storage areas must be considered for critical


documents.
Organization should consider having alternative means of power source
(like procurement of more diesel/ emergency battery backup etc.) for
extended period of power cuts.

4.5.2 Technology Aspects of BCP


The are many applications and services used in every organization that are
highly mission critical in nature and therefore high availability and fault
tolerance to be considered while designing and implementing the solution. This
aspect is to be taken into account especially while designing the data center
solution and the corporate network solution.
4.6 Data Recovery Strategies
4.6.1 RPO and RTO
The Recovery Point Objective (RPO) is the maximum acceptable level of data
loss following an unplanned event, like a disaster (natural or man-made), act
of crime or terrorism, or any other business or technical disruption that could
cause such data loss. The RPO represents the point in time, prior to such an
event or incident, to which lost data can be recovered (given the most recent
backup copy of the data).
The Recovery Time Objective (RTO) is a period of time within which business
and technology capabilities must be restored following an unplanned event or
disaster. The RTO is a function of the extent to which the interruption disrupts
normal operations and the amount of revenue lost per unit of time as a result
of the disaster. These factors in turn depend on the affected equipment and
application(s).
Both of these numbers represent key targets that are set by businesses during
continuity and disaster recovery planning; these targets in turn drive the
technology and implementation choices for business resumption services,
backup/ recovery/ archival services, and recovery facilities and procedures.
Organization needs to clearly define RPO and RTO parameters and develop
corollary solutions accordingly. By doing so, businesses can avoid substantial
loss of productivity and the negative brand impacts of unplanned outages.
Recovery Point Objective must ensure that the maximum tolerable data loss for
each activity is not exceeded. The metrics specified for the business processes
must then be mapped to the underlying IT systems and infrastructure that
support those processes. The RTO and RPO metrics are mapped to the IT
infrastructure. The DR planner can determine the most suitable recovery
Copyright Intelligent Quotient System Pvt. Ltd. |

58

Business Continuity and Disaster Recovery

strategy for each system. The strategy is based ultimately on the IT budget.
Therefore, RTO and RPO metrics need to fit with the available budget and the
critical of the business process/function.
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are some of
the most important parameters of a disaster recovery or data protection plan.
These objectives guide the enterprises in choosing an optimal data backup or
restore plan.

4.6.2 Data protection


In larger businesses, companies will add a storage area network (SAN), which
is a consolidated place for all storage. SANs are expensive, and even then,
youre out of luck if your data center goes down. So the largest enterprises will
build an entirely new data center somewhere else, with another set of identical
mail servers, another SAN and more people to staff them.
But if, disaster strikes both your data centers, your data is completely lost (ex.
Fire outbreak).So big companies will often build the second data center far
away, in a different threat zone, which creates even more management
headaches. Next they need to ensure the primary SAN talks to the backup SAN,
so they have to implement robust bandwidth to handle terabytes of data flying
back and forth without crippling their network. There are other backup options
as well, but the storys the same: as redundancy increases, cost and complexity
multiplies.
How do you know if your disaster recovery solution is as strong as you need it
to be? Its usually measured in two ways: RPO (Recovery Point Objective) and
RTO (Recovery Time Objective). RPO is how much data youre willing to lose
Copyright Intelligent Quotient System Pvt. Ltd. |

59

Business Continuity and Disaster Recovery

when things go wrong, and RTO is how long youre willing to go without service
after a disaster.
For a large enterprise running SANs, the RTO and RPO targets are an hour or
less: the more you pay, the lower the numbers. That can mean a large
company spending the big amount is willing to lose all the email sent to them
for up to an hour after the system goes down, and go without access to email
for an hour as well. Enterprises without SANs may be literally trucking tapes
back and forth between data centers, so as you can imagine their RPOs and
RTOs can stretch into days. As for small businesses, often they just have to
start over.
Prior to selecting a data recovery (DR) strategy, a DR planner should refer to
their organizations BCP, which should indicate key metrics of recovery point
objective and recovery time objective for business processes.
4.6.3A List of Common Strategies for Data Protection:

Backups made to tape and sent off-site at regular intervals (preferably


daily)
Backups made to disk on-site and automatically copied to off-site disk,
or made directly to off-site disk
Replication of data to an off-site location, which overcomes the need to
restore the data (only the systems then need to be restored or synced).
This generally makes use of storage area network (SAN) technology
High availability systems that keep both data and system replicated offsite, enabling continuous access to systems and data.
In many cases, an organization may elect to use an outsourced disaster
recovery provider to provide a stand-by site and systems rather than
using their own remote facilities

4.6.4 Precautionary Measures for data protection


In addition to preparing for the need to recover systems, organizations must
also implement precautionary measures with an objective of preventing a
disaster in the first place. These may include some of the following:

Local mirrors of systems or data. Use of disk protection technology such


as RAID
Surge protectorsto minimize the effect of power surges on delicate
electronic equipment
Uninterrupted power supply (UPS) or backup generator to keep systems
going in the event of a power failure
Fire preventionsalarms, fire extinguishers
Anti-virus software and security measures
Copyright Intelligent Quotient System Pvt. Ltd. |

60

Business Continuity and Disaster Recovery

4.7 Contents of Disaster Recovery Plan


A disaster recovery plan is a part of the BCP. It dictates every facet of the
recovery process, including:

What events denote possible disasters;


What people in the organization have the authority to declare a disaster
and thereby put the plan into effect;
The sequence of events necessary to prepare the backup site once a
disaster has been declared;
The roles and responsibilities of all key personnel with respect to
carrying out the plan;
An inventory of the necessary hardware and software required to restore
production;
Team members. A schedule listing the personnel that will be staffing the
backup site, including a rotation schedule to support ongoing operations
without burning out the disaster

A disaster recovery plan must be a living document; as the data center


changes, the plan must be updated to reflect those changes.
It is to be noted that the technology issues are a derivative of the Business
Continuity plan and Management.
For example, BCP and Management will lead to the Business Impact Analysis,
which will lead to the Performance Impact Analysis (PIA). That will depend on
the Technology Performance of the total IT Solution Architecture.
To amplify business impact analysis is to identify the critical operations and
services, key internal and external dependencies and appropriate resilience
levels. It also analyses the risks and quantify the impact of those risks from the
point of view of the business disruptions.
Technology Solution Architecture to address specific BCP and Management
requirements are:

Performance
Availability
Security and Access Control
Conformance to standards to ensure Interoperability

Performance of the technology solution architecture for operations needs to be


quantified. It should be possible to measure, as and when required, the
quantified parameters. (For example, if the latency for a complex transaction
initiated at the branch has to be completed in four seconds under peak load, it
Copyright Intelligent Quotient System Pvt. Ltd. |

61

Business Continuity and Disaster Recovery

should be possible to have adequate measuring environments to ensure that


performance degradations have not taken place due to increasing loads.)18
Solution architecture has to be designed with high-availability, and no single
point of failure. It is inevitable that a complex solution architecture with point
products from different sources procured and implemented at different points
in time will have some outage once in a while and the important issue is that
with clearly defined SLAs, mean time to restore, it should be possible to identify
the fault and correct the same without any degradation in performance.19
Accordingly, with respect to the performance and availability aspects the
following architectures have to be designed and configured to provide high
levels of up time round the clock to ensure uninterrupted functioning.
The issues detailed above have to be borne in mind while finalizing the data
center architecture and the corporate network architecture which are expected
to have redundancy built in the solution with no single point of failure.
With reference to the network architecture it is recommended that the
organization should have built in redundancies as under:

Link level redundancy


Path level redundancy
Route level redundancy
Equipment level redundancy
Service provider level redundancy

18

http://www.rbi.org.in/scripts/PublicationReportDetails.aspx?UrlPage=&ID=622

19

http://www.rbi.org.in/scripts/PublicationReportDetails.aspx?UrlPage=&ID=622

Copyright Intelligent Quotient System Pvt. Ltd. |

62

Business Continuity and Disaster Recovery

Summary

Purpose of BCP is to become an operating manual when disaster strikes


by providing the guidance needed to continue or restore operations.
The organization should consider looking at BCP methodologies and
standardsBS 25999 by BSIwhich follows the Plan-Do-Check-Act
Principle.
The illustrative techniques that can be used for BCP testing purposes are
Paper Test, Walkthrough Tests, Simulation Testing, Parallel Testing,
Cutover Testing, Component testing
BCPs should be maintained by annual reviews and updates to ensure
their continued effectiveness.
An effective BCP should take into account the potential of wide area
disasters, which impact an entire region, and for resulting loss or
inaccessibility of staff. It should also consider and address
interdependencies, both market-based and geographic, among financial
system participants as well as infrastructure service providers.
Organization should consider paying special attention to availability of
basic amenities such as electricity, water and first-aid box in all offices
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are
some of the most important parameters of a disaster recovery or data
protection plan. These objectives guide the enterprises in choosing an
optimal data backup or restore plan.
RPO is how much data youre willing to lose when things go wrong, and
RTO is how long youre willing to go without service after a disaster.
Network architecture of an organization should have built in
redundancies as: Link level redundancy, Path level redundancy, Route
level redundancy, Equipment level redundancy and Service provider level
redundancy.

****************************************************************************************

Copyright Intelligent Quotient System Pvt. Ltd. |

63

Business Continuity and Disaster Recovery

Chapter - 5
BCP/DR and Recovery Technology
Objective
5.1Fault Tolerance and Disaster Recovery
5.2 Assessing Fault Tolerance and Disaster Recovery Needs
5.3 Clustering Technologies
5.4 Power Management
5.5 Issues in implementing a DC /DR solution

Introduction
Technology never gives perfect solution, and computers are not exception for it.
They can have problems that affect their users productivity. These problems
range from small errors to total system failure known as catastrophic failure.
Errors and failures can be the result of environmental problems, hardware and
software failure, hacking (malicious, unauthorized use of a computer or a
network), as well as natural disasters. Every organization needs to measure
and minimize the impact of computer and network problems. These measures
fall into two major categories known as fault tolerance and disaster recovery.
5.1 Fault Tolerance and Disaster Recovery
Fault tolerance is the capability of a computer or a network system to respond
to a condition automatically, usually resolving it and thus reducing the impact
on the system. If fault tolerant measures have been implemented, it is unlikely
that a user would know that a problem existed.
Disaster recovery is the ability to get a system functional after a total system
failure or site outage in the least amount of time. If enough fault tolerance
methods are in place, there is no need to have disaster recovery subject to
inherent risks.
We need to deal with the following when we want to have fault tolerance to
support business continuity.
1. Assessing fault tolerance and disaster recovery needs
2. Power management
3. Disk system fault tolerance methods
4. Backup considerations
5. Virus protection
Copyright Intelligent Quotient System Pvt. Ltd. |

64

Business Continuity and Disaster Recovery

In this chapter, we are going to discuss first two points. Last three will be
discussed in next chapter.
5.2 Assessing Fault Tolerance and Disaster Recovery Needs
First we need to find out what are the critical process for the organization as
well as we need to determine how long each system could afford to be
nonfunctional (down). These determinations will dictate which fault tolerance
and disaster recovery methods can be implemented and to what extent. The
more vital the system, the greater lengths (and, thus, the greater expense) you
should go to in order to protect it from downtime
For example, banking
organizations, insurance companies, the government agencies , and airlines all
run highly critical computer and network systems. Thus, they all have complex
and expensive fault tolerance and disaster recovery systems in place.
The fault tolerance and disaster recovery are implemented, by developing sites
described as hot, warm, or cold.

Hot Sites

A hot site is a commercial disaster recovery service that allows a business to


continue computer and network operations in the event of a computer or
equipment disaster. For example, if an enterprise's data center becomes
inoperable, that enterprise can move all data processing operations to a hot
site. A hot site has all the equipment needed for the enterprise to continue
operation, including office space and furniture, telephone jacks and computer
equipment.
A hot site is a duplicate of the original site of the organization, this type of
backup site is the most expensive to operate. Hot sites are popular with
organizations that operate real time processes such as financial institutions,
government agencies and ecommerce providers. Here every computer system
and piece of information has a redundant copy (possibly multiple
redundancies). The technology used to implement hot sites is clustering,
which is grouping multiple computers to provide increased performance and
fault tolerance.

5.3 Clustering Technologies


Copyright Intelligent Quotient System Pvt. Ltd. |

65

Business Continuity and Disaster Recovery

Cluster is a group of two or more computers (called nodes or members) that


work together to perform a task.
There are two major functions of clusters:
1) Load Balancing
2) High Availability
1) Load Balancing
Load balancing is dividing the amount of work that a computer has to do
between two or more computers so that more work gets done in the same
amount of time and, in general, all users get served faster. Load balancing
can be implemented with hardware, software, or a combination of both. On
the Internet, companies whose Web sites get a great deal of traffic usually
use load balancing.
Load-balancing clusters dispatch network service requests to multiple
cluster nodes to balance the request load among the cluster nodes. Load
balancing provides cost-effective scalability because you can match the
number of nodes according to load requirements. If a node in a loadbalancing cluster becomes inoperative, the load-balancing software detects
the failure and redirects requests to other cluster nodes. Node failures in a
load-balancing cluster are not visible from clients outside the cluster.
Red Hat Cluster Suite provides load-balancing through LVS (Linux Virtual
Server). Whereas Microsoft windows servers provide it through a service
called Network Load Balancing
2) High Availability:
In many organizations application servers are used to provide various
resources to their clients on a network. In such organizations it becomes
necessary to implement high availability. These technologies ensure the
continued performance of their server applications. High availability
technologies refer to a system or component that is responsible to provide
an application or service continuously for a desirable time period.
High-availability clusters provide continuous availability of services by
eliminating single points of failure and by failing over services from one
cluster node to another in case a node becomes inoperative. Typically,
services in a high-availability cluster read and write data (via read-write
Copyright Intelligent Quotient System Pvt. Ltd. |

66

Business Continuity and Disaster Recovery

mounted file systems). Therefore, a high-availability cluster must maintain


data integrity, as one cluster node takes over control of a service from
another cluster node. Node failures in a high-availability cluster are not
visible from clients outside the cluster. (High-availability clusters are
sometimes referred to as failover clusters.)
Red Hat Cluster Suite provides high-availability clustering through its Highavailability Service Management component. Whereas Microsoft windows
server OS provide it through Failover Clustering service
Two popular clustering services provided by Microsoft Windows are
1) Network Load Balancing (NLB)
2) Failover Clustering
5.3.1Network Load balancing (NLB) Clusters
A Windows featurebased NLB implementation is one where multiple servers
(up to 32) run independently of one another and do not share any resources
thus client requests connect to the farm of servers and can be sent to any of
them since they all provide the same functionality.
It is a group of Servers that provide the load balancing service by sharing
multiple client requests by using virtual IP Addresses and a shared name. If
you see from client perspective, the group of Servers appears as a single
server.
A single server can provide a limited level of reliability and scalability
performance. However, when the resources of two or more servers are
combined in a cluster (NLB), provides reliable performance that Web Servers
and other mission critical servers require.
NLB is designed for stateless applications such as Web Servers, FTP Servers
and VPN Servers.
Stateless Applications often have read only data or data that change
infrequently.
All Nodes in NLB cluster are active.
NLB distributes the incoming requests of clients across the host servers,
which are included in the NLB cluster.

Copyright Intelligent Quotient System Pvt. Ltd. |

67

Business Continuity and Disaster Recovery

NLB automatically detects servers that are disconnected from the NLB
clusters and then redistributes the client requests to the remaining active
servers.NLB does not direct the client requests to the failed or inactive
servers.
The algorithms behind NLB keep track of which servers are busy, so when a
request comes in, it is sent to a server that can handle it. In the event of an
individual server failure, NLB knows about the problem and can be
configured to automatically redirect the connection to another server in the
NLB cluster
NLB also supports the feature of scalability. When the number of client
requests increases, the number of host servers in NLB cluster can also be
increased in the server farm.

Copyright Intelligent Quotient System Pvt. Ltd. |

68

Business Continuity and Disaster Recovery

20

Figure showing five nodes NLB Cluster


5.3.2 Failover Clusters
A failover cluster is a group of independent computers that work together to
increase the availability of applications and services. The clustered servers
(called nodes) are connected by physical cables and by software. If one of the
cluster nodes fails, another node begins to provide service (a process known as
failover). Users experience a minimum of disruptions in service.
Each node that is a member of the cluster has both its own individual disk
storage and access to a common disk subsystem. When one node in the cluster
fails, the remaining node or nodes assume responsibility for the resources that

20

Source: http://www. Remoteitservices.com

Copyright Intelligent Quotient System Pvt. Ltd. |

69

Business Continuity and Disaster Recovery

the failed node was running. This allows the users to continue to access those
resources while the failed node is out of operation.
A typical configuration for a cluster would use a shared disk technology such
as RAID (Redundant Array of Inexpensive/Independent Disks) or SAN (Storage
Area Network) to share back-end data stores.

21

Failover Cluster
Failover Cluster Terminology

Nodes
The individual servers of the cluster are called Nodes. The Nodes are
connected by physical cables and cluster software.

Failover

When one of the servers in the cluster fails, another server node in the
cluster provides the applications or services. This process is known as
Failover.

Failback
When the server which dropped out of the cluster returns to service and
rejoins the cluster, the services or applications which previously failed
over to another node can now return to the server on which they
originally ran. This is called failback

21

Source: http://Remoteitservices.com

Copyright Intelligent Quotient System Pvt. Ltd. |

70

Business Continuity and Disaster Recovery

Quorum
Quorums are used to determine the number of failures that can be
tolerated within a cluster before the cluster itself has to stop functioning.
This is done to protect data integrity and to prevent problems that could
occur because of failed or failing communication between nodes.
Quorums describe the configuration of the cluster and contain
information about the cluster components such as network adapters,
storage, and the servers themselves.

A Windows failover clusters purpose is to help maintain client access to


applications and server resources even in the event of some sort of outage
(natural disaster, software failure, server failure, etc.).
Failover clusters are probably the most common type of clusters consisting of
servers that can handle and trade workloads for stateful applications
Stateful applications are the one in which data changes frequently.
Examples: Email server,
Virtualization servers etc.

File

and

Print

Servers,

Database

servers,

5.3.3Geo-Cluster
Clusters can be deployed in a server farm in a single physical facility or in
different facilities geographically separated for added resiliency. The latter type
of cluster is often referred to as a geo-cluster. Geo-clusters became very
popular as a tool to implement business continuance because they improve the
time that it takes for an application to be brought online after the servers in the
primary site become unavailable meaning that ultimately they improve the
recovery time objective (RTO).

5.3.4 Two Node Failover Cluster


This type of failover cluster includes two servers. The first is the active server
node and the second is the passive server node or failover server. The failover
server is an exact duplicate of the active server, but it is inactive and connected
to the active server by a high-speed link. The failover server monitors the active
server and its condition by using heartbeat. A heartbeat is a signal that comes
from the active server at a specified interval. If the failover server doesnt
Copyright Intelligent Quotient System Pvt. Ltd. |

71

Business Continuity and Disaster Recovery

receive a heartbeat from the active device in the specified interval, the failover
server considers the active server inactive, and the failover server comes online
(becomes active) and is now the active server. When the previously active server
comes back online, it starts sending out the heartbeat. The failover device,
which currently is responding to requests as the active server, hears the
heartbeat and detects that the active server is now back online. The failover
server then goes back into standby mode and starts listening to the heartbeat
of the active server again.

22

Two Node Failover Cluster

Warm Site

A warm site is, quite logically, a compromise between hot and cold sites. These
sites will have hardware and connectivity already established, though on a
smaller scale than the original production site or even a hot site. Warm sites
22

Source: http://www.fatihacar.com

Copyright Intelligent Quotient System Pvt. Ltd. |

72

Business Continuity and Disaster Recovery

will have backups on hand, but they may not be complete and may be between
several days and a week old. An example would be backup tapes sent to the
warm site by courier. The data and services are less critical than those in a hot
site. With hot site technologies, all fault-tolerance procedures are automatic
and are controlled by the Network Operating System. Warm site technologies
require a little more administrator intervention, but it isnt expensive. The most
commonly used warm site technology is a duplicate server. A duplicate server,
as its name suggests, is currently not being used and is available to replace
any server that fails. When a server fails, the administrator installs the new
server and restores the data; the network services are available to users with a
minimum of downtime. The administrator sends the failed server out to be
repaired. Once the repaired server comes back, it is now the spare server and is
available when another server fails. Using a duplicate server is a disaster
recovery method because the entire server is replaced in a shorter time than if
all the components had to be ordered and configured at the time of the system
failure. The major advantage of using duplicate servers rather than clustering
is that its less expensive. A single duplicate server costs much less than a
comparable cluster solution. Corporate networks dont often use duplicate
servers, and thats because there are some major disadvantages associated
with duplicate servers.

It needs to keep current backups. Because the duplicate server relies on


a current backup, you must back up every day and verify every backup,
which is time-consuming.
To stay as current as possible. Some companies run continuous
backups. If a server fails in mid-afternoon and the backup was run the
evening before, you will lose any data that was placed on the server since
the last backup. This may not be a big problem on servers that arent
updated frequently.
Cold Site

A cold site is the most inexpensive type of backup site for an organization to
operate. It does not include backed up copies of data and information from the
original location of the organization, nor does it include hardware already set
up. The lack of hardware contributes to the minimal startup costs of the cold
site, but requires additional time following the disaster to have the operation
running at a capacity close to that prior to the disaster. A cold site cannot
guarantee server uptime. Generally speaking, cold sites have little or no fault
tolerance and rely completely on efficient disaster recovery methods to ensure
data integrity. If a server fails, the IT personnel do their best to recover and fix
the problem. If a major component needs to be replaced, the server stays down
until the component is replaced. Errors and failures are handled as they occur.
Apart from regular system backups, no fault tolerance or disaster recovery
methods are implemented. This type of site has one major advantage: it is the
Copyright Intelligent Quotient System Pvt. Ltd. |

73

Business Continuity and Disaster Recovery

cheapest way to deal with errors and system failures. No extra hardware is
required (except hardware required for backing up).
Any disadvantages of implementing a cold site would stem from having an
application that cannot afford the downtime associated with service-affecting
faults and disasters.
The term near line refer to storage method that is neither online nor offline but
somewhere in the middle, like tape backup. It involves material that is not
likely to be needed except in cases of disaster recovery. While there is not a
one-to-one correspondence between any type of site (hot, warm, or cold) and
nearline storage, which is not actively accessed during normal operation, you
can see that nearline storage comes in handy when recovering from disasters
in warm and cold sites.
5.4

Power Management

Power management is very important strategy for fault tolerance. Electricity


powers the network, switches, hubs, PCs, and computer servers. Variations in
power can cause problems ranging from a reboot after a short loss of service to
damaged equipment and data. Fortunately, a number of products are available
to help protect sensitive systems from the dangers of lightning strikes, dirty
(uneven) power, and accidental power cable disconnection. This includes surge
protectors, Standby Power Supplies, uninterruptible power supplies, and line
conditioners. What you use depends on how critical your system is. This is
decided by whether it is a hot, warm, or cold site. At a minimum, connect
individual workstations to surge protectors; network hardware and servers
should use uninterruptible power supplies or line conditioners. Critical
operations, such as ambulance corps and hospitals, typically go one step
further and also have a gas-powered backup generator to provide long-term
supplemental power to all systems.

Surge Protectors

Surge protectors also referred to as surge suppressors are typically power


blocks or power strips with electronics that limit the amount of voltage, current
(amps), and noise that can get through to equipment. They are designed to
protect your equipment from long-lasting increases in voltage (surges) and
high, short bursts of voltage (spikes). The unit does not provide any power,
however. Rather, it blocks harmful electricity from reaching your equipment.
The surge protector detects a surge or a spike and clamps down on the
incoming voltage, reducing it to safe levels. If the surge is large enough, it can
trip the built-in safety mechanism. You may then lose power and have to reset
the equipment you are protecting. Common causes of surges and spikes are

Copyright Intelligent Quotient System Pvt. Ltd. |

74

Business Continuity and Disaster Recovery

fluctuations in power from the electricity company, additions of equipment to


the power grid by customers, and natural storms.

23

Battery Backup Systems

Battery backup systems protect computer systems from power failures. These
systems use a battery to power the computer and its assorted peripherals.
These devices are activated due to a power failure, they permit the user to save
data and initiate a graceful shutdown of the system. They normally arent used
to run the system for an extended period.
There are two main types of battery backup systems:

Standby Power Supply (SPS)

Uninterruptible Power Supply (UPS)

Standby Power Supply (SPS)

A Standby Power Supply (SPS) contains a battery, a switchover circuit, and an


inverter (a device to convert the DC voltage from the battery into AC voltage
that the computer and peripherals need). The outlets on the SPS are connected
to the switching circuit, which is in turn connected to the incoming AC power
(called line voltage). The switching circuit monitors the line voltage. When it

drops below a factory preset threshold, the switching circuit switches from line
voltage to the battery and inverter. The battery and inverter power the outlets
(and, thus, the computers or devices plugged into them) until the switching
23

Ref: http://en.wikipedia.org/wiki/File:Surge_protector.jpg

Copyright Intelligent Quotient System Pvt. Ltd. |

75

Business Continuity and Disaster Recovery

circuit detects that line voltage is present again at the correct levels. The
switching circuit then switches the outlets back to line voltage.
24

Uninterruptible Power Supply (UPS)

A UPS is another type of battery backup often found on computers and


network devices today. It is similar to an SPS in that it has outlets, a battery,
and an inverter. The similarities end there. A UPS uses an entirely different
method to provide continuous AC voltage to the equipment it supports. In a
UPS, the equipment is always running off the inverter and battery. A UPS
contains a charging/monitoring circuit that charges the battery constantly. It
also monitors the AC line voltage. When a power failure occurs, the charger
just stops charging the battery. The equipment never senses any change in
power. The monitoring part of the circuit senses the change and emits a beep
to tell the user the power has failed.

24

Ref: http://www.p-wholesale.com

Copyright Intelligent Quotient System Pvt. Ltd. |

76

Business Continuity and Disaster Recovery

Line Conditioners

A power conditioner (also known as a line conditioner or power line conditioner)


is a device intended to improve the quality of the power that is delivered to
electrical load equipment. The term most often refers to a device that acts in
one or more ways to deliver a voltage of the proper level and characteristics to
enable load equipment to function properly.
Line Conditioners keep equipment working through voltage fluctuations
without using backup power.
Line conditioners are complex (and thus expensive) devices that incorporate a
number of power-correction technologies to provide electronic devices with
clean power. Some of these technologies are UPS, surge suppression, and
power filtering.

Copyright Intelligent Quotient System Pvt. Ltd. |

77

Business Continuity and Disaster Recovery

5.5The following issues arise in choosing a backup site and implementing


a Data Centre /DR solution:
1. Solution architectures of DC and DR are not identical for all the applications
and services. Critical applications and services, namely the retail, corporate,
trade finance and government business solutions as well as the delivery
channels are having the same DR configurations whereas surround or
interfacing applications do not have the DR support. Organization will have to
conduct periodical review with reference to the above aspect and upgrade the
DR solutions from time to time and ensure that all the critical applications and
services have a perfect replica in terms of performance and availability.
2. The configurations of servers, network devices and other products at the DR
have to be identical at all times. This includes the patches that are applied at
the DR periodically and the changes made to the software from time to time by
customization and parameterization to account for the regulatory
requirements, system changes etc.
3. Solutions have to have a defined Recovery Time Objective (RTO) and
Recovery Point Objective (RPO) parameter. These two parameters have a very
clear bearing on the technology aspects as well as the process defined for cut
over to the DR and the competency levels required to move over in the specified
time frame.
4. The Values chosen for the RTO and RPO is more to follow the industry
practice and not derived from first principles. Therefore, the DR drills that are
conducted periodically have to ensure that the above parameters are strictly
complied with.
5. Technology operations processes which support business operations need to
formally included into the IT Continuity Plan.
6. Organizations may also consider Recovery Time Objective and Recovery
Point Objectives (RTO/ RPO) for services being offered and not just a specific
application. This is done to avoid any inconsistency in business users
understanding.
7. DR drills currently conducted periodically come under the category of
planned shutdown. Organizations have to evolve a suitable methodology to
conduct the drills which are closer to the real disaster scenario so that the
confidence levels of the technical team taking up this exercise is built to
address the requirement in the event of a real disaster.

Copyright Intelligent Quotient System Pvt. Ltd. |

78

Business Continuity and Disaster Recovery

8It is also recommended that the support infrastructure at the DC and DR,
namely the electrical systems, air-conditioning environment and other support
systems have no single point of failure and do have a building management
and monitoring system to constantly and continuously monitor the resources.
If it is specified that the solution has a high availability of 99.95 measured on a
monthly basis and a mean time to restore of 2 hrs. in the event of any failure, it
has to include the support system also.
9. Data replication mechanism followed between DC and DR is the
asynchronous replication mechanism and implemented across the industry
either using database replication techniques or the storage based replication
techniques. They do have relative merits and demerits. The RTO and RPO
discussed earlier, along with the replication mechanism used and the data
transfer required to be accomplished during the peak load will decide the
bandwidth required between the DC and the DR. The RPO is directly related to
the latency permissible for the transaction data from the DC to update the
database at the DR. Therefore, the process implemented for the data replication
requirement has to conform to the above and with no compromise to data and
transaction integrity.
10. Given the need for drastically minimizing the data loss during exigencies
and enable quick recovery and continuity of critical business operations,
organizations may need to consider near site DR architecture. Major
organizations with significant customer delivery channel usage and significant
participation in financial markets/payment and settlement systems may need
to have a plan of action for creating a near site DR architecture over the
medium term (say, within three years).
To address these issues, the following controls are required:

Stipulating a periodic DR exercise with clearly defined ground rules for


the same. IT recovery tests are also required to realistically reflect the
worst case scenario where all critical systems must be restored
concurrently.
Sending out detailed questionnaires to the organizations based on the
questionnaire that is issued by organizations like ISACA (Information
Systems Audit and Control Association)
Carrying out system wide stress testing simulating various scenarios.

5.5.1 Issues/Challenges in BC/DR implementation


a. Despite considerable advances in equipment and telecommunications
design and recovery services, IT disaster recovery is becoming
challenging. Continuity and recovery aspects are impacting IT strategy
and cost implications are challenging IT budgets.
Copyright Intelligent Quotient System Pvt. Ltd. |

79

Business Continuity and Disaster Recovery

b. The time window for recovery is shrinking in face of the demand for 24 /
365 operations. Some studies claim that around 30 percent of highavailability applications have to be recovered in less than three hours. A
further 45 percent within 24 hours, before losses become unsustainable;
others claim that 60 percent of Enterprise Resource Planning (ERP)
Systems have to be restored in under 24 hours. This means that
traditional off-site backup and restore methods are often no longer
adequate. It simply takes too long to recover incremental and full image
backups of various inter-related applications (backed up at different
times), synchronize them and re-create the position as at disaster.
Continuous operationdata mirroring to off-site locations and standby
computing and telecommunicationsmay be the only solution.
c. A risk assessment and business impact analysis should establish the
justification for continuity for specific IT and telecommunication services
and applications.
d. Achieving robust security (security assurance) is not a onetime activity. It
cannot be obtained just by purchasing and installing suitable software
and hardware. It is a continuous process that requires regular
assessment of the security health of the organization and proactive steps
to detect and fix any vulnerability. Every organization should have in
place quick and reliable access to expertise for tracking suspicious
behavior, monitoring users and performing forensics. Adequate reporting
to the authorities concerned.
5.5.2 Telecommunications issues In BCP/DR
It is important to ensure that relevant links are in place and that
communications capability is compatible. The adequacy of voice and data
capacity needs to be checked. Telephonic communication needs to be
switched from the disaster site to the standby site. However, diverse
routing may be difficult to achieve since primary telecommunications
carriers may have an agreement with the same sub-carriers to provide
local access service, and these sub-carriers may also have a contract
with the same local access service providers. Financial institutions do
not have any control over the number of circuit segments that will be
needed, and they typically do not have a business relationship with any
of the sub-carriers. Consequently, it is important for financial
institutions to understand the relationship between their primary
telecommunications carrier and these various sub-carriers and how this
complex network connects to their primary and back-up facilities. To
determine whether telecommunications providers use the same subcarrier or local access service provider, organizations may consider
performing an end-to-end trace of all critical or sensitive circuits to
search for single points of failure such as a common switch, router, PBX,
or central telephone office.
Copyright Intelligent Quotient System Pvt. Ltd. |

80

Business Continuity and Disaster Recovery

5.5.3 Organizations may consider the following telecommunications


diversity components to enhance BCP:
i.
ii.
iii.
iv.
v.

vi.
vii.
viii.
ix.

Alternative media, such as secure wireless systems


Internet protocol networking equipment that provides easily configurable
re-routing and traffic load balancing capabilities
Local services to more than one telecommunications carriers central
office, or diverse physical paths to independent central offices
Multiple, geographically diverse cables and separate points of entry
Frame relay circuits that do not require network interconnections, which
often causes delays due to concentration points between frame relay
providers
Separate power sources for equipment with generator or uninterrupted
power supply back-up
Separate connections to back-up locations
Regular use of multiple facilities in which traffic is continually split
between the connections; and
Separate suppliers for hardware and software infrastructure needs.

5.5.4Outsourcing Risks
In theory a commercial hot or warm standby site is available 24 / 365. It has
staff skilled in assisting recovery. Its equipment is constantly kept up to date,
while older equipment remains supported. It is always available for use and
offers testing periods once or twice a year. The practice may be different. These
days, organizations have a wide range of equipment from different vendors and
different models from the same vendor. Not every commercial standby site is
able to support the entire range of equipment that an organization may have.
Instead, vendors form alliances with others but this may mean that an
organizations recovery effort is split between more than one standby site. The
standby site may not have identical IT equipment; instead of the use of an
identical piece of equipment, it will offer a partition on a compatible large
computer or server. Operating systems and security packages may not be the
same version as the client usually uses. These aspects may cause setbacks
when attempting recovery of IT systems and applications and weak change
control at the recovery site could cause a disaster on return to the normal site.

Copyright Intelligent Quotient System Pvt. Ltd. |

81

Business Continuity and Disaster Recovery

Summary

Fault tolerance is the capability of a computer or a network system to


respond to a condition automatically, usually resolving it and thus
reducing the impact on the system.
Disaster recovery is the ability to get a system functional after a total
system failure or site outage in the least amount of time.
We need to deal with the following when we want to have fault tolerance
to support business continuity:
-Assessing fault tolerance and disaster recovery needs
-Power management
-Disk system fault tolerance methods
-Backup consideration
-Virus protection
A hot site is a commercial disaster recovery service that allows a
business to continue computer and network operations in the event of a
disaster.
Cluster is a group of two or more computers that work together to
perform a task. There are two major functions of clusters: Load
Balancing, High Availability
A warm site is, quite logically, a compromise between hot and cold sites.
Cold sites have little or no fault tolerance and rely completely on efficient
disaster recovery methods to ensure data integrity.
Power management is very important strategy for fault tolerance. To
protect sensitive systems, Surge protectors, Standby Power Supplies,
uninterruptible power supplies, and line conditioners can be used.
Addressing different issues in choosing a backup site and implementing
a Data Centre/Disaster Recovery solution is very crucial for an
organizations survival after a disaster.

Copyright Intelligent Quotient System Pvt. Ltd. |

82

Business Continuity and Disaster Recovery

Chapter 6
Disk System Fault Tolerance
Objective
6.1
6.2
6.3
6.4
6.5
6.6
6.7

Server Storage Technologies


Disk System Fault Tolerance
Disk Management in Microsoft Windows OS
Disk Management Tool
Creating Dynamic Volumes
Backup Considerations
Virus Protection

Introduction
The primary requirement in an organizations network is disk storage, which is
used to store organizational data.
Various storage technologies, such as direct-attached storage (DAS), network
attached storage (NAS), and storage-area networks (SANs) are used to store
organizational data. Stored data requires fault tolerance. Backup is used to
secure data in case disaster occurs. Data and computers must be protected
from virus attack.
6.1 Server Storage Technologies
The demand for server storage these days has increased manifold.
Consequently, server storage technologies have improved with time. Initially,
DAS technology was used to store data. However, DAS technology was used to
attach storage to only one server, thereby leading to inefficient utilization of
storage resources.
Network Attached Storage (NAS) technology was introduced mainly because of
the limitations of DAS.NAS is a data storage technology, which allows you to
store the data on a network storage location and provides data accessibility to
multiple clients
The storage area network (SAN) is an architecture that helps you to attach
remote storage devices to servers. These storage devices are attached in such a
manner that it appears as if they are attached locally to servers. SAN is not
restricted to a single server; instead, SAN storage can be moved from one server
to another.
6.2 Disk Systems Fault Tolerance
Hard disk is the basic storage device of a computer system. As compared to
other hardware devices, Hard disk carries the maximum risk of failure (Hard
disk crash). When this happens, all data can be lost. Therefore, to make data
available and accessible, some methods of Fault tolerance must be
implemented.
Copyright Intelligent Quotient System Pvt. Ltd. |

83

Business Continuity and Disaster Recovery

Fault tolerances essentially a systems ability to allow for failures or


malfunctions, and this ability can be provided by software, hardware or a
combination of both.
The methods that provide fault tolerance for hard disk systems include:
Mirroring
Duplexing
Data striping
Redundant array of independent (or inexpensive) disks (RAID)
Before you read about the various methods of providing fault tolerance for disk
systems, you should know about one important concept that is Volumes. When
you install a new hard disk into a computer and prepare it for use, the network
operating system (NOS) sets up the disk so that you can store data on it in a
process known as formatting. Once this has been achieved, the NOS can
access the disk. Before it can store data on the disk, it must set up what is
known as a volume. A volume, for all practical purposes, is a named chunk of
disk space. This chunk can exist on part of a disk, can exist on all of a disk, or
can span multiple disks. Volumes provide a way of organizing disk storage, as
you can see in this illustration:

25

Disk Mirroring
Mirroring a drive means designating a hard disk drive in the computer as a
mirror or duplicate to another, specified drive. The two drives are attached to a
single disk controller. This disk fault tolerance feature is provided by most
Network Operating Systems (NOS). When the NOS writes data to the specified
drive, the same data is also written to the drive designated as the mirror. If the
first drive fails, the mirror drive is already online, and since it has a duplicate
of the information contained on the specified drive, the users wont know that a
disk drive in the server has failed. The NOS notifies the administrator that the
failure has occurred. The down side is that if the disk controller fails, neither
drive is available.
25

Ref: http://e-university.wisdomjobs.com

Copyright Intelligent Quotient System Pvt. Ltd. |

84

Business Continuity and Disaster Recovery

Disk Duplexing
As with mirroring, duplexing also saves data to a mirror drive. In fact, the only
major difference between duplexing and mirroring is that duplexing uses two
separate disk controllers (one for each disk). Thus, duplexing provides not only
a redundant disk, but a redundant controller as well. Duplexing provides fault
tolerance even if one of the controllers fails. There is now an extra disk
controller in the system. Mirroring is also known as RAID-1.
Disk Striping
From a performance point of view, writing data to a single drive is slow. When
three drives are configured as a single volume, information must fill the first
drive before it can go to the second, as well as fill the second before filling the
third. If the user configures that volume to use disk striping, the user will see a
definite performance gain. Disk striping breaks up the data to be saved to disk
into small portions and sequentially writes the portions to all disks
simultaneously in small areas called stripes. These stripes maximize
performance because all the read/write heads are working constantly. Notice
that the data is broken into sections and that each section is sequentially
written to a separate disk.
Striping data across multiple disks improves only performance; it does not
improve fault tolerance. To add fault tolerance to disk striping, it is necessary
to use parity. Disk striping is also known as RAID-0.
Parity Information
Parity, as it relates to disk fault tolerance, is a general term for the fault
tolerance information computed for each chunk of data written to a disk. This
parity information can be used to reconstruct missing data should a disk fail.
Striping can use parity or not, but if the striping technology doesnt use parity,
the user wont gain any fault tolerance. When using striping with parity, the
parity information is computed for each block and written to the drive. The
advantage to using parity with striping is gaining fault tolerance. If any part of
the data gets lost or destroyed, the information can be rebuilt from the parity
information. The down side to using parity is that computing and writing parity
information reduces the total performance of a disk system that uses striping.
The parity information also reduces the total amount of free disk space.

Redundant Array of Inexpensive (or Independent)Disks (RAID)


RAID is a well-tested method for providing data protection and enabling
multiple physical disks to be combined into one larger logical disk. RAID
has several levels, which represent the resulting configuration. Each level
has its own set of requirements.
Copyright Intelligent Quotient System Pvt. Ltd. |

85

Business Continuity and Disaster Recovery

RAID 0 (Commonly used): This method is the fastest because all read/
write heads are constantly being used without the burden of parity or
duplicate data being written. A system using this method has multiple
disks, and the information to be stored is striped across the disks in
blocks without parity. This RAID level only improves performance; it
does not provide fault tolerance.
RAID 1 (Commonly used): This level uses two hard disks, one mirrored
to the other (commonly known as mirroring; duplexing is also an
implementation of RAID 1). This is the most basic level of disk fault
tolerance.
If the first hard disk fails, the second automatically takes over. No parity
or error-checking information is stored. Rather, each drive has duplicate
information of the other. If both drives fail, a new drive must be installed
and configured, and the data must be restored from a backup.
RAID 2: Individual bits are striped across multiple disks. One drive
(designated as the parity drive) in this configuration is dedicated to
storing parity data. If any data drive (a drive in this configuration that is
not the parity drive) fails, the data on that drive can be rebuilt from
parity data stored on the parity drive. At least three disk drives are
required in this configuration. This is not a commonly used
implementation.
RAID 3 (Commonly used): Data is striped across multiple hard drives
using a parity drive (similar to RAID 2). The main difference is that the
data is striped in bytes, not bits, as in RAID 2. This configuration is
popular because more data is written and read in one operation,
increasing overall disk performance.
RAID 4: This is similar to RAID 2 and 3 (striping with parity drive),
except data is striped in blocks, which facilitates fast reads from one
drive. RAID 4 is the same as RAID 0, with the addition of a parity drive.
This is not a popular implementation.
RAID 5 (Commonly used): The data and parity are striped across
several drives. This allows for fast writes and reads. The parity
information for data on one disk is stored with the data on another disk,
so if any one disk fails, the drive can be replaced, and its data can be
rebuilt from the parity data stored on the other drives. A minimum of
three disks is required. Five or more disks are most often used.

Copyright Intelligent Quotient System Pvt. Ltd. |

86

Business Continuity and Disaster Recovery

6.3 Disk Management in Windows


Let us now see the disk system fault tolerance methods on Microsoft windows
operating system.
6.3.1 Disk Types:
Microsoft Windows supports basic and dynamic disks for storage. A basic disk
contains primary or extended partitions and logical drives. You can create
partitions on a basic disk depending on the partition style. For example:
A disk using master boot record (MBR) partition style allows you to create four
partitions, or three primary partitions and one extended partition.
A disk using Globally Unique Identifier (GUID) partition table partition style
allows you to create up to 128 primary partitions.
Apart from basic disk Microsoft Windows also supports dynamic disks. A
dynamic disk is a physical disk that neither uses partitions nor MBR. You can
divide dynamic disks into unlimited number of Volumes; however, to avoid
slow boot time performance, you need to restrict the number of volumes. A
dynamic disk supports the following five types of volumes.

Simple: Comprises a single or multiple regions on a disk


Spanned: Comprises disk space on more than one disk
Striped: Also known as RAID-0.Stores data in the form of stripes on two
or more physical disks
Mirrored: Also known as RAID-1.Provides data redundancy with the help
of two copies of a volume
RAID-5 :( Redundant Array of Inexpensive or Independent Disks): Also
known as striped volume with parity. Combines the free space areas from
three or more physical disks.

Initially if you had to create any of the preceding volumes, you needed to first
convert the basic disks into dynamic disks. However, Windows server 2008
provides the Disk Management tool, which automatically converts the basic
disk to dynamic one, while creating volumes.
A volume refers to a data storage area that is accessible by a file system, which
may or may not be on a single partition of a hard disk. Apart from creating
different types of volumes, dynamic disks also supports Extending volumes,
shrinking volumes and creating mount point on a volume

Copyright Intelligent Quotient System Pvt. Ltd. |

87

Business Continuity and Disaster Recovery

6.3.2 Simple Volume:


A simple volume is a dynamic volume that is made up of disk space from a
single dynamic disk. A simple volume can consist of a single region on a disk or
multiple regions of the same disk that are linked together. You can create
simple volumes only on dynamic disks. Simple volumes can be extended or
shrinked. Simple volumes are not fault tolerant.

6.3.3 Spanned Volume:


In spanned volume, multiple hard drives are combined to make one large
volume. Data is not written in parallel to multiple disks. Data is written to a
single disk until it is full and then written to the next disk in the set.

26

Spanned Volume
A spanned volume is a dynamic volume consisting of disk space on more than
one physical disk. If a simple volume is not a system volume or boot volume,
you can extend it across additional disks to create a spanned volume, or you
can create a spanned volume in unallocated space on a dynamic disk.
You can extend a spanned volume onto a maximum of 32 dynamic disks.
Spanned volumes are not fault tolerant. If one of the disks containing a
spanned volume fails, the entire volume fails, and all data on the spanned
volume becomes inaccessible.
26

Ref: www.iomega.com

Copyright Intelligent Quotient System Pvt. Ltd. |

88

Business Continuity and Disaster Recovery

6.3.4 Striped Volume:


A dynamic volume that stores data in stripes on more than one physical drive
is known as striped volume. The striped volume is also known as RAID-0 and it
allocates data alternatively and evenly to the disks that are within the striped
volume. You need at least two physical disks to create striped volume.
Striped volumes improve disk input/output (I/O) performance by distributing
I/O requests across disks.
Striped volumes cannot be extended or mirrored and do not offer fault
tolerance. If one of the disks containing a striped volume fails, the entire
volume fails, and all data on the striped volume becomes inaccessible.

27

RAID 0 -Striping

6.3.5 Mirrored Volume:


A Mirrored Volume is a volume that duplicates data across two disks, Data is
written on two disks at the same time. This type of volume is fault tolerant
because if one drive fails, the data on the other disk is unaffected. It is also
known as RAID-1
It costs more, since mirroring data requires twice as much as disk storage than
what would otherwise be required. However, the cost of an extra hard drive is
usually well worth the security of having important data backed up.
27

Ref: thedatarescuecentre.com

Copyright Intelligent Quotient System Pvt. Ltd. |

89

Business Continuity and Disaster Recovery

28

RAID 1
Fig showing Mirrored volume

6.3.6 RAID-5
It is also known as disk striping with parity. With disk striping with parity, you
can use three or more disks (with a maximum of 32) and data is striped across
all the disks with an additional block of error-correction called parity, which is
used to reconstruct the data in the event of a disk failure. RAID-5 has slower
write performance than the other RAID types because the OS must calculate
the parity information for each stripe that is written, but the read performance
is equivalent to RAID-0, because the parity information is not read. Like RAID1, RAID-5 comes with additional cost considerations. For every RAID-5 set,
roughly an entire hard disk is consumed for storing the parity information.
For example, a minimum RAID-5 set requires three hard disks, and if those
hard disks are 300GB each, approximately 600GB of disk space is available to
the OS and 300GB is consumed by parity information, which equates to 33.3
percent of the available space. Similarly, in a five disk RAID-5 set of 300 GB
disks, approximately 1200 GB of disk space is available to the OS which means
that 20 percent of the total available space is consumed by the parity
information.
The capacity of the volume is limited to the smallest section of unallocated
space on any one of the disk that belongs to RAID-5 set. Suppose we want to
create a RAID-5 volume using three disks -disk1, disk2 and disk3. If Disk2 has
50GB of unallocated space, but Disks 1 and 3 have 100GB of unallocated
space, the stripe can use only 50GB of space on Disk1, Disk2 and Disk3. Thus
28

Ref: thewebshop.com

Copyright Intelligent Quotient System Pvt. Ltd. |

90

Business Continuity and Disaster Recovery

the space used on each disk in the volume is identical. Out of this space, the
entire space of one hard disk will be used to store parity information. So the
capacity of the RAID-5 volume created in this example will be 100GB.
In RAID-5, fault tolerance applies only to a single drive failure. If more than one
drives fails data is lost that can be recovered only by restoring it from a
backup.
The following diagram shows the RAID-5 configuration involving 4 disks.

29

RAID 5
Fig showing RAID-5 volume

Let us now see how the Disk Management utility in Microsoft Windows OS can
be used to perform the tasks mentioned above

6.4 Using Disk Management Tool


To open the tool, click Start, select Run, type diskmgmt.mscin the Open text
box, and then click OK. The top pane, by default, contains a volume list that
displays the volumes on all of the physical disks in the computer. This list
29

Ref: blog.everycity.co.uk

Copyright Intelligent Quotient System Pvt. Ltd. |

91

Business Continuity and Disaster Recovery

actually displays volumes only for dynamic disks; on basic disks, the top pane
contains a list of the primary partitions and logical drives.

Disk Management
Each entry in the volume list contains the following information:
Volume: Specifies the drive letter and/or volume name
Layout: Specifies the volume type, such as simple, spanned, or striped
for volumes on dynamic disks, or partitions for basic disks.
Type: specifies the type of disk on which the volume is located: basic or
dynamic
File System: Specifies the file system that was used to format the
volume.
Status: Specifies the current status of the volume, using one of the
following values:
a. Failed -Indicates that the volume could not be started
b. Failed Redundancy-Indicates that a mirrored or RAID-5 volume is
no longer fault tolerant because of a disk failure
c. Formatting-Indicates that the volume is in the process of being
formatted
d. Healthy - Indicates that the volume is operating normally
e. Regenerating-Indicates that a RAID-5 volume is in the process of
re-creating data on a newly restored disk
Copyright Intelligent Quotient System Pvt. Ltd. |

92

Business Continuity and Disaster Recovery

f. Resyncing -- Indicates that a mirrored volume is in the process of


re-creating data on a newly restored disk
g. Unknown -- Indicates that the boot sector for the volume has been
corrupted
Capacity- Specifies the total capacity of the volume, in megabytes or
gigabytes
Free Spaces-Specifies the total amount of free space on the volume, in
megabytes or gigabytes
%Free space-Specifies the percentage of the volumes capacity that is
free
Fault Tolerance-Specifies whether the volume type provides fault
tolerance
Overhead-Specifies the percentage of the volumes capacity devoted
storing redundant data

The bottom pane of the Disk Management console window contains a graphical
view of the physical disks in the computer. For each disk, the view specifies the
following information:
Disk identifier-Specifies the number assigned to the disk by the system.
Hard disk identifiers-begin with Disk 0 and CD-ROMs with CD-ROM 0.
Disk type-specifies whether the disk is a basic disk, dynamic disk,CD-ROM, or
DVD-ROM
Disk size-specifies the total capacity of the disk.
Disk status-specifies the current status of the disk, using one of the following
values:
a. Audio CD- Indicates that a CD-ROM or DVD-ROM drive contains an
audio CD.
b. Foreign- Indicates a dynamic disk that has been moved from another
computer but has not yet been imported into the current systems
configuration. Run the Import Foreign Disks command to access the
disk.
c. Initializing- Indicates that the disk is in the process of being converted
from a basic disk to a dynamic disk.
d. Missing- Indicates that a dynamic disk has been removed from the
computer, disconnected, or corrupted. Use the Reactivate Disk command
to access a previously disconnected disk.
e. No Media- Indicates that a CD-ROM, DVD-ROM, or removable disk drive
is currently empty.
f. Not Initialized- Indicates that the disk does not contain a valid
signature. Use Initialize Disk to activate the disk.
g. Online- Indicates that the disk is accessible and functioning normally.
h. Online (Errors) -Indicates that I/O errors have been detected on a region
of a dynamic disk.
i. Offline- Indicates that a dynamic disk is not accessible.
Copyright Intelligent Quotient System Pvt. Ltd. |

93

Business Continuity and Disaster Recovery

j. Unreadable- Indicates that the disk is not accessible, due to hardware


failure, I/O errors, or corruption on certain portions of the disk.
The horizontal bar representing each disk is divided into segments representing
the various volumes or partitions on that disk. Each volume or partition
segment is color coded to indicate whether it is a basic volume, a dynamic
volume of a particular type, or unallocated space. The segments also contain
some of the same information found in the volume list, such as the volume
name, capacity, file system, and current status.
The Disk Management snap-in enables you to specify what should appear in
the top and bottom panes by using the commands on the View menu. You can
reverse the volume list and the graphical view, or you can replace either one
with a disk list. The disk list contains much of the same information for each
disk as the graphical view, plus a Device Type, such as IDE or SCSI, and a
Partition Style, such as MBR or GPT.

Disk Management window


Disk Management can manage disk storage on local or remote systems.
Copyright Intelligent Quotient System Pvt. Ltd. |

94

Business Continuity and Disaster Recovery

6.4.1Adding Storage (Adding New Hard Disk to Computer)


The process of adding more storage capacity to a Windows Server 2003 and
windows server 2008 computer consists of the following steps:
1. Physically install the disk(s).
2. Initialize the disk.
3. On a basic disk, create partitions, extended partitions and logical drives,
or, on a dynamic disk, create volumes.
4. Format the volumes.
5. Assign drive letters to the volumes, or mount the volumes to an empty
folder on existing NTFS volumes.
You must be a member of the Administrators or Backup Operators group, or
have been otherwise delegated authority, to perform most of these tasks. Only
administrators can format a volume.
If you start the Disk Management snap-in after installing a new disk, the
Initialize-Convert Disk Wizard usually appears automatically. The wizard
enables you to create a signature on the new disk and convert the default basic
disk to a dynamic disk. To initialize a disk manually using Disk Management,
right-click the disks status box on the graphical view and, from the Action
menu, point to All Tasks and select Initialize Disk.
6.4.2Creating Basic Disk Partitions
After you have initialized the disk, you can begin to implement a storage
structure of partitions, logical drives, or volumes. As mentioned earlier, all
newly initialized disks in Windows Server 2003 and 2008 are basic disks by
default. If you want to maintain the disk as a basic disk, you can create
partitions by selecting the unallocated space in the graphical view and, on the
Action menu, pointing to All Tasks and selecting New Partition. This launches
the New Partition Wizard, in which you specify whether you want to create a
primary partition or an extended partition and of what size the partition should
be.

Copyright Intelligent Quotient System Pvt. Ltd. |

95

Business Continuity and Disaster Recovery

Ft12cr03.
New Partition Wizard
If you create a primary partition, the wizard takes you through the process of
assigning a drive letter to the partition and formatting it, or you can choose to
perform these tasks later. If you create an extended partition, you must select
the Free Space area you just created and run the New Partition Wizard again,
this time opting to create a logical drive. You can create any number of logical
drives you want, until you have used all of the space in the extended partition.
Here again, the wizard enables you to format the logical drives as you create
them, or you can choose to format them later.
6.4.3Converting Basic Disk to Dynamic Disk
If you want to use dynamic storage, you must convert a basic disk to a
dynamic disk before creating new volumes. However, when you convert the
system disk to dynamic storage, you must restart the system before you can
perform any further actions on the disk.
You can convert a basic disk to a dynamic disk at any time, even when you
have data stored on the disk. The structure of data on the disk is not modified,
so the existing data is not lost. However, the best practice when performing any
major disk manipulation is to back up your data first.
When you convert a basic disk that already contains partitions and logical
drives to a dynamic disk, those elements are converted to the equivalent
dynamic disk elements. In most cases, basic partitions and logical drives are
converted to simple volumes.
Copyright Intelligent Quotient System Pvt. Ltd. |

96

Business Continuity and Disaster Recovery

6.4.4Converting a Dynamic Disk to a Basic Disk


Converting a dynamic disk back to basic storage wipes out all data on the
drive. Therefore, you must first back up all of the data on the disk. Then you
must delete all of the volumes on the dynamic disk. Only then can you select
the disk and select Convert to Basic Disk from the Action/All Tasks menu.
After creating basic partitions and logical drives, you can restore the data back
to the disk.
6.5 Creating Dynamic Disk Volumes
Once you have converted a disk to dynamic storage, you can proceed to create
volumes on it. Select an area of unallocated space on the disk in the graphical
view and, on the Action menu, point to All Tasks and select New Volume. The
New Volume Wizard appears.

New Volume Wizard


The volume types that are available for selection depend on the number of
dynamic disks in the computer with unallocated space available.
Creating Simple Volumes
If you have only one disk in the computer, you can create only simple volumes.
All you have to do to create a simple volume is specify its size. Then the New
Volume Wizard takes you through the process of assigning a drive letter to the
volume and formatting it.
Copyright Intelligent Quotient System Pvt. Ltd. |

97

Business Continuity and Disaster Recovery

Creating Other Volume Types


To create spanned, striped, or mirrored volumes, you must have at least two
dynamic disks with unallocated space available. To create a RAID-5 volume,
you must have at least three dynamic disks

Disk Management Utility

By default, the disk you chose when creating the volume appears in the
selected list. All of the other dynamic disks in the computer appear in the
Available list. To add a disk to the volume, you make a selection in the
Available list and click Add.
You can add up to 32 disks to a spanned, striped, or RAID-5 volume; mirrored
volume uses only two disks.
Once you have selected the disks you want to use to create the volume, you
must specify the volumes size. The process varies slightly, depending on the
type of volume you are creating:
Spanned volumes can use any amount of space from each of the drives.
For each of the disks in the selected list, you specify the amount of space (in
megabytes) that you want to add to the spanned volume. The Total Volume Size
in Megabytes (MB) field displays the combined space from all the selected
drives.
Striped, mirrored, and RAID-5 volumes must use the same amount of space on
each of the selected disks. After you select the disks you want to use for the
volume, select the amount of Space in MB option specifies the maximum
amount of space that each disk can contribute, which is determined by the
Copyright Intelligent Quotient System Pvt. Ltd. |

98

Business Continuity and Disaster Recovery

disk with the least amount of space free. When you change the amount of
space for one disk, the wizard changes the amount of space contributed by the
other disks.
The total size of the volume is also calculated differently for the various volume
types:
For a spanned volume, the total size of the volume is the number of megabytes
you specified for the selected disks combined.
For a striped volume, the total size of the volume is the number of megabytes
you specified, multiplied by the number of disks you selected.
For a mirrored volume, the total size of the volume is the number of megabytes
you specified. This is because each of the disks contains an identical copy of
the data on the other disks.
For a RAID-5 volume, the total size of the volume is the number of megabytes
you specified, multiplied by the number of disks you selected minus one. This
is because the RAID-5 volume uses one disks worth of space to store the parity
for the rest of the disk array.
After you configure these parameters, the wizard enables you to assign a drive
letter to the volume and format it, so that it is ready to use.
Working with Mirrored Volumes
Mirror volume is also known as RAID-1. Mirror volume requires only two disks
A mirrored volume provides good performance along with excellent fault
tolerance. Two disks participate in a mirrored volume, and all data is written to
both volumes simultaneously. For the best possible fault tolerance, you should
use disks connected to separate host adapters. This creates a configuration
called duplexing, which provides better performance and enables the volume to
survive an adapter failure as well as a disk failure.
Converting a simple volume to a mirrored volume:
In addition to creating a new mirrored volume, you can also convert a simple
volume into a mirrored volume by selecting the simple volume and, on the
Action menu, pointing to All Tasks and selecting Add Mirror. You must have
another dynamic disk in the computer with sufficient unallocated space to hold
a copy of the simple volume you selected.
Once you have created the mirror volume, the system begins copying data,
sector by sector, to the newly added disk. During that time, the volume status
is reported as resyncing.
Recovering from Mirrored Disk Failures:
The recovery process for a failed disk within a mirrored volume depends on the
type of failure. If a disk has experienced transient I/O errors, the volume on
Copyright Intelligent Quotient System Pvt. Ltd. |

99

Business Continuity and Disaster Recovery

both disks will show a status of Failed Redundancy.

Disk Management Utility

After you correct the cause of the I/O errorperhaps a bad cable connection or
power supplyselect the volume on the problematic disk and, on the Action
menu, point to All Tasks and select Reactivate Volume. Or you can select the
disk and choose Reactivate Disk. Reactivating brings the disk or volume back
online. The system then resynchronizes the disks.
If you want to stop mirroring, you have three choices, depending on what you
want the outcome to be:
Delete the volume: If you delete the volume, the volume and all the
information it contains is removed. The resulting unallocated space is
then available for new volumes.

Remove the mirror: If you remove the mirror, the mirror is broken and
the space on one of the disks becomes unallocated. The other disk
maintains a copy of the data that had been mirrored, but that data is of
course no longer fault tolerant.
Break the mirror: If you break the mirror, the mirror is broken but both
disks maintain copies of the data. The portion of the mirror that you
select when you select Break Mirror maintains the original mirrored
Copyright Intelligent Quotient System Pvt. Ltd. |

100

Business Continuity and Disaster Recovery

volumes drive letter, shared folders, paging file, and reparse points. The
secondary drive is given the next available drive letter.
If you have a mirrored volume in which one physical disk has failed completely
and must be replaced, you cant simply remirror the mirrored volume, even
though one of the disks in the mirror set no longer exists. You must first
remove the failed disk from the mirror set to break the mirror. Select the
volume and, on the Action menu, point to All Tasks and select Remove Mirror.
In the Remove Mirror dialog box, it is important to select the disk that is
missing. The disk you select is deleted when you click Remove Mirror, and the
remaining disk becomes a simple volume. Once the operation is complete, you
can select the simple volume and use the Add Mirror command to use the
replacement disk to create a new mirror volume.
Working with RAID
As mentioned earlier in this chapter, RAID is a series of fault tolerance
technologies that enable a computer or operating system to respond to a
catastrophic event, such as a hardware failure, so that no data is lost and work
in progress is not corrupted or interrupted. You can implement RAID fault
tolerance as either hardware or a software solution. In a hardware solution, a
RAID adapter handles the creation and regeneration of redundant information.
Some vendors implement RAID data protection directly in their hardware, as
with disk array adapter cards. Because these methods are vendor specific and
bypass the operating systems fault tolerance software drivers, they offer
performance improvements over software implementations of RAID, like that
included in Windows Server 2003 and Server 2008.
Consider the following points when you decide whether to use a software or
hardware RAID implementation:

Hardware RAID implementations are more expensive than software RAID


and might limit equipment options to a single vendor.
Hardware RAID implementations generally provide faster disk I/O than
software RAID.
Hardware RAID implementations might include features such as hot
swapping of hard disks, to allow for replacement of a failed hard disk
without shutting down the computer, and hot sparing, so that a failed
disk is automatically replaced by an online spare.

Windows Server 2003 and 2008 supports three levels of RAID; RAID-0, RAID-1
and RAID-5. Only RAID-1 and RAID-5 are fault tolerant.
The fault tolerance applies only to a single drive failure. If more than one disk
failed, data is lost that can be recovered only by restoring it from a backup.
BecauseRAID-5 volumes are created as native dynamic volumes from
Copyright Intelligent Quotient System Pvt. Ltd. |

101

Business Continuity and Disaster Recovery

unallocated space, you cannot convert any other type of volume into a RAID-5
volume without backing up that volumes data and restoring into a newly
created RAID-5 volume.
If a single disk fails in a RAID-5 volume, the entire data store on the volume
remains accessible. During read operations, any missing data is regenerated on
the fly through a calculation involving remaining data and parity. Performance
is degraded during this time, and if a second drive fails, data is lost
irretrievably.
Once the failed drive is returned to service, you might need to use the Rescan
Disks command in Disk Management and then reactivate the volume on the
newly restored disk. The system then rebuilds missing data from the parity
information and repopulates the disk, leaving the volume fully functional and
fault tolerant again.
6.6Backup Considerations
The organization can never be completely prepared for every natural disaster or
human foible that can bring down the network, the user can make sure that
the user has a solid backup plan in place to minimize the impact of lost data.
Even if the worst happens, the user doesnt have to lose days or weeks of work,
provided that the user has a solid plan in place. A backup plan is the set of
guidelines and schedules that determine which data should be backed up and
how often. A backup plan includes information such as:

What to back up
Where to back it up
When to back up
How often to back up
Who should be responsible for backups
Where media should be stored
How often to test backups
The procedure to follow in case of data loss

Copying Workstation Data to the Network


Servers must be backed up because they contain all the data for the entire
network. In most networks, workstations are not backed up because they
usually dont contain any data of major importance. This is only the case if the
users are trained properly and store all their data on the network. Users can
mistakenly save their data to their local workstation. Also, user application
configuration data are normally stored on the workstation. If a workstations
hard disk goes down, the configuration is lost. For backups to be successful,
users need to ensure that all necessary data is located on the network. The
user can do this in two ways: user training and folder replication. Training is
time-consuming and costly, but productive in the long run. Users should
understand the general network layout and know how to save their data in the

Copyright Intelligent Quotient System Pvt. Ltd. |

102

Business Continuity and Disaster Recovery

proper place. This keeps all user data centralized and makes it easy for the
administrator to back up the data.

6.6.1 Backup Types


After the user choose your backup medium and backup utility, the user must
decide what type of backup to run. The types vary by how much data they back
up each time and by how many tapes it takes to restore data after a complete
system crash.
The three backup types are:
Full
Differential
Incremental

Full Backup

In a full backup, all network data is backed up (without skipping any files).
This type of backup is straightforward because the user simply tell the software
which servers (and, if applicable, workstations) to back up, where to back up
the data, and start the backup. If the user have to do a restore after a crash,
the user have only one set of tapes to restore from (as many tapes as it took to
back up everything). Simply insert the most recent full back up into the drive
and start the restore.

Differential Backup

In a differential backup strategy, a single, full back up is done typically once a


week. Every night for the next six nights, the backup utility backs up all files
that have changed since the last full backup (the actual differential backup).
After a weeks worth of differential backups, another full backup is done,
starting the cycle all over again. With differential backups, the user uses a
maximum of two backup sessions to restore a file or group of files. Heres how
it works: The backup utility keeps track of which files have been backed up
through the use of the archive bit, which is simply an attribute that indicates a
files status with respect to the current backup type. The archive bit is cleared
for each file backed up during the full backup. After that, any time a program
opens and changes a file, the NOS sets the archive bit, indicating that the file
has changed and needs to be backed up. Then each night, in a differential
backup, the backup program copies every item that has its archive bit set,
indicating the file has changed since the last full backup. The archive bit is not
touched during each differential backup. When restoring a server after a
complete server failure, the user must restore two sets of tapes: the last full
backup and the most current differential backup. A full restoration may take
Copyright Intelligent Quotient System Pvt. Ltd. |

103

Business Continuity and Disaster Recovery

longer, but each differential backup takes much less time than a full backup.
This type of backup is used when the amount of time each day available to
perform a system backup (called the backup window) is smaller during the
week and larger on the weekend.
Notice that the amount of data becomes gradually larger every day as the
number of files that needs to get backed up increases. Remember that the
archive bit isnt cleared each day; so by the end of the week, the files that
changed at the beginning of the week may have been backed up several times,
even though they havent changed since the first part of the week.

Incremental Backup

In an incremental backup, a full backup is used in conjunction with daily


partial backups to back up the entire server, thus reducing the amount of time
it takes for a daily backup. With an incremental backup, the weekly full
backup takes place as it does during a differential backup, and the archive bit
is cleared during the full backup. The incremental, daily backups back up only
the data that has changed since the last backup (not the last full backup). The
archive bit is cleared each time a backup occurs. With this method, only files
that change since the previous days backup are backed up. Each days backup
is a different size because a different number of files are modified each day.
This method provides the fastest daily backups for networks whose daily
backup window is extremely small. However, the network administrator does
pay a price for shortened backup sessions. The restores after a server failure
take the longest of the three methods. The full backup set is restored plus
every tape from the day of the failure back to the preceding full backup.
The data backed up each day is different from day to day, but it is also much
smaller than doing a differential or full backup. Furthermore, unlike the other
two backup types, incremental backups never back up the same information
twice.
Each backup type is used for a different purpose. Full backups are used when
restore time is at a premium. Incremental backups are used when backup time
is at a premium. Differential backups are a compromise between the two
methods.

Copyright Intelligent Quotient System Pvt. Ltd. |

104

Business Continuity and Disaster Recovery

30

6.6.2Windows Server Backup

30

Ref: blogs.technet.com

Copyright Intelligent Quotient System Pvt. Ltd. |

105

Business Continuity and Disaster Recovery

Backup and recovery of servers have always been a core component of


business continuity process.
With more reliable hardware, the amount of time that a system
administrator spends on backup and recovery has decreased, but
managements expectations about server availability have also changed.
Users who accepted that a file server might have been out of action for 24
hours in the late 1990s are unwilling to accept several hours of
downtime a decade later.
The Windows Server Backup feature provides a basic backup and
recovery solution for computers running the Windows Server 2008
operating system. Windows Server Backup introduces new backup and
recovery technology and replaces the previous Windows Backup
(Ntbackup.exe) feature that was available with earlier versions of the
Windows operating system.
The Windows Server Backup feature in Windows Server 2008 consists of
a Microsoft Management Console (MMC) snap-in and command-line tools
(wbadmin), that provide a complete solution for your day-to-day backup
and recovery needs. You can use four wizards to guide you through
running backups and recoveries. You can use Windows Server Backup to
back up a full server (all volumes), selected volumes, or the system state.
You can recover volumes, folders, files, certain applications, and the
system state. And, in case of disasters like hard disk failures, you can
perform a system recovery, which will restore your complete system onto
the new hard disk, by using a full server backup and the Windows
Recovery Environment.
You can use Windows Server Backup to create and manage backups for
the local computer or a remote computer. You can also schedule
backups to run automatically and you can perform one-time backups to
augment the scheduled backups.
Windows Server Backup includes the following improvements:

Faster backup technology


Simplified restoration
Simplified recovery of your operating system
Ability to recover applications
Improved scheduling
Offsite removal of backups for disaster protection
Remote administration of backup
Automatic disk usage management
wbadmin command-line support
Support for optical media drives and removable media
Copyright Intelligent Quotient System Pvt. Ltd. |

106

Business Continuity and Disaster Recovery

To access backup and recovery tools for Windows Server 2008, you must
install the Windows Server Backup, Command-line Tools, and Windows
Power Shell items that are available in the Add Features Wizard in Server
Manager. This installs the following tools:
Windows Server Backup Microsoft Management Console (MMC)
snap-in
Wbadmin command-line tool
Windows Server Backup cmdlets (Windows Power Shell commands)
6.6.3Scheduled Backup
Scheduled backups are data backup processes which proceed
automatically on a scheduled basis without additional computer or user
intervention. The advantage of using scheduled backups instead of
manual backups is that a backup process can be run during off-peak
hours when data is unlikely to be accessed, precluding or reducing the
impact of backup downtime.
Scheduled backups allow you to automate the backup process. After you
set the schedule, Server Backup takes care of everything else. You can
set the schedule according to your organizations requirement.
6.6.4Remote Backup
A remote, online, or managed backup service, sometimes marketed as
cloud backup, is a service that provides users with a system for the
backup and storage of computer files. Online backup providers are
companies that provide this type of service to end users (or clients). Such
backup services are considered a form of cloud computing.
Online backup systems are typically built around a client software
program that runs on a schedule, typically once a day, and usually at
night while computers aren't in use. This program typically collects,
compresses, encrypts, and transfers the data to the remote backup
service provider's servers or off-site hardware.
The Windows Server Backup tool can be used to connect to another
Windows Server 2008 computer and perform backup tasks as though the
backup were being performed on the local computer.
6.6.5Offsite Backup
Offsite backups ensure that if the building that hosts your servers is
destroyed by flood, fire, or earthquake, your organization can still recover
its data.

Copyright Intelligent Quotient System Pvt. Ltd. |

107

Business Continuity and Disaster Recovery

In computing, off-site data protection, or vaulting, is the strategy of


sending critical data out of the main location (off the main site) as part of
a disaster recovery plan. Data is usually transported off-site using
removable storage media such as magnetic tape or optical storage. Data
can also be sent electronically via a remote backup service, which is
known as electronic vaulting or e-vaulting. Sending backups off-site
ensures systems and servers can be reloaded with the latest data in the
event of a disaster, accidental error, or system crash. Sending backups
off-site also ensures that there is a copy of pertinent data that isnt
stored on-site. Off-site backup services are convenient for companies
that backup pertinent data on a daily basis.
Although some organizations manage and store their own off-site
backups, many choose to have their backups managed and stored by
third parties who specialize in the commercial protection of off-site data.
6.6.6Backup of workstations or client machines
Data protection is one of the most important responsibilities of an organization.
Although most of the data resides on servers, users still maintain their data on
client machines like windows XP and windows7. That is why backup and
restoration of data on such machines is also very important

Backup on Windows7

In windows7, you are provided with the Backup and Restore window, which
contains various options that allow you to backup and restore the information
present on your system. While creating a backup of the important files, the first
thing that you should consider is the backup destination, that is; where the
backup should be stored. You can save the backup of your data on any of
these storage devices:

An alternative internal hard drive


An external hard drive
DVD-ROMs
Universal Serial Bus(USB)
Network Location

After deciding about the backup destination for your data, the next step is to
create the backup. Windows7 supports the following two kinds of backup:

Creating System Image:


It refers to creating a backup of a complete volume to a .vhd disk
image file. This type of backup is helpful in situations where you

Copyright Intelligent Quotient System Pvt. Ltd. |

108

Business Continuity and Disaster Recovery

want to restore not only the saved files but also all the running
applications on the computer.
System Image is the exact image of the drive in which the
windows operating system is installed, which includes system
settings, programs, and files. A system image helps in restoring
the contents of your computer, which includes the drive in
which Windows is installed. Therefore, if you use a system image
to restore your computer, you need to perform a complete
restoration of the system, instead of restoring only selected files
and folders.
At the time of creating the System Image backup, a Windows
Image Backup folder is automatically created in the backup
media. A folder having the same name as your computer is
created in this backup folder to store image of your system. Two
folders namely the Catalog folder and the Backup folder are
further created in your computer folder

Backing up Files and Folders:


It refers to storing files and documents to. .zip (compressed)
files. However you should note that this type of backup does
not create a backup of the system files, programs, encrypted
files, temporary files, and the files stored in the Recycle Bin
folder.

To access Backup utility on Windows7


1. Click Start>Control Panel>System and Security link
2. Select the Backup and Restore link in the System and Security window.
3. Click the Set up backup link and select proper options for your backup.
In windows7, the backup and restore utility can also be used to create a
system repair disk, which can be used to boot your computer if it doesnt
start normally. It also contains Windows system recovery tools that can help
you recover Windows from a serious error or restore your computer from a
system image.

Copyright Intelligent Quotient System Pvt. Ltd. |

109

Business Continuity and Disaster Recovery

Above image displays Backup and Restore Window

Backup on Windows XP

We use ntbackup.exe command to access backup utility on windows XP. This


backup is different from the backup used on Windows 7. Windows XP support
different types of backup like
Normal
Copy
Incremental
Differential
Daily

Normal: backs up all files and marks each as backed up.


Copy: backs up files but does not mark them as backed up.
Incremental: backs up files only if they were created or modified since
the last backup operation completed and marks them as backed up.
Differential: backs up only those files created since the last backup
completed, but unlike Incremental backups, a Differential backup
doesn't mark the files as backed up.
Copyright Intelligent Quotient System Pvt. Ltd. |

110

Business Continuity and Disaster Recovery

Daily: backs up only files created or modified that day (without


changing files' archive bits).

To access backup utility on windows XP,


Select Start > All Programs > Accessories > System Tools > Backup

Backup window on Windows XP

To proceed with backup:


1. Select what to backup
a. Selected Files and Folders
b. Everything on a computer
c. Only System files
2. Select destination for your backup
a. Removable media
b. Network location
c. Different drive on same computer
3. Select type of backup
a. Normal
b. Copy
c. Incremental
d. Differential
e. Daily

Copyright Intelligent Quotient System Pvt. Ltd. |

111

Business Continuity and Disaster Recovery

In Windows XP, the backup utility can be used to restore the backup as
well as to create Automated System Recovery (ASR) disk, which can be
used to repair system from serious errors.
6.6.7 Volume Shadow Copy
Volume Shadow Copy (Volume Snapshot Service or Volume Shadow
Copy Service or VSS), is a technology included in Microsoft Windows that
allows taking manual or automatic backup copies or snapshots of data,
even if it has a lock, on a specific volume at a specific point in time over
regular intervals. It is implemented as a Windows service called the
Volume Shadow Copy service. Shadow Copy technology requires the file
system to be NTFS to be able to create and store shadow copies. Shadow
Copies can be created on local and external (removable or network)
volumes by any Windows component that uses this technology, such as
when creating a scheduled Windows Backup or automatic System
Restore point.
Snapshots have two primary purposes: they allow the creation of
consistent backups of a volume, ensuring that the contents cannot
change while the backup is being made; and they avoid problems with
file locking. By creating a read-only copy of the volume, backup programs
are able to access every file without interfering with other programs
writing to those same files.
The Volume Shadow Copy Service provides the backup infrastructure for
the Microsoft Windows XP, Microsoft Windows Server 2003, Windows 7
and Microsoft windows server 2008 operating systems, as well as a
mechanism for creating consistent point-in-time copies of data known as
shadow copies.

Copyright Intelligent Quotient System Pvt. Ltd. |

112

Business Continuity and Disaster Recovery

Enabling shadow copies

The VSS service automatically takes a snapshot of the files and folders
located on any volume or partition where the service has been enabled.
These snapshots include an image of the contents of the folder at a given
point in time. Depending on the space you make available to it, you
could have up to 512 different snapshots of a disk volume. And because
Microsoft has made a client component of VSS, the Previous Versions
client, available along with VSS, users and administrators can have
access to these snapshots.
On regular File Servers, this means that once VSS is implemented, users
can recover any lost file by themselves, at the privacy of their own desk.
Shadow copy service is designed to assist in the process of recovering
previous versions of files without having to resort to backups. VSS only
works well if you have a lot of free space on your disks, but nevertheless,
it is a good solution and requires a very little overhead to run.
Shadow copies can never be a replacement for backup, because files are
not backed-up. So if a shadow copy is no longer available, files are not
available
By default, windows server 2008 creates shadow copies twice a day: at
7:00 A.M and noon. This schedule can be changed if you find that it does
not meet your requirement.
Copyright Intelligent Quotient System Pvt. Ltd. |

113

Business Continuity and Disaster Recovery

Using Previous Version client to recover files

Advantages of Shadow Copy


You can recover files that have been accidently deleted.
You can recover files that have been overwritten and you want to access
a previous version of the file.
You can use file comparison to see the differences between a current
version of a file and a previous version of a file.
On Microsoft windows OS which supports shadow copy service, it can be
implemented using Windows Explorer or vssadmin command.
6.6.8System Recovery using System Restore
In Microsoft windows XP and Windows 7, the system restore feature is used to
restore the files of a system to an earlier date and time, that is, the time when
the system was working properly. These points of time are called restore points.
System restore utilizes the System Protection feature to create and save restore
point regularly. Restore points are automatically created whenever there is any
change in the system settings. These restore points store the information
regarding the registry settings of the system.
To create Restore point on Windows 7,
Copyright Intelligent Quotient System Pvt. Ltd. |

114

Business Continuity and Disaster Recovery

1. Click the Control Panel>System and Security>System. The System


window appears.
2. Click the System protection link in the left pane of the window.
3. Click the Create button on the bottom of the screen next to the Create A
Restore Point Right Now section

Image displays the System Properties Dialog box

Sometimes, even the System Restore feature is not able to recover the lost data.
In such situations, you can implement any of the following three methods to
recover the lost data.

System Protection

Creates and saves information about the system files and settings in restore
points, which are created just before you start installing a program or device.
The system protection feature is turned on by default for the drive that holds
the operating system.
Copyright Intelligent Quotient System Pvt. Ltd. |

115

Business Continuity and Disaster Recovery

Advanced Recovery Methods

Specifies that you can either use a system image to restore your computer or
reinstall Windows. You can access these options from the Advanced Recovery
Methods window

System Image Backup

Rewrite the complete content of the system volume. To recover your system
using this feature, you need to first boot your system from Windows 7
installation DVD-ROM. Go to Advanced Boot Options screen by pressing F8
key and select Repair your computer option. Next, select Image Recovery option
and specify the location of the backup
6.7Virus Protection
A virus is a program that causes malicious change in your computer and
makes copies of it. Sophisticated viruses encrypt and hide themselves to thwart
detection. There are tens of thousands of viruses that your computer can
catch.
Viruses can shut down an entire corporation. The types vary, but the approach
to handling them does not. You need to install virus protection software on all
computer equipment. Workstations, personal computers, servers, and firewalls
all must have virus protection, even if they never connect to your network.
They can still get viruses from removable storage media or Internet downloads.
As viruses can cause great damage to your organization and data, it is
necessary to detect and eradicate viruses from computer and network.
Types of Viruses
Several types of viruses exist, but the popular ones are file viruses, macro (data
file) viruses, and boot sector viruses. Each type differs slightly in the way it
works and how it infects your system. Many viruses attack popular
applications such as Microsoft Word, Excel, and PowerPoint; they are easy to
use and its easy to create a virus for them. Because writing a unique virus is
considered a challenge to a bored programmer, viruses are becoming more and
more complex and harder to eradicate.
File Viruses
A file virus attacks executable application and system program files, such as
those ending in .COM, .EXE, and .DLL. Most of these types of viruses replace
some or all of the program code with their own. Only once the file is executed
can the virus cause its damage. This includes loading itself into memory and
waiting to infect other executable, further propagating its potentially
destructive effects throughout a system or network. Examples of file viruses are
Jerusalem and Nimda (although Nimda is usually seen as an Internet worm)
Copyright Intelligent Quotient System Pvt. Ltd. |

116

Business Continuity and Disaster Recovery

may also infect common Windows files, as well as files with extensions such as
.HTML, .HTM, and .ASP.
Macro Viruses
A macro is a series of commands and actions that are used to automatically
perform operations without a users intervention. Macro viruses use the Visual
Basic macro scripting language to perform malicious or mischievous functions
in data files created with Microsoft Office products. Macro viruses are the most
harmless but also the most annoying viruses. Since macros are easy to write,
macro viruses are among the most common viruses and are frequently found in
Microsoft Word and PowerPoint. They affect the file you are working on. For
example, you might be unable to save the file even though the Save function is
working, or you might be unable to open a new documentyou can only open
a template. These viruses will not crash your system, but they are annoying.
Cap and Cap A are examples of macro viruses.
Boot Sector Viruses
Boot sector viruses get into the Master Boot Record (MBR). This is track one,
sector one on your hard disk and no applications are supposed to reside there.
The computer at boot up checks this section to find a pointer for the operating
system. If you have a multi-operating-system boot between various versions or
instances of Windows, this is where the pointers are stored. A boot sector virus
will overwrite the boot sector, thereby making it look as if there is no pointer to
your operating system. When you power up the computer, you will see a
Missing Operating System or Hard Disk Not Found error message. Monkey B,
Michelangelo, Stoned, and Stealth Boot are examples of boot sector viruses.
Nearly any virus that falls under one of these three categories can be
implemented as a Trojan horse. Just as the Greeks in legend attacked Troy by
hiding within a giant horse, a Trojan virus hides within other programs and is
launched when the program in which it is hiding is launched. DMSETUP.EXE
and LOVE-LETTER-FOR-YOU.TXT.VBS are examples of known Trojan Horses.
Displaying extensions for known file types can help you remain vigilant against
such naming tricks. These are only a few of the types of viruses out there.
Updating Antivirus Components
A typical antivirus program consists of two components:
The definition files
The engine
The definition files list the various viruses, their type, and their footprints and
specify how to remove them. More than 100 new viruses are found in the wild
each month. An antivirus program would be useless if it did not keep up with
all the new viruses. The engine accesses the definition files (or database), runs
the virus scans, cleans the files, and notifies the appropriate people and
accounts. Eventually viruses become so sophisticated that a new engine and
new technology are needed to combat them effectively.
Copyright Intelligent Quotient System Pvt. Ltd. |

117

Business Continuity and Disaster Recovery

Heuristic scanning is a technology that allows an antivirus program to search


for a virus even if there is no definition for it. The engine looks for suspicious
activity that might indicate a virus. Be careful if you have this feature turned
on. A heuristic scan might detect more than viruses; removing harmless code
might cause unpredictable results.
For an antivirus program to be effective, you must upgrade, update, and scan
in a specific order:
Upgrade the antivirus engine.
Update the definition files.
Create an antivirus emergency boot disk.
Configure and run a full on-demand scan.
Schedule monthly full on-demand scans.
Configure and activate on-access scans.
Update the definition files monthly.
Make a new antivirus emergency boot disk monthly.
Get the latest update when fighting a virus outbreak.
Repeat all steps when you get a new engine.
We will look at the first two steps in using antivirus software in the following
sections. The other steps are beyond the scope of this book.
Upgrading an Antivirus Engine
An antivirus engine is the core program that runs the scanning process; virus
definitions are keyed to an engine version number. For example, a 3.x engine
will not work with 4.x definition files. When the manufacturer releases a new
engine, consider both the cost to upgrade and the added benefits. Before
installing new or upgraded software, back up your entire computer system,
including all data.
Updating Definition Files
Every week you need to update your list of known virusescalled the virus
definition files. You can do this manually or automatically through the
manufacturers website. You can use a staging server within your company to
download and then distribute the updates, or you can set up each computer to
download updates.
Scanning for Viruses
An antivirus scan is the process in which an antivirus program examines the
computer suspected of having a virus and eradicates any viruses it finds. There
are two types of antivirus scans:
On-demand
On-access
An on-demand scan searches a file, a directory, a drive, or an entire computer.
An on-access scan checks only the files you are currently accessing. To
maximize protection, you should use a combination of both types.
Copyright Intelligent Quotient System Pvt. Ltd. |

118

Business Continuity and Disaster Recovery

On-Demand Scans
An on-demand scan is a virus scan initiated by either a network administrator
or a user. You can manually or automatically initiate an on-demand scan.
Typically, youd schedule a monthly on-demand scan, but youll also want to do
an on-demand scan in the following situations:
After you first install the antivirus software.
When you upgrade the antivirus software engine.
When you suspect a virus outbreak.
Before you initiate an on-demand scan, be sure that you have the latest
virus definitions. When you encounter a virus, scan all potentially affected
hard disks and any floppy disks that could be suspicious. Establish a
cleaning station, and quarantine the infected area. Ask all users in the
infected area to stop using their computers. Perform a scan and clean at the
cleaning station. Run a full scan and clean the entire system on all
computers in the office space.
On-Access Scans
An on-access scan runs in the background when you open a file or use a
program. For example, an on-access scan can run when you do any of the
following:
Insert a floppy disk
Download a file with FTP
Receive e-mail messages and attachments
View a web page
The scan slows the processing speed of other programs, but it is worth the
inconvenience.
A relatively new form of malicious attack makes its way to your computer
through ActiveX and Java programs (applets). These are miniature programs
that run on a web server or that you download to your local machine. Most
ActiveX and Java applets are safe, but some contain viruses or snoop
programs. The snoop programs allow a hacker to look at everything on your
hard drive from a remote location without your knowing. Be sure that you
properly configure the on-access component of your antivirus software to check
and clean for all these types of attacks.
There is a host of great shareware and freeware available on the Internet today.
Titles include Microsoft Antispyware, Spybot Search & Destroy and Ad-Aware,
as well as Windows Update.
Many programs will not install unless you disable the on-access portion of your
antivirus software. This is dangerous if the program has a virus. Your safest
bet is to do an on-demand scan of the software before installation. Disable onaccess scanning during installation, and then reactivate it when the
installation is complete.

Copyright Intelligent Quotient System Pvt. Ltd. |

119

Business Continuity and Disaster Recovery

Emergency Scans
In an emergency scan, only the operating system and the antivirus program are
running. An emergency scan is called for after a virus has invaded your system
and taken control of a machine. In this situation, insert your antivirus
emergency boot disk and boot the infected computer from it. Then scan and
clean the entire computer.
Another possibility is to use an emergency scan website like
housecall.trendmicro.com. It allows you to scan your computer via a high
speed Internet access without using an emergency disk.
Software Revisions
Patches, fixes, service packs, and updates are all the same thingfree software
revisions. These are intermediary solutions until a new version of the product
is released. They may solve a particular problem, as does a security patch, or
change the way your system works, as does an update. You can apply a socalled hot patch without rebooting your computer; in other cases, applying a
patch requires that the server go down.
Necessity of patches
Because patches are designed to fix problems, it would seem that you would
want to download the most current patches and apply them immediately. That
is not always the best thing to do. Patches can sometimes cause problems with
existing, older software. Different opinions exist regarding the application of the
newest patches. The first opinion is to keep your systems only as up-to-date as
necessary to keep them running. This is the if it isnt broken, dont fix it
approach. After all, the point of a patch is to fix your software. Why fix it if it
isnt broken? The other opinion is to keep the software as up-to-date as
possible because of the additional features that a patch will sometimes provide.
You must choose the approach that is best for your situation.
Where to Get Patches
Patches are available from several locations:
The manufacturers website
The manufacturers CD or DVD
The manufacturers support subscriptions on CD or DVD
The manufacturers bulletin
Youll notice in every case that the source of the patch, regardless of the
medium being used to distribute it, is the manufacturer. You cannot be sure
that patches available through online magazines, other companies, and
shareware websites are safe. Also, patches for the operating system are
sometimes included when you purchase a new computer.
How to Apply Patches

Copyright Intelligent Quotient System Pvt. Ltd. |

120

Business Continuity and Disaster Recovery

Just as you always need to plan for an upgrade, you need to plan for a patch.
Never blindly install patches (or any other new software) without examining the
potential impact on the network. Although patches are designed to fix known
problems, they may create new ones. It is best to try patches on a test network
or system before installing them on all systems on the network.

Summary

Various storage technologies, such as direct-attached storage (DAS),


network attached storage (NAS), and storage-area networks (SANs) are
used to store organizational data.
The methods that provide fault tolerance for hard disk systems include:
-Mirroring
-Duplexing
-Data striping
-Redundant array of independent (or inexpensive) disks (RAID)
On Microsoft Windows operating systems like Server 2003. 2008, XP
Professional and windows 7, we can create dynamic volumes.
Different dynamic volumes supported by Microsoft are: Simple, Spanned,
Striped, Mirror, RAID-5
The three backup types are: Full, Differential, and Incremental.
Some other types of backup are copy, daily, network, scheduled, remote,
offsite etc.
Different services that can be used to recover data as well as systems are
volume shadow copy, restore point.
Viruses can cause great damage to your organization and data. It is
necessary to detect and eradicate viruses from computer and network.
Updated antivirus software plays a very crucial role to defend against
viruses.

************************************************************************************

Copyright Intelligent Quotient System Pvt. Ltd. |

121

Business Continuity and Disaster Recovery

Case Study
The World Trade Center Disaster: Who Was Prepared?
A little after 8am on Tuesday morning, September 11, 2001, four cross-country
passenger jetliners were hijacked with loaded fuel tanks. One was crashed into
a section of the Pentagon, another plunged into the Pennsylvania countryside
when passengers prevented the hijackers from hitting their target. The other
two planes were crashed into New York City's two World Trade Center (WTC)
towers, ultimately causing them to implode and kill 5,000 people.
All WTC offices were destroyed, a total of over 15 million square feet of office
space, an area equal to all Atlanta office space. Some of the nearby buildings,
including the World Financial Center (WFC), the American Express Building,
and 1 Liberty Plaza, were badly damaged and were immediately evacuated.
Some may have to be demolished. With the New York Stock Exchange (NYSE)
located so very close, the WTC area was the center of global finance and many
nearby financial firms were also adversely affected. Also affected were many
other companies, such as Lufthansa Airlines, and New York recruiting firm
Digital Market Research Inc., which lost telephone service and contact with
customers for a number of days because their telecommunications providers
were located in or near the WTC complex.
The financial industry's equipment loss was immense. The Tower Group
technology research company estimated that the securities firms alone will
spend up to $3.2 billion just to replace computer equipment. Much of the WTC
IT and telecommunications equipment was underground and was destroyed by
the collapsed debris. Tower calculates replacements will include 16,000
trading-desk workstations, 34,000 PCs, 8,000 servers, plus large numbers of
information computer terminals, printers, storage devices, and network hubs
and switches. Setting up this equipment will cost an additional $1.5 billion.
The most vital issue for many companies was their loss of staff. Few recovery
plans anticipated such a catastrophe. Organizations that were directly hit did
not even know who in their companies had survived or where they were
because hardly any kept secure, accessible lists of employees or contact
information. The New York Board of Trade (NYBT), which had its trading floor
in the WTC where it dealt in such commodities as coffee, orange juice, cocoa,
sugar and cotton, had to call all employees, one by one. Often survivors
couldn't be reached because area telephone facilities were destroyed while any
working circuits were overloaded. A few companies had considered some staff
problems. The Nasdaq stock exchange, with headquarters at nearby 1 Liberty
Plaza, had required many managers to carry two cell phones in case both the
telephone and one cell phone did not work. It also required every employee
from the chairman on down to carry a crisis line number card.
Copyright Intelligent Quotient System Pvt. Ltd. |

122

Business Continuity and Disaster Recovery

Disaster recovery companies did provide some work space for their customers.
Comdisco had seven WTC customers, and it made space available for 3,000
customer employees, enabling those companies to continue operations. Some
recovery companies, including SunGard, made available tractor-trailers
equipped with portable data centers. Not all plans worked. Barclays Bank had
planned for evacuating its 1,200-person investment-banking unit to its disaster
recovery site in New Jersey, but the site proved to be too small for so many
employees. Moreover, the bridges and tunnels crossing the Hudson River were
immediately closed so most employees could not get there. Fortunately Barclay
was able to shift much of its work to its London, Hong Kong, and Tokyo offices,
although the time differences forced those workers to do double shifts.
Data loss is extremely critical, often requiring extensive planning. Many
organizations already relied on disaster recovery companies such as SunGard,
Comdisco and Recall, which offer office space, computers, and
telecommunications equipment when disasters occur. "Cold site" recovery
requires the companies to back up their own data onto tapes, storing them
offsite. If a disaster occurs, the organizations transport their backup tapes to
the recovery sites where they load and boot their applications from scratch
using their backup tapes. Although the cold site approach is relatively
inexpensive, restoring data can be slow, often taking up to 24 hours. If the
tapes are stored at the affected site or relatively close by, all data may be
permanently lost, which could put some companies out of business. Moreover
the data for all activity since the last backup will be lost.
"Hot site" backups can solve some problems, but it could cost some companies
as much as $1 million monthly. A hot site is located offsite where a reserve
computer continually creates a mirror image of the production computer's
data. Should a data disaster occur, the company can quickly switch over to the
backup computer and continue to operate? If the production site itself is
destroyed, the staff will actually go to the hot site to operate.
While many companies lost a lot of data in the attack, a recent Morgan Stanley
technology team report said the WTC was "probably one of the best-prepared
office facilities from a systems and data recovery perspective." Lower
Manhattan's extraordinary data security concern erupted in 1993 when a large
bomb exploded in the subterranean parking area of the WTC in a terrorist
attack. Six people were killed and more than 1,000 were injured. Realizing how
vulnerable they were, many companies took steps to protect themselves.
Pressures for emergency planning further increased as companies faced the
feared Y2K problems. As a result, the data for many organizations were
relatively well protected when the 9/11 WTC attack occurred. Let us look at
how some organizations responded to the attack.
Prior to 1993, to protect itself, the NYBT had contracted with SunGard Data
Systems Inc. for "cold site" disaster recovery. After the 1993 bombing it decided
Copyright Intelligent Quotient System Pvt. Ltd. |

123

Business Continuity and Disaster Recovery

to establish its own hot site. It rented a computer and trading floor space in
Queens for $300,000 annually. It hired Comdisco to help it set up the hot
backup site, which it hoped to never have to use despite the expense. After the
attack the NYBT quickly moved its operations to Queens and began trading on
September 17, along with the NYSE, Nasdaq, and the other exchanges that had
not suffered direct hits.
Sometimes backups are too limited. Most disaster recovery companies and
their clients have been too focused on recovery of mainframes and needed
extensive help to recover midrange systems and servers. Moreover, backups are
often stored in the same office or site and so are useless if the location is
destroyed. For example the Board of Trade backed up only some servers and
PCs, and those backups were stored in a fireproof safe in the WTC where they
are now buried beneath many thousands of tons of rubble.
Giant bond trader Cantor Fitzgerald occupied several top floors in one of the
WTC buildings and lost its offices and perhaps 700 of its 1000 American staff.
No company could have adequately planned for the magnitude of its disaster.
However Cantor was almost immediately able to shift its functions to its
Connecticut and London offices and its surviving U.S. traders began settling
trades by telephone. Despite its enormous losses, the company amazingly
resumed operations in just two days, partly with the help of backup
companies, software and computer systems. One reason for its rapid recovery
was Recall, Cantor's disaster recovery company. Recall had up-to-date Cantor
data because it had been picking up Cantor backup tapes three to five times
daily. Moreover, in 1999 Cantor had started switching much of its trading to
eSpeed, its fully automated online system. Investors were attracted partly
because users of eSpeed were given a 10% discount. After the WTC disaster
Peter DaPuzzo, a founder and head of Cantor Fitzgerald, decided that the
company would not replace any of the over 100 lost bond traders. Instead the
company switched its entire bond trading to eSpeed.
America's oldest bank, the Bank of New York (BONY), is a critical hub for
securities processing because it is one of the largest custodians and clearing
institutions in the United States. Half the trading in U.S. government bonds
moves through its settlement system. The bank also handles around 140, 00
fund transfers totaling $900 billion every day. Since the bank facilitates the
transfer of cash between buyers and sellers, any outage of disruption of its
systems would leave some firms short of anticipated cash already promised to
others. BONY was under extraordinary pressure to keep running at full speed.
BONY operations were heavily concentrated in downtown Manhattan, very
close to the World Trade Center. The bank is headquartered at 1 Wall Street,
almost abutting the WTC and had two other sites on Barclay and Church
Streets that were even closer. These buildings housed 5,300 employees plus
the bank's main computer center. On September 11, the bank lost the two
Copyright Intelligent Quotient System Pvt. Ltd. |

124

Business Continuity and Disaster Recovery

closest sites and their equipment. The bank had arranged for its computer
processing to revert to centers outside New York in case of emergency, but it
was not able to follow its plan. The World Trade Center attack had heavily
damaged a major Verizon switching station at 140 West Street serving 3 million
data circuits in lower Manhattan. The loss of this switching station left BONY
without any bandwidth for transmitting voice and data communications to
downtown New York, and the bank struggled to find ways to connect with
customers.
The bank's disaster recovery plan called for paper check processing to be
moved from its financial district computer center to its Cherry Hill, New Jersey
facility. With communication so disrupted, BONY management decided Cherry
Hill was too distant and moved the functions to its closer center in Lodi, New
Jersey. However, that center lacked machines for its lockbox business, in
which it opens envelopes that contain bill payments, deposits checks, and
reads payment stubs to credit the right accounts.
The bank had deliberately planned to have different level of backup for different
functions. The bank's government bond processing was backed up by a second
computer that could take over on a moment's notice. No such backup existed
for the bank's 350 automated teller machines. The bank rationalized that its
customers could use other banks' machines in case of a problem and its
customers were forced to do that. Even the backup system for the government
bond business did not work properly because the communication lines between
its backup sites and clients' backup sites were often of low capacity and had
not been fully tested and debugged. For example BONY's required connection
to the Government Securities Clearing Corporation, a central component of the
government bond market, failed so tapes had to be driven to them for several
days. Trades were properly posted but clients could not obtain timely reports
on their positions. The bank had also established redundant
telecommunications facilities in case of problems with one line, but they turned
out to be routed through the same physical phone facilities. John Costas, the
president and COO of UBS Warburg, explained "We've all learned that when we
have backup lines, we should know a lot more about where they run."
As a result the Bank of New York's customers expecting funds from the Bank of
New York didn't receive them on time and had to borrow emergency cash for
the Federal Reserve. Yet Thomas A. Renyi, the Bank of New York's chairman,
expressed pride in how the bank had responded. He said "Our longstanding
disaster recovery plans worked, and they worked in the extreme." It will be
months before BONY can return to its computer center at 101 Barclay Street
and the bank is working with IBM on where to locate an interim computer
center and ways to improve its backup systems.
The Nasdaq stock exchange seems to have had more success. It has no trading
floor anywhere but instead is a vast distributed network with over 7,000
Copyright Intelligent Quotient System Pvt. Ltd. |

125

Business Continuity and Disaster Recovery

workstations at about 2,500 sites, all connected to its network through at least
20 points of presence (POPs). The POPs in turn are doubly or triply connected
to its main network and data centers in Connecticut and Maryland. Nasdaq's
headquarters at 1 Liberty Plaza were heavily damaged. Its operational staff and
its press and broadcast functions are housed in its Times Square building. On
September 11 (Tuesday), Nasdaq opened at 8am as usual, but it closed at
9:15AM, and did not open again until the following Monday, when the NYSE
and other exchanges resumed trading. NASDAQ was well prepared for the
disaster with its highly redundant setup. It even had many cameras and
monitoring systems so that the company would know what actually happened
if a disaster or other crisis should strike. Nasdaq had even purposely
established a very close relationship with WorldCom, its telecommunications
provider, and it had made sure WorldCom had access to different networks for
the purpose of redundancy.
At first Nasdaq established a command center at its Times Square office, but
the implosion of the WTC buildings destroyed Nasdaq's telephone switches
connected to that office, and so the essential staff members were quickly moved
to a nearby hotel. Management immediately addressed the personnel situation,
creating an executive locator system in Maryland with everyone's names and
telephone numbers and a list of the still missing. Next they evaluated the
physical situationwhat was destroyed, what ceased to work, where work
could proceedwhile finding offices for the 127 employees who worked near
the WTC. Next they started to evaluate the regulatory and trading industry
situations and the conditions of Nasdaq's trading companies. The security staff
was placed on high alert to search for attempted penetration of the building or
the network.
On Wednesday September 12 Nasdaq management determined that 30 of the
300 firms they called would not be able to open the next day, 10 of which
needed to operate out of backup centers. Management assigned some of its
own staff to work with all 30 of them to help solve their problems. The next day
they learned that the devastated lower Manhattan telecommunications would
not be ready to support a Nasdaq opening the following day. They decided to
postpone Nasdaq's opening until Monday, September 17. On Saturday and
again on Sunday they successfully ran industry-wide testing. On Monday, only
six days after the attack, Nasdaq opened and successfully processed 2.7 billion
shares, by far its largest volume ever.
Nasdaq found its distributed systems worked very well, while its rapid recovery
validated the necessity for two network topologies. Moreover, while Nasdaq lost
no senior staff, the company had three dispersed management sites, and had it
lost one, the company could still operate because of the leadership at its two
remaining sites. Nasdaq also realized its extensive crisis management
rehearsals for both Y2K and the conversion to decimals had proven vital,
verifying the need to schedule more rehearsals regularly. The company even
Copyright Intelligent Quotient System Pvt. Ltd. |

126

Business Continuity and Disaster Recovery

recognized how critical ongoing communications were, and so it formalized


regular nationwide company telecommunication forums. It even established
automatic triggers for regular communications forums with the Securities and
Exchange Commission (SEC).
Sources: Martin J. Garvey, "Disaster Recovery Isn't Just for Big Business," Information Week,
April 1, 2002; John Pallatto, "Contingency Planning, " Internet World , May, 2002; Juliana
Gruenwald, "Communications That Won't Quit," Fortune/CNET Tech Review, Winter 2002;
Anthony Guerra, "The Buck Stopped Here: BONY's Disaster Recovery Comes Under Attack, "
Wall Street and Technology , November 2001; Saul Hansell with Riva D. Atlas, "Disruptions Put
Bank of New York to the Test," The New York Times, October 6, 2001; Tom Field, "How
Nasdaq Bounced Back," CIO Magazine, November 1, 2001; Dennis K. Berman and Calmetta
Coleman, "Companies Test System-Backup Plans as They Struggle to Recover Lost Data," The
Wall Street Journal, September 13, 2001; Jayson Blair, "A Nation Challenged: The Computers,"
The New York Times, September 20, 2001; Debra Donston, "Disaster Recovery's Core
Component: People," eWeek, September 13, 2001; Tom Field, "Disaster Recovery: Nasdaq,"
CIO, October 12, 2001; John Foley, "Ready for Anything?" Information Week, September 24,
2001; Sharon Gaudin, "Protecting a Net in a Time of Terrorism," Network World Fusion,
September 24, 2001; Stan Gibson, "Mobilizing IT," eWeek, September 17, 2001; Eugene Grygo
and Jennifer Jones, "U.S. Recovery: Cost of Rebuilding N.Y. IT Infrastructures Estimated at $3.2
Billion," InfoWorld, September 19, 2001; Edward Iwata and Jon Schwartz, "Tech Firms Jump In
to Help Companies Mobilize to Rebuild Systems, Reclaim Lost Data," USA Today, September
19, 2001; April Jacobs, "Good Planning Kept NASDAQ Running During Attacks," Network
World Fusion, September 24, 2001; Suzanne Kapner, "Wall Street Runs Through London," The
New York Times, September 27, 2001; Richard Karpinski, "E-Business Aftermath,"
InternetWeek, September 24, 2001; Diane Rezendes Khirallah, "Disaster Takes Toll on Public
Network," Information Week, September 17, 2001; Daniel Machalaba and Carrick Mollenkamp,
"Companies Struggle to Cope with Chaos, Breakdowns and Trauma," The Wall Street Journal,
September 13, 2001; Paul McDougall and Rick Whiting, "Assessing the Impact (Part One),
Information Week, September 17, 2001; Patrick McGeehan, "A Nation Challenged: Wall Street,"
The New York Times, September 21, 2001; Paula Musich, "Rising From the Rubble," eWeek,
September 24, 2001; Kathleen Ohlson,"Businesses Start the Recovery Process," Network World
Fusion, September 12, 2001; Julia Scheeres, "Attack Can't Erase Stored Data," wired.com,
September 21, 2001; Carol Sliwa, "New York Board of Trade Gets Back to Business,"
Computerworld, September 24, 2001; Marc L. Songini, "Supply Chains Face Changes After
Attacks, Computerworld, October 1, 2001; Bob Tedeschi, "More Web Spending with a Focus,"
The New York Times, October 8, 2001; Dan Verton, "IT Operations Damaged in Pentagon
Attack," Computerworld, September 24, 2001; Shawn Tully, "Rebuilding Wall Street," Fortune,
October 1, 2001; "WTC Technology Replacement Costs Billions," excite.com, September 14,
2001.

************************************************************************************

Copyright Intelligent Quotient System Pvt. Ltd. |

127

Business Continuity and Disaster Recovery

References
http://books.google.co.in/books?isbn=0782150780
http://books.google.co.in/books?isbn=0470550058
http://zkamioni.dnsalias.com/E...Network%20Study%20Guide/
http://www.centos.org/docs/3/html/rhel-isa-en.../s1-disaster-recovery.html
http://googleenterprise.blogspot.com/.../disaster-recovery-by-google.html
http://gigaom.com/2010/03/03/Google-apps/
http://cloudcomputing.info/.../how-Google-implements-disaster-recovery-f...
http://www.centos.org/docs/3/html/rhel-isa-en.../s1-disaster-recovery.html
http://www.fixit-me.com/dms.html
http://www.rbi.org.in Publications
http://www.niiconsulting.com/innovation/RBI%20Guidelines_Summary.pdf
http://en.wikipedia.org/wiki/Telemetry
http://en.wikipedia.org/wiki/File:Surge_protector.jpg
http://books.google.co.in/books?isbn=0782150780
http://e-university.wisdomjobs.com/networking/chapter-52.../index.html
http://www.kingswell.net/news%20items/Contractual%20arrangements%20fo
r%20DR.htm
http://e-university.wisdomjobs.com/networking/chapter-52-273/faulttolerance-and-disaster-recovery.html
http://books.google.co.in/books?
http://thewebshop.com
http://thedatarescuecentre.com
http://www.iomega.com
Copyright Intelligent Quotient System Pvt. Ltd. |

128

Business Continuity and Disaster Recovery

http://blog.everycity.co.uk
Network+ Study Guide: Exam N10-003By David Groth, Toby Skandier
http://technet.microsoft.com/en-us/library/cc770266 (v=ws.10).asp
http://technet.microsoft.com/en-us/library/cc770266%28v=ws.10%29.aspx
http://en.wikipedia.org/wiki/Remote_backup_service
http://en.wikipedia.org/wiki/Off-site_data_protection
http://Remoteitservices.com
http://www.fatihacar.com

Copyright Intelligent Quotient System Pvt. Ltd. |

129

You might also like