You are on page 1of 6

ITIL Incident Management

An 'Incident' is any event which is not part of the standard operation of the service and which causes, or may cause, an interruption or a reduction of the quality of the service. The objective of Incident Management is to restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price. Inputs for Incident Management mostly come from users, but can have other sources as well like management Information or Detection Systems. The outputs of the process are RFCs (Requests for Changes), resolved and closed Incidents, management information and communication to the customer. Activities of the Incident Management process: Incident detection and recording Classification and initial support Investigation and diagnosis Resolution and recovery Incident closure Incident ownership, monitoring, tracking and communication These elements provide a baseline for management review.

Incident Management Quick Overview Mission Statement


Restore normal state IT service operations as quickly as possible to minimize the adverse impact on business operations. Process Goal Achieve the process mission by implementing: ITIL-aligned Incident Management Policies, Processes and Procedures Incident escalation standards Dedicated Incident Management Process Owner Incident classification categories Incident reports Incident communications and education for IT staff

Critical Success Factors (CSFs)


The Critical Success Factors are: Maintaining IT Service Quality Maintaining Customer Satisfaction Resolving Incidents Within Established Service Times Key Activities The key activities for this process are: Detect and record incidents Classify incidents Provide initial incident support

Prioritize incidents based on impact and urgency Investigate and diagnose incidents Resolve incidents and recover service per agreed service levels Close incidents Maintain ownership, monitoring, tracking and communications about incidents Provide management information about Incident Management quality and operations Key Performance Indicators (KPIs) Examples of Key Process Performance Indicators (KPIs) are shown in the list below. Each one is mapped to a Critical Success Factor (CSF).

Maintaining IT Service Quality


Number of Severity 1 incidents (total and by category) Number of Severity 2 incidents (total and by category) Number of other incidents (total and by category) Number of incidents incorrectly categorized Number of incidents incorrectly escalated Number of incidents bypassing Service Desk Number of incidents not closed/resolved with workarounds Number of incidents resolved before customers notice Number of incidents reopened

Maintaining Customer Satisfaction


Number of User/Customer surveys sent Number of User/Customer surveys responded to Average User/Customer survey score (total and by question category) Average queue time waiting for Incident response

Resolving Incidents Within Established Service Times


Number of incidents logged Number of incidents resolved by Service Desk Number of incidents escalated by Service Desk Average time to restore service from point of first call Average time to restore Severity 1 incidents Average time to restore Severity 2 incidents

The Difference Between Incident Management And Problem Management


Incidents and Service Requests are formally managed through a staged process to conclusion. This process is referred to as the "Incident Management Lifecycle". The objective of the Incident Management Lifecycle is to restore the service as quickly as possible to meet Service Level Agreements. The process is primarily aimed at the user level. Problem Management deals with resolving the underlying cause of one or more Incidents. The focus of Problem Management is to resolve the root cause of errors and to find permanent solutions. Although every effort will be made to resolve the problem as quickly as possible this process is focused on the

resolution of the problem rather than the speed of the resolution. This process deals at the enterprise level.
ITIL Incident Management
You know the call. We all have received them it is the break fix call. The call occurs any time during the day. The call that tell you three main things, someones unhappy, someone has to handle this, and isnt there a better way to manage this aspect of support. There is some good news here for all, ITIL has best practice guidelines for dealing with incident management. Look at an incident from a high level, there is a pattern of actions that can be taken to resolve the incident. All incidents have inputs, outputs, and management activities like all other processes. The parts of an incident management process are:

Inputs Inputs are key to the process. Incident details are received from the service desk, network, or computer operations. There are many forms of inputs, break fix issues, service request, and/or automatic monitoring alerts. Outputs Outputs need to be considered from the viewpoint of what are the outputs of incident management. Obviously this would be the closed incident or restored application availability. But looking at a higher level, there is the user satisfaction, improved productivity for all, customer follow up and communication, the documentation for the incident reports and management information. Incident Management Activities Incident management activities are detection and reporting; classification and initial support; investigation and diagnosis, resolution and recovery; incident closure; and incident ownership, monitoring tracking, and communication.

Just to review the flow of events, most customer and user incidents are initially reported to the service desk. This action gives ownership of the handling and tracking of the incident from beginning to end to the service desk, even though the work maybe completed coordinating with other departments. The activities of an incident management process are:

Incident detection and reporting - Incident detection and reporting is the act of learning an incident has occurred and recording the basic details related to it. Classification and initial support Classification and initial support categorizes the incident, by matching it against the knowledge base of issues, assigning a priority, assessing if it is related to configuration details, providing initial support and closing the incident or routing it to a specialist group. Investigation and diagnosis Investigation and diagnosis relates to assessing incident details, collecting and analyzing the information and resolution, then routing the incident to line support Resolution and recovery - Resolution and recovery surrounds the completing of the incident, using a solution or workaround, or raising a request for change. Incident closure - Incident closure is the act of confirming the resolution with the reporter of the incident and closing the incident. Incident ownership, monitoring, tracking, and communications - Incident ownership, monitoring, tracking, and all communications are all the activities that surround monitoring the incident, escalating it, and informing the user of the latest status, key accomplishments, and next steps.

Imagine how helpful this would be if it was in place. Wouldnt anyone work towards this ideal? To elaborate on this point a little further, as import as roles and responsibilities are to any effective plan, the tools to get the job done are just as important. You simply need the right tools to be able to work effectively. Tools commonly used in incident management are:

Automatic incident logging and alerting - This tool can automatically log incidents and alert support personnel in the event of fault detection on mainframes, networks, servers, and possibly through an interface to system management tools. Automatic escalation facilities - Automatic escalation facilities help facilitate the timely handling of incidents and service requests. Imagine automatic notification, instead of constantly checking a worklist.for a groups queue. Highly flexible routing of incidents - This is a requirement; when control staff members are located in multiple sites or collocated in an operational bridge, the incident calls can be routed efficiently and effectively. Automatic extraction of data records - Automatic extraction of data records from the configuration management database, CMDB, of a failed item and affected items is helpful. Specialized software - This software is used for the speed and effectiveness of handling incidents. BMC is an ideal system. It can help with very accurate classification of incidents and successful matching at the point of alert. Telephone systems integration Telephone systems integration can be used to automatically registering the names and phone numbers of users. Diagnostic tools - These tools can assist with the diagnostic process so that the support staff can more quickly diagnose the source of incidents.

One of the constant statements is that you cant manage what you can not measure. Normally the incident manager is accountable and responsible for reporting the performance of the incident. In order to accomplish reporting is to have clearly define objects with measurable targets that can provide performance information. Common metrics used to report the effectiveness and efficiency of the incident management process are:

Incident volume refers to the total number of incidents that are handled by the incident management process. Mean elapsed time shows how much time was taken to achieve incident resolution or circumvention. The time is broken down by impact code. Incident response time refers to the percentage of incidents handled within the agreed upon response times, which may have been specified in service level agreements by impact code, for example. Average incident cost refers to the average cost of each incident. The percentage of incidents closed refers to the percentage of incidents closed by the service desk without reference to other levels of support. The number and percentage of incidents resolved remotely refers to those incidents that were taken care of off-site, with no physical visit.

The relationships between the incident management process and other IT Service Management processes are:

The configuration management database defines the relationships among resources, services, users, and service levels. For example, lets say a server fails. With the configuration management database, all existing processes, applications, and interfaces would be documented, so downstream affects would be noted immediately. Problem management provides information about problems, know errors, workarounds, and quick fixes. Change management yields information about scheduled changes and their status. Service level management monitors the service level agreements with the customer about the support to be provided. Availability management measures the aspects of the availability of services and uses the incident records and the status monitoring provided by configuration management. Capacity management assures that storage capacity matches the evolving demands of the business. It is concerned with incidents that relate to this objective, such as incidents caused by a shortage of disk space or slow response time.

An item to remember and maybe even reinforce is that the incident management process is interwoven with the other IT service management processes. The processes work as long as all the processes support each other. Finally there are some common barrier in the form of costs and problems to implementing an incident management process. The common costs are the implementing and operating cost, as is standard with almost any implementation. Implementation costs are the training, tools needed, process and workflow definition, and resources expended in the implementation. Operating costs are the continuing maintenance license feeds and operating resources expended. Some of the common recurring problems that affect all organizations are:

Users and IT staff bypassing incident management procedures - this results in the IT organization does not obtaining important information about the service level and the number of errors. Incident overload and backlog - This circumvention makes it difficult to record incidents effectively. Escalations may occur if incidents are not resolved quickly enough. Incomplete service catalogs and service level agreements - define the time in which an incident or request for service needs to be solved or escalated. If these documents don't exist or are incomplete, the caller may not be able to get the issue resolvedand get back onlineas quickly as possible. Lack of commitment -This is a problem because effective incident management requires real staff commitment, not just involvement.

Just remember to treat the cost and problems as any other hurdle. Find your way around it or over it.
Incident Management Goals

Here is the goal for ITIL Incident Management as quoted in the ITIL publication Service Support: The primary goal of the Incident Management process is to restore normal service operation as quickly as possible and minimise the adverse impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained. 'Normal service operation' is defined here as service operation within Service Level Agreement (SLA) limits. Let us break this down into its fundamental components and see what we can identify to help justify ITIL: Restore normal service operation ITIL defines normal service operation as Service Level Agreements. So do you have formal SLAs? If you do then you can determine how often you fail to resume service within the agreed time limits as specified in the SLA. The trick now is to identify, or estimate, how much you could improve this figure with ITIL. This is a deliverable. Select some recent incidents and use them as case studies. Try to identify if any of the incidents could have been resolved quicker, e.g. second level were slow to respond, or the Service Desk allocated a wrong priority, or the Service Desk wrongly diagnosed the symptoms to an Incident. Use data from your case studies to express how ITIL would improve Incident Management. Remember to compare actual service against your SLAs. However if you do not have SLAs then you cannot determine your level of success because you do not have benchmark. The result will be that you will have constant confrontation with your customers concerning the resolving and fixing of incidents. You will need to determine some desired SLA levels and measure against them to obtain potential ITIL benefits however the fact that you do not have an

SLA is a big argument in itself. Use the examples in the previous paragraph but measure against desired levels of service rather than actual examples. Quickly and efficiently as possible again against agreed Service Levels but the trick is to beat not just meet Service Levels. So look for data that shows the speed with which you are solving incidents and how ITIL will help you to beat your current levels. For example integration with change can mean that you can react quicker to failed changes and therefore restore services faster. Minimising the adverse impact on the business and operations this is a key deliverable because if you do not have Configuration Management you cannot readily identify the impact of failed IT components on the business. As a result you will set targets that are too IT orientated rather than business driven which may result in IT working hard to solve an incident that seems important to IT but in reality is not very important to the business community. With staffing limits strictly limited in IT nowadays it is crucial that the IT workforce focuses on supporting the business. So look for when the customer has complained about delays that could have been avoided with better staff scheduling. Also our friends the SLAs appear here again because an SLA should state the business impact of IT systems and services. Without these you are guessing the business impact which is dangerous and unprofessional. If you do not have SLAs then you should question how you can arrive at business driven priorities and incident scheduling. Ensuring best levels of service quality and availability are maintained - the keyword here is best or to use another ITIL expression fit for purpose. We often hear about world class service but rarely have seen a clear definition of world class. Why? Because there is no such thing! This is why fit for purpose, or best, is so important. It means providing the correct level of service at a sensible cost if you can do this then you can say that you are delivering exactly what your business needs and can afford. This is a good definition of world class if you really need a definition. Again this means having Service Levels determined and regular feedback from your customers, e.g. surveys. So regularly communicate with your customers do determine whether you are providing the correct levels of service. Do not confuse this with attitude questions, e.g. are we polite, work with your customers get their views on your ability to manage and solve their incidents use this data to make your case. Incident Management is a key component in both ITIL and customer service after all it is the service that customers use to communicate with IT on a daily basis. The better you manage Incidents the happier your customers will be. So look carefully at the ITIL Incident Management goal and ask are we delivering that goal right now? If not identify the failure points and here lies your justification. If you have a good Service Desk then this is the process where you are likely to be closest to achieving the ITIL goals. Business alignment indicator the key alignment point here is agreeing with the customers a definition and a value for the normal service operation which then will be and getting it formalized in the SLA. First you must analyze what levels of service operation provided by Service Management and then discuss with the customers their requirements for normal service operation. Do not be too ambitious keep in mind the ITIL mantra fit for purpose. The other key area is in minimising the adverse effect on business operations. This is where priorities need to be defined with the customers so that the correct priority is allocated to all incidents thus reducing the chance of delaying business critical operations any more than necessary. Regularly review priorities with your customers to ensure that the priorities continue to meet their requirements. Ideally your customers should be able to review the status and history of their incidents on-line otherwise you will need to provide regular reports to them. You should also build your priorities and normal service operation level into your Incident Management technology so that both you and your customers can immediately be alerted to any service failures caused by Incident Management.

You might also like