ITIL OSA Incident Management

What is the purpose of Incident Management
1. restore normal service operation as quickly as possible
2. minimize the adverse impact on business operations
Name the objectives of Incident Management
-ensure standardized methods and procedures are used
-increase visibility and communication of incidents
-enhance business perception of IT
-align activities to the priorities of the business
-maintain user satisfaction with IT service quality
incident
-unplanned interruption of an IT service
-reduction in the quality of an IT service
-failure of a CI that has not yet impacted a service
normal service operation
the operational state where services and CIs are performing within their agreed service and operational levels
What is considered “in scope” of the incident management process?
-any events that indicate disruption to an IT service
-any events that could disrupt an IT service
What is considered “out of scope” of the incident management process?
-informational events that indicate normal service operation
-service requests
Business value of incident management
-reduction of IT and business labor costs related to incidents
-improved incident resolution, leading to higher levels of service availability
-better alignment of IT and business priorities
-increased ability to identify potential service improvements
-identification of additional service or training requirements
Policies of Incident Management
1. incidents and their status must be timely and effectively communicated
2. incidents must be resolved within agreed and acceptable timeframes
3. Customer satisfaction must be maintained at all times
4. Incident handling should be aligned to the priorities of the business
5. Incidents should be stored and managed in a single system
6. a standard classification scheme should be used for all incidents
7. Incident records should be audited on a regular basis
8. All incident records should follow a standard format
9. Prioritization and categorization should be done according to a common agreed set of criteria
timescales of incident management
-must be agreed for all incident handling stages
-will differ based on incident priority
-must be based on agreed response and resolution targets with SLAs
-captured as targets on OLAs and UCs as appropriate
-communicated to all support groups
-service management tools should be used to automate timescales based on predefined rules
incident models
-predefined steps to handle known types of incidents in an agreed way
-input into incident support tools to support automation
-stored in the SKMS
service knowledge management system
a set of tools and databases that is used to manage knowledge, information, and data; includes the CMS (configuration management system) as well as other databases and information systems; includes tools for collecting, storing, managing, updating, and analyzing, and presenting all the knowledge, information, and data that an IT service provider will need to manage the full lifecycle of IT services
incident models should include
-chronological steps to handle the incident
-responsibilities; who should do what
-any precautions that need to be taken
-timescales and thresholds for completion
-escalation procedures
-any necessary evidence preservation activities
major incident
-highest priority category of impact for an incident; results in significant disruption to the business
-must be clearly defined and mapped into incident prioritization scheme
major incident procedure
-a separate procedure, with shorter timescales and greater urgency, which must be used for major incidents
-establishment of a separate major incident team
-led and managed by the incident manager; there is a risk of conflicting priorities when the incident manger is also the service desk manager
-involve problem manager as necessary; incident manager ensures focus remains on restoration
-service desk is accountable for recording all activities; responsibility for recording may be delegated; users kept up-to-date
incident manager (major incident role)
this role is responsible for leading and managing the organizational response. Care should be taken when the incident manger is also the service desk manager as there may be conflicting priorities between those two roles during major incidents. As necessary, a separate person may designated to lead the major incident response team
problem manager (major incident role)
this role participates as needed if the cause needs to be investigated at the same time as incident resolution. The incident manager should ensure that priority is placed on incident resolution and that the investigation of cause is kept separate.
service desk (major incident role)
this role is responsible for keeping users up to date as the incident progresses through resolution activities. While it is ultimately accountable for keeping the incident record up to date, responsibility for this activity may be delegated as necessary to other support teams
incidents should be tracked throughout their lifecycle:
-support proper handling and escalation
-facilitate accurate reporting of incident status
-capture within the incident management system
incident status examples
-open
-in progress
-resolved
-closed
Open
an incident has been recognized but not yet assigned to a support resource for resolution
in progress
-the incident is in the process of being investigated and resolved
resolved
-a resolution has been put in place for the incident but normal state service operation has not yet been validated by the business or end user
closed
the user or business has agreed that the incident has been resolved and that normal state operations have been restored
Incident management process activitites
1. identification
2. logging
3. categorization
4. prioritization
5. initial diagnosis
6. escalation
7. investigation and diagnosis
8. resolution and recovery
9. closure
incident identification
1. event management
2. web interface
3. phone call
4. email
incident logging
-all incidents must be fully logged and date and time stamped
+a unique incident record for each unique incident must be logged
=service desk
=automated incident creation
=incident submitted over the web or e-mail
=any other responsible groups
-incident records must capture all relevant information
+updated as it progresses through the lifecycle
+full historical record
incident categorization
-supports trending and analysis
-can change or evolve through the incident lifecycle
-multilevel categorization
-confirmed at incident close
Defining Categories Approach
1. brainstorm with relevant support groups (service desk superior, incident and problem managers)
2. best-guess the top-level categories from a user perspective (include an “other” category)
3. set up relevant tools and trial the categories
4. analyze incidents captured
5. perform a breakdown analysis of each high-level category; define the lower-level categories for each
6. implement the new categories and review after 1 to 3 months; ongoing review; changes could affect incident trending and should be done only when genuinely required
incident prioritization
-priority should be based on impact and urgency
-some factors contributing to impact: risk to life or limb; number of users or service impacted; level of financial loss of impact; effect on business reputation; regulatory or legislative breaches
-clear guidance with practical examples should be provided to all staff
-priority can be dynamic
-occasionally, priority may be overidden
impact
relates to the overall impact the incident is having on the business
urgency
related to how quickly the business needs a resolution to the incident
initial diagnosis
-typically performed by the service desk
-uses diagnostic scripts and known error records
-can be potentially closed over the phone with the user
-if service desk cannot resolve incident over the phone, but can within an agreed timeframe: give the reference number to the user; inform the user of service desk intentions; escalate as necessary
functional escalation
-transferring an incident, problem, or change to a technical team with a higher level of expertise to assist in an escalation
-technical escalation
-service desk: escalate when it is clear it cannot restore service within agreed timeframes; always owns the incident and responsibility for user communication
-can involve internal and external teams
-may be done multiple times for an individual incident
-rules for escalations must be part of OLAs and UCs
hierarchic escalation
-informing or involving more senior levels of management to assist with an escalation
-management escalation
-relay information, for example, major incidents
-make required decisions, for example, resource allocation or incident assignment
-settle disagreements
investigation and diagnosis
might include:
-establishing what has gone wrong
-understanding chronological order of events
-confirming full impact of the incident
-identifying any events that may have triggered the incident
-performing detailed knowledge searches

coordination is critical when simultaneous activities are occurring

all activities should be fully documented in the incident record

resolution and recovery
solutions applied and tested as they are identified:
-asking the user to undertake directed activities on their own desktop or remote equipment
-the service desk implementing the resolution either centrally or remotely using software to take control of the user’s desktop to diagnose and implement a resolution
-specialist support groups being asked to implement specific recovery actions
-a third-party supplier or maintainer being asked to resolve the fault

actions must be coordinated by incident management

sufficient testing should be performed to validate resolution

incident is passed back to service desk for closure

Closure
service desk responsible for incident closure:
-user confirmation/acceptance
-closure categorization
-user satisfaction survey
-incident documentation
-ongoing or recurring problem?
-formal closure

automated incident closure:
-may not be appropriate for VIPs and major incidents
-must be discussed, agreed on, and communicated

closure categorization
check and confirm that the initial incident categorization was correct or, where the categorization subsequently turned out to be incorrect, update the record so that a correct closure categorization is recorded for the incident – seeking advice or guidance from the resolving group(s) as necessary
user satisfaction survey
carry out a user satisfaction callback or email survey for the agreed percentage of incidents
incident documentation
chase any outstanding details and ensure that the incident record is fully documented so that a full historic record at a sufficient level of detail is complete
ongoing or recurring problem?
determine (in conjunction with resolver groups) whether the incident was resolved without the root cause being identified. In this situation, it is likely that the incident could recur and require further preventive action to avoid this. In all such cases, determine if a problem record related to the incident has already been raised. If not, raise a new problem record in conjunction with the problem management process so that preventive action is initiated
formal closure
formally close the incident record
triggers of incident management process
-user calls service desk phone
-user submits Web-based incident
-event management sends automated alerts
-technical staff
-suppliers
inputs of incident management process
-information about CIs and their status
-known errors and workarounds
-communication about incident symptoms
-communication about RFCs and releases
-communication of events
-operational and service level objectives
-customer feedback on incident resolution
-agreed criteria for escalating incidents
outputs of incident management process
-resolved/updated incidents
-updated classifications
-raised problem records
-validation that incidents have not recurred
-feedback on incidents related to changes and releases
-identification of related CIs
-customer feedback
-feedback to event management on monitoring levels
-communication of incident and resolution history
service design
service level management
information security management
capacity management
availability management
service transition
service asset and configuration management
change management
service operation
problem management
access management
service level management interaction with incident management
-ability to resolve incidents in a specified time is key par of delivering an agreed level of service
-enables SLM to define measurable responses to service disruptions
-provides reports that enable SLM to review SLAs objectively and regularly
-incident management is able to assist in defining where services are their weakest so that SLM ca define actions as part of the SIP (service improvement plan)
SLM defines….
the acceptable levels of service within which incident management works, including: incident response times, impact definitions, target fix times, service definitions, which are mapped to users, rules for requesting services, and expectations for providing feedback to users
Information security management
providing security-related incident information as needed to support service design activities and gain a full picture of the effectiveness of the security measures as a whole based on an insight into all security incidents. This is facilitated maintaining log and audit files and incident records
capacity management
incident management provides a trigger for performance monitoring where there appears to be a performance problem; may develop workarounds for incidents
availability management
will use incident management data to determine the availability of IT services and look at where the incident lifecycle can be improved
service asset and configuration management
This process provides the data used to identify and progress incidents. One of the uses of the CMS is to identify faulty equipment and to assess the impact of an incident. The CMS also contains information about which categories of incident should be assigned to which support group. In turn, incident management can maintain the status of the faulty CIs. It can also assist service asset and configuration management to audit the infrastructure when working to resolve an incident
change management
where a change is required to implement a workaround or resolution, this will need to be logged as an RFC and progressed through change management. In turn, incident management is able to detect and resolve incidents that arise from failed changes
problem management
for some incidents, it will be appropriate to involve problem management to investigate and resolve the underlying cause to prevent or reduce the impact of recurrence. Incident management provides a point where these are reported. Problem management, in return, can provide known errors for faster incident resolution through workarounds that can be used to restore service
access management
incidents should be raised when unauthorized access attempts and security breaches have been detected. A history of incidents should also be maintained to support forensic investigation activities and resolution of access breaches
Incident management tools provide the following types of information
-incident and problem history
-incident categories
-action taken to resolve incidents
-diagnostic scripts that can help first-line analysts to resolve the incident, or at least gather information that will help second- or third-line analysts resolve it faster
the following types of data are contained in the service catalog
-key service delivery objectives, levels and targets
-information about the service in terms that the customer and users understand
-information that can be used for communication with customers and users
incident management should have access to the CMS to be able to identify relationship information such as the following about CIs:
-identification of affected CIs
-ability to estimate the scope and impact of the incident
resolutions information, such as the following, can be found in the knowledge error database (KEDB)
information about workarounds that may be used to potentially restore service for the incident
incident records should contain the following types of data
-unique reference number
-incident categorization
-incident urgency
-incident impact
-incident prioritization
-date and time recorded
-name or ID of the person recording the incident
-method of notification
-name, dept., phone, and location of user
-call-back method
-description of symptoms
-incident status
-related CI
-support group or person incident is assigned to
-related problem or known error
-activities undertaken to resolve the incident
-resolution date and time
-closure category
-closure date and time
IM CSF: Resolve incidents as quickly as possible, minimizing impacts to the business
KPI: Mean elapsed time to achieve incident resolution or circumvention, broken down by impact code
KPI: Breakdown of incidents at each stage
KPI: percentage of incidents closed by the service desk without reference to other levels of support
KPI: number and percentage of incidents resolved remotely, without the need for a visit
KPI: number of incidents resolved without impact to the business
IM CSF: Maintain quality of IT services
KPI: total number of incidents
KPI: size of current incident backlog for each IT service
KPI: number and percentage of major incidents for each IT service
IM CSF: Maintain user satisfaction with IT services
KPI: average user/customer survey score
KPI: percentage of satisfaction surveys answered versus total number of surveys sent
IM CSF: Increase visibility and communication of incidents to business and IT support staff
KPI: average number of service desk calls or other contacts from business users for incidents already reported
KPI: number of business user complaints or issues about the content and quality of incident communications
IM CSF: Align incident management activities and priorities with those of the business
KPI: percentage of incidents handled within agreed response time
KPI: average cost per incident
IM CSF: Ensure the standardized methods and procedures used for efficient and prompt response, analysis, documentation, ongoing management and reporting of incidents to maintain business confidence in IT capabilities
KPI: number and percentage of incidents incorrectly assigned
KPI: number and percentage of incidents incorrectly categorized
KPI: number and percentage of incidents processed per service desk agent
KPI: number and percentage of incidents related to changes and releases
challenges of incident management
-early detection of incidents
-convincing staff to log all incidents
-information about problems and known errors
-integration into the CMS
-integration into SLM
risks of incident management
-being inundated with incidents
-unmanaged backlog of incidents
-inadequate information
-poorly aligned OLA and UC causing mismatched objectives
incident management process owner
-carrying out the generic process owner role for incident management
-designing incident models and workflows
-working with other process owners to ensure proper integration of ITSM processes
incident management process manager
-carrying out the generic process manager role for incident management
-planning and managing support for incident management tools and processes
-coordinating interfaces between incident management and other service management processes
-ensuring and monitoring incident management efficiency and effectiveness
-producing management information
-managing the work of incident management staff
-suggesting improvements for incident management
-developing and maintaining incident management process, procedures, and systems
-managing major incidents
first-line analyst
-recording incidents
-routing incidents to support specialists as needed
-prioritizing, categorizing, and providing initial support for incidents
-providing resolution and recovery of incidents not escalated
-closing incidents
-monitoring the status of incidents
-engaging in ongoing communication with users about incident progress
-escalating incidents as necessary
second-line analysts
more specialized than first-line:
-have additional time for diagnosis and resolution
-resolve less complicated incidents to allow third-line to focus on the most difficult
-advantages to locating them close to first-line support
third-line analysts
-specialized level support includes a number of teams, such as: network support; application support; suppliers