ITIL OSA Event Management

Effective service operation depends on
1. knowing the status of infrastructure and services
2. ability to detect deviations from normal or expected operation
event
a change of state that has significance for the management of an IT service or other CI; can require IT operations personnel to take action and often result in events being logged
event management
the process responsible for managing events through their lifecycle; one of the main activities of IT operations
configuration item (CI)
any component or other service asset that needs to be managed in order to deliver an IT service; information about each is recorded in a CMS (configuration management system) and maintained throughout its lifecycle; they are under the control of change management and can include services, hardware, software, buildings, people, and formal documentation such as processes, procedures, and SLAs (service level agreements)
active monitoring
monitoring of a CI or an IT service that uses automated regular checks to discover its current status; tools that poll CIs at a predetermined frequency to determine their current status; exceptions generate an alert that is communicated to the appropriate tool or team for follow-up actions
passive monitoring tools
montioring of a CI or an IT service or process that relies on an alert or notification to discover its current status; tools that detect and correlate operational alerts that are generated by the CIs themselves
purpose of event management
1. manages events through their lifecycle
2. includes activities that detect events, make sense of them, and determine the appropriate response
3. provide the basis for operational monitoring and control
4. automate normal operations as well as detect early warnings and failures of CIs
objectives of event management
1. detect all significant changes of state to CIs
2. determine the appropriate control action (response)
3. provide a trigger to initiate other operational processes
4. provide a means to compare actual performance against designs
5. provide a basis for service assurance and reporting
scope of event management
1. supports any service management aspect that needs to be controlled and can be automated including:
-CIs (monitoring and updating of status)
-environmental conditions
-software license monitoring
-security monitoring
-normal activities
2. monitoring and event management are related but different
business value of event management
1. early detection of incidents
2. monitoring automated activities by exception
3. information for other service management processes
4. basis for automation
5. early notifications, which can prevent service disruptions
policies of event management
1. events should only be sent to those responsible for action
2. event management should be centralized as much as possible
3. events should utilize common messaging and logging standards
4. event handling should be automated where possible
5. events should have standard classification schemes and escalation procedures
6. all recognized events should be captured and logged
informational events
1. indicates normal operation
2. conveys data for decision making

indicates information that can be used for trending and analysis to inform the service provider in its decision-making process

informational events examples
-data for decision making
-scheduled work completed
-user accessed an application
-e-mail was received
warning events
1. indicates usual operation
2. conveys predictive information or early warning
3. additional monitoring or response may be required

indicates early warning information that can often be leveraged to minimize or prevent any user or business impact

warning events examples
– transaction completion time 10% higher than normal
– CPU utilization with 5% of highest tolerance
exception events
-indicates operation outside of acceptable range
-conveys abnormal situations that require follow-up actions

indicates abnormal situations or failures that require additional follow-up actions

exception events examples
-services or functionality unavailable
-CPU utilization above acceptable levels
-incorrect password attempts
-unauthorized software
Filtering (events) Strategy: integration
event management integrated into all service management processes
Filtering (events) Strategy: design
services designed with event management in mind
Filtering (events) Strategy: trial and error
perfection is elusive, formal reviews and evaluation
Filtering (events) Strategy: planning
-approach from an enterprise perspective
-manage as a project
-ensure realistic timeliness and resources
Name the event types.
1. informational
2. warning
3. exception
Name event filtering types.
1. integration
2. design
3. trial and error
4. planning
Event management should be:
-designed within the service design stage with availability and capacity management involvement
-testing and validated as part of service transition
-supported, managed, and refined by service operation
-support and be supported by continual service improvement
key questions for event management design
1. What needs to be monitored?
2. What type of monitoring is required?
3. When should an event be generated?
4. What information needs to be communicated with the event?
5. Who will the messages be delivered to?
6. Who will be responsible for communicating and taking necessary follow-up actions?
Instrumentation
defining and designing how IT components and services will be monitored and controlled
consideration for effective instrumentation
1. event generation, classification, communication, and escalation
2. availability and adequacy of a CI’s event generation capabilities
3. What data will be captured in the record?
4. Will active or passive monitoring be used?
5. Where will events be logged and stored?
6. How will supplementary data be gathered?
Error messaging
1. services should be designed and tested to support event management
-meaningful error messaging
-adequate supporting detail to facilitate analysis
2. service management tools can provide enterprise wide monitoring
-centralized monitoring across complex distributed environments
-standardized messaging across multiple platforms
event detection and alerting mechanism
-event management design
+configuration and population of tools for event detection
+establishing rule sets and criteria for correlation
-thorough design requires the following knowledge
+relationship of services to business processes
+service level requirements
+resource supporting each CI
+normal and abnormal operations of each CI
+information that needs to be captured for each event
+incident categorization and prioritization codes
+CI dependencies and significance of multiple events
event records should include:
1. device
2. component
3. type of failure
4. date and time
5. parameters
6. unique identifier
7. value
Activities of event management
1. event occurs
2. event notification
3. event detection
4. event logging
5. first-level correlation and filtering
6. significance of events
7. second-level correlation
8. further action required
9. response selection
10. review actions
11. close event
Activities of event management: event occurs
-everyone involved in designing and supporting services should be involved
+ defining the types of events that need to be detected
+ fine-tuning event filtering levels and correlation rules
Activities of event management: event notification
– CIs can communicate events in two ways:
+polled by a service management tool (active monitoring)
+CI generates a notification when certain thresholds are met (passive monitoring)
-Event notifications:
+proprietary or standards based
+meaningful data to targeted audience
+clearly defined roles and responsibilities
Activities of event management: event detection
– notifications will result in event detection
-detection can happen with agent software or centralized management tool
Activities of event management: event logging
-events should be logged
-logging can be centralized or left on the device
-clear instructions should be defined on how and when to check the logs if left on device
-standards should be defined for how long events are kept before deletion
Activities of event management: first-level correlation and filtering/significance of events
-determines the event type and whether to communicate it
+informational: does not require action; logged for predetermined period of time; used to generate statistics
+warning: service or device has reached threshold that requires action; actions can prevent an exception from occurring; failures should be treated as exceptions, even when services is not impacted
+exception: abnormal operation, often SLA or OLA breach; can be total failures or degraded performance; includes events such as unauthorized devices detected
Activities of event management: second-level correlation
-warning events require additional correlation:
+what is the significance and what action needs to be taken?
+management tools can compare performance against standards
+correlation engines can apply additional business rules
Activities of event management: further action required
-generating an incident record
-generating an RFC
-escalation to change management related to an authorized RFC
-automated scripts
-automated paging or notification systems
-database actions
Activities of event management: response selection
-auto response
-alerts and human intervention
-incident, problem or change?
-open an RFC- can be initiated when event occurs or through correlation
-open an incident record
-open or link to a problem record
-special types of incident – incidents with no business impact:
+generate and escalate incident to appropriate team
+record that no business impact occurred and ensure not calculated as downtime or reported as business impact incident
+can be used to demonstrate proactive service provider capabilities
Activities of event management: review actions
-warning and exception events should be reviewed:
+ensure events were handled appropriately
+can be automated (open/close or down/up events)
+ should not duplicate incident, problem, or change closure steps
+provides input into evaluation and improvement of event management
Activities of event management: close event
-events that generate an incident, problem, or change should be formally closed and linked to associated records
Triggers of event management
-exceptions to acceptable CI performance or state
-exceptions to automated process or procedures
-exceptions to business processes being monitored
-completion of tasks or jobs
-CI status changes
-application or data access
inputs of event management
-operational and service level requirements
-alarms, alerts, and defined thresholds
-event correlation rules
-automated responses
-defined roles and responsibilities
-operational procedures for event response
outputs of event management
-communicated and escalated events
-event logs
-events initiating incident management
-events indicating SLA or OLA breaches
-events indicating completion of operational activities
-SKMS event information and history
Service design with event management
-service level management
-information security management
-availability management
-capacity management
Service transition with event management
-service asset and configuration management
-knowledge management
-change management
service operation with event management
-incident management
-problem management
-access management
key information involved in event management
-event messages
-database holding CI state and performance information
-monitoring tools and agent software
-correlation engines and rules sets
challenges of event management
-obtaining initial funding for tools and effort required
-establishing the correct level of filtering
-deploying monitoring tools and agents across the enterprise
-automated and monitoring activities impacting capacity utilization
-acquiring or developing the necessary skills
-deploying tools without processes to define and operate them
risks of event management
-failure to obtain adequate funding or resources
-incorrect levels of filtering
-failure to maintain deployment and monitoring across the enterprise
event management process owner
1. carrying out the generic process owner role for event management
2. planning and managing support for event management tools and processes
3. coordinating interfaces between event management and other ITSM processes
Other event management roles: service desk
-typically not involved unless event requires a response within the scope of the service desk
-initial investigation of events identified as incidents
-escalating incidents and related events to appropriate resources as needed
-communication about the status of events as appropriate to all stakeholder groups
other event management roles: technical and application management
-support event management across the service lifecycle
-service design: defining events, detection and correlation mechanisms and responses
-service transition: testing to ensure events are detected and responses are appropriate
-service operation: performing event management for systems and applications under their control; responding to incidents and problems related to events in their areas; ensuring IT operations and service desk staff are trained appropriately for their level of involvement in event management
other event management roles: IT operations management
-event monitoring commonly delegated to IT operations where it exists: event monitoring and first-line response; following SOPs related to each event; ensuring incidents are created and escalated as appropriate
CSF: Detecting all changes of state that have significance for the management of CIs and IT services
KPI: Number and ratio of events compared with the number of incidents
KPI: Number and percentage of each type of event per platform or application versus total number of platforms and applications underpinning live IT services (looking to identify IT services that may be at risk for lack of capability to detect their events)
CSF: Ensuring all events are communicated to the appropriate functions that need to be informed or take further control actions
KPI: Number and percantage of events that required human intervention and whether this was performed
KPI: Number of incidents that occurred and percentage of these that were trigered without a corresponding event
CSF: Providing the trigger, or entry point, for the execution of many service operation processes and operations management activities
KPI: Number and percentage of events that required human intervention and whether this was performed
CSF: Provide the means to compare actual operating performance and behvaior against design standards and SLAs
KPI: Number and percentage of incidents that were resolved without impact to the business (indicates the overall effectiveness of the event management process and underpinning solutions)
KPI: Number and percentage of events that resulted in incidents of changes
KPI: Number and percentage of events caused by existing problems or known errors (this may result in a change to the priority of work on that problem or known error)
KPI: Number and percentage of events indicating performance issues (for example, growth in the nubmer of times an application exceeded its transaction thresholds over the past six months)
KPI: Number and percentage of events indicating potential availability issues (for example, failovers to alternative devices, or excessive workload swapping)
CSF: Providing a basis for service assurance, reporting, and service improvement
KPI: Number and percentage of repeated or duplicated events (this will help in the tuning of the correlation engine to eliminate unnecessary event generation and can also be used to assist in the design of better event generation functionality in new services)
KPI: Number of events/alerts generated without actual degradation of service/functionality (false positives- indication of the accuracy of the instrumentation parameters, important for CSI)