Overprepared Craig's Courseware Notes: ITIL 4 Specialist: Monitor, Support and Fulfil (MSF)

Page numbers refer to the pdf page numbers of the Learner Workbook, not the content page numbers.
Chapter numbers (prefixed with §) refer to the Official Book or Practice Guide.

Module 1: General intro, revision of terms

Page 11: Incident management

Note that incidents might not be visible to the consumer.

Incident resolution might be “workaround now, systemic solution after hours”.

Page 13: The MSF practices are applied together

Remember that a practice has no value it itself. It needs to be in the context of a value chain activity. See § 4.5 in Foundation.

Pages 14-15: Some key value streams

What were they thinking with all the angles instead of actual arrows?

For better examples, see § 3.1 in any of the practice guides.

Page 15: Analysing a service value stream

See § 3.2.3 in any of the practice guides.

Note that the opposite of the “to be” state is the “as is” state. Discovering the “as is” state is the goal of step 3.

Some Key Terms

A few terms I was weak on.

Technical debt. The total rework backlog accumulated by choosing workarounds instead of systemic solutions that would take longer.

Some notes on technical debt from the man who coined the phrase.
A discussion of how some technical debt is good.

Service request model. A repeatable predefined approach to the fulfilment of a particular type of service request.

Incident model. A repeatable approach to the management of a particular type of incident.

Problem model. A repeatable approach to the management of a particular type of problem.

Module 2: Incident Management (INM)

Page 20: Is it an incident?

See § 3.1.1, particularly the first activity in the process.

Page 21: Benefits of incident management

These items are the benefits of having a fit-for-purpose management practice.

Note that it is possible to meet SLAs and still have unhappy customers and/or users, and vice versa.

Page 23: Workarounds

What is the term for the cost associated with using workarounds instead of systemic solutions? Technical debt. See § 2.2.3 (and page 113 in the LWB).

Page 24: Practice Success Factor (PSF) Definition

A practice success factor is more than a task or activity; it includes components from all four dimensions of service management. The nature of the activities and resources of PSFs within a practice may differ, but together they ensure that the practice is effective.

Page 26: Complexity-based approach

Snowden and Boone originally used “simple” for the bottom-right corner, changed it to “obvious” in 2014, and to “clear” in the last four or five years.

A key concept of the Cynefin framework is how easy it is for a situation to “fall off the cliff” from clear to chaotic, and how easy it isto miss it when this happens.

For another model, see VUCA (volatility, uncertainty, complexity, ambiguity) by Warren Bennis and Bert Nanus, 1987.

Page 27: Prioritisation

Finances and contracts might affect our behaviour. For example, if we’ve already breached SLA on one incident and are going to suffer consequences regardless of how fast we resolve it, we might shift resources to other incidents that haven’t yet breached SLA.

We might also prioritise based on task complexity, our reputation, or our morale (quick wins are good for the team).

Pages 29–30: Key PSF metrics

I dislike the way the slides teach all the PSFs together and then all the Key Metrics together, many slides later. On my first read of the material I completely missed the concept that key metrics belong to practice success factors.

So instead of using the slides for the key metrics, see § 2.5 in the appropriate practice guide.

*cough*exam hint*cough* The exam will ask questions like “Which of the following is NOT a key metric for the XYZ practice success factor of the ABC practice?”

§ 3.1.1 Incident handling and resolution

Additional inputs?
• Change register.

Page 33: Periodic incident review

Incident review (particularly for major incidents) includes determining who is financially responsible (or more bluntly, who’s getting fired).

Page 37: Major incident manager (MIM)

The organisation might also have a dedicated media team for major incidents, reporting to the Major Incident Manager.

Page 38: Tiered vs flat structures

With tiered teams, each team has its own backlog and prioritisation process, so incidents might not keep their priority when passed to another team.

Page 40: Information needed

The effectiveness of all the practices (and of almost everything) depends on the quality of information, so you might think this slide is obvious and can be skipped. Well, you are right, it is obvious, but people often miss obvious things so we shouldn’t skip them. :-)

Also, this slide lets us know what information is particularly important to this practice.

Page 43: Partners and Suppliers section

Suppliers often only look at their little slice of the overall service. Often this is all they can do.

For example: One supplier is responsible for the IP phones, another for the physical cables, and a third for the network infrastructure. The customer is refusing to take responsibility for their staff damaging cables, just yelling “The phones don’t work. Fix it!” :-(

Page 44: Provision of software tools and consulting

Generally it’s better if everyone (SD, INM, PRM, KM, etc) uses the same tool, even if it’s not the best tool. Why? Because data exchange always sucks.

Page 45: Recommendations for the success of…

The slide deck presents these topics as if they were just a topic in the Partners and Suppliers section. They are not. “Recommendations” is a separate section in the practice guides (§ 8) and should get its own section in the slide deck, complete with a purple-background title slide.

Module 3: Service Desk (SD)

Purpose

The service desk also captures demands for things we don’t provide. These should be captured because they are opportunities.

It is the single point of contact for communication from the service provder to its users (a point that is often missed).

The service desk agents probably know more about your users than anyone in the organisation. You should consult them when crafting and sending messages to users.

Page 52: Service empathy

This is based on cognitive empathy.

For the mental and emotional health of the SDAs, emotional/affective empathy is not desired.

§ 3.1.1 User query handling

Additional inputs?
• Knowledge management (it is listed in § 5.1).

Page 59: User query handling

Note that a user query record is not the same as an incident record. For example, 50 people might contact the SD about one incident.

A good automation tool will link all these records together.

Page 60: Communicating to users

Note that “radio” is not a channel. “An ad during the morning show on The Sound” is a channel.

Page 70: Information needed for effective service desk practice

In my experience, the service desk is often the last to know about changes, so I’d emphasise bullet 4.

Page 73: Third parties performing service desk activities

To discuss: Why do service desks often have high turnover?

Page 76: Review Question 3

I struggled with this question, initially putting A as the answer. I felt that B was wrong because B says “do a review” but we’ve already done a review – it’s what identified the touchpoints.

The reason why A is wrong is that those touchpoints might apply to other value streams so we shouldn’t eliminate them until we’ve reviewed those streams.
C is definitely wrong.
D is funny, but still wrong.
That leaves B as the best answer, and the point of multi-choice questions is not to answer the question, but to pick the best answer from those given.

Module 4: Montoring and Event Management (MEM)

Purpose

Monitoring is recording metrics. Event management is looking for changes of state.

Big picture: Events happen, whether or not we monitor them. Events and metrics are inputs into MEM. Alerts are one of the outputs (the monitored data is of course also an output). MEM is the tracking and recording of events and metrics, and then providing this information to relevant parties.

Key points: observing, analysing, and responding.

Page 82: Establishing and maintaining approaches/models

“Common classification scheme” just means something like “info, warning, critical”. The advice is to keep it general but consistent. That way, you can say something like “critical events must be responded to inside two days” in multiple processes.

Page 82: Ensuring that monitoring data is available

“Ground truth” is a term that is used very infrequently (see § 2.4.2 in the incident management practice guide). It means “what is actually happening on the ground” (as opposed to what we infer is happing based on the information we have). Judging from the discussion during Adrian’s class, it’s not a common term in Australian or New Zealand business.

Note “stakeholders”; not “customers” or “consumers”.

Page 86: Using monitoring to manage events

A better phrasing might be: Reactive monitoring is responding to an event. Proactive monitoring is trying to predict when an event is going to happen.

Note that the terms “proactive” and “reactive” are also used for INM and PRM (see the Recap note below).

Page 86: Types of monitoring and event management

I’m not sure I agree with the text in the boxes. I’m not sure there is even a correlation between reactive/proactive and active/passive. Judging by how much discussion this generated during Adrian’s class, the attendees all felt the same.

Page 88: Event groups

Poor phrasing. “Group” and “type” are synonyms. They just describe how the Event Classification activity can put events into buckets.

Page 89: Recap

Typo: 2nd bullet point, replace “active” with “proactive”.

A class discussion from Adrian – is there a difference between a “major event” and a “major incident”?

Consider if your cloud compute has no more capacity. This is a major event, but not a major incident. At the time when someone tries to spin up a computer resource and it fails for lack of capacity, that’s an incident.

Note that incident management and problem management also include the concepts of “proactive” and “reactive”. Proactive incident/problem management probably(?) requires proactive monitoring and event management. Incident review requires MEM, because one of the questions that should be asked during review is “are there events/metrics we could have been monitoring that would have predicted this incident?”

Page 91: Monitoring planning process workflow

A health model reflects the key events in the services and connections between them. One service can have many models.

A rule set consists of several rules that define how the event messages for a particular event will be processed and evaluated.

A monitoring action plan is defined for each event or group of events. Its purpose is to minimise the impact of the event(s). Action plans become a basis for procedures and automation.

Event correlation (along with aggregation filtering) provides the remedy for over-alerting.

Page 91: Monitoring planning inputs and outputs

“Knowledge articles” refers to articles generated by the creators of the components as well as to our own knowledge base. Note that this is a key output for the Event Handling value stream.

What is a “responsibility matrix”? A project tracking tool that maps people against specific profiles and tasks in a project. One common tool for this is RACI (responsible, accountable, consulted, informed).

Page 101: Providing monitoring and event management capabilities in technology products

The first bullet point is very important. If you can’t measure something, how can you ever determine if it meets SLA?

As an aside, I would hope that a supplier of a product or service is an expert in their products and services.

Page 102: Documentation

(Big) Typo: Replace the last grey box with
“How to integrate their product with other monitoring and event management tools to ensure that rules and responses do not conflict.”

Module 5: Problem Management (PRM)

Page 109: Problems

To discuss: Is “live environment” an important qualifier?
See § 2.2.1 and § 2.3.

Page 110: Benefits of problem management

INM might have found a workaround but not determined why the workaround works. PRM figures out why it works and if there is a better workaround or a systemic fix.

Page 111: Optimizing problem resolution and mitigation

Delivery Hint: Ignore the slides and read § 2.4.2. The slides miss out words that are important.

Note the phrase “business impact”. We should always be using outcome-based thinking.

Why use a balanced approach? Because it is not practically possible to solve every problem. See the second paragraph of § 2.4.2.

Page 112: Key phases of problem management

Adrian commented that Foundation doesn’t mention that a problem could be dismissed at any of these phrases.

Problem-investigation techniques

Fishbone diagram (https://www.cms.gov/medicare/provider-enrollment-and-certification/qapi/downloads/fishbonerevised.pdf).

Five whys technique (https://www.mindtools.com/a3mi00v/5-whys).
Hey, look, another management technique from Toyota. :-)

Kepner and Fourie Root cause analysis (note the downloadable whitepaper) https://www.kepner-fourie.com/root-cause-analysis/.

Page 113: Problem identification

To discuss: Does failing an audit or assessment come under problem management?

Page 114: Reactive and proactive problem management

The term “proactive” in PRM is relative to incidents. Proactive problem management is identifying and (if possible) resolving a problem before it causes an incident.

As we’ll see a bit later, PRM has separate value streams for proactive and reactive problem identification.

For example: A programmer finds an error in some code that could cause an incident. The team develops, deploys, and releases a fix. No incident happened, so this is an example of proactive problem management. Note that the problem existed from the time the original code was written – not from the time the programmer found the error.

For example: A staff member notices the parts cupboard is low on a part and does a one-off order now instead of waiting until the next order cycle. You could say that this was an incident without an underlying problem and call it proactive incident management, but you could also say that the underlying problem is poor inventory control and the systemic solution is changing our restocking procedures. TODO explain/phrase this better

Risk management

Adrian asked, “Is ‘we just identified a risk’ a problem?” The answer is, yes it is. A risk is a known error, with a workaround of “we are accepting the risk”. The systemic solution would be preventing and/or mitigating the risk.

Is PRM is closely entwined with the risk management practice? Yes. The practice guide for the risk management practice actually says the following.

2.3.4 Problem management
Problem management is mostly about risk management. The purpose of the problem management practice is to reduce the likelihood and impact of incidents, by identifying the actual and potential causes of incidents and managing the workarounds and known errors. A potential cause of incidents is a risk and reducing the likelihood and impact of this is a risk management activity. ITIL 4 Foundation notes that ‘problem management activities can be organized as a specific case of risk management: they aim to identify, assess, and control risks in any of the four dimensions of service management. It is useful to adopt risk management tools and techniques for problem management’.
ITIL treats problem management differently to other aspects of risk management. This is due the nature and frequency of problems, and the resources needed to manage them. However, it would be acceptable for an organization to treat all problems as risks and to manage these problems in exactly the same way as other risks.

Page 118: The problem management processes

Typo: In the slide, replace “Reactive problem management” with “Reactive problem identification”.

Page 121: Problem control

We might do problem control on a problem record given to us by someone else (an external supplier, for example). In this case, part of problem control is determining if this problem applies to us or not.

Page 121: Problem control inputs and outputs

Should “problem solutions” be an output here? No. See Foundation § 5.2.8.

Page 122: Error control

Note that “problem is resolved” and “problem is no longer relevant” are different but in both cases we close the problem record.

Page 122: Error control inputs and outputs

Inconsistency: INM uses the phrase “knowledge base”, PRM uses “knowledge management data”.

Page 125: Problem manager

This role is not just “management” – it needs analytical skills as well.

Page 132: Recommendations for the success of problem management (2/2)

Typo in Learner Workbook and in Instructor slide deck: Replace “Publish a list of business problems” with “Publish a list of top business problems”.

Module 6: Service Request Management (SRM)

Purpose

Anything that the service desk gets that isn’t an incident is, by definition, a service request (or at least should be treated as one). When a user asks the provider for something the provider doesn’t actually do, the answer “sorry, we don’t do that” is part of service request management.

Key terms in the definition: agreed, pre-defined, normal part of service delivery.

The “predefined” nature of requests mean that we should be able to predict fulfilment times and therefore set expectations accurately (page 274).

Page 138: Service request guidelines

Shift-Left is particularly effective for the service request practice.

Pages 139-141: Activity: What is a service request?

Note that all 3 of these scenarios ended up at the same place – a technician replacing a toner cartridge (a standard change).

To discuss: Is Scenario 2 is an example of active monitoring or passive monitoring?

Page 142: Practice synergies

Given that the list of service requests available is published in a request catalogue (see a few pages later), I’d add service catalogue management to this list. See § 2.2 in the service catalogue management practice guide.

Page 143: Request catalogue

The term "view" is defined in the service catalogue management practice. It is a subset of a service catalogue, intended for a certain audience. Service catalogue management calls them a "tailored view". See § 2.4.2 in the service catalogue management practice guide.

Page 145: Ensuring fulfilment procedures are optimized -and- Documenting request fulfilment procedures

“The development of request procedures should be integrated early into the product and service lifecycle.”
In my experience, this is often not done, and organisations often suffer from it, yet often refuse to learn the lesson.

On the other hand, as an outsourced IT engineer (which I have been for most of my career), I am usually dealing with organisations where processes have already failed. This means there is a selection bias in my experience, which sometimes makes me very cynical.

Page 149: Service request fulfilment control

The “exception” path could lead to something simple or to some massive project. Both cases are still exceptions and should be dealt with individually.

Page 150: Service request review and optimization inputs and outputs

From § 2.2.4 in the service configuration management practice guide:

Defintion: Configuration management system (CMS)
A set of tools, data, and information that is used to support the service configuration management practice.

Definition: Configuration management database
A database used to store configuration records throughout their lifecycle. It also maintains the relationships between configuration records.

Page 157: Automation tools

Measurement aids improvement, not control — control requires knowing what is happening now; analysis and reporting tools can only tell us what happened in the past. See table 5.1, note that only two of the items include the word “control”.

§ 5.2 Automation and tooling

In table 5.2, what are "content management tools"? Is this a typo? Is it supposed to be "publishing tools"?

Page 160: Service integration and management (SIAM)

Error: In the SIAM model diagram (figure 6.1) in the SRM practice guide, in model 1, the “service integration” box should be crimson, not blue.

Also, the first bullet point is the only place the practice guide uses the word “retained”. It means that the organisation retains control of suppliers rather than outsourcing that control to another organisation.

Links:
https://itsm.tools/siam-success-10-key-steps/
https://www.bmc.com/blogs/service-integration-and-management-siam-for-beginners/