Archive | post-mortems

Postmortem reviews – “root cause seduction”

I promise that this will be my last reference to this article, “The infinite hows”, that is on the O’Reilly website. My last quote refers again to the investigations carried out after things go wrong – sometimes called postmortems in the IT world. I’ve written before here how people only really get the answers they want to hear when investigating why shoddy IT happens.

This last quote is attributed to John Carroll (Carroll, 1995), and describes what is called the “root cause seduction”:

The identification of a root cause means that the analysis has found the source of the event and so everyone can focus on fixing the problem. This satisfies people’s need to avoid ambiguous situations in which one lacks essential information to make a decision (Frisch & Baron, 1988) or experiences a salient knowledge gap (Loewenstein, 1993). The seductiveness of singular root causes may also feed into, and be supported by, the general tendency to be overconfident about how much we know (Fischhoff, Slovic,& Lichtenstein, 1977).

I would change that a little. rather than getting everyone to “focus on fixing the problem”, it’s my experience that people are focused on fixing a problem – something that can be loosely attributed to the IT issue that occurred, but it may not always necessarily be the actual reason for the issue occurring – merely the most palatable problem that can be acknowledged and easily resolved.

0

Asking “how” and getting the right answer

This article, “The infinite hows”, is on the O’Reilly website and intended to be a critique of what is a standard Business Analysis method – the 5 whys.

While it’s quite an interesting article, and has somewhat caused me to adjust my own analysis approach by looking more at the “how”, rather than the “why”, this paragraph below struck me as extremely relevant when it comes to the topic of “postmortems”.

I’ve written about postmortems after a computer glitch occurs before on this site (here), and this quote from Nancy Leveson in her book Engineering a Safer World: Systems Thinking Applied to Safety (Engineering Systems) pretty much sums up my thoughts on how most computer glitch postmortems are carried out:

A final reason why a ‘root cause’ may be selected is that it is politically acceptable as the identified cause. Other events or explanations may be excluded or not examined in depth because they raise issues that are embarrassing to the organization or its contractors or are politically unacceptable.

A postmortem into an IT glitch is normally intended to discover what the reasons for the problem was. This could be an internal investigation, or as in the example of the Ulster Bank IT fiasco, could be something that has to be made public.

The follow up to a postmortem investigation is intended to be remediation tasks that are supposed to make things better. However, it may not always be the case that the real underlying causes for an issue can or would be addressed – i.e. poor management decisions, cost management and cutbacks, or anything else that could be embarrassing to management. In that situation, the direction of the postmortem can be easily manipulated to ensure that the root cause is something mainly benign and can be “addressed” through simple but ultimately pointless recommendations – improve training, ensure documentation in place, implement a checklist process.

Earlier in the article, two items are discussed that describe how postmortem investigations can be manipulated in this way – either intentionally or accidentally.These quotes relate more to accident investigations, but they are very relevant for investigations into IT or computer glitches:

“In accident investigation, as in most other human endeavors, we fall prey to the What-You-Look-For-Is-What-You-Find or WYLFIWYF principle. This is a simple recognition of the fact that assumptions about what we are going to see (What-You-Look-For), to a large extent will determine what we actually find (What-You-Find).” Erik Hollnagel, The ETTO Principle: Efficiency-Thoroughness Trade-Off

And this, from Sidney Dekker in The Field Guide to Understanding Human Error by Dekker, Sidney 2nd edition (2006)

“We think there is something like the cause of a mishap (sometimes we call it the root cause, or primary cause), and if we look in the rubble hard enough, we will find it there. The reality is that there is no such thing as the cause, or primary cause or root cause . Cause is something we construct, not find. And how we construct causes depends on the accident model that we believe in.”

 

0

When the process becomes more important than the required outcome

When something publicly goes wrong in the IT world, you can be sure there’ll be a communication telling us that something is being done – systems being reviewed and processes being updated – to make sure it doesn’t happen again.

This shouldn’t instill as much confidence as it does.

A number of years ago, I worked for a company who had a series of particularly important deadlines staggered throughout the day. The teams responsible had pretty punishing schedules to follow to ensure the work was done to meet these deadlines, or risk customer discontent and potentially charges and fines.

Deadlines were missed, however. Sometimes, it was known that a deadline was going to be missed, and the customer was let know in advance. Sometimes, deadlines were missed without anybody noticing except the customer – a hugely embarrassing event when it happened.

Continue Reading →

0

Why “computer glitch” and not “human glitch”?

I wrote previously that when it came to the analysis of what caused what’s known in the media as a “computer glitch”, we’re really looking for the human that caused the glitch.  Yet it’s rarely ever acknowledged, beyond the “computer glitch” reason, what it was that caused any IT systems issues to occur.

I appreciate that acknowledging this publicly might be akin to a company coming forward with a statement saying:

We confirm that we have people in our company who don’t really understand what they’re doing, and we don’t have sufficient controls in place to prevent them from screwing things up with their incompetence.

Ain’t going to happen!

So, either a CEO, or CIO, or PR department that tries to rely on the “computer glitch” excuse is trying to pull the wool over the eyes of their customers and anyone else impacted by the problem, or they too don’t really understand what’s going on if they accept that excuse from their IT department.

Continue Reading →

0

Powered by WordPress. Designed by WooThemes

shopify
stats