Operating notes

How to run a root cause analysis.

Most teams fix the symptom and the problem comes back next month. A root cause analysis finds the cause underneath, so the fix actually holds. Here is the method, with the 5 Whys and the fishbone.

11 min read  ·  By Everton Paula  ·  Ler em português →

A delivery rate drops. A defect spikes. A queue blows out. The instinct is to do something fast: add people, send a warning, push the team harder. The number recovers for a week, and then the same problem returns, because the thing that caused it was never touched. The team fixed the symptom.

A root cause analysis is the discipline of not doing that. It is a structured way to trace a problem down to the cause that, once removed, stops the problem from coming back. It is one of the oldest tools in operations and one of the most misused, because most people run it as a meeting where everyone shares an opinion, rather than as a method with a defined end state. This is how to run it properly.

What a root cause analysis actually is

A root cause analysis, or RCA, is a method for finding the underlying cause of a problem instead of its visible symptom. The visible symptom is what shows up on the dashboard. The root cause is the process or system failure that produced it. The test for a root cause is simple: if you removed it, would the problem stop recurring? If the answer is no, you have not reached the root cause yet, you have found another symptom on the way down.

RCA is not one technique. It is a goal, and there are a few methods to reach it. The two that carry almost all of operations are the 5 Whys and the fishbone diagram. The skill is knowing which one a problem calls for.

The 5 Whys

The 5 Whys is the method for a problem that runs down a single chain. You state the problem, ask why it happened, and then ask why again of the answer, and again, until you reach a cause that is a process or system failure rather than a person trying harder. Five is a guideline, not a law. Sometimes it is three whys, sometimes seven. You stop when removing the cause would prevent recurrence.

Here is a worked example of root cause analysis from a last-mile operation. On-time delivery dropped from 95 percent to 88 percent in a month.

  • Why is on-time delivery down? Couriers are arriving late to pickup.
  • Why are couriers arriving late? They are waiting at the warehouse because orders are not ready.
  • Why are orders not ready? The pick-pack team is starting its batches late.
  • Why is pick-pack starting late? The morning shift now begins 30 minutes after the couriers are scheduled.
  • Why is the shift misaligned? The shift schedule changed last month, and nobody re-baselined the batch start time against the courier schedule.

The root cause is not lazy couriers or a slow pick-pack team, which is where the first instinct pointed. It is a schedule change made without re-baselining the dependent process. The fix is to re-align the batch start time and add a step to the shift-change procedure that re-checks downstream dependencies. Telling couriers to hurry would have done nothing, which is exactly why the problem kept coming back.

The test for a root cause: if you removed it, would the problem stop coming back? If not, you are still looking at a symptom.

The fishbone (Ishikawa) diagram

The 5 Whys breaks down when a problem has more than one cause feeding it at once, because a single chain cannot hold several contributing causes. That is what the fishbone diagram, also called the Ishikawa diagram, is for. You write the problem at the head of the fish, and draw bones for the categories of cause, then fill each with the specific causes the team can find.

For operations and service problems, six categories cover most cases:

  • People. Skill, training, staffing, handovers.
  • Process. The steps, the sequence, the dependencies between them.
  • Policy. The rules and decisions that constrain how work is done.
  • Technology. Tools, systems, integrations, and what they do under load.
  • Measurement. What is tracked, how, and whether the number is even right.
  • Environment. Demand spikes, seasonality, external conditions.

Use the fishbone when a defect rate or a complaint volume has clearly more than one driver, and you need to see them laid out before deciding which to attack first. Use the 5 Whys when a single number moved and you can feel there is a chain behind it. A lot of strong analyses use both: a fishbone to surface the candidate causes, then a 5 Whys down the one that looks largest.

The format: how to write it down

An RCA that lives only in someone's head is not an RCA, it is a hunch. The format is short and fixed, one page, and it is what makes the analysis reviewable and the fix accountable. A usable root cause analysis format has six parts:

  • Problem statement. The measured gap: what, how much, since when.
  • Impact. What the problem costs while it continues.
  • Evidence. The data and observations the analysis rests on.
  • Root cause. The cause that passes the recurrence test, with the method used to find it.
  • Corrective action. The specific fix, with a named owner and a date.
  • Verification. How and when you will confirm the fix held.

If you want a tool, the 5 Whys and a fishbone both fit on a single sheet of paper or a shared doc. The method matters far more than the software. A root cause analysis tool is whatever lets the team see the chain and agree on it.

The corrective action is the point

An RCA with no corrective action is an essay. The whole exercise exists to produce a specific fix with an owner and a date, and to verify it held. That last part is where most analyses quietly fail: the fix gets agreed in the room, everyone moves on, and nobody checks in three weeks whether the number actually recovered and stayed recovered. The corrective action has to live somewhere it gets reviewed, which is the weekly and monthly operating cadence. Without a cadence to carry it, even a correct root cause analysis evaporates.

In one operation I ran, building a taxonomy that mapped customer complaints to specific root causes, and then attacking the causes rather than the complaint volume, was a meaningful part of cutting a defect rate from 6 percent to 3 percent in six months. The analyses were not complicated. The discipline of running them every week, and verifying the fixes held, was the hard part.

The three ways it goes wrong

Stopping too early. The team reaches a cause that is satisfying to blame and stops there, usually one level above the real cause. "The agent made an error" is almost never a root cause. Why was the error possible? Why did the process allow it? Keep going until you reach the system.

Blaming people instead of process. If the root cause is a name, you have probably stopped too early. People operate inside a process and a set of policies. A good RCA assumes a competent person and asks what in the system let the failure happen. It is more useful and it is more true.

No owner on the corrective action. A fix that belongs to everyone belongs to no one. Every corrective action needs one name and one date, reviewed until it is closed. Without that, the analysis was a conversation.

Why this is operating work

Root cause analysis is a Lean Six Sigma staple, and it is one of the first habits I install in an operation, because a team that fixes causes instead of symptoms stops fighting the same fire every month. It is also a small example of the larger job: building the operating disciplines that let a company improve on purpose rather than by luck, and that keep running after the person who installed them moves on.

That is the work Plenor does. If your team keeps fixing the same problems and you want an outside operator to read the operation and rank what to fix at the root, the one-week, fixed-fee Operating Teardown is where to start. If you need someone to own the operation while you build it, that is fractional COO work.

From reading to running

Fixing the same problem every month?

One week, fixed fee, a written teardown that finds what is breaking your operation at the root and ranks the fix sequence.

Book 15 min