π How to Run a Successful RCA Process with Phoenix Incidents
The Root Cause Analysis (RCA) process is a critical practice for learning from incidents and preventing future failures. By focusing on understanding why a failure occurred, an effective RCA helps teams move beyond a simple fix to a solution that addresses the underlying issues.
This guide will walk you through the core principles and steps to conduct a successful RCA process, highlighting how the Phoenix Incidents platform streamlines this crucial work.
We use best practices and decades of experience to guide you through the most effective way to implement an RCA process.
π Key Principles for an Effective RCA
Before you begin, itβs essential to establish the right mindset and environment. A successful RCA is built on a foundation of trust and intellectual honesty.
Foster a Blame-Free Culture: Incidents are rarely caused by a single person's error. If a mistake happened, the cause is almost always a process or technology issue. The question you want to ask is: What process or technology allowed or caused the mistakes to occur? All team members should cultivate a culture of psychological safety by focusing on data and process, not individual mistakes. The goal is to learn.
Focus on Facts, Not Assumptions: Psychological biases, such as hindsight bias (the belief that an outcome was predictable) and confirmation bias, can taint the analysis. The Phoenix Incidents Forge app helps you systematically gather all relevant data and provides dedicated sections to document the timeline of events without making premature judgments.
Understand Causal Relationships: An RCA is not just a list of events. It is about understanding the complex interplay of factors that led to the
incident. The Five Whys section within the Phoenix Incidents RCA module guides you in drilling down from symptoms to the root causes.
Embrace a "Right Reasons" Mindset: People can be right for the wrong reasons, and wrong for the right reasons. Your analysis should evaluate the thinking and procedures that were in place, not just the correctness of the final decision. The structured format of the Phoenix Incidents RCA module helps you focus on systemic flaws.
π The Three Phases of an RCA
There are three main phases of the RCA process that Phoenix Incidents guides you through. When an incident is resolved, an RCA is automatically created in Jira and linked in Jira for you. The RCA is directly linked as a subtask in Jira with a comprehensive guided experience through the below phases.
There are two things your team needs to do immediately after resolving an incident:
Schedule an RCA meeting. This should include all participants of the Incident, relevant product owners and appropriate leadership.
Begin collecting data about what happened prior and during the incident.
π 1. Data Gathering
The first thing your team needs to do after resolving an incident is to start gathering data about what happened before and during the incident. This includes a timeline of events, any related technical errors, summary of how your team resolved the incident and explanation of how the incident impacted customers.
This phase is generally best completed asynchronously by people involved in the incidents.
π₯ 2. Analysis Meeting
This is a synchronous meeting where all people involved attend to review and more importantly discuss the root causes and build mitigating action items to prevent similar future incidents.
This phase is generally best completed in real-time with all people involved in the incident, together with other responsible product and engineering leaders. In our experience, the act of doing this in real-time tends to produce better understanding of the incident, root causes and more complete action items.
β 3. Finalize
Either at the completion of the Analysis Meeting, or after review by senior leaders you should mark the RCA as Finalized. Once Finalized, no changes can be made to the RCA document. At this stage, the RCA document is a reference to both share learnings and as a input to help resolve future incidents faster.
π§ Root Cause Analysis Meeting: A Step-by-Step Guide
While your company may wish to personalize your RCA meeting process, here is a step-by-step guide to run a best practices RCA meeting.
π 1. Meeting Schedule
Ensure that the meeting is scheduled on the calendar and that the majority of key folks are there before you attend the meeting. It is critical that your leaders help foster the culture that RCA meetings are critical to your company's culture of learning. If you don't have a quorum, we recommend your re-schedule the meeting and stress the importance of attendance.
π§ 2. Setting the Right Mindset
An effective RCA process always begins with setting the right frame of mind for the group. You should begin the meeting with a standard introduction that hits on the following points:
Our RCA process is blameless. It is counter productive to place blame on any person's action: if there were mistakes it is the fault of our processes and technologies--we're here to figure out what we need to change to prevent similar future outages.
We will review the details of the incident, including the timeline of events that led up to the incident. Remember, that often related events may have occurred hours or even many days prior to the incidents: think deployments, configuration changes, infrastructure changes, capacities being breached, etc.
Whilst we will review details, we want to spend the bulk of our time going through the "Five Whys" and creating Action Items.
π 3. Review the Incident Details
Typically, all the incident details such as Resolution Summary, Impact to Customers and Error messages would have been filled out prior to the meeting. However, if anything is missing, now is time to update the RCA details. As your team gets used to the process this part of the RCA will be speedy.
This is also a good time to review incident metrics such as Time to Detect, Acknowledge, Verify and Resolve the incident.
Lastly, have the team to confirm the actual Incident Start Date - 99% of the time, the Incident actually started before the Jira Ticket was raised.
β±οΈ 4. Review the Timeline
Phoenix Incidents automatically fills out some parts of the timeline for you, but it can't capture everything. Review the timeline as a team. Similarly to the incident details, once your team has a solid process in place, the timeline will be mostly completed by now. It is not uncommon for someone to bring up a couple of events days before the incident occurred during the meeting.
β 5. Five Whys
The "Five Whys" is a simple, iterative technique for finding the root cause of a problem by repeatedly asking the question "Why?" (much like a toddler might ask repeatedly).
You start by asking the group "Why did this happen?" and the continue to ask "Why" until you reach a cause that can be addressed or prevented. Don't be afraid to ask "Why" one more time if you feel there is an underlying process or cause that could be addressed.
It is common for there to be multiple paths of Why questions two, and multiple root causes for the incident. For example, if an answer to a question is "The disk space filled up on a virtual machine", you may want to explore both:
Why we didn't get alerted earlier before the disk space became critical; and
What changed recently to fill up the disk.
Lastly, if you feel the group hasn't uncovered all the preventative causes, you can always ask the group "How could we have caught this earlier?".
π 6. Root Causes
The root causes section in the RCA module is actually Thematic Root Causes. These are non-specific root causes that can be attributable to this incident. Spending a minute or so thinking about the correct root causes will help your team and leadership uncover systemic issues that cut across specific incidents, products or teams. Phoenix Incident reporting module helps your organization uncover cross cutting concerns that provides data for the engineering team to focus on deficiencies above and beyond mitigating specific incidents (e.g. allocating time towards testing, documentation, re-evaluating external vendors, etc).
π οΈ 7. Action Items
In this phase you want to create action items for teams to mitigate the cause(s) of this incident. Ideally, you will prevent not just a repeat incident but similar and adjacent types of incidents by completing the action items. Here you can create action items in any project in Jira and that team is responsible for ensuring they complete the action item within your company's SLA.
Note, that depending on the Severity of the incident, you will have a different SLA set by your company that these action items must be completed by. For SEV1 incidents mitigation might be as soon as 30 days of the incident, whereas a lower severity may be within 90 to 180 days.
Best practice for action items are ones that:
Would have mitigated or prevented this and/or similar incidents from occurring (or at least given the team more warning before an issue).
Can with reasonable certainty be completed within the SLA
Are assigned to a specific person
β 8. Review and Finalize
This last phase is a simple review of the RCA. Some teams may prefer to review and finalize the RCA immediately after the meeting, whereas others may prefer senior leadership review the RCA prior to Finalizing (locking) it.
Conclusion
A successful RCA process is not about assigning blame; it's about fostering a culture of continuous improvement. With Phoenix Incidents, you transform a negative incident into a powerful learning opportunity, making your team more resilient and your product more reliable. A well-documented RCA in Phoenix Incidents can significantly reduce the length of future outages if similar incidents occur.
β‘οΈ What's Next
Complete and Track Action Items: The appropriate teams that have action items created from the RCA process should ensure they prioritize the action items to be completed within your company's SLA. Depending on the Incident's severity, the SLA for action items may differ from incident to incident.
Leverage the RCA History: By reviewing the prior RCAs, either directly or via the reporting dashboard your team and leaders can learn from past incidents and keep track of post-incident progress.
