RCA Workflow
The RCA process is crucial for learning from incidents and preventing recurrence. The most critical aspect of the RCA is understanding the Root Cause and, from there, creating action items to mitigate future incidents of the same cause.
Phoenix Incidents automatically creates an RCA when an incident moves to "Fixing" or "Resolved."
The RCA follows its own status flow:
Data Gathering: We collect all relevant incident information. You should complete this before the analysis meeting, and participants should review the data in advance to keep the meeting focused and efficient.
Analysis Meeting: We analyze data with a team to identify root causes and create action items.
Finalized: The RCA process is complete.
Canceled: If the parent incident is canceled.
Components of the RCA Process
Every RCA document you create will comprise seven key sections, each serving a vital role in our continuous improvement efforts:
Incident Metrics
View and confirm key metrics here about incident type, product affected and key timing metrics of the Incident:
Time to Detect: This is the time it took your organization to detect that there was an outage. This is defined as the delta between Incident Start and the Creation time of the Incident ticket.
Time to Acknowledge: The time it took a team member to acknowledge the incident (either from a paging system, Slack/Teams or Jira). This is the delta between Creation time and when the Incident moved to Assessing status.
Time to Verify: How long it took the team to verify that this is an actual incident. This is the delta between Creation time and when the Incident moved to Fixing status.
Time to Resolve: The time it took the team to resolve the incident and for the outage to end. It does does not include Action Item completion time. This is the delta between Creation time and when the Incident moved to Resolved status.
You can also modify the Incident Start and End times directly in this form.
Incident Details
Impact to Customers: Clearly articulate what your customers experienced during the incident. Framing incidents by customer impact is an effective way to keep your teams focused on customer experience and the true severity of an outage. This section should detail the specific services affected, the scope of the impact (e.g., specific regions, all users), and any direct customer quotes or feedback, if available.
Resolution Summary: Explicitly describe the fix. What specific actions did your team take to resolve the issue? Being precise and detailed in this summary is important because a well-documented RCA can dramatically reduce outage length if a similar incident occurs in the future. Include technical steps, configuration changes, or code deployments.
Error Message: Copy any relevant error messages here. This is incredibly helpful for future investigations, allowing you to quickly find similar RCAs and their resolution summaries by searching for specific error patterns. Include full stack traces or logs if they are concise and directly relevant to the core issue.
Timeline
Phoenix Incidents automatically generates a partial timeline including incident start, incident transitions, and incident end times. It's good practice to meticulously fill out the full timeline with all significant events, actions taken, and key decision points. This exercise helps people recall the incident more accurately, which, in turn, helps uncover root causes and identify effective action items.
Five Whys
Drilling down into the core issue is critical for creating proper action items. Our Phoenix Incidents AI-guided Five Whys module will help you systematically uncover the fundamental root cause by iteratively asking "why" until you reach the underlying problem rather than just a symptom.
Root Cause
Discuss with your team why the root cause happened. You can select multiple root causes if the incident resulted from a combination of factors. This data powers comprehensive reporting that helps you uncover macro trends across your teams, allowing you to focus on bigger-picture systemic issues and prevent recurrence.
Action Items
These are the essential follow-up tasks that need to be addressed to ensure the incident shouldn't be repeated or its impact minimized in the future. You can assign these to any team, and all RCA stakeholders should agree on them. Action items should be specific, urgent, and completed within your defined SLA to drive effective change.
