Skip to main content

πŸ“Š Metrics & Definitions

How Phoenix Incidents measures detection, response, recovery, and uptime, and the timestamps behind each metric.

D
Written by Dave Rochwerger

Phoenix Incidents tracks a handful of timing metrics for every incident, and rolls them up across incidents in reporting. This page defines each one and the timestamps behind it.

πŸ•’ The timestamps behind the metrics

Every metric is built from a few key moments in an incident:

  • Incident Start – when the disruption actually began. You can set this, or it defaults to the creation time, and you can correct it later.

  • Creation – when the incident was raised in Phoenix.

  • Incident End – when service was restored. If you use the Monitoring phase, it is set automatically when the incident moves to Monitoring. Otherwise you set it when you resolve the incident, and in Slack it defaults to the current time. Either way, you can adjust it later, just like Incident Start.

  • Status transitions – the moments the incident first moves to Assessing (acknowledged) and Fixing (verified).

πŸ“ˆ Per-incident metrics

Each incident shows:

  • Time to Detect – from Incident Start to Creation. How long the disruption ran before it was raised in Phoenix.

  • Time to Ack – from Creation to the first time the incident moved to Assessing. How quickly someone took ownership.

  • Time to Verify – from Creation to the first time the incident moved to Fixing. How quickly the team confirmed it was a real incident.

  • Time to Recover – from Incident Start to Incident End. The headline measure of how long the disruption lasted.

πŸ“Š Reporting roll-ups

In reporting, these are averaged across the incidents in the period you select:

  • Mean Time to Ack (MTTA) – the average Time to Ack. Canceled and brand-new incidents are left out.

  • Mean Time to Recovery (MTTR) – the average Time to Recover, shown in hours. Only incidents that have both an Incident Start and an Incident End are counted, and canceled incidents are excluded.

  • Uptime – for each month, the share of time your products were up. Phoenix treats each incident's Incident Start to Incident End as downtime, merges any overlaps, and divides the remaining time by the total time in the month. Canceled incidents do not count.

πŸ’‘ Why accurate timestamps matter

These metrics are only as good as the timestamps behind them. Most incidents actually began before the ticket was raised, so it is worth confirming the Incident Start and Incident End during the RCA. An accurate Incident Start gives you a truer Time to Detect and Time to Recover, and a more reliable uptime number.

Did this answer your question?