Mean Recovery Time (MTTR)
VC-MET-302 — the mean elapsed time from incident detection to verified resolution across closed dispatch tickets, the headline measure of facility response effectiveness.
Definition
Mean Recovery Time (MTTR) is the arithmetic mean, in minutes, of the elapsed
time from incident detection to verified resolution, computed over all dispatch tickets that
reached a RESOLVED state within the trailing window. The clock starts at the
detection timestamp (the metric breach or fault event, not ticket creation) and stops at
operator-verified resolution, not at the provisional auto-clear. Cancelled and duplicate tickets
are excluded.
Why it matters
MTTR is the truest measure of how quickly the facility returns from an off-nominal state to health. Queue depth tells you how much is open; MTTR tells you how fast it gets fixed. A rising MTTR with stable queue depth means individual incidents are getting harder — parts shortages, skill gaps, or recurrent root causes — while a low, stable MTTR underpins every recovery-time service commitment the Directorate holds. CHLORA publishes it on the Operations Dashboard and uses it to tune crew dispatch priority.
Formula
MTTR is the mean of per-ticket recovery durations over the trailing window:
MTTR = ( 1 / N ) · Σ ( t_resolved,i − t_detected,i )
over tickets i resolved in trailing 60 min, in minutes
where:
t_detected = timestamp of the originating breach/fault event
t_resolved = timestamp of operator-verified resolution
N = count of qualifying resolved tickets in window
Exclusions:
- CANCELLED and de-duplicated child tickets
- tickets resolved < 30 s (auto-clear noise)
Segmentation (published alongside, not in headline):
MTTR_P1, MTTR_P2, MTTR_P3, MTTR_P4 by severity band
P95 recovery time reported to expose long-tail incidents.
If N < 5 in window, MTTR carries a LOW_SAMPLE confidence flag.
Inputs
| Channel | Role | Cadence | Reference | Source |
|---|---|---|---|---|
| Detection timestamps | Start of recovery clock | Event-driven | Breach / fault epoch | CHLORA |
| Resolution timestamps | Stop of recovery clock | Event-driven | Operator-verified | CHLORA |
| Severity tags | Per-ticket P1–P4 segmentation | Event-driven | Triage label | CHLORA |
| Dispatch queue (VC-MET-301) | Population of resolved tickets | 1 min | Closed-ticket set | CHLORA |
Units & Scale
MTTR is reported in minutes to whole-minute precision with a P95 long-tail figure alongside. It is a window mean, not additive across zones; a facility MTTR weighted by ticket count is published rather than a simple zone average. The trailing window is 60 minutes, re-evaluated hourly. A LOW_SAMPLE confidence flag is attached when fewer than five qualifying tickets fall in the window, since small N makes the mean volatile.
Sampling & Source
- Recomputed every 1 hour by CHLORA over a trailing 60-minute window of resolved tickets.
- Detection and resolution timestamps both sourced from the CHLORA dispatch ledger.
- Recovery clock runs from breach detection to operator-verified resolution, not ticket creation to auto-clear.
- Stale / low-sample handling: window with N < 5 tickets → value held with LOW_SAMPLE; no resolutions → last value carried forward, marked STALE.
Thresholds
OK
Incidents cleared within service commitment.
WARN
Recovery slowing; root-cause review for the window.
CRIT
Sustained slow recovery; process-failure escalation.
Recent Trend
Facility MTTR in minutes, last 14 hourly windows:
Interpretation Guidance
| MTTR Band | Reading | Likely Driver | Action |
|---|---|---|---|
| ≤ 20 min | Fast recovery | Simple incidents, crews available | None; log window as reference. |
| 21–35 min | Nominal | Mixed incident complexity | None; normal operating band. |
| 36–60 min | Slowing | Harder incidents or crew contention | Check parts availability and crew load. |
| 61–120 min | WARN slow | Recurrent root cause or backlog | Window root-cause review; rebalance crews. |
| > 120 min | CRIT stalled | Process failure or long-tail incident | Process-failure escalation; invoke recovery SOP. |
Related Metrics
Dispatch Queue Depth
Backlog that MTTR drains.
VC-MET-303Alarm Acknowledgement Latency
First-touch component of recovery.
VC-MET-304Pollinator Drone Charge
Fleet readiness affects response speed.
VC-MET-310Substrate Stock Coverage
Parts/consumable stock gates fixes.
VC-MET-001Canopy Vitality Index
Vitality CRITs are high-priority tickets.
VC-MET-401Containment Integrity Index
P1 recovery held to tighter MTTR.
Related SOPs
- SOP Library — incident response and root-cause-review procedures.
- Window root-cause review on WARN/CRIT MTTR — see the SOP Library.
- Recovery-time monitoring & long-tail escalation — Monitoring Systems.