Incidents Are Inevitable

Every production system will experience incidents — server failures, deployment bugs, database issues, third-party outages. The difference between chaos and professionalism is how you handle them. At Nexis Limited, our incident management process is inspired by SRE practices, focused on rapid detection, structured response, and thorough learning.

Service Level Objectives (SLOs)

SLOs define the reliability targets for your service:

  • Availability SLO: The percentage of time the service is operational (e.g., 99.9% = 8.7 hours of downtime per year).
  • Latency SLO: Response time targets (e.g., 95% of requests complete within 200ms).
  • Error rate SLO: Percentage of requests that succeed (e.g., 99.95% success rate).

SLOs are measured by Service Level Indicators (SLIs) — actual metrics collected from your system. When SLIs breach SLO thresholds, an incident may be declared.

Error Budgets

The error budget is the difference between 100% and your SLO. A 99.9% availability SLO gives you a 0.1% error budget — 43 minutes of downtime per month. When the error budget is consumed, the team should slow down feature development and focus on reliability improvements.

Incident Detection

  • Monitoring and alerting: Automated alerts when SLIs breach SLO thresholds. Use tools like Datadog, Grafana, or PagerDuty.
  • Synthetic monitoring: Regularly test critical user journeys from external locations to detect outages that internal monitoring might miss.
  • Customer reports: Sometimes customers detect issues before monitoring does. Provide clear channels for reporting issues and respond quickly.
  • Alert quality: Every alert should be actionable. If an alert fires and the response is "ignore it," the alert needs to be fixed or removed. Alert fatigue is a serious reliability risk.

Incident Response

Roles

  • Incident Commander (IC): Coordinates the response, makes decisions, delegates tasks, and manages communication. Does not debug — manages the process.
  • Technical Lead: Diagnoses the issue and implements the fix. Multiple technical leads may work on different aspects of the incident.
  • Communication Lead: Updates the status page, notifies stakeholders, and communicates with customers.

Response Process

  1. Detect: Alert fires or report received.
  2. Triage: Assess severity and impact. Declare the incident and assign roles.
  3. Mitigate: Restore service as quickly as possible. Rollback, scale up, failover, or apply a hotfix. Focus on mitigation, not root cause analysis — investigation comes later.
  4. Communicate: Update the status page, notify affected users, and keep stakeholders informed.
  5. Resolve: Confirm the service is restored and stable. Monitor for recurrence.
  6. Follow up: Conduct a post-mortem and implement action items.

Communication During Incidents

  • Update the status page within 5 minutes of declaring an incident.
  • Provide updates every 20-30 minutes, even if there is no new information.
  • Be transparent about impact — "some users are experiencing slow response times" is better than silence.
  • Communicate the resolution and any user action required when the incident is resolved.

Post-Mortems

Conduct a blameless post-mortem within 48 hours of every significant incident:

  • Timeline: Detailed chronological account of what happened.
  • Root cause analysis: Identify the underlying cause (not "human error" — what system allowed the human error to cause an outage?).
  • Impact assessment: Duration, affected users, revenue impact.
  • Action items: Specific, assigned, and time-bound improvements to prevent recurrence.
  • Lessons learned: What went well? What could be improved in the response process?

Blameless means focusing on system improvements, not individual blame. People make mistakes — the question is how to design systems that are resilient to human error.

Building a Reliability Culture

  • Make SLOs visible to the entire team — display dashboards prominently.
  • Practice incident response with game days and chaos engineering.
  • Share post-mortems openly — they are learning opportunities for the entire organization.
  • Celebrate improvements to reliability engineering, not just feature delivery.

Conclusion

Incident management is a skill that improves with practice and process. Define SLOs, build alerting around SLIs, respond with structured roles, communicate transparently, and learn from every incident through blameless post-mortems. Reliability is not about preventing all failures — it is about detecting, responding, and learning faster.

Building your reliability practice? Our team implements SRE practices and incident management processes.