On-call playbook · v1.0
If it’s on fire, start here.
Incidents are ranked by blast radius. P0 wakes the on-call engineer. P1 waits until morning. Everything else becomes a follow-up ticket.
Everything is down
Check /status · confirm Vercel status · check Postgres · page Awais
Database unreachable
Drain traffic · promote read replica · restore from nightly dump if needed
AI provider degraded
Flip AI_PROVIDER env to fallback · announce in #cohort-01
Swap rate-limit driver
Upstash Redis is the production driver. The in-memory fallback handles dev and Redis outages. Both share the same API , no code change needed to swap.
Restore from backup
Nightly logical dumps land in s3://ilmai-backups/pg/YYYY-MM-DD.dump. Retention: 30 daily, 12 weekly, 12 monthly.
Force a feature flag
Kill-switches for AI TA and Community live in env vars so we can flip them without a deploy if a moderation incident escalates.
Post-incident
Every P0 gets a write-up within 72 hours. Template lives at /ops/post-mortem-template.md.
- Blameless , focus on the system, not the engineer.
- Capture timeline from audit log, not memory.
- Every post-mortem ends with three action items in Linear.