Skip to content
IlmAI Ops · RunbookLive status

On-call playbook · v1.0

If it’s on fire, start here.

Incidents are ranked by blast radius. P0 wakes the on-call engineer. P1 waits until morning. Everything else becomes a follow-up ticket.

P0

Everything is down

Check /status · confirm Vercel status · check Postgres · page Awais

P0

Database unreachable

Drain traffic · promote read replica · restore from nightly dump if needed

P1

AI provider degraded

Flip AI_PROVIDER env to fallback · announce in #cohort-01

Swap rate-limit driver

Upstash Redis is the production driver. The in-memory fallback handles dev and Redis outages. Both share the same API , no code change needed to swap.

ilmai ~ rate-limit
  ok · driver=redis on next boot

Restore from backup

Nightly logical dumps land in s3://ilmai-backups/pg/YYYY-MM-DD.dump. Retention: 30 daily, 12 weekly, 12 monthly.

ilmai ~ restore
  ok · verify row counts match audit log before repointing

Force a feature flag

Kill-switches for AI TA and Community live in env vars so we can flip them without a deploy if a moderation incident escalates.

ilmai ~ flags
  reads from process.env at request time , no cold start needed

Post-incident

Every P0 gets a write-up within 72 hours. Template lives at /ops/post-mortem-template.md.

  • Blameless , focus on the system, not the engineer.
  • Capture timeline from audit log, not memory.
  • Every post-mortem ends with three action items in Linear.