Production Incident
3am. Payments down. 6 engineers on a call.
It is 3:17am. PagerDuty fires for the payments service. You are on call. Monitoring shows 40% of POST /payments/process requests returning 500 errors. The other 60% succeed. Slack is active with a client-facing team asking for updates. Three engineers join the bridge call in 2 minutes.
ERROR
WARN
INFO
ERROR
WARN