Incident Response Runbook ​
Last Updated: 2025-12-04 Status: Active On-Call: [Configure rotation as needed]
Overview ​
This runbook provides step-by-step procedures for responding to common production incidents on the Hello World DAO platform.
General Incident Response Process ​
1. Detect & Alert ​
- Monitor alerts from IC Dashboard, PostHog, or manual detection
- Severity assessment: P1 (Critical), P2 (High), P3 (Medium), P4 (Low)
2. Acknowledge & Communicate ​
- Acknowledge alert within 15 minutes
- Post status update to team channel
- Estimate time to resolution
3. Investigate & Diagnose ​
- Gather logs and metrics
- Identify root cause
- Check runbook for known issue
4. Resolve & Verify ​
- Apply fix following documented procedure
- Verify fix resolves issue
- Monitor for regression
5. Document & Review ​
- Document incident details
- Update runbook if new issue
- Schedule post-mortem if P1/P2
Severity Levels ​
P1 - Critical (Response: Immediate) ​
- Site completely down
- Data loss or corruption
- Security breach
- Revenue-impacting issue
P2 - High (Response: < 1 hour) ​
- Major feature broken
- Performance degradation > 50%
- High error rate (> 10%)
- Canister cycles critically low
P3 - Medium (Response: < 4 hours) ​
- Minor feature broken
- Moderate performance degradation
- Low error rate (1-10%)
- Non-critical monitoring alert
P4 - Low (Response: Next business day) ​
- Cosmetic issues
- Minor bugs
- Documentation updates
- Optimization opportunities
Common Incidents ​
Incident #1: Canister Out of Cycles ​
Symptoms:
- IC Dashboard shows cycles balance < 500B
- Canister calls failing with "out of cycles" error
- 502/503 errors on frontend
Diagnosis:
# Check canister cycles balance
dfx canister --network mainnet status user-service
dfx canister --network mainnet status frontendResolution:
# Top up canister with cycles (requires controller principal)
dfx canister --network mainnet deposit-cycles 5000000000000 user-service
dfx canister --network mainnet deposit-cycles 5000000000000 frontend
# Verify balance increased
dfx canister --network mainnet status user-servicePrevention:
- Set up automated cycles monitoring
- Configure email alerts for low cycles (< 1T)
- Maintain cycles wallet with sufficient balance
Escalation:
- If cycles wallet empty, contact team lead for funding
- If repeated occurrences, review canister resource usage
Incident #2: High Error Rate (> 10%) ​
Symptoms:
- IC Dashboard shows error spike
- PostHog events missing or delayed
- User reports of form submission failures
Diagnosis:
# Check canister logs
dfx canister --network mainnet logs user-service
# Check frontend browser console for errors
# Open https://www.helloworlddao.com in browser
# F12 -> Console tab -> Look for red errorsCommon Causes:
- Canister panic/trap
- Network connectivity issue
- Frontend JavaScript error
- Third-party service down (email provider)
Resolution:
If canister panic:
# Check canister status
dfx canister --network mainnet status user-service
# If stopped, restart
dfx canister --network mainnet start user-serviceIf frontend error:
- Review error message in browser console
- Check if recent deployment introduced bug
- Rollback to previous version if needed
If third-party service down:
- Check email provider status page
- Implement retry logic or queue
- Communicate to users via status page
Prevention:
- Comprehensive testing before deployment
- Canary deployments for gradual rollout
- Circuit breakers for third-party dependencies
Incident #3: PostHog Events Not Received ​
Symptoms:
- PostHog dashboard shows no events for 15+ minutes
- Analytics tracking not working
Diagnosis:
- Check PostHog status: https://status.posthog.com
- Check browser console for PostHog errors
- Verify PostHog API key is correct
- Check network requests to posthog.com (DevTools -> Network tab)
Resolution:
If PostHog service issue:
- Wait for PostHog to resolve
- Events may be buffered and sent later
- Monitor PostHog status page
If configuration issue:
# Verify PostHog API key in environment
cd /home/coby/git/frontend/app/www
grep VITE_PUBLIC_POSTHOG_KEY .env
# Should match key from PostHog dashboardIf client-side tracking blocked:
- User may have ad blocker enabled
- This is expected for some users
- Not a production issue if only affecting small %
Prevention:
- Monitor PostHog event volume
- Set up alerts for event drops
- Regular testing of analytics tracking
Incident #4: Email Verification Not Working ​
Symptoms:
- Users report not receiving verification emails
- PostHog shows
email_verification_sentbut noemail_verification_success
Diagnosis:
- Test form submission yourself
- Check spam folder
- Review oracle-bridge logs
- Check email provider status
Resolution:
If emails going to spam:
- Review email content for spam triggers
- Configure SPF/DKIM/DMARC records
- Contact email provider support
If oracle-bridge not sending:
# Check oracle-bridge service status
cd /home/coby/git/oracle-bridge
npm run logs
# Restart service if needed
npm run restartIf email provider issue:
- Check provider status page
- Switch to backup provider if available
- Communicate delay to users
Prevention:
- Email deliverability monitoring
- Backup email provider configured
- Regular deliverability testing
Incident #5: Performance Degradation ​
Symptoms:
- Lighthouse performance score < 70
- Slow page load times (> 5s)
- User complaints about slowness
Diagnosis:
# Run Lighthouse audit
# Chrome DevTools -> Lighthouse -> Run Performance Audit
# Check Core Web Vitals
# Look for:
# - Large images
# - Slow JavaScript execution
# - Render-blocking resourcesCommon Causes:
- Large unoptimized assets
- Slow canister responses
- Network latency
- JavaScript bundle too large
Resolution:
If large assets:
- Optimize images (compress, resize, WebP format)
- Lazy load below-the-fold content
- Implement CDN caching
If slow canister:
- Optimize canister query methods
- Add caching layer
- Review database queries
If JavaScript bundle large:
- Code splitting and lazy loading
- Remove unused dependencies
- Tree shaking optimization
Prevention:
- Performance budgets
- Automated Lighthouse CI checks
- Regular performance testing
Incident #6: Form Submission Failures ​
Symptoms:
- PostHog shows form submit events but high failure rate
- User reports form not working
Diagnosis:
- Test form yourself on production
- Check browser console for errors
- Review user-service canister logs
- Check network requests (DevTools -> Network tab)
Common Issues:
- Validation errors
- Canister method panics
- Network timeout
- CORS issues
Resolution:
If validation errors:
- Review validation rules
- Ensure frontend matches backend validation
- Improve error messages to users
If canister panics:
# Check canister logs
dfx canister --network mainnet logs user-service
# Look for trap/panic messages
# Fix code bug and redeployIf network timeout:
- Check canister response times
- Increase timeout threshold
- Optimize canister performance
Prevention:
- End-to-end testing before deployment
- Input validation testing
- Performance monitoring
Escalation Procedures ​
Level 1: On-Call Engineer ​
- First responder
- Handles P3/P4 incidents
- Escalates P1/P2 incidents
Level 2: Team Lead ​
- Handles P1/P2 incidents
- Provides technical guidance
- Coordinates cross-team efforts
Level 3: CTO / Engineering Manager ​
- Critical business-impacting incidents
- Decision authority for major changes
- External communication
Escalation Triggers ​
- P1 incident not resolved in 1 hour
- P2 incident not resolved in 4 hours
- Incident requires system-wide changes
- Incident requires vendor coordination
Communication Templates ​
Incident Notification ​
[INCIDENT] [P1/P2/P3/P4] Brief description
Status: Investigating / Identified / Resolved
Affected: [Feature/Service]
Started: [Timestamp]
ETA: [Estimated resolution time]
Details: [Brief description of issue]
Impact: [User impact description]
Updates will be provided every [frequency]Resolution Notification ​
[RESOLVED] Brief description
Incident resolved at [timestamp]
Duration: [Total time]
Root cause: [Brief explanation]
Resolution: [What was done]
Prevention: [Steps to prevent recurrence]
Post-mortem: [Link or scheduled date]Post-Incident Review ​
For P1/P2 incidents, conduct post-mortem within 48 hours:
- Timeline: Document incident from detection to resolution
- Root Cause: Identify underlying cause, not just symptoms
- Impact: Quantify user impact and business cost
- Response: Evaluate response effectiveness
- Prevention: Define action items to prevent recurrence
- Follow-up: Assign owners and deadlines for action items
Contact Information ​
On-Call Rotation ​
- Week of [Date]: [Name] - [Contact]
- Week of [Date]: [Name] - [Contact]
Key Contacts ​
- Team Lead: [Name] - [Email] - [Phone]
- CTO: [Name] - [Email] - [Phone]
- DevOps: [Name] - [Email] - [Phone]
External Vendors ​
- IC Support: https://support.dfinity.org
- PostHog Support: support@posthog.com
- Email Provider: [Support contact]
Runbook Maintenance ​
- Review Frequency: Monthly
- Owner: DevOps team
- Last Review: 2025-11-16
- Next Review: 2025-12-16
Update this runbook after each incident with new procedures or lessons learned.
Specialized Runbooks ​
For specific incident types, see detailed runbooks:
| Alert | Runbook |
|---|---|
| LowCyclesBalance, CriticalCyclesBalance | Cycles Top-Up Procedure |
| HighErrorRate, CriticalErrorRate | High Error Rate Triage |
| CanisterUnresponsive | Canister Unresponsive Recovery |
| Deployment Failure | Deployment Failure Recovery |
| Database/External Service | Database Connectivity Issues |
Tabletop Exercise ​
Conduct quarterly tabletop exercises to validate incident response procedures.
Exercise Schedule ​
| Quarter | Focus Area | Exercise Type |
|---|---|---|
| Q1 | Canister crash recovery | Simulated canister stop |
| Q2 | Cycles depletion | Monitored low cycles scenario |
| Q3 | Deployment rollback | Practice rollback workflow |
| Q4 | Full incident simulation | Multi-system failure |
Exercise Procedure ​
Preparation (1 day before)
- Notify team of exercise
- Prepare test scenario
- Ensure staging environment ready
Execution (1-2 hours)
- Inject simulated failure
- Team responds per runbook
- Document response times and actions
Review (30 minutes after)
- Debrief with team
- Identify gaps in runbooks
- Document improvements
Exercise Checklist ​
Cycles Top-Up Drill:
- [ ] Identify canister with low cycles (staging)
- [ ] Execute top-up procedure
- [ ] Verify cycles balance increased
- [ ] Time: < 15 minutes total
Canister Restart Drill:
- [ ] Stop canister (staging)
- [ ] Detect via monitoring (or manual)
- [ ] Execute restart procedure
- [ ] Verify functionality restored
- [ ] Time: < 10 minutes total
Rollback Drill:
- [ ] Deploy test version to staging
- [ ] Identify rollback target (previous run ID)
- [ ] Execute rollback workflow
- [ ] Verify previous version restored
- [ ] Time: < 5 minutes total
Exercise Documentation ​
After each exercise, document:
- Date and participants
- Scenario description
- Response timeline
- Issues identified
- Runbook updates needed
- Action items with owners