Incident Response Runbook

Last Updated: 2025-12-04 Status: Active On-Call: [Configure rotation as needed]

Overview

This runbook provides step-by-step procedures for responding to common production incidents on the Hello World DAO platform.

General Incident Response Process

1. Detect & Alert

Monitor alerts from IC Dashboard, PostHog, or manual detection
Severity assessment: P1 (Critical), P2 (High), P3 (Medium), P4 (Low)

2. Acknowledge & Communicate

Acknowledge alert within 15 minutes
Post status update to team channel
Estimate time to resolution

3. Investigate & Diagnose

Gather logs and metrics
Identify root cause
Check runbook for known issue

4. Resolve & Verify

Apply fix following documented procedure
Verify fix resolves issue
Monitor for regression

5. Document & Review

Document incident details
Update runbook if new issue
Schedule post-mortem if P1/P2

Severity Levels

P1 - Critical (Response: Immediate)

Site completely down
Data loss or corruption
Security breach
Revenue-impacting issue

P2 - High (Response: < 1 hour)

Major feature broken
Performance degradation > 50%
High error rate (> 10%)
Canister cycles critically low

P3 - Medium (Response: < 4 hours)

Minor feature broken
Moderate performance degradation
Low error rate (1-10%)
Non-critical monitoring alert

P4 - Low (Response: Next business day)

Cosmetic issues
Minor bugs
Documentation updates
Optimization opportunities

Common Incidents

Incident #1: Canister Out of Cycles

Symptoms:

IC Dashboard shows cycles balance < 500B
Canister calls failing with "out of cycles" error
502/503 errors on frontend

Diagnosis:

bash

# Check canister cycles balance
dfx canister --network mainnet status user-service
dfx canister --network mainnet status frontend

Resolution:

bash

# Top up canister with cycles (requires controller principal)
dfx canister --network mainnet deposit-cycles 5000000000000 user-service
dfx canister --network mainnet deposit-cycles 5000000000000 frontend

# Verify balance increased
dfx canister --network mainnet status user-service

Prevention:

Set up automated cycles monitoring
Configure email alerts for low cycles (< 1T)
Maintain cycles wallet with sufficient balance

Escalation:

If cycles wallet empty, contact team lead for funding
If repeated occurrences, review canister resource usage

Incident #2: High Error Rate (> 10%)

Symptoms:

IC Dashboard shows error spike
PostHog events missing or delayed
User reports of form submission failures

Diagnosis:

bash

# Check canister logs
dfx canister --network mainnet logs user-service

# Check frontend browser console for errors
# Open https://www.helloworlddao.com in browser
# F12 -> Console tab -> Look for red errors

Common Causes:

Canister panic/trap
Network connectivity issue
Frontend JavaScript error
Third-party service down (email provider)

Resolution:

If canister panic:

bash

# Check canister status
dfx canister --network mainnet status user-service

# If stopped, restart
dfx canister --network mainnet start user-service

If frontend error:

Review error message in browser console
Check if recent deployment introduced bug
Rollback to previous version if needed

If third-party service down:

Check email provider status page
Implement retry logic or queue
Communicate to users via status page

Prevention:

Comprehensive testing before deployment
Canary deployments for gradual rollout
Circuit breakers for third-party dependencies

Incident #3: PostHog Events Not Received

Symptoms:

PostHog dashboard shows no events for 15+ minutes
Analytics tracking not working

Diagnosis:

Check PostHog status: https://status.posthog.com
Check browser console for PostHog errors
Verify PostHog API key is correct
Check network requests to posthog.com (DevTools -> Network tab)

Resolution:

If PostHog service issue:

Wait for PostHog to resolve
Events may be buffered and sent later
Monitor PostHog status page

If configuration issue:

bash

# Verify PostHog API key in environment
cd /home/coby/git/frontend/app/www
grep VITE_PUBLIC_POSTHOG_KEY .env

# Should match key from PostHog dashboard

If client-side tracking blocked:

User may have ad blocker enabled
This is expected for some users
Not a production issue if only affecting small %

Prevention:

Monitor PostHog event volume
Set up alerts for event drops
Regular testing of analytics tracking

Incident #4: Email Verification Not Working

Symptoms:

Users report not receiving verification emails
PostHog shows email_verification_sent but no email_verification_success

Diagnosis:

Test form submission yourself
Check spam folder
Review oracle-bridge logs
Check email provider status

Resolution:

If emails going to spam:

Review email content for spam triggers
Configure SPF/DKIM/DMARC records
Contact email provider support

If oracle-bridge not sending:

bash

# Check oracle-bridge service status
cd /home/coby/git/oracle-bridge
npm run logs

# Restart service if needed
npm run restart

If email provider issue:

Check provider status page
Switch to backup provider if available
Communicate delay to users

Prevention:

Email deliverability monitoring
Backup email provider configured
Regular deliverability testing

Incident #5: Performance Degradation

Symptoms:

Lighthouse performance score < 70
Slow page load times (> 5s)
User complaints about slowness

Diagnosis:

bash

# Run Lighthouse audit
# Chrome DevTools -> Lighthouse -> Run Performance Audit

# Check Core Web Vitals
# Look for:
# - Large images
# - Slow JavaScript execution
# - Render-blocking resources

Common Causes:

Large unoptimized assets
Slow canister responses
Network latency
JavaScript bundle too large

Resolution:

If large assets:

Optimize images (compress, resize, WebP format)
Lazy load below-the-fold content
Implement CDN caching

If slow canister:

Optimize canister query methods
Add caching layer
Review database queries

If JavaScript bundle large:

Code splitting and lazy loading
Remove unused dependencies
Tree shaking optimization

Prevention:

Performance budgets
Automated Lighthouse CI checks
Regular performance testing

Incident #6: Form Submission Failures

Symptoms:

PostHog shows form submit events but high failure rate
User reports form not working

Diagnosis:

Test form yourself on production
Check browser console for errors
Review user-service canister logs
Check network requests (DevTools -> Network tab)

Common Issues:

Validation errors
Canister method panics
Network timeout
CORS issues

Resolution:

If validation errors:

Review validation rules
Ensure frontend matches backend validation
Improve error messages to users

If canister panics:

bash

# Check canister logs
dfx canister --network mainnet logs user-service

# Look for trap/panic messages
# Fix code bug and redeploy

If network timeout:

Check canister response times
Increase timeout threshold
Optimize canister performance

Prevention:

End-to-end testing before deployment
Input validation testing
Performance monitoring

Escalation Procedures

Level 1: On-Call Engineer

First responder
Handles P3/P4 incidents
Escalates P1/P2 incidents

Level 2: Team Lead

Handles P1/P2 incidents
Provides technical guidance
Coordinates cross-team efforts

Level 3: CTO / Engineering Manager

Critical business-impacting incidents
Decision authority for major changes
External communication

Escalation Triggers

P1 incident not resolved in 1 hour
P2 incident not resolved in 4 hours
Incident requires system-wide changes
Incident requires vendor coordination

Communication Templates

Incident Notification

[INCIDENT] [P1/P2/P3/P4] Brief description

Status: Investigating / Identified / Resolved
Affected: [Feature/Service]
Started: [Timestamp]
ETA: [Estimated resolution time]

Details: [Brief description of issue]
Impact: [User impact description]

Updates will be provided every [frequency]

Resolution Notification

[RESOLVED] Brief description

Incident resolved at [timestamp]
Duration: [Total time]
Root cause: [Brief explanation]

Resolution: [What was done]
Prevention: [Steps to prevent recurrence]

Post-mortem: [Link or scheduled date]

Post-Incident Review

For P1/P2 incidents, conduct post-mortem within 48 hours:

Timeline: Document incident from detection to resolution
Root Cause: Identify underlying cause, not just symptoms
Impact: Quantify user impact and business cost
Response: Evaluate response effectiveness
Prevention: Define action items to prevent recurrence
Follow-up: Assign owners and deadlines for action items

Contact Information

On-Call Rotation

Week of [Date]: [Name] - [Contact]
Week of [Date]: [Name] - [Contact]

Key Contacts

Team Lead: [Name] - [Email] - [Phone]
CTO: [Name] - [Email] - [Phone]
DevOps: [Name] - [Email] - [Phone]

External Vendors

IC Support: https://support.dfinity.org
PostHog Support: support@posthog.com
Email Provider: [Support contact]

Runbook Maintenance

Review Frequency: Monthly
Owner: DevOps team
Last Review: 2025-11-16
Next Review: 2025-12-16

Update this runbook after each incident with new procedures or lessons learned.

Specialized Runbooks

For specific incident types, see detailed runbooks:

Alert	Runbook
LowCyclesBalance, CriticalCyclesBalance	Cycles Top-Up Procedure
HighErrorRate, CriticalErrorRate	High Error Rate Triage
CanisterUnresponsive	Canister Unresponsive Recovery
Deployment Failure	Deployment Failure Recovery
Database/External Service	Database Connectivity Issues

Tabletop Exercise

Conduct quarterly tabletop exercises to validate incident response procedures.

Exercise Schedule

Quarter	Focus Area	Exercise Type
Q1	Canister crash recovery	Simulated canister stop
Q2	Cycles depletion	Monitored low cycles scenario
Q3	Deployment rollback	Practice rollback workflow
Q4	Full incident simulation	Multi-system failure

Exercise Procedure

Preparation (1 day before)
- Notify team of exercise
- Prepare test scenario
- Ensure staging environment ready
Execution (1-2 hours)
- Inject simulated failure
- Team responds per runbook
- Document response times and actions
Review (30 minutes after)
- Debrief with team
- Identify gaps in runbooks
- Document improvements

Exercise Checklist

Cycles Top-Up Drill:

[ ] Identify canister with low cycles (staging)
[ ] Execute top-up procedure
[ ] Verify cycles balance increased
[ ] Time: < 15 minutes total

Canister Restart Drill:

[ ] Stop canister (staging)
[ ] Detect via monitoring (or manual)
[ ] Execute restart procedure
[ ] Verify functionality restored
[ ] Time: < 10 minutes total

Rollback Drill:

[ ] Deploy test version to staging
[ ] Identify rollback target (previous run ID)
[ ] Execute rollback workflow
[ ] Verify previous version restored
[ ] Time: < 5 minutes total

Exercise Documentation

After each exercise, document:

Date and participants
Scenario description
Response timeline
Issues identified
Runbook updates needed
Action items with owners

Login Required

Incident Response Runbook ​

Overview ​

General Incident Response Process ​

1. Detect & Alert ​

2. Acknowledge & Communicate ​

3. Investigate & Diagnose ​

4. Resolve & Verify ​

5. Document & Review ​

Severity Levels ​

P1 - Critical (Response: Immediate) ​

P2 - High (Response: < 1 hour) ​

P3 - Medium (Response: < 4 hours) ​

P4 - Low (Response: Next business day) ​

Common Incidents ​

Incident #1: Canister Out of Cycles ​

Incident #2: High Error Rate (> 10%) ​

Incident #3: PostHog Events Not Received ​

Incident #4: Email Verification Not Working ​

Incident #5: Performance Degradation ​

Incident #6: Form Submission Failures ​

Escalation Procedures ​

Level 1: On-Call Engineer ​

Level 2: Team Lead ​

Level 3: CTO / Engineering Manager ​

Escalation Triggers ​

Communication Templates ​

Incident Notification ​

Resolution Notification ​

Post-Incident Review ​

Contact Information ​

On-Call Rotation ​

Key Contacts ​

External Vendors ​

Runbook Maintenance ​

Specialized Runbooks ​

Tabletop Exercise ​

Exercise Schedule ​

Exercise Procedure ​

Exercise Checklist ​

Exercise Documentation ​

References ​

Incident Response Runbook

Overview

General Incident Response Process

1. Detect & Alert

2. Acknowledge & Communicate

3. Investigate & Diagnose

4. Resolve & Verify

5. Document & Review

Severity Levels

P1 - Critical (Response: Immediate)

P2 - High (Response: < 1 hour)

P3 - Medium (Response: < 4 hours)

P4 - Low (Response: Next business day)

Common Incidents

Incident #1: Canister Out of Cycles

Incident #2: High Error Rate (> 10%)

Incident #3: PostHog Events Not Received

Incident #4: Email Verification Not Working

Incident #5: Performance Degradation

Incident #6: Form Submission Failures

Escalation Procedures

Level 1: On-Call Engineer

Level 2: Team Lead

Level 3: CTO / Engineering Manager

Escalation Triggers

Communication Templates

Incident Notification

Resolution Notification

Post-Incident Review

Contact Information

On-Call Rotation

Key Contacts

External Vendors

Runbook Maintenance

Specialized Runbooks

Tabletop Exercise

Exercise Schedule

Exercise Procedure

Exercise Checklist

Exercise Documentation

References