Skip to content

Checking access...

Incident Response Runbook

Last Updated: 2025-12-04 Status: Active On-Call: [Configure rotation as needed]

Overview

This runbook provides step-by-step procedures for responding to common production incidents on the Hello World DAO platform.

General Incident Response Process

1. Detect & Alert

  • Monitor alerts from IC Dashboard, PostHog, or manual detection
  • Severity assessment: P1 (Critical), P2 (High), P3 (Medium), P4 (Low)

2. Acknowledge & Communicate

  • Acknowledge alert within 15 minutes
  • Post status update to team channel
  • Estimate time to resolution

3. Investigate & Diagnose

  • Gather logs and metrics
  • Identify root cause
  • Check runbook for known issue

4. Resolve & Verify

  • Apply fix following documented procedure
  • Verify fix resolves issue
  • Monitor for regression

5. Document & Review

  • Document incident details
  • Update runbook if new issue
  • Schedule post-mortem if P1/P2

Severity Levels

P1 - Critical (Response: Immediate)

  • Site completely down
  • Data loss or corruption
  • Security breach
  • Revenue-impacting issue

P2 - High (Response: < 1 hour)

  • Major feature broken
  • Performance degradation > 50%
  • High error rate (> 10%)
  • Canister cycles critically low

P3 - Medium (Response: < 4 hours)

  • Minor feature broken
  • Moderate performance degradation
  • Low error rate (1-10%)
  • Non-critical monitoring alert

P4 - Low (Response: Next business day)

  • Cosmetic issues
  • Minor bugs
  • Documentation updates
  • Optimization opportunities

Common Incidents

Incident #1: Canister Out of Cycles

Symptoms:

  • IC Dashboard shows cycles balance < 500B
  • Canister calls failing with "out of cycles" error
  • 502/503 errors on frontend

Diagnosis:

bash
# Check canister cycles balance
dfx canister --network mainnet status user-service
dfx canister --network mainnet status frontend

Resolution:

bash
# Top up canister with cycles (requires controller principal)
dfx canister --network mainnet deposit-cycles 5000000000000 user-service
dfx canister --network mainnet deposit-cycles 5000000000000 frontend

# Verify balance increased
dfx canister --network mainnet status user-service

Prevention:

  • Set up automated cycles monitoring
  • Configure email alerts for low cycles (< 1T)
  • Maintain cycles wallet with sufficient balance

Escalation:

  • If cycles wallet empty, contact team lead for funding
  • If repeated occurrences, review canister resource usage

Incident #2: High Error Rate (> 10%)

Symptoms:

  • IC Dashboard shows error spike
  • PostHog events missing or delayed
  • User reports of form submission failures

Diagnosis:

bash
# Check canister logs
dfx canister --network mainnet logs user-service

# Check frontend browser console for errors
# Open https://www.helloworlddao.com in browser
# F12 -> Console tab -> Look for red errors

Common Causes:

  1. Canister panic/trap
  2. Network connectivity issue
  3. Frontend JavaScript error
  4. Third-party service down (email provider)

Resolution:

If canister panic:

bash
# Check canister status
dfx canister --network mainnet status user-service

# If stopped, restart
dfx canister --network mainnet start user-service

If frontend error:

  • Review error message in browser console
  • Check if recent deployment introduced bug
  • Rollback to previous version if needed

If third-party service down:

  • Check email provider status page
  • Implement retry logic or queue
  • Communicate to users via status page

Prevention:

  • Comprehensive testing before deployment
  • Canary deployments for gradual rollout
  • Circuit breakers for third-party dependencies

Incident #3: PostHog Events Not Received

Symptoms:

  • PostHog dashboard shows no events for 15+ minutes
  • Analytics tracking not working

Diagnosis:

  1. Check PostHog status: https://status.posthog.com
  2. Check browser console for PostHog errors
  3. Verify PostHog API key is correct
  4. Check network requests to posthog.com (DevTools -> Network tab)

Resolution:

If PostHog service issue:

  • Wait for PostHog to resolve
  • Events may be buffered and sent later
  • Monitor PostHog status page

If configuration issue:

bash
# Verify PostHog API key in environment
cd /home/coby/git/frontend/app/www
grep VITE_PUBLIC_POSTHOG_KEY .env

# Should match key from PostHog dashboard

If client-side tracking blocked:

  • User may have ad blocker enabled
  • This is expected for some users
  • Not a production issue if only affecting small %

Prevention:

  • Monitor PostHog event volume
  • Set up alerts for event drops
  • Regular testing of analytics tracking

Incident #4: Email Verification Not Working

Symptoms:

  • Users report not receiving verification emails
  • PostHog shows email_verification_sent but no email_verification_success

Diagnosis:

  1. Test form submission yourself
  2. Check spam folder
  3. Review oracle-bridge logs
  4. Check email provider status

Resolution:

If emails going to spam:

  • Review email content for spam triggers
  • Configure SPF/DKIM/DMARC records
  • Contact email provider support

If oracle-bridge not sending:

bash
# Check oracle-bridge service status
cd /home/coby/git/oracle-bridge
npm run logs

# Restart service if needed
npm run restart

If email provider issue:

  • Check provider status page
  • Switch to backup provider if available
  • Communicate delay to users

Prevention:

  • Email deliverability monitoring
  • Backup email provider configured
  • Regular deliverability testing

Incident #5: Performance Degradation

Symptoms:

  • Lighthouse performance score < 70
  • Slow page load times (> 5s)
  • User complaints about slowness

Diagnosis:

bash
# Run Lighthouse audit
# Chrome DevTools -> Lighthouse -> Run Performance Audit

# Check Core Web Vitals
# Look for:
# - Large images
# - Slow JavaScript execution
# - Render-blocking resources

Common Causes:

  1. Large unoptimized assets
  2. Slow canister responses
  3. Network latency
  4. JavaScript bundle too large

Resolution:

If large assets:

  • Optimize images (compress, resize, WebP format)
  • Lazy load below-the-fold content
  • Implement CDN caching

If slow canister:

  • Optimize canister query methods
  • Add caching layer
  • Review database queries

If JavaScript bundle large:

  • Code splitting and lazy loading
  • Remove unused dependencies
  • Tree shaking optimization

Prevention:

  • Performance budgets
  • Automated Lighthouse CI checks
  • Regular performance testing

Incident #6: Form Submission Failures

Symptoms:

  • PostHog shows form submit events but high failure rate
  • User reports form not working

Diagnosis:

  1. Test form yourself on production
  2. Check browser console for errors
  3. Review user-service canister logs
  4. Check network requests (DevTools -> Network tab)

Common Issues:

  • Validation errors
  • Canister method panics
  • Network timeout
  • CORS issues

Resolution:

If validation errors:

  • Review validation rules
  • Ensure frontend matches backend validation
  • Improve error messages to users

If canister panics:

bash
# Check canister logs
dfx canister --network mainnet logs user-service

# Look for trap/panic messages
# Fix code bug and redeploy

If network timeout:

  • Check canister response times
  • Increase timeout threshold
  • Optimize canister performance

Prevention:

  • End-to-end testing before deployment
  • Input validation testing
  • Performance monitoring

Escalation Procedures

Level 1: On-Call Engineer

  • First responder
  • Handles P3/P4 incidents
  • Escalates P1/P2 incidents

Level 2: Team Lead

  • Handles P1/P2 incidents
  • Provides technical guidance
  • Coordinates cross-team efforts

Level 3: CTO / Engineering Manager

  • Critical business-impacting incidents
  • Decision authority for major changes
  • External communication

Escalation Triggers

  • P1 incident not resolved in 1 hour
  • P2 incident not resolved in 4 hours
  • Incident requires system-wide changes
  • Incident requires vendor coordination

Communication Templates

Incident Notification

[INCIDENT] [P1/P2/P3/P4] Brief description

Status: Investigating / Identified / Resolved
Affected: [Feature/Service]
Started: [Timestamp]
ETA: [Estimated resolution time]

Details: [Brief description of issue]
Impact: [User impact description]

Updates will be provided every [frequency]

Resolution Notification

[RESOLVED] Brief description

Incident resolved at [timestamp]
Duration: [Total time]
Root cause: [Brief explanation]

Resolution: [What was done]
Prevention: [Steps to prevent recurrence]

Post-mortem: [Link or scheduled date]

Post-Incident Review

For P1/P2 incidents, conduct post-mortem within 48 hours:

  1. Timeline: Document incident from detection to resolution
  2. Root Cause: Identify underlying cause, not just symptoms
  3. Impact: Quantify user impact and business cost
  4. Response: Evaluate response effectiveness
  5. Prevention: Define action items to prevent recurrence
  6. Follow-up: Assign owners and deadlines for action items

Contact Information

On-Call Rotation

  • Week of [Date]: [Name] - [Contact]
  • Week of [Date]: [Name] - [Contact]

Key Contacts

  • Team Lead: [Name] - [Email] - [Phone]
  • CTO: [Name] - [Email] - [Phone]
  • DevOps: [Name] - [Email] - [Phone]

External Vendors

Runbook Maintenance

  • Review Frequency: Monthly
  • Owner: DevOps team
  • Last Review: 2025-11-16
  • Next Review: 2025-12-16

Update this runbook after each incident with new procedures or lessons learned.

Specialized Runbooks

For specific incident types, see detailed runbooks:

AlertRunbook
LowCyclesBalance, CriticalCyclesBalanceCycles Top-Up Procedure
HighErrorRate, CriticalErrorRateHigh Error Rate Triage
CanisterUnresponsiveCanister Unresponsive Recovery
Deployment FailureDeployment Failure Recovery
Database/External ServiceDatabase Connectivity Issues

Tabletop Exercise

Conduct quarterly tabletop exercises to validate incident response procedures.

Exercise Schedule

QuarterFocus AreaExercise Type
Q1Canister crash recoverySimulated canister stop
Q2Cycles depletionMonitored low cycles scenario
Q3Deployment rollbackPractice rollback workflow
Q4Full incident simulationMulti-system failure

Exercise Procedure

  1. Preparation (1 day before)

    • Notify team of exercise
    • Prepare test scenario
    • Ensure staging environment ready
  2. Execution (1-2 hours)

    • Inject simulated failure
    • Team responds per runbook
    • Document response times and actions
  3. Review (30 minutes after)

    • Debrief with team
    • Identify gaps in runbooks
    • Document improvements

Exercise Checklist

Cycles Top-Up Drill:

  • [ ] Identify canister with low cycles (staging)
  • [ ] Execute top-up procedure
  • [ ] Verify cycles balance increased
  • [ ] Time: < 15 minutes total

Canister Restart Drill:

  • [ ] Stop canister (staging)
  • [ ] Detect via monitoring (or manual)
  • [ ] Execute restart procedure
  • [ ] Verify functionality restored
  • [ ] Time: < 10 minutes total

Rollback Drill:

  • [ ] Deploy test version to staging
  • [ ] Identify rollback target (previous run ID)
  • [ ] Execute rollback workflow
  • [ ] Verify previous version restored
  • [ ] Time: < 5 minutes total

Exercise Documentation

After each exercise, document:

  • Date and participants
  • Scenario description
  • Response timeline
  • Issues identified
  • Runbook updates needed
  • Action items with owners

References

Hello World Co-Op DAO