Skip to content
🔒

Login Required

You need to be logged in to view this content. This page requires Member access.

Incident Response Runbook ​

Last Updated: 2025-12-04 Status: Active On-Call: [Configure rotation as needed]

Overview ​

This runbook provides step-by-step procedures for responding to common production incidents on the Hello World DAO platform.

General Incident Response Process ​

1. Detect & Alert ​

  • Monitor alerts from IC Dashboard, PostHog, or manual detection
  • Severity assessment: P1 (Critical), P2 (High), P3 (Medium), P4 (Low)

2. Acknowledge & Communicate ​

  • Acknowledge alert within 15 minutes
  • Post status update to team channel
  • Estimate time to resolution

3. Investigate & Diagnose ​

  • Gather logs and metrics
  • Identify root cause
  • Check runbook for known issue

4. Resolve & Verify ​

  • Apply fix following documented procedure
  • Verify fix resolves issue
  • Monitor for regression

5. Document & Review ​

  • Document incident details
  • Update runbook if new issue
  • Schedule post-mortem if P1/P2

Severity Levels ​

P1 - Critical (Response: Immediate) ​

  • Site completely down
  • Data loss or corruption
  • Security breach
  • Revenue-impacting issue

P2 - High (Response: < 1 hour) ​

  • Major feature broken
  • Performance degradation > 50%
  • High error rate (> 10%)
  • Canister cycles critically low

P3 - Medium (Response: < 4 hours) ​

  • Minor feature broken
  • Moderate performance degradation
  • Low error rate (1-10%)
  • Non-critical monitoring alert

P4 - Low (Response: Next business day) ​

  • Cosmetic issues
  • Minor bugs
  • Documentation updates
  • Optimization opportunities

Common Incidents ​

Incident #1: Canister Out of Cycles ​

Symptoms:

  • IC Dashboard shows cycles balance < 500B
  • Canister calls failing with "out of cycles" error
  • 502/503 errors on frontend

Diagnosis:

bash
# Check canister cycles balance
dfx canister --network mainnet status user-service
dfx canister --network mainnet status frontend

Resolution:

bash
# Top up canister with cycles (requires controller principal)
dfx canister --network mainnet deposit-cycles 5000000000000 user-service
dfx canister --network mainnet deposit-cycles 5000000000000 frontend

# Verify balance increased
dfx canister --network mainnet status user-service

Prevention:

  • Set up automated cycles monitoring
  • Configure email alerts for low cycles (< 1T)
  • Maintain cycles wallet with sufficient balance

Escalation:

  • If cycles wallet empty, contact team lead for funding
  • If repeated occurrences, review canister resource usage

Incident #2: High Error Rate (> 10%) ​

Symptoms:

  • IC Dashboard shows error spike
  • PostHog events missing or delayed
  • User reports of form submission failures

Diagnosis:

bash
# Check canister logs
dfx canister --network mainnet logs user-service

# Check frontend browser console for errors
# Open https://www.helloworlddao.com in browser
# F12 -> Console tab -> Look for red errors

Common Causes:

  1. Canister panic/trap
  2. Network connectivity issue
  3. Frontend JavaScript error
  4. Third-party service down (email provider)

Resolution:

If canister panic:

bash
# Check canister status
dfx canister --network mainnet status user-service

# If stopped, restart
dfx canister --network mainnet start user-service

If frontend error:

  • Review error message in browser console
  • Check if recent deployment introduced bug
  • Rollback to previous version if needed

If third-party service down:

  • Check email provider status page
  • Implement retry logic or queue
  • Communicate to users via status page

Prevention:

  • Comprehensive testing before deployment
  • Canary deployments for gradual rollout
  • Circuit breakers for third-party dependencies

Incident #3: PostHog Events Not Received ​

Symptoms:

  • PostHog dashboard shows no events for 15+ minutes
  • Analytics tracking not working

Diagnosis:

  1. Check PostHog status: https://status.posthog.com
  2. Check browser console for PostHog errors
  3. Verify PostHog API key is correct
  4. Check network requests to posthog.com (DevTools -> Network tab)

Resolution:

If PostHog service issue:

  • Wait for PostHog to resolve
  • Events may be buffered and sent later
  • Monitor PostHog status page

If configuration issue:

bash
# Verify PostHog API key in environment
cd /home/coby/git/frontend/app/www
grep VITE_PUBLIC_POSTHOG_KEY .env

# Should match key from PostHog dashboard

If client-side tracking blocked:

  • User may have ad blocker enabled
  • This is expected for some users
  • Not a production issue if only affecting small %

Prevention:

  • Monitor PostHog event volume
  • Set up alerts for event drops
  • Regular testing of analytics tracking

Incident #4: Email Verification Not Working ​

Symptoms:

  • Users report not receiving verification emails
  • PostHog shows email_verification_sent but no email_verification_success

Diagnosis:

  1. Test form submission yourself
  2. Check spam folder
  3. Review oracle-bridge logs
  4. Check email provider status

Resolution:

If emails going to spam:

  • Review email content for spam triggers
  • Configure SPF/DKIM/DMARC records
  • Contact email provider support

If oracle-bridge not sending:

bash
# Check oracle-bridge service status
cd /home/coby/git/oracle-bridge
npm run logs

# Restart service if needed
npm run restart

If email provider issue:

  • Check provider status page
  • Switch to backup provider if available
  • Communicate delay to users

Prevention:

  • Email deliverability monitoring
  • Backup email provider configured
  • Regular deliverability testing

Incident #5: Performance Degradation ​

Symptoms:

  • Lighthouse performance score < 70
  • Slow page load times (> 5s)
  • User complaints about slowness

Diagnosis:

bash
# Run Lighthouse audit
# Chrome DevTools -> Lighthouse -> Run Performance Audit

# Check Core Web Vitals
# Look for:
# - Large images
# - Slow JavaScript execution
# - Render-blocking resources

Common Causes:

  1. Large unoptimized assets
  2. Slow canister responses
  3. Network latency
  4. JavaScript bundle too large

Resolution:

If large assets:

  • Optimize images (compress, resize, WebP format)
  • Lazy load below-the-fold content
  • Implement CDN caching

If slow canister:

  • Optimize canister query methods
  • Add caching layer
  • Review database queries

If JavaScript bundle large:

  • Code splitting and lazy loading
  • Remove unused dependencies
  • Tree shaking optimization

Prevention:

  • Performance budgets
  • Automated Lighthouse CI checks
  • Regular performance testing

Incident #6: Form Submission Failures ​

Symptoms:

  • PostHog shows form submit events but high failure rate
  • User reports form not working

Diagnosis:

  1. Test form yourself on production
  2. Check browser console for errors
  3. Review user-service canister logs
  4. Check network requests (DevTools -> Network tab)

Common Issues:

  • Validation errors
  • Canister method panics
  • Network timeout
  • CORS issues

Resolution:

If validation errors:

  • Review validation rules
  • Ensure frontend matches backend validation
  • Improve error messages to users

If canister panics:

bash
# Check canister logs
dfx canister --network mainnet logs user-service

# Look for trap/panic messages
# Fix code bug and redeploy

If network timeout:

  • Check canister response times
  • Increase timeout threshold
  • Optimize canister performance

Prevention:

  • End-to-end testing before deployment
  • Input validation testing
  • Performance monitoring

Escalation Procedures ​

Level 1: On-Call Engineer ​

  • First responder
  • Handles P3/P4 incidents
  • Escalates P1/P2 incidents

Level 2: Team Lead ​

  • Handles P1/P2 incidents
  • Provides technical guidance
  • Coordinates cross-team efforts

Level 3: CTO / Engineering Manager ​

  • Critical business-impacting incidents
  • Decision authority for major changes
  • External communication

Escalation Triggers ​

  • P1 incident not resolved in 1 hour
  • P2 incident not resolved in 4 hours
  • Incident requires system-wide changes
  • Incident requires vendor coordination

Communication Templates ​

Incident Notification ​

[INCIDENT] [P1/P2/P3/P4] Brief description

Status: Investigating / Identified / Resolved
Affected: [Feature/Service]
Started: [Timestamp]
ETA: [Estimated resolution time]

Details: [Brief description of issue]
Impact: [User impact description]

Updates will be provided every [frequency]

Resolution Notification ​

[RESOLVED] Brief description

Incident resolved at [timestamp]
Duration: [Total time]
Root cause: [Brief explanation]

Resolution: [What was done]
Prevention: [Steps to prevent recurrence]

Post-mortem: [Link or scheduled date]

Post-Incident Review ​

For P1/P2 incidents, conduct post-mortem within 48 hours:

  1. Timeline: Document incident from detection to resolution
  2. Root Cause: Identify underlying cause, not just symptoms
  3. Impact: Quantify user impact and business cost
  4. Response: Evaluate response effectiveness
  5. Prevention: Define action items to prevent recurrence
  6. Follow-up: Assign owners and deadlines for action items

Contact Information ​

On-Call Rotation ​

  • Week of [Date]: [Name] - [Contact]
  • Week of [Date]: [Name] - [Contact]

Key Contacts ​

  • Team Lead: [Name] - [Email] - [Phone]
  • CTO: [Name] - [Email] - [Phone]
  • DevOps: [Name] - [Email] - [Phone]

External Vendors ​

Runbook Maintenance ​

  • Review Frequency: Monthly
  • Owner: DevOps team
  • Last Review: 2025-11-16
  • Next Review: 2025-12-16

Update this runbook after each incident with new procedures or lessons learned.

Specialized Runbooks ​

For specific incident types, see detailed runbooks:

AlertRunbook
LowCyclesBalance, CriticalCyclesBalanceCycles Top-Up Procedure
HighErrorRate, CriticalErrorRateHigh Error Rate Triage
CanisterUnresponsiveCanister Unresponsive Recovery
Deployment FailureDeployment Failure Recovery
Database/External ServiceDatabase Connectivity Issues

Tabletop Exercise ​

Conduct quarterly tabletop exercises to validate incident response procedures.

Exercise Schedule ​

QuarterFocus AreaExercise Type
Q1Canister crash recoverySimulated canister stop
Q2Cycles depletionMonitored low cycles scenario
Q3Deployment rollbackPractice rollback workflow
Q4Full incident simulationMulti-system failure

Exercise Procedure ​

  1. Preparation (1 day before)

    • Notify team of exercise
    • Prepare test scenario
    • Ensure staging environment ready
  2. Execution (1-2 hours)

    • Inject simulated failure
    • Team responds per runbook
    • Document response times and actions
  3. Review (30 minutes after)

    • Debrief with team
    • Identify gaps in runbooks
    • Document improvements

Exercise Checklist ​

Cycles Top-Up Drill:

  • [ ] Identify canister with low cycles (staging)
  • [ ] Execute top-up procedure
  • [ ] Verify cycles balance increased
  • [ ] Time: < 15 minutes total

Canister Restart Drill:

  • [ ] Stop canister (staging)
  • [ ] Detect via monitoring (or manual)
  • [ ] Execute restart procedure
  • [ ] Verify functionality restored
  • [ ] Time: < 10 minutes total

Rollback Drill:

  • [ ] Deploy test version to staging
  • [ ] Identify rollback target (previous run ID)
  • [ ] Execute rollback workflow
  • [ ] Verify previous version restored
  • [ ] Time: < 5 minutes total

Exercise Documentation ​

After each exercise, document:

  • Date and participants
  • Scenario description
  • Response timeline
  • Issues identified
  • Runbook updates needed
  • Action items with owners

References ​

Hello World Co-Op DAO