Skip to content
🔒

Login Required

You need to be logged in to view this content. This page requires Member access.

High Error Rate Triage ​

Last Updated: 2025-12-04 Alert: HighErrorRate, CriticalErrorRate Severity: Warning / Critical Response Time: < 30 minutes for critical

Overview ​

This runbook covers the triage and resolution process when canister or oracle-bridge error rates exceed acceptable thresholds.

Alert Thresholds ​

LevelThresholdDurationResponse Time
Warning> 5%5 minutes< 1 hour
Critical> 10%5 minutes< 30 minutes

Symptoms ​

  • Alert: HighErrorRate or CriticalErrorRate
  • Users reporting form submission failures
  • PostHog showing increased error events
  • Grafana error rate panel above threshold

Diagnosis ​

Step 1: Identify Affected Component ​

Check which component is generating errors:

bash
# For canister errors
dfx canister --network ic logs <canister-id>

# For oracle-bridge
# Check application logs or cloud logging service

Step 2: Check Recent Deployments ​

  1. Review GitHub Actions for recent deployments
  2. Check if error spike correlates with deployment time
  3. Note the commit SHA of current deployment

Step 3: Analyze Error Types ​

In Grafana, check error breakdown by type:

  • Validation errors (bad input)
  • Network errors (connectivity issues)
  • Internal errors (bugs, panics)
  • Rate limiting (too many requests)

Step 4: Check Dependencies ​

bash
# Check if external services are available
# Email provider status
curl -s https://status.sendgrid.com/api/v2/status.json | jq '.status'

# Stripe status (if applicable)
curl -s https://status.stripe.com/api/v2/status.json | jq '.status'

Resolution ​

If errors started after a recent deployment:

  1. Assess severity: Is the feature critical?
  2. Consider rollback:
    bash
    # Via GitHub Actions
    # Go to Actions > Emergency Rollback > Run workflow
    # Enter canister name, network, and previous run ID
  3. Verify rollback resolved the issue
  4. Investigate the problematic commit

Scenario B: Input Validation Errors ​

If errors are primarily validation failures:

  1. Check recent changes to validation rules
  2. Review PostHog for patterns in failing inputs
  3. Update validation or error messages if needed
  4. No immediate action if users are providing bad input

Scenario C: External Service Errors ​

If errors are from external dependencies:

  1. Check service status pages
  2. Implement circuit breaker if not present
  3. Queue requests for retry if possible
  4. Communicate to users via status page

Scenario D: Canister Crash/Panic ​

If canister is crashing:

bash
# Check canister status
dfx canister --network ic status <canister-id>

# If stopped, restart it
dfx canister --network ic start <canister-id>

# Check logs for panic reason
dfx canister --network ic logs <canister-id>

Scenario E: Rate Limiting ​

If errors are from rate limiting:

  1. Verify rate limiting is working as designed
  2. Check for abuse patterns (same IP, unusual patterns)
  3. Adjust thresholds if legitimate traffic is being blocked
  4. No action if rate limiting is protecting the system

Error Code Reference ​

Error PatternLikely CauseResolution
"out of cycles"Cycles depletedSee cycles-topup.md
"trap" or "panic"Canister crashRestart and investigate logs
"timeout"Slow responseCheck canister load, optimize
"validation failed"Bad inputReview validation rules
"unauthorized"Auth issueCheck session validation
"network error"ConnectivityCheck IC network status

Post-Resolution ​

Step 1: Verify Resolution ​

  • Monitor Grafana for 15 minutes
  • Confirm error rate below threshold
  • Test affected functionality manually

Step 2: Root Cause Analysis ​

For critical incidents:

  1. Document timeline of events
  2. Identify root cause
  3. Create tickets for:
    • Bug fixes
    • Monitoring improvements
    • Documentation updates

Step 3: Update Alert Thresholds ​

If thresholds are too sensitive or not sensitive enough:

  1. Review historical error rates
  2. Propose new thresholds in PR
  3. Update alert-rules.yml

Prevention ​

Code Quality ​

  • Run PocketIC tests before deployment
  • Require code review for all changes
  • Use canary deployments for risky changes

Monitoring ​

  • Set up alerts for gradual increases, not just thresholds
  • Track error rates by endpoint for granular detection
  • Monitor deployment correlation with errors

Graceful Degradation ​

  • Implement circuit breakers for external services
  • Queue non-critical operations
  • Return friendly error messages to users

Escalation ​

ConditionAction
Error rate > 50%Escalate to team lead immediately
Rollback doesn't resolveContact senior engineer
External service outageContact vendor support
Security-related errorsContact security team

Hello World Co-Op DAO