Skip to content

Checking access...

High Error Rate Triage

Last Updated: 2025-12-04 Alert: HighErrorRate, CriticalErrorRate Severity: Warning / Critical Response Time: < 30 minutes for critical

Overview

This runbook covers the triage and resolution process when canister or oracle-bridge error rates exceed acceptable thresholds.

Alert Thresholds

LevelThresholdDurationResponse Time
Warning> 5%5 minutes< 1 hour
Critical> 10%5 minutes< 30 minutes

Symptoms

  • Alert: HighErrorRate or CriticalErrorRate
  • Users reporting form submission failures
  • PostHog showing increased error events
  • Grafana error rate panel above threshold

Diagnosis

Step 1: Identify Affected Component

Check which component is generating errors:

bash
# For canister errors
dfx canister --network ic logs <canister-id>

# For oracle-bridge
# Check application logs or cloud logging service

Step 2: Check Recent Deployments

  1. Review GitHub Actions for recent deployments
  2. Check if error spike correlates with deployment time
  3. Note the commit SHA of current deployment

Step 3: Analyze Error Types

In Grafana, check error breakdown by type:

  • Validation errors (bad input)
  • Network errors (connectivity issues)
  • Internal errors (bugs, panics)
  • Rate limiting (too many requests)

Step 4: Check Dependencies

bash
# Check if external services are available
# Email provider status
curl -s https://status.sendgrid.com/api/v2/status.json | jq '.status'

# Stripe status (if applicable)
curl -s https://status.stripe.com/api/v2/status.json | jq '.status'

Resolution

If errors started after a recent deployment:

  1. Assess severity: Is the feature critical?
  2. Consider rollback:
    bash
    # Via GitHub Actions
    # Go to Actions > Emergency Rollback > Run workflow
    # Enter canister name, network, and previous run ID
  3. Verify rollback resolved the issue
  4. Investigate the problematic commit

Scenario B: Input Validation Errors

If errors are primarily validation failures:

  1. Check recent changes to validation rules
  2. Review PostHog for patterns in failing inputs
  3. Update validation or error messages if needed
  4. No immediate action if users are providing bad input

Scenario C: External Service Errors

If errors are from external dependencies:

  1. Check service status pages
  2. Implement circuit breaker if not present
  3. Queue requests for retry if possible
  4. Communicate to users via status page

Scenario D: Canister Crash/Panic

If canister is crashing:

bash
# Check canister status
dfx canister --network ic status <canister-id>

# If stopped, restart it
dfx canister --network ic start <canister-id>

# Check logs for panic reason
dfx canister --network ic logs <canister-id>

Scenario E: Rate Limiting

If errors are from rate limiting:

  1. Verify rate limiting is working as designed
  2. Check for abuse patterns (same IP, unusual patterns)
  3. Adjust thresholds if legitimate traffic is being blocked
  4. No action if rate limiting is protecting the system

Error Code Reference

Error PatternLikely CauseResolution
"out of cycles"Cycles depletedSee cycles-topup.md
"trap" or "panic"Canister crashRestart and investigate logs
"timeout"Slow responseCheck canister load, optimize
"validation failed"Bad inputReview validation rules
"unauthorized"Auth issueCheck session validation
"network error"ConnectivityCheck IC network status

Post-Resolution

Step 1: Verify Resolution

  • Monitor Grafana for 15 minutes
  • Confirm error rate below threshold
  • Test affected functionality manually

Step 2: Root Cause Analysis

For critical incidents:

  1. Document timeline of events
  2. Identify root cause
  3. Create tickets for:
    • Bug fixes
    • Monitoring improvements
    • Documentation updates

Step 3: Update Alert Thresholds

If thresholds are too sensitive or not sensitive enough:

  1. Review historical error rates
  2. Propose new thresholds in PR
  3. Update alert-rules.yml

Prevention

Code Quality

  • Run PocketIC tests before deployment
  • Require code review for all changes
  • Use canary deployments for risky changes

Monitoring

  • Set up alerts for gradual increases, not just thresholds
  • Track error rates by endpoint for granular detection
  • Monitor deployment correlation with errors

Graceful Degradation

  • Implement circuit breakers for external services
  • Queue non-critical operations
  • Return friendly error messages to users

Escalation

ConditionAction
Error rate > 50%Escalate to team lead immediately
Rollback doesn't resolveContact senior engineer
External service outageContact vendor support
Security-related errorsContact security team

Hello World Co-Op DAO