High Error Rate Triage

Last Updated: 2025-12-04 Alert: HighErrorRate, CriticalErrorRate Severity: Warning / Critical Response Time: < 30 minutes for critical

Overview

This runbook covers the triage and resolution process when canister or oracle-bridge error rates exceed acceptable thresholds.

Alert Thresholds

Level	Threshold	Duration	Response Time
Warning	> 5%	5 minutes	< 1 hour
Critical	> 10%	5 minutes	< 30 minutes

Symptoms

Alert: HighErrorRate or CriticalErrorRate
Users reporting form submission failures
PostHog showing increased error events
Grafana error rate panel above threshold

Diagnosis

Step 1: Identify Affected Component

Check which component is generating errors:

bash

# For canister errors
dfx canister --network ic logs <canister-id>

# For oracle-bridge
# Check application logs or cloud logging service

Step 2: Check Recent Deployments

Review GitHub Actions for recent deployments
Check if error spike correlates with deployment time
Note the commit SHA of current deployment

Step 3: Analyze Error Types

In Grafana, check error breakdown by type:

Validation errors (bad input)
Network errors (connectivity issues)
Internal errors (bugs, panics)
Rate limiting (too many requests)

Step 4: Check Dependencies

bash

# Check if external services are available
# Email provider status
curl -s https://status.sendgrid.com/api/v2/status.json | jq '.status'

# Stripe status (if applicable)
curl -s https://status.stripe.com/api/v2/status.json | jq '.status'

Resolution

If errors started after a recent deployment:

Assess severity: Is the feature critical?

Consider rollback:

bash

# Via GitHub Actions
# Go to Actions > Emergency Rollback > Run workflow
# Enter canister name, network, and previous run ID

Verify rollback resolved the issue
Investigate the problematic commit

Scenario B: Input Validation Errors

If errors are primarily validation failures:

Check recent changes to validation rules
Review PostHog for patterns in failing inputs
Update validation or error messages if needed
No immediate action if users are providing bad input

Scenario C: External Service Errors

If errors are from external dependencies:

Check service status pages
Implement circuit breaker if not present
Queue requests for retry if possible
Communicate to users via status page

Scenario D: Canister Crash/Panic

If canister is crashing:

bash

# Check canister status
dfx canister --network ic status <canister-id>

# If stopped, restart it
dfx canister --network ic start <canister-id>

# Check logs for panic reason
dfx canister --network ic logs <canister-id>

Scenario E: Rate Limiting

If errors are from rate limiting:

Verify rate limiting is working as designed
Check for abuse patterns (same IP, unusual patterns)
Adjust thresholds if legitimate traffic is being blocked
No action if rate limiting is protecting the system

Error Code Reference

Error Pattern	Likely Cause	Resolution
"out of cycles"	Cycles depleted	See cycles-topup.md
"trap" or "panic"	Canister crash	Restart and investigate logs
"timeout"	Slow response	Check canister load, optimize
"validation failed"	Bad input	Review validation rules
"unauthorized"	Auth issue	Check session validation
"network error"	Connectivity	Check IC network status

Post-Resolution

Step 1: Verify Resolution

Monitor Grafana for 15 minutes
Confirm error rate below threshold
Test affected functionality manually

Step 2: Root Cause Analysis

For critical incidents:

Document timeline of events
Identify root cause
Create tickets for:
- Bug fixes
- Monitoring improvements
- Documentation updates

Step 3: Update Alert Thresholds

If thresholds are too sensitive or not sensitive enough:

Review historical error rates
Propose new thresholds in PR
Update alert-rules.yml

Prevention

Code Quality

Run PocketIC tests before deployment
Require code review for all changes
Use canary deployments for risky changes

Monitoring

Set up alerts for gradual increases, not just thresholds
Track error rates by endpoint for granular detection
Monitor deployment correlation with errors

Graceful Degradation

Implement circuit breakers for external services
Queue non-critical operations
Return friendly error messages to users

Escalation

Condition	Action
Error rate > 50%	Escalate to team lead immediately
Rollback doesn't resolve	Contact senior engineer
External service outage	Contact vendor support
Security-related errors	Contact security team

Login Required

High Error Rate Triage ​

Overview ​

Alert Thresholds ​

Symptoms ​

Diagnosis ​

Step 1: Identify Affected Component ​

Step 2: Check Recent Deployments ​

Step 3: Analyze Error Types ​

Step 4: Check Dependencies ​

Resolution ​

Scenario A: Deployment-Related Errors ​

Scenario B: Input Validation Errors ​

Scenario C: External Service Errors ​

Scenario D: Canister Crash/Panic ​

Scenario E: Rate Limiting ​

Error Code Reference ​

Post-Resolution ​

Step 1: Verify Resolution ​

Step 2: Root Cause Analysis ​

Step 3: Update Alert Thresholds ​

Prevention ​

Code Quality ​

Monitoring ​

Graceful Degradation ​

Escalation ​

Related Documentation ​

High Error Rate Triage

Overview

Alert Thresholds

Symptoms

Diagnosis

Step 1: Identify Affected Component

Step 2: Check Recent Deployments

Step 3: Analyze Error Types

Step 4: Check Dependencies

Resolution

Scenario A: Deployment-Related Errors

Scenario B: Input Validation Errors

Scenario C: External Service Errors

Scenario D: Canister Crash/Panic

Scenario E: Rate Limiting

Error Code Reference

Post-Resolution

Step 1: Verify Resolution

Step 2: Root Cause Analysis

Step 3: Update Alert Thresholds

Prevention

Code Quality

Monitoring

Graceful Degradation

Escalation

Related Documentation