High Error Rate Triage ​
Last Updated: 2025-12-04 Alert: HighErrorRate, CriticalErrorRate Severity: Warning / Critical Response Time: < 30 minutes for critical
Overview ​
This runbook covers the triage and resolution process when canister or oracle-bridge error rates exceed acceptable thresholds.
Alert Thresholds ​
| Level | Threshold | Duration | Response Time |
|---|---|---|---|
| Warning | > 5% | 5 minutes | < 1 hour |
| Critical | > 10% | 5 minutes | < 30 minutes |
Symptoms ​
- Alert:
HighErrorRateorCriticalErrorRate - Users reporting form submission failures
- PostHog showing increased error events
- Grafana error rate panel above threshold
Diagnosis ​
Step 1: Identify Affected Component ​
Check which component is generating errors:
bash
# For canister errors
dfx canister --network ic logs <canister-id>
# For oracle-bridge
# Check application logs or cloud logging serviceStep 2: Check Recent Deployments ​
- Review GitHub Actions for recent deployments
- Check if error spike correlates with deployment time
- Note the commit SHA of current deployment
Step 3: Analyze Error Types ​
In Grafana, check error breakdown by type:
- Validation errors (bad input)
- Network errors (connectivity issues)
- Internal errors (bugs, panics)
- Rate limiting (too many requests)
Step 4: Check Dependencies ​
bash
# Check if external services are available
# Email provider status
curl -s https://status.sendgrid.com/api/v2/status.json | jq '.status'
# Stripe status (if applicable)
curl -s https://status.stripe.com/api/v2/status.json | jq '.status'Resolution ​
Scenario A: Deployment-Related Errors ​
If errors started after a recent deployment:
- Assess severity: Is the feature critical?
- Consider rollback:bash
# Via GitHub Actions # Go to Actions > Emergency Rollback > Run workflow # Enter canister name, network, and previous run ID - Verify rollback resolved the issue
- Investigate the problematic commit
Scenario B: Input Validation Errors ​
If errors are primarily validation failures:
- Check recent changes to validation rules
- Review PostHog for patterns in failing inputs
- Update validation or error messages if needed
- No immediate action if users are providing bad input
Scenario C: External Service Errors ​
If errors are from external dependencies:
- Check service status pages
- Implement circuit breaker if not present
- Queue requests for retry if possible
- Communicate to users via status page
Scenario D: Canister Crash/Panic ​
If canister is crashing:
bash
# Check canister status
dfx canister --network ic status <canister-id>
# If stopped, restart it
dfx canister --network ic start <canister-id>
# Check logs for panic reason
dfx canister --network ic logs <canister-id>Scenario E: Rate Limiting ​
If errors are from rate limiting:
- Verify rate limiting is working as designed
- Check for abuse patterns (same IP, unusual patterns)
- Adjust thresholds if legitimate traffic is being blocked
- No action if rate limiting is protecting the system
Error Code Reference ​
| Error Pattern | Likely Cause | Resolution |
|---|---|---|
| "out of cycles" | Cycles depleted | See cycles-topup.md |
| "trap" or "panic" | Canister crash | Restart and investigate logs |
| "timeout" | Slow response | Check canister load, optimize |
| "validation failed" | Bad input | Review validation rules |
| "unauthorized" | Auth issue | Check session validation |
| "network error" | Connectivity | Check IC network status |
Post-Resolution ​
Step 1: Verify Resolution ​
- Monitor Grafana for 15 minutes
- Confirm error rate below threshold
- Test affected functionality manually
Step 2: Root Cause Analysis ​
For critical incidents:
- Document timeline of events
- Identify root cause
- Create tickets for:
- Bug fixes
- Monitoring improvements
- Documentation updates
Step 3: Update Alert Thresholds ​
If thresholds are too sensitive or not sensitive enough:
- Review historical error rates
- Propose new thresholds in PR
- Update
alert-rules.yml
Prevention ​
Code Quality ​
- Run PocketIC tests before deployment
- Require code review for all changes
- Use canary deployments for risky changes
Monitoring ​
- Set up alerts for gradual increases, not just thresholds
- Track error rates by endpoint for granular detection
- Monitor deployment correlation with errors
Graceful Degradation ​
- Implement circuit breakers for external services
- Queue non-critical operations
- Return friendly error messages to users
Escalation ​
| Condition | Action |
|---|---|
| Error rate > 50% | Escalate to team lead immediately |
| Rollback doesn't resolve | Contact senior engineer |
| External service outage | Contact vendor support |
| Security-related errors | Contact security team |