Deployment Failure Recovery ​
Last Updated: 2025-12-04 Alert: Manual detection or CI/CD failure notification Severity: High / Critical Response Time: < 30 minutes
Overview ​
This runbook covers recovery procedures when a canister deployment fails or causes production issues.
Symptoms ​
- GitHub Actions deployment workflow failed
- New deployment causes errors/crashes
- Canister stopped after upgrade
- Users reporting issues after deployment
- Rollback workflow triggered
Diagnosis ​
Step 1: Identify Failure Point ​
Check GitHub Actions workflow logs:
- Go to repository > Actions
- Find failed workflow run
- Review error messages in logs
Common failure points:
- Build failed (compilation error)
- Test failed (regression detected)
- Deploy failed (network/auth issue)
- Post-deploy health check failed
Step 2: Assess Impact ​
| Symptom | Impact Level |
|---|---|
| Deployment workflow failed, nothing deployed | Low |
| Deployed but canister stopped | High |
| Deployed and errors increasing | High |
| Deployed and canister unresponsive | Critical |
Step 3: Check Canister Status ​
bash
export DFX_WARNING=-mainnet_plaintext_identity
dfx canister --network ic status <canister-id>
# If canister is stopped or erroring, note the statusResolution ​
Option A: Rollback via GitHub Actions (Recommended) ​
For quick recovery when the previous version was working:
- Go to Actions > "Emergency Rollback" workflow
- Click "Run workflow"
- Enter:
canister_name: Name of canister to rollbacknetwork:stagingormainnetrollback_run_id: Run ID of last successful deployment
- Wait for workflow to complete (< 5 minutes)
- Verify canister status
Option B: Rollback via dfx (Manual) ​
If GitHub Actions is unavailable:
bash
# Set up identity
dfx identity use <controller-identity>
export DFX_WARNING=-mainnet_plaintext_identity
# Option B1: Rebuild from previous commit
git checkout <previous-commit>
cargo build --release --target wasm32-unknown-unknown
dfx canister install <canister-name> --network ic --mode reinstall \
--wasm target/wasm32-unknown-unknown/release/<canister>.wasm --yes
# Option B2: Use cached WASM artifact
# Download WASM from previous GitHub Actions run artifacts
dfx canister install <canister-name> --network ic --mode reinstall \
--wasm /path/to/previous.wasm --yesOption C: Fix Forward ​
If the issue is a simple fix and rollback would lose important changes:
- Identify the bug in new code
- Create hotfix branch
- Apply minimal fix
- Deploy hotfix through normal CI/CD
- Monitor for resolution
Option D: Restart Stopped Canister ​
If canister stopped but code is correct:
bash
# Restart the canister
dfx canister --network ic start <canister-id>
# Verify it's running
dfx canister --network ic status <canister-id>Finding Previous Run ID ​
To find the run ID for rollback:
- Go to repository > Actions
- Filter by workflow name (e.g., "Deploy Staging")
- Find last successful run (green checkmark)
- Click on the run
- Note the run ID from the URL:
/runs/12345678
Post-Deployment Verification ​
After rollback or fix:
Step 1: Verify Canister Status ​
bash
dfx canister --network ic status <canister-id>
# Should show: Canister status: RunningStep 2: Test Functionality ​
bash
# Test a simple query method
dfx canister --network ic call <canister-id> <test-method>
# Example: Check stats
dfx canister --network ic call user_service get_stats '()'Step 3: Monitor Error Rates ​
- Check Grafana dashboard for 15 minutes
- Verify error rate returns to normal
- Confirm no new alerts fire
Step 4: Test End-to-End ​
- Perform user-facing actions on staging/production
- Verify form submissions work
- Check authentication flows
Common Failure Causes ​
Build Failures ​
| Error | Cause | Solution |
|---|---|---|
| Compilation error | Code bug | Fix code, run tests locally |
| Missing dependency | Cargo.toml issue | Update dependencies |
| WASM too large | Binary size | Optimize code, enable LTO |
Deployment Failures ​
| Error | Cause | Solution |
|---|---|---|
| Identity not found | Secret missing | Check DFX_IDENTITY_PEM secret |
| Insufficient cycles | Low balance | Top up cycles first |
| Network timeout | IC congestion | Retry deployment |
| Permission denied | Wrong controller | Verify identity is controller |
Post-Deploy Failures ​
| Error | Cause | Solution |
|---|---|---|
| Canister traps | Runtime panic | Check logs, rollback, fix bug |
| Memory overflow | Too much data | Optimize state, increase memory |
| Query timeout | Slow computation | Optimize algorithms |
Prevention ​
Pre-Deployment Checklist ​
- [ ] All tests pass locally
- [ ] Code reviewed and approved
- [ ] WASM built successfully
- [ ] Staging deployment tested
- [ ] Rollback plan documented
Deployment Best Practices ​
- Deploy to staging first - Always test on staging
- Gradual rollout - Consider canary deployments
- Monitor actively - Watch metrics during deployment
- Have rollback ready - Know the previous run ID
Automated Safeguards ​
- Required status checks on PRs
- Mainnet deployment requires approval
- Auto-rollback on health check failure
Escalation ​
| Condition | Action |
|---|---|
| Rollback fails | Contact senior engineer |
| Data corruption suspected | Contact team lead immediately |
| Security vulnerability | Contact security team |
| DFINITY issue suspected | Contact DFINITY support |