Skip to content
🔒

Login Required

You need to be logged in to view this content. This page requires Member access.

Deployment Failure Recovery ​

Last Updated: 2025-12-04 Alert: Manual detection or CI/CD failure notification Severity: High / Critical Response Time: < 30 minutes

Overview ​

This runbook covers recovery procedures when a canister deployment fails or causes production issues.

Symptoms ​

  • GitHub Actions deployment workflow failed
  • New deployment causes errors/crashes
  • Canister stopped after upgrade
  • Users reporting issues after deployment
  • Rollback workflow triggered

Diagnosis ​

Step 1: Identify Failure Point ​

Check GitHub Actions workflow logs:

  1. Go to repository > Actions
  2. Find failed workflow run
  3. Review error messages in logs

Common failure points:

  • Build failed (compilation error)
  • Test failed (regression detected)
  • Deploy failed (network/auth issue)
  • Post-deploy health check failed

Step 2: Assess Impact ​

SymptomImpact Level
Deployment workflow failed, nothing deployedLow
Deployed but canister stoppedHigh
Deployed and errors increasingHigh
Deployed and canister unresponsiveCritical

Step 3: Check Canister Status ​

bash
export DFX_WARNING=-mainnet_plaintext_identity
dfx canister --network ic status <canister-id>

# If canister is stopped or erroring, note the status

Resolution ​

For quick recovery when the previous version was working:

  1. Go to Actions > "Emergency Rollback" workflow
  2. Click "Run workflow"
  3. Enter:
    • canister_name: Name of canister to rollback
    • network: staging or mainnet
    • rollback_run_id: Run ID of last successful deployment
  4. Wait for workflow to complete (< 5 minutes)
  5. Verify canister status

Option B: Rollback via dfx (Manual) ​

If GitHub Actions is unavailable:

bash
# Set up identity
dfx identity use <controller-identity>
export DFX_WARNING=-mainnet_plaintext_identity

# Option B1: Rebuild from previous commit
git checkout <previous-commit>
cargo build --release --target wasm32-unknown-unknown
dfx canister install <canister-name> --network ic --mode reinstall \
  --wasm target/wasm32-unknown-unknown/release/<canister>.wasm --yes

# Option B2: Use cached WASM artifact
# Download WASM from previous GitHub Actions run artifacts
dfx canister install <canister-name> --network ic --mode reinstall \
  --wasm /path/to/previous.wasm --yes

Option C: Fix Forward ​

If the issue is a simple fix and rollback would lose important changes:

  1. Identify the bug in new code
  2. Create hotfix branch
  3. Apply minimal fix
  4. Deploy hotfix through normal CI/CD
  5. Monitor for resolution

Option D: Restart Stopped Canister ​

If canister stopped but code is correct:

bash
# Restart the canister
dfx canister --network ic start <canister-id>

# Verify it's running
dfx canister --network ic status <canister-id>

Finding Previous Run ID ​

To find the run ID for rollback:

  1. Go to repository > Actions
  2. Filter by workflow name (e.g., "Deploy Staging")
  3. Find last successful run (green checkmark)
  4. Click on the run
  5. Note the run ID from the URL: /runs/12345678

Post-Deployment Verification ​

After rollback or fix:

Step 1: Verify Canister Status ​

bash
dfx canister --network ic status <canister-id>
# Should show: Canister status: Running

Step 2: Test Functionality ​

bash
# Test a simple query method
dfx canister --network ic call <canister-id> <test-method>

# Example: Check stats
dfx canister --network ic call user_service get_stats '()'

Step 3: Monitor Error Rates ​

  • Check Grafana dashboard for 15 minutes
  • Verify error rate returns to normal
  • Confirm no new alerts fire

Step 4: Test End-to-End ​

  • Perform user-facing actions on staging/production
  • Verify form submissions work
  • Check authentication flows

Common Failure Causes ​

Build Failures ​

ErrorCauseSolution
Compilation errorCode bugFix code, run tests locally
Missing dependencyCargo.toml issueUpdate dependencies
WASM too largeBinary sizeOptimize code, enable LTO

Deployment Failures ​

ErrorCauseSolution
Identity not foundSecret missingCheck DFX_IDENTITY_PEM secret
Insufficient cyclesLow balanceTop up cycles first
Network timeoutIC congestionRetry deployment
Permission deniedWrong controllerVerify identity is controller

Post-Deploy Failures ​

ErrorCauseSolution
Canister trapsRuntime panicCheck logs, rollback, fix bug
Memory overflowToo much dataOptimize state, increase memory
Query timeoutSlow computationOptimize algorithms

Prevention ​

Pre-Deployment Checklist ​

  • [ ] All tests pass locally
  • [ ] Code reviewed and approved
  • [ ] WASM built successfully
  • [ ] Staging deployment tested
  • [ ] Rollback plan documented

Deployment Best Practices ​

  1. Deploy to staging first - Always test on staging
  2. Gradual rollout - Consider canary deployments
  3. Monitor actively - Watch metrics during deployment
  4. Have rollback ready - Know the previous run ID

Automated Safeguards ​

  • Required status checks on PRs
  • Mainnet deployment requires approval
  • Auto-rollback on health check failure

Escalation ​

ConditionAction
Rollback failsContact senior engineer
Data corruption suspectedContact team lead immediately
Security vulnerabilityContact security team
DFINITY issue suspectedContact DFINITY support

Hello World Co-Op DAO