Skip to content

Checking access...

Deployment Failure Recovery

Last Updated: 2025-12-04 Alert: Manual detection or CI/CD failure notification Severity: High / Critical Response Time: < 30 minutes

Overview

This runbook covers recovery procedures when a canister deployment fails or causes production issues.

Symptoms

  • GitHub Actions deployment workflow failed
  • New deployment causes errors/crashes
  • Canister stopped after upgrade
  • Users reporting issues after deployment
  • Rollback workflow triggered

Diagnosis

Step 1: Identify Failure Point

Check GitHub Actions workflow logs:

  1. Go to repository > Actions
  2. Find failed workflow run
  3. Review error messages in logs

Common failure points:

  • Build failed (compilation error)
  • Test failed (regression detected)
  • Deploy failed (network/auth issue)
  • Post-deploy health check failed

Step 2: Assess Impact

SymptomImpact Level
Deployment workflow failed, nothing deployedLow
Deployed but canister stoppedHigh
Deployed and errors increasingHigh
Deployed and canister unresponsiveCritical

Step 3: Check Canister Status

bash
export DFX_WARNING=-mainnet_plaintext_identity
dfx canister --network ic status <canister-id>

# If canister is stopped or erroring, note the status

Resolution

For quick recovery when the previous version was working:

  1. Go to Actions > "Emergency Rollback" workflow
  2. Click "Run workflow"
  3. Enter:
    • canister_name: Name of canister to rollback
    • network: staging or mainnet
    • rollback_run_id: Run ID of last successful deployment
  4. Wait for workflow to complete (< 5 minutes)
  5. Verify canister status

Option B: Rollback via dfx (Manual)

If GitHub Actions is unavailable:

bash
# Set up identity
dfx identity use <controller-identity>
export DFX_WARNING=-mainnet_plaintext_identity

# Option B1: Rebuild from previous commit
git checkout <previous-commit>
cargo build --release --target wasm32-unknown-unknown
dfx canister install <canister-name> --network ic --mode reinstall \
  --wasm target/wasm32-unknown-unknown/release/<canister>.wasm --yes

# Option B2: Use cached WASM artifact
# Download WASM from previous GitHub Actions run artifacts
dfx canister install <canister-name> --network ic --mode reinstall \
  --wasm /path/to/previous.wasm --yes

Option C: Fix Forward

If the issue is a simple fix and rollback would lose important changes:

  1. Identify the bug in new code
  2. Create hotfix branch
  3. Apply minimal fix
  4. Deploy hotfix through normal CI/CD
  5. Monitor for resolution

Option D: Restart Stopped Canister

If canister stopped but code is correct:

bash
# Restart the canister
dfx canister --network ic start <canister-id>

# Verify it's running
dfx canister --network ic status <canister-id>

Finding Previous Run ID

To find the run ID for rollback:

  1. Go to repository > Actions
  2. Filter by workflow name (e.g., "Deploy Staging")
  3. Find last successful run (green checkmark)
  4. Click on the run
  5. Note the run ID from the URL: /runs/12345678

Post-Deployment Verification

After rollback or fix:

Step 1: Verify Canister Status

bash
dfx canister --network ic status <canister-id>
# Should show: Canister status: Running

Step 2: Test Functionality

bash
# Test a simple query method
dfx canister --network ic call <canister-id> <test-method>

# Example: Check stats
dfx canister --network ic call user_service get_stats '()'

Step 3: Monitor Error Rates

  • Check Grafana dashboard for 15 minutes
  • Verify error rate returns to normal
  • Confirm no new alerts fire

Step 4: Test End-to-End

  • Perform user-facing actions on staging/production
  • Verify form submissions work
  • Check authentication flows

Common Failure Causes

Build Failures

ErrorCauseSolution
Compilation errorCode bugFix code, run tests locally
Missing dependencyCargo.toml issueUpdate dependencies
WASM too largeBinary sizeOptimize code, enable LTO

Deployment Failures

ErrorCauseSolution
Identity not foundSecret missingCheck DFX_IDENTITY_PEM secret
Insufficient cyclesLow balanceTop up cycles first
Network timeoutIC congestionRetry deployment
Permission deniedWrong controllerVerify identity is controller

Post-Deploy Failures

ErrorCauseSolution
Canister trapsRuntime panicCheck logs, rollback, fix bug
Memory overflowToo much dataOptimize state, increase memory
Query timeoutSlow computationOptimize algorithms

Prevention

Pre-Deployment Checklist

  • [ ] All tests pass locally
  • [ ] Code reviewed and approved
  • [ ] WASM built successfully
  • [ ] Staging deployment tested
  • [ ] Rollback plan documented

Deployment Best Practices

  1. Deploy to staging first - Always test on staging
  2. Gradual rollout - Consider canary deployments
  3. Monitor actively - Watch metrics during deployment
  4. Have rollback ready - Know the previous run ID

Automated Safeguards

  • Required status checks on PRs
  • Mainnet deployment requires approval
  • Auto-rollback on health check failure

Escalation

ConditionAction
Rollback failsContact senior engineer
Data corruption suspectedContact team lead immediately
Security vulnerabilityContact security team
DFINITY issue suspectedContact DFINITY support

Hello World Co-Op DAO