Deployment Failure Recovery

Last Updated: 2025-12-04 Alert: Manual detection or CI/CD failure notification Severity: High / Critical Response Time: < 30 minutes

Overview

This runbook covers recovery procedures when a canister deployment fails or causes production issues.

Symptoms

GitHub Actions deployment workflow failed
New deployment causes errors/crashes
Canister stopped after upgrade
Users reporting issues after deployment
Rollback workflow triggered

Diagnosis

Step 1: Identify Failure Point

Check GitHub Actions workflow logs:

Go to repository > Actions
Find failed workflow run
Review error messages in logs

Common failure points:

Build failed (compilation error)
Test failed (regression detected)
Deploy failed (network/auth issue)
Post-deploy health check failed

Step 2: Assess Impact

Symptom	Impact Level
Deployment workflow failed, nothing deployed	Low
Deployed but canister stopped	High
Deployed and errors increasing	High
Deployed and canister unresponsive	Critical

Step 3: Check Canister Status

bash

export DFX_WARNING=-mainnet_plaintext_identity
dfx canister --network ic status <canister-id>

# If canister is stopped or erroring, note the status

Resolution

Option A: Rollback via GitHub Actions (Recommended)

For quick recovery when the previous version was working:

Go to Actions > "Emergency Rollback" workflow
Click "Run workflow"
Enter:
- canister_name: Name of canister to rollback
- network: staging or mainnet
- rollback_run_id: Run ID of last successful deployment
Wait for workflow to complete (< 5 minutes)
Verify canister status

Option B: Rollback via dfx (Manual)

If GitHub Actions is unavailable:

bash

# Set up identity
dfx identity use <controller-identity>
export DFX_WARNING=-mainnet_plaintext_identity

# Option B1: Rebuild from previous commit
git checkout <previous-commit>
cargo build --release --target wasm32-unknown-unknown
dfx canister install <canister-name> --network ic --mode reinstall \
  --wasm target/wasm32-unknown-unknown/release/<canister>.wasm --yes

# Option B2: Use cached WASM artifact
# Download WASM from previous GitHub Actions run artifacts
dfx canister install <canister-name> --network ic --mode reinstall \
  --wasm /path/to/previous.wasm --yes

Option C: Fix Forward

If the issue is a simple fix and rollback would lose important changes:

Identify the bug in new code
Create hotfix branch
Apply minimal fix
Deploy hotfix through normal CI/CD
Monitor for resolution

Option D: Restart Stopped Canister

If canister stopped but code is correct:

bash

# Restart the canister
dfx canister --network ic start <canister-id>

# Verify it's running
dfx canister --network ic status <canister-id>

Finding Previous Run ID

To find the run ID for rollback:

Go to repository > Actions
Filter by workflow name (e.g., "Deploy Staging")
Find last successful run (green checkmark)
Click on the run
Note the run ID from the URL: /runs/12345678

Post-Deployment Verification

After rollback or fix:

Step 1: Verify Canister Status

bash

dfx canister --network ic status <canister-id>
# Should show: Canister status: Running

Step 2: Test Functionality

bash

# Test a simple query method
dfx canister --network ic call <canister-id> <test-method>

# Example: Check stats
dfx canister --network ic call user_service get_stats '()'

Step 3: Monitor Error Rates

Check Grafana dashboard for 15 minutes
Verify error rate returns to normal
Confirm no new alerts fire

Step 4: Test End-to-End

Perform user-facing actions on staging/production
Verify form submissions work
Check authentication flows

Common Failure Causes

Build Failures

Error	Cause	Solution
Compilation error	Code bug	Fix code, run tests locally
Missing dependency	Cargo.toml issue	Update dependencies
WASM too large	Binary size	Optimize code, enable LTO

Deployment Failures

Error	Cause	Solution
Identity not found	Secret missing	Check DFX_IDENTITY_PEM secret
Insufficient cycles	Low balance	Top up cycles first
Network timeout	IC congestion	Retry deployment
Permission denied	Wrong controller	Verify identity is controller

Post-Deploy Failures

Error	Cause	Solution
Canister traps	Runtime panic	Check logs, rollback, fix bug
Memory overflow	Too much data	Optimize state, increase memory
Query timeout	Slow computation	Optimize algorithms

Prevention

Pre-Deployment Checklist

[ ] All tests pass locally
[ ] Code reviewed and approved
[ ] WASM built successfully
[ ] Staging deployment tested
[ ] Rollback plan documented

Deployment Best Practices

Deploy to staging first - Always test on staging
Gradual rollout - Consider canary deployments
Monitor actively - Watch metrics during deployment
Have rollback ready - Know the previous run ID

Automated Safeguards

Required status checks on PRs
Mainnet deployment requires approval
Auto-rollback on health check failure

Escalation

Condition	Action
Rollback fails	Contact senior engineer
Data corruption suspected	Contact team lead immediately
Security vulnerability	Contact security team
DFINITY issue suspected	Contact DFINITY support

Login Required

Deployment Failure Recovery ​

Overview ​

Symptoms ​

Diagnosis ​

Step 1: Identify Failure Point ​

Step 2: Assess Impact ​

Step 3: Check Canister Status ​

Resolution ​

Option A: Rollback via GitHub Actions (Recommended) ​

Option B: Rollback via dfx (Manual) ​

Option C: Fix Forward ​

Option D: Restart Stopped Canister ​

Finding Previous Run ID ​

Post-Deployment Verification ​

Step 1: Verify Canister Status ​

Step 2: Test Functionality ​

Step 3: Monitor Error Rates ​

Step 4: Test End-to-End ​

Common Failure Causes ​

Build Failures ​

Deployment Failures ​

Post-Deploy Failures ​

Prevention ​

Pre-Deployment Checklist ​

Deployment Best Practices ​

Automated Safeguards ​

Escalation ​

Related Documentation ​

Deployment Failure Recovery

Overview

Symptoms

Diagnosis

Step 1: Identify Failure Point

Step 2: Assess Impact

Step 3: Check Canister Status

Resolution

Option A: Rollback via GitHub Actions (Recommended)

Option B: Rollback via dfx (Manual)

Option C: Fix Forward

Option D: Restart Stopped Canister

Finding Previous Run ID

Post-Deployment Verification

Step 1: Verify Canister Status

Step 2: Test Functionality

Step 3: Monitor Error Rates

Step 4: Test End-to-End

Common Failure Causes

Build Failures

Deployment Failures

Post-Deploy Failures

Prevention

Pre-Deployment Checklist

Deployment Best Practices

Automated Safeguards

Escalation

Related Documentation