Operations Documentation
This section contains operational documentation for the Hello World DAO platform, including monitoring, runbooks, and incident response procedures.
Quick Links
| Resource | Description |
|---|---|
| Canister Monitoring | Cycle balance monitoring and top-up procedures |
| Monitoring & Alerting | Grafana dashboards and alerting setup |
| Incident Response | General incident handling |
Runbooks
Detailed procedures for common operational tasks:
| Runbook | Purpose |
|---|---|
| Cycles Top-Up | Top up canister cycles when low |
| High Error Rate | Triage elevated error rates |
| Canister Unresponsive | Recover unresponsive canisters |
| Deployment Failure | Handle failed deployments and rollback |
| Database Connectivity | Troubleshoot database and service connectivity |
Monitoring Stack
┌──────────────────────────────────────────────────┐
│ Monitoring Infrastructure │
├──────────────────────────────────────────────────┤
│ Metrics Collection │
│ └── GitHub Actions workflow (every 6 hours) │
│ └── Canister status via dfx │
├──────────────────────────────────────────────────┤
│ Visualization │
│ └── Grafana dashboards │
│ └── IC Dashboard │
├──────────────────────────────────────────────────┤
│ Alerting │
│ └── Prometheus alert rules │
│ └── Slack notifications │
│ └── PagerDuty escalation (critical) │
└──────────────────────────────────────────────────┘Alert Severity Levels
| Severity | Response Time | Examples |
|---|---|---|
| Critical | < 15 minutes | Canister unresponsive, cycles depleted |
| Warning | < 1 hour | Low cycles, elevated error rate |
| Info | Next business day | No new members, high proposal volume |
On-Call Rotation
On-call engineers are the first responders for production incidents.
Responsibilities:
- Acknowledge alerts within 15 minutes
- Follow runbook procedures
- Escalate P1/P2 incidents to team lead
- Document incident resolution
Contact: See Incident Response Runbook for rotation schedule.
Maintenance Windows
Regular maintenance activities:
| Activity | Schedule | Duration |
|---|---|---|
| Cycles monitoring | Every 6 hours | Automated |
| Dashboard review | Weekly | 30 minutes |
| Runbook review | Monthly | 1 hour |
| DR drill | Quarterly | 2 hours |
Key Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| Cycles balance | > 1T | < 1T warning, < 500B critical |
| Error rate | < 1% | > 5% warning, > 10% critical |
| Response time | < 2s | > 5s warning |
| Uptime | 99.9% | < 99% critical |
External Services
| Service | Status Page | Purpose |
|---|---|---|
| Internet Computer | status.internetcomputer.org | Blockchain platform |
| PostHog | status.posthog.com | Analytics |
| SendGrid | status.sendgrid.com | Email delivery |
Related Documentation
- CI/CD Pipeline - Deployment and testing workflows
- Architecture - System design overview