Operations Documentation

This section contains operational documentation for the Hello World DAO platform, including monitoring, runbooks, and incident response procedures.

Quick Links

Resource	Description
Canister Monitoring	Cycle balance monitoring and top-up procedures
Monitoring & Alerting	Grafana dashboards and alerting setup
Incident Response	General incident handling

Runbooks

Detailed procedures for common operational tasks:

Runbook	Purpose
Cycles Top-Up	Top up canister cycles when low
High Error Rate	Triage elevated error rates
Canister Unresponsive	Recover unresponsive canisters
Deployment Failure	Handle failed deployments and rollback
Database Connectivity	Troubleshoot database and service connectivity

Monitoring Stack

┌──────────────────────────────────────────────────┐
│              Monitoring Infrastructure            │
├──────────────────────────────────────────────────┤
│  Metrics Collection                              │
│  └── GitHub Actions workflow (every 6 hours)    │
│  └── Canister status via dfx                    │
├──────────────────────────────────────────────────┤
│  Visualization                                   │
│  └── Grafana dashboards                         │
│  └── IC Dashboard                               │
├──────────────────────────────────────────────────┤
│  Alerting                                        │
│  └── Prometheus alert rules                     │
│  └── Slack notifications                        │
│  └── PagerDuty escalation (critical)            │
└──────────────────────────────────────────────────┘

Alert Severity Levels

Severity	Response Time	Examples
Critical	< 15 minutes	Canister unresponsive, cycles depleted
Warning	< 1 hour	Low cycles, elevated error rate
Info	Next business day	No new members, high proposal volume

On-Call Rotation

On-call engineers are the first responders for production incidents.

Responsibilities:

Acknowledge alerts within 15 minutes
Follow runbook procedures
Escalate P1/P2 incidents to team lead
Document incident resolution

Contact: See Incident Response Runbook for rotation schedule.

Maintenance Windows

Regular maintenance activities:

Activity	Schedule	Duration
Cycles monitoring	Every 6 hours	Automated
Dashboard review	Weekly	30 minutes
Runbook review	Monthly	1 hour
DR drill	Quarterly	2 hours

Key Metrics

Metric	Target	Alert Threshold
Cycles balance	> 1T	< 1T warning, < 500B critical
Error rate	< 1%	> 5% warning, > 10% critical
Response time	< 2s	> 5s warning
Uptime	99.9%	< 99% critical

External Services

Service	Status Page	Purpose
Internet Computer	status.internetcomputer.org	Blockchain platform
PostHog	status.posthog.com	Analytics
SendGrid	status.sendgrid.com	Email delivery

CI/CD Pipeline - Deployment and testing workflows
Architecture - System design overview

Login Required

Operations Documentation ​

Quick Links ​

Runbooks ​

Monitoring Stack ​

Alert Severity Levels ​

On-Call Rotation ​

Maintenance Windows ​

Key Metrics ​

External Services ​

Related Documentation ​