Skip to content
🔒

Login Required

You need to be logged in to view this content. This page requires Member access.

Operations Documentation

This section contains operational documentation for the Hello World DAO platform, including monitoring, runbooks, and incident response procedures.

ResourceDescription
Canister MonitoringCycle balance monitoring and top-up procedures
Monitoring & AlertingGrafana dashboards and alerting setup
Incident ResponseGeneral incident handling

Runbooks

Detailed procedures for common operational tasks:

RunbookPurpose
Cycles Top-UpTop up canister cycles when low
High Error RateTriage elevated error rates
Canister UnresponsiveRecover unresponsive canisters
Deployment FailureHandle failed deployments and rollback
Database ConnectivityTroubleshoot database and service connectivity

Monitoring Stack

┌──────────────────────────────────────────────────┐
│              Monitoring Infrastructure            │
├──────────────────────────────────────────────────┤
│  Metrics Collection                              │
│  └── GitHub Actions workflow (every 6 hours)    │
│  └── Canister status via dfx                    │
├──────────────────────────────────────────────────┤
│  Visualization                                   │
│  └── Grafana dashboards                         │
│  └── IC Dashboard                               │
├──────────────────────────────────────────────────┤
│  Alerting                                        │
│  └── Prometheus alert rules                     │
│  └── Slack notifications                        │
│  └── PagerDuty escalation (critical)            │
└──────────────────────────────────────────────────┘

Alert Severity Levels

SeverityResponse TimeExamples
Critical< 15 minutesCanister unresponsive, cycles depleted
Warning< 1 hourLow cycles, elevated error rate
InfoNext business dayNo new members, high proposal volume

On-Call Rotation

On-call engineers are the first responders for production incidents.

Responsibilities:

  • Acknowledge alerts within 15 minutes
  • Follow runbook procedures
  • Escalate P1/P2 incidents to team lead
  • Document incident resolution

Contact: See Incident Response Runbook for rotation schedule.

Maintenance Windows

Regular maintenance activities:

ActivityScheduleDuration
Cycles monitoringEvery 6 hoursAutomated
Dashboard reviewWeekly30 minutes
Runbook reviewMonthly1 hour
DR drillQuarterly2 hours

Key Metrics

MetricTargetAlert Threshold
Cycles balance> 1T< 1T warning, < 500B critical
Error rate< 1%> 5% warning, > 10% critical
Response time< 2s> 5s warning
Uptime99.9%< 99% critical

External Services

ServiceStatus PagePurpose
Internet Computerstatus.internetcomputer.orgBlockchain platform
PostHogstatus.posthog.comAnalytics
SendGridstatus.sendgrid.comEmail delivery

Hello World Co-Op DAO