Monitoring & Alerting β
This section covers the monitoring infrastructure for the Hello World DAO platform, including dashboards, metrics collection, alerting, and operational runbooks.
Overview β
The monitoring stack consists of:
- Metrics Collection: GitHub Actions workflow collecting canister metrics every 6 hours
- Dashboards: Grafana dashboards for system health and per-canister detail views
- Alerting: Prometheus-style alert rules with Slack and PagerDuty routing
- Runbooks: Documented procedures for common operational tasks
Quick Links β
| Resource | Description |
|---|---|
| Grafana Dashboard | System health overview |
| IC Dashboard | Internet Computer status |
| GitHub Actions | CI/CD and monitoring workflows |
| Alert Rules | Prometheus alerting configuration |
Monitoring Architecture β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Canisters βββββΆβ GitHub βββββΆβ Grafana β
β (IC) β β Actions β β (dashboards) β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β
βΌ
ββββββββββββββββ
β Alert ManagerβββββΆ Slack / PagerDuty
ββββββββββββββββDashboards β
System Health Overview β
Shows all canisters at a glance:
- Cycles balance per canister (color-coded by threshold)
- Error rate trends
- Request latency percentiles
- Business metrics (members, proposals, transactions)
Import: system-health.json
Per-Canister Detail β
Detailed view for individual canisters:
- Cycles balance history
- Memory usage trends
- Error rate over time
- Call rate metrics
Import: per-canister.json
Alert Thresholds β
| Alert | Threshold | Severity | Response Time |
|---|---|---|---|
| Low Cycles Balance | < 1T cycles | Warning | < 1 hour |
| Critical Cycles Balance | < 500B cycles | Critical | < 15 minutes |
| High Error Rate | > 5% for 5m | Warning | < 1 hour |
| Critical Error Rate | > 10% for 5m | Critical | < 30 minutes |
| Canister Unresponsive | No response 2m | Critical | < 15 minutes |
Alert Routing β
Critical Alerts
βββ Slack #production-alerts
βββ PagerDuty (on-call)
Warning Alerts
βββ Slack #production-alerts
Info Alerts
βββ Slack #ops-infoMetrics Collected β
Canister Metrics β
| Metric | Unit | Description |
|---|---|---|
canister_cycles_balance | cycles | Current cycles balance |
canister_memory_size | bytes | Current memory usage |
canister_status | enum | Running/Stopped status |
canister_error_rate | percent | Error rate percentage |
Collection Schedule β
- Automated: Every 6 hours via GitHub Actions
- Manual: Trigger
monitor-metricsworkflow on-demand - Retention: 30 days in GitHub Actions artifacts
Runbooks β
For operational procedures, see:
| Topic | Runbook |
|---|---|
| Low cycles | Cycles Top-Up Procedure |
| High errors | High Error Rate Triage |
| Canister down | Canister Unresponsive Recovery |
| Failed deploy | Deployment Failure Recovery |
| Database issues | Database Connectivity |
Setup Guide β
1. Configure GitHub Secrets β
Add these secrets to the ops-infra repository:
| Secret | Purpose |
|---|---|
DFX_IDENTITY_PEM | dfx identity for canister status checks |
SLACK_WEBHOOK_URL | Slack incoming webhook for alerts |
PAGERDUTY_ROUTING_KEY | PagerDuty Events API routing key |
2. Import Grafana Dashboards β
- Open Grafana
- Go to Dashboards > Import
- Upload JSON files from
ops-infra/monitoring/dashboards/ - Configure Prometheus data source if prompted
3. Configure Alertmanager β
- Update
alertmanager.ymlwith your Slack webhook URL - Update PagerDuty routing key
- Deploy Alertmanager configuration
4. Add Canisters to Monitor β
Edit the monitor-metrics.yml workflow to add canister IDs:
yaml
CANISTERS=(
"frontend:vlmti-wqaaa-aaaad-acoiq-cai"
"user-service:<canister-id>"
# Add more canisters as deployed
)Troubleshooting β
No Metrics in Grafana β
- Verify monitoring workflow is running successfully
- Check GitHub Actions logs for errors
- Confirm Grafana is configured with correct data source
Alerts Not Firing β
- Check Alertmanager status
- Verify alert rules are loaded
- Test alert by manually triggering threshold breach
Slack Notifications Not Working β
- Verify webhook URL is correct
- Test webhook with curl:bash
curl -X POST -H 'Content-type: application/json' \ --data '{"text":"Test alert"}' \ "$SLACK_WEBHOOK_URL" - Check Alertmanager logs for errors