Skip to content
πŸ”’

Login Required

You need to be logged in to view this content. This page requires Member access.

Monitoring & Alerting ​

This section covers the monitoring infrastructure for the Hello World DAO platform, including dashboards, metrics collection, alerting, and operational runbooks.

Overview ​

The monitoring stack consists of:

  • Metrics Collection: GitHub Actions workflow collecting canister metrics every 6 hours
  • Dashboards: Grafana dashboards for system health and per-canister detail views
  • Alerting: Prometheus-style alert rules with Slack and PagerDuty routing
  • Runbooks: Documented procedures for common operational tasks
ResourceDescription
Grafana DashboardSystem health overview
IC DashboardInternet Computer status
GitHub ActionsCI/CD and monitoring workflows
Alert RulesPrometheus alerting configuration

Monitoring Architecture ​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Canisters   │───▢│  GitHub      │───▢│   Grafana    β”‚
β”‚  (IC)        β”‚    β”‚  Actions     β”‚    β”‚ (dashboards) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ Alert Manager│───▢ Slack / PagerDuty
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Dashboards ​

System Health Overview ​

Shows all canisters at a glance:

  • Cycles balance per canister (color-coded by threshold)
  • Error rate trends
  • Request latency percentiles
  • Business metrics (members, proposals, transactions)

Import: system-health.json

Per-Canister Detail ​

Detailed view for individual canisters:

  • Cycles balance history
  • Memory usage trends
  • Error rate over time
  • Call rate metrics

Import: per-canister.json

Alert Thresholds ​

AlertThresholdSeverityResponse Time
Low Cycles Balance< 1T cyclesWarning< 1 hour
Critical Cycles Balance< 500B cyclesCritical< 15 minutes
High Error Rate> 5% for 5mWarning< 1 hour
Critical Error Rate> 10% for 5mCritical< 30 minutes
Canister UnresponsiveNo response 2mCritical< 15 minutes

Alert Routing ​

Critical Alerts
β”œβ”€β”€ Slack #production-alerts
└── PagerDuty (on-call)

Warning Alerts
└── Slack #production-alerts

Info Alerts
└── Slack #ops-info

Metrics Collected ​

Canister Metrics ​

MetricUnitDescription
canister_cycles_balancecyclesCurrent cycles balance
canister_memory_sizebytesCurrent memory usage
canister_statusenumRunning/Stopped status
canister_error_ratepercentError rate percentage

Collection Schedule ​

  • Automated: Every 6 hours via GitHub Actions
  • Manual: Trigger monitor-metrics workflow on-demand
  • Retention: 30 days in GitHub Actions artifacts

Runbooks ​

For operational procedures, see:

TopicRunbook
Low cyclesCycles Top-Up Procedure
High errorsHigh Error Rate Triage
Canister downCanister Unresponsive Recovery
Failed deployDeployment Failure Recovery
Database issuesDatabase Connectivity

Setup Guide ​

1. Configure GitHub Secrets ​

Add these secrets to the ops-infra repository:

SecretPurpose
DFX_IDENTITY_PEMdfx identity for canister status checks
SLACK_WEBHOOK_URLSlack incoming webhook for alerts
PAGERDUTY_ROUTING_KEYPagerDuty Events API routing key

2. Import Grafana Dashboards ​

  1. Open Grafana
  2. Go to Dashboards > Import
  3. Upload JSON files from ops-infra/monitoring/dashboards/
  4. Configure Prometheus data source if prompted

3. Configure Alertmanager ​

  1. Update alertmanager.yml with your Slack webhook URL
  2. Update PagerDuty routing key
  3. Deploy Alertmanager configuration

4. Add Canisters to Monitor ​

Edit the monitor-metrics.yml workflow to add canister IDs:

yaml
CANISTERS=(
  "frontend:vlmti-wqaaa-aaaad-acoiq-cai"
  "user-service:<canister-id>"
  # Add more canisters as deployed
)

Troubleshooting ​

No Metrics in Grafana ​

  1. Verify monitoring workflow is running successfully
  2. Check GitHub Actions logs for errors
  3. Confirm Grafana is configured with correct data source

Alerts Not Firing ​

  1. Check Alertmanager status
  2. Verify alert rules are loaded
  3. Test alert by manually triggering threshold breach

Slack Notifications Not Working ​

  1. Verify webhook URL is correct
  2. Test webhook with curl:
    bash
    curl -X POST -H 'Content-type: application/json' \
      --data '{"text":"Test alert"}' \
      "$SLACK_WEBHOOK_URL"
  3. Check Alertmanager logs for errors

Hello World Co-Op DAO