Skip to content

Checking access...

Monitoring & Alerting

This section covers the monitoring infrastructure for the Hello World DAO platform, including dashboards, metrics collection, alerting, and operational runbooks.

Overview

The monitoring stack consists of:

  • Metrics Collection: GitHub Actions workflow collecting canister metrics every 6 hours
  • Dashboards: Grafana dashboards for system health and per-canister detail views
  • Alerting: Prometheus-style alert rules with Slack and PagerDuty routing
  • Runbooks: Documented procedures for common operational tasks
ResourceDescription
Grafana DashboardSystem health overview
IC DashboardInternet Computer status
GitHub ActionsCI/CD and monitoring workflows
Alert RulesPrometheus alerting configuration

Monitoring Architecture

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Canisters   │───▶│  GitHub      │───▶│   Grafana    │
│  (IC)        │    │  Actions     │    │ (dashboards) │
└──────────────┘    └──────────────┘    └──────────────┘


                    ┌──────────────┐
                    │ Alert Manager│───▶ Slack / PagerDuty
                    └──────────────┘

Dashboards

System Health Overview

Shows all canisters at a glance:

  • Cycles balance per canister (color-coded by threshold)
  • Error rate trends
  • Request latency percentiles
  • Business metrics (members, proposals, transactions)

Import: system-health.json

Per-Canister Detail

Detailed view for individual canisters:

  • Cycles balance history
  • Memory usage trends
  • Error rate over time
  • Call rate metrics

Import: per-canister.json

Alert Thresholds

AlertThresholdSeverityResponse Time
Low Cycles Balance< 1T cyclesWarning< 1 hour
Critical Cycles Balance< 500B cyclesCritical< 15 minutes
High Error Rate> 5% for 5mWarning< 1 hour
Critical Error Rate> 10% for 5mCritical< 30 minutes
Canister UnresponsiveNo response 2mCritical< 15 minutes

Alert Routing

Critical Alerts
├── Slack #production-alerts
└── PagerDuty (on-call)

Warning Alerts
└── Slack #production-alerts

Info Alerts
└── Slack #ops-info

Metrics Collected

Canister Metrics

MetricUnitDescription
canister_cycles_balancecyclesCurrent cycles balance
canister_memory_sizebytesCurrent memory usage
canister_statusenumRunning/Stopped status
canister_error_ratepercentError rate percentage

Collection Schedule

  • Automated: Every 6 hours via GitHub Actions
  • Manual: Trigger monitor-metrics workflow on-demand
  • Retention: 30 days in GitHub Actions artifacts

Runbooks

For operational procedures, see:

TopicRunbook
Low cyclesCycles Top-Up Procedure
High errorsHigh Error Rate Triage
Canister downCanister Unresponsive Recovery
Failed deployDeployment Failure Recovery
Database issuesDatabase Connectivity

Setup Guide

1. Configure GitHub Secrets

Add these secrets to the ops-infra repository:

SecretPurpose
DFX_IDENTITY_PEMdfx identity for canister status checks
SLACK_WEBHOOK_URLSlack incoming webhook for alerts
PAGERDUTY_ROUTING_KEYPagerDuty Events API routing key

2. Import Grafana Dashboards

  1. Open Grafana
  2. Go to Dashboards > Import
  3. Upload JSON files from ops-infra/monitoring/dashboards/
  4. Configure Prometheus data source if prompted

3. Configure Alertmanager

  1. Update alertmanager.yml with your Slack webhook URL
  2. Update PagerDuty routing key
  3. Deploy Alertmanager configuration

4. Add Canisters to Monitor

Edit the monitor-metrics.yml workflow to add canister IDs:

yaml
CANISTERS=(
  "frontend:vlmti-wqaaa-aaaad-acoiq-cai"
  "user-service:<canister-id>"
  # Add more canisters as deployed
)

Troubleshooting

No Metrics in Grafana

  1. Verify monitoring workflow is running successfully
  2. Check GitHub Actions logs for errors
  3. Confirm Grafana is configured with correct data source

Alerts Not Firing

  1. Check Alertmanager status
  2. Verify alert rules are loaded
  3. Test alert by manually triggering threshold breach

Slack Notifications Not Working

  1. Verify webhook URL is correct
  2. Test webhook with curl:
    bash
    curl -X POST -H 'Content-type: application/json' \
      --data '{"text":"Test alert"}' \
      "$SLACK_WEBHOOK_URL"
  3. Check Alertmanager logs for errors

Hello World Co-Op DAO