Monitoring & Alerting

This section covers the monitoring infrastructure for the Hello World DAO platform, including dashboards, metrics collection, alerting, and operational runbooks.

Overview

The monitoring stack consists of:

Metrics Collection: GitHub Actions workflow collecting canister metrics every 6 hours
Dashboards: Grafana dashboards for system health and per-canister detail views
Alerting: Prometheus-style alert rules with Slack and PagerDuty routing
Runbooks: Documented procedures for common operational tasks

Quick Links

Resource	Description
Grafana Dashboard	System health overview
IC Dashboard	Internet Computer status
GitHub Actions	CI/CD and monitoring workflows
Alert Rules	Prometheus alerting configuration

Monitoring Architecture

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Canisters   │───▶│  GitHub      │───▶│   Grafana    │
│  (IC)        │    │  Actions     │    │ (dashboards) │
└──────────────┘    └──────────────┘    └──────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │ Alert Manager│───▶ Slack / PagerDuty
                    └──────────────┘

Dashboards

System Health Overview

Shows all canisters at a glance:

Cycles balance per canister (color-coded by threshold)
Error rate trends
Request latency percentiles
Business metrics (members, proposals, transactions)

Import: system-health.json

Per-Canister Detail

Detailed view for individual canisters:

Cycles balance history
Memory usage trends
Error rate over time
Call rate metrics

Import: per-canister.json

Alert Thresholds

Alert	Threshold	Severity	Response Time
Low Cycles Balance	< 1T cycles	Warning	< 1 hour
Critical Cycles Balance	< 500B cycles	Critical	< 15 minutes
High Error Rate	> 5% for 5m	Warning	< 1 hour
Critical Error Rate	> 10% for 5m	Critical	< 30 minutes
Canister Unresponsive	No response 2m	Critical	< 15 minutes

Alert Routing

Critical Alerts
├── Slack #production-alerts
└── PagerDuty (on-call)

Warning Alerts
└── Slack #production-alerts

Info Alerts
└── Slack #ops-info

Metrics Collected

Canister Metrics

Metric	Unit	Description
`canister_cycles_balance`	cycles	Current cycles balance
`canister_memory_size`	bytes	Current memory usage
`canister_status`	enum	Running/Stopped status
`canister_error_rate`	percent	Error rate percentage

Collection Schedule

Automated: Every 6 hours via GitHub Actions
Manual: Trigger monitor-metrics workflow on-demand
Retention: 30 days in GitHub Actions artifacts

Runbooks

For operational procedures, see:

Topic	Runbook
Low cycles	Cycles Top-Up Procedure
High errors	High Error Rate Triage
Canister down	Canister Unresponsive Recovery
Failed deploy	Deployment Failure Recovery
Database issues	Database Connectivity

Setup Guide

1. Configure GitHub Secrets

Add these secrets to the ops-infra repository:

Secret	Purpose
`DFX_IDENTITY_PEM`	dfx identity for canister status checks
`SLACK_WEBHOOK_URL`	Slack incoming webhook for alerts
`PAGERDUTY_ROUTING_KEY`	PagerDuty Events API routing key

2. Import Grafana Dashboards

Open Grafana
Go to Dashboards > Import
Upload JSON files from ops-infra/monitoring/dashboards/
Configure Prometheus data source if prompted

3. Configure Alertmanager

Update alertmanager.yml with your Slack webhook URL
Update PagerDuty routing key
Deploy Alertmanager configuration

4. Add Canisters to Monitor

Edit the monitor-metrics.yml workflow to add canister IDs:

yaml

CANISTERS=(
  "frontend:vlmti-wqaaa-aaaad-acoiq-cai"
  "user-service:<canister-id>"
  # Add more canisters as deployed
)

Troubleshooting

No Metrics in Grafana

Verify monitoring workflow is running successfully
Check GitHub Actions logs for errors
Confirm Grafana is configured with correct data source

Alerts Not Firing

Check Alertmanager status
Verify alert rules are loaded
Test alert by manually triggering threshold breach

Slack Notifications Not Working

Verify webhook URL is correct

Test webhook with curl:

bash

curl -X POST -H 'Content-type: application/json' \
  --data '{"text":"Test alert"}' \
  "$SLACK_WEBHOOK_URL"

Check Alertmanager logs for errors

Monitoring & Alerting ​

Overview ​

Quick Links ​

Monitoring Architecture ​

Dashboards ​

System Health Overview ​

Per-Canister Detail ​

Alert Thresholds ​

Alert Routing ​

Metrics Collected ​

Canister Metrics ​

Collection Schedule ​

Runbooks ​

Setup Guide ​

1. Configure GitHub Secrets ​

2. Import Grafana Dashboards ​

3. Configure Alertmanager ​

4. Add Canisters to Monitor ​

Troubleshooting ​

No Metrics in Grafana ​

Alerts Not Firing ​

Slack Notifications Not Working ​

Related Documentation ​