Alerting

The Anton cluster implements a multi-layered alerting strategy using AlertManager and Grafana to ensure rapid response to critical issues and proactive monitoring of system health.

Architecture

Alert Classification

Severity Levels

Critical Alerts

Node Down: Kubernetes node becomes unavailable
Storage Failure: Ceph cluster degraded or failing
Control Plane Issues: API server or etcd problems
Security Events: Unauthorized access attempts

Warning Alerts

High Resource Usage: CPU/Memory above 80% for extended periods
Disk Space Low: Storage usage above 85%
Application Errors: Increased error rates or response times
Certificate Expiration: TLS certificates expiring within 7 days

Info Alerts

Deployment Events: Successful/failed deployments
Scale Events: Pod scaling up/down
Configuration Changes: ConfigMap or Secret updates

Alert States

Prometheus Alert Rules

Node-Level Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: node-alerts
spec:
  groups:
    - name: node.rules
      rules:
        - alert: NodeDown
          expr: up{job="node-exporter"} == 0
          for: 1m
          labels:
            severity: critical
            team: infrastructure
          annotations:
            summary: "Node {{ $labels.instance }} is down"
            description: "Node has been down for more than 1 minute"
            
        - alert: NodeHighCPU
          expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
          for: 5m
          labels:
            severity: warning
            team: infrastructure
          annotations:
            summary: "High CPU usage on {{ $labels.instance }}"
            description: "CPU usage is {{ $value }}% for more than 5 minutes"

Kubernetes Alerts

- name: kubernetes.rules
  rules:
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 5m
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "Pod {{ $labels.pod }} is crash looping"
        description: "Pod has restarted {{ $value }} times in the last 15 minutes"
        
    - alert: DeploymentReplicasMismatch
      expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
      for: 10m
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "Deployment {{ $labels.deployment }} has mismatched replicas"
        description: "Expected {{ $labels.spec_replicas }} but have {{ $labels.available_replicas }}"

Storage Alerts

- name: storage.rules
  rules:
    - alert: CephClusterWarning
      expr: ceph_health_status != 0
      for: 5m
      labels:
        severity: warning
        team: storage
      annotations:
        summary: "Ceph cluster health warning"
        description: "Ceph cluster is in {{ $labels.status }} state"
        
    - alert: PersistentVolumeUsageHigh
      expr: 100 * (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 85
      for: 5m
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "PersistentVolume {{ $labels.persistentvolumeclaim }} usage high"
        description: "Volume usage is {{ $value }}%"

AlertManager Configuration

Routing Configuration

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
      group_wait: 10s
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'warning-alerts'
      repeat_interval: 2h

Receivers Configuration

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://webhook-service/alerts'
        send_resolved: true
        
  - name: 'critical-alerts'
    email_configs:
      - to: 'admin@example.com'
        subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
        body: |
          Alert: {{ .GroupLabels.alertname }}
          Summary: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}
    webhook_configs:
      - url: 'http://pagerduty-webhook/critical'
        
  - name: 'warning-alerts'
    webhook_configs:
      - url: 'http://slack-webhook/warnings'

Silencing and Inhibition

# Inhibition rules - suppress alerts when related alerts fire
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']
    
  - source_match:
      alertname: 'NodeDown'
    target_match_re:
      alertname: 'Node.*'
    equal: ['instance']

Management Commands

AlertManager Operations

# Port forward to AlertManager UI
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093

# Check alert status
curl http://localhost:9093/api/v1/alerts

# View current silences
curl http://localhost:9093/api/v1/silences

# Create a silence
curl -XPOST http://localhost:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{"matchers":[{"name":"alertname","value":"NodeHighCPU"}],"startsAt":"'$(date -Iseconds)'","endsAt":"'$(date -d '+1 hour' -Iseconds)'","comment":"Maintenance window"}'

Prometheus Alert Rules

# Check alert rules status
kubectl get prometheusrule -n monitoring

# Validate alert rules
promtool check rules alert-rules.yaml

# View firing alerts
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

# Test alert expression
curl "http://localhost:9090/api/v1/query?query=up==0"

Grafana Alerting

# Check Grafana alert rules
curl -H "Authorization: Bearer $GRAFANA_TOKEN" \
  http://localhost:3000/api/alert-rules

# Test notification channel
curl -H "Authorization: Bearer $GRAFANA_TOKEN" \
  -X POST http://localhost:3000/api/alert-notifications/test \
  -d '{"id":1}'

Troubleshooting Alerts

Common Issues

# Check AlertManager configuration
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager

# Verify webhook endpoints
kubectl run test-webhook --image=curlimages/curl --rm -it -- \
  curl -X POST http://webhook-service/test

# Test SMTP configuration
kubectl exec -n monitoring alertmanager-pod -- \
  telnet smtp.example.com 587

# Check alert rule evaluation
kubectl logs -n monitoring -l app.kubernetes.io/name=kube-prometheus-stack-prometheus | grep "rule evaluation"

Alert Debugging

# Check why alert isn't firing
curl "http://localhost:9090/api/v1/query?query=ALERT_EXPRESSION"

# View alert rule groups
curl http://localhost:9090/api/v1/rules

# Check AlertManager logs for routing issues
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager | grep -i routing

Best Practices

Alert Design

Actionable: Every alert should have a clear remediation path
Contextual: Include relevant labels and annotations
Appropriate Urgency: Match severity to business impact

Alert Fatigue Prevention

Proper Thresholds: Avoid overly sensitive thresholds
Meaningful Grouping: Group related alerts together
Regular Review: Periodically review and tune alert rules

On-Call Efficiency

Clear Escalation: Define escalation paths for different severities
Documentation: Link to runbooks in alert annotations
Testing: Regularly test alert delivery channels

The alerting system ensures rapid detection and response to issues across the Anton cluster, providing the necessary visibility and automation to maintain high availability and performance.

Architecture​

Alert Classification​

Severity Levels​

Critical Alerts​

Warning Alerts​

Info Alerts​

Alert States​

Prometheus Alert Rules​

Node-Level Alerts​

Kubernetes Alerts​

Storage Alerts​

AlertManager Configuration​

Routing Configuration​

Receivers Configuration​

Silencing and Inhibition​

Management Commands​

AlertManager Operations​

Prometheus Alert Rules​

Grafana Alerting​

Troubleshooting Alerts​

Common Issues​

Alert Debugging​

Best Practices​

Alert Design​

Alert Fatigue Prevention​

On-Call Efficiency​

Architecture

Alert Classification

Severity Levels

Critical Alerts

Warning Alerts

Info Alerts

Alert States

Prometheus Alert Rules

Node-Level Alerts

Kubernetes Alerts

Storage Alerts

AlertManager Configuration

Routing Configuration

Receivers Configuration

Silencing and Inhibition

Management Commands

AlertManager Operations

Prometheus Alert Rules

Grafana Alerting

Troubleshooting Alerts

Common Issues

Alert Debugging

Best Practices

Alert Design

Alert Fatigue Prevention

On-Call Efficiency