Skip to main content

Monitoring

The Anton cluster uses a comprehensive observability stack based on Prometheus, Grafana, and AlertManager to provide deep insights into cluster health, application performance, and resource utilization.

Architecture Overview

Core Components

Prometheus Stack

  • Prometheus: Time-series metrics collection and storage
  • Grafana: Dashboard creation and visualization
  • AlertManager: Alert routing and notification management
  • Node Exporter: Hardware and OS metrics
  • kube-state-metrics: Kubernetes API object metrics

Key Features

  • Service Discovery: Automatic discovery of monitoring targets
  • Long-term Storage: Persistent metrics with configurable retention
  • High Availability: Cluster-aware alerting and deduplication
  • Custom Dashboards: Pre-configured and custom Grafana dashboards

Deployment

The monitoring stack is deployed using the kube-prometheus-stack Helm chart in the monitoring namespace:

# Deployed via Flux GitOps
namespace: monitoring
chart: kube-prometheus-stack
version: 76.3.1

Key Metrics

Cluster Health

  • Node Status: CPU, memory, disk usage per node
  • Pod Health: Running, pending, failed pod counts
  • Resource Utilization: CPU and memory requests/limits vs usage

Application Performance

  • HTTP Metrics: Request rates, response times, error rates
  • Custom Metrics: Application-specific performance indicators
  • Service Dependencies: Inter-service communication metrics

Infrastructure Metrics

  • Storage: Ceph cluster health, OSD performance, disk usage
  • Network: Bandwidth utilization, packet loss, connection counts
  • Control Plane: API server performance, etcd health

Access and Commands

Status Commands

# Check monitoring stack pods
kubectl get pods -n monitoring

# View Prometheus targets
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090

# Check AlertManager status
kubectl get alertmanager -n monitoring

# View Grafana service
kubectl get svc -n monitoring | grep grafana

Configuration

# View Prometheus configuration
kubectl get prometheus -n monitoring -o yaml

# Check alert rules
kubectl get prometheusrule -n monitoring

# View Grafana configuration
kubectl get configmap -n monitoring | grep grafana

Troubleshooting

# Check monitoring namespace events
kubectl get events -n monitoring --sort-by='.lastTimestamp'

# View Prometheus logs
kubectl logs -n monitoring -l app.kubernetes.io/name=kube-prometheus-stack-prometheus

# Check Grafana logs
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana

# Verify service monitors
kubectl get servicemonitor -n monitoring

Dashboards

Pre-configured Dashboards

  • Cluster Overview: High-level cluster health and resource usage
  • Node Metrics: Per-node hardware and OS metrics
  • Pod Metrics: Container resource usage and health
  • Storage Health: Ceph cluster status and performance
  • Network Overview: Cilium and ingress metrics

Custom Dashboards

Located in kubernetes/apps/monitoring/kube-prometheus-stack/app/dashboards/:

  • Storage Health: Rook-Ceph specific metrics
  • Application Performance: Custom application metrics
  • Cost Analysis: Resource usage and optimization insights

Alerting

Alert Categories

  • Critical: Immediate attention required (node down, storage failure)
  • Warning: Potential issues (high resource usage, slow responses)
  • Info: Informational alerts (deployments, scaling events)

Alert Destinations

  • Webhooks: Integration with external systems
  • Email: Critical alert notifications
  • Slack: Team notifications (when configured)

Performance Optimization

Retention Policies

# Prometheus data retention
retention: 15d
retentionSize: 10GB

Query Optimization

  • Use recording rules for frequently queried metrics
  • Implement appropriate scrape intervals
  • Configure resource limits for monitoring components

The monitoring stack provides comprehensive observability across all layers of the Anton cluster, enabling proactive issue detection and performance optimization.