Monitoring

The Anton cluster uses a comprehensive observability stack based on Prometheus, Grafana, and AlertManager to provide deep insights into cluster health, application performance, and resource utilization.

Architecture Overview

Core Components

Prometheus Stack

Prometheus: Time-series metrics collection and storage
Grafana: Dashboard creation and visualization
AlertManager: Alert routing and notification management
Node Exporter: Hardware and OS metrics
kube-state-metrics: Kubernetes API object metrics

Key Features

Service Discovery: Automatic discovery of monitoring targets
Long-term Storage: Persistent metrics with configurable retention
High Availability: Cluster-aware alerting and deduplication
Custom Dashboards: Pre-configured and custom Grafana dashboards

Deployment

The monitoring stack is deployed using the kube-prometheus-stack Helm chart in the monitoring namespace:

# Deployed via Flux GitOps
namespace: monitoring
chart: kube-prometheus-stack
version: 76.3.1

Key Metrics

Cluster Health

Node Status: CPU, memory, disk usage per node
Pod Health: Running, pending, failed pod counts
Resource Utilization: CPU and memory requests/limits vs usage

Application Performance

HTTP Metrics: Request rates, response times, error rates
Custom Metrics: Application-specific performance indicators
Service Dependencies: Inter-service communication metrics

Infrastructure Metrics

Storage: Ceph cluster health, OSD performance, disk usage
Network: Bandwidth utilization, packet loss, connection counts
Control Plane: API server performance, etcd health

Access and Commands

Status Commands

# Check monitoring stack pods
kubectl get pods -n monitoring

# View Prometheus targets
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090

# Check AlertManager status
kubectl get alertmanager -n monitoring

# View Grafana service
kubectl get svc -n monitoring | grep grafana

Configuration

# View Prometheus configuration
kubectl get prometheus -n monitoring -o yaml

# Check alert rules
kubectl get prometheusrule -n monitoring

# View Grafana configuration
kubectl get configmap -n monitoring | grep grafana

Troubleshooting

# Check monitoring namespace events
kubectl get events -n monitoring --sort-by='.lastTimestamp'

# View Prometheus logs
kubectl logs -n monitoring -l app.kubernetes.io/name=kube-prometheus-stack-prometheus

# Check Grafana logs
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana

# Verify service monitors
kubectl get servicemonitor -n monitoring

Dashboards

Pre-configured Dashboards

Cluster Overview: High-level cluster health and resource usage
Node Metrics: Per-node hardware and OS metrics
Pod Metrics: Container resource usage and health
Storage Health: Ceph cluster status and performance
Network Overview: Cilium and ingress metrics

Custom Dashboards

Located in kubernetes/apps/monitoring/kube-prometheus-stack/app/dashboards/:

Storage Health: Rook-Ceph specific metrics
Application Performance: Custom application metrics
Cost Analysis: Resource usage and optimization insights

Alerting

Alert Categories

Critical: Immediate attention required (node down, storage failure)
Warning: Potential issues (high resource usage, slow responses)
Info: Informational alerts (deployments, scaling events)

Alert Destinations

Webhooks: Integration with external systems
Email: Critical alert notifications
Slack: Team notifications (when configured)

Performance Optimization

Retention Policies

# Prometheus data retention
retention: 15d
retentionSize: 10GB

Query Optimization

Use recording rules for frequently queried metrics
Implement appropriate scrape intervals
Configure resource limits for monitoring components

The monitoring stack provides comprehensive observability across all layers of the Anton cluster, enabling proactive issue detection and performance optimization.

Architecture Overview​

Core Components​

Prometheus Stack​

Key Features​

Deployment​

Key Metrics​

Cluster Health​

Application Performance​

Infrastructure Metrics​

Access and Commands​

Status Commands​

Configuration​

Troubleshooting​

Dashboards​

Pre-configured Dashboards​

Custom Dashboards​

Alerting​

Alert Categories​

Alert Destinations​

Performance Optimization​

Retention Policies​

Query Optimization​