Grafana
Grafana serves as the primary visualization platform for the Anton cluster, providing rich dashboards and alerting capabilities built on top of Prometheus metrics.
Architecture
Data Sources
Prometheus Integration
# Configured as primary data source
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://kube-prometheus-stack-prometheus:9090
isDefault: true
access: proxy
Loki Integration
# Log data source
- name: Loki
type: loki
url: http://loki-gateway:80
access: proxy
Dashboard Categories
Infrastructure Dashboards
Cluster Overview
- Node Status: CPU, memory, disk usage across all nodes
- Pod Health: Running vs desired pod counts
- Network Traffic: Ingress/egress patterns
- Storage Usage: Ceph cluster health and capacity
Node Details
- Hardware Metrics: Temperature, power consumption (if available)
- OS Metrics: Load average, disk I/O, network interfaces
- Container Runtime: containerd performance metrics
Application Dashboards
Service Performance
- HTTP Metrics: Request rates, response times, error rates
- Database Performance: Query times, connection pools
- Custom Metrics: Application-specific KPIs
Resource Utilization
- CPU/Memory: Usage patterns and limits
- Storage I/O: Read/write operations
- Network: Service-to-service communication
Custom Dashboard Development
Dashboard as Code
Dashboards are stored in Git and deployed via Flux:
kubernetes/apps/monitoring/kube-prometheus-stack/app/dashboards/
├── cluster-overview.json
├── storage-health.json
├── application-performance.json
└── cost-analysis.json
Dashboard Structure
{
"dashboard": {
"title": "Cluster Overview",
"tags": ["kubernetes", "cluster"],
"time": {
"from": "now-1h",
"to": "now"
},
"panels": [
{
"title": "CPU Usage",
"type": "stat",
"targets": [
{
"expr": "avg(cpu_usage_percent)",
"legendFormat": "CPU %"
}
]
}
]
}
}
Key Panels and Visualizations
Time Series Panels
# CPU usage over time
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage trend
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Network traffic
rate(node_network_receive_bytes_total[5m]) * 8
Stat Panels
# Current pod count
count(up{job="kubelet"})
# Storage usage percentage
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
# Service uptime
avg(up{job="kubernetes-service-endpoints"})
Table Panels
# Top CPU consuming pods
topk(10, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))
# Persistent volume usage
sort_desc(100 - ((kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100))
Alerting
Grafana Alert Rules
# Define alerts within Grafana
alert:
name: High Memory Usage
frequency: 1m
conditions:
- query: avg(memory_usage_percent)
reducer: last
type: query
- evaluator:
params: [80]
type: gt
type: threshold
executionErrorState: alerting
noDataState: no_data
for: 2m
Notification Channels
# Webhook notification
notifiers:
- name: webhook
type: webhook
settings:
url: https://hooks.slack.com/webhook/...
httpMethod: POST
Access Management
Authentication
# Configuration for auth
auth:
anonymous:
enabled: false
basic:
enabled: true
User Roles
- Admin: Full access to dashboards, data sources, and configuration
- Editor: Create and modify dashboards
- Viewer: Read-only access to dashboards
Management Commands
Access Grafana
# Port forward to Grafana UI
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Get admin password
kubectl get secret -n monitoring kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 -d
# Direct service access (if ingress configured)
kubectl get ingress -n monitoring
Dashboard Management
# List available dashboards
kubectl get configmap -n monitoring | grep dashboard
# Update dashboard from file
kubectl create configmap custom-dashboard \
--from-file=dashboard.json \
-n monitoring \
--dry-run=client -o yaml | kubectl apply -f -
# Check Grafana configuration
kubectl get configmap -n monitoring kube-prometheus-stack-grafana -o yaml
Troubleshooting
# Check Grafana logs
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana
# Verify data source connectivity
kubectl exec -n monitoring -c grafana deployment/kube-prometheus-stack-grafana -- \
curl -s http://kube-prometheus-stack-prometheus:9090/api/v1/status/config
# Check plugin status
kubectl exec -n monitoring deployment/kube-prometheus-stack-grafana -- \
grafana-cli plugins ls
Best Practices
Dashboard Design
- Consistent Time Ranges: Use template variables for time selection
- Appropriate Visualizations: Choose panel types that match data characteristics
- Performance: Limit data points and use recording rules for complex queries
Organization
- Folders: Group related dashboards logically
- Tags: Use consistent tagging for searchability
- Templates: Create reusable dashboard templates
Maintenance
# Backup dashboards
kubectl get configmap -n monitoring -o yaml > grafana-dashboards-backup.yaml
# Monitor Grafana performance
kubectl top pod -n monitoring -l app.kubernetes.io/name=grafana
Grafana transforms raw metrics into actionable insights through intuitive visualizations, enabling effective monitoring and observability across the entire Anton cluster infrastructure.