Skip to main content

Operations

Daily operations, maintenance procedures, and troubleshooting workflows for the Rook-Ceph storage system in the Anton cluster.

Daily Health Checks

Automated Health Monitoring

# Comprehensive cluster health check
./scripts/storage-health-check.ts --json

# Quick status overview
kubectl -n storage exec deploy/rook-ceph-tools -- ceph status

# Check for any warnings or errors
kubectl -n storage exec deploy/rook-ceph-tools -- ceph health detail

Key Health Indicators

Routine Maintenance

Weekly Operations

# Check cluster performance trends
kubectl -n storage exec deploy/rook-ceph-tools -- ceph tell osd.* perf schema

# Review slow operations
kubectl -n storage exec deploy/rook-ceph-tools -- ceph daemon osd.0 dump_historic_slow_ops

# Validate data consistency (automatically scheduled but can be manual)
kubectl -n storage exec deploy/rook-ceph-tools -- ceph pg scrub <pg-id>

# Clean up old snapshots if any
kubectl -n storage exec deploy/rook-ceph-tools -- rbd snap ls -p ceph-blockpool

Monthly Operations

# Review capacity trends and plan expansion
kubectl -n storage exec deploy/rook-ceph-tools -- ceph df detail

# Check for device wear indicators
kubectl -n storage exec deploy/rook-ceph-tools -- ceph device ls

# Perform deep scrub on critical pools
kubectl -n storage exec deploy/rook-ceph-tools -- ceph pg deep-scrub <pg-id>

# Review and update retention policies
kubectl get pvc -A --sort-by=.metadata.creationTimestamp

Capacity Management

Storage Monitoring

# Current usage breakdown
kubectl -n storage exec deploy/rook-ceph-tools -- ceph df

# Per-pool utilization
kubectl -n storage exec deploy/rook-ceph-tools -- ceph osd pool stats

# OSD utilization distribution
kubectl -n storage exec deploy/rook-ceph-tools -- ceph osd df tree

# Identify largest consumers
kubectl -n storage exec deploy/rook-ceph-tools -- rbd du -p ceph-blockpool --format json

Capacity Planning

# Prometheus queries for capacity planning

# Current usage percentage
100 * (ceph_cluster_total_used_bytes / ceph_cluster_total_bytes)

# Growth rate (bytes per hour)
rate(ceph_cluster_total_used_bytes[24h]) * 3600

# Projected time to 85% capacity
((ceph_cluster_total_bytes * 0.85) - ceph_cluster_total_used_bytes) /
rate(ceph_cluster_total_used_bytes[7d]) / 86400

Expansion Procedures

# Adding new storage to existing nodes
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
spec:
storage:
nodes:
- name: k8s-1
devices:
- name: /dev/sdb # Existing
- name: /dev/sdc # New device

Performance Monitoring

Baseline Performance Metrics

# Establish performance baselines
kubectl -n storage exec deploy/rook-ceph-tools -- rados bench -p ceph-blockpool 60 write
kubectl -n storage exec deploy/rook-ceph-tools -- rados bench -p ceph-blockpool 60 seq
kubectl -n storage exec deploy/rook-ceph-tools -- rados bench -p ceph-blockpool 60 rand

# Monitor real-time performance
watch "kubectl -n storage exec deploy/rook-ceph-tools -- ceph osd perf"

Performance Tuning

# Adjust OSD memory targets (if needed)
kubectl -n storage exec deploy/rook-ceph-tools -- ceph config set osd.* osd_memory_target 2147483648

# Tune RBD cache settings
kubectl -n storage exec deploy/rook-ceph-tools -- ceph config set client rbd_cache true
kubectl -n storage exec deploy/rook-ceph-tools -- ceph config set client rbd_cache_size 67108864

# Optimize placement group count
kubectl -n storage exec deploy/rook-ceph-tools -- ceph osd pool autoscale-status

Backup and Recovery

Volume Snapshots

# Create volume snapshots for backup
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-backup-$(date +%Y%m%d)
namespace: database
spec:
volumeSnapshotClassName: csi-ceph-blockpool-snapclass
source:
persistentVolumeClaimName: postgres-data
EOF

# List available snapshots
kubectl get volumesnapshot -A

# Restore from snapshot
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-restore
spec:
dataSource:
name: postgres-backup-20241201
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
storageClassName: ceph-block
EOF

Cluster-Level Backup

# Export cluster configuration
kubectl get cephcluster -n storage rook-ceph -o yaml > cluster-backup.yaml
kubectl get cephblockpool -n storage -o yaml >> cluster-backup.yaml
kubectl get cephobjectstore -n storage -o yaml >> cluster-backup.yaml

# Backup critical data using Velero (if installed)
velero backup create storage-backup --include-namespaces storage

Troubleshooting Workflows

OSD Issues

# Identify problematic OSD
kubectl -n storage exec deploy/rook-ceph-tools -- ceph osd tree | grep down

# Check OSD logs
kubectl logs -n storage -l ceph_daemon_id=0 | tail -50

# Restart failed OSD
kubectl -n storage delete pod -l ceph_daemon_id=0

# If restart doesn't help, mark out and back in
kubectl -n storage exec deploy/rook-ceph-tools -- ceph osd out 0
kubectl -n storage exec deploy/rook-ceph-tools -- ceph osd in 0

Placement Group Issues

# Identify stuck PGs
kubectl -n storage exec deploy/rook-ceph-tools -- ceph pg dump_stuck

# Check specific PG details
kubectl -n storage exec deploy/rook-ceph-tools -- ceph pg 1.0 query

# Force recovery if needed
kubectl -n storage exec deploy/rook-ceph-tools -- ceph pg force-recovery 1.0

Monitor Issues

# Check monitor quorum
kubectl -n storage exec deploy/rook-ceph-tools -- ceph mon stat

# View monitor details and identify issues
kubectl -n storage exec deploy/rook-ceph-tools -- ceph mon dump

# Restart problematic monitor
kubectl -n storage delete pod rook-ceph-mon-a

# Check monitor logs
kubectl logs -n storage rook-ceph-mon-a

CSI Driver Issues

# Check CSI driver pods
kubectl get pods -n storage -l app=csi-rbdplugin

# View CSI driver logs
kubectl logs -n storage -l app=csi-rbdplugin -c csi-rbdplugin

# Check volume attachment issues
kubectl describe volumeattachment

# Restart CSI components if needed
kubectl -n storage rollout restart daemonset csi-rbdplugin
kubectl -n storage rollout restart deployment csi-rbdplugin-provisioner

Emergency Procedures

Cluster Recovery

# If cluster is completely down, start with monitors
kubectl -n storage scale deployment rook-ceph-operator --replicas=1

# Wait for operator to stabilize, then check cluster
kubectl -n storage exec deploy/rook-ceph-tools -- ceph status

# If toolbox is unavailable, access from operator
kubectl -n storage exec deploy/rook-ceph-operator -- ceph status --connect-timeout 10

Data Recovery

# If data corruption is suspected
kubectl -n storage exec deploy/rook-ceph-tools -- ceph pg repair <pg-id>

# For severe corruption, export/import pool
kubectl -n storage exec deploy/rook-ceph-tools -- rados export -p ceph-blockpool pool-backup.dump
kubectl -n storage exec deploy/rook-ceph-tools -- rados import -p ceph-blockpool-new pool-backup.dump

Disaster Recovery

# Complete cluster rebuild (last resort)
# ⚠️ WARNING: This will DELETE ALL DATA. Ensure backups are verified and accessible.
# ⚠️ CRITICAL: Verify no other clusters share the same storage devices.
# 1. Backup critical data and configurations
# 2. Delete CephCluster resource
kubectl -n storage delete cephcluster rook-ceph --wait

# 3. Clean up node storage
kubectl -n storage get job -l app=rook-ceph-detect-version

# 4. Redeploy with backed up configuration
kubectl apply -f cluster-backup.yaml

Automation Scripts

Health Check Script

// Example monitoring script structure
export async function checkStorageHealth() {
const checks = [
checkClusterHealth(),
checkOSDStatus(),
checkPGHealth(),
checkCapacity(),
checkPerformance()
];

const results = await Promise.all(checks);
return generateReport(results);
}

Alert Integration

# Example PrometheusRule for storage alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ceph-storage-alerts
spec:
groups:
- name: ceph.rules
rules:
- alert: CephClusterDown
expr: ceph_health_status != 0
for: 5m
labels:
severity: critical
annotations:
summary: "Ceph cluster is not healthy"

- alert: CephOSDDown
expr: ceph_osd_up < count(ceph_osd_up)
for: 2m
labels:
severity: warning

These operational procedures ensure the storage system remains healthy, performant, and available while providing clear workflows for both routine maintenance and emergency situations.