Observability Stack Installation Guide

Complete guide for deploying a production-grade observability stack (Prometheus, Loki, Alloy, Grafana) on Talos Kubernetes using GitOps.

Overview

This guide installs a comprehensive observability stack providing metrics collection, log aggregation, and unified visualization through Kubernetes operators.

Stack Components:

Component	Version	Purpose	Operator
Prometheus	v0.86.1	Metrics collection, storage, alerting	prometheus-operator/prometheus-operator
Loki	v3.4	Log aggregation and storage	grafana/loki (operator)
Alloy	v3.0	Telemetry collector (replaces Grafana Agent)	grafana/alloy-operator
Grafana	v5.20.0	Unified visualization and dashboards	grafana/grafana-operator

Why This Stack:

Unified observability: Metrics (Prometheus) + Logs (Loki) in single pane (Grafana)
Kubernetes-native: Declarative CRDs for all components
GitOps-ready: Flux-managed HelmReleases for version control
Scalable: Microservices mode for production workloads
Cost-effective: Self-hosted alternative to cloud observability platforms

Architecture:

Prerequisites

Cluster Requirements

Requirement	Value
Kubernetes	v1.29+ (Talos Linux)
Nodes	3+ control plane nodes
Storage	Rook Ceph or equivalent (dynamic PVC provisioning)
CPU (total)	4+ cores recommended
Memory (total)	8Gi+ recommended
Flux CD	Installed and syncing

Resource Allocation Guidelines

Component	CPU Request	Memory Request	Storage
Prometheus	500m	2Gi	50Gi (time-series data)
Loki	500m	1Gi	50Gi (log data)
Alloy	200m/node	512Mi/node	Ephemeral
Grafana	250m	512Mi	10Gi (dashboards/plugins)

Total estimated: 2-3 cores, 8-10Gi memory, 110Gi storage

Software Requirements

✓ Flux CD installed and syncing from Git
✓ SOPS configured for secret encryption
✓ Rook Ceph storage class configured (or equivalent)
✓ Talos cluster fully operational

Deployment Plan

Phase 1: Metrics Backend (Prometheus)

Why first: Foundation for cluster monitoring; other components depend on metrics.

Steps:

Deploy Prometheus Operator via HelmRelease
Create Prometheus CRD instance
Configure ServiceMonitor for auto-discovery
Verify metrics scraping

Expected outcome: Prometheus collecting metrics from kube-state-metrics, node-exporter, and API server.

Phase 2: Log Backend (Loki)

Why second: Independent of Prometheus; provides log storage backend.

Deployment mode: Monolithic (simpler) or Microservices (production HA)

Steps:

Deploy Loki Operator
Create LokiStack CRD instance
Configure object storage backend (S3/Ceph)
Verify Loki readiness

Expected outcome: Loki ready to receive logs from Alloy.

Phase 3: Telemetry Collector (Alloy)

Why third: Requires both Prometheus and Loki backends to send data to.

Important: Alloy replaces deprecated Grafana Agent (EOL Nov 1, 2025) and Promtail (deprecated in Loki 3.0).

Steps:

Deploy Alloy Operator
Deploy Alloy DaemonSet with configuration
Configure pipelines:
- Kubernetes logs → Loki
- Prometheus metrics → Prometheus
- Pod logs → Loki
Verify data flow

Expected outcome: Logs flowing to Loki, metrics to Prometheus.

Phase 4: Visualization (Grafana)

Why last: Consumes data from Prometheus and Loki.

Steps:

Deploy Grafana Operator v5
Create Grafana instance
Configure data sources:
- Prometheus (metrics)
- Loki (logs)
Import dashboards:
- Kubernetes cluster overview
- Node exporter metrics
- Loki log explorer
Configure ingress/gateway for access

Expected outcome: Unified dashboards showing metrics + logs.

Installation Steps

Step 1: Prometheus Operator

Documentation: https://prometheus-operator.dev/

Create namespace structure:

mkdir -p kubernetes/apps/observability/prometheus-operator
cd kubernetes/apps/observability/prometheus-operator

Standard 3-file pattern:

prometheus-operator/
 app/
    helmrelease.yaml       # Prometheus Operator v0.86.1
    kustomization.yaml
 ks.yaml                    # Kustomization for Flux

Key configuration:

# helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: prometheus-operator
spec:
  chart:
    spec:
      chart: kube-prometheus-stack
      version: 67.4.0  # Includes Prometheus Operator v0.86.1
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
        namespace: flux-system
  values:
    prometheus:
      prometheusSpec:
        retention: 30d
        storageClassName: ceph-block
        storageSpec:
          volumeClaimTemplate:
            spec:
              resources:
                requests:
                  storage: 50Gi

Deploy:

task configure
git add kubernetes/apps/observability/prometheus-operator
git commit -m "feat: add prometheus-operator v0.86.1"
git push
task reconcile

Verify:

# Watch deployment
watch kubectl get pods -n observability

# Expected pods:
# - prometheus-operator-kube-prometheus-operator-*
# - prometheus-prometheus-operator-kube-prometheus-prometheus-0
# - alertmanager-prometheus-operator-kube-prometheus-alertmanager-0
# - prometheus-operator-kube-state-metrics-*
# - prometheus-operator-prometheus-node-exporter-* (DaemonSet)

# Check Prometheus targets
kubectl port-forward -n observability svc/prometheus-operator-kube-prometheus-prometheus 9090:9090
# Open: http://localhost:9090/targets

Step 2: Loki Operator

Documentation: https://loki-operator.dev/

Create namespace structure:

mkdir -p kubernetes/apps/observability/loki-operator
cd kubernetes/apps/observability/loki-operator

Deployment considerations:

Mode	Use Case	Complexity	HA	Storage
Monolithic	Small clusters, dev	Low	No	Single instance
Microservices	Production, scale	High	Yes	Distributed

Recommended: Start with monolithic, migrate to microservices if needed.

Key configuration:

# helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: loki
spec:
  chart:
    spec:
      chart: loki
      version: 6.26.0  # Loki v3.4
      sourceRef:
        kind: HelmRepository
        name: grafana
        namespace: flux-system
  values:
    deploymentMode: SingleBinary  # Monolithic mode
    loki:
      commonConfig:
        replication_factor: 1
      storage:
        type: filesystem
      schemaConfig:
        configs:
          - from: 2024-04-01
            store: tsdb
            object_store: filesystem
            schema: v13
    singleBinary:
      replicas: 1
      persistence:
        enabled: true
        storageClass: ceph-block
        size: 50Gi

For production (microservices mode):

deploymentMode: Distributed
write:
  replicas: 3
  persistence:
    storageClass: ceph-block
    size: 50Gi
read:
  replicas: 3

Deploy and verify:

task configure
git add kubernetes/apps/observability/loki-operator
git commit -m "feat: add loki v3.4 (monolithic mode)"
git push
task reconcile

# Verify
kubectl get pods -n observability -l app.kubernetes.io/name=loki
kubectl logs -n observability -l app.kubernetes.io/name=loki --tail=50

Step 3: Alloy Operator

Documentation: https://grafana.com/docs/alloy/latest/

Important migration notes:

Alloy replaces Grafana Agent (EOL Nov 1, 2025)
Alloy replaces Promtail (deprecated in Loki 3.0)
Use Alloy for all new deployments

Create namespace structure:

mkdir -p kubernetes/apps/observability/alloy-operator
cd kubernetes/apps/observability/alloy-operator

Key configuration:

# helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: alloy-operator
spec:
  chart:
    spec:
      chart: alloy-operator
      version: 0.13.0
      sourceRef:
        kind: HelmRepository
        name: grafana
        namespace: flux-system
---
# alloy-instance.yaml
apiVersion: monitoring.grafana.com/v1alpha1
kind: Alloy
metadata:
  name: alloy-metrics-logs
  namespace: observability
spec:
  config: |
    // Scrape Kubernetes pod metrics
    prometheus.scrape "pods" {
      targets = discovery.kubernetes.pods.targets
      forward_to = [prometheus.remote_write.prometheus.receiver]
    }

    // Collect Kubernetes pod logs
    loki.source.kubernetes "pods" {
      targets = discovery.kubernetes.pods.targets
      forward_to = [loki.write.loki.receiver]
    }

    // Send metrics to Prometheus
    prometheus.remote_write "prometheus" {
      endpoint {
        url = "http://prometheus-operator-kube-prometheus-prometheus.observability.svc:9090/api/v1/write"
      }
    }

    // Send logs to Loki
    loki.write "loki" {
      endpoint {
        url = "http://loki-gateway.observability.svc:80/loki/api/v1/push"
      }
    }

  clustering:
    enabled: true

Deploy and verify:

task configure
git add kubernetes/apps/observability/alloy-operator
git commit -m "feat: add alloy-operator v3.0 for telemetry collection"
git push
task reconcile

# Verify collectors
kubectl get pods -n observability -l app.kubernetes.io/name=alloy
kubectl logs -n observability -l app.kubernetes.io/name=alloy --tail=50

Step 4: Grafana Operator

Documentation: https://grafana.com/docs/grafana-cloud/developer-resources/infrastructure-as-code/grafana-operator/

Create namespace structure:

mkdir -p kubernetes/apps/observability/grafana-operator
cd kubernetes/apps/observability/grafana-operator

Key configuration:

# helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: grafana-operator
spec:
  chart:
    spec:
      chart: grafana-operator
      version: v5.20.0
      sourceRef:
        kind: HelmRepository
        name: grafana
        namespace: flux-system
---
# grafana-instance.yaml
apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: grafana
  namespace: observability
spec:
  config:
    server:
      root_url: "https://grafana.${CLOUDFLARE_DOMAIN}"
    security:
      admin_user: admin
      admin_password: ${GRAFANA_ADMIN_PASSWORD}  # From SOPS secret
  deployment:
    spec:
      replicas: 1
  persistentVolumeClaim:
    spec:
      storageClassName: ceph-block
      resources:
        requests:
          storage: 10Gi
---
# datasource-prometheus.yaml
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDatasource
metadata:
  name: prometheus
  namespace: observability
spec:
  instanceSelector:
    matchLabels:
      dashboards: grafana
  datasource:
    name: Prometheus
    type: prometheus
    url: http://prometheus-operator-kube-prometheus-prometheus.observability.svc:9090
    isDefault: true
    access: proxy
---
# datasource-loki.yaml
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDatasource
metadata:
  name: loki
  namespace: observability
spec:
  instanceSelector:
    matchLabels:
      dashboards: grafana
  datasource:
    name: Loki
    type: loki
    url: http://loki-gateway.observability.svc:80
    access: proxy

Deploy and verify:

task configure
git add kubernetes/apps/observability/grafana-operator
git commit -m "feat: add grafana-operator v5.20.0 with datasources"
git push
task reconcile

# Verify
kubectl get pods -n observability -l app=grafana
kubectl get grafanadatasources -n observability

# Access Grafana
kubectl port-forward -n observability svc/grafana-service 3000:3000
# Open: http://localhost:3000

Post-Deployment Validation

Health Checks

Prometheus:

# Check targets
kubectl port-forward -n observability svc/prometheus-operator-kube-prometheus-prometheus 9090:9090
# Visit: http://localhost:9090/targets (all should be "UP")

# Query metrics
curl -s http://localhost:9090/api/v1/query?query=up | jq

Loki:

# Check health
kubectl exec -n observability -it loki-0 -- wget -qO- http://localhost:3100/ready
# Expected: "ready"

# Query logs
kubectl port-forward -n observability svc/loki-gateway 3100:80
curl -s "http://localhost:3100/loki/api/v1/query?query={namespace=\"observability\"}" | jq

Alloy:

# Check pipeline status
kubectl logs -n observability -l app.kubernetes.io/name=alloy --tail=100 | grep -E "(scrape|push)"
# Should see: "successfully pushed to remote_write" and "scraped targets"

Grafana:

# Check datasource health
kubectl exec -n observability -it deploy/grafana -- \
  curl -s http://admin:${ADMIN_PASSWORD}@localhost:3000/api/datasources
# Both Prometheus and Loki should show "database": "connected"

Data Flow Verification

Test end-to-end:

# 1. Deploy test app with metrics
kubectl create deployment nginx --image=nginx:latest -n default
kubectl expose deployment nginx --port=80 -n default

# 2. Check metrics in Prometheus
# Query: up{job="kubernetes-pods", namespace="default"}

# 3. Check logs in Loki
# Query: {namespace="default", app="nginx"}

# 4. View in Grafana
# Create dashboard with both queries

Next Steps

1. Import Essential Dashboards

Kubernetes cluster monitoring:

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: kubernetes-cluster
spec:
  instanceSelector:
    matchLabels:
      dashboards: grafana
  json: |
    # Dashboard ID: 7249 (Kubernetes Cluster Monitoring)

Recommended dashboards:

15759: Kubernetes / Views / Global
15760: Kubernetes / Views / Namespaces
15761: Kubernetes / Views / Pods
12019: Loki / Logs Explorer
19268: Node Exporter Full

2. Configure Alerting

Create PrometheusRule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cluster-alerts
  namespace: observability
spec:
  groups:
    - name: cluster
      interval: 30s
      rules:
        - alert: HighMemoryUsage
          expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Node {{ $labels.instance }} has high memory usage"

Configure Alertmanager:

# Add to prometheus-operator HelmRelease values
alertmanager:
  config:
    receivers:
      - name: 'slack'
        slack_configs:
          - api_url: ${SLACK_WEBHOOK_URL}
            channel: '#alerts'

3. Enable Long-Term Storage

Prometheus (Thanos):

# Add to prometheus-operator values
thanos:
  enabled: true
  objectStorageConfig:
    bucket: prometheus-metrics
    endpoint: rook-ceph-rgw.storage.svc:80

Loki (Ceph S3):

# Migrate from filesystem to object storage
loki:
  storage:
    type: s3
    s3:
      endpoint: rook-ceph-rgw.storage.svc:80
      bucketNames: loki-chunks
      accessKeyId: ${S3_ACCESS_KEY}
      secretAccessKey: ${S3_SECRET_KEY}

4. Expose Grafana via Gateway

Create HTTPRoute:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: grafana
  namespace: observability
spec:
  parentRefs:
    - name: envoy-external
      namespace: network
  hostnames:
    - grafana.${CLOUDFLARE_DOMAIN}
  rules:
    - backendRefs:
        - name: grafana-service
          port: 3000

Update Grafana config:

config:
  server:
    root_url: "https://grafana.${CLOUDFLARE_DOMAIN}"
    serve_from_sub_path: false

5. Optimize Performance

Prometheus query performance:

Enable query caching
Configure recording rules for expensive queries
Tune retention: 30d for recent, Thanos for long-term

Loki query performance:

Use indexed labels sparingly (namespace, app, pod)
Avoid high-cardinality labels (user_id, request_id)
Configure compaction for storage efficiency

Alloy resource tuning:

resources:
  limits:
    memory: 1Gi    # Increase if high log volume
  requests:
    cpu: 200m
    memory: 512Mi

6. Security Hardening

Enable authentication:

# Grafana OAuth via Cloudflare Access
config:
  auth.generic_oauth:
    enabled: true
    name: Cloudflare Access
    client_id: ${OAUTH_CLIENT_ID}
    client_secret: ${OAUTH_CLIENT_SECRET}

Network policies:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: prometheus
  namespace: observability
spec:
  podSelector:
    matchLabels:
      app: prometheus
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: observability
      ports:
        - port: 9090

7. Capacity Planning

Monitor resource usage:

# Prometheus disk usage
kubectl exec -n observability prometheus-prometheus-operator-kube-prometheus-prometheus-0 -- \
  df -h /prometheus

# Loki disk usage
kubectl exec -n observability loki-0 -- df -h /var/loki

Set up capacity alerts:

- alert: PrometheusStorageFull
  expr: prometheus_tsdb_storage_blocks_bytes / prometheus_tsdb_storage_max_bytes > 0.8
  for: 10m
  annotations:
    summary: "Prometheus storage {{ $value }}% full"

Troubleshooting

Prometheus Not Scraping Targets

Symptoms: Targets show "down" in Prometheus UI

Diagnosis:

kubectl get servicemonitors -n observability
kubectl describe servicemonitor <name> -n observability
kubectl logs -n observability -l app.kubernetes.io/name=prometheus

Common causes:

ServiceMonitor label selector mismatch
Network policy blocking scraping
Target pod not exposing /metrics

Fix:

# Ensure ServiceMonitor matches service labels
selector:
  matchLabels:
    app: my-app  # Must match service labels

Loki Not Receiving Logs

Symptoms: Empty results in Grafana log explorer

Diagnosis:

# Check Alloy logs
kubectl logs -n observability -l app.kubernetes.io/name=alloy | grep -i loki

# Check Loki ingester
kubectl logs -n observability loki-0 | grep -i ingester

# Test Loki API
kubectl exec -n observability loki-0 -- \
  wget -qO- http://localhost:3100/loki/api/v1/query?query={namespace=\"observability\"}

Common causes:

Alloy config error (check loki.write block)
Loki ingester not ready
Network connectivity issue

Fix:

# Restart Alloy
kubectl rollout restart daemonset -n observability alloy

# Check Loki health
kubectl get pods -n observability -l app.kubernetes.io/name=loki

Grafana Datasource Unhealthy

Symptoms: "Bad Gateway" or "Connection refused" in Grafana

Diagnosis:

# Check datasource config
kubectl get grafanadatasources -n observability -o yaml

# Test connectivity from Grafana pod
kubectl exec -n observability deploy/grafana -- \
  curl -v http://prometheus-operator-kube-prometheus-prometheus.observability.svc:9090/api/v1/query?query=up

Common causes:

Incorrect service URL
Service not ready
DNS resolution failure

Fix:

# Use full FQDN
url: http://prometheus-operator-kube-prometheus-prometheus.observability.svc.cluster.local:9090

High Memory Usage

Symptoms: OOMKilled pods, slow queries

Diagnosis:

# Check current usage
kubectl top pods -n observability

# Check metrics
kubectl exec -n observability prometheus-prometheus-operator-kube-prometheus-prometheus-0 -- \
  wget -qO- http://localhost:9090/api/v1/status/tsdb

Tuning:

# Increase Prometheus memory
prometheus:
  prometheusSpec:
    resources:
      limits:
        memory: 4Gi  # Increase from 2Gi

# Reduce retention
    retention: 15d  # Reduce from 30d

# Increase Loki memory
singleBinary:
  resources:
    limits:
      memory: 2Gi  # Increase from 1Gi

Reference Links

Official Documentation

Prometheus Operator - v0.86.1
Grafana Loki - v3.4
Grafana Alloy - Replacing Agent/Promtail
Grafana Operator - v5.20.0

Helm Charts

Migration Guides

Community Resources

Summary

This guide provides a complete observability stack with:

✅ Metrics: Prometheus Operator v0.86.1
✅ Logs: Loki v3.4 (monolithic or microservices)
✅ Collection: Alloy Operator v3.0 (modern replacement)
✅ Visualization: Grafana Operator v5.20.0

Deployment sequence: Prometheus → Loki → Alloy → Grafana

Next actions:

Deploy Prometheus Operator (Phase 1)
Verify metrics collection
Deploy Loki (Phase 2)
Deploy Alloy collectors (Phase 3)
Deploy Grafana with datasources (Phase 4)
Import dashboards and configure alerting

Timeline estimate: 2-4 hours for complete stack deployment with validation.

Overview​

Prerequisites​

Cluster Requirements​

Resource Allocation Guidelines​

Software Requirements​

Deployment Plan​

Phase 1: Metrics Backend (Prometheus)​

Phase 2: Log Backend (Loki)​

Phase 3: Telemetry Collector (Alloy)​

Phase 4: Visualization (Grafana)​

Installation Steps​

Step 1: Prometheus Operator​

Step 2: Loki Operator​

Step 3: Alloy Operator​

Step 4: Grafana Operator​

Post-Deployment Validation​

Health Checks​

Data Flow Verification​

Next Steps​

1. Import Essential Dashboards​

2. Configure Alerting​

3. Enable Long-Term Storage​

4. Expose Grafana via Gateway​

5. Optimize Performance​

6. Security Hardening​

7. Capacity Planning​

Troubleshooting​

Prometheus Not Scraping Targets​

Loki Not Receiving Logs​

Grafana Datasource Unhealthy​

High Memory Usage​

Reference Links​

Official Documentation​

Helm Charts​

Migration Guides​

Community Resources​

Summary​

Overview

Prerequisites

Cluster Requirements

Resource Allocation Guidelines

Software Requirements

Deployment Plan

Phase 1: Metrics Backend (Prometheus)

Phase 2: Log Backend (Loki)

Phase 3: Telemetry Collector (Alloy)

Phase 4: Visualization (Grafana)

Installation Steps

Step 1: Prometheus Operator

Step 2: Loki Operator

Step 3: Alloy Operator

Step 4: Grafana Operator

Post-Deployment Validation

Health Checks

Data Flow Verification

Next Steps

1. Import Essential Dashboards

2. Configure Alerting

3. Enable Long-Term Storage

4. Expose Grafana via Gateway

5. Optimize Performance

6. Security Hardening

7. Capacity Planning

Troubleshooting

Prometheus Not Scraping Targets

Loki Not Receiving Logs

Grafana Datasource Unhealthy

High Memory Usage

Reference Links

Official Documentation

Helm Charts

Migration Guides

Community Resources

Summary