Files

Michael Skrynski 6eca87bfa9 Add default-backend and enable compute-blade-agent

2026-01-12 12:53:47 +01:00

18 KiB

Raw Permalink Blame History

Prometheus Operator & Monitoring Guide

Complete guide for deploying and managing monitoring infrastructure with Prometheus Operator, Grafana, and AlertManager in your k3s-ansible cluster.

Overview
Quick Start
Installation
Configuration
Accessing Components
Monitoring compute-blade-agent
Custom ServiceMonitors
Alerting
Grafana Dashboards
Troubleshooting

Overview

The Prometheus Operator installation includes:

Prometheus: Time-series database and scraping engine
Grafana: Visualization and dashboarding platform
AlertManager: Alert routing and management
Node Exporter: Hardware and OS metrics
kube-state-metrics: Kubernetes cluster metrics
Prometheus Operator: CRD controller for managing Prometheus resources

Architecture

┌─────────────────────────────────────────┐
│         Prometheus Operator             │
│  (CRD Controller - monitoring namespace)│
└─────────────────────────────────────────┘
                    ↓
    ┌───────────────┼───────────────┐
    ↓               ↓               ↓
Prometheus      Grafana        AlertManager
  (9090)         (3000)           (9093)
    ↓               ↓               ↓
    └───────────────┼───────────────┘
                    ↓
        ┌───────────────────────┐
        │  ServiceMonitors      │
        │  PrometheusRules      │
        │  AlertingRules        │
        └───────────────────────┘
                    ↓
    ┌───────────────┼───────────────┐
    ↓               ↓               ↓
 Scrape         Scrape          Scrape
Targets      Targets         Targets

Quick Start

Deploy Everything

# 1. Enable Prometheus Operator in inventory
# Edit inventory/hosts.ini
[k3s_cluster:vars]
enable_prometheus_operator=true
enable_compute_blade_agent=true

# 2. Run Ansible playbook
ansible-playbook site.yml --tags prometheus-operator

# 3. Wait for components to be ready
kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/name=prometheus \
  -n monitoring --timeout=300s

# 4. Access Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

# 5. Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Then open:

Prometheus: http://localhost:9090
Grafana: http://localhost:3000 (default: admin/admin)
AlertManager: http://localhost:9093

Disable Prometheus Operator

# Edit inventory/hosts.ini
[k3s_cluster:vars]
enable_prometheus_operator=false

# Re-run playbook (Prometheus stack won't be installed)
ansible-playbook site.yml --tags prometheus-operator

Installation

Prerequisites

K3s cluster already deployed with k3s-ansible
kubectl access to the cluster
Helm 3.x installed on the control machine

Step-by-Step Installation

1. Configure Inventory

Edit inventory/hosts.ini:

[k3s_cluster:vars]
# Enable Prometheus Operator
enable_prometheus_operator=true

# (Optional) Set Grafana admin password
grafana_admin_password=SecurePassword123!

# Enable compute-blade-agent monitoring
enable_compute_blade_agent=true

2. Run the Playbook

# Install only Prometheus Operator
ansible-playbook site.yml --tags prometheus-operator

# Or deploy everything including K3s
ansible-playbook site.yml

3. Verify Installation

# Check if monitoring namespace exists
kubectl get namespace monitoring

# Check Prometheus Operator deployment
kubectl get deployment -n monitoring

# Check all monitoring components
kubectl get all -n monitoring

# Wait for all pods to be ready
kubectl wait --for=condition=ready pod --all -n monitoring --timeout=300s

Expected output:

NAME                                                    READY   STATUS    RESTARTS   AGE
pod/prometheus-operator-5f8d4b5c7d-x9k2l               1/1     Running   0          2m
pod/prometheus-kube-prometheus-prometheus-0            2/2     Running   0          1m
pod/prometheus-kube-state-metrics-7c9d5f8c4-m2k9n     1/1     Running   0          2m
pod/prometheus-node-exporter-pz8kl                     1/1     Running   0          2m
pod/prometheus-grafana-5f8d7b5c9e-z1q3x                3/3     Running   0          1m
pod/prometheus-kube-alertmanager-0                     2/2     Running   0          1m

Configuration

Environment Variables

Configure via inventory/hosts.ini:

[k3s_cluster:vars]
# Enable/disable monitoring stack
enable_prometheus_operator=true

# Grafana configuration
grafana_admin_password=SecurePassword123!
grafana_admin_user=admin
grafana_storage_size=5Gi

# Prometheus configuration
prometheus_retention_days=7
prometheus_storage_size=10Gi
prometheus_scrape_interval=30s
prometheus_scrape_timeout=10s

# AlertManager configuration
alertmanager_storage_size=5Gi

# Component flags
enable_grafana=true
enable_alertmanager=true
enable_prometheus_node_exporter=true
enable_kube_state_metrics=true

Per-Node Configuration

To restrict Prometheus to specific nodes:

[k3s_cluster:vars]
prometheus_node_selector={"node-type": "monitoring"}

Or via inventory host vars:

[master]
cm4-01 ansible_host=192.168.30.101 ansible_user=pi enable_prometheus_operator=true
cm4-02 ansible_host=192.168.30.102 ansible_user=pi enable_prometheus_operator=false

Resource Limits

Control resource usage in inventory/hosts.ini:

prometheus_cpu_request=250m
prometheus_cpu_limit=500m
prometheus_memory_request=512Mi
prometheus_memory_limit=1Gi

grafana_cpu_request=100m
grafana_cpu_limit=200m
grafana_memory_request=256Mi
grafana_memory_limit=512Mi

Accessing Components

Prometheus Web UI

# Port-forward to localhost
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

# Access at: http://localhost:9090

Available from within cluster:

http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090

Features:

Query builder
Target health status
Alert rules
Service discovery
Graph visualization

Grafana Dashboards

# Port-forward to localhost
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Access at: http://localhost:3000
# Default credentials: admin / admin (or your configured password)

Available from within cluster:

http://prometheus-grafana.monitoring.svc.cluster.local:80

Pre-installed Dashboards:

Kubernetes / Cluster Monitoring
Kubernetes / Nodes
Kubernetes / Pods
Kubernetes / Deployments Statefulsets Daemonsets
Node Exporter for Prometheus Dashboard

Custom Dashboards:

Import from grafana.com
Create custom dashboards
Connect to Prometheus data source

AlertManager

# Port-forward to localhost
kubectl port-forward -n monitoring svc/prometheus-kube-alertmanager 9093:9093

# Access at: http://localhost:9093

Available from within cluster:

http://prometheus-kube-alertmanager.monitoring.svc.cluster.local:9093

Features:

Alert grouping and deduplication
Alert routing rules
Notification management
Silence alerts

Verify Network Connectivity

# Test from within the cluster
kubectl run debug --image=busybox -it --rm -- sh

# Inside the pod:
wget -O- http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090/-/healthy
wget -O- http://prometheus-grafana.monitoring.svc:80/api/health
wget -O- http://prometheus-kube-alertmanager.monitoring.svc:9093/-/healthy

Monitoring compute-blade-agent

Automatic Integration

When both are enabled, the Ansible role automatically:

Creates the compute-blade-agent namespace
Deploys ServiceMonitor for metrics scraping
Deploys PrometheusRule for alerting
Configures Prometheus scrape targets

Verify compute-blade-agent Monitoring

# Check if ServiceMonitor is created
kubectl get servicemonitor -n compute-blade-agent

# Check if metrics are being scraped
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

# Then in Prometheus UI:
# 1. Go to Status → Targets
# 2. Look for "compute-blade-agent" targets
# 3. Should show "UP" status

Available Metrics

# Temperature monitoring
compute_blade_temperature_celsius

# Fan monitoring
compute_blade_fan_rpm
compute_blade_fan_speed_percent

# Power monitoring
compute_blade_power_watts

# Status indicators
compute_blade_status
compute_blade_led_state

Create Custom Dashboard for compute-blade-agent

In Grafana:

Create new dashboard

Add panel with query:

compute_blade_temperature_celsius{job="compute-blade-agent"}

Set visualization type to "Gauge" or "Graph"
Save dashboard

Custom ServiceMonitors

Create a ServiceMonitor

Create custom-servicemonitor.yml:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-monitoring
  namespace: my-app
  labels:
    app: my-app
    release: prometheus
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
      scrapeTimeout: 10s

Deploy it:

kubectl apply -f custom-servicemonitor.yml

Create a PrometheusRule

Create custom-alerts.yml:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-alerts
  namespace: my-app
  labels:
    prometheus: kube-prometheus
spec:
  groups:
    - name: my-app.rules
      interval: 30s
      rules:
        - alert: MyAppHighErrorRate
          expr: |
            rate(my_app_errors_total[5m]) > 0.05
          for: 5m
          labels:
            severity: warning
            app: my-app
          annotations:
            summary: "High error rate in my-app"
            description: "Error rate is {{ $value }} on {{ $labels.instance }}"

        - alert: MyAppDown
          expr: |
            up{job="my-app"} == 0
          for: 5m
          labels:
            severity: critical
            app: my-app
          annotations:
            summary: "my-app is down"
            description: "my-app on {{ $labels.instance }} is unreachable"

Deploy it:

kubectl apply -f custom-alerts.yml

Alerting

Pre-configured Alerts for compute-blade-agent

Automatically deployed when compute-blade-agent monitoring is enabled:

ComputeBladeAgentHighTemperature (Warning)
- Triggers when temp > 80°C for 5 minutes
ComputeBladeAgentCriticalTemperature (Critical)
- Triggers when temp > 95°C for 2 minutes
ComputeBladeAgentDown (Critical)
- Triggers when agent unreachable for 5 minutes
ComputeBladeAgentFanFailure (Warning)
- Triggers when fan RPM = 0 for 5 minutes
ComputeBladeAgentHighFanSpeed (Warning)
- Triggers when fan speed > 90% for 10 minutes

View Active Alerts

# In Prometheus UI:
# 1. Go to Alerts
# 2. See all active and pending alerts

# Or query:
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Then visit: http://localhost:9090/alerts

Configure AlertManager Routing

Edit AlertManager configuration:

# Get current config
kubectl get secret -n monitoring alertmanager-kube-prometheus-alertmanager -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d

# Edit configuration
kubectl edit secret -n monitoring alertmanager-kube-prometheus-alertmanager

Example routing configuration:

global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  
  routes:
    - match:
        severity: critical
      receiver: 'critical'
      group_wait: 0s
      group_interval: 1m
      repeat_interval: 30m
      
    - match:
        severity: warning
      receiver: 'default'
      group_wait: 1m

receivers:
  - name: 'default'
    # Add email, slack, webhook, etc.
    
  - name: 'critical'
    # Add urgent notifications

Grafana Dashboards

Import Pre-built Dashboards

Open Grafana: http://localhost:3000
Click "+" → Import
Enter dashboard ID from grafana.com
Select Prometheus data source
Click Import

Recommended Dashboards:

ID	Name
1860	Node Exporter for Prometheus Dashboard
6417	Kubernetes Cluster Monitoring
8588	Kubernetes Deployment Statefulset Daemonset
11074	Node Exporter - Nodes
12114	Kubernetes cluster monitoring

Create Custom Dashboard

Click "+" → Dashboard
Click "Add new panel"
Configure query:
- Data source: Prometheus
- Query: up{job="compute-blade-agent"}
Set visualization (Graph, Gauge, Table, etc.)
Click Save

Export Dashboard

# Get dashboard JSON
curl http://admin:password@localhost:3000/api/dashboards/db/my-dashboard > my-dashboard.json

# Import elsewhere
curl -X POST -H "Content-Type: application/json" \
  -d @my-dashboard.json \
  http://admin:password@localhost:3000/api/dashboards/db

Troubleshooting

Prometheus Not Scraping Targets

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

# Visit: http://localhost:9090/targets
# Look for failed targets

# Check ServiceMonitor
kubectl get servicemonitor --all-namespaces

# Check Prometheus config
kubectl get prometheus -n monitoring -o yaml

# View Prometheus logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus --tail=50 -f

Grafana Data Source Not Working

# Check Prometheus connectivity from Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# In Grafana:
# 1. Configuration → Data Sources
# 2. Click Prometheus
# 3. Check Status (green = working)
# 4. If red, check URL and credentials

# Or check logs
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana --tail=50 -f

AlertManager Not Sending Notifications

# Check AlertManager configuration
kubectl get secret -n monitoring alertmanager-kube-prometheus-alertmanager -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d

# Restart AlertManager to apply changes
kubectl rollout restart statefulset -n monitoring prometheus-kube-alertmanager

# Check logs
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager --tail=50 -f

# Test webhook
kubectl port-forward -n monitoring svc/prometheus-kube-alertmanager 9093:9093
# Visit: http://localhost:9093 to see alerts

Disk Space Issues

# Check Prometheus PVC usage
kubectl get pvc -n monitoring

# View disk usage
kubectl exec -n monitoring prometheus-kube-prometheus-prometheus-0 -- df -h /prometheus

# Increase storage
kubectl patch pvc -n monitoring prometheus-kube-prometheus-prometheus-db-prometheus-kube-prometheus-prometheus-0 \
  -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'

High Memory/CPU Usage

# Check resource usage
kubectl top pod -n monitoring

# Reduce retention period
kubectl edit prometheus -n monitoring

# Update in spec:
# retention: 3d  # Reduce from 7d to 3d

# Or reduce scrape interval
# serviceMonitorSelectorNilUsesHelmValues: false
# scrapeInterval: 60s  # Increase from 30s to 60s

ServiceMonitor Not Being Picked Up

# Check if labels match
kubectl get servicemonitor --all-namespaces -o yaml | grep -A5 "release: prometheus"

# Prometheus selector config
kubectl get prometheus -n monitoring -o yaml | grep -A5 "serviceMonitorSelector"

# Restart Prometheus if config changed
kubectl rollout restart statefulset -n monitoring prometheus-kube-prometheus-prometheus

# Check Prometheus logs for errors
kubectl logs -n monitoring prometheus-kube-prometheus-prometheus-0 -c prometheus --tail=100 | grep -i "error\|failed"

Maintenance

Backup Grafana Dashboards

# Export all dashboards
for dashboard in $(curl -s http://admin:password@localhost:3000/api/search | jq -r '.[] | .uri'); do
  name=$(echo $dashboard | cut -d'/' -f2)
  curl -s http://admin:password@localhost:3000/api/dashboards/$dashboard > ${name}.json
done

# Or use backup tool
kubectl exec -n monitoring prometheus-grafana-0 -- grafana-cli admin export-dashboard all ./backups

Update Prometheus Retention

# Edit Prometheus resource
kubectl edit prometheus -n monitoring

# Update retention field
spec:
  retention: "30d"  # Change from 7d to 30d

# Changes apply automatically

Scale Prometheus Resources

# For high-load environments, increase resources
kubectl patch prometheus -n monitoring kube-prometheus -p '{"spec":{"resources":{"requests":{"cpu":"500m","memory":"1Gi"},"limits":{"cpu":"1000m","memory":"2Gi"}}}}'

Best Practices

Security
- Change default Grafana password immediately
- Restrict AlertManager webhook URLs to known services
- Use network policies to limit access
Performance
- Monitor Prometheus disk usage regularly
- Adjust scrape intervals based on needs
- Use recording rules for complex queries
Reliability
- Enable persistent storage (PVC)
- Configure alert routing and escalation
- Regular backup of Grafana dashboards and configs
Organization
- Label all ServiceMonitors with release: prometheus
- Use consistent naming conventions
- Document custom dashboards and alerts
Cost Optimization
- Remove unused scrape targets
- Tune scrape intervals (don't scrape more than needed)
- Set appropriate retention periods

18 KiB Raw Permalink Blame History

Prometheus Operator & Monitoring Guide

Table of Contents

Overview

Architecture

Quick Start

Deploy Everything

Disable Prometheus Operator

Installation

Prerequisites

Step-by-Step Installation

1. Configure Inventory

2. Run the Playbook

3. Verify Installation

Configuration

Environment Variables

Per-Node Configuration

Resource Limits

Accessing Components

Prometheus Web UI

Grafana Dashboards

AlertManager

Verify Network Connectivity

Monitoring compute-blade-agent

Automatic Integration

Verify compute-blade-agent Monitoring

Available Metrics

Create Custom Dashboard for compute-blade-agent

Custom ServiceMonitors

Create a ServiceMonitor

Create a PrometheusRule

Alerting

Pre-configured Alerts for compute-blade-agent

View Active Alerts

Configure AlertManager Routing

Grafana Dashboards

Import Pre-built Dashboards

Create Custom Dashboard

Export Dashboard

Troubleshooting

Prometheus Not Scraping Targets

Grafana Data Source Not Working

AlertManager Not Sending Notifications

Disk Space Issues

High Memory/CPU Usage

ServiceMonitor Not Being Picked Up

Maintenance

Backup Grafana Dashboards

Update Prometheus Retention

Scale Prometheus Resources

Best Practices

Support

18 KiB

Raw Permalink Blame History