Add Prometheus Operator role and templates

2026-01-12 09:27:57 +01:00
parent fd7c9239b5
commit f3754c01d7
7 changed files with 1064 additions and 17 deletions
--- a/PROMETHEUS_MONITORING.md
+++ b/PROMETHEUS_MONITORING.md
@@ -0,0 +1,732 @@
+# Prometheus Operator & Monitoring Guide
+
+Complete guide for deploying and managing monitoring infrastructure with Prometheus Operator, Grafana, and AlertManager in your k3s-ansible cluster.
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Quick Start](#quick-start)
+- [Installation](#installation)
+- [Configuration](#configuration)
+- [Accessing Components](#accessing-components)
+- [Monitoring compute-blade-agent](#monitoring-compute-blade-agent)
+- [Custom ServiceMonitors](#custom-servicemonitors)
+- [Alerting](#alerting)
+- [Grafana Dashboards](#grafana-dashboards)
+- [Troubleshooting](#troubleshooting)
+
+## Overview
+
+The Prometheus Operator installation includes:
+
+- **Prometheus**: Time-series database and scraping engine
+- **Grafana**: Visualization and dashboarding platform
+- **AlertManager**: Alert routing and management
+- **Node Exporter**: Hardware and OS metrics
+- **kube-state-metrics**: Kubernetes cluster metrics
+- **Prometheus Operator**: CRD controller for managing Prometheus resources
+
+### Architecture
+
+```
+┌─────────────────────────────────────────┐
+│         Prometheus Operator             │
+│  (CRD Controller - monitoring namespace)│
+└─────────────────────────────────────────┘
+                    ↓
+    ┌───────────────┼───────────────┐
+    ↓               ↓               ↓
+Prometheus      Grafana        AlertManager
+  (9090)         (3000)           (9093)
+    ↓               ↓               ↓
+    └───────────────┼───────────────┘
+                    ↓
+        ┌───────────────────────┐
+        │  ServiceMonitors      │
+        │  PrometheusRules      │
+        │  AlertingRules        │
+        └───────────────────────┘
+                    ↓
+    ┌───────────────┼───────────────┐
+    ↓               ↓               ↓
+ Scrape         Scrape          Scrape
+Targets      Targets         Targets
+```
+
+## Quick Start
+
+### Deploy Everything
+
+```bash
+# 1. Enable Prometheus Operator in inventory
+# Edit inventory/hosts.ini
+[k3s_cluster:vars]
+enable_prometheus_operator=true
+enable_compute_blade_agent=true
+
+# 2. Run Ansible playbook
+ansible-playbook site.yml --tags prometheus-operator
+
+# 3. Wait for components to be ready
+kubectl wait --for=condition=ready pod \
+  -l app.kubernetes.io/name=prometheus \
+  -n monitoring --timeout=300s
+
+# 4. Access Prometheus
+kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
+
+# 5. Access Grafana
+kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
+```
+
+Then open:
+- Prometheus: http://localhost:9090
+- Grafana: http://localhost:3000 (default: admin/admin)
+- AlertManager: http://localhost:9093
+
+### Disable Prometheus Operator
+
+```bash
+# Edit inventory/hosts.ini
+[k3s_cluster:vars]
+enable_prometheus_operator=false
+
+# Re-run playbook (Prometheus stack won't be installed)
+ansible-playbook site.yml --tags prometheus-operator
+```
+
+## Installation
+
+### Prerequisites
+
+- K3s cluster already deployed with k3s-ansible
+- kubectl access to the cluster
+- Helm 3.x installed on the control machine
+
+### Step-by-Step Installation
+
+#### 1. Configure Inventory
+
+Edit `inventory/hosts.ini`:
+
+```ini
+[k3s_cluster:vars]
+# Enable Prometheus Operator
+enable_prometheus_operator=true
+
+# (Optional) Set Grafana admin password
+grafana_admin_password=MySecurePassword123!
+
+# Enable compute-blade-agent monitoring
+enable_compute_blade_agent=true
+```
+
+#### 2. Run the Playbook
+
+```bash
+# Install only Prometheus Operator
+ansible-playbook site.yml --tags prometheus-operator
+
+# Or deploy everything including K3s
+ansible-playbook site.yml
+```
+
+#### 3. Verify Installation
+
+```bash
+# Check if monitoring namespace exists
+kubectl get namespace monitoring
+
+# Check Prometheus Operator deployment
+kubectl get deployment -n monitoring
+
+# Check all monitoring components
+kubectl get all -n monitoring
+
+# Wait for all pods to be ready
+kubectl wait --for=condition=ready pod --all -n monitoring --timeout=300s
+```
+
+Expected output:
+
+```
+NAME                                                    READY   STATUS    RESTARTS   AGE
+pod/prometheus-operator-5f8d4b5c7d-x9k2l               1/1     Running   0          2m
+pod/prometheus-kube-prometheus-prometheus-0            2/2     Running   0          1m
+pod/prometheus-kube-state-metrics-7c9d5f8c4-m2k9n     1/1     Running   0          2m
+pod/prometheus-node-exporter-pz8kl                     1/1     Running   0          2m
+pod/prometheus-grafana-5f8d7b5c9e-z1q3x                3/3     Running   0          1m
+pod/prometheus-kube-alertmanager-0                     2/2     Running   0          1m
+```
+
+## Configuration
+
+### Environment Variables
+
+Configure via `inventory/hosts.ini`:
+
+```ini
+[k3s_cluster:vars]
+# Enable/disable monitoring stack
+enable_prometheus_operator=true
+
+# Grafana configuration
+grafana_admin_password=SecurePassword123!
+grafana_admin_user=admin
+grafana_storage_size=5Gi
+
+# Prometheus configuration
+prometheus_retention_days=7
+prometheus_storage_size=10Gi
+prometheus_scrape_interval=30s
+prometheus_scrape_timeout=10s
+
+# AlertManager configuration
+alertmanager_storage_size=5Gi
+
+# Component flags
+enable_grafana=true
+enable_alertmanager=true
+enable_prometheus_node_exporter=true
+enable_kube_state_metrics=true
+```
+
+### Per-Node Configuration
+
+To restrict Prometheus to specific nodes:
+
+```ini
+[k3s_cluster:vars]
+prometheus_node_selector={"node-type": "monitoring"}
+```
+
+Or via inventory host vars:
+
+```ini
+[master]
+cm4-01 ansible_host=192.168.30.101 ansible_user=pi enable_prometheus_operator=true
+cm4-02 ansible_host=192.168.30.102 ansible_user=pi enable_prometheus_operator=false
+```
+
+### Resource Limits
+
+Control resource usage in `inventory/hosts.ini`:
+
+```ini
+prometheus_cpu_request=250m
+prometheus_cpu_limit=500m
+prometheus_memory_request=512Mi
+prometheus_memory_limit=1Gi
+
+grafana_cpu_request=100m
+grafana_cpu_limit=200m
+grafana_memory_request=256Mi
+grafana_memory_limit=512Mi
+```
+
+## Accessing Components
+
+### Prometheus Web UI
+
+```bash
+# Port-forward to localhost
+kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
+
+# Access at: http://localhost:9090
+```
+
+**Available from within cluster:**
+```
+http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
+```
+
+**Features:**
+- Query builder
+- Target health status
+- Alert rules
+- Service discovery
+- Graph visualization
+
+### Grafana Dashboards
+
+```bash
+# Port-forward to localhost
+kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
+
+# Access at: http://localhost:3000
+# Default credentials: admin / admin (or your configured password)
+```
+
+**Available from within cluster:**
+```
+http://prometheus-grafana.monitoring.svc.cluster.local:80
+```
+
+**Pre-installed Dashboards:**
+1. Kubernetes / Cluster Monitoring
+2. Kubernetes / Nodes
+3. Kubernetes / Pods
+4. Kubernetes / Deployments Statefulsets Daemonsets
+5. Node Exporter for Prometheus Dashboard
+
+**Custom Dashboards:**
+- Import from grafana.com
+- Create custom dashboards
+- Connect to Prometheus data source
+
+### AlertManager
+
+```bash
+# Port-forward to localhost
+kubectl port-forward -n monitoring svc/prometheus-kube-alertmanager 9093:9093
+
+# Access at: http://localhost:9093
+```
+
+**Available from within cluster:**
+```
+http://prometheus-kube-alertmanager.monitoring.svc.cluster.local:9093
+```
+
+**Features:**
+- Alert grouping and deduplication
+- Alert routing rules
+- Notification management
+- Silence alerts
+
+### Verify Network Connectivity
+
+```bash
+# Test from within the cluster
+kubectl run debug --image=busybox -it --rm -- sh
+
+# Inside the pod:
+wget -O- http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090/-/healthy
+wget -O- http://prometheus-grafana.monitoring.svc:80/api/health
+wget -O- http://prometheus-kube-alertmanager.monitoring.svc:9093/-/healthy
+```
+
+## Monitoring compute-blade-agent
+
+### Automatic Integration
+
+When both are enabled, the Ansible role automatically:
+
+1. Creates the `compute-blade-agent` namespace
+2. Deploys ServiceMonitor for metrics scraping
+3. Deploys PrometheusRule for alerting
+4. Configures Prometheus scrape targets
+
+### Verify compute-blade-agent Monitoring
+
+```bash
+# Check if ServiceMonitor is created
+kubectl get servicemonitor -n compute-blade-agent
+
+# Check if metrics are being scraped
+kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
+
+# Then in Prometheus UI:
+# 1. Go to Status → Targets
+# 2. Look for "compute-blade-agent" targets
+# 3. Should show "UP" status
+```
+
+### Available Metrics
+
+```
+# Temperature monitoring
+compute_blade_temperature_celsius
+
+# Fan monitoring
+compute_blade_fan_rpm
+compute_blade_fan_speed_percent
+
+# Power monitoring
+compute_blade_power_watts
+
+# Status indicators
+compute_blade_status
+compute_blade_led_state
+```
+
+### Create Custom Dashboard for compute-blade-agent
+
+In Grafana:
+
+1. Create new dashboard
+2. Add panel with query:
+   ```
+   compute_blade_temperature_celsius{job="compute-blade-agent"}
+   ```
+3. Set visualization type to "Gauge" or "Graph"
+4. Save dashboard
+
+## Custom ServiceMonitors
+
+### Create a ServiceMonitor
+
+Create `custom-servicemonitor.yml`:
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: my-app-monitoring
+  namespace: my-app
+  labels:
+    app: my-app
+    release: prometheus
+spec:
+  selector:
+    matchLabels:
+      app: my-app
+  endpoints:
+    - port: metrics
+      interval: 30s
+      path: /metrics
+      scrapeTimeout: 10s
+```
+
+Deploy it:
+
+```bash
+kubectl apply -f custom-servicemonitor.yml
+```
+
+### Create a PrometheusRule
+
+Create `custom-alerts.yml`:
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: my-app-alerts
+  namespace: my-app
+  labels:
+    prometheus: kube-prometheus
+spec:
+  groups:
+    - name: my-app.rules
+      interval: 30s
+      rules:
+        - alert: MyAppHighErrorRate
+          expr: |
+            rate(my_app_errors_total[5m]) > 0.05
+          for: 5m
+          labels:
+            severity: warning
+            app: my-app
+          annotations:
+            summary: "High error rate in my-app"
+            description: "Error rate is {{ $value }} on {{ $labels.instance }}"
+
+        - alert: MyAppDown
+          expr: |
+            up{job="my-app"} == 0
+          for: 5m
+          labels:
+            severity: critical
+            app: my-app
+          annotations:
+            summary: "my-app is down"
+            description: "my-app on {{ $labels.instance }} is unreachable"
+```
+
+Deploy it:
+
+```bash
+kubectl apply -f custom-alerts.yml
+```
+
+## Alerting
+
+### Pre-configured Alerts for compute-blade-agent
+
+Automatically deployed when compute-blade-agent monitoring is enabled:
+
+1. **ComputeBladeAgentHighTemperature** (Warning)
+   - Triggers when temp > 80°C for 5 minutes
+   
+2. **ComputeBladeAgentCriticalTemperature** (Critical)
+   - Triggers when temp > 95°C for 2 minutes
+   
+3. **ComputeBladeAgentDown** (Critical)
+   - Triggers when agent unreachable for 5 minutes
+   
+4. **ComputeBladeAgentFanFailure** (Warning)
+   - Triggers when fan RPM = 0 for 5 minutes
+   
+5. **ComputeBladeAgentHighFanSpeed** (Warning)
+   - Triggers when fan speed > 90% for 10 minutes
+
+### View Active Alerts
+
+```bash
+# In Prometheus UI:
+# 1. Go to Alerts
+# 2. See all active and pending alerts
+
+# Or query:
+kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
+# Then visit: http://localhost:9090/alerts
+```
+
+### Configure AlertManager Routing
+
+Edit AlertManager configuration:
+
+```bash
+# Get current config
+kubectl get secret -n monitoring alertmanager-kube-prometheus-alertmanager -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d
+
+# Edit configuration
+kubectl edit secret -n monitoring alertmanager-kube-prometheus-alertmanager
+```
+
+Example routing configuration:
+
+```yaml
+global:
+  resolve_timeout: 5m
+
+route:
+  receiver: 'default'
+  group_by: ['alertname', 'cluster', 'service']
+  group_wait: 30s
+  group_interval: 5m
+  repeat_interval: 4h
+  
+  routes:
+    - match:
+        severity: critical
+      receiver: 'critical'
+      group_wait: 0s
+      group_interval: 1m
+      repeat_interval: 30m
+      
+    - match:
+        severity: warning
+      receiver: 'default'
+      group_wait: 1m
+
+receivers:
+  - name: 'default'
+    # Add email, slack, webhook, etc.
+    
+  - name: 'critical'
+    # Add urgent notifications
+```
+
+## Grafana Dashboards
+
+### Import Pre-built Dashboards
+
+1. Open Grafana: http://localhost:3000
+2. Click "+" → Import
+3. Enter dashboard ID from grafana.com
+4. Select Prometheus data source
+5. Click Import
+
+**Recommended Dashboards:**
+
+| ID | Name |
+|----|------|
+| 1860 | Node Exporter for Prometheus Dashboard |
+| 6417 | Kubernetes Cluster Monitoring |
+| 8588 | Kubernetes Deployment Statefulset Daemonset |
+| 11074 | Node Exporter - Nodes |
+| 12114 | Kubernetes cluster monitoring |
+
+### Create Custom Dashboard
+
+1. Click "+" → Dashboard
+2. Click "Add new panel"
+3. Configure query:
+   - Data source: Prometheus
+   - Query: `up{job="compute-blade-agent"}`
+4. Set visualization (Graph, Gauge, Table, etc.)
+5. Click Save
+
+### Export Dashboard
+
+```bash
+# Get dashboard JSON
+curl http://admin:password@localhost:3000/api/dashboards/db/my-dashboard > my-dashboard.json
+
+# Import elsewhere
+curl -X POST -H "Content-Type: application/json" \
+  -d @my-dashboard.json \
+  http://admin:password@localhost:3000/api/dashboards/db
+```
+
+## Troubleshooting
+
+### Prometheus Not Scraping Targets
+
+```bash
+# Check Prometheus targets
+kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
+
+# Visit: http://localhost:9090/targets
+# Look for failed targets
+
+# Check ServiceMonitor
+kubectl get servicemonitor --all-namespaces
+
+# Check Prometheus config
+kubectl get prometheus -n monitoring -o yaml
+
+# View Prometheus logs
+kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus --tail=50 -f
+```
+
+### Grafana Data Source Not Working
+
+```bash
+# Check Prometheus connectivity from Grafana
+kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
+
+# In Grafana:
+# 1. Configuration → Data Sources
+# 2. Click Prometheus
+# 3. Check Status (green = working)
+# 4. If red, check URL and credentials
+
+# Or check logs
+kubectl logs -n monitoring -l app.kubernetes.io/name=grafana --tail=50 -f
+```
+
+### AlertManager Not Sending Notifications
+
+```bash
+# Check AlertManager configuration
+kubectl get secret -n monitoring alertmanager-kube-prometheus-alertmanager -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d
+
+# Restart AlertManager to apply changes
+kubectl rollout restart statefulset -n monitoring prometheus-kube-alertmanager
+
+# Check logs
+kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager --tail=50 -f
+
+# Test webhook
+kubectl port-forward -n monitoring svc/prometheus-kube-alertmanager 9093:9093
+# Visit: http://localhost:9093 to see alerts
+```
+
+### Disk Space Issues
+
+```bash
+# Check Prometheus PVC usage
+kubectl get pvc -n monitoring
+
+# View disk usage
+kubectl exec -n monitoring prometheus-kube-prometheus-prometheus-0 -- df -h /prometheus
+
+# Increase storage
+kubectl patch pvc -n monitoring prometheus-kube-prometheus-prometheus-db-prometheus-kube-prometheus-prometheus-0 \
+  -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
+```
+
+### High Memory/CPU Usage
+
+```bash
+# Check resource usage
+kubectl top pod -n monitoring
+
+# Reduce retention period
+kubectl edit prometheus -n monitoring
+
+# Update in spec:
+# retention: 3d  # Reduce from 7d to 3d
+
+# Or reduce scrape interval
+# serviceMonitorSelectorNilUsesHelmValues: false
+# scrapeInterval: 60s  # Increase from 30s to 60s
+```
+
+### ServiceMonitor Not Being Picked Up
+
+```bash
+# Check if labels match
+kubectl get servicemonitor --all-namespaces -o yaml | grep -A5 "release: prometheus"
+
+# Prometheus selector config
+kubectl get prometheus -n monitoring -o yaml | grep -A5 "serviceMonitorSelector"
+
+# Restart Prometheus if config changed
+kubectl rollout restart statefulset -n monitoring prometheus-kube-prometheus-prometheus
+
+# Check Prometheus logs for errors
+kubectl logs -n monitoring prometheus-kube-prometheus-prometheus-0 -c prometheus --tail=100 | grep -i "error\|failed"
+```
+
+## Maintenance
+
+### Backup Grafana Dashboards
+
+```bash
+# Export all dashboards
+for dashboard in $(curl -s http://admin:password@localhost:3000/api/search | jq -r '.[] | .uri'); do
+  name=$(echo $dashboard | cut -d'/' -f2)
+  curl -s http://admin:password@localhost:3000/api/dashboards/$dashboard > ${name}.json
+done
+
+# Or use backup tool
+kubectl exec -n monitoring prometheus-grafana-0 -- grafana-cli admin export-dashboard all ./backups
+```
+
+### Update Prometheus Retention
+
+```bash
+# Edit Prometheus resource
+kubectl edit prometheus -n monitoring
+
+# Update retention field
+spec:
+  retention: "30d"  # Change from 7d to 30d
+
+# Changes apply automatically
+```
+
+### Scale Prometheus Resources
+
+```bash
+# For high-load environments, increase resources
+kubectl patch prometheus -n monitoring kube-prometheus -p '{"spec":{"resources":{"requests":{"cpu":"500m","memory":"1Gi"},"limits":{"cpu":"1000m","memory":"2Gi"}}}}'
+```
+
+## Best Practices
+
+1. **Security**
+   - Change default Grafana password immediately
+   - Restrict AlertManager webhook URLs to known services
+   - Use network policies to limit access
+
+2. **Performance**
+   - Monitor Prometheus disk usage regularly
+   - Adjust scrape intervals based on needs
+   - Use recording rules for complex queries
+
+3. **Reliability**
+   - Enable persistent storage (PVC)
+   - Configure alert routing and escalation
+   - Regular backup of Grafana dashboards and configs
+
+4. **Organization**
+   - Label all ServiceMonitors with `release: prometheus`
+   - Use consistent naming conventions
+   - Document custom dashboards and alerts
+
+5. **Cost Optimization**
+   - Remove unused scrape targets
+   - Tune scrape intervals (don't scrape more than needed)
+   - Set appropriate retention periods
+
+## Support
+
+- [Prometheus Documentation](https://prometheus.io/docs/)
+- [Prometheus Operator GitHub](https://github.com/prometheus-operator/prometheus-operator)
+- [Grafana Documentation](https://grafana.com/docs/)
+- [AlertManager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)
--- a/inventory/hosts.ini
+++ b/inventory/hosts.ini
@@ -35,3 +35,9 @@ extra_packages=btop,vim,tmux,net-tools,dnsutils,iotop,ncdu,tree,jq
 # Compute Blade Agent configuration
 # Set to false to skip compute-blade-agent deployment on specific nodes
 enable_compute_blade_agent=true
+
+# enable Prometheus
+enable_prometheus_operator=true
+grafana_admin_password=SecurePassword123!
+prometheus_storage_size=10Gi
+prometheus_retention_days=7
--- a/manifests/compute-blade-agent-daemonset.yaml
+++ b/manifests/compute-blade-agent-daemonset.yaml
@@ -100,19 +100,19 @@ spec:
 ---
 # Optional ServiceMonitor for Prometheus (requires prometheus-operator)
 # Uncomment this section if you have Prometheus installed with the operator
-#
-# apiVersion: monitoring.coreos.com/v1
-# kind: ServiceMonitor
-# metadata:
-#   name: compute-blade-agent
-#   namespace: compute-blade-agent
-#   labels:
-#     app: compute-blade-agent
-# spec:
-#   selector:
-#     matchLabels:
-#       app: compute-blade-agent
-#   endpoints:
-#     - port: metrics
-#       interval: 30s
-#       path: /metrics
+
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: compute-blade-agent
+  namespace: compute-blade-agent
+  labels:
+    app: compute-blade-agent
+spec:
+  selector:
+    matchLabels:
+      app: compute-blade-agent
+  endpoints:
+    - port: metrics
+      interval: 30s
+      path: /metrics
--- a/roles/prometheus-operator/defaults/main.yml
+++ b/roles/prometheus-operator/defaults/main.yml
@@ -0,0 +1,68 @@
+---
+# Prometheus Operator configuration defaults
+
+# Enable/disable Prometheus Operator installation
+enable_prometheus_operator: true
+
+# Grafana admin password (change in production!)
+grafana_admin_password: "admin"
+
+# Kubeconfig path for kubectl access
+kubeconfig_path: "/etc/rancher/k3s/k3s.yaml"
+
+# Prometheus configuration
+prometheus_retention_days: 7
+prometheus_storage_size: "10Gi"
+
+# Grafana configuration
+grafana_storage_size: "5Gi"
+grafana_admin_user: "admin"
+
+# AlertManager configuration
+alertmanager_storage_size: "5Gi"
+
+# Node selector for Prometheus components (optional)
+# Set to restrict Prometheus to specific nodes
+prometheus_node_selector: {}
+# Example:
+# prometheus_node_selector:
+#   node-type: monitoring
+
+# Resource requests and limits
+prometheus_cpu_request: "250m"
+prometheus_cpu_limit: "500m"
+prometheus_memory_request: "512Mi"
+prometheus_memory_limit: "1Gi"
+
+grafana_cpu_request: "100m"
+grafana_cpu_limit: "200m"
+grafana_memory_request: "256Mi"
+grafana_memory_limit: "512Mi"
+
+alertmanager_cpu_request: "100m"
+alertmanager_cpu_limit: "200m"
+alertmanager_memory_request: "256Mi"
+alertmanager_memory_limit: "512Mi"
+
+# Scrape interval configuration
+prometheus_scrape_interval: "30s"
+prometheus_scrape_timeout: "10s"
+prometheus_evaluation_interval: "30s"
+
+# Service Monitor label selector
+prometheus_service_monitor_selector: {}
+prometheus_pod_monitor_selector: {}
+
+# Enable/disable components
+enable_grafana: true
+enable_alertmanager: true
+enable_prometheus_node_exporter: true
+enable_kube_state_metrics: true
+
+# Helm values for fine-tuning
+prometheus_helm_values: {}
+# Example:
+# prometheus_helm_values:
+#   prometheus:
+#     prometheusSpec:
+#       retention: "15d"
--- a/roles/prometheus-operator/tasks/main.yml
+++ b/roles/prometheus-operator/tasks/main.yml
@@ -0,0 +1,140 @@
+---
+- name: Skip Prometheus Operator installation if disabled
+  debug:
+    msg: 'Prometheus Operator installation is disabled for this cluster'
+  when: not enable_prometheus_operator | bool
+
+- name: Block for Prometheus Operator installation
+  block:
+    - name: Check if Helm is installed locally
+      shell: which helm
+      register: helm_check
+      changed_when: false
+      failed_when: false
+      delegate_to: localhost
+      become: false
+
+    - name: Install Helm if not found
+      shell: |
+        curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
+      when: helm_check.rc != 0
+      delegate_to: localhost
+      become: false
+      changed_when: true
+
+    - name: Add Prometheus Helm repository
+      kubernetes.core.helm_repository:
+        name: prometheus-community
+        repo_url: https://prometheus-community.github.io/helm-charts
+        state: present
+      delegate_to: localhost
+      become: false
+
+    - name: Update Helm repositories
+      shell: helm repo update
+      changed_when: true
+      delegate_to: localhost
+      become: false
+
+    - name: Create monitoring namespace
+      shell: kubectl create namespace monitoring --kubeconfig={{ playbook_dir }}/kubeconfig 2>/dev/null || true
+      changed_when: false
+      delegate_to: localhost
+      become: false
+
+    - name: Install Prometheus Operator via Helm
+      shell: |
+        helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
+          --namespace monitoring \
+          --create-namespace \
+          --set prometheus.prometheusSpec.retention={{ prometheus_retention_days }}d \
+          --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.accessModes[0]=ReadWriteOnce \
+          --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage={{ prometheus_storage_size }} \
+          --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
+          --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
+          --set grafana.enabled=true \
+          --set grafana.adminPassword="{{ grafana_admin_password | default('admin') }}" \
+          --set grafana.persistence.enabled=true \
+          --set grafana.persistence.size={{ grafana_storage_size }} \
+          --set alertmanager.enabled=true \
+          --set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.accessModes[0]=ReadWriteOnce \
+          --set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage={{ alertmanager_storage_size }} \
+          --kubeconfig={{ playbook_dir }}/kubeconfig
+      environment:
+        KUBECONFIG: '{{ playbook_dir }}/kubeconfig'
+      register: helm_install_result
+      delegate_to: localhost
+      become: false
+      changed_when: "'has been upgraded' in helm_install_result.stdout or 'has been installed' in helm_install_result.stdout"
+
+    - name: Wait for Prometheus Operator to be ready
+      shell: kubectl rollout status deployment/prometheus-kube-prometheus-operator -n monitoring --kubeconfig={{ playbook_dir }}/kubeconfig --timeout=300s
+      retries: 5
+      delay: 10
+      delegate_to: localhost
+      become: false
+      changed_when: false
+
+    - name: Wait for Prometheus to be ready
+      shell: kubectl rollout status statefulset/prometheus-prometheus-kube-prometheus-prometheus -n monitoring --kubeconfig={{ playbook_dir }}/kubeconfig --timeout=300s
+      retries: 5
+      delay: 10
+      delegate_to: localhost
+      become: false
+      changed_when: false
+
+    - name: Wait for Grafana to be ready
+      shell: kubectl rollout status deployment/prometheus-grafana -n monitoring --kubeconfig={{ playbook_dir }}/kubeconfig --timeout=300s
+      retries: 5
+      delay: 10
+      delegate_to: localhost
+      become: false
+      changed_when: false
+
+    - name: Generate compute-blade-agent monitoring resources
+      template:
+        src: compute-blade-agent-monitoring.j2
+        dest: /tmp/compute-blade-agent-monitoring.yaml
+      when: enable_compute_blade_agent | bool
+      delegate_to: localhost
+      become: false
+
+    - name: Deploy compute-blade-agent monitoring resources
+      shell: kubectl apply -f /tmp/compute-blade-agent-monitoring.yaml --kubeconfig={{ playbook_dir }}/kubeconfig
+      when: enable_compute_blade_agent | bool
+      delegate_to: localhost
+      become: false
+      changed_when: "'created' in result.stdout or 'configured' in result.stdout"
+      register: result
+
+    - name: Wait for compute-blade-agent ServiceMonitor to be picked up
+      pause:
+        seconds: 30
+
+    - name: Verify Prometheus targets
+      shell: kubectl get service prometheus-operated -n monitoring --kubeconfig={{ playbook_dir }}/kubeconfig -o jsonpath='{.metadata.name}'
+      register: prometheus_service
+      delegate_to: localhost
+      become: false
+      changed_when: false
+
+    - name: Display Prometheus Operator installation details
+      debug:
+        msg:
+          - 'Prometheus Operator has been successfully installed'
+          - 'Namespace: monitoring'
+          - 'Prometheus: Available at prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090'
+          - 'Grafana: Available at prometheus-grafana.monitoring.svc.cluster.local:80'
+          - 'AlertManager: Available at prometheus-kube-alertmanager.monitoring.svc.cluster.local:9093'
+          - "Default Grafana admin password: {{ grafana_admin_password | default('admin') }}"
+          - ''
+          - 'To access Prometheus UI:'
+          - '  kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090'
+          - ''
+          - 'To access Grafana:'
+          - '  kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80'
+          - ''
+          - 'To access AlertManager:'
+          - '  kubectl port-forward -n monitoring svc/prometheus-kube-alertmanager 9093:9093'
+
+  when: enable_prometheus_operator | bool
--- a/roles/prometheus-operator/templates/compute-blade-agent-monitoring.j2
+++ b/roles/prometheus-operator/templates/compute-blade-agent-monitoring.j2
@@ -0,0 +1,91 @@
+---
+# ServiceMonitor for compute-blade-agent metrics collection
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: compute-blade-agent
+  namespace: compute-blade-agent
+  labels:
+    app: compute-blade-agent
+    release: prometheus
+spec:
+  selector:
+    matchLabels:
+      app: compute-blade-agent
+  endpoints:
+    - port: metrics
+      interval: 30s
+      path: /metrics
+      scrapeTimeout: 10s
+---
+# PrometheusRule for compute-blade-agent alerting
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: compute-blade-agent
+  namespace: compute-blade-agent
+  labels:
+    prometheus: kube-prometheus
+    role: alert-rules
+spec:
+  groups:
+    - name: compute-blade-agent.rules
+      interval: 30s
+      rules:
+        - alert: ComputeBladeAgentHighTemperature
+          expr: compute_blade_temperature_celsius > 80
+          for: 5m
+          labels:
+            severity: warning
+            component: compute-blade-agent
+          annotations:
+            summary: "Compute blade high temperature detected on {% raw %}{{ $labels.instance }}{% endraw %}"
+            description: "Compute blade temperature is {% raw %}{{ $value }}{% endraw %}°C (threshold: 80°C) on node {% raw %}{{ $labels.instance }}{% endraw %}"
+
+        - alert: ComputeBladeAgentCriticalTemperature
+          expr: compute_blade_temperature_celsius > 95
+          for: 2m
+          labels:
+            severity: critical
+            component: compute-blade-agent
+          annotations:
+            summary: "Compute blade CRITICAL temperature on {% raw %}{{ $labels.instance }}{% endraw %}"
+            description: "Compute blade temperature is {% raw %}{{ $value }}{% endraw %}°C (CRITICAL threshold: 95°C) on node {% raw %}{{ $labels.instance }}{% endraw %}"
+
+        - alert: ComputeBladeAgentDown
+          expr: up{job="compute-blade-agent"} == 0
+          for: 5m
+          labels:
+            severity: critical
+            component: compute-blade-agent
+          annotations:
+            summary: "Compute blade agent is down on {% raw %}{{ $labels.instance }}{% endraw %}"
+            description: "Compute blade agent has been unreachable for more than 5 minutes on {% raw %}{{ $labels.instance }}{% endraw %}"
+
+        - alert: ComputeBladeAgentFanFailure
+          expr: compute_blade_fan_rpm == 0
+          for: 5m
+          labels:
+            severity: warning
+            component: compute-blade-agent
+          annotations:
+            summary: "Compute blade fan failure detected on {% raw %}{{ $labels.instance }}{% endraw %}"
+            description: "Compute blade fan is not running on {% raw %}{{ $labels.instance }}{% endraw %}"
+
+        - alert: ComputeBladeAgentHighFanSpeed
+          expr: compute_blade_fan_speed_percent > 90
+          for: 10m
+          labels:
+            severity: warning
+            component: compute-blade-agent
+          annotations:
+            summary: "Compute blade fan running at high speed on {% raw %}{{ $labels.instance }}{% endraw %}"
+            description: "Compute blade fan speed is {% raw %}{{ $value }}{% endraw %}% (threshold: 90%) on {% raw %}{{ $labels.instance }}{% endraw %}"
+---
+# Namespace for compute-blade-agent (ensure it exists)
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: compute-blade-agent
+  labels:
+    name: compute-blade-agent
--- a/site.yml
+++ b/site.yml
@@ -40,7 +40,7 @@
    - agent
    - worker

- name: Install compute-blade-agent on workers
+- name: Install compute-blade-agent on all nodes
  hosts: all
  become: true
  roles:
@@ -49,6 +49,16 @@
    - compute-blade-agent
    - blade-agent

+- name: Install Prometheus Operator
+  hosts: "{{ groups['master'][0] }}"
+  gather_facts: false
+  become: false
+  roles:
+    - role: prometheus-operator
+  tags:
+    - prometheus-operator
+    - monitoring
+
 - name: Deploy test applications
  hosts: "{{ groups['master'][0] }}"
  gather_facts: true