Add Prometheus Operator role and templates

This commit is contained in:
2026-01-12 09:27:57 +01:00
parent fd7c9239b5
commit f3754c01d7
7 changed files with 1064 additions and 17 deletions

732
PROMETHEUS_MONITORING.md Normal file
View File

@@ -0,0 +1,732 @@
# Prometheus Operator & Monitoring Guide
Complete guide for deploying and managing monitoring infrastructure with Prometheus Operator, Grafana, and AlertManager in your k3s-ansible cluster.
## Table of Contents
- [Overview](#overview)
- [Quick Start](#quick-start)
- [Installation](#installation)
- [Configuration](#configuration)
- [Accessing Components](#accessing-components)
- [Monitoring compute-blade-agent](#monitoring-compute-blade-agent)
- [Custom ServiceMonitors](#custom-servicemonitors)
- [Alerting](#alerting)
- [Grafana Dashboards](#grafana-dashboards)
- [Troubleshooting](#troubleshooting)
## Overview
The Prometheus Operator installation includes:
- **Prometheus**: Time-series database and scraping engine
- **Grafana**: Visualization and dashboarding platform
- **AlertManager**: Alert routing and management
- **Node Exporter**: Hardware and OS metrics
- **kube-state-metrics**: Kubernetes cluster metrics
- **Prometheus Operator**: CRD controller for managing Prometheus resources
### Architecture
```
┌─────────────────────────────────────────┐
│ Prometheus Operator │
│ (CRD Controller - monitoring namespace)│
└─────────────────────────────────────────┘
┌───────────────┼───────────────┐
↓ ↓ ↓
Prometheus Grafana AlertManager
(9090) (3000) (9093)
↓ ↓ ↓
└───────────────┼───────────────┘
┌───────────────────────┐
│ ServiceMonitors │
│ PrometheusRules │
│ AlertingRules │
└───────────────────────┘
┌───────────────┼───────────────┐
↓ ↓ ↓
Scrape Scrape Scrape
Targets Targets Targets
```
## Quick Start
### Deploy Everything
```bash
# 1. Enable Prometheus Operator in inventory
# Edit inventory/hosts.ini
[k3s_cluster:vars]
enable_prometheus_operator=true
enable_compute_blade_agent=true
# 2. Run Ansible playbook
ansible-playbook site.yml --tags prometheus-operator
# 3. Wait for components to be ready
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=prometheus \
-n monitoring --timeout=300s
# 4. Access Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# 5. Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
```
Then open:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (default: admin/admin)
- AlertManager: http://localhost:9093
### Disable Prometheus Operator
```bash
# Edit inventory/hosts.ini
[k3s_cluster:vars]
enable_prometheus_operator=false
# Re-run playbook (Prometheus stack won't be installed)
ansible-playbook site.yml --tags prometheus-operator
```
## Installation
### Prerequisites
- K3s cluster already deployed with k3s-ansible
- kubectl access to the cluster
- Helm 3.x installed on the control machine
### Step-by-Step Installation
#### 1. Configure Inventory
Edit `inventory/hosts.ini`:
```ini
[k3s_cluster:vars]
# Enable Prometheus Operator
enable_prometheus_operator=true
# (Optional) Set Grafana admin password
grafana_admin_password=MySecurePassword123!
# Enable compute-blade-agent monitoring
enable_compute_blade_agent=true
```
#### 2. Run the Playbook
```bash
# Install only Prometheus Operator
ansible-playbook site.yml --tags prometheus-operator
# Or deploy everything including K3s
ansible-playbook site.yml
```
#### 3. Verify Installation
```bash
# Check if monitoring namespace exists
kubectl get namespace monitoring
# Check Prometheus Operator deployment
kubectl get deployment -n monitoring
# Check all monitoring components
kubectl get all -n monitoring
# Wait for all pods to be ready
kubectl wait --for=condition=ready pod --all -n monitoring --timeout=300s
```
Expected output:
```
NAME READY STATUS RESTARTS AGE
pod/prometheus-operator-5f8d4b5c7d-x9k2l 1/1 Running 0 2m
pod/prometheus-kube-prometheus-prometheus-0 2/2 Running 0 1m
pod/prometheus-kube-state-metrics-7c9d5f8c4-m2k9n 1/1 Running 0 2m
pod/prometheus-node-exporter-pz8kl 1/1 Running 0 2m
pod/prometheus-grafana-5f8d7b5c9e-z1q3x 3/3 Running 0 1m
pod/prometheus-kube-alertmanager-0 2/2 Running 0 1m
```
## Configuration
### Environment Variables
Configure via `inventory/hosts.ini`:
```ini
[k3s_cluster:vars]
# Enable/disable monitoring stack
enable_prometheus_operator=true
# Grafana configuration
grafana_admin_password=SecurePassword123!
grafana_admin_user=admin
grafana_storage_size=5Gi
# Prometheus configuration
prometheus_retention_days=7
prometheus_storage_size=10Gi
prometheus_scrape_interval=30s
prometheus_scrape_timeout=10s
# AlertManager configuration
alertmanager_storage_size=5Gi
# Component flags
enable_grafana=true
enable_alertmanager=true
enable_prometheus_node_exporter=true
enable_kube_state_metrics=true
```
### Per-Node Configuration
To restrict Prometheus to specific nodes:
```ini
[k3s_cluster:vars]
prometheus_node_selector={"node-type": "monitoring"}
```
Or via inventory host vars:
```ini
[master]
cm4-01 ansible_host=192.168.30.101 ansible_user=pi enable_prometheus_operator=true
cm4-02 ansible_host=192.168.30.102 ansible_user=pi enable_prometheus_operator=false
```
### Resource Limits
Control resource usage in `inventory/hosts.ini`:
```ini
prometheus_cpu_request=250m
prometheus_cpu_limit=500m
prometheus_memory_request=512Mi
prometheus_memory_limit=1Gi
grafana_cpu_request=100m
grafana_cpu_limit=200m
grafana_memory_request=256Mi
grafana_memory_limit=512Mi
```
## Accessing Components
### Prometheus Web UI
```bash
# Port-forward to localhost
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Access at: http://localhost:9090
```
**Available from within cluster:**
```
http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
```
**Features:**
- Query builder
- Target health status
- Alert rules
- Service discovery
- Graph visualization
### Grafana Dashboards
```bash
# Port-forward to localhost
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Access at: http://localhost:3000
# Default credentials: admin / admin (or your configured password)
```
**Available from within cluster:**
```
http://prometheus-grafana.monitoring.svc.cluster.local:80
```
**Pre-installed Dashboards:**
1. Kubernetes / Cluster Monitoring
2. Kubernetes / Nodes
3. Kubernetes / Pods
4. Kubernetes / Deployments Statefulsets Daemonsets
5. Node Exporter for Prometheus Dashboard
**Custom Dashboards:**
- Import from grafana.com
- Create custom dashboards
- Connect to Prometheus data source
### AlertManager
```bash
# Port-forward to localhost
kubectl port-forward -n monitoring svc/prometheus-kube-alertmanager 9093:9093
# Access at: http://localhost:9093
```
**Available from within cluster:**
```
http://prometheus-kube-alertmanager.monitoring.svc.cluster.local:9093
```
**Features:**
- Alert grouping and deduplication
- Alert routing rules
- Notification management
- Silence alerts
### Verify Network Connectivity
```bash
# Test from within the cluster
kubectl run debug --image=busybox -it --rm -- sh
# Inside the pod:
wget -O- http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090/-/healthy
wget -O- http://prometheus-grafana.monitoring.svc:80/api/health
wget -O- http://prometheus-kube-alertmanager.monitoring.svc:9093/-/healthy
```
## Monitoring compute-blade-agent
### Automatic Integration
When both are enabled, the Ansible role automatically:
1. Creates the `compute-blade-agent` namespace
2. Deploys ServiceMonitor for metrics scraping
3. Deploys PrometheusRule for alerting
4. Configures Prometheus scrape targets
### Verify compute-blade-agent Monitoring
```bash
# Check if ServiceMonitor is created
kubectl get servicemonitor -n compute-blade-agent
# Check if metrics are being scraped
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Then in Prometheus UI:
# 1. Go to Status → Targets
# 2. Look for "compute-blade-agent" targets
# 3. Should show "UP" status
```
### Available Metrics
```
# Temperature monitoring
compute_blade_temperature_celsius
# Fan monitoring
compute_blade_fan_rpm
compute_blade_fan_speed_percent
# Power monitoring
compute_blade_power_watts
# Status indicators
compute_blade_status
compute_blade_led_state
```
### Create Custom Dashboard for compute-blade-agent
In Grafana:
1. Create new dashboard
2. Add panel with query:
```
compute_blade_temperature_celsius{job="compute-blade-agent"}
```
3. Set visualization type to "Gauge" or "Graph"
4. Save dashboard
## Custom ServiceMonitors
### Create a ServiceMonitor
Create `custom-servicemonitor.yml`:
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitoring
namespace: my-app
labels:
app: my-app
release: prometheus
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
path: /metrics
scrapeTimeout: 10s
```
Deploy it:
```bash
kubectl apply -f custom-servicemonitor.yml
```
### Create a PrometheusRule
Create `custom-alerts.yml`:
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
namespace: my-app
labels:
prometheus: kube-prometheus
spec:
groups:
- name: my-app.rules
interval: 30s
rules:
- alert: MyAppHighErrorRate
expr: |
rate(my_app_errors_total[5m]) > 0.05
for: 5m
labels:
severity: warning
app: my-app
annotations:
summary: "High error rate in my-app"
description: "Error rate is {{ $value }} on {{ $labels.instance }}"
- alert: MyAppDown
expr: |
up{job="my-app"} == 0
for: 5m
labels:
severity: critical
app: my-app
annotations:
summary: "my-app is down"
description: "my-app on {{ $labels.instance }} is unreachable"
```
Deploy it:
```bash
kubectl apply -f custom-alerts.yml
```
## Alerting
### Pre-configured Alerts for compute-blade-agent
Automatically deployed when compute-blade-agent monitoring is enabled:
1. **ComputeBladeAgentHighTemperature** (Warning)
- Triggers when temp > 80°C for 5 minutes
2. **ComputeBladeAgentCriticalTemperature** (Critical)
- Triggers when temp > 95°C for 2 minutes
3. **ComputeBladeAgentDown** (Critical)
- Triggers when agent unreachable for 5 minutes
4. **ComputeBladeAgentFanFailure** (Warning)
- Triggers when fan RPM = 0 for 5 minutes
5. **ComputeBladeAgentHighFanSpeed** (Warning)
- Triggers when fan speed > 90% for 10 minutes
### View Active Alerts
```bash
# In Prometheus UI:
# 1. Go to Alerts
# 2. See all active and pending alerts
# Or query:
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Then visit: http://localhost:9090/alerts
```
### Configure AlertManager Routing
Edit AlertManager configuration:
```bash
# Get current config
kubectl get secret -n monitoring alertmanager-kube-prometheus-alertmanager -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d
# Edit configuration
kubectl edit secret -n monitoring alertmanager-kube-prometheus-alertmanager
```
Example routing configuration:
```yaml
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'critical'
group_wait: 0s
group_interval: 1m
repeat_interval: 30m
- match:
severity: warning
receiver: 'default'
group_wait: 1m
receivers:
- name: 'default'
# Add email, slack, webhook, etc.
- name: 'critical'
# Add urgent notifications
```
## Grafana Dashboards
### Import Pre-built Dashboards
1. Open Grafana: http://localhost:3000
2. Click "+" → Import
3. Enter dashboard ID from grafana.com
4. Select Prometheus data source
5. Click Import
**Recommended Dashboards:**
| ID | Name |
|----|------|
| 1860 | Node Exporter for Prometheus Dashboard |
| 6417 | Kubernetes Cluster Monitoring |
| 8588 | Kubernetes Deployment Statefulset Daemonset |
| 11074 | Node Exporter - Nodes |
| 12114 | Kubernetes cluster monitoring |
### Create Custom Dashboard
1. Click "+" → Dashboard
2. Click "Add new panel"
3. Configure query:
- Data source: Prometheus
- Query: `up{job="compute-blade-agent"}`
4. Set visualization (Graph, Gauge, Table, etc.)
5. Click Save
### Export Dashboard
```bash
# Get dashboard JSON
curl http://admin:password@localhost:3000/api/dashboards/db/my-dashboard > my-dashboard.json
# Import elsewhere
curl -X POST -H "Content-Type: application/json" \
-d @my-dashboard.json \
http://admin:password@localhost:3000/api/dashboards/db
```
## Troubleshooting
### Prometheus Not Scraping Targets
```bash
# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Visit: http://localhost:9090/targets
# Look for failed targets
# Check ServiceMonitor
kubectl get servicemonitor --all-namespaces
# Check Prometheus config
kubectl get prometheus -n monitoring -o yaml
# View Prometheus logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus --tail=50 -f
```
### Grafana Data Source Not Working
```bash
# Check Prometheus connectivity from Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# In Grafana:
# 1. Configuration → Data Sources
# 2. Click Prometheus
# 3. Check Status (green = working)
# 4. If red, check URL and credentials
# Or check logs
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana --tail=50 -f
```
### AlertManager Not Sending Notifications
```bash
# Check AlertManager configuration
kubectl get secret -n monitoring alertmanager-kube-prometheus-alertmanager -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d
# Restart AlertManager to apply changes
kubectl rollout restart statefulset -n monitoring prometheus-kube-alertmanager
# Check logs
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager --tail=50 -f
# Test webhook
kubectl port-forward -n monitoring svc/prometheus-kube-alertmanager 9093:9093
# Visit: http://localhost:9093 to see alerts
```
### Disk Space Issues
```bash
# Check Prometheus PVC usage
kubectl get pvc -n monitoring
# View disk usage
kubectl exec -n monitoring prometheus-kube-prometheus-prometheus-0 -- df -h /prometheus
# Increase storage
kubectl patch pvc -n monitoring prometheus-kube-prometheus-prometheus-db-prometheus-kube-prometheus-prometheus-0 \
-p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
```
### High Memory/CPU Usage
```bash
# Check resource usage
kubectl top pod -n monitoring
# Reduce retention period
kubectl edit prometheus -n monitoring
# Update in spec:
# retention: 3d # Reduce from 7d to 3d
# Or reduce scrape interval
# serviceMonitorSelectorNilUsesHelmValues: false
# scrapeInterval: 60s # Increase from 30s to 60s
```
### ServiceMonitor Not Being Picked Up
```bash
# Check if labels match
kubectl get servicemonitor --all-namespaces -o yaml | grep -A5 "release: prometheus"
# Prometheus selector config
kubectl get prometheus -n monitoring -o yaml | grep -A5 "serviceMonitorSelector"
# Restart Prometheus if config changed
kubectl rollout restart statefulset -n monitoring prometheus-kube-prometheus-prometheus
# Check Prometheus logs for errors
kubectl logs -n monitoring prometheus-kube-prometheus-prometheus-0 -c prometheus --tail=100 | grep -i "error\|failed"
```
## Maintenance
### Backup Grafana Dashboards
```bash
# Export all dashboards
for dashboard in $(curl -s http://admin:password@localhost:3000/api/search | jq -r '.[] | .uri'); do
name=$(echo $dashboard | cut -d'/' -f2)
curl -s http://admin:password@localhost:3000/api/dashboards/$dashboard > ${name}.json
done
# Or use backup tool
kubectl exec -n monitoring prometheus-grafana-0 -- grafana-cli admin export-dashboard all ./backups
```
### Update Prometheus Retention
```bash
# Edit Prometheus resource
kubectl edit prometheus -n monitoring
# Update retention field
spec:
retention: "30d" # Change from 7d to 30d
# Changes apply automatically
```
### Scale Prometheus Resources
```bash
# For high-load environments, increase resources
kubectl patch prometheus -n monitoring kube-prometheus -p '{"spec":{"resources":{"requests":{"cpu":"500m","memory":"1Gi"},"limits":{"cpu":"1000m","memory":"2Gi"}}}}'
```
## Best Practices
1. **Security**
- Change default Grafana password immediately
- Restrict AlertManager webhook URLs to known services
- Use network policies to limit access
2. **Performance**
- Monitor Prometheus disk usage regularly
- Adjust scrape intervals based on needs
- Use recording rules for complex queries
3. **Reliability**
- Enable persistent storage (PVC)
- Configure alert routing and escalation
- Regular backup of Grafana dashboards and configs
4. **Organization**
- Label all ServiceMonitors with `release: prometheus`
- Use consistent naming conventions
- Document custom dashboards and alerts
5. **Cost Optimization**
- Remove unused scrape targets
- Tune scrape intervals (don't scrape more than needed)
- Set appropriate retention periods
## Support
- [Prometheus Documentation](https://prometheus.io/docs/)
- [Prometheus Operator GitHub](https://github.com/prometheus-operator/prometheus-operator)
- [Grafana Documentation](https://grafana.com/docs/)
- [AlertManager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)

View File

@@ -35,3 +35,9 @@ extra_packages=btop,vim,tmux,net-tools,dnsutils,iotop,ncdu,tree,jq
# Compute Blade Agent configuration
# Set to false to skip compute-blade-agent deployment on specific nodes
enable_compute_blade_agent=true
# enable Prometheus
enable_prometheus_operator=true
grafana_admin_password=SecurePassword123!
prometheus_storage_size=10Gi
prometheus_retention_days=7

View File

@@ -100,19 +100,19 @@ spec:
---
# Optional ServiceMonitor for Prometheus (requires prometheus-operator)
# Uncomment this section if you have Prometheus installed with the operator
#
# apiVersion: monitoring.coreos.com/v1
# kind: ServiceMonitor
# metadata:
# name: compute-blade-agent
# namespace: compute-blade-agent
# labels:
# app: compute-blade-agent
# spec:
# selector:
# matchLabels:
# app: compute-blade-agent
# endpoints:
# - port: metrics
# interval: 30s
# path: /metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: compute-blade-agent
namespace: compute-blade-agent
labels:
app: compute-blade-agent
spec:
selector:
matchLabels:
app: compute-blade-agent
endpoints:
- port: metrics
interval: 30s
path: /metrics

View File

@@ -0,0 +1,68 @@
---
# Prometheus Operator configuration defaults
# Enable/disable Prometheus Operator installation
enable_prometheus_operator: true
# Grafana admin password (change in production!)
grafana_admin_password: "admin"
# Kubeconfig path for kubectl access
kubeconfig_path: "/etc/rancher/k3s/k3s.yaml"
# Prometheus configuration
prometheus_retention_days: 7
prometheus_storage_size: "10Gi"
# Grafana configuration
grafana_storage_size: "5Gi"
grafana_admin_user: "admin"
# AlertManager configuration
alertmanager_storage_size: "5Gi"
# Node selector for Prometheus components (optional)
# Set to restrict Prometheus to specific nodes
prometheus_node_selector: {}
# Example:
# prometheus_node_selector:
# node-type: monitoring
# Resource requests and limits
prometheus_cpu_request: "250m"
prometheus_cpu_limit: "500m"
prometheus_memory_request: "512Mi"
prometheus_memory_limit: "1Gi"
grafana_cpu_request: "100m"
grafana_cpu_limit: "200m"
grafana_memory_request: "256Mi"
grafana_memory_limit: "512Mi"
alertmanager_cpu_request: "100m"
alertmanager_cpu_limit: "200m"
alertmanager_memory_request: "256Mi"
alertmanager_memory_limit: "512Mi"
# Scrape interval configuration
prometheus_scrape_interval: "30s"
prometheus_scrape_timeout: "10s"
prometheus_evaluation_interval: "30s"
# Service Monitor label selector
prometheus_service_monitor_selector: {}
prometheus_pod_monitor_selector: {}
# Enable/disable components
enable_grafana: true
enable_alertmanager: true
enable_prometheus_node_exporter: true
enable_kube_state_metrics: true
# Helm values for fine-tuning
prometheus_helm_values: {}
# Example:
# prometheus_helm_values:
# prometheus:
# prometheusSpec:
# retention: "15d"

View File

@@ -0,0 +1,140 @@
---
- name: Skip Prometheus Operator installation if disabled
debug:
msg: 'Prometheus Operator installation is disabled for this cluster'
when: not enable_prometheus_operator | bool
- name: Block for Prometheus Operator installation
block:
- name: Check if Helm is installed locally
shell: which helm
register: helm_check
changed_when: false
failed_when: false
delegate_to: localhost
become: false
- name: Install Helm if not found
shell: |
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
when: helm_check.rc != 0
delegate_to: localhost
become: false
changed_when: true
- name: Add Prometheus Helm repository
kubernetes.core.helm_repository:
name: prometheus-community
repo_url: https://prometheus-community.github.io/helm-charts
state: present
delegate_to: localhost
become: false
- name: Update Helm repositories
shell: helm repo update
changed_when: true
delegate_to: localhost
become: false
- name: Create monitoring namespace
shell: kubectl create namespace monitoring --kubeconfig={{ playbook_dir }}/kubeconfig 2>/dev/null || true
changed_when: false
delegate_to: localhost
become: false
- name: Install Prometheus Operator via Helm
shell: |
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention={{ prometheus_retention_days }}d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.accessModes[0]=ReadWriteOnce \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage={{ prometheus_storage_size }} \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
--set grafana.enabled=true \
--set grafana.adminPassword="{{ grafana_admin_password | default('admin') }}" \
--set grafana.persistence.enabled=true \
--set grafana.persistence.size={{ grafana_storage_size }} \
--set alertmanager.enabled=true \
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.accessModes[0]=ReadWriteOnce \
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage={{ alertmanager_storage_size }} \
--kubeconfig={{ playbook_dir }}/kubeconfig
environment:
KUBECONFIG: '{{ playbook_dir }}/kubeconfig'
register: helm_install_result
delegate_to: localhost
become: false
changed_when: "'has been upgraded' in helm_install_result.stdout or 'has been installed' in helm_install_result.stdout"
- name: Wait for Prometheus Operator to be ready
shell: kubectl rollout status deployment/prometheus-kube-prometheus-operator -n monitoring --kubeconfig={{ playbook_dir }}/kubeconfig --timeout=300s
retries: 5
delay: 10
delegate_to: localhost
become: false
changed_when: false
- name: Wait for Prometheus to be ready
shell: kubectl rollout status statefulset/prometheus-prometheus-kube-prometheus-prometheus -n monitoring --kubeconfig={{ playbook_dir }}/kubeconfig --timeout=300s
retries: 5
delay: 10
delegate_to: localhost
become: false
changed_when: false
- name: Wait for Grafana to be ready
shell: kubectl rollout status deployment/prometheus-grafana -n monitoring --kubeconfig={{ playbook_dir }}/kubeconfig --timeout=300s
retries: 5
delay: 10
delegate_to: localhost
become: false
changed_when: false
- name: Generate compute-blade-agent monitoring resources
template:
src: compute-blade-agent-monitoring.j2
dest: /tmp/compute-blade-agent-monitoring.yaml
when: enable_compute_blade_agent | bool
delegate_to: localhost
become: false
- name: Deploy compute-blade-agent monitoring resources
shell: kubectl apply -f /tmp/compute-blade-agent-monitoring.yaml --kubeconfig={{ playbook_dir }}/kubeconfig
when: enable_compute_blade_agent | bool
delegate_to: localhost
become: false
changed_when: "'created' in result.stdout or 'configured' in result.stdout"
register: result
- name: Wait for compute-blade-agent ServiceMonitor to be picked up
pause:
seconds: 30
- name: Verify Prometheus targets
shell: kubectl get service prometheus-operated -n monitoring --kubeconfig={{ playbook_dir }}/kubeconfig -o jsonpath='{.metadata.name}'
register: prometheus_service
delegate_to: localhost
become: false
changed_when: false
- name: Display Prometheus Operator installation details
debug:
msg:
- 'Prometheus Operator has been successfully installed'
- 'Namespace: monitoring'
- 'Prometheus: Available at prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090'
- 'Grafana: Available at prometheus-grafana.monitoring.svc.cluster.local:80'
- 'AlertManager: Available at prometheus-kube-alertmanager.monitoring.svc.cluster.local:9093'
- "Default Grafana admin password: {{ grafana_admin_password | default('admin') }}"
- ''
- 'To access Prometheus UI:'
- ' kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090'
- ''
- 'To access Grafana:'
- ' kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80'
- ''
- 'To access AlertManager:'
- ' kubectl port-forward -n monitoring svc/prometheus-kube-alertmanager 9093:9093'
when: enable_prometheus_operator | bool

View File

@@ -0,0 +1,91 @@
---
# ServiceMonitor for compute-blade-agent metrics collection
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: compute-blade-agent
namespace: compute-blade-agent
labels:
app: compute-blade-agent
release: prometheus
spec:
selector:
matchLabels:
app: compute-blade-agent
endpoints:
- port: metrics
interval: 30s
path: /metrics
scrapeTimeout: 10s
---
# PrometheusRule for compute-blade-agent alerting
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: compute-blade-agent
namespace: compute-blade-agent
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: compute-blade-agent.rules
interval: 30s
rules:
- alert: ComputeBladeAgentHighTemperature
expr: compute_blade_temperature_celsius > 80
for: 5m
labels:
severity: warning
component: compute-blade-agent
annotations:
summary: "Compute blade high temperature detected on {% raw %}{{ $labels.instance }}{% endraw %}"
description: "Compute blade temperature is {% raw %}{{ $value }}{% endraw %}°C (threshold: 80°C) on node {% raw %}{{ $labels.instance }}{% endraw %}"
- alert: ComputeBladeAgentCriticalTemperature
expr: compute_blade_temperature_celsius > 95
for: 2m
labels:
severity: critical
component: compute-blade-agent
annotations:
summary: "Compute blade CRITICAL temperature on {% raw %}{{ $labels.instance }}{% endraw %}"
description: "Compute blade temperature is {% raw %}{{ $value }}{% endraw %}°C (CRITICAL threshold: 95°C) on node {% raw %}{{ $labels.instance }}{% endraw %}"
- alert: ComputeBladeAgentDown
expr: up{job="compute-blade-agent"} == 0
for: 5m
labels:
severity: critical
component: compute-blade-agent
annotations:
summary: "Compute blade agent is down on {% raw %}{{ $labels.instance }}{% endraw %}"
description: "Compute blade agent has been unreachable for more than 5 minutes on {% raw %}{{ $labels.instance }}{% endraw %}"
- alert: ComputeBladeAgentFanFailure
expr: compute_blade_fan_rpm == 0
for: 5m
labels:
severity: warning
component: compute-blade-agent
annotations:
summary: "Compute blade fan failure detected on {% raw %}{{ $labels.instance }}{% endraw %}"
description: "Compute blade fan is not running on {% raw %}{{ $labels.instance }}{% endraw %}"
- alert: ComputeBladeAgentHighFanSpeed
expr: compute_blade_fan_speed_percent > 90
for: 10m
labels:
severity: warning
component: compute-blade-agent
annotations:
summary: "Compute blade fan running at high speed on {% raw %}{{ $labels.instance }}{% endraw %}"
description: "Compute blade fan speed is {% raw %}{{ $value }}{% endraw %}% (threshold: 90%) on {% raw %}{{ $labels.instance }}{% endraw %}"
---
# Namespace for compute-blade-agent (ensure it exists)
apiVersion: v1
kind: Namespace
metadata:
name: compute-blade-agent
labels:
name: compute-blade-agent

View File

@@ -40,7 +40,7 @@
- agent
- worker
- name: Install compute-blade-agent on workers
- name: Install compute-blade-agent on all nodes
hosts: all
become: true
roles:
@@ -49,6 +49,16 @@
- compute-blade-agent
- blade-agent
- name: Install Prometheus Operator
hosts: "{{ groups['master'][0] }}"
gather_facts: false
become: false
roles:
- role: prometheus-operator
tags:
- prometheus-operator
- monitoring
- name: Deploy test applications
hosts: "{{ groups['master'][0] }}"
gather_facts: true