Add Prometheus Operator role and templates
This commit is contained in:
732
PROMETHEUS_MONITORING.md
Normal file
732
PROMETHEUS_MONITORING.md
Normal file
@@ -0,0 +1,732 @@
|
||||
# Prometheus Operator & Monitoring Guide
|
||||
|
||||
Complete guide for deploying and managing monitoring infrastructure with Prometheus Operator, Grafana, and AlertManager in your k3s-ansible cluster.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Overview](#overview)
|
||||
- [Quick Start](#quick-start)
|
||||
- [Installation](#installation)
|
||||
- [Configuration](#configuration)
|
||||
- [Accessing Components](#accessing-components)
|
||||
- [Monitoring compute-blade-agent](#monitoring-compute-blade-agent)
|
||||
- [Custom ServiceMonitors](#custom-servicemonitors)
|
||||
- [Alerting](#alerting)
|
||||
- [Grafana Dashboards](#grafana-dashboards)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
|
||||
## Overview
|
||||
|
||||
The Prometheus Operator installation includes:
|
||||
|
||||
- **Prometheus**: Time-series database and scraping engine
|
||||
- **Grafana**: Visualization and dashboarding platform
|
||||
- **AlertManager**: Alert routing and management
|
||||
- **Node Exporter**: Hardware and OS metrics
|
||||
- **kube-state-metrics**: Kubernetes cluster metrics
|
||||
- **Prometheus Operator**: CRD controller for managing Prometheus resources
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Prometheus Operator │
|
||||
│ (CRD Controller - monitoring namespace)│
|
||||
└─────────────────────────────────────────┘
|
||||
↓
|
||||
┌───────────────┼───────────────┐
|
||||
↓ ↓ ↓
|
||||
Prometheus Grafana AlertManager
|
||||
(9090) (3000) (9093)
|
||||
↓ ↓ ↓
|
||||
└───────────────┼───────────────┘
|
||||
↓
|
||||
┌───────────────────────┐
|
||||
│ ServiceMonitors │
|
||||
│ PrometheusRules │
|
||||
│ AlertingRules │
|
||||
└───────────────────────┘
|
||||
↓
|
||||
┌───────────────┼───────────────┐
|
||||
↓ ↓ ↓
|
||||
Scrape Scrape Scrape
|
||||
Targets Targets Targets
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Deploy Everything
|
||||
|
||||
```bash
|
||||
# 1. Enable Prometheus Operator in inventory
|
||||
# Edit inventory/hosts.ini
|
||||
[k3s_cluster:vars]
|
||||
enable_prometheus_operator=true
|
||||
enable_compute_blade_agent=true
|
||||
|
||||
# 2. Run Ansible playbook
|
||||
ansible-playbook site.yml --tags prometheus-operator
|
||||
|
||||
# 3. Wait for components to be ready
|
||||
kubectl wait --for=condition=ready pod \
|
||||
-l app.kubernetes.io/name=prometheus \
|
||||
-n monitoring --timeout=300s
|
||||
|
||||
# 4. Access Prometheus
|
||||
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
|
||||
|
||||
# 5. Access Grafana
|
||||
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
|
||||
```
|
||||
|
||||
Then open:
|
||||
- Prometheus: http://localhost:9090
|
||||
- Grafana: http://localhost:3000 (default: admin/admin)
|
||||
- AlertManager: http://localhost:9093
|
||||
|
||||
### Disable Prometheus Operator
|
||||
|
||||
```bash
|
||||
# Edit inventory/hosts.ini
|
||||
[k3s_cluster:vars]
|
||||
enable_prometheus_operator=false
|
||||
|
||||
# Re-run playbook (Prometheus stack won't be installed)
|
||||
ansible-playbook site.yml --tags prometheus-operator
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- K3s cluster already deployed with k3s-ansible
|
||||
- kubectl access to the cluster
|
||||
- Helm 3.x installed on the control machine
|
||||
|
||||
### Step-by-Step Installation
|
||||
|
||||
#### 1. Configure Inventory
|
||||
|
||||
Edit `inventory/hosts.ini`:
|
||||
|
||||
```ini
|
||||
[k3s_cluster:vars]
|
||||
# Enable Prometheus Operator
|
||||
enable_prometheus_operator=true
|
||||
|
||||
# (Optional) Set Grafana admin password
|
||||
grafana_admin_password=MySecurePassword123!
|
||||
|
||||
# Enable compute-blade-agent monitoring
|
||||
enable_compute_blade_agent=true
|
||||
```
|
||||
|
||||
#### 2. Run the Playbook
|
||||
|
||||
```bash
|
||||
# Install only Prometheus Operator
|
||||
ansible-playbook site.yml --tags prometheus-operator
|
||||
|
||||
# Or deploy everything including K3s
|
||||
ansible-playbook site.yml
|
||||
```
|
||||
|
||||
#### 3. Verify Installation
|
||||
|
||||
```bash
|
||||
# Check if monitoring namespace exists
|
||||
kubectl get namespace monitoring
|
||||
|
||||
# Check Prometheus Operator deployment
|
||||
kubectl get deployment -n monitoring
|
||||
|
||||
# Check all monitoring components
|
||||
kubectl get all -n monitoring
|
||||
|
||||
# Wait for all pods to be ready
|
||||
kubectl wait --for=condition=ready pod --all -n monitoring --timeout=300s
|
||||
```
|
||||
|
||||
Expected output:
|
||||
|
||||
```
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
pod/prometheus-operator-5f8d4b5c7d-x9k2l 1/1 Running 0 2m
|
||||
pod/prometheus-kube-prometheus-prometheus-0 2/2 Running 0 1m
|
||||
pod/prometheus-kube-state-metrics-7c9d5f8c4-m2k9n 1/1 Running 0 2m
|
||||
pod/prometheus-node-exporter-pz8kl 1/1 Running 0 2m
|
||||
pod/prometheus-grafana-5f8d7b5c9e-z1q3x 3/3 Running 0 1m
|
||||
pod/prometheus-kube-alertmanager-0 2/2 Running 0 1m
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
Configure via `inventory/hosts.ini`:
|
||||
|
||||
```ini
|
||||
[k3s_cluster:vars]
|
||||
# Enable/disable monitoring stack
|
||||
enable_prometheus_operator=true
|
||||
|
||||
# Grafana configuration
|
||||
grafana_admin_password=SecurePassword123!
|
||||
grafana_admin_user=admin
|
||||
grafana_storage_size=5Gi
|
||||
|
||||
# Prometheus configuration
|
||||
prometheus_retention_days=7
|
||||
prometheus_storage_size=10Gi
|
||||
prometheus_scrape_interval=30s
|
||||
prometheus_scrape_timeout=10s
|
||||
|
||||
# AlertManager configuration
|
||||
alertmanager_storage_size=5Gi
|
||||
|
||||
# Component flags
|
||||
enable_grafana=true
|
||||
enable_alertmanager=true
|
||||
enable_prometheus_node_exporter=true
|
||||
enable_kube_state_metrics=true
|
||||
```
|
||||
|
||||
### Per-Node Configuration
|
||||
|
||||
To restrict Prometheus to specific nodes:
|
||||
|
||||
```ini
|
||||
[k3s_cluster:vars]
|
||||
prometheus_node_selector={"node-type": "monitoring"}
|
||||
```
|
||||
|
||||
Or via inventory host vars:
|
||||
|
||||
```ini
|
||||
[master]
|
||||
cm4-01 ansible_host=192.168.30.101 ansible_user=pi enable_prometheus_operator=true
|
||||
cm4-02 ansible_host=192.168.30.102 ansible_user=pi enable_prometheus_operator=false
|
||||
```
|
||||
|
||||
### Resource Limits
|
||||
|
||||
Control resource usage in `inventory/hosts.ini`:
|
||||
|
||||
```ini
|
||||
prometheus_cpu_request=250m
|
||||
prometheus_cpu_limit=500m
|
||||
prometheus_memory_request=512Mi
|
||||
prometheus_memory_limit=1Gi
|
||||
|
||||
grafana_cpu_request=100m
|
||||
grafana_cpu_limit=200m
|
||||
grafana_memory_request=256Mi
|
||||
grafana_memory_limit=512Mi
|
||||
```
|
||||
|
||||
## Accessing Components
|
||||
|
||||
### Prometheus Web UI
|
||||
|
||||
```bash
|
||||
# Port-forward to localhost
|
||||
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
|
||||
|
||||
# Access at: http://localhost:9090
|
||||
```
|
||||
|
||||
**Available from within cluster:**
|
||||
```
|
||||
http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Query builder
|
||||
- Target health status
|
||||
- Alert rules
|
||||
- Service discovery
|
||||
- Graph visualization
|
||||
|
||||
### Grafana Dashboards
|
||||
|
||||
```bash
|
||||
# Port-forward to localhost
|
||||
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
|
||||
|
||||
# Access at: http://localhost:3000
|
||||
# Default credentials: admin / admin (or your configured password)
|
||||
```
|
||||
|
||||
**Available from within cluster:**
|
||||
```
|
||||
http://prometheus-grafana.monitoring.svc.cluster.local:80
|
||||
```
|
||||
|
||||
**Pre-installed Dashboards:**
|
||||
1. Kubernetes / Cluster Monitoring
|
||||
2. Kubernetes / Nodes
|
||||
3. Kubernetes / Pods
|
||||
4. Kubernetes / Deployments Statefulsets Daemonsets
|
||||
5. Node Exporter for Prometheus Dashboard
|
||||
|
||||
**Custom Dashboards:**
|
||||
- Import from grafana.com
|
||||
- Create custom dashboards
|
||||
- Connect to Prometheus data source
|
||||
|
||||
### AlertManager
|
||||
|
||||
```bash
|
||||
# Port-forward to localhost
|
||||
kubectl port-forward -n monitoring svc/prometheus-kube-alertmanager 9093:9093
|
||||
|
||||
# Access at: http://localhost:9093
|
||||
```
|
||||
|
||||
**Available from within cluster:**
|
||||
```
|
||||
http://prometheus-kube-alertmanager.monitoring.svc.cluster.local:9093
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Alert grouping and deduplication
|
||||
- Alert routing rules
|
||||
- Notification management
|
||||
- Silence alerts
|
||||
|
||||
### Verify Network Connectivity
|
||||
|
||||
```bash
|
||||
# Test from within the cluster
|
||||
kubectl run debug --image=busybox -it --rm -- sh
|
||||
|
||||
# Inside the pod:
|
||||
wget -O- http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090/-/healthy
|
||||
wget -O- http://prometheus-grafana.monitoring.svc:80/api/health
|
||||
wget -O- http://prometheus-kube-alertmanager.monitoring.svc:9093/-/healthy
|
||||
```
|
||||
|
||||
## Monitoring compute-blade-agent
|
||||
|
||||
### Automatic Integration
|
||||
|
||||
When both are enabled, the Ansible role automatically:
|
||||
|
||||
1. Creates the `compute-blade-agent` namespace
|
||||
2. Deploys ServiceMonitor for metrics scraping
|
||||
3. Deploys PrometheusRule for alerting
|
||||
4. Configures Prometheus scrape targets
|
||||
|
||||
### Verify compute-blade-agent Monitoring
|
||||
|
||||
```bash
|
||||
# Check if ServiceMonitor is created
|
||||
kubectl get servicemonitor -n compute-blade-agent
|
||||
|
||||
# Check if metrics are being scraped
|
||||
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
|
||||
|
||||
# Then in Prometheus UI:
|
||||
# 1. Go to Status → Targets
|
||||
# 2. Look for "compute-blade-agent" targets
|
||||
# 3. Should show "UP" status
|
||||
```
|
||||
|
||||
### Available Metrics
|
||||
|
||||
```
|
||||
# Temperature monitoring
|
||||
compute_blade_temperature_celsius
|
||||
|
||||
# Fan monitoring
|
||||
compute_blade_fan_rpm
|
||||
compute_blade_fan_speed_percent
|
||||
|
||||
# Power monitoring
|
||||
compute_blade_power_watts
|
||||
|
||||
# Status indicators
|
||||
compute_blade_status
|
||||
compute_blade_led_state
|
||||
```
|
||||
|
||||
### Create Custom Dashboard for compute-blade-agent
|
||||
|
||||
In Grafana:
|
||||
|
||||
1. Create new dashboard
|
||||
2. Add panel with query:
|
||||
```
|
||||
compute_blade_temperature_celsius{job="compute-blade-agent"}
|
||||
```
|
||||
3. Set visualization type to "Gauge" or "Graph"
|
||||
4. Save dashboard
|
||||
|
||||
## Custom ServiceMonitors
|
||||
|
||||
### Create a ServiceMonitor
|
||||
|
||||
Create `custom-servicemonitor.yml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
name: my-app-monitoring
|
||||
namespace: my-app
|
||||
labels:
|
||||
app: my-app
|
||||
release: prometheus
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: my-app
|
||||
endpoints:
|
||||
- port: metrics
|
||||
interval: 30s
|
||||
path: /metrics
|
||||
scrapeTimeout: 10s
|
||||
```
|
||||
|
||||
Deploy it:
|
||||
|
||||
```bash
|
||||
kubectl apply -f custom-servicemonitor.yml
|
||||
```
|
||||
|
||||
### Create a PrometheusRule
|
||||
|
||||
Create `custom-alerts.yml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: PrometheusRule
|
||||
metadata:
|
||||
name: my-app-alerts
|
||||
namespace: my-app
|
||||
labels:
|
||||
prometheus: kube-prometheus
|
||||
spec:
|
||||
groups:
|
||||
- name: my-app.rules
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: MyAppHighErrorRate
|
||||
expr: |
|
||||
rate(my_app_errors_total[5m]) > 0.05
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
app: my-app
|
||||
annotations:
|
||||
summary: "High error rate in my-app"
|
||||
description: "Error rate is {{ $value }} on {{ $labels.instance }}"
|
||||
|
||||
- alert: MyAppDown
|
||||
expr: |
|
||||
up{job="my-app"} == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
app: my-app
|
||||
annotations:
|
||||
summary: "my-app is down"
|
||||
description: "my-app on {{ $labels.instance }} is unreachable"
|
||||
```
|
||||
|
||||
Deploy it:
|
||||
|
||||
```bash
|
||||
kubectl apply -f custom-alerts.yml
|
||||
```
|
||||
|
||||
## Alerting
|
||||
|
||||
### Pre-configured Alerts for compute-blade-agent
|
||||
|
||||
Automatically deployed when compute-blade-agent monitoring is enabled:
|
||||
|
||||
1. **ComputeBladeAgentHighTemperature** (Warning)
|
||||
- Triggers when temp > 80°C for 5 minutes
|
||||
|
||||
2. **ComputeBladeAgentCriticalTemperature** (Critical)
|
||||
- Triggers when temp > 95°C for 2 minutes
|
||||
|
||||
3. **ComputeBladeAgentDown** (Critical)
|
||||
- Triggers when agent unreachable for 5 minutes
|
||||
|
||||
4. **ComputeBladeAgentFanFailure** (Warning)
|
||||
- Triggers when fan RPM = 0 for 5 minutes
|
||||
|
||||
5. **ComputeBladeAgentHighFanSpeed** (Warning)
|
||||
- Triggers when fan speed > 90% for 10 minutes
|
||||
|
||||
### View Active Alerts
|
||||
|
||||
```bash
|
||||
# In Prometheus UI:
|
||||
# 1. Go to Alerts
|
||||
# 2. See all active and pending alerts
|
||||
|
||||
# Or query:
|
||||
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
|
||||
# Then visit: http://localhost:9090/alerts
|
||||
```
|
||||
|
||||
### Configure AlertManager Routing
|
||||
|
||||
Edit AlertManager configuration:
|
||||
|
||||
```bash
|
||||
# Get current config
|
||||
kubectl get secret -n monitoring alertmanager-kube-prometheus-alertmanager -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d
|
||||
|
||||
# Edit configuration
|
||||
kubectl edit secret -n monitoring alertmanager-kube-prometheus-alertmanager
|
||||
```
|
||||
|
||||
Example routing configuration:
|
||||
|
||||
```yaml
|
||||
global:
|
||||
resolve_timeout: 5m
|
||||
|
||||
route:
|
||||
receiver: 'default'
|
||||
group_by: ['alertname', 'cluster', 'service']
|
||||
group_wait: 30s
|
||||
group_interval: 5m
|
||||
repeat_interval: 4h
|
||||
|
||||
routes:
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: 'critical'
|
||||
group_wait: 0s
|
||||
group_interval: 1m
|
||||
repeat_interval: 30m
|
||||
|
||||
- match:
|
||||
severity: warning
|
||||
receiver: 'default'
|
||||
group_wait: 1m
|
||||
|
||||
receivers:
|
||||
- name: 'default'
|
||||
# Add email, slack, webhook, etc.
|
||||
|
||||
- name: 'critical'
|
||||
# Add urgent notifications
|
||||
```
|
||||
|
||||
## Grafana Dashboards
|
||||
|
||||
### Import Pre-built Dashboards
|
||||
|
||||
1. Open Grafana: http://localhost:3000
|
||||
2. Click "+" → Import
|
||||
3. Enter dashboard ID from grafana.com
|
||||
4. Select Prometheus data source
|
||||
5. Click Import
|
||||
|
||||
**Recommended Dashboards:**
|
||||
|
||||
| ID | Name |
|
||||
|----|------|
|
||||
| 1860 | Node Exporter for Prometheus Dashboard |
|
||||
| 6417 | Kubernetes Cluster Monitoring |
|
||||
| 8588 | Kubernetes Deployment Statefulset Daemonset |
|
||||
| 11074 | Node Exporter - Nodes |
|
||||
| 12114 | Kubernetes cluster monitoring |
|
||||
|
||||
### Create Custom Dashboard
|
||||
|
||||
1. Click "+" → Dashboard
|
||||
2. Click "Add new panel"
|
||||
3. Configure query:
|
||||
- Data source: Prometheus
|
||||
- Query: `up{job="compute-blade-agent"}`
|
||||
4. Set visualization (Graph, Gauge, Table, etc.)
|
||||
5. Click Save
|
||||
|
||||
### Export Dashboard
|
||||
|
||||
```bash
|
||||
# Get dashboard JSON
|
||||
curl http://admin:password@localhost:3000/api/dashboards/db/my-dashboard > my-dashboard.json
|
||||
|
||||
# Import elsewhere
|
||||
curl -X POST -H "Content-Type: application/json" \
|
||||
-d @my-dashboard.json \
|
||||
http://admin:password@localhost:3000/api/dashboards/db
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Prometheus Not Scraping Targets
|
||||
|
||||
```bash
|
||||
# Check Prometheus targets
|
||||
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
|
||||
|
||||
# Visit: http://localhost:9090/targets
|
||||
# Look for failed targets
|
||||
|
||||
# Check ServiceMonitor
|
||||
kubectl get servicemonitor --all-namespaces
|
||||
|
||||
# Check Prometheus config
|
||||
kubectl get prometheus -n monitoring -o yaml
|
||||
|
||||
# View Prometheus logs
|
||||
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus --tail=50 -f
|
||||
```
|
||||
|
||||
### Grafana Data Source Not Working
|
||||
|
||||
```bash
|
||||
# Check Prometheus connectivity from Grafana
|
||||
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
|
||||
|
||||
# In Grafana:
|
||||
# 1. Configuration → Data Sources
|
||||
# 2. Click Prometheus
|
||||
# 3. Check Status (green = working)
|
||||
# 4. If red, check URL and credentials
|
||||
|
||||
# Or check logs
|
||||
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana --tail=50 -f
|
||||
```
|
||||
|
||||
### AlertManager Not Sending Notifications
|
||||
|
||||
```bash
|
||||
# Check AlertManager configuration
|
||||
kubectl get secret -n monitoring alertmanager-kube-prometheus-alertmanager -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d
|
||||
|
||||
# Restart AlertManager to apply changes
|
||||
kubectl rollout restart statefulset -n monitoring prometheus-kube-alertmanager
|
||||
|
||||
# Check logs
|
||||
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager --tail=50 -f
|
||||
|
||||
# Test webhook
|
||||
kubectl port-forward -n monitoring svc/prometheus-kube-alertmanager 9093:9093
|
||||
# Visit: http://localhost:9093 to see alerts
|
||||
```
|
||||
|
||||
### Disk Space Issues
|
||||
|
||||
```bash
|
||||
# Check Prometheus PVC usage
|
||||
kubectl get pvc -n monitoring
|
||||
|
||||
# View disk usage
|
||||
kubectl exec -n monitoring prometheus-kube-prometheus-prometheus-0 -- df -h /prometheus
|
||||
|
||||
# Increase storage
|
||||
kubectl patch pvc -n monitoring prometheus-kube-prometheus-prometheus-db-prometheus-kube-prometheus-prometheus-0 \
|
||||
-p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
|
||||
```
|
||||
|
||||
### High Memory/CPU Usage
|
||||
|
||||
```bash
|
||||
# Check resource usage
|
||||
kubectl top pod -n monitoring
|
||||
|
||||
# Reduce retention period
|
||||
kubectl edit prometheus -n monitoring
|
||||
|
||||
# Update in spec:
|
||||
# retention: 3d # Reduce from 7d to 3d
|
||||
|
||||
# Or reduce scrape interval
|
||||
# serviceMonitorSelectorNilUsesHelmValues: false
|
||||
# scrapeInterval: 60s # Increase from 30s to 60s
|
||||
```
|
||||
|
||||
### ServiceMonitor Not Being Picked Up
|
||||
|
||||
```bash
|
||||
# Check if labels match
|
||||
kubectl get servicemonitor --all-namespaces -o yaml | grep -A5 "release: prometheus"
|
||||
|
||||
# Prometheus selector config
|
||||
kubectl get prometheus -n monitoring -o yaml | grep -A5 "serviceMonitorSelector"
|
||||
|
||||
# Restart Prometheus if config changed
|
||||
kubectl rollout restart statefulset -n monitoring prometheus-kube-prometheus-prometheus
|
||||
|
||||
# Check Prometheus logs for errors
|
||||
kubectl logs -n monitoring prometheus-kube-prometheus-prometheus-0 -c prometheus --tail=100 | grep -i "error\|failed"
|
||||
```
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Backup Grafana Dashboards
|
||||
|
||||
```bash
|
||||
# Export all dashboards
|
||||
for dashboard in $(curl -s http://admin:password@localhost:3000/api/search | jq -r '.[] | .uri'); do
|
||||
name=$(echo $dashboard | cut -d'/' -f2)
|
||||
curl -s http://admin:password@localhost:3000/api/dashboards/$dashboard > ${name}.json
|
||||
done
|
||||
|
||||
# Or use backup tool
|
||||
kubectl exec -n monitoring prometheus-grafana-0 -- grafana-cli admin export-dashboard all ./backups
|
||||
```
|
||||
|
||||
### Update Prometheus Retention
|
||||
|
||||
```bash
|
||||
# Edit Prometheus resource
|
||||
kubectl edit prometheus -n monitoring
|
||||
|
||||
# Update retention field
|
||||
spec:
|
||||
retention: "30d" # Change from 7d to 30d
|
||||
|
||||
# Changes apply automatically
|
||||
```
|
||||
|
||||
### Scale Prometheus Resources
|
||||
|
||||
```bash
|
||||
# For high-load environments, increase resources
|
||||
kubectl patch prometheus -n monitoring kube-prometheus -p '{"spec":{"resources":{"requests":{"cpu":"500m","memory":"1Gi"},"limits":{"cpu":"1000m","memory":"2Gi"}}}}'
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Security**
|
||||
- Change default Grafana password immediately
|
||||
- Restrict AlertManager webhook URLs to known services
|
||||
- Use network policies to limit access
|
||||
|
||||
2. **Performance**
|
||||
- Monitor Prometheus disk usage regularly
|
||||
- Adjust scrape intervals based on needs
|
||||
- Use recording rules for complex queries
|
||||
|
||||
3. **Reliability**
|
||||
- Enable persistent storage (PVC)
|
||||
- Configure alert routing and escalation
|
||||
- Regular backup of Grafana dashboards and configs
|
||||
|
||||
4. **Organization**
|
||||
- Label all ServiceMonitors with `release: prometheus`
|
||||
- Use consistent naming conventions
|
||||
- Document custom dashboards and alerts
|
||||
|
||||
5. **Cost Optimization**
|
||||
- Remove unused scrape targets
|
||||
- Tune scrape intervals (don't scrape more than needed)
|
||||
- Set appropriate retention periods
|
||||
|
||||
## Support
|
||||
|
||||
- [Prometheus Documentation](https://prometheus.io/docs/)
|
||||
- [Prometheus Operator GitHub](https://github.com/prometheus-operator/prometheus-operator)
|
||||
- [Grafana Documentation](https://grafana.com/docs/)
|
||||
- [AlertManager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)
|
||||
@@ -35,3 +35,9 @@ extra_packages=btop,vim,tmux,net-tools,dnsutils,iotop,ncdu,tree,jq
|
||||
# Compute Blade Agent configuration
|
||||
# Set to false to skip compute-blade-agent deployment on specific nodes
|
||||
enable_compute_blade_agent=true
|
||||
|
||||
# enable Prometheus
|
||||
enable_prometheus_operator=true
|
||||
grafana_admin_password=SecurePassword123!
|
||||
prometheus_storage_size=10Gi
|
||||
prometheus_retention_days=7
|
||||
|
||||
@@ -100,19 +100,19 @@ spec:
|
||||
---
|
||||
# Optional ServiceMonitor for Prometheus (requires prometheus-operator)
|
||||
# Uncomment this section if you have Prometheus installed with the operator
|
||||
#
|
||||
# apiVersion: monitoring.coreos.com/v1
|
||||
# kind: ServiceMonitor
|
||||
# metadata:
|
||||
# name: compute-blade-agent
|
||||
# namespace: compute-blade-agent
|
||||
# labels:
|
||||
# app: compute-blade-agent
|
||||
# spec:
|
||||
# selector:
|
||||
# matchLabels:
|
||||
# app: compute-blade-agent
|
||||
# endpoints:
|
||||
# - port: metrics
|
||||
# interval: 30s
|
||||
# path: /metrics
|
||||
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
name: compute-blade-agent
|
||||
namespace: compute-blade-agent
|
||||
labels:
|
||||
app: compute-blade-agent
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: compute-blade-agent
|
||||
endpoints:
|
||||
- port: metrics
|
||||
interval: 30s
|
||||
path: /metrics
|
||||
|
||||
68
roles/prometheus-operator/defaults/main.yml
Normal file
68
roles/prometheus-operator/defaults/main.yml
Normal file
@@ -0,0 +1,68 @@
|
||||
---
|
||||
# Prometheus Operator configuration defaults
|
||||
|
||||
# Enable/disable Prometheus Operator installation
|
||||
enable_prometheus_operator: true
|
||||
|
||||
# Grafana admin password (change in production!)
|
||||
grafana_admin_password: "admin"
|
||||
|
||||
# Kubeconfig path for kubectl access
|
||||
kubeconfig_path: "/etc/rancher/k3s/k3s.yaml"
|
||||
|
||||
# Prometheus configuration
|
||||
prometheus_retention_days: 7
|
||||
prometheus_storage_size: "10Gi"
|
||||
|
||||
# Grafana configuration
|
||||
grafana_storage_size: "5Gi"
|
||||
grafana_admin_user: "admin"
|
||||
|
||||
# AlertManager configuration
|
||||
alertmanager_storage_size: "5Gi"
|
||||
|
||||
# Node selector for Prometheus components (optional)
|
||||
# Set to restrict Prometheus to specific nodes
|
||||
prometheus_node_selector: {}
|
||||
# Example:
|
||||
# prometheus_node_selector:
|
||||
# node-type: monitoring
|
||||
|
||||
# Resource requests and limits
|
||||
prometheus_cpu_request: "250m"
|
||||
prometheus_cpu_limit: "500m"
|
||||
prometheus_memory_request: "512Mi"
|
||||
prometheus_memory_limit: "1Gi"
|
||||
|
||||
grafana_cpu_request: "100m"
|
||||
grafana_cpu_limit: "200m"
|
||||
grafana_memory_request: "256Mi"
|
||||
grafana_memory_limit: "512Mi"
|
||||
|
||||
alertmanager_cpu_request: "100m"
|
||||
alertmanager_cpu_limit: "200m"
|
||||
alertmanager_memory_request: "256Mi"
|
||||
alertmanager_memory_limit: "512Mi"
|
||||
|
||||
# Scrape interval configuration
|
||||
prometheus_scrape_interval: "30s"
|
||||
prometheus_scrape_timeout: "10s"
|
||||
prometheus_evaluation_interval: "30s"
|
||||
|
||||
# Service Monitor label selector
|
||||
prometheus_service_monitor_selector: {}
|
||||
prometheus_pod_monitor_selector: {}
|
||||
|
||||
# Enable/disable components
|
||||
enable_grafana: true
|
||||
enable_alertmanager: true
|
||||
enable_prometheus_node_exporter: true
|
||||
enable_kube_state_metrics: true
|
||||
|
||||
# Helm values for fine-tuning
|
||||
prometheus_helm_values: {}
|
||||
# Example:
|
||||
# prometheus_helm_values:
|
||||
# prometheus:
|
||||
# prometheusSpec:
|
||||
# retention: "15d"
|
||||
140
roles/prometheus-operator/tasks/main.yml
Normal file
140
roles/prometheus-operator/tasks/main.yml
Normal file
@@ -0,0 +1,140 @@
|
||||
---
|
||||
- name: Skip Prometheus Operator installation if disabled
|
||||
debug:
|
||||
msg: 'Prometheus Operator installation is disabled for this cluster'
|
||||
when: not enable_prometheus_operator | bool
|
||||
|
||||
- name: Block for Prometheus Operator installation
|
||||
block:
|
||||
- name: Check if Helm is installed locally
|
||||
shell: which helm
|
||||
register: helm_check
|
||||
changed_when: false
|
||||
failed_when: false
|
||||
delegate_to: localhost
|
||||
become: false
|
||||
|
||||
- name: Install Helm if not found
|
||||
shell: |
|
||||
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
|
||||
when: helm_check.rc != 0
|
||||
delegate_to: localhost
|
||||
become: false
|
||||
changed_when: true
|
||||
|
||||
- name: Add Prometheus Helm repository
|
||||
kubernetes.core.helm_repository:
|
||||
name: prometheus-community
|
||||
repo_url: https://prometheus-community.github.io/helm-charts
|
||||
state: present
|
||||
delegate_to: localhost
|
||||
become: false
|
||||
|
||||
- name: Update Helm repositories
|
||||
shell: helm repo update
|
||||
changed_when: true
|
||||
delegate_to: localhost
|
||||
become: false
|
||||
|
||||
- name: Create monitoring namespace
|
||||
shell: kubectl create namespace monitoring --kubeconfig={{ playbook_dir }}/kubeconfig 2>/dev/null || true
|
||||
changed_when: false
|
||||
delegate_to: localhost
|
||||
become: false
|
||||
|
||||
- name: Install Prometheus Operator via Helm
|
||||
shell: |
|
||||
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
|
||||
--namespace monitoring \
|
||||
--create-namespace \
|
||||
--set prometheus.prometheusSpec.retention={{ prometheus_retention_days }}d \
|
||||
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.accessModes[0]=ReadWriteOnce \
|
||||
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage={{ prometheus_storage_size }} \
|
||||
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
|
||||
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
|
||||
--set grafana.enabled=true \
|
||||
--set grafana.adminPassword="{{ grafana_admin_password | default('admin') }}" \
|
||||
--set grafana.persistence.enabled=true \
|
||||
--set grafana.persistence.size={{ grafana_storage_size }} \
|
||||
--set alertmanager.enabled=true \
|
||||
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.accessModes[0]=ReadWriteOnce \
|
||||
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage={{ alertmanager_storage_size }} \
|
||||
--kubeconfig={{ playbook_dir }}/kubeconfig
|
||||
environment:
|
||||
KUBECONFIG: '{{ playbook_dir }}/kubeconfig'
|
||||
register: helm_install_result
|
||||
delegate_to: localhost
|
||||
become: false
|
||||
changed_when: "'has been upgraded' in helm_install_result.stdout or 'has been installed' in helm_install_result.stdout"
|
||||
|
||||
- name: Wait for Prometheus Operator to be ready
|
||||
shell: kubectl rollout status deployment/prometheus-kube-prometheus-operator -n monitoring --kubeconfig={{ playbook_dir }}/kubeconfig --timeout=300s
|
||||
retries: 5
|
||||
delay: 10
|
||||
delegate_to: localhost
|
||||
become: false
|
||||
changed_when: false
|
||||
|
||||
- name: Wait for Prometheus to be ready
|
||||
shell: kubectl rollout status statefulset/prometheus-prometheus-kube-prometheus-prometheus -n monitoring --kubeconfig={{ playbook_dir }}/kubeconfig --timeout=300s
|
||||
retries: 5
|
||||
delay: 10
|
||||
delegate_to: localhost
|
||||
become: false
|
||||
changed_when: false
|
||||
|
||||
- name: Wait for Grafana to be ready
|
||||
shell: kubectl rollout status deployment/prometheus-grafana -n monitoring --kubeconfig={{ playbook_dir }}/kubeconfig --timeout=300s
|
||||
retries: 5
|
||||
delay: 10
|
||||
delegate_to: localhost
|
||||
become: false
|
||||
changed_when: false
|
||||
|
||||
- name: Generate compute-blade-agent monitoring resources
|
||||
template:
|
||||
src: compute-blade-agent-monitoring.j2
|
||||
dest: /tmp/compute-blade-agent-monitoring.yaml
|
||||
when: enable_compute_blade_agent | bool
|
||||
delegate_to: localhost
|
||||
become: false
|
||||
|
||||
- name: Deploy compute-blade-agent monitoring resources
|
||||
shell: kubectl apply -f /tmp/compute-blade-agent-monitoring.yaml --kubeconfig={{ playbook_dir }}/kubeconfig
|
||||
when: enable_compute_blade_agent | bool
|
||||
delegate_to: localhost
|
||||
become: false
|
||||
changed_when: "'created' in result.stdout or 'configured' in result.stdout"
|
||||
register: result
|
||||
|
||||
- name: Wait for compute-blade-agent ServiceMonitor to be picked up
|
||||
pause:
|
||||
seconds: 30
|
||||
|
||||
- name: Verify Prometheus targets
|
||||
shell: kubectl get service prometheus-operated -n monitoring --kubeconfig={{ playbook_dir }}/kubeconfig -o jsonpath='{.metadata.name}'
|
||||
register: prometheus_service
|
||||
delegate_to: localhost
|
||||
become: false
|
||||
changed_when: false
|
||||
|
||||
- name: Display Prometheus Operator installation details
|
||||
debug:
|
||||
msg:
|
||||
- 'Prometheus Operator has been successfully installed'
|
||||
- 'Namespace: monitoring'
|
||||
- 'Prometheus: Available at prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090'
|
||||
- 'Grafana: Available at prometheus-grafana.monitoring.svc.cluster.local:80'
|
||||
- 'AlertManager: Available at prometheus-kube-alertmanager.monitoring.svc.cluster.local:9093'
|
||||
- "Default Grafana admin password: {{ grafana_admin_password | default('admin') }}"
|
||||
- ''
|
||||
- 'To access Prometheus UI:'
|
||||
- ' kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090'
|
||||
- ''
|
||||
- 'To access Grafana:'
|
||||
- ' kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80'
|
||||
- ''
|
||||
- 'To access AlertManager:'
|
||||
- ' kubectl port-forward -n monitoring svc/prometheus-kube-alertmanager 9093:9093'
|
||||
|
||||
when: enable_prometheus_operator | bool
|
||||
@@ -0,0 +1,91 @@
|
||||
---
|
||||
# ServiceMonitor for compute-blade-agent metrics collection
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
name: compute-blade-agent
|
||||
namespace: compute-blade-agent
|
||||
labels:
|
||||
app: compute-blade-agent
|
||||
release: prometheus
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: compute-blade-agent
|
||||
endpoints:
|
||||
- port: metrics
|
||||
interval: 30s
|
||||
path: /metrics
|
||||
scrapeTimeout: 10s
|
||||
---
|
||||
# PrometheusRule for compute-blade-agent alerting
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: PrometheusRule
|
||||
metadata:
|
||||
name: compute-blade-agent
|
||||
namespace: compute-blade-agent
|
||||
labels:
|
||||
prometheus: kube-prometheus
|
||||
role: alert-rules
|
||||
spec:
|
||||
groups:
|
||||
- name: compute-blade-agent.rules
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: ComputeBladeAgentHighTemperature
|
||||
expr: compute_blade_temperature_celsius > 80
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: compute-blade-agent
|
||||
annotations:
|
||||
summary: "Compute blade high temperature detected on {% raw %}{{ $labels.instance }}{% endraw %}"
|
||||
description: "Compute blade temperature is {% raw %}{{ $value }}{% endraw %}°C (threshold: 80°C) on node {% raw %}{{ $labels.instance }}{% endraw %}"
|
||||
|
||||
- alert: ComputeBladeAgentCriticalTemperature
|
||||
expr: compute_blade_temperature_celsius > 95
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
component: compute-blade-agent
|
||||
annotations:
|
||||
summary: "Compute blade CRITICAL temperature on {% raw %}{{ $labels.instance }}{% endraw %}"
|
||||
description: "Compute blade temperature is {% raw %}{{ $value }}{% endraw %}°C (CRITICAL threshold: 95°C) on node {% raw %}{{ $labels.instance }}{% endraw %}"
|
||||
|
||||
- alert: ComputeBladeAgentDown
|
||||
expr: up{job="compute-blade-agent"} == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
component: compute-blade-agent
|
||||
annotations:
|
||||
summary: "Compute blade agent is down on {% raw %}{{ $labels.instance }}{% endraw %}"
|
||||
description: "Compute blade agent has been unreachable for more than 5 minutes on {% raw %}{{ $labels.instance }}{% endraw %}"
|
||||
|
||||
- alert: ComputeBladeAgentFanFailure
|
||||
expr: compute_blade_fan_rpm == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: compute-blade-agent
|
||||
annotations:
|
||||
summary: "Compute blade fan failure detected on {% raw %}{{ $labels.instance }}{% endraw %}"
|
||||
description: "Compute blade fan is not running on {% raw %}{{ $labels.instance }}{% endraw %}"
|
||||
|
||||
- alert: ComputeBladeAgentHighFanSpeed
|
||||
expr: compute_blade_fan_speed_percent > 90
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
component: compute-blade-agent
|
||||
annotations:
|
||||
summary: "Compute blade fan running at high speed on {% raw %}{{ $labels.instance }}{% endraw %}"
|
||||
description: "Compute blade fan speed is {% raw %}{{ $value }}{% endraw %}% (threshold: 90%) on {% raw %}{{ $labels.instance }}{% endraw %}"
|
||||
---
|
||||
# Namespace for compute-blade-agent (ensure it exists)
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: compute-blade-agent
|
||||
labels:
|
||||
name: compute-blade-agent
|
||||
12
site.yml
12
site.yml
@@ -40,7 +40,7 @@
|
||||
- agent
|
||||
- worker
|
||||
|
||||
- name: Install compute-blade-agent on workers
|
||||
- name: Install compute-blade-agent on all nodes
|
||||
hosts: all
|
||||
become: true
|
||||
roles:
|
||||
@@ -49,6 +49,16 @@
|
||||
- compute-blade-agent
|
||||
- blade-agent
|
||||
|
||||
- name: Install Prometheus Operator
|
||||
hosts: "{{ groups['master'][0] }}"
|
||||
gather_facts: false
|
||||
become: false
|
||||
roles:
|
||||
- role: prometheus-operator
|
||||
tags:
|
||||
- prometheus-operator
|
||||
- monitoring
|
||||
|
||||
- name: Deploy test applications
|
||||
hosts: "{{ groups['master'][0] }}"
|
||||
gather_facts: true
|
||||
|
||||
Reference in New Issue
Block a user