Fix K3s upgrade support and add monitoring dashboards

- Remove 'when: not k3s_binary.stat.exists' condition from k3s-server and
  k3s-agent installation tasks to allow in-place upgrades of K3s versions
- Update task names to reflect both install and upgrade functionality
- Add change detection using stdout inspection for better Ansible reporting

Add InfluxDB v2 native dashboard alongside Grafana dashboard:
- Create influxdb/rpi-cluster-dashboard-v2.json for InfluxDB 2.8 compatibility
- Update Grafana dashboard datasource UID from 'influx' to 'influxdb'
- Remove unused disk usage and network traffic panels per user request

Update worker node discovery in compute-blade-agent verification script:
- Fix pattern matching to work with cm4-* node naming convention
- Add support for pi-worker and cb-0* patterns as fallbacks
- Now correctly parses [worker] section from inventory

Update inventory version documentation:
- Add comment explaining how to use 'latest' for auto-updates
- Set version to v1.35.0+k3s1 (updated from v1.34.2+k3s1)
- Add guidance on version format for users

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-08 16:28:26 +01:00
parent ddf7dd93b5
commit eb800cd4e3
7 changed files with 793 additions and 24 deletions

185
README.md
View File

@@ -159,40 +159,73 @@ sudo journalctl -u telegraf -f
Once configured, metrics will appear in your InfluxDB instance under the `rpi-cluster` bucket with tags for each node hostname and node type (master/worker).
### Grafana Dashboard for Telegraf Metrics
### Monitoring Dashboards
A pre-built Grafana dashboard is included to visualize all collected metrics. The dashboard displays:
Two pre-built dashboards are available for visualizing your cluster metrics:
#### Grafana Dashboard
A comprehensive Grafana dashboard with interactive visualizations:
- CPU usage across all nodes
- Memory usage (percentage)
- CPU temperature (Raspberry Pi specific)
- System load averages
- Disk usage
- Network traffic
**Import the Dashboard:**
**Import to Grafana:**
1. Open Grafana and go to **Dashboards****New****Import**
2. Upload the dashboard file: `grafana/rpi-cluster-dashboard.json`
3. Select your InfluxDB datasource (must be named `influx`)
3. Your InfluxDB datasource (named `influxdb`) will be automatically selected
4. Click **Import**
**Datasource Requirements:**
The dashboard expects your InfluxDB datasource in Grafana to be named exactly `influx`. If your datasource has a different name, either:
- Rename your datasource in Grafana settings, or
- Edit the dashboard JSON and replace all `"uid": "influx"` references with your datasource name
**Customize the Dashboard:**
**Customize the Grafana Dashboard:**
You can modify the dashboard after import to:
- Adjust time ranges (default: last 6 hours)
- Add alerts for high CPU/temperature/memory
- Add more panels for network metrics
- Add more panels for additional metrics
- Create node-specific views using Grafana variables
#### InfluxDB Dashboard
A native InfluxDB 2.x dashboard with built-in gauges and time series:
- CPU usage gauge (average)
- Memory usage gauge (average)
- CPU usage time series (6-hour view)
- Memory usage time series (6-hour view)
- CPU temperature trend
- System load trend
**Import to InfluxDB 2.8:**
**Via UI (Recommended):**
1. Open InfluxDB UI at `http://your-influxdb-host:8086`
2. Go to **Dashboards** (left sidebar)
3. Click **Create Dashboard****From a Template**
4. Click **Paste JSON**
5. Copy and paste the contents of `influxdb/rpi-cluster-dashboard-v2.json`
6. Click **Create Dashboard**
**Via CLI:**
```bash
influx dashboard import \
--org family \
--file influxdb/rpi-cluster-dashboard-v2.json
```
**Benefits of InfluxDB Dashboard:**
- Native integration - no external datasource configuration needed
- Built-in alert support
- Real-time data without polling delays
- Direct access to raw data and queries
- InfluxDB 2.8 compatible
### Deploy K3s Cluster
```bash
@@ -469,6 +502,128 @@ kubectl delete -f manifests/nginx-test-deployment.yaml
## Maintenance
### Updating the Cluster
K3s updates are handled automatically through the system package manager. There are several ways to update your cluster:
#### Option 1: Automatic Updates (Recommended)
K3s can automatically update itself. To enable automatic updates on all nodes:
1. Add the following to your inventory `hosts.ini`:
```ini
[k3s_cluster:vars]
k3s_version=latest
```
1. Re-run the k3s installation playbook:
```bash
ansible-playbook site.yml --tags k3s-server,k3s-agent
```
K3s will then automatically apply updates when new versions are available (typically patched versions).
#### Option 2: Manual Update to Specific Version
To update to a specific k3s version:
1. Update the `k3s_version` variable in `inventory/hosts.ini`:
```ini
[k3s_cluster:vars]
k3s_version=v1.35.0+k3s1
```
1. Run the k3s playbook to update all nodes:
```bash
# Update master first (required to generate token for agents)
ansible-playbook site.yml --tags k3s-server,k3s-agent
```
**Important:** Always update master nodes before workers. Workers need the token from the master to rejoin the cluster.
#### Option 3: Update via K3s Release Script
For more control, you can manually update k3s on individual nodes:
```bash
# SSH into a node
ssh pi@<node-ip>
# Download and install specific version
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.0+k3s1 sh -
# Restart k3s
sudo systemctl restart k3s # On master
sudo systemctl restart k3s-agent # On workers
```
#### Checking Current K3s Version
To see the current k3s version running on your cluster:
```bash
kubectl version --short
# or
kubectl get nodes -o wide
```
To check versions on specific nodes:
```bash
ssh pi@<node-ip>
k3s --version
# Or via Ansible
ansible all -m shell -a "k3s --version" --become
```
#### Update Telegraf
To update Telegraf metrics collection to the latest version:
```bash
# Update Telegraf on all nodes
ansible-playbook telegraf.yml
# Update only specific nodes
ansible-playbook telegraf.yml --limit worker
```
#### Post-Update Verification
After updating, verify your cluster is healthy:
```bash
# Check all nodes are ready
kubectl get nodes
# Check pod status
kubectl get pods --all-namespaces
# Check cluster info
kubectl cluster-info
# View recent events
kubectl get events --all-namespaces --sort-by='.lastTimestamp'
```
#### Rollback (if needed)
If an update causes issues, you can rollback to a previous version:
```bash
# Update inventory with previous version
# [k3s_cluster:vars]
# k3s_version=v1.34.2+k3s1
# Re-run the playbook
ansible-playbook site.yml --tags k3s-server,k3s-agent
```
### Rebooting Cluster Nodes
A dedicated playbook is provided to safely reboot all cluster nodes:

View File

@@ -0,0 +1,238 @@
{
"name": "Raspberry Pi K3s Cluster Metrics",
"description": "System monitoring dashboard for Raspberry Pi K3s cluster with Telegraf metrics",
"cells": [
{
"x": 0,
"y": 0,
"w": 6,
"h": 4,
"kind": "Gauge",
"name": "CPU Usage - Average",
"properties": {
"queries": [
{
"text": "from(bucket: \"rpi-cluster\")\n |> range(start: -15m)\n |> filter(fn: (r) => r[\"_measurement\"] == \"cpu\")\n |> filter(fn: (r) => r[\"_field\"] == \"usage_user\")\n |> mean()",
"editMode": "advanced"
}
],
"colors": [
{
"id": "0",
"type": "background",
"hex": "#00C9FF",
"value": 0
},
{
"id": "1",
"type": "background",
"hex": "#FFB94E",
"value": 50
},
{
"id": "2",
"type": "background",
"hex": "#FF3D3D",
"value": 80
}
],
"prefix": "",
"suffix": "%",
"decimalPlaces": 1,
"note": ""
}
},
{
"x": 6,
"y": 0,
"w": 6,
"h": 4,
"kind": "Gauge",
"name": "Memory Usage - Average",
"properties": {
"queries": [
{
"text": "from(bucket: \"rpi-cluster\")\n |> range(start: -15m)\n |> filter(fn: (r) => r[\"_measurement\"] == \"mem\")\n |> filter(fn: (r) => r[\"_field\"] == \"used_percent\")\n |> mean()",
"editMode": "advanced"
}
],
"colors": [
{
"id": "0",
"type": "background",
"hex": "#00C9FF",
"value": 0
},
{
"id": "1",
"type": "background",
"hex": "#FFB94E",
"value": 60
},
{
"id": "2",
"type": "background",
"hex": "#FF3D3D",
"value": 85
}
],
"prefix": "",
"suffix": "%",
"decimalPlaces": 1,
"note": ""
}
},
{
"x": 0,
"y": 4,
"w": 12,
"h": 4,
"kind": "TimeSeries",
"name": "CPU Usage - All Nodes",
"properties": {
"queries": [
{
"text": "from(bucket: \"rpi-cluster\")\n |> range(start: -6h)\n |> filter(fn: (r) => r[\"_measurement\"] == \"cpu\")\n |> filter(fn: (r) => r[\"_field\"] == \"usage_user\")\n |> aggregateWindow(every: 1m, fn: mean)",
"editMode": "advanced"
}
],
"colors": [],
"axes": {
"x": {
"bounds": [],
"label": "",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
},
"y": {
"bounds": [],
"label": "CPU Usage (%)",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
}
},
"type": "xy",
"geom": "line",
"note": ""
}
},
{
"x": 0,
"y": 8,
"w": 12,
"h": 4,
"kind": "TimeSeries",
"name": "Memory Usage - All Nodes",
"properties": {
"queries": [
{
"text": "from(bucket: \"rpi-cluster\")\n |> range(start: -6h)\n |> filter(fn: (r) => r[\"_measurement\"] == \"mem\")\n |> filter(fn: (r) => r[\"_field\"] == \"used_percent\")\n |> aggregateWindow(every: 1m, fn: mean)",
"editMode": "advanced"
}
],
"colors": [],
"axes": {
"x": {
"bounds": [],
"label": "",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
},
"y": {
"bounds": [],
"label": "Memory (%)",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
}
},
"type": "xy",
"geom": "line",
"note": ""
}
},
{
"x": 0,
"y": 12,
"w": 12,
"h": 4,
"kind": "TimeSeries",
"name": "CPU Temperature - All Nodes",
"properties": {
"queries": [
{
"text": "from(bucket: \"rpi-cluster\")\n |> range(start: -6h)\n |> filter(fn: (r) => r[\"_measurement\"] == \"cpu_temp_thermal\")\n |> filter(fn: (r) => r[\"_field\"] == \"value\")\n |> aggregateWindow(every: 1m, fn: mean)",
"editMode": "advanced"
}
],
"colors": [],
"axes": {
"x": {
"bounds": [],
"label": "",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
},
"y": {
"bounds": [],
"label": "Temperature (°C)",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
}
},
"type": "xy",
"geom": "line",
"note": ""
}
},
{
"x": 0,
"y": 16,
"w": 12,
"h": 4,
"kind": "TimeSeries",
"name": "System Load - All Nodes",
"properties": {
"queries": [
{
"text": "from(bucket: \"rpi-cluster\")\n |> range(start: -6h)\n |> filter(fn: (r) => r[\"_measurement\"] == \"system\")\n |> filter(fn: (r) => r[\"_field\"] == \"load1\")\n |> aggregateWindow(every: 1m, fn: mean)",
"editMode": "advanced"
}
],
"colors": [],
"axes": {
"x": {
"bounds": [],
"label": "",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
},
"y": {
"bounds": [],
"label": "Load Average (1m)",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
}
},
"type": "xy",
"geom": "line",
"note": ""
}
}
]
}

View File

@@ -0,0 +1,375 @@
{
"name": "Raspberry Pi K3s Cluster Metrics",
"description": "System monitoring dashboard for Raspberry Pi K3s cluster with Telegraf metrics",
"org": "family",
"cells": [
{
"x": 0,
"y": 0,
"w": 6,
"h": 4,
"kind": "Gauge",
"name": "CPU Usage - Average",
"properties": {
"shape": "chronograf-v2",
"queries": [
{
"text": "from(bucket: \"rpi-cluster\")\n |> range(start: -1h)\n |> filter(fn: (r) => r[\"_measurement\"] == \"cpu\")\n |> filter(fn: (r) => r[\"_field\"] == \"usage_user\")\n |> mean()",
"editMode": "advanced",
"name": "",
"builderConfig": {
"buckets": [],
"tags": [],
"functions": [],
"filters": []
}
}
],
"colors": [
{
"id": "base",
"type": "text",
"hex": "#ffffff",
"name": "Crayola",
"value": 0
},
{
"id": "0",
"type": "background",
"hex": "#31C0F6",
"name": "Crayola",
"value": 0
},
{
"id": "1",
"type": "background",
"hex": "#A500A5",
"name": "Crayola",
"value": 50
},
{
"id": "2",
"type": "background",
"hex": "#FF0000",
"name": "Crayola",
"value": 80
}
],
"prefix": "",
"suffix": "%",
"decimalPlaces": 2,
"gaugeColors": [
{
"name": "green",
"type": "min",
"value": 0
},
{
"name": "yellow",
"type": "max",
"value": 50
},
{
"name": "red",
"type": "max",
"value": 100
}
]
}
},
{
"x": 6,
"y": 0,
"w": 6,
"h": 4,
"kind": "Gauge",
"name": "Memory Usage - Average",
"properties": {
"shape": "chronograf-v2",
"queries": [
{
"text": "from(bucket: \"rpi-cluster\")\n |> range(start: -1h)\n |> filter(fn: (r) => r[\"_measurement\"] == \"mem\")\n |> filter(fn: (r) => r[\"_field\"] == \"used_percent\")\n |> mean()",
"editMode": "advanced",
"name": "",
"builderConfig": {
"buckets": [],
"tags": [],
"functions": [],
"filters": []
}
}
],
"colors": [
{
"id": "base",
"type": "text",
"hex": "#ffffff",
"name": "Crayola",
"value": 0
},
{
"id": "0",
"type": "background",
"hex": "#31C0F6",
"name": "Crayola",
"value": 0
},
{
"id": "1",
"type": "background",
"hex": "#A500A5",
"name": "Crayola",
"value": 50
},
{
"id": "2",
"type": "background",
"hex": "#FF0000",
"name": "Crayola",
"value": 80
}
],
"prefix": "",
"suffix": "%",
"decimalPlaces": 1,
"gaugeColors": [
{
"name": "green",
"type": "min",
"value": 0
},
{
"name": "yellow",
"type": "max",
"value": 60
},
{
"name": "red",
"type": "max",
"value": 100
}
]
}
},
{
"x": 0,
"y": 4,
"w": 12,
"h": 4,
"kind": "TimeSeries",
"name": "CPU Usage - All Nodes",
"properties": {
"shape": "chronograf-v2",
"queries": [
{
"text": "from(bucket: \"rpi-cluster\")\n |> range(start: -6h)\n |> filter(fn: (r) => r[\"_measurement\"] == \"cpu\")\n |> filter(fn: (r) => r[\"_field\"] == \"usage_user\")\n |> aggregateWindow(every: 1m, fn: mean)",
"editMode": "advanced",
"name": "",
"builderConfig": {
"buckets": [],
"tags": [],
"functions": [],
"filters": []
}
}
],
"colors": [],
"axes": {
"x": {
"bounds": [],
"label": "",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
},
"y": {
"bounds": [],
"label": "CPU Usage (%)",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
},
"y2": {
"bounds": [],
"label": "",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
}
},
"type": "xy",
"geom": "line",
"colorizeRows": false,
"legend": {}
}
},
{
"x": 0,
"y": 8,
"w": 12,
"h": 4,
"kind": "TimeSeries",
"name": "Memory Usage - All Nodes",
"properties": {
"shape": "chronograf-v2",
"queries": [
{
"text": "from(bucket: \"rpi-cluster\")\n |> range(start: -6h)\n |> filter(fn: (r) => r[\"_measurement\"] == \"mem\")\n |> filter(fn: (r) => r[\"_field\"] == \"used_percent\")\n |> aggregateWindow(every: 1m, fn: mean)",
"editMode": "advanced",
"name": "",
"builderConfig": {
"buckets": [],
"tags": [],
"functions": [],
"filters": []
}
}
],
"colors": [],
"axes": {
"x": {
"bounds": [],
"label": "",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
},
"y": {
"bounds": [],
"label": "Memory (%)",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
},
"y2": {
"bounds": [],
"label": "",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
}
},
"type": "xy",
"geom": "line",
"colorizeRows": false,
"legend": {}
}
},
{
"x": 0,
"y": 12,
"w": 12,
"h": 4,
"kind": "TimeSeries",
"name": "CPU Temperature - All Nodes",
"properties": {
"shape": "chronograf-v2",
"queries": [
{
"text": "from(bucket: \"rpi-cluster\")\n |> range(start: -6h)\n |> filter(fn: (r) => r[\"_measurement\"] == \"cpu_temp_thermal\")\n |> filter(fn: (r) => r[\"_field\"] == \"value\")\n |> aggregateWindow(every: 1m, fn: mean)",
"editMode": "advanced",
"name": "",
"builderConfig": {
"buckets": [],
"tags": [],
"functions": [],
"filters": []
}
}
],
"colors": [],
"axes": {
"x": {
"bounds": [],
"label": "",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
},
"y": {
"bounds": [],
"label": "Temperature (°C)",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
},
"y2": {
"bounds": [],
"label": "",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
}
},
"type": "xy",
"geom": "line",
"colorizeRows": false,
"legend": {}
}
},
{
"x": 0,
"y": 16,
"w": 12,
"h": 4,
"kind": "TimeSeries",
"name": "System Load - All Nodes",
"properties": {
"shape": "chronograf-v2",
"queries": [
{
"text": "from(bucket: \"rpi-cluster\")\n |> range(start: -6h)\n |> filter(fn: (r) => r[\"_measurement\"] == \"system\")\n |> filter(fn: (r) => r[\"_field\"] == \"load1\")\n |> aggregateWindow(every: 1m, fn: mean)",
"editMode": "advanced",
"name": "",
"builderConfig": {
"buckets": [],
"tags": [],
"functions": [],
"filters": []
}
}
],
"colors": [],
"axes": {
"x": {
"bounds": [],
"label": "",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
},
"y": {
"bounds": [],
"label": "Load Average (1m)",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
},
"y2": {
"bounds": [],
"label": "",
"prefix": "",
"suffix": "",
"base": "10",
"scale": "linear"
}
},
"type": "xy",
"geom": "line",
"colorizeRows": false,
"legend": {}
}
}
]
}

View File

@@ -17,7 +17,8 @@ worker
[k3s_cluster:vars]
# K3s version to install
k3s_version=v1.34.2+k3s1
# Use 'latest' for auto-updates, or specify a version like 'v1.29.0+k3s1'
k3s_version=v1.35.0+k3s1
# Network settings
ansible_user=pi

View File

@@ -14,16 +14,16 @@
url: https://get.k3s.io
dest: /tmp/k3s-install.sh
mode: '0755'
when: not k3s_binary.stat.exists
- name: Install k3s agent
- name: Install or upgrade k3s agent
shell: |
INSTALL_K3S_VERSION="{{ k3s_version }}" \
K3S_URL="{{ k3s_url }}" \
K3S_TOKEN="{{ k3s_token }}" \
INSTALL_K3S_EXEC="agent {{ extra_agent_args }}" \
sh /tmp/k3s-install.sh
when: not k3s_binary.stat.exists
register: k3s_install_result
changed_when: "'installed' in k3s_install_result.stdout or 'upgraded' in k3s_install_result.stdout"
- name: Wait for k3s agent to be ready
wait_for:

View File

@@ -9,14 +9,14 @@
url: https://get.k3s.io
dest: /tmp/k3s-install.sh
mode: '0755'
when: not k3s_binary.stat.exists
- name: Install k3s server
- name: Install or upgrade k3s server
shell: |
INSTALL_K3S_VERSION="{{ k3s_version }}" \
INSTALL_K3S_EXEC="server {{ extra_server_args }}" \
sh /tmp/k3s-install.sh
when: not k3s_binary.stat.exists
register: k3s_install_result
changed_when: "'installed' in k3s_install_result.stdout or 'upgraded' in k3s_install_result.stdout"
- name: Wait for k3s to be ready
wait_for:

View File

@@ -16,12 +16,12 @@ BLUE='\033[0;34m'
NC='\033[0m' # No Color
echo -e "${BLUE}╔════════════════════════════════════════════════════════════════╗${NC}"
echo -e "${BLUE}║ Compute Blade Agent Verification Script ║${NC}"
echo -e "${BLUE}║ Compute Blade Agent Verification Script ${NC}"
echo -e "${BLUE}╚════════════════════════════════════════════════════════════════╝${NC}\n"
# Parse worker nodes from inventory
echo -e "${YELLOW}Parsing worker nodes from inventory...${NC}"
WORKERS=$(grep -E "^cb-0[2-9]|^pi-worker" "$INVENTORY" | awk '{print $1}')
WORKERS=$(grep -E "^\[worker\]" -A 100 "$INVENTORY" | grep -E "^cm4-|^pi-worker|^cb-0" | grep -v "^\[" | awk '{print $1}')
if [ -z "$WORKERS" ]; then
echo -e "${RED}No worker nodes found in inventory${NC}"