Go to file

Michael Skrynski 079bb4ba77 Add MikroTik VIP setup guide as primary HA solution

Create MIKROTIK-VIP-SETUP.md with comprehensive guide:
- MikroTik Virtual IP configuration (web interface and CLI)
- NAT rule setup for traffic routing
- Health check script for automatic failover
- Comparison with Keepalived approach
- Troubleshooting guide
- Failover testing procedures

Update README.md DNS configuration section:
- Add MikroTik VIP as Option C1 (recommended for MikroTik users)
- Keep Keepalived as Option C2 (for non-MikroTik setups)
- Link to MIKROTIK-VIP-SETUP.md for detailed instructions
- Clear recommendation based on hardware

Benefits of MikroTik VIP over Keepalived:
- Hardware-based failover (more reliable)
- No additional software on cluster nodes
- Simpler setup (5 minutes vs 10 minutes)
- Better performance

Fix markdown linting issues:
- Add proper blank lines around lists
- Use headings instead of emphasis
- Maintain consistent formatting

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

2026-01-08 16:46:21 +01:00

grafana

adding metrics to influx via telegraf

2025-12-18 21:17:17 +01:00

influxdb

Fix K3s upgrade support and add monitoring dashboards

2026-01-08 16:28:26 +01:00

inventory

Fix K3s upgrade support and add monitoring dashboards

2026-01-08 16:28:26 +01:00

manifests

Add external DNS configuration guide and update ingress for test.zlor.fi

2026-01-08 16:40:00 +01:00

roles

Fix K3s upgrade support and add monitoring dashboards

2026-01-08 16:28:26 +01:00

scripts

Fix K3s upgrade support and add monitoring dashboards

2026-01-08 16:28:26 +01:00

.gitignore

adding metrics to influx via telegraf

2025-12-18 21:17:17 +01:00

ansible.cfg

initial commit

2025-10-22 08:20:53 +02:00

COMPUTE_BLADE_AGENT.md

renaming cluster nodes

2025-11-26 20:32:21 +01:00

DEPLOYMENT_CHECKLIST.md

renaming cluster nodes

2025-11-26 20:32:21 +01:00

GETTING_STARTED.md

renaming cluster nodes

2025-11-26 20:32:21 +01:00

MIKROTIK-VIP-SETUP.md

Add MikroTik VIP setup guide as primary HA solution

2026-01-08 16:46:21 +01:00

multi-ssh.sh

adding compute blade specific code

2025-11-24 10:25:03 +01:00

README.md

Add MikroTik VIP setup guide as primary HA solution

2026-01-08 16:46:21 +01:00

reboot.yml

fix inconsistancies

2025-10-22 10:25:38 +02:00

site.yml

adding compute blade specific code

2025-11-24 10:25:03 +01:00

telegraf.yml

adding metrics to influx via telegraf

2025-12-18 21:17:17 +01:00

vip-setup.yml

Add Virtual IP (VIP) solution for single DNS record with failover

2026-01-08 16:44:06 +01:00

README.md

K3s Ansible Deployment for Raspberry Pi CM4/CM5

Ansible playbook to deploy a k3s Kubernetes cluster on Raspberry Pi Compute Module 4 and 5 devices.

Prerequisites

Raspberry Pi CM4/CM5 modules running Raspberry Pi OS (64-bit recommended)
SSH access to all nodes
Ansible installed on your control machine
SSH key-based authentication configured

Project Structure

k3s-ansible/
├── ansible.cfg                  # Ansible configuration
├── site.yml                     # Main playbook
├── inventory/
│   └── hosts.ini               # Inventory file
├── manifests/
│   └── nginx-test-deployment.yaml  # Test nginx deployment
└── roles/
    ├── prereq/                 # Prerequisites role
    │   └── tasks/
    │       └── main.yml
    ├── k3s-server/            # K3s master/server role
    │   └── tasks/
    │       └── main.yml
    ├── k3s-agent/             # K3s worker/agent role
    │   └── tasks/
    │       └── main.yml
    └── k3s-deploy-test/       # Test deployment role
        └── tasks/
            └── main.yml

Configuration

1. Update Inventory

Edit inventory/hosts.ini and add your Raspberry Pi nodes:

[master]
pi-master ansible_host=192.168.30.100 ansible_user=pi

[worker]
pi-worker-1 ansible_host=192.168.30.102 ansible_user=pi
pi-worker-2 ansible_host=192.168.30.103 ansible_user=pi
pi-worker-3 ansible_host=192.168.30.104 ansible_user=pi

2. Configure Variables

In inventory/hosts.ini, you can customize:

k3s_version: K3s version to install (default: v1.34.2+k3s1)
extra_server_args: Additional arguments for k3s server
extra_agent_args: Additional arguments for k3s agent
extra_packages: List of additional packages to install on all nodes

3. Customize Extra Packages (Optional)

The playbook can install additional system utilities on all nodes. Edit the extra_packages variable in inventory/hosts.ini:

# Comma-separated list of packages
extra_packages=btop,vim,tmux,net-tools,dnsutils,iotop,ncdu,tree,jq

Included packages:

btop - Better top, modern system monitor
vim - Text editor
tmux - Terminal multiplexer
net-tools - Network tools (ifconfig, netstat, etc.)
dnsutils - DNS utilities (dig, nslookup)
iotop - I/O monitor
ncdu - Disk usage analyzer
tree - Directory tree viewer
jq - JSON processor

To add packages, append them to the comma-separated list. To disable extra packages entirely, comment out or remove the extra_packages line.

Usage

Test Connectivity

Basic connectivity test:

ansible all -m ping

Gather Node Information

Display critical information from all nodes (uptime, temperature, memory, disk usage, load average):

Deploy Telegraf for Metrics Collection

Stream system metrics from all nodes to InfluxDB using Telegraf client.

Prerequisites:

InfluxDB instance running and accessible
API token with write permissions to your bucket

Setup:

Configure your InfluxDB credentials in .env file (already created):

# .env file (keep this secret, never commit!)
INFLUXDB_HOST=192.168.10.10
INFLUXDB_PORT=8086
INFLUXDB_ORG=family
INFLUXDB_BUCKET=rpi-cluster
INFLUXDB_TOKEN=your-api-token-here

Deploy Telegraf to all nodes:

ansible-playbook telegraf.yml

Or deploy to specific nodes:

# Only worker nodes
ansible-playbook telegraf.yml --limit worker

# Only master nodes
ansible-playbook telegraf.yml --limit master

# Specific node
ansible-playbook telegraf.yml --limit cm4-02

Metrics Collected:

System: CPU (per-core and total), memory, swap, processes, system load
Disk: Disk I/O, disk usage, inodes
Network: Network interfaces, packets, errors
Thermal: CPU temperature (Raspberry Pi specific)
K3s: Process metrics for k3s components

Verify Installation:

Check Telegraf status on a node:

ssh pi@<node-ip>
sudo systemctl status telegraf
sudo journalctl -u telegraf -f

View Metrics in InfluxDB:

Once configured, metrics will appear in your InfluxDB instance under the rpi-cluster bucket with tags for each node hostname and node type (master/worker).

Monitoring Dashboards

Two pre-built dashboards are available for visualizing your cluster metrics:

Grafana Dashboard

A comprehensive Grafana dashboard with interactive visualizations:

CPU usage across all nodes
Memory usage (percentage)
CPU temperature (Raspberry Pi specific)
System load averages

Import to Grafana:

Open Grafana and go to Dashboards → New → Import
Upload the dashboard file: grafana/rpi-cluster-dashboard.json
Your InfluxDB datasource (named influxdb) will be automatically selected
Click Import

Customize the Grafana Dashboard:

You can modify the dashboard after import to:

Adjust time ranges (default: last 6 hours)
Add alerts for high CPU/temperature/memory
Add more panels for additional metrics
Create node-specific views using Grafana variables

InfluxDB Dashboard

A native InfluxDB 2.x dashboard with built-in gauges and time series:

CPU usage gauge (average)
Memory usage gauge (average)
CPU usage time series (6-hour view)
Memory usage time series (6-hour view)
CPU temperature trend
System load trend

Import to InfluxDB 2.8:

Via UI (Recommended):

Open InfluxDB UI at http://your-influxdb-host:8086
Go to Dashboards (left sidebar)
Click Create Dashboard → From a Template
Click Paste JSON
Copy and paste the contents of influxdb/rpi-cluster-dashboard-v2.json
Click Create Dashboard

Via CLI:

influx dashboard import \
  --org family \
  --file influxdb/rpi-cluster-dashboard-v2.json

Benefits of InfluxDB Dashboard:

Native integration - no external datasource configuration needed
Built-in alert support
Real-time data without polling delays
Direct access to raw data and queries
InfluxDB 2.8 compatible

Deploy K3s Cluster

ansible-playbook site.yml

This will deploy the full k3s cluster with the test nginx application.

Deploy Without Test Application

To skip the test deployment:

ansible-playbook site.yml --skip-tags test

Deploy Only the Test Application

If the cluster is already running and you just want to deploy the test app:

ansible-playbook site.yml --tags deploy-test

Deploy Only Prerequisites

ansible-playbook site.yml --tags prereq

What the Playbook Does

Prerequisites Role (`prereq`)

Sets hostname on each node
Updates and upgrades system packages
Installs required packages (curl, wget, git, iptables, etc.)
Enables cgroup memory and swap in boot config
Configures legacy iptables (required for k3s on ARM)
Disables swap
Reboots if necessary

K3s Server Role (`k3s-server`)

Installs k3s in server mode on master node(s)
Configures k3s with Flannel VXLAN backend (optimized for ARM)
Retrieves and stores the node token for workers
Copies kubeconfig to master node user
Fetches kubeconfig to local machine for kubectl access

K3s Agent Role (`k3s-agent`)

Installs k3s in agent mode on worker nodes
Joins workers to the cluster using the master's token
Configures agents to connect to the master

K3s Deploy Test Role (`k3s-deploy-test`)

Waits for all cluster nodes to be ready
Deploys the nginx test application with 5 replicas
Verifies deployment is successful
Displays pod distribution across nodes

Post-Installation

After successful deployment:

The kubeconfig file will be saved to ./kubeconfig
Use it with kubectl:

export KUBECONFIG=$(pwd)/kubeconfig
kubectl get nodes

You should see all your nodes in Ready state:

NAME          STATUS   ROLES                  AGE   VERSION
pi-master     Ready    control-plane,master   5m    v1.34.2+k3s1
pi-worker-1   Ready    <none>                 3m    v1.34.2+k3s1
pi-worker-2   Ready    <none>                 3m    v1.34.2+k3s1

Accessing the Cluster

From Master Node

SSH into the master node and use kubectl:

ssh pi@pi-master
kubectl get nodes

From Your Local Machine

The playbook automatically fetches the kubeconfig to ./kubeconfig. You have several options to use it:

Option 1: Temporary Access (Environment Variable)

export KUBECONFIG=$(pwd)/kubeconfig
kubectl get nodes
kubectl get pods --all-namespaces

Option 2: Merge into ~/.kube/config (Recommended)

This allows you to manage multiple clusters and switch between them:

# Backup your existing config
cp ~/.kube/config ~/.kube/config.backup

# Merge the k3s config into your existing config
KUBECONFIG=~/.kube/config:$(pwd)/kubeconfig kubectl config view --flatten > ~/.kube/config.tmp
mv ~/.kube/config.tmp ~/.kube/config

# Rename the context to something meaningful
kubectl config rename-context default k3s-pi-cluster

# View all contexts
kubectl config get-contexts

# Switch to k3s context
kubectl config use-context k3s-pi-cluster

# Switch back to other clusters
kubectl config use-context <other-context-name>

Option 3: Direct Usage

Use the kubeconfig file directly without setting environment variables:

kubectl --kubeconfig=./kubeconfig get nodes
kubectl --kubeconfig=./kubeconfig get pods --all-namespaces

Ingress Setup

K3s comes with Traefik ingress controller pre-installed by default, which allows you to expose your applications via HTTP/HTTPS with domain names.

How It Works

Traefik listens on ports 80 (HTTP) and 443 (HTTPS) on all nodes
Ingress rules route traffic based on hostname to different services
Multiple applications can share the same IP using different hostnames
No additional setup required - Traefik is ready to use after cluster deployment

Verify Traefik is Running

kubectl --kubeconfig=./kubeconfig get pods -n kube-system -l app.kubernetes.io/name=traefik
kubectl --kubeconfig=./kubeconfig get svc -n kube-system traefik

View Ingress Resources

kubectl --kubeconfig=./kubeconfig get ingress
kubectl --kubeconfig=./kubeconfig describe ingress nginx-test

Testing the Cluster

A sample nginx deployment with 5 replicas and ingress is provided to test your cluster.

Automated Deployment (via Ansible)

The test application is automatically deployed with ingress when you run the full playbook:

ansible-playbook site.yml

Or deploy it separately after the cluster is up:

ansible-playbook site.yml --tags deploy-test

The Ansible role will:

Wait for all nodes to be ready
Deploy the nginx application with ingress
Wait for all pods to be running
Show deployment status, pod distribution, ingress details, and access instructions

Manual Deployment (via kubectl)

Deploy using kubectl:

export KUBECONFIG=$(pwd)/kubeconfig
kubectl apply -f manifests/nginx-test-deployment.yaml

This deploys:

Nginx deployment with 5 replicas
ClusterIP service
Ingress resource for domain-based access

Verify the Deployment

Check that all 5 replicas are running:

kubectl --kubeconfig=./kubeconfig get deployments
kubectl --kubeconfig=./kubeconfig get pods -o wide
kubectl --kubeconfig=./kubeconfig get ingress

You should see output similar to:

NAME         READY   UP-TO-DATE   AVAILABLE   AGE
nginx-test   5/5     5            5           1m

NAME                          READY   STATUS    RESTARTS   AGE   NODE
nginx-test-7d8f4c9b6d-2xk4p   1/1     Running   0          1m    pi-worker-1
nginx-test-7d8f4c9b6d-4mz9r   1/1     Running   0          1m    pi-worker-2
nginx-test-7d8f4c9b6d-7w3qs   1/1     Running   0          1m    pi-worker-3
nginx-test-7d8f4c9b6d-9k2ln   1/1     Running   0          1m    pi-worker-1
nginx-test-7d8f4c9b6d-xr5wp   1/1     Running   0          1m    pi-worker-2

Access via Ingress

Add your master node IP to /etc/hosts:

# Replace 192.168.30.101 with your master node IP
192.168.30.101  nginx-test.local nginx.pi.local

Then access via browser:

Or test with curl:

# Replace with your master node IP
curl -H "Host: nginx-test.local" http://192.168.30.101

Scale the Deployment

Test scaling:

# Scale up to 10 replicas
kubectl scale deployment nginx-test --replicas=10

# Scale down to 3 replicas
kubectl scale deployment nginx-test --replicas=3

# Watch the pods being created/terminated
kubectl get pods -w

Clean Up Test Deployment

When you're done testing:

kubectl delete -f manifests/nginx-test-deployment.yaml

Maintenance

Updating the Cluster

K3s updates are handled automatically through the system package manager. There are several ways to update your cluster:

Option 1: Automatic Updates (Recommended)

K3s can automatically update itself. To enable automatic updates on all nodes:

Add the following to your inventory hosts.ini:

[k3s_cluster:vars]
k3s_version=latest

Re-run the k3s installation playbook:

ansible-playbook site.yml --tags k3s-server,k3s-agent

K3s will then automatically apply updates when new versions are available (typically patched versions).

Option 2: Manual Update to Specific Version

To update to a specific k3s version:

Update the k3s_version variable in inventory/hosts.ini:

[k3s_cluster:vars]
k3s_version=v1.35.0+k3s1

Run the k3s playbook to update all nodes:

# Update master first (required to generate token for agents)
ansible-playbook site.yml --tags k3s-server,k3s-agent

Important: Always update master nodes before workers. Workers need the token from the master to rejoin the cluster.

Option 3: Update via K3s Release Script

For more control, you can manually update k3s on individual nodes:

# SSH into a node
ssh pi@<node-ip>

# Download and install specific version
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.0+k3s1 sh -

# Restart k3s
sudo systemctl restart k3s        # On master
sudo systemctl restart k3s-agent  # On workers

Checking Current K3s Version

To see the current k3s version running on your cluster:

kubectl version --short
# or
kubectl get nodes -o wide

To check versions on specific nodes:

ssh pi@<node-ip>
k3s --version

# Or via Ansible
ansible all -m shell -a "k3s --version" --become

Update Telegraf

To update Telegraf metrics collection to the latest version:

# Update Telegraf on all nodes
ansible-playbook telegraf.yml

# Update only specific nodes
ansible-playbook telegraf.yml --limit worker

Post-Update Verification

After updating, verify your cluster is healthy:

# Check all nodes are ready
kubectl get nodes

# Check pod status
kubectl get pods --all-namespaces

# Check cluster info
kubectl cluster-info

# View recent events
kubectl get events --all-namespaces --sort-by='.lastTimestamp'

Rollback (if needed)

If an update causes issues, you can rollback to a previous version:

# Update inventory with previous version
# [k3s_cluster:vars]
# k3s_version=v1.34.2+k3s1

# Re-run the playbook
ansible-playbook site.yml --tags k3s-server,k3s-agent

Rebooting Cluster Nodes

A dedicated playbook is provided to safely reboot all cluster nodes:

ansible-playbook reboot.yml

This playbook will:

Reboot worker nodes first (one at a time, serially)
Wait for each worker to come back online and k3s-agent to be running
Reboot master nodes (one at a time, serially)
Wait for each master to come back online and k3s to be running
Verify the cluster status and show all nodes are ready

The serial approach ensures that only one node reboots at a time, maintaining cluster availability.

Reboot Only Workers

ansible-playbook reboot.yml --limit worker

Reboot Only Masters

ansible-playbook reboot.yml --limit master

Reboot a Specific Node

ansible-playbook reboot.yml --limit pi-worker-1

Troubleshooting

Check k3s service status

On master:

sudo systemctl status k3s
sudo journalctl -u k3s -f

On workers:

sudo systemctl status k3s-agent
sudo journalctl -u k3s-agent -f

Reset a node

If you need to reset a node and start over:

# On the node
/usr/local/bin/k3s-uninstall.sh          # For server
/usr/local/bin/k3s-agent-uninstall.sh    # For agent

Common Issues

Nodes not joining: Check firewall rules. K3s requires port 6443 open on the master.
Memory issues: Ensure cgroup memory is enabled (the playbook handles this).
Network issues: The playbook uses VXLAN backend which works better on ARM devices.

Customization

Add More Master Nodes (HA Setup)

For a high-availability setup, you can add more master nodes:

[master]
pi-master-1 ansible_host=192.168.30.100 ansible_user=pi
pi-master-2 ansible_host=192.168.30.101 ansible_user=pi
pi-master-3 ansible_host=192.168.30.102 ansible_user=pi

You'll need to configure an external database (etcd or PostgreSQL) for HA.

Custom K3s Arguments

Modify extra_server_args or extra_agent_args in the inventory:

[k3s_cluster:vars]
extra_server_args="--flannel-backend=vxlan --disable traefik --disable servicelb"
extra_agent_args="--node-label foo=bar"

Compute Blade Agent Deployment

The playbook includes automatic deployment of the Compute Blade Agent, a system service for managing Compute Blade hardware (Raspberry Pi CM4/CM5 modules). The agent monitors hardware states, reacts to temperature changes and button presses, and exposes metrics via Prometheus.

Components

compute-blade-agent: Daemon that monitors hardware and manages blade operations
bladectl: Command-line tool for local/remote interaction with the agent
fanunit.uf2: Firmware for the fan unit microcontroller

Configuration

The compute-blade-agent deployment is controlled by the enable_compute_blade_agent variable in inventory/hosts.ini:

# Enable/disable compute-blade-agent on all worker nodes
enable_compute_blade_agent=true

To disable on specific nodes, add an override:

[worker]
cm4-02 ansible_host=192.168.30.102 ansible_user=pi enable_compute_blade_agent=false
cm4-03 ansible_host=192.168.30.103 ansible_user=pi
cm4-04 ansible_host=192.168.30.104 ansible_user=pi

Deployment

The compute-blade-agent is automatically deployed as part of the main playbook:

ansible-playbook site.yml

Or deploy only the compute-blade-agent on worker nodes:

ansible-playbook site.yml --tags compute-blade-agent

Verification

Check the agent status on a worker node:

# SSH into a worker node
ssh pi@192.168.30.102

# Check service status
sudo systemctl status compute-blade-agent

# View logs
sudo journalctl -u compute-blade-agent -f

# Check binary installation
/usr/local/bin/compute-blade-agent --version

Configuration Files

The compute-blade-agent creates its configuration at:

/etc/compute-blade-agent/config.yaml

Configuration can also be controlled via environment variables prefixed with BLADE_.

Metrics and Monitoring

The compute-blade-agent exposes Prometheus metrics. To monitor the agents:

Optional Kubernetes resources are available in manifests/compute-blade-agent-daemonset.yaml
Deploy the optional monitoring resources (requires Prometheus):

kubectl apply -f manifests/compute-blade-agent-daemonset.yaml

Features

Hardware Monitoring: Tracks temperature, fan speed, and button events
Critical Mode: Automatically enters maximum fan speed + red LED during overheating
Identification: Locate specific blades via LED blinking
Metrics Export: Prometheus-compatible metrics endpoint

Troubleshooting compute-blade-agent

Service fails to start

Check the installer output:

sudo journalctl -u compute-blade-agent -n 50

Agent not detecting hardware

Verify the Compute Blade hardware is properly connected. The agent logs detailed information:

sudo journalctl -u compute-blade-agent -f

Re-run installation

To reinstall compute-blade-agent:

# SSH into the node
ssh pi@<node-ip>

# Uninstall
sudo /usr/local/bin/k3s-uninstall-compute-blade-agent.sh 2>/dev/null || echo "Not found, continuing"

# Remove from Ansible to reinstall
# Then re-run the playbook
ansible-playbook site.yml --tags compute-blade-agent

External DNS Configuration

To use external domains (like test.zlor.fi) with your k3s cluster ingress, you need to configure DNS and update your nodes.

Step 1: Configure DNS Server Records

On your DNS server, add A records pointing to your k3s cluster nodes:

Option A: Single Record (Master Node Only) - Simplest

If your DNS only allows one A record:

test.zlor.fi  A  192.168.30.101

Pros: Simple, works with any DNS server Cons: No failover if master node is down

Option B: Multiple Records (Load Balanced) - Best Redundancy

If your DNS supports multiple A records:

test.zlor.fi  A  192.168.30.101
test.zlor.fi  A  192.168.30.102
test.zlor.fi  A  192.168.30.103
test.zlor.fi  A  192.168.30.104

DNS clients will distribute requests across all nodes (round-robin).

Pros: Load balanced, automatic failover Cons: Requires DNS server support for multiple A records

Option C: Virtual IP (VIP) - Best of Both Worlds

If your DNS only allows one A record but you want redundancy:

test.zlor.fi  A  192.168.30.100

Set up a virtual IP that automatically handles failover. You have two sub-options:

Option C1: MikroTik VIP (Recommended if you have MikroTik router)

Configure VIP directly on your MikroTik router. See MIKROTIK-VIP-SETUP.md for detailed instructions.

Pros:

Simple setup (5 minutes)
No additional software on cluster nodes
Hardware-based failover (more reliable)
Better performance

Option C2: Keepalived (Software-based VIP)

Configure floating IP using Keepalived on cluster nodes. See "Virtual IP Setup (Keepalived)" below for detailed instructions.

Pros:

No router configuration needed
Portable across different networks
Works in cloud environments

Cons:

Additional daemon on all nodes
More configuration needed

Recommendation: If you have MikroTik, use Option C1 (MikroTik VIP). Otherwise, use Option C2 (Keepalived).

Step 2: Configure Cluster Nodes for External DNS

K3s nodes need to be able to resolve external DNS queries. Update the DNS resolver on all nodes:

Option A: Ansible Playbook (Recommended)

Create a new playbook dns-config.yml:

---
- name: Configure external DNS resolver
  hosts: all
  become: yes
  tasks:
    - name: Update /etc/resolv.conf with custom DNS
      copy:
        content: |
          nameserver 8.8.8.8
          nameserver 8.8.4.4
          nameserver 192.168.1.1
        dest: /etc/resolv.conf
        owner: root
        group: root
        mode: '0644'
      notify: Update systemd-resolved

    - name: Make resolv.conf immutable
      file:
        path: /etc/resolv.conf
        attributes: '+i'
        state: file

    - name: Configure systemd-resolved for external DNS
      copy:
        content: |
          [Resolve]
          DNS=8.8.8.8 8.8.4.4 192.168.1.1
          FallbackDNS=8.8.8.8
          DNSSECNegativeTrustAnchors=zlor.fi
        dest: /etc/systemd/resolved.conf
        owner: root
        group: root
        mode: '0644'
      notify: Restart systemd-resolved

  handlers:
    - name: Update systemd-resolved
      systemd:
        name: systemd-resolved
        state: restarted
        daemon_reload: yes

Apply the playbook:

ansible-playbook dns-config.yml

Option B: Manual Configuration on Each Node

SSH into each node and update DNS:

ssh pi@192.168.30.101
sudo nano /etc/systemd/resolved.conf

Add or modify:

[Resolve]
DNS=8.8.8.8 8.8.4.4 192.168.1.1
FallbackDNS=8.8.8.8
DNSSECNegativeTrustAnchors=zlor.fi

Save and restart:

sudo systemctl restart systemd-resolved

Verify DNS is working:

nslookup test.zlor.fi
dig test.zlor.fi

Step 3: Update Ingress Configuration

Your nginx-test deployment has already been updated to include test.zlor.fi. Verify the ingress:

kubectl get ingress nginx-test -o yaml

You should see:

spec:
  rules:
  - host: test.zlor.fi
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: nginx-test
            port:
              number: 80

Step 4: Test External Domain Access

Once DNS is configured, test access from your local machine:

# Test DNS resolution
nslookup test.zlor.fi

# Test HTTP access
curl http://test.zlor.fi

# With verbose output
curl -v http://test.zlor.fi

# Test from all cluster IPs
for ip in 192.168.30.{101..104}; do
  echo "Testing $ip:"
  curl -H "Host: test.zlor.fi" http://$ip
done

Troubleshooting DNS

DNS Resolution Failing

Check if systemd-resolved is running:

systemctl status systemd-resolved

Test DNS from a node:

ssh pi@192.168.30.101
nslookup test.zlor.fi
dig test.zlor.fi @8.8.8.8

Ingress Not Responding

Check if Traefik is running:

kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik

Check ingress status:

kubectl get ingress
kubectl describe ingress nginx-test

Request Timing Out

Verify network connectivity:

# From your machine
ping 192.168.30.101
ping 192.168.30.102

# From a cluster node
ssh pi@192.168.30.101
ping test.zlor.fi
curl -v http://test.zlor.fi

Adding More Domains

To add additional domains (e.g., api.zlor.fi, admin.zlor.fi):

Add DNS A records for each domain pointing to your cluster nodes
Update the ingress YAML with new rules:

spec:
  rules:
  - host: test.zlor.fi
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: nginx-test
            port:
              number: 80
  - host: api.zlor.fi
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

Apply the updated manifest:

kubectl apply -f manifests/nginx-test-deployment.yaml

Virtual IP Setup - Keepalived (Option C2)

If your DNS server only allows a single A record but you want high availability across all nodes, and you're not using MikroTik VIP, use a Virtual IP (VIP) with Keepalived.

How It Works

A virtual IP (192.168.30.100) floats between cluster nodes using VRRP protocol
The master node holds the VIP by default
If the master fails, a worker node automatically takes over
All traffic reaches the cluster through a single IP address
Clients experience automatic failover with minimal downtime

Prerequisites

All nodes must be on the same network segment
Network must support ARP protocol (standard on most networks)
No other services should use 192.168.30.100

Installation

Step 1: Update Your VIP Address

Edit vip-setup.yml and change the VIP to an unused IP on your network:

vars:
  vip_address: "192.168.30.100"  # Change this to your desired VIP
  vip_interface: "eth0"           # Change if your interface is different

Step 2: Run the VIP Setup Playbook

ansible-playbook vip-setup.yml

This will:

Install Keepalived on all nodes
Configure VRRP with master on cm4-01 and backup on workers
Set up health checks for automatic failover
Enable the virtual IP

Step 3: Verify VIP is Active

Check that the VIP is assigned to the master node:

# From your local machine
ping 192.168.30.100

# From any cluster node
ssh pi@192.168.30.101
ip addr show

# Look for your VIP address in the output

Step 4: Update DNS Records

Now you can use just one A record pointing to the VIP:

test.zlor.fi  A  192.168.30.100

Step 5: Update Ingress (Optional)

If you want to reference the VIP in your ingress, update the manifest:

spec:
  rules:
  - host: test.zlor.fi
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: nginx-test
            port:
              number: 80

The ingress is already correct - it will reach the cluster through any node IP.

Monitoring the VIP

Check VIP status and failover behavior:

# View Keepalived status
ssh pi@192.168.30.101
systemctl status keepalived

# Watch VIP transitions (open in separate terminal)
watch 'ip addr show | grep 192.168.30.100'

# View Keepalived logs
sudo journalctl -u keepalived -f

# Check health check script
sudo cat /usr/local/bin/check_apiserver.sh

Testing Failover

To test automatic failover:

Note which node has the VIP:

for ip in 192.168.30.{101..104}; do
  echo "=== $ip ==="
  ssh pi@$ip "ip addr show | grep 192.168.30.100" 2>/dev/null || echo "Not on this node"
done

SSH into the node holding the VIP and stop keepalived:

ssh pi@192.168.30.101  # or whichever node has the VIP
sudo systemctl stop keepalived

Watch the VIP migrate to another node:

# From another terminal, watch the migration
ping 192.168.30.100 -c 5
# Connection may drop briefly, then resume on new node

Restart keepalived on the original node:

sudo systemctl start keepalived

Troubleshooting VIP

VIP is not appearing on any node

Check if Keepalived is running:

ssh pi@192.168.30.101
sudo systemctl status keepalived
sudo journalctl -u keepalived -n 20

Verify the interface name:

ip route | grep default  # Should show your interface name

Update vip_interface in vip-setup.yml if needed and re-run.

VIP keeps switching between nodes

This indicates the health check is failing. Verify:

# Check if API server is responding
curl -k https://127.0.0.1:6443/healthz

# Check the health check script
cat /usr/local/bin/check_apiserver.sh
sudo bash /usr/local/bin/check_apiserver.sh

DNS resolves but connections time out

Verify all nodes have the VIP configured:

for ip in 192.168.30.{101..104}; do
  echo "=== $ip ==="
  ssh pi@$ip "ip addr show | grep 192.168.30.100"
done

Test direct connectivity to the VIP from each node:

ssh pi@192.168.30.101
curl -H "Host: test.zlor.fi" http://192.168.30.100

Disabling VIP

If you no longer need the VIP:

# Stop Keepalived on all nodes
ansible all -m systemd -a "name=keepalived state=stopped enabled=no" --become

# Remove configuration
ansible all -m file -a "path=/etc/keepalived/keepalived.conf state=absent" --become

Uninstall

To completely remove k3s from all nodes:

# Create an uninstall playbook or run manually on each node
ansible all -m shell -a "/usr/local/bin/k3s-uninstall.sh" --become
ansible workers -m shell -a "/usr/local/bin/k3s-agent-uninstall.sh" --become

To uninstall compute-blade-agent:

# Uninstall from all worker nodes
ansible worker -m shell -a "bash /usr/local/bin/k3s-uninstall-compute-blade-agent.sh" --become

License

MIT

README.md

K3s Ansible Deployment for Raspberry Pi CM4/CM5

Prerequisites

Project Structure

Configuration

1. Update Inventory

2. Configure Variables

3. Customize Extra Packages (Optional)

Usage

Test Connectivity

Gather Node Information

Deploy Telegraf for Metrics Collection

Monitoring Dashboards

Grafana Dashboard

InfluxDB Dashboard

Deploy K3s Cluster

Deploy Without Test Application

Deploy Only the Test Application

Deploy Only Prerequisites

What the Playbook Does

Prerequisites Role (prereq)

K3s Server Role (k3s-server)

K3s Agent Role (k3s-agent)

K3s Deploy Test Role (k3s-deploy-test)

Post-Installation

Accessing the Cluster

From Master Node

From Your Local Machine

Option 1: Temporary Access (Environment Variable)

Option 2: Merge into ~/.kube/config (Recommended)

Option 3: Direct Usage

Ingress Setup

How It Works

Verify Traefik is Running

View Ingress Resources

Testing the Cluster

Automated Deployment (via Ansible)

Manual Deployment (via kubectl)

Verify the Deployment

Access via Ingress

Scale the Deployment

Clean Up Test Deployment

Maintenance

Updating the Cluster

Option 1: Automatic Updates (Recommended)

Option 2: Manual Update to Specific Version

Option 3: Update via K3s Release Script

Checking Current K3s Version

Update Telegraf

Post-Update Verification

Rollback (if needed)

Rebooting Cluster Nodes

Reboot Only Workers

Reboot Only Masters

Reboot a Specific Node

Troubleshooting

Check k3s service status

Reset a node

Common Issues

Customization

Add More Master Nodes (HA Setup)

Custom K3s Arguments

Compute Blade Agent Deployment

Components

Configuration

Deployment

Verification

Configuration Files

Metrics and Monitoring

Features

Troubleshooting compute-blade-agent

Service fails to start

Agent not detecting hardware

Re-run installation

External DNS Configuration

Step 1: Configure DNS Server Records

Option A: Single Record (Master Node Only) - Simplest

Option B: Multiple Records (Load Balanced) - Best Redundancy

Option C: Virtual IP (VIP) - Best of Both Worlds

Option C1: MikroTik VIP (Recommended if you have MikroTik router)

Prerequisites Role (`prereq`)

K3s Server Role (`k3s-server`)

K3s Agent Role (`k3s-agent`)

K3s Deploy Test Role (`k3s-deploy-test`)