🚀 Production-ready Kubernetes cluster automation for Raspberry Pi Compute Module 4/5 hardware with built-in monitoring, high availability, and hardware management.

✨ Features

🔄 3-node HA control plane with automatic failover
📊 Comprehensive monitoring (Telegraf → InfluxDB → Grafana)
🌐 Traefik ingress with SSL support
🖥️ Compute Blade Agent for hardware monitoring
📈 Prometheus metrics with custom dashboards
🔧 One-command deployment and maintenance

📋 Prerequisites

Hardware: Raspberry Pi CM4/CM5 modules
OS: Raspberry Pi OS (64-bit recommended)
Network: SSH access to all nodes
Control machine: Ansible installed
Authentication: SSH key-based configured

🏗️ Project Structure

k3s-ansible/
├── 📄 ansible.cfg                    # Ansible configuration
├── 📄 site.yml                       # Main deployment playbook
├── 📁 inventory/
│   └── 📄 hosts.ini                  # Cluster inventory
├── 📁 manifests/                     # Kubernetes manifests
│   └── 📄 nginx-test-deployment.yaml # Test application
├── 📁 roles/                         # Ansible roles
│   ├── 📁 prereq/                    # System preparation
│   ├── 📁 k3s-server/               # Control-plane setup
│   ├── 📁 k3s-agent/                # Worker node setup
│   ├── 📁 k3s-deploy-test/          # Test deployment
│   ├── 📁 compute-blade-agent/      # Hardware monitoring
│   ├── 📁 prometheus-operator/      # Monitoring stack
│   └── 📁 telegraf/                 # Metrics collection
├── 📁 grafana/                       # Grafana dashboards
├── 📁 influxdb/                      # InfluxDB dashboards
└── 📄 telegraf.yml                  # Metrics deployment

⚙️ Quick Setup

1. Configure Inventory

Edit inventory/hosts.ini with your node details:

[master]
cm4-01 ansible_host=192.168.30.101 ansible_user=pi k3s_server_init=true
cm4-02 ansible_host=192.168.30.102 ansible_user=pi k3s_server_init=false
cm4-03 ansible_host=192.168.30.103 ansible_user=pi k3s_server_init=false

[worker]
cm4-04 ansible_host=192.168.30.104 ansible_user=pi

2. Key Configuration Options

[k3s_cluster:vars]
k3s_version=v1.35.0+k3s1              # K3s version
extra_packages=btop,vim,tmux,net-tools # System utilities
enable_compute_blade_agent=true        # Hardware monitoring
enable_prometheus_operator=true        # Monitoring stack

3. Setup Environment Variables

Create a .env file in the repository root with your credentials:

cat > .env << EOF
INFLUXDB_HOST=192.168.10.10
INFLUXDB_PORT=8086
INFLUXDB_ORG=family
INFLUXDB_BUCKET=rpi-cluster
INFLUXDB_TOKEN=your-api-token-here
EOF

⚠️ Security Note: This file is ignored by Git (.gitignore) and should never be committed. Keep actual tokens secure and only on your local machine.

4. Test Connectivity

ansible all -m ping

🚀 Deployment Commands

Prerequisites: Make sure your inventory/hosts.ini is configured and .env file is created (see Setup steps above).

Full Cluster Deployment

ansible-playbook site.yml

Component-Specific Deployment

# Prepare nodes only
ansible-playbook site.yml --tags prereq

# Deploy monitoring
ansible-playbook telegraf.yml

# Deploy test application only
ansible-playbook site.yml --tags deploy-test

# Skip test deployment
ansible-playbook site.yml --skip-tags test

📊 Monitoring Setup

Telegraf Metrics Collection

1. Configure InfluxDB credentials in .env:

INFLUXDB_HOST=192.168.10.10
INFLUXDB_PORT=8086
INFLUXDB_ORG=family
INFLUXDB_BUCKET=rpi-cluster
INFLUXDB_TOKEN=your-api-token-here

2. Deploy Telegraf:

ansible-playbook telegraf.yml

Metrics Collected:

🖥️ System: CPU, memory, processes, load
💾 Disk: I/O, usage, inodes
🌐 Network: Interfaces, packets, errors
🌡️ Thermal: CPU temperature (Pi-specific)
⚙️ K3s: Process metrics

Dashboard Options

📈 Grafana Dashboard

# Import: grafana/rpi-cluster-dashboard.json
# Features: Interactive visualizations, alerts, node-specific views

📊 InfluxDB Dashboard

# Import: influxdb/rpi-cluster-dashboard-v2.json
# Features: Native integration, real-time data, built-in alerts

🎯 What Gets Deployed

📋 System Preparation (`prereq`)

✅ Hostname configuration
✅ System updates & package installation
✅ cgroup memory & swap configuration
✅ Legacy iptables setup (ARM requirement)
✅ Swap disabling

🎯 Control Plane (`k3s-server`)

✅ K3s server installation
✅ Flannel VXLAN networking (ARM optimized)
✅ Cluster token management
✅ Kubeconfig generation & retrieval

👥 Worker Nodes (`k3s-agent`)

✅ K3s agent installation
✅ Cluster joining via master token
✅ Network configuration

🧪 Test Application (`k3s-deploy-test`)

✅ Nginx deployment (5 replicas)
✅ Ingress configuration
✅ Health verification
✅ Pod distribution analysis

🎉 Post-Installation

Access Your Cluster

📁 Kubeconfig Location: ./kubeconfig

🔧 Quick Setup:

export KUBECONFIG=$(pwd)/kubeconfig
kubectl get nodes

Expected Output:

NAME     STATUS   ROLES                       AGE   VERSION
cm4-01   Ready    control-plane,etcd,master   5m    v1.35.0+k3s1
cm4-02   Ready    control-plane,etcd          3m    v1.35.0+k3s1
cm4-03   Ready    control-plane,etcd          3m    v1.35.0+k3s1
cm4-04   Ready    <none>                      3m    v1.35.0+k3s1

Access Options

🌐 Local Machine Access

# Option 1: Environment variable
export KUBECONFIG=$(pwd)/kubeconfig

# Option 2: Merge with existing config
KUBECONFIG=~/.kube/config:$(pwd)/kubeconfig kubectl config view --flatten > ~/.kube/config.tmp
mv ~/.kube/config.tmp ~/.kube/config
kubectl config rename-context default k3s-pi-cluster

# Option 3: Direct usage
kubectl --kubeconfig=./kubeconfig get nodes

🖥️ Direct SSH Access

ssh pi@192.168.30.101
kubectl get nodes

🌐 Ingress & Networking

Traefik Ingress Controller

✅ Pre-installed and ready to use!

How it works:

🎯 Listens on ports 80 (HTTP) & 443 (HTTPS)
🔄 Routes traffic by hostname
📦 Multiple apps share same IP via different domains
⚡ Zero additional configuration needed

Verify Traefik:

kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik
kubectl get svc -n kube-system traefik
kubectl get ingress

🧪 Test Your Cluster

Automated Test Deployment

# Deploy with full cluster
ansible-playbook site.yml

# Deploy test app only
ansible-playbook site.yml --tags deploy-test

Manual Test Deployment

kubectl apply -f manifests/nginx-test-deployment.yaml

Verify Test Deployment

kubectl get deployments
kubectl get pods -o wide
kubectl get ingress

Expected Output:

NAME         READY   UP-TO-DATE   AVAILABLE   AGE
nginx-test   5/5     5            5           1m

NAME                          READY   STATUS    NODE
nginx-test-7d8f4c9b6d-2xk4p   1/1     Running   cm4-04
nginx-test-7d8f4c9b6d-4mz9r   1/1     Running   cm4-04
nginx-test-7d8f4c9b6d-7w3qs   1/1     Running   cm4-03
nginx-test-7d8f4c9b6d-9k2ln   1/1     Running   cm4-03
nginx-test-7d8f4c9b6d-xr5wp   1/1     Running   cm4-02

Access Test Application

1. Add to /etc/hosts:

192.168.30.101  nginx-test.local
192.168.30.102  nginx-test.local
192.168.30.103  nginx-test.local
192.168.30.104  nginx-test.local

2. Access via browser:

🌐 http://nginx-test.local

3. Test with curl:

curl -H "Host: nginx-test.local" http://192.168.30.101

Scale Test

# Scale up/down
kubectl scale deployment nginx-test --replicas=10
kubectl scale deployment nginx-test --replicas=3

# Watch scaling
kubectl get pods -w

Cleanup

kubectl delete -f manifests/nginx-test-deployment.yaml

🛡️ High Availability

3-Node Control Plane

✅ Production-ready HA setup

Architecture:

🎯 Control Plane: cm4-01, cm4-02, cm4-03
👥 Workers: cm4-04
🌐 Virtual IP: 192.168.30.100 (MikroTik)

Benefits:

🚫 No SPOF - Cluster survives master failures
🔄 Auto failover - Seamless master switching
⚡ Load distribution - API server & etcd spread across nodes
🔧 Zero downtime maintenance - Update masters one-by-one

Master Management

🔍 Monitor Master Health:

kubectl get nodes -L node-role.kubernetes.io/control-plane
kubectl get nodes --show-labels | grep control-plane

⬆️ Promote Worker to Master:

# Edit inventory/hosts.ini
[master]
cm4-01 ansible_host=192.168.30.101 ansible_user=pi k3s_server_init=true
cm4-02 ansible_host=192.168.30.102 ansible_user=pi k3s_server_init=false
cm4-03 ansible_host=192.168.30.103 ansible_user=pi k3s_server_init=false
cm4-04 ansible_host=192.168.30.104 ansible_user=pi k3s_server_init=false  # Promoted

[worker]
# Workers only

ansible-playbook site.yml --tags k3s-server

🔄 Master Recovery:

# Reset failed master
ssh pi@<failed-master-ip>
sudo /usr/local/bin/k3s-uninstall.sh

# Rejoin cluster
ansible-playbook site.yml --tags k3s-server --limit <failed-master>

🔧 Maintenance

Cluster Updates

🚀 Auto Updates (Recommended):

# inventory/hosts.ini
[k3s_cluster:vars]
k3s_version=latest

ansible-playbook site.yml --tags k3s-server,k3s-agent

🎯 Manual Version Update:

# inventory/hosts.ini
k3s_version=v1.36.0+k3s1

# ⚠️ Update masters first!
ansible-playbook site.yml --tags k3s-server,k3s-agent

📊 Check Versions:

kubectl version --short
kubectl get nodes -o wide
ansible all -m shell -a "k3s --version" --become

✅ Post-Update Verification:

kubectl get nodes
kubectl get pods --all-namespaces
kubectl cluster-info

🔄 Rollback if Needed:

# Set previous version in inventory
k3s_version=v1.35.0+k3s1
ansible-playbook site.yml --tags k3s-server,k3s-agent

Safe Reboots

🔄 Full Cluster Reboot:

ansible-playbook reboot.yml

Reboots workers first, then masters (serially)

🎯 Selective Reboots:

ansible-playbook reboot.yml --limit worker    # Workers only
ansible-playbook reboot.yml --limit master    # Masters only  
ansible-playbook reboot.yml --limit cm4-04     # Specific node

🐛 Troubleshooting

Service Status

# Master nodes
sudo systemctl status k3s
sudo journalctl -u k3s -f

# Worker nodes  
sudo systemctl status k3s-agent
sudo journalctl -u k3s-agent -f

Node Reset

# Reset server
/usr/local/bin/k3s-uninstall.sh

# Reset agent
/usr/local/bin/k3s-agent-uninstall.sh

Common Issues

🔥 Nodes not joining: Check firewall (port 6443)
💾 Memory issues: Verify cgroup memory enabled
🌐 Network issues: VXLAN backend optimized for ARM

🎛️ Customization

Add More Masters

[master]
pi-master-1 ansible_host=192.168.30.100 ansible_user=pi
pi-master-2 ansible_host=192.168.30.101 ansible_user=pi
pi-master-3 ansible_host=192.168.30.102 ansible_user=pi

Custom K3s Args

[k3s_cluster:vars]
extra_server_args="--flannel-backend=vxlan --disable traefik --disable servicelb"
extra_agent_args="--node-label foo=bar"

🖥️ Compute Blade Agent

🔧 Hardware monitoring for Compute Blade systems

Components

🖥️ compute-blade-agent: Hardware monitoring daemon
🛠️ bladectl: CLI tool for agent interaction
⚡ fanunit.uf2: Fan controller firmware

Configuration

# Enable/disable in inventory/hosts.ini
enable_compute_blade_agent=true

# Per-node override
cm4-01 ansible_host=192.168.30.101 enable_compute_blade_agent=true
cm4-02 ansible_host=192.168.30.102 enable_compute_blade_agent=false

Deployment

# Auto-deployed with main playbook
ansible-playbook site.yml

# Deploy only blade agent
ansible-playbook site.yml --tags compute-blade-agent

Verification

# Check service status
sudo systemctl status compute-blade-agent
sudo journalctl -u compute-blade-agent -f

# Check binary
/usr/bin/compute-blade-agent --version

Features

🌡️ Hardware monitoring: Temperature, fans, buttons
🚨 Critical mode: Auto max fan + red LED on overheating
🔍 Identification: LED blade locator
📊 Metrics: Prometheus endpoint

Monitoring Setup

# Deploy Prometheus monitoring
kubectl apply -f manifests/compute-blade-agent-daemonset.yaml

🌍 External DNS Setup

DNS Configuration Options

🏆 Option A: Virtual IP (Recommended)

test.zlor.fi  A  192.168.30.100  # MikroTik VIP

Benefits: Single IP, hardware failover, best performance

⚖️ Option B: Multiple Records

test.zlor.fi  A  192.168.30.101
test.zlor.fi  A  192.168.30.102  
test.zlor.fi  A  192.168.30.103
test.zlor.fi  A  192.168.30.104

Benefits: Load balanced, auto failover

🔧 Option C: Single Node

test.zlor.fi  A  192.168.30.101

Benefits: Simple, no failover

Cluster DNS Configuration

Configure DNS resolvers on all nodes:

# Update systemd-resolved.conf
[Resolve]
DNS=8.8.8.8 8.8.4.4 192.168.1.1
FallbackDNS=8.8.8.8
DNSSECNegativeTrustAnchors=zlor.fi

# Restart service
sudo systemctl restart systemd-resolved

Test External Access

# Test DNS resolution
nslookup test.zlor.fi

# Test HTTP access  
curl http://test.zlor.fi
curl -v http://test.zlor.fi

# Test from all cluster IPs
for ip in 192.168.30.{101..104}; do
  echo "Testing $ip:"
  curl -H "Host: test.zlor.fi" http://$ip
done

Add More Domains

# Update ingress with new hosts
spec:
  rules:
  - host: test.zlor.fi
  - host: api.zlor.fi  
  - host: admin.zlor.fi

Pros:

Single IP for entire cluster
Hardware-based failover (more reliable)
Better performance
No additional software needed
Automatically routes to available masters

See MIKROTIK-VIP-SETUP-CUSTOM.md for detailed setup instructions.

Option B: Multiple Records (Load Balanced)

If your DNS supports multiple A records, point to all cluster nodes:

test.zlor.fi  A  192.168.30.101
test.zlor.fi  A  192.168.30.102
test.zlor.fi  A  192.168.30.103
test.zlor.fi  A  192.168.30.104

Pros: Load balanced, automatic failover Cons: Requires DNS server support for multiple A records

Option C: Single Master Node (No Failover)

For simple setups without redundancy:

test.zlor.fi  A  192.168.30.101

Pros: Simple, works with any DNS server Cons: No failover if that node is down (not recommended for HA clusters)

Step 2: Configure Cluster Nodes for External DNS

K3s nodes need to be able to resolve external DNS queries. Update the DNS resolver on all nodes:

Option A: Ansible Playbook (Recommended)

Create a new playbook dns-config.yml:

---
- name: Configure external DNS resolver
  hosts: all
  become: true
  tasks:
    - name: Update /etc/resolv.conf with custom DNS
      copy:
        content: |
          nameserver 8.8.8.8
          nameserver 8.8.4.4
          nameserver 192.168.1.1
        dest: /etc/resolv.conf
        owner: root
        group: root
        mode: '0644'
      notify: Update systemd-resolved

    - name: Make resolv.conf immutable
      file:
        path: /etc/resolv.conf
        attributes: '+i'
        state: file

    - name: Configure systemd-resolved for external DNS
      copy:
        content: |
          [Resolve]
          DNS=8.8.8.8 8.8.4.4 192.168.1.1
          FallbackDNS=8.8.8.8
          DNSSECNegativeTrustAnchors=zlor.fi
        dest: /etc/systemd/resolved.conf
        owner: root
        group: root
        mode: '0644'
      notify: Restart systemd-resolved

  handlers:
    - name: Update systemd-resolved
      systemd:
        name: systemd-resolved
        state: restarted
        daemon_reload: yes

Apply the playbook:

ansible-playbook dns-config.yml

Option B: Manual Configuration on Each Node

SSH into each node and update DNS:

ssh pi@192.168.30.101
sudo nano /etc/systemd/resolved.conf

Add or modify:

[Resolve]
DNS=8.8.8.8 8.8.4.4 192.168.1.1
FallbackDNS=8.8.8.8
DNSSECNegativeTrustAnchors=zlor.fi

Save and restart:

sudo systemctl restart systemd-resolved

Verify DNS is working:

nslookup test.zlor.fi
dig test.zlor.fi

Step 3: Update Ingress Configuration

Your nginx-test deployment has already been updated to include test.zlor.fi. Verify the ingress:

kubectl get ingress nginx-test -o yaml

You should see:

spec:
  rules:
  - host: test.zlor.fi
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: nginx-test
            port:
              number: 80

Step 4: Test External Domain Access

Once DNS is configured, test access from your local machine:

# Test DNS resolution
nslookup test.zlor.fi

# Test HTTP access
curl http://test.zlor.fi

# With verbose output
curl -v http://test.zlor.fi

# Test from all cluster IPs
for ip in 192.168.30.{101..104}; do
  echo "Testing $ip:"
  curl -H "Host: test.zlor.fi" http://$ip
done

Troubleshooting DNS

DNS Resolution Failing

Check if systemd-resolved is running:

systemctl status systemd-resolved

Test DNS from a node:

ssh pi@192.168.30.101
nslookup test.zlor.fi
dig test.zlor.fi @8.8.8.8

Ingress Not Responding

Check if Traefik is running:

kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik

Check ingress status:

kubectl get ingress
kubectl describe ingress nginx-test

Request Timing Out

Verify network connectivity:

# From your machine
ping 192.168.30.101
ping 192.168.30.102

# From a cluster node
ssh pi@192.168.30.101
ping test.zlor.fi
curl -v http://test.zlor.fi

Adding More Domains

To add additional domains (e.g., api.zlor.fi, admin.zlor.fi):

Add DNS A records for each domain pointing to your cluster nodes
Update the ingress YAML with new rules:

spec:
  rules:
  - host: test.zlor.fi
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: nginx-test
            port:
              number: 80
  - host: api.zlor.fi
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

Apply the updated manifest:

kubectl apply -f manifests/nginx-test-deployment.yaml

🗑️ Uninstall

Complete Cluster Removal

# Remove K3s from all nodes
ansible all -m shell -a "/usr/local/bin/k3s-uninstall.sh" --become
ansible workers -m shell -a "/usr/local/bin/k3s-agent-uninstall.sh" --become

# Remove compute-blade-agent
ansible worker -m shell -a "bash /usr/local/bin/k3s-uninstall-compute-blade-agent.sh" --become

📄 License

MIT License

🔗 References

🎉 Happy clustering!

For issues or questions, check the troubleshooting section or refer to the documentation links above.

README.md

K3s Ansible Deployment for Raspberry Pi CM4/CM5