updating documentation

2026-01-08 17:48:49 +01:00
parent a2cf2a86d2
commit 4e0a3cf0cb
4 changed files with 177 additions and 113 deletions
--- a/DEPLOYMENT_CHECKLIST.md
+++ b/DEPLOYMENT_CHECKLIST.md
@@ -68,36 +68,36 @@ bash scripts/verify-compute-blade-agent.sh
 - [ ] Service status shows "Running"
 - [ ] Config file exists at `/etc/compute-blade-agent/config.yaml`

-### 3. Manual Verification on a Worker
+### 3. Manual Verification on a Master Node

 ```bash
-ssh pi@192.168.30.102
-sudo systemctl status compute-blade-agent
+# Connect to any master (cm4-01, cm4-02, or cm4-03)
+ssh pi@192.168.30.101
+kubectl get nodes
 ```

- [ ] Service is active (running)
- [ ] Service is enabled (will start on boot)
+- [ ] All 3 masters show as "Ready"
+- [ ] Worker node (cm4-04) shows as "Ready"

-### 4. Check Logs
+### 4. Check Etcd Quorum

 ```bash
-ssh pi@192.168.30.102
-sudo journalctl -u compute-blade-agent -n 50
+ssh pi@192.168.30.101
+sudo /var/lib/rancher/k3s/data/*/bin/etcdctl member list
 ```

- [ ] No error messages
- [ ] Service started successfully
- [ ] Hardware detection messages present (if applicable)
+- [ ] All 3 etcd members show as active
+- [ ] Cluster has quorum (2/3 minimum for failover)

-### 5. Verify Installation
+### 5. Verify Kubeconfig

 ```bash
-ssh pi@192.168.30.102
-/usr/local/bin/compute-blade-agent --version
+export KUBECONFIG=$(pwd)/kubeconfig
+kubectl config get-contexts
 ```

- [ ] Binary responds with version information
- [ ] bladectl CLI tool is available
+- [ ] Shows contexts: cm4-01, cm4-02, cm4-03, and default
+- [ ] All contexts point to correct control-plane nodes

 ## Optional: Kubernetes Monitoring Setup

@@ -159,15 +159,20 @@ enable_compute_blade_agent=true  # or false

 ### Per-Node Configuration

-To enable/disable specific nodes, edit `inventory/hosts.ini`:
+Note: cm4-02 and cm4-03 are now **master nodes**, not workers. To enable/disable compute-blade-agent on specific nodes:

 ```ini
+[master]
+cm4-01 ansible_host=192.168.30.101 ansible_user=pi k3s_server_init=true enable_compute_blade_agent=false
+cm4-02 ansible_host=192.168.30.102 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false
+cm4-03 ansible_host=192.168.30.103 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false
+
 [worker]
-cm4-02 ansible_host=... enable_compute_blade_agent=false
-cm4-03 ansible_host=... enable_compute_blade_agent=true
+cm4-04 ansible_host=192.168.30.104 ansible_user=pi enable_compute_blade_agent=true
 ```

 - [ ] Per-node settings configured as needed
+- [ ] Master nodes typically don't need compute-blade-agent
 - [ ] Saved inventory file
 - [ ] Re-run playbook if changes made

@@ -214,26 +219,36 @@ ansible worker -m shell -a "systemctl status compute-blade-agent" --become

 - [ ] All workers show active status

+## HA Cluster Maintenance
+
+### Testing Failover
+
+Your 3-node HA cluster can handle one master going down (maintains 2/3 quorum):
+
+```bash
+# Reboot one master while monitoring cluster
+ssh pi@192.168.30.101
+sudo reboot
+
+# From another terminal, watch cluster status
+watch kubectl get nodes
+```
+
+- [ ] Cluster remains operational with 2/3 masters
+- [ ] Pods continue running
+- [ ] Can still kubectl from cm4-02 or cm4-03 context
+
 ## Uninstall (if needed)

-### Uninstall from Single Node
+### Uninstall K3s from All Nodes

 ```bash
-ssh pi@<worker-ip>
-sudo bash /usr/local/bin/k3s-uninstall-compute-blade-agent.sh
+ansible all -m shell -a "bash /usr/local/bin/k3s-uninstall.sh" --become
+ansible worker -m shell -a "bash /usr/local/bin/k3s-agent-uninstall.sh" --become
 ```

- [ ] Uninstall script executed
- [ ] Service removed
- [ ] Configuration cleaned up
-
-### Uninstall from All Workers
-
-```bash
-ansible worker -m shell -a "bash /usr/local/bin/k3s-uninstall-compute-blade-agent.sh" --become
-```
-
- [ ] All workers uninstalled
+- [ ] All K3s services stopped
+- [ ] Cluster data cleaned up

 ### Disable in Future Deployments

--- a/GETTING_STARTED.md
+++ b/GETTING_STARTED.md
@@ -18,9 +18,9 @@ cat inventory/hosts.ini

 Verify:

- Master node IP is correct (cm4-01)
- Worker node IPs are correct (cm4-02, cm4-03, cm4-04)
- `enable_compute_blade_agent=true` is set
+- Master nodes are correct (cm4-01, cm4-02, cm4-03)
+- Worker node IP is correct (cm4-04)
+- `enable_compute_blade_agent=true` is set (optional for masters)

 ### Step 2: Test Connectivity

@@ -46,17 +46,22 @@ This will:

 **Total time**: ~30-45 minutes

-### Step 4: Verify
+### Step 4: Verify Cluster

 ```bash
-bash scripts/verify-compute-blade-agent.sh
+export KUBECONFIG=$(pwd)/kubeconfig
+kubectl get nodes
 ```

-All workers should show:
+You should see all 4 nodes ready (3 masters + 1 worker):

- ✓ Network: Reachable
- ✓ Service Status: Running
- ✓ Binary: Installed
+```bash
+NAME     STATUS   ROLES                       AGE   VERSION
+cm4-01   Ready    control-plane,etcd,master   5m    v1.35.0+k3s1
+cm4-02   Ready    control-plane,etcd          3m    v1.35.0+k3s1
+cm4-03   Ready    control-plane,etcd          3m    v1.35.0+k3s1
+cm4-04   Ready    <none>                      3m    v1.35.0+k3s1
+```

 ## Configuration

@@ -215,22 +220,31 @@ sudo systemctl status compute-blade-agent

 ## Common Tasks

-### Restart Agent on All Workers
+### Check Cluster Status

 ```bash
-ansible worker -m shell -a "sudo systemctl restart compute-blade-agent" --become
+export KUBECONFIG=$(pwd)/kubeconfig
+kubectl get nodes
+kubectl get pods --all-namespaces
 ```

-### View Agent Logs on All Workers
+### Access Any Master Node

 ```bash
-ansible worker -m shell -a "sudo journalctl -u compute-blade-agent -n 20" --become
+# Access cm4-01
+ssh pi@192.168.30.101
+
+# Or access cm4-02 (backup master)
+ssh pi@192.168.30.102
+
+# Or access cm4-03 (backup master)
+ssh pi@192.168.30.103
 ```

 ### Deploy Only to Specific Nodes

 ```bash
-ansible-playbook site.yml --tags compute-blade-agent --limit cm4-02,cm4-03
+ansible-playbook site.yml --tags compute-blade-agent --limit cm4-04
 ```

 ### Disable Agent for Next Deployment
@@ -257,12 +271,12 @@ ansible worker -m shell -a "bash /usr/local/bin/k3s-uninstall-compute-blade-agen
 ansible all -m shell -a "bash /usr/local/bin/k3s-uninstall.sh" --become
 ```

-## Support
+## Documentation

- **Quick Reference**: `cat COMPUTE_BLADE_AGENT.md`
- **Checklist**: `cat DEPLOYMENT_CHECKLIST.md`
- **Full Guide**: `cat README.md`
- **GitHub**: [compute-blade-agent](https://github.com/compute-blade-community/compute-blade-agent)
+- **README.md** - Full guide with all configuration options
+- **DEPLOYMENT_CHECKLIST.md** - Step-by-step checklist
+- **COMPUTE_BLADE_AGENT.md** - Quick reference for agent deployment
+- **MIKROTIK-VIP-SETUP-CUSTOM.md** - Virtual IP failover configuration

 ## File Locations

--- a/MIKROTIK-VIP-SETUP-CUSTOM.md
+++ b/MIKROTIK-VIP-SETUP-CUSTOM.md
@@ -8,16 +8,18 @@ Customized setup guide for your MikroTik RouterOS configuration.
 Uplink Network:     192.168.1.0/24  (br-uplink - WAN/External)
 LAB Network:        192.168.30.0/24 (br-lab - K3s Cluster)

-K3s Nodes:
-  cm4-01: 192.168.30.101 (Master)
-  cm4-02: 192.168.30.102 (Worker)
-  cm4-03: 192.168.30.103 (Worker)
+K3s Nodes (3-node HA Cluster):
+  cm4-01: 192.168.30.101 (Master/Control-Plane)
+  cm4-02: 192.168.30.102 (Master/Control-Plane)
+  cm4-03: 192.168.30.103 (Master/Control-Plane)
  cm4-04: 192.168.30.104 (Worker)

 Virtual IP to Create:
-  192.168.30.100/24 (on br-lab bridge)
+  192.168.30.100/24 (on br-lab bridge - HAProxy or MikroTik failover)
 ```

+**⚠️ Important Note**: The basic NAT rules below will route to cm4-01 only. To achieve true failover in your 3-node HA cluster, activate the health check script (Step 8) so traffic automatically routes to another master if cm4-01 goes down.
+
 ## Step 1: Add Virtual IP Address on MikroTik

 Since your K3s nodes are on the `br-lab` bridge, add the VIP there:
@@ -183,9 +185,9 @@ curl http://test.zlor.fi
 curl -k https://test.zlor.fi
 ```

-## Step 8: Optional - Add Health Check Script
+## Step 8: Add Health Check Script (Recommended for HA)

-For automatic failover, create a health check script that monitors the master node and updates NAT rules if it goes down.
+**For automatic failover with your 3-node HA cluster**, create a health check script that monitors the master node and updates NAT rules if it goes down. This ensures traffic automatically routes to cm4-02 or cm4-03 if cm4-01 fails.

 ### Create Health Check Script

@@ -237,6 +239,8 @@ For automatic failover, create a health check script that monitors the master no
  comment="Monitor K3s cluster and update VIP routes"
 ```

+**Status**: This scheduler will run every 30 seconds and automatically switch the VIP NAT rules to an available master if cm4-01 becomes unreachable.
+
 ### View Health Check Logs

 ```mikrotik
@@ -247,14 +251,33 @@ For automatic failover, create a health check script that monitors the master no
 ## Verification Checklist

 - [ ] VIP address (192.168.30.100) added to br-lab
- [ ] NAT rules for port 80 and 443 created
+- [ ] NAT rules for port 80 and 443 created (routed to cm4-01)
 - [ ] Firewall rules allow traffic to VIP
 - [ ] Ping 192.168.30.100 succeeds
 - [ ] curl http://192.168.30.100 returns nginx page
 - [ ] DNS A record added: test.zlor.fi → 192.168.30.100
 - [ ] curl http://test.zlor.fi works
- [ ] Health check script created (optional)
- [ ] Health check scheduled (optional)
+- [ ] **Health check script created** (recommended for HA failover)
+- [ ] **Health check scheduled** (recommended for HA failover)
+- [ ] Test failover by pinging health check scheduler status
+
+## Testing Failover (HA Cluster)
+
+If you've enabled the health check script, you can test automatic failover:
+
+```bash
+# From your machine, start monitoring
+watch -n 5 'curl -v http://192.168.30.100 2>&1 | grep "200 OK\|Connected"'
+
+# In another terminal, SSH to cm4-01 and reboot it
+ssh pi@192.168.30.101
+sudo reboot
+
+# Watch the curl output - after ~30 seconds, it should reconnect
+# This means the health check script switched traffic to cm4-02 or cm4-03
+```
+
+**Expected result**: Traffic stays online during the reboot (except for ~30 second switchover window)

 ## Troubleshooting

@@ -368,16 +391,27 @@ Your VIP is now configured on MikroTik:
 ```
 External Traffic
    ↓
-192.168.30.100:80 (VIP on br-lab)
+192.168.30.100:80/443 (VIP on br-lab)
    ↓
-NAT Rule Routes to 192.168.30.101:80
+NAT Rule Routes to 192.168.30.101:80/443 (cm4-01 Master)
    ↓
-K3s Master Node (cm4-01)
+If Health Check Enabled:
+  - Routes to cm4-02 if cm4-01 down (every 30 seconds check)
+  - Routes to cm4-03 if both cm4-01 and cm4-02 down
    ↓
-If Master Down → Failover to Worker
-    (Optional with health check script)
+Ingress → K3s Service → Pods
 ```

-DNS: `test.zlor.fi → 192.168.30.100`
+**DNS**: `test.zlor.fi → 192.168.30.100`

-Single IP for your entire cluster with automatic failover! ✅
+**Status**:
+
+- ✅ Single IP for entire cluster
+- ✅ Automatic failover (with health check script)
+- ✅ 3-node HA masters provide etcd quorum
+
+**Next Steps**:
+
+1. Enable health check script (Step 8) for automatic failover
+2. Test failover by rebooting cm4-01 and monitoring connectivity
+3. Your cluster now has true high availability!
--- a/README.md
+++ b/README.md
@@ -42,19 +42,19 @@ Edit `inventory/hosts.ini` and add your Raspberry Pi nodes:

 ```ini
 [master]
-pi-master ansible_host=192.168.30.100 ansible_user=pi
+cm4-01 ansible_host=192.168.30.101 ansible_user=pi k3s_server_init=true
+cm4-02 ansible_host=192.168.30.102 ansible_user=pi k3s_server_init=false
+cm4-03 ansible_host=192.168.30.103 ansible_user=pi k3s_server_init=false

 [worker]
-pi-worker-1 ansible_host=192.168.30.102 ansible_user=pi
-pi-worker-2 ansible_host=192.168.30.103 ansible_user=pi
-pi-worker-3 ansible_host=192.168.30.104 ansible_user=pi
+cm4-04 ansible_host=192.168.30.104 ansible_user=pi
 ```

 ### 2. Configure Variables

 In `inventory/hosts.ini`, you can customize:

- `k3s_version`: K3s version to install (default: v1.34.2+k3s1)
+- `k3s_version`: K3s version to install (default: v1.35.0+k3s1)
 - `extra_server_args`: Additional arguments for k3s server
 - `extra_agent_args`: Additional arguments for k3s agent
 - `extra_packages`: List of additional packages to install on all nodes
@@ -304,20 +304,21 @@ kubectl get nodes
 You should see all your nodes in Ready state:

 ```bash
-NAME          STATUS   ROLES                  AGE   VERSION
-pi-master     Ready    control-plane,master   5m    v1.34.2+k3s1
-pi-worker-1   Ready    <none>                 3m    v1.34.2+k3s1
-pi-worker-2   Ready    <none>                 3m    v1.34.2+k3s1
+NAME     STATUS   ROLES                       AGE   VERSION
+cm4-01   Ready    control-plane,etcd,master   5m    v1.35.0+k3s1
+cm4-02   Ready    control-plane,etcd          3m    v1.35.0+k3s1
+cm4-03   Ready    control-plane,etcd          3m    v1.35.0+k3s1
+cm4-04   Ready    <none>                      3m    v1.35.0+k3s1
 ```

 ## Accessing the Cluster

 ### From Master Node

-SSH into the master node and use kubectl:
+SSH into a master node and use kubectl:

 ```bash
-ssh pi@pi-master
+ssh pi@192.168.30.101
 kubectl get nodes
 ```

@@ -461,8 +462,11 @@ nginx-test-7d8f4c9b6d-xr5wp   1/1     Running   0          1m    pi-worker-2
 Add your master node IP to /etc/hosts:

 ```bash
-# Replace 192.168.30.101 with your master node IP
+# Replace with any master or worker node IP
 192.168.30.101  nginx-test.local nginx.pi.local
+192.168.30.102  nginx-test.local nginx.pi.local
+192.168.30.103  nginx-test.local nginx.pi.local
+192.168.30.104  nginx-test.local nginx.pi.local
 ```

 Then access via browser:
@@ -473,8 +477,9 @@ Then access via browser:
 Or test with curl:

 ```bash
-# Replace with your master node IP
+# Test with any cluster node IP (master or worker)
 curl -H "Host: nginx-test.local" http://192.168.30.101
+curl -H "Host: nginx-test.local" http://192.168.30.102
 ```

 ### Scale the Deployment
@@ -624,7 +629,7 @@ ansible-playbook site.yml --tags k3s-server --limit <failed-master>

 ### Demoting a Master to Worker

-To remove a master from control-plane and make it a worker:
+To remove a master from control-plane and make it a worker (note: this reduces HA from 3-node to 2-node):

 1. Edit `inventory/hosts.ini`:

@@ -638,6 +643,8 @@ To remove a master from control-plane and make it a worker:
   cm4-04 ansible_host=192.168.30.104 ansible_user=pi
   ```

+   **Warning**: This reduces your cluster to 2 master nodes. With only 2 masters, you lose quorum (require 2/3, have only 1/2 if one fails).
+
 2. Drain the node:

   ```bash
@@ -690,7 +697,7 @@ To update to a specific k3s version:

 ```ini
 [k3s_cluster:vars]
-k3s_version=v1.35.0+k3s1
+k3s_version=v1.36.0+k3s1
 ```

 1. Run the k3s playbook to update all nodes:
@@ -711,7 +718,7 @@ For more control, you can manually update k3s on individual nodes:
 ssh pi@<node-ip>

 # Download and install specific version
-curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.0+k3s1 sh -
+curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.36.0+k3s1 sh -

 # Restart k3s
 sudo systemctl restart k3s        # On master
@@ -775,7 +782,7 @@ If an update causes issues, you can rollback to a previous version:
 ```bash
 # Update inventory with previous version
 # [k3s_cluster:vars]
-# k3s_version=v1.34.2+k3s1
+# k3s_version=v1.35.0+k3s1

 # Re-run the playbook
 ansible-playbook site.yml --tags k3s-server,k3s-agent
@@ -814,7 +821,7 @@ ansible-playbook reboot.yml --limit master
 ### Reboot a Specific Node

 ```bash
-ansible-playbook reboot.yml --limit pi-worker-1
+ansible-playbook reboot.yml --limit cm4-04
 ```

 ## Troubleshooting
@@ -1001,26 +1008,33 @@ ansible-playbook site.yml --tags compute-blade-agent

 ## External DNS Configuration

-To use external domains (like `test.zlor.fi`) with your k3s cluster ingress, you need to configure DNS and update your nodes.
+To use external domains (like `test.zlor.fi`) with your k3s cluster ingress, you need to configure DNS. Your cluster uses a Virtual IP (192.168.30.100) via MikroTik for high availability.

 ### Step 1: Configure DNS Server Records

 On your DNS server, add **A records** pointing to your k3s cluster nodes:

-#### Option A: Single Record (Master Node Only) - Simplest
+#### Option A: Virtual IP (VIP) via MikroTik - Recommended for HA

-If your DNS only allows one A record:
+Use your MikroTik router's Virtual IP (192.168.30.100) for high availability:

 ```dns
-test.zlor.fi  A  192.168.30.101
+test.zlor.fi  A  192.168.30.100
 ```

-**Pros:** Simple, works with any DNS server
-**Cons:** No failover if master node is down
+**Pros:**

-#### Option B: Multiple Records (Load Balanced) - Best Redundancy
+- Single IP for entire cluster
+- Hardware-based failover (more reliable)
+- Better performance
+- No additional software needed
+- Automatically routes to available masters

-If your DNS supports multiple A records:
+See [MIKROTIK-VIP-SETUP-CUSTOM.md](MIKROTIK-VIP-SETUP-CUSTOM.md) for detailed setup instructions.
+
+#### Option B: Multiple Records (Load Balanced)
+
+If your DNS supports multiple A records, point to all cluster nodes:

 ```dns
 test.zlor.fi  A  192.168.30.101
@@ -1029,32 +1043,19 @@ test.zlor.fi  A  192.168.30.103
 test.zlor.fi  A  192.168.30.104
 ```

-DNS clients will distribute requests across all nodes (round-robin).
-
 **Pros:** Load balanced, automatic failover
 **Cons:** Requires DNS server support for multiple A records

-#### Option C: Virtual IP (VIP) - Best of Both Worlds
+#### Option C: Single Master Node (No Failover)

-If your DNS only allows one A record but you want redundancy:
+For simple setups without redundancy:

 ```dns
-test.zlor.fi  A  192.168.30.100
+test.zlor.fi  A  192.168.30.101
 ```

-Set up a virtual IP that automatically handles failover. You have two sub-options:
-
-##### Option C: MikroTik VIP (Recommended)
-
-Configure VIP directly on your MikroTik router. See [MIKROTIK-VIP-SETUP.md](MIKROTIK-VIP-SETUP.md) for customized setup instructions for your network topology.
-
-Pros:
-
- Simple setup (5 minutes)
- No additional software on cluster nodes
- Hardware-based failover (more reliable)
- Better performance
- Reduced CPU overhead on nodes
+**Pros:** Simple, works with any DNS server
+**Cons:** No failover if that node is down (not recommended for HA clusters)

 ### Step 2: Configure Cluster Nodes for External DNS