updating documentation

2026-01-08 17:48:49 +01:00
parent a2cf2a86d2
commit 4e0a3cf0cb
4 changed files with 177 additions and 113 deletions
@@ -68,36 +68,36 @@ bash scripts/verify-compute-blade-agent.sh
 - [ ] Service status shows "Running"
 - [ ] Config file exists at `/etc/compute-blade-agent/config.yaml`
-### 3. Manual Verification on a Worker
+### 3. Manual Verification on a Master Node
 ```bash
-ssh pi@192.168.30.102
+# Connect to any master (cm4-01, cm4-02, or cm4-03)
-sudo systemctl status compute-blade-agent
+ssh pi@192.168.30.101
 kubectl get nodes
 ```
- [ ] Service is active (running)
+- [ ] All 3 masters show as "Ready"
- [ ] Service is enabled (will start on boot)
+- [ ] Worker node (cm4-04) shows as "Ready"
-### 4. Check Logs
+### 4. Check Etcd Quorum
 ```bash
-ssh pi@192.168.30.102
+ssh pi@192.168.30.101
-sudo journalctl -u compute-blade-agent -n 50
+sudo /var/lib/rancher/k3s/data/*/bin/etcdctl member list
 ```
- [ ] No error messages
+- [ ] All 3 etcd members show as active
- [ ] Service started successfully
+- [ ] Cluster has quorum (2/3 minimum for failover)
 - [ ] Hardware detection messages present (if applicable)
-### 5. Verify Installation
+### 5. Verify Kubeconfig
 ```bash
-ssh pi@192.168.30.102
+export KUBECONFIG=$(pwd)/kubeconfig
-/usr/local/bin/compute-blade-agent --version
+kubectl config get-contexts
 ```
- [ ] Binary responds with version information
+- [ ] Shows contexts: cm4-01, cm4-02, cm4-03, and default
- [ ] bladectl CLI tool is available
+- [ ] All contexts point to correct control-plane nodes
 ## Optional: Kubernetes Monitoring Setup
@@ -159,15 +159,20 @@ enable_compute_blade_agent=true  # or false
 ### Per-Node Configuration
-To enable/disable specific nodes, edit `inventory/hosts.ini`:
+Note: cm4-02 and cm4-03 are now **master nodes**, not workers. To enable/disable compute-blade-agent on specific nodes:
 ```ini
 [master]
 cm4-01 ansible_host=192.168.30.101 ansible_user=pi k3s_server_init=true enable_compute_blade_agent=false
 cm4-02 ansible_host=192.168.30.102 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false
 cm4-03 ansible_host=192.168.30.103 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false
 [worker]
-cm4-02 ansible_host=... enable_compute_blade_agent=false
+cm4-04 ansible_host=192.168.30.104 ansible_user=pi enable_compute_blade_agent=true
 cm4-03 ansible_host=... enable_compute_blade_agent=true
 ```
 - [ ] Per-node settings configured as needed
 - [ ] Master nodes typically don't need compute-blade-agent
 - [ ] Saved inventory file
 - [ ] Re-run playbook if changes made
@@ -214,26 +219,36 @@ ansible worker -m shell -a "systemctl status compute-blade-agent" --become
 - [ ] All workers show active status
 ## HA Cluster Maintenance
 ### Testing Failover
 Your 3-node HA cluster can handle one master going down (maintains 2/3 quorum):
 ```bash
 # Reboot one master while monitoring cluster
 ssh pi@192.168.30.101
 sudo reboot
 # From another terminal, watch cluster status
 watch kubectl get nodes
 ```
 - [ ] Cluster remains operational with 2/3 masters
 - [ ] Pods continue running
 - [ ] Can still kubectl from cm4-02 or cm4-03 context
 ## Uninstall (if needed)
-### Uninstall from Single Node
+### Uninstall K3s from All Nodes
 ```bash
-ssh pi@<worker-ip>
+ansible all -m shell -a "bash /usr/local/bin/k3s-uninstall.sh" --become
-sudo bash /usr/local/bin/k3s-uninstall-compute-blade-agent.sh
+ansible worker -m shell -a "bash /usr/local/bin/k3s-agent-uninstall.sh" --become
 ```
- [ ] Uninstall script executed
+- [ ] All K3s services stopped
- [ ] Service removed
+- [ ] Cluster data cleaned up
 - [ ] Configuration cleaned up
 ### Uninstall from All Workers
 ```bash
 ansible worker -m shell -a "bash /usr/local/bin/k3s-uninstall-compute-blade-agent.sh" --become
 ```
 - [ ] All workers uninstalled
 ### Disable in Future Deployments
@@ -18,9 +18,9 @@ cat inventory/hosts.ini
 Verify:
- Master node IP is correct (cm4-01)
+- Master nodes are correct (cm4-01, cm4-02, cm4-03)
- Worker node IPs are correct (cm4-02, cm4-03, cm4-04)
+- Worker node IP is correct (cm4-04)
- `enable_compute_blade_agent=true` is set
+- `enable_compute_blade_agent=true` is set (optional for masters)
 ### Step 2: Test Connectivity
@@ -46,17 +46,22 @@ This will:
 **Total time**: ~30-45 minutes
-### Step 4: Verify
+### Step 4: Verify Cluster
 ```bash
-bash scripts/verify-compute-blade-agent.sh
+export KUBECONFIG=$(pwd)/kubeconfig
 kubectl get nodes
 ```
-All workers should show:
+You should see all 4 nodes ready (3 masters + 1 worker):
- ✓ Network: Reachable
+```bash
- ✓ Service Status: Running
+NAME     STATUS   ROLES                       AGE   VERSION
- ✓ Binary: Installed
+cm4-01   Ready    control-plane,etcd,master   5m    v1.35.0+k3s1
 cm4-02   Ready    control-plane,etcd          3m    v1.35.0+k3s1
 cm4-03   Ready    control-plane,etcd          3m    v1.35.0+k3s1
 cm4-04   Ready    <none>                      3m    v1.35.0+k3s1
 ```
 ## Configuration
@@ -215,22 +220,31 @@ sudo systemctl status compute-blade-agent
 ## Common Tasks
-### Restart Agent on All Workers
+### Check Cluster Status
 ```bash
-ansible worker -m shell -a "sudo systemctl restart compute-blade-agent" --become
+export KUBECONFIG=$(pwd)/kubeconfig
 kubectl get nodes
 kubectl get pods --all-namespaces
 ```
-### View Agent Logs on All Workers
+### Access Any Master Node
 ```bash
-ansible worker -m shell -a "sudo journalctl -u compute-blade-agent -n 20" --become
+# Access cm4-01
 ssh pi@192.168.30.101
 # Or access cm4-02 (backup master)
 ssh pi@192.168.30.102
 # Or access cm4-03 (backup master)
 ssh pi@192.168.30.103
 ```
 ### Deploy Only to Specific Nodes
 ```bash
-ansible-playbook site.yml --tags compute-blade-agent --limit cm4-02,cm4-03
+ansible-playbook site.yml --tags compute-blade-agent --limit cm4-04
 ```
 ### Disable Agent for Next Deployment
@@ -257,12 +271,12 @@ ansible worker -m shell -a "bash /usr/local/bin/k3s-uninstall-compute-blade-agen
 ansible all -m shell -a "bash /usr/local/bin/k3s-uninstall.sh" --become
 ```
-## Support
+## Documentation
- **Quick Reference**: `cat COMPUTE_BLADE_AGENT.md`
+- **README.md** - Full guide with all configuration options
- **Checklist**: `cat DEPLOYMENT_CHECKLIST.md`
+- **DEPLOYMENT_CHECKLIST.md** - Step-by-step checklist
- **Full Guide**: `cat README.md`
+- **COMPUTE_BLADE_AGENT.md** - Quick reference for agent deployment
- **GitHub**: [compute-blade-agent](https://github.com/compute-blade-community/compute-blade-agent)
+- **MIKROTIK-VIP-SETUP-CUSTOM.md** - Virtual IP failover configuration
 ## File Locations
@@ -8,16 +8,18 @@ Customized setup guide for your MikroTik RouterOS configuration.
 Uplink Network:     192.168.1.0/24  (br-uplink - WAN/External)
 LAB Network:        192.168.30.0/24 (br-lab - K3s Cluster)
-K3s Nodes:
+K3s Nodes (3-node HA Cluster):
-  cm4-01: 192.168.30.101 (Master)
+  cm4-01: 192.168.30.101 (Master/Control-Plane)
-  cm4-02: 192.168.30.102 (Worker)
+  cm4-02: 192.168.30.102 (Master/Control-Plane)
-  cm4-03: 192.168.30.103 (Worker)
+  cm4-03: 192.168.30.103 (Master/Control-Plane)
  cm4-04: 192.168.30.104 (Worker)
 Virtual IP to Create:
-  192.168.30.100/24 (on br-lab bridge)
+  192.168.30.100/24 (on br-lab bridge - HAProxy or MikroTik failover)
 ```
 **⚠️ Important Note**: The basic NAT rules below will route to cm4-01 only. To achieve true failover in your 3-node HA cluster, activate the health check script (Step 8) so traffic automatically routes to another master if cm4-01 goes down.
 ## Step 1: Add Virtual IP Address on MikroTik
 Since your K3s nodes are on the `br-lab` bridge, add the VIP there:
@@ -183,9 +185,9 @@ curl http://test.zlor.fi
 curl -k https://test.zlor.fi
 ```
-## Step 8: Optional - Add Health Check Script
+## Step 8: Add Health Check Script (Recommended for HA)
-For automatic failover, create a health check script that monitors the master node and updates NAT rules if it goes down.
+**For automatic failover with your 3-node HA cluster**, create a health check script that monitors the master node and updates NAT rules if it goes down. This ensures traffic automatically routes to cm4-02 or cm4-03 if cm4-01 fails.
 ### Create Health Check Script
@@ -237,6 +239,8 @@ For automatic failover, create a health check script that monitors the master no
  comment="Monitor K3s cluster and update VIP routes"
 ```
 **Status**: This scheduler will run every 30 seconds and automatically switch the VIP NAT rules to an available master if cm4-01 becomes unreachable.
 ### View Health Check Logs
 ```mikrotik
@@ -247,14 +251,33 @@ For automatic failover, create a health check script that monitors the master no
 ## Verification Checklist
 - [ ] VIP address (192.168.30.100) added to br-lab
- [ ] NAT rules for port 80 and 443 created
+- [ ] NAT rules for port 80 and 443 created (routed to cm4-01)
 - [ ] Firewall rules allow traffic to VIP
 - [ ] Ping 192.168.30.100 succeeds
 - [ ] curl http://192.168.30.100 returns nginx page
 - [ ] DNS A record added: test.zlor.fi → 192.168.30.100
 - [ ] curl http://test.zlor.fi works
- [ ] Health check script created (optional)
+- [ ] **Health check script created** (recommended for HA failover)
- [ ] Health check scheduled (optional)
+- [ ] **Health check scheduled** (recommended for HA failover)
 - [ ] Test failover by pinging health check scheduler status
 ## Testing Failover (HA Cluster)
 If you've enabled the health check script, you can test automatic failover:
 ```bash
 # From your machine, start monitoring
 watch -n 5 'curl -v http://192.168.30.100 2>&1 | grep "200 OK\|Connected"'
 # In another terminal, SSH to cm4-01 and reboot it
 ssh pi@192.168.30.101
 sudo reboot
 # Watch the curl output - after ~30 seconds, it should reconnect
 # This means the health check script switched traffic to cm4-02 or cm4-03
 ```
 **Expected result**: Traffic stays online during the reboot (except for ~30 second switchover window)
 ## Troubleshooting
@@ -368,16 +391,27 @@ Your VIP is now configured on MikroTik:
 ```
 External Traffic
    ↓
-192.168.30.100:80 (VIP on br-lab)
+192.168.30.100:80/443 (VIP on br-lab)
    ↓
-NAT Rule Routes to 192.168.30.101:80
+NAT Rule Routes to 192.168.30.101:80/443 (cm4-01 Master)
    ↓
-K3s Master Node (cm4-01)
+If Health Check Enabled:
  - Routes to cm4-02 if cm4-01 down (every 30 seconds check)
  - Routes to cm4-03 if both cm4-01 and cm4-02 down
    ↓
-If Master Down → Failover to Worker
+Ingress → K3s Service → Pods
    (Optional with health check script)
 ```
-DNS: `test.zlor.fi → 192.168.30.100`
+**DNS**: `test.zlor.fi → 192.168.30.100`
-Single IP for your entire cluster with automatic failover! ✅
+**Status**:
 - ✅ Single IP for entire cluster
 - ✅ Automatic failover (with health check script)
 - ✅ 3-node HA masters provide etcd quorum
 **Next Steps**:
 1. Enable health check script (Step 8) for automatic failover
 2. Test failover by rebooting cm4-01 and monitoring connectivity
 3. Your cluster now has true high availability!
@@ -42,19 +42,19 @@ Edit `inventory/hosts.ini` and add your Raspberry Pi nodes:
 ```ini
 [master]
-pi-master ansible_host=192.168.30.100 ansible_user=pi
+cm4-01 ansible_host=192.168.30.101 ansible_user=pi k3s_server_init=true
 cm4-02 ansible_host=192.168.30.102 ansible_user=pi k3s_server_init=false
 cm4-03 ansible_host=192.168.30.103 ansible_user=pi k3s_server_init=false
 [worker]
-pi-worker-1 ansible_host=192.168.30.102 ansible_user=pi
+cm4-04 ansible_host=192.168.30.104 ansible_user=pi
 pi-worker-2 ansible_host=192.168.30.103 ansible_user=pi
 pi-worker-3 ansible_host=192.168.30.104 ansible_user=pi
 ```
 ### 2. Configure Variables
 In `inventory/hosts.ini`, you can customize:
- `k3s_version`: K3s version to install (default: v1.34.2+k3s1)
+- `k3s_version`: K3s version to install (default: v1.35.0+k3s1)
 - `extra_server_args`: Additional arguments for k3s server
 - `extra_agent_args`: Additional arguments for k3s agent
 - `extra_packages`: List of additional packages to install on all nodes
@@ -304,20 +304,21 @@ kubectl get nodes
 You should see all your nodes in Ready state:
 ```bash
-NAME          STATUS   ROLES                  AGE   VERSION
+NAME     STATUS   ROLES                       AGE   VERSION
-pi-master     Ready    control-plane,master   5m    v1.34.2+k3s1
+cm4-01   Ready    control-plane,etcd,master   5m    v1.35.0+k3s1
-pi-worker-1   Ready    <none>                 3m    v1.34.2+k3s1
+cm4-02   Ready    control-plane,etcd          3m    v1.35.0+k3s1
-pi-worker-2   Ready    <none>                 3m    v1.34.2+k3s1
+cm4-03   Ready    control-plane,etcd          3m    v1.35.0+k3s1
 cm4-04   Ready    <none>                      3m    v1.35.0+k3s1
 ```
 ## Accessing the Cluster
 ### From Master Node
-SSH into the master node and use kubectl:
+SSH into a master node and use kubectl:
 ```bash
-ssh pi@pi-master
+ssh pi@192.168.30.101
 kubectl get nodes
 ```
@@ -461,8 +462,11 @@ nginx-test-7d8f4c9b6d-xr5wp   1/1     Running   0          1m    pi-worker-2
 Add your master node IP to /etc/hosts:
 ```bash
-# Replace 192.168.30.101 with your master node IP
+# Replace with any master or worker node IP
 192.168.30.101  nginx-test.local nginx.pi.local
 192.168.30.102  nginx-test.local nginx.pi.local
 192.168.30.103  nginx-test.local nginx.pi.local
 192.168.30.104  nginx-test.local nginx.pi.local
 ```
 Then access via browser:
@@ -473,8 +477,9 @@ Then access via browser:
 Or test with curl:
 ```bash
-# Replace with your master node IP
+# Test with any cluster node IP (master or worker)
 curl -H "Host: nginx-test.local" http://192.168.30.101
 curl -H "Host: nginx-test.local" http://192.168.30.102
 ```
 ### Scale the Deployment
@@ -624,7 +629,7 @@ ansible-playbook site.yml --tags k3s-server --limit <failed-master>
 ### Demoting a Master to Worker
-To remove a master from control-plane and make it a worker:
+To remove a master from control-plane and make it a worker (note: this reduces HA from 3-node to 2-node):
 1. Edit `inventory/hosts.ini`:
@@ -638,6 +643,8 @@ To remove a master from control-plane and make it a worker:
   cm4-04 ansible_host=192.168.30.104 ansible_user=pi
   ```
   **Warning**: This reduces your cluster to 2 master nodes. With only 2 masters, you lose quorum (require 2/3, have only 1/2 if one fails).
 2. Drain the node:
   ```bash
@@ -690,7 +697,7 @@ To update to a specific k3s version:
 ```ini
 [k3s_cluster:vars]
-k3s_version=v1.35.0+k3s1
+k3s_version=v1.36.0+k3s1
 ```
 1. Run the k3s playbook to update all nodes:
@@ -711,7 +718,7 @@ For more control, you can manually update k3s on individual nodes:
 ssh pi@<node-ip>
 # Download and install specific version
-curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.0+k3s1 sh -
+curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.36.0+k3s1 sh -
 # Restart k3s
 sudo systemctl restart k3s        # On master
@@ -775,7 +782,7 @@ If an update causes issues, you can rollback to a previous version:
 ```bash
 # Update inventory with previous version
 # [k3s_cluster:vars]
-# k3s_version=v1.34.2+k3s1
+# k3s_version=v1.35.0+k3s1
 # Re-run the playbook
 ansible-playbook site.yml --tags k3s-server,k3s-agent
@@ -814,7 +821,7 @@ ansible-playbook reboot.yml --limit master
 ### Reboot a Specific Node
 ```bash
-ansible-playbook reboot.yml --limit pi-worker-1
+ansible-playbook reboot.yml --limit cm4-04
 ```
 ## Troubleshooting
@@ -1001,26 +1008,33 @@ ansible-playbook site.yml --tags compute-blade-agent
 ## External DNS Configuration
-To use external domains (like `test.zlor.fi`) with your k3s cluster ingress, you need to configure DNS and update your nodes.
+To use external domains (like `test.zlor.fi`) with your k3s cluster ingress, you need to configure DNS. Your cluster uses a Virtual IP (192.168.30.100) via MikroTik for high availability.
 ### Step 1: Configure DNS Server Records
 On your DNS server, add **A records** pointing to your k3s cluster nodes:
-#### Option A: Single Record (Master Node Only) - Simplest
+#### Option A: Virtual IP (VIP) via MikroTik - Recommended for HA
-If your DNS only allows one A record:
+Use your MikroTik router's Virtual IP (192.168.30.100) for high availability:
 ```dns
-test.zlor.fi  A  192.168.30.101
+test.zlor.fi  A  192.168.30.100
 ```
-**Pros:** Simple, works with any DNS server
+**Pros:**
 **Cons:** No failover if master node is down
-#### Option B: Multiple Records (Load Balanced) - Best Redundancy
+- Single IP for entire cluster
 - Hardware-based failover (more reliable)
 - Better performance
 - No additional software needed
 - Automatically routes to available masters
-If your DNS supports multiple A records:
+See [MIKROTIK-VIP-SETUP-CUSTOM.md](MIKROTIK-VIP-SETUP-CUSTOM.md) for detailed setup instructions.
 #### Option B: Multiple Records (Load Balanced)
 If your DNS supports multiple A records, point to all cluster nodes:
 ```dns
 test.zlor.fi  A  192.168.30.101
@@ -1029,32 +1043,19 @@ test.zlor.fi  A  192.168.30.103
 test.zlor.fi  A  192.168.30.104
 ```
 DNS clients will distribute requests across all nodes (round-robin).
 **Pros:** Load balanced, automatic failover
 **Cons:** Requires DNS server support for multiple A records
-#### Option C: Virtual IP (VIP) - Best of Both Worlds
+#### Option C: Single Master Node (No Failover)
-If your DNS only allows one A record but you want redundancy:
+For simple setups without redundancy:
 ```dns
-test.zlor.fi  A  192.168.30.100
+test.zlor.fi  A  192.168.30.101
 ```
-Set up a virtual IP that automatically handles failover. You have two sub-options:
+**Pros:** Simple, works with any DNS server
-
+**Cons:** No failover if that node is down (not recommended for HA clusters)
 ##### Option C: MikroTik VIP (Recommended)
 Configure VIP directly on your MikroTik router. See [MIKROTIK-VIP-SETUP.md](MIKROTIK-VIP-SETUP.md) for customized setup instructions for your network topology.
 Pros:
 - Simple setup (5 minutes)
 - No additional software on cluster nodes
 - Hardware-based failover (more reliable)
 - Better performance
 - Reduced CPU overhead on nodes
 ### Step 2: Configure Cluster Nodes for External DNS