diff --git a/DEPLOYMENT_CHECKLIST.md b/DEPLOYMENT_CHECKLIST.md index 2885b81..cef2841 100644 --- a/DEPLOYMENT_CHECKLIST.md +++ b/DEPLOYMENT_CHECKLIST.md @@ -68,36 +68,36 @@ bash scripts/verify-compute-blade-agent.sh - [ ] Service status shows "Running" - [ ] Config file exists at `/etc/compute-blade-agent/config.yaml` -### 3. Manual Verification on a Worker +### 3. Manual Verification on a Master Node ```bash -ssh pi@192.168.30.102 -sudo systemctl status compute-blade-agent +# Connect to any master (cm4-01, cm4-02, or cm4-03) +ssh pi@192.168.30.101 +kubectl get nodes ``` -- [ ] Service is active (running) -- [ ] Service is enabled (will start on boot) +- [ ] All 3 masters show as "Ready" +- [ ] Worker node (cm4-04) shows as "Ready" -### 4. Check Logs +### 4. Check Etcd Quorum ```bash -ssh pi@192.168.30.102 -sudo journalctl -u compute-blade-agent -n 50 +ssh pi@192.168.30.101 +sudo /var/lib/rancher/k3s/data/*/bin/etcdctl member list ``` -- [ ] No error messages -- [ ] Service started successfully -- [ ] Hardware detection messages present (if applicable) +- [ ] All 3 etcd members show as active +- [ ] Cluster has quorum (2/3 minimum for failover) -### 5. Verify Installation +### 5. Verify Kubeconfig ```bash -ssh pi@192.168.30.102 -/usr/local/bin/compute-blade-agent --version +export KUBECONFIG=$(pwd)/kubeconfig +kubectl config get-contexts ``` -- [ ] Binary responds with version information -- [ ] bladectl CLI tool is available +- [ ] Shows contexts: cm4-01, cm4-02, cm4-03, and default +- [ ] All contexts point to correct control-plane nodes ## Optional: Kubernetes Monitoring Setup @@ -159,15 +159,20 @@ enable_compute_blade_agent=true # or false ### Per-Node Configuration -To enable/disable specific nodes, edit `inventory/hosts.ini`: +Note: cm4-02 and cm4-03 are now **master nodes**, not workers. To enable/disable compute-blade-agent on specific nodes: ```ini +[master] +cm4-01 ansible_host=192.168.30.101 ansible_user=pi k3s_server_init=true enable_compute_blade_agent=false +cm4-02 ansible_host=192.168.30.102 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false +cm4-03 ansible_host=192.168.30.103 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false + [worker] -cm4-02 ansible_host=... enable_compute_blade_agent=false -cm4-03 ansible_host=... enable_compute_blade_agent=true +cm4-04 ansible_host=192.168.30.104 ansible_user=pi enable_compute_blade_agent=true ``` - [ ] Per-node settings configured as needed +- [ ] Master nodes typically don't need compute-blade-agent - [ ] Saved inventory file - [ ] Re-run playbook if changes made @@ -214,26 +219,36 @@ ansible worker -m shell -a "systemctl status compute-blade-agent" --become - [ ] All workers show active status +## HA Cluster Maintenance + +### Testing Failover + +Your 3-node HA cluster can handle one master going down (maintains 2/3 quorum): + +```bash +# Reboot one master while monitoring cluster +ssh pi@192.168.30.101 +sudo reboot + +# From another terminal, watch cluster status +watch kubectl get nodes +``` + +- [ ] Cluster remains operational with 2/3 masters +- [ ] Pods continue running +- [ ] Can still kubectl from cm4-02 or cm4-03 context + ## Uninstall (if needed) -### Uninstall from Single Node +### Uninstall K3s from All Nodes ```bash -ssh pi@ -sudo bash /usr/local/bin/k3s-uninstall-compute-blade-agent.sh +ansible all -m shell -a "bash /usr/local/bin/k3s-uninstall.sh" --become +ansible worker -m shell -a "bash /usr/local/bin/k3s-agent-uninstall.sh" --become ``` -- [ ] Uninstall script executed -- [ ] Service removed -- [ ] Configuration cleaned up - -### Uninstall from All Workers - -```bash -ansible worker -m shell -a "bash /usr/local/bin/k3s-uninstall-compute-blade-agent.sh" --become -``` - -- [ ] All workers uninstalled +- [ ] All K3s services stopped +- [ ] Cluster data cleaned up ### Disable in Future Deployments diff --git a/GETTING_STARTED.md b/GETTING_STARTED.md index 0d5748e..844241b 100644 --- a/GETTING_STARTED.md +++ b/GETTING_STARTED.md @@ -18,9 +18,9 @@ cat inventory/hosts.ini Verify: -- Master node IP is correct (cm4-01) -- Worker node IPs are correct (cm4-02, cm4-03, cm4-04) -- `enable_compute_blade_agent=true` is set +- Master nodes are correct (cm4-01, cm4-02, cm4-03) +- Worker node IP is correct (cm4-04) +- `enable_compute_blade_agent=true` is set (optional for masters) ### Step 2: Test Connectivity @@ -46,17 +46,22 @@ This will: **Total time**: ~30-45 minutes -### Step 4: Verify +### Step 4: Verify Cluster ```bash -bash scripts/verify-compute-blade-agent.sh +export KUBECONFIG=$(pwd)/kubeconfig +kubectl get nodes ``` -All workers should show: +You should see all 4 nodes ready (3 masters + 1 worker): -- ✓ Network: Reachable -- ✓ Service Status: Running -- ✓ Binary: Installed +```bash +NAME STATUS ROLES AGE VERSION +cm4-01 Ready control-plane,etcd,master 5m v1.35.0+k3s1 +cm4-02 Ready control-plane,etcd 3m v1.35.0+k3s1 +cm4-03 Ready control-plane,etcd 3m v1.35.0+k3s1 +cm4-04 Ready 3m v1.35.0+k3s1 +``` ## Configuration @@ -215,22 +220,31 @@ sudo systemctl status compute-blade-agent ## Common Tasks -### Restart Agent on All Workers +### Check Cluster Status ```bash -ansible worker -m shell -a "sudo systemctl restart compute-blade-agent" --become +export KUBECONFIG=$(pwd)/kubeconfig +kubectl get nodes +kubectl get pods --all-namespaces ``` -### View Agent Logs on All Workers +### Access Any Master Node ```bash -ansible worker -m shell -a "sudo journalctl -u compute-blade-agent -n 20" --become +# Access cm4-01 +ssh pi@192.168.30.101 + +# Or access cm4-02 (backup master) +ssh pi@192.168.30.102 + +# Or access cm4-03 (backup master) +ssh pi@192.168.30.103 ``` ### Deploy Only to Specific Nodes ```bash -ansible-playbook site.yml --tags compute-blade-agent --limit cm4-02,cm4-03 +ansible-playbook site.yml --tags compute-blade-agent --limit cm4-04 ``` ### Disable Agent for Next Deployment @@ -257,12 +271,12 @@ ansible worker -m shell -a "bash /usr/local/bin/k3s-uninstall-compute-blade-agen ansible all -m shell -a "bash /usr/local/bin/k3s-uninstall.sh" --become ``` -## Support +## Documentation -- **Quick Reference**: `cat COMPUTE_BLADE_AGENT.md` -- **Checklist**: `cat DEPLOYMENT_CHECKLIST.md` -- **Full Guide**: `cat README.md` -- **GitHub**: [compute-blade-agent](https://github.com/compute-blade-community/compute-blade-agent) +- **README.md** - Full guide with all configuration options +- **DEPLOYMENT_CHECKLIST.md** - Step-by-step checklist +- **COMPUTE_BLADE_AGENT.md** - Quick reference for agent deployment +- **MIKROTIK-VIP-SETUP-CUSTOM.md** - Virtual IP failover configuration ## File Locations diff --git a/MIKROTIK-VIP-SETUP-CUSTOM.md b/MIKROTIK-VIP-SETUP-CUSTOM.md index ade14d0..6abc47b 100644 --- a/MIKROTIK-VIP-SETUP-CUSTOM.md +++ b/MIKROTIK-VIP-SETUP-CUSTOM.md @@ -8,16 +8,18 @@ Customized setup guide for your MikroTik RouterOS configuration. Uplink Network: 192.168.1.0/24 (br-uplink - WAN/External) LAB Network: 192.168.30.0/24 (br-lab - K3s Cluster) -K3s Nodes: - cm4-01: 192.168.30.101 (Master) - cm4-02: 192.168.30.102 (Worker) - cm4-03: 192.168.30.103 (Worker) +K3s Nodes (3-node HA Cluster): + cm4-01: 192.168.30.101 (Master/Control-Plane) + cm4-02: 192.168.30.102 (Master/Control-Plane) + cm4-03: 192.168.30.103 (Master/Control-Plane) cm4-04: 192.168.30.104 (Worker) Virtual IP to Create: - 192.168.30.100/24 (on br-lab bridge) + 192.168.30.100/24 (on br-lab bridge - HAProxy or MikroTik failover) ``` +**⚠️ Important Note**: The basic NAT rules below will route to cm4-01 only. To achieve true failover in your 3-node HA cluster, activate the health check script (Step 8) so traffic automatically routes to another master if cm4-01 goes down. + ## Step 1: Add Virtual IP Address on MikroTik Since your K3s nodes are on the `br-lab` bridge, add the VIP there: @@ -183,9 +185,9 @@ curl http://test.zlor.fi curl -k https://test.zlor.fi ``` -## Step 8: Optional - Add Health Check Script +## Step 8: Add Health Check Script (Recommended for HA) -For automatic failover, create a health check script that monitors the master node and updates NAT rules if it goes down. +**For automatic failover with your 3-node HA cluster**, create a health check script that monitors the master node and updates NAT rules if it goes down. This ensures traffic automatically routes to cm4-02 or cm4-03 if cm4-01 fails. ### Create Health Check Script @@ -237,6 +239,8 @@ For automatic failover, create a health check script that monitors the master no comment="Monitor K3s cluster and update VIP routes" ``` +**Status**: This scheduler will run every 30 seconds and automatically switch the VIP NAT rules to an available master if cm4-01 becomes unreachable. + ### View Health Check Logs ```mikrotik @@ -247,14 +251,33 @@ For automatic failover, create a health check script that monitors the master no ## Verification Checklist - [ ] VIP address (192.168.30.100) added to br-lab -- [ ] NAT rules for port 80 and 443 created +- [ ] NAT rules for port 80 and 443 created (routed to cm4-01) - [ ] Firewall rules allow traffic to VIP - [ ] Ping 192.168.30.100 succeeds - [ ] curl http://192.168.30.100 returns nginx page - [ ] DNS A record added: test.zlor.fi → 192.168.30.100 - [ ] curl http://test.zlor.fi works -- [ ] Health check script created (optional) -- [ ] Health check scheduled (optional) +- [ ] **Health check script created** (recommended for HA failover) +- [ ] **Health check scheduled** (recommended for HA failover) +- [ ] Test failover by pinging health check scheduler status + +## Testing Failover (HA Cluster) + +If you've enabled the health check script, you can test automatic failover: + +```bash +# From your machine, start monitoring +watch -n 5 'curl -v http://192.168.30.100 2>&1 | grep "200 OK\|Connected"' + +# In another terminal, SSH to cm4-01 and reboot it +ssh pi@192.168.30.101 +sudo reboot + +# Watch the curl output - after ~30 seconds, it should reconnect +# This means the health check script switched traffic to cm4-02 or cm4-03 +``` + +**Expected result**: Traffic stays online during the reboot (except for ~30 second switchover window) ## Troubleshooting @@ -368,16 +391,27 @@ Your VIP is now configured on MikroTik: ``` External Traffic ↓ -192.168.30.100:80 (VIP on br-lab) +192.168.30.100:80/443 (VIP on br-lab) ↓ -NAT Rule Routes to 192.168.30.101:80 +NAT Rule Routes to 192.168.30.101:80/443 (cm4-01 Master) ↓ -K3s Master Node (cm4-01) +If Health Check Enabled: + - Routes to cm4-02 if cm4-01 down (every 30 seconds check) + - Routes to cm4-03 if both cm4-01 and cm4-02 down ↓ -If Master Down → Failover to Worker - (Optional with health check script) +Ingress → K3s Service → Pods ``` -DNS: `test.zlor.fi → 192.168.30.100` +**DNS**: `test.zlor.fi → 192.168.30.100` -Single IP for your entire cluster with automatic failover! ✅ +**Status**: + +- ✅ Single IP for entire cluster +- ✅ Automatic failover (with health check script) +- ✅ 3-node HA masters provide etcd quorum + +**Next Steps**: + +1. Enable health check script (Step 8) for automatic failover +2. Test failover by rebooting cm4-01 and monitoring connectivity +3. Your cluster now has true high availability! diff --git a/README.md b/README.md index 08b1bd5..c4a0176 100644 --- a/README.md +++ b/README.md @@ -42,19 +42,19 @@ Edit `inventory/hosts.ini` and add your Raspberry Pi nodes: ```ini [master] -pi-master ansible_host=192.168.30.100 ansible_user=pi +cm4-01 ansible_host=192.168.30.101 ansible_user=pi k3s_server_init=true +cm4-02 ansible_host=192.168.30.102 ansible_user=pi k3s_server_init=false +cm4-03 ansible_host=192.168.30.103 ansible_user=pi k3s_server_init=false [worker] -pi-worker-1 ansible_host=192.168.30.102 ansible_user=pi -pi-worker-2 ansible_host=192.168.30.103 ansible_user=pi -pi-worker-3 ansible_host=192.168.30.104 ansible_user=pi +cm4-04 ansible_host=192.168.30.104 ansible_user=pi ``` ### 2. Configure Variables In `inventory/hosts.ini`, you can customize: -- `k3s_version`: K3s version to install (default: v1.34.2+k3s1) +- `k3s_version`: K3s version to install (default: v1.35.0+k3s1) - `extra_server_args`: Additional arguments for k3s server - `extra_agent_args`: Additional arguments for k3s agent - `extra_packages`: List of additional packages to install on all nodes @@ -304,20 +304,21 @@ kubectl get nodes You should see all your nodes in Ready state: ```bash -NAME STATUS ROLES AGE VERSION -pi-master Ready control-plane,master 5m v1.34.2+k3s1 -pi-worker-1 Ready 3m v1.34.2+k3s1 -pi-worker-2 Ready 3m v1.34.2+k3s1 +NAME STATUS ROLES AGE VERSION +cm4-01 Ready control-plane,etcd,master 5m v1.35.0+k3s1 +cm4-02 Ready control-plane,etcd 3m v1.35.0+k3s1 +cm4-03 Ready control-plane,etcd 3m v1.35.0+k3s1 +cm4-04 Ready 3m v1.35.0+k3s1 ``` ## Accessing the Cluster ### From Master Node -SSH into the master node and use kubectl: +SSH into a master node and use kubectl: ```bash -ssh pi@pi-master +ssh pi@192.168.30.101 kubectl get nodes ``` @@ -461,8 +462,11 @@ nginx-test-7d8f4c9b6d-xr5wp 1/1 Running 0 1m pi-worker-2 Add your master node IP to /etc/hosts: ```bash -# Replace 192.168.30.101 with your master node IP +# Replace with any master or worker node IP 192.168.30.101 nginx-test.local nginx.pi.local +192.168.30.102 nginx-test.local nginx.pi.local +192.168.30.103 nginx-test.local nginx.pi.local +192.168.30.104 nginx-test.local nginx.pi.local ``` Then access via browser: @@ -473,8 +477,9 @@ Then access via browser: Or test with curl: ```bash -# Replace with your master node IP +# Test with any cluster node IP (master or worker) curl -H "Host: nginx-test.local" http://192.168.30.101 +curl -H "Host: nginx-test.local" http://192.168.30.102 ``` ### Scale the Deployment @@ -624,7 +629,7 @@ ansible-playbook site.yml --tags k3s-server --limit ### Demoting a Master to Worker -To remove a master from control-plane and make it a worker: +To remove a master from control-plane and make it a worker (note: this reduces HA from 3-node to 2-node): 1. Edit `inventory/hosts.ini`: @@ -638,6 +643,8 @@ To remove a master from control-plane and make it a worker: cm4-04 ansible_host=192.168.30.104 ansible_user=pi ``` + **Warning**: This reduces your cluster to 2 master nodes. With only 2 masters, you lose quorum (require 2/3, have only 1/2 if one fails). + 2. Drain the node: ```bash @@ -690,7 +697,7 @@ To update to a specific k3s version: ```ini [k3s_cluster:vars] -k3s_version=v1.35.0+k3s1 +k3s_version=v1.36.0+k3s1 ``` 1. Run the k3s playbook to update all nodes: @@ -711,7 +718,7 @@ For more control, you can manually update k3s on individual nodes: ssh pi@ # Download and install specific version -curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.0+k3s1 sh - +curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.36.0+k3s1 sh - # Restart k3s sudo systemctl restart k3s # On master @@ -775,7 +782,7 @@ If an update causes issues, you can rollback to a previous version: ```bash # Update inventory with previous version # [k3s_cluster:vars] -# k3s_version=v1.34.2+k3s1 +# k3s_version=v1.35.0+k3s1 # Re-run the playbook ansible-playbook site.yml --tags k3s-server,k3s-agent @@ -814,7 +821,7 @@ ansible-playbook reboot.yml --limit master ### Reboot a Specific Node ```bash -ansible-playbook reboot.yml --limit pi-worker-1 +ansible-playbook reboot.yml --limit cm4-04 ``` ## Troubleshooting @@ -1001,26 +1008,33 @@ ansible-playbook site.yml --tags compute-blade-agent ## External DNS Configuration -To use external domains (like `test.zlor.fi`) with your k3s cluster ingress, you need to configure DNS and update your nodes. +To use external domains (like `test.zlor.fi`) with your k3s cluster ingress, you need to configure DNS. Your cluster uses a Virtual IP (192.168.30.100) via MikroTik for high availability. ### Step 1: Configure DNS Server Records On your DNS server, add **A records** pointing to your k3s cluster nodes: -#### Option A: Single Record (Master Node Only) - Simplest +#### Option A: Virtual IP (VIP) via MikroTik - Recommended for HA -If your DNS only allows one A record: +Use your MikroTik router's Virtual IP (192.168.30.100) for high availability: ```dns -test.zlor.fi A 192.168.30.101 +test.zlor.fi A 192.168.30.100 ``` -**Pros:** Simple, works with any DNS server -**Cons:** No failover if master node is down +**Pros:** -#### Option B: Multiple Records (Load Balanced) - Best Redundancy +- Single IP for entire cluster +- Hardware-based failover (more reliable) +- Better performance +- No additional software needed +- Automatically routes to available masters -If your DNS supports multiple A records: +See [MIKROTIK-VIP-SETUP-CUSTOM.md](MIKROTIK-VIP-SETUP-CUSTOM.md) for detailed setup instructions. + +#### Option B: Multiple Records (Load Balanced) + +If your DNS supports multiple A records, point to all cluster nodes: ```dns test.zlor.fi A 192.168.30.101 @@ -1029,32 +1043,19 @@ test.zlor.fi A 192.168.30.103 test.zlor.fi A 192.168.30.104 ``` -DNS clients will distribute requests across all nodes (round-robin). - **Pros:** Load balanced, automatic failover **Cons:** Requires DNS server support for multiple A records -#### Option C: Virtual IP (VIP) - Best of Both Worlds +#### Option C: Single Master Node (No Failover) -If your DNS only allows one A record but you want redundancy: +For simple setups without redundancy: ```dns -test.zlor.fi A 192.168.30.100 +test.zlor.fi A 192.168.30.101 ``` -Set up a virtual IP that automatically handles failover. You have two sub-options: - -##### Option C: MikroTik VIP (Recommended) - -Configure VIP directly on your MikroTik router. See [MIKROTIK-VIP-SETUP.md](MIKROTIK-VIP-SETUP.md) for customized setup instructions for your network topology. - -Pros: - -- Simple setup (5 minutes) -- No additional software on cluster nodes -- Hardware-based failover (more reliable) -- Better performance -- Reduced CPU overhead on nodes +**Pros:** Simple, works with any DNS server +**Cons:** No failover if that node is down (not recommended for HA clusters) ### Step 2: Configure Cluster Nodes for External DNS