updating documentation

This commit is contained in:
2026-01-08 17:48:49 +01:00
parent a2cf2a86d2
commit 4e0a3cf0cb
4 changed files with 177 additions and 113 deletions

View File

@@ -68,36 +68,36 @@ bash scripts/verify-compute-blade-agent.sh
- [ ] Service status shows "Running"
- [ ] Config file exists at `/etc/compute-blade-agent/config.yaml`
### 3. Manual Verification on a Worker
### 3. Manual Verification on a Master Node
```bash
ssh pi@192.168.30.102
sudo systemctl status compute-blade-agent
# Connect to any master (cm4-01, cm4-02, or cm4-03)
ssh pi@192.168.30.101
kubectl get nodes
```
- [ ] Service is active (running)
- [ ] Service is enabled (will start on boot)
- [ ] All 3 masters show as "Ready"
- [ ] Worker node (cm4-04) shows as "Ready"
### 4. Check Logs
### 4. Check Etcd Quorum
```bash
ssh pi@192.168.30.102
sudo journalctl -u compute-blade-agent -n 50
ssh pi@192.168.30.101
sudo /var/lib/rancher/k3s/data/*/bin/etcdctl member list
```
- [ ] No error messages
- [ ] Service started successfully
- [ ] Hardware detection messages present (if applicable)
- [ ] All 3 etcd members show as active
- [ ] Cluster has quorum (2/3 minimum for failover)
### 5. Verify Installation
### 5. Verify Kubeconfig
```bash
ssh pi@192.168.30.102
/usr/local/bin/compute-blade-agent --version
export KUBECONFIG=$(pwd)/kubeconfig
kubectl config get-contexts
```
- [ ] Binary responds with version information
- [ ] bladectl CLI tool is available
- [ ] Shows contexts: cm4-01, cm4-02, cm4-03, and default
- [ ] All contexts point to correct control-plane nodes
## Optional: Kubernetes Monitoring Setup
@@ -159,15 +159,20 @@ enable_compute_blade_agent=true # or false
### Per-Node Configuration
To enable/disable specific nodes, edit `inventory/hosts.ini`:
Note: cm4-02 and cm4-03 are now **master nodes**, not workers. To enable/disable compute-blade-agent on specific nodes:
```ini
[master]
cm4-01 ansible_host=192.168.30.101 ansible_user=pi k3s_server_init=true enable_compute_blade_agent=false
cm4-02 ansible_host=192.168.30.102 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false
cm4-03 ansible_host=192.168.30.103 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false
[worker]
cm4-02 ansible_host=... enable_compute_blade_agent=false
cm4-03 ansible_host=... enable_compute_blade_agent=true
cm4-04 ansible_host=192.168.30.104 ansible_user=pi enable_compute_blade_agent=true
```
- [ ] Per-node settings configured as needed
- [ ] Master nodes typically don't need compute-blade-agent
- [ ] Saved inventory file
- [ ] Re-run playbook if changes made
@@ -214,26 +219,36 @@ ansible worker -m shell -a "systemctl status compute-blade-agent" --become
- [ ] All workers show active status
## HA Cluster Maintenance
### Testing Failover
Your 3-node HA cluster can handle one master going down (maintains 2/3 quorum):
```bash
# Reboot one master while monitoring cluster
ssh pi@192.168.30.101
sudo reboot
# From another terminal, watch cluster status
watch kubectl get nodes
```
- [ ] Cluster remains operational with 2/3 masters
- [ ] Pods continue running
- [ ] Can still kubectl from cm4-02 or cm4-03 context
## Uninstall (if needed)
### Uninstall from Single Node
### Uninstall K3s from All Nodes
```bash
ssh pi@<worker-ip>
sudo bash /usr/local/bin/k3s-uninstall-compute-blade-agent.sh
ansible all -m shell -a "bash /usr/local/bin/k3s-uninstall.sh" --become
ansible worker -m shell -a "bash /usr/local/bin/k3s-agent-uninstall.sh" --become
```
- [ ] Uninstall script executed
- [ ] Service removed
- [ ] Configuration cleaned up
### Uninstall from All Workers
```bash
ansible worker -m shell -a "bash /usr/local/bin/k3s-uninstall-compute-blade-agent.sh" --become
```
- [ ] All workers uninstalled
- [ ] All K3s services stopped
- [ ] Cluster data cleaned up
### Disable in Future Deployments

View File

@@ -18,9 +18,9 @@ cat inventory/hosts.ini
Verify:
- Master node IP is correct (cm4-01)
- Worker node IPs are correct (cm4-02, cm4-03, cm4-04)
- `enable_compute_blade_agent=true` is set
- Master nodes are correct (cm4-01, cm4-02, cm4-03)
- Worker node IP is correct (cm4-04)
- `enable_compute_blade_agent=true` is set (optional for masters)
### Step 2: Test Connectivity
@@ -46,17 +46,22 @@ This will:
**Total time**: ~30-45 minutes
### Step 4: Verify
### Step 4: Verify Cluster
```bash
bash scripts/verify-compute-blade-agent.sh
export KUBECONFIG=$(pwd)/kubeconfig
kubectl get nodes
```
All workers should show:
You should see all 4 nodes ready (3 masters + 1 worker):
- ✓ Network: Reachable
- ✓ Service Status: Running
- ✓ Binary: Installed
```bash
NAME STATUS ROLES AGE VERSION
cm4-01 Ready control-plane,etcd,master 5m v1.35.0+k3s1
cm4-02 Ready control-plane,etcd 3m v1.35.0+k3s1
cm4-03 Ready control-plane,etcd 3m v1.35.0+k3s1
cm4-04 Ready <none> 3m v1.35.0+k3s1
```
## Configuration
@@ -215,22 +220,31 @@ sudo systemctl status compute-blade-agent
## Common Tasks
### Restart Agent on All Workers
### Check Cluster Status
```bash
ansible worker -m shell -a "sudo systemctl restart compute-blade-agent" --become
export KUBECONFIG=$(pwd)/kubeconfig
kubectl get nodes
kubectl get pods --all-namespaces
```
### View Agent Logs on All Workers
### Access Any Master Node
```bash
ansible worker -m shell -a "sudo journalctl -u compute-blade-agent -n 20" --become
# Access cm4-01
ssh pi@192.168.30.101
# Or access cm4-02 (backup master)
ssh pi@192.168.30.102
# Or access cm4-03 (backup master)
ssh pi@192.168.30.103
```
### Deploy Only to Specific Nodes
```bash
ansible-playbook site.yml --tags compute-blade-agent --limit cm4-02,cm4-03
ansible-playbook site.yml --tags compute-blade-agent --limit cm4-04
```
### Disable Agent for Next Deployment
@@ -257,12 +271,12 @@ ansible worker -m shell -a "bash /usr/local/bin/k3s-uninstall-compute-blade-agen
ansible all -m shell -a "bash /usr/local/bin/k3s-uninstall.sh" --become
```
## Support
## Documentation
- **Quick Reference**: `cat COMPUTE_BLADE_AGENT.md`
- **Checklist**: `cat DEPLOYMENT_CHECKLIST.md`
- **Full Guide**: `cat README.md`
- **GitHub**: [compute-blade-agent](https://github.com/compute-blade-community/compute-blade-agent)
- **README.md** - Full guide with all configuration options
- **DEPLOYMENT_CHECKLIST.md** - Step-by-step checklist
- **COMPUTE_BLADE_AGENT.md** - Quick reference for agent deployment
- **MIKROTIK-VIP-SETUP-CUSTOM.md** - Virtual IP failover configuration
## File Locations

View File

@@ -8,16 +8,18 @@ Customized setup guide for your MikroTik RouterOS configuration.
Uplink Network: 192.168.1.0/24 (br-uplink - WAN/External)
LAB Network: 192.168.30.0/24 (br-lab - K3s Cluster)
K3s Nodes:
cm4-01: 192.168.30.101 (Master)
cm4-02: 192.168.30.102 (Worker)
cm4-03: 192.168.30.103 (Worker)
K3s Nodes (3-node HA Cluster):
cm4-01: 192.168.30.101 (Master/Control-Plane)
cm4-02: 192.168.30.102 (Master/Control-Plane)
cm4-03: 192.168.30.103 (Master/Control-Plane)
cm4-04: 192.168.30.104 (Worker)
Virtual IP to Create:
192.168.30.100/24 (on br-lab bridge)
192.168.30.100/24 (on br-lab bridge - HAProxy or MikroTik failover)
```
**⚠️ Important Note**: The basic NAT rules below will route to cm4-01 only. To achieve true failover in your 3-node HA cluster, activate the health check script (Step 8) so traffic automatically routes to another master if cm4-01 goes down.
## Step 1: Add Virtual IP Address on MikroTik
Since your K3s nodes are on the `br-lab` bridge, add the VIP there:
@@ -183,9 +185,9 @@ curl http://test.zlor.fi
curl -k https://test.zlor.fi
```
## Step 8: Optional - Add Health Check Script
## Step 8: Add Health Check Script (Recommended for HA)
For automatic failover, create a health check script that monitors the master node and updates NAT rules if it goes down.
**For automatic failover with your 3-node HA cluster**, create a health check script that monitors the master node and updates NAT rules if it goes down. This ensures traffic automatically routes to cm4-02 or cm4-03 if cm4-01 fails.
### Create Health Check Script
@@ -237,6 +239,8 @@ For automatic failover, create a health check script that monitors the master no
comment="Monitor K3s cluster and update VIP routes"
```
**Status**: This scheduler will run every 30 seconds and automatically switch the VIP NAT rules to an available master if cm4-01 becomes unreachable.
### View Health Check Logs
```mikrotik
@@ -247,14 +251,33 @@ For automatic failover, create a health check script that monitors the master no
## Verification Checklist
- [ ] VIP address (192.168.30.100) added to br-lab
- [ ] NAT rules for port 80 and 443 created
- [ ] NAT rules for port 80 and 443 created (routed to cm4-01)
- [ ] Firewall rules allow traffic to VIP
- [ ] Ping 192.168.30.100 succeeds
- [ ] curl http://192.168.30.100 returns nginx page
- [ ] DNS A record added: test.zlor.fi → 192.168.30.100
- [ ] curl http://test.zlor.fi works
- [ ] Health check script created (optional)
- [ ] Health check scheduled (optional)
- [ ] **Health check script created** (recommended for HA failover)
- [ ] **Health check scheduled** (recommended for HA failover)
- [ ] Test failover by pinging health check scheduler status
## Testing Failover (HA Cluster)
If you've enabled the health check script, you can test automatic failover:
```bash
# From your machine, start monitoring
watch -n 5 'curl -v http://192.168.30.100 2>&1 | grep "200 OK\|Connected"'
# In another terminal, SSH to cm4-01 and reboot it
ssh pi@192.168.30.101
sudo reboot
# Watch the curl output - after ~30 seconds, it should reconnect
# This means the health check script switched traffic to cm4-02 or cm4-03
```
**Expected result**: Traffic stays online during the reboot (except for ~30 second switchover window)
## Troubleshooting
@@ -368,16 +391,27 @@ Your VIP is now configured on MikroTik:
```
External Traffic
192.168.30.100:80 (VIP on br-lab)
192.168.30.100:80/443 (VIP on br-lab)
NAT Rule Routes to 192.168.30.101:80
NAT Rule Routes to 192.168.30.101:80/443 (cm4-01 Master)
K3s Master Node (cm4-01)
If Health Check Enabled:
- Routes to cm4-02 if cm4-01 down (every 30 seconds check)
- Routes to cm4-03 if both cm4-01 and cm4-02 down
If Master Down → Failover to Worker
(Optional with health check script)
Ingress → K3s Service → Pods
```
DNS: `test.zlor.fi → 192.168.30.100`
**DNS**: `test.zlor.fi → 192.168.30.100`
Single IP for your entire cluster with automatic failover! ✅
**Status**:
- ✅ Single IP for entire cluster
- ✅ Automatic failover (with health check script)
- ✅ 3-node HA masters provide etcd quorum
**Next Steps**:
1. Enable health check script (Step 8) for automatic failover
2. Test failover by rebooting cm4-01 and monitoring connectivity
3. Your cluster now has true high availability!

View File

@@ -42,19 +42,19 @@ Edit `inventory/hosts.ini` and add your Raspberry Pi nodes:
```ini
[master]
pi-master ansible_host=192.168.30.100 ansible_user=pi
cm4-01 ansible_host=192.168.30.101 ansible_user=pi k3s_server_init=true
cm4-02 ansible_host=192.168.30.102 ansible_user=pi k3s_server_init=false
cm4-03 ansible_host=192.168.30.103 ansible_user=pi k3s_server_init=false
[worker]
pi-worker-1 ansible_host=192.168.30.102 ansible_user=pi
pi-worker-2 ansible_host=192.168.30.103 ansible_user=pi
pi-worker-3 ansible_host=192.168.30.104 ansible_user=pi
cm4-04 ansible_host=192.168.30.104 ansible_user=pi
```
### 2. Configure Variables
In `inventory/hosts.ini`, you can customize:
- `k3s_version`: K3s version to install (default: v1.34.2+k3s1)
- `k3s_version`: K3s version to install (default: v1.35.0+k3s1)
- `extra_server_args`: Additional arguments for k3s server
- `extra_agent_args`: Additional arguments for k3s agent
- `extra_packages`: List of additional packages to install on all nodes
@@ -304,20 +304,21 @@ kubectl get nodes
You should see all your nodes in Ready state:
```bash
NAME STATUS ROLES AGE VERSION
pi-master Ready control-plane,master 5m v1.34.2+k3s1
pi-worker-1 Ready <none> 3m v1.34.2+k3s1
pi-worker-2 Ready <none> 3m v1.34.2+k3s1
NAME STATUS ROLES AGE VERSION
cm4-01 Ready control-plane,etcd,master 5m v1.35.0+k3s1
cm4-02 Ready control-plane,etcd 3m v1.35.0+k3s1
cm4-03 Ready control-plane,etcd 3m v1.35.0+k3s1
cm4-04 Ready <none> 3m v1.35.0+k3s1
```
## Accessing the Cluster
### From Master Node
SSH into the master node and use kubectl:
SSH into a master node and use kubectl:
```bash
ssh pi@pi-master
ssh pi@192.168.30.101
kubectl get nodes
```
@@ -461,8 +462,11 @@ nginx-test-7d8f4c9b6d-xr5wp 1/1 Running 0 1m pi-worker-2
Add your master node IP to /etc/hosts:
```bash
# Replace 192.168.30.101 with your master node IP
# Replace with any master or worker node IP
192.168.30.101 nginx-test.local nginx.pi.local
192.168.30.102 nginx-test.local nginx.pi.local
192.168.30.103 nginx-test.local nginx.pi.local
192.168.30.104 nginx-test.local nginx.pi.local
```
Then access via browser:
@@ -473,8 +477,9 @@ Then access via browser:
Or test with curl:
```bash
# Replace with your master node IP
# Test with any cluster node IP (master or worker)
curl -H "Host: nginx-test.local" http://192.168.30.101
curl -H "Host: nginx-test.local" http://192.168.30.102
```
### Scale the Deployment
@@ -624,7 +629,7 @@ ansible-playbook site.yml --tags k3s-server --limit <failed-master>
### Demoting a Master to Worker
To remove a master from control-plane and make it a worker:
To remove a master from control-plane and make it a worker (note: this reduces HA from 3-node to 2-node):
1. Edit `inventory/hosts.ini`:
@@ -638,6 +643,8 @@ To remove a master from control-plane and make it a worker:
cm4-04 ansible_host=192.168.30.104 ansible_user=pi
```
**Warning**: This reduces your cluster to 2 master nodes. With only 2 masters, you lose quorum (require 2/3, have only 1/2 if one fails).
2. Drain the node:
```bash
@@ -690,7 +697,7 @@ To update to a specific k3s version:
```ini
[k3s_cluster:vars]
k3s_version=v1.35.0+k3s1
k3s_version=v1.36.0+k3s1
```
1. Run the k3s playbook to update all nodes:
@@ -711,7 +718,7 @@ For more control, you can manually update k3s on individual nodes:
ssh pi@<node-ip>
# Download and install specific version
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.0+k3s1 sh -
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.36.0+k3s1 sh -
# Restart k3s
sudo systemctl restart k3s # On master
@@ -775,7 +782,7 @@ If an update causes issues, you can rollback to a previous version:
```bash
# Update inventory with previous version
# [k3s_cluster:vars]
# k3s_version=v1.34.2+k3s1
# k3s_version=v1.35.0+k3s1
# Re-run the playbook
ansible-playbook site.yml --tags k3s-server,k3s-agent
@@ -814,7 +821,7 @@ ansible-playbook reboot.yml --limit master
### Reboot a Specific Node
```bash
ansible-playbook reboot.yml --limit pi-worker-1
ansible-playbook reboot.yml --limit cm4-04
```
## Troubleshooting
@@ -1001,26 +1008,33 @@ ansible-playbook site.yml --tags compute-blade-agent
## External DNS Configuration
To use external domains (like `test.zlor.fi`) with your k3s cluster ingress, you need to configure DNS and update your nodes.
To use external domains (like `test.zlor.fi`) with your k3s cluster ingress, you need to configure DNS. Your cluster uses a Virtual IP (192.168.30.100) via MikroTik for high availability.
### Step 1: Configure DNS Server Records
On your DNS server, add **A records** pointing to your k3s cluster nodes:
#### Option A: Single Record (Master Node Only) - Simplest
#### Option A: Virtual IP (VIP) via MikroTik - Recommended for HA
If your DNS only allows one A record:
Use your MikroTik router's Virtual IP (192.168.30.100) for high availability:
```dns
test.zlor.fi A 192.168.30.101
test.zlor.fi A 192.168.30.100
```
**Pros:** Simple, works with any DNS server
**Cons:** No failover if master node is down
**Pros:**
#### Option B: Multiple Records (Load Balanced) - Best Redundancy
- Single IP for entire cluster
- Hardware-based failover (more reliable)
- Better performance
- No additional software needed
- Automatically routes to available masters
If your DNS supports multiple A records:
See [MIKROTIK-VIP-SETUP-CUSTOM.md](MIKROTIK-VIP-SETUP-CUSTOM.md) for detailed setup instructions.
#### Option B: Multiple Records (Load Balanced)
If your DNS supports multiple A records, point to all cluster nodes:
```dns
test.zlor.fi A 192.168.30.101
@@ -1029,32 +1043,19 @@ test.zlor.fi A 192.168.30.103
test.zlor.fi A 192.168.30.104
```
DNS clients will distribute requests across all nodes (round-robin).
**Pros:** Load balanced, automatic failover
**Cons:** Requires DNS server support for multiple A records
#### Option C: Virtual IP (VIP) - Best of Both Worlds
#### Option C: Single Master Node (No Failover)
If your DNS only allows one A record but you want redundancy:
For simple setups without redundancy:
```dns
test.zlor.fi A 192.168.30.100
test.zlor.fi A 192.168.30.101
```
Set up a virtual IP that automatically handles failover. You have two sub-options:
##### Option C: MikroTik VIP (Recommended)
Configure VIP directly on your MikroTik router. See [MIKROTIK-VIP-SETUP.md](MIKROTIK-VIP-SETUP.md) for customized setup instructions for your network topology.
Pros:
- Simple setup (5 minutes)
- No additional software on cluster nodes
- Hardware-based failover (more reliable)
- Better performance
- Reduced CPU overhead on nodes
**Pros:** Simple, works with any DNS server
**Cons:** No failover if that node is down (not recommended for HA clusters)
### Step 2: Configure Cluster Nodes for External DNS