updating documentation

This commit is contained in:
2026-01-08 17:48:49 +01:00
parent a2cf2a86d2
commit 4e0a3cf0cb
4 changed files with 177 additions and 113 deletions
+48 -33
View File
@@ -68,36 +68,36 @@ bash scripts/verify-compute-blade-agent.sh
- [ ] Service status shows "Running"
- [ ] Config file exists at `/etc/compute-blade-agent/config.yaml`
### 3. Manual Verification on a Worker
### 3. Manual Verification on a Master Node
```bash
ssh pi@192.168.30.102
sudo systemctl status compute-blade-agent
# Connect to any master (cm4-01, cm4-02, or cm4-03)
ssh pi@192.168.30.101
kubectl get nodes
```
- [ ] Service is active (running)
- [ ] Service is enabled (will start on boot)
- [ ] All 3 masters show as "Ready"
- [ ] Worker node (cm4-04) shows as "Ready"
### 4. Check Logs
### 4. Check Etcd Quorum
```bash
ssh pi@192.168.30.102
sudo journalctl -u compute-blade-agent -n 50
ssh pi@192.168.30.101
sudo /var/lib/rancher/k3s/data/*/bin/etcdctl member list
```
- [ ] No error messages
- [ ] Service started successfully
- [ ] Hardware detection messages present (if applicable)
- [ ] All 3 etcd members show as active
- [ ] Cluster has quorum (2/3 minimum for failover)
### 5. Verify Installation
### 5. Verify Kubeconfig
```bash
ssh pi@192.168.30.102
/usr/local/bin/compute-blade-agent --version
export KUBECONFIG=$(pwd)/kubeconfig
kubectl config get-contexts
```
- [ ] Binary responds with version information
- [ ] bladectl CLI tool is available
- [ ] Shows contexts: cm4-01, cm4-02, cm4-03, and default
- [ ] All contexts point to correct control-plane nodes
## Optional: Kubernetes Monitoring Setup
@@ -159,15 +159,20 @@ enable_compute_blade_agent=true # or false
### Per-Node Configuration
To enable/disable specific nodes, edit `inventory/hosts.ini`:
Note: cm4-02 and cm4-03 are now **master nodes**, not workers. To enable/disable compute-blade-agent on specific nodes:
```ini
[master]
cm4-01 ansible_host=192.168.30.101 ansible_user=pi k3s_server_init=true enable_compute_blade_agent=false
cm4-02 ansible_host=192.168.30.102 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false
cm4-03 ansible_host=192.168.30.103 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false
[worker]
cm4-02 ansible_host=... enable_compute_blade_agent=false
cm4-03 ansible_host=... enable_compute_blade_agent=true
cm4-04 ansible_host=192.168.30.104 ansible_user=pi enable_compute_blade_agent=true
```
- [ ] Per-node settings configured as needed
- [ ] Master nodes typically don't need compute-blade-agent
- [ ] Saved inventory file
- [ ] Re-run playbook if changes made
@@ -214,26 +219,36 @@ ansible worker -m shell -a "systemctl status compute-blade-agent" --become
- [ ] All workers show active status
## HA Cluster Maintenance
### Testing Failover
Your 3-node HA cluster can handle one master going down (maintains 2/3 quorum):
```bash
# Reboot one master while monitoring cluster
ssh pi@192.168.30.101
sudo reboot
# From another terminal, watch cluster status
watch kubectl get nodes
```
- [ ] Cluster remains operational with 2/3 masters
- [ ] Pods continue running
- [ ] Can still kubectl from cm4-02 or cm4-03 context
## Uninstall (if needed)
### Uninstall from Single Node
### Uninstall K3s from All Nodes
```bash
ssh pi@<worker-ip>
sudo bash /usr/local/bin/k3s-uninstall-compute-blade-agent.sh
ansible all -m shell -a "bash /usr/local/bin/k3s-uninstall.sh" --become
ansible worker -m shell -a "bash /usr/local/bin/k3s-agent-uninstall.sh" --become
```
- [ ] Uninstall script executed
- [ ] Service removed
- [ ] Configuration cleaned up
### Uninstall from All Workers
```bash
ansible worker -m shell -a "bash /usr/local/bin/k3s-uninstall-compute-blade-agent.sh" --become
```
- [ ] All workers uninstalled
- [ ] All K3s services stopped
- [ ] Cluster data cleaned up
### Disable in Future Deployments