Files

Michael Skrynski fd7c9239b5 Update docs and roles for agent on all nodes

- Switch compute-blade-agent deployment from workers to all nodes
  (control-plane and workers)
- Use /usr/bin/compute-blade-agent instead of /usr/local/bin
- Update verification scripts to reference /usr/bin/compute-blade-agent
- Update docs to refer to all nodes across Deployment Guide, Checklist,
  and Getting Started
- Change site.yml to install on all hosts instead of just workers
- Align example commands to the all-nodes workflow

2026-01-12 08:54:41 +01:00

7.2 KiB

Raw Permalink Blame History

K3s Cluster Deployment Checklist (with Compute Blade Agent on All Nodes)

Pre-Deployment

Review inventory configuration: cat inventory/hosts.ini
Verify SSH access to all worker nodes: ansible all -m ping
Review Compute Blade Agent documentation: cat COMPUTE_BLADE_AGENT.md
Check that compute-blade-agent is enabled: grep enable_compute_blade_agent inventory/hosts.ini

Deployment

Option 1: Full Stack Deployment (Recommended for new clusters)

ansible-playbook site.yml

This will:

Prepare all nodes (prerequisites)
Install K3s server on master
Install K3s agents on workers
Install compute-blade-agent on all nodes (control-plane and workers)
Deploy test nginx application

Start full deployment
Wait for completion (typically 10-20 minutes)
Check for any errors in output

Option 2: Skip Test Application

ansible-playbook site.yml --skip-tags test

Start deployment without test app
Faster deployment, suitable if cluster already has applications

Option 3: Deploy Only Compute Blade Agent

ansible-playbook site.yml --tags compute-blade-agent

Use on existing K3s cluster
Deploy agent to all configured nodes (masters and workers)
Verify with verification script

Post-Deployment Verification

1. Check Cluster Status

export KUBECONFIG=$(pwd)/kubeconfig
kubectl get nodes

All master and worker nodes should show "Ready"

2. Run Verification Script

bash scripts/verify-compute-blade-agent.sh

All worker nodes pass connectivity check
Binary is installed at /usr/bin/compute-blade-agent
Service status shows "Running"
Config file exists at /etc/compute-blade-agent/config.yaml

3. Manual Verification on a Master Node

# Connect to any master (cm4-01, cm4-02, or cm4-03)
ssh pi@192.168.30.101
kubectl get nodes

All 3 masters show as "Ready"
Worker node (cm4-04) shows as "Ready"

4. Check Etcd Quorum

ssh pi@192.168.30.101
sudo /var/lib/rancher/k3s/data/*/bin/etcdctl member list

All 3 etcd members show as active
Cluster has quorum (2/3 minimum for failover)

5. Verify Kubeconfig

export KUBECONFIG=$(pwd)/kubeconfig
kubectl config get-contexts

Shows contexts: cm4-01, cm4-02, cm4-03, and default
All contexts point to correct control-plane nodes

Optional: Kubernetes Monitoring Setup

Deploy Monitoring Resources

kubectl apply -f manifests/compute-blade-agent-daemonset.yaml

Check namespace creation: kubectl get namespace compute-blade-agent
Check DaemonSet: kubectl get daemonset -n compute-blade-agent
Check service: kubectl get service -n compute-blade-agent

Enable Prometheus Monitoring

Edit manifests/compute-blade-agent-daemonset.yaml
Uncomment the ServiceMonitor section
Apply: kubectl apply -f manifests/compute-blade-agent-daemonset.yaml

ServiceMonitor created (if Prometheus operator installed)
Prometheus scrape targets added (visible in Prometheus UI)

Troubleshooting

Service Not Running

Check status: sudo systemctl status compute-blade-agent
Check logs: sudo journalctl -u compute-blade-agent -f
Check if binary exists: ls -la /usr/bin/compute-blade-agent
Check systemd unit: cat /etc/systemd/system/compute-blade-agent.service

Installation Failed

Re-run Ansible playbook: ansible-playbook site.yml --tags compute-blade-agent
Check for network connectivity during installation
Verify sufficient disk space on nodes
Check /tmp directory permissions

Hardware Not Detected

Verify physical hardware connection
Check dmesg: sudo dmesg | grep -i compute
Check hardware info: lspci or lsusb
Review compute-blade-agent logs for detection messages

Configuration

Global Configuration

To enable/disable on all workers, edit inventory/hosts.ini:

[k3s_cluster:vars]
enable_compute_blade_agent=true  # or false

Configuration reviewed and correct
Saved inventory file

Per-Node Configuration

Compute-blade-agent is now installed on all nodes (control-plane and workers). To enable/disable on specific nodes:

[master]
cm4-01 ansible_host=192.168.30.101 ansible_user=pi k3s_server_init=true enable_compute_blade_agent=false
cm4-02 ansible_host=192.168.30.102 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false
cm4-03 ansible_host=192.168.30.103 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false

[worker]
cm4-04 ansible_host=192.168.30.104 ansible_user=pi enable_compute_blade_agent=true

Per-node settings configured as needed
Saved inventory file
Re-run playbook if changes made: ansible-playbook site.yml --tags compute-blade-agent

Agent Configuration

Edit configuration on the node:

ssh pi@<worker-ip>
sudo vi /etc/compute-blade-agent/config.yaml
sudo systemctl restart compute-blade-agent

Configuration customized (if needed)
Service restarted successfully

Maintenance

Restart Service

ssh pi@<worker-ip>
sudo systemctl restart compute-blade-agent

Service restarted
Service is still running

View Real-time Logs

ssh pi@<worker-ip>
sudo journalctl -u compute-blade-agent -f

Monitor for any issues
Press Ctrl+C to exit

Check Service on All Nodes

ansible k3s_cluster -m shell -a "systemctl status compute-blade-agent" --become

All nodes show active status

HA Cluster Maintenance

Testing Failover

Your 3-node HA cluster can handle one master going down (maintains 2/3 quorum):

# Reboot one master while monitoring cluster
ssh pi@192.168.30.101
sudo reboot

# From another terminal, watch cluster status
watch kubectl get nodes

Cluster remains operational with 2/3 masters
Pods continue running
Can still kubectl from cm4-02 or cm4-03 context

Uninstall (if needed)

Uninstall K3s from All Nodes

ansible k3s_cluster -m shell -a "bash /usr/local/bin/k3s-uninstall.sh" --become
ansible k3s_cluster -m shell -a "bash /usr/local/bin/k3s-agent-uninstall.sh 2>/dev/null || true" --become

All K3s services stopped
Cluster data cleaned up

Disable in Future Deployments

Edit inventory/hosts.ini:

enable_compute_blade_agent=false

Setting disabled
Won't be deployed on next playbook run

Documentation References

Read README.md compute-blade-agent section
Read COMPUTE_BLADE_AGENT.md quick reference
Check GitHub repo: compute-blade-agent
Review Ansible role: cat roles/compute-blade-agent/tasks/main.yml

Completion

All deployment steps completed
All verification checks passed
Documentation reviewed
Team notified of deployment
Monitoring configured (optional)
Backup of configuration taken

Notes

Document any issues, customizations, or special configurations here:

[Add notes here]

Last Updated: 2025-11-24 Status: Ready for Deployment

7.2 KiB Raw Permalink Blame History