# Compute Blade Agent Deployment Checklist ## Pre-Deployment - [ ] Review inventory configuration: `cat inventory/hosts.ini` - [ ] Verify SSH access to all worker nodes: `ansible all -m ping` - [ ] Review Compute Blade Agent documentation: `cat COMPUTE_BLADE_AGENT.md` - [ ] Check that compute-blade-agent is enabled: `grep enable_compute_blade_agent inventory/hosts.ini` ## Deployment ### Option 1: Full Stack Deployment (Recommended for new clusters) ```bash ansible-playbook site.yml ``` This will: 1. Prepare all nodes (prerequisites) 2. Install K3s server on master 3. Install K3s agents on workers 4. Install compute-blade-agent on workers 5. Deploy test nginx application - [ ] Start full deployment - [ ] Wait for completion (typically 10-20 minutes) - [ ] Check for any errors in output ### Option 2: Skip Test Application ```bash ansible-playbook site.yml --skip-tags test ``` - [ ] Start deployment without test app - [ ] Faster deployment, suitable if cluster already has applications ### Option 3: Deploy Only Compute Blade Agent ```bash ansible-playbook site.yml --tags compute-blade-agent ``` - [ ] Use on existing K3s cluster - [ ] Deploy agent to all configured workers - [ ] Verify with verification script ## Post-Deployment Verification ### 1. Check Cluster Status ```bash export KUBECONFIG=$(pwd)/kubeconfig kubectl get nodes ``` - [ ] All master and worker nodes should show "Ready" ### 2. Run Verification Script ```bash bash scripts/verify-compute-blade-agent.sh ``` - [ ] All worker nodes pass connectivity check - [ ] Binary is installed at `/usr/local/bin/compute-blade-agent` - [ ] Service status shows "Running" - [ ] Config file exists at `/etc/compute-blade-agent/config.yaml` ### 3. Manual Verification on a Master Node ```bash # Connect to any master (cm4-01, cm4-02, or cm4-03) ssh pi@192.168.30.101 kubectl get nodes ``` - [ ] All 3 masters show as "Ready" - [ ] Worker node (cm4-04) shows as "Ready" ### 4. Check Etcd Quorum ```bash ssh pi@192.168.30.101 sudo /var/lib/rancher/k3s/data/*/bin/etcdctl member list ``` - [ ] All 3 etcd members show as active - [ ] Cluster has quorum (2/3 minimum for failover) ### 5. Verify Kubeconfig ```bash export KUBECONFIG=$(pwd)/kubeconfig kubectl config get-contexts ``` - [ ] Shows contexts: cm4-01, cm4-02, cm4-03, and default - [ ] All contexts point to correct control-plane nodes ## Optional: Kubernetes Monitoring Setup ### Deploy Monitoring Resources ```bash kubectl apply -f manifests/compute-blade-agent-daemonset.yaml ``` - [ ] Check namespace creation: `kubectl get namespace compute-blade-agent` - [ ] Check DaemonSet: `kubectl get daemonset -n compute-blade-agent` - [ ] Check service: `kubectl get service -n compute-blade-agent` ### Enable Prometheus Monitoring 1. Edit `manifests/compute-blade-agent-daemonset.yaml` 2. Uncomment the ServiceMonitor section 3. Apply: `kubectl apply -f manifests/compute-blade-agent-daemonset.yaml` - [ ] ServiceMonitor created (if Prometheus operator installed) - [ ] Prometheus scrape targets added (visible in Prometheus UI) ## Troubleshooting ### Service Not Running - [ ] Check status: `sudo systemctl status compute-blade-agent` - [ ] Check logs: `sudo journalctl -u compute-blade-agent -f` - [ ] Check if binary exists: `ls -la /usr/local/bin/compute-blade-agent` - [ ] Check systemd unit: `cat /etc/systemd/system/compute-blade-agent.service` ### Installation Failed - [ ] Re-run Ansible playbook: `ansible-playbook site.yml --tags compute-blade-agent` - [ ] Check for network connectivity during installation - [ ] Verify sufficient disk space on nodes - [ ] Check /tmp directory permissions ### Hardware Not Detected - [ ] Verify physical hardware connection - [ ] Check dmesg: `sudo dmesg | grep -i compute` - [ ] Check hardware info: `lspci` or `lsusb` - [ ] Review compute-blade-agent logs for detection messages ## Configuration ### Global Configuration To enable/disable on all workers, edit `inventory/hosts.ini`: ```ini [k3s_cluster:vars] enable_compute_blade_agent=true # or false ``` - [ ] Configuration reviewed and correct - [ ] Saved inventory file ### Per-Node Configuration Note: cm4-02 and cm4-03 are now **master nodes**, not workers. To enable/disable compute-blade-agent on specific nodes: ```ini [master] cm4-01 ansible_host=192.168.30.101 ansible_user=pi k3s_server_init=true enable_compute_blade_agent=false cm4-02 ansible_host=192.168.30.102 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false cm4-03 ansible_host=192.168.30.103 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false [worker] cm4-04 ansible_host=192.168.30.104 ansible_user=pi enable_compute_blade_agent=true ``` - [ ] Per-node settings configured as needed - [ ] Master nodes typically don't need compute-blade-agent - [ ] Saved inventory file - [ ] Re-run playbook if changes made ### Agent Configuration Edit configuration on the node: ```bash ssh pi@ sudo vi /etc/compute-blade-agent/config.yaml sudo systemctl restart compute-blade-agent ``` - [ ] Configuration customized (if needed) - [ ] Service restarted successfully ## Maintenance ### Restart Service ```bash ssh pi@ sudo systemctl restart compute-blade-agent ``` - [ ] Service restarted - [ ] Service is still running ### View Real-time Logs ```bash ssh pi@ sudo journalctl -u compute-blade-agent -f ``` - [ ] Monitor for any issues - [ ] Press Ctrl+C to exit ### Check Service on All Workers ```bash ansible worker -m shell -a "systemctl status compute-blade-agent" --become ``` - [ ] All workers show active status ## HA Cluster Maintenance ### Testing Failover Your 3-node HA cluster can handle one master going down (maintains 2/3 quorum): ```bash # Reboot one master while monitoring cluster ssh pi@192.168.30.101 sudo reboot # From another terminal, watch cluster status watch kubectl get nodes ``` - [ ] Cluster remains operational with 2/3 masters - [ ] Pods continue running - [ ] Can still kubectl from cm4-02 or cm4-03 context ## Uninstall (if needed) ### Uninstall K3s from All Nodes ```bash ansible all -m shell -a "bash /usr/local/bin/k3s-uninstall.sh" --become ansible worker -m shell -a "bash /usr/local/bin/k3s-agent-uninstall.sh" --become ``` - [ ] All K3s services stopped - [ ] Cluster data cleaned up ### Disable in Future Deployments Edit `inventory/hosts.ini`: ```ini enable_compute_blade_agent=false ``` - [ ] Setting disabled - [ ] Won't be deployed on next playbook run ## Documentation References - [ ] Read README.md compute-blade-agent section - [ ] Read COMPUTE_BLADE_AGENT.md quick reference - [ ] Check GitHub repo: [compute-blade-agent](https://github.com/compute-blade-community/compute-blade-agent) - [ ] Review Ansible role: `cat roles/compute-blade-agent/tasks/main.yml` ## Completion - [ ] All deployment steps completed - [ ] All verification checks passed - [ ] Documentation reviewed - [ ] Team notified of deployment - [ ] Monitoring configured (optional) - [ ] Backup of configuration taken ## Notes Document any issues, customizations, or special configurations here: ```text [Add notes here] ``` --- **Last Updated**: 2025-11-24 **Status**: Ready for Deployment