Files
k3s-ansible/DEPLOYMENT_CHECKLIST.md

292 lines
7.1 KiB
Markdown

# Compute Blade Agent Deployment Checklist
## Pre-Deployment
- [ ] Review inventory configuration: `cat inventory/hosts.ini`
- [ ] Verify SSH access to all worker nodes: `ansible all -m ping`
- [ ] Review Compute Blade Agent documentation: `cat COMPUTE_BLADE_AGENT.md`
- [ ] Check that compute-blade-agent is enabled: `grep enable_compute_blade_agent inventory/hosts.ini`
## Deployment
### Option 1: Full Stack Deployment (Recommended for new clusters)
```bash
ansible-playbook site.yml
```
This will:
1. Prepare all nodes (prerequisites)
2. Install K3s server on master
3. Install K3s agents on workers
4. Install compute-blade-agent on workers
5. Deploy test nginx application
- [ ] Start full deployment
- [ ] Wait for completion (typically 10-20 minutes)
- [ ] Check for any errors in output
### Option 2: Skip Test Application
```bash
ansible-playbook site.yml --skip-tags test
```
- [ ] Start deployment without test app
- [ ] Faster deployment, suitable if cluster already has applications
### Option 3: Deploy Only Compute Blade Agent
```bash
ansible-playbook site.yml --tags compute-blade-agent
```
- [ ] Use on existing K3s cluster
- [ ] Deploy agent to all configured workers
- [ ] Verify with verification script
## Post-Deployment Verification
### 1. Check Cluster Status
```bash
export KUBECONFIG=$(pwd)/kubeconfig
kubectl get nodes
```
- [ ] All master and worker nodes should show "Ready"
### 2. Run Verification Script
```bash
bash scripts/verify-compute-blade-agent.sh
```
- [ ] All worker nodes pass connectivity check
- [ ] Binary is installed at `/usr/local/bin/compute-blade-agent`
- [ ] Service status shows "Running"
- [ ] Config file exists at `/etc/compute-blade-agent/config.yaml`
### 3. Manual Verification on a Master Node
```bash
# Connect to any master (cm4-01, cm4-02, or cm4-03)
ssh pi@192.168.30.101
kubectl get nodes
```
- [ ] All 3 masters show as "Ready"
- [ ] Worker node (cm4-04) shows as "Ready"
### 4. Check Etcd Quorum
```bash
ssh pi@192.168.30.101
sudo /var/lib/rancher/k3s/data/*/bin/etcdctl member list
```
- [ ] All 3 etcd members show as active
- [ ] Cluster has quorum (2/3 minimum for failover)
### 5. Verify Kubeconfig
```bash
export KUBECONFIG=$(pwd)/kubeconfig
kubectl config get-contexts
```
- [ ] Shows contexts: cm4-01, cm4-02, cm4-03, and default
- [ ] All contexts point to correct control-plane nodes
## Optional: Kubernetes Monitoring Setup
### Deploy Monitoring Resources
```bash
kubectl apply -f manifests/compute-blade-agent-daemonset.yaml
```
- [ ] Check namespace creation: `kubectl get namespace compute-blade-agent`
- [ ] Check DaemonSet: `kubectl get daemonset -n compute-blade-agent`
- [ ] Check service: `kubectl get service -n compute-blade-agent`
### Enable Prometheus Monitoring
1. Edit `manifests/compute-blade-agent-daemonset.yaml`
2. Uncomment the ServiceMonitor section
3. Apply: `kubectl apply -f manifests/compute-blade-agent-daemonset.yaml`
- [ ] ServiceMonitor created (if Prometheus operator installed)
- [ ] Prometheus scrape targets added (visible in Prometheus UI)
## Troubleshooting
### Service Not Running
- [ ] Check status: `sudo systemctl status compute-blade-agent`
- [ ] Check logs: `sudo journalctl -u compute-blade-agent -f`
- [ ] Check if binary exists: `ls -la /usr/local/bin/compute-blade-agent`
- [ ] Check systemd unit: `cat /etc/systemd/system/compute-blade-agent.service`
### Installation Failed
- [ ] Re-run Ansible playbook: `ansible-playbook site.yml --tags compute-blade-agent`
- [ ] Check for network connectivity during installation
- [ ] Verify sufficient disk space on nodes
- [ ] Check /tmp directory permissions
### Hardware Not Detected
- [ ] Verify physical hardware connection
- [ ] Check dmesg: `sudo dmesg | grep -i compute`
- [ ] Check hardware info: `lspci` or `lsusb`
- [ ] Review compute-blade-agent logs for detection messages
## Configuration
### Global Configuration
To enable/disable on all workers, edit `inventory/hosts.ini`:
```ini
[k3s_cluster:vars]
enable_compute_blade_agent=true # or false
```
- [ ] Configuration reviewed and correct
- [ ] Saved inventory file
### Per-Node Configuration
Note: cm4-02 and cm4-03 are now **master nodes**, not workers. To enable/disable compute-blade-agent on specific nodes:
```ini
[master]
cm4-01 ansible_host=192.168.30.101 ansible_user=pi k3s_server_init=true enable_compute_blade_agent=false
cm4-02 ansible_host=192.168.30.102 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false
cm4-03 ansible_host=192.168.30.103 ansible_user=pi k3s_server_init=false enable_compute_blade_agent=false
[worker]
cm4-04 ansible_host=192.168.30.104 ansible_user=pi enable_compute_blade_agent=true
```
- [ ] Per-node settings configured as needed
- [ ] Master nodes typically don't need compute-blade-agent
- [ ] Saved inventory file
- [ ] Re-run playbook if changes made
### Agent Configuration
Edit configuration on the node:
```bash
ssh pi@<worker-ip>
sudo vi /etc/compute-blade-agent/config.yaml
sudo systemctl restart compute-blade-agent
```
- [ ] Configuration customized (if needed)
- [ ] Service restarted successfully
## Maintenance
### Restart Service
```bash
ssh pi@<worker-ip>
sudo systemctl restart compute-blade-agent
```
- [ ] Service restarted
- [ ] Service is still running
### View Real-time Logs
```bash
ssh pi@<worker-ip>
sudo journalctl -u compute-blade-agent -f
```
- [ ] Monitor for any issues
- [ ] Press Ctrl+C to exit
### Check Service on All Workers
```bash
ansible worker -m shell -a "systemctl status compute-blade-agent" --become
```
- [ ] All workers show active status
## HA Cluster Maintenance
### Testing Failover
Your 3-node HA cluster can handle one master going down (maintains 2/3 quorum):
```bash
# Reboot one master while monitoring cluster
ssh pi@192.168.30.101
sudo reboot
# From another terminal, watch cluster status
watch kubectl get nodes
```
- [ ] Cluster remains operational with 2/3 masters
- [ ] Pods continue running
- [ ] Can still kubectl from cm4-02 or cm4-03 context
## Uninstall (if needed)
### Uninstall K3s from All Nodes
```bash
ansible all -m shell -a "bash /usr/local/bin/k3s-uninstall.sh" --become
ansible worker -m shell -a "bash /usr/local/bin/k3s-agent-uninstall.sh" --become
```
- [ ] All K3s services stopped
- [ ] Cluster data cleaned up
### Disable in Future Deployments
Edit `inventory/hosts.ini`:
```ini
enable_compute_blade_agent=false
```
- [ ] Setting disabled
- [ ] Won't be deployed on next playbook run
## Documentation References
- [ ] Read README.md compute-blade-agent section
- [ ] Read COMPUTE_BLADE_AGENT.md quick reference
- [ ] Check GitHub repo: [compute-blade-agent](https://github.com/compute-blade-community/compute-blade-agent)
- [ ] Review Ansible role: `cat roles/compute-blade-agent/tasks/main.yml`
## Completion
- [ ] All deployment steps completed
- [ ] All verification checks passed
- [ ] Documentation reviewed
- [ ] Team notified of deployment
- [ ] Monitoring configured (optional)
- [ ] Backup of configuration taken
## Notes
Document any issues, customizations, or special configurations here:
```text
[Add notes here]
```
---
**Last Updated**: 2025-11-24
**Status**: Ready for Deployment