Troubleshooting

Common issues and diagnostic steps.

PXE Boot Issues

Node doesn't get an IP

Check DHCP is running on nuc-00-01:

ssh ${IP_PREFIX}.8 systemctl status dhcpd

Verify the MAC address in dhcpd.conf matches the actual NIC MAC
Check that no other DHCP server is active on the network segment

Verify TFTP is serving ipxe.efi:

tftp ${IP_PREFIX}.8 -c get ipxe.efi /tmp/test.efi && echo OK

Check the DHCP next-server and filename options in dhcpd.conf

Verify Apache is serving the iPXE menu:

curl http://${IP_PREFIX}.10/harvester/harvester/ipxe-menu

Harvester install stalls

Check that all config YAML files are rendered (no ${VAR} placeholders remain):
```
grep '\${' /srv/www/htdocs/harvester/harvester/config-create-nuc-01.yaml
```
Verify DNS resolves the Harvester VIP before second/third node joins
Check Harvester installer logs on the console

DNS Issues

Cluster nodes can't resolve hostnames

Verify BIND is running:

ssh ${IP_PREFIX}.8 systemctl status named

Test resolution from a cluster node:

dig @${IP_PREFIX}.8 rancher.${BASE_DOMAIN}

Check that nuc-00-01 is the first DNS server in /etc/resolv.conf on each node

Rancher Manager Issues

Rancher UI unreachable

Check the Keepalived VIP is up:
```
ping -c 1 ${IP_PREFIX}.193
```

Check HAProxy is forwarding to Rancher VMs:

ssh ${IP_PREFIX}.93 systemctl status haproxy

Verify Rancher RKE2 VMs are running inside Harvester

`cert-manager` webhook failures

Wait 2–3 minutes after cert-manager install before running the Rancher Helm chart. The webhook needs time to become ready.

kubectl --kubeconfig ~/.kube/rancher.yaml -n cert-manager wait \
  --for=condition=ready pod --selector=app.kubernetes.io/component=webhook \
  --timeout=120s

Environment / Registry Issues

Image pull failures (Carbide/Enclave)

Verify registry credentials are correct:

bash Scripts/modules/carbide/registry_auth.sh

Check Harvester registry mirror configuration:

kubectl --kubeconfig ~/.kube/harvester.yaml get configmap -n harvester-system

For Enclave: verify the Hauler store is populated and Harbor is running

`envsubst` leaves `${VAR}` placeholders

Ensure the environment is sourced before running scripts:

source Scripts/env.sh
echo $BASE_DOMAIN   # should print the domain, not "${BASE_DOMAIN}"

General Diagnostic Commands

# Check all pods across all namespaces
kubectl --kubeconfig ~/.kube/harvester.yaml get pods -A | grep -v Running

# Recent events (errors first)
kubectl --kubeconfig ~/.kube/harvester.yaml get events -A --sort-by='.lastTimestamp' | tail -20

# Harvester node logs
kubectl --kubeconfig ~/.kube/harvester.yaml logs -n harvester-system -l app=harvester --tail=50

# Check cluster etcd health
kubectl --kubeconfig ~/.kube/harvester.yaml -n kube-system exec -it \
  $(kubectl --kubeconfig ~/.kube/harvester.yaml get pods -n kube-system -l component=etcd -o name | head -1) \
  -- etcdctl endpoint health

PXE Boot Issues​

Node doesn't get an IP​

iPXE menu doesn't appear (node boots from disk)​

Harvester install stalls​

DNS Issues​

Cluster nodes can't resolve hostnames​

Rancher Manager Issues​

Rancher UI unreachable​

cert-manager webhook failures​

Environment / Registry Issues​

Image pull failures (Carbide/Enclave)​

envsubst leaves ${VAR} placeholders​

General Diagnostic Commands​