Troubleshooting
Cluster Creation Issues
Section titled âCluster Creation IssuesâDocker Connection Failed
Section titled âDocker Connection FailedâVerify Docker is running with docker ps. If not running, start Docker Desktop (macOS) or sudo systemctl start docker (Linux).
Cluster Creation Hangs
Section titled âCluster Creation HangsâCommon causes: insufficient resources, firewall blocking Docker network access, or leftover cluster state.
ksail cluster listksail cluster delete --name <cluster-name>docker system prune -fPort Already in Use
Section titled âPort Already in UseâIf you see Error: Port 5000 is already allocated, use a different port (e.g., --local-registry localhost:5050) or kill the conflicting process:
macOS/Linux:
lsof -ti:5000 | xargs kill -9Windows (PowerShell):
netstat -ano | findstr :5000taskkill /PID <id> /FGitOps Workflow Issues
Section titled âGitOps Workflow IssuesâRegistry Access and Image Push Failures
Section titled âRegistry Access and Image Push FailuresâKSail automatically retries transient registry errors (HTTP 429, 5xx, timeouts) during cluster create/update and ksail workload push. For authentication errors, verify connectivity and credentials:
curl -I https://registry.example.com/v2/docker ps | grep registryksail cluster init --local-registry '${REG_USER}:${REG_TOKEN}@registry.example.com/my-org/my-repo'registry requires authenticationâ missing or incorrect--local-registrycredentialsregistry access deniedâ credentials lack write permissionregistry is unreachableâ DNS failure, firewall, or registry down
Registry containers have a built-in health check (polls /v2/ every 10 s, marks unhealthy after 3 consecutive failures). To diagnose mirror errors:
docker ps --filter label=io.ksail.registry --format 'table {{.Names}}\t{{.Status}}'docker inspect --format '{{json .State.Health}}' <container-name>Flux Operator Installation Timeout
Section titled âFlux Operator Installation TimeoutâFlux CRDs can take 7â10 minutes on resource-constrained systems; KSail allows up to 12 minutes. If timeouts persist, check resources (docker stats) and ensure 4 GB+ RAM.
ksail workload get pods -n flux-systemkubectl get crd <crd-name> -o jsonpath='{.status.conditions[?(@.type=="Established")].status}'Flux/ArgoCD CrashLoopBackOff After Component Installation
Section titled âFlux/ArgoCD CrashLoopBackOff After Component InstallationâInfrastructure components (MetalLB, Kyverno, cert-manager) can temporarily disrupt API server connectivity while registering webhooks/CRDs, causing CrashLoopBackOff with dial tcp 10.96.0.1:443: i/o timeout errors. CNI components (e.g. Cilium) can also cause this if their eBPF dataplane hasnât finished programming pod-to-service routing when GitOps engines start. KSail performs a three-step cluster stability check before installing GitOps engines: (1) 5 consecutive successful API server health checks, (2) all kube-system DaemonSets ready, and (3) a short-lived busybox pod confirms TCP connectivity to the API server ClusterIP. If you see cluster not stable after infrastructure installation or in-cluster API connectivity check failed, check resources and optionally recreate with fewer components:
ksail workload get nodesksail workload get pods -A | grep -v Runningksail cluster delete && ksail cluster createFlux/ArgoCD Not Reconciling
Section titled âFlux/ArgoCD Not ReconcilingâIf changes donât appear after ksail workload reconcile, check status and logs:
ksail workload get pods -n flux-system # Fluxksail workload get pods -n argocd # ArgoCDksail workload logs -n flux-system deployment/source-controllerksail workload reconcile --timeout=5mComponent Installation Issues
Section titled âComponent Installation IssuesâInstallation Failures and Timeouts
Section titled âInstallation Failures and TimeoutsâKSail retries transient Helm registry errors automatically (5 attempts, exponential backoff). For persistent failures, check resources with docker stats and curl -I https://ghcr.io, then recreate: ksail cluster delete && ksail cluster create. On resource-constrained systems, increase Docker limits, skip optional components, or use K3s.
Configuration Issues
Section titled âConfiguration IssuesâInvalid ksail.yaml
Section titled âInvalid ksail.yamlâValidate against the schema or re-initialize: ksail cluster init --name my-cluster --distribution Vanilla
Environment Variables Not Expanding
Section titled âEnvironment Variables Not ExpandingâEnsure environment variables are set before running KSail. Verify with echo $MY_TOKEN before using ${MY_TOKEN} in configuration.
LoadBalancer Issues
Section titled âLoadBalancer IssuesâLoadBalancer Service Stuck in Pending
Section titled âLoadBalancer Service Stuck in PendingâIf kubectl get svc shows <pending> for EXTERNAL-IP, verify LoadBalancer is enabled in ksail.yaml (reinitialize with --load-balancer Enabled if not) and check the controller for your distribution:
- Vanilla:
docker ps | grep ksail-cloud-provider-kind - Talos:
kubectl get pods -n metallb-system - Hetzner:
kubectl get pods -n kube-system | grep hcloud
Cannot Access LoadBalancer IP
Section titled âCannot Access LoadBalancer IPâIf connection fails despite an external IP, ensure the application listens on 0.0.0.0 (not 127.0.0.1). Debug with kubectl logs -l app=my-app, kubectl describe svc my-app, and kubectl exec -it <pod-name> -- netstat -tlnp to check listening ports.
MetalLB IP Pool Exhausted
Section titled âMetalLB IP Pool ExhaustedâIf new LoadBalancer services remain pending after several successful allocations, the MetalLB IP pool is exhausted. See the LoadBalancer Configuration Guide to expand the address range.
Network Issues
Section titled âNetwork IssuesâCNI Installation Failed
Section titled âCNI Installation FailedâIf pods are stuck in ContainerCreating with CNI errors, check CNI pods with ksail workload get pods -n kube-system -l k8s-app=cilium (or calico-node). If failed, recreate: ksail cluster init --cni Cilium && ksail cluster create
VCluster Issues
Section titled âVCluster IssuesâTransient Startup Failures
Section titled âTransient Startup FailuresâKSail automatically retries transient VCluster startup failures (up to 5 attempts, 5-second delay), including exit status 22/EINVAL, D-Bus errors, network transients, and GHCR pull failures. Retrying vCluster create (attempt 2/5)... messages are expected â no action required.
If all retries fail, check Docker resource limits and D-Bus availability. See the VCluster Getting Started guide for details.
kubectl Commands Fail After VCluster Creation
Section titled âkubectl Commands Fail After VCluster CreationâWait a few seconds if kubectl get nodes returns connection errors immediately after creation â VCluster control planes need time to start. Verify the active context with kubectl config current-context and ksail workload get nodes.
Hetzner Cloud Issues
Section titled âHetzner Cloud Issuesâ- HCLOUD_TOKEN not working: Verify read/write permissions (Hetzner Cloud Console â Security â API Tokens). Test with
hcloud server listif installed. - Talos ISO not found: The default ISO ID may be outdated. Find the correct ID in Hetzner Cloud Console under Images â ISOs.
Getting More Help
Section titled âGetting More HelpâCheck GitHub Issues and Discussions. When reporting issues, include KSail version, OS, Docker version, ksail.yaml, error messages, and reproduction steps.