Fixing NAT Hairpinning in k3s with Tailscale
Services in my k3s cluster were painfully slow when accessed over Tailscale. Curling a service took 17s from a tailnet client and 60s from the controlplane.
problem
| Source | Time | Notes |
|---|---|---|
| tailnet client | 17s | Via Tailscale tunnel |
| controlplane | 60s | NAT hairpin: exits and re-enters cluster |
My setup: k3s cluster with ingress-nginx exposed to the Tailnet via a Tailscale LoadBalancer IP. External-DNS watches Ingress objects and creates A records in PiHole pointing all *.k3s.mydomain.com hostnames to this IP.
root cause
NAT hairpinning caused by DNS.
In-cluster pods use CoreDNS, which forwards external queries to PiHole (the upstream DNS). PiHole returns 100.x.x.x - the Tailscale LoadBalancer IP. So even traffic originating inside the cluster exits via Tailscale, traverses the tunnel, and re-enters the cluster to hit ingress-nginx.
broken path (NAT hairpin):
pod → 100.x.x.x → tailscale tunnel → back into cluster → ingress-nginx → backend
correct path:
pod → ingress-nginx ClusterIP → backend
The controlplane was worst at 60s because it had to NAT hairpin through its own Tailscale interface.
solution
Split-horizon DNS using CoreDNS’s rewrite plugin.
k3s supports a coredns-custom ConfigMap in kube-system. Files with .override extension get injected into the main .:53 server block. I added a rewrite rule that intercepts *.k3s.mydomain.com queries and resolves them to the ingress-nginx ClusterIP directly - before they ever reach PiHole.
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns-custom
namespace: kube-system
data:
custom-zone.override: |
rewrite stop {
name regex (.+)\.k3s\.mydomain\.com ingress-nginx-controller.ingress-nginx.svc.cluster.local
answer name ingress-nginx-controller\.ingress-nginx\.svc\.cluster\.local (.+)\.k3s\.mydomain\.com
}
how it works
- Pod queries
app.k3s.mydomain.com - The
rewriteplugin rewrites the query toingress-nginx-controller.ingress-nginx.svc.cluster.local - The
kubernetesplugin (already in the.:53block) resolves it to the current ClusterIP - The
answer namedirective rewrites the response back so the client seesapp.k3s.mydomain.com - The HTTP
Hostheader is stillapp.k3s.mydomain.com, so ingress-nginx routes correctly
No hardcoded IPs. If the ingress-nginx ClusterIP changes, the kubernetes plugin picks it up automatically.
why .override and not .server
Files with .server extension create a separate CoreDNS server block. That block doesn’t have the kubernetes plugin, so you’d need to either hardcode the ClusterIP (fragile) or forward queries back to kube-dns (extra hop). .override injects into the existing block that already has the kubernetes plugin loaded.
why not the template plugin
The template plugin can synthesize DNS responses, but it requires a literal IP address in the config. If the ingress-nginx Service is ever recreated, the ClusterIP changes and the config breaks. The rewrite approach resolves dynamically.
verification
# in-cluster: should return ingress-nginx ClusterIP, not 100.x.x.x
kubectl run dnstest --rm -it --image=busybox --restart=Never -- nslookup app.k3s.mydomain.com
# tailnet client: should still return 100.x.x.x
dig app.k3s.mydomain.com
# latency from controlplane
curl -o /dev/null -w "%{time_total}\n" -sk https://app.k3s.mydomain.com/
Latency from the controlplane dropped from 60s to 0.003s. The worker node, which doesn’t have Tailscale, was already at 0.04s since it couldn’t resolve the external hostname and was never affected.
The tailnet client is still at 17s. The CoreDNS rewrite only fixes in-cluster resolution; external clients still go through the Tailscale tunnel. That turned out to be a second, unrelated problem (see below).
the tailnet client: cilium eating tailscale0
Fixing the hairpin left the tailnet client untouched at 17s. I initially guessed DERP relay overhead, but that was wrong. Turns out it was my CNI.
Cilium, when its devices option is unset, auto-detects every network interface and attaches its eBPF datapath to it. The agent logs at startup told the whole story:
kubectl -n kube-system logs ds/cilium -c cilium-agent | grep "Devices changed"
msg="Devices changed" devices="[ens18 tailscale0]"
msg="Setting IPv4" device=tailscale0 gso_max_size=65536 gro_max_size=65536
Two things jumped out. Cilium had grabbed tailscale0 alongside the physical NIC ens18, and was applying BIG TCP to it: gso/gro=65536, i.e. 64 KB segmentation/receive offload. The Tailscale tunnel has a 1280-byte MTU. Pushing 64 KB offload onto a 1280-MTU interface causes segmentation and PMTU stalls, which explains the multi-second drag.
One thing worth knowing: kubectl logs ds/cilium only returns logs from a single pod of the DaemonSet. To check every node I looped over them:
for p in $(kubectl -n kube-system get pods -l k8s-app=cilium -o name); do
echo "$p"; kubectl -n kube-system logs $p -c cilium-agent | grep "Devices changed"
done
Only the node running Tailscale had [ens18 tailscale0]; the others just showed [ens18]. The symptom tracked the Tailscale-hosting node exactly.
solution
Pin Cilium’s datapath to the physical NIC so it never touches tailscale0:
# cilium helm values
devices: ens18
Got the interface name straight from the same logs (device=ens18 in the node-address and direct-routing lines), so no shell access needed. Pod traffic still goes through eBPF; only tailscale0 host traffic is left on the native kernel stack. This matches Tailscale’s documented Cilium guidance and tailscale/tailscale#12393.
the gotcha that cost me a round-trip
Setting devices updates the cilium-config ConfigMap, but the running agents don’t auto-reload it; they read it once at startup. My first attempt looked like nothing changed because the agents had restarted ~2 minutes before the new config synced. The fix was correct, the agents needed to be reloaded post-fix. Comparing pod start time against ConfigMap update time made it obvious:
kubectl -n kube-system get cm cilium-config -o jsonpath='{.metadata.managedFields[*].time}' # config landed at 18:05
kubectl -n kube-system get pods -l k8s-app=cilium -o custom-columns='POD:.metadata.name,START:.status.startTime' # pods started 18:02
A kubectl rollout restart ds/cilium (briefly disruptive since it’s the CNI) made it stick. After that, every node reported devices=[ens18] and BIG TCP was only applied to ens18.
result
| Source | Before | After |
|---|---|---|
| tailnet client (3.2 MB asset) | 17s | 0.31s |
| cilium devices | [ens18 tailscale0] | [ens18] |
~55x improvement, finally in line with in-cluster speeds. Two separate bugs (DNS hairpinning and Cilium grabbing the tunnel interface) wearing the same “slow over Tailscale” costume.
rollback
CoreDNS auto-reloads when the ConfigMap changes, so rollback is instant:
kubectl delete cm coredns-custom -n kube-system