Fixing NAT Hairpinning in k3s with Tailscale

Services in my k3s cluster were painfully slow when accessed over Tailscale. Curling a service took 17s from a tailnet client and 60s from the controlplane.

problem

SourceTimeNotes
tailnet client17sVia Tailscale tunnel
controlplane60sNAT hairpin: exits and re-enters cluster

My setup: k3s cluster with ingress-nginx exposed to the Tailnet via a Tailscale LoadBalancer IP. External-DNS watches Ingress objects and creates A records in PiHole pointing all *.k3s.mydomain.com hostnames to this IP.

root cause

NAT hairpinning caused by DNS.

In-cluster pods use CoreDNS, which forwards external queries to PiHole (the upstream DNS). PiHole returns 100.x.x.x - the Tailscale LoadBalancer IP. So even traffic originating inside the cluster exits via Tailscale, traverses the tunnel, and re-enters the cluster to hit ingress-nginx.

broken path (NAT hairpin):
  pod → 100.x.x.x → tailscale tunnel → back into cluster → ingress-nginx → backend

correct path:
  pod → ingress-nginx ClusterIP → backend

The controlplane was worst at 60s because it had to NAT hairpin through its own Tailscale interface.

solution

Split-horizon DNS using CoreDNS’s rewrite plugin.

k3s supports a coredns-custom ConfigMap in kube-system. Files with .override extension get injected into the main .:53 server block. I added a rewrite rule that intercepts *.k3s.mydomain.com queries and resolves them to the ingress-nginx ClusterIP directly - before they ever reach PiHole.

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns-custom
  namespace: kube-system
data:
  custom-zone.override: |
    rewrite stop {
      name regex (.+)\.k3s\.mydomain\.com ingress-nginx-controller.ingress-nginx.svc.cluster.local
      answer name ingress-nginx-controller\.ingress-nginx\.svc\.cluster\.local (.+)\.k3s\.mydomain\.com
    }

how it works

  1. Pod queries app.k3s.mydomain.com
  2. The rewrite plugin rewrites the query to ingress-nginx-controller.ingress-nginx.svc.cluster.local
  3. The kubernetes plugin (already in the .:53 block) resolves it to the current ClusterIP
  4. The answer name directive rewrites the response back so the client sees app.k3s.mydomain.com
  5. The HTTP Host header is still app.k3s.mydomain.com, so ingress-nginx routes correctly

No hardcoded IPs. If the ingress-nginx ClusterIP changes, the kubernetes plugin picks it up automatically.

why .override and not .server

Files with .server extension create a separate CoreDNS server block. That block doesn’t have the kubernetes plugin, so you’d need to either hardcode the ClusterIP (fragile) or forward queries back to kube-dns (extra hop). .override injects into the existing block that already has the kubernetes plugin loaded.

why not the template plugin

The template plugin can synthesize DNS responses, but it requires a literal IP address in the config. If the ingress-nginx Service is ever recreated, the ClusterIP changes and the config breaks. The rewrite approach resolves dynamically.

verification

# in-cluster: should return ingress-nginx ClusterIP, not 100.x.x.x
kubectl run dnstest --rm -it --image=busybox --restart=Never -- nslookup app.k3s.mydomain.com

# tailnet client: should still return 100.x.x.x
dig app.k3s.mydomain.com

# latency from controlplane
curl -o /dev/null -w "%{time_total}\n" -sk https://app.k3s.mydomain.com/

Latency from the controlplane dropped from 60s to 0.003s. The worker node, which doesn’t have Tailscale, was already at 0.04s since it couldn’t resolve the external hostname and was never affected.

The tailnet client is still at 17s. The CoreDNS rewrite only fixes in-cluster resolution; external clients still go through the Tailscale tunnel. That turned out to be a second, unrelated problem (see below).

the tailnet client: cilium eating tailscale0

Fixing the hairpin left the tailnet client untouched at 17s. I initially guessed DERP relay overhead, but that was wrong. Turns out it was my CNI.

Cilium, when its devices option is unset, auto-detects every network interface and attaches its eBPF datapath to it. The agent logs at startup told the whole story:

kubectl -n kube-system logs ds/cilium -c cilium-agent | grep "Devices changed"
msg="Devices changed"  devices="[ens18 tailscale0]"
msg="Setting IPv4"  device=tailscale0  gso_max_size=65536  gro_max_size=65536

Two things jumped out. Cilium had grabbed tailscale0 alongside the physical NIC ens18, and was applying BIG TCP to it: gso/gro=65536, i.e. 64 KB segmentation/receive offload. The Tailscale tunnel has a 1280-byte MTU. Pushing 64 KB offload onto a 1280-MTU interface causes segmentation and PMTU stalls, which explains the multi-second drag.

One thing worth knowing: kubectl logs ds/cilium only returns logs from a single pod of the DaemonSet. To check every node I looped over them:

for p in $(kubectl -n kube-system get pods -l k8s-app=cilium -o name); do
  echo "$p"; kubectl -n kube-system logs $p -c cilium-agent | grep "Devices changed"
done

Only the node running Tailscale had [ens18 tailscale0]; the others just showed [ens18]. The symptom tracked the Tailscale-hosting node exactly.

solution

Pin Cilium’s datapath to the physical NIC so it never touches tailscale0:

# cilium helm values
devices: ens18

Got the interface name straight from the same logs (device=ens18 in the node-address and direct-routing lines), so no shell access needed. Pod traffic still goes through eBPF; only tailscale0 host traffic is left on the native kernel stack. This matches Tailscale’s documented Cilium guidance and tailscale/tailscale#12393.

the gotcha that cost me a round-trip

Setting devices updates the cilium-config ConfigMap, but the running agents don’t auto-reload it; they read it once at startup. My first attempt looked like nothing changed because the agents had restarted ~2 minutes before the new config synced. The fix was correct, the agents needed to be reloaded post-fix. Comparing pod start time against ConfigMap update time made it obvious:

kubectl -n kube-system get cm cilium-config -o jsonpath='{.metadata.managedFields[*].time}'      # config landed at 18:05
kubectl -n kube-system get pods -l k8s-app=cilium -o custom-columns='POD:.metadata.name,START:.status.startTime'  # pods started 18:02

A kubectl rollout restart ds/cilium (briefly disruptive since it’s the CNI) made it stick. After that, every node reported devices=[ens18] and BIG TCP was only applied to ens18.

result

SourceBeforeAfter
tailnet client (3.2 MB asset)17s0.31s
cilium devices[ens18 tailscale0][ens18]

~55x improvement, finally in line with in-cluster speeds. Two separate bugs (DNS hairpinning and Cilium grabbing the tunnel interface) wearing the same “slow over Tailscale” costume.

rollback

CoreDNS auto-reloads when the ConfigMap changes, so rollback is instant:

kubectl delete cm coredns-custom -n kube-system