Improving Tailscale via Apple’s open source

on
Photo of Mihai Parparita

The ability to peek under the hood of the operating system can be a powerful tool for debugging. Spelunking Apple’s Open Source, a recent blog post by Daniel Jalkut’s, struck a very familiar chord with me, reminding me of times I spent poring through WebKit internals at my previous employer. I've been able to continue that pattern at Tailscale, now more focused on Darwin (the operating system at the core of macOS and iOS), its kernel and its userland tools.

Just recently, Apple's open source came in handy when debugging an issue for the Tailscale 1.36 release. We were investigating two network-interface related bugs – one where Tailscale traffic was looping on itself (instead of being passed through to the actual physical network connection) and another where Tailscale was sending traffic over the phone’s cellular interface (even though Wi-Fi was available). The fix for both of these involved more consistently “binding” Tailscale’s network requests to the active network interface (and re-binding them when it changes).

Normally figuring out what the active network interface is pretty straightforward — it is the interface that is associated with the default route. However, when Tailscale is configured to use an exit node, it becomes the default route, thus putting us back in danger of looping traffic. We want to know what the interface would be if Tailscale was not present (but without actually disabling it, since that is disruptive). I initially had an approach that lived in our closed-source repo and relied on private APIs (that I should probably plead the 5th instead of discussing further). Worse yet, we started to get reports that users were having issues because we were sometimes detecting the wrong interface.

Brad mentioned we should see how the ifconfig outputs changes in various states: when Tailscale is the default route, when the default physical interface changes (e.g. from Ethernet to Wi-Fi), and combinations of the two. While the default output didn’t have anything interesting, getting more verbose information (via the -v flag) had a clue: an “effective interface” field appeared for the virtual Tailscale utunN interface, with the name of the actual physical interface (en0, en10, etc.) that should be handling the traffic.


utun5: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1280 index 34
eflags=5002080<TXSTART,NOAUTOIPV6LL,ECN_ENABLE,CHANNEL_DRV>
inet 100.118.111.67 --> 100.118.111.67 netmask 0xffffffff
...
effective interface: en4
...

ifconfig is part of the code that Apple open-sources, and a quick search through it turned up the implementation of that line:


if (ioctl(s, SIOCGIFDELEGATE, &ifr) != -1 && ifr.ifr_delegated) {
char delegatedif[IFNAMSIZ+1];
if (if_indextoname(ifr.ifr_delegated, delegatedif) != NULL)
printf("\teffective interface: %s\n", delegatedif);
}

ioctl is kind of a Swiss Army knife system call, in this particular case the SIOCGIFDELEGATE option allows the underlying physical interface to be obtained. The Darwin Networking chapter of the *OS Internals book has more details.

Now that we knew what ifconfig did, it was a matter of trying to replicate it in Tailscale’s client. Go ends up using various ioctls to implement standard library functionality on Darwin (e.g. to get the MTU associated with an interface), but it didn’t happen to have a wrapper for SIOCGIFDELEGATE. Luckily it was not too much boilerplate to do the equivalent within Tailscale’s code.

I ran the code on my MacBook and it worked! But what about iOS? ifconfig (and the Darwin userland tools in general) is not available on iOS (except for creative hacks), and the iOS sandbox may be more restrictive when it comes to which ioctls it allows. I was betting on this being a shared implementation on Apple’s side, and thus that it worked on all of their platforms. The moment of truth came a few minutes later when I tried it on my iPhone, and luckily it worked there too.

The final implementation ended up being pretty concise, and it was satisfying to have it an unstable build the next day and have users confirm that the problem was fixed.