Hidden latency in the Linux network stack

nkurz · on April 7, 2016

Stray thoughts:

1) Any chance this bug would have manifested as "connection reset" errors when accessing HN? I exchanged email with Dan a couple months ago trying to figure out why about 10% of my requests were failing, but we never figured out the root cause before (after some weeks) the problem went away.

2) As others have pointed out, doubling the number of hash buckets seems like a bandaid. But other than the scolding comment, is there any reason not to go to an appropriately sized hash? If you know in advance that you are going to have 16K addresses (ie, not the use case the original code anticipated), it would seem beneficial to choose a data structure that has a fighting chance of providing good performance.

3) This seems like a wonderful argument for _against_ running your high performance DNS server on the same machine as your other services. Would containerization have helped here, possibly with each OS pinned to a set of cores? Is the cost of splitting it off onto a separate physical machine prohibitive? At the optimization level you are aiming for, "process isolation" seems like a pretty leaky abstraction.

4) Going farther down the path of reducing jitter, have you considered running a tickless (NOHZ_FULL) kernel? Perhaps you are already doing it, but quieting down the cores running your important services can make a significant difference. I've been exploring this direction, and have found it rewarding. Info on one way to measure is here: https://kernel.googlesource.com/pub/scm/linux/kernel/git/fre...

majke · on April 7, 2016

1) Not really. Connection resets have nothing to do here. Are you by any chance running on a slow network? We did some debugging of problems slow clients recently.

2) Correct. Increasing number of buckets is a bandaid. Improving the linux code for inet_lookup_listener is not a trivial task. But for some reason Linux did optimize similar function for UDP. Wonders of the network stack.

3) Not really, dns can coexist with other services. The bug was having many TCP listening sockets on the same port without a good reason really. There is no DDoS benefit in splitting receive queues for TCP.

Containers won't help, the packets go to the same instance of LHTABLE.

4) Scheduling jitter is a subject for separate story.

Florin_Andrei · on April 7, 2016

If the issue is in the kernel, containers won't solve it.

nkurz · on April 7, 2016

You are correct. As evidenced by the mention of "each OS", I mangled that sentence and wrote containerization where I should have said virtualization. I meant something like "While containerization wouldn't have helped here, would virtualization (perhaps with each OS pinned to a different set of cores) have helped?"

Florin_Andrei · on April 7, 2016

Yeah, that sounds better.

Either way, if it's a critical service, I'd rather have it running on hardware where there isn't much competition for resources, so not a whole lot of virtualization.

ryao · on April 7, 2016

The hash function includes the address of the struct net when mount namespaces are in use, so if the same service were spread among different containers (or more specifically, different mount namespaces), the hash collisions per bucket might be somewhat more uniform. It would likely be far better to patch the kernel than it would be to abuse mount namespaces to workaround this. That workaround would add complexity for a fairly limited gain that is more by accident than anything else.

ryao · on April 7, 2016

This should have been net means bases not mount names basis. In which case I'll be doesn't make much sense even as a horrible hack.

ryao · on April 8, 2016

That will be the last time that I try using dictation to compose a hacker news comment without reading it. That should have read: "This should have been net namespaces, not mount name spaces. In that case, it does not make sense even as a horrible hack".

ryao · on April 7, 2016

> 2) As others have pointed out, doubling the number of hash buckets seems like a bandaid. But other than the scolding comment, is there any reason not to go to an appropriately sized hash? If you know in advance that you are going to have 16K addresses (ie, not the use case the original code anticipated), it would seem beneficial to choose a data structure that has a fighting chance of providing good performance.

I am not familiar with the code, but it looks to me like having a separate hash table for address-bound sockets like you suggest would work.

I suggest using some sort of extensible hash table (possibly with a bounded maximum size so things do not become crazy) to avoid statically sizing it. That way it is not too big on systems where it is not often used and not too small on systems where it is heavily used. The type of extensible hash table should be picked for compatibility with RCU to avoid causing a regression in multicore scaling. A quick google search suggests that a split-ordered list might work here:

https://lwn.net/Articles/573431/

As for the hash function, I would suggest adding the IP address and a random value calculated at boot time to the sum calculated by the current hash function before it currently truncates, run that sum through a next() function from a fast PRNG and then truncate the result to whatever the extensible hash table wants at the given moment. As for potential PRNG next() function donors, the xorshift128+ at the following site is fairly nice:

http://xorshift.di.unimi.it/

The idea behind using a PRNG's next() function would be to avoid artifacts in how IP addresses are assigned that would cause abnormally many hash table collisions. The idea behind adding a random number to the sum before the hash function is to ensure that people could not pick IP address + port combinations that hash to the same location as part of an algorithmic complexity attack. The potential for an algorithmic complexity attack here is relatively minor given that the problem occurs in normal use, but fixing it is rather inexpensive, so someone might as well do it when changing this.

Anyway, this does not do anything for the wild card case, but barring gratuitous use of SO_REUSEPORT, the number of sockets should be be small. Cloudflare is using address-bound sockets, so it probably would not matter for them.

ryao · on April 7, 2016

By the way, the bandaid solution lowers the latency not just for each request, but also each queued request waiting for CPU time making the reduction in time culminative. As long as the bandaid solution keeps the maximum queue depth and average time spent processing low enough, it would probably suffice. If the spikes remain low under load tests, implementing it would be more of an exercise in seeing how much you can reduce CPU time in this code path for the sake of reducing it than it would be a useful improvement.

That is not meant to discount the fun that one can have when using overkill to eliminate the potential for historical issue to resurface in some new situation. It can be very fun, as long as one does not spend too much time on it. I have been known to do it on occasion:

https://github.com/zfsonlinux/spl/commit/0b43696e6676391e5be...

charliedevolve · on April 7, 2016

Why did increasing the size of the hash help? Wouldn't the %64 (or whatever the new value was) just send all the port 53 sockets into the same bucket again? It seems you'd need a different hash function that provides more uniformity.

kbenson · on April 7, 2016

For TCP connections our DNS server now binds to ANY_IP address (aka: 0.0.0.0:53, :53). We call this "bind to star". While binding to specific IP addresses is still necessary for UDP, there is little benefit in doing that for the TCP traffic. For TCP we can bind to star safely, without compromising our DDoS defenses.

I suspect that's the real fix. Now all those (16k) bound addresses aren't creating hash table entries, so other connections that happen to use a port that hashes to 21 (or 53 after enlarging the table) aren't being shoved into a hash bucket that starts with 16k entries already in it.

The enlarging of the hash table I think is less a fix for this problem (although it would halve the number of later connections being put in the bucket), and more just a good fix they happened to do at the same time.

creshal · on April 7, 2016

> The enlarging of the hash table I think is less a fix for this problem (although it would halve the number of later connections being put in the bucket), and more just a good fix they happened to do at the same time.

Yes. It just reduces the risk that they run into this problem again with a different port constellation.

stingraycharles · on April 7, 2016

A bit pedantic, but it doesn't necessarily reduce the risk, but rather the impact: using 64 buckets would only have half as many connections go into a bad bucket. This, however, does not in any way decrease the chance of the problem occuring again.

sinxoveretothex · on April 7, 2016

Using twice as many buckets, there will be half as many destination ports in the same bucket (65535 / 32 ≈ 2048, 65535 / 64 ≈ 1024), but since the "bad" connections described in the blogpost all use the same destination port, it won't change anything wrt that.

It does, however, reduce the overall impact when all connections are considered.

jcoffland · on April 7, 2016

There is only a reduction in hash collisions if the destination ports are fairly evenly distributed. I just don't see how this helps at all in the case described.

dhruvbird · on April 7, 2016

@kbenson Your explanation makes a lot more sense. Increasing the hash table size probably didn't affect the perf. significantly, but binding to the ANY ip did.

MBCook · on April 7, 2016

So are things 'bound to star' checked somewhere else in the path? I don't understand why the hash table wouldn't just have entries for each of the receiving connections still.

seabee · on April 7, 2016

There is only one listener bound to star, instead of 16k listeners for every IP. Thus, the hash bucket mapping to port 53 has only one entry instead of 16k.

It works out the same for the application: 1 fd or 16k fds doesn't really matter if you're using epoll, and that single fd can accept connections to any of those 16k IP addresses.

majke · on April 7, 2016

It would reduce the probability of collision. Sure, port 53 will always go to unlucky bucket. But port 16275 may or may not collide depending on thehash size.

charliedevolve · on April 7, 2016

Reduce the collisions how much, though? Are you using a port number == N*hashSize+53 for something else a lot? A collision is guaranteed for every connection to port 53, so isn't that a much bigger source of collisions? I think the hash size of 32 isn't the problem. It's the function.

majke · on April 7, 2016

I think I understand the confusion. The packets are hashed based on DESTINATION PORT.

Now, from a server point of view there are two types of connections: inbound and outbound. Our servers accept connections but they also establish connections, for example to your http origin hosts.

So from the point of view of our server the "colliding" packets will fit two categories: A) incoming packets to port 53 B) incoming packets to outbound connections which source port % 32 == 21.

For A) this is not that a big deal. DNS usually works over UDP, there are not _that_ many DNS queries done using TCP.

For B), since Linux choses source port incrementally, that means every 32'nd connection will possibly have some packets hitting the unhappy bucket.

Therefore increasing the hash size twice, reduces the chance of collision twice: now every 64th outbound connection will have some packets hitting the unhappy bucket.

charliedevolve · on April 7, 2016

Oh yes, I forgot that tcp/53 is mostly (if not entirely) for zone transfers.

majke · on April 7, 2016

:)

The full answer is: depending on which RFC you read :)

Initially the RFC's specified that you could only use TCP if you got UDP truncation _first_.

Nowadays that's relaxed but it's very vague when you should use TCP except for after UDP TR. For example Bind will try to connect over TCP if UDP fails.

Generally speaking most of the traffic goes over UDP, and sometimes, in undefined circumstances, some stuff may be requested over TCP. No hard rule.

bboreham · on April 8, 2016

Recalling that CloudFlare's business is dealing with malicious actors, you should assume that the caller has read all the RFCs and then deliberately disobeyed them.

xigency · on April 7, 2016

As others have answered, it doesn't help fix this problem in any way. I'm assuming the team doing this investigation wasn't willing to rewrite `__inet_lookup_listener`. I guess the quick fix became the concern at that point. (Compared to rewriting kernel modules.)

But yes, it seems like an unnecessary change.

jandrese · on April 7, 2016

I'm not sure inet_lookup_listener requires a rewrite for a general purpose kernel. They were hitting this problem because they were doing something unusual (starting thousands of listeners on the same port) and they found a pretty reasonable workaround.

In particular, a rewrite would have to make sure not to make the general case worse in an attempt to avoid this pathological situation.

majke · on April 7, 2016

Please read my other comment in this thread. https://news.ycombinator.com/item?id=11448987

Increasing the hash size doesn't fix the unhappy bucket, but it does reduce chance that packets will ever hit it. So yes, traversal of this bucket will be slow, but it will be hit less often.

jcoffland · on April 7, 2016

__inet_lookup_listener could be modified to use a two level hash that only kicked in for buckets which exceeded a certain threshold. The second level hash tables would xor the parts of the destination IP address modulo the hash table size. Special care would have to be taken for listeners on 0.0.0.0. This would solve the problem with negligible cost to the general case but at the expense of added complexity.

edwintorok · on April 7, 2016

In fact wouldn't it be simpler to just use an array with all 64k destination ports for the 1st level, and a hash based on destination IP for the 2nd level? This would only need 64k*sizeof(pointer) memory globally for the 1st level, which sounds reasonable for everything except low-memory embedded devices. Or does cache-locality matter here so much that its worth taking the hash-collision penalty on the 1st level?

For the 2nd level you could size the hash table appropriately since you always know the maximum number of IP addresses a host has.

Would be nice if this 2level array/hash was tunable from /proc or /sys.

jcoffland · on April 7, 2016

Increasing the hash table size would be bad for embedded systems and would not reduce the complexity for problems like the one described here. In general, hash table tuning is challenging but you shouldn't need to make the table significantly larger than some fraction of the total number of entries as long, as your hash function is effective. The problem here is that listeners are differentiated by both destination IP and port but the hash table only uses the port and then does a linear lookup by IP. As was demonstrated the hash function is not always effective and the list of listeners in a bucket can get quite long. Presumably the existing scheme was chosen because it is simple and makes it really easy to handle 0.0.0.0 listeners.

Using a binary tree at each bucket could also work well but you would have to rebalance the tree periodically if listeners were inserted in sorted order. A self balancing tree could be used instead but then again this adds complexity.

readams · on April 7, 2016

They might hash the source port also. Maybe they didn't mention this.

charliedevolve · on April 7, 2016

Mixing in the source port might be a good way to fix that hash function.

seabee · on April 7, 2016

The hash is used to find listening sockets. Source port doesn't help you there.

Destination IP is a reasonable addition to the hash function, however.

kazinator · on April 7, 2016

In the internet protocol suite, there is something called ARP. That will cause a small latency spike whenever an ARP cache entry expires.

The host wants to send a datagram to some IP address Y somewhere. It knows that the route for that IP goes through some local gateway with IP address X, so it must actually send the datagram to X, for which it needs the MAC. Normally, it knows the MAC of that gateway, because X is associated with the MAC in the ARP cache.

From time to time, the ARP cache entry expires. In that case, before the packet can be sent to the remote host Y, an ARP query must be performed: an ARP "who has X?" broadcast request is generated, to which the gateway will reply. Only then can that host send the packet to Y, through X.

This extra ARP exchange shouldn't take anywhere near 100 milliseconds, of course. But that is beyond the control of the host that is querying.

deegles · on April 7, 2016

I love these mysterious bug deep dives. I should start aggregating them somewhere...

ecma · on April 7, 2016

I'm not entirely convinced that increasing the size of LHTABLE solves anything. True, it may remove some collisions in the hash table but note that 63925 % 64 = 53. Given that the two slow ports listed seem to be arbitrary and assigned to customers, they're probably just a symptom of the overload on 53/UDP. I'm not suggesting they chose 64, that's unclear, but whatever they chose probably just shifts problem the elsewhere. Increasing it /would/ inherently reduce the frequency of the events though so you can call that a win.

A naïve solution would be to choose a bucket based on the destination port as well as the source port if one is available (e.g. TCP). This might help balance load affecting particular local ports since we can assume the source port for TCP will be random enough. However, it doesn't solve the problem - it'll just hide it. Random spikes in latency for connections to random customers? Sounds undesirable.

A reasonable solution might be to work out a way to map gateway 53/UDP to a diverse set of ports which are bound to rrdns processes on the boxes which currently have 16K IP addresses. For UDP packets, this would be possible by doing on-wire modifications to the transport header and recalculating any checksums. Perhaps that just shifts the burden though.

lilyball · on April 7, 2016

You can't include the source port in the hash, because this is a table of listening sockets, i.e. no connection has been established yet and the socket needs to see packets from ANY source as long as they go to the right destination.

You could suggest including the bound destination IP in the hash, but then you'd also need a separate hashtable for sockets that are bound to any IP (instead of being bound to a specific IP).

ecma · on April 8, 2016

Good catch. I did mean that mixing in the source would apply to the established table if it's constructed in the same way but doesn't do so already (which would surprise me now that I think about it properly).

I don't think that you'd need a separate table for star bound listeners if the IP is mixed in since you could just hash in 0.0.0.0 but you'd need to check both the real IP and the special value too which is a potentially damaging performance hit. It's probably done with just the destination port for a good reason.

caf · on April 8, 2016

You wouldn't need a separate hash table, you could first look up Hash(destip, destport) then if that fails to find a listener, look up Hash(0, destport).

Gratsby · on April 7, 2016

Why not look into changing the hash keys such that they are not bound to listening port or alternatively put a new lookup and hash in place for this specific purpose?

While bind to star works, it feels like you answered an operational concern but left the design consideration on the table.

forgotmypassw · on April 7, 2016

Even though I'm not using CloudFlare myself, I always enjoy reading their blog posts, these guys are amazing hackers.

alberth · on April 7, 2016

What about DragonflyBSD?

I'm curious is if Cloudflare has investigated using DrafonflyBSD, given that it has a lockless network stack.

caf · on April 8, 2016

This wasn't a locking issue.

shiftoutbox · on April 7, 2016

I think Matt Dillon would love to see that . Also I would love to see the analysis of their situation on a UNIX , BSD's and even Windows .

ishtu · on April 7, 2016

Can anyone suggest a good reading on differences in linux and freebsd network stacks?

lazyant · on April 7, 2016

here's a diagram for Linux http://www.linuxfoundation.org/images/1/1c/Network_data_flow...

pyvpx · on April 7, 2016

just the code, for anything recent and useful. I'm not aware of anything else, sadly.

Zenst · on April 8, 2016

I wonder how if this was IPv6 if this problem would still entail and if IPv6 has design aspects that would negate this issue?

Would the lookup be different or the design aspect that produces what many would class a edge case instance, one that as companies grow, is only going to become more common.

triplenineteen · on April 7, 2016

It seems like they are using a non-standard netstat switch:

  $ netstat -ep4ln --udp

Is that '4' a typo or something?

dice · on April 7, 2016

It's present in my netstat, from `net-tools 2.10-alpha`. Man page says it's shorthand for `--protocol=inet` (as opposed to `inet6`) which shows only IPv4 sockets.

Negitivefrags · on April 7, 2016

It means only list ipv4 sockets. It's pretty standard across many command line network tools.

lttlrck · on April 7, 2016

It would be helpful to show the improvement for the two changes made individually instead of lumping them together.

js2 · on April 7, 2016

What did you increase the size of LHTABLE to? I don't see that mentioned in the post.

majke · on April 7, 2016

In order to reduce the number of packets hitting the bad bucket. With LHTABLE of size 32 port 16257 will hit bad bucket. With other sizes port 16257 may not hit the naughty bucket.

nkurz · on April 7, 2016

I think the question was what number you used for the new LHTABLE, rather than the purpose of increasing it. That is, did you double it from 32 to 64? Choose a prime? Something else?

js2 · on April 7, 2016

Correct, that was my question. The post doesn't say what they chose for the LHTABLE size when they recompiled the kernel, only that it was increased from 32. BTW, according to the patch the article linked to, the size has to be a power of 2. http://patchwork.ozlabs.org/patch/79014/

wrigby · on April 8, 2016

That makes sense - since port numbers are unsigned integers, you can just use a bitwise AND rather than a more costly MOD. I haven't actually looked at the internals of the hash, so I'm just assuming it's simply BUCKET = DPORT & MASK.

known · on April 8, 2016

Can't we fix/configure it in /etc/sysctl.conf

wrigby · on April 8, 2016

That would actually be pretty cool, but the data would have to be redistributed across the new set of buckets when the parameter was changed. That could get really messy...