A random thing I ran into with the defaults (Ubuntu Linux):
- net.ipv4.tcp_rmem ~ 6MB
- net.core.rmem_max ~ 1MB
So.. the tcp_rmem value overrides by default, meaning that the TCP receive window for a vanilla TCP socket actually goes up to 6MB if needed (in reality - 3MB because of the halving, but let's ignore that for now since it's a constant).
But if I "setsockopt SO_RCVBUF" in a user-space application, I'm actually capped at a maximum 1MB, even though I already have 6MB. If I try to reduce it from 6MB to e.g. 4MB, it will result in 1MB. This seems very strange. (Perhaps I'm holding it wrong?)
(Same applies to SO_SNDBUF/wmem...)
To me, it seems like Linux is confused about the precedence order of these options. Why not have core.rmem_max be larger and the authoritative directive? Is there some historical reason for this?
If you want to limit the amount of excess buffered data you can lower TCP_NOTSENT_LOWAT instead, which caps the amount that is buffered beyond what's needed for the BDP.
1. While your context about auto-tuning is accurate and valuable, it doesn't really address the fundamental strangeness that the parent post is commenting about: It's still strange that it can auto-tune to a higher value than you can manually tune it to.
2. It's always valuable to provide further references, but I'd guess that down-voters found the "It's pretty clearly documented" phrasing a little condescending? Perhaps "See the docs at [] for more information."?
3. "Please don't comment about the voting on comments. It never does any good, and it makes boring reading."
Their criticism was accurate and well intentioned. Getting downvoted not for the content but perhaps poor phrasing is perfectly normal. Complaining at all about the votes your internet comment gets is asinine.
It's not asinine to complain that for no good reason a perfectly good technical reference written for the benefit of all readers was being grayed out (at the time) via downvotes. It took a non-zero amount of work to dig up where the setting is documented, and I didn't do it for my own benefit.
This isn't taking it personally like I value HN karma. This is complaining purely because downvotes can make content invisible.
Your original comment amounted to "It's working as documented, see here and here". But arguably the question was "Why does it work in this baffling way?"
Certainly that's how I interpreted it -- and while I didn't downvote your answer explaining that this weird footgun is actually documented behaviour, I got no value from either that information or your tone, which read to me as a little dismissive ("It's pretty clearly documented [, you lazy/incompetent person who didn't bother to look this up yourself]").
> Getting downvoted not for the content but perhaps poor phrasing is perfectly normal
IMHO, good and relevant content beats poor phrasing (which again IMHO I didn't witness in the original post), especially since English is not the first language for many people on this board. Downvoting only disincentivizes posting and unfortunately the HN voting system doesn't indicate why, leaving one just to guess.
> once you do SO_RCVBUF the auto-tuning is out of the picture for that socket
Oh I didn’t realize this. That explains the switch in limits. However:
I would have liked to keep auto-tuning, but only change the max buffer size. It’s still weird to me that these are different modes with different limits and whatnot. In my case, I was parallelizing tcp and capping the max size would have been better, and instead varying the number of conns.
I gave up on it. Especially since I need cross platform user-space only, I don’t want to fiddle with these APIs that are all different and unpredictable. I guess it’s for the best anyway, to avoid as much per-platform hacks as possible.
This is great, not just the parameters themselves but all the steps that a packet follows from the point it enters the NIC until it gets to userspace.
Just one thing to add regarding network performance: if you're working in a system with multiple CPUs (which is usually the case in big servers), check NUMA allocation. Sometimes the network card will be in one CPU while the application is executing on a different one, and that can affect performance too.
Packagecloud have a great article series which goes into much more detail and with code study. If you really want to learn the network send and receive, these are the articles to read:
Just changing Linux's default congestion control (net.ipv4.tcp_congestion_control) to 'bbr' can make a _huge_ difference in some scenarios, I guess over distances with sporadic packet loss and jitter, and encapsulation.
Over the last year, I was troubleshooting issues with the following connection flow:
client host <-- HTTP --> reverse proxy host <-- HTTP over Wireguard --> service host
On average, I could not get better than 20% theoretical max throughput. Also, connections tended to slow to a crawl over time. I had hacky solutions like forcing connections to close frequently. Finally switching congestion control to 'bbr' gives close to theoretical max throughput and reliable connections.
I don't really understand enough about TCP to understand why it works. The change needed to be made on both sides of Wireguard.
The difference is that BBR does not use loss as a signal of congestion. Most TCP stacks will cut their send windows in half (or otherwise greatly reduce them) at the first sign of loss. So if you're on a lossy VPN, or sending a huge burst at 1Gb/s on a 10Mb/s VPN uplink, TCP will normally see loss, and back way off.
BBR tries to find Bottleneck Bandwidth rate. Eg, the bandwidth of the narrowest or most congested link. It does this by measuring the round trip time, and increasing the transmit rate until the RTT increases. When the RTT increases, the assumption is that a queue is building at the narrowest portion of the path and the increase of RTT is proportional to the queue depth. It then drops rate until the RTT normalizes due to the queue draining. It sends at that rate for a period of time, and then slightly increases the rate to see if RTT increases again (if not, it means that the queuing that saw before was due to competing traffic which has cleared).
I upgraded from a 10Mb/s cable uplink to 1Gb/s symmetrical fiber a few years ago. When I did so, I was ticked that my upload speed on my corp. VPN remained at 5Mb/s or so. When I switched to RACK TCP (or BBR) on FreeBSD, my upload went up by a factor of 8 or so, to about 40Mb/s, which is the limit of the VPN.
You seem quite knowledgeable in this domain. Have you authored any blog posts to expand on this topic? I would welcome the chance to learn more from you.
No, fast retransmit basically does what it says -- retransmits things quicker. However, it is orthogonal to what the congestion control (CC) algorithm decides to do with the send window in the face of loss. Older CC like Reno halves the send window. Newer ones like CUBIC are more aggressive, and cut the window less (and grow it faster). However, RACK and BBR are still superior in the face of a lossy link.
Depending on the particular situation maybe vegas would work as well?
In particular, since Wireguard is UDP, using vegas over Wireguard seems to me like it should be good (based on a very limited understanding, though :/ ), it is just a question of how well it would work on the other side of the reverse proxy since I don't think it can be set per link?
Er, I was confused; of course being over UDP won't make the kind of difference I was thinking since the congestion control is just about when packets are sent. Although I heard a while back that UDP packets can be dropped more quickly during congestion. If that is the case and the congestion isn't too severe (but leading to dropped packets because it is over UDP) then possibly vegas would help.
Yes, BBRv1 has fairness issues when used at scale vs certain other algorithms. No, that doesn't mean people finding it tunes their performance 5x in a particular use case with high latency and some loss should stop talking about how it helps in that scenario. YouTube ran it without the internet burning to the ground so using it in these niche kinds of cases with personal tuning almost certainly results in a net good even though the algorithm wasn't perfect. BBRv3 will make it scale to everyone with better fairness for sure but BBR is still much better behaving for fairness than almost any UDP stream anyways.
The original cargo cultists built runways on islands to cause supplies to be dropped off. It didn't work. If someone copy and pasted something they don't fully understand off the Internet, but it works, can you really blame them for it, or call them cargo cultists?
It's easy to coax BBR into converging on using 20% of a shared link instead of 50% (cohabiting with one other stream).
The inverse is true and it's easy to get BBR to hog 80% of a link instead of 50% (cohabiting with one other stream). If you're happy for other people to steal bandwidth from you with greedy CCAs then go ahead and ratelimit yourself. I'm not.
It's still useful when dealing with high-latency links with non-zero loss. Fine, something might outcompete it, but without it the throughput would suck anyway.
E.g. if a service runs on a single server (no CDN) and you occasionally get users from Australia then the site will by a bit laggy for them but at least it won't be a trickle.
> The change needed to be made on both sides of Wireguard.
Congestion control works from the sender of data to the receiver. You don't need to switch both sides if you are just interested in improving performance in one direction.
Besides that, I agree to what others said BBRv1. The cubic implementation in the Linux kernel works really nice for most applications.
Does performance tuning for Wi-Fi adapters matter?
On desktops, other than disabling features, can anything fix the problems with i210 and i225 ethernet? Those seem to be the two most common NICs nowadays.
I don't really understand why common networking hardware and drivers are so flawed. There is a lot of attention paid to RISC-V. How about start with a fully open and correct NIC? They'll shove it in there if it's cheaper than an i210. Or maybe that's impossible.
> Does performance tuning for Wi-Fi adapters matter?
If you're willing to potentially sacrifice 10-20% of (max local network) throughput you can drastically improve wifi fairness and improve ping times/reduce bufferbloat (random ping spikes will still happen on wifi though).
i225 is just broken but I get excellent performance from i210. 1gb is hardly challenging on a contemporaneous CPU, and the i210 offers 4 queues. What's your beef with i210?
Most people don’t really use their NICs “all the time” “with many hosts.” The i210 in particular will hang after a few months of e.g. etcd cluster traffic on 9th and 10th gen Intel, which is common for SFFPCs.
On Windows, the ndis driver works a lot better. Many disconnects in similar traffic load as Linux, and features like receive side coalescing are broken. They also don’t provide proper INFs for Windows server editions, just because.
I assume Intel does all of this on purpose. I don’t think their functionally equivalent server SKUs are this broken.
Apparently the 10Gig patents are expiring very soon. That will make Realtek, Broadcom and Aquantia’s chips a lot cheaper. IMO, motherboards should be much smaller, shipping with BMC and way more rational IO: SFP+, 22110, Oculink, U.2, and PCIe spaced for Infinity Fabric & NVLink. Everyone should be using LVFS for firmware - NVMe firmware, despite being standardized to update, is a complete mess with bugs on every major controller.
I share all of this as someone with experience in operating commodity hardware at scale. People are so wasteful with their hardware.
Many systems which cost more than a good car are still coming with the Broadcom 5719 (tg3) from 1999. They have a single transmit queue and the driver is full of workarounds. It's a complete joke these are still supplied today.
SFP would be great but I'd settle for an onboard NIC chipset which was made in the last 10 years.
There are 3 revisions of i225 and Intel essentially got rid of it and launched i226. That one also seems to be problematic [1] . Why is it exponentially harder to make a 2.5gbps NIC when the 1gbps NIC (i210 and i211) has worked well for them. Shouldn't it be trivial to make it 2.5x? They seem to make good 10gbps NICs so I would assume 2.5gbps shouldn't need a 5th try from intel ?
The bugs I am aware of are on the PCIe side. i225 will lock up the bus if it attempts to do PTM to support PTP. That's a pretty serious bug. You would think Intel has this nailed since they invented PCIe and PCI for that matter. Apparently not. Maybe they outsourced it.
This is really interesting for me to read. I encountered a DMA lockup in the hardware by an Ethernet MAC implementation on an ARM chip. It was a Synopsys Designware MAC implementation. It would specifically lockup when PTP was enabled. From my testing, it seemed like it would specifically lockup if some internal queue was overrun. This was speculation on my part, because it would only lockup if I tried to enable timestamping on all packets. It seemed to work alright if the hardware filter was used to only timestamp PTP packets. This can be a significant limitation though, as it can prevent PTP from working with VLANs or DSA switch tags, since the hardware can't identify PTP packets with those extra prefixes.
The PTP timestamps would arrive as a separate DMA transaction after the packet DMA transaction. It very possibly could have been poor integration into the ARM SOC, but your PTP-specific issue on x86 makes me wonder.
I'm interested in clicking your link just not through a shortener. Perhaps just my own bias but figured I'd surface that reaction here. Much more useful to see the domain I'll end up on.
Great overview of the Linux network queues as provided in the Figure, should paste it on the wall somewhere.
Brendan's System Performance books provide nice coverage on Linux network performance and more [1]. It's already in the second edition, both are excellent books but the 2nd edition focuses mainly on Linux whereas the 1st edition also include Solaris.
There's also a more recent book on BPF Performance Tools by him [2].
[1] Systems Performance: Enterprise and the Cloud, 2nd Edition (2020)
This doc kinda needs to say "TCP" somewhere, as it's very focused on TCP concerns - which is useful, people are mostly using TCP. The default UDP tunings are awfully low and as such are notably missing.
I'm also seconding this, but from microcontroller perspective.
I want to try developing a simple tcp echo server for a microcontroller, but most examples just use the vendor's own tcp library and put no effort explaining how to manually setup and establish connection to the router.
Because implementing TCP not just as a toy is incredibly difficult with tripwires that even subject matter experts struggle to get it right without decades of in field testing.
For TCP sockets I'd rather just MSS clamp on the internet gateway. On top of too many things just dropping PMTUD, enabling it results in a slower process while MSS clamping hijacks the initial TCP open messages directly.
For TCP in Linux the only thing I know of is net.ipv4.tcp_mtu_probing=2 which is still slower than clamping at the edge. You can also run into weird slowdowns in cases with packet loss even after the initial discovery. If you don't have a way to clamp but absolutely need the interface to have jumbo enabled for local traffic performance it's probably the best fallback but even then I'm not sure it's worth the extra headache it causes.
That's funny ... the "big guys" are some of the biggest contributors to the Linux network stack, almost as if they were actually using it and cared about how well it works.
History has shown that tons of Linux networking scalability and performance contributions have been rejected by the gatekeepers/maintainers. The upstream kernel remains unsuitable for datacenter use, and all the major operators bypass or patch it.
All the major operators sometimes bypass or patch it for some use cases. For others they use it as is. For other still they laugh at you for taking the type of drugs that makes one think any CPU is sufficient to handle networking in code.
Networking isn't a one size fits all thing - different networks have different needs, and different systems in any network will have different needs.
Userland networking is great until you start needing to deal with weird flows or unexpected traffic - then you end up either needing something a bit more robust and your performance starts dropping because you added a bunch of branches to your code or switched over to a kernel implementation that handles those cases. I've seen a few cases of userland networking being slower than just doing the kernel - and being kept because sometimes the what you care about is control over packet lifecycle more than raw throughput.
Kernels prioritize robust network stacks that can handle a lot of cases good enough. Different implementations handle different scenarios better - there's plenty of very high performance networking done with vanilla linux and vanilla freebsd.
Over the course of several years, the architecture underpinning Snap has been used in production for multiple networking applications, including network virtualization for cloud VMs [19], packet-processing for Internet peering [62], scalable load balancing [22], and Pony Express, a reliable transport and communications stack that is our focus for the remainder of this paper.
This paper suggests, as I would have expected, that Google uses userland networking in strategic spots where low-level network development is important (SDNs and routing), and not for normal applications.
"and Pony Express" is the operative phrase. As the paper states on page 1, "Snap is deployed to over half of our fleet of machines and supports the needs of numerous teams." According to the paper it is not niche.
Makes sense, they're probably using QUIC in lots of products and the kernel can't accelerate that anyways, it would only pass opaque UDP packets to and from the application.
Last I remember as of at least 7 years ago Google et al were using custom NIC firmware to avoid having the kernel get involved in general (I think they managed to do a lot of Maglev directly on the NICs) because latency is so dang important at high speed networking speeds that letting anything context switch and need to wait on the kernel is a big performance hit. Not a lot of room for latency when you're working at 100 Gbps.
Correct. That is my point. The sockets interface, and design choices within the Linux kernel, make ordinary TCP sockets too difficult to exploit in a datacenter environment. The general trend is away from TCP sockets. QUIC (HTTP/3) is a less extreme retreat from TCP, moving all the flow control, congestion, and retry logic out of the kernel and into the application.
An example of how Linux TCP is unsuitable for datacenters is that the minimum RTO is hard-coded to 200ms, which is essentially forever. People have been trying to land better or at least more configurable parameters upstream for decades. I am hardly the first person to point out the deficiencies. Google presented tuning Linux for datacenter applications at LPC 2022, and their deck has barely changed in 15 years.
At the point where we're talking about applications that don't even use standard protocols, we've stopped supplying data points about whether FreeBSD's stack is faster than Linux's, which is the point of the thread.
Later
Also, the idea that QUIC is a concession made to intractable Linux stack problems (the subtext I got from that comment) seems pretty off, since the problems QUIC addresses (HOLB, &c) are old, well known, and were the subject of previous attempts at new transports (SCTP, notably).
Google/Amazon "etc;" are likely happy to pay the cost because it really is "good enough" and the benefits of Linux over FreeBSD are otherwise quite considerable.
Google in particular seems blissfully happy to literally throw hardware at problems; since hardware is (for them especially) fundamentally extremely cheap.
Even multiple percentage gains in throughput are not necessary for most applications, and Linux is decent enough with latency if you avoid having complex IP/NFTables rules and avoid CONNTRACK like the plague.
as u/jeffbee says anyway, most of the larger tech companies these days are using userland networking and bypass the kernel almost completely for networking.
I know they bypass the kernel but my point still stands, most of the servers on the internet runs on Linux, that's a fact, so there was more money, time invested, man power on that OS than any others.
Your point is that popularity means that it will improve.
This is true, to a point.
Counterpoint: Windows Desktop Experience.
EDIT: that comment was glib, let me do a proper counterpoint.
Common area's are some of the most least maintained in reality; I can think of meet-me-rooms or central fibre hubs in major cities; they are expensive and subject to a lot of the whims of the major provider.
Crucially, despite large amounts of investment, the underlying architecture or infrastructure remains, even if the entire fabric of the area changes around it. Most providers using these kind of common areas do everything they can to avoid touching the area itself, especially as after a while it becomes very difficult to navigate and politically charged.
Fundamentally the architecture of Linux's network stack, really is, "good enough", which is almost worse than you would originally think since "good enough" means there's no reason to look there. There is an old parable about "worse is better" because if something is truly broken people will put effort into fixing it.
Linux's networking stack is fine, it's just not quite as good an architecture as the FreeBSD one. FreeBSD one has a lot less attention on it but fundamentally it's a cleaner implementation and easier to get much more out of..
You will find the same argument ad infinitum regarding other subjects such as Epoll vs IOCP vs kqueue (Epoll was abysmally terrible though and ended up being replaced by IO_URING, but even that took over a decade)
Especially since you don't even know what you're attempting to optimise for.
Latency? p99 of linux is fine, nobody is going to care that the request took 300μs longer. Even in aggregate across a huge fleet of machines waiting an extra 3ms is totally, totally fine.
Throughput? you'll bottleneck on something else most likely anyway, getting a storage array to hydrate at line rate for 100GBPs is difficult and anyway you want to do authentication and distribution of chunks and metadata operations anyway? right?
You're forgetting that it's likely an additional cost of a couple million dollars per year in absolute hardware to solve that issue with throughput, which is, in TCO terms, a couple of developers.
Engineering effort to replace the foundation of an OS? Probably an order of magnitude more. Definitely contains a significant amount more risk, and the potential risk of political backlash for upheaving some other companies workflow that is weird.
Hardware isn't so expensive really.
Of course, you could just bypass the kernel with much less effort and avoid all of this shit entirely.
Google makes heavy use of userspace networking. I was there roughly a decade ago. At least at that time, a major factor is the choice of userspace over kernel networking was time to deployment. Services like the ones described above were built on the monorepo, and could be deployed in seconds at the touch of a button.
Meanwhile, Google had a building full of people maintaining the Google kernel (eg, maintaining rejected or unsubmitted patches that were critical for business reasons), and it took many months to do a kernel release.
Yes. I don't think anyone is disputing that Google does significant userspace networking things. But the premise of this thread is that "ordinary" (ie: non-network-infrastructure --- SDN, load balancer, routing) applications, things that would normally just get BSD sockets, are based on userspace networking. That seems not to be the case.
For one by assuming the work that is done primarily for microkernels/appliances is the absolute limit of userspace networking at Google and that similar work would not go into a hypervisor (hypervisors which are universally treated as a vSwitch in almost all virtual environments the world over).
And making that assumption when there are many public examples of Google doing this in other areas such as gVisor and Netstack?
If you have information about other userspace networking projects at Google, I'd love to read it, but the Snap paper repeatedly suggests that the userspace networking characteristics of the design are distinctive. Certainly, most networking at Google isn't netstack. Have you done much with netstack? It is many things, but ultra-high-performance isn't one of them.
OK. I did. They said "no, it's not the case that networking at Google is predominately user-mode". (They also said "it depends on what you mean by most"). Do you have more you want me to relay to them? Did you work on this stuff at Google?
Per the Snap thread above: if you're building a router or a load balancer or some other bit of network infrastructure, it's not unlikely that there's userland IP involved. But if you're shipping a normal program on, like, Borg or whatever, it's kernel networking.
Oh. Then, unless a Googler jumps in here and says I'm wrong: no, ordinary applications at Google are not as a rule built on userspace networking. That's not my opinion (though: it was my prior, having done a bunch of userspace networking stuff), it's the result of asking Google people about it.
Maybe it's all changed in the last year! But then: that makes all of this irrelevant to the thread, about FreeBSD vs. Linux network stack performance.
Based on this I understand why you're talking like this: I think you have made an assumption/interpretation here and argued the assumption because nobody here (I believe) has claimed that Google only uses user-space networking; merely that google makes use of user-space networking where it's "appropriate" (IE; when FreeBSD would have had an advantage). Which is backed up by basically everything in this thread.
Which is why I said you "probably read it wrong".
Google is much happier to throw hardware at the problem in most cases, only when it really matters and they would have had to rearchitect the kernel to improve a situation any further do they break out the user-space networking.
The point I was driving at was that it's more common than you think.
Your base assertion that it's ubiquitous is very obviously false because Chromebooks are pretty common inside google offices and those are running stock chromeOS (except in the offices that are developing chromeOS)
Let me make my point clearly: Google depends on the Linux kernel stack as much as the top-of-the-thread comment suggests that they did, and the things they're doing in user-mode, they would also be doing in user-mode on FreeBSD.
That's all I'm here to say.
As these things go, in the course of making your argument, you made a falsifiable and, I believe, flatly incorrect claim:
Most of the larger tech companies these days are using userland networking and bypass the kernel almost completely for networking
At least in Google's case, this isn't true. People doing custom network stack stuff totally do bypass the kernel stack (sometimes with usermode stacks, and sometimes with eBPF, and sometimes with offload). But the way you phrased this, you implied pretty directly that networking writ large was usermode at Google, and while I entertained the possibility that this might be true, when I investigated, it wasn't (unsurprisingly, given how annoying user mode networking is to interact directly with, as opposed to in middlebox applications).
Ok, this is a very hostile and could not possibly be considered a charitable interpretation of what I said, in fact I'd say it borders on trying to pick an argument where there isn't one. I did not "flat out lie" and I detest the insinuation. I expect better of you honestly.
To answer: "Most of the larger tech companies these days are using userland networking and bypass the kernel almost completely for networking"
You could read "almost completely" as in "almost across the whole company" which would be a weird way to read it, but you seem to have read it this way.
I intended it to mean: when they bypass the kernel; it is a near complete bypass.
Since there actually is still a network connection going through the kernel (the host itself will still be connected of course), which is of course the inverse of what you seem to have taken away; in that user-mode networking is used even less than entirely even on a single node.
edit: In fact, I stated multiple times in my post: "Google is mostly happy to just throw hardware at this" which you seemed to just.. ignore? Google are absolutely happy to throw hardware at issues until they can't anymore or the gains are too enormous to avoid. I thought I was extremely clear about that.
Your point about user-land networking in FreeBSD is just a nonsense one to make -- and not the point we were discussing anyway, like suggesting "if it did rain beer, would we all get drunk?" it's completely hypothetical and not based in any subjective reality or objective truth. You have absolutely no way of knowing if FreeBSD could do those things, the statement that the architecture permits is is shown somewhat in FreeBSD's use in Netflix, which has been commented elsewhere in this thread to achieve close to 0.8TiB of data transfer, but knowing what google would have done with freebsd would require seeing into alternative realities.
I don't know where you work, but as far as I know: nobody has managed to perfect that technology yet.
This is a weird cursed thread that started out with a pretty silly† claim about FreeBSD vs. Linux network stack performance (neither of us started it; it's a standard platform war argument). Someone made a comment that the hyperscalers all depend on the Linux kernel stack, to a far greater degree than they do on FreeBSD. That's a true statement; at this point, you and I have both agreed on it.
When that point was pressed earlier, you and another comment brought up kernel bypass (usermode networking, specifically) as a way of dismissing hyperscaler Linux dependencies. But that's not really a valid argument. Hyperscalers do kernel bypass stuff! Of course they do! But they're doing it for the things that you'd formerly have bought dedicated network equipment for, and in no case are they doing it in a situation where deploying FreeBSD and using the FreeBSD stack would be a valid alternative.
The disconnect between us is that I'm still talking about the subject that the thread was originally about --- whether Google using the Linux stack is a valid point backing up its fitness for purpose. I think it pretty clearly is a valid point. I think the usermode networking stuff is interesting, but is a sideshow.
† "Silly" because these kinds of claims never, ever get resolved, and just bring out each side's cheering section, not because I have low opinions of FreeBSD's kernel stack --- I came up on that stack and find it easier to follow than Linux's, though I rather doubt the claim that it has a decisive performance advantage.
Thank you for clarifying, and I apologise as I also became quite hostile.
I agree that we agree on many points; but I think where we diverge (and perhaps fundamentally) is in the base assumption that "because google uses it, it must be the best, because even if it wasn't google would make it so" (at least, this is my interpretation of GPs comment).
I have little doubt that the low hanging... mid hanging and perhaps even most of the high-hanging fruit has been well and truly plucked when it comes to linux throughput at the behest of the large tech companies; because a few percent improvement translates a lot at their scale. -- However I am reminded of an allegory given in a talk (that I can't find) regarding bowling.
In the talk the speaker mentions how they got "really good" at bowling with completely the wrong technique; but it worked for them, up to a point in which they could not improve no matter what they did. They had to go back to the basics and learn proper technique and become much worse before they were able to overtake their previous scores with the bad technique.... but after that point there were further improvements to be had.
My argument that this is the case is merely: doing an architectural re-write of the linux kernel to be more scalable in the way FreeBSD's is would be very punishing for too many people, and additionally that the economics are not favourable when, if you do get to a point where you cannot scale due to the kernel, you could just break out into userland. -- then simply suffer the adequate but not insane performance everywhere else where it's not needed anyway.
So, to summarise my points:
* Because a big company uses something does not mean it is perfect in all areas
* That Linux has a lot of attention on it does not mean necessarily that it has the most potential: though I don't doubt that the majority of it's potential has been reached.
* Diminishing returns means once it's "good enough" people will try to get performance elsewhere if they need it.
* Rewriting the network stack in Linux completely would likely be harmful to many and subtly so, I haven't seen people moving towards this idea either, this feels like it could be political as well as technical.
* Hyperscalers will often trade convenience over performance: regardless, CPU time is much cheaper for them than it is for us.
I think it's totally legit to say that hyperscaler Linux adoption isn't dispositive of the Linux's stack's performance advantage over FreeBSD. I basically think of this FreeBSD vs. Linux stack debate as unknowable (it probably isn't, but it's liberating to decide I'm not going to resolve it to anyone's satisfaction). So I'm not here to say "Google uses Linux ergo it's faster than FreeBSD"; that adoption is a useful observation, but that's all it is.
I've been at two large FreeBSD users that switched to Linux, and the reason was never performance.
Yahoo switched to Linux because having a single server OS is nicer than having two, acquisitions were nearly all running Linux, and Linux was at least good enough (although having all of Overture crash at the same time because of a leap second bug in the Linux kernel, twice, wasn't great), and maybe something about easier to hire for kernel engineers.
WhatsApp switched to Linux because the server team was mostly ex-Yahoo and had seen how much strife running a different OS than your acquirer in their datacenters causes. There was enough non-negotiable strife from all of the other tech and philosophy differences, that accepting the acquirer kernel that is at least good enough was worthwhile.
I'm not saying FreeBSD or Linux has better performance. I like the reputation FreeBSD has for network performance, and it certainly has good performance, but is it better than Linux? I don't know, I never ran apples to apples comparisons, because whenever I was involved in a switch, the hardware was going to be very different, and the changeover was a policy decision rather than a technical one.
Personally, I run FreeBSD when I can, and Linux when I have to. FreeBSD's development model and smaller team lead to less churn, and I value stability and consistency. Other people have different values, and that's fine too.
Do you think that (outside of a few special cases) they're using anything near the network bandwidth available to them?
I would expect in the 1% to 10% bandwidth utilization, on average. From my vague recollection, that's what it was at FB when I was there. They put stupid amounts of network capacity in so that the engineers rarely have to think about the capacity of the links they're using, and that if their needs grow, they're not bottlenecked on a build out.
To answer the original question, it's complicated. I have a weird client where freebsd gets 450 MiB/s, and Linux gets 85 with the default congestion control algorithm. Changing the congestion control algorithm can get me between 1.7 MiB/s and 470 MiB/s. So, better performance... Under what circumstances?
Performance parity on which axis? For which use case?
Talking generally about "network performance" is approximately as useful as talking generally about "engine performance". Just like it makes no sense to compare a weed-eater engine to a locomotive diesel without talking about use case and desired outcomes, it makes no sense to compare "performance of FreeBSD network stack" and "Linux network stack" without understanding the role those systems will be playing in the network.
Depending on context, FreeBSD, Linux or various userland stacks can be a great, average, or terrible choices.
Linux is a networking swiss army knife (or maybe a dremmel). It can do a lot of stuff reasonably well. It has all sorts of knobs and levers, so you can often configure it to do really weird stuff. I tend to reach for it first to understand the shape of a problem/solution.
BSD is fantastic for a lot of server applications, particularly single tenant high throughput ones like mail servers, dedicated app servers, etc. A great series of case studies have come out of Netflix on this (google for "800Gbps on freebsd netflix" for example - every iteration of that presentation is fantastic and usually discussed here at least once, and Drew G. shows up in comments and answers questions).
It's also pretty nice for firewalling/routing small and medium networks - (opn|pf)sense are both great systems for this built on FreeBSD (apologies for the drama injection this may cause below).
One of the reasons I reach for linux first unless I already know the scope and shape of the problem is that the entire "userland vs kernel" distinction is much blurrier there. Linux allows you to pass some or all traffic to userland at various points in the stack and in various ways, and inject code at the kernel level via ebpf, leading to a lot of hybrid solutions - this is nice in middleboxes where you want some dynamism and control, particularly in multi-tenant networks (and thats the space my work is in, so it's what I know best)
Please bear in mind that these are my opinions and uses/takes on the tools. Just like with programming there's a certain amount of "art" (or maybe "craft") to this, and other folks will have different (but likely just a valid) views - there's a lot of ways to do anything in networking.
This largely depends on how the application is written and in particular what if any non-POSIX interfaces it uses.
If you are looking to hit line rates with UDP, or looking to head well above ~1-10gbps with TCP, you're fast headed into territory where you likely need to move away from POSIX. (For a super dumb benchmark, oversized buffers amortizing syscall overhead might get you to 10gbps on TCP, but in a real application everything changes)
Once you're headed over 10gbps you'll quickly run into a hard need to retune things even for TCP, earlier if you're talking to non-local hosts. Once you're over 25gbps you're headed into the territory where you'll need to fix drivers, fix cpu tuning and so on. For a recent real world example: when we were doing performance analysis of our offloading patches for Tailscale we identified problems with the current default CPU frequency scaler for Intel CPUs on current kernels, and reached out to the maintainers with data.
- net.ipv4.tcp_rmem ~ 6MB
- net.core.rmem_max ~ 1MB
So.. the tcp_rmem value overrides by default, meaning that the TCP receive window for a vanilla TCP socket actually goes up to 6MB if needed (in reality - 3MB because of the halving, but let's ignore that for now since it's a constant).
But if I "setsockopt SO_RCVBUF" in a user-space application, I'm actually capped at a maximum 1MB, even though I already have 6MB. If I try to reduce it from 6MB to e.g. 4MB, it will result in 1MB. This seems very strange. (Perhaps I'm holding it wrong?)
(Same applies to SO_SNDBUF/wmem...)
To me, it seems like Linux is confused about the precedence order of these options. Why not have core.rmem_max be larger and the authoritative directive? Is there some historical reason for this?