I think Whatsapp served "at peak" like 450 million users with only 32 engineers.

dmitriid · on Sept 30, 2019

Whatsapp also had multiple patches to Erlang VM, backported FreeBSD drivers and code etc. etc.

Do not assume you'll be able to build your own Whatsapp just because of Erlang.

fouc · on Sept 30, 2019

The story of freebsd & erlang "needing" to be patched seems to be greatly exaggerated. Especially when it turns out that elixir/phoenix also achieved the same "2 million connections on single server" without needing those optimizations.

dmitriid · on Sept 30, 2019

> The story of freebsd & erlang "needing" to be patched seems to be greatly exaggerated.

The story is greatly underreported and all focus is only on "they run Whatsapp on Erlang with just ~50 engineers".

Highscalability lists just some of the patches and optimisations they have here [1] and here [2]

Here's an incomplete list of patches only. There's also tuning and optimisation:

Erlang: Fixed head-of-line blocking in async file IO by patching BEAM, added round-robin scheduling for async file IO, added multiple instrumentation patches. Instrumented scheduler to get utilization information, statistics for message queues, number of sleeps, send rates, message counts, etc. Made lock counting work for larger async thread counts. Patched to dial down spin counts so the scheduler wouldn’t spin.

BSD: Backported a TSE time counter. Backported igp network driver.

More Mnesia (Erlang) patches discussed here: [3]

Are you ready to do this for your Whatsapp?

> elixir/phoenix also achieved the same "2 million connections on single server" without needing those optimizations.

There's more needed to run a chat server than just "2 million empty connections".

---

[1] http://highscalability.com/blog/2014/2/26/the-whatsapp-archi...

[2] http://highscalability.com/blog/2014/3/31/how-whatsapp-grew-...

[3] https://www.infoq.com/presentations/whatsapp-scalability/

fouc · on Sept 30, 2019

>Are you ready to do this for your Whatsapp?

I don't need to thanks to the fact that a bunch of those patches are now part of Erlang.

dmitriid · on Sept 30, 2019

Note how I said it was an incomplete list of patches only.

There's also signinficant tuning and optimisation, both for the Erlang VM and FreeBSD.

There also things like (quotes from Highscalability):

"Mnesia: Using no transactions, but with remote replication ran into a backlog. Parallelized replication for each table to increase throughput."

"When Rick is going through all the changes that he made to get to 2 million connections a server it was mind numbing. Notice the immense amount of work that went into writing tools, running tests, backporting code, adding gobs of instrumentation to nearly every level of the stack, tuning the system, looking at traces, mucking with very low level details and just trying to understand everything. That’s what it takes to remove the bottlenecks in order to increase performance and scalability to extreme levels."

Or even the things like "What has hundreds of nodes, thousands of cores, hundreds of terabytes of RAM? The Erlang/FreeBSD-based server infrastructure at WhatsApp". Oh, wait. Erlang's default distribution mechanism grinds to a halt when there are more than ~60-80 nodes. And Mnesia has a 2GB limit on table sizes. So you have to work around those limitations yourself.

There are no magic bullets. Erlang will only take you so far. The rest (80-90% of the way) you have to take on your own, and you have to know what you're doing, and what needs to be done: patches, tuning, workarounds, limits of the systems you work with etc.

toast0 · on Sept 30, 2019

> Erlang's default distribution mechanism grinds to a halt when there are more than ~60-80 nodes. And Mnesia has a 2GB limit on table sizes. So you have to work around those limitations yourself.

I've seen people say these, and I have no idea where they come from. If you have a decent network, dist works fine at well over 80 nodes, but everyone says it doesn't work. pg2/global has some sharp edges if you're trying to have many nodes acquire the same global lock when you have a lot of nodes (a few hundred) or a smaller number if you have a lot of latency between them. There's options though -- maybe you don't need to acquire the same lock on all nodes, or maybe you can look in pg2.erl and global.erl and wiggle the locking code until it no longer live locks.

The Mnesia supposed 2GB limit is a bunch of hooey. Yes, disc_only_tables has (or had) that limit, because dets has that limit. Yes, it's a sharp edge, because there's no warning about it. However, a 2GB dets table is awful to work with anyway. You want to use disc_copies or ram_copies for big tables. Also, mnesia_frag is well supported, so if you really wanted to, you could make your disc_only_copies table 1024 fragments, and have 2 TB of dets, if that's how you wanted to role.

And yes, if you're going to hyperscale, you're going to need a couple people who know how to figure out what your system is doing. Is there a language/environment where that's not true?

I claim, without real proof, that Erlang's BEAM VM and OTP standard library are easier to understand and tweak when you do hit problems. You'll note however, that Rick Reed's first presentation was when he had been at WhatsApp for about a year, and he had zero experience with Erlang before that.

fouc · on Sept 30, 2019

Honestly, I'll worry about it when I get there.

EdwardDiego · on Sept 30, 2019

It's like open source works or something.

strmpnk · on Sept 30, 2019

That’s a bit misleading. Many of their patches were needed and had been contributed upstream by the time Phoenix was being tested as such. Still, this is a great benefit of sharing the ecosystem as such as everyone gets to benefit from this work.