The MySQL “swap insanity” problem and the effects of the NUMA architecture

guard-of-terra · on April 8, 2012

One behavior I've noticed with linux that if you read files sequentially from disk (for example, doing scp), then linux would fill all the memory with those files' contents and then it would swap out everything but the (obviously useless) disk caches. So you'll have all the memory filled with data you would never need again and trying to do anything would cause a large and painful unswapping (had side effect of halting my qemu).

This is true insanity. Surely you can disable swap or tune swappiness, but what's the reason for crazy default behavior?

justincormack · on April 8, 2012

Use rsync. It now preserves the buffer cache status that files had before so it does not stomp on your allocations.

http://insights.oetiker.ch/linux/fadvise/

mceachen · on April 9, 2012

Also, consider using --bwlimit to throttle the copy speed, so the spindle can still respond to other IO requests. (25-50% of unthrottled speed seemed to be a reasonable tradeoff).

pmjordan · on April 8, 2012

I suspect the reason to be that the system has noticed that your apps haven't touched their memory for a long time and thus "don't need it". This is a reasonably valid assumption on servers: if you've got some daemons that are backgrounded for minutes or hours at a time, keeping their memory resident is a waste. However, on the desktop responsiveness (latency) is more important than throughput. Just because you only switch between apps on a timescale of minutes or hours, doesn't mean the kernel should swap them out. So the algorithm needs different weighting.

I used to experience this problem, but I haven't lately. I suspect what's going on is that the "-desktop" kernel variant of OpenSUSE uses a differently weighted swappiness algorithm. If your distro offers a choice of different kernel variants, you could try them; otherwise (or if that doesn't help), you could track down the knobs you need to tweak to make the problem go away.

steerb · on April 8, 2012

I guess that the kernel cannot know that the application (in your case scp) will not try to touch the data ever again.

The current default behavior, which you call crazy, does however favor programs which do actually need the data that was just read (e.g., databases).

masklinn · on April 8, 2012

Would mmapping those files instead of reading them provide saner behavior, or does the kernel still do that?

FooBarWidget · on April 8, 2012

Reading a file with mmap() results in the same behavior.

nknight · on April 8, 2012

I'm guessing you're either running a rather old distribution/kernel, or your swappiness is set far too aggressively. Try setting it to 0.

andreasvc · on April 8, 2012

I still don't comprehend why one needs swap at all. All the explanations I have come across talk about not having enough memory. Given that one has at least 8 GB of memory, or maybe even >100GB, why on earth would you need swap? Sure some process might allocate even more than that, but maybe it's better to refuse such a request than to slow down the whole system due to thrashing.

I get the idea that the reason might be that a lot of programs allocate memory which they don't actually need regularly, which is then very convenient to swap out. Rather than enabling this bad habit using slow disk storage it would be much better to expect programs to be more frugal, or at least signify whether something should be kept in memory or not.

atombender · on April 9, 2012

The problem is that if you disable swap completely and let the system refuse allocation, you will face failures not just from the process that was responsible for taking most of the memory, but from existing ones that attempt to continue operating normally -- including innocuous processes such as bash and sshd.

This would affect C programs in particular, since they usually manage their memory manually. If bash can't malloc() a buffer for its input, for example, it will simply fail, and you might be able to do anything to fix the system; the same goes for sshd, which might end up refusing new connections as a result. Programs that preallocate important data structures, and programs using garbage collection, would fare somewhat better.

In other words, if swap is disabled you will still need a sort of soft limit or reserved space to ensure that programs can survive memory starvation. I don't know if the Linux kernel (or the GNU C library) has anything of the sort.

jeremycole · on April 8, 2012

Well, for the systems in question, they don't necessarily need swap, and they may never actually swap in practice aside from problems such as the NUMA issues described. However, even with swap disabled, the NUMA issues cause problems with performance due to running a single node out of memory.

makmanalp · on April 8, 2012

Because most data is critical and you can't afford to just drop it on the ground whenever you please. A better option would be to have the application/db to have its own swap routines optimized for its own purposes rather than letting the OS doing a catch-all swap method.

jcrites · on April 8, 2012

Even beyond that: it's desirable for a machine to be able to compute arbitrarily large data sets. If the data set can't efficiently fit in memory, the machine should still make progress, just more slowly, using disk.

It is not desirable for a machine to have a "wall" which, upon being hit, becomes a harsh restriction on its capabilities. This is because we often encounter the "wall" unexpectedly, at a time that might be critical.

vidarh · on April 8, 2012

But they still have a wall - it's just takes a bit more to hit it. In Linux systems swap is usually 2x memory. With swap set like that, all swap does is raise the wall to 3 times what it previously was.

But for a lot of systems your service will fail shortly after you start swapping anyway, because the performance cost of swapping is so high that it often starts a death spiral (can't handle enough requests, so they start piling up, eating even more memory, until your system dies or you hit connection limits etc.).

So "best case" in a typical configuration is that the wall is a bit higher. Worst case you gain nothing at all from the swap.

Personally I treat it as a failure if we ever hit swap - it means connection limits etc. has been set too high.

GoodIntentions · on April 8, 2012

"Personally I treat it as a failure if we ever hit swap"

/agree. but still a useful feature.

The degraded performance a system will show when it starts hitting disk instead of memory is a great 'soft' failure.

I think it is good to have graduations. Going from 'OK' to 'Damn-this-is-slow' before 'Fail' is handy.

ars · on April 9, 2012

That's not why you need swap. Swap is because many applications will use memory when starting and then never touch it again.

You can therefor swap it out and use the extra memory for cache.

Most long term applications only need a small fraction of their startup memory.

justincormack · on April 9, 2012

Then surely they could free it?

ars · on April 9, 2012

If you allocate memory "after" that memory, it's not possible to return the earlier memory to the OS.

Also, suppose you need the memory only for startup and shutdown (things like logfiles, network connections, command line parsing, etc).

justincormack · on April 9, 2012

Yes it is. Memory allocators are heaps not queues.

Things like network connections, logfiles are used all the time, so they won't be swapped out (actually file handles are kernel side so never swapped anyway). You can free the command line parse after setting the options.

And clean shutdown is overrated: long running programs can just terminate fairly gracelessly if necessary, the OS cleans everything up.

ars · on April 11, 2012

Memory allocators might be heaps, but behind the scenes it's just an area of memory, and that area is contiguous.

If you increase the size of the memory available to you (sbrk) you can only decrease it if no memory is allocated between the new area and the end of it.

In practice the memory is never returned, and applications rely to swap to deal with that.

It's not the logfile (and network) handle that is swapped out - it's the code for deciding where it is, and opening it. Also initialization code.

Some programs can abort, but others will require a (slow) consistency check of their data if that happens to them.

And finally theory is all well and good, but in actual practice about 3/4 of the memory used by running programs can be swapped out.

__david__ · on April 8, 2012

I like the erlang technique of just dying when something bad happens. That is, if you write your app right then you can afford to just die at any point (without dropping any data on the floor at all). I agree with the grandparent--I've started to disable swap on my production servers because I reason that if I run out of ram then I haven't I configured something correctly. A real server shouldn't ever swap--heavy swapping grinds the whole world to a halt which means requests are being serviced way too slowly, if at all...

jws · on April 8, 2012

An application that is aware of the half dozen of so caching layers from register to platter can perform dramatically better than a naive program. Two wrinkles:

1) it needs to either be told the various sizes, speeds, and quirks on each server to make best use. (just some work)

2) it needs to coordinate with the other processes running on the system to divide up the resources. This is hard. Generally people bail and just assign some share of RAM and hope for the best with the other layers.

andreasvc · on April 8, 2012

I agree of course, but I think the current situation is worse. Applications can allocate more than the amount of physical memory, at which point the system can become unusable. If on the other hand an allocation would have been refused at an earlier stage, there would have been no critical data to drop.

I guess I want to argue that with the currently typical amounts of RAM, all critical data should fit in RAM and stay there. The idea of virtual memory was to abstract over the difference between RAM and disk, but perhaps this has become a harmful abstraction now that RAM is big enough while the disadvantage of slow disks remains. RAM and disks are fundamentally different parts of the memory hierarchy, and should be treated completely differently by applications.

jeremycole · on April 8, 2012

There's also copy-on-write to consider. If you actually had to allocate real memory-backed pages for every allocation a process made, had the right to, but never modified, the size of each process would be a lot larger. For example, if a process consumes 1GB of memory and forks a child, the child has access to the same 1GB of memory, but it doesn't really consume 1GB of memory. It has mapped a bunch of copy-on-write pages from its parent, and when either of them next modify those pages, the real memory is allocated.

If they never modify those CoW pages, they can both happily keep using the same copy of the page in memory, and you can have two 1GB processes using a total of e.g. 1.01GB of real memory.

This is very useful in practice, but it means that the system needs the ability to over-commit memory allocations (allow CoW allocations etc., when there is no actual memory available to back it), and over-commit currently, and probably should, requires swap (some place to dump pages in case an over-committed allocation comes calling).

lawnchair_larry · on April 8, 2012

Overcommit does not use or require swap. You can have overcommit enabled on linux and not be using swap at all.

wmf · on April 8, 2012

But then you may meet the OOM killer.

lawnchair_larry · on April 8, 2012

Again, that has nothing to do with swap being enabled.

wmf · on April 9, 2012

My understanding is that swap would delay or prevent the OOM killer from kicking in. Is that wrong?

lawnchair_larry · on April 9, 2012

If you have 8GB of ram and 8GB of swap, the OOM killer will kick in at 16GB.

If you have 16GB of ram, and 0GB of swap, the OOM killer will kick in at 16GB.

bifrost · on April 11, 2012

You don't need it on some systems, however Linux will typically kernel panic if it has no swap rather than just killing processes. FreeBSD/NetBSD/Solaris all do this properly, however its still good to have swap because having your processes die when you make a tiny mistake is rather irritating. This is a great example of why you need experienced ops people to reign in the ideas of green engineers.

toddh · on April 8, 2012

Swapping made sense when we had computers with limited resources running multi-user workloads. Neither is now true. These servers are special purposed so they don't need the overhead of a general OS at all. But aren't there two issues here? Swap and node allocation? Even without swapping wouldn't you have NUMA node allocation issues?

andreasvc · on April 8, 2012

This is what I suspect, but in practice it seems not a good idea to just turn off swap:

> DO NOT TURN OFF SWAP to prevent this. Your box will crawl, kswapd will chew up a lot of the processor, Linux needs swap enabled, lets just hope its not used.

(from one of the blogs linked in the article).

However, I can't find a clear explanation of why this is so.

lawnchair_larry · on April 8, 2012

That was an old bug. It is perfectly fine to disable swap completely. It's the first thing we do on our servers, partly due to the issues detailed in this blog post, and partly due to it being archaic and useless when you have 128GB-512GB of ram.

jeremycole · on April 8, 2012

Yes, they are two distinct issues which are somewhat conflated: the actual "swap insanity" and swap in general, and NUMA performance effects. I think for my purposes, the negative effects of NUMA are relatively hidden, so we only see the side effects. That is, it's impossible to know at this point whether optimization for NUMA would help MySQL, since no one has made a serious effort to do it.

I suspect it would help quite a bit, if done right, and for the right query workload.

ataggart · on April 8, 2012

The two may be less distinct in the face of some approaches to minimizing remote reads/writes, e.g., replication of pages across domains. Apart from ensuring the cost of such replication is warranted, the dominant pressure against replication is the effect it has in increasing the memory footprint of the application, and thus possibly triggering more page faults.

xxjaba · on April 8, 2012

I am very impressed with how well written this article is. A brief description of the problem, links to relevant discusions for less informed readers to come up to speed, and clear examples of how key pieces of information were gathered. I learned more from this article about the topic at hand than I have from a Blog post in recent memory.

jeremycole · on April 8, 2012

Thanks! I am glad you learned something, and happy to get great feedback!

finnh · on April 8, 2012

I'll second this.

The intro was so well written that by the time I got to the first numa_maps output ("2aaaaad3e000 default anon=13240527 dirty=13223315 swapcache=3440324 active=13202235 N0=7865429 N1=5375098") I immediately thought "well geez look at that N0/N1 imbalance, there's your problem right there".

Point being, I haven't dealt with low-level hardware details since college, and yet your article's delightfully clear intro got me sufficiently educated to feel like I was right there with you.

A question well phrased is half answered...

sciurus · on April 9, 2012

Anther good article is https://kevinclosson.wordpress.com/2009/05/14/you-buy-a-numa...

On commodity servers, unless you have specific reasons to do otherwise just switch from NUMA to SUMA. There are two things yo should do

* Change a BIOS setting. The term for this will vary by manufacturer. For Dell, it means enabling node interleaving.

* Pass numa=off to the linux kernel (e.g. edit grub.cfg)

larsberg · on April 8, 2012

Yes, NUMA effects will really kill you, though how much depends on the particular quad-proc topology. I have some measurments for the interested in a small workshop paper I put together (I gathered the numbers in the context of tuning our garbage collector anyway):

http://dl.dropbox.com/u/1620890/website/writings/mspc12-stre...

WALoeIII · on April 9, 2012

Will this optimization help on virtualized machines like Xen? Or does all memory appear to be the same?

On an EC2 m1.xlarge:

$ numactl --hardware available: 1 nodes (0) libnuma: Warning: /sys not mounted or invalid. Assuming one node: No such file or directory node 0 cpus: node 0 size: <not available> node 0 free: <not available> libnuma: Warning: Cannot parse distance information in sysfs: No such file or directory

jakejake · on April 8, 2012

For those of us mere mortals, would it be safe to assume that adding the suggested line to mysql_safe would be ok to do?

cmd="/usr/bin/numactl --interleave all $cmd"

corford · on April 8, 2012

Interesting read. Does anyone know if things have improved/changed significantly since the article was posted (Sep 2010)?

jeremycole · on April 8, 2012

They have not changed in any way, however there is a patchset proposed currently to change how NUMA works a bit. Unclear if it will change this situation.

corford · on April 8, 2012

Would that be a mysql or linux kernel patch (assume the latter)? Also want to echo xxjaba's comment further down - thanks for doing the work on that post, it was really enlightening!

jeremycole · on April 8, 2012

Yes, a Linux kernel patch. I've got a lot of ideas on NUMA optimization for MySQL directly though, so keep an eye out for that, perhaps some time this year.

corford · on April 8, 2012

>I've got a lot of ideas on NUMA optimization for MySQL directly though, so keep an eye out for that

Thanks Jeremy, I will do.

lawnchair_larry · on April 8, 2012

The title is inaccurate. It should say, "the linux swap insanity problem" because this is entirely related to the linux kernel. It just happens to affect MySQL and similar workloads, but it is not MySQL's fault. It doesn't behave that way on other platforms either.

jeremycole · on April 9, 2012

Pretty hard to make you happy, eh?

I agree with you on a purely technical basis, however, this article was written for the MySQL community, was tested (only) on MySQL, and has primarily affected my only on MySQL systems, which I (and the others referenced in the article) primarily run on Linux.

bifrost · on April 11, 2012

Thats a very good point, these types of problems exist to some extent on other OSes but this case seems pretty specific to Linux. I suspect testing on Solaris/FreeBSD would show better results in this area.

defen · on April 8, 2012

Are the lessons here applicable to other commonly used databases (mongo, postgres, redis, etc)?

wmf · on April 8, 2012

This applies to any case where you want a single process to use more than half the server's RAM.

jeremycole · on April 9, 2012

Yes, absolutely. In fact one of the most common longer term referrers for that post is about MongoDB, not MySQL:

http://www.mongodb.org/display/DOCS/NUMA

j2labs · on April 8, 2012

tl;dr - If you're running a database, or generally memory intensive system, while also using multiple CPUs you should run this command: echo 0 > /proc/sys/vm/zone_reclaim_mode

But the article is great. You should definitely read it.

finnh · on April 8, 2012

Except that's not the TL;DR at all.

From the article:

"An aside on zone_reclaim_mode

The zone_reclaim_mode tunable in /proc/sys/vm can be used to fine-tune memory reclamation policies in a NUMA system. Subject to some clarifications from the linux-mm mailing list, it doesn’t seem to help in this case."

The real TL;DR is "run your mysql command under the auspices of '/usr/bin/numactl --interleave all' so that your big pool allocation is split evenly across nodes"

And an even better solution would be if _only_ the big pool allocation use interleaved allocation, and all the rest used normal node-bound allocation. This would require some sort of change to the malloc calls though, yes? All of the solutions listed in the article operate at the granularity of a process (or higher), not down to the individual allocation.

j2labs · on April 9, 2012

My mistake, you are correct that I forgot the second command.

The new tldr is: numactl --interleave=all /path/to/daemon; echo 0 > /proc/sys/vm/zone_reclaim_mode

This file helps explain the different ways one can tweak memory with Linux: http://www.kernel.org/doc/Documentation/sysctl/vm.txt