Mysql "Swap Insanity"

stephenjudkins · on Sept 29, 2010

This is all very interesting, and the author has clearly done a lot of research. But where are the benchmarks? I'd love to see some sort of replicable evidence that using this command helps things that much.

Linux's NUMA policy does seem broken for this use case. If all the memory on node 0 is used up (but plenty is free in node 1) and a thread in node 0 attempts to allocate memory, why not, instead of swapping out pages from node 0, simply move them to node 1? Alternatively just allocate memory in node 1, as the author suggests. I'm not a kernel programmer. Anyone who's more familiar with this care to answer?

nkurz · on Sept 29, 2010

I'm only semi-familiar with the issues, but on the surface swapping across nodes rather than to disk seems like it has to be a win. I think the problem might be that on a long running system there is rarely any truly "free" memory. Rather, one chooses to dump cached pages. It's possible that the cost of copying across nodes plus the eventual reread of the cache makes it a negative? Although I'd have to think that it's better than a certain write.

I can see the logic where allocating on a non-local node is potentially a mistake. Depending on how many times the memory will be accessed, it may well be worth the immediate hit to swap a page to disk and keep all your accesses local. For the swap, you at least have evidence that it hasn't been used that recently, thus may never be used again. It would be sad to work yourself into a corner where lots of long lived processes are constantly cross-allocating.

Edit to add:

Looks like a good reference paper here: http//www.kernel.org/pub/linux/kernel/people/christoph/.../numamemory.pdf Only skimmed, but it makes it sound like 'page migration' is already in place.

I'm particularly interested in the idea of migration partly because it might help provide an answer to my recent StackOverflow question: http://stackoverflow.com/questions/3784434/inserting-pages-i...

nkurz · on Sept 29, 2010

That was odd. I must have submitted at the same time that the edit window ended, and it took me to a broken page. Anyway, the proper link is: http://www.kernel.org/pub/linux/kernel/people/christoph/pmig...

jeremycole · on Sept 29, 2010

Hi,

I didn't feel that benchmarks were necessary in this case, since the result is clearly visible: either it does or does not swap under a given workload. We did run benchmarks, but only to prove that the performance was nominal with and without the setting in place, swap behavior aside, to ensure that this doesn't introduce some regression.

Regards,

Jeremy

illumin8 · on Sept 29, 2010

The problem is that there is a latency hit required for a thread running on node 0 to access memory on node 1. Furthermore, this uses Hypertransport on AMD or QPI on Intel, which has limited bandwidth so if you get too many off-node memory accesses, performance begins to suffer.

The real solution to this issue is for MySQL to become NUMA aware and place threads and the cached data blocks those threads are accessing more intelligently on nodes that have enough space. Other more robust databases like Oracle already do this, having been running on NUMA architectures for decades now.

jemfinch · on Sept 29, 2010

> The problem is that there is a latency hit required for a thread running on node 0 to access memory on node 1. Furthermore, this uses Hypertransport on AMD or QPI on Intel, which has limited bandwidth so if you get too many off-node memory accesses, performance begins to suffer.

Does it suffer more than hitting the disk?

illumin8 · on Sept 29, 2010

No need to be snarky. You take a one time large hit in performance to page out some data as opposed to many small continual hits in performance going across the interconnect between nodes.

Which would you rather suffer? A 50 ms one time hit to page some data out, or many thousands of 500 microsecond hits and interconnect saturation over time? The kernel engineers looked at these trade offs and determined it was better to page out data. After all, the kernel does not know how long you'll need your data, and if it allowed memory to be allocated haphazardly all over a NUMA system, after many hours or days you could end up with a very slow running system where every other thread had to access memory in a different node.

I find it rather puzzling that DBAs think they know more about how a kernel should page memory than a kernel developer like Linus Torvalds.

The answer seems clear - if your software relies on huge amounts of memory, make it NUMA aware. Oracle did this a long time ago and I don't see any strange swap activity on our 8-socket 48-core 128GB NUMA systems (AMD Opteron).

jemfinch · on Sept 29, 2010

> You take a one time large hit in performance to page out some data as opposed to many small continual hits in performance going across the interconnect between nodes.

You misunderstand the OP. He's not saying "Just put it in Node 1 and access it from there." He's saying to swap it out to Node 1, and then when it is needed in Node 0 again, swap it back to Node 0. That's certainly cheaper than swapping to disk and back.

illumin8 · on Sept 29, 2010

I see, so he's essentially proposing a memory to memory swap functionality as opposed to just memory to disk. It sounds like a workable solution, although it would require some engineering in the kernel paging algorithms. You'd also need to make changes to the scheduler so that you could intelligently schedule threads on the node where their memory is. It sounds doable, but it seems that this is a lot of work for kernel engineers to do that could be done by software that is NUMA aware.

mmaunder · on Sept 29, 2010

The author doesn't mention anything about load e.g. queries per second. I wonder if the description is: Huge data set, low usage and very fast response time required.

There are a host of problems with low traffic stacks including:

-Persistent database connections going away and the first user who hits the system having to wait for the web server to connect to the DB.

-Cache warmup time taking a long time because it takes a long time to get enough queries for the cache to figure out what needs to be cached.

-App server warmups taking a long time. Low traffic means it takes a long time for all apache children (or ruby app server or whatever) to get a hit and compile code into mem so the next request is fast.

-And the author's [I'm assuming] problem of systems deciding that low usage means the data can be expired from cache.

I find the innodb buffer pool cache to be the best in the business on high traffic sites using the author's configuration. [Random aside: I also use Redis and Memcached in production extensively.]

jeremycole · on Sept 29, 2010

Hi,

In this case, the traffic level is somewhat irrelevant. However, swapping can be demonstrated at levels of a few thousand queries per second. Read the article: this has nothing to do with swapping due to non-use (which would logically be OK).

Regards,

Jeremy

nwilkens · on Sept 29, 2010

What about using the large-pages (huge pages) mysql option: http://dev.mysql.com/doc/refman/5.0/en/large-page-support.ht...

From the kernel documentation @ http://www.mjmwired.net/kernel/Documentation/vm/hugetlbpage.... : "Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under memory pressure."

jeremycole · on Sept 29, 2010

Hi,

I think hugepages etc., serve to solve the symptom of the problem (making mysqld itself unswappable) rather than the actual problem (the system needs memory on a certain node and there is none to be had). Making mysqld's memory unswappable will just mean that something else gets swapped, or in the worst case something gets OOM-killed or the allocation just fails. Those situations could be worse.

Regards,

Jeremy

gaius · on Sept 29, 2010

You would think MySQL types would simply "shard" their servers by starting a MySQL process on each board.

(This problem was fixed in Oracle in the 90s originally for deployment on Sequent hardware).

apenwarr · on Sept 29, 2010

That would probably not help much here; you'd then end up caching pretty much the same stuff on node 0 and 1, so it would be like having two 32 GB nodes instead of one big 64 GB node. They would both be maximally fast, within the 32 GB memory constraint, but they would both have to swap if you wanted to exceed 32 GB of cached information.

Real sharding - storing totally different stuff on the two shards - would probably give a really good performance improvement. But real sharding is much more of a pain than just starting two copies of mysqld.

forkqueue · on Sept 29, 2010

I've worked on several servers of similar specs to this (64GB RAM, MySQL 5.0 or 5.1) and have never seen this issue.

All systems were running RHEL or CentOS, so perhaps Red Hat have fixed the problem.

metageek · on Sept 29, 2010

What CPUs? As the article says, the NUMA characteristics show up with Optera and Nehalems; older Intel chips didn't have it.

mikey_p · on Sept 29, 2010

Does MariaDB (and other MySQL variants) suffer from this issue?

nwmcsween · on Sept 29, 2010

It's an operating system issue - specifically how Linux handling of NUMA is generally broken.

timthorn · on Sept 29, 2010

I'm not sure I agree; the application can be written with an understanding of how best to utilise the memory architecture, but the OS must manage memory for the common case unless instructed otherwise. I'm not commenting on Linux's handling of NUMA - rather that applications shouldn't assume that a generic OS can provide an optimal hardware abstraction. All memory is equal - but some memory is more equal than others.

JoachimSchipper · on Sept 29, 2010

Yes, the OS must assume the common case. However, by the time one node has 90% of its memory in-use and the other node has 1% of its memory in use, it's clear that this is not the common case, and this should be handled more gracefully.

oozcitak · on Sept 29, 2010

The article states that the default NUMA policy is to allocate memory in the same node the thread is going to run on. Which is logical considering non-local memory access is more costly than local memory access. The problem appears when one process requests more memory than a single node can provide. Can the Linux kernel detect this usage and adjust accordingly?

irv · on Sept 30, 2010

there's a (superficially) similar problem in MS SQL Server/Windows Server. You can fix that by setting "lock pages in memory"

fragmede · on Sept 29, 2010

sync; echo 3 > /proc/sys/vm/drop_caches

Drop caches was originally added for benchmarking purposes, but I've found running it every N minutes seems to help system responsiveness. (I've been unable to quantify it, unfortunately.)

jeremycole · on Sept 29, 2010

This is a tremendously bad idea, but rather than clarify why, just read:

http://www.listware.net/201009/linux-kernel/48874-rfc-patch-...

fragmede · on Sept 30, 2010

The thread doesn't at all clarify why (it is), and actually gives a counter example.

For those who compare the before and after output of 'free', stop it. Yes, the numbers are (sometimes drastically) different. It doesn't matter. The kernel drops pages when it needs a page, and for the general case, this does work. But, as the linked LKML thread states, there may be a pathological case that you are not expected to hit (10 million+ files, and 40 Gigs of ram). For that specific use-case, it did make sense.

The reason this is a bad idea is because dirty pages cannot be freed, which is why it is recommended to run 'sync' first. Unfortunately, on a busy system, pages will get dirty in between the sync and the drop_cache, moreso if you're doing it in a shell. Those dirty pages can then never be reclaimed (due to how drop_caches works, and because drop_caches is only intended for benchmark testing).

(link to same thread but threaded http://lkml.indiana.edu/hypermail//linux/kernel/1009.1/02943...