The author never made clear if they are measuring virtual memory usage, or physical memory usage. Having a lot of virtual memory does not "cost" much: it's just having permission to use a lot of memory if you so wish. It's possible to have huge chunks of your virtual address space with no physical memory backing them. Physical memory is the amount of physical RAM your process is using.
Memory managers tend to greedily request a lot of virtual memory because there's usually little harm in doing so, and the act of asking the kernel for more permission (read: more virtual memory) is slow.
Minor quibble: it's not accurate to call `malloc()` and family in glibc the "operating system's memory allocator." That is the memory allocator for C, and Ruby just so happens to be implemented in C. The glibc allocator will use the system calls `brk()`, `sbrk()` and/or `mmap()` to request memory from the kernel (http://man7.org/linux/man-pages/man2/brk.2.html;http://man7.org/linux/man-pages/man2/mmap.2.html). Nothing really changes per the punchline.
I wrote a bit about this here, with some small GC.stat hacks to manage it a tad, when I talk about how ruby has it's own heap and manages garbage collection lower down in this post: [redacted]
Even fluentd had this issue with a default heap size lower than what was needed to initialize itself, causing hiccups, until the GC.stat RUBY_GC_HEAP_GROWTH_FACTOR was tuned in the source code.
Ruby memory bloat is everywhere. Being familiar with gc.stat and being able to tune ruby applications as you test them is a good habit to have if you develop or work with ruby based tools.
>it's not accurate to call `malloc()` and family in glibc the "operating system's memory allocator."
I watched an interview with Bryan Cantrill some time ago and one of the things he mentioned is that libc is considered part of the operating system for other unixes, and Linux is the only one that decided to redefine operating system to exclude libc.
I wonder why the ISO C standard is specifying who is responsible for implementing libc. That seems screwy at best.
Regardless, afaik OpenBSD and FreeBSD both ship libc as part of the operating system. It makes a lot of sense, really, because libc is bound to need to make syscalls, which differ strongly depending on the kernel, whereas the user interface of libc is pretty much the same across any operating system.
You mean like devices without MMUs? I think that might be the case but I'm not sure. Under Linux you need a special kernel and libc to handle such devices IIRC.
- afaik freestanding C specifies no standard library at all (except pure headers,) so bare metal targets providing libc are going beyond the standard.
- I'm just saying it doesn't need to specify who implements the standard library. I've read the standard before and I don't really recall this coming up, I'm just assuming this assertion is accurate.
(And of course, the overwhelming reality is that libc is provided by none of the compilers except Microsoft C, since Clang and GCC tend to be used with glibc.)
Even then, the comments regarding "freestanding" C runtimes are the same. Targeting bare metal puts the standard library squarely out of scope of the standard.
> Having a lot of virtual memory does not "cost" much
To some extent, I agree, at least on 64 bit systems (on 32 bit systems, there was the risk of running out of address space).
However, the memory in question here is, for the most part, not freshly allocated, but was in use once. This means that it used to be backed by physical memory, and before that backing can be withdrawn, the page has to be written to disk.
It seems to me that the best solution would be to call madvise(MADV_FREE) on these regions, in which case they can be unbacked without further ado. I'm somewhat surprised that the memory allocator does not do this itself already.
> However, the memory in question here is, for the most part, not freshly allocated, but was in use once.
Because the author does not differentiate between virtual and physical memory, I can’t agree with that. It’s quite possible most of the memory “freed” was never in use. In which case, there’s not much benefit. And there is the probable downside that allocation heavy applications will pay a lot more.
> It’s quite possible most of the memory “freed” was never in use.
Given the allocation patterns shown, it would seem to me that if two blocks are still in use, it's fairly likely that the region in between also were in use once (there may be exceptions due to pools etc, but generally memory is parcelled out in a linear fashion).
> And there is the probable downside that allocation heavy applications will pay a lot more.
What would the cost be? All that would happen is that the free pages are marked as clean. I'm sure that's not entirely free, but bound to be considerably cheaper than paging the page out and in again.
> What would the cost be? All that would happen is that the free pages are marked as clean.
That's a minor page fault. You get an OS-level exception, switch to kernel mode, process the page fault by marking the page as loaded, then switching back to user mode. That is expensive if you do it a lot.
> but bound to be considerably cheaper than paging the page out and in again.
Yes, of course, a minor page fault is cheaper than a major page fault. But both are more expensive than no page fault, which is what happens if you just never free the page to the OS and there's plenty of available physical memory.
> I measured RSS with swap disabled, so physical memory usage.
That sounds correct to me. More details you could get from /proc/<pid>/smaps.
There are a couple of other issues I found confusing in your text. If you intentionally make simplifications, I would recommend to add at least a footnote to indicate so.
Linux has only 1 heap. But while the heap is declared to be of a certain maximum size using brk() it does not mean that all of the pages are really in RAM (as before, swapping aside). A page gets only really allocated when it is accessed the first time. And it can get deallocated again using madvise() if the program know that it is no longer needed. malloc_trim() calls madvise(). So you are correct, it does not only move the top of the heap. Probably it was that way many years ago. So in general the Linux heap can be sparsely mapped to RAM. But if you use RSS your measuring takes that into account already.
The arenas are a feature of glibc. They are documented (to some degree) in the man pages, so I would not call them magic. Also malloc_info() prints how they are used. It appears to me that arena 0 is on the Linux heap. If a program is multithreaded, glibc will call mmap() to create an additional memory area to be used for the additional arenas.
While avoiding mutex contention might be a good thing if the threads are allocating many small blocks, the overhead for arenas might not be justified for rarely allocating bigger areas as the Ruby heap implementation probably does it. So limiting the number of arenas might indeed be beneficial, it's just a typical trade-off one needs to understand. Not that that is easy, but magic is the wrong description IMHO.
Let's call the glibc functionality above the system allocator (I'm not sure whether this is the correct name or whether it even has an official
name). However, glibc has yet another completely different allocator, the mmap allocator. For allocations bigger than MMAP_THRESHOLD glibc will not use the heap or any of the additional arenas, but just directly make a completely new memory mapping from the Linux kernel. Again these will lazily/sparsely allocated to RAM. The mmap allocator does not use arenas at all.
I am not a regular Ruby user, so I don't know whether the Ruby heap uses glibcs's system allocator or the mmap allocator or what mix of both. (Mixing them is transparent to a program.) However, from your description I would guess that it mostly uses the system allocator, otherwise arenas would not make any change and AFAIK malloc_trim() has no effect on the mmap allocator at all. Have you considered either changing the setting of MMAP_THRESHOLD or the Ruby interpreter code such, that the Ruby heap uses only the mmap allocator?
It appears to me that having the Ruby heap management on top of the glibc system allocator is just not a good idea. The system allocator tries to be a good compromise for a widely varying spectrum of applications. But the Ruby heap management is one very specific case, partially duplicating the work that the glibc allocator does. I'd guess having the Ruby heap running closer to the operating system should improve things. The mmap allocator of glibc might be an easy way to achieve that. Otherwise the Ruby head should use mmap() directly, because it should know best how it wants to use the memory. But that would be much more difficult to implement of course.
P.S. Your visualizer looks impressive. Unfortunately I did not have time to study it in detail. Did you make sure that the code does never access a page that has been mapped but never accessed by Ruby? Because that would dirty the page and increase the RSS.
> Did you make sure that the code does never access a page that has been mapped but never accessed by Ruby? Because that would dirty the page and increase the RSS.
Dirtying only happens upon write. My visualizer never writes, only reads.
The author never made clear if they are measuring virtual memory usage, or physical memory usage. Having a lot of virtual memory does not "cost" much: it's just having permission to use a lot of memory if you so wish. It's possible to have huge chunks of your virtual address space with no physical memory backing them. Physical memory is the amount of physical RAM your process is using.
Yes, that remains very unclear. Only physical memory is interesting in the end, address space exhaustion is not on issue on most systems today and large pages are typically not used. The graphs have "virtual", "dirty" and "clean" in them.
"clean" normally means that the page in RAM is just a copy of a mass storage. If the kernel runs short of memory, it can use this page for other purposes. No harm done, except that the system gets slower when it needs to page in the same page later.
"dirty" normally means that the page in RAM is not backed by mass storage. The kernel must keep it reserved for the current purpose in order not to lose data.
If we assume that author's application does not do swapping (it really shouldn't for 230 MB, otherwise the machine is seriously overloaded and really slow) all heap in use is always dirty. The amount of dirty equals usage of physical RAM.
I am not sure which Linux tool that can easily report the use of physical RAM for the heap. Would RssAnon from /proc/self/status be a good approximation? It certainly contains the stack, too. But that should not grow a lot unless you have infinite recursion.
Here are some benchmarks on an 8-year-old, 4-core i7 running OSX Sierra. Parsing a 115Mb log file for lines containing a 15-character word (regex: \b\w{15}\b) we have:
*These figures are for runtime, ie. with startup time deducted. Ruby's startup time (0.55s) is much longer than the other languages (Python:0.06s, Perl:0.02s, PHP:0.14s).
Ruby's memory usage is 3.5 times that of Python for only a 30% speed gain. Perl uses 1/15 of the RAM used by Ruby and is 35% faster but it could be argued that Perl 5's lack of built-in OO accounts for some of this ..... until you look at PHP which has built-in OOP and uses 2/5 of the RAM used by Ruby whilst performing 3.2 times as fast.
I love that Ruby is designed from programmer happiness but the shine starts to wear off when you look at its memory usage. Slow is bearable as it's only marginal but Ruby's memory usage is orders of magnitude higher it seems. Matz's goal of making Ruby 3 times faster is only half the battle, maybe even only a third. If an increase in speed comes at the expense of even greater memory use then Ruby will not survive.
If Ruby was performant enough to disrupt the web app industry ten years ago, it is performant enough today. Ruby's survival is not dependent on trivial fluctuations in benchmark numbers.
Ruby is just as shiny as it was in 2006, if not more so. If you feel happy writing log parsers in JavaScript or PHP well then all power to you, but I'd rather chew on a shoe.
I do my data processing in Ruby, if there's a performance issue I'll fork out to a little Go widget. But that's when I need a factor ten improvement or more.
Could very well be, I was a young developer when the Rails hype happened, to me it was disrupting clumsy Java and .Net web architectures, and the ugly ad hoc PHP methodology. I have never heard of AOLServer or Zope, no one ever told me about them.
The problem for Ruby on Rails is that a lot of its good idea were eventually borrowed while it was encumbered by Ruby's performance. No one develop web apps like they did on Java when Rails came out. A lot of good ideas from Ruby and RoR became mainstream and popular but done on faster platforms.
I was trying to create a level playing field as PHP and Python don't lend themselves to one-liners as well as Ruby and Perl.
If blessing a hashref is real OO why has so much energy been expended on Moose and its offspring? Before I left, back in 2013, there was also much gnashing of teeth over whether Perl needed a MOP so I don't think everyone agreed that bless() is enough.
"If blessing a hashref is real OO why has so much energy been expended on Moose and its offspring?"
That's a good question. It's always mystified me. A little boilerplate doesn't bother me. As far as I can tell, it's because people didn't like writing the constructor boilerplate and getters/setters. Maybe for the "isa" pretend types?
In classic glibc form, malloc_trim has been freeing OS pages in the middle of the heap since 2007, but this is documented nowhere. Even the function comment in the source code itself is inaccurate.
I'm used to a random comment on stack overflow being the source of truth for angular, django and the like. But this is the first time I've seen it for glibc!
The reason glibc malloc doesn't like to free random pages in the middle of a mapping is probably because that inflates the number of PTEs needed to describe it. Say you have one mapping and free a page in the middle of it - now you need to PTEs to describe that. Similarly the mapping shown in the last image probably requires a few dozen PTEs to accommodate the holes.
That isn't free (it requires cache & TLB space), but it's entirely possible that Ruby is slow enough on the interpreter and data model level for this to not matter much.
Edit: Turns out malloc_trim doesn't actually modify the mapping but rather uses madvise(DONTNEED), so a higher address resolution cost probably only materializes under memory pressure.
It really makes sense that something like this would be the case.
All the experts say "oh, Ruby uses lots of memory for [reason] and it can't really be fixed", so no one even tries.
Until someone comes along who is either motivated, smart, or ignorant(!) enough to try to fix it anyway, and finds that the commonly accepted answer was wrong.
This happens all the time, especially in science. Trust, but verify, I suppose.
In the context of Rails applications hosted on EC2, then I've not found Ruby's memory usage to really be an issue. In my experience most Rails apps range between 150-500MB per instance.
My current employer typically uses M5 instances which have a ratio of 1 vCPU : 4 GiB Ram.
Running Unicorn you'll probably only want 1.5 instances per vCPU. Even a memory heavy Rails app is probably only going to utilise ~20% of the available memory.
Running threaded Puma, you probably want only a single process per vCPU and maybe 5-6 threads. In my apps running 5 threads per process typically increases memory of the process by 20%. So in that instance you'd only utilise 15% of the available memory on a M5 instance.
If you are having memory issues on Rails, then quick wins are upgrading your Ruby version. I saw 5-10% drop in memory usage with each of the major version 2.3.x -> 2.4.x -> 2.5.x.
Also if it is an old app, check you've not built up cruft in your Gemfile. Removing unused gems can be another quick win for reducing memory usage.
I know that a Ruby shop naturally wants to use Ruby for everything, but when the job described is:
a simple multithreaded HTTP proxy server written in Ruby (which serves our DEB and RPM packages)
then I would reach for Linux ipvs or haproxy, and apache or nginx to do the serving. Good tools already exist for these things, it's a shame not to use them.
(And we have a Ruby dev group, so please don't accuse us of having a phobia or hatred of Ruby.)
This is a pretty old piece of software, it might predate nginx. Or if it doesn't, it probably does some non trivial url rewriting or other logic that would be a pain to do in nginx.
In any case it's just something thrown together to solve a need quickly and effectively. The performance characteristics might not even have been an issue, they just caught his eye.
Writing an http proxy is 5 lines in node.js, and not much more in Ruby. I could do it in either without even consulting a reference. I spent hours learning nginx configuration, and could spend hours more. If all I need is a simple proxy with some logic why do it? It's not reinventing the wheel, it's building a wheel that's good enough, using the materials you got.
I recently spent weeks investigating a very similar issue in a Haskell program, giving deep into its memory manager and glibc's malloc.c (where I found multiple bugs, showing that much code in there appears to never have gotten proper review in decades).
Writing a memory visualiser is exactly what I needed and planned to do next, so this is a great contribution for anybody working on problems like this.
This was a really interesting read. It makes me wonder though - are other languages affected by this? I haven't heard any similar reports from, say, Java or Python.
It's a long known problem with anything using 'real' threads and glibc malloc (to some extent, any malloc really). As an example and contrary to wozer's comment, Java is affected by this: https://github.com/prestodb/presto/issues/8993
Excellent post! I didn’t quite understand why MALLOC_ARENA_MAX=2 outperforms this solution (slightly) and what are the trade-offs exactly ... can anyone shed more light on this?
I'd have to look more deeply myself, but it likely means larger allocations up front, to avoid having to call out to the OS to make smaller allocations. So if you know you're going to use the larger allocations anyway it'll probably be almost no trade off but if you aren't sure you'll probably use more than you needed.
It doesn't mean larger allocations, it means more allocations. glibc will try to use different arenas for different OS threads to avoid lock contention, if you allow a lot of arenas then malloc will allocation a lot of chunks of memory assuming you have lots of threads requesting memory.
When you GC.start() you are probably forcing a round of garbage collection. I assume it is slow to run GC.start(), but after gives you some runtime advantages, but only for a bit (until you call it again), but then the heap fills up again. You can tune the RUBY_GC_HEAP_GROWTH_FACTOR in GC.stat after initializing and load testing a few times to avoid having to call GC.start() and manually force garbage collections by allocating enough space for the system to run its own processes, while minimizing initializing unutilized memory or having too small of a heap size, which will trigger too many garbage collection runs, both requiring expensively slow kernel system calls.
Memory managers tend to greedily request a lot of virtual memory because there's usually little harm in doing so, and the act of asking the kernel for more permission (read: more virtual memory) is slow.
Minor quibble: it's not accurate to call `malloc()` and family in glibc the "operating system's memory allocator." That is the memory allocator for C, and Ruby just so happens to be implemented in C. The glibc allocator will use the system calls `brk()`, `sbrk()` and/or `mmap()` to request memory from the kernel (http://man7.org/linux/man-pages/man2/brk.2.html; http://man7.org/linux/man-pages/man2/mmap.2.html). Nothing really changes per the punchline.