> When we free memory, we should make sure that if the block we return to the free list is next to any other free blocks, we combine them together. This is called "coalescing."
A little offtopic but the default Delphi 7 memory allocator did this, except that it also merged blocks that it obtained from different OS allocation calls.
This worked fine for regular usage, but if that memory was ever used for Bitmaps for UI stuff, it wouldn't work: Since Windows does some of its UI stuff in kernel mode, before doing anything Windows would attempt to lock the entire allocation's VAD entry to prevent you from messing with it in another thread while it was using it. If the Bitmap you were working on happened to belong to two different OS-level allocations, this lock function would fail since it wasn't meant to handle that case.
This would lead to random, extremely cryptic errors such as ERROR_NO_SCROLLBARS "The window does not have scroll bars." since the lock call was deep in the stack and the callers replaced the error with another one as it bubbled up.
In the end we had to replace the allocator to avoid that issue. That was a fun day I spent debugging that.
Yes and no, it took me from 8am to 3am once we decided it needed to get fixed but really it sat on the app for years, it only happened on a background process that sent print jobs on a timer, since it used Windows GDI to compose the image we sent to the printer it was affected (our "frontend" should've been affected too but never was, I guess because it had a different memory usage pattern).
We just had it restart itself and try again whenever it got one of those errors when printing but eventually we wanted to add a feature that required the process to not die, and by that time I was already 99% sure that it wasn't something in our code and I had already ruled out threading issues.
I ended up putting it in a VM with a kernel debugger attached and having a script make a snapshot and make it print over and over until it errored, then following along in IDA until I saw what was going on.
Having a way to trigger it (by restoring the snapshot) on demand helped a lot, otherwise it would have taken forever to make sense of it as it could sit without crashing for nearly an hour.
In Hyper-V it's fairly easy. You make a virtual serial port ("COMPort"), set the bootloader to enable kernel debugging over serial, then connect to the virtual serial port from the host via a named pipe.
Kernel debugging over serial should be possible in vSphere, but Ethernet is easier to set up:
1. Make sure at least one virtual NIC on the target VM (with IP connectivity to the debug host machine/VM) is on Microsoft's NIC whitelist[1]. I use e1000e; note that vmxnet3 is not on the list.
2. Follow Microsoft's directions[2] to connect.
I can confirm this works on vSphere[3] and there's no reason it shouldn't also work on VMware Workstation, Player, and Fusion.
[3] Tested last week with a Windows Server 2022 target (e1000e virtual NIC) and Windows 10 debug host (vmxnet3 virtual NIC), both running on ESXi 8.0 Update 1 VM hosts.
Hyper-V wasn't until Windows Server 2008. I know you could do virtual serial ports w/ VMware GSX and ESX (and later ESXi) forwarded to real hardware serial ports on the host.
If I recall correctly the kernel part of things would return an out of memory error which the user mode DLL translated to that weird error (sometimes, other times it would just say "out of system resources", it depended on what widget the bitmap that overlapped the two memory regions belonged to).
Until Windows 95, Windows was essentially just a DOS application that grabbed the framebuffer and ran an event loop where it drew "controls" (which includes windows, buttons, text views, and yes, scrollbars.) That was the whole point of it. It wasn't an "OS" per se; DOS was the OS. Windows was what a Linux-head would think of as a combination of an X server and window manager. And Windows loaded your "application" as essentially a DLL, with the Windows global event loop calling into your application's event-loop delegate handler (WndProc) whenever it has an interesting event that your application might like to react to.
(Your application wasn't even a "process" per se. Until Windows 95, everything was just happening in one shared address space, in real mode. In fact, it was only in Windows 3.1 where user applications stopped running in ring 0!)
If you think about it, this "the kernel is a game engine and your application is the game" approach isn't necessarily a bad design... for a single-tasking OS's library exokernel, like the Wii's https://wiibrew.org/wiki/IOS.
But, of course, Windows claimed to be a multitasking OS. But it actually wasn't! And I don't mean the obvious thing about it not having pre-emption. Lots of multitasking OSes didn't have pre-emption.
No, what I mean is that the concurrency primitive for the cooperative scheduling wasn't the "task" (i.e. the process or thread. Which, again, there weren't any of.) Instead, the concurrency primitive was the window. Until Windows 95, Windows was a multi-windowing OS.
Each control was owned by a window. Each window had a WndProc. If your Windows executable (i.e. application delegate module) set up two windows, then each window participated separately in the global Windows event loop, up-to-and-including things like having its own set of loaded fonts, its own clipboard state, and its own interned strings table. In OOP terms†, your application was just a dead "class object", running no logic of its own save for one-time load and unload hooks. It was the windows themselves that were the "instances" of your class.
This might make you realize why MDI (or Multiple Document Interface, where there are multiple small per-document "windows" inside one big window) was so popular back then. The MDI "windows" weren't actually windows — they didn't have their own WndProc. They were just controls, like a tab view is a control. Only the big container window was a real window, and so all the resources within that big window were shared between all the virtual windows. MDI was a memory-saving trick!
---
† The actual more interesting analogy is that Windows was essentially a (single-threaded, cooperatively-scheduled) actor system, where windows were the actors. There is a very close parallel between (Windows 3.1 executables, Windows 3.1 windows) and (Erlang modules, Erlang processes).
> This might make you realize why MDI (or Multiple Document Interface, where there are multiple small per-document "windows" inside one big window) was so popular back then. The MDI "windows" weren't actually windows — they didn't have their own WndProc. They were just controls, like a tab view is a control. Only the big container window was a real window, and so all the resources within that big window were shared between all the virtual windows. MDI was a memory-saving trick!
MDI may have saved some memory - I can't say one way or the other on that - but the mechanism you describe is incorrect.
Every MDI child window was a window of its own with its own WndProc. Every control inside those windows was also a window with its own WndProc. Every dialog box was also - yes - a window with its own WndProc.
You wouldn't always be aware of the WndProc in your code, but it was there.
If you ran WinSight or Spy++, you could see the entire window hierarchy including all the child windows, child control windows, and so on.
Later on, a few applications implemented "windowless controls" to save memory, but this was uncommon, especially in the early days. For example, there was an optional windowless version of the Rich Edit control:
Fun fact: an early informal name for MDI was "Mac in a box", because if you maximized the top level window, you had a somewhat Mac-like environment, with one menu bar at the top that was shared by the child windows.
Source: I was the author of WinSight and the VBX (Visual Basic eXtension) API.
Interesting; through the fog of time, I may have misremembered some "tip" I was given for Windows 3.1 programming efficiency, as being a more definitive statement about the internal structure of Windows than it really was.
That tip, as I recall it, was that the developer should minimize the number of top-level windows they create, because each top-level window in the system gets opted automatically into various things that regular controls don't (including having a bunch of default child controls that probably keep a lot of state and pump a lot of messages.) But MDI child windows don't have the same window-class as top-level windows, and are instead pruned down to be much "lighter", both in doing things like:
- not having a some child controls (e.g. a menu bar) by default (you're supposed to attach the menu bar to the MDI frame instead, and change it to reflect the menu of the active child — as you say, it's "a Mac in a box" behavior);
- implementing behaviors that seem like their own child controls, but which are actually parts of the MDI child's own geometry, to reduce the number of event loops that need to be pumped. (I believe an MDI child's default-styled title bar might be this way, only becoming a full-on child control tree if the MDI child is given non-default styles.)
- "Sharing" child controls, where the control (I think an MDI child window's caption buttons might be this way?)
- routing messages to only the active child, with the others essentially frozen except when they need to redraw
Digging around, evidence to corroborate this is pretty scant, but there is some:
- https://jeffpar.github.io/kbarchive/kb/095/Q95578/ is a bug that is evidence of the MDI child window class having most of its behaviors programmed separately, rather than there being any code reuse between top-level windows and MDI child windows
Thanks for the interesting discussion. Yes, the fog of time affects us all.
Funny you mention caption buttons (min/max/close/etc.) and menus and such. I meant to add a "Fun fact #2" in my first comment. So here we go...
In traditional Windows applications, none of those are child windows (or child controls, same thing). They are all part of the "non-client area". You may recall there being a whole series of WM_NC... messages like WM_NCMOUSEMOVE and WM_NCPAINT, with the same names as your usual messages except for the NC prefix. Your WndProc would receive all of these messages, but generally would just pass them along to DefWindowProc which handled this "non-client" activity.
Now here is the fun fact. OS/2 Presentation Manager took a different and cleaner approach. It removed the "non-client area" concept entirely, along with all those messages. Instead, all of those window doo-dads were child windows. The minimize button was a child window. Your "client area" was a child window. And so on. It was all child windows.
On this point:
> implementing behaviors that seem like their own child controls, but which are actually parts of the MDI child's own geometry, to reduce the number of event loops that need to be pumped.
To be clear, a typical application had only one event loop. It was your classic GetMessage/TranslateMessage/DispatchMessage loop. DispatchMessage would pass the message off to whatever WndProc it should be directed to.
> Until Windows 95, everything was just happening in one shared address space, in real mode. In fact, it was only in Windows 3.1 where user applications stopped running in ring 0!
Windows 3.0 and predecessors runs in processor which had no concept of "ring 0", so that should not be surprising at all...
> Your application wasn't even a "process" per se
I think this is a bit of a "modernistic", "win32y" view of the definition of a process. Surely there are processes/tasks in 3.x -- you can launch multiple instances of a module, each of them can allocate memory separately, each of them have different resources/handles, and the OS will cleanup for you once each instance terminates (cleanly). Obviously, without any type of virtual address space or memory protection, any such process can write to and destroy the memory of any other process, but they are still processes. The existence of Yield/DirectedYield , which do not take/need a window, is also a hint of that. (Note there are no threads in 3.x).
Many platforms (that either predate VM or decide not to use VM for e.g. power usage concerns) worked like this. MacOS, Windows CE, PalmOS, etc.
> you can launch multiple instances of a module, each of them can allocate memory separately
I don't think this is true, though? You're not getting separate processes; you're just getting separate hInstances. Which don't map cleanly to a process-like abstraction.
Consider: while you can (in theory) spawn multiple copies of a Win16 executable, with each spawn getting its own hInstance and therefore its own locals heap, this isn't an inherent part of Win16's architecture, but rather is something specific to its handling of spawning executables. DLLs weren't included in this handling, and so only get one hInstance+heap each. This means that if you load and call into the same DLL from two separate actively-running Win16 programs, then that DLL must juggle any state it wants to keep for its N active callers on its own single heap. A function call crossing semantically-distinct memory protection domains, without any kind of IPC serialization, is not very "process"-y. It's more of an object system, like the JVM.
(In both Windows and the JVM: each "object" has its own private heap, and a module wrapping access to that heap; and some "objects" additionally have their own concurrent execution thread. But a given execution thread isn't "bottled up in" a particular heap; it just has a "home" heap. If an execution thread calls another object's API, then that execution thread, through that API, manipulates that other object's heap. There's no concept of IPC — of manipulations of other objects' heaps requiring you to ask the other object whose execution thread owns that heap to do the manipulation for you.)
> The existence of Yield/DirectedYield , which do not take/need a window, is also a hint of that.
DirectedYield is an attempt to make Windows tasks seem to act like "processes"... in the Communicating Sequential Processes sense, at least.
But it doesn't really accomplish this. From the Windows 3.1 API Reference Guide:
> If an application uses DirectedYield for a task with no events scheduled,
the task will not be executed. Instead, Windows searches the task queue.
In some cases, however, you may want the application to force a specific
task to be scheduled. The application can do this by calling the
PostAppMessage function, specifying a WM_NULL message identifier.
Then, when the application calls DirectedYield, the scheduler will run the
task regardless of the task's event status.
In other words, what's really acting as CSPs here, are the windows. :)
> That was the whole point of it. It wasn't an "OS" per se; DOS was the OS. Windows was what a Linux-head would think of as a combination of an X server and window manager.
This seems like a very post-3.0 (i.e. 386-only) view of things—the 8086 and 286 versions of Windows were also fairly advanced memory allocator/overlay manager hybrids.
They parcelled out memory, compacted it to avoid external fragmentation (take that, dlmalloc!), expelled pieces that could be read back from executables, and discarded segments that the application programmer said could be recovered, as necessary. (Yet they couldn’t actually swap mutable memory, as far as I can tell. Why?) For data, they required your cooperation by only revealing addresses between *Lock and *Unlock calls and requiring you to store handle+offset pairs otherwise; for code, the 8086 kernel would reach into your stack and walk the frame pointer chains in order to redirect return addresses to swap-in thunks. (Maintaining LRU lists along the way in either case.) Things became better on the 286 when it could just hand out segment selectors and arrange for accesses to fault as required, but this is the problem statement that Windows 1.0, running on the 8086, set out for itself.
Now, none of this is impossibly difficult (although I shudder at the thought of doing it without good debugging tools), but it feels pretty damn OS-like to me. You might argue there isn’t much virtualization of hardware going on—except for RAM, CPU, display, keyboard, and mouse—but I’d say there is at least as much of it in Win16 as there is in DOS.
One would hope things would get easier on the 386. And then one gets to DPMI and VDMs and still wants to support 16-bit drivers that hooked BIOS calls as though that would help and now the system cannot interact with the user while it’s formatting a floppy[1].
> [T]he concurrency primitive wasn’t the “task” [but instead] the window. Until Windows 95, Windows was a multi-windowing OS. [... E]ach window participated separately in the global Windows event loop, up-to-and-including things like having its own set of loaded fonts, its own clipboard state, and its own interned strings table. In OOP terms†, your application was just a dead "class object", running no logic of its own save for one-time load and unload hooks.
That... doesn’t sound correct on the implementation level. If you’re an instance of a Win16 executable module, you own a copy of your executable image’s mutable data, you own your memory allocations (that are not marked GMEM_SHARE), you own a stack, you get all (“posted”, i.e. asynchronous) messages for all windows you (or your library dependencies) created and you ask the system to munge and route them to the message dispatch routines of the windows you created—or not, if you don’t want to.
Now, the overall effect is very much like what you described, and often it may feel that you could as well throw out all those tedious GetMessage-TranslateMessage-DispatchMessage loops and replace them with a standard implementation of a vat[2]. Then a puny message handler decides to hijack the whole thing and go into a modal dialog loop. And damned if I could describe what that means in terms of an object model.
(What would you say about Symbian? Now there’s a system that throws out processes and only runs objects. Except it doesn’t run inside of a DOS or anything; there’s a kernel and on top of that there are objects. Boom. ... I think?)
> The actual more interesting analogy is that Windows was essentially a (single-threaded, cooperatively-scheduled) actor system, where windows were the actors.
Yeah, that I’ll wholeheartedly agree with. Nevermind the drawing part, it even has separate synchronous sends (SendMessage) and asynchronous ones (PostMessage)! Of course, unlike E’s[3], these have completely insane interactions with the concurrency parts, especially once you get to Win32.
The article claims that an allocator that splits memory based on allocation size is called a "buddy allocator". That's misleading: an allocator that allocates an area for each size class is usually called a "slab allocator", while a "buddy allocator" is one that when needed subdivides a memory area with a power of two size into two half-sized areas that are "buddies", does so recursively to satisfy allocations, and coalesces them again when they are free.
E.g. the Linux kernel used (not sure if it's still like this) a buddy allocator to allocate pages and power-of-two blocks of pages and slab allocators to subdivide those pages and allocate data structures.
Another thing that the article doesn't mention that is important is that most production allocators make use of thread-local storage and either have per-thread caches of free blocks or sometimes whole per-thread memory regions. This is to reduce lock contention and provide memory that is more likely to be in the current core's cache.
I had originally written about threading and locality but it made the post too long and complicated, so I cut it out for the final draft. You can see remnants of it if you check the HTML comments in the post :D
linux had a bitmap based "buddy allocator" (power of two), now it is not bitmap based anymore (complexity not worth it anymore, performance wise, namely simplicty was restored).
Then linux has various slabs(slub/slob/slab), built on top of the "buddy allocator".
Userlevel code shoud use non virtual address stable mmap-ed regions (slabs + offsets). Legacy "libc" services were built as virtual address stable services... which are kind of expensive to manage on a modern paginated system. Virtual address stable regions should be kept to a minimum (that horrible ELF static TLS). There is a workaround though (but linux overcommit default policy could kill your process): a user process would query the amount of ram on the system and mmap a region of roughly (care of the overcommit policy) the same size, only once per process life-time. Then you could have a virtual address stable region which could use most if not all the available ram (excluding hot-memory addition...). Should be very easy to manage with lists.
This is absolute gold. When I use things like this, I am reminded how powerful digital learning can be. So much more capable then just text or video. But very little content is this well put together. Probably because it is so time intensive.
I'm inspired by Bartosz Ciechanowski and Julia Evans. The web is such a powerful toolbox. So many concepts are more complex than text alone can hope to explain. Those two are so creative and full of energy.
And you're right, it's incredibly time intensive to put these together. Thousands of lines of code, plus the text content, plus reaching out to domain experts for reviews (shout out Chris Down, kernel wizard extraordinaire).
Such great work. I learned things and have a clearer understanding that I think I will come back to in the future. And I say that as someone who has programmed for 15 years! I think your effort paid off, and the inspiration shows through.
One more name I'dd add to this list is Mike Bostock. The care and attention he gives to data visualization examples comes through in the finished product.
Communicating complex subjects through interactive visual displays is very effective -- it's worth the effort. Thank you!
I've been admiring this trend since I started seeing these kinds of essays posted here, and felt motivated to start collecting them in a curated list[0]. Happy to accept new entries or volunteers to help maintain it!
Would you be open to suggestions? Off the top of my head, there's Explorable Explanations from https://ncase.me and, while not as educational, The Space Elevator on https://neal.fun
Definitely, Issues or PRs would be splendid. Though, I've had a hard time finding similar essay-like deep dives that fit this pattern on https://explorabl.es (I see a lot of what seem to be just an individual specialized simulation, games, or traditional e-learning courses), and the Space Elevator page feels like more of an interactive visualization in my opinion (though still very cool)
Oh man, this brings back the days when I wrote special debug-version malloc and free code to try to track down heap corruption due to malloc / free misuse (in code I had contributed to). Stuff like kbyte-long boundary buffers with bit patterns in them, and all sorts of lookaside lists run in parallel with libc's default code. Those bug-detectors worked OK. Hard-nosed code inspection worked far better.
As an old-timer in writing code, I think my generation's most-challenging legacies (=== the things we f**ked up) are the non-robust malloc/free concept and null-terminated text strings. Bugs in code using those conventions have been exploitable to a really damaging extent. I learned to program in C from K&R. And getting data-structure code right, and safe to deploy, in that language and its derivatives is HARD.
The C inventors are (were in Dennis Ritchie's case) brilliant and Bell Labs was great. Their ideas shaped the the stuff the global internet runs on. But these parts of what thy invented ..... ouch. (All OSs had the same problems.)
I wish the wonderful article presented here carried a more prominent warning about this. Many will read it as they learn to code. The history of our profession can teach about what NOT to do as well as what to do.
This is wonderful! I'm definitely going to be sharing this with my students (college sophomores studying CS).
If I were to make some suggestions, based on how I know they would receive this:
- I would make explicit reference to heap and stack. Students who are learning this material are learning about the heap/stack dichotomy, and I think it would really improve the exposition to make clear that not all memory is allocated this way in a program.
- I would remove this line: "At the end of this post, you should know everything you need to know to write your own allocator." I can confidently say that my student will not be able to write an allocator after reading this. It's nothing to do with your piece, it's just the intersection of people who don't understand hex and who could build an allocator after a short blog post is very very small. Students who read this post and at the end still feel like they can't build an allocator will end up discouraged, with a feeling that maybe they are missing something.
- Consider rethinking how you handle hex numbers throughout. You introduce them and say they are distinguished by a preceding "0x", but then you immediately omit the 0x to save space in the figure. In my experience, students have a lot of trouble with understanding that hex and dec have the same underlying representation. They will not be able to distinguish between hex and dec bases implicitly, so from a pedagogical standpoint, it's better to be consistent throughout and include the prefix.
- Finally at the end I would make mention that there are other strategies for memory management to encourage further exploration. Other languages do it differently, and for students it's important to know which other avenues are out there. Otherwise they might think heap-based malloc/free are the only way to do things, the same way they often thing imperative programming is the only way to program.
Anyway, thank you for creating this, and I'm sure it will really help my students. In a just world, "seeing" the memory like this would ideally be first-class tooling for languages like C.
Really appreciate you taking the time to write this, thank you.
I tried a couple of different ways to introduce the stack and the heap but it always felt like it made the post too long and complicated. In the end I decided to take a pure, idealistic view of memory in order to focus on the algorithms used to pack allocations effectively. You can see some of my abandoned efforts as HTML comments in the post :D
Introducing the 0x prefix and immediately dropping it hurt me as well, but I didn't have a better way to make the visualisation work on mobile. I completely agree with you that it's not ideal.
I'd like to do a post of this style about garbage collection at some point.
Maybe make this a series? I think another post just like this on the stack is in order. You could show allocating stack frames with the slider! Then you can put them both together in a third post and show the entire memory layout of a C program.
Would definitely like to see more thoughts from those cute corgis.
A few people suggested a series, but there's a human problem: my attention jumps between topics quite a lot! At the end of this post, and my load balancing post, my main feeling was relief in that I could start looking into a new topic.
Maybe after more time has gone by I can revisit earlier posts and do follow-ups :)
I went back and forth on this and decided in the end that it makes sense to use hex for 2 reasons:
1. It's what people are going to be exposed to in any material outside of this post.
2. Within this post, it helps distinguish between pointers and sizes/values.
Now, who owns inner_struct? Who is responsible for freeing it? Do I free it when I assign to it?
I feel this ownership complicates cross-language FFI API calls, because who is responsible for freeing structures depends on the application and the platform you're running under. For example, Rust code being called from Erlang. You have to be extra careful when memory is managed by a different language runtime.
I feel I am at the beginning of intuitively understanding what memory really is: memory is just a huge contiguous region of numbered locations. Bump allocators allocate in one direction and free all at once. Arena allocators allocate to a preallocated region, I think.
Memory is a logistical problem of how you arrange and allocate finite resources.
I am thinking of alternative visualizations of understanding memory, for example, I started writing an animation of binary search:
The idea is that you see values and memory locations move around with the final goal being able to command memory to move around and be computed, such as real time strategy game.
I think if we could visualise memory as cars on a road, we would see obvious traffic jams.
A struct can't own something - it isn't a class with a destructor or anything. So it isn't quite so obvious. There are only two lines of code here, but the implication is a function is running that code and then returning `return_struct`, which might get passed around to more functions, and even returned further up the call stack. Somewhere there needs to be code that knows "hey - nobody else is using return_struct, and by the way you need to free return_struct->inner_struct before freeing return_struct.
> I feel I am at the beginning of intuitively understanding what memory really is: memory is just a huge contiguous region of numbered locations.
There might be an analogy here that could help you reason about your nested structure allocations…
Memory is an array of bytes owned by the OS. While there are all kinds of implementation details about addressing and storage and performance and paging and virtual memory, it’s really just an array. The OS gives you a way to reserve pieces of the array for your own use, and you’re responsible for giving them back if you want to play nice and/or run for a long time, otherwise (as a safety net) the OS will take them back as soon as you exit.
This is, in a sense, very similar to the question you posed. An outer routine owns the outer structure, and an inner routine allocates some inner structure. The simplest, most intuitive, and generally best advice is that whoever allocates is also responsible for freeing memory. In other words, one way to define ownership of memory is by who allocates it. Implicitly and automatically the responsibility to free that memory belongs to owner that allocated it. It’s okay to explicitly transfer ownership, but can easily get complicated and unintuitive. You can also consider letting the parent free your struct to be similar to not calling free() in your JIT compiler - it’s a ‘lazy’ optimization to have the parent clean up - and I don’t mean that in a judgemental sense, I mean it’s valid to let the parent handle it, if you know that it will, and this can be done without getting confused about who owns the memory and who was actually responsible for freeing it. Note that when you leave the parent to clean it up, you are foregoing the ability to re-use the memory - this is true in your JIT compiler and it’s true for malloc() and free() as well. If you let the OS handle it, you’re in effect declaring that you believe you do not need to recycle the memory allocated in your program during it’s execution. (This might be true, and it might stay that way, but it’s always worth asking if it will remain true, since lots of people have been eventually bitten and had to retroactively refactor for memory management when their requirements change.)
Yeah, I hear you. I've not done a lot of FFI stuff directly, it scares me.
Arena allocators are cool, the idea is you allocate a large-ish region of memory and sub-allocate into it (often with a fast, simple allocator like a bump allocator) and then free the large-ish block when you're done. It's a way to take knowing how much memory you need as a whole and optimise that to a single call to malloc/free.
I want an extremely performant deep copy solution, I've been thinking of using an allocator to implement it.
If we have a tree data structure or a nested hashmap, then we want to copy it cheaply, there is copy on write. But most copies of hashmaps are slow because they instantiate every child object in a recursive loop.
So I want to be able to memcpy a complicated data structure for cheap copies.
> You have to be extra careful when memory is managed by a different language runtime.
While it would be nice to have next to no overhead for FFI, it's not always tractable. That's why you have to serialize across boundaries, the same as if you're serializing across processes or the network. At least in a single virtual memory space you can have a caller allocate a buffer and the callee fill it, with the caller being responsible for deserializing and freeing later. That gets you pretty far, and is relatively safe.
The alternative is to be callee managed, and for the callee to return things by handle and not necessarily by pointer, but that is also fraught.
Not really related, but the feeling of elation when I alloc'd 1M of RAM in OS/2 and it _didn't_ swap changed me.
This was on a 386sx with 8M RAM and it was pretty much all the available memory after the OS was loaded and settled down.
A MILLION BYTES!!
Didn't do anything with it, but still, after DOS and EMS/XMS memory and all the other falderol of managing memory. (At the time, it was also the only x86 OS that would format a floppy drive without bringing all the other threads to a crawl. UI was still available-ish, BiModem was still BiModeming...
Sam, this is such a wonderful resource that you've put out into the world. Thank you for the time and care you've put into each paragraph and interactive component. You've not only given me a lot to think about in terms of my basic assumptions about memory, but also about how to write and teach better online. I'm really excited to start following you!
Nice article on general-purpose memory allocation approaches. While the article doesn't explicitly discuss it, these all seem to be list-allocation mechanisms?
> "List allocators used by malloc() and free() are, by necessity, general purpose allocators that aren’t optimized for any particular memory allocation pattern... To understand, let’s examine commonly used list allocation algorithms: first-fit, next-fit, best-fit and quick-fit."
That's from an article on custom memory management (aimed at embedded programming issues) I've found pretty useful, and it picks up where this leaves off:
You can practice writing custom memory managers in an application that runs on an operating system by only using the stack (just create a big array of int etc. and call that your memory space):
> "For the safety-critical tasks, the developer can replace the standard allocator with a custom allocator that sets aside a buffer for the exclusive use of that task, and satisfies all memory allocation requests out of that buffer... The remainder of this paper presents four examples of custom allocator algorithms: the block allocator, stack allocator, bitmap allocator and thread-local allocator. Custom memory managers can use any one of these algorithms or a combination of different algorithms."
> Others will return what's called a "null pointer", a special pointer that will crash your program if you try to read or write the memory it points to.
This is not strictly true, it depends on the environment you're using. Some older operating systems and some modern embedded systems have memory mapped at the zero address.
If we’re gonna be pedantic, isn’t memory mapping itself an optional property? Ie you can have a pointer to 0x00..00 refer to the actual physical bytes at that address, without any mapping, no?
Sure, there can be physical memory at address zero, there can be virtual memory at address zero. (Most modern operating systems and programming tools place a deathtrap for misbehaving code there instead.) There can be no physical memory at address 0 at all, though it is uncommon, and other gaps in addresses. RAM can technically start elsewhere, and lower part of memory can be taken by ROM or other hardware. (For example, see how authors of breadboard computers reason about the choice of system memory layouts and address line operation.) NULL value itself can be different from zero. Therefore, C standard defines nothing about null pointer dereferencing.
On the other hand, most people expect the usual layout that grows from zero upwards. CPU designers are people, too, and make similar assumptions about layout and operation of memory (if they have to), which then affect systems that use their processors. Also, with all its compatibility, C still presumed certain class of hardware. It was not invented as a language for every microprocessor and microcontroller and each odd memory model, the microcontroller evolution was instead affected by alignment with coding in C.
This ugly hack (using the same object to hold the address value that can be operated on, and its own validity information) might be the most well known.
"As a general-purpose memory allocator, though, we can't get away with having no free implementation."
I have a belief that the future of software are short-lived programs that never free memory. Programs allocate and terminate. Short-lived program communicate with each other via blocking CSP-style channels (see Reppy's Concurrent Programming in ML).
If you could also educate me on why this is a bad idea I would appreciate.
My first large scale web application was a webmail service built in C++ (!) where I early on decided we'd ditch nearly all freeing of memory, as it was running as a CGI, and it was much faster to just let the OS free memory on termination. The exception was for any particularly large buffers. Coupled with statically linking it, it reduced the overhead sufficiently that running it as a CGI performed well enough to save us the massive pain of guaranteeing sufficient isolation and ensuring we were free of memory leaks.
Especially in a request/reply style environment, long running application servers is largely a workaround for high startup costs, and it's only a "bad idea" in the instances where removing that high startup cost is too difficult to be practical. Overall I love avoiding long running programs.
I agree with your point but disagree with your reasoning. I think programs should always free memory at some point because then it's easier to reason about debugging memory leaks.
Practically speaking though, there are arena allocators that do exactly this - you allocate a bunch of memory at once, assign like-typed instances to "slots" in that memory region, and then deallocate everything all at once. Thus, the individual instance `free()` is a no-op.
It's not general purpose, and lots of programs that were designed to be short-lived often end up not being so in the future. People used to point at compilers as a typical example of this kind of thing, well, now we have compilers as libraries sitting resident in every popular developer tool.
With a simple enough allocator, freeing things could maybe even be beneficial even for short-lived programs, purely from the memory being in cache already, instead of needing to ask the OS for more (incl. page faulting & zeroing, besides being limited by RAM throughput). For a buddy allocator without coalescing, a free() that just adds the argument to the corresponding freelist can be as simple as 5 or so x86-64 instructions (the fast path of the allocator being ~8 or so instructions; certainly more than a bump allocator, but not by much, and the reuse benefits can easily be pretty big).
This would probably be something closer to actors (https://en.wikipedia.org/wiki/Actor_model) rather than programs since programs are traditionally implemented as OS processes which are relatively expensive to spin up and terminate. At some level, though, somebody has to deal with freeing the memory, and they may do it less efficiently than you can.
My take on this is that code should always match up malloc and free, but your application may use an allocator where free is noop, if that's appropriate for the application you write. This way your code is more generic and can be reused in an other application with different constraints.
And as soon as you are replacing free, you can replace malloc as well to be optimized for your use case. No need to build difficult bookkeeping hierarchies when they will never get used.
I suspect this would cause the same performance issues as a long running application that constantly malloc'd and free'd memory? Many application runtimes allocate but don't free memory - they just reuse it internally - for this reason. For example in a ruby application you'll see memory usage climb after boot, and eventually level off when it has all it will need for its lifetime, but never go down.
Yes, this is what Ur/Web does albeit this is limited to Web Server requests. I'd argue all programs could be short lived and memory management becomes a matter of sizing program's scope/role to the amount of memory you can greedily consume. Certainly many sorting program (for e.g.) can leak until they terminate. Then, cheap instantiation and communication between programs.
That's no different than writing a traditional program and using big arena allocators for everything instead of individual allocations, except it's more complicated for no apparent reason.
> "There's no shortage of information about memory allocators on the Internet, and if you've read this far you should be well-placed to dive in to it.
Join the discussion on Hacker News! https://news.ycombinator.com/item?id=36029087 "
Interesting to use hacker news as the blog's own comment section in this way.
Seems to be a bug on the first interactive graph, at least for me. Unless I'm misunderstanding the point of the graph, `malloc(7)` only allocates 2 bytes.
I came here to see if anyone else noticed this and am confirming that there is a bug in the first slider on malloc(7). Indeed it only allocates two bytes instead of seven.
The playground was the inspiration for the post. I always wanted to be able to fiddle with malloc implementations and see how they work in this way.
Admittedly the playground is hard to understand at first, and a touch janky in places. But the workloads are taken from for-real malloc/free traces on programs!
When you think that I (and probably the vast majority of developers) used a pen and a paper for the first few years every time I tried to visualize more complex memory, then that's a big upgrade.
Especially because you can scroll through all the steps.
Ahh yes. Back at university we had a class called efficient programming where we had to implement a Unix utility and optimize it for speed, meaning cpu cycles.
While aggressively optimizing we replaced malloc in the end with a function we called "cooloc", that traded portability and security for speed. The debug tool here would have been useful.
I then apply this same principle of "opening" and "closing" structures throughout the application. Generally, I can quickly verify that the calls are balanced:
What's nice about this pattern is that the underlying implementation of exactly how the memory is maintained for a particular data structure (e.g., whether malloc or gdImageCreateTrueColor is called) becomes an implementation detail:
This means there's only one call to malloc and one call to free throughout the entire application (third-party libraries notwithstanding), allowing them to be swapped out with ease.
Aside, logging can follow this same idea by restricting where any text is written to the console to a single location in the code base:
As a very minor point, calling free(NULL) is well-defined and safe so there is no need for the if-statement in memory_close(). This is very clearly stated in the manual page [1] for instance:
Probably showing my age. SunOS 4, PalmOS, and 3BSD reputedly crashed. (There were also double free exploits back in 1996.)
This further illustrates my point, though: Removing the NULL check is a single conditional to remove, as opposed to littering the free + guard everywhere. In effect, by isolating duplicated pieces of logic, it keeps possibilities open.
I love this so much, thank you for putting this together!
My only piece of feedback would be for the "Inline Bookkeeping" section (https://samwho.dev/memory-allocation/#inline-bookkeeping), it took a while for me to grok the numbered list to figure out which block corresponded to address + X. I wonder if there is a better way to visualize the 4 numbered bullet points? Maybe just arrows and text pointing to the visualization?
Yes, this is one of the things I look back on as a weak point in the writing. I wanted to make this a lot more clear but by the time I'd gotten to this point in the article, I'd sort of coded myself into a corner with the visualisation code.
In another universe I'd hook in to the page scroll and highlight each block being referred to as I talk about it in the text. I'm probably not explaining that well here, but imagine something like the way this article works: https://pudding.cool/2017/05/song-repetition/.
Couldn't you highlight each block being referred to as you mouse over / click on relevant text? (The latter might not be terribly hard to do with <a> and CSS :target, if you can give the blocks ids.)
Love this. It is so important to know these fundamentals concepts. I would like to see a similar version for SQL database indexes as 99% of the engineers I work with have no idea how to use them.
If more took the time to EXPLAIN ANALYZE they'd see indexes in action and hopefully learn from a few variations, especially pre/post variations of indexes. Or so I'd hope :).
Great webpage, but it somehow left me more confused.
It seems to imply that malloc/free works by boundary tag? Which I don't think is the case? (and if not, it leaves the reader confused as to how it then actually works).
I know "some" languages use the tag technique (e.g. julia), but the article seems to imply that this also applies to the c code above? Or am I wrong and c also makes use of such a scheme when you use pointer arithmetic?? (I don't see how that would work with arrays if that's the case though)
I'm sorry you're more confused than you were when you started!
The post is intending to talk about the topic of memory allocation in a general sense. The way that malloc works on any given system is implementation dependent, it's not possible to say "malloc works in this way" because Debian can differ from Arch can differ from BSD, etc.
It's not my goal to tell you exactly how modern malloc/free implementations work. I could probably be more clear about this at the start of the post.
Glad there are always talent people and great guide like him/this exist! Great people and guides will help tremendous learners along the way, timeless, priceless.
I love this! I wish it existed back when I was writing my first memory allocators in university or when building a custom EntityComponentSystem implementation.
I'd love to also see applications of custom memory allocations. I know about usecases in building game engines and the importance of hitting cache there, but I'm not sure where else in the world this would be as useful.
I want to learn more about next steps, such as virtual memory. Should Ch 8 of K&R be the next step? I started reading the chapter and I think the whole chapter is about UNIX system calls.
Excellent, excellent article! I have a question though.
> Couldn't we rearrange the memory to get a block of 6 contiguous bytes? Some sort of defragmentation process?
> Sadly not. Remember earlier we talked about how the return value of malloc is the address of a byte in memory? Moving allocations won't change the pointers we have already returned from malloc. We would change the value those pointers are pointed at, effectively breaking them. This is one of the downsides of the malloc/free API.
But why not? Couldn't we store information about old pointers somewhere and match them with new addresses when defragmenting? Some kind of virtual memory driver that would map old pointers to new adresses transparently for the programs? Or would it be too much overhead for too little benefit?
Most OSes today do that "transparently" with virtual memory, but usually with a coarse granularity (e.g. 4k). A page table is just a translation of "pointers" to "memory addresses"; the OS can rearrange physical memory as it sees fit without the program having to update its pointers.
In OSes without virtual memory, one option is to do the same non-transparently: instead of returning pointers, malloc and friends work with "pointers to pointers" (called handles), so there is one extra level of indirection, and now the OS is free to rearrange this 2nd level as it sees fit. Whenever a program wants to read/write the data behind a handle, it must dereference the handle to get to the real pointer, but it also must let the OS know that it is currently using the real pointer -- this is to avoid the OS moving it around.
This is usually called "locking/unlocking" the handle.
Some classic examples are Windows 3.x, Mac OS (toolbox), PalmOS, etc.
> Some kind of virtual memory driver that would map old pointers to new adresses transparently for the programs
You would need hardware support for this, since the hardware is what decides what gets returned when a program attempts to read from a memory location.
Hardware already does support virtual memory but the granularity is the page (which are a minimum of 4KiB in most OSs).
To do that you either need a structure that you update every time a pointer is created, copied, moved or deleted (too much overhead), or you need a way to scan the entire memory
and get all the pointers. And at the point where you have a piece of code that knows where every pointer is, you already know which pointers aren't used anywhere anymore so it's a waste to not have it also call free() for you.
Once you have it call free() for you, your piece of code is now a compacting GC, like Java's for example.
> But why not? Couldn't we store information about old pointers somewhere and match them with new addresses when defragmenting? Some kind of virtual memory driver that would map old pointers to new adresses transparently for the programs? Or would it be too much overhead for too little benefit?
In languages where memory is managed for you, you can absolutely do this, since the runtime can find every single pointer to an object and rewrite it.
Virtual memory can let you do this, but would require a separate page for each allocation (since virtual memory operates on a page-level). Given that the smallest page on modern architectures is 4k, this would mean using 4k of ram for each allocation (and rounding up each allocation to a 4k page boundary).
On top of that, it's an OS system call to map and unmap pages, which means you incur a system-call on every allocation and deallocation, which is much slower than using a user-space allocator.
You’d need cooperation between your malloc implementation and the application code. It’s possible, but tricky. If your malloc returned a pointer to a pointer, and promised to keep the first level pointer up to date, and was able to lock your application whenever it moved things around, it could work.
Someone else already mentioned, but garbage collected languages do actually do this. Because they’re fully in control of memory (the language exposes no raw pointers), they’re able to create the layer of indirection you’ve suggested and move things around as they please. I know at minimum the JVM does this. The term to search for is “heap compaction.”
There’s also the weird and wonderful work of Emery Berger et al with their “Mesh” malloc implementation, which blows my mind: https://youtu.be/XRAP3lBivYM.
In a language like C that isn't really possible because the language can't keep track of all of the places that memory address is stored.
If malloc were to return something like an address that holds the address of memory allocated there is nothing preventing the program from reading that address, doing math on it, and storing it somewhere else.
Early MacOS did this with handles (basically pointers to pointers.) You'd lock them to read/write and when they were unlocked, the OS was free to move the underlying memory and change the pointer in the handle.
I'm glad you like the playground. If you don't mind me asking, what/where/how do you teach? I was actually hoping to get the attention of educators with this tool to see if it would make sense in, e.g., undergrad CS courses.
The commentary along with the first milestones achieved by Elon M's SpaceX and Starlink speak of a global scitech education initiative, perhaps even includes Khan Academy, you could see there among those communities if they have a slot for your talent. Look for contact points on their webpage closest to that initiative?
I think it will be helpful if it discusses internal fragmentation, where the payload is smaller than the allocated block. I observed that this was important in understanding the purpose of various memory allocation algorithms when undertaking the malloc lab in college.
Can you highlight the specific malloc() calls in the interactive part? It confused me when it said malloc(5) because it looks almost exactly like malloc(4). Specifically highlighting the malloc(5) allocation would make that a bit clearer I think.
I completely agree with you on this, though I couldn't find a great way to do it. It was suggested to me to put an outline around the allocation or free that just happened, but I struggled to get the box to look good when the allocation was split across 2 lines.
I've started writing my next post and have learnt a bit more about PixiJS/GSAP3, and think I know a way to do it that would work nicely but would require changing some of the underlying code. I can't promise I'll revisit this soon, but I would like to in future.
This is the future of learning. You have a skill in it, learn to monetize it. Will benefit both of us, as you are more incentivized to do it if you can make money
I've played around quite a bit with slab allocation in C to get my interpreters to run faster, and this post inspired me to do a quick benchmark of a design I've been iterating.
It talks about how a simple malloc implementation would doll out entries in a buffer of memory and manage free lists, but not how the implementation actually gets that buffer from the operating system, or return it to the system when done (mmap, munmap, for example).
I made a deliberate choice to not include that, mostly due to post size/complexity reasons. I decided that the focus of the piece was going to be on the task of effectively managing the memory you have in the face of calls to malloc/free. I decided against talking about any environment-specific information (brk, mmap, cache locality, multithreading, etc.)
May not have been the right call, but it kept the length fairly reasonable and I think it gives people enough of a nudge to dip their toes into the playground if they have time.
If you check the HTML comments on the page you'll find quite a lot of cut content. I wanted to do deep dives on real-world malloc implementations, but it ended up being very difficult to cover without making the post take over an hour to read.
I was having some fun recently just seeing how fast we can allocate large chunks of memory.
There's something refreshing about firing up a C (or maybe now Zig, in my case) compiler and allocating a gigabyte of memory, and seeing that your process is using exactly 1GB.
> I was having some fun recently just seeing how fast we can allocate large chunks of memory.
How did you measure that? What was fastest? I'd imagine sbrk is faster than mmap, but I haven't checked. If you're on Linux, then doing a virtual memory allocation will not cause the memory to have physical representation. Did you populate the pages through looping? Or did you use MAP_POPULATE?
I saw that madvise was going to get MADV_POPULATE_(READ|WRITE) in the Linux kernel mailing lists back in 2021, but I haven't seen it on the kernels that I use (5.4, 5.15)
The accessibility of this technique is something that greatly worries me. I try and be quite thoughtful about things like colour palette. I'm not sure how to achieve the level of interactivity I want while still being accessible to screen readers, without writing two completely separate articles with and without JavaScript.
It would have been nice if I had seen this article before refactoring a Java project into a C project, where learning memory allocation rules in order to use C was a painful process for me :(
Love the visualizations. It would be great to have `realloc` covered too but I'm not sure how much that varies by system, potentially making it too complicated for this sort of post.
I did work focused on optimizing the latency impacts of mem allocation for a recent client of mine. Loved doing that kind of low-level systems tuning. (Linux, C, mmap and friends.)
Good question! I’m not aware of one, but I also haven’t looked.
Most of the research for this post was done by reading papers for various malloc implementations (phkmalloc, dlmalloc, tcmalloc, mimalloc) and reading their source code.
The asides with the dog are great. I've seen a few blogs use this technique -- I'm not big on Greek, but I think Plato used this technique too.
It instills the idea that you should be asking a question at this point. Some information has been provided that should generate more questions in your head, if you're keeping pace.
A little offtopic but the default Delphi 7 memory allocator did this, except that it also merged blocks that it obtained from different OS allocation calls.
This worked fine for regular usage, but if that memory was ever used for Bitmaps for UI stuff, it wouldn't work: Since Windows does some of its UI stuff in kernel mode, before doing anything Windows would attempt to lock the entire allocation's VAD entry to prevent you from messing with it in another thread while it was using it. If the Bitmap you were working on happened to belong to two different OS-level allocations, this lock function would fail since it wasn't meant to handle that case.
This would lead to random, extremely cryptic errors such as ERROR_NO_SCROLLBARS "The window does not have scroll bars." since the lock call was deep in the stack and the callers replaced the error with another one as it bubbled up.
In the end we had to replace the allocator to avoid that issue. That was a fun day I spent debugging that.