Rather than a direct syscall, you could imagine something like rseq where you have a shared userspace / kernel data structure where the userspace code gets aborted and restarted if the page was evicted while being processed. But making this work correctly and actually not have a perf overhead and also be an ergonomic API is super hard. In practice people who care probably are satisfied by direct I/O within io_uring with a custom page cache and a truly optimal implementation where the OS can still manage file pages and evict them but the application still new when it happened isn’t worth it.
Unfortunately, a lot of the shared state with userland became much more difficult to implement securely when the Meltdown and Spectre (and others) exploits became concerns that had to be mitigated. They makes the OS's job a heck of a lot harder.
Sometimes I feel modern technology is basically a delicately balanced house of cards that falls over when breathed upon or looked at incorrectly.
> You would need an hardware instruction that tells you if a load or store would fault.
You have MADV_FREE pages/ranges. They get cleared when purged, so reading zeros tells you that the load would have faulted and needs to be populated from storage.
MADV_FREE is insufficient - userspace doesn’t get a signal from the OS to know when there’s system wide memory pressure and having userspace try to respond to such a signal would be counter productive and slow in a kernel operation that needs to be a fast path. It’s more that you want MADV (page cache) a memory range and then have some way to have a shared data structure where you are told if it’s still resident and can lock it from being paged out.
MADV_FREE is also extremely expensive. CPU vendors have finally simplified TLB shootdown in recent CPUs with both AMD and Intel now having instructions to broadcast TLB flushes in hardware, which gets rid of one of the worst sources of performance degradation in threaded multicore applications (oh the pain of IPIs mixed with TLB flushing!). However, it's still very expensive to walk page tables and free pages.
Hardware reference counting of memory allocations would be very interesting. It would be shockingly simple to implement compared to many other features hardware already has to tackle.
It's quite expensive to free pages under memory pressure (though it's not clear that there's any other choice to be made), but if the pages are never freed it should be cheap, AIUI.
> have some way to have a shared data structure where you are told if it’s still resident and can lock it from being paged out.
What's more efficient than fetching data and comparing it with zero? Any write within the range will then cancel the MADV_FREE property on the written-to page thus "locking" it again, and this is also very efficient.
you can longjmp, swapcontext or whatever from a signal handler into another lightweight fiber. The problem is that there is no "until the page arrive" notification. You would have to poll mincore which is awful.
You could of course imagine an ansychronous "mmap complete notification" syscal, but at that point why not just use io_uring, it will be simpler and it has the benefit of actually existing.
I think the model I described is more precise than madvise. I think madvise would usually be called on large sequences of pages, which is why it has `MADV_RANDOM`, `MADV_SEQUENTIAL` etc. You're not specifying which memory/pages are about to be accessed, but the likely access pattern.
If you're just using mmap to read a file from start to finish, then the `hint_read` mechanism is indeed pointless, since multiple `hint_read` calls would do the same thing as a single `madvise(..., MADV_SEQUENTIAL)` call.
The point of `hint_read`, and indeed io_uring or `readv` is the program knows exactly what parts of the file it wants to read first, so it would be best if those are read concurrently, and preferably using a single system call or page fault (ie, one switch to kernel space).
I would expect the `hint_read` function to push to a queue in thread-local storage, so it shouldn't need a switch to kernel space. User/kernel space switches are slow, in the order of a couple of 10s of millions per second. This is why the vDSO exists, and why the libc buffers writes through `fwrite`/`println`/etc, because function calls within userspace can happen at rates of billions per second.
The entire point I was trying to make at the beginning of the thread is that mmap gives you memory pages in the page cache that the OS can drop on memory pressure. Io_uring is close on the performance and fine-grained access patterns front. It’s not so good on the system-wide cooperative behavior with memory front and has a higher cost as either you’re still copying it from the page cache into a user buffer (non trivial performance impact vs the read itself) + trashing your CPU caches or you’re doing direct I/O and having to implement a page cache manually (and risks duplicating page data inefficiently in userspace if the same file is accessed by multiple processes.
Right, so zero copy IO but still having the ability to share the pagecache across process and allow the kernel to drop caches on high mempressure. One issue is that when under pressure, a process might not really be able to successfully read a page and keep retyring and failing (with an LRU replacement policy it is unlikely and probably self-limiting, but still...).
To take advantage of zero-copy I/O, which I believe has become much more important since the shift from spinning rust to Flash, I think applications often need to adopt a file format that's amenable to zero-copy access. Examples include Arrow (but not compressed Feather), HDF5, FlatBuffers, Avro, and SBE. A lot of file formats developed during the spinning-rust eon require full parsing before the data in them can be used, which is fine for a 1KB file but suboptimal for a 1GB file.
> inflation doesn't increase the dollar amount owned
In theory no, but when old debt matures, it is often paid by issuing newer debt that is financed at the higher rate. So in practice inflation does increase the dollar amount owned unless the government actively reduces the debt.
There are hundreds of billions of lines of code of critical software[1] written in unsafe languages, that is not going to be rewritten any time soon. Adding memory safety "for free" to such software is a net positive.
Current CPUs are limited by power, transistors are essentially free.
[1] often the VMs of safer higher level languages, in fact.
If I had to guess, the actual GCC maintainers[1] had no interest into integrating a very large codebase into GCC which would duplicate a lot of its functionality.
LLVM could have been integrated under the GNU/FSF umbrella as a separate project of course.
[1] since the egcs debacle was resolved, RMS has had very little control of GCC
I got so used to electric-indent and the immediate feedback it gives, that for a very long time it prevented me from even considering any other editors.
These days I rely on clangd driven autoindent (which is fast enough to do every line), but I still use emacs because it is so easy to tweak the interaction to clangd to work exactly as I prefer.
reply