Very cool. Having done a bit of NES dev I can imagine this wasn't super straightforward to make performant for the graphics, given you can typically only have a few sprites on a row before the NES starts to 'dissolve' them (not sure the term).
I wonder if it's using the background tile map for this instead of sprites, though that's also an impressive amount of graphics bandwidth.
> with full audio playback rate (44.2kHz)
The audio being so clear is also impressive, is that something that the card extends? IIRC the PCM channel on the NES isn't anywhere near that bitrate, and is also 8-bit sample size.
What a delightfully arrogant article, to the point I believe it to be satire (stopped reading at section headers, perhaps I missed the punchline).
TOML is by far the most stripped down and easy to understand configuration format I've ever used, allowing just enough syntactic sugar to be useful without changing semantics. The fact it's made by Tom is meaningless, so its flagrant dismissal is silly to me.
Meanwhile, the proposed configuration format sounds like a nightmare to read. There is still clashing syntax, offloads all of the parsing work to the software (which means now you have the same config format with multiple different ways of interpreting values), restricts usage of certain characters with no way of escaping them (someone else mentioned base64), and otherwise requires that you recursively parse it for nested KVs rather than constructing the final in memory structures in a linear pass, adding a layer of indirection prior to parsing.
Not to mention, I really get turned off by this sort of pious writing style.
No thanks. Lots of reasons to dislike config formats we've seen before but this doesn't solve anything in my eyes.
Technically yes, you can do copy-on-write semantics, which marks both the original and "copied" page as read-only, and the first write to either virtual page causes a page fault in the kernel (access violation) prompting a physical page allocation, copying the page, and then updating both virtual page table entries' permissions to be RW, then the equivalent of `invlpg` to flush them out of the TLB, before returning to the program to re-try the write instruction.
However this gets tricky with heap allocators since they often pack `malloc`-like allocations into the same page using some allocation scheme (e.g. buddy allocation, etc.). Though you can typically bypass malloc-like, in-process allocators to handle page mappings yourself, it's just a little more cumbersome.
This is actually how "virtual allocation" (I think is the term, don't quote me) happens; if you `malloc` a very large space on many modern OSes, you get a pointer back and there is an allotment in the process's virtual address space registered, but no page tables are actually updated (the kernel can assume they're "not present", meaning that there is no read nor write availability; any memory operation with that range causes a page fault). Upon any page in that range being accessed for the first time, the page fault handler sees it's part of that otherwise very large segment and will allocate in physical memory to that virtual page, then return.
It allows for e.g. allocating a VERY large sparse array of items (even terabytes!) that might otherwise exceed the size of physical memory actually available to the system, with the assumption that it'll never actually be fully populated (or at least populated across enough memory pages such that you exhaust 'real', physical memory).
This is also how file mapping works. You file map something from the filesystem, and the kernel keeps a cache of read pages from that file; memory reads and writes cause page faults if they haven't been read in (and cached), prompting a read from the disk storage into a physical page, updating the page tables, and resuming the faulting thread. It also allows the kernel to selectively reclaim less-frequently-used file mapped pages for higher priority allocations at any time really, since the next time the program tries to access that page, the kernel just faults again, reads it into a new page, and updates the tables. It's entirely transparent to the process.
I'm designing a novel kernel at the moment and some of these sorts of possibilities are what I find the more interesting parts of design, actually. I find them to be underutilized to some extent even in modern kernels :)
Same cache-line. CPU caches come after virtual memory translations / TLB lookups. Memory caches work on physical addresses, not linear (virtual) addresses.
Memory access -> TLB cache lookup -> PT lookup (if TLB miss) -> L1 cache check (depending on PT flags) -> L2 cache check (depending on PT flags, if L1 misses) -> ... -> main memory fetch, to boil it down simply.
CPUs would be ridiculously slow if that wasn't the case. Also upon thinking about it a bit more, I have no idea how it'd even work if it was the other way around. (EDIT: To be clear, I meant if main memory cache was hit first followed by the MMU - someone correctly mentioned VIVT caches which aren't what I meant :D)
That's very true, though AFAIK they aren't used much in modern processors. It's usually PIPT or VIPT (I think I've seen more references to the latter), VIPT being prevalent because the logical address and the cache can be resolved in parallel when designing the circuitry.
But I've not designed CPUs nor do I work for $chip_manu so I'm speculating. Would love more info if anyone has it.
EDIT: Looks like some of the x86 manus have figured out a VIPT that has less downsides and behaves more like PIPT caches. I'd imagine "how" is more of a trade secret though. I wonder what ARM manus do, actually. Going to have to look it up :D
Original ARM (as in Acorn Risc Machine) did VIVT. Interestingly, to allow the OS to access the physical memory without aliasing, ARM1 only translated a part of address space (26 bits), the rest of it was always physical.
Nowdays, you don't see it exactly because of problems with aliasing. Hardware people would love to have these back because having to do shared-index is what limits L1 cache today. Hope nobody actually does it because this is a thing that you can't really abstract away and it interacts badly with applications that aren't aware of it.
Somewhat tangential, this is also true for other virtual memory design choices, like page size (apple silicon had problems with software that assumed 4096-byte pages). And I seriously wish for CPU designers not to be all to creative such hard-to-abstract things. Shaving some hundred transistors isn't really worth the eternal suffering upon everyone who have to provide compatibility for this. Nowdays it's generally recognised (RISC-V was quite conscious about it). Pre-AMD64 systems like Itanium and MIPS were total wild west about it.
Another example hard-to-abstract thing that is still ubiquitous is incoherent TLBs. It might have been the right choice back when SMP and multithreading was uncommon (a TLB flush on a single core is cheap), but it's certainly isn't true anymore with IPIs being super expensive. The problem is that it directly affects how we write applications. Memory reclamation is so expensive it's not worth it so nobody bothers. Passing buffers by virtual memory remapping is expensive, so we use memcpy everywhere. Which means it's hard to quantify the real-life benefit of TLB coherence, which makes it even more unlikely we ever get those.
Original ARM (ARM1 and ARM2) were cacheless; ARM3 was the first with a cache.
The CPU’s 26 bit address space was split into virtually mapped RAM in the bottom half, and the machine’s physical addresses in the top half. The physical address space had the RAM in the lowest addresses, with other stuff such as ROMs, memory-mapped IO, etc. higher up. The virtual memory hardware was pretty limited: it could not map a page more than once. But you could see both the virtually mapped and physically addressed versions of the same page.
RISC OS used this by placing the video memory in the lowest physical addresses in RAM, and also mapping it into the highest virtual addresses, so there were two copies of video memory next to each other in the middle of the 26 bit address space. The video hardware accessed memory using the same address space as the CPU, so it could do fast full-screen scrolling by adjusting the start address, using exactly the same trick as in the article.
> Passing buffers by virtual memory remapping is expensive, so we use memcpy everywhere.
Curious if you could expand on this a bit; memcpy still requires that two buffers are mapped in anyway. Do you mean that avoiding maps is more important than avoiding copies? Or is there something inherent about multiple linear addresses -> same physical address that is somehow slower on modern processors?
Assume an (untrusted) application A wants to send a stream of somewhat long (several tens of KB/multiple pages each) messages to application B. A and B could establish a shared memory region for this, but that would possibly allow A to trigger a TOCTOU vulnerability in B by modifying the buffer after B started reading the message. If page capability reclamation would have been cheap, the OS could unmap the shared buffer from A before notifying B of incoming message. But nowadays unmapping requires synchronizing with all CPUs that might have TLBs with A's mapping, so memcpy is cheaper.
That implies AArch64 support which many hobby OSes don't have, usually because the introductory osdev material is written largely for x86.
But yes, raspi is a good platform if you are targeting arm.
As I'm also designing an OS, my biggest piece of advice for anyone seriously considering it is to target two archs at once, in parallel. Then adding a third becomes much easier.
Raspberry Pi has a bizarre boot sequence and bringup process, much of it which is not open and not implemented in open source code. I think it's probably not a great platform for this sort of thing, despite it being decently well-documented.
(And even then, its USB controller, for example, has no publicly-available datasheet. If you want to write your own driver for it, you have to read the Linux driver source and adapt it for your needs.)
For anyone that hasn't fallen into this rabbit hole yet it's a good one: raspberry pi started out as a kind of digital billboard appliance, so they chose a GPU with efficient 1080p decoding and strapped a CPU to the die. On power up the (proprietary) GPU boots first and then brings up the CPU.
That's as far as I got before discovering the Armbian project could handle all that for me. Coincidentally that's also when I discovered QEMU because 512MB was no longer enough to pip install pycrypto once they switched to Rust and cargo. My pip install that worked fine with earlier versions suddenly started crashing due to running out of memory, so I got to use Armbians faculties for creating a disk image by building everything on the target architecture via QEMU. Pretty slick. This was for an Orange Pi.
The "color gamut" display, as you call it, is a GPU test pattern, created by start.elf (or start4.elf, or one of the other start*.elf files, depending on what is booting). That 3rd stage bootloader is run by the GPU which configures other hardware (like the ARM cores and RAM split).
You could probably skip some of the difficult parts if you bring in an existing bootloader that can provide a UEFI environment (it's how Linux & the BSDs boot on ARM Macs). But Serenity is all about DIY/NIH
RISC-V is the new hotness but it has limited usefulness in general purpose osdev at the moment due to slower chips (for now) and the fact not a lot of ready-to-go boards use them. I definitely think that's changing and I plan to target RISC-V; I have just always had an x86 machine, and I have built some electronics that use aarch64, so I went with those to start.
Kernel is still in early stages but progress is steady - it's "quietly public". https://github.com/oro-os
I wonder if it's using the background tile map for this instead of sprites, though that's also an impressive amount of graphics bandwidth.
> with full audio playback rate (44.2kHz)
The audio being so clear is also impressive, is that something that the card extends? IIRC the PCM channel on the NES isn't anywhere near that bitrate, and is also 8-bit sample size.
reply