I have a theory that CPU architectures implicitly have something I call the "architectural weirdness budget", which is the extent to which you can get away with design decisions in the architecture or ABI that differ from the existing established consensus. Any time you do something a bit odd that means that existing software has to do awkward things or might not be no-changes portable to your new architecture, you're spending a bit of your weirdness budget. Some examples:
+ HPPA's upwards-growing stack
+ SPARC register windows
+ PowerPC's slightly oddball varargs ABI
+ CHERI capabilities are massively spending weirdness budget
+ these days, being big-endian is spending weirdness budget
Doing a few odd things is fine, but if you blow your weirdness budget entirely then your new architecture is in danger of failing because you've cumulatively made adoption of it too expensive for the existing software ecosystem. So you need to be careful that you're spending it on things that really matter; if something's not a big deal, do it the same way everybody else does. (Certainly some of the things in the list above are deliberate choices for things that are, or at least seemed at the time, important. But nobody tried to do all of them at once!)
Of course if you're a pre-existing dominant architecture like x86 then you get a lot more freedom to do weird things :-)
> CPU architectures implicitly have something I call the "architectural weirdness budget"
The point of designing a new architecture is you have a point to make (generally, "doing X will lead to faster execution"). So by definition you are adding unfamiliar architectural weirdness, else why get involved.
The big problem is that the pervasiveness of the C model has fossilized design decisions of the PDP-11 that still haunt us to this day. For example a lot of transistors are devoted to hiding the effects of the micromachine (superscalar implementation, OOO execution etc). Newer paradigms (even trivial older ones like vector instructions) don't get used enough because of higher level language constraints (e.g. C's memory aliasing model). There's a lot of innovation going on but it's all stick in one small bay of a large lake.
> the pervasiveness of the C model has fossilized design decisions of the PDP-11
This is tangential and somewhat pedantic, but I want to point out an oft-repeated mischaracterization.
C is essentially a refinement of B with some additions [1], and B was written for the PDP-7, which was a very different machine.
For example, the increment and decrement operators in C were inherited from B; they could not have been modelled after the PDP-11. From Dennis Ritchie himself [2]:
> Thompson went a step further by inventing the ++ and -- operators, which increment or decrement; their prefix or postfix position determines whether the alteration occurs before or after noting the value of the operand. They were not in the earliest versions of B, but appeared along the way. People often guess that they were created to use the auto-increment and auto-decrement address modes provided by the DEC PDP-11 on which C and Unix first became popular. This is historically impossible, since there was no PDP-11 when B was developed. The PDP-7, however, did have a few "auto-increment" memory cells, with the property that an indirect memory reference through them incremented the cell. This feature probably suggested such operators to Thompson; the generalization to make them both prefix and postfix was his own.
Another person puts it this way [3]:
> It's a myth to suggest C's design is based on the PDP-11. People often quote, for example, the increment and decrement operators because they have an analogue in the PDP-11 instruction set. This is, however, a coincidence. Those operators were invented before the language [i.e. B] was ported to the PDP-11.
[1] Notably, the char type, since the PDP-11 was byte addressable but the PDP-7 was not.
>[1] Notably, the char type, since the PDP-11 was byte addressable but the PDP-7 was not.
"char" is a useful type for the program's problem domain regardless of whether the machine is byte-addressed or not. Pascal was developed on machines that were not byte-addressable but always had a char type. This is easily visible in the PACK and UNPACK standard functions and the PACKED keyword for arrays and records. It is slow to do access to a random char from an array packed several to a word, so there are standard functions to copy everything from a packed array to an array that uses one word per item for easier processing, and back again.
If anything, you need the compiler to support the char type more if you don't have a byte-addressed machine because it's such a pain for the programmer to deal with themselves.
Standard Algol 60 didn't have "char" but some manufacturers added it non-standard. Algol 68 had char. These ran on primarily word-addressed machines too, as was the fashion in those days.
Yet all these languages were limited to the 8-bit byte. C on the PDP-10 and -20 only had an 8-bit byte while all other languages on those machines Al had native access to bytes of width 1-36 bits — yes, std::vector<bool> in hardware, but more powerful.
Though a general purpose machine it had instructions specifically designed for low cost Lisp implementation, arbitrary numbers of stacks etc…all inaccessible to C
There is no requirement in C for char to be 8 bits. Even in modern C it's allowed to be bigger than that. For sure it is a bit limiting that in C everything else has to be a multiple (possibly x1) of the size of char. Other languages don't have that restriction.
Algol68 allowed char to be as small as 6 bits, the size on the ICL 1900 it was first developed on (with a 24 bit word size).
Pascal was first implemented on the CDC 6600, with 60 bit words and characters again 6 bits. The first other machine Pascal was ported to was, again, the ICL 1900.
Yes, absolutely you should have a point to make with your new architecture, and that's likely to involve at least some weirdness. What I'm trying to get at is that the weirdness budget isn't infinite, and therefore you want to be careful to spend it on making your point, not frittering it away on incidentals.
> It also constrains some optimizations like reordering.
Ha! As if C compilers really cared about correctness. This is why type-punning and aliasing memory through different typed pointers is fraught with peril, often UB. It's so C compilers can cheat and use type-based alias analysis, breaking programs that have "UB" but would work absolutely fine if the compiler wasn't so aggressive.
In general, almost all arguments between language design and optimization for C/C++ are settled with "let's not think too hard about how to help users; go ahead and optimize this and make it their fault (UB) when they observe the hard cases". It's so rude.
This is roughly what happens, but you shouldn't attribute this to rudeness. Compiler users also demand performance. They compare implementations (and languages) and point out when such-and-such is X% slower. They file bugs about "missed" optimizations and "unnecessary" instructions.
C is especially awkward here, because compilers are expected to precisely conform the C specification, but at the same time C users don't actually want to program for the abstract machine from the C specification (the one that can't overflow signed integers or shift by a full int width). C users expect their programs to actually behave like programs for a particular hardware they target, without any overhead or discrepancy from "emulating" the imaginary C-spec machine. In face of such conflicting requirements, C compilers will always be forced to balance performance, predictability, conformance, and compatibility.
You make some good points. C was often used in the situation where "Hey, we just built this enormous supercomputer for LLNL, let's make this C program run as fast as possible on this one machine. Compiler team, please do everything possible to get the maximum performane out of this machine." I feel like a lot has followed from this repeating dozens of times over the past few decades. Performance has been the #1 priority, above all else, and it shows.
Not sure if the S-1 is a good example — IIRC it still had a 36-bit word and was mainly intended for pl1 and lisp (I only really saw it from the S-1 Lisp side). Was there evening a C for it? It would have been as bad as running C on a Cray YMP (which I mention above)
What exactly do you mean by the "C model"? Computers have followed the Von Neumann architecture for quite a long time. That PDP-11 is fundamentally the same as a modern computer, albeit millions of times slower.
And I'm curious as well what "new paradigms" you have in mind.
The C language and portable C programs do not in any way require a Von Neumann architecture. In fact with the modern trend for phone/tablet/desktop/server OSes to use W^X memory permissions they effectively mandate programs written in Harvard-compatible style even if they technically have code and data in the same address space.
Once instruction sets got features such as index (or base) registers and register indirect addressing (including stack addressing) instead of absolute addressing (including for indirect jumps/subroutine calls) there was no longer any need to support self-modifying code during execution of the program itself, and Von Neumann architecture is not required except for initial loading of the program code into memory by the OS.
I'd think most modern instruction sets are easily capable of running with code and data in truly different address spaces.
The only real difficulty is in loading constant data and especially constant tables from program space. Microcontroller ISAs such as AVR have special instructions for program space loads. Any ISA where the PC is a GPR (PDP-11, VAX, arm32) or that has an explicit PC-relative addressing mode (preferably indexed) such as 68000 make it easy to detect that a memory reference is PC-relative and do it in the program space instead of the data space.
Arm64 with ADR and ADRP and RISC-V with AUIPC make it tricky because hardware would have to track that a GPR contains a pointer derived from the PC. x86 and PowerPC make it even more difficult, because the only way to get the PC value is to do a fake function call to the next instruction (or to keep return stack prediction happy, to a real function that just saves the return address then returns). On these ISAs you're probably better off assembling all constants using load immediate and shifts, and arrays or tables of constants using computed jumps to load immediate instructions.
In short, using a conventional modern ISA with Harvard architecture is doable if you want to.
So which language has handled ILP, OOO, delay slots, cache hints, etc. particularly well? Anything that's ergonomic and actually used (i.e. not awkward and academic)?
But that's my point: the dominance of a C monoculture has relegated alternatives to being "awkward and academic".
When you're in a C straightjacket it's hard to take advantages of other architectures like transputer, connection machine, or cell. Ever run bsd unix on a Cray? It was dog slow because the CPU assumed very deep pipelining and always had to continually resynchronize because of the frequent branches in the C code.
GPGPU is the sole recent exception, and even then it's still at the margins. And we have CUDA which is an attempt to, yes, get back to the C model.
These modern CPUs benchmark pretty well, but in practice little code can really take advantage. For example the M1 is a multicore screamer, yet most apps are still restricted to a single one.
It's far less popular than the C/C++ bindings (at least at an end-user level), but NVIDIA actually has official support for FORTRAN on CUDA as well. iirc you can mix and match at the linker level, and there's a fair amount of library code that uses FORTRAN for performance reasons.
FORTRAN has tighter restrictions around what you can do in terms of aliasing/etc, which allows compilers to be more aggressive about optimizations without as much undefined-behavior "this looks correct and will run fine until we add another optimization and it doesn't" shenanigans. So in practice it can be faster than C, and it's actually quite popular among the HPC community and other performance-sensitive segments.
It's not like C has to be this way, with all the "undefined behavior" nonsense, it just is an artifact of a programming language written in the early 70s. C makes a lot of sense when a compiler looks a lot more like an assembler with a bit of optimization sprinkled in, and isn't performing super deep introspection and deciding that this entire function can be optimized away because of some arcane "undefined behavior" rule.
I'd say that the most popular option at the end-user level today is a whole level of abstraction above: using Python w/ PyTorch (or less frequently, TensorFlow).
These are all microarchitectural things that should not be surfaced in language, IMHO. This is why compilers exist, to raise the level of abstraction. For example, other than sidechannels (i.e. Spectre), OOO is not observable, nor is ILP or cache hints. Delay slots are observable at the ISA level but typically compilers do the work of dealing with them. Thank god delay slots didn't make it into programming languages; they were a wart of ISA design surfaced because of a lack of decent branch prediction. Can you imagine if you had a bizarre feature where the next line of code after a branch still executed even if the branch was taken? That'd be bonkers.
OOO is definitely observable when dealing with multithreading.
That's why you have things like memory order constraints on atomic operations in C, and why you need to be extremely careful with how you place things like memory fences when doing anything fine-grained between several threads.
Fair point, but I'd qualify that OOO is only observable for racy programs. In languages with shared memory concurrency, but safety, races are either statically disallowed (Rust), or their results are constrained by a (weak) memory model. Java, e.g. only makes rough guarantees that you don't see out-of-thin air values and that you can't violate the type system. Java also is a little stricter around volatile memory access (they are sequentially consistent, iirc). Also worth noting, some reorderings that you see are not due to hardware, but due to compiler optimizations. Compilers have to obey the memory model, too.
Generally, most programmers shouldn't need to be reasoning about OOO or memory order and be using fences. Rather, they should use libraries that are written by experts that use extra-lingual mechanisms (like compare-swap or fences) to guarantee a higher contract.
Rust has atomic types that can observe races. Such "racy" code is useful because some concurrent code doesn't need the strictest guarantees (sequential consistency). Though, compared to C, Rust does not use the "Consume" memory ordering because so far it has turned out to be ill-defined.
AFAICT those can't observe OOO of CPUs, just nondeterminism from interleavings different threads. I.e. they will give you sequential consistency unless you have other, worse races.
Sequential consistency is absolutely optional, there are several other possible memory orderings that are far weaker. When programs involve multiple atomic locations concurrently, atomicity is only preserved at each separate location.
Think about it the other way around: if the language were sufficiently high level that the compiler could give the CPU more opportunities for OOO. C incorporates too many assumptions about the PDP-11. Sometimes they are harmless (the ++ operator) and sometimes disastrous (memory model).
I once worked on the MC88K which had this behavior at the instruction level, and I think other contemporary RISC processors did as well. Never saw it in a higher-level language, though. Maybe some variant of SNOBOL (which already had success/failure branches for every line)?
MIPS delay slots were another example of this. Arguably Itanium's failure was due to exceeding the weirdness budget with VLIW nonsense, leading to the default PC 64-bit architecture being AMD's "x86 with longer registers" instead.
What happened to the Mill "unlimited weirdness budget" CPU guys anyway?
> Arguably Itanium's failure was due to exceeding the weirdness budget with VLIW nonsense
Being VLIW was probably the least weird part of Itanium. The NaT ("not a thing") bit on every register was the weirdest part and probably the one which tripped people the most. And like SPARC, it had register windows. It also had a pair of stacks instead of a single one, a set of predicate registers (which also had their own register windows), and plenty of other assorted weirdness.
The mill guys post one or two updates per year on their forum site. Apparently covid messed up their timeline a lot, but they aren't bankrupt or anything.
I think they don't have a rich enough VC friend, and they need a kick up the arse before (to be blunt) they get too old.
Ivan was saying that they weren't sure about how to get money, and then someone quite rightfully pointed out that this is probably the most liquid time in years for funding startups.
Mill team if you're reading this: Get your IP sorted out then open source everything, then start talking to VCs aggressively. Not the other way round because everyone thinks you're either a joke or a group of fantasists.
Cool ISA, hopeless execution. Where's the FPGA model?
I think they just have a small team. Most recent work has been on their software stack (emulator, compiler etc), which I think is fairly close to release, but I could be wrong about that.
I might be missing something, but I never figured out where mill was saving all those gates they claimed. It could be my lack of imagination, but the straightforward way to implement the belt just looks like a ROB and bypass network just like a normal OoO core spends it's gates on.
And on the frontend, sure they mark instruction boundaries within an instruction bundle, but you still need N barrel shifters (where N is the number of max instructions per cycle) to get at them, and that's the expensive part of wide decode. Even 16bit aligned is so much nicer for the critical path than 8-bit, bit aligned is absolutely crazy town.
Bit aligned instructions? Yeah, that definitely sounds mad, like my joke proposal that instructions should be Huffman-coded so the frequent ones take fewer bits. Might make sense if you're doing a microcontroller with a few K of ROM and need to save bits, but otherwise no.
I'm not sure that you'd do that even on a microcontroller. If you don't have the ROM space, you probably also don't have the transistor budget for the decoder you'd need to handle that.
I think they are claiming a savings on a register file, which is not nothing, but not exactly a ton. Also they save on register renaming and scoreboards, which have some gate count.
The biggest thing that they seem to have is that they can compute an enormous number of instructions per cycle without blowing up the size of the bypass network.
If their goal is to save on register file space, they're barking up the wrong tree, IMO. The Skylake-X die[0], for example, dedicates only about 1% to the integer register file, and only about 2.5% for the floating point register file. The decoder and scheduler also take up only single digit percents.
Granted, this is a top of the line superscalar processor with AVX-512 and megabytes of cache, but my point still stands. The vast majority of the gates are in everything else: cache, execution units, branch prediction, load/store, die interfaces, etc.
Its supposed to be a valid way to use the chip. The Russian military has a lot of windows systems and wants to run that on domestically developed chips that don't have a management engine.
The Mill weirdeness is weird in the sense that it's almost perfectly compatible with C-like code. It's not this that is impacting the project.
Instead, launching a new architecture is an herculean task. Trying that inside anything that isn't a huge company or consensus on the academic world has a history of nearly 100% failure since huge companies started building computers. (And even the consensus on the academic world is iffy.)
As a former microarchitect, the contortions designs take on is due to a combination of factors, usually stemming from routing/timing problems in the design or backwards compatibility, and every now and then, compiler team requests.
For example, endian-ness looks like weirdness to a programmer, but not to a chip designer that needs to read in lower-order bytes first to decode an instruction quickly. Why not just change the instruction decoder? Because it needs to be backwards compatible to what was once a byte-sized ISA.
But for programmers who don't understand the history, it seems insane.
First off, most processors are not BE. Arm and X86 are LE and cover "most processors".
Second, processors like the Motorola 68000 were 32-bit, so there was no need to read just the first byte. Refer to my point about instruction fetch. Like I said, if you have an ISA that came from byte-sized opcodes, and then expanded later, you need to fetch bytes first. However, if your opcodes are 32-bit, then the code is bigger. (In the old days, CISC code size was a selling point compared to RISC, this opcode compression was one of the reasons.)
This is what you learn in first-year computer architecture. However the world has changed a lot since the 80's. Single bytes aren't fetched in x86, 256-byte cachelines are, and the decode happens on that.
ARM and x86 are only two out of many processors: you are purposely choosing to confuse market dominance with what exists, and I fault you for it, because by behaving in this way, you are helping perpetuate the monoculture.
The moment you help perpetuate a monoculture rather than subvert it and fight it at every turn and opportunity, you become part of the problem.
ARM spends it on pc being wrong: when you mov from it, it's off by 8 in ARM mode, and off by 4 in Thumb mode, and if you're using it for pc-relative loads (and stores, if you're crazy), it uses a word aligned value.
edit: oh, and softfloat vs softfp vs hardfp vs vfp
edit: oh, and how they have two incompatible assembly language dialects that are mostly the same, but in non-trivial code, incompatible
Heh, having dealt with x86 for years, this is comparatively such a nothing burger. It's always a simple known fixed offset.
> softfloat vs softfp vs hardfp vs vfp
That's something not really unique to ARM per se. Any architecture with options for hardware FPU are going to practically need ABI specs for the soft and hard cases (you don't absolutely need anything more than hardfp since you can always emulate but it will be slow as shit) - and ARM is certainly not unique in having multiple hardware floating point implementations either.
> two incompatible assembly language dialects
Curious what you're referring to here - but I personally wouldn't consider assembly language dialects to be part of a CPU architecture.
For traditional 32-bit ARM, off the top of my head:
- Every instruction being conditional (all instructions have a four-bit condition field, with one of the 16 possible modes being "always");
- The barrel shifter, which can be used on nearly every data processing instruction;
- The program counter being one of the general-purpose registers (and on the original ARM, the same register also containing the flags), so that any register move can alter the program flow;
- The load-multiple/store-multiple instructions, which can load or store up to 16 registers, plus incrementing or decrementing the base register; and since the program counter is one of these registers, it can restore several registers from the stack, update the stack pointer, change the program counter, and switch to Thumb mode (stored on the least significant bit of the program counter), all in a single instruction.
"always" is not the weird one -- that's just the same as everyone else. "Never" is weird, immediately spending 1/16th of the opcode space (256 million instructions) on NO-OPs.
PC being a general-purpose register was historically not uncommon. PDP-11 and VAX both did it and they were kinda popular at one time.
Load/store multiple was also fairly common with, for example, both 68000 and VAX having it. IBM 360 also, though using a register range rather than a bitmap -- a less general solution, but good enough, and much easier to make go fast.
> "always" is not the weird one -- that's just the same as everyone else. "Never" is weird, immediately spending 1/16th of the opcode space (256 million instructions) on NO-OPs.
Worse, iirc what looks like it should be the slot for the "never" condition actually does exactly the same thing as "always".
The A64 manual says 1111 on a Bcc etc disassembles as NV but does the same as AL.
The ARM7TDMI manual says 1111 is reserved and don't use it. I don't know the actual behaviour.
Aha. The Welsh&Knaggs "ARM Book" says before ARMv3 NV meant NV. In ARMv3 and ARMv4 NV is unpredictable. And in ARMv5 NV is used to encode "various additional instructions that can only be executed unconditionally".
Loading and storing multiple registers is by no means weird: the MC68000 family has the movem.(b|w|l) instructions which do exactly that, and it's one of the best things since sliced bread because performance can be gained when used cleverly.
Being able to manipulate the program counter in the ARM processors directly is just being honest, simple and straightforward, rather then having it always done implicitly. Seems very intuitive to me now that I think about it.
Being able to manipulate the program counter directly plays hell with a superscalar and especially OoO processor where you want to be able to predict what the program counter does very accurately so the instruction fetch and decode can run far ahead of the execution.
There are four kinds of instructions that play hell with pipeline and OoO design:
- instructions that might cause traps, dependent on the values processed
- instructions that you don't know whether they will change the control flow
- instructions that you don't know where the control flow is going to go to
- instructions where you don't know how long they will take to execute
RISC-V, for example, bans the first category entirely other than load/store, and carefully separates the other three so any one instruction only had at most one of those problems.
ARM load multiple has all of those problems. At least you can examine the register mask at instruction decode time and know whether it will change the PC or not and tag the instruction in the pipeline as being a Jump or not. Imagine if there was a version that took the bitmap from a register instead of being hard-coded...
Load/store multiple don't increase performance much if at all on a CPU with an instruction cache and/or an instruction prefetch buffer. On an original 68000 or ARM without any cache, sure, a series of load or store instructions requires interleaving reading the opcodes with reading or writing the data, while load/store multiple eliminates the opcode reads. An instruction cache also eliminates them, leaving only the code size benefits. But load/store multiple is a perfect candidate for using a simple runtime function instead, at least if you have lightweight function call/return as RISC designs usually do.
"An instruction cache also eliminates them, leaving only the code size benefits."
The performance of movem.l in the MC68000 comes not from multiple load, but from multiple store, because the main memory access incurred a tremendous, extremely punitive penalty. This has not changed, even decades later, in systems with the fastest memory chips available: any writes to random access memory incur tremendous penalties.
RISC-V spends it on the "R": either not having single instructions to do certain stuff, or that those instructions are in one of the extension blocks so you have to customize your binaries to a particular RISC-V subset. Most architectures only do this for the high-performance SIMD numeric instructions.
You keep saying this about RISC-V and it keeps not being true. Instruction sequences that fuse are standardized. RISC-V defines various profiles (like "Unix server") which mandate a minimum set of extensions. Extensions beyond the mandated ones will be detected at runtime, just like on x86.
Modern x86 and ARM both rely on fusing a compare instruction with a following conditional branch -- something RISC-V doesn't have to do as conditional branches already incorporate the compare and there are no condition codes.
So if fusion is a weirdness it's a nearly universal one.
The good thing about fusion is the program works fine if you don't do it, so low end minimal area CPUs such as microcontrollers can just not bother.
So searching by the terms you give the actual spec lists two possibilities a lui/jalr pair, a auipc/jalr pair. Plus they're not given as 'these are the blessed instruction pairs high performance implementations should seek to fuse' but more 'you can do this if you want'. You could maybe count the return address stack hints but they're not about instruction fusion, just a common branch predictor optimisation.
An unsourced table on Wikichip (which was added in 2019 and not updated since) barely counts as a proposal (in terms of it standing a good chance of becoming part of a ratified RISC-V standard).
No doubt many high performance implementations will choose to use fusion and no doubt they'll all go for different combinations, with different edges cases. Yes there likely will be significant overlap but it could become a bit of a nightmare for a compiler writers. A thorough standardized list of instruction pairs to fuse would definitely help here, but we don't have one.
just one example of weird insanity from that table:
slli rd, rs1, {1,2,3}
add rd, rd, rs2 Fused into a load effective address
...this is so insane. Whoever thought that this is okay and good, has, in my opinion, severe psychological and psychiatric problems and would do well to seek professional help. If this gets "fused" into a lea, why just not implement a hex code for lea? I'm just completely at a loss as to how messed up that is.
You know what, I'd like to know what a person who thinks that this is okay looks and behaves like.
The Bitmanip "LEA" instructions were added primarily for sh1add.uw, sh1add.uw, and sh3add.uw which not only shift and add but also zero-extend the lower 32 bits of the rs1 register to 64 bits before shifting and adding them.
Thus they are replacing not two instructions but three. This addition was indicated because of the number of critical loops in legacy software that, against C recommendations, tries to "optimise" code by using "unsigned" for loop counters and array indexes instead of the natural types int, long, size_t, ptrdiff_t. This can indeed be an optimisation on amd64 and arm64, but it is a pessimisation on RISC-V, MIPS, Alpha, PowerPC.
One codebase that uses "unsigned" in this way is CoreMark and they explicitly prohibit fixing the variable type. But it's also common in SPEC and in much code optimised for x86 and ARM in general, where using "int" pessimises the code. If they used long, unsigned long, or the size_t or ptrdiff_t typedefs the code would run well everywhere.
While the .uw instructions were being added, it was very low cost to add the versions using all the bits of rs1 at the same time.
So, in the context of this discussion, having 32 bit operations sign extend the results to 64 bits is a weirdness. More ISAs do is than zero-extension, but the most common in the market zero-extend. Note that at the time RISC-V was designed arm64 was not yet announced, so only amd64 did zero-extension.
The hoops they had to go through to get PIC address calculations to work make it quite weird. Because `auipc` adds an offset from its `pc`, the corresponding `add` or `lw` relocation needs refer back to that instruction rather than the symbol it's actually looking for.
The poor ELF specification ends up quite tortured by this, IMO.
Arm64 is even worse! There is almost exactly the same instruction, but it also zeroes the low bits of the target address, so as you relocate code you also have to change the offset in the 2nd instruction even if the distance between the reference and the target stays the same.
Probably on copying Arm too much :-) I hit "char is unsigned" (not really architectural, more of a toolchain issue) only this morning.
RISC-V was originally going to implement a hypervisor mode which would only have worked with Xen-like hypervisors. Luckily we were able to head that off early and the actual hypervisor extension we got can run KVM efficiently.
Not sure if you'll see this - going through old tabs - but I was curious, and just in case: what are the practical differences between Xen and KVM here?
Not having condition codes actually cleans up the semantics a lot - no need to define the CC effects for each and every insn. It's also helpful to high-performance implementations, since the condition code register would otherwise enter as a dependency in every insn.
Sure, there are reasons for it being that way, but it makes it stick out from the mainstream. For example, if you're writing a codegen, before RISC-V you could assume that you had CMOVE and now you can't. The risk is then that the RISC-V backend will emit some clumsy sequence that emulates CMOVE, to make it fit in with the others.
The weirdest things about RISC-V are having very weak addressing modes in comparison with almost all other CPU architectures and also having extremely weak support for detecting integer overflow.
Going back to the original H&P texts and RISC philosophy, there is a claim that auto-increment addressing modes is too expensive due to the extra port in the RF + scheduling complexity. For an instruction set that should support embedded in order dual issue processors with math intensive loops, this is arguably a "weird" omission in the base line.
Lack of condition codes when the dominant paradigm (x86, ARM) is condition codes is weird. However, having 32 registers unlike x86_64's 16 makes up for some of it.
The base is too base and their bitfield extension is weird.
In RISC-V with the C-extension 32bit instructions are 16bit aligned.
Far from x86-level of weirdness but my must be quite annoying for those who want to make 'high performance' CPUs..
Not really. It's no different to ARM Thumb2 which also has two lengths. NanoMips and IBM 360 have three lengths (16, 32, 48 bits).
I've looked at wide RISC-V decoder design and the variable length is no problem at all out to at least decoding 32 bytes of code per cycle i.e. eight 32 bit opcodes or sixteen 16 bit opcodes, or somewhere between for a mix (average would usually be about 11-12 or so).
You just need 8 decoders that can decode any instruction, plus 8 decoders that only have to understand C instructions. The 16/32 decoders each need a 2:1 mux in front of them selecting either bytes 0..3 or 2..5 from a six byte window. They always output a real instruction. The C-only decoders will sometimes be told just to output a NOP instead [1]. Each decoder type needs a 1 bit input to tell it which option to take. Those inputs can be chained like carries in a simple adder, or they can be calculated in parallel like in a carry-lookahead adder. For an 8-16 wide decode you need a 2-deep network of LUT6 to do this (in FPGA terms .. also not very deep in SoC terms).
Note that this is a VERY wide machine. Possibly well beyond the point of usefulness given typical basic block lengths and what you can sensibly do in the OoO back end. x86 is currently doing 3-4 wide decode, and Apple M1 is doing 8 wide.
In short: no, it's not a problem.
[1] or not output an instruction at all. Outputting a NOP makes it easier to insert the decoded instructions into an output buffer. Then you need to filter out NOPs later -- which is needed anyway, as programs contain explicit NOPs, OoO machinery turns register MOVE instructions into NOPs by just updating the rename tables, etc.
RISC-V instructions are variable length, from 16 bits up to 192 bits[0]. The instruction stream is self-synchronising though so from a hardware decode point of view it's not a problem, unlike x86.
[0] See "Expanded Instruction-Length Encoding" in the user spec.
RISC-V instructions longer than 32 bits are just a theoretical possibility at this point. An extension escape hatch for the future.
No one has done it, no one seems to be keen to be the first to do it, and even how the instruction length encodings work is not a ratified part of the spec -- it's just a proposal at the moment, even for the next step of 48 bit instructions.
There has been discussion of encodings better than the one proposed in the current spec, especially around instruction length encoding schemes that would make more opcode bits available in 80 bit instructions than in the scheme in the spec, so as to have a possibility of encoding 64 bit literals in an 80 bit instruction.
I don't think the instruction stream is self synchronizing; if you jump to the middle of an instruction there's no guarantee you'll ever get back to not parsing garbage.
I didn't put that very well. I didn't mean it was self-synchronising when executing, but that you can (I think?!) always find the next instruction boundary by looking at the bottom bits. At least, that's my understanding from reading that part of the user spec.
You could argue that Arm mostly doesn't get too weird and that that's part of why it succeeded, but some things include:
+ 'char' being unsigned
+ handling of unaligned accesses (in early architecture versions a value is read from the aligned address and rotated, which is useless behaviour that falls out of the original implementation because of how it dealt with byte loads; subsequently it was at least made to fault, but it wasn't until I think v6 that unaligned accesses were made to Just Work)
wasnt there also the long jump limitation weirdness that forced it to create little hopping points, were the instructions were stored to hop to the next point, all of that to get to some memory whos pointer it could not fit into loadInstructions Register?
Sorry, its been a while, but i found the idea, to generate those instruction isles into the code quite weird.
+ MSP430 holes in the addressing mode encoding space
I don't know of any new architecture that is choosing big-endian. Almost all of the ones that I can think of that are/were big-endian have added a little-endian mode.
Thank god for that! In Wasm we just decreed little-endian, and I think this was one of our better calls.
You also have to distinguish between the architecture, and an ABI targeting a specific architecture. I believe x86 exception handling is nothing like this on gcc/clang.
This article isn’t really about x86’s weirdness as an architecture so much as it’s about Windows’ unusual implementation choices particular to that architecture.
HPPA upwards-growing stack: far more logical when one thinks about it, because in an upwards growing stack, the limit of the stack is the available memory, rather than some arbitrary size decided by someone arbitrarily.
SPARC register windows: once one gr0ks them, they are a feature of wonder! The register windows in an UltraSPARC processor effectively provide 256 virtual registers in a processor with only 32 physical ones, giving one increased performance.
To be honest, I never really understood why more architectures don't have upwards-growing stacks. Visually, one adds things to the top of a stack, after all, so it makes sense to increment the stack pointer. I suppose it's a relic of the old days when folks just set the start of the stack to some remote part of memory and hoped it never grew down into the code or the heap?
> + these days, being big-endian is spending weirdness budget
Hard disagree, big-endian is right and little-endian is wrong. At least, unless you think this year is 2202. I don't know how Intel got this so badly incorrect.
And remember that network byte order is big-endian. Any protocol which emits little-endian bytes on the wire is broken by design.
> To be honest, I never really understood why more architectures don't have upwards-growing stacks.
A downward-growing stack is more intuitive. To access items on a downward-growing stack, you use positive offsets; for instance, sp+0x10 could be the second stack-passed argument to a function. For an upward-growing stack, you have to use negative offsets.
> Hard disagree, big-endian is right and little-endian is wrong.
Little-endian is more natural. With little-endian numbers, the byte b at offset n has value b*256**n; with big-endian numbers, the byte b at offset n has value b*256**(l-n-1).
> And remember that network byte order is big-endian.
That's mostly by historical accident; the architectures originally used to implement the network protocols we use nowadays were big-endian. It makes sense in some situations to have on-wire numbers be big-endian, especially network addresses, since it allows the hardware to decide before the whole number arrives; that is less relevant nowadays because not only networks are much faster, but also it's more common to read the whole header to a buffer before making these decisions.
> A downward-growing stack is more intuitive. To access items on a downward-growing stack, you use positive offsets; for instance, sp+0x10 could be the second stack-passed argument to a function. For an upward-growing stack, you have to use negative offsets.
Why would one use negative offsets with the stack pointer instead of positive offsets with the base pointer?
Using the stack pointer, I can never dynamically push something onto the stack without changing the offsets.
“ A downward-growing stack is more intuitive. To access items on a downward-growing stack, you use positive offsets; for instance, sp+0x10 could be the second stack-passed argument to a function. For an upward-growing stack, you have to use negative offsets.”
I would expect you to use negative offsets to access the previous elements.
On endianness, I'm not making a claim that one or the other is "better" or "right" -- merely that little-endian architectures are pervasive in the current world, and so picking big-endian for a hypothetical new architecture would be unnecessarily forcing it to deal with "this software is buggy on big-endian" issues forever. Which is fine if there is some really strong reason your new architecture should be big-endian; but if it's mere aesthetic preference then the cost probably outweighs the benefit right now.
Network does have a natural order. You can make a routing decision faster when you get the most significant units first. If you get the least first, you have to wait for the entire address before you can route to it.
This is mostly irrelevant today though; the hardware is so much faster and memory so much more plentiful that handling is done by reading the full header (and often full packet) then making the routing or forwarding decision. You don't need to squeeze out every single cycle or save every single byte possible. You have so much headroom saving cycles or bytes makes no difference and you can't optimize enough to make significant BOM reductions by choosing cheaper chips while still offering features considered standard. Even if you could you'd be selling into an ultra-ultra-cheap market segment no one cares about (congrats, you made a $2 10Mbps router that can run on a coin cell battery).
That something doesn't make a big improvement doesn't mean that the improvement does not exist.
If I wanted to show where ordering truly does not matter, I would rather look at encoding schemes. Some versions of Ethernet use https://en.wikipedia.org/wiki/64b/66b_encoding , meaning that the smallest unit that can be read at once is 8 bytes. That removes the ordering from IPv4 addresses, but IPv6 can still get handled earlier by taking ordering into account.
Huh - I would have thought for higher speed stuff (say 40G+ ?) performance would still be an issue if you wanted to maintain wire speed - Especially driving multiple interfaces ?
I agree home/consumer stuff is just basicly coasting on 3-4 generation tech though. I ran cat-6 cable around the house hoping to get 10G sometime in my lifetime for less then the cost of a kidney...
"When the bytes which make up such multi-byte numbers are ordered from most
significant byte to least significant byte, that is called "network byte order" or "big endian." (1)
Seems pretty straightforward MSB to LSB? which is what I learned way back in 'C' times...Did something change ?
The oldest source [1] seems to be in Sanskrit, a left-to-right language, so they were big-endian numbers. This still does not rule out an even older origin in a right-to-left language, but as it stands I'm afraid I was just wrong with my 'little endian origin' story. Too bad, I liked it.
> Hard disagree, big-endian is right and little-endian is wrong. At least, unless you think this year is 2202.
If you think a bit about it, it would have been much more natural to write number in the opposite order compared to what we have now, since we read and manipulate numbers from least-significant digit to most-significant digit and read text left-to-right.
When you read a number like 346624596, you need to parse it twice to know its value, but also to know its magnitude or do anything significant with it. In the opposite order, everything becomes simpler: adding two numbers, knowing the magnitude or ignoring the least-significant digits ("its about 350 millions"), ...
MS was stuck with a poor exception implementation but what I never see explain is how they got there in the first place. It seems crazy to incur a runtime overhead to support something that is hardly ever used. The happy path is where you need the performance most, and exceptions are by definition the "unhappy path".
Michael and I talked about exception design a lot in g++ (I certainly had experience of what to do and not do from CLOS) and there was never a consideration of doing anything that might impact runtime performance.
At least now I finally understand why some people, especially in the game industry, want to disable exceptions.
As a side point, I'm disappointed by Raymond's (18 yo) claim that "Zero cost exceptions" aren't really zero cost. Everything called "zero cost" in C++ means "doesn't add any runtime cost if you don't use it". He argued against a straw man.
Table-based exception-handling metadata requires function prologue/epilogue sequences to have constrained forms that can be described by that metadata. When NT was first designed for x86-32, there was a lot of existing asm code that people wanted to easily port over to NT, and that ported code included pretty much every weird function entry/exit sequence you could think of (and lots more that you wouldn’t imagine anyone would ever think of). Switching to table-based metadata would have required modifying all that code, making porting more difficult. At least, I assume that’s the reasoning – I was deeply involved in the SEH runtime code in the ’90s and ’00s when I was on the VC++ compiler team.
The latter does exception unwinding in the kernel on runtime faults (segv, division by zero, etc). It does _not_ unwind destructors or deallocate storage. And yes, I know about this because I had to debug it, on WinCE ARM, where we eventually discovered that there were two different MS compilers, one of which generated code that could be unwound and one of which didn't.
I got bitten porting code to Windows that use setjmp/longjmp not only for exception handling (more correctly: non-local control transfer -- the exception was actually handled by then), but also for implementing a thread library.
The code worked literally everywhere else, but it turned out that on Windows longjmp doesn't just load the registers from the buffer but also does some SEH bollocks. If I recall (and this was in 2006), it didn't matter for the exception handling, but it was a hard crash for switching threads.
So I had to write my own setjmp/longjmp for Windows in assembly language (i.e copy theirs and cut out the SEH bollocks).
Thanks. A legacy of lots of assembly code, though surprising that stuff like that survived in libraries (in applications it probably wouldn't have mattered)
C++ exception model is complex and the implementations are intricate. I do not think you one can easily dismiss the possibility of runtime overhead just because it seems like no extra work is done. It's much more subtle that that. The implementation potentially affects inclining opportunities, compiler complexity, code caching etc.
On modern CPUs, probably the most efficient exception model is one that simply uses error codes. Especially when paired with an optimised calling convention like what Swift does. Checking for an error flag is essentially free on a superscalar CPU anyway and all this stuff is transparent to the compiler, simplifying the translation process and enabling more optimisation opportunities.
P.S. I ran a bunch of tests with C++ a while ago and a Result-like type error handling (implemented sanely) was always as fast as C++ exceptions for the good path and faster for the bad path — unless you are going a hundred or so nested functions deep (but then you have a massive code smell problem anyway). With an optimised calling convention it is likely to be even better.
Re the zero cost claim, he specifically clarifies that zero cost exceptions have a cost even if you do not use them, which is paid by inhibiting some optimizations.
IANACW, but I think that in principle is almost always possible to implement truly zero cost exception path in the non-taken path, by moving all necessary compensation code to undo optimizations into the exceptional path, but in practice it might be too hard for compilers to do it.
If you spend a lot of time scrutinizing g++ output (which thankfully I haven't had to do in over 20 years) you'll see that the impact is negligible and that was true when Raymond made that statement. As you point out there's a lot the compiler can do to make the common path really fast.
But OK, it could potentially be nonzero. Let's say epsilon cost instead.
I concur with this. Java JITs do a ton of optimizations and all of the effects of exceptions are minimal, second-order things, like keeping something alive a little longer (for deoptimization/debugging). If the compiler has a good notion of hot/cold paths, then a lot of optimizations that you'd be tempted to think up for exceptions fall out naturally from the more general notion, including spilling the likely-never-used-but-alive-for-deopt values onto the stack in the important paths.
This is one of the many reasons that having exceptional control flow explicit in the compiler IR is a Good Thing(tm). The HotSpot client compiler lacked this information (in the beginning) because it was deemed "too expensive to keep" and the result was that many optimizations had ugly special cases (or were disabled altogether). For the most part, if the compiler has some notion of hot/cold code, then these exceptional edges can just be marked as cold and the same heuristics that apply to, e.g. unlikely branches, will do a good job.
Pretty much exactly. They had a situation with external visibility into information in the stack frame and don't seem to have made an effort to tell the compiler. I mean... it's their compiler! It's their ABI, even! They have all the levers and dials, and somehow the fact that their exception counter is being optimized out means that the "architecture" is the weirdo?
Yeah. At its root this is a Conway's Law bug. The people responsible for "exceptions" had a senseless wall between them and the people responsible for "stack frames".
Indeed. In Java, on every platform that I know of (including x86 and x86-64), exceptions are zero cost, with no dynamic work needed to enter a try. On Windows they are probably stuck with ABI compatibility. It sounds like the design was flawed from the start and definitely not the architecture's fault.
This isn't about x86 architecture but historical design choices that then got frozen into an ABI. When Microsoft implemented SEH on ia32, the current approach of lookup table based exceptions (trading memory for runtime code execution, particularly reducing the latter in a success case) was not yet in vogue. When they ported to other architectures and had a choice to introduce a different ABI, they did.
Am I the only one who prefers the way x86 did exceptions? I really don't like that with table based exceptions, every function must include bloated unwind tables, and is also forced to conform to a very restrictive set of code generation rules. It also makes writing and assembly and dynamic code generation difficult.
I don't think it is. I don't know if the peculiarities of x86 were considered when Microsoft's engineers decided to do SEH the way it is, but there is no hard reason why they couldn't decide to make it table based.
It is interesting that the stores to node are seen as dead. Surely the assignment of the node address to fs:[0] should equivalent of assigning it a global variable, so any state change in node can be observed by any function.
I believe this is because the intermediate state can't be safely observed by another thread, since there's no locking, so the C++ memory model allows removing them.
I don’t think so. It is global state, but it lives outside the abstract C machine, in its implementation.
The compiler knows f1 and f2 live in the abstract C machine and cannot access it.
I think its like the difference between errno, a global inside the abstract C machine, and __foo, which a compiler might create, and, if it did so, could assume to be completely under its control.
Well, sure we are deep into implementation details, so it is reasonable for the implementation to treat stores to it specially. But the rest of the article goes into details on how treating it specially causes "real pain in the neck" because the optimizer wants to optimize them away. It seems to me that if the store was not treated specially, the issue wouldn't exist in the first place.
So either: exception handling in 32 bit mode is sort of an hack and the exception handling code generation doesn't actually expose the store the store through fs to the rest of the optimizer or, more likely, things are much much more complicated than described in the article.
sure, but that's just a global memory location which any function can read from (I think it is part of Windows thread control block, but do not quote me on that).
Probably stores through fs are pragmatically not considered globals stores as they would hinder other optimizations.
I did not understand why on x86 Windows/MSVC cannot implement exceptions the same way as it was done for other architectures. I read that on x64 a different approach is used ("table based" - sounds to me like the "unwind code" approach mentioned in the article). Is this not possible no x86 for some reason ?
Table based exceptions only work if everyone plays ball. Its not practical to introduce them after the fact because they will not work with any existing code including OS callbacks. AMD64 can use them because they were there from the beginning.
Isn't it possible to annotate a variable as not being eligible for dead store elimination with something like "volatile"? I wonder why that particular state was causing trouble for Visual C++, assuming that it's able to handle code that deal with other kinds of asynchronous events (e.g. signal handlers).
I'm bummed that The Old New Thing is now on Microsoft's domain. Old blogs seems to be dying out or being moved behind the walls of Medium/Substack/etc.
It's always been on one Microsoft domain or another, for almost as long as there've been blogs at all. It's almost 20 years ago now, but when Blogs started their initial ascendency as a useful format, Microsoft had a a number of developers start writing publicly facing personal blogs. Raymond Chen was one of that group, also including Michael Kaplan, Eric Lippert and Larry Osterman, among others.
As far as I know, The Old New Thing is the only blog from that wave that's still being written - and it's been written every weekday for essentially that entire time. To give an idea of the scope of the work, he did a compilation of posts in book form sixteen years ago, and it was over 500 pages long.
I don't read it daily anymore (hasn't been relevant to my work or professional interests for ten years or more), but I do read it occasionally and continue to be both impressed and grateful for the effort he's put into the blog.
> I'm bummed that The Old New Thing is now on Microsoft's domain.
Um, it's always been on MS's domain. It has moved from e.g. msdn.com (I think) to microsoft.com, but that's more about MS periodically moving around where its documentation is hosted because shrug, but it was always in same way officially hosted by MS.
+ HPPA's upwards-growing stack
+ SPARC register windows
+ PowerPC's slightly oddball varargs ABI
+ CHERI capabilities are massively spending weirdness budget
+ these days, being big-endian is spending weirdness budget
Doing a few odd things is fine, but if you blow your weirdness budget entirely then your new architecture is in danger of failing because you've cumulatively made adoption of it too expensive for the existing software ecosystem. So you need to be careful that you're spending it on things that really matter; if something's not a big deal, do it the same way everybody else does. (Certainly some of the things in the list above are deliberate choices for things that are, or at least seemed at the time, important. But nobody tried to do all of them at once!)
Of course if you're a pre-existing dominant architecture like x86 then you get a lot more freedom to do weird things :-)