What does ARM and RISC-V spend their weirdness budget on?

jmgao · on April 22, 2022

ARM spends it on pc being wrong: when you mov from it, it's off by 8 in ARM mode, and off by 4 in Thumb mode, and if you're using it for pc-relative loads (and stores, if you're crazy), it uses a word aligned value.

edit: oh, and softfloat vs softfp vs hardfp vs vfp

edit: oh, and how they have two incompatible assembly language dialects that are mostly the same, but in non-trivial code, incompatible

danachow · on April 22, 2022

> ARM spends it on pc being wrong.

Heh, having dealt with x86 for years, this is comparatively such a nothing burger. It's always a simple known fixed offset.

> softfloat vs softfp vs hardfp vs vfp

That's something not really unique to ARM per se. Any architecture with options for hardware FPU are going to practically need ABI specs for the soft and hard cases (you don't absolutely need anything more than hardfp since you can always emulate but it will be slow as shit) - and ARM is certainly not unique in having multiple hardware floating point implementations either.

> two incompatible assembly language dialects

Curious what you're referring to here - but I personally wouldn't consider assembly language dialects to be part of a CPU architecture.

kelnos · on April 22, 2022

> Curious what you're referring to here - but I personally wouldn't consider assembly language dialects to be part of a CPU architecture.

I assume they're talking about ARM vs. Thumb?

jmgao · on April 27, 2022

Belated, but no, I was referring to UAL vs old syntax.

curling_grad · on April 22, 2022

By pc you mean "program counter"?

cesarb · on April 22, 2022

For traditional 32-bit ARM, off the top of my head:

- Every instruction being conditional (all instructions have a four-bit condition field, with one of the 16 possible modes being "always");

- The barrel shifter, which can be used on nearly every data processing instruction;

- The program counter being one of the general-purpose registers (and on the original ARM, the same register also containing the flags), so that any register move can alter the program flow;

- The load-multiple/store-multiple instructions, which can load or store up to 16 registers, plus incrementing or decrementing the base register; and since the program counter is one of these registers, it can restore several registers from the stack, update the stack pointer, change the program counter, and switch to Thumb mode (stored on the least significant bit of the program counter), all in a single instruction.

brucehoult · on April 22, 2022

"always" is not the weird one -- that's just the same as everyone else. "Never" is weird, immediately spending 1/16th of the opcode space (256 million instructions) on NO-OPs.

PC being a general-purpose register was historically not uncommon. PDP-11 and VAX both did it and they were kinda popular at one time.

Load/store multiple was also fairly common with, for example, both 68000 and VAX having it. IBM 360 also, though using a register range rather than a bitmap -- a less general solution, but good enough, and much easier to make go fast.

naniwaduni · on April 24, 2022

> "always" is not the weird one -- that's just the same as everyone else. "Never" is weird, immediately spending 1/16th of the opcode space (256 million instructions) on NO-OPs.

Worse, iirc what looks like it should be the slot for the "never" condition actually does exactly the same thing as "always".

brucehoult · on April 25, 2022

Hmm. I guess I never actually tried it.

The A64 manual says 1111 on a Bcc etc disassembles as NV but does the same as AL.

The ARM7TDMI manual says 1111 is reserved and don't use it. I don't know the actual behaviour.

Aha. The Welsh&Knaggs "ARM Book" says before ARMv3 NV meant NV. In ARMv3 and ARMv4 NV is unpredictable. And in ARMv5 NV is used to encode "various additional instructions that can only be executed unconditionally".

Annatar · on April 22, 2022

Loading and storing multiple registers is by no means weird: the MC68000 family has the movem.(b|w|l) instructions which do exactly that, and it's one of the best things since sliced bread because performance can be gained when used cleverly.

Being able to manipulate the program counter in the ARM processors directly is just being honest, simple and straightforward, rather then having it always done implicitly. Seems very intuitive to me now that I think about it.

brucehoult · on April 22, 2022

Being able to manipulate the program counter directly plays hell with a superscalar and especially OoO processor where you want to be able to predict what the program counter does very accurately so the instruction fetch and decode can run far ahead of the execution.

There are four kinds of instructions that play hell with pipeline and OoO design:

- instructions that might cause traps, dependent on the values processed

- instructions that you don't know whether they will change the control flow

- instructions that you don't know where the control flow is going to go to

- instructions where you don't know how long they will take to execute

RISC-V, for example, bans the first category entirely other than load/store, and carefully separates the other three so any one instruction only had at most one of those problems.

ARM load multiple has all of those problems. At least you can examine the register mask at instruction decode time and know whether it will change the PC or not and tag the instruction in the pipeline as being a Jump or not. Imagine if there was a version that took the bitmap from a register instead of being hard-coded...

Load/store multiple don't increase performance much if at all on a CPU with an instruction cache and/or an instruction prefetch buffer. On an original 68000 or ARM without any cache, sure, a series of load or store instructions requires interleaving reading the opcodes with reading or writing the data, while load/store multiple eliminates the opcode reads. An instruction cache also eliminates them, leaving only the code size benefits. But load/store multiple is a perfect candidate for using a simple runtime function instead, at least if you have lightweight function call/return as RISC designs usually do.

Annatar · on April 24, 2022

"An instruction cache also eliminates them, leaving only the code size benefits."

The performance of movem.l in the MC68000 comes not from multiple load, but from multiple store, because the main memory access incurred a tremendous, extremely punitive penalty. This has not changed, even decades later, in systems with the fastest memory chips available: any writes to random access memory incur tremendous penalties.

pjc50 · on April 22, 2022

RISC-V spends it on the "R": either not having single instructions to do certain stuff, or that those instructions are in one of the extension blocks so you have to customize your binaries to a particular RISC-V subset. Most architectures only do this for the high-performance SIMD numeric instructions.

ARM endianness is switchable at runtime.

rwmj · on April 22, 2022

You keep saying this about RISC-V and it keeps not being true. Instruction sequences that fuse are standardized. RISC-V defines various profiles (like "Unix server") which mandate a minimum set of extensions. Extensions beyond the mandated ones will be detected at runtime, just like on x86.

Jasper_ · on April 22, 2022

Fusion isn't cheap and I'd say it's part of the weirdness bucket to rely on fusion instead of making a combined instruction.

brucehoult · on April 22, 2022

Modern x86 and ARM both rely on fusing a compare instruction with a following conditional branch -- something RISC-V doesn't have to do as conditional branches already incorporate the compare and there are no condition codes.

So if fusion is a weirdness it's a nearly universal one.

The good thing about fusion is the program works fine if you don't do it, so low end minimal area CPUs such as microcontrollers can just not bother.

gchadwick · on April 22, 2022

> Instruction sequences that fuse are standardized

They are? Care to point to a ratified RISC-V standard that lists said instruction sequences?

rwmj · on April 22, 2022

Go to the user spec here: https://riscv.org/technical/specifications/ and search for "fusion", "macro-op", "hint", "two-instruction sequences". A lot of it is through register choice. For other proposals (not standards yet) see https://en.wikichip.org/wiki/macro-operation_fusion#Proposed...

gchadwick · on April 22, 2022

So searching by the terms you give the actual spec lists two possibilities a lui/jalr pair, a auipc/jalr pair. Plus they're not given as 'these are the blessed instruction pairs high performance implementations should seek to fuse' but more 'you can do this if you want'. You could maybe count the return address stack hints but they're not about instruction fusion, just a common branch predictor optimisation.

An unsourced table on Wikichip (which was added in 2019 and not updated since) barely counts as a proposal (in terms of it standing a good chance of becoming part of a ratified RISC-V standard).

No doubt many high performance implementations will choose to use fusion and no doubt they'll all go for different combinations, with different edges cases. Yes there likely will be significant overlap but it could become a bit of a nightmare for a compiler writers. A thorough standardized list of instruction pairs to fuse would definitely help here, but we don't have one.

Annatar · on April 22, 2022

just one example of weird insanity from that table:

slli rd, rs1, {1,2,3}

add rd, rd, rs2 Fused into a load effective address

...this is so insane. Whoever thought that this is okay and good, has, in my opinion, severe psychological and psychiatric problems and would do well to seek professional help. If this gets "fused" into a lea, why just not implement a hex code for lea? I'm just completely at a loss as to how messed up that is.

You know what, I'd like to know what a person who thinks that this is okay looks and behaves like.

pclmulqdq · on April 22, 2022

The bitmanip extension has an LEA equivalent, so it looks like they have backtracked on this one.

brucehoult · on April 23, 2022

The Bitmanip "LEA" instructions were added primarily for sh1add.uw, sh1add.uw, and sh3add.uw which not only shift and add but also zero-extend the lower 32 bits of the rs1 register to 64 bits before shifting and adding them.

Thus they are replacing not two instructions but three. This addition was indicated because of the number of critical loops in legacy software that, against C recommendations, tries to "optimise" code by using "unsigned" for loop counters and array indexes instead of the natural types int, long, size_t, ptrdiff_t. This can indeed be an optimisation on amd64 and arm64, but it is a pessimisation on RISC-V, MIPS, Alpha, PowerPC.

One codebase that uses "unsigned" in this way is CoreMark and they explicitly prohibit fixing the variable type. But it's also common in SPEC and in much code optimised for x86 and ARM in general, where using "int" pessimises the code. If they used long, unsigned long, or the size_t or ptrdiff_t typedefs the code would run well everywhere.

While the .uw instructions were being added, it was very low cost to add the versions using all the bits of rs1 at the same time.

So, in the context of this discussion, having 32 bit operations sign extend the results to 64 bits is a weirdness. More ISAs do is than zero-extension, but the most common in the market zero-extend. Note that at the time RISC-V was designed arm64 was not yet announced, so only amd64 did zero-extension.

brucehoult · on April 22, 2022

Modern x86 and ARM both rely on fusing a compare instruction with a following conditional branch.

That's not in the spec. It's just something high end implementations do, and low end ones don't.

cestith · on April 22, 2022

POWER architecture endianness is also switchable at runtime. https://developer.ibm.com/articles/l-power-little-endian-faq...

snvzz · on April 22, 2022

As far as I am aware, RISC-V spends it on not being weird (which is itself weird).

TNorthover · on April 22, 2022

The hoops they had to go through to get PIC address calculations to work make it quite weird. Because `auipc` adds an offset from its `pc`, the corresponding `add` or `lw` relocation needs refer back to that instruction rather than the symbol it's actually looking for.

The poor ELF specification ends up quite tortured by this, IMO.

brucehoult · on April 23, 2022

That affects ELF relocations but not the code.

Arm64 is even worse! There is almost exactly the same instruction, but it also zeroes the low bits of the target address, so as you relocate code you also have to change the offset in the 2nd instruction even if the distance between the reference and the target stays the same.

rwmj · on April 22, 2022

Probably on copying Arm too much :-) I hit "char is unsigned" (not really architectural, more of a toolchain issue) only this morning.

RISC-V was originally going to implement a hypervisor mode which would only have worked with Xen-like hypervisors. Luckily we were able to head that off early and the actual hypervisor extension we got can run KVM efficiently.

exikyut · on April 27, 2022

Not sure if you'll see this - going through old tabs - but I was curious, and just in case: what are the practical differences between Xen and KVM here?

kryptiskt · on April 22, 2022

It doesn't have a condition code register, which scores it plenty of weirdness points.

zozbot234 · on April 22, 2022

Not having condition codes actually cleans up the semantics a lot - no need to define the CC effects for each and every insn. It's also helpful to high-performance implementations, since the condition code register would otherwise enter as a dependency in every insn.

kryptiskt · on April 22, 2022

Sure, there are reasons for it being that way, but it makes it stick out from the mainstream. For example, if you're writing a codegen, before RISC-V you could assume that you had CMOVE and now you can't. The risk is then that the RISC-V backend will emit some clumsy sequence that emulates CMOVE, to make it fit in with the others.

adrian_b · on April 22, 2022

The weirdest things about RISC-V are having very weak addressing modes in comparison with almost all other CPU architectures and also having extremely weak support for detecting integer overflow.

adapteva · on April 22, 2022

Going back to the original H&P texts and RISC philosophy, there is a claim that auto-increment addressing modes is too expensive due to the extra port in the RF + scheduling complexity. For an instruction set that should support embedded in order dual issue processors with math intensive loops, this is arguably a "weird" omission in the base line.

CalChris · on April 22, 2022

Lack of condition codes when the dominant paradigm (x86, ARM) is condition codes is weird. However, having 32 registers unlike x86_64's 16 makes up for some of it.

The base is too base and their bitfield extension is weird.

snvzz · on April 24, 2022

>The base is too base

I'm sure those that only need the base disagree. For the rest of us, there's G (IMAFD).

renox · on April 22, 2022

In RISC-V with the C-extension 32bit instructions are 16bit aligned. Far from x86-level of weirdness but my must be quite annoying for those who want to make 'high performance' CPUs..

brucehoult · on April 23, 2022

Not really. It's no different to ARM Thumb2 which also has two lengths. NanoMips and IBM 360 have three lengths (16, 32, 48 bits).

I've looked at wide RISC-V decoder design and the variable length is no problem at all out to at least decoding 32 bytes of code per cycle i.e. eight 32 bit opcodes or sixteen 16 bit opcodes, or somewhere between for a mix (average would usually be about 11-12 or so).

You just need 8 decoders that can decode any instruction, plus 8 decoders that only have to understand C instructions. The 16/32 decoders each need a 2:1 mux in front of them selecting either bytes 0..3 or 2..5 from a six byte window. They always output a real instruction. The C-only decoders will sometimes be told just to output a NOP instead [1]. Each decoder type needs a 1 bit input to tell it which option to take. Those inputs can be chained like carries in a simple adder, or they can be calculated in parallel like in a carry-lookahead adder. For an 8-16 wide decode you need a 2-deep network of LUT6 to do this (in FPGA terms .. also not very deep in SoC terms).

Note that this is a VERY wide machine. Possibly well beyond the point of usefulness given typical basic block lengths and what you can sensibly do in the OoO back end. x86 is currently doing 3-4 wide decode, and Apple M1 is doing 8 wide.

In short: no, it's not a problem.

[1] or not output an instruction at all. Outputting a NOP makes it easier to insert the decoded instructions into an output buffer. Then you need to filter out NOPs later -- which is needed anyway, as programs contain explicit NOPs, OoO machinery turns register MOVE instructions into NOPs by just updating the rename tables, etc.

rwmj · on April 22, 2022

RISC-V instructions are variable length, from 16 bits up to 192 bits[0]. The instruction stream is self-synchronising though so from a hardware decode point of view it's not a problem, unlike x86.

[0] See "Expanded Instruction-Length Encoding" in the user spec.

brucehoult · on April 23, 2022

RISC-V instructions longer than 32 bits are just a theoretical possibility at this point. An extension escape hatch for the future.

No one has done it, no one seems to be keen to be the first to do it, and even how the instruction length encodings work is not a ratified part of the spec -- it's just a proposal at the moment, even for the next step of 48 bit instructions.

There has been discussion of encodings better than the one proposed in the current spec, especially around instruction length encoding schemes that would make more opcode bits available in 80 bit instructions than in the scheme in the spec, so as to have a possibility of encoding 64 bit literals in an 80 bit instruction.

https://github.com/riscv/riscv-isa-manual/issues/280

pm215 · on April 22, 2022

It can still get you into some extra corner cases for the insn fetch hardware to have to handle when an insn crosses a cache line boundary, though...

monocasa · on April 22, 2022

I don't think the instruction stream is self synchronizing; if you jump to the middle of an instruction there's no guarantee you'll ever get back to not parsing garbage.

rwmj · on April 22, 2022

I didn't put that very well. I didn't mean it was self-synchronising when executing, but that you can (I think?!) always find the next instruction boundary by looking at the bottom bits. At least, that's my understanding from reading that part of the user spec.

pm215 · on April 22, 2022

You could argue that Arm mostly doesn't get too weird and that that's part of why it succeeded, but some things include:

+ 'char' being unsigned

+ handling of unaligned accesses (in early architecture versions a value is read from the aligned address and rotated, which is useless behaviour that falls out of the original implementation because of how it dealt with byte loads; subsequently it was at least made to fault, but it wasn't until I think v6 that unaligned accesses were made to Just Work)

+ the weak memory model

bpye · on April 22, 2022

I’m pretty sure it’s x86 that is weird in having a strong memory model, most other architectures (including ARM) are weak.

qikInNdOutReply · on April 25, 2022

wasnt there also the long jump limitation weirdness that forced it to create little hopping points, were the instructions were stored to hop to the next point, all of that to get to some memory whos pointer it could not fit into loadInstructions Register?

Sorry, its been a while, but i found the idea, to generate those instruction isles into the code quite weird.