Low-level details of the Zen 2 microarchitecture [pdf]

davidtgoldblatt · on Aug 30, 2020

Of particular interest is section 20.18 -- "Mirroring memory operands"; extending (something like) register renaming to memory.

boxfire · on Aug 30, 2020

I wonder what the limits are on that, and if it chains very deeply, e.g mreg ("memory register") to op to mreg to op to mreg.... What's the limit? Can this be used to extend the optimization of some chained computations via a kind of unrolling?

dkersten · on Aug 30, 2020

In Agner's forum post linked a few days ago[1], it sounded like it was quite limited. I mean, super impressive, but as far as I understood, it didn't handle nesting (just the single pointer/memory access + offset)[2]. I'm not very well versed on this stuff though, so maybe I misunderstood.

[1] https://news.ycombinator.com/item?id=24302057

[2] > The mechanism works only under certain conditions. It must use general purpose registers, and the operand size must be 32 or 64 bits. The memory operand must use a pointer and optionally an index. It does not work with absolute or rip-relative addresses.

> It seems that the CPU makes assumptions about whether memory operands have the same address before the addresses have been calculated. This may cause problems in case of pointer aliasing.

Or from the PDF:

•The instructions must use general purpose registers.

•The memory operands must have the same address.

•The operand size must be 32 or 64 bits.

•You may have a32 bit read after a 64 bit write to the same address, but not vice versa.

•The memory address must have a base pointer, no absolute address, and no rip-relative address. The memory address may have an index register,a scale factor, and an offset no bigger than 8 bits.

•The memory operand must be specified in exactly the same way with the same unmodified pointer and index registers in all the instructions involved.

•The memory address cannot cross a cache line boundary.

•The instructions can be simple MOV instructions, read-modify instructions,or read-modify-write instructions.It also works with PUSH and POP instructions.

•Complex instructions with multiple μops cannot be used.

innocenat · on Aug 30, 2020

My educated guess it that it is handled the same way as register renaming. Zen 2 has 180 GP registers.

For 64-bit platform, you would need some algorithm that needs more than 15 registers (so you need to spill to stack) that need to be execute faster than cache access latency. There might be some, but I doubt many would fall into this category.

phkahler · on Aug 30, 2020

A better name might be "Memory Operand Forwarding" since the data is probably coming from a write buffer.

chrisseaton · on Aug 30, 2020

> since the data is probably coming from a write buffer

Taking pending writes from the store buffer before they have retired is something else and has (obviously) been done since we first had OOO execution.

This is referring back to the operand's original value in a register file because that can be done with less latency than searching around in the store buffer, which I think isn't much different to L1.

phkahler · on Aug 30, 2020

>> This is referring back to the operand's original value in a register file because that can be done with less latency...

Thanks, I had missed that distinction.

rrss · on Aug 30, 2020

That name doesn't really distinguish this feature from the normal store-to-load forwarding from the store buffer that processors have been doing for decades, though. I think memory renaming is the probably the best name for it.

enchiridion · on Aug 30, 2020

Can anyone comment on how secure this architecture is against speculative execution attacks vs. Intel?

Reelin · on Aug 31, 2020

I have a (likely silly) question for anyone well versed in low level CPU stuff. §20.7 says there's a stack engine that optimizes manipulation of the stack pointer. Does this only apply to the dedicated hardware register (ie %rsp) or to other registers as well?

(Potentially related, assuming it's of benefit are modern compilers smart enough to repurpose %rsp (is this even allowed?) if I use a block of memory as a stack inside a hot loop?)

cepp · on Aug 31, 2020

Take this with 2 cents since I'm not versed explicitly in Zen architecture, but it's likely only the SP. Usage patterns are fairly easy to deduce at compile time and are optimized, i.e. loop tiling, thus I think it's fair to assume the optimizations are leveraged against this. For example, if you can predict the loop pattern you can repurpose the SP.

foota · on Aug 31, 2020

Funny enough, I ctrl fed for an optimization problem I've been wondering about and found a couple mentions of it (branch vs conditional move)

jordiburgos · on Aug 31, 2020

Could all these insights added to an Artificial Intelligence (AI)? Then the AI would find the best way to compile, re-arrange instructions, etc...

Just thinking...