Hacker News new | past | comments | ask | show | jobs | submit login
Low-level details of the Zen 2 microarchitecture [pdf] (agner.org)
163 points by ekoutanov on Aug 30, 2020 | hide | past | favorite | 13 comments



Of particular interest is section 20.18 -- "Mirroring memory operands"; extending (something like) register renaming to memory.


I wonder what the limits are on that, and if it chains very deeply, e.g mreg ("memory register") to op to mreg to op to mreg.... What's the limit? Can this be used to extend the optimization of some chained computations via a kind of unrolling?


In Agner's forum post linked a few days ago[1], it sounded like it was quite limited. I mean, super impressive, but as far as I understood, it didn't handle nesting (just the single pointer/memory access + offset)[2]. I'm not very well versed on this stuff though, so maybe I misunderstood.

[1] https://news.ycombinator.com/item?id=24302057

[2] > The mechanism works only under certain conditions. It must use general purpose registers, and the operand size must be 32 or 64 bits. The memory operand must use a pointer and optionally an index. It does not work with absolute or rip-relative addresses.

> It seems that the CPU makes assumptions about whether memory operands have the same address before the addresses have been calculated. This may cause problems in case of pointer aliasing.

Or from the PDF:

•The instructions must use general purpose registers.

•The memory operands must have the same address.

•The operand size must be 32 or 64 bits.

•You may have a32 bit read after a 64 bit write to the same address, but not vice versa.

•The memory address must have a base pointer, no absolute address, and no rip-relative address. The memory address may have an index register,a scale factor, and an offset no bigger than 8 bits.

•The memory operand must be specified in exactly the same way with the same unmodified pointer and index registers in all the instructions involved.

•The memory address cannot cross a cache line boundary.

•The instructions can be simple MOV instructions, read-modify instructions,or read-modify-write instructions.It also works with PUSH and POP instructions.

•Complex instructions with multiple μops cannot be used.


My educated guess it that it is handled the same way as register renaming. Zen 2 has 180 GP registers.

For 64-bit platform, you would need some algorithm that needs more than 15 registers (so you need to spill to stack) that need to be execute faster than cache access latency. There might be some, but I doubt many would fall into this category.


A better name might be "Memory Operand Forwarding" since the data is probably coming from a write buffer.


> since the data is probably coming from a write buffer

Taking pending writes from the store buffer before they have retired is something else and has (obviously) been done since we first had OOO execution.

This is referring back to the operand's original value in a register file because that can be done with less latency than searching around in the store buffer, which I think isn't much different to L1.


>> This is referring back to the operand's original value in a register file because that can be done with less latency...

Thanks, I had missed that distinction.


That name doesn't really distinguish this feature from the normal store-to-load forwarding from the store buffer that processors have been doing for decades, though. I think memory renaming is the probably the best name for it.


Can anyone comment on how secure this architecture is against speculative execution attacks vs. Intel?


I have a (likely silly) question for anyone well versed in low level CPU stuff. §20.7 says there's a stack engine that optimizes manipulation of the stack pointer. Does this only apply to the dedicated hardware register (ie %rsp) or to other registers as well?

(Potentially related, assuming it's of benefit are modern compilers smart enough to repurpose %rsp (is this even allowed?) if I use a block of memory as a stack inside a hot loop?)


Take this with 2 cents since I'm not versed explicitly in Zen architecture, but it's likely only the SP. Usage patterns are fairly easy to deduce at compile time and are optimized, i.e. loop tiling, thus I think it's fair to assume the optimizations are leveraged against this. For example, if you can predict the loop pattern you can repurpose the SP.


Funny enough, I ctrl fed for an optimization problem I've been wondering about and found a couple mentions of it (branch vs conditional move)


Could all these insights added to an Artificial Intelligence (AI)? Then the AI would find the best way to compile, re-arrange instructions, etc...

Just thinking...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: