> Tom Petersen made a big deal about 16-lane SIMD in Battlemage [...] Where? The...

ryao · 2024-12-25T15:19:47 1735139987

For context, Alchemist was SIMD8. They made a big deal out of this at the alchemist launch if I recall correctly since they thought it would be more efficient. Unfortunately, it turned out to be less efficient.

Tom Petersen did a bunch of interviews right before the Intel B580 launch. In the hardware unboxed interview, he mentioned it, but accidentally misspoke. I must have interpreted his misspeak as meaning games want SIMD16 and noted it that way in my mind, as what he says elsewhere seems to suggest that games want SIMD16. It was only after thinking about what I heard that I realized otherwise. Here is an interview where he talks about native SIMD16 being better:

https://www.youtube.com/live/z7mjKeck7k0?t=35m38s

In specific, he says:

> But we also have native SIMD support—SIMD16 native support, which is going to say that you don’t have to like recode your computer shader to match a particular topology. You can use the one that you use for everyone else, and it’ll just run well on ARC. So I’m pretty exited about that.

In an interview with gamers nexus, he has a nice slide where he attributes a performance gain directly to SIMD16:

https://youtu.be/ACOlBthEFUw?t=16m35s

At the start of the gamers nexus video, Steve mentions that Tom‘s slides are from a presentation. I vaguely remember seeing a video of it where he talked more about SIMD16 being an improvement, but I am having trouble finding it.

Having to schedule fewer things is a definite benefit of 32 lanes over a smaller lane count. Interestingly, AMD switched from a 16 lane count to a 32 lane count with RDNA, and RDNA turned out to be a huge improvement in efficiency. The switch is actually somewhat weird since they had been emulating SIMD64 using their SIMD16 hardware, so the hardware simultaneously became wider and narrower at the same time. Their emulation of SIMD64 in SIMD16 is mentioned in this old GCN documentation describing cross lane operations:

https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operat...

That documentation talks about writing to a temporary location and reading form a temporary location in order to do cross lane operations. Contrast this with 12.5.1 of RDNA 3 ISA documentation, where the native SIMD32 units just fetch the values from each others’ registers with no mention of a temporary location:

https://www.amd.com/content/dam/amd/en/documents/radeon-tech...

That strikes me as much more efficient. While I do not write shaders, I have written CUDA kernels and in CUDA kernels, you sometimes need to do what Nvidia calls a parallel reduction across lanes, which are cross lane operations (Intel’s CPU division calls these horizontal operations). For example, you might need to sum across all lanes (e.g. for an average, matrix vector multiplication or dot product). When your thread count matches the SIMD lane count, you can do this without going to shared memory, which is fast. If you need to emulate a higher lane width, you need to use a temporary storage location (like what AMD described), which is not as fast.

If games’ shaders are written with an assumption that SIMD32 is used, then native SIMD32 is going to be more performant than native SIMD16 because of faster cross lane operations. Intel’s slide attributes a 0.3ms reduction in render time to their switch from SIMD8 to SIMD16. I suspect that they would see a further reduction with SIMD32 since that would eliminate the need to emulate SIMD32 for games that expect SIMD32 due to Nvidia (since as late as Turing) and AMD (since RDNA 1) both using SIMD32.

To illustrate this, here are some CUDA kernels that I wrote:

https://github.com/ryao/llama3.c/blob/master/rung.cu#L15

The softmax kernel for example has the hardware emulate SIMD1024, although you would need to look at the kernel invocations in the corresponding rung.c file to know that. The purpose of doing 1024 threads is to ensure that the kernel is memory bandwidth bound since the hardware bottleneck for this operation should be memory bandwidth. In order to efficiently do the parallel reductions to calculate the max and sum values in different parts of softmax, I use the fast SIMD32 reduction in every SIMD32 unit. I then write the results to shared memory from each of the 32 SIMD32 units that performed this (since 32 * 32 = 1024). I then have all 32x SIMD32 units read from shared memory and simultaneously do the same reduction to calculate the final value. Afterward, the leader in each unit tells all others the value and everything continues. Now imagine having a compiler compile this for a native SIMD16.

A naive approach would introduce a trip to shared memory for both reductions, giving us 3 trips to shared memory and 4 reductions. A more clever approach would do 2 trips to shared memory and 3 reductions. Either way, SIMD16 is less efficient. The smart thing to do would be to recognize that 256 threads is likely okay too and just do the same exact thing with a smaller number of threads, but a compiler is not expected to be able to make such a high level optimization, especially since the high level API says “use 1024 threads”. Thus you need the developer to rewrite this for SIMD16 hardware to get it to run at full speed and with Intel’s low marketshare, that is not very likely to happen. Of course, this is CUDA code and not a shader, but a shader is likely in a similar situation.

Dylan16807 · 2024-12-25T19:01:18 1735153278

> Having to schedule fewer things is a definite benefit of 32 lanes over a smaller lane count.

From a hardware design perspective, it saves you some die size in the scheduler.

From a performance perspective, as long as the hardware designer kept 32 in mind, it can schedule 32 lanes and duplicate the signals to the 16 or 8 wide lanes with no loss of performance.

> That documentation talks about writing to a temporary location and reading form a temporary location in order to do cross lane operations.

> If games’ shaders are written with an assumption that SIMD32 is used, then native SIMD32 is going to be more performant than native SIMD16 because of faster cross lane operations.

So this is a situation where wider lanes actually need more hardware to run at full speed and not having it causes a penalty. I see your point here, but I will note that you can add that criss-cross hardware for 32-wide operations while still having 16-wide be your default.

ryao · 2024-12-25T21:44:40 1735163080

> From a performance perspective, as long as the hardware designer kept 32 in mind, it can schedule 32 lanes and duplicate the signals to the 16 or 8 wide lanes with no loss of performance.

I was looking at the things that were said for XE2 in Lunar Lake and it appears that the slides suggest that they had special handling to emulate SIMD32 using SIMD16 in hardware, so you might be right.

> So this is a situation where wider lanes actually need more hardware to run at full speed and not having it causes a penalty. I see your point here, but I will note that you can add that criss-cross hardware for 32-wide operations while still having 16-wide be your default.

To go from SIMD8 to SIMD16, Intel halved the number of units while making them double the width. They could have done that again to avoid the need for additional hardware.

I have not seen the Xe2 instruction set to have any hints about how they are doing these operations in their hardware. I am going to leave it at that since I have spent far too much time analyzing the technical marketing for a GPU architecture that I am not likely to use. No matter how well they made it, it just was not scaled up enough to make it interesting to me as a developer that owns a RTX 3090 Ti. I only looked into it as much as I did since I am excited to see Intel moving forward here. That said, if they launched a 48GB variant, I would buy it in a heartbeat and start writing code to run on it.

ryao · 2024-12-25T17:28:06 1735147686

There is a typo in the Tom Petersen quote. He said “compute shader”, not “computer shader”. Autocorrect changed it when I had transcribed it and I did not catch this during the edit window.