That's just compute shaders in DirectX. They also support compute shaders in Vulkan and OpenGL.
It's not exactly the same as Cuda and OpenCL. Especially the numerical precision requirements are way off on graphics apis. And by way off I mean they're not always even specified what they should be.
Basic fp ops like +-*/ are fine generally. It's more about the ability to prevent reodering of instructions (so that Kahan summation etc is not screwed up), and having well defined spec for the precision of things like transcendental functions. HLSL is bit better at controlling this than GLSL is.
It has also improved greatly. Using workgroup shared memory is now a thing. And also one can use subgroup ops that have been in Cuda for years.
Some of the other things like commandqueues are in Vulkan and DirectX12. But oh boy those are pain to program in compared to OpenCL or Cuda. Usability matters too.
GPUs don’t reorder, their EUs are way too simple for that. Are you certain GPU drivers reorder instructions while recompiling DXBC into their microcode?
> HLSL is bit better at controlling this than GLSL is.
BTW, if you compile acos() in HLSL and disassemble the output DXBC, you’ll see a really strange sequence of 10 instructions (mad, mad, mad, add, lt, sqrt, mul, mad, and, mad) . The precision is indeed lost that way. Still, if you really need that, you can implement full-precision stuff on top of what’s available.
> Using workgroup shared memory is now a thing
Was always there. CUDA 1.0 was released in 2007, D3D 11 in 2008.
> But oh boy those are pain to program in compared to OpenCL or Cuda.
D3D 12 is a pain to program in general, too low level. But for GPGPU, I personally never needed command queues, I’m quite happy with old-school D3D 11. Despite the API is mostly single threaded, with some care you can do stuff in parallel. Things like ID3D11DeviceContext::CopyResource are asynchronous, you can go quite far with deeply pipelined commands without doing it manually like you have to in D3D12.
Compiler does the reordering. Not the GPU itself. Even in OoO CPU’s the ordering by HW doesn’t affect FP results. As a rule of thumb shaders in graphics are compiled as if one had passed -ffast-math to the compiler. Works perfectly in most cases, but not everywhere.
Nowadays MS has actually published the spec for DX that was closed previously. See https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_... for differences of strict IEEE behaviour. In Cuda and OpenCL one can get way closer. As an example for performance reasons one might want to flush denorms to zero. But DX mandates that. So no denormals for you. In CL and Cuda they’re usable by default.
As for the command queues. I’ve often used them in Cuda. Just to get overlap between kernel executions. In DX12 one can do that by omitting barriers. DX11 allows no such feat.
It does, but you can always disassemble the DXBC and see what happened to your HLSL code.
> DX11 allows no such feat.
ID3D11DeviceContext::Dispatch is asynchronous just like CopyResource. Dispatch multiple shaders, and unless they have data dependencies (i.e. same buffer written by one as UAV and read by the next one as SRV) they'll happily run in parallel. No need for manual shenanigans with command queues.
> It does, but you can always disassemble the DXBC and see what happened to your HLSL code.
And you can always disasm X86 code and see what -ffast-math did. Doesn't mean that everyone would be fine with just mandating it everywhere with no option to disable it.
Even then the DX functional spec gives some leeway. As an example if you write x*y+z it will compile it into mad instruction. And that's just specified as that the precision must not be worse as the worst possible ordering of separate instructions. So which it is? Depends on the vendor. This is completely fine for graphics, but not fine for all workloads.
> No need for manual shenanigans with command queues
Unless you do access same buffer from multiple places in a way that's still spec conformant, just in a way that the DX11 implementation cannot detect.
> Doesn't mean that everyone would be fine with just mandating it everywhere
Practically speaking, often I enable it everywhere even on CPU (or similar options in visual C++). When more precision is needed, FP64 is the way to go. Apart from rare edge cases, you won't be getting many useful mantissa bits in these denormals, or with better rounding order.
> So which it is? Depends on the vendor.
Yeah, but on the same nVidia GPU, I'm pretty sure mad in DXBC does precisely the same thing as fma in CUDA PTX.
> just in a way that the DX11 implementation cannot detect
It doesn't detect much. If you want to allow shaders to arbitrarily read and write the same buffer, bind that buffer as UAV and you'll be able to run many of these shaders in parallel, despite a single queue.
P.S. AFAIK the main use case for these queues is high-end graphics, to send relatively cheap GPU tasks at huge rate (like 1MHz of them), from many CPU cores in parallel. In GPU compute, at least in my experience, the tasks tend to be much larger.
It's not exactly the same as Cuda and OpenCL. Especially the numerical precision requirements are way off on graphics apis. And by way off I mean they're not always even specified what they should be.