NVIDIA's hardware and software (CUDA) badly need competition in this space -- from Intel, from AMD, from anyone, please.
If anyone at Intel is reading this, please consider releasing all Ponte Vecchio drivers under a permissive open-source license; it would facilitate and encourage faster adoption.
The One-API and OpenCL implementations, Intel Graphics Compiler, and Linux driver are all open source. Ponte Vechio support just hasn't been publicly released yet.
CUDA’s biggest advantage over OpenCL other than not being a camel was its C++ support which is still the main language in use for CUDA in production, I doubt FORTRAN was the reason why CUDA got to where it is, C++ on the other hand had quite a lot to do with it during its initial days when OpenCL was still stuck in OpenGL C-Land.
NVIDIA understood also early on the importance of first party libraries and commercial partnerships something Intel also understands which is why OneAPI has wider adoption already than ROCm.
How does CUDA support any of these (.Net, Java, etc?). It's the first time I hear this claim. There are 3rd party wrappers in Java, .Net, etc. that call CUDA's C++ API, and that's all. Equivalet APIs exist for OpenCL too...
There are Java and C# compilers for CUDA such as JCUDA and http://www.altimesh.com/hybridizer-essentials/ but the CUDA runtime, libraries and the first party compiler only supports C/C++ and FORTRAN, for Python you need to use something like Numba.
Most non C++ frameworks and implementations tho would simply use wrappers and bindings.
I also am not aware of any high performance lib for CUDA that wasn’t written in C++.
Hybridizer simply creates CUDA C++ code from C# which is then compiled to PTX it also does it for AVX which you can the compile with Intel’s compiler or gcc, it’s not particularly good and you often need to debug the generated CUDA source code yourself, it’s also doesn’t always play well with the CUDA programming model especially its more advanced features.
And again it’s a commercial product developed by a 3rd party, whilst someone uses it I wouldn’t even put it as a rounding error when accounting for why CUDA has the market share it has.
It is like everyone arguing about C++ for AAA studios, as if everyone was doing Crysis and Fortnight clones, while forgetting the legions of people making money selling A games.
Or forgetting the days when games written in C were actually full of inline Assembly.
It is still CUDA, regardless if it goes through PTX or CUDA C++ as implementation detail for the high level code.
The market for these secondary implementations is tiny, and that is coming from someone who worked at a company that had CUDA executed from a spreadsheet.
The C#/Java et. al isn’t what made CUDA popular nor what would make OneAPI succeed or fail.
CUDA became popular because of its architecture, using an intermediate assembly to allow backward and forward compatibility, it had exe Elle to support across the entire NVIDIA GPU stack which means that it could run on everything form bargain bin laptops with the cheapest dGPU to HPC cards.
It came with a large library of high performant libraries and yes the C++ programming model is why it was adopted so well by the big players.
And even arguably more importantly is that when ML and GPU compute exploded and that wasn’t that long ago NVIDIA from a business perspective was the top dog in town, CUDA could’ve been dog shit but when AMD could barely launch a GPU that could compete with NVIDIA’s mid range for multiple generations it wouldn’t have mattered.
>CUDA could’ve been dog shit but when AMD could barely launch a GPU that could compete with NVIDIA’s mid range for multiple generations it wouldn’t have mattered.
This is really the only point to be made. Intel could release open source GPU drivers and GPGPU frameworks for every language under the sun, personally hold workshops in every city and even give every developer a back massage and everyone would likely still use CUDA.
Intel has one huge advantage tho, OneAPI already supports their existing CPUs and GPUs (Gen9-12 graphics), and it’s already cross platform available on Linux, MacOS and Windows this was the biggest failure of AMD no support for consumer graphics, no support for APUs which means laptops are cut out of the equation and Linux only which limits your commercial deployment to the datacenter and a handful of “nerds”.
The vast majority of CUDA applications don’t need 100’s of HPC cards to execute, consumers want their favorite video or photo editor to work, they want to be able to apply filters to their Zoom calls, students and researchers want to be able to develop and run POCs on their laptops as long as Adobe and the likes adopt OneAPI and as long as Intel will provide a backend for common ML frameworks like Pytorch and TF (which they already do) performance at that point won’t matter as much as you think.
Performance at this scale is a business question if AMD had a decent ecosystem but lacked performance they could’ve priced their cards accordingly and still captured some market share. Their problem was that they couldn’t actually release hardware in time, their shipments were tiny and they didn’t had the software to back it up.
Intel despite all the doom and gloom still ships more chips than AMD and NVIDIA combined if OneAPI is even remotely technically competent and from my very limited experience with it it is looking rather good Intel can offer developers a huge addressable market overnight with a single framework.
I am not denying that C++ is very relevant for CUDA (since version 3.0), it is also why I never bothered to touch OpenCL.
And when Khronos woke up for that fact, alongside SPIR, it was already too late for anyone to care.
Regarding the trees, I guess my point is that regardless of tiny they are, the developers behind those stacks rather bet on CUDA and eventually collaborate with NVidia than going after to the alternatives.
So the alternatives to CUDA aren't even able to significally atract those devs to their platforms, given the tooling around CUDA to support their efforts.
Yep, CUDA running on literally anything and everything definitely helped its success. So many data scientists, ML engineers, who got into cuda by playing with their gaming GPUs.
Which is exactly the advantage Intel has over AMD, they aren’t locked to Linux only and they support iGPUs, ROCm is essentially an extension of the Linux display driver stack at this point and barely supports any consumer hardware and most importantly APUs.
I would really want to be able to find the people at AMD who are responsible for the ROCm roadmap and ask them WTF were they thinking...
"Alternatively you can let the virtual machine (VM) make this decision automatically by setting a system property on the command line. The JIT can also offload certain processing tasks based on performance heuristics."
A lot of what ultimately limits GPUs today is that they are connected over a relatively slow bus (PCIe), this will change in the future, allowing smaller and smaller tasks to be offloaded.
Oooh, I didn’t know PTX was an intermediate representation and explicitly documented as such, I really thought it was the actual assembly ran by the chips…
PTX is a virtual ISA and the reason why ROCm is doomed to fail well beyond its horrendous bugs.
ROCm produces hardware specific binaries which not only means you need to produce binaries for multiple GPUs but you also don’t have any guarantee for forward compatibility.
A CUDA binary from 10 years ago will still run today on modern hardware, ROCm breaks compatibility between minor releases sometimes and it’s often not documented.
You can get the GPU-targeted assembly (sometimes called SASS by NVIDIA) through specifically compiling to a given GPU then using nvdisasm, which also has a very terse definition of the underlying instruction set in the docs (https://docs.nvidia.com/cuda/cuda-binary-utilities/index.htm...).
But it's one way only, NVIDIA ships a disassembler, but explicitly doesn't ship an assembler.
in addition, grCuda is a breakthrough that enable interop with much more languages such as Ruby, R, Js, (soon python), etc
https://github.com/NVIDIA/grcuda
I don't have specific knowledge of Ponte Vecchio in particular, so I'll defer to you if you have such info. The support for their mainstream GPU products is open source, though.
CUDA is not as important as Tensorflow, PyTorch and JAX support at this point. Those frameworks are what people code against, so having high quality backends for them are more important than the drivers themselves.
Can you give me another GPGPU framework that puts in the effort to have up-to-date benchmark comparisions between different hardware?
Just as an example I just bought an M1 laptop. Even PyTorch CPU cross-compiling hasn't been done properly, and TensorFlow support is full of bugs. I hope that with M1X it will be taken more seriously by Apple.
I understand your point (Julia is a great example), but trying to support a rarely used hardware with a rarely supported framework is just not practical. (Stick with CUDA.jl for Julia :) )
OneAPI is already cross platform through codeplay’s implementation which also can run on NVIDIA GPUs, its whole point is to be open cross platform framework that targets a wide range of hardware.
Wether it would be successful or not is up in the air but it’s goals are pretty solid.
So basically, a thing that will provide first-class capabilities only on Intel hardware, and won't be really optimised for maximum performance/expose all the underlying capabilities of the hardware elsewhere.
I really don't get this push to polyglot programming when 99% of the high performance libraries use C++. Even more, openAPI has DPC++, SPIR-V has SYCL, CUDA is even building a C++ standard library that is heterogeneous supporting both CPU and GPU, libcu++. Seriously now, how many people from JVM or CLR world actually need this level of high performance? How many actually push kernels to the GPU from these runtimes? I have yet to see a programming language that will replace C++ at what it does best. Maybe Zig because it is streamlined and easier to get into will be a true contender to C++ HPC but only time will tell.
Enough people to keep a couple of companies in business, and NVidia doing collaboration projects with Microsoft and Oracle, HPC is not the only market for CUDA.
Indeed it is, but the developers in these ecosystems created complements like Apache Arrow that will unload the data in a language-independent columnar memory format for efficient analytics in services that will run C++ on clusters of CPUs and GPUs. Even Spark has rewritten their own analytics engine in C++ recently. These were created because of the limitations of the JVM. We have tried to move the numerical processing away from C++ in the past decades but we have always failed.
You asked who in the JVM world would be interested in this kind of performance: that's big data folks. To the extent that improvements accrue to the JVM they accrue to that world without needing to rewrite into C++.
If anyone at Intel is reading this, please consider releasing all Ponte Vecchio drivers under a permissive open-source license; it would facilitate and encourage faster adoption.