NVIDIA's hardware and software (CUDA) badly need competition in this space -- fr...

elihu · on March 27, 2021

The One-API and OpenCL implementations, Intel Graphics Compiler, and Linux driver are all open source. Ponte Vechio support just hasn't been publicly released yet.

https://github.com/intel/compute-runtime

https://github.com/intel/intel-graphics-compiler

https://github.com/torvalds/linux/tree/master/drivers/gpu/dr...

pjmlp · on March 27, 2021

One-API focus too much on C++ (SYSCL + Intel own stuff), while OpenCL is all about C.

CUDA is polyglot, with very nice graphical debuggers that can even single step shaders.

Something that the anti-CUDA keep forgeting.

dogma1138 · on March 27, 2021

CUDA’s biggest advantage over OpenCL other than not being a camel was its C++ support which is still the main language in use for CUDA in production, I doubt FORTRAN was the reason why CUDA got to where it is, C++ on the other hand had quite a lot to do with it during its initial days when OpenCL was still stuck in OpenGL C-Land.

NVIDIA understood also early on the importance of first party libraries and commercial partnerships something Intel also understands which is why OneAPI has wider adoption already than ROCm.

pjmlp · on March 27, 2021

CUDA supports much more languages than just C++ and Fortan.

.NET, Java, Julia, Python (RAPIDS/cuDF), Haskell don't have a place on OneAPI so far.

And yes, going back to C++, the hardware is based on C++11 memory model (which was based on Java/.NET models).

So plenty of stuff to catch up, besides "we can do C++".

dragandj · on March 27, 2021

How does CUDA support any of these (.Net, Java, etc?). It's the first time I hear this claim. There are 3rd party wrappers in Java, .Net, etc. that call CUDA's C++ API, and that's all. Equivalet APIs exist for OpenCL too...

dogma1138 · on March 27, 2021

There are Java and C# compilers for CUDA such as JCUDA and http://www.altimesh.com/hybridizer-essentials/ but the CUDA runtime, libraries and the first party compiler only supports C/C++ and FORTRAN, for Python you need to use something like Numba.

Most non C++ frameworks and implementations tho would simply use wrappers and bindings.

I also am not aware of any high performance lib for CUDA that wasn’t written in C++.

pjmlp · on March 27, 2021

"Hybridizer: High-Performance C# on GPUs"

https://developer.nvidia.com/blog/hybridizer-csharp/

"Simplifying GPU Access: A Polyglot Binding for GPUs with GraalVM"

https://developer.nvidia.com/gtc/2020/video/s21269-vid

And then you can browse for products on https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Cent...

dogma1138 · on March 27, 2021

Hybridizer simply creates CUDA C++ code from C# which is then compiled to PTX it also does it for AVX which you can the compile with Intel’s compiler or gcc, it’s not particularly good and you often need to debug the generated CUDA source code yourself, it’s also doesn’t always play well with the CUDA programming model especially its more advanced features.

And again it’s a commercial product developed by a 3rd party, whilst someone uses it I wouldn’t even put it as a rounding error when accounting for why CUDA has the market share it has.

pjmlp · on March 27, 2021

It is like everyone arguing about C++ for AAA studios, as if everyone was doing Crysis and Fortnight clones, while forgetting the legions of people making money selling A games.

Or forgetting the days when games written in C were actually full of inline Assembly.

It is still CUDA, regardless if it goes through PTX or CUDA C++ as implementation detail for the high level code.

dogma1138 · on March 27, 2021

You aren’t seeing the forest for the trees.

The market for these secondary implementations is tiny, and that is coming from someone who worked at a company that had CUDA executed from a spreadsheet.

The C#/Java et. al isn’t what made CUDA popular nor what would make OneAPI succeed or fail.

CUDA became popular because of its architecture, using an intermediate assembly to allow backward and forward compatibility, it had exe Elle to support across the entire NVIDIA GPU stack which means that it could run on everything form bargain bin laptops with the cheapest dGPU to HPC cards. It came with a large library of high performant libraries and yes the C++ programming model is why it was adopted so well by the big players.

And even arguably more importantly is that when ML and GPU compute exploded and that wasn’t that long ago NVIDIA from a business perspective was the top dog in town, CUDA could’ve been dog shit but when AMD could barely launch a GPU that could compete with NVIDIA’s mid range for multiple generations it wouldn’t have mattered.

nemothekid · on March 28, 2021

>CUDA could’ve been dog shit but when AMD could barely launch a GPU that could compete with NVIDIA’s mid range for multiple generations it wouldn’t have mattered.

This is really the only point to be made. Intel could release open source GPU drivers and GPGPU frameworks for every language under the sun, personally hold workshops in every city and even give every developer a back massage and everyone would likely still use CUDA.

The performance gap is still so large.

dogma1138 · on March 28, 2021

Intel has one huge advantage tho, OneAPI already supports their existing CPUs and GPUs (Gen9-12 graphics), and it’s already cross platform available on Linux, MacOS and Windows this was the biggest failure of AMD no support for consumer graphics, no support for APUs which means laptops are cut out of the equation and Linux only which limits your commercial deployment to the datacenter and a handful of “nerds”.

The vast majority of CUDA applications don’t need 100’s of HPC cards to execute, consumers want their favorite video or photo editor to work, they want to be able to apply filters to their Zoom calls, students and researchers want to be able to develop and run POCs on their laptops as long as Adobe and the likes adopt OneAPI and as long as Intel will provide a backend for common ML frameworks like Pytorch and TF (which they already do) performance at that point won’t matter as much as you think.

Performance at this scale is a business question if AMD had a decent ecosystem but lacked performance they could’ve priced their cards accordingly and still captured some market share. Their problem was that they couldn’t actually release hardware in time, their shipments were tiny and they didn’t had the software to back it up.

Intel despite all the doom and gloom still ships more chips than AMD and NVIDIA combined if OneAPI is even remotely technically competent and from my very limited experience with it it is looking rather good Intel can offer developers a huge addressable market overnight with a single framework.

pjmlp · on March 28, 2021

I am not denying that C++ is very relevant for CUDA (since version 3.0), it is also why I never bothered to touch OpenCL.

And when Khronos woke up for that fact, alongside SPIR, it was already too late for anyone to care.

Regarding the trees, I guess my point is that regardless of tiny they are, the developers behind those stacks rather bet on CUDA and eventually collaborate with NVidia than going after to the alternatives.

So the alternatives to CUDA aren't even able to significally atract those devs to their platforms, given the tooling around CUDA to support their efforts.

dannyw · on March 28, 2021

Yep, CUDA running on literally anything and everything definitely helped its success. So many data scientists, ML engineers, who got into cuda by playing with their gaming GPUs.

dogma1138 · on March 28, 2021

Which is exactly the advantage Intel has over AMD, they aren’t locked to Linux only and they support iGPUs, ROCm is essentially an extension of the Linux display driver stack at this point and barely supports any consumer hardware and most importantly APUs.

I would really want to be able to find the people at AMD who are responsible for the ROCm roadmap and ask them WTF were they thinking...

my123 · on March 27, 2021

https://www.ibm.com/support/knowledgecenter/SSYKE2_8.0.0/com... goes to a level above:

"Alternatively you can let the virtual machine (VM) make this decision automatically by setting a system property on the command line. The JIT can also offload certain processing tasks based on performance heuristics."

A lot of what ultimately limits GPUs today is that they are connected over a relatively slow bus (PCIe), this will change in the future, allowing smaller and smaller tasks to be offloaded.

my123 · on March 27, 2021

The CUDA runtime gets as input the PTX intermediate language.

The toolkit ships with compilers from C++ and Fortran to NVVM, and provides you documentation about the PTX virtual machine at https://docs.nvidia.com/cuda/parallel-thread-execution/index... and about the higher-level NVVM (which compiles down to PTX) at https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html.

navaati · on March 27, 2021

Oooh, I didn’t know PTX was an intermediate representation and explicitly documented as such, I really thought it was the actual assembly ran by the chips…

dogma1138 · on March 27, 2021

PTX is a virtual ISA and the reason why ROCm is doomed to fail well beyond its horrendous bugs. ROCm produces hardware specific binaries which not only means you need to produce binaries for multiple GPUs but you also don’t have any guarantee for forward compatibility.

A CUDA binary from 10 years ago will still run today on modern hardware, ROCm breaks compatibility between minor releases sometimes and it’s often not documented.

my123 · on March 27, 2021

You can get the GPU-targeted assembly (sometimes called SASS by NVIDIA) through specifically compiling to a given GPU then using nvdisasm, which also has a very terse definition of the underlying instruction set in the docs (https://docs.nvidia.com/cuda/cuda-binary-utilities/index.htm...).

But it's one way only, NVIDIA ships a disassembler, but explicitly doesn't ship an assembler.

The_rationalist · on March 27, 2021

https://github.com/NVIDIA/grcuda

The_rationalist · on March 27, 2021

in addition, grCuda is a breakthrough that enable interop with much more languages such as Ruby, R, Js, (soon python), etc https://github.com/NVIDIA/grcuda

UncleOxidant · on March 27, 2021

oneAPI support in Julia: https://github.com/JuliaGPU/oneAPI.jl

pjmlp · on March 27, 2021

Nice to know, thanks.

zepmck · on March 27, 2021

One-API is not completely open source. Support for Ponte Vecchio will be not released open source for many reasons.

elihu · on March 27, 2021

I don't have specific knowledge of Ponte Vecchio in particular, so I'll defer to you if you have such info. The support for their mainstream GPU products is open source, though.

nine_k · on March 27, 2021

Where to find more details?

xiphias2 · on March 27, 2021

CUDA is not as important as Tensorflow, PyTorch and JAX support at this point. Those frameworks are what people code against, so having high quality backends for them are more important than the drivers themselves.

pjmlp · on March 28, 2021

Not everyone is using GPGPU for Tensorflow, PyTorch and JAX.

xiphias2 · on March 28, 2021

Can you give me another GPGPU framework that puts in the effort to have up-to-date benchmark comparisions between different hardware?

Just as an example I just bought an M1 laptop. Even PyTorch CPU cross-compiling hasn't been done properly, and TensorFlow support is full of bugs. I hope that with M1X it will be taken more seriously by Apple.

I understand your point (Julia is a great example), but trying to support a rarely used hardware with a rarely supported framework is just not practical. (Stick with CUDA.jl for Julia :) )

pjmlp · on March 28, 2021

For example, the guys at OTOY couldn't care one second about PyTorch, TensorFlow.

https://home.otoy.com/

xiphias2 · on March 28, 2021

What does it have to do with the AI chip in the article? That looks the opposite of GPGPU (rendering real graphics)

pjmlp · on March 28, 2021

Nothing, CUDA is used for graphics programming as well.

dogma1138 · on March 27, 2021

Intel’s OneAPI is already miles a head of AMD’s ROCm which is pretty awesome.

zepmck · on March 27, 2021

When? Where? How can it be miles ahead if the hardware has not been released yet?

baybal2 · on March 27, 2021

Yes, seconding that.

What the point of using OneAPI, a yet another compute API wrapper, to make software just for a single platform?

You can just use regular computing libs, and C, or C++.

Serious HPC will still stay with its own serious HPC stuff, superoptimised C, and fortran code, no matter how labour intensive it is.

So, I see very little point in that.

dogma1138 · on March 27, 2021

OneAPI is already cross platform through codeplay’s implementation which also can run on NVIDIA GPUs, its whole point is to be open cross platform framework that targets a wide range of hardware.

Wether it would be successful or not is up in the air but it’s goals are pretty solid.

my123 · on March 27, 2021

So basically, a thing that will provide first-class capabilities only on Intel hardware, and won't be really optimised for maximum performance/expose all the underlying capabilities of the hardware elsewhere.

pjmlp · on March 28, 2021

Superoptimised C++, and Fortran code, with Chapel on the horizon.

spijdar · on March 27, 2021

Sadly, that's not a very high bar to set...

pjmlp · on March 27, 2021

Now they need to catch up with polyglot CUDA eco-system.

_hzrk · on March 27, 2021

I really don't get this push to polyglot programming when 99% of the high performance libraries use C++. Even more, openAPI has DPC++, SPIR-V has SYCL, CUDA is even building a C++ standard library that is heterogeneous supporting both CPU and GPU, libcu++. Seriously now, how many people from JVM or CLR world actually need this level of high performance? How many actually push kernels to the GPU from these runtimes? I have yet to see a programming language that will replace C++ at what it does best. Maybe Zig because it is streamlined and easier to get into will be a true contender to C++ HPC but only time will tell.

pjmlp · on March 27, 2021

Enough people to keep a couple of companies in business, and NVidia doing collaboration projects with Microsoft and Oracle, HPC is not the only market for CUDA.

jacques_chester · on March 27, 2021

> Seriously now, how many people from JVM or CLR world actually need this level of high performance?

The big data ecosystem is Java-centric.

_hzrk · on March 27, 2021

Indeed it is, but the developers in these ecosystems created complements like Apache Arrow that will unload the data in a language-independent columnar memory format for efficient analytics in services that will run C++ on clusters of CPUs and GPUs. Even Spark has rewritten their own analytics engine in C++ recently. These were created because of the limitations of the JVM. We have tried to move the numerical processing away from C++ in the past decades but we have always failed.

jacques_chester · on March 27, 2021

You asked who in the JVM world would be interested in this kind of performance: that's big data folks. To the extent that improvements accrue to the JVM they accrue to that world without needing to rewrite into C++.

dogma1138 · on March 27, 2021

Finance too, large exchanges with micro second latency have their core systems written in Java, CME Globex and EBS/Brokertec are written in Java.

bionhoward · on March 27, 2021

Whenever I hit AI limits, it's due to memory. That's why I would argue the future of AI is Rust, not C++. Memory efficiency matters!