Top item on the roadmap: "Support Apple silicon M1/M2 deployment"

MuffinFlavored · on Feb 20, 2023

I tried to figure out how to do GPGPU stuff as a total beginner in Rust on Apple Silicon.

I couldn't figure out if I was supposed to be chasing down Apple Metal or OpenCL backends. It also didn't seem to make much of a difference because while there are crates for both that seemed relatively well-maintained/fleshed out, I couldn't figure out how exactly to just pull one down and plug them into a higher level library (or find said higher level library all together).

Have you had any luck? In my experience, it's basically Python or bust in this space despite lots of efforts to make it not that way?

I also got confuses as to whether a 'shader' was more for the visual GPU output of things, or if it was also a building block for model training/networks/machine learning/etc.

fathyb · on Feb 20, 2023

> I couldn't figure out if I was supposed to be chasing down Apple Metal or OpenCL backends.

If you want cross-platform compatibility (kinda), go for OpenCL, if you want the best performance go for Metal. Both use a very similar language for kernels, but Metal is generally more efficient.

> Have you had any luck?

Not in ML, but I'm doing a lot of GPGPU on Metal, I recently started doing it in Rust. A bit less convenient than with Swift/Objective-C, but still possible. Worst case you'll have to add an .mm file and bridge it with `extern "C"`. That said, doing GPGPU is not doing ML, and most ML libraries are in Python.

> I also got confuses as to whether a 'shader' was more for the visual GPU output of things, or if it was also a building block for model training/networks/machine learning/etc.

A shader is basically a function that runs for every element of the output buffer. We generally call them kernels for GPGPU, and shaders (geometry, vertex, fragment) for graphics stuff. You have to write them in a language that kinda looks like C (OpenGL GLSL, DirectX HSL, Metal MSL), but is optimized for the SMT properties of GPUs.

Learning shaders will let you run code on the GPU, to do ML you also need to learn what are tensors, how to compute them on the GPU, and how to build ML systems using them.

I recommend ShaderToy [0] if you want a cool way to understand and play with shaders.

[0]: https://www.shadertoy.com/

MuffinFlavored · on Feb 21, 2023

> GPGPU is not doing ML

> General-purpose computing on graphics processing units

> machine learning

Could you expand on why this is the case please? I thought machine learning was basically brute forcing a bunch of possibilities and keeping track of how different inputs "score", then ranking them accordingly to help make educated predictions later.

> GPGPU (General-Purpose Graphics Processing Unit) and machine learning are not the same thing, although they can be related in some ways.

> GPGPU refers to using the parallel processing power of graphics processing units (GPUs) to perform computations beyond graphics rendering. This involves using the massive number of cores in modern GPUs to accelerate tasks such as scientific simulations, numerical analysis, and other data-intensive applications. Essentially, GPGPU involves leveraging the processing power of GPUs for general-purpose computing tasks, not just for graphics processing.

> On the other hand, machine learning involves using algorithms and statistical models to enable computer systems to learn from data and improve their performance on a specific task. It involves feeding large amounts of data to a machine learning algorithm so that it can learn to recognize patterns and make predictions or decisions based on that data.

> While GPGPU can be used to accelerate the computation required in machine learning tasks, they are not the same thing. Machine learning is a specific type of computation, whereas GPGPU is a technique for accelerating computation in general. Additionally, GPGPU can be used for a wide variety of computational tasks, not just machine learning.

Miraste · on Feb 20, 2023

I'm not familiar with Metal, but on Apple Silicon aren't CPU and GPU memory completely shared?

fathyb · on Feb 20, 2023

They do, however it's not fully shared at the process level, the GPGPU API should explicitly support mapping a buffer from the process virtual memory space to the GPU.

I looked it up and turns out OpenCL also supports zero-copy buffers, so I edited my comment accordingly!

MuffinFlavored · on Feb 20, 2023

so write a kernel in OpenCL, then call it from Rust

is that what machine learning is doing at a high level?

fathyb · on Feb 20, 2023

At a very high level yes. There is also the very important step of efficiently laying out data in the GPU memory to compute tensor values in the kernels.

MuffinFlavored · on Feb 21, 2023

Can you confirm if OpenCL has been deprecated going forward for Apple Silicon please?

Also, should I be able to expect to use OpenCL version 3.0 on Apple Silicon, or only v1.2 or 2.0 or something else?

raphlinus · on Feb 21, 2023

Yes, according to Apple official documentation, OpenCL was deprecated as of macOS 10.14. It is reported to work, including on Apple Silicon (M1 and M2), but don't expect any updates.

[1]: https://developer.apple.com/library/archive/documentation/Pe...

smoldesu · on Feb 20, 2023

Give this a look:

https://github.com/guillaume-be/rust-bert

https://github.com/guillaume-be/rust-bert/blob/master/exampl...

If you have Pytorch configured correctly, this should "just work" for a lot of the smaller models. It won't be a 1:1 ChatGPT replacement, but you can build some pretty cool stuff with it.

> it's basically Python or bust in this space

More or less, but that doesn't have to be a bad thing. If you're on Apple Silicon, you have plenty of performance headroom to deploy Python code for this. I've gotten this library to work on systems with as little as 2gb of memory, so outside of ultra-low-end use cases, you should be fine.

MuffinFlavored · on Feb 20, 2023

To clarify,

> Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing from rust-tokenizers.

> tch-rs: Rust bindings for the C++ api of PyTorch.

Which "backend" does this end up using on Apple Silicon, MPS (Metal Performance Shaders) or OpenCL?

https://pytorch.org/docs/stable/notes/mps.html

I'm going to guess MPS?

smoldesu · on Feb 20, 2023

Whatever your Pytorch install is designed to accelerate. I've got Ampere-accelerated Pytorch running it on my ARM server, I assume MPS is used on compatible systems.

fancyfredbot · on Feb 20, 2023

I believe that you can't get enough RAM with M1/M2 for this to be useful

ricardobeat · on Feb 20, 2023

This is meant to run on GPUs with 16GB RAM. Most M1/M2 users have at least 32GB (unified memory), and you can configure a MBP or Mac Studio with up to 96/128GB.

The Mac Pro is still Intel, but it can be configured with up to 1.5TB of RAM, you can imagine the M* replacement will have equally gigantic options when it comes out.

fancyfredbot · on Feb 20, 2023

If you look closely there's 16GB of GPU memory and over 200GB of CPU memory. So none of the currently available M* have the same kind of capacity. Let's hope this changes in the future!

ricardobeat · on Feb 24, 2023

Apple silicon has unified memory, the GPU has access to the entire 32/64/96/128GB of RAM. It's part of the appeal.

I would really like to see how stuff performs on a Mac Studio with 128GB memory, 8TB SSD (at 6GB/s), not to mention the extra 32 "neural engine" cores. It seems the performance of these machines has been barely explored so far.

fancyfredbot · on Feb 25, 2023

I think that here the main bottleneck is data movement. If you are streaming weight data from a 6GB/s SSD you'll get under 10% of the performance shown for 3090 (which will be moving data at PCIe 4 speeds of 64GB/s).

Once in unified memory the weights are accessible at about half the rate they are on the 3090 (400GB/sec on M2 Max vs 936GB/sec on 3090).