Any chance these work on CPUs with any acceptable performance? I have a 10-core ...

adeon · on Feb 20, 2023

I don't know about these large models but I saw on a random HN comment earlier in a different topic where someone showed a GPT-J model on CPU only: https://github.com/ggerganov/ggml

I tested it on my Linux and Macbook M1 Air and it generates tokens at a reasonable speed using CPU only. I noticed it doesn't quite use all my available CPU cores so it may be leaving some performance on the table, not sure though.

The GPT-J 6B is nowhere near as large as the OPT-175B in the post. But I got the sense that CPU-only inference may not be totally hopeless even for large models if only we got some high quality software to do it.

generalizations · on Feb 20, 2023

There's also the Fabrice Bellard inference code: https://textsynth.com/technology.html. He claims up to 41 tokens per second on the GPT-Neox 20B model.

brigade · on Feb 20, 2023

Your CPU gets maybe 700-800 gflops depending on your all-core frequency (fp32 because you don't have Sapphire Rapids.) The T4 benchmarked would be crunching what it can at ~65 tflops (fp16 tensor.) Newer GPUs hit 300 tflops (4090) or even nearly 2 petaflops (H100).

To give you an idea of the order of magnitude of compute difference. Sapphire Rapids has AMX and fp16 AVX512 to close the gap a little, but it's still massive.

NavinF · on Feb 20, 2023

> 10-core 20-thread monster CPU

With what, 50GB/s memory bandwidth? That's no monster. The two consumer GPUs in my machine both do 1TB/s and are still bottlenecked on memory bandwidth.

> only works with the large proprietary driver here

In practice, nothing works without the proprietary driver so this isn't specific to temperature. Also the setting you're looking for is almost certainly `nvidia-smi -pl $watts` for setting the power limit, not whatever that wiki gives you. GPU temperature = ambient temperature + (power limit)*(thermal resistance of cooler)

TimeBearingDown · on Feb 20, 2023

That power limit control is explained in detail a few paragraphs down on that wiki page.

https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Cust...

bioemerl · on Feb 20, 2023

Nope. 20 cores in a CPU, 2000 in a GPU, with much much faster memory and an architecture designed to chew through data as fast as possible.

bee_rider · on Feb 20, 2023

No real reason to compare a GPU core to a CPU one, but the memory bandwidth difference is pretty concrete!

fulafel · on Feb 20, 2023

GPU "cores" are ~ SIMD lanes.

(a difference I think is that there are more virtual lanes, some of may be masked off, that are mapped to the GPU physical SIMD lanes)

metadat · on Feb 20, 2023

Unlikely, because this is an efficient GPU work offloader, not a complete replacement for GPU computation.

nestorD · on Feb 21, 2023

The other answer give you a few of the current solutions.

In the long term I am hoping that JAX (/XLA) will get better support for the CPU backend of their compiler and in particular, use SIMD and multicore better than it currently does.

It is very doable (just low priority) and it would mean that a lot of models could get close to optimal CPU performances out of the box which would be a step forward for accessibility.