I just tried to run the example in the README, using the OPT-30B model. It appea...

Miraste · on Feb 20, 2023

You have to change the --percent flag. It takes some experimentation. The format is three pairs of 0-100 integers, one for parameters, attention cache and hidden states respectively. The first zero is percent on GPU, the second one is percent on CPU (system RAM), and the remaining percentage will go on disk.

For disk offloading to work you may also have to specify --offload-dir.

I have opt-30B running on a 3090 with --percent 20 50 100 0 100 0, although I think those could be tweaked to be faster.

lxe · on Feb 20, 2023

How much system RAM are you running with? And I'm guessing it wouldn't hurt to have a fast SSD for disk offloading?

Miraste · on Feb 20, 2023

128GB, but by turning on compression I managed to fit the whole thing on the GPU. I did try it off a mix of RAM and SSD as well, and it was slower but still usable. Presumably disk speed matters a lot.

lxe · on Feb 20, 2023

Well just got some more sticks. While I wait for RAM to arrive, will try with compress_weight and compress_cache. If you're in any discord or any other space where people are tinkering with this, would love to join!

lxe · on Feb 20, 2023

With compression, was able to get 30b to run on 3090 with '100 0'! Let me see if I can tweak the prompt a bit and make it come alive...

ImprobableTruth · on Feb 20, 2023

How fast is it in single batch mode?

Miraste · on Feb 20, 2023

After turning on compression I was able to fit the whole thing in GPU memory and then it became much faster. Not ChatGPT speeds or anything, but under a minute for a response in their chatbot demo. A few seconds in some cases.

baobabKoodaa · on Feb 20, 2023

FWIW I was able to load the OPT-6.7B model and play with it in chatbot mode. This would not have been possible without the offloading, so... cool stuff!

bee_rider · on Feb 20, 2023

Hmm, well we used to have swap partitions equal in size to our memory… you’ll have 4GiB left over!