I just tried to run the example in the README, using the OPT-30B model. It appeared to download 60GiB of model files, and then it attempted to read all of it into RAM. My laptop has "only" 32GiB of RAM so it just ran out of memory.
You have to change the --percent flag. It takes some experimentation. The format is three pairs of 0-100 integers, one for parameters, attention cache and hidden states respectively. The first zero is percent on GPU, the second one is percent on CPU (system RAM), and the remaining percentage will go on disk.
For disk offloading to work you may also have to specify --offload-dir.
I have opt-30B running on a 3090 with --percent 20 50 100 0 100 0, although I think those could be tweaked to be faster.
128GB, but by turning on compression I managed to fit the whole thing on the GPU. I did try it off a mix of RAM and SSD as well, and it was slower but still usable. Presumably disk speed matters a lot.
Well just got some more sticks. While I wait for RAM to arrive, will try with compress_weight and compress_cache. If you're in any discord or any other space where people are tinkering with this, would love to join!
After turning on compression I was able to fit the whole thing in GPU memory and then it became much faster. Not ChatGPT speeds or anything, but under a minute for a response in their chatbot demo. A few seconds in some cases.
FWIW I was able to load the OPT-6.7B model and play with it in chatbot mode. This would not have been possible without the offloading, so... cool stuff!