This update is pretty exciting, I’m gonna try running a large model (65B) with a...

rain1 · on May 14, 2023

Tell us how it goes! Try different numbers of layers if needed.

A good place to dig for prompt structures may be the 'text-generation-webui' commit log. For example https://github.com/oobabooga/text-generation-webui/commit/33...

int_19h · on May 15, 2023

I tried llama-65b on a system with RTX 4090 + 64Gb of DDR5 system RAM. I can push up to 45 layers (out of 80) to the GPU, and the overall performance is ~800ms / token, which is "good enough" for real-time chat.

guardiangod · on May 14, 2023

I got the alpaca 65B GGML model to run on my 64GB ram laptop. No GPU required if you can tolerate the 1 token per 3 seconds rate.

syntaxing · on May 14, 2023

Supposedly the new update with GPU offloading will bring that up to 10 tokens per second! 1 token per second is painfully slow, that’s about 30s for a sentence.

int_19h · on May 15, 2023

10 tokens / second is what you get running llama-30b entirely on the GPU. A 65b model will be slower than that since there's more compute involved.