Hacker News new | past | comments | ask | show | jobs | submit login

This update is pretty exciting, I’m gonna try running a large model (65B) with a 3090. I have ran a ton of local LLM but the hardest part is finding out the prompt structure. I wish there is some sort of centralized data base that explains it.



Tell us how it goes! Try different numbers of layers if needed.

A good place to dig for prompt structures may be the 'text-generation-webui' commit log. For example https://github.com/oobabooga/text-generation-webui/commit/33...


I tried llama-65b on a system with RTX 4090 + 64Gb of DDR5 system RAM. I can push up to 45 layers (out of 80) to the GPU, and the overall performance is ~800ms / token, which is "good enough" for real-time chat.


I got the alpaca 65B GGML model to run on my 64GB ram laptop. No GPU required if you can tolerate the 1 token per 3 seconds rate.


Supposedly the new update with GPU offloading will bring that up to 10 tokens per second! 1 token per second is painfully slow, that’s about 30s for a sentence.


10 tokens / second is what you get running llama-30b entirely on the GPU. A 65b model will be slower than that since there's more compute involved.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: