Because for anything other than CPU inference it is inferior to TensorRT and vLL...

luke-stanley · 2024-06-02T18:11:30 1717351890

Are you sure about that? what are the benchmarks it fails on when setup like-for-like with GPU drivers? Even still, it can do constrained grammar and is really easy to setup with wrappers like Ollama and the Python server too. I struggled to find support for that with other inference engines - though things change fast!