I'm building this type of functionality on top of Llama models if you're interes...

refulgentis · 2024-09-25T23:09:57 1727305797

I'm writing a Flutter AI client app, integrates with llama.cpp. I used a PoC of llama.cpp running in WASM, I'm desperate to signal the app is agnostic to AI provider, but it was horrifically slow, ended up backing out to WebMLC.

What are you doing underneath, here? If thats secret sauce, I'm curious what you're seeing in tokens/sec on ex. a phone vs. MacBook M-series.

Or are you deploying on servers?