I developed a stack on Cloudflare workers where latency is super low and it is cheap to run at scale thanks to Cloudflare pricing.
Runs at around 50 cents per hour using AssemblyAI or Deepgram as the STT, Gemini Flash as LLM and InWorld.ai as the TTS (for me it’s on par with ElevenLabs and super fast)
I am not using speech to speech APIs like OpenAI, but it would be easy to swap the STT + LLM + TTS to using Realtime (or Gemini Live API for that matter).
OpenAI realtime voices are really bad though, so you can also configure your session to accept AUDIO and output TEXT, and then use any TTS provider (like ElevenLabs or InWord.ai, my favorite for cost) so generate the audio.
Runs at around 50 cents per hour using AssemblyAI or Deepgram as the STT, Gemini Flash as LLM and InWorld.ai as the TTS (for me it’s on par with ElevenLabs and super fast)