Gemini’s multi-model Live and elevan labs conversational AI both can take audio as input and generate audio as response.
Both of these offerings work on Websocket and comes with out-of-the-box VAD and handles interruptions automatically.
They are good for turned based conversation (chat). These type of models buffer audio signals till they detect pause in the incoming audio signal. Pause in the incoming audio is treated as user done talking in his turn, then do STT -> LLM -> TTS & Handle user interruptions and the loop continues.
Sending and receiving audio signals, VAD, interruptions, transcriptions all are just JSON data events over the websocket, either sent or received.
#elevanlabs #gemini2.0 #audio #llm