OpenAI GPT-4o and Voice Bots

May 16, 2024

On May 13th, OpenAI showed their new, multi-model called GPT-4o (omni).

The demo app was ChatGPT and the demo’s focus was their new Voice mode.

The demos were exceptional and quite futuristic!

OpenAI’s engineering team has figured out a way to map audio to audio directly as a first-class modality which reduced the latency and added more “audio intelligence” to the model.

The result is a low-latency and natural-sounding conversational AI.

Many startups have been trying to do this for a while but bringing the latency down was a challenge.

It turns out that having an end-to-end trained speech foundational model was the solution.

The beauty of this model is that it is able to perform many tasks in parallel:

Transcribe (even better than Whisper)
Translate (better than many existing models)
Reason better than GPT-4.5 and other models
Generate fast response

So how will this impact Voice Bots (e.g. in Call Centers)?

Once GPT-4o voice mode is made available, the companies will switch to it. Their voice bots will:

sound more natural
will have 2-3x lower latency
will speak different languages

The adoption of Voice Bots products will simply accelerate. Exciting times ahead!

Voice AI Newsletter

Discussion about this post