On May 13th, OpenAI showed their new, multi-model called GPT-4o (omni).
The demo app was ChatGPT and the demo’s focus was their new Voice mode.
The demos were exceptional and quite futuristic!
OpenAI’s engineering team has figured out a way to map audio to audio directly as a first-class modality which reduced the latency and added more “audio intelligence” to the model.
The result is a low-latency and natural-sounding conversational AI.
Many startups have been trying to do this for a while but bringing the latency down was a challenge.
It turns out that having an end-to-end trained speech foundational model was the solution.
The beauty of this model is that it is able to perform many tasks in parallel:
Transcribe (even better than Whisper)
Translate (better than many existing models)
Reason better than GPT-4.5 and other models
Generate fast response
So how will this impact Voice Bots (e.g. in Call Centers)?
Once GPT-4o voice mode is made available, the companies will switch to it. Their voice bots will:
sound more natural
will have 2-3x lower latency
will speak different languages
The adoption of Voice Bots products will simply accelerate. Exciting times ahead!