Where AI Voice Agents Fail the Most Today

Here’s How to Fix It

Mar 27, 2025

AI voice agents are everywhere—handling customer service calls, booking appointments, and assisting in day-to-day business. While their quality has been improving fast, one problem keeps getting in the way: they don’t know when to talk and when to listen.

The technical term for this is turn-taking, or interrupt-handling.

Turn-Taking

Turn-taking is a hot problem these days, with many companies trying to solve it.

LiveKit has published an article showing how they tackle it
Daily recently started an open-source project called smart-turn
OpenAI recently launched a variant of it called “semantic VAD”.

A particular case where turn-taking fails miserably is in noisy environments.

Whenever there is background noise or chatter, AI agents get confused and start to interrupt us at the wrong time, talk over us, or miss what’s actually being said. This makes conversations frustrating and unnatural.

In a normal conversation, humans naturally know when to pause, respond, or wait their turn. AI agents don’t have that instinct. Today, they have to rely on Voice Activity Detection (VAD)—a technology that decides when a piece of audio is human speech or not. An AI agent looks into VAD’s output and if there is enough “non-speech” data, they decide the person finished speaking.

However VAD-based turn-detection is too primitive for real-life scenarios. There are many situations where people could pause without finishing their speech.

Here is a great deep-dive video on turn-taking.

Improving turn-taking with Noise and Voice Cancellation

While solving turn-taking is difficult, we can improve it with noise and voice cancellation technology by placing it just before VAD and speech recognition models.

By filtering out background noise and voices in real time, AI agents get only the speech that matters. That means:

No more false interruptions
No more missed responses
Smoother, more human-like conversations

Here’s how a voice agent performs with Daily, Pipecat, and Gemini in a noisy environment—with vs. without noise cancellation:

The Results

Real-world tests show that when background voice and noise cancellation are applied before VAD, AI agents perform much better:

3.5x fewer false interruptions → A 71% decrease in AI cutting off users unnecessarily.
2x better speech recognition accuracy → AI agents hear and respond more accurately.
50% decrease in call drops → Less conversations abandoned due to frustrating interruptions.
30% increase in CSAT → Smoother interactions make happier customers.

Leading Conversational AI platforms—including Vodex, Fixie, Daily, LiveKit, and Fluidworks—have already integrated noise and voice cancellation to fix turn-taking and improve response accuracy.

Below is a technical report on how exactly this works:

What This Means for AI Teams

If you’re building or deploying AI voice agents, this is a must-have. Without noise cancellation, AI models are guessing when to talk and when to listen, which leads to broken conversations.

Clean audio means better AI decisions. And better AI decisions mean better user experiences. It’s that simple.

Voice AI Newsletter

Discussion about this post