Voice Agents Go Mainstream

Voice AI weekly digest

Davit Baghdasaryan

May 04, 2026

Three important Voice AI events this week:

Twilio Signal - May 6-7 in SF
Cerebral Valley Voice Summit - May 6 in SF
NVIDIA Developer Meetup | Building and Evaluating Real-time Voice Agents - May 7 in SF

Top Updates 💪

xAI launches Custom Voices, a voice cloning API that creates a voice ID from 120 seconds of audio with speaker verification, plus 80+ built-in voices across 28 languages. (VentureBeat)
Microsoft ships real-time voice agents in Copilot Studio, now GA in Dynamics 365 Contact Center with low-latency speech-to-speech, interruptions, and mid-call language switching. (Microsoft Blog)
Amazon adds “Join the Chat” to product pages, letting shoppers ask voice or text questions during AI audio summaries and get real-time conversational answers. (TechCrunch)
Otter.ai pivots from notetaker to Conversational Knowledge Engine, launching MCP connectors, AI Chat, and desktop app to turn meeting data into agentic workflows. (BusinessWire)
Deepgram launches Flux Multilingual with 10 languages and mid-call language switching, plus model-based turn detection under 400ms. (SiliconANGLE)
Twilio Q1 voice revenue hits a 19-quarter high, up 20% YoY with Conversational Intelligence and Branded Calling both growing over 100%. (The Next Web)
NordVPN adds AI voice deepfake detector to its Chrome extension, analyzing acoustic patterns in real time without recording or interpreting content. (BetaNews)
Audion raises $15M to bring AI-powered contextual audio ad targeting to the U.S., processing 500K hours of audio weekly for brands like Apple and Nike. (Axios)
3CLogic launches outbound voice AI agents with multimodal voice+digital capabilities and an automated LLM-powered QA engine for scoring every AI interaction. (PR Newswire)
AI-generated podcasts are booming on Spotify, Apple, and YouTube, with AI hosts that sound convincingly human raising questions about disclosure. (Inc)
Tells launches AI voice agents on existing SMS numbers with a single toggle, adding sub-second-latency voice to any business texting line without a new number or integration. (AIthority)
SpeakON ships a MagSafe AI dictation accessory that turns iPhone voice input into formatted, tone-adapted text with translation across 12 languages. (9to5Mac)
Docplanner’s voice AI agent “Noa Booking” doubles doctor appointment bookings vs traditional call centers, built on Twilio ConversationRelay. (Health Tech Digital)
Lumeris adds native audio to its Tom platform using Gemini’s speech-to-speech capabilities for real-time, empathetic patient conversations in primary care. (HIT Consultant)
Ablio launches AI-powered interpretation with hybrid human+AI model, combining ASR, neural translation, and TTS for live multilingual events on Zoom and Teams. (AIthority)

Engineering Corner 😎

Sakana AI introduces KAME, a tandem speech-to-speech architecture that lets a backend LLM inject knowledge in real time while the front-end keeps talking with near-zero latency. (Sakana AI)

NVIDIA releases Nemotron 3 Nano Omni, an open 30B-A3B multimodal model unifying vision, audio, and language with 9x higher throughput than competing omni models. (NVIDIA Blog)
OpenMOSS releases MOSS-Audio, an open-source foundation model for speech, sound, music understanding, and time-aware audio reasoning in 4B and 8B variants. (MarkTechPost)
Async publishes open TTS benchmark revealing major accuracy gaps when streaming models handle phone numbers, dates, and prices in production. (Podnews)
Speaker diarization explained: how AI knows who said what, from spectral embeddings to clustering. (dev.to)
Laravel AI SDK tutorial: add TTS and voice to your app in 20 minutes. (dev.to)
Hobbyist builds a C-3PO head with real-time voice interaction using off-the-shelf speech models. (Let’s Data Science)

Voice AI Newsletter

Discussion about this post

Ready for more?