Top Updates 💪
xAI ships standalone Grok STT and TTS APIs with streaming transcription at $0.20/hr and expressive TTS with inline emotion tags across 20 languages. (xAI)
Google launches Gemini 3.1 Flash TTS with 200+ audio tags for fine-grained voice control, multi-speaker dialogue, and SynthID watermarking across 70+ languages. (Google Blog)
Starlink customer support is now Grok-powered, with a voice AI agent handling sales, troubleshooting, and account setup on the phone line. (PCMag)
Cloudflare adds real-time voice to its Agents SDK, enabling voice-enabled agents over WebSockets in ~30 lines of server code on Durable Objects. (Cloudflare Blog)
DeepL launches voice-to-voice translation for meetings with Zoom and Teams add-ons, plus a developer API for custom use cases like call centers. (TechCrunch)
Phonely raises $16M Series A for AI phone agents that drove $10M+ in insurance policy sales for a single customer this year. (Axios)
Krisp launches British English accent conversion, letting offshore agents in India, Philippines, and beyond sound local for UK-facing programs in real time. (CXM World)
interface.ai launches Nexus, a fully agentic CCaaS platform for credit unions that eliminates hold queues by keeping AI in the conversation with human backup. (GlobeNewswire)
ConverseNow partners with Deliverect to pipe voice AI phone and drive-thru orders into unified restaurant order management across thousands of locations. (PR Newswire)
ENCO unveils enSpeak at NAB Show, adding real-time voice translation to its captioning workflow so viewers can hear live broadcasts in their preferred language. (Content + Technology)
Engineering Corner 😎
NVIDIA releases Audio Flamingo Next (AF-Next), an open large audio-language model that understands speech, sound, and music with 30-minute context and timestamp-grounded reasoning. (MarkTechPost)
MOSS-TTS-Nano-100M brings multilingual voice cloning to CPUs with a 100M-param model that streams 48kHz audio in 20 languages. (HackerNoon)
Hands-on VibeVoice tutorial covering speaker-aware ASR, real-time TTS, and speech-to-speech pipelines with code. (MarkTechPost)
Build a real-time voice agent with Pipecat, step-by-step guide to streaming STT/TTS pipelines. (HackerNoon)
Build an AI medical scribe using voice agents for clinical documentation. (HackerNoon)
Diction: self-hosted STT setup guide as an open alternative to Wispr Flow. (dev.to)
Deepgram and Modulate benchmarked against real-world audio conditions. (HackerNoon)

