Grok Powers Vapi, Gemma 4 Brings Audio to Your Laptop

Voice AI weekly digest

Davit Baghdasaryan

Jun 08, 2026

Top Updates 💪

xAI brings Grok TTS and STT to Vapi, letting developers build voice agents with Grok’s speech models. (xAI)
Sesame launches its iOS app with four conversational voice agents, built by the co-founders of Oculus. (TechCrunch)

Google releases Gemma 4 12B, an open-source multimodal model with native audio that runs on a 16GB laptop. (VentureBeat)
AethexAI raises $3M to build voice AI infrastructure for Africa and the Middle East. (TechCrunch)
Aircall acquires Piper AI to add revenue intelligence and sales automation to its voice platform. (WebProNews)
8x8 launches Pulse, a conversational intelligence tool that turns calls and chats into actionable business insights. (BusinessWire)
Google rolls out real-time deepfake voice detection on Android to catch AI scam calls as they happen. (TechBuzz)
Microsoft Edge adds on-device speech recognition and translation APIs powered by local AI models. (Microsoft)
McDonald’s pilots ArchIQ, a voice AI drive-thru that handles 90% of orders without human help. (TheEdAdvocate)
Peak XV eyes a $10M round in Ringg AI as Indian voice agent startups gain momentum. (Economic Times)
Deepgram partners with Fortanix to run voice AI on-premises using NVIDIA confidential computing. (ITNerd)
Americans lost $893M to AI scams last year, with voice cloning attacks leading the surge. (The Independent)
Equity demands Fish Audio remove unauthorized AI clones of performers’ voices from its platform. (Equity)
Sarvam AI opens its multilingual voice agents platform to the public, covering 11 Indian languages. (LetDataScience)
Ubuntu plans to ship AI-powered speech-to-text across all text fields in the OS. (OMG Ubuntu)
ENCO debuts EnSpeak, a real-time voice-to-voice translation system for live venues and classrooms. (RavePubs)
Broadvoice launches GoEngage and AI Analyst, adding speech-to-speech voice AI to its contact center. (SmartCustomerService)
ElevenLabs opens a pop-up store in NYC where every part of the experience is run by a voice agent. (LetDataScience)
RingCentral leads the G2 Summer 2026 AI VoIP category with 137 product badges. (RingCentral)
In2ition AI launches Iris, an always-on AI companion that joins live meetings instead of just transcribing them. (PRNewswire)
Astreya integrates 3CLogic voice AI into its ServiceNow-based IT service desk. (PRNewswire)

Engineering Corner 😎

NVIDIA publishes a fine-tuning guide for Nemotron 3.5 ASR, its 600M-param streaming model covering 40 languages. (Hugging Face)

Higgs Audio v3 is a 4B-param chat-native TTS model supporting 102 languages with zero-shot voice cloning. (LMSYS)
MisoTTS is an 8B emotive TTS model with open weights that claims 110ms latency. (MarkTechPost)
Audio-Interaction is a 3B open-source model that listens nonstop and decides every 0.4 seconds whether to speak. (The Decoder)
pyannote.ai’s Bredin explains how speaker diarization makes voice AI understand conversations, not just transcribe them. (StartupHub)
HackerNoon walks through how to transfer an AI voice agent to a human without losing context. (HackerNoon)
HackerNoon lists the 7 best voice agent testing platforms for 2026. (HackerNoon)
TechStartups breaks down how speech datasets for AI are built, what they contain, and where they fail. (TechStartups)

Voice AI Newsletter

Discussion about this post

Ready for more?