Voice AI Newsletter

Rime Raises $24M, Meta Patents Voice Emotion Tracking

Davit Baghdasaryan — Mon, 20 Jul 2026 14:02:26 GMT

Events

Deepgram hosts a fireside chat on the future of voice agents, evaluating agents beyond WER and whether cascaded STT-LLM-TTS still holds up. (Jul 23, London, Voice AI Space)
AssemblyAI demos Universal-3.5 Pro Realtime, its new model with context carryover and conversation memory, then opens up for a fireside chat with production voice agent builders. (Jul 23, SF, Luma)

Top Updates 💪

Alibaba launches Qwen Audio 3.0 with real-time voice that can proactively use external tools, covering 113 languages for ASR and 36 for TTS. (KuCoin)
Meta patents an AI wearable that continuously analyzes voice to track the user’s emotional state, raising concerns under the EU AI Act’s emotion-inference ban. (The Next Web)
PwC and OpenAI launch agentic customer service solutions combining PwC’s CX expertise with OpenAI’s multimodal voice and digital agent APIs. (CX Today)

Google Voice adds Gemini AI notes that auto-summarize calls with key points and action items, plus new standalone plans starting at $10/mo. (WebProNews)
Google quietly opted users into AI training on voice queries and uploaded media via a new Search Services History setting, with no opt-in required. (Fox News)
Rime raises $24M Series A to build enterprise speech-to-speech models, powering nearly 100M phone calls monthly for Mayo Clinic, Dialpad, and others. (Rime)

LALAL.AI launches Lynx, a neural network built for speech denoising that is 6x smaller than its flagship model while matching output quality. (Slator)
Sber’s GigaChat adds emotion detection and can process audio up to three hours long with speaker separation, timestamps, and segment summaries. (BusinessNewsThisWeek)
DoorDash, ObserveAI, and AWS scale AI-powered quality evaluation across 19,000 agents, automating nearly 100% of interaction reviews. (PR Newswire)
Samsung adds cloud transcription to its Voice Recorder app, giving users a choice between on-device privacy and cloud-powered accuracy. (WebProNews)
Aina raises $5.5M to build hardware that controls AI agents rather than just recording, with its first product Dune already shipping to early adopters. (TechCrunch)
Chen Institute and Science honor neuroscientist Sergey Stavisky for an AI speech neuroprosthesis that decodes brain activity into spoken words at 97.5% accuracy. (PR Newswire)
New research finds that AI voice phishing works because of persuasive scripting, not vocal realism, as 70% of targets detect the synthetic voice but comply anyway. (Help Net Security)
Instadesk, Huawei, and iFlytek open a joint AI customer experience lab in Uzbekistan, combining multilingual ASR/TTS with Ascend cloud infrastructure. (Manila Times)
VoicePing 3.0 launches with real-time translation, AI dubbing, new ASR and MT models, and MCP/API access for enterprise multilingual workflows. (PR Newswire)
A study flags five risks in clinical AI scribes: inconsistent consent, weak performance on accented speech, background noise, missing human review, and unclear accountability. (ResultSense)
Telcos are sitting on a voice AI opportunity bigger than their own cost savings, argues an analysis that says the real play is selling voice infrastructure to enterprises. (Sebastian Barros)

Engineering Corner 😎

Apple’s SpeechAnalyzer API outperforms Whisper Small in English benchmarks, running fully on-device on iOS 26 and macOS Tahoe. (Gigazine)
Cohere releases Transcribe Arabic, an open-source 2B-param ASR model that beats Meta’s 7B model on Arabic with multidialect and code-switching support. (Cohere)
Llamafile 0.10.4 ships transcribefile, a portable speech-to-text CLI built on transcribe.cpp that supports 16+ model families with GPU acceleration. (Phoronix)
A prototype tongue-reading system uses ultrasound and ML to decode silent speech from tongue movements, enabling voice input without making a sound. (Hackaday)
Adafruit demos VAD, STT, and TTS all running on a single RP2040 microcontroller, bringing a complete voice AI pipeline to a $4 chip. (Adafruit)
ReSpeaker Clip is an open-source wearable AI recorder with dual mics, BLE 5.3, Wi-Fi 6, and full SDK for building custom voice AI applications. (Seeed Studio)
Resemble AI explains how neural audio watermarking embeds inaudible signals during voice generation for traceability, ahead of the EU AI Act’s August 2 deadline. (Resemble AI)
A Python tutorial walks through building a real-time AI phone agent that quotes prices using tool calling and low-latency voice synthesis. (Low Latency Club)
Voice AI benchmarks hide a gap: systems scoring 600-800ms in the lab hit 2-4 seconds on real telephony, and accuracy drops from 51% to 26-38% under realistic audio. (Embedded Computing)

GPT-Live Goes Full-Duplex, Taco Bell expands voice AI

Davit Baghdasaryan — Mon, 13 Jul 2026 14:01:48 GMT

Events

DeepLearning.AI Voice AI Hackathon: In-person hackathon with Sabre + Vocal Bridge. Teams build voice agents that book real trips end-to-end, judged by Andrew Ng. (Jul 18, Mountain View, Luma)
Voice Agent Evalathon: Okareo x Telnyx virtual challenge to red-team a voice agent’s reasoning, execution, and stability. (Jul 15, Virtual, Meetup)

Top Updates 💪

OpenAI launches GPT-Live, a full-duplex voice model that listens and speaks simultaneously, powering a more natural ChatGPT Voice experience. (OpenAI)
xAI releases 21 multilingual flagship voices for Grok, expanding its voice lineup across multiple languages. (xAI)
Gradium raises $100M seed backed by NVIDIA, making the Paris-based Kyutai spinout one of the largest seed rounds ever for ultra-low latency voice AI. (TechCrunch)
Rylo AI raises $85M to scale its real-time captioning platform for deaf and hard-of-hearing users, rebranded from Nagish. (AlleyWatch)
Five9 launches next-gen Voice AI Agents built on a purpose-built agentic architecture (Insider Monkey)
Taco Bell expands voice AI to 890+ drive-thrus across 38 states, powered by Omilia. (FER Magazine)
Omilia launches Lexis, a native generative TTS engine for enterprise CX with sub-45ms latency. (AIthority)
Dell AI Factory partners with Deepgram and Penguin Solutions to deliver enterprise-grade real-time voice AI infrastructure. (Business Insider)
SoundHound’s OASYS platform wins “Agentic AI Company of the Year” as Q1 revenue hits $44M, up 52% year-over-year. (Nasdaq)
CallTower partners with Sestek to add conversational AI, voice biometrics, and real-time translation to its CX portfolio. (Telecom Reseller)
Whispp raises $5M to scale its on-device AI that reconstructs speech for people with voice disorders in real time. (Pulse 2.0)
AI voice agents boosted specialty care enrollment 340% in a peer-reviewed clinical study by RadiantGraph. (PR Newswire)
A $25M deepfake scam at Arup used AI-generated executives on a video call, becoming a landmark case for corporate voice fraud. (FinanceFeeds)
Reality Defender warns that autonomous AI callers can now bypass contact center defenses at scale, posing a new voice fraud threat. (Biometric Update)
Hydaway launches RealityChek, a streaming audio deepfake detector for enterprises that flags synthetic speech in real time. (Biometric Update)

Engineering Corner 😎

OpenAI ships GPT-Realtime-2.1-mini with reasoning and tool use support at 6x lower cost and 25% reduced latency. (Marktechpost)
AssemblyAI details how Universal-3.5 Pro handles noisy audio, sharing techniques for improving transcription accuracy on hard recordings. (AssemblyAI)
Flock Safety explains its audio detection system that identifies gunshots and crashes in real time using acoustic sensors and AI classification. (Flock Safety)

xAI Ships Voice Agent Builder, Krisp named 2026 Disruptive Technology

Davit Baghdasaryan — Mon, 06 Jul 2026 14:02:55 GMT

Events

Real-Time Observability for Production Voice AI Regal walks through its Observability Dashboard with real catch-and-fix examples from production voice agents. (Jul 9, SF, Voice AI Space)
AI Tinkerers San Francisco: July GTM Engineering Track w/ Attio Builder-only, no-pitch meetup with live GTM engineering demos - a solid room for voice agent developers. (Jul 8, SF, AI Tinkerers)

Top Updates 💪

xAI launches Voice Agent Builder, a no-code platform for production voice agents with telephony and guardrails at $0.05/min. (xAI)

Krisp named 2026 Disruptive Technology of the Year by CMP Research for its voice AI infrastructure. (Krisp Blog)
ElevenLabs explores a $22B tender offer, doubling its valuation from the $11B Series D five months ago. (Tech in Asia)
Pocket raises $11M from Accel and YC for its $129 AI note-taking puck that has shipped 130K units. (TechCrunch)
Lucida AI raises €6.1M seed for its speech-to-speech language coaching platform, now at 3M users. (EU-Startups)
US senators revive the AI Labeling Act, a bipartisan bill requiring AI-generated audio and video to carry disclosure labels. (Music Business Worldwide)
Retell AI launches Conductor, a graph-native review interface with an AI copilot for production voice agents. (AIthority)
Syntiant and Vibe partner to bring voice-enabled AI to smart workspace hardware using edge AI chips. (GlobeNewsWire)
HealthLynked launches an AI healthcare platform with 24/7 scheduling and medical office voice agents. (GlobeNewsWire)
RevComm launches MiiTel for Retail, extending its voice AI analytics to in-store customer conversations. (IBTimes JP)
Patient trust is the biggest barrier to healthcare voice AI, not the technology, argues a Forbes analysis. (Forbes)
Voices similar to our own are more persuasive, finds new research raising concerns about companies weaponizing stored voice data. (Nautilus)
Synthflow deployed voice AI in one day for Nellis Auction, now handling 80% of each customer interaction automatically. (CXM World)

Engineering Corner 😎

Mozilla AI releases transcribe-cpp, an open-source C/C++ STT library with ggml runtime and GPU acceleration. (StartupHub)

ViiTorVoice-NAR goes open source with word-level TTS editing that swaps individual words without regenerating surrounding audio. (TechTimes)
Higgs TTS 2 3B from BosonAI is a 5.8B-param TTS model trained on 10M+ hours with zero-shot voice cloning. (HackerNoon)
Vowen 0.4.8 released, a free offline voice productivity app using Whisper-based local transcription. (Warp2Search)
WhisTam is a Whisper-based framework for Tamil dialect speech recognition, ranking 2nd at DravidianLangTech@ACL 2026. (ACL Anthology)

$200M Pours Into Voice AI, OpenAI Bidi-1 Leaks

Davit Baghdasaryan — Mon, 29 Jun 2026 14:03:12 GMT

Events

AI Engineer World’s Fair is a flagship AI engineering conference with a dedicated Voice & Realtime AI miniconference featured this year (Jun 29-Jul 2, SF | AI Engineer)
Low Latency Lounge by Deepgram is an invite only evening for engineers building the fastest AI in the stack. Together AI and Runware are cohosting (Jun 30, SF | LUMA)
Real-Time Voice AI × Device Builders Meetup “Give Voice to Robots!” Runs alongside IVS Kyoto (Jul 2, Kyoto | Voice AI Space)

Top Updates 💪

AssemblyAI launches Universal-3.5 Pro Realtime, the first streaming STT model that takes the agent’s question as input (AssemblyAI Blog)
Five9 launches Voice AI Agents and AI Agent Studio at CCW, bringing agentic CX to enterprise contact centers. (CX Today)
Krisp launches Voice Security for deepfake detection and fraud detection for contact centers. (CX Today)
CallMiner launches real-time AI guidance that lets contact center agents initiate AI assistance on demand with human-in-the-loop controls. (BusinessWire)
Assort Health raises $120M Series C led by Menlo Ventures at a $1.2B valuation to scale its voice AI agent platform across healthcare. (Fierce Healthcare)
Prosper AI raises $30M Series A led by a16z to scale its autonomous patient journey platform, reporting 5x revenue growth in six months. (HackerNoon)
Coval raises $28M Series A led by Norwest to advance its voice AI evaluation and testing platform, founded by an ex-Waymo engineer. (Pulse2)
Kotoba Technologies raises $10M seed led by Kindred Ventures for its real-time East Asian voice translation platform with sub-2s latency. (VentureBeat)
Valence AI raises $5M seed and secures US patents on real-time emotional detection from live speech. (PR Newswire)
TELUS Digital partners with ElevenLabs as a preferred implementation partner to scale voice AI alongside frontline customer care teams. (PR Newswire)
OpenAI’s GPT-Bidi-1 leaks as a full-duplex voice model that can listen and speak simultaneously, enabling true bidirectional conversation. (Crypto Briefing)
Conduent unveils a next-gen CX platform with real-time translation across 90+ languages to accelerate agent performance. (Conduent)
Speechify brings free voice typing to all iPhone and Mac users, adding AI-powered dictation across every app. (9to5Mac)
Modulate launches an AI music detection API with 95% precision across 76 genres to help platforms verify AI-generated music. (Morningstar)
ByteDance releases Seed Audio 1.0, a unified model that generates speech, music, and ambient sound from a single architecture. (CityBuzz)
Amazon launches Alexa Plus Hindi beta in India, targeting 600M+ Hindi speakers with its upgraded AI assistant. (The Next Web)
ElevenLabs adopts Google’s SynthID watermarking to tag all AI-generated speech, making synthetic voices easier to detect. (Digital Trends)
Shure says audio quality is now the critical bottleneck for AI-powered meetings, and microphone clarity drives everything. (InAVate)
Attention Labs launches SAA, a selective auditory attention layer that lets voice AI detect when it is being directly addressed. (Dispatch)
Deepgram and Fortanix partner to run voice AI on-premises with NVIDIA confidential computing, keeping audio data encrypted during processing. (RadioInfo)

Subscribe now

Engineering Corner 😎

Gradium releases STT-Translate and S2S-Translate, real-time speech translation models that beat GPT Realtime Translate on accuracy and latency. (MarkTechPost)

AWS publishes a full tutorial on building a healthcare appointment agent with Amazon Nova 2 Sonic and Bedrock AgentCore. (AWS Blog)
AssemblyAI shares four techniques for prompting Claude to build production-ready voice agents in about 30 seconds. (AssemblyAI Blog)
Deepgram discusses voice AI infrastructure and the path to production-grade agents on the Telecom Reseller podcast. (Telecom Reseller)
ACL 2026 publishes 10 voice AI papers covering noise-robust ASR, accented speech recognition, environment-aware TTS, controllable speech synthesis, multi-speaker diarization, and multilingual translation. (ACL Anthology)

Soniox launches v5, Bland raises $50M, Mistral ships Voxtral Transcribe 2 and more

Davit Baghdasaryan — Mon, 22 Jun 2026 14:02:55 GMT

Events

Voice AI Meetup Madrid is a small gathering hosted by Deepgram, AWS, and Pipecat for founders and engineers building with voice AI in Spain (Jun 23, Madrid | Pipecat)
Boba-thon is a hands-on AI build night by AI Valley × Workato - teams form, prototype AI workflows, and demo by end of night. (Jun 25, San Francisco | Voice AI Space)
UK & Ireland Speech Workshop brings together speech science researchers and industry builders around advances in healthcare speech tech. (Jun 22-24, London | UKIS2026)

Top Updates 💪

Bland raises $50M Series C led by Dell Technologies, bringing its total funding past $100M. (Fortune)
Soniox launches v5 Real-Time and Async, a speech model that turns live conversations into structured, speaker-aware intelligence. (Soniox)

Google launches a $99 Gemini-powered Home Speaker, its first standalone smart speaker since the Nest Audio in 2020. (TechCrunch)
Plaud crosses $100M ARR in two years, making it the fastest hardware-led AI company to hit that milestone. (ITBrief)
Respond.io raises $62.5M Series B to expand its AI-powered customer messaging platform into North America and Europe. (MarTech Series)
Poland invests $11M in ElevenLabs and launches AI Lab Poland to grow its national AI ecosystem. (Mezha)
Mistral ships Voxtral Transcribe 2, an open-source on-device ASR model with batch transcription at $0.003 per minute. (Mistral)
Gnani AI launches Prisma v2.5, ranking first in 8 of 9 Indian language ASR benchmarks against Sarvam and ElevenLabs. (MediaNama)
Tencent Cloud and Inworld AI partner to integrate sub-130ms TTS into Tencent’s real-time communication infrastructure. (PR Newswire Asia)
Tencent Cloud and Soniox partner to bring multilingual speech-to-text across 200+ countries via Tencent RTC. (FutureCIO)
DeepL acquires Mixhalo’s ultra-low-latency audio team and technology to scale its real-time voice translation product. (PR Newswire)
TELUS Digital and Cresta partner to deliver AI agents alongside human agents in enterprise contact centers. (PR Newswire)
Parloa becomes the first agentic AI provider on Alvaria’s outbound platform, targeting regulated industries. (PR Newswire)
LiveKit Inference now defaults to zero data retention, meaning prompts and audio are never stored by any model provider. (LiveKit)
AI fraud cost $442B globally in 2025 as voice clones now fool even experts, per an INTERPOL report. (TechTimes)
UC study finds vocal similarity alone drives persuasion, with listeners complying more when a speaker’s voice matches theirs. (UC News)
AI voice clones are up to 20% more intelligible than real humans in noisy environments, a new JASA study shows. (PsyPost)
India’s telecom layer needs rebuilding for voice AI to scale, with traditional infrastructure adding 300-500ms of latency. (Inc42)
Multilingual voice AI is India’s next big opportunity, with 600M+ vernacular users driving enterprise demand. (Express Computer)

Subscribe now

Engineering Corner 😎

TowardsAI tutorial on using Gemini streaming TTS to make voice apps feel instant. (Towards AI)

Dev.to walkthrough of building a voice AI platform with 28 modules in Python. (Dev.to)
CTO field report on testing 184 AI text-to-speech models across quality, latency, and cost. (Dev.to)
Dev.to tutorial on simple text-to-speech in Python using PythonAIBrain. (Dev.to)

Krisp Voice Translation v3, New Siri AI and more

Davit Baghdasaryan — Mon, 15 Jun 2026 14:03:02 GMT

Top Updates 💪

Krisp ships Voice Translation v3 with 96% accuracy in 61 languages and opens a self-serve developer API. (Krisp | Krisp)
Apple launches Siri AI at WWDC with multi-turn conversations and a standalone app powered by Gemini. (Apple)

Google launches Gemini 3.5 Live Translate for real-time speech translation across 70+ languages. (Google)
Mistral is raising ~€3B at a €20B valuation, nearly doubling since its September round. (TechCrunch)
Equal AI raises $30M Series B to scale India’s voice-first AI assistant across a billion smartphones. (LiveMint)
NICE makes agentic AI the native architecture of its CX platform at NICE World 2026. (CMSWire)
Microsoft launches MAI-Voice-2, a TTS model supporting 10 languages and zero-shot voice cloning. (Blockchain Council)
AI voice scams surged 1,210% in 2025, needing just 3 seconds of audio to clone any voice. (Fox News)
Google will save search images and audio by default for AI model training. (The Verge)
AI ambient scribes cut physician burnout by 21 percentage points in a Mass General Brigham study. (Medical Daily)
MindBio delivers AI voice kiosks that detect intoxication and fatigue from speech patterns. (StreetWise Reports)
Top Gear asks whether AI voice control in cars is the next big thing or a waste of time. (Top Gear)
WSJ reports the job AI was supposed to kill now needs more humans than ever. (WSJ)
Voicegain hires a VP of Sales to push voice AI into healthcare call centers. (PRWeb)
Speechmatics named HackerNoon’s Company of the Week for speech AI innovation. (HackerNoon)
Voice AI adoption crosses an enterprise threshold in contact centers with measurable ROI. (CXToday)
India positions itself as the world’s CX leader as voice AI reshapes its call center industry. (Express Computer)

Subscribe now

Engineering Corner 😎

Kyutai shows how RL post-training improves turn-taking and backchanneling in full-duplex voice models. (Kyutai)

Treble and Hugging Face launch FFASR, the first open benchmark for far-field speech recognition. (Newsfilecorp)
Red Hat publishes a guide to building a local voice agent with OpenShift AI. (Red Hat Developer)
DrivenData announces winners of “On Top of Pasketti,” a children’s speech recognition challenge. (DrivenData)
Dev.to tutorial on extracting conversation intelligence from audio beyond simple dictation. (Dev.to)
Dev.to tutorial on building voice agents that send follow-up emails via Nylas. (Dev.to)
Blog tutorial covers building an ElevenLabs + n8n voice AI sales agent end to end. (whoisalfaz.me)
ParseJargon paper introduces real-time jargon translation for online meetings using LLMs. (arXiv)

Grok Powers Vapi, Gemma 4 Brings Audio to Your Laptop

Davit Baghdasaryan — Mon, 08 Jun 2026 14:02:15 GMT

Top Updates 💪

xAI brings Grok TTS and STT to Vapi, letting developers build voice agents with Grok’s speech models. (xAI)
Sesame launches its iOS app with four conversational voice agents, built by the co-founders of Oculus. (TechCrunch)

Google releases Gemma 4 12B, an open-source multimodal model with native audio that runs on a 16GB laptop. (VentureBeat)
AethexAI raises $3M to build voice AI infrastructure for Africa and the Middle East. (TechCrunch)
Aircall acquires Piper AI to add revenue intelligence and sales automation to its voice platform. (WebProNews)
8x8 launches Pulse, a conversational intelligence tool that turns calls and chats into actionable business insights. (BusinessWire)
Google rolls out real-time deepfake voice detection on Android to catch AI scam calls as they happen. (TechBuzz)
Microsoft Edge adds on-device speech recognition and translation APIs powered by local AI models. (Microsoft)
McDonald’s pilots ArchIQ, a voice AI drive-thru that handles 90% of orders without human help. (TheEdAdvocate)
Peak XV eyes a $10M round in Ringg AI as Indian voice agent startups gain momentum. (Economic Times)
Deepgram partners with Fortanix to run voice AI on-premises using NVIDIA confidential computing. (ITNerd)
Americans lost $893M to AI scams last year, with voice cloning attacks leading the surge. (The Independent)
Equity demands Fish Audio remove unauthorized AI clones of performers’ voices from its platform. (Equity)
Sarvam AI opens its multilingual voice agents platform to the public, covering 11 Indian languages. (LetDataScience)
Ubuntu plans to ship AI-powered speech-to-text across all text fields in the OS. (OMG Ubuntu)
ENCO debuts EnSpeak, a real-time voice-to-voice translation system for live venues and classrooms. (RavePubs)
Broadvoice launches GoEngage and AI Analyst, adding speech-to-speech voice AI to its contact center. (SmartCustomerService)
ElevenLabs opens a pop-up store in NYC where every part of the experience is run by a voice agent. (LetDataScience)
RingCentral leads the G2 Summer 2026 AI VoIP category with 137 product badges. (RingCentral)
In2ition AI launches Iris, an always-on AI companion that joins live meetings instead of just transcribing them. (PRNewswire)
Astreya integrates 3CLogic voice AI into its ServiceNow-based IT service desk. (PRNewswire)

Subscribe now

Engineering Corner 😎

NVIDIA publishes a fine-tuning guide for Nemotron 3.5 ASR, its 600M-param streaming model covering 40 languages. (Hugging Face)

Higgs Audio v3 is a 4B-param chat-native TTS model supporting 102 languages with zero-shot voice cloning. (LMSYS)
MisoTTS is an 8B emotive TTS model with open weights that claims 110ms latency. (MarkTechPost)
Audio-Interaction is a 3B open-source model that listens nonstop and decides every 0.4 seconds whether to speak. (The Decoder)
pyannote.ai’s Bredin explains how speaker diarization makes voice AI understand conversations, not just transcribe them. (StartupHub)
HackerNoon walks through how to transfer an AI voice agent to a human without losing context. (HackerNoon)
HackerNoon lists the 7 best voice agent testing platforms for 2026. (HackerNoon)
TechStartups breaks down how speech datasets for AI are built, what they contain, and where they fail. (TechStartups)

Anthropic's Trillion-Dollar Moment

Davit Baghdasaryan — Mon, 01 Jun 2026 13:45:58 GMT

Top Updates 💪

Anthropic closes a Series H near a $965B valuation, landing alongside its Claude Opus 4.8 launch. (TechCrunch)
Parloa deploys its $350M war chest into partnerships with SAP, Microsoft, OpenAI, Five9, and Epic. (The Next Web)
Exclusive: Krisp scales its infra deployment paradigm (AIM Network)
Greenhouse acquires Ezra AI Labs, folding a voice-AI interviewer into its hiring platform. (PR Newswire)
Alibaba Updates Speech Translation Model, Triples Language Coverage (Slator)
StepFun ships StepAudio 2.5 Realtime, an end-to-end speech LLM with roleplay RLHF and paralinguistic perception. (MarkTechPost)
COLDI launches a turnkey platform for integrated AI voice agents aimed at lead management. (PR Newswire)
What the Language Solutions and AI Market Should Take Away From Google I/O (Slator)
Palabra.ai crosses $1M ARR, a 17x six-month climb for its real-time speech-to-speech translator. (AiThority)
iFlytek debuts 40g AI glasses with an on-device GlassClaw agent and live translation in 122 languages. (Longbridge)
iFLYTEK unveils AI Recorder S6 with long-range voice recording and smart summaries (FinancialContent)
An ElevenLabs-linked deal licenses Stan Lee’s voice and likeness for AI-narrated audiobooks and comics. (Kotaku)
What Apple’s New AI Glasses Mean for the Future of Wearables. (Geeky Gadgets)
A new study shows inaudible audio commands can hijack AI voice models unheard by humans. (Decrypt)
AI Studios Launches Context-Aware Expressive TTS with 1,000+ AI Voices (Business Insider)
What healthcare organizations need to get right about AI transcription. (National Law Review)

Subscribe now

Engineering Corner 😎

OmniVoice Studio ships as a local, open-source ElevenLabs alternative with cloning, dubbing, diarization, and an MCP server. (MarkTechPost)
A field guide to production voice agents tackles sub-300ms latency with LiveKit and WebRTC. (dev.to)
A walkthrough adds Gemma 4 speech recognition to a .NET desktop app via a llama-server sidecar. (dev.to)
Vaani pairs speech recognition with Indian Sign Language on Android using MediaPipe. (dev.to)
FlowSpeech offers context-aware TTS with controllable emotion, pacing, and pauses across 30+ voices. (flowspeech.io)
Vowen runs fully offline STT on Windows and macOS, free and privacy-first. (MajorGeeks)

Google I/O Goes Voice-First, Corti Beats OpenAI on Medical STT

Davit Baghdasaryan — Mon, 25 May 2026 14:03:05 GMT

Top Updates 💪

Google adds voice to Gmail, Docs and Keep letting users search their inbox and dictate by speaking instead of typing. (TechCrunch)

Google unveils audio-powered smart glasses at I/O 2026, taking on Meta in the wearable AI race. (TechCrunch)
Spotify launches an ElevenLabs-powered tool that lets authors create audiobooks from text. (TechCrunch)
Corti’s Symphony model outperforms OpenAI’s Whisper on medical terminology accuracy for speech-to-text. (VentureBeat)
Zoom opens its AI Translator and Summarizer as standalone APIs for third-party developers. (Zoom | Slator)
Twilio shares surged 60% as voice AI adoption accelerates across its communications platform. (Sebastian Barros)
Zendesk expands its AI agents across ChatGPT, Gemini, voice and messaging channels. (TechRadar)
Kardome ships its voice AI in LG OLED TVs, reaching mass-market consumers for the first time. (AudioXpress)
Amazon’s Alexa can now generate full podcast episodes on any topic you ask for. (TechBuzz)
Alibaba releases Qwen3.5 LiveTranslate Flash, a real-time interpreter covering 60 languages at 2.8-second latency. (MarkTechPost)
NTSB shuts down its public docket after people used AI to recreate dead pilots’ voices from spectrograms. (Engadget)
Columbia researchers pass the first human trial of a brain-controlled hearing system that isolates one speaker in noise. (Medscape)
iProov launches a deepfake detection system designed specifically for enterprise video calls. (FinanceFeeds)
Halsa Global launches Voice IQ, a Salesforce-native conversational AI for enterprise sales. (Newswire)
Korean tech firms double down on voice AI with localized models and in-car assistants. (Korea Times)
Tamber launches its AI music creation platform after raising $5M from Adobe Ventures. (Music Business Worldwide)
TalkSign launches Palm 1.0 and Echo 1.0, AI models for sign language recognition and generation. (TechCabal)
CMU research shows adding audio cues like typing sounds makes AI feel more human but also more rude. (TechXplore)
Office workers shift from typing to voice dictation as AI transcription apps go mainstream. (The Week)
Synthflow AI handles over 5 million calls a month as call centres move to voice AI at scale. (Tech.eu)

Subscribe now

Engineering Corner 😎

AWS publishes a guide to building real-time voice apps with SageMaker AI and vLLM using bidirectional streaming. (AWS Blog)

VoiceBox is an open-source voice cloning app that runs locally from 3 seconds of audio with no cloud uploads. (TechTimes)
Vowen is a free offline voice dictation tool for Windows and macOS that transcribes speech system-wide. (MajorGeeks)
NoteSnip turns video transcripts into source-grounded AI study notes across YouTube, podcasts and PDFs. (Dev.to)
IEEE Spectrum covers how Maori researchers are building indigenous AI voice models to preserve te reo Maori. (IEEE Spectrum)
Memeburn ranks the best AI voice generators of 2026 by use case, from cloning to e-learning. (Memeburn)

Customer Service Hiring Is Surging. So Is Voice AI

Davit Baghdasaryan — Mon, 18 May 2026 14:03:30 GMT

Customer service job postings are up ~8% YoY. More voice AI doesn’t mean fewer human agents - it means more conversations.

Top Updates 💪

Vapi raises $50M for its voice AI agent platform, now valued at $500M. (TechCrunch)

Thinking Machines previews voice+video models that can listen and talk at the same time. (VentureBeat)
Wispr seeks $260M at a $2B valuation for its voice dictation app. (Bloomberg)
OpenAI acquires Weights.gg, a voice cloning startup, and folds the team internally. (ITVoice)
Medicare will reimburse AI voice agents that manage chronic care patients. (WebProNews)
Better.com’s voice agent handles 35% of mortgage calls without human involvement. (PYMNTS)
Bajaj Finance replaces 1,500 calling agents with 10 AI voice bots. (TechStory)
Rivian rolls out a voice assistant across its R1 and R2 vehicles. (InsideEVs)
Quiq adds voice AI to its platform and rebrands for enterprise scale. (CSM Magazine)
ElevenLabs signs McConaughey, Caine, and Minnelli for AI voice partnerships. (Deadline)
Activate invests in ElevenLabs to help grow its India business. (BusinessToday)
RingCentral named Leader by IDC, Omdia, and Metrigy for customer engagement. (RingCentral Blog)
Smallest AI runs its TTS on Tenstorrent chips at 4x lower cost. (The Wire)
MindBio detects intoxication from voice alone using AI speech analysis. (BusinessInsider)

Subscribe now

Engineering Corner 😎

AWS adds Qwen3 speech models to SageMaker JumpStart for TTS and ASR. (AWS)
Foundry Local v1.1 adds live speech-to-text that runs entirely on-device. (Microsoft DevBlogs)

Supertone open-sources Supertonic v3, an on-device TTS supporting 31 languages. (MarkTechPost)
Coval publishes open TTS benchmarks comparing speed and accuracy across major providers. (Coval)
OpenMOSS gets a C++ port for easy local deployment without Python. (StartupFortune)
ThirdReality ships a $70 open-source voice assistant for Home Assistant. (PRWeb)
Monologue adds CLI and MCP support for piping voice dictation into AI agents. (MacStories)

Updates from Krisp, OpenAI, ServiceNow and much more!

Davit Baghdasaryan — Mon, 11 May 2026 14:00:36 GMT

Top Updates 💪

Krisp launches VIVA 2.0 with Turn Prediction v3 and a first-of-its-kind Interrupt Prediction model, all running on CPU with no transcription required. (Krisp Blog)
OpenAI launches three real-time audio models for its API: GPT-Realtime-2 with GPT-5-class reasoning, GPT-Realtime-Translate for live translation across 70+ languages, and GPT-Realtime-Whisper for streaming speech-to-text. (Reuters)

Twilio unveils a Conversation Layer at SIGNAL 2026 with persistent Memory, Orchestrator, Intelligence, and open-source Agent Connect for plugging in any AI provider. (MarTech)
Inworld ships Realtime TTS-2, a frontier voice model that reads user emotion and tone in real time and adapts pacing, softness, and empathy mid-conversation. (BusinessWire)
ServiceNow unveils Otto, a unified conversational AI layer combining Now Assist, Moveworks, and voice agents across every department and system. (The AI Economy)
SoundHound launches OASYS, a self-learning agentic platform that auto-builds, orchestrates, and improves voice AI agents from documentation and transcripts. (GlobeNewsWire)
ElevenLabs adds BlackRock, NVIDIA, and Jamie Foxx to its $550M+ Series D as annualized revenue crosses $500M, up from $350M at the end of 2025. (TechCrunch)
Greenhouse acquires Ezra AI Labs to bring voice AI interviewing into its ATS as applications per recruiter have spiked over 400% since 2023. (PR Newswire)
Ethos raises $22.75M from a16z for an expert network that onboards 35K people per week through voice AI interviews. (TechCrunch)
8x8 launches AI Studio in early availability, letting teams describe needs in plain language and deploy voice and digital AI agents without adding vendors. (CMSWire)
Wispr Flow bets on India as its fastest-growing market with Hinglish dictation support, 2.5M downloads, and 100% month-over-month growth. (TechCrunch)
ElevenLabs powers SpoonLabs’ audio novels, cutting production time from months to hours and launching PodNovel across Korea, Japan, and Taiwan. (DigitalToday)
eGain launches AI Agent IVA, a knowledge-powered virtual agent that replaces IVR dial trees with natural conversation and 24/7 voice support. (GlobeNewsWire)
Gnani.ai hires eight senior execs after its $10M Series B, processing over 30M voice AI calls daily for 200+ enterprise customers in India. (BusinessToday)
Vobiz.ai raises $1M seed to build AI-native telephony infrastructure in India with DID provisioning, low-latency SIP trunking, and LLM audio streaming. (Tech in Asia)
Twinnin targets $3M seed round for its voice and face cloning marketplace where actors license digital likenesses to studios, backed by Google and NVIDIA. (Deadline)
BCM One partners with TD Synnex to bring Pure IP voice services and SkySwitch UCaaS to the MSP channel through the distributor’s partner network. (CRN)
AI note-taking earbuds go mainstream as Viaim and Mobvoi ship wireless earbuds that record, transcribe, and summarize meetings entirely on-device. (How-To Geek)

Subscribe now

Engineering Corner 😎

OpenAI publishes its WebRTC infrastructure playbook, detailing a split relay + transceiver architecture that routes voice AI sessions for 900M+ weekly users at 300-500ms latency. (OpenAI Blog)

TypeWhisper open-sources Mac dictation with 10 ASR engines including WhisperKit, Parakeet, Apple SpeechAnalyzer, Groq, and xAI Grok STT, all running locally. (GitHub)
Dictee ships offline voice dictation for Linux as a KDE Plasma 6 plasmoid with Rust backend, 4 ASR engines, and NVIDIA Parakeet via ONNX Runtime. (GitHub)
TTS models for Indian languages: a dev survey covering Hindi, Tamil, Bengali, and Telugu with architecture comparisons and demo links. (dev.to)
Build a voice agent with LiveKit + AssemblyAI using Universal-3 Pro Streaming STT with function calling and MCP integration. (dev.to)

Voice Agents Go Mainstream

Davit Baghdasaryan — Mon, 04 May 2026 13:17:51 GMT

Three important Voice AI events this week:

Twilio Signal - May 6-7 in SF
Cerebral Valley Voice Summit - May 6 in SF
NVIDIA Developer Meetup | Building and Evaluating Real-time Voice Agents - May 7 in SF

Top Updates 💪

xAI launches Custom Voices, a voice cloning API that creates a voice ID from 120 seconds of audio with speaker verification, plus 80+ built-in voices across 28 languages. (VentureBeat)
Microsoft ships real-time voice agents in Copilot Studio, now GA in Dynamics 365 Contact Center with low-latency speech-to-speech, interruptions, and mid-call language switching. (Microsoft Blog)
Amazon adds “Join the Chat” to product pages, letting shoppers ask voice or text questions during AI audio summaries and get real-time conversational answers. (TechCrunch)
Otter.ai pivots from notetaker to Conversational Knowledge Engine, launching MCP connectors, AI Chat, and desktop app to turn meeting data into agentic workflows. (BusinessWire)
Deepgram launches Flux Multilingual with 10 languages and mid-call language switching, plus model-based turn detection under 400ms. (SiliconANGLE)
Twilio Q1 voice revenue hits a 19-quarter high, up 20% YoY with Conversational Intelligence and Branded Calling both growing over 100%. (The Next Web)
NordVPN adds AI voice deepfake detector to its Chrome extension, analyzing acoustic patterns in real time without recording or interpreting content. (BetaNews)
Audion raises $15M to bring AI-powered contextual audio ad targeting to the U.S., processing 500K hours of audio weekly for brands like Apple and Nike. (Axios)
3CLogic launches outbound voice AI agents with multimodal voice+digital capabilities and an automated LLM-powered QA engine for scoring every AI interaction. (PR Newswire)
AI-generated podcasts are booming on Spotify, Apple, and YouTube, with AI hosts that sound convincingly human raising questions about disclosure. (Inc)
Tells launches AI voice agents on existing SMS numbers with a single toggle, adding sub-second-latency voice to any business texting line without a new number or integration. (AIthority)
SpeakON ships a MagSafe AI dictation accessory that turns iPhone voice input into formatted, tone-adapted text with translation across 12 languages. (9to5Mac)
Docplanner’s voice AI agent “Noa Booking” doubles doctor appointment bookings vs traditional call centers, built on Twilio ConversationRelay. (Health Tech Digital)
Lumeris adds native audio to its Tom platform using Gemini’s speech-to-speech capabilities for real-time, empathetic patient conversations in primary care. (HIT Consultant)
Ablio launches AI-powered interpretation with hybrid human+AI model, combining ASR, neural translation, and TTS for live multilingual events on Zoom and Teams. (AIthority)

Subscribe now

Engineering Corner 😎

Sakana AI introduces KAME, a tandem speech-to-speech architecture that lets a backend LLM inject knowledge in real time while the front-end keeps talking with near-zero latency. (Sakana AI)

NVIDIA releases Nemotron 3 Nano Omni, an open 30B-A3B multimodal model unifying vision, audio, and language with 9x higher throughput than competing omni models. (NVIDIA Blog)
OpenMOSS releases MOSS-Audio, an open-source foundation model for speech, sound, music understanding, and time-aware audio reasoning in 4B and 8B variants. (MarkTechPost)
Async publishes open TTS benchmark revealing major accuracy gaps when streaming models handle phone numbers, dates, and prices in production. (Podnews)
Speaker diarization explained: how AI knows who said what, from spectral embeddings to clustering. (dev.to)
Laravel AI SDK tutorial: add TTS and voice to your app in 20 minutes. (dev.to)
Hobbyist builds a C-3PO head with real-time voice interaction using off-the-shelf speech models. (Let’s Data Science)

Voice AI's Consolidation Begins

Davit Baghdasaryan — Mon, 27 Apr 2026 14:01:59 GMT

Two important Voice AI events in the coming weeks:

Twilio Signal - May 6-7 in SF
Cerebral Valley Voice Summit - May 6 in SF

Top Updates 💪

xAI launches Grok Voice Think Fast 1.0, ranking #1 on the tau-voice Bench for full-duplex voice agents and already powering Starlink support with a 20% sales conversion rate. (xAI)
Anker unveils THUS, the first compute-in-memory AI audio chip, claiming 150x more on-device AI power for noise cancellation in its upcoming Soundcore earbuds. (The Verge)
SoundHound acquires LivePerson for $43M, combining voice agentic AI with LivePerson’s digital messaging platform that handles one billion customer messages per month. (GlobeNewswire)
Krisp Voice AI SDK won double Webby Awards for Technical Achievement (LinkedIn)
Speechmatics delivers on-device STT for Adobe Premiere, transcribing an hour of video in 55 seconds offline with accuracy within 5% of cloud. (TV Technology)
Nothing launches Essential Voice, an AI dictation tool that cleans filler words and formats speech-to-text system-wide in 100+ languages. (TechCrunch)
Synthflow AI and 8x8 partner to embed no-code voice AI agents directly into the 8x8 Contact Center platform across 30+ languages. (VentureBeat)
Google Meet AI note-taking now works for in-person meetings, generating transcripts, summaries, and action items from face-to-face conversations via mobile. (Lifehacker)
Xiaomi releases MiMo v2.5 TTS and open-sources MiMo v2.5 ASR, a full voice pipeline with voice cloning, voice design, and dialect-aware recognition for the agent era. (Gizmochina)
Volkswagen will ship voice AI in all China-built cars starting H2 2026, using on-device LLMs from Tencent, Alibaba, and Baidu. (CNBC)
Newo appoints new CEO after $25M Series A to scale partner-led voice AI infrastructure for MSPs, VoIP providers, and software platforms serving SMBs. (GlobeNewswire)
Ericsson embeds AI calling and fraud detection into IMS, partnering with Hiya for real-time spam blocking as 86% of unknown calls go unanswered. (Ericsson Blog)

Subscribe now

Engineering Corner 😎

Streaming TTS models fail over 60% of sentences containing phone numbers, dates, and prices due to 5-20x less context than batch mode. (Technology.org)

AI neck sensor turns silent speech into voice by reading microscopic throat muscle movements with a CNN+transformer pipeline from POSTECH. (Digital Trends)
AWS guide to cost-effective multilingual transcription at scale using NVIDIA Parakeet TDT and AWS Batch. (AWS Blog)
Ghost Pepper: open-source browser extension for real-time voice transcription and LLM-powered responses. (GitHub)
Mimi Codec deep-dive on its layered audio compression design for neural speech coding. (LetsDDataScience)
AssemblyAI showcases configurable STT with tunable turn-taking, medical mode for streaming, and real-time speaker labeling. (TipRanks)

Everyone Wants a Voice Platform

Davit Baghdasaryan — Mon, 20 Apr 2026 14:04:17 GMT

Top Updates 💪

xAI ships standalone Grok STT and TTS APIs with streaming transcription at $0.20/hr and expressive TTS with inline emotion tags across 20 languages. (xAI)
Google launches Gemini 3.1 Flash TTS with 200+ audio tags for fine-grained voice control, multi-speaker dialogue, and SynthID watermarking across 70+ languages. (Google Blog)

Starlink customer support is now Grok-powered, with a voice AI agent handling sales, troubleshooting, and account setup on the phone line. (PCMag)
Cloudflare adds real-time voice to its Agents SDK, enabling voice-enabled agents over WebSockets in ~30 lines of server code on Durable Objects. (Cloudflare Blog)
DeepL launches voice-to-voice translation for meetings with Zoom and Teams add-ons, plus a developer API for custom use cases like call centers. (TechCrunch)
Phonely raises $16M Series A for AI phone agents that drove $10M+ in insurance policy sales for a single customer this year. (Axios)
Krisp launches British English accent conversion, letting offshore agents in India, Philippines, and beyond sound local for UK-facing programs in real time. (CXM World)
interface.ai launches Nexus, a fully agentic CCaaS platform for credit unions that eliminates hold queues by keeping AI in the conversation with human backup. (GlobeNewswire)
ConverseNow partners with Deliverect to pipe voice AI phone and drive-thru orders into unified restaurant order management across thousands of locations. (PR Newswire)
ENCO unveils enSpeak at NAB Show, adding real-time voice translation to its captioning workflow so viewers can hear live broadcasts in their preferred language. (Content + Technology)

Subscribe now

Engineering Corner 😎

NVIDIA releases Audio Flamingo Next (AF-Next), an open large audio-language model that understands speech, sound, and music with 30-minute context and timestamp-grounded reasoning. (MarkTechPost)

MOSS-TTS-Nano-100M brings multilingual voice cloning to CPUs with a 100M-param model that streams 48kHz audio in 20 languages. (HackerNoon)
Hands-on VibeVoice tutorial covering speaker-aware ASR, real-time TTS, and speech-to-speech pipelines with code. (MarkTechPost)
Build a real-time voice agent with Pipecat, step-by-step guide to streaming STT/TTS pipelines. (HackerNoon)
Build an AI medical scribe using voice agents for clinical documentation. (HackerNoon)
Diction: self-hosted STT setup guide as an open alternative to Wispr Flow. (dev.to)
Deepgram and Modulate benchmarked against real-world audio conditions. (HackerNoon)

The Week Voice AI Went Local

Davit Baghdasaryan — Mon, 13 Apr 2026 14:03:24 GMT

Top Updates 💪

Krisp brings Accent Conversion to YouTube with free Chrome Extension for 2.7B users (LinkedIn)
Google quietly ships AI Edge Eloquent, a free offline-first dictation app for iOS running on-device Gemma models with filler removal and no subscription. (TechCrunch)

Mistral launches Voxtral TTS, a 4B open-weights streaming speech model in 9 languages that beats ElevenLabs Flash v2.5 in voice cloning win rates. (Slator)
ByteDance introduces Seeduplex, a native full-duplex speech LLM that listens while speaking and cuts false interruption rates in half vs half-duplex Doubao. (ByteDance Seed)
Willow launches Atlas-1, a new frontier STT model built on human-powered transcription infrastructure that claims to beat ElevenLabs, Deepgram, and OpenAI. (VP-Land)
Telnyx launches LiveKit on Telnyx, a hosted platform running LiveKit agents on Telnyx infrastructure with 50% lower cost and sub-200ms latency. (Telecom Reseller)
Natter raises $23M Series A led by Renegade Partners to replace enterprise surveys with AI-moderated 1:1 video conversations at scale. (VentureBurn)
Twilio Q4 voice AI revenue grew 60% as the company closed its biggest enterprise deal ever and repositioned as AI infrastructure. (CX Today)
Regal AI launches Copilot, a self-improving voice agent builder that learns from call outcomes and flags underperformance automatically. (SiliconANGLE)
Exotel acqui-hires Dubverse core team to lead conversation quality analytics and AI, deepening its voice AI stack for Indian enterprises. (TechCircle)
Californians sue Sutter and MemorialCare over use of Abridge AI scribe that allegedly recorded doctor-patient visits without clear patient consent. (Ars Technica)
Five9 expands Fusion ecosystem with AI Agent Connect API, letting enterprises wire voice AI agents into third-party systems and Assembled WFM. (Yahoo Finance)
Weya AI open-sources Hush, an 8MB speech enhancement model with 1.8M params that isolates the primary speaker in under 1ms per frame, CPU-only. (IndianWeb2)
Shunya Labs launches voice AI platform for dubbing, translation, lip-sync, and low-shot voice cloning for entertainment localization. (Passionate in Marketing)
Beaver AI launches Magic Whiteboard, a privacy-first meeting assistant that transcribes in real time but never records or stores audio. (PRWeb)

Subscribe now

Engineering Corner 😎

AWS on Nova Multimodal Embeddings for semantic audio search across tone, emotion, and events, unified with text/image/video in a single vector space. (AWS Blog)

Voxtral TTS surgery: deep-dive into reconstructing codec audio from intermediate model states. (Towards Data Science)
Kokoro 82M TTS runs fully offline on CPU with 8 languages and 26 voices in a ~350MB footprint. (Geeky Gadgets)
docker-whisper: self-hosted Whisper ASR in a container for easy local deployment. (GitHub)
Browser-based STT with Whisper: tutorial on running Whisper inference entirely in the browser. (dev.to)
Lightweight offline TTS for Node.js using a minimal dependency chain. (dev.to)
Designing a real-time voice agent with RAG, SIP, and compliance guardrails. (HackerNoon)
Open-source Amazon Lex connector for Cisco Webex Contact Center for adding virtual agents without a platform rebuild. (AWS APN Blog)

Microsoft Enters the Voice AI Race

Davit Baghdasaryan — Mon, 06 Apr 2026 14:03:48 GMT

Top Updates 💪

Microsoft launches MAI-Transcribe-1 and MAI-Voice-1 - Two new in-house models: a batch transcription model (top 25 languages, 2.5x faster than Azure Fast) and a voice generation model that produces 60s of audio in 1s. Available now in Foundry. (VentureBeat) (Microsoft AI)

Microsoft open-sources VibeVoice - Family of TTS and ASR models under MIT license. TTS handles up to 90 minutes with 4 speakers. ASR transcribes 60-minute audio in a single pass with speaker diarization. Already at 27K GitHub stars. (GitHub)
Alibaba releases Qwen3.5-Omni - Native multimodal model processing text, audio, video in one pipeline. Speech recognition in 113 languages, generation in 36. Built-in turn-taking recognition that distinguishes backchanneling from real interruptions. Closed source, API only. (MarkTechPost)
Modulate launches Velma Deepfake Detect - Synthetic voice detection API ranked #1 on the HuggingFace Deepfake Speech leaderboard. Claims 578x lower cost than the next-best model, making always-on call monitoring viable. (GamesBeat)
CNTXT AI launches Munsit - Arabic voice AI platform combining ASR and TTS across 25+ dialects. Already processing over a million minutes of audio for 250+ government and enterprise orgs in the UAE. (Zawya)
Retell AI makes Wing VC Enterprise Tech 30 - Voice AI agent platform hit $50M ARR and powers 50M+ real-time AI phone calls per month. One of three voice AI companies on the list. (GlobeNewswire)
Speechify launches Windows app with on-device models - Local Whisper-based transcription and neural TTS on Copilot+ PCs and GPUs. No cloud needed. Competing with Wispr Flow and Superwhisper. (TechCrunch)
The hidden cost of agentic AI callers - Some B2B contact centers seeing 15-20% of inbound volume from AI agents at peak. They wait forever, consume resources, and extract operational data. Detection is key. (SymNex)
AudioShake ships real-time audio separation SDK - Source separation for iOS, Android, Windows, Linux. Ranked #1 in Meta’s SAM audio benchmarks. Used by Warner, Universal, Sony, Disney. Now available for edge deployment. (Slator)
AI voice scams surge with 3-second cloning - Scammers cloning family members’ voices from short social media clips. BBB and FTC warnings. AI-generated voice fraud up 1,200% in 2025. (MoneyControl)
MiraVoice raises $6.3M - AI voice agent for long-form phone surveys (120+ questions, 40+ min). Seed round led by Unusual Ventures. (Crunchbase)
Gnani.ai raises $10M Series B - India’s leading voice AI platform, 30M+ voice interactions daily in 12+ languages. Also launched Inya VoiceOS, a 5B-parameter voice-to-voice model. (BusinessToday)
Insight Health raises $11M Series A - Voice and chat AI agents for clinical admin: patient screening, referral processing, EHR documentation. Integrated with athenahealth. (MobiHealthNews)
Google Gboard adds Bluetooth mic for voice typing - Finally lets you dictate through connected earbuds instead of phone mic. Rolling out via server-side update. (Android Authority)

Subscribe now

Engineering Corner 😎

Orpheus-FastAPI - Self-hosted TTS server with OpenAI-compatible API. 8 voices, emotion tags, long-form batching. Connects to llama.cpp, LM Studio, GPUStack. Apache 2.0. (GitHub)
MeloTTS - Multi-lingual TTS library by MyShell.ai. English (4 accents), Spanish, French, Chinese, Japanese, Korean. Runs in real time on CPU. MIT license. (GitHub)
Build a voice-enabled AI agent in n8n - Step-by-step tutorial for wiring up voice input/output in n8n workflows. (dev.to)
How to choose the best STT API for voice agents - Comparison of latency, accuracy, and cost tradeoffs across providers. (HackerNoon)
The hidden audio bias in audio-visual speech recognition - Analysis of how AV-ASR models over-rely on audio, undermining the visual modality. (HackerNoon)
Why speech recognition APIs need a different architecture - Smallest AI on designing ASR for real-time voice agent use cases vs batch transcription. (dev.to)

Krisp Is Nominated for 3 Webby Awards

Davit Baghdasaryan — Thu, 02 Apr 2026 13:54:40 GMT

Krisp has been nominated for three 2026 Webby Awards for Technical Achievement, Developer Tools & APIs, and Best Use of AI Voice & Conversational Interface.

The Webby Awards are one of the most recognized honors in digital technology. Getting nominated in three categories, all tied to voice AI, is a meaningful signal of where this space is headed and the work the team has put in to get us here.

The People’s Voice Award is decided solely by public vote.

If you follow this newsletter, you already believe in what we’re building, and we'd love your support.

Click each link below to vote:

Technical Achievement

Developer Tools & APIs

AI Voice & Conversational Interface

Or type “Krisp” into the category search bar and our nominations will surface for one-click voting.

You can cast one vote per category, closes April 16.

Thank you, and more soon.

— Davit

3 New Open-Source Voice Models Drop in One Week

Davit Baghdasaryan — Mon, 30 Mar 2026 14:03:12 GMT

Top Updates 💪

Mistral launches Voxtral TTS - Open-weight 4B TTS model. 9 languages, 90ms TTFA, 6x RTF. Runs on consumer GPUs. Mistral claims it beats ElevenLabs on quality benchmarks. (TechCrunch) (Mistral blog)

Cohere releases Transcribe - Open-source 2B ASR model built for edge. 14 languages, 5.42 avg WER on HF Open ASR leaderboard, beating Zoom Scribe v1, IBM Granite 4.0, ElevenLabs Scribe v2, and Qwen3-ASR. Free via API and HuggingFace. (TechCrunch) (Cohere blog)
Google ships Gemini 3.1 Flash Live + Search Live goes global - Real-time voice/video model with native function calling. 90.8% on ComplexFuncBench Audio (~20% jump over prev gen). Now powers Search Live in 200+ countries with voice and camera input. (Google blog) (TechCrunch)
Smallest AI launches Lightning V3 - 3.89 MOS in conversational evals, claims to beat OpenAI, Cartesia, and ElevenLabs. 15 languages with auto-detection and mid-sentence switching. Voice cloning from 5-15s of audio. (Smallest.ai blog)
Amazon Polly adds Bidirectional Streaming - Stream text to Polly token-by-token as your LLM generates it, get audio back in real time over HTTP/2. 39% faster than batch approach, collapses 27 API calls to 1 on a 970-word passage. GA now. (AWS blog)
AWS adds WebRTC to Bedrock AgentCore - Pipecat voice agents now run on AgentCore Runtime with bidirectional WebSocket and WebRTC. Supports barge-in. Ready-to-deploy examples with Pipecat, Nova Sonic, LiveKit, and Strands SDK. (AWS blog)
Genesys reports record Q4 - Genesys Cloud at ~$2.6B ARR, 35%+ YoY growth. 70%+ of customers now on AI. AI-powered conversations up 120% YoY. AI is 20% of new ACV, with 10+ deals where AI exceeded half the contract value. (Genesys)
Artificial Analysis updates voice benchmarks - AA-WER v2.0 adds conversational AI, EU Parliament speech, and financial call datasets. ElevenLabs Scribe v2 leads at 2.3% WER. Best value: Mistral Voxtral Small at 3.0% WER / $4 per 1K min. TTS Arena: Inworld TTS-1.5-Max at #1, ELO 1,160. (X post)
AI chatbots handle 60%+ of banking support - BofA Erica: 1.5B+ interactions, 98% resolved without human. Klarna AI: 66% of inquiries, saving $40M/yr. Gartner projects $80B in contact center labor cost cuts in 2026. (TechBullion)
The economics of AI vs human agents - Voice AI now costs ~$0.40/call vs $7-12 for a human agent: 90-95% cost reduction per interaction. Analysis of how this is reshaping contact center staffing. (Medium)
Agentic Voice AI goes mainstream - 1 in 10 customer service interactions projected to be fully automated by agentic voice AI in 2026. 80% of businesses plan to deploy. RingCentral shipped AIR Pro, an agentic voice platform embedded in its comms stack. (Telecom Reseller)
Salesforce Agentforce Contact Center - Native CCaaS unifying voice, digital channels, CRM, and AI agents in one stack. Voice now built into the CRM on Hyperforce. GA since Feb 23. (Cloud Wars)
Otter.ai hits 35M users, $100M ARR - Sam Liang interview. $100M ARR with <200 employees ($500K+ rev/employee). #14 on Forbes 2026 Best Startup Employers. Liang: 2026 is “the year of the voice.” (YouTube)

Subscribe now

Engineering Corner 😎

Gladia open-sources WER normalization library - Normalizes transcripts before computing WER to eliminate false penalties from formatting differences (”$50” vs “fifty dollars”). Configurable YAML pipelines for fair cross-engine ASR comparison. (GitHub) (LinkedIn - Gevorg Minasyan)

MacWhisper - Mac-native local transcription using Whisper and Nvidia Parakeet. 300K copies sold. Batch processing, YouTube transcription, auto-recording Zoom/Teams/Webex. All on-device. (Trend Hunter)
Logan Kilpatrick on Gemini 3 Flash - Google DeepMind’s Logan Kilpatrick discusses the latest Gemini model capabilities. (X post)
Google Docs adds Gemini-powered audio proofreading - “Listen to this” reads docs aloud with AI voices. 0.5x-2x playback. Also ships audio summaries: condenses long docs into ~3min podcast-style recaps. Desktop, English only for now. (MakeUseOf)
Rekam AI - All-in-one voice platform: TTS, STT, voice cloning, custom voice creation. 2,000+ voices, 20+ languages. Free unlimited tier for Kokoro models. (Dynamic Business)
Klassifier - AI-powered audio classification tool. (Trend Hunter)
ViciStack on call center AI voice agents - Overview of real-time conversation handling, reduced wait times, and automated workflows in production contact centers. (ViciStack)

Scaling STT systems | Maxime Gaudin (CTO at Gladia)

Davit Baghdasaryan — Thu, 26 Mar 2026 13:10:44 GMT

In the Future of Voice AI series of interviews, I ask three questions to my guests:

- What problems do you currently see in Enterprise Voice AI?
- How does your company solve these problems?
- What solutions do you envision in the next 5 years?

This episode’s guest is Maxime Gaudin, CTO at Gladia.

Former Co-Founder & CTO at Matcha and CTO at MadKudu through its private equity acquisition. Among the first employees at Malt, where he helped scale the company from 6 to 250 people and over €12M in monthly transaction volume. He earned his Master's degree in Computer Science from INSA Lyon and Polytechnique Montréal, Canada. Throughout his career, he has built and scaled products across B2B SaaS, data intelligence, and speech AI, from early-stage founding to leading engineering organizations through hypergrowth and acquisitions.

Gladia was founded in 2022 by Jean-Louis Queguiner and Jonathan Soto with a mission to help companies leverage cutting-edge AI and retrieve actionable insights from audio data. Its API supports advanced speech recognition features in over 100 languages, with exceptional accuracy and asynchronous and real-time transcription.

Listen on YouTube

Recap Video

Takeaways

Winning isn’t just about model quality, it is surviving brutal tradeoffs between latency, cost, and scale.
The real challenge is not training one great model, it is running it cheap enough to meet market pricing without breaking performance.
STT is getting commoditized so fast that providers have to chase better accuracy while selling at margins that keep shrinking.
Big models don’t matter if they are too expensive to run at scale.
Real-time voice AI lives or dies under a hard latency budget, and staying under 300 milliseconds leaves little room for mistakes.
The industry obsession with one model that does everything may be the wrong path if smaller specialist models can outperform it in the moments that matter.
Every model upgrade is risky because improving one language or task can make another one worse.
Testing speech systems is harder than people admit because teams know something broke, but don’t know what.
General transcription errors can be patched by an LLM, but once a name, phone number, email, or address is lost, it is gone.
The next edge in voice AI may come from tiny models trained for high-value details like PII, not from one giant model trying to handle everything.
Email addresses sound simple until real accents, pauses, corrections, and spelling cues expose how messy spoken language really is.
The companies that win enterprise voice AI will be the ones that orchestrate many narrow models well, not the ones chasing a single universal model.
Infrastructure strategy is becoming a product decision because legal rules, traffic spikes, and customer use cases all change what “best” deployment looks like.
Cloud scaling breaks in real-time spikes, like emergency calls.
Using managed infra and large DevOps teams at once wastes money.
Customers want one vendor for everything, even if quality drops.
The market will reward depth over breadth if a vendor can become truly exceptional in one painful, business-critical part of the voice stack.

Scale AI launches real-world voice AI benchmark

Davit Baghdasaryan — Mon, 23 Mar 2026 14:02:58 GMT

Top Updates 💪

Scale AI launches the first real-world voice AI benchmark (VentureBeat)
NVIDIA has released Nemotron 3 VoiceChat speech to speech model (X)
Krisp launches MCP integration with Claud (LinkedIn)
Amazon Connect voice AI agents now supports 13 new languages (AWS)
Modulate launches Velma Transcribe: High-performance transcription for real-world conversations at 90% lower cost (Enterprise News)
Google News could soon give you a convenient new way to consume its audio briefings (Android Authority)
AI notetaking devices that record and transcribe your meetings (TechCrunch)
Krisp has been named a Palomarr Leader across Accent Conversion, Noise Cancellation, Voice Translation (LinkedIn)
Amazon Connect adds new generative TTS voices and expands regions (AWS)
Ringover launches enhanced AI assistant ask Empower 2.0 (AIThority)
WhatsApp upgrade — calls will sound completely different (Nokia Power User)
8x8 Engage launches globally for frontline teams (CMSWire)
Itel unveils Zeno AI Weaver voice recorder in India (Gadgets360)
AI voice cloning & synthesis are shaping the future of digital voices (TechTimes)
How businesses are replacing IVR with conversational AI (Social Media Explorer)
Bandicam launches AI feature to transcribe video to text on Mac (MarTech Series)
The mounting cost of voice fraud: revenue loss, broken trust (Retail Dive)
Robinhood’s startup fund invests $35M in Stripe and AI audio firm (The Block)
Ezra raises $3.2M in seed funding (FinSMEs)
WellSaid closes venture debt funding (FinSMEs)

Subscribe now

Engineering Corner 😎

VoXtream2: Full-stream TTS with dynamic speaking-rate control (LinkedIn)
Adaptive AI voice layer for real-time communication (Dev)
Utterly: Transcribe speech privately on Apple devices, offline (BetaList)
MiniMax 2.7: GLM-5 at 1/3 cost SOTA open model (Smol AI News)
Best STT APIs to build an AI notetaker in 2026 (Hacker Noon)
PersonaOps: A voice-to-data intelligence system powered by Notion MCP (Dev)
Google AI releases WAXAL: Multilingual African speech dataset (MarktechPost)
WhisperWeb processed STT Directly within the browser (Trend Hunter)
Why building voice AI agents is still so hard (Dev)
OpenVoiceUI: AI voice agent app generates live canvas pages (Dev)
Vietnamese automatic speech recognition (TLDR Takara)
VoiceType AI transcribes, edits, and auto-formats your speech (Trend Hunter)
Speech synthesis API for TTS (Dev)