Microsoft Enters the Voice AI Race

Voice AI weekly digest

Davit Baghdasaryan

Apr 06, 2026

Top Updates 💪

Microsoft launches MAI-Transcribe-1 and MAI-Voice-1 - Two new in-house models: a batch transcription model (top 25 languages, 2.5x faster than Azure Fast) and a voice generation model that produces 60s of audio in 1s. Available now in Foundry. (VentureBeat) (Microsoft AI)

Microsoft open-sources VibeVoice - Family of TTS and ASR models under MIT license. TTS handles up to 90 minutes with 4 speakers. ASR transcribes 60-minute audio in a single pass with speaker diarization. Already at 27K GitHub stars. (GitHub)
Alibaba releases Qwen3.5-Omni - Native multimodal model processing text, audio, video in one pipeline. Speech recognition in 113 languages, generation in 36. Built-in turn-taking recognition that distinguishes backchanneling from real interruptions. Closed source, API only. (MarkTechPost)
Modulate launches Velma Deepfake Detect - Synthetic voice detection API ranked #1 on the HuggingFace Deepfake Speech leaderboard. Claims 578x lower cost than the next-best model, making always-on call monitoring viable. (GamesBeat)
CNTXT AI launches Munsit - Arabic voice AI platform combining ASR and TTS across 25+ dialects. Already processing over a million minutes of audio for 250+ government and enterprise orgs in the UAE. (Zawya)
Retell AI makes Wing VC Enterprise Tech 30 - Voice AI agent platform hit $50M ARR and powers 50M+ real-time AI phone calls per month. One of three voice AI companies on the list. (GlobeNewswire)
Speechify launches Windows app with on-device models - Local Whisper-based transcription and neural TTS on Copilot+ PCs and GPUs. No cloud needed. Competing with Wispr Flow and Superwhisper. (TechCrunch)
The hidden cost of agentic AI callers - Some B2B contact centers seeing 15-20% of inbound volume from AI agents at peak. They wait forever, consume resources, and extract operational data. Detection is key. (SymNex)
AudioShake ships real-time audio separation SDK - Source separation for iOS, Android, Windows, Linux. Ranked #1 in Meta’s SAM audio benchmarks. Used by Warner, Universal, Sony, Disney. Now available for edge deployment. (Slator)
AI voice scams surge with 3-second cloning - Scammers cloning family members’ voices from short social media clips. BBB and FTC warnings. AI-generated voice fraud up 1,200% in 2025. (MoneyControl)
MiraVoice raises $6.3M - AI voice agent for long-form phone surveys (120+ questions, 40+ min). Seed round led by Unusual Ventures. (Crunchbase)
Gnani.ai raises $10M Series B - India’s leading voice AI platform, 30M+ voice interactions daily in 12+ languages. Also launched Inya VoiceOS, a 5B-parameter voice-to-voice model. (BusinessToday)
Insight Health raises $11M Series A - Voice and chat AI agents for clinical admin: patient screening, referral processing, EHR documentation. Integrated with athenahealth. (MobiHealthNews)
Google Gboard adds Bluetooth mic for voice typing - Finally lets you dictate through connected earbuds instead of phone mic. Rolling out via server-side update. (Android Authority)

Engineering Corner 😎

Orpheus-FastAPI - Self-hosted TTS server with OpenAI-compatible API. 8 voices, emotion tags, long-form batching. Connects to llama.cpp, LM Studio, GPUStack. Apache 2.0. (GitHub)
MeloTTS - Multi-lingual TTS library by MyShell.ai. English (4 accents), Spanish, French, Chinese, Japanese, Korean. Runs in real time on CPU. MIT license. (GitHub)
Build a voice-enabled AI agent in n8n - Step-by-step tutorial for wiring up voice input/output in n8n workflows. (dev.to)
How to choose the best STT API for voice agents - Comparison of latency, accuracy, and cost tradeoffs across providers. (HackerNoon)
The hidden audio bias in audio-visual speech recognition - Analysis of how AV-ASR models over-rely on audio, undermining the visual modality. (HackerNoon)
Why speech recognition APIs need a different architecture - Smallest AI on designing ASR for real-time voice agent use cases vs batch transcription. (dev.to)

Voice AI Newsletter

Discussion about this post

Ready for more?