Voice AI is destined to revolutionize how we communicate at work. The fast-paced innovation in Speech Recognition, Speech Synthesis, Speech Quality and LLMs are accelerating this disruption. The industry is already moving fast and this pace will only accelerate in the coming 5 years.
Last week was tremendous in this sense.
Top Updates πͺ
Open AI launched 3 Voice AI features
Whisper large-v3 featuring improved performance across languages. Improvements are shown here.
A new TTS API with 6 preset voices. Seems like its pricing is 10x+ lower compared to the market π―
GPT-4 Turbo with 128K context for higher-quality and cheaper Meeting summaries
There is a new English-only Whisper model claimed to be 6x faster π This is great news for on-device transcription!
Zoom reached 1M meeting summaries generated with its AI Companion
ElevenLabs launched Eleven Turbo v2. Turbo is their fastest model so far with audio generation times of ~400ms
Prevail integrated Krispβs noise-cancellation AI πͺ
Podcastle launched noise cancellation for podcasts called Magic Dust πͺ
Descript launched AI-generated podcast notes
Noteworthy π
Text-To-Speech market is expected to grow to $17B by 2029 π
How to leverage Sentiment Analysis and voice data to obtain CX insights
Best Microphones π§ for Zoom, according to the CNET staff who use them
Yum is testing a voice-enabled AI drive-thru system in restaurants to increase productivity and also provide automated upsell recommendations
6 Customer Service trends π for 2024 - AI chatbots, Omni-channel, Voice-based AI, Automation, AR and Personalization.
Azure AI Services introduced 7 new Text-to-Speech voices
Audio Hijack v4.3 has a new superpower: speech-to-text, powered by Whisper
Xenova announces a new Text-to-Speech-Client Tool: A Robust and Flexible AI Platform for Producing Natural-Sounding Synthetic Speech
Demos π
Quick demo of the 6x faster Distil Whisper model
Talk with an LLaMA AI in your terminal. Whisper Medium + LLaMA v2 13B on M2 Ultra.
Experimenting with the magic of open-source! Whisper for text translation, XTTS for audio, and Video-retalker for seamless mouth sync in a short video.