Playback speed
×
Share post
Share post at current time
0:00
/
0:00
Transcript

In Voice AI, latency is #1 priority | Trevor Back (Chief Product Officer at Speechmatics, ex-DeepMind)

In The Future of Voice AI series of interviews, I ask three questions to my guests:

- What problems do you currently see in Enterprise Voice AI?
- How does your company solve these problems?
- What solutions do you envision in the next 5 years?

This episode’s guest is Trevor Back, Chief Product Officer at Speechmatics.

Trevor Back joined Speechmatics in 2023 bringing over a decade’s experience in machine learning and AI (ex-DeepMind). During his tenure, he worked across numerous industries including consumer, wearable, automotive, healthcare, deep tech and science. Launching applications and services which are still in use in well known products such as YouTube. Back also was fundamental in the commercialization of AlphaFold which led to the formation of DeepMind spin-off, Isomorphic Labs. 

Speechmatics is a leading expert in speech technology, leveraging the latest AI and machine learning breakthroughs to transcribe and understand human speech in real-time and recorded media. Their technology supports over 50 languages and can translate 69 language pairs, ensuring high accuracy across diverse demographics, accents, and dialects. Speechmatics uses self-supervised learning to train its models on vast amounts of unlabeled data, enabling exceptional performance and inclusivity. Recognized for innovation, Speechmatics partners with global brands like Ubisoft and AI Media and is dedicated to making technology accessible to everyone, aligning with their mission to "Understand Every Voice."

Recap Video

Takeaways

  • Speechmatics sees themselves as an applied research lab. They focus on different languages, accents, and other characteristics important for STT performance

  • They leverages transformers and are focused on real-time models

  • They recently launched Flow, a horizontal API enabling any company to build voice interactions into their product without worrying about the whole stack.

  • Speech-to-speech models are not yet ready for large-scale deployment across diverse languages and speech nuances.

  • They are exploring to build their own version of TTS to offer on-premise for security-conscious customers.

  • A downside of using a single model for many languages is that accuracy decreases as more languages are added

  • Speechmatics achieves high accuracy for over 50 languages while requiring 100 times less labeled data than competitors.

  • This ensures better recognition of accents, dialects, and localized speech, improving representation and accuracy.

  • On-device AI brings huge benefits like more privacy and less data usage.

  • Scaling to more languages ensures that no one is left behind as AI technology advances—speech has got to be a core part of any future AGI stack.

  • Speechmatics built an OpenAI voice-mode like model and Tervor showed it during the podcast. Super impressive.

  • Real-time speech-to-speech translation is on their roadmap, building out more and more language pairs.

Discussion about this podcast