0:00
/
0:00

Scaling STT systems | Maxime Gaudin (CTO at Gladia)

In the Future of Voice AI series of interviews, I ask three questions to my guests:

- What problems do you currently see in Enterprise Voice AI?
- How does your company solve these problems?
- What solutions do you envision in the next 5 years?

This episode’s guest is Maxime Gaudin, CTO at Gladia.

Former Co-Founder & CTO at Matcha and CTO at MadKudu through its private equity acquisition. Among the first employees at Malt, where he helped scale the company from 6 to 250 people and over €12M in monthly transaction volume. He earned his Master's degree in Computer Science from INSA Lyon and Polytechnique Montréal, Canada. Throughout his career, he has built and scaled products across B2B SaaS, data intelligence, and speech AI, from early-stage founding to leading engineering organizations through hypergrowth and acquisitions.

Gladia was founded in 2022 by Jean-Louis Queguiner and Jonathan Soto with a mission to help companies leverage cutting-edge AI and retrieve actionable insights from audio data. Its API supports advanced speech recognition features in over 100 languages, with exceptional accuracy and asynchronous and real-time transcription.

Listen on YouTube

Recap Video

Thanks for reading Voice AI Newsletter! Subscribe for free to receive weekly updates.

Takeaways

  • Winning isn’t just about model quality, it is surviving brutal tradeoffs between latency, cost, and scale.

  • The real challenge is not training one great model, it is running it cheap enough to meet market pricing without breaking performance.

  • STT is getting commoditized so fast that providers have to chase better accuracy while selling at margins that keep shrinking.

  • Big models don’t matter if they are too expensive to run at scale.

  • Real-time voice AI lives or dies under a hard latency budget, and staying under 300 milliseconds leaves little room for mistakes.

  • The industry obsession with one model that does everything may be the wrong path if smaller specialist models can outperform it in the moments that matter.

  • Every model upgrade is risky because improving one language or task can make another one worse.

  • Testing speech systems is harder than people admit because teams know something broke, but don’t know what.

  • General transcription errors can be patched by an LLM, but once a name, phone number, email, or address is lost, it is gone.

  • The next edge in voice AI may come from tiny models trained for high-value details like PII, not from one giant model trying to handle everything.

  • Email addresses sound simple until real accents, pauses, corrections, and spelling cues expose how messy spoken language really is.

  • The companies that win enterprise voice AI will be the ones that orchestrate many narrow models well, not the ones chasing a single universal model.

  • Infrastructure strategy is becoming a product decision because legal rules, traffic spikes, and customer use cases all change what “best” deployment looks like.

  • Cloud scaling breaks in real-time spikes, like emergency calls.

  • Using managed infra and large DevOps teams at once wastes money.

  • Customers want one vendor for everything, even if quality drops.

  • The market will reward depth over breadth if a vendor can become truly exceptional in one painful, business-critical part of the voice stack.

Discussion about this video

User's avatar

Ready for more?