0:00
/
0:00

Real-world problems with STT | Klemen Simonic (Soniox) & Kwindla Kramer (Daily)

In the Future of Voice AI series of interviews, I ask three questions to my guests:

- What problems do you currently see in Enterprise Voice AI?
- How does your company solve these problems?
- What solutions do you envision in the next 5 years?

This episode’s guests are Klemen Simonic, Co-Founder & CEO at Soniox, and Kwindla Hultman Kramer, Co-Founder & CEO at Daily.

Klemen Simonic is the CEO and Co-Founder of Soniox, where he leads the development of advanced voice AI models built for real-world performance. He brings over 16 years of experience across industry and academia, with a deep focus on artificial intelligence. He has worked on cutting-edge AI systems at Facebook, Google, Stanford University, and the University of Ljubljana. Klemen has been developing AI technologies since his undergraduate years, spanning speech, language, and large-scale knowledge systems.

Kwin is CEO and co-founder of Daily, a developer platform for real-time audio, video, and AI. He has been interested in large-scale networked systems and real-time video since his graduate student days at the MIT Media Lab. Before Daily, Kwin helped to found Oblong Industries, which built an operating system for spatial, multi-user, multi-screen, multi-device computing.

Listen on YouTube

Recap Video

Thanks for reading Voice AI Newsletter! Subscribe for free to receive weekly updates.

Takeaways

  • Voice AI adoption is slow because real-time transcription still breaks on the most basic parts of a customer call.

  • Real growth is happening quietly inside call centers, but teams won’t scale until transcription stops causing cascading errors.

  • Even the top models fail on emails, addresses, and alphanumerics, which are the single points of failure in most B2B workflows.

  • Consumer-grade demos hide the reality that long, multi-turn conversations still fall apart without rigorous context control.

  • POC to production fails not because of LLMs, but because engineering teams underestimate context management.

  • A universal multilingual model can outperform single-language models by transferring entity knowledge across languages.

  • Mixed-language conversations are the norm worldwide, and current systems break the moment a user switches language.

  • Latency, accuracy, and cost must be solved at the same time; optimizing only one kills the use case.

  • Feeding both sides of the conversation into STT gives models more context and improves accuracy.

  • Domain-specific accuracy matters far more than general accuracy, and most models still fail in specialized environments.

  • Industry “context boosting” tricks are hacks that break at scale; native learned context inside STT is the only path forward.

  • Punctuation and intonation directly shape LLM reasoning, and stripping them for speed creates silent failure modes.

  • Voice AI is shifting from speech-to-text to full speech understanding, and models that don’t evolve won’t survive.

  • The future points toward fused audio plus LLM architectures that remove the brittle STT handoff entirely.

Discussion about this video

User's avatar

Ready for more?