In the Future of Voice AI series of interviews, I ask three questions to my guests:
- What problems do you currently see in Enterprise Voice AI?
- How does your company solve these problems?
- What solutions do you envision in the next 5 years?This episode’s guests are Klemen Simonic, Co-Founder & CEO at Soniox, and Kwindla Hultman Kramer, Co-Founder & CEO at Daily.
Klemen Simonic is the CEO and Co-Founder of Soniox, where he leads the development of advanced voice AI models built for real-world performance. He brings over 16 years of experience across industry and academia, with a deep focus on artificial intelligence. He has worked on cutting-edge AI systems at Facebook, Google, Stanford University, and the University of Ljubljana. Klemen has been developing AI technologies since his undergraduate years, spanning speech, language, and large-scale knowledge systems.
Kwin is CEO and co-founder of Daily, a developer platform for real-time audio, video, and AI. He has been interested in large-scale networked systems and real-time video since his graduate student days at the MIT Media Lab. Before Daily, Kwin helped to found Oblong Industries, which built an operating system for spatial, multi-user, multi-screen, multi-device computing.
Recap Video
Takeaways
Voice AI adoption is slow because real-time transcription still breaks on the most basic parts of a customer call.
Real growth is happening quietly inside call centers, but teams won’t scale until transcription stops causing cascading errors.
Even the top models fail on emails, addresses, and alphanumerics, which are the single points of failure in most B2B workflows.
Consumer-grade demos hide the reality that long, multi-turn conversations still fall apart without rigorous context control.
POC to production fails not because of LLMs, but because engineering teams underestimate context management.
A universal multilingual model can outperform single-language models by transferring entity knowledge across languages.
Mixed-language conversations are the norm worldwide, and current systems break the moment a user switches language.
Latency, accuracy, and cost must be solved at the same time; optimizing only one kills the use case.
Feeding both sides of the conversation into STT gives models more context and improves accuracy.
Domain-specific accuracy matters far more than general accuracy, and most models still fail in specialized environments.
Industry “context boosting” tricks are hacks that break at scale; native learned context inside STT is the only path forward.
Punctuation and intonation directly shape LLM reasoning, and stripping them for speed creates silent failure modes.
Voice AI is shifting from speech-to-text to full speech understanding, and models that don’t evolve won’t survive.
The future points toward fused audio plus LLM architectures that remove the brittle STT handoff entirely.










