In the Future of Voice AI series of interviews, I ask three questions to my guests:
- What problems do you currently see in Enterprise Voice AI?
- How does your company solve these problems?
- What solutions do you envision in the next 5 years?
This episode’s guest is Jack Piunti, GTM Lead for Communications at ElevenLabs.
Jack Piunti is the GTM lead for Communications at ElevenLabs, where he oversees go-to-market strategy across CPaaS, CCaaS, UCaaS, and customer experience. With a strong background in consultative technology partnerships and startup growth, Jack brings deep expertise in AI-driven communications. Prior to ElevenLabs, he spent six years at Twilio, helping shape enterprise adoption of real-time voice technologies. He is passionate about the future of connected applications and the role of AI in transforming how we communicate.
ElevenLabs is a voice AI company offering ultra-realistic text-to-speech, speech-to-text, voice cloning, multilingual dubbing, and conversational AI tools. Founded in 2022, it enables creators and developers to build voice apps and generate lifelike, emotionally rich speech in 70+ languages. Its latest models support expressive cues and multi-speaker dialogue.
Recap Video
Takeaways
Most AI failures in conversation don't come from the language model, but from inaccurate speech-to-text at the start.
Bad transcription of critical details like names or codes breaks the entire user experience and can’t easily be recovered.
Accurate speech-to-text is now a make-or-break factor for building reliable AI agents.
Voice will soon replace typing as the main way humans interact with machines because it's more natural and efficient.
Enterprises don’t want to stitch together multiple AI vendors, they want end-to-end platforms that simplify the stack and reduce latency.
Demos often look impressive, but very few companies can scale real-time voice tech reliably in production environments.
AI voice agents that sound expressive aren't enough — turn-taking and accuracy are still bigger challenges.
Most companies ignore accessibility in AI, but modeling things like stuttering actually improves agent behavior.
Streaming speech and voice models will unlock more lifelike, responsive AI agents — and it’s coming fast.
Audio AI needs deep expertise beyond AI, including sound engineering and context-aware modeling of human speech.
There’s a growing trend of AI companies going beyond voice to control the full audio experience, including music and sound effects.
The way voice models are trained is fundamentally different from language models and requires much cleaner training data.
Many agentic AI builders today are forced to cobble together solutions from different vendors, which creates delay and complexity.
True real-time voice AI must handle language switching, emotional cues, and speech disfluencies automatically to feel natural.
Share this post