0:00
/
0:00

What to expect in 2025 | Jack Piunti (GTM Lead for Communications at ElevenLabs)

In the Future of Voice AI series of interviews, I ask three questions to my guests:

- What problems do you currently see in Enterprise Voice AI?
- How does your company solve these problems?
- What solutions do you envision in the next 5 years?

This episode’s guest is Jack Piunti, GTM Lead for Communications at ElevenLabs.

Listen on YouTube

Jack Piunti is the GTM lead for Communications at ElevenLabs, where he oversees go-to-market strategy across CPaaS, CCaaS, UCaaS, and customer experience. With a strong background in consultative technology partnerships and startup growth, Jack brings deep expertise in AI-driven communications. Prior to ElevenLabs, he spent six years at Twilio, helping shape enterprise adoption of real-time voice technologies. He is passionate about the future of connected applications and the role of AI in transforming how we communicate.

ElevenLabs is a voice AI company offering ultra-realistic text-to-speech, speech-to-text, voice cloning, multilingual dubbing, and conversational AI tools. Founded in 2022, it enables creators and developers to build voice apps and generate lifelike, emotionally rich speech in 70+ languages. Its latest models support expressive cues and multi-speaker dialogue.

Recap Video

Thanks for reading Voice AI Newsletter! Subscribe for free to receive weekly updates.

Takeaways

  • Most AI failures in conversation don't come from the language model, but from inaccurate speech-to-text at the start.

  • Bad transcription of critical details like names or codes breaks the entire user experience and can’t easily be recovered.

  • Accurate speech-to-text is now a make-or-break factor for building reliable AI agents.

  • Voice will soon replace typing as the main way humans interact with machines because it's more natural and efficient.

  • Enterprises don’t want to stitch together multiple AI vendors, they want end-to-end platforms that simplify the stack and reduce latency.

  • Demos often look impressive, but very few companies can scale real-time voice tech reliably in production environments.

  • AI voice agents that sound expressive aren't enough — turn-taking and accuracy are still bigger challenges.

  • Most companies ignore accessibility in AI, but modeling things like stuttering actually improves agent behavior.

  • Streaming speech and voice models will unlock more lifelike, responsive AI agents — and it’s coming fast.

  • Audio AI needs deep expertise beyond AI, including sound engineering and context-aware modeling of human speech.

  • There’s a growing trend of AI companies going beyond voice to control the full audio experience, including music and sound effects.

  • The way voice models are trained is fundamentally different from language models and requires much cleaner training data.

  • Many agentic AI builders today are forced to cobble together solutions from different vendors, which creates delay and complexity.

  • True real-time voice AI must handle language switching, emotional cues, and speech disfluencies automatically to feel natural.

Discussion about this video