On Oct 2nd, I attended the very first VapiCon organized by Vapi.
This was by far the largest Voice AI gathering so far. Not my words, but pretty much everyone was asserting this.
I heard the demand turned out to be much bigger than they could fit.
The keynote speakers did a great job laying out the state of Voice AI.
Big kudos to Jordan@Vapi, Scott@Deepgram, Justin@OpenAI and Dylan@Assembly.
Also big kudos to the event staff. They were very caring and accommodating.
Unfortunately, I had to miss a lot of sessions but here are my takeaways from the ones I did attend:
The Voice AI community is vibrant and energized. It’s rare to see so many young folks focused on an industry. The future is clearly being built here.
I think most of the attendees were developers from startups. I estimate that there are between 500-1000 startups in Voice AI now.
$2B+ has been invested in startups since 2024.
STT accuracy, turn-taking, speaker-separation, latency, TTS accuracy, hallucinations, function-calling, lack-of-context are the main technical challenges the industry is trying to solve.
Most of these issues have significantly improved in the last 2 years but they all STILL remain big challenges (except maybe latency?)
There is a big debate about speech-to-speech (S2S) models and the Cascading approach (STT→LLM→TTS). There are clear pros and cons here.
The vast majority of the deployed technology uses the Cascading approach
It is assumed that most of the technical problems will go away once S2S models mature, but this is yet to be seen.
While the event was super energized, I couldn’t help but think about a topic that was in the air. I think people are NOT talking about this enough, while everyone does know about it.
Most Voice AI Agents in production are still quite unstable today.
And this SLOWS the industry down.
I think there are 2 main reasons why they keep failing:
The real world is complex: there is too much context and too many edge cases out there, and AI agents simply don’t know how to handle these yet. In contrast, people are really good at this because of our experience, context and multi-modality.
During VapiCon, I witnessed several live demos fail because of these exact issues—background noise, hallucinations, and AI agents not taking turns properly. It was a clear reminder that even advanced systems still struggle when faced with messy, unpredictable real-world conditions.STT is failing us: In most cases STTs do a great job of capturing what people said. However, the error rate of properly capturing numbers, emails, addresses, and nouns remains quite high. This might seem like a small issue but it’s not, especially in B2B use cases. Perhaps it’s the most important technical problem today, preventing the industry to grow. I really do hope this will get better in 2026.
I estimate the Voice AI Agents traffic to be around 3B/mins/month today.
If we solve the above two problems, the traffic will skyrocket 🚀
Here is to reaching 100B/mins/month by next VapiCon!