In the Future of Voice AI series of interviews, I ask three questions to my guests:
- What problems do you currently see in Enterprise Voice AI?
- How does your company solve these problems?
- What solutions do you envision in the next 5 years?This episode’s guest is Dylan Fox, Founder & CEO at AssemblyAI.
Dylan started AssemblyAI in 2017 inspired by the potential of new voice-powered products like the Amazon Alexa, as well as his experience working as a research engineer at Cisco on new AI products and features. He saw an opportunity to use new AI technology to make fundamental improvements in the way that computers can understand and extract value from voice data. AssemblyAI started in Y Combinator and has now grown into a Series C company with over $115 million in funding from notable investors like Accel, Insight Partners, and Smith Point Capital. Dylan lives in Brooklyn, NY.
AssemblyAI builds speech language models that serve as the foundational voice AI infrastructure for next-generation voice applications. Their models deliver industry-leading speech-to-text accuracy with superhuman speech understanding capabilities including speaker detection, summarization, PII redaction, and an LLM gateway — giving developers everything they need to build sophisticated voice AI products.
Universal-3 Pro, the first speech language model optimized specifically for voice AI, goes further with advanced prompting capabilities that let developers customize model behavior for their exact use case. With both async and real-time streaming support, AssemblyAI integrates directly into voice agents, AI assistants, medical scribes, real-time call analysis systems, and more. Tens of thousands of developers rely on AssemblyAI's models to power voice AI applications used by millions of end users every day.
Recap Video
Takeaways
Real-time is the new growth engine - the last ~18–20 months crossed a reliability threshold where voice use cases actually work.
The real barrier in real-time STT is not model quality, it’s running low-latency systems at massive scale without breaking.
Voice AI is quietly expanding beyond agents into robotics, consumer hardware, ambient listening, and medical scribes, which widens the market fast.
Streaming models will always be disadvantaged on “look-ahead,” so the core problem is making good calls with incomplete future context.
The old quality-vs-speed tradeoff is shrinking because hardware and model optimizations are closing the gap between streaming and batch.
The ‘98% accuracy’ claims are meaningless because benchmarks reward clean audio, not real phone chaos and edge cases.
The industry needs hard voice evals where models look bad on purpose (WER ~50%) because that’s closer to real conditions.
The bottleneck is not model quality, it’s operating low-latency voice systems at insane scale without falling over.
Pricing is used as a growth lever: $0.21 per hour, prorated by the second, with automatic volume discounts.
The “no reservations, no concurrency limits” promise is really a bet on infra superiority, not just model quality.
Dylan’s open-source take is blunt: managing your own AI infra is a tax that slows shipping and kills competitiveness.
Specialization beats multimodal generalists for reliability: a model trained 100% on STT tasks is less likely to go off the rails.
Massive training data scale, not a sudden architecture breakthrough, is the main reason accuracy jumped in the last 2–3 years.
Infrastructure is becoming the hidden moat: unlimited rate limits and no concurrency negotiations remove a major bottleneck for teams shipping voice products.
Real-world performance can move business metrics, like a 15–20% lift in voice agent booking conversions from better STT.
Dylan’s adoption forecast is aggressive: we are at the start of a 100x curve, which means today’s usage is the floor, not the peak.










