In The Future of Voice AI series of interviews, I ask three questions to my guests:
- What problems do you currently see in Enterprise Voice AI?
- How does your company solve these problems?
- What solutions do you envision in the next 5 years?
This episode’s guest is Dylan Fox, Founder & CEO at AssemblyAI.
Dylan started AssemblyAI in 2017 inspired by the potential of new voice-powered products like the Amazon Alexa, as well as his experience working as a research engineer at Cisco on new AI products and features. He saw an opportunity to use new AI technology to make fundamental improvements in the way that computers can understand and extract value from voice data. AssemblyAI started in Y Combinator and has now grown into a Series C company with over $115 million in funding from notable investors like Accel, Insight Partners, and Smith Point Capital. Dylan lives in Brooklyn, NY.
AssemblyAI builds new AI systems that can understand human speech with superhuman abilities. AssemblyAI’s multilingual speech AI models provide speech-to-text with industry-leading accuracy, and advanced capabilities like speaker detection, summarization, PII redaction, and sentiment analysis give organizations the ability to generate powerful, actionable insights from audio. With AssemblyAI, organizations can combine speech-to-text and LLMs to quickly build powerful AI features. Tens of thousands of developers across the globe have built on AssemblyAI's API, powering products that are used by tens of millions of end users every single day.
Recap Video
Takeaways
AssemblyAI is processing terabytes of audio data every single day - podcasts, meetings, phone calls, radio, TV, broadcast, etc
Over 100K developers use the API, resulting in 30M AI inference calls per day
Every 6 months the cost has been going down due to the economy of scale and model optimizations
There is a ton of interest in streaming use cases (voice bots, agent assist, close captions) as well as non-streaming
Since non-streaming models can work bi-directionally, they will always produce higher quality. The majority of users submit non-streaming tasks.
They recently launched their newest model called Universal-1
Universal-1 can do both streaming and non-streaming. It was trained on 12.5M hours of voice data. 90-93% accuracy in English and 90-92% in French, Spanish and German
Today’s AI benchmarks are the Wild West. Industry must use independent 3rd parties for benchmarks. The benchmarking data must be closed source so that companies cannot play the system.
Average WER is not a good metric as it’s not representative of real-world user needs
WER doesn’t include quality for detecting rare words, alphanumerics, proper nouns, emails, formatting, or context. But these are super important for Speech AI workflows (e.g. summaries)
What users care about is not WER but fluency of output
AssemblyAI is doing a lot of human evaluations of models
They used Google TPUv5 for training Universal-1
They will always work to make STT models better, faster, and cheaper. STT market will grow faster once the models improve. New use cases will unlock.
In 18mo-24mo models will be much more accurate
Currently, AssemblyAI is highly focused on STT and Speech Understanding.
TTS and Translation will come over time but not soon.
Share this post