In The Future of Voice AI series of interviews, I ask three questions to my guests:
- What problems do you currently see in Enterprise Voice AI?
- How does your company solve these problems?
- What solutions do you envision in the next 5 years?
This episode’s guest is Kwindla Hultman Kramer, Co-Founder & CEO at Daily.
Kwin is CEO and co-founder of Daily, a developer platform for real-time audio, video, and AI. He has been interested in large-scale networked systems and real-time video since his graduate student days at the MIT Media Lab. Before Daily, Kwin helped to found Oblong Industries, which built an operating system for spatial, multi-user, multi-screen, multi-device computing.
Daily makes developer tools and infrastructure for real-time audio, video, and AI. The company was founded in 2016 with the goal of making it easier to embed real-time communications into websites and applications. Today, Daily powers telehealth, education, workplace collaboration, customer support, social, and gaming applications for thousands of developers and product teams. Daily's core competence is delivering reliable, high-quality, low-latency audio and video streams to any device, on any network, anywhere in the world.
Recap Video
Takeaways
This platform change (AI) is going to change how we think about computers and how we use them
Daily’s focus is real-time communications
In the last 2 years many of their customers have been asking: can AI participants be part of the sessions?
Daily is built to deliver low-latency voice and video, interactive conversational AI applications
To build an interactive AI app, you need to
Send audio from user’s device
Transcribe the audio
Run LLM inference
Likely do an API call
Convert text to speech and send it back
This whole pipeline must be cancellable/interruptible at any point
Having an open-source layer for this is very important
Better to do it in the cloud than on-device
Daily delivers the transport layer, the bottom layer of the stack (WebRTC)
Daily’s built an open source layer called Pipecat to enable more apps
After the introduction of GPT-4o, the optimal architecture has changed entirely
GPT-4o collapses 3-4 steps that we had to do separately before (transcription, phrase endpointing, LLM inference, TTS)
Before GPT-4o the best latency was 800ms. Now it’s 300ms.
GPT-4o audio feels first class and it opens up a whole bunch of new use cases
Daily is the WebRTC network glue between a user on a device and servers that are generating audio/video
It can support 100K+ people in a single session
Main use cases: Healthcare, education, workforce, social, gaming, customer support
Soon, all games will have real-time conversational AI characters in them
Share this post