In The Future of Voice AI series of interviews, Davit asks three questions to his guests:
- What problems do you currently see in Enterprise Voice AI?
- How do you (your company) solve these problems?
- What solutions do you envision in the next 5 years?
This episode’s guest is Jordan Dearsley, Co-Founder of Vapi AI.
Jordan Dearsley is the cofounder of Vapi (YC W21), a developer platform for building Voice AIs that talk like people. Previously founded Superpowered, an A.I. notetaker for meetings.
Vapi is a developer platform for voice AI. In minutes, developers can create human-level voice bots that can be used in call centers, drive-thrus, websites, and mobile apps.
Summary of the conversation
Integration and Reliability Challenges in Voice AI: Jordan highlights the integration difficulties within voice AI technologies, emphasizing the complexity of creating seamless interactions between components such as speech-to-text, NLP, and noise cancellation. He points out the critical issue of reliability, noting how the dependency on multiple external services can lead to system breakdowns, stressing the need for resilience in voice AI systems.
Latency and Architectural Efficiency: Jordan discusses the current state of latency in voice AI, with the minimum achievable latency being around 500 milliseconds, under ideal conditions. This speed approaches human-level interaction speeds, achieved through optimized on-premises solutions and closely located server clusters. He outlines Vapi's architectural approach, which involves converting speech to text, processing it through a specially tuned LLM for task-specific outputs, and then synthesizing the speech to deliver the response.
Expanding Use Cases for Voice Bots: The conversation delves into the potential for voice AI to revolutionize customer-business interactions by addressing scalability issues inherent in traditional call centers and exploring new applications across websites, apps, and hardware devices. Jordan envisions voice AI enabling more personalized and efficient communication channels, capable of handling diverse tasks from customer support to booking services, all while maintaining patience and adherence to training.
Future Directions and Challenges: Looking ahead, Jordan predicts significant advancements in voice AI, driven by increased intelligence, reduced costs, and lower latency. He envisions a future where voice interfaces are ubiquitous, offering a more natural and efficient alternative to current interaction methods. The discussion also touches on the importance of overcoming challenges related to achieving human-like conversational capabilities, the necessity for a 'master model' to achieve human-level performance, and the role of voice AI in addressing urgent, complex, or emotionally charged customer service scenarios.
Takeaways
Integration and Reliability Issues: A significant challenge in voice bots is integrating various components (speech-to-text, NLP, etc.) into a seamless system. Additionally, reliability is a concern due to the dependency on external services, where a single failure can disrupt the entire process.
Latency Achievements and Architectural Design: The industry benchmark for latency in voice AI technologies can be as low as 500 milliseconds, close to human-level interaction speeds. This is achieved through optimized server configurations and a streamlined process of converting voice to text, processing through an LLM, and then synthesizing speech.
Broad Use Cases for Voice Bots: Voice Bots has the potential to significantly impact consumer-business interactions, particularly in call centers and beyond, by enabling scalable, efficient, and patient customer service. Future applications could extend to every website, app, and hardware device, transforming interaction paradigms.
Challenges in Voice Bots Adoption: Key obstacles include maintaining conversation scripts (medium hanging fruit) and developing systems that can handle open-ended interactions or role-playing scenarios (low hanging fruit), with a focus on non-mission-critical applications to mitigate the impact of technology limitations.
Future Directions and Predictions: In the next two to five years, significant improvements are expected in Voice Bots intelligence, cost reduction, and latency. The ultimate vision is for Voice Bots to be ubiquitous, offering intuitive and efficient interfaces for a wide range of applications.
Importance of Human-like Interaction: Achieving a Voice Bots system that can converse indistinguishably from humans is seen as a critical goal. This may require the development of a master model for voice, akin to a comprehensive, integrated approach that surpasses the capabilities of individually linked APIs.
Economic and Practical Implications: As technology evolves, initial focus will likely be on solving low-hanging use cases with high economic incentives. Over time, advancements will address more complex challenges, potentially leading to widespread adoption and transformation in how businesses and consumers interact.
Full transcript: https://voice-ai-newsletter.krisp.ai/publish/post/140734453/transcript
Referenced:
Share this post