0:00
/
0:00

Building Voice AI at 11x | Francisco Izaguirre (Engineering Lead at 11x)

In the Future of Voice AI series of interviews, I ask three questions to my guests:

- What problems do you currently see in Enterprise Voice AI?
- How does your company solve these problems?
- What solutions do you envision in the next 5 years?

This episode’s guest is Francisco Izaguirre, Engineering Lead at 11x.

Listen on YouTube

Francisco Izaguirre is the Engineering Lead at 11x, where he’s helping redefine how voice AI powers modern revenue teams. With a background spanning backend systems, ML infrastructure, and conversational AI, Francisco currently leads development on Julian—11x’s real-time voice agent built to autonomously handle sales calls, qualify leads, and book demos. His work sits at the intersection of latency optimization, emotional intelligence in AI, and scalable agent design, making him one of the most hands-on builders in the emerging voice AI space.

Recap Video

Thanks for reading Voice AI Newsletter! Subscribe for free to receive weekly updates.

Takeaways

  • 11x is turning voice AI into a real revenue driver. Their agent, Julian isn’t just automating calls, it’s reshaping how businesses scale outbound sales with speed, empathy, and autonomy that rivals human reps.

  • There’s nothing more agentic than a phone call—full autonomy, real-time decisions, no undo button.

  • Speed to lead isn’t just a sales metric, it’s a core design principle for AI voice agents like Julian.

  • 300ms end-to-end latency is now the bar for natural-feeling AI phone calls, including STT, LLM, RAG, and TTS.

  • Traditional RAG pipelines break down in real-time voice—fetching data across the wire is too slow and too obvious.

  • Masking latency through conversational techniques like “give me a second” is a practical design pattern, not a hack.

  • Twilio and similar telephony providers are becoming the new latency bottleneck as expectations move to sub-300ms performance.

  • ASR still struggles with critical details like names, emails, and numbers—especially across accents—which can derail entire calls.

  • Pronunciation isn’t solved; getting someone’s name or company wrong breaks the trust instantly.

  • Using phonetic inputs like IPA to improve TTS accuracy shows measurable emotional and experiential gains in customer interactions.

  • Good conversation isn’t about perfection, it’s about imperfection that feels human and emotionally attuned.

  • The best agents will sound like your imperfect but empathetic friend.

  • “Turn-taking” is still a massive challenge. Detection is easy, inference is hard, and both introduce their own bugs.

  • Backchanneling (knowing when to respond, not just when to talk) is an unsolved frontier in conversational AI.

  • Emotional intelligence isn’t a bonus for voice agents—it’s required for use cases like medical, hospitality, and SMB outreach.

  • Agents must dynamically adapt empathy patterns depending on user stress, urgency, and tone—or risk sounding tone-deaf.

  • Emerging techniques like emotional tagging in TTS are promising but still lack scalable evaluation methods.

  • Breaking a single agent into a mesh of specialized sub-agents may outperform monolithic models in complex conversations.

  • Self-learning pipelines are still aspirational; manual tuning can’t keep up with the pace or volume of voice interactions.

  • Development speed can’t lag behind live conversations; automation loops with human-in-the-loop review are essential.

  • Passing the Turing test in voice won’t just be about tone or latency, it will require recursive emotional and contextual depth.