Talk, listen, see: the three levels of interactive video agents

Create AI videos with 240+ avatars in 160+ languages
For the last several years, the conversation about generative video has been dominated by quality. For us, that has meant delivering more realistic faces and more convincing voices to our customers. That work is important, and it will continue as weβre currently training our next-generation diffusion transformer model. But we believe the next chapter of applied research in generative video is about something different: interactivity.
Three signals reinforce this shift. The agentic ecosystem is exploding, with new agents being built for almost every workflow you can name, and software increasingly mediated through an agent layer rather than a traditional UI. On the research side, hybrid approaches that combine autoregressive and diffusion methods have become one of the most active areas of work in video generation. And a growing number of startups and research groups are trying to crack interactive video as a substrate for entirely new application classes, from open world experiences to live conversation. Taken together, these signals all point in the same direction: interactivity is becoming the defining dimension of next-generation video systems.
Today, I want to dive deeper into a new category of work we are calling Interactive Avatar Models which are video generation models focused on producing a talking video of an agent reacting to a human in the context of a conversation, at latencies low enough to support real dialogue (typically under one second).
These Interactive Avatar Models can achieve three levels of capability.
A Level 1 Interactive Avatar Model can talk. It is driven by its own audio and has no awareness of the person it is speaking to. This is where every Interactive Avatar Model system on the market sits today, including the early systems we have built.
A Level 2 Interactive Avatar Model can talk and listen. It is driven by its own audio and the user's audio, so it reacts visually with the kind of small signals that real listeners produce when engaged in a conversation (nodding, shifting expression), and it reacts vocally with acknowledgements and prosodic cues to signal engagement.
A Level 3 Interactive Avatar Model can talk, listen, and see. It does so by taking in the user's camera feed, so it can respond to a personβs posture, gesture, and facial signals on top of their audio. As a result, speaking to a Level 3 Interactive Avatar Model feels very similar to interacting with a real-life colleague or friend.Β
Our hypothesis is that Level 1 Interactive Avatar Models will fit a compelling class of applications where companies want to go beyond simple information sharing and build experiences that deliver opportunities for people to put what they've been taught into action. However, these avatars are structurally insufficient at scale, particularly when set against the bar that audio-only conversational models already meet. A talking avatar with no awareness of the person in front of it will often move in ways that surprise or frustrate. It looks alive without being responsive, which over a long enough conversation reads as uncanny rather than helpful.
In our view, the jump from Level 1 to Level 2 is the most important step because it turns an avatar from a face that talks into a conversational counterpart that listens. Our internal experiments show that in sales-oriented roleplay scenarios, users speak most of the time, and if the avatar fails to react naturally while listening, the illusion of a real conversation breaks within seconds.
Building Level 2 well requires modelling two-person conversations, including the subtleties of how humans take turns, signal attention, and respond to one another. It also requires us to treat audio and video as a single problem rather than two stacked problems. Subtle vocal reactions, brief interruptions, and prosodic acknowledgements all contribute to a sense of engagement, and we are convinced that predicting audio and video jointly is the path to a genuine "wow" moment, where the avatar feels alive rather than animated.
Level 3 is harder, and it is one of the most ambitious things we have ever taken on. We are designing it as a strict enhancement to Level 2 that activates when the user is ready for it. When a user turns their camera on, the interaction should become richer, with the avatar responding to visual cues the way people do on a video call rather than a phone call. When the camera stays off, the experience should fall back to a best-in-class Level 2 interaction.Β
This will be a year-long effort. Interactivity is becoming the shared mission of research at Synthesia, and it is the lens through which the rest of our work will increasingly be organised. The avatars that ship from this effort will be the first that genuinely listen to the people they speak with, allowing our customers to create interactive experiences that are useful, engaging and realistic.











.jpg)
.webp)
.webp)


