omnihuman-1-5
Multi-character audio-driven avatar video generation. Takes a portrait image + audio and generates a video where the person speaks/sings in sync. Supports specifying which character to drive.
There's a specific category of AI video that most people haven't thought carefully about yet: the audio-driven avatar. Not text-to-video, where you describe a scene and hope for the best. Not image-to-video, where you animate a still frame into generic motion. This is something more constrained and, paradoxically, more useful. You give it a photo of a person and an audio clip, and it produces a video of that person speaking, gesturing, and moving in sync with the audio.
ByteDance's OmniHuman is the most ambitious attempt at this problem I've seen. Two versions are now available on inference.sh - the original OmniHuman 1.0 and the newer 1.5 - and they represent meaningfully different philosophies about what "avatar generation" should mean. The original focused on getting lip sync and head motion right for a single subject. Version 1.5 wants to simulate cognition - to make the avatar appear to think before it moves.
I think this distinction matters more than the marketing suggests. Let me explain why.
the problem with talking heads
Most audio-driven avatar tools produce what I'd call "news anchor syndrome." The subject's lips move correctly, the head nods at appropriate intervals, maybe there's a slight eyebrow raise. But the body is essentially frozen. The hands don't gesture. The shoulders don't shift weight. The result looks like a person bolted to a chair from the neck down.
This happens because earlier approaches treated avatar generation as primarily a face problem. Map phonemes to mouth shapes, add some procedural head motion, call it a day. The rest of the body was either cropped out or left static. Technically successful, emotionally dead.
OmniHuman starts from the opposite premise: humans communicate with their entire body. The hand that rises to emphasize a point, the lean forward during an important statement, the subtle weight shift when someone is about to change topics. These aren't decorative - they carry meaning. A viewer processes them unconsciously, and their absence triggers uncanny valley responses even when the face is perfect.
how omnihuman 1.0 works
The original OmniHuman (available as bytedance/omnihuman-1-0), first presented in a research paper in February 2025, uses a Diffusion Transformer architecture trained on approximately 19,000 hours of human video data. The key insight in the research paper was what ByteDance calls "omni-conditions training" - rather than training separate models for audio-driven, video-driven, and pose-driven generation, they mixed all conditioning types into a single training phase. This means the model learned the relationships between speech audio, body motion, and visual appearance simultaneously.
In practice, you provide a portrait image and an audio file. The model analyzes the audio's rhythm, prosody, and energy, then generates a video where the subject moves naturally in response. Lip sync is the baseline requirement, but the model also produces head tilts, facial expressions, and upper body gestures that correlate with the speech content.
The training strategy was clever. Previous end-to-end approaches struggled because high-quality paired data (audio plus full-body video of the same person) is scarce. By training with mixed conditioning - sometimes audio only, sometimes video reference, sometimes both - the model could learn from a much larger and more diverse dataset. The motion understanding transfers between modalities.
The limitation is straightforward: OmniHuman 1.0 handles one person at a time. You provide an image with a single subject, and the model drives that subject. The scene is relatively static - backgrounds don't change, camera angles are fixed, and the subject stays roughly in place. For a talking-head video, a product demo, or a voiceover with a human face, this works well. For anything resembling a real scene with spatial complexity, you hit the ceiling quickly.
the leap to omnihuman 1.5
Version 1.5 (bytedance/omnihuman-1-5) isn't just an incremental improvement. ByteDance rebuilt the cognitive architecture around an idea borrowed from psychology: Daniel Kahneman's System 1 and System 2 thinking. System 1 is fast, intuitive, reactive. System 2 is slow, deliberate, planning-oriented.
In OmniHuman 1.5, a Multimodal Large Language Model handles the "thinking" layer (System 1 - rapid semantic understanding) - interpreting the semantic content of the audio, planning appropriate gestures and emotional responses, understanding context. A Diffusion Transformer then handles the "doing" layer (System 2 - deliberative planning of complex full-body movements) - rendering those planned movements into smooth, physically plausible video frames. The MLLM plans the performance; the DiT executes it. This dual-system framework enables generation of videos over one minute long with highly dynamic motion and continuous camera movement.
This sounds like marketing abstraction, but the practical difference is observable. When OmniHuman 1.0 generates a speaking avatar, the gestures tend to correlate with audio energy - louder speech gets bigger movements. When 1.5 generates the same audio, the gestures correlate with meaning. A phrase about something small might produce a pinching gesture. A reference to expansion might trigger arms opening outward. The model isn't just reacting to volume and rhythm; it appears to understand what's being said.
I want to be careful here. "Appears to understand" is doing heavy lifting. The model has learned statistical correlations between speech content and human gesture from its training data. Whether that constitutes understanding is a philosophical question I'll leave aside. What matters practically is that the output looks more intentional, more like a real person who chose to move that way rather than a puppet driven by audio waveforms.
multi-character scenes
The headline feature of 1.5 is multi-character support. You can provide an image containing multiple people and specify which one to drive using a mask index parameter. This opens up conversational scenes, interview formats, panel discussions - anything where more than one person appears in frame.
The implementation is practical rather than magical. The model detects subjects in the input image, and you tell it which one to animate. The other figures remain static or minimally animated. It's not generating independent performances for multiple characters simultaneously from a single audio track. You'd need to run separate generations with different audio inputs to create a true multi-person conversation.
Still, this is a meaningful step. Being able to place your speaking avatar within a scene that includes other people changes the use case from "floating head on a background" to something approaching a real production setup. Think about it: a customer testimonial video where the interviewer is visible, a training video where a presenter stands among colleagues, a demo where the speaker references someone standing nearby.
what it gets wrong
I should be honest about the failure modes because they're instructive.
First, extreme poses and unusual body positions in the reference image confuse the model. If your input photo shows someone mid-jump or in an unusual posture, the generated motion often drifts into unrealistic territory. The model wants to start from a neutral or near-neutral pose. This isn't unique to OmniHuman - it's a limitation of the training distribution.
Second, longer generations (approaching the 30-second maximum) can accumulate drift. The subject might gradually shift position, clothing textures can subtly change, or the motion quality degrades toward the end. For longer content, you're better off generating multiple shorter clips and stitching them.
Third, the "semantic understanding" in 1.5 is language-dependent. English content produces the most natural gesture mapping. Other languages still get good lip sync and rhythmic motion, but the gesture-to-meaning alignment is weaker. This will presumably improve as training data expands, but it's worth noting if you're working in non-English contexts.
Fourth - and this is the most fundamental limitation - the model generates from a single static image. It doesn't have a 3D understanding of the subject. If the person turns too far from their original pose, the model hallucinates details about parts of them it never saw. Hair from behind, the side of a face not shown in the reference photo. These hallucinations are sometimes convincing and sometimes not.
the competitive landscape
OmniHuman isn't operating in a vacuum. Pruna's P-Video Avatar tackles similar territory with a different approach. The landscape also includes Hedra, Synthesia, D-ID, and HeyGen for commercial avatar generation.
What distinguishes OmniHuman is the full-body emphasis and the research depth. Most competitors focus exclusively on shoulders-up generation. They produce polished results within that frame but don't attempt the harder problem of coordinated body motion. OmniHuman accepts the harder problem and its associated failure modes.
OmniHuman is competitively priced relative to alternatives like Pruna while delivering full-body results. For high-volume production use cases - generating hundreds of short avatar clips for personalized content - the cost adds up, but it's within range of what businesses budget for video production.
where this actually matters
I keep coming back to the same use cases when I think about who benefits most from this technology. Corporate training and internal communications, where you need a consistent presenter across dozens of videos. Product explanations and demos where a human face increases engagement but scheduling a shoot for every update is impractical. Localization, where the same visual performance needs to exist in multiple languages with different audio tracks.
The multi-character capability in 1.5 opens up scenarios that single-character tools simply cannot address. Simulated conversations for training materials. Interview-format content where both parties are visible. Conference or panel presentations where spatial relationships between speakers matter.
None of this replaces professional video production for high-stakes content. But it fills the enormous gap between "no video at all" and "full production crew." Most organizations have hundreds of situations where video would be better than text but nobody has the budget or timeline for traditional production. That gap is where OmniHuman lives.
practical considerations
Both versions accept the same core inputs: a portrait image and an audio file. The image should show the subject clearly, ideally from a neutral or slightly angled position, with the face visible and well-lit. Audio can be speech, singing, or any vocalization - the model adapts its motion generation to match the audio type.
For OmniHuman 1.0, the image should contain exactly one person. For 1.5, multiple people can appear, and you select which one to drive. The audio duration determines the output video length, capped at the model's maximum generation window.
Quality depends heavily on input quality. A sharp, well-composed portrait with good lighting produces dramatically better results than a grainy phone photo. Similarly, clean audio with minimal background noise yields better lip sync than audio recorded in a noisy environment. This isn't surprising, but it's worth emphasizing: garbage in, garbage out still applies when the middle part is a diffusion transformer.
what comes next
The trajectory from 1.0 to 1.5 tells you where this is heading. The original model solved the mechanics - making lips move correctly, generating plausible body motion. Version 1.5 adds intent - making the avatar appear to choose its movements based on meaning rather than just matching audio energy.
The logical next step is interaction. Avatars that respond to each other in real-time, that react to viewer inputs, that adapt their delivery based on context. ByteDance's research trajectory - the cognitive architecture, the multi-character support - points directly toward interactive avatar experiences rather than pre-rendered clips.
For now, though, OmniHuman 1.0 and 1.5 represent the state of the art in audio-driven avatar generation that actually considers the whole human body. They're imperfect, they have clear boundaries, and they reward thoughtful input preparation. Within those boundaries, they produce results that were genuinely impossible two years ago.
frequently asked questions
what's the main difference between omnihuman 1.0 and 1.5?
OmniHuman 1.0 generates single-character avatar videos by correlating body motion with audio energy and rhythm. Version 1.5 adds a cognitive layer that interprets the semantic meaning of speech to produce more contextually appropriate gestures, and introduces multi-character support so you can animate one person within a scene containing multiple subjects. The 1.5 architecture also handles longer durations and more dynamic camera-aware motion.
how does omnihuman compare to other avatar generation tools?
Most competing tools - Synthesia, HeyGen, D-ID - focus on shoulders-up generation and produce polished results within that limited frame. OmniHuman distinguishes itself by attempting full-body coordination, generating gestures and weight shifts that extend well below the neckline. The tradeoff is that full-body generation introduces more potential failure modes, particularly with unusual poses or very long clips. For talking-head use cases, specialized tools may produce more consistent results. For anything requiring visible body language, OmniHuman currently leads.
what input quality do I need for good results?
The reference image matters enormously. Use a well-lit portrait with the subject's face clearly visible, shot from a neutral or slightly angled perspective. Avoid extreme poses, heavy shadows across the face, or images where the subject is partially occluded. For audio, clean recordings with minimal background noise produce the best lip sync. The model handles various audio types - speech, singing, narration - but struggles more with heavily compressed or noisy audio where phoneme boundaries are unclear.
api reference
about
multi-character audio-driven avatar video generation. takes a portrait image + audio and generates a video where the person speaks/sings in sync. supports specifying which character to drive.
1. calling the api
install the client
the client provides a convenient way to interact with the api.
1pip install inferenceshsetup your api key
set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.
1export INFERENCE_API_KEY="inf_your_key"run and get result
submit a request and wait for the final result. best for batch processing or when you don't need progress updates.
1from inferencesh import inference23client = inference()456result = client.run({7 "app": "bytedance/omnihuman-1-5",8 "input": {}9 })1011print(result["output"])stream live updates
get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8 "app": "bytedance/omnihuman-1-5",9 "input": {}10 }, stream=True):11 if update.get("progress"):12 print(f"progress: {update['progress']}%")13 if update.get("output"):14 print(f"output: {update['output']}")2. authentication
the api uses api keys for authentication. see the authentication docs for detailed setup instructions.
3. files
file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.
automatic upload
the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.
1# local file paths are automatically uploaded2result = client.run({3 "app": "bytedance/omnihuman-1-5",4 "input": {5 "image": "/path/to/local/image.png", # detected & uploaded6 "audio": "https://example.com/audio.mp3", # url passed through7 }8})4. webhooks
get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.
1result = client.run({2 "app": "bytedance/omnihuman-1-5",3 "input": {},4 "webhook": "https://your-server.com/webhook"5}, wait=False)webhook payload
your endpoint receives a JSON POST with the task result:
1{2 "id": "task_abc123",3 "status": 9,4 "output": { ... },5 "error": "",6 "session_id": null,7 "created_at": "2024-01-15T10:30:00Z",8 "updated_at": "2024-01-15T10:30:05Z"9}5. schema
input
portrait image containing one or more people. jpg format recommended. max 5mb, max 4096x4096 pixels.
audio file to drive the avatar. duration should be under 15 seconds for best quality.
which detected subject to drive (0 = largest face/body, 1 = second largest, etc.). set to 0 for single-person images.
ready to run omnihuman-1-5?
we use cookies
we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.
by clicking "accept", you agree to our use of cookies.
learn more.