apps/falai/fabric-1-0

fabric-1-0

Creates videos where an image appears to talk using advanced lip-sync technology.

run with your agent
# install belt
$curl -fsSL https://cli.inference.sh | sh
# view schema & details
$belt app get falai/fabric-1-0
# run
$belt app run falai/fabric-1-0

Video generation gets most of the attention. Text-to-video, image-to-video, the big foundation models competing on motion quality and temporal coherence. But there's a quieter category of tools that matters just as much for anyone producing real content: the post-production utilities. The effects processors. The specialized models that take existing footage or images and transform them into something specific rather than generating from whole cloth.

I think these tools get overlooked because they lack the spectacle of general-purpose video generation. Nobody posts a fabric drape animation to social media with breathless commentary about the future of AI. But in actual production pipelines, a lipsync model that reliably maps audio to facial motion is worth more than the fanciest text-to-video model in the world. Specificity beats generality when you have a concrete job to do.

Four tools on inference.sh occupy this space in complementary ways: VEED's Fabric 1.0, PixVerse Lipsync, Wan 2.5 text-to-video, and Wan 2.5 image-to-video. They range from specialized talking-video generation to general-purpose video creation, and together they cover a set of production needs that the headline models don't address well.

fabric 1.0 - veed's talking video model

Fabric 1.0, built by VEED, is the most specialized tool in this group. Despite the name suggesting textiles, it's actually an image-to-talking-video model. You provide a static image of a person (or illustration, 3D render, mascot - it adapts to the input style) and an audio file, and it produces a video where the subject speaks with synchronized lip movement, head motion, facial micro-expressions, and subtle body gestures driven by the audio. The underlying architecture is a Diffusion Transformer trained on diverse datasets of talking people.

The use cases are clear: localized spokesperson videos where you record one script in multiple languages and pair each audio file with the same brand image, generating a complete set of localized talking-head videos in one batch. Product explainers, training content, social media posts with a human presenter - anywhere you need someone talking to camera but don't have the budget or logistics for a video shoot.

The model works at 480p and 720p and supports clips up to five minutes long - enough for most ad, tutorial, and explainer formats. You provide an image and an audio file, and the system produces a video where the audio drives not only lip movement but also head, body, and hand gestures.

The results depend heavily on the input image quality. Well-lit portraits with clear face visibility and neutral poses produce near-seamless talking videos. The model analyzes audio waveforms to extract phoneme timing and intensity, then maps these features to facial keypoints for realistic mouth movement synthesis. It handles various character types - photorealistic portraits, illustrated characters, stylized artwork - adapting its animation approach to the input style.

Fabric 1.0 occupies a similar space to OmniHuman but with a different emphasis. Where OmniHuman aims for full-body coordination with semantic gesture understanding, Fabric focuses on being a reliable, production-ready talking-head generator that accepts a wide range of input image styles.

pixverse lipsync: the practical workhorse

If Fabric 1.0 is the talking-video specialist in this group, PixVerse Lipsync, released in July 2025, is the serious post-production utility. It takes an existing video of a person and an audio track, then generates a new version where the subject's lips move in precise synchronization with the audio. The face is reanimated. The rest of the frame stays intact. The result is a video where someone appears to say something they never actually said.

That description alone should make the implications clear - both the productive and the problematic ones. I'll focus on the productive side because the technology exists regardless of anyone's comfort level with it.

The most straightforward application is localization. You have a video of your CEO delivering a keynote in English. You need versions in Spanish, German, Japanese, and Portuguese. Previously, your options were subtitles (low engagement), dubbing with mismatched lip movement (uncanny and distracting), or reshooting with the CEO attempting each language (impractical and usually terrible). Lipsync models offer a fourth path: keep the original video, replace the audio with a professional translation and voice performance, then reanimate the lips to match. The speaker appears to be fluently delivering the message in each language.

Content creators face a version of this problem constantly. You record a video, realize the audio has issues - background noise, a mispronounced word, an awkward pause - and normally your options are rerecording or living with it. With PixVerse Lipsync, you can record clean audio separately and map it onto the original video. The visual performance stays the same. Only the mouth movements change to match the corrected audio.

PixVerse's approach is interesting. Beyond video processing, it also includes optional text-to-speech: if you don't have audio ready, you can provide text and a voice selection, and the system generates both the speech audio and the synchronized lip animation in one pass. For quick iterations where you're testing how a script sounds before committing to professional voice recording, that integrated TTS path saves a meaningful amount of workflow friction.

The quality of lipsync generation depends heavily on the input video. A front-facing talking head with clear lighting and minimal head movement produces near-flawless results. As conditions degrade - side angles, harsh shadows across the mouth, rapid head turning, hands near the face - the model struggles to maintain convincing lip animation. This isn't a limitation specific to PixVerse; it's inherent to the problem. Lips are small, fast-moving features that interact with surrounding facial muscles in complex ways. Occlude them or change the viewing angle, and the reconstruction problem becomes substantially harder.

One thing I appreciate about PixVerse's approach is the separation of concerns. It takes a video in and produces a video out. It's not trying to generate the person from scratch, hallucinate a background, or create motion from nothing. It solves one problem - making lips match audio - and solves it with reasonable reliability. In a landscape where every model wants to be a general-purpose creative engine, that focus is refreshing.

wan 2.5: the versatile foundation

Wan 2.5 sits in a different category from Fabric and PixVerse. It's a general-purpose video generation model from Alibaba, available in both text-to-video and image-to-video variants. Calling it a "post-production tool" is a stretch - it's really a production tool, generating video content from prompts and images. But it complements the other tools in this group because it produces the raw material that lipsync and effects tools then transform.

The text-to-video variant accepts a prompt describing your desired scene and generates video at resolutions from 480p to 1080p, with pricing scaling by resolution tier. It supports aspect ratio selection, duration control, negative prompts to steer away from unwanted content, and optional prompt expansion where the model rewrites your prompt to improve generation quality. That last feature is a double-edged capability. Sometimes the expanded prompt produces better results because the model knows what descriptions it responds to best. Other times, the expansion introduces creative choices you didn't ask for - adding cinematic camera movement to a scene you wanted static, or elaborating on details you intentionally left unspecified.

Wan 2.5's image-to-video mode is where it becomes genuinely useful as a pipeline component. You provide an image as the first frame and a prompt describing how it should animate, and the model generates video that begins from your reference image and progresses according to your instructions. The fidelity to the source image is generally strong for the first few seconds, with drift increasing as duration extends. Colors stay accurate. Composition holds. The primary subject maintains its appearance. Background elements are more likely to shift or evolve in unintended ways, particularly if your prompt is vague about what the environment should do.

I should note that Wan 2.5 is not the latest in the Wan family. Wan 2.7 exists and improves on it across several dimensions - longer maximum durations, additional modes including reference-to-video, and general quality improvements. So why cover 2.5 at all? Because it's cheaper for many workflows, because it's proven and stable, and because not every project needs the latest model. If you're generating background plates, creating source material for effects processing, or producing quick visual concepts that will be heavily post-processed anyway, Wan 2.5 at its price points is perfectly adequate and leaves budget for the specialized tools that follow in your pipeline.

The interaction between Wan 2.5 and the other tools in this group is where things get practical. Generate a talking-head video with Fabric 1.0 using a portrait image and your audio recording - you get lip-synced speech with natural gestures in one step. Or take a different approach: use Wan 2.5 image-to-video to create a scene with ambient motion, then run the output through PixVerse Lipsync to add synchronized speech. Either way, you've produced a synthetic spokesperson video without filming anything, assembled entirely from generative tools.

building a post-production pipeline

These four tools aren't individually revolutionary. Fabric 1.0 is a focused talking-video generator that competes with OmniHuman and similar avatar tools. PixVerse Lipsync solves a real problem but isn't unique in the market. Wan 2.5 is a solid but not exceptional video generation model. Their value increases when you think of them as components rather than standalone products.

The content creation workflow that these tools enable is something like this. Start with static assets - photographs, illustrations, brand imagery. Use Wan 2.5 image-to-video to bring them to life with controlled motion. Apply PixVerse Lipsync to add synchronized speech to any human subjects in existing footage. Or use Fabric 1.0 to generate a complete talking-head video from a single portrait image and audio file. Each tool handles one transformation step, and the combined output is something that would have required a motion graphics team and significant budget to produce manually.

The total cost for this kind of pipeline is modest. A multi-layer production that draws from three specialized models stays affordable enough to scale to a dozen variations for A/B testing or localization without significant budget impact.

The tradeoffs are real, though. None of these tools produce broadcast-quality output. Resolution caps at 720p or 1080p depending on the model. Motion quality, while impressive for generative tools, doesn't approach what a camera captures. Lipsync accuracy degrades outside of ideal conditions. And there's an uncanny quality to any pipeline that chains multiple generative steps - errors compound, and the combined output can feel slightly off even when each individual step looks fine in isolation.

For social media content, internal communications, rapid prototyping, and situations where "good enough quickly" beats "perfect eventually," these tradeoffs are acceptable. For brand advertising on major channels, broadcast production, or anything where visual fidelity is a competitive requirement, you're still better served by traditional production methods augmented by AI rather than fully replaced by it.

the lipsync question

Lipsync technology sits at an interesting ethical boundary. The ability to make any person appear to say anything is powerful and uncomfortable in equal measure. PixVerse Lipsync, and tools like it, don't distinguish between legitimate localization work and malicious deepfake creation.

Anyone integrating lipsync into a production workflow should think about provenance. Labeling AI-modified video isn't just ethical good practice - it's increasingly a legal requirement in many jurisdictions. Using lipsync to help your company communicate in languages your team doesn't speak is straightforwardly positive. Using it without disclosure in contexts where viewers assume they're seeing unaltered footage is dishonest.

Current lipsync tools handle front-facing, well-lit talking heads reliably. Profile views, dramatic lighting, rapid head motion, and facial occlusion remain challenging. Within two years, I'd expect these limitations to shrink significantly. Lipsync will become indistinguishable from real footage for practical purposes. That makes the ethical question more urgent, not less.

choosing the right tool for the job

Fabric 1.0 is for generating talking videos from still images and audio - turning a portrait and a voice recording into a complete talking-head video with lip sync, gestures, and expressions. PixVerse Lipsync is for changing or adding what a person appears to say in existing video - localization, audio correction, synthetic spokesperson creation. Wan 2.5 text-to-video generates visual content from descriptions when you don't have existing footage. Wan 2.5 image-to-video animates still images into video clips with controlled motion, and it's the most versatile entry point for multi-tool workflows.

None of these tools are the best in their respective categories if your only metric is raw quality. But they're affordable enough for experimentation, they compose well together, and they solve real production problems without requiring you to understand diffusion architectures. For teams who need video content faster than traditional production allows, this collection of specialized tools is worth having in the toolkit.

frequently asked questions

how does pixverse lipsync handle non-english audio?

PixVerse Lipsync operates on audio waveforms rather than linguistic understanding, which means it maps mouth shapes to sounds rather than words. This makes it largely language-agnostic in principle. English, Spanish, Mandarin, Arabic - the model processes the phonetic content of the audio regardless of language. In practice, performance is strongest with clearly enunciated speech in any language and weakens when audio includes heavy accents, overlapping speakers, or significant background noise. The built-in TTS option supports multiple voice selections, though the voice catalog is more limited for non-English languages.

can these tools be chained together in automated workflows?

Yes, and that's where much of their value lies. Each tool accepts standard inputs (images, video files, audio files) and produces standard outputs. A typical chain might use Wan 2.5 image-to-video to animate a still portrait, then feed that video into PixVerse Lipsync with a separate audio track. Because the tools run as API calls through inference.sh, integrating them into scripted pipelines or automated content systems is straightforward. The main constraint is that each step adds latency and cost, so design your pipeline with the minimum number of transformations needed for your specific output.

what resolution and duration limits should I plan around?

Fabric 1.0 tops out at 720p but supports clips up to five minutes long, driven by audio duration. Wan 2.5 supports up to 1080p in both text-to-video and image-to-video modes with clips up to 10 seconds. PixVerse Lipsync accepts video input at its native resolution and processes accordingly. Duration is determined by your audio input for Fabric 1.0 and PixVerse Lipsync, while Wan 2.5 offers configurable duration settings. For pipeline work, targeting 720p is a safe common denominator that all tools handle well and keeps per-second costs reasonable. If your final output needs 1080p, generate your Wan 2.5 base material at that resolution and accept that Fabric 1.0 output will be limited to 720p.

api reference

about

creates videos where an image appears to talk using advanced lip-sync technology.

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash
1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash
1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python
1from inferencesh import inference23client = inference()456result = client.run({7        "app": "falai/fabric-1-0",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "falai/fabric-1-0",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python
1# local file paths are automatically uploaded2result = client.run({3    "app": "falai/fabric-1-0",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python
1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python
1result = client.run({2    "app": "falai/fabric-1-0",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json
1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}
idstringtask id
statusnumberterminal status (9=completed, 10=failed, 11=cancelled)
outputobjecttask output (when completed)
errorstringerror message (when failed)
session_idstringsession id (if using sessions)
created_atstringiso timestamp
updated_atstringiso timestamp

5. schema

input

imagestring(file)*

image to turn into a talking video. supported formats: jpeg, png, webp

audiostring(file)*

audio file for the talking video. supported formats: wav, mp3

resolutionstring

video resolution. higher resolutions provide better quality but take longer to generate.

default: "480p"
options:"480p""720p"

output

videostring(file)*

generated talking video with lip-sync animation

output_metaobject

structured metadata about inputs/outputs for pricing calculation

ready to run fabric-1-0?

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.