apps/alibaba/happyhorse-1-0-t2v

happyhorse-1-0-t2v

HappyHorse 1.0 Text-to-Video generates physically realistic videos with smooth motion from text prompts via DashScope API, supporting 720P/1080P resolution and up to 15 seconds duration

run in browser run via API

run with your agent

# install belt

$curl -fsSL https://cli.inference.sh | sh

# view schema & details

$belt app get alibaba/happyhorse-1-0-t2v

# run

$belt app run alibaba/happyhorse-1-0-t2v

HappyHorse 1.0 is not one model but four, and the distinction matters. Built by the Future Life Lab inside Alibaba's Taotian Group - a team led by Zhang Di, formerly VP of Kuaishou and head of Kling's technology - HappyHorse is a 15-billion-parameter video generation model that topped the Artificial Analysis Video Arena leaderboard in April 2026, beating Seedance 2.0 in the text-to-video (without audio) category by nearly 115 Elo points. Text-to-video, image-to-video, reference-to-video, and video editing ship as separate apps on inference.sh, each tuned for a different stage of the creative pipeline. Rather than cramming every capability into a single endpoint with fifty parameters, the team split the family along workflow boundaries. The result is a set of tools that compose well together while remaining individually simple.

I've been running all four variants extensively, and the honest take is this: HappyHorse handles physics-driven motion better than most competitors. Under the hood, it uses a unified 40-layer single-stream transformer architecture that jointly generates video and synchronized audio in a single forward pass - no cross-attention modules, just text, video, and audio tokens processed in one unified sequence. Objects fall with weight. Water behaves like water. Fabric drapes and folds rather than sliding around like a texture projection. Where it falls short - and it does - is in the finer details of human faces at extreme close-ups and in scenes requiring precise text rendering. These are known quantities in the field, not unique failures.

The model achieves clear video outputs in just 8 denoising steps, with a claimed inference speed of roughly 38 seconds for a 1080p clip on a single H100 GPU. The family competes most directly with Wan 2.7, which costs slightly less per second but takes a different approach to temporal coherence. Wan tends to produce smoother transitions at the cost of slightly floatier motion. HappyHorse commits harder to physical realism, which means when it works, it looks more grounded - but when it misses, artifacts can be more jarring because they violate the physics expectations the model itself established. Pick your poison.

starting from text

The text-to-video variant (alibaba/happyhorse-1-0-t2v) is the most straightforward entry point. You provide a prompt, optionally set duration (3-15 seconds), aspect ratio, and resolution, and the model generates a complete clip. Resolution options are 720P and 1080P. Duration defaults to 5 seconds if you don't specify.

Prompt engineering matters significantly here. HappyHorse responds well to cinematic language - describing camera movements, lighting conditions, and motion dynamics explicitly. A prompt like "a person walking" gives you something generic. A prompt specifying "slow tracking shot following a woman walking through morning fog on a cobblestone street, warm side-lighting from shop windows, shallow depth of field" gives you something worth using.

Some prompt tips that I've found consistently improve results: specify the camera movement type (dolly, orbit, push-in, crane up). Mention the time of day and lighting quality rather than just "good lighting." Describe motion dynamics explicitly - "slow-motion water droplets splashing off a leaf" tells the model what temporal behavior you want. If you want cinematic quality, say so. The model does respond to style direction, and leaving it unspecified tends to produce a somewhat flat default aesthetic.

Avoid cramming too many subjects or actions into a single prompt. HappyHorse handles one or two subjects well but starts losing coherence with complex multi-character scenes. If your shot needs five people doing different things, you're going to be disappointed. Simplify the composition, shoot multiple clips, cut in post.

anchoring to a reference frame

Image-to-video (alibaba/happyhorse-1-0-i2v) takes a single image as the first frame and animates outward from it. This is invaluable when you need visual continuity with existing assets. You've already got the product shot, the concept art, the screenshot - now you want it to move.

The input image determines composition, color palette, and subject appearance. Your text prompt then describes what happens: the camera pulls back, the subject turns, wind picks up, light shifts. Think of the image as setting the stage and the prompt as calling action.

One thing to know: the model respects the input image's aspect ratio. If you feed it a 4:3 photograph, you'll get a 4:3 video. You don't specify a separate ratio parameter because the first frame already defines it. Resolution still toggles between 720P and 1080P, and duration remains configurable from 3 to 15 seconds.

The image quality of your input frame matters more than you might expect. Compression artifacts, noise, or unusual color spaces in the source image propagate into the video. Start with a clean, well-exposed image and you'll get cleaner motion. Feed it a heavily compressed JPEG and the model spends capacity trying to make sense of block artifacts rather than generating convincing motion.

character and subject consistency with reference-to-video

This is where HappyHorse differentiates itself most clearly. The reference-to-video variant (alibaba/happyhorse-1-0-r2v) accepts up to nine reference images of a character or subject, then generates video featuring that subject in whatever scene you describe via prompt. It's not image-to-video with extra steps. The reference images don't define the first frame. They define what the subject looks like, and the model generates a new scene preserving that identity.

For anyone building content that requires visual consistency across clips - a recurring character in a series, a mascot in different scenarios, a product in various environments - this is the capability that matters. You shoot or generate reference images once, then produce as many variations as you need.

The practical limitation is that identity preservation degrades with highly unusual subjects or when you push the model into situations very different from the reference angles you provided. If all your references show a character from the front, asking for a dramatic overhead shot may produce something that drifts from the established appearance. Provide diverse angles in your reference set when possible.

Aspect ratio and duration controls work the same as text-to-video. Pricing is identical across all four variants.

editing existing video with natural language

The video edit variant (alibaba/happyhorse-1-0-video-edit) takes a different approach entirely. Instead of generating from scratch, you feed it an existing video and describe what you want changed. "Replace the blue car with a red sports car." "Change the background to a snowy landscape." "Make the lighting warmer and add lens flare." The model applies the edit while preserving the original motion dynamics.

This is genuinely useful and genuinely imperfect. Simple edits - color grading changes, environment swaps, style transfers - work reliably. Complex structural edits - adding subjects that weren't there, significantly altering motion paths - are hit or miss. The model tries to maintain temporal coherence with the original clip, which means it's conservative about changes that would require inventing new motion.

You can also provide up to five reference images to guide the edit. Want to replace a character in the video with a specific person? Provide reference images of that person. Want to change the environment to match a particular location? Provide reference images of that location. The model uses them as visual anchors for the edit operation.

One notable feature: the audio handling parameter. Set it to "auto" and the model decides whether to regenerate audio to match the edited video, "retain" keeps the original audio track, and "remove" strips audio entirely. This is a small detail that saves a post-production step.

The video edit app bills on combined input and output duration, which means longer edits can add up faster than generating from scratch.

practical cost considerations

HappyHorse is slightly more expensive than Wan 2.7 but competitive with the broader market. The 1080P tier costs meaningfully more than 720P across all variants.

The video edit app bills on combined input and output duration, which makes it meaningfully more expensive for the editing workflow compared to generating fresh. If you're doing heavy iteration on edits, the costs compound. For one-off adjustments to otherwise-finished clips, it's reasonable.

My general advice: use 720P while iterating on prompts and finding what works. Switch to 1080P only for final renders. The visual difference during prompt exploration isn't worth the premium per generation, and you'll iterate faster when you're not waiting for higher-resolution renders.

where the family approach pays off

The real value of having four specialized variants rather than one kitchen-sink model shows up in multi-step workflows. Generate a scene from text. Pull a frame you like from the result. Use that frame as input for image-to-video with a different camera angle. Take your best reference images and generate variations with reference-to-video. Then polish the final clip with video edit to fix color or swap an element.

Each step is its own API call with its own focused set of parameters. There's no mode-switching confusion, no parameter conflicts, no wondering which inputs override which. You pick the tool that matches your current creative intent.

limitations worth knowing

Duration caps at 15 seconds across all variants. For longer content, you're stitching clips in post. The model doesn't offer any built-in clip extension or continuation feature.

HappyHorse generates audio natively as part of its unified transformer architecture - video and audio tokens are produced in the same forward pass, so sound effects, ambient audio, and even lip-synced speech across seven languages align naturally with the visual content. The video edit variant additionally handles audio in the context of preserving, regenerating, or stripping existing audio from the source clip.

Human subjects in close-up still exhibit the usual generative artifacts - slight facial inconsistencies, occasional hand issues. The physics-aware motion that HappyHorse handles well applies more convincingly to objects, environments, and wide shots of people than to tight portraits with complex expressions.

No negative prompts. Unlike Wan 2.7, you can't tell HappyHorse what to avoid. You have to prompt positively for what you want rather than specifying failure modes to exclude. This is a real workflow difference if you're used to negative prompting as a correction mechanism.

frequently asked questions

how does happyhorse compare to wan 2.7 for everyday video generation?

Wan 2.7 costs less per second and offers negative prompts plus LLM-powered prompt extension, which makes the prompt engineering more forgiving. HappyHorse produces more physically grounded motion - objects have weight and inertia that Wan sometimes lacks. If your content is heavy on physical interactions (pouring, splashing, mechanical movement), HappyHorse wins. For general-purpose generation where cost and prompt flexibility matter more, Wan is the practical choice.

can I use reference-to-video for consistent characters across a series of clips?

Yes, and this is one of HappyHorse's strongest differentiators. Provide up to nine reference images of your character from various angles, then generate as many clips as you need with different prompts. Identity preservation is solid for human subjects in medium shots and for objects with distinctive shapes or colors. It weakens on extreme angles not represented in your reference set, and very fine details like text on clothing may drift between generations.

what file formats does the family accept and produce?

All variants output MP4 video with H.264 encoding. For image inputs (i2v and r2v), the model accepts JPEG, JPG, PNG, BMP, and WEBP. For video edit input, MP4 and MOV are supported with H.264 recommended for best compatibility. Generated files are hosted on cloud storage and delivered via URL immediately after completion - no transcoding needed for web or social media use.

api reference

about

happyhorse 1.0 text-to-video generates physically realistic videos with smooth motion from text prompts via dashscope api, supporting 720p/1080p resolution and up to 15 seconds duration

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash

1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash

1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python

1from inferencesh import inference23client = inference()456result = client.run({7        "app": "alibaba/happyhorse-1-0-t2v",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python

1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "alibaba/happyhorse-1-0-t2v",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python

1# local file paths are automatically uploaded2result = client.run({3    "app": "alibaba/happyhorse-1-0-t2v",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python

1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python

1result = client.run({2    "app": "alibaba/happyhorse-1-0-t2v",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json

1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}

idstring— task id

statusnumber— terminal status (9=completed, 10=failed, 11=cancelled)

outputobject— task output (when completed)

errorstring— error message (when failed)

session_idstring— session id (if using sessions)

created_atstring— iso timestamp

updated_atstring— iso timestamp

5. schema

input

promptstring*

text description of the video to generate. supports any language. up to 5000 non-chinese characters or 2500 chinese characters.

resolutionstring

video resolution: 720p or 1080p (default).

default: "1080P"

options:"720P""1080P"

ratiostring

aspect ratio of the generated video.

default: "16:9"

options:"16:9""9:16""1:1""4:3""3:4"

durationinteger

video duration in seconds (3-15).

default: 5min:3max:15

watermarkboolean

add 'happyhorse' watermark to bottom-right corner.

default: false

seedinteger

random seed for reproducibility (0-2147483647).

min:0max:2147483647

output

videostring(file)*

generated video in mp4 format.

output_metaobject

structured metadata about inputs/outputs for pricing calculation

ready to run happyhorse-1-0-t2v?

try in browser browse all tools

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.