apps/xai/grok-extend-video

grok-extend-video

Extend existing videos using xAI's Grok Imagine Video model. Takes an existing video and generates additional frames to continue it with prompt guidance.

run with your agent
# install belt
$curl -fsSL https://cli.inference.sh | sh
# view schema & details
$belt app get xai/grok-extend-video
# run
$belt app run xai/grok-extend-video

Quietly, while everyone was busy debating which chatbot sounds most human, xAI built a media generation suite. Not just image generation - though that's where it started - but video synthesis, video extension, reference-guided video, and text-to-speech. Five distinct capabilities running on what xAI calls the Aurora engine, all available through inference.sh as standalone apps you can hit individually or chain together.

I find the xAI media story interesting for reasons that have nothing to do with Elon Musk discourse. The content policy is genuinely more permissive than what Google or OpenAI ship. The pricing is aggressive. And the quality, while uneven across modalities, has improved faster than I expected over the past few months. This isn't a toy. It's a real creative toolkit with real tradeoffs worth understanding.

What follows is a walkthrough of every non-Pro Grok media app: what each one does well, where it stumbles, and how they fit together as a unified system for people making things.

image generation with grok imagine

The xai/grok-imagine-image app is the foundation of the whole suite. Text-to-image, image editing, multiple aspect ratios, batch generation up to 10 images per call. The basics. But the basics executed with a particular personality.

Aurora's visual style leans vivid. Colors saturate harder than DALL-E's output, compositions tend toward the dramatic, and photorealistic prompts produce results with a slight cinematic grade baked in. Whether you consider that an asset or a limitation depends entirely on what you're making. For social content, marketing materials, and concept art it's often exactly what you want. For product photography or technical illustration where neutrality matters, you'll find yourself fighting the model's instincts.

The image editing mode accepts an input image and a prompt describing desired changes. It's competent at style transfers and additive edits - placing objects, changing lighting, shifting seasons. Subtractive edits (removing elements cleanly) are less reliable, though improving with each model update.

The pricing is competitive with the market and cheap enough that generating 50 variations to find the right one feels like a reasonable workflow rather than an extravagance. The economics actively encourage iteration rather than agonizing over prompt engineering.

The aspect ratio support covers the standard spread: square, landscape, portrait, and widescreen. Nothing exotic, but the common social media and presentation formats are all there. I've found the model handles widescreen compositions particularly well - panoramic scenes maintain coherence edge to edge without the weird distortion some models introduce at extreme ratios.

One honest note: Aurora is newer than Midjourney or Flux, and it shows in certain categories. Hands are better than they were six months ago but still not as consistently correct as the best competitors. Text rendering in images is unreliable. Complex multi-subject scenes occasionally lose track of spatial relationships. These are known weaknesses that xAI appears to be actively addressing, but they exist today and pretending otherwise would be dishonest.

video generation from text and image

The xai/grok-imagine-video app handles the full spectrum: pure text-to-video, image-to-video (animating a still), and video editing (modifying existing footage). It generates clips between 2 and 10 seconds at either 480p or 720p resolution.

Text-to-video is the headline feature but honestly not where I find the most value. Short clips from text prompts produce results that look impressive in isolation but rarely match a specific creative vision precisely enough to use without modification. The motion is fluid, physics are generally respected, and the model understands cinematic language - dolly shots, rack focus, aerial perspectives. But "a woman walking through a rainy Tokyo street at night" produces a generic version of that scene, not your specific version.

Image-to-video is where things get genuinely useful. Take a hero image you've already perfected - through Grok Imagine, Flux, Photoshop, whatever - and animate it. The model infers plausible motion from the still, guided by your prompt. A portrait gains subtle breathing and eye movement. A landscape gets wind through trees and drifting clouds. A product shot rotates slowly with studio lighting maintained. This workflow - craft the perfect frame, then bring it to life - produces dramatically better results than pure text-to-video for most professional applications.

The resolution choice matters more than it sounds. 480p is fine for social media and prototyping, 720p is better for anything that might end up on a larger screen. The quality difference is noticeable but not transformative. Neither is suitable for broadcast-quality final delivery without upscaling.

Video editing mode accepts existing footage and a prompt describing desired changes. Think of it as instruction-based video editing: "make the sky sunset-colored" or "add falling snow to this scene." Results vary. Simple color and atmospheric changes work reliably. Structural edits - adding or removing subjects, changing camera angles - are hit or miss.

extending videos beyond their boundaries

The xai/grok-extend-video app takes an existing video and generates additional frames to continue it. Feed in a 5-second clip and a prompt describing what should happen next, get back a longer video that picks up where the original left off.

This is conceptually simple but practically powerful. The most common video generation frustration is that 5 seconds isn't enough. You get a beautiful opening shot and then it just... ends. Extension lets you build longer sequences iteratively. Generate an initial clip, extend it, extend the extension. Each step guided by a new prompt describing the next beat of the sequence.

The continuity preservation is the hard technical problem here, and Aurora handles it reasonably well. Camera motion continues smoothly. Subject appearance remains consistent. Lighting doesn't suddenly shift. It's not perfect - I've seen subtle color temperature drift across multiple extensions, and complex motion occasionally simplifies in extended segments - but it's good enough for practical creative work.

Duration caps at 10 seconds per extension. Chaining multiple extensions to build a 30-second sequence is remarkably cheap compared to the alternatives.

The limitation worth flagging: extension works best when the original clip has clear directional momentum. A camera moving forward through a space extends naturally. A static shot of a face talking is harder to extend convincingly because there's less spatial information for the model to extrapolate from. Choose your initial generations with extension in mind and you'll get much better results.

reference-guided video generation

The xai/grok-reference-video app is the most creatively interesting of the bunch. Instead of generating video purely from text, you provide reference images that guide the visual style, character appearance, or scene composition of the output. The model uses those images as stylistic anchors while generating motion according to your prompt.

I think of this as the "make it look like this, but moving" tool. Have a brand style guide with specific color palettes and visual treatments? Feed those as references. Want a character you've designed in still images to appear in video? Reference images solve the consistency problem that pure text prompts cannot.

You can provide multiple reference images, which lets you establish both subject appearance and environmental style simultaneously. A character reference plus an environment reference plus a motion prompt produces results that respect all three inputs to varying degrees. The model doesn't always weight references equally - dominant visual elements in reference images tend to carry more influence - but with some experimentation you can achieve reliable style transfer into motion.

Resolution options include 480p and 720p, with aspect ratios from square to widescreen. Duration ranges from 2 to 10 seconds, same as the standard video generation.

Where reference video really earns its place is in series production. If you're generating multiple clips that need to feel like they belong together - a social campaign, an explainer series, a brand content library - reference images ensure visual coherence across generations in a way that text prompts alone simply cannot guarantee. You establish the look once, then maintain it across dozens of outputs.

text-to-speech with character

The xai/grok-tts app rounds out the media suite by covering audio. Convert text to natural-sounding speech with a selection of voices - eve, ara, rex, and others - at up to 15,000 characters per call. Output formats include MP3, WAV, and raw PCM at configurable sample rates.

The voice quality sits solidly in the upper tier of current TTS systems. Natural prosody, appropriate pausing, emotional range that responds to the content being read. It's not ElevenLabs - the voice library is smaller and the cloning capabilities aren't present - but for narration, voiceover, and content production the quality is absolutely usable.

What distinguishes Grok TTS is the expressive speech tag system. You can embed markup in your text to control delivery: emphasis, pacing, emotional tone. This gives you director-level control over the performance without needing to regenerate and hope the model picks up on subtle prompt cues. For anyone building audio content at scale - audiobook chapters, podcast segments, course narration - this kind of fine-grained control matters enormously.

Language support works through BCP-47 codes with automatic detection as a fallback. The auto-detection is reliable for major languages but I'd recommend explicit language specification when working with content that mixes languages or uses loanwords heavily.

The pricing is production-friendly - narrating a full blog post or audiobook chapter costs very little, making large-scale audio production economically viable.

the content policy question

I'd be dishonest if I didn't address this directly, because it's a significant reason people choose xAI's tools over alternatives. Grok's content policy is more permissive than what Google, OpenAI, or Anthropic enforce on their generation tools. More creative scenarios are allowed. Fewer prompts hit refusal walls.

This cuts both ways. For legitimate creative work - fiction, concept art, editorial content, artistic exploration - fewer restrictions means less friction and fewer frustrating dead ends. You spend less time rewording prompts to avoid triggering safety filters on perfectly reasonable creative requests. If you've ever tried to generate a medieval battle scene on DALL-E and been refused, you know the feeling.

The flip side is obvious and doesn't need belaboring. More permissive means more potential for misuse. xAI presumably accepts this tradeoff deliberately as a competitive positioning choice. Whether it's the right tradeoff depends on your values and use case. I'm not here to moralize about it - just to note that it exists and influences the practical experience of using these tools.

the unified workflow

The real value of having all these capabilities under one roof becomes apparent when you chain them. Generate a hero image with Grok Imagine. Animate it with Grok Imagine Video. Extend the result with Grok Extend Video. Generate narration with Grok TTS. You've gone from a text prompt to a complete video asset with voiceover, all within the same ecosystem, all maintaining stylistic consistency through the Aurora engine's shared visual language.

The reference video app adds another dimension to this workflow. Generate a set of images that define your project's visual identity. Use those as references across all subsequent video generations. Every clip you produce shares the same DNA without you having to describe the style in every single prompt.

Is this ecosystem as mature as Runway or Pika for video, or as deep as Midjourney for images? Not yet. The iteration speed at xAI is high, but they're working from behind in terms of community knowledge, prompt engineering resources, and edge-case handling. What they have going for them is pricing, permissiveness, and the compounding advantage of a single engine that handles multiple modalities with shared aesthetic sensibility.

For creators who value speed of iteration and breadth of capability over best-in-class quality in any single dimension, xAI's media suite is a genuinely compelling option today. Not perfect. Not the best at any individual task. But surprisingly complete, aggressively priced, and improving at a pace that makes the gap with leaders narrower every month.

frequently asked questions

how does grok image generation compare to midjourney or flux?

Aurora produces vivid, cinematically graded images that lean toward the dramatic. Midjourney still edges it out on overall aesthetic polish and community-developed prompt techniques. Flux offers more precise control over composition and better text rendering. Where Grok Imagine wins is on content policy flexibility, competitive pricing, and tight integration with the rest of the xAI media pipeline. If your workflow involves generating images that become video inputs, staying within the Aurora ecosystem gives you consistency advantages that cross-platform workflows lack.

what's the maximum video length I can generate?

Individual generations cap at 10 seconds across all video apps. However, the extend video app lets you chain segments together iteratively - generate a 5-second clip, extend it by 10 seconds, extend again. Each extension maintains continuity with the previous segment. Practically, you can build sequences of 30-60 seconds before accumulated drift in style or motion quality becomes noticeable. For longer content, plan your sequence as discrete shots rather than one continuous take, which also gives you more control over composition and pacing at each step.

is grok tts good enough for production audio content?

For narration, voiceover, and informational content - yes. The voice quality is natural, prosody is appropriate, and the expressive speech tags give you meaningful control over delivery. The voice library is smaller than ElevenLabs (which offers dozens of voices plus cloning), so if you need a very specific vocal identity or custom voice, you'll find the selection limiting. For podcast intros, course content, explainer videos, and any scenario where you need clean professional narration from a set of predefined voices, the quality and pricing make large-scale audio production economically viable.

api reference

about

extend existing videos using xai's grok imagine video model. takes an existing video and generates additional frames to continue it with prompt guidance.

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash
1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash
1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python
1from inferencesh import inference23client = inference()456result = client.run({7        "app": "xai/grok-extend-video",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "xai/grok-extend-video",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python
1# local file paths are automatically uploaded2result = client.run({3    "app": "xai/grok-extend-video",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python
1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python
1result = client.run({2    "app": "xai/grok-extend-video",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json
1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}
idstringtask id
statusnumberterminal status (9=completed, 10=failed, 11=cancelled)
outputobjecttask output (when completed)
errorstringerror message (when failed)
session_idstringsession id (if using sessions)
created_atstringiso timestamp
updated_atstringiso timestamp

5. schema

input

promptstring*

text prompt describing what should happen in the extended portion of the video.

example: "The camera continues to pan across the landscape"
videostring(file)*

input video to extend. must be a publicly accessible url.

durationinteger

duration of the extended video in seconds (1-15). if not specified, defaults to model's default.

min:1max:15

output

videostring(file)*

the extended video file.

output_metaobject

structured metadata about inputs/outputs for pricing calculation

ready to run grok-extend-video?

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.