text-to-dialogue
ElevenLabs Text to Dialogue - Generate immersive multi-voice dialogue
Most people know ElevenLabs for text-to-speech. Fair enough - their voice synthesis is genuinely best-in-class. But if you stop there, you're missing the larger story. Behind the TTS headline sits a full audio production toolkit: transcription, music generation, sound design, voice transformation, noise isolation, multilingual dubbing, and scripted dialogue. Seven distinct capabilities, each available as a standalone app on inference.sh, each solving a different slice of the audio production problem.
I've spent enough time with these tools to know where they shine and where they don't. This guide covers everything except TTS (which has its own dedicated article). Think of it as a tour through a professional audio workshop where every station runs on the same underlying engine but serves a completely different creative purpose.
speech-to-text with scribe
Transcription sounds like a solved problem until you actually need accurate results from messy real-world audio. ElevenLabs Scribe (available as elevenlabs/stt) handles the hard cases better than most: overlapping speakers, background noise, accented English, non-English languages. The diarization feature identifies who's speaking when, which matters enormously for meetings, interviews, and podcasts where you need attribution, not just a wall of text.
Scribe returns word-level timestamps alongside the full transcript, making it straightforward to build subtitle files or align text to video. It also tags audio events like laughter or applause - a small touch that saves real editing time. Language detection works automatically, though you can pin it to a specific language if you know what you're working with.
The practical ceiling I've hit is that extremely noisy field recordings with heavy wind or machinery still produce artifacts. For studio-quality or even decent phone recordings, though, the accuracy is remarkably high. The model comes in two versions. Scribe v1 was the initial release. Scribe v2 launched in January 2026 in both realtime and batch variants - the realtime version achieves under 150ms latency for live conversational AI, while the batch version handles long-form audio with support for up to 48 distinct speakers, keyterm prompting, and entity detection. Scribe v2 covers 90+ languages with word error rates as low as 3.3% in English across FLEURS benchmarks.
Pricing is competitive with Whisper API while offering substantially better diarization out of the box.
music generation
The elevenlabs/music app generates original compositions from text prompts. You describe what you want - genre, mood, tempo, instrumentation - and get back a production-ready audio file. Duration ranges from 3 seconds up to 10 minutes via the API (5 minutes on the web interface), which covers everything from a notification sound to a full background track.
What impresses me here is the compositional coherence. A request for "jazz trio with walking bass, brushed drums, and Rhodes piano, relaxed Sunday morning feel" doesn't just slap instruments together. The generated track has structure, dynamics, and something approaching musical taste. It understands that a jazz trio breathes, that sections should develop, that repetition needs variation.
Where it falls short: if you need very specific harmonic progressions or precise structural control (verse-chorus-verse with a bridge at 2:14), you're fighting the prompt rather than directing it. This is a composition tool, not a notation-to-audio renderer. For background music, content soundtracks, and creative exploration it's excellent. For replacing a session musician who takes direction, not yet.
Creative workflow example: I've used this to prototype sonic branding. Generate 20 variations of a brand's audio identity in different styles, pick the direction that resonates, then hand that reference to a composer for final production. What used to take a week of back-and-forth happens in an afternoon.
sound effects
The elevenlabs/sound-effects app generates custom audio from text descriptions. "Heavy wooden door creaking open in a stone hallway" or "retro arcade game coin insert sound" or "distant thunder rolling across a valley with light rain." Each generation produces a unique result, so you can run the same prompt multiple times and build a library of variations.
Duration caps at 30 seconds per generation (with Sound Effects V2, up from 22 seconds in the original version), which fits the vast majority of sound effect use cases. There's a prompt influence parameter that controls how literally the model interprets your description - higher values stick close to the text, lower values allow more creative interpretation.
The quality consistently surprises me. Foley effects in particular - footsteps, object interactions, environmental sounds - come out with a physicality that synthetic sound design usually lacks. Abstract or heavily stylized effects (sci-fi UI sounds, magical spells) also work well, though they lean toward familiar genre conventions rather than truly novel design.
The limitation worth knowing: this generates isolated effects, not layered soundscapes. If you need a complex ambient bed with multiple simultaneous elements, you'll want to generate components separately and mix them. Not a fault exactly, just the boundary of what a single generation handles.
Pricing is flat per generation regardless of duration, making it cheap enough to generate dozens of variations and pick the best ones.
voice changer
This one is straightforward in concept but surprisingly powerful in practice. The elevenlabs/voice-changer app takes audio containing speech and re-renders it in a different voice while preserving the content, timing, and emotional delivery. The words stay the same. The performance stays the same. Only the vocal identity changes.
The multilingual model handles 7+ languages, and you can choose from ElevenLabs' library of target voices or use custom cloned voices. Output format options range from compressed MP3 to high-quality PCM, depending on your downstream needs.
Where this gets interesting is creative production. Record a rough voiceover yourself - capturing the pacing and emphasis you want - then transform it into a voice that fits the project. Directors can perform temporary tracks for animators to work against, podcasters can create character voices, game developers can prototype dialogue without casting every role.
The tradeoff is fidelity at the edges. Whispered speech, singing, and heavily emotional delivery can occasionally flatten during transformation. Conversational and narrative speech transforms cleanly.
voice isolator
Noise removal is one of those capabilities that sounds simple until you've tried to rescue a recording from a windy rooftop interview or a cafe with an espresso machine running. The elevenlabs/voice-isolator app strips background noise and returns clean vocal audio. It accepts standard formats - WAV, MP3, FLAC, OGG, and AAC - and outputs the isolated voice track.
I find this most valuable as a preprocessing step. Run a noisy recording through voice isolation before transcription and your accuracy jumps noticeably. Use it to clean up interview audio before publishing. Feed it field recordings that would otherwise be unusable.
The model handles steady-state noise (air conditioning, traffic hum, room tone) extremely well. It also manages transient noise (clattering dishes, phone notifications) better than traditional noise gates. Where it struggles is when the noise and voice occupy the same frequency range simultaneously - a loud conversation at the next table, for instance, is harder to remove than construction noise outside a window.
One creative application: isolate vocals from a mixed recording to use as input for voice cloning or voice changing. Chain the tools together. Clean first, transform second.
dubbing
The elevenlabs/dubbing app handles the full localization pipeline in a single call: it transcribes source audio, translates the content, and re-synthesizes speech in the target language while preserving the original speaker's vocal identity. The result sounds like the same person speaking a different language.
This is genuinely useful for content creators targeting multiple markets. A 10-minute YouTube video in English becomes a Spanish, French, or Japanese version without hiring voice actors, without re-recording, without weeks of production time. The timing alignment isn't perfect - some languages are more verbose than others, and the model has to compress or expand delivery to fit - but for most content it's close enough that viewers don't notice.
Source language detection is automatic. You specify the target and let the system figure out the rest. Input can be audio or video files (it processes the audio track from video).
Creative workflow example: a course creator with 40 hours of English training content can dub the entire library into Spanish for a new market. The cost is a fraction of what traditional dubbing studios charge, and it ships in hours rather than weeks.
The honest limitation: highly technical content with domain-specific terminology sometimes mistranslates. Always have a native speaker spot-check the output for anything audience-facing. The voice preservation also works better for some language pairs than others - similar language families (English to Spanish) tend to sound more natural than distant ones (English to Mandarin).
A watermarked tier and a clean, watermark-free tier are available, letting you choose based on your production needs.
text-to-dialogue
The elevenlabs/text-to-dialogue app generates multi-voice audio from scripted conversations. You provide an array of dialogue segments, each specifying a voice and the text to speak, and it renders the full conversation as a single audio file with natural turn-taking, appropriate pauses, and consistent character voices throughout.
This solves a specific production problem: creating dialogue content without a studio session. Audiobook previews, podcast pilots, game dialogue prototypes, explainer videos with multiple speakers, training scenarios with role-play conversations. Anywhere you need multiple distinct voices performing scripted material.
The segment-based input gives you precise control over who says what and in which order. You pick voices from ElevenLabs' library for each speaker, and the rendering engine handles the transitions between them. The result sounds like a conversation, not a sequence of isolated clips pasted together.
Where I'd push back on expectations: this generates dialogue, not drama. The voices perform the text competently but don't deliver Oscar-worthy emotional range. For informational content, podcasts, and narrative where natural delivery matters more than theatrical performance, it works well. For an audiobook climax where a character needs to convey heartbreak through a whispered line, you probably still want a human actor.
Pricing varies by model tier, with Flash and Turbo models being more affordable than Multilingual v2. The cost scales with script length, making it economical for most dialogue production needs.
building workflows across the suite
The real power here isn't any single capability in isolation. It's how they compose. A practical production workflow might look like this: record a rough interview on location, run it through voice isolation to clean the audio, transcribe with Scribe for an accurate text record, dub the key segments into three target languages, generate background music for the final edit, and create custom transition sounds. Six different tools, one audio project, no specialized software beyond the API calls.
Another scenario: a game developer scripts NPC dialogue, renders it with text-to-dialogue for prototyping, generates ambient sound effects for each scene, composes area-specific background music, and uses voice changer to create variations of the same lines with different character voices. The entire audio prototype exists before a single voice actor steps into a booth.
The pricing across the suite stays predictable. There are no hidden costs for storage or processing queues. You pay per unit of work and the results are yours.
when to use what
The suite breaks down along clear functional lines. Scribe handles audio-to-text. Music and sound effects handle text-to-audio for non-speech content. Voice changer and voice isolator are audio-to-audio transformations. Dubbing combines multiple steps into a single localization pipeline. Text-to-dialogue renders scripted multi-speaker content.
If your project touches audio in any capacity, at least one of these probably saves you time or money compared to the traditional approach. The trick is knowing which one fits your specific bottleneck rather than trying to force a single tool to do everything.
frequently asked questions
can I chain these tools together in automated pipelines?
Yes, and this is where the suite becomes most valuable. Each app accepts and returns standard audio formats, so the output of one becomes the input of another without conversion steps. A common chain is voice isolation followed by transcription for noisy source material, or text-to-dialogue followed by dubbing for multilingual scripted content. On inference.sh, you can orchestrate these as sequential app runs within a single workflow, passing file references between steps. The per-call pricing means you only pay for the processing you actually use at each stage.
how does the audio quality compare to dedicated professional tools?
For most production contexts, the quality is indistinguishable from professional alternatives. Music generation produces broadcast-ready output. Transcription accuracy matches or exceeds most commercial services. Voice isolation rivals dedicated plugins like iZotope RX for common noise profiles. The gap shows up at the extremes - a mastering engineer will hear differences in generated music, and a professional studio with treated acoustics won't need isolation. But for content creation, marketing, game development, and any context where "good enough for release" is the bar, these tools clear it comfortably.
what are the file format limitations?
The suite supports standard audio formats across the board: MP3, WAV, FLAC, and M4A for input. The dubbing app also accepts video formats like MP4, processing just the audio track. Output defaults to MP3 at various quality levels, though some apps (like voice changer) offer format selection including high-bitrate MP3 and PCM WAV. File size limits are generous enough for most real-world content - the main practical constraint is the duration caps on specific tools, like 30 seconds for sound effects and 10 minutes for music generation via the API.
api reference
about
elevenlabs text to dialogue - generate immersive multi-voice dialogue
1. calling the api
install the client
the client provides a convenient way to interact with the api.
1pip install inferenceshsetup your api key
set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.
1export INFERENCE_API_KEY="inf_your_key"run and get result
submit a request and wait for the final result. best for batch processing or when you don't need progress updates.
1from inferencesh import inference23client = inference()456result = client.run({7 "app": "elevenlabs/text-to-dialogue",8 "input": {}9 })1011print(result["output"])stream live updates
get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8 "app": "elevenlabs/text-to-dialogue",9 "input": {}10 }, stream=True):11 if update.get("progress"):12 print(f"progress: {update['progress']}%")13 if update.get("output"):14 print(f"output: {update['output']}")2. authentication
the api uses api keys for authentication. see the authentication docs for detailed setup instructions.
3. files
file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.
automatic upload
the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.
1# local file paths are automatically uploaded2result = client.run({3 "app": "elevenlabs/text-to-dialogue",4 "input": {5 "image": "/path/to/local/image.png", # detected & uploaded6 "audio": "https://example.com/audio.mp3", # url passed through7 }8})4. webhooks
get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.
1result = client.run({2 "app": "elevenlabs/text-to-dialogue",3 "input": {},4 "webhook": "https://your-server.com/webhook"5}, wait=False)webhook payload
your endpoint receives a JSON POST with the task result:
1{2 "id": "task_abc123",3 "status": 9,4 "output": { ... },5 "error": "",6 "session_id": null,7 "created_at": "2024-01-15T10:30:00Z",8 "updated_at": "2024-01-15T10:30:05Z"9}5. schema
ready to run text-to-dialogue?
we use cookies
we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.
by clicking "accept", you agree to our use of cookies.
learn more.