apps/elevenlabs/tts

tts

ElevenLabs Text to Speech - High-quality multilingual voice synthesis

run with your agent
# install belt
$curl -fsSL https://cli.inference.sh | sh
# view schema & details
$belt app get elevenlabs/tts
# run
$belt app run elevenlabs/tts

I remember the first time I heard ElevenLabs output and realized I'd been settling. Every other text-to-speech system I'd used before that moment suddenly sounded like what it was - a machine reading words aloud. ElevenLabs doesn't read. It speaks. The difference is subtle until you hear it, and then you can't unhear it.

Text-to-speech has been "good enough" for years if all you needed was a screen reader or a GPS giving directions. But the bar has moved. People expect podcasts, audiobooks, voiceovers, and conversational AI to sound like actual humans. Not humans trying really hard to sound natural. Just humans. That's the gap ElevenLabs fills, and it fills it better than anything else currently available.

why the quality gap matters more than you think

There's a phenomenon in audio that doesn't have an exact parallel in other media. When synthetic speech is 90% good, listeners notice the remaining 10% constantly. Every misplaced emphasis, every slightly robotic transition between words, every pause that lands wrong - these pull attention away from the content itself. Your audience stops listening to what's being said and starts noticing how it's being said.

ElevenLabs crosses a threshold where that cognitive friction disappears for most listeners. I've run informal tests with colleagues where I play two versions of the same paragraph - one from ElevenLabs, one from a competing service - and ask which is the human recording. People pick ElevenLabs as the human about 60% of the time. That's not marketing fluff. That's the meaningful quality difference you're paying for.

The prosody is where it shows most clearly. Sentences have rhythm. Questions actually rise at the end in the way your brain expects. Commas produce tiny hesitations. Lists get that slight acceleration that real speakers use when they're ticking through items. These aren't features anyone toggles on. They emerge from the model's understanding of language structure.

the model variants and what they actually mean for your workflow

ElevenLabs offers multiple model variants through inference.sh, and the choice between them is more nuanced than "fast vs. good" - though that's roughly the tradeoff.

Flash v2.5 and Turbo v2.5 exist for situations where latency matters more than the last 5% of quality. Flash is the one to reach for - ElevenLabs recommends it over Turbo in all use cases since it achieves lower average latency (under 75ms) while being functionally equivalent. If you're powering a conversational AI agent that needs to respond in under a second, you want Flash. If you're generating audio that someone will listen to passively - a podcast, an audiobook chapter, a narration track for video - Multilingual v2 is the obvious choice. The quality difference between Flash and Multilingual v2 is audible but not dramatic. Think of it as the difference between a good podcast mic and a great one. Both work. One just sounds slightly richer.

The multilingual capability on the v2 model deserves specific mention. It handles 29 languages without requiring you to specify which language the input text is in. The model figures it out. Flash and Turbo v2.5 extend that to 32 languages, adding Vietnamese, Hungarian, and Norwegian. I've fed it mixed-language paragraphs - English with French quotations, for instance - and it switches accents appropriately at the language boundaries. That's genuinely impressive and saves an enormous amount of engineering work if you're building anything for a global audience.

understanding the voice controls

The parameter space for ElevenLabs is small but the interactions between parameters are non-obvious. I want to walk through what actually happens when you adjust these values, because the documentation tends toward the abstract.

Stability controls variation across the generated audio. At high stability values (0.8 and above), the voice maintains a very consistent emotional register. Same pacing, same energy, same intonation patterns. This is what you want for instructional content, news reading, or anything where consistency signals authority. Drop stability toward 0.3 or 0.4 and the voice starts introducing the kind of micro-variations that real speakers use - slight changes in pace, tiny shifts in energy between sentences. This sounds more natural for storytelling or casual content but can occasionally produce unexpected emphasis patterns.

Similarity boost is less intuitive. It controls how strongly the output adheres to the characteristic qualities of whatever voice preset you've selected. High similarity means the output sounds very specifically like that voice. Low similarity means the voice takes on a more generic quality. In practice, I leave this between 0.7 and 0.85 for most work. Going higher can introduce artifacts if the model is trying too hard to match a voice fingerprint.

Style exaggeration is the parameter that most people either ignore or overuse. At zero, you get neutral delivery. At 0.2 to 0.4, you get the kind of subtle emotional coloring that a professional voice actor brings without being asked. Beyond 0.5, the voice starts performing. That's appropriate for dramatic fiction, advertising, or any content that intentionally leans into emotion. For most professional applications - product documentation, business podcasts, accessibility narration - keep it under 0.3.

Speaker boost is a simple clarity enhancement that adds presence to the voice. Think of it as a gentle compressor and EQ chain that makes the voice sit forward in the mix. Useful if your audio will compete with background music or if it needs to cut through on phone speakers.

choosing the right model for your budget

ElevenLabs is more expensive than alternatives like Google Cloud TTS and AWS Polly. OpenAI's TTS sits somewhere in the middle. The question you need to answer honestly is whether the quality premium justifies the cost premium for your specific use case.

For a product demo video that represents your company? Yes, obviously. For an internal notification system that reads out alert statuses to an ops team? Probably not. For an audiobook where listeners will spend hours with the voice? Absolutely - listener fatigue over a full book is the most tangible quality-of-life difference between ElevenLabs and cheaper alternatives.

The character limits per request vary by model: Flash and Turbo v2.5 allow up to 40,000 characters per request, while Multilingual v2 caps at 10,000 characters. For most content, these limits are generous enough that you rarely need to think about chunking. The Multilingual v2 limit of 10,000 characters covers roughly 1,500 to 2,000 words per call - full article narrations and substantial audiobook sections fit within a single request. For longer content, Flash v2.5's 40,000-character limit handles virtually anything without splitting logic.

where it falls short

No honest assessment skips the limitations. ElevenLabs has limited SSML support - most models accept <break> tags for pauses up to 3 seconds and <phoneme> tags for custom pronunciation using IPA or CMU dictionaries, but the newer v3 model uses its own inline tags like [pause] and [long pause] instead. Full SSML control - complex prosody markup, emphasis tags, say-as types - isn't available. If you need that level of control, Google Cloud TTS with full SSML is still the better tool for that specific job.

The voice library through inference.sh uses presets. You can't bring a custom cloned voice through this integration - that requires a direct ElevenLabs account and their cloning pipeline. The presets cover a wide range of personas and styles, but if your brand has a specific voice actor whose likeness you've licensed, you'll need the direct platform.

There's also no streaming output in the serverless model. You submit text, you get an audio file back. For conversational AI where you need the first audio chunk playing while the rest generates, you'd want their websocket API directly. The inference.sh integration is optimized for batch generation - submit a text, get complete audio back.

the voice library and choosing well

Voice selection is more consequential than most people treat it. The wrong voice for your content creates a subtle dissonance that listeners feel but can't name. A young, energetic voice reading financial analysis sounds off. A deep, authoritative voice reading casual blog content feels stiff.

I'd suggest generating the same 2-3 paragraph sample across five or six voices before committing to one for a project. The time investment is minimal - you're talking a few cents worth of generation - and the difference between a good voice match and a perfect one compounds over hours of listening.

The default voice works well as a neutral starting point. It's clear, professional, and doesn't draw attention to itself. For most informational content, that's exactly what you want. Save the more distinctive voices for creative projects where personality is part of the value.

the competitive picture in mid-2026

The TTS space is moving fast. ElevenLabs themselves released Eleven v3 in June 2025, supporting 70+ languages with inline emotion tags like [excited], [whispers], and [sighs], plus native multi-speaker dialogue generation. OpenAI's TTS has improved significantly and offers a simpler parameter set. Google and Amazon have both updated their neural voice options. Microsoft's Azure Speech has strong enterprise features. Smaller players like Cartesia are pushing ultra-low-latency use cases with their Sonic model, built on state space model architecture, achieving sub-90ms time-to-first-audio.

Despite this, ElevenLabs maintains a lead on naturalness that I find consistent across languages and voice types. The gap has narrowed from "night and day" to "clearly better in a direct comparison." Whether that gap continues to justify a premium depends on how quickly competitors close it.

What I can say confidently right now: if you play ElevenLabs output alongside any alternative for a non-technical audience and ask them which sounds better, they'll pick ElevenLabs. That consistency of preference is the product's core value proposition.

output and format considerations

The default output format - MP3 at 44.1kHz with 128kbps bitrate - is suitable for virtually all production use cases. It's the standard for podcast distribution, video narration, and web audio playback. The file sizes are reasonable (roughly 1MB per minute of audio) and the quality loss from compression is negligible for speech content.

If you're feeding the output into a post-production pipeline where you'll be applying effects, mixing with music, or doing significant editing, you may want to check available higher-quality format options. For direct-to-listener applications, the default is the right choice.

FAQ

how does elevenlabs compare to openai's tts for long-form content?

For short snippets - a sentence or two - the difference between ElevenLabs and OpenAI TTS is subtle enough that convenience might win. For long-form content where a listener spends ten minutes or more with the voice, ElevenLabs pulls ahead noticeably. The prosody stays more natural over long passages, transitions between paragraphs feel more deliberate, and listener fatigue is measurably lower. OpenAI's model tends toward a consistent energy level that becomes monotonous over time, while ElevenLabs introduces enough natural variation to keep the ear engaged.

what stability and similarity settings work best for podcasts?

Start with stability at 0.65 to 0.75 and similarity boost at 0.75 to 0.8. This gives you enough consistency that the voice sounds professional and controlled, while preserving enough variation that it doesn't sound robotic over a twenty-minute episode. Add speaker boost for clarity, and keep style exaggeration at or below 0.2 unless your podcast format calls for dramatic delivery. These settings produce the "friendly authority" tone that works for most informational podcast content.

can I generate audio in multiple languages with the same voice?

Yes, using the Multilingual v2 model. The same voice preset will produce output in whatever language the input text uses - the model detects the language automatically. The voice retains its fundamental character across languages while adapting accent and pronunciation appropriately. Quality is strongest in widely-spoken languages like English, Spanish, French, German, and Portuguese. Less common languages work but may show slightly less natural prosody in edge cases.

api reference

about

elevenlabs text to speech - high-quality multilingual voice synthesis

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash
1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash
1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python
1from inferencesh import inference23client = inference()456result = client.run({7        "app": "elevenlabs/tts",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "elevenlabs/tts",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python
1# local file paths are automatically uploaded2result = client.run({3    "app": "elevenlabs/tts",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python
1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python
1result = client.run({2    "app": "elevenlabs/tts",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json
1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}
idstringtask id
statusnumberterminal status (9=completed, 10=failed, 11=cancelled)
outputobjecttask output (when completed)
errorstringerror message (when failed)
session_idstringsession id (if using sessions)
created_atstringiso timestamp
updated_atstringiso timestamp

5. schema

input

textstring*

text to convert to speech (max 40,000 characters).

maxLength:40000
voicestring

voice to use for speech generation.

default: "george"
options:"adam""alice""aria""bella""bill""brian""callum""charlie""chris""daniel""eric""george""harry""jessica""laura""liam""lily""matilda""river""roger""sarah""will"
modelstring

model to use. multilingual_v2 is highest quality, turbo/flash are faster with lower latency.

default: "eleven_multilingual_v2"
options:"eleven_multilingual_v2""eleven_turbo_v2_5""eleven_flash_v2_5"
output_formatstring

audio output format. mp3_44100_128 is standard quality mp3.

default: "mp3_44100_128"
options:"mp3_44100_128""mp3_44100_192""pcm_16000""pcm_22050""pcm_24000""pcm_44100"
stabilitynumber

voice stability (0-1). higher = more consistent, lower = more expressive.

default: 0.5min:0max:1
similarity_boostnumber

similarity boost (0-1). higher = closer to original voice.

default: 0.75min:0max:1
stylenumber

style exaggeration (0-1). increases expressiveness but may reduce stability.

default: 0min:0max:1
use_speaker_boostboolean

enable speaker boost for enhanced clarity.

default: true

output

audiostring(file)*

generated speech audio file

output_metaobject

structured metadata about inputs/outputs for pricing calculation

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.