ElevenLabs Text to Speech on inference.sh

ElevenLabs TTS is the leading AI voice synthesis model, delivering human-quality speech in dozens of languages with fine-grained control over voice characteristics. Available on inference.sh as a serverless app, it serves 44 paying users who integrate it into content pipelines, accessibility tools, and product features. The model handles text up to 40,000 characters per request — enough for entire articles, podcast scripts, or audiobook chapters in a single API call.

What distinguishes ElevenLabs from other TTS options is the voice quality. The output sounds like a real person speaking naturally, with appropriate intonation, pacing, and emotion. The model offers multiple voices, adjustable stability and expressiveness controls, and support for multilingual content. You get broadcast-ready audio without post-processing.

what it does

ElevenLabs TTS converts text into spoken audio. You provide text and voice parameters, and it returns an audio file of that text being spoken aloud. The model handles English and dozens of other languages, supports multiple voice presets, and offers fine-grained control over how the voice sounds through stability, similarity, and style parameters.

The app supports three model variants: Flash and Turbo for fast generation at lower cost, and Multilingual v2 for maximum quality and language coverage. Choose based on your latency and quality requirements.

key features

Natural prosody — The output captures natural speech rhythm, emphasis, and intonation. Sentences flow with appropriate pauses, stress patterns, and tonal variation rather than the robotic cadence of older TTS systems.

Voice selection — Choose from a library of preset voices, each with distinct character — different ages, accents, tones, and speaking styles. The default voice works well for general content; specialized voices suit specific use cases.

Stability and variation control — The stability parameter controls how consistent the voice sounds across the output. Higher stability means more predictable delivery; lower stability introduces natural variation. The similarity_boost parameter controls how closely the output matches the voice preset's defining characteristics.

Style exaggeration — The style parameter pushes the voice's emotional expressiveness. Low values produce neutral, professional delivery. Higher values increase dramatic expression — useful for storytelling, marketing copy, or any content that benefits from emotional emphasis.

Speaker boost — An optional enhancement that increases clarity and presence. Useful for podcasts and voice-overs where the voice needs to cut through background content.

Long-form support — Process up to 40,000 characters in a single request. No need to chunk articles, scripts, or documents into small pieces and stitch audio together.

Multiple output formats — Choose MP3 at various bitrates (44.1kHz/128kbps is the default), or other formats depending on your pipeline requirements.

use cases

Podcast and audiobook production — Convert written scripts and manuscripts into spoken audio. The long-form support means entire episodes or chapters can be generated in single requests.

Accessibility features — Add spoken versions of text content for visually impaired users or situations where reading is not possible. The natural quality means users can listen for extended periods without fatigue.

Video narration — Generate voice-over tracks for video content, tutorials, product demos, and explainer videos. The style control lets you match the tone to your content.

Interactive applications — Power voice interfaces, chatbots, and interactive agents with natural-sounding speech. The fast model variants support near-real-time applications.

Content localization — Generate spoken versions of content in multiple languages using the Multilingual v2 model. One text, many languages, consistent quality.

Notification and alert systems — Generate dynamic spoken notifications, announcements, or status updates for applications that benefit from audio output.

how to run

belt CLI

Basic text-to-speech with default voice:

bash
1belt app run elevenlabs/tts --input '{"text": "Welcome to our platform. This guide will walk you through the key features and help you get started in under five minutes."}'

With voice selection and quality tuning:

bash
1belt app run elevenlabs/tts --input '{"text": "Breaking news from the technology sector: AI capabilities continue to advance at an unprecedented pace.", "voice": "adam", "stability": 0.7, "similarity_boost": 0.8}'

Expressive narration with style exaggeration:

bash
1belt app run elevenlabs/tts --input '{"text": "The door creaked open slowly. In the darkness beyond, something moved.", "voice": "daniel", "style": 0.6, "stability": 0.4}'

High-quality multilingual generation:

bash
1belt app run elevenlabs/tts --input '{"text": "Bienvenue sur notre plateforme. Ce guide vous aidera a demarrer rapidement.", "model": "multilingual_v2", "voice": "rachel"}'

Long-form content with speaker boost:

bash
1belt app run elevenlabs/tts --input '{"text": "Chapter One. It was a bright cold day in April, and the clocks were striking thirteen...", "use_speaker_boost": true, "output_format": "mp3_44100_128"}'

API

bash
1curl -X POST https://api.inference.sh/v1/apps/elevenlabs/tts/run \2  -H "Authorization: Bearer $INFERENCE_API_KEY" \3  -H "Content-Type: application/json" \4  -d '{5    "text": "This is a product demo. Our API processes your request in seconds and returns broadcast-ready audio that sounds indistinguishable from human speech.",6    "voice": "adam",7    "model": "multilingual_v2",8    "stability": 0.75,9    "similarity_boost": 0.8,10    "style": 0.3,11    "use_speaker_boost": true,12    "output_format": "mp3_44100_128"13  }'

input parameters

ParameterTypeRequiredDescription
textstringyesThe text to convert to speech. Maximum 40,000 characters per request.
voicestringnoVoice preset to use. Options include various characters with different ages, accents, and tones. Default is a neutral, professional voice.
modelstringnoModel variant: "multilingual_v2" for highest quality and broadest language support, or Flash/Turbo for faster, cheaper generation.
stabilitynumbernoVoice stability, 0 to 1. Higher values produce more consistent, predictable delivery. Lower values introduce natural variation. Default is around 0.5.
similarity_boostnumbernoSimilarity boost, 0 to 1. Higher values make the output more closely match the voice preset's defining characteristics.
stylenumbernoStyle exaggeration, 0 to 1. Higher values increase emotional expressiveness and dramatic delivery. Keep low for neutral, professional content.
use_speaker_boostbooleannoEnable speaker boost for enhanced clarity and vocal presence. Good for podcasts and voice-overs.
output_formatstringnoAudio output format. Default is "mp3_44100_128" (MP3, 44.1kHz, 128kbps). Other bitrate and format options available.

output

The app returns:

  • audio — URL to the generated speech audio file hosted on inference.sh cloud storage. Format matches the output_format parameter.
  • output_meta — Metadata including character count processed, model used, and billing details.

pricing

ModelPrice
Flash / Turbo$0.06 per 1,000 characters
Multilingual v2$0.12 per 1,000 characters

A 500-word article (roughly 3,000 characters) costs about $0.18 with Flash or $0.36 with Multilingual v2. A full audiobook chapter of 10,000 characters costs $0.60 or $1.20 respectively.

when to use this vs alternatives

Choose ElevenLabs TTS when you need the most natural-sounding voice synthesis available, fine-grained control over delivery style, or long-form content processing in a single request.

Choose browser-native TTS (Web Speech API) when you need zero-cost TTS for simple applications and can accept lower quality robotic output.

Choose OpenAI TTS when you are already in the OpenAI ecosystem and need good-enough quality with simpler parameter tuning.

Choose Google Cloud TTS when you need specific SSML markup control or integration with Google Cloud workflows.

FAQ

How long can the input text be?

Up to 40,000 characters per request. This is roughly 6,000-8,000 words — enough for a full article, podcast script, or audiobook chapter. For longer content, split into multiple requests at natural paragraph or chapter boundaries.

What languages are supported?

The Multilingual v2 model supports dozens of languages including English, Spanish, French, German, Italian, Portuguese, Polish, Hindi, Arabic, Japanese, Korean, Chinese, and many more. The Flash/Turbo models primarily support English with limited multilingual capability.

How do the stability and similarity parameters interact?

Stability controls how predictable the delivery is — high stability means consistent pacing and tone, low stability means more natural variation (which can sound more human but less controlled). Similarity boost controls how much the output matches the voice preset's signature characteristics. For most content, starting with stability at 0.7 and similarity at 0.75 gives good results.

Can I clone a custom voice?

The inference.sh integration uses ElevenLabs' preset voice library. Voice cloning requires a direct ElevenLabs account. The preset voices cover a wide range of ages, accents, and styles that suit most use cases.

What audio quality should I expect?

The default MP3 output at 44.1kHz/128kbps is broadcast-ready quality. The audio sounds natural enough for professional podcasts, video narration, and commercial applications. Most listeners cannot distinguish it from human-recorded speech in blind tests.

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.