apps/falai/dia-tts

dia-tts

Dia TTS - Generate realistic dialogue with emotion control, natural nonverbals, and voice cloning

run with your agent
# install belt
$curl -fsSL https://cli.inference.sh | sh
# view schema & details
$belt app get falai/dia-tts
# run
$belt app run falai/dia-tts

There's a conversation that happens in every project where synthetic speech comes up. Someone demos ElevenLabs, everyone agrees it sounds incredible, and then someone else opens a spreadsheet. If you're generating a few minutes of audio per day, ElevenLabs is a rounding error. If you're generating thousands of audio clips for an e-learning platform, narrating an entire product catalog, or spinning up conversational content at scale, the per-character costs add up to a line item that makes finance teams ask questions.

This is where open-source TTS models become genuinely interesting. Not as curiosities or proof-of-concept toys, but as practical tools that solve real production problems at a fraction of the cost. I want to talk about two models available on inference.sh that represent the best of what open-source speech synthesis can do right now: Kokoro TTS and Dia TTS. They're different tools for different jobs, and being honest about where they shine and where they don't is more useful than pretending they compete with ElevenLabs on every dimension.

kokoro tts: the workhorse for bulk narration

Kokoro is a lightweight, 82-million-parameter text-to-speech model built on the StyleTTS 2 architecture with an ISTFTNet vocoder. Released in December 2024 under the Apache 2.0 license, it was trained for remarkably low compute costs. It does one thing well: it turns text into clean, intelligible speech quickly and cheaply. It costs a fraction of what ElevenLabs charges - not a marginal savings, but the difference between a project being economically viable and not.

The model supports several languages - primarily American English and British English, with additional support for Japanese, Mandarin, French, and Korean, though non-English voices are more limited in variety. It offers a selection of voice presets. You pick a voice, set a speed, and feed it text. The output is perfectly serviceable speech that won't make anyone think they're listening to a human, but also won't make them cringe. It sits in that middle zone where the technology is clearly synthetic but not distractingly so. For many applications, that's exactly the right spot.

I think the best way to understand Kokoro is to think about where voice quality lives on your priority list. If you're building an audiobook that someone will listen to for six hours straight, every imperfection in prosody compounds into listener fatigue, and you should be using ElevenLabs. But if you're generating audio previews for a content management system, creating accessibility narration for documentation, producing voice prompts for an IVR system, or building any application where the audio is functional rather than experiential, Kokoro handles it cleanly.

The speed parameter is worth understanding. Being able to adjust playback speed at generation time means the model produces audio that sounds natural at the requested pace, rather than simply compressing or stretching a fixed-rate output. That matters for accessibility use cases where users need slower delivery, or for notification-style audio where faster delivery respects the listener's time.

where kokoro genuinely excels

Latency is Kokoro's quiet advantage. Because the model is lightweight, generation happens fast. When you're integrating TTS into an application where response time matters - think interactive voice systems, real-time notifications, or any pipeline where audio generation is one step in a longer chain - the speed difference between Kokoro and heavier models becomes meaningful. Not every application can afford to wait several seconds for premium-quality audio when a quick, clean output would do.

The cost story is even more compelling at scale. Consider a scenario where you're narrating ten thousand product descriptions for an e-commerce platform. With Kokoro, the total cost is a fraction of what you'd pay with ElevenLabs Multilingual v2. The Kokoro approach lets you regenerate content freely, iterate on scripts without worrying about cost, and treat audio generation as something close to disposable. That changes how you design your workflow. You stop optimizing for fewer generation calls and start optimizing for better content.

The multi-language support is functional rather than exceptional. Kokoro handles its supported languages - English, Japanese, Mandarin, French, and Korean - competently, and the language parameter gives you explicit control over pronunciation. Non-English support can be thinner due to limited training data and G2P coverage, and some languages have only one or two voice options. It won't seamlessly code-switch mid-paragraph the way ElevenLabs Multilingual v2 does, but for content that's cleanly in one language, it delivers solid results.

dia tts: multi-speaker dialogue that actually works

Dia is a fundamentally different proposition from Kokoro, and I think it's the more interesting of the two models from a capability standpoint. Built by Nari Labs - a small South Korean startup founded by two undergraduate students including Toby Kim - Dia is a 1.6-billion-parameter model released under the Apache 2.0 license. Its architecture draws from SoundStorm, Parakeet, and the Descript Audio Codec. Where most TTS systems - including ElevenLabs - treat speech as a single-voice problem, Dia is built from the ground up for dialogue. It handles multiple speakers natively using a simple markup format where you tag text with speaker identifiers.

This matters more than it might seem at first. The traditional approach to multi-speaker audio is to generate each speaker's lines separately with different voice settings, then stitch the clips together in post-production. The results are technically correct but emotionally flat. Real conversations have rhythm. One speaker's energy affects the next speaker's response. Interruptions have a cadence. Dia captures some of that conversational dynamics because it generates both speakers in a single pass, maintaining the temporal relationships between turns.

Dia is significantly cheaper than ElevenLabs Flash. Both the fal.ai-hosted version and the infsh-hosted variant are available, sharing the same underlying model and capabilities, giving you flexibility in how you route your requests.

The emotion control and natural nonverbals are where Dia gets genuinely surprising. The model can produce laughter, hesitations, and the kind of small vocal fillers that make dialogue sound lived-in rather than recited. I don't want to oversell this. It's not indistinguishable from recorded conversation. But it's noticeably ahead of what you get from generating two separate TTS tracks and splicing them together.

the voice cloning angle

Dia supports voice cloning through reference audio, which opens up some interesting production possibilities. You provide a sample of a target voice along with its transcript, and the model adapts its output to match the vocal characteristics. This is useful for maintaining character consistency across long-form content, or for creating dialogue that sounds like specific speakers rather than generic presets.

Dia supports zero-shot voice cloning from just seconds of reference audio, though the quality of voice cloning depends heavily on what you provide. Clean recordings produce noticeably better results than short or noisy samples. This is true of every cloning system I've tested, not a Dia-specific limitation. The practical takeaway is that if voice cloning is central to your use case, invest a few minutes in capturing a good reference sample. It pays for itself immediately in output quality.

who should use what and when

I've spent enough time with both models to have a clear framework for when each one makes sense, and when you should just pay for ElevenLabs.

Kokoro is the right choice when cost is the primary constraint and the audio serves a functional purpose. Documentation narration, content accessibility, notification systems, audio previews, prototype development where you need speech output but don't need it to be perfect. If someone will listen to the audio for under a minute and then move on, Kokoro is almost certainly good enough.

Dia is the right choice when you need multi-speaker dialogue and building a manual pipeline from single-speaker TTS would be painful. Podcast generation is the obvious use case. So are audiobooks with distinct characters, conversational AI demos, educational content with student-teacher dynamics, and any format where the back-and-forth between voices is part of the experience. Dia handles this natively in a way that no single-speaker model can match without significant engineering overhead.

ElevenLabs remains the right choice when voice quality is paramount and the listener experience is the product. Marketing content, premium audiobooks, brand voice applications, anything where a discerning listener will spend significant time with the output. The quality gap is real. In a direct A/B comparison, most people will prefer ElevenLabs. The question is whether that preference matters enough for your specific application to justify the cost difference.

the honest tradeoffs

I want to be direct about what you lose with these open-source models, because pretending the tradeoffs don't exist helps no one.

Prosody is the biggest gap. ElevenLabs has a feel for emphasis, pacing, and emotional arc that Kokoro and Dia don't match. Sentences from ElevenLabs rise and fall in ways that feel considered. Kokoro's delivery is more even, more predictable. It reads competently but doesn't perform. Over short passages this barely registers. Over long passages, the difference becomes cumulative.

Voice variety is another area where ElevenLabs leads. Its voice library offers a range of personas that feel distinct and characterful. Kokoro and Dia have more limited voice selections. This matters less if you've found one or two voices that work for your application, but it limits your options for projects that need diverse vocal identities.

The parameter space is simpler with Kokoro and Dia, which is honestly both a limitation and an advantage. ElevenLabs gives you stability, similarity boost, style exaggeration, and speaker boost - knobs that let experienced users dial in exactly the sound they want. Kokoro gives you speed and voice selection. Dia gives you speaker tags and optional voice cloning. Less control, but also less time spent tuning. For teams without audio production experience, the simpler interface means fewer ways to get a bad result.

the tiered approach

At low volumes - a single article per day - the cost difference between Kokoro and ElevenLabs is negligible. But at scale - hundreds of articles daily across a media company - the gap becomes significant enough to drive strategic decisions. The quality premium on every single clip needs to justify the cost difference, and for many organizations, it doesn't.

The sweet spot I keep coming back to is a tiered approach. Use ElevenLabs for customer-facing, high-value audio where quality directly affects perception. Use Kokoro or Dia for internal, high-volume, or functional audio where "good enough" genuinely is good enough. Most organizations that generate significant amounts of synthetic speech will benefit from having both tiers available rather than committing entirely to one end of the cost-quality spectrum.

what's next for open-source tts

The trajectory here is clear and moving in one direction. Open-source TTS models have improved dramatically in the past two years, and the pace isn't slowing. Kokoro represents the current state of lightweight, fast, cheap speech synthesis. Dia represents something newer - the idea that dialogue and multi-speaker interaction should be first-class capabilities rather than afterthoughts bolted onto single-speaker systems.

I expect the quality gap between open-source and commercial models to continue narrowing. Whether it closes entirely is harder to predict. ElevenLabs is also improving, and they have the advantage of revenue funding dedicated research teams. But the structural economics favor open-source adoption for cost-sensitive, high-volume use cases, and that's a market reality that isn't going away.

For now, having Kokoro and Dia available alongside ElevenLabs means you can make decisions based on what each project actually needs rather than defaulting to the most expensive option for everything. That's the kind of flexibility that compounds into real savings over time.

FAQ

how much worse do kokoro and dia actually sound compared to elevenlabs?

The gap is noticeable but context-dependent. In short clips under thirty seconds, casual listeners might not distinguish Kokoro from a lower-tier ElevenLabs model. Over longer passages, the prosody differences become clearer - ElevenLabs handles emphasis and pacing with more subtlety. Dia is harder to compare directly because its multi-speaker capability doesn't have an exact ElevenLabs equivalent. For dialogue, Dia's native two-speaker generation sounds more natural than manually stitching together two separate ElevenLabs outputs. The honest summary: ElevenLabs wins on single-voice quality, but the margin shrinks when cost efficiency and specific capabilities like dialogue generation enter the equation.

when does dia tts make more sense than running two separate elevenlabs calls?

Dia is the stronger choice whenever your content is conversational. Generating a ten-minute podcast episode with two speakers through ElevenLabs means running each speaker's lines independently, timing the pauses between turns, and assembling the result. With Dia, you submit the full script with speaker tags and get back a single audio file where the conversational timing is handled by the model. The result sounds more natural because the speakers' energy and pacing respond to each other. Beyond the quality benefit, it's also simpler to implement. One generation call instead of a multi-step pipeline with alignment logic.

can i use kokoro for real-time voice applications?

Kokoro's lightweight architecture makes it one of the faster TTS options available, which is an advantage for near-real-time use cases like interactive voice responses or in-app audio feedback. That said, "real-time" in the conversational AI sense - where the first audio chunk needs to start playing within 200 milliseconds - requires streaming output, and the serverless inference model returns complete audio files. For strict real-time conversational applications, you'd want a streaming-capable setup. For everything else where "fast" means a response in under a second or two, Kokoro handles it well and the low per-character cost means you're not paying a premium for that speed.

api reference

about

dia tts - generate realistic dialogue with emotion control, natural nonverbals, and voice cloning

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash
1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash
1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python
1from inferencesh import inference23client = inference()456result = client.run({7        "app": "falai/dia-tts",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "falai/dia-tts",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python
1# local file paths are automatically uploaded2result = client.run({3    "app": "falai/dia-tts",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python
1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python
1result = client.run({2    "app": "falai/dia-tts",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json
1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}
idstringtask id
statusnumberterminal status (9=completed, 10=failed, 11=cancelled)
outputobjecttask output (when completed)
errorstringerror message (when failed)
session_idstringsession id (if using sessions)
created_atstringiso timestamp
updated_atstringiso timestamp

5. schema

input

ref_audiostring(file)

reference audio file for voice cloning. if provided, ref_text is also required.

ref_textstring

transcript of the reference audio. required when using voice cloning.

textstring*

the text to convert to speech. use [s1], [s2] for multi-speaker dialogue and (laughs), (sighs) for nonverbals.

output

audiostring(file)*

audio

output_metaobject

structured metadata about inputs/outputs for pricing calculation

ready to run dia-tts?

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.