I thought the recording was real. The breath catches, the little hesitations before a difficult sentence, the way the energy dropped on a throwaway phrase. Sounded like somebody talking, not like a machine reading. Then I found out it was ElevenLabs v3 output, and something clicked about why everybody's been so excited about this model.
Then I cloned my own voice, ran it through v3, and got back a stranger. Same words, beautiful delivery, wrong person.
If you've tried ElevenLabs v3 with a cloned voice, you probably know this feeling. The model handles emotion and emphasis better than anything else on the market right now. It understands audio tags like [sighs], [whispers], [excited] and actually performs them convincingly. But the voice similarity drops. Sometimes a little, sometimes a lot. Your clone comes back sounding close but not right - like a cousin doing an impression of you.
This is a known issue. ElevenLabs themselves acknowledge that Professional Voice Clones are not fully optimized for v3 yet. The voice library produces more variable results compared to v2 and v2.5 models. Community forums are full of people who love what v3 does with delivery but hate what it does to their voice.
There's a workaround, though. And it's surprisingly simple once you see it.
the problem with v3
To understand the fix, you need to understand what's actually happening. ElevenLabs v3 is a different architecture from v2. It was built for expressiveness first - 70+ languages, emotional control through inline tags, three rendering modes (Creative, Natural, Robust) that trade stability for emotional range. The model excels at making speech sound alive.
The tradeoff is voice fidelity. When v3 processes your cloned voice, it prioritizes the emotional interpretation of the text over strict adherence to your voice's characteristics. Stability settings help somewhat, but there's a ceiling. Push similarity too high and you introduce artifacts - inconsistent speed, mispronunciation, random volume changes. Keep it moderate and the voice drifts from the original.
v2, on the other hand, has the opposite strengths. Multilingual v2 and Flash v2.5 are extremely good at preserving voice identity. They reproduce your clone faithfully. The pacing is consistent. The timbre stays locked. But the delivery can feel flat. If your source text has emotional weight - a sad monologue, an excited announcement, a tense dialogue - v2 often reads it like a newsreader. Accurate, competent, missing the soul.
So you have two models: one that acts but doesn't look like you, and one that looks like you but doesn't act.
the two-step fix
The solution is to use both models in sequence, each doing what it's best at.
Step one: render with v3 for delivery. Take your text, run it through ElevenLabs v3 TTS with a voice that has broad emotional range. This doesn't need to be your cloned voice - any voice that responds well to the emotional tags will work. What matters here is the performance. The pauses, the emphasis, the breath patterns, the emotional arc. v3 will produce an audio file that sounds like a real human delivering your text with genuine feeling.
Step two: pass through v2 voice changer. Take that v3 audio output and feed it into the ElevenLabs Voice Changer, which runs on v2. The voice changer preserves everything about the delivery - timing, emphasis, emotional dynamics, pauses, breaths - but replaces the vocal identity with your clone. The words stay the same. The performance stays the same. Only the voice changes.
What comes out the other end is your voice, delivering the text with v3-quality emotion and v2-quality similarity. The best of both models.
why this actually works
The voice changer isn't just doing a simple pitch shift or formant adjustment. It's re-synthesizing the speech in the target voice while preserving the temporal and prosodic characteristics of the input. That's why v2's voice changer is so effective here - it was designed to maintain delivery while swapping identity, which is exactly the inverse of v3's weakness.
The key insight is that emotional delivery lives in timing, rhythm, and energy contours. Voice identity lives in timbre, formant structure, and pitch range. These are separable. v3 captures the first set. v2's voice changer preserves them while fixing the second set.
adding an LLM step for automatic emotion tagging
Here's where it gets really interesting. v3's emotion tags are powerful but manual. Tagging a full script by hand is tedious work - you need to decide where someone sighs, where they pause, where the tone shifts from calm to frustrated. If you're producing audiobook chapters or podcast episodes, that tagging work adds up fast.
The solution is to add a language model in front of the TTS step. Feed your plain text to an LLM with instructions to analyze the emotional content and insert appropriate v3 audio tags. The model reads context, identifies emotional beats, and returns tagged text ready for v3 rendering.
A good system prompt for this step tells the LLM to act as an ElevenLabs v3 prompt engineer. It should understand the full tag vocabulary - emotional states like [excited], [nervous], [frustrated]; reactions like [sigh], [laughs], [gulps]; pacing controls like [pauses], [hesitates], [slows down]; volume cues like [whispering], [shouting]; and delivery styles like [sarcastically], [matter-of-fact], [dramatic].
The LLM doesn't need to be fancy. It just needs to read the text, understand the emotional subtext, and place tags where a voice director would give notes to an actor. "This line is frustrated." "Pause here." "This part should be quieter." That kind of direction, expressed as inline tags that v3 knows how to interpret.
The complete flow becomes three steps: LLM tags the text, v3 renders the tagged text with full emotional performance, v2 voice changer maps the result onto your cloned voice.
building the flow
You can set this up as an automated pipeline on inference.sh using flows. A flow chains multiple apps together, passing the output of one step as input to the next. We've published a ready-made version of this pipeline that you can duplicate and customize, or build your own from scratch.
The first node is a chat model running a specialized system prompt. We use Gemini 2.5 Flash for this step since it's fast and cheap, but any model works. The system prompt tells it to act as an ElevenLabs v3 prompt engineer - read the text, understand the emotional subtext, and insert audio tags where a voice director would give notes. Temperature should be low (around 0.3) so the tagging stays consistent across runs. The LLM outputs only the tagged text, no explanations or commentary.
The second node is elevenlabs/tts running on the v3 model with audio_tags enabled. The settings that matter: stability at 0 (maximum expressiveness), style at 1 (full emotional range), and similarity at 1. Pick a voice with broad emotional range for this step - it doesn't need to match your final voice since the next step handles identity. The voice just needs to be expressive enough to respond well to the emotion tags.
The third node is elevenlabs/voice-changer using the eleven_multilingual_sts_v2 model. This is where you set your actual target voice - the clone you want the final audio to sound like. It receives the v3 audio, preserves the full performance, and outputs it in your voice.
One practical tip: for the voice you use in step two, pick one with long recordings that cover a wide emotional range. A voice trained on flat, monotone samples won't respond as well to the emotion tags, even on v3. The model needs reference material that demonstrates the vocal range you're asking for.
where this matters most
Audiobook production is the obvious use case. A full book means hours of narrated content where consistent voice identity and emotional range both matter. Flat TTS sounds lifeless after twenty minutes. Voice-drifted TTS breaks immersion the moment the narrator stops sounding like the narrator.
Podcast production benefits similarly. Interview-style shows where one voice introduces segments, reacts to clips, and transitions between topics need that natural variation in energy and tone. The LLM tagging step is particularly useful here since podcast scripts tend to shift between informational, conversational, and editorial modes - exactly the kind of emotional context the model can detect and tag.
Dubbing and localization is another strong fit. When you're translating content across languages, preserving the original performance's emotional character matters as much as translating the words. Run the translated script through the three-step flow and you get the new language delivered with the original emotion, in the target voice.
Character dialogue for games and animation rounds it out. Voice actors are expensive and scheduling sessions is slow. The three-step flow lets you prototype all dialogue with emotional performances that actually sound like performances, not placeholder robot readings.
getting the best results
The quality of the voice you use in step two makes a real difference. ElevenLabs voices trained on longer recordings with natural emotional variety - laughing, whispering, raising their voice, speaking softly - give v3 more to work with. A voice trained on a two-minute clip of someone reading a grocery list will produce flat output no matter how many emotion tags you throw at it.
For the LLM tagging step, less is usually more. Over-tagging produces chaotic audio where every sentence has a different emotional register. Real speech has long stretches of consistent tone with occasional shifts. Tell your LLM to tag conservatively - mark the emotional peaks and valleys, not every word.
The voice changer step works best with clean v3 output. If v3 produced artifacts - clicks, weird pauses, volume spikes - the voice changer will faithfully reproduce those artifacts in your voice. Listen to the v3 output before passing it through. If a generation sounds off, regenerate it. The cost of an extra TTS call is tiny compared to the cost of editing bad audio downstream.
the bigger picture
This workaround exists because ElevenLabs v3 is genuinely new territory. No other TTS model offers this level of emotional control through inline tags. The voice similarity issues are growing pains, and ElevenLabs will almost certainly optimize PVC support for v3 over time. When they do, this two-step approach might become unnecessary.
Until then, the v3-to-v2 pipeline gives you access to the best emotional TTS on the market without sacrificing the voice identity you've spent time building. It's the kind of trick that sounds complicated in explanation but takes about five minutes to set up as a flow - and once it's running, you just feed in text and get back audio that sounds like you on your best day.
does the voice changer preserve all of v3's emotion tags?
The voice changer preserves the performance, not the tags themselves. By the time the audio reaches the voice changer, the tags have already been rendered into actual speech patterns - pauses, breaths, whispers, emphasis shifts. The voice changer treats these as characteristics of the input audio and maintains them while swapping the voice. Subtle effects like barely-audible sighs or very quiet whispers may lose some definition in the transfer, but the overall emotional arc stays intact. For best results, avoid stacking too many simultaneous effects in a single phrase.
which v3 mode should I use for the TTS step?
Creative mode gives you the most emotional range and responds most strongly to audio tags, but it can occasionally hallucinate sounds or insert unexpected vocalizations. Natural mode is the safe default - it responds well to tags while staying predictable. Robust mode is too stable for this workflow since the whole point is getting emotional delivery. Start with Natural and switch to Creative if you need more dramatic performances. If Creative produces artifacts, regenerate rather than falling back to Robust.
do I need a professional voice clone or does instant cloning work?
Instant Voice Cloning works for the voice changer step since you're using v2, which handles IVCs well. The v3 step doesn't use your clone at all, so PVC vs. IVC only matters for the final voice changer pass. That said, longer and higher-quality recordings always produce better clones. If you're doing production work rather than experimenting, investing in a Professional Voice Clone for the voice changer step will give you noticeably better identity matching.