Best of Product Hunt

How to Build Multilingual Text-to-Speech in Your App with the ElevenLabs API (Step-by-Step + Code)

A practical, step-by-step guide to implementing multilingual text-to-speech in an app using the ElevenLabs API. Includes language detection, voice selection strategy, streaming vs. file generation, Node.js and Python code examples, and production tips for latency, caching, and quality.

Share:

A production-friendly approach is to accept text (and optionally a language), detect the language when needed, map languages to voice IDs, call the ElevenLabs TTS endpoint, and return audio as a file or stream. Add basics like caching, retries, and fallback voices to avoid common edge cases in real apps.

The simplest and most common pattern is one language per request, where your UI or user settings already know the language. Mixed-language text is harder and usually requires splitting the text by language and stitching the audio back together.

Maintain a predictable mapping like `en → voice A`, `es → voice B`, `fr → voice C` and route requests by language. For a consistent product persona across locales, use voices that match tone, age, and cadence across languages.

Generating a complete audio file is simpler and works well for short snippets, and it’s easier to cache. Streaming is better for long content because it reduces time-to-first-audio, but it requires slightly more client/server integration.

Language detection is optional if your app already knows the language from locale or user preference. If you need detection, you can use a lightweight language detection library, but short inputs like “Hi” are unreliable, so you should fall back to user preference or a default language.

Create an endpoint (e.g., Express) that accepts `{ text, lang }`, chooses a voice ID from a language map, and POSTs to `https://api.elevenlabs.io/v1/text-to-speech/{voiceId}` with your API key. Set `Accept: audio/mpeg` and return the response bytes as `audio/mpeg`.

Pick a voice ID based on a `lang → voiceId` mapping, then send a POST request to `https://api.elevenlabs.io/v1/text-to-speech/{voiceId}` with JSON containing `text` and optional `voice_settings`. Save the response content as an MP3 file.

Normalize text before synthesis (whitespace cleanup, quote normalization, and expanding tricky abbreviations) to improve quality. Cache outputs using a hash of text + voice + settings, add 1–2 retries with exponential backoff, and define fallback language/voice when detection or routing fails.

Use a fallback language and voice (often English or your primary market) so requests still succeed. Log these events so you can fix language routing or add the missing voice mapping later.

How to Build Multilingual Text-to-Speech in Your App with the ElevenLabs API (Step-by-Step + Code)

Multilingual text-to-speech (TTS) sounds simple—until you ship it.

In a real app, you need to handle mixed-language content, select appropriate voices, keep latency low, manage audio formats, and avoid awkward edge cases (like punctuation pauses or mid-sentence language switching). This guide walks through a clean, production-friendly approach to building multilingual TTS using the [PRODUCT_LINK]ElevenLabs API platform[/PRODUCT_LINK], with concrete code in both Node.js and Python.

What you’ll build

By the end, you’ll have a working pipeline that:

1. Accepts text (and optionally a requested language)

2. Detects language when needed

3. Selects a voice per language (or a multilingual voice strategy)

4. Generates speech through the API

5. Returns audio to the client (as a file or stream)

6. Adds basic production hardening (caching, retries, and fallbacks)

---

Step 0: Decide how “multilingual” your TTS must be

Before writing code, pick one of these implementation patterns:

Pattern A: One language per request (most common)

- Your UI already knows the language (e.g., user preference, locale)

- Each request is a single language

- Best quality and simplest routing

Pattern B: Mixed languages within the same text (harder)

- Example: English paragraph with Spanish names or quotes

- Usually requires **splitting the text by language** and stitching audio

- More work, but more natural results if done well

This article focuses on Pattern A (the best default), with notes on how to extend to Pattern B.

---

Step 1: Get an API key and set up your environment

You’ll need an API key and a server-side environment (recommended) to protect it.

- Read the [PRODUCT_LINK]ElevenLabs developer documentation for Text to Speech[/PRODUCT_LINK] to confirm endpoints, authentication, and available parameters.

- Store your key in an environment variable:

```bash

export ELEVENLABS_API_KEY="your_key_here"

```

---

Step 2: Choose a voice strategy (per language)

A practical approach is to maintain a mapping:

- `en` → voice A

- `es` → voice B

- `fr` → voice C

This makes behavior predictable and avoids surprising voice changes.

**Tip:** If you’re doing localization at scale, create voices that match the same persona (tone, age, cadence) across languages so your product feels consistent.

Example mapping (pseudo):

```js

const VOICE_BY_LANG = {

en: "VOICE_ID_EN",

es: "VOICE_ID_ES",

fr: "VOICE_ID_FR",

de: "VOICE_ID_DE",

};

```

If you don’t want to maintain multiple voices, you can still implement multilingual support by using fewer voices and focusing on language routing and UX. Just keep expectations realistic: some voices handle some languages better than others.

---

Step 3: Decide whether you need streaming audio

Two common delivery modes:

Generate a complete audio file

- Simple server implementation

- Good for short content (e.g., notifications, snippets)

- Easier to cache

Stream audio

- Lower time-to-first-audio (better UX)

- Good for long content (articles, lessons)

- Slightly more complex client/server integration

If your app reads long passages, consider streaming from day one.

---

Step 4: Implement language detection (optional but useful)

If your app doesn’t already know the language, detect it.

Options:

- Frontend locale + user setting (fast and reliable)

- Lightweight language detection library (good enough for many cases)

- LLM-based classification (overkill for most TTS routing)

**Important:** Language detection is unreliable on very short strings (“Hi”, “OK”). In production, treat short inputs as “use user preference” or fall back to a default.

---

Step 5: Node.js example (generate speech and return audio)

Below is a minimal Express route that:

- Accepts `{ text, lang }`

- Picks a voice based on `lang`

- Calls the ElevenLabs TTS endpoint

- Returns an audio payload

> Note: Endpoint paths and request fields can evolve—confirm the latest request shape in docs.

```js

import express from "express";

const app = express();

app.use(express.json());

const ELEVENLABS_API_KEY = process.env.ELEVENLABS_API_KEY;

const VOICE_BY_LANG = {

en: process.env.VOICE_ID_EN,

es: process.env.VOICE_ID_ES,

fr: process.env.VOICE_ID_FR,

};

function pickVoiceId(lang) {

return VOICE_BY_LANG[lang] || VOICE_BY_LANG.en;

}

app.post("/tts", async (req, res) => {

const { text, lang = "en" } = req.body;

if (!text || typeof text !== "string") {

return res.status(400).json({ error: "Missing 'text'" });

}

const voiceId = pickVoiceId(lang);

try {

const url = `https://api.elevenlabs.io/v1/text-to-speech/${voiceId}`;

const r = await fetch(url, {

method: "POST",

headers: {

"xi-api-key": ELEVENLABS_API_KEY,

"Content-Type": "application/json",

"Accept": "audio/mpeg",

},

body: JSON.stringify({

text,

// Many integrations also include voice_settings.

// Tune stability/similarity to match your product voice.

voice_settings: {

stability: 0.5,

similarity_boost: 0.75,

},

}),

});

if (!r.ok) {

const msg = await r.text();

return res.status(r.status).send(msg);

}

const audioBuffer = Buffer.from(await r.arrayBuffer());

// Return as MP3

res.setHeader("Content-Type", "audio/mpeg");

res.setHeader("Cache-Control", "no-store");

return res.status(200).send(audioBuffer);

} catch (err) {

return res.status(500).json({ error: "TTS generation failed" });

}

});

app.listen(3000, () => console.log("Listening on http://localhost:3000"));

```

Client usage (quick test)

```bash

curl -X POST http://localhost:3000/tts \

-H 'Content-Type: application/json' \

-d '{"text":"Hello! This is an English sample.","lang":"en"}' \

--output out.mp3

```

---

Step 6: Python example (same idea)

This Python snippet does the same thing server-side and writes an MP3 file.

```python

import os

import requests

ELEVENLABS_API_KEY = os.environ["ELEVENLABS_API_KEY"]

VOICE_BY_LANG = {

"en": os.environ.get("VOICE_ID_EN"),

"es": os.environ.get("VOICE_ID_ES"),

"fr": os.environ.get("VOICE_ID_FR"),

}

def pick_voice_id(lang: str) -> str:

return VOICE_BY_LANG.get(lang) or VOICE_BY_LANG["en"]

def text_to_speech(text: str, lang: str = "en", out_path: str = "out.mp3"):

voice_id = pick_voice_id(lang)

url = f"https://api.elevenlabs.io/v1/text-to-speech/{voice_id}"

headers = {

"xi-api-key": ELEVENLABS_API_KEY,

"Content-Type": "application/json",

"Accept": "audio/mpeg",

}

payload = {

"text": text,

"voice_settings": {

"stability": 0.5,

"similarity_boost": 0.75,

},

}

r = requests.post(url, headers=headers, json=payload, timeout=60)

r.raise_for_status()

with open(out_path, "wb") as f:

f.write(r.content)

if __name__ == "__main__":

text_to_speech("Bonjour ! Ceci est un exemple en français.", "fr", "fr.mp3")

```

---

Step 7: Production essentials (what top tutorials often skip)

1) Normalize text before sending it to TTS

A small cleanup pass improves quality:

- Collapse repeated whitespace

- Convert smart quotes if your content pipeline mixes encodings

- Expand tricky abbreviations (e.g., “Dr.”, “St.”) if you notice misreads

2) Cache outputs aggressively

Multilingual apps often repeat the same phrases (UI strings, help prompts).

A simple cache key:

- `hash(text + voiceId + settings + modelVersion)`

Store:

- MP3 in object storage

- metadata in Redis/DB

3) Add retries for transient failures

Network calls fail. Your UX shouldn’t.

- Retry 1–2 times with exponential backoff

- Use idempotent cache keys to avoid duplicate generation

4) Pick a fallback language and fallback voice

If language detection fails or a voice ID is missing:

- Fall back to `en` (or your primary market)

- Log the event so you can fix routing

5) Validate multilingual quality per target language

Some languages are inherently harder for certain voices.

If you’re shipping Chinese, for example, plan extra evaluation—tone, pacing, and pronunciation consistency can vary. Build a small test suite of sentences per language, and run it whenever you change voices or settings.

6) Watch for fades / abrupt endings

In real-world integrations, you may occasionally notice audio ending slightly early or fading oddly, especially on certain inputs.

Mitigations:

- Add a short trailing punctuation mark if missing (e.g., ensure text ends with `.`)

- Avoid extremely long single-paragraph requests; chunk long content (see next section)

---

Step 8: Handling long content (chunking without ruining prosody)

For long-form TTS (articles, lessons), don’t send one huge block.

Instead:

1. Split by paragraphs or sentences

2. Generate audio per chunk

3. Concatenate audio on the server (or play sequentially on the client)

Rules of thumb:

- Keep chunks semantically coherent (don’t split mid-sentence)

- Add short pauses between chunks if needed

- Reuse the same voice/settings across chunks

This also helps with mixed-language scenarios: detect language per chunk and route to the appropriate voice.

---

Step 9: Mixed-language text (advanced but common)

If you truly have bilingual paragraphs, you’ll get better results by:

1. Detecting language at the sentence level

2. Generating audio per sentence with the best voice for that language

3. Stitching together

Tradeoff: voice changes can be noticeable. If your product needs a single consistent persona, you may prefer a voice that performs acceptably across both languages rather than “best per language.”

---

When it makes sense to use ElevenLabs specifically

If your app needs realistic voices, fast iteration, and a developer-friendly workflow, a TTS platform with both Studio tooling and an API can save significant time.

- The [PRODUCT_LINK]ElevenLabs text-to-speech API[/PRODUCT_LINK] is typically used when you want to generate audio dynamically (on demand) rather than recording voice actors.

- If you’re building an internal tool or prototype, the [PRODUCT_LINK]ElevenLabs getting started resources[/PRODUCT_LINK] can help you validate a multilingual experience quickly.

---

Conclusion

Building multilingual TTS isn’t just “send text, get MP3.” The difference between a demo and a production feature is the routing and reliability layer: language strategy, voice mapping, streaming vs. file generation, chunking, caching, and fallbacks.

Start with one-language-per-request, add deterministic voice selection, then evolve toward chunking and mixed-language handling if your content requires it. With a small amount of structure, you can ship multilingual speech that feels intentional—and scales as your app grows.

More from ElevenLabs