Free AI Voice Generator (Text-to-Speech API): Build a Working TTS Demo in 15 Minutes (with Code)
This hands-on guide shows how to build a simple, working text-to-speech demo in about 15 minutes using a TTS API. You’ll learn the minimal architecture, how to call an API, stream audio, and ship a small web UI—plus practical tips for voice quality, latency, and production readiness.
You can build a working “text in → playable audio out” demo in about 15 minutes using a small Node.js + Express backend and a simple browser UI. The backend calls a TTS API and returns an MP3 that the browser can play.
A backend is recommended so your API key stays off the client. The article uses a Node.js endpoint (`/api/tts`) that accepts text and voice ID, then calls the TTS provider securely.
You’ll need Node.js 18+ (or any environment with fetch support), an API key from a TTS provider, and a modern browser. For ElevenLabs specifically, set the `ELEVENLABS_API_KEY` environment variable before running the server.
The server sends a POST request to `https://api.elevenlabs.io/v1/text-to-speech/{voiceId}` with JSON (text, model_id, and voice_settings). It sets `Accept: audio/mpeg`, then returns the audio as an MP3 response to the client.
Streaming can start playback earlier instead of waiting for the entire MP3 to generate, which reduces perceived latency. This is especially important for interactive experiences like assistants, narration tools, and customer support.
The browser UI POSTs text to the backend, converts the response to a Blob, and creates an object URL for an HTML `<audio>` element. It then sets `player.src` to that URL and calls `player.play()`.
Some browsers restrict requests from `file://` pages to `http://localhost` APIs. The article suggests serving the HTML with a small static server (like `npx serve`) or adding a static route in Express.
“Free” often means a free tier with limited monthly characters or minutes. After the free allowance, paid usage typically applies depending on the provider.
Write for speech (use contractions, shorter sentences, and punctuation to control rhythm) and normalize numbers/abbreviations (e.g., "$1.2M" to “1.2 million dollars”). These small input changes often improve realism more than a single setting.
Free AI Voice Generator (Text-to-Speech API): Build a Working TTS Demo in 15 Minutes (with Code)
If you’ve searched for a **free AI voice generator** or a **text-to-speech API** to prototype quickly, you’ve probably noticed a pattern in the top results: lots of “try it now” tools, but not enough practical, copy‑pasteable code that gets you from **text → playable audio** in minutes.
This article is a practical walkthrough for developers: you’ll build a working TTS demo fast, using a small Node.js server and a basic browser UI. The approach maps to any modern TTS provider, but the code examples use the [PRODUCT_LINK]{ElevenLabs platform}[/PRODUCT_LINK] because it’s straightforward to integrate and produces realistic speech.
**What you’ll have in ~15 minutes:**
- A tiny API server that turns text into speech
- Streaming audio playback in the browser (lower perceived latency)
- A minimal UI to type text, pick a voice, and play
> Note: “Free” often means “free tier.” Most TTS APIs provide limited monthly characters/minutes before paid usage kicks in.
---
What you’re building (and why this architecture works)
A basic text-to-speech demo has 4 moving parts:
1. **UI**: textarea + “Generate” button
2. **Backend**: endpoint that calls the TTS API (keeps your API key off the client)
3. **TTS API request**: send text + voice settings
4. **Audio response**: stream or download an MP3/WAV and play it
**Why streaming matters:** Instead of waiting for the entire MP3 to generate, streaming can start playback earlier, which is crucial for interactive apps (assistants, narration tools, customer support, in-product reading).
---
Prerequisites
- Node.js 18+ (or any environment with fetch support)
- An API key from your TTS provider
- A modern browser
If you’re using ElevenLabs, create an API key in your dashboard. The docs and examples in the [PRODUCT_LINK]{ElevenLabs developer docs}[/PRODUCT_LINK] are helpful if you want to go beyond this demo.
---
Step 1 — Create the project
```bash
mkdir tts-demo
cd tts-demo
npm init -y
npm i express cors
```
Create a file named `server.js`.
---
Step 2 — Build a minimal TTS endpoint (Node.js + Express)
This endpoint accepts text and returns audio (MP3) as the response.
> Keep your API key in an environment variable: `ELEVENLABS_API_KEY`.
**server.js**
```js
import express from "express";
import cors from "cors";
const app = express();
app.use(cors());
app.use(express.json({ limit: "1mb" }));
const API_KEY = process.env.ELEVENLABS_API_KEY;
// Pick a default voice. You can later replace this with a voice list call.
const DEFAULT_VOICE_ID = "21m00Tcm4TlvDq8ikWAM"; // Example voice ID
app.post("/api/tts", async (req, res) => {
try {
const { text, voiceId = DEFAULT_VOICE_ID } = req.body;
if (!API_KEY) {
return res.status(500).json({ error: "Missing ELEVENLABS_API_KEY" });
}
if (!text || typeof text !== "string") {
return res.status(400).json({ error: "Please provide a text string." });
}
// Call the ElevenLabs TTS endpoint
const url = `https://api.elevenlabs.io/v1/text-to-speech/${voiceId}`;
const ttsResp = await fetch(url, {
method: "POST",
headers: {
"Content-Type": "application/json",
"xi-api-key": API_KEY,
// Audio format can be changed; mp3 is easy for browsers.
"Accept": "audio/mpeg"
},
body: JSON.stringify({
text,
model_id: "eleven_multilingual_v2",
voice_settings: {
stability: 0.4,
similarity_boost: 0.8
}
})
});
if (!ttsResp.ok) {
const msg = await ttsResp.text();
return res.status(ttsResp.status).send(msg);
}
res.setHeader("Content-Type", "audio/mpeg");
// Stream the audio back to the browser
const arrayBuffer = await ttsResp.arrayBuffer();
res.send(Buffer.from(arrayBuffer));
} catch (err) {
console.error(err);
res.status(500).json({ error: "TTS generation failed." });
}
});
app.listen(3000, () => {
console.log("TTS demo server running on http://localhost:3000");
});
```
Run it:
```bash
macOS/Linux
export ELEVENLABS_API_KEY="your_key_here"
node server.js
Windows PowerShell
$env:ELEVENLABS_API_KEY="your_key_here"
node server.js
```
At this point, you have a working “text in → audio out” API.
---
Step 3 — Add a tiny web UI (play audio in the browser)
Create `index.html` in the same folder.
**index.html**
```html
<!doctype html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>TTS Demo</title>
<style>
body { font-family: system-ui, Arial; max-width: 760px; margin: 40px auto; padding: 0 16px; }
textarea { width: 100%; height: 140px; }
.row { display: flex; gap: 8px; margin-top: 12px; }
button { padding: 10px 14px; }
audio { width: 100%; margin-top: 16px; }
.hint { color: #555; font-size: 14px; margin-top: 8px; }
</style>
</head>
<body>
<h1>Text-to-Speech Demo</h1>
<p class="hint">Type something, generate audio, and play it in the browser.</p>
<textarea id="text">Hey! This is a quick text-to-speech demo you can build in about fifteen minutes.</textarea>
<div class="row">
<button id="btn">Generate speech</button>
<button id="btnStop" disabled>Stop</button>
</div>
<audio id="player" controls></audio>
<script>
const btn = document.getElementById('btn');
const btnStop = document.getElementById('btnStop');
const textEl = document.getElementById('text');
const player = document.getElementById('player');
let currentUrl = null;
btn.addEventListener('click', async () => {
btn.disabled = true;
btn.textContent = 'Generating…';
try {
const resp = await fetch('http://localhost:3000/api/tts', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text: textEl.value })
});
if (!resp.ok) {
const errText = await resp.text();
throw new Error(errText);
}
const blob = await resp.blob();
if (currentUrl) URL.revokeObjectURL(currentUrl);
currentUrl = URL.createObjectURL(blob);
player.src = currentUrl;
await player.play();
btnStop.disabled = false;
} catch (e) {
alert('Failed to generate audio. Check console/server logs.');
console.error(e);
} finally {
btn.disabled = false;
btn.textContent = 'Generate speech';
}
});
btnStop.addEventListener('click', () => {
player.pause();
player.currentTime = 0;
});
player.addEventListener('pause', () => {
btnStop.disabled = true;
});
</script>
</body>
</html>
```
Now open `index.html` in your browser (double-click it). Click **Generate speech**.
If your browser blocks requests from `file://` to `http://localhost`, serve the HTML with a tiny static server (for example, `npx serve`) or add a static route in Express.
---
Step 4 — Make it feel “instant”: streaming options (what to do next)
The demo above returns a full MP3 payload. That’s fine for a prototype, but for snappier UX you’ll want **streaming**.
In practice, you have three common approaches:
1. **Backend streaming**: pipe the TTS response stream directly to the client as it arrives.
2. **Chunked playback**: generate smaller segments per sentence and play sequentially.
3. **Pre-generation**: generate and cache common prompts (IVR, UI narration).
If you’re building a real-time product (voice assistants, NPC dialog, accessibility narration), exploring streaming and caching patterns in the [PRODUCT_LINK]{ElevenLabs TTS API}[/PRODUCT_LINK] docs is a solid next step.
---
Voice quality tips (the stuff that actually improves output)
Getting “realistic” speech is rarely about one magic setting—it’s usually input text hygiene and a few practical controls.
1) Write for speech, not for reading
- Use contractions (“you’ll” vs “you will”) when appropriate
- Break up long sentences
- Add punctuation to control rhythm
2) Normalize numbers and abbreviations
- “$1.2M” → “1.2 million dollars” (or your preferred style)
- “ETA 5m” → “estimated time of arrival five minutes”
3) Watch for edge cases
Even strong models can produce occasional artifacts. For example, some systems may have **audio fades** in certain cases, and **Chinese quality** can vary by model/voice. The best mitigation is to:
- keep generated clips short,
- regenerate when you detect issues,
- test multiple voices/models for your target language.
---
Production checklist (so your demo can ship)
If you plan to move beyond a demo, these are the items that matter:
- **Don’t expose API keys**: keep all TTS calls server-side.
- **Rate limiting**: prevent abuse (especially on a “free tier” demo).
- **Caching**: hash `(voiceId + text + settings)` and store audio to avoid repeated charges.
- **Observability**: log latency, response codes, and character counts.
- **Content safety**: add policy checks if users can input arbitrary text.
- **File storage**: store MP3s in object storage (S3/GCS) when you need persistence.
If you need voice assets and management features (multiple voices, reusable presets, project organization), tools like [PRODUCT_LINK]{ElevenLabs Studio}[/PRODUCT_LINK] can complement an API-first workflow.
---
Conclusion
A working **AI voice generator** demo doesn’t need a big framework or hours of setup. With a small backend endpoint and a simple browser UI, you can go from **text to natural-sounding speech** quickly—then iterate on streaming, caching, and voice settings as your use case gets more serious.
Once your prototype works, the highest ROI improvements usually come from:
- streaming or sentence chunking (faster perceived latency),
- caching generated audio,
- better text normalization and prompt formatting.
If you want to extend this into a real application (multi-voice selector, SSML-like controls, localization, or voice cloning workflows), the [PRODUCT_LINK]{ElevenLabs API and voice platform}[/PRODUCT_LINK] is a practical place to explore next.