Best of Product Hunt

The Best Sample Script for Voice Cloning (Step-by-Step): What to Read, How Long, and How to Record

A practical, step-by-step guide to recording the right voice cloning sample: exactly what to read, how long to record, and how to capture clean audio. Includes a ready-to-use script with phonetic coverage, emotion variety, and recording tips to improve voice clone accuracy and naturalness.

Share:

A minimum viable sample is 1–2 minutes, but 3–7 minutes is the recommended sweet spot for strong similarity. Go longer (10–20 minutes) only if you need wide emotional range, character acting, or challenging languages/accents.

Use a script that includes everyday sentences, questions, numbers/dates/abbreviations, proper nouns, and a few phonetic coverage lines. This mix helps capture natural cadence and common “TTS trouble spots” like pricing, IDs, and punctuation.

Most “doesn’t sound like me” problems come from the sample: the script, pacing, and recording consistency—not the model. Clean audio, natural delivery, stable mic distance, and good phoneme coverage usually improve similarity fast.

A dynamic mic (SM58-style) or a decent condenser in a quiet room is ideal, but a modern smartphone can work in a small quiet space. Keep your mouth about 6–8 inches (15–20 cm) from the mic and speak slightly off-axis to reduce plosives.

Target peaks around -12 dB to -6 dB. If the waveform looks “brick-like,” you’re clipping—lower the gain and re-record.

Prefer WAV or another lossless format, ideally 44.1 kHz or 48 kHz at 16-bit or 24-bit in mono. Avoid heavy noise reduction because it can create metallic artifacts; light cleanup is okay if it doesn’t introduce artifacts.

Speak at your normal pace and avoid “announcer voice,” whispering, or overacting. If you make a mistake, pause and redo the sentence without stopping the recording.

Common issues include whispering or performing too much, aggressive audio processing, recording when tired, and changing rooms or mic position mid-sample. Skipping numbers and abbreviations also hurts because the clone may struggle with real-world formats like dates and IDs.

Turn off fans/AC, record at quieter times, and use soft furnishings (curtains or closet clothes) to reduce echo. Aim for a close-sounding voice with minimal room sound and no audible hum or loud computer noise.

The Best Sample Script for Voice Cloning (Step-by-Step): What to Read, How Long, and How to Record

Voice cloning quality often comes down to one thing you can control: the sample. Most “my clone doesn’t sound like me” issues aren’t model problems—they’re script, pacing, and recording consistency problems.

Below is a practical, repeatable way to record a strong voice cloning sample (plus a script you can read verbatim). The goal is to capture **your natural speaking voice**, across enough sounds and speaking styles, in clean audio.

---

What makes a “good” voice cloning sample?

A great sample balances four things:

1. **Coverage**: enough phonemes (speech sounds), including tricky consonants and vowel shifts.

2. **Consistency**: stable mic distance, steady volume, minimal room changes.

3. **Natural delivery**: not “announcer voice,” not whispering—just you.

4. **Clean signal**: low noise, no music, no echo, no clipping.

If you’re using a modern voice cloning workflow (for example, the tools and pipelines commonly used with platforms like [PRODUCT_LINK]ElevenLabs voice cloning[/PRODUCT_LINK]), the model can learn surprisingly fast—but it still needs a well-designed input.

---

How long should your recording be?

Most beginners either record **too little** (thin, unstable voice) or **too much** (fatigue, inconsistent tone). Use this rule of thumb:

- **Minimum viable sample**: **1–2 minutes** (works for demos; may sound “close but not quite”)

- **Recommended for strong similarity**: **3–7 minutes** (sweet spot for most use cases)

- **When to go longer (10–20 minutes)**: if you need **wide emotional range**, lots of **character acting**, or you’re targeting **challenging languages/accents**

Important: **5 clean minutes beats 20 messy minutes**.

---

What to read (and why it matters)

To build a reliable voice clone sample, your script should include:

- **Everyday sentences** (natural cadence)

- **Numbers, dates, and abbreviations** (common TTS failure points)

- **Proper nouns** (names, places)

- **Question/answer patterns** (intonation shifts)

- **Emotion tags** (calm, upbeat, serious) to capture expressive range

- **A few “phonetic coverage” lines** (dense consonant/vowel variety)

Avoid:

- Tongue twisters only (they can distort your natural rhythm)

- Singing or shouting (unless you explicitly need it)

- Long monotone paragraphs (you’ll drift in pitch and energy)

---

Step-by-step: how to record your sample

Step 1) Choose your setup (simple is fine)

- **Best**: dynamic mic (e.g., SM58-style) or a decent condenser + quiet room

- **Good enough**: a modern smartphone in a quiet closet-sized space

Keep your mouth **6–8 inches (15–20 cm)** from the mic. Speak slightly *off-axis* (not directly into it) to reduce plosives.

Step 2) Control the room

- Turn off fans/AC if possible

- Use soft furnishings (curtains, closet clothes) to reduce echo

- Record at a time with less traffic and fewer people around

Step 3) Set your levels

- Target peaks around **-12 dB to -6 dB**

- If your waveform looks “brick-like,” you’re clipping—lower gain and re-record

Step 4) Delivery rules (these matter more than people think)

- Speak **at your normal pace**

- Keep a **consistent distance** from the mic

- If you make a mistake: **pause, then redo the sentence** (don’t stop the recording)

Step 5) File format basics

- Prefer **WAV** (or lossless)

- If possible: **44.1 kHz or 48 kHz**, **16-bit or 24-bit**, **mono**

- Avoid heavy noise reduction; light cleanup is okay if it doesn’t create artifacts

If you’re building voice assets inside a production workflow, tools like [PRODUCT_LINK]the ElevenLabs Studio and API toolkit[/PRODUCT_LINK] generally benefit from cleaner, artifact-free audio (even if it’s short).

---

The best sample script for voice cloning (copy/paste)

**How to use this script**

- Total read time: **~5 minutes**

- Record it in one take if you can, but it’s fine to do **two takes** and pick the best.

- Keep each “mode” distinct, but don’t overact.

Part A — Natural baseline (about 60–90 seconds)

> Hi, I’m recording a voice sample for a realistic text-to-speech model. I’m going to speak naturally, at a comfortable pace.

>

> Today is a good day to focus on clarity. I’ll keep my voice steady, and I’ll avoid rushing. If I misread a line, I’ll pause and say it again.

>

> For this sample, I’m aiming for a clean recording with minimal background noise. I’m speaking at a normal volume, and I’m staying the same distance from the microphone.

Part B — Questions, emphasis, and conversational rhythm (about 60 seconds)

> Are we meeting at 9:30, or at 10:00?

> I thought you said Tuesday—did you mean Thursday?

>

> I can do it today, but I need the final details first.

> When you send the file, please include the notes and the version number.

>

> Just to confirm: the address is 18 North Harbor Road, apartment 4B, correct?

Part C — Numbers, punctuation, and “TTS trouble spots” (about 60 seconds)

> My phone number is 202-555-0147.

> The total is $48.75, not $487.50.

>

> The meeting is on March 6th, 2026, at 2:15 p.m.

> Please review pages 12 through 17, especially section 3.2.1.

>

> The code is A-B-7-9-Z. The file name is final_edit_v3 dot mp3.

Part D — Proper nouns and varied sounds (about 60 seconds)

> I’ve traveled through Boston, Austin, and Seattle.

> I also visited Montréal, São Paulo, and Copenhagen.

>

> My favorite cafés have names like Riva, Juniper, and Blue Harbor.

> I enjoy reading about robotics, climate science, and classical literature.

Part E — Emotional range, but still natural (about 60–90 seconds)

**(Calm, reassuring)**

> Everything is under control. We have time, and we’ll handle it step by step.

**(Upbeat, friendly)**

> That’s great news! I’m excited to see how it turns out.

**(Serious, focused)**

> This part is important. Please follow the instructions exactly and double-check the final result.

**(Warm, conversational)**

> Thanks for your help—I really appreciate it. Let me know if you want me to send a quick summary.

Part F — Phonetic coverage (30–45 seconds)

> The quick brown fox jumps over the lazy dog.

> She sells sea shells by the sea shore.

>

> Please pack my box with five dozen quality jugs.

> Bright violet lilies bloom quietly beside the fresh green hedge.

---

Recording checklist (use this before you hit “export”)

- [ ] No audible fan, hum, or loud computer noise

- [ ] Minimal room echo (voice sounds close, not “far away”)

- [ ] No clipping (harsh distortion on loud syllables)

- [ ] Consistent mic distance and volume

- [ ] Natural cadence (not “reading robotically”)

- [ ] Clean pauses between retakes (easy to cut if needed)

Once your recording is ready, you can test it in your preferred cloning workflow. If you’re iterating quickly, a platform like [PRODUCT_LINK]ElevenLabs for generating realistic voice clones[/PRODUCT_LINK] can make it easy to compare versions and hear what changes improve similarity.

---

Common mistakes that quietly ruin voice clones

1) Whispering or “performing” too much

Clones learn what you give them. If you speak in an unnatural “recording voice,” your output will inherit that.

2) Overprocessing your audio

Aggressive noise reduction can introduce metallic artifacts. Light cleanup is fine; heavy cleanup is often worse than mild noise.

3) Recording when you’re tired

Your tone and pitch shift when you’re fatigued. If you need 5 minutes, record 6–7 and select your best 5.

4) Inconsistent environment

Changing rooms, mic positions, or even sitting vs. standing can change resonance enough to confuse the model.

5) Skipping numbers and abbreviations

If you only read story text, the clone can struggle with real-world formats like dates, pricing, and IDs.

---

Conclusion: start with 5 clean minutes, not 50 random lines

If you want a voice clone that sounds like you, prioritize **clean audio**, **a script designed for coverage**, and **consistent delivery**. Record **3–7 minutes** in a stable setup, include numbers and questions, and capture a little emotional range without acting.

Treat your first sample as version 1. Make one improvement at a time (room echo, pacing, script variety), and you’ll hear the difference fast—no guesswork required.

More from ElevenLabs