HomeBlogTutorials
📖 16 min read
A mechanical steel jaw with pulsing LEDs speaking sound waves that morph into pure digital text across a dark computer screen
Advertisement

For decades, Text-to-Speech (TTS) sounded like a robot trapped in a tin can. Then ElevenLabs arrived. It is currently the undisputed leader in emotionally resonant, incredibly human AI voice generation.

When you attach a fast TTS pipeline (like ElevenLabs) to a reasoning pipeline (like GPT-4), you aren't just building a script anymore. You are building an agentic phone caller. You are building an interactive podcast host. But getting the latency low enough so the voice doesn't pause awkwardly for six seconds before speaking requires a specific architectural approach.

In this tutorial, we will build a Python application that sends text to ElevenLabs, streams the audio back in real time, and plays it through your speakers instantly.

Step 1: Environment and API Key

Create an account at ElevenLabs.io, navigate to your Profile, and copy your API Key. Be aware that downloading custom cloned voices requires a paid tier (Starter or Creator), but the free tier has enough credits to use the pre-made voices for this tutorial.

Install the official Python SDK:

pip install elevenlabs python-dotenv

Add your key to your .env file:

ELEVENLABS_API_KEY=sk_...
Advertisement

Step 2: Basic Text to Speech and File Save

Let's start with the simplest use case. We want to convert a string of text into an MP3 file and save it to our hard drive. This is the exact workflow you would use to generate voiceovers for YouTube videos or an audiobook pipeline.

import os
from dotenv import load_dotenv
from elevenlabs.client import ElevenLabs

load_dotenv()

# Initialize the client. It automatically picks up ELEVENLABS_API_KEY
client = ElevenLabs()

def generate_and_save_audio(text, output_filename="output.mp3"):
    print(f"Generating audio for: '{text}'...")
    
    # The 'generate' function calls the API and returns bytes
    audio_stream = client.generate(
        text=text,
        voice="Rachel", # One of ElevenLabs default voices
        model="eleven_multilingual_v2" # The best model for general use
    )
    
    # Open a file and write the raw binary chunks
    with open(output_filename, "wb") as f:
        for chunk in audio_stream:
            if chunk:
                f.write(chunk)
                
    print(f"Audio saved to {output_filename}")

if __name__ == "__main__":
    generate_and_save_audio("Welcome to the PromptForge tutorial on autonomous voice agents.")

Run this script, and an MP3 file will appear in your directory sounding completely human.

Advertisement

Step 3: Finding Voice IDs (The Pro Workflow)

While passing the name "Rachel" works in the simple SDK, in production, you must use Voice IDs. If you clone your own voice or browse the Community Library, the API requires the specific alphanumeric ID associated with that voice.

Here is a quick utility script to list all the voices attached to your account so you can grab the ID.

def get_all_voice_ids():
    voices_response = client.voices.get_all()
    for v in voices_response.voices:
        print(f"Name: {v.name}")
        print(f"ID: {v.voice_id}\n")

# Run this once to find your favorite voice ID
# get_all_voice_ids()

Step 4: Streaming Audio (Instant Playback)

If you generate a 3-minute script using the code in Step 2, you have to wait for the entire 3 minutes to process before you hear the first word. If you are building an interactive AI assistant, the user will think your app is broken.

To fix this, we must use Streaming. The API will process the first sentence, send us the audio chunks, and our Python client will begin playing them through the speakers while the API concurrently processes the second sentence.

Note: You must install the mpv or ffmpeg system packages on your machine for the 'stream' function to access your hardware speakers.

from elevenlabs import stream

def stream_chatbot_reply(text):
    print("Initiating Streaming Connection...")
    
    # We set stream=True in the generate function
    audio_stream = client.generate(
        text=text,
        voice="pNInz6obbfDQGcgMyIGD", # Use the Voice ID we found
        model="eleven_multilingual_v2",
        stream=True
    )
    
    # The stream() imported function plays the chunks as they arrive
    stream(audio_stream)

if __name__ == "__main__":
    reply = "I am processing this audio in chunks. You are hearing my voice before the API has finished generating this entire paragraph."
    stream_chatbot_reply(reply)

Step 5: Voice Cloning (The API Call)

If you have the proper subscription tier, you can programmatically clone a voice by uploading an MP3 of someone speaking (ensure you have legal consent).

def clone_voice_from_file(name, filepath, description):
    with open(filepath, "rb") as file_data:
        voice = client.voices.add(
            name=name,
            description=description,
            files=[file_data]
        )
    print(f"Success! Clone created with ID: {voice.voice_id}")
    return voice.voice_id

Conclusion

By mastering the stream=True parameter, you unlock the ability to chain this script directly to a ChatGPT output queue. You can have GPT-4o outputting language token-by-token directly into the ElevenLabs pipeline chunk-by-chunk, resulting in sub-2-second latency for an emotionally intelligent, conversational voice bot.

Advertisement