API v1.0 is now available! Check out the new voice cloning features.
SpeechgenSpeechgen

Text to Speech

Convert text to natural-sounding speech with our TTS API

Text to Speech API

Convert any text into natural-sounding speech using our advanced neural TTS models.

Endpoint

POST https://api.speechgen.com/v1/text-to-speech

Request

Headers

HeaderValueRequired
AuthorizationBearer YOUR_API_KEYYes
Content-Typeapplication/jsonYes

Body Parameters

Prop

Type

Example Request

curl -X POST https://api.speechgen.com/v1/text-to-speech \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Welcome to Speechgen. This is a demonstration of our text-to-speech capabilities.",
    "voice": "en-US-Neural2-A",
    "format": "mp3",
    "speed": 1.0,
    "pitch": 0
  }'

Response

Success (200)

Returns binary audio data with appropriate content type:

  • audio/mpeg for MP3
  • audio/wav for WAV
  • audio/ogg for OGG

Response Headers:

Content-Type: audio/mpeg
Content-Length: 45678
X-Audio-Duration: 4.5
X-Characters-Used: 87

Error Response

{
  "error": {
    "code": "invalid_voice",
    "message": "The specified voice 'invalid-voice' does not exist.",
    "param": "voice"
  }
}

SSML Support

Our API supports Speech Synthesis Markup Language (SSML) for advanced control:

{
  "text": "<speak><prosody rate=\"slow\">Hello</prosody>, <break time=\"500ms\"/> welcome to <emphasis level=\"strong\">Speechgen</emphasis>!</speak>",
  "voice": "en-US-Neural2-A",
  "format": "mp3"
}

Supported SSML Tags

TagDescriptionExample
<speak>Root element<speak>Hello</speak>
<break>Insert pause<break time="500ms"/>
<prosody>Control rate, pitch, volume<prosody rate="slow">...</prosody>
<emphasis>Add emphasis<emphasis level="strong">...</emphasis>
<say-as>Interpret text type<say-as interpret-as="date">2024-01-15</say-as>

SSML text counts towards your character limit including the tags.

Code Examples

JavaScript/TypeScript

async function textToSpeech(text: string, voice: string): Promise<Blob> {
  const response = await fetch("https://api.speechgen.com/v1/text-to-speech", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.SPEECHGEN_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      text,
      voice,
      format: "mp3",
    }),
  });

  if (!response.ok) {
    const error = await response.json();
    throw new Error(error.error.message);
  }

  return response.blob();
}

// Usage
const audio = await textToSpeech("Hello world!", "en-US-Neural2-A");

Python

import requests

def text_to_speech(text: str, voice: str) -> bytes:
    response = requests.post(
        'https://api.speechgen.com/v1/text-to-speech',
        headers={
            'Authorization': f'Bearer {os.environ["SPEECHGEN_API_KEY"]}',
            'Content-Type': 'application/json',
        },
        json={
            'text': text,
            'voice': voice,
            'format': 'mp3',
        }
    )

    response.raise_for_status()
    return response.content

# Usage
audio = text_to_speech('Hello world!', 'en-US-Neural2-A')
with open('output.mp3', 'wb') as f:
    f.write(audio)

Best Practices

  1. Batch requests: Combine multiple short texts into single requests when possible
  2. Cache audio: Store generated audio to avoid regenerating the same content
  3. Use streaming: For long texts, consider using our streaming endpoint
  4. Choose appropriate format: Use MP3 for web, WAV for professional editing