Text-to-Speech Endpoint
Generate speech from text using a specified voice and model.
Request
POST /v1/text-to-speech
Request Body Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
text | string | Yes | The text to convert to speech. Maximum length is 10,000 characters. |
voice | string | Yes | The voice ID to use (e.g., polly_joanna, google_wavenet_a, azure_guy). |
model | string | No | The model ID to use (e.g., polly_neural, google_wavenet). If not specified, a default model compatible with the voice will be used. |
extended | object | No | Common parameters supported by many models. |
model_specific | object | No | Parameters specific to the chosen model. |
output_format | string | No | Desired output format (default: mp3). Options include mp3, wav, ogg. |
Extended Parameters
These parameters are common across many models:
| Parameter | Type | Description |
|---|---|---|
emotion | string | Emotion to apply to the voice (e.g., happy, sad, neutral). |
speed | float | Playback speed multiplier. Range: 0.5 to 2.0. Default: 1.0. |
pitch | float | Voice pitch adjustment. Range: -10.0 to 10.0. Default: 0.0. |
Model-Specific Parameters
Parameters vary by model. Some examples:
AWS Polly
| Parameter | Type | Description |
|---|---|---|
engine | string | The engine to use. Options: standard, neural. Default depends on voice compatibility. |
Google Cloud TTS
| Parameter | Type | Description |
|---|---|---|
speaking_rate | float | Speaking rate. Range: 0.25 to 4.0. Default: 1.0. |
pitch | float | Voice pitch. Range: -20.0 to 20.0. Default: 0.0. |
Azure Speech Service
| Parameter | Type | Description |
|---|---|---|
pitch | string | Voice pitch adjustment. Options: x-low, low, medium, high, x-high. Default: medium. |
rate | string | Speaking rate. Options: x-slow, slow, medium, fast, x-fast. Default: medium. |
Example Request
curl -X POST https://api.uberduck.ai/v1/text-to-speech \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, world!",
"voice": "polly_joanna",
"model": "polly_neural",
"extended": {
"speed": 1.2
},
"output_format": "mp3"
}'
Response
Response Body
| Field | Type | Description |
|---|---|---|
audio_url | string | URL to the generated audio file. This URL is valid for 24 hours. |
duration_seconds | float | Duration of the generated audio in seconds. |
model | string | The model used for generation. |
voice | string | The voice used for generation. |
created_at | string | ISO 8601 timestamp when the audio was created. |
Example Response
{
"audio_url": "https://static.uberduck.ai/output/fc9a66eb-1a0f-4cab-a139-13e8dc306786.mp3",
"duration_seconds": 1.5,
"model": "polly_neural",
"voice": "polly_joanna",
"created_at": "2025-05-27T17:30:00Z"
}
Error Codes
| Status Code | Error Code | Description |
|---|---|---|
| 400 | invalid_request | Invalid request parameters. |
| 400 | text_too_long | Text exceeds maximum length. |
| 400 | invalid_voice | The specified voice does not exist. |
| 401 | unauthorized | Invalid or missing API key. |
| 404 | voice_not_found | The specified voice was not found. |
| 404 | model_not_found | The specified model was not found. |
| 422 | voice_model_incompatible | The specified voice is not compatible with the specified model. |
| 429 | rate_limit_exceeded | Rate limit exceeded. |
| 500 | internal_error | Internal server error. |
Usage Notes
- The maximum text length is 10,000 characters. Longer text will be rejected.
- Audio URLs are valid for 24 hours. Download and store the audio if you need it for longer.
- When using provider voices (AWS Polly, Google, Azure), the voice parameter should include the provider prefix (e.g.,
polly_joanna,google_wavenet_a). - Supported output formats:
mp3,wav,ogg.