Skip to main content

Text-to-Speech Guide

This guide provides a comprehensive overview of Uberduck's text-to-speech capabilities, including tips, best practices, and advanced usage scenarios.

Basic Text-to-Speech

The simplest text-to-speech request requires just three elements:

  1. Text content to be spoken
  2. Voice selection
  3. Model selection
const response = await fetch('https://api.uberduck.ai/v1/text-to-speech', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: 'Welcome to Uberduck text-to-speech API.',
voice: 'polly_joanna',
model: 'polly_neural'
})
});

const data = await response.json();
console.log(`Audio URL: ${data.audio_url}`);

Text Content Best Practices

Text Length

While the API accepts up to 10,000 characters per request, consider these guidelines:

  • Short text (< 500 characters): Ideal for UI feedback, notifications, and alerts
  • Medium text (500-2,000 characters): Good for short paragraphs, product descriptions
  • Long text (2,000-10,000 characters): Suitable for articles, long-form content

For very long content, consider splitting it into multiple requests and concatenating the results.

Text Formatting

For optimal speech synthesis:

  • Include proper punctuation for natural pauses
  • Use complete sentences when possible
  • Spell out numbers, abbreviations, and acronyms if you want them pronounced in full
  • Consider phonetic spelling for uncommon words or names

Example:

"The CEO of NASA, Bill Nelson (born 1942), announced a $24.5 billion budget."

Special Characters and Symbols

Handling special characters:

  • Currency symbols ($, €, £) are generally spoken correctly
  • Emoji and special symbols may be ignored or mispronounced
  • For math equations, write them out in words (e.g., "x squared plus y squared equals z")

Voice Selection Strategy

Choose voices based on your use case:

By Provider

  • AWS Polly: Good balance of quality and cost
  • Google: Excellent for multilingual content
  • Azure: Strong in natural-sounding voices

By Voice Characteristics

  • Gender: Choose based on your target audience or brand identity
  • Age: Select to match your content's context
  • Accent: Consider your audience's geographical location
  • Style: Professional voices for business content, casual for conversational

Testing Multiple Voices

It's often worth testing multiple voices for your content:

import requests
import asyncio
import aiohttp

async def test_voice(session, voice_id, text):
url = "https://api.uberduck.ai/v1/text-to-speech"
headers = {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
}
payload = {
"text": text,
"voice": voice_id,
"model": "polly_neural" if voice_id.startswith("polly") else "google_wavenet" if voice_id.startswith("google") else "azure_neural"
}

async with session.post(url, json=payload, headers=headers) as response:
result = await response.json()
return {
"voice_id": voice_id,
"audio_url": result.get("audio_url")
}

async def test_multiple_voices(text, voice_ids):
async with aiohttp.ClientSession() as session:
tasks = [test_voice(session, voice_id, text) for voice_id in voice_ids]
results = await asyncio.gather(*tasks)
return results

# Example usage
test_text = "Welcome to our product demonstration."
voices_to_test = ["polly_joanna", "polly_matthew", "google_wavenet_a", "azure_guy"]

results = asyncio.run(test_multiple_voices(test_text, voices_to_test))
for result in results:
print(f"Voice: {result['voice_id']}, Audio: {result['audio_url']}")

Advanced Parameters

Speech Characteristics

Fine-tune speech output with extended parameters:

const response = await fetch('https://api.uberduck.ai/v1/text-to-speech', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: 'This is a customized speech sample.',
voice: 'polly_joanna',
model: 'polly_neural',
extended: {
speed: 1.2, // 20% faster than normal
pitch: 0.5, // Slightly higher pitch
emotion: 'happy' // Emotional tone (if supported)
}
})
});

Provider-Specific Parameters

Different providers support different advanced parameters:

AWS Polly

{
// ... other params
model_specific: {
engine: 'neural', // 'neural' or 'standard'
voice_style: 'newscaster' // For specific voices that support styles
}
}

Google Cloud TTS

{
// ... other params
model_specific: {
speaking_rate: 0.85, // Range: 0.25 to 4.0
pitch: 2.0, // Range: -20.0 to 20.0
volume_gain_db: 3.0 // Volume adjustment in dB
}
}

Azure Speech Service

{
// ... other params
model_specific: {
style: 'cheerful', // Styles vary by voice
style_degree: 1.5, // Emphasis of the style (0.5-2.0)
role: 'YoungAdultFemale' // Role playing for the voice
}
}

Output Formats

The API supports multiple output formats:

{
// ... other params
output_format: 'mp3' // Options: 'mp3', 'wav', 'ogg'
}

Considerations for each format:

  • MP3: Smaller file size, good for web and mobile applications
  • WAV: Higher quality, lossless format good for professional audio work
  • OGG: Open format with good compression, popular for web applications

Integration Examples

Web Application

// Frontend JavaScript for a web app
async function generateAndPlaySpeech() {
const textInput = document.getElementById('text-input').value;
const voiceSelect = document.getElementById('voice-select').value;

// Show loading state
document.getElementById('status').textContent = 'Generating speech...';

try {
// This call would typically go through your backend to protect your API key
const response = await fetch('/api/generate-speech', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: textInput,
voice: voiceSelect
})
});

const data = await response.json();

// Create audio player
document.getElementById('status').textContent = 'Speech generated!';
const audioPlayer = document.getElementById('audio-player');
audioPlayer.src = data.audio_url;
audioPlayer.style.display = 'block';
audioPlayer.play();

} catch (error) {
document.getElementById('status').textContent = `Error: ${error.message}`;
}
}

Mobile Application

// React Native example
import React, { useState } from 'react';
import { View, Text, TextInput, Button, ActivityIndicator } from 'react-native';
import { Audio } from 'expo-av';

export default function TextToSpeechScreen() {
const [text, setText] = useState('');
const [loading, setLoading] = useState(false);
const [sound, setSound] = useState<Audio.Sound | null>(null);

async function generateSpeech() {
setLoading(true);

try {
// API call via your backend
const response = await fetch('https://your-backend.com/generate-speech', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: text,
voice: 'polly_joanna'
})
});

const data = await response.json();

// Play the audio
const { sound: newSound } = await Audio.Sound.createAsync({ uri: data.audio_url });
setSound(newSound);
await newSound.playAsync();
} catch (error) {
console.error('Failed to generate speech:', error);
alert('Failed to generate speech. Please try again.');
} finally {
setLoading(false);
}
}

return (
<View style={{ padding: 20 }}>
<Text style={{ fontSize: 20, marginBottom: 20 }}>Text to Speech</Text>
<TextInput
style={{ borderWidth: 1, padding: 10, marginBottom: 20 }}
placeholder="Enter text to convert to speech"
multiline
value={text}
onChangeText={setText}
/>
{loading ? (
<ActivityIndicator size="large" color="#0000ff" />
) : (
<Button title="Generate Speech" onPress={generateSpeech} />
)}
</View>
);
}

Caching and Optimization

For production applications, consider implementing caching:

# Python backend example with caching
import hashlib
import os
import requests
from flask import Flask, request, jsonify
from werkzeug.contrib.cache import SimpleCache

app = Flask(__name__)
cache = SimpleCache()

@app.route('/api/generate-speech', methods=['POST'])
def generate_speech():
data = request.json
text = data.get('text')
voice = data.get('voice', 'polly_joanna')

# Create a cache key based on text and voice
cache_key = hashlib.md5(f"{text}:{voice}".encode()).hexdigest()

# Check if we have a cached result
cached_result = cache.get(cache_key)
if cached_result:
return jsonify(cached_result)

# Make API request to Uberduck
response = requests.post(
'https://api.uberduck.ai/v1/text-to-speech',
headers={
'Authorization': f'Bearer {os.environ["UBERDUCK_API_KEY"]}',
'Content-Type': 'application/json'
},
json={
'text': text,
'voice': voice,
'model': 'polly_neural' if voice.startswith('polly') else 'google_wavenet'
}
)

result = response.json()

# Cache the result for 24 hours (86400 seconds)
cache.set(cache_key, result, timeout=86400)

return jsonify(result)

if __name__ == '__main__':
app.run(debug=True)

Conclusion

This guide covered the fundamentals and advanced techniques for using Uberduck's text-to-speech API. For specific implementation details, refer to the Getting Started guide and API Reference.