Text-to-Speech Guide
This guide provides a comprehensive overview of Uberduck's text-to-speech capabilities, including tips, best practices, and advanced usage scenarios.
Basic Text-to-Speech
The simplest text-to-speech request requires just three elements:
- Text content to be spoken
- Voice selection
- Model selection
const response = await fetch('https://api.uberduck.ai/v1/text-to-speech', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: 'Welcome to Uberduck text-to-speech API.',
voice: 'polly_joanna',
model: 'polly_neural'
})
});
const data = await response.json();
console.log(`Audio URL: ${data.audio_url}`);
Text Content Best Practices
Text Length
While the API accepts up to 10,000 characters per request, consider these guidelines:
- Short text (< 500 characters): Ideal for UI feedback, notifications, and alerts
- Medium text (500-2,000 characters): Good for short paragraphs, product descriptions
- Long text (2,000-10,000 characters): Suitable for articles, long-form content
For very long content, consider splitting it into multiple requests and concatenating the results.
Text Formatting
For optimal speech synthesis:
- Include proper punctuation for natural pauses
- Use complete sentences when possible
- Spell out numbers, abbreviations, and acronyms if you want them pronounced in full
- Consider phonetic spelling for uncommon words or names
Example:
"The CEO of NASA, Bill Nelson (born 1942), announced a $24.5 billion budget."
Special Characters and Symbols
Handling special characters:
- Currency symbols ($, €, £) are generally spoken correctly
- Emoji and special symbols may be ignored or mispronounced
- For math equations, write them out in words (e.g., "x squared plus y squared equals z")
Voice Selection Strategy
Choose voices based on your use case:
By Provider
- AWS Polly: Good balance of quality and cost
- Google: Excellent for multilingual content
- Azure: Strong in natural-sounding voices
By Voice Characteristics
- Gender: Choose based on your target audience or brand identity
- Age: Select to match your content's context
- Accent: Consider your audience's geographical location
- Style: Professional voices for business content, casual for conversational
Testing Multiple Voices
It's often worth testing multiple voices for your content:
import requests
import asyncio
import aiohttp
async def test_voice(session, voice_id, text):
url = "https://api.uberduck.ai/v1/text-to-speech"
headers = {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
}
payload = {
"text": text,
"voice": voice_id,
"model": "polly_neural" if voice_id.startswith("polly") else "google_wavenet" if voice_id.startswith("google") else "azure_neural"
}
async with session.post(url, json=payload, headers=headers) as response:
result = await response.json()
return {
"voice_id": voice_id,
"audio_url": result.get("audio_url")
}
async def test_multiple_voices(text, voice_ids):
async with aiohttp.ClientSession() as session:
tasks = [test_voice(session, voice_id, text) for voice_id in voice_ids]
results = await asyncio.gather(*tasks)
return results
# Example usage
test_text = "Welcome to our product demonstration."
voices_to_test = ["polly_joanna", "polly_matthew", "google_wavenet_a", "azure_guy"]
results = asyncio.run(test_multiple_voices(test_text, voices_to_test))
for result in results:
print(f"Voice: {result['voice_id']}, Audio: {result['audio_url']}")
Advanced Parameters
Speech Characteristics
Fine-tune speech output with extended parameters:
const response = await fetch('https://api.uberduck.ai/v1/text-to-speech', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: 'This is a customized speech sample.',
voice: 'polly_joanna',
model: 'polly_neural',
extended: {
speed: 1.2, // 20% faster than normal
pitch: 0.5, // Slightly higher pitch
emotion: 'happy' // Emotional tone (if supported)
}
})
});
Provider-Specific Parameters
Different providers support different advanced parameters:
AWS Polly
{
// ... other params
model_specific: {
engine: 'neural', // 'neural' or 'standard'
voice_style: 'newscaster' // For specific voices that support styles
}
}
Google Cloud TTS
{
// ... other params
model_specific: {
speaking_rate: 0.85, // Range: 0.25 to 4.0
pitch: 2.0, // Range: -20.0 to 20.0
volume_gain_db: 3.0 // Volume adjustment in dB
}
}
Azure Speech Service
{
// ... other params
model_specific: {
style: 'cheerful', // Styles vary by voice
style_degree: 1.5, // Emphasis of the style (0.5-2.0)
role: 'YoungAdultFemale' // Role playing for the voice
}
}
Output Formats
The API supports multiple output formats:
{
// ... other params
output_format: 'mp3' // Options: 'mp3', 'wav', 'ogg'
}
Considerations for each format:
- MP3: Smaller file size, good for web and mobile applications
- WAV: Higher quality, lossless format good for professional audio work
- OGG: Open format with good compression, popular for web applications
Integration Examples
Web Application
// Frontend JavaScript for a web app
async function generateAndPlaySpeech() {
const textInput = document.getElementById('text-input').value;
const voiceSelect = document.getElementById('voice-select').value;
// Show loading state
document.getElementById('status').textContent = 'Generating speech...';
try {
// This call would typically go through your backend to protect your API key
const response = await fetch('/api/generate-speech', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: textInput,
voice: voiceSelect
})
});
const data = await response.json();
// Create audio player
document.getElementById('status').textContent = 'Speech generated!';
const audioPlayer = document.getElementById('audio-player');
audioPlayer.src = data.audio_url;
audioPlayer.style.display = 'block';
audioPlayer.play();
} catch (error) {
document.getElementById('status').textContent = `Error: ${error.message}`;
}
}
Mobile Application
// React Native example
import React, { useState } from 'react';
import { View, Text, TextInput, Button, ActivityIndicator } from 'react-native';
import { Audio } from 'expo-av';
export default function TextToSpeechScreen() {
const [text, setText] = useState('');
const [loading, setLoading] = useState(false);
const [sound, setSound] = useState<Audio.Sound | null>(null);
async function generateSpeech() {
setLoading(true);
try {
// API call via your backend
const response = await fetch('https://your-backend.com/generate-speech', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: text,
voice: 'polly_joanna'
})
});
const data = await response.json();
// Play the audio
const { sound: newSound } = await Audio.Sound.createAsync({ uri: data.audio_url });
setSound(newSound);
await newSound.playAsync();
} catch (error) {
console.error('Failed to generate speech:', error);
alert('Failed to generate speech. Please try again.');
} finally {
setLoading(false);
}
}
return (
<View style={{ padding: 20 }}>
<Text style={{ fontSize: 20, marginBottom: 20 }}>Text to Speech</Text>
<TextInput
style={{ borderWidth: 1, padding: 10, marginBottom: 20 }}
placeholder="Enter text to convert to speech"
multiline
value={text}
onChangeText={setText}
/>
{loading ? (
<ActivityIndicator size="large" color="#0000ff" />
) : (
<Button title="Generate Speech" onPress={generateSpeech} />
)}
</View>
);
}
Caching and Optimization
For production applications, consider implementing caching:
# Python backend example with caching
import hashlib
import os
import requests
from flask import Flask, request, jsonify
from werkzeug.contrib.cache import SimpleCache
app = Flask(__name__)
cache = SimpleCache()
@app.route('/api/generate-speech', methods=['POST'])
def generate_speech():
data = request.json
text = data.get('text')
voice = data.get('voice', 'polly_joanna')
# Create a cache key based on text and voice
cache_key = hashlib.md5(f"{text}:{voice}".encode()).hexdigest()
# Check if we have a cached result
cached_result = cache.get(cache_key)
if cached_result:
return jsonify(cached_result)
# Make API request to Uberduck
response = requests.post(
'https://api.uberduck.ai/v1/text-to-speech',
headers={
'Authorization': f'Bearer {os.environ["UBERDUCK_API_KEY"]}',
'Content-Type': 'application/json'
},
json={
'text': text,
'voice': voice,
'model': 'polly_neural' if voice.startswith('polly') else 'google_wavenet'
}
)
result = response.json()
# Cache the result for 24 hours (86400 seconds)
cache.set(cache_key, result, timeout=86400)
return jsonify(result)
if __name__ == '__main__':
app.run(debug=True)
Conclusion
This guide covered the fundamentals and advanced techniques for using Uberduck's text-to-speech API. For specific implementation details, refer to the Getting Started guide and API Reference.