Text-to-Speech Guide

This guide provides a comprehensive overview of Uberduck's text-to-speech capabilities, including tips, best practices, and advanced usage scenarios.

Basic Text-to-Speech

The simplest text-to-speech request requires just three elements:

Text content to be spoken
Voice selection
Model selection

const response = await fetch('https://api.uberduck.ai/v1/text-to-speech', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    text: 'Welcome to Uberduck text-to-speech API.',
    voice: 'polly_joanna',
    model: 'polly_neural'
  })
});

const data = await response.json();
console.log(`Audio URL: ${data.audio_url}`);

Text Content Best Practices

Text Length

While the API accepts up to 10,000 characters per request, consider these guidelines:

Short text (< 500 characters): Ideal for UI feedback, notifications, and alerts
Medium text (500-2,000 characters): Good for short paragraphs, product descriptions
Long text (2,000-10,000 characters): Suitable for articles, long-form content

For very long content, consider splitting it into multiple requests and concatenating the results.

Text Formatting

For optimal speech synthesis:

Include proper punctuation for natural pauses
Use complete sentences when possible
Spell out numbers, abbreviations, and acronyms if you want them pronounced in full
Consider phonetic spelling for uncommon words or names

Example:

"The CEO of NASA, Bill Nelson (born 1942), announced a $24.5 billion budget."

Special Characters and Symbols

Handling special characters:

Currency symbols ($, €, £) are generally spoken correctly
Emoji and special symbols may be ignored or mispronounced
For math equations, write them out in words (e.g., "x squared plus y squared equals z")

Voice Selection Strategy

Choose voices based on your use case:

By Provider

AWS Polly: Good balance of quality and cost
Google: Excellent for multilingual content
Azure: Strong in natural-sounding voices

By Voice Characteristics

Gender: Choose based on your target audience or brand identity
Age: Select to match your content's context
Accent: Consider your audience's geographical location
Style: Professional voices for business content, casual for conversational

Testing Multiple Voices

It's often worth testing multiple voices for your content:

import requests
import asyncio
import aiohttp

async def test_voice(session, voice_id, text):
    url = "https://api.uberduck.ai/v1/text-to-speech"
    headers = {
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json"
    }
    payload = {
        "text": text,
        "voice": voice_id,
        "model": "polly_neural" if voice_id.startswith("polly") else "google_wavenet" if voice_id.startswith("google") else "azure_neural"
    }
    
    async with session.post(url, json=payload, headers=headers) as response:
        result = await response.json()
        return {
            "voice_id": voice_id,
            "audio_url": result.get("audio_url")
        }

async def test_multiple_voices(text, voice_ids):
    async with aiohttp.ClientSession() as session:
        tasks = [test_voice(session, voice_id, text) for voice_id in voice_ids]
        results = await asyncio.gather(*tasks)
        return results

# Example usage
test_text = "Welcome to our product demonstration."
voices_to_test = ["polly_joanna", "polly_matthew", "google_wavenet_a", "azure_guy"]

results = asyncio.run(test_multiple_voices(test_text, voices_to_test))
for result in results:
    print(f"Voice: {result['voice_id']}, Audio: {result['audio_url']}")

Advanced Parameters

Speech Characteristics

Fine-tune speech output with extended parameters:

const response = await fetch('https://api.uberduck.ai/v1/text-to-speech', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    text: 'This is a customized speech sample.',
    voice: 'polly_joanna',
    model: 'polly_neural',
    extended: {
      speed: 1.2,    // 20% faster than normal
      pitch: 0.5,    // Slightly higher pitch
      emotion: 'happy'  // Emotional tone (if supported)
    }
  })
});

Provider-Specific Parameters

Different providers support different advanced parameters:

AWS Polly

{
  // ... other params
  model_specific: {
    engine: 'neural',  // 'neural' or 'standard'
    voice_style: 'newscaster'  // For specific voices that support styles
  }
}

Google Cloud TTS

{
  // ... other params
  model_specific: {
    speaking_rate: 0.85,  // Range: 0.25 to 4.0
    pitch: 2.0,  // Range: -20.0 to 20.0
    volume_gain_db: 3.0  // Volume adjustment in dB
  }
}

Azure Speech Service

{
  // ... other params
  model_specific: {
    style: 'cheerful',  // Styles vary by voice
    style_degree: 1.5,  // Emphasis of the style (0.5-2.0)
    role: 'YoungAdultFemale'  // Role playing for the voice
  }
}

Output Formats

The API supports multiple output formats:

{
  // ... other params
  output_format: 'mp3'  // Options: 'mp3', 'wav', 'ogg'
}

Considerations for each format:

MP3: Smaller file size, good for web and mobile applications
WAV: Higher quality, lossless format good for professional audio work
OGG: Open format with good compression, popular for web applications

Integration Examples

Web Application

// Frontend JavaScript for a web app
async function generateAndPlaySpeech() {
  const textInput = document.getElementById('text-input').value;
  const voiceSelect = document.getElementById('voice-select').value;
  
  // Show loading state
  document.getElementById('status').textContent = 'Generating speech...';
  
  try {
    // This call would typically go through your backend to protect your API key
    const response = await fetch('/api/generate-speech', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        text: textInput,
        voice: voiceSelect
      })
    });
    
    const data = await response.json();
    
    // Create audio player
    document.getElementById('status').textContent = 'Speech generated!';
    const audioPlayer = document.getElementById('audio-player');
    audioPlayer.src = data.audio_url;
    audioPlayer.style.display = 'block';
    audioPlayer.play();
    
  } catch (error) {
    document.getElementById('status').textContent = `Error: ${error.message}`;
  }
}

Mobile Application

// React Native example
import React, { useState } from 'react';
import { View, Text, TextInput, Button, ActivityIndicator } from 'react-native';
import { Audio } from 'expo-av';

export default function TextToSpeechScreen() {
  const [text, setText] = useState('');
  const [loading, setLoading] = useState(false);
  const [sound, setSound] = useState<Audio.Sound | null>(null);
  
  async function generateSpeech() {
    setLoading(true);
    
    try {
      // API call via your backend
      const response = await fetch('https://your-backend.com/generate-speech', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json'
        },
        body: JSON.stringify({
          text: text,
          voice: 'polly_joanna'
        })
      });
      
      const data = await response.json();
      
      // Play the audio
      const { sound: newSound } = await Audio.Sound.createAsync({ uri: data.audio_url });
      setSound(newSound);
      await newSound.playAsync();
    } catch (error) {
      console.error('Failed to generate speech:', error);
      alert('Failed to generate speech. Please try again.');
    } finally {
      setLoading(false);
    }
  }
  
  return (
    <View style={{ padding: 20 }}>
      <Text style={{ fontSize: 20, marginBottom: 20 }}>Text to Speech</Text>
      <TextInput
        style={{ borderWidth: 1, padding: 10, marginBottom: 20 }}
        placeholder="Enter text to convert to speech"
        multiline
        value={text}
        onChangeText={setText}
      />
      {loading ? (
        <ActivityIndicator size="large" color="#0000ff" />
      ) : (
        <Button title="Generate Speech" onPress={generateSpeech} />
      )}
    </View>
  );
}

Caching and Optimization

For production applications, consider implementing caching:

# Python backend example with caching
import hashlib
import os
import requests
from flask import Flask, request, jsonify
from werkzeug.contrib.cache import SimpleCache

app = Flask(__name__)
cache = SimpleCache()

@app.route('/api/generate-speech', methods=['POST'])
def generate_speech():
    data = request.json
    text = data.get('text')
    voice = data.get('voice', 'polly_joanna')
    
    # Create a cache key based on text and voice
    cache_key = hashlib.md5(f"{text}:{voice}".encode()).hexdigest()
    
    # Check if we have a cached result
    cached_result = cache.get(cache_key)
    if cached_result:
        return jsonify(cached_result)
    
    # Make API request to Uberduck
    response = requests.post(
        'https://api.uberduck.ai/v1/text-to-speech',
        headers={
            'Authorization': f'Bearer {os.environ["UBERDUCK_API_KEY"]}',
            'Content-Type': 'application/json'
        },
        json={
            'text': text,
            'voice': voice,
            'model': 'polly_neural' if voice.startswith('polly') else 'google_wavenet'
        }
    )
    
    result = response.json()
    
    # Cache the result for 24 hours (86400 seconds)
    cache.set(cache_key, result, timeout=86400)
    
    return jsonify(result)

if __name__ == '__main__':
    app.run(debug=True)

Conclusion

This guide covered the fundamentals and advanced techniques for using Uberduck's text-to-speech API. For specific implementation details, refer to the Getting Started guide and API Reference.

Basic Text-to-Speech​

Text Content Best Practices​

Text Length​

Text Formatting​

Special Characters and Symbols​

Voice Selection Strategy​

By Provider​

By Voice Characteristics​

Testing Multiple Voices​

Advanced Parameters​

Speech Characteristics​

Provider-Specific Parameters​

AWS Polly​

Google Cloud TTS​

Azure Speech Service​

Output Formats​

Integration Examples​

Web Application​

Mobile Application​

Caching and Optimization​

Conclusion​