Skip to main content
For the endpoint contract see Audio API. This guide covers practical recipes.

Text-to-speech

Pick a voice and model

ModelVoicesStrong for
tts-1 (OpenAI)alloy, echo, onyx, nova, shimmer, sage, ash, coral, fableFast, low cost, monolingual-EN strong
tts-1-hd (OpenAI)sameHigher fidelity, ~2× cost
gemini-2-5-ttsMultilingualNatural prosody, 30+ languages
qwen-ttsMultilingual incl. CN/JA/KOPRC opt-in required
Voice character is fixed per voice — you can’t fine-tune them. To pick, render the same line in 3–4 voices and listen.

Save to file

python
audio = client.audio.speech.create(
    model="tts-1",
    voice="nova",
    input="Welcome to Infery. Your audio is being generated now.",
    response_format="mp3",
    speed=1.0,
)
audio.stream_to_file("welcome.mp3")

Sample output

“Welcome to Infery — one API for every AI model.” Generated with gemini-2.5-flash-preview-tts, voice Kore.

Stream to a player

For long passages, stream the bytes directly to the user’s audio element instead of downloading then playing:
python
with client.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="nova",
    input=long_text,
) as resp:
    for chunk in resp.iter_bytes(8192):
        write_to_pipe(chunk)

Format choice

FormatBitrateUse
mp3~96 kbpsDefault; widely supported
opus~64 kbpsBest for streaming voice (web/WebRTC)
wavuncompressedEditing, further processing
flaclossless compressedArchival
pcmraw 24kHz monoCustom pipelines (synth, modems)

Pacing

speed is 0.25 → 4.0. Most listeners are comfortable at 0.95–1.15 . Speeding past 1.5 is intelligible but tiring; slowing below 0.85 gets robotic.

Speech-to-text

Quick transcription

python
tr = client.audio.transcriptions.create(
    model="whisper-1",
    file=open("meeting.mp3", "rb"),
    response_format="verbose_json",
    language="en",
)
print(tr.text)
for s in tr.segments:
    print(f"[{s.start:.1f}s] {s.text}")

Long-form audio (>25 MB)

Whisper-1 caps at 25 MB. For longer recordings, split first:
python
import subprocess, tempfile, os

def split(input_path: str, segment_seconds: int = 600) -> list[str]:
    out_dir = tempfile.mkdtemp()
    pattern = os.path.join(out_dir, "part_%03d.mp3")
    subprocess.run([
        "ffmpeg", "-i", input_path,
        "-f", "segment", "-segment_time", str(segment_seconds),
        "-c:a", "libmp3lame", "-b:a", "64k",
        pattern,
    ], check=True)
    return sorted(os.path.join(out_dir, f) for f in os.listdir(out_dir))

def transcribe_long(path: str) -> str:
    parts = split(path)
    return "\n".join(
        client.audio.transcriptions.create(
            model="whisper-1",
            file=open(p, "rb"),
            response_format="text",
        )
        for p in parts
    )
Use 64 kbps mono MP3 — Whisper doesn’t benefit from higher bitrate, and you stay well under the size cap.

Subtitle export

response_format="srt" or "vtt" returns ready-to-use subtitle files:
python
vtt = client.audio.transcriptions.create(
    model="whisper-1",
    file=open("video.mp4", "rb"),
    response_format="vtt",
)
open("subs.vtt", "w").write(vtt)  # ready for HTML5 <track>

Word-level timestamps

python
tr = client.audio.transcriptions.create(
    model="whisper-1",
    file=open("call.wav", "rb"),
    response_format="verbose_json",
    timestamp_granularities=["word"],
)
for w in tr.words:
    print(w.word, w.start, w.end)
Useful for click-to-seek transcripts and aligning with diarisation.

Translation

Use client.audio.translations.create(...) to transcribe and translate to English in one call. Source language is auto-detected.

Round-trip: voice agent

Combining STT → chat → TTS gives you a basic voice agent:
python
def respond(user_audio_path: str) -> bytes:
    user_text = client.audio.transcriptions.create(
        model="whisper-1",
        file=open(user_audio_path, "rb"),
    ).text

    reply = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_text}],
    ).choices[0].message.content

    speech = client.audio.speech.create(
        model="tts-1",
        voice="nova",
        input=reply,
        response_format="opus",
    )
    return speech.read()
End-to-end latency is dominated by the chat call. Realtime (WebSocket bidirectional voice) is on the roadmap.

Costs at a glance

  • TTS: ~15per1Mcharactersontts1, 15 per 1 M characters on `tts-1`, ~30 on tts-1-hd
  • Whisper STT: ~$6 per hour of audio
  • Long meetings (1 h) typically cost less than the chat completion that follows them

Pitfalls

  • Wrong language hint drops STT accuracy. Auto-detect is good but a language= hint is better when known.
  • Quiet/clipped recordings — Whisper handles noise well but not clipping. Normalise levels before transcribing.
  • TTS swallowing punctuation — write naturally; “Hi—how are you?” reads better than “Hi how are you”.
  • Long base64 audio over JSON wastes 33% bandwidth vs. multipart. Use multipart unless you have a JSON-only client.