Audio: TTS and STT

For the endpoint contract see Audio API. This guide covers practical recipes.

Text-to-speech

Pick a voice and model

Model	Voices	Strong for
`tts-1` (OpenAI)	alloy, echo, onyx, nova, shimmer, sage, ash, coral, fable	Fast, low cost, monolingual-EN strong
`tts-1-hd` (OpenAI)	same	Higher fidelity, ~2× cost
`gemini-2-5-tts`	Multilingual	Natural prosody, 30+ languages
`qwen-tts`	Multilingual incl. CN/JA/KO	PRC opt-in required

Voice character is fixed per voice — you can’t fine-tune them. To pick, render the same line in 3–4 voices and listen.

Save to file

python

audio = client.audio.speech.create(
    model="tts-1",
    voice="nova",
    input="Welcome to Infery. Your audio is being generated now.",
    response_format="mp3",
    speed=1.0,
)
audio.stream_to_file("welcome.mp3")

Sample output

“Welcome to Infery — one API for every AI model.” Generated with gemini-2.5-flash-preview-tts, voice Kore.

Stream to a player

For long passages, stream the bytes directly to the user’s audio element instead of downloading then playing:

python

with client.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="nova",
    input=long_text,
) as resp:
    for chunk in resp.iter_bytes(8192):
        write_to_pipe(chunk)

Format choice

Format	Bitrate	Use
`mp3`	~96 kbps	Default; widely supported
`opus`	~64 kbps	Best for streaming voice (web/WebRTC)
`wav`	uncompressed	Editing, further processing
`flac`	lossless compressed	Archival
`pcm`	raw 24kHz mono	Custom pipelines (synth, modems)

Pacing

speed is 0.25 → 4.0. Most listeners are comfortable at 0.95–1.15 . Speeding past 1.5 is intelligible but tiring; slowing below 0.85 gets robotic.

Speech-to-text

Quick transcription

python

tr = client.audio.transcriptions.create(
    model="whisper-1",
    file=open("meeting.mp3", "rb"),
    response_format="verbose_json",
    language="en",
)
print(tr.text)
for s in tr.segments:
    print(f"[{s.start:.1f}s] {s.text}")

Long-form audio (>25 MB)

Whisper-1 caps at 25 MB. For longer recordings, split first:

python

import subprocess, tempfile, os

def split(input_path: str, segment_seconds: int = 600) -> list[str]:
    out_dir = tempfile.mkdtemp()
    pattern = os.path.join(out_dir, "part_%03d.mp3")
    subprocess.run([
        "ffmpeg", "-i", input_path,
        "-f", "segment", "-segment_time", str(segment_seconds),
        "-c:a", "libmp3lame", "-b:a", "64k",
        pattern,
    ], check=True)
    return sorted(os.path.join(out_dir, f) for f in os.listdir(out_dir))

def transcribe_long(path: str) -> str:
    parts = split(path)
    return "\n".join(
        client.audio.transcriptions.create(
            model="whisper-1",
            file=open(p, "rb"),
            response_format="text",
        )
        for p in parts
    )

Use 64 kbps mono MP3 — Whisper doesn’t benefit from higher bitrate, and you stay well under the size cap.

Subtitle export

response_format="srt" or "vtt" returns ready-to-use subtitle files:

python

vtt = client.audio.transcriptions.create(
    model="whisper-1",
    file=open("video.mp4", "rb"),
    response_format="vtt",
)
open("subs.vtt", "w").write(vtt)  # ready for HTML5 <track>

Word-level timestamps

python

tr = client.audio.transcriptions.create(
    model="whisper-1",
    file=open("call.wav", "rb"),
    response_format="verbose_json",
    timestamp_granularities=["word"],
)
for w in tr.words:
    print(w.word, w.start, w.end)

Useful for click-to-seek transcripts and aligning with diarisation.

Translation

Use client.audio.translations.create(...) to transcribe and translate to English in one call. Source language is auto-detected.

Round-trip: voice agent

Combining STT → chat → TTS gives you a basic voice agent:

python

def respond(user_audio_path: str) -> bytes:
    user_text = client.audio.transcriptions.create(
        model="whisper-1",
        file=open(user_audio_path, "rb"),
    ).text

    reply = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_text}],
    ).choices[0].message.content

    speech = client.audio.speech.create(
        model="tts-1",
        voice="nova",
        input=reply,
        response_format="opus",
    )
    return speech.read()

End-to-end latency is dominated by the chat call. Realtime (WebSocket bidirectional voice) is on the roadmap.

Costs at a glance

TTS: ~ $15 per 1 M characters on `tts-1`, ~$ 30 on tts-1-hd
Whisper STT: ~$6 per hour of audio
Long meetings (1 h) typically cost less than the chat completion that follows them

Pitfalls

Wrong language hint drops STT accuracy. Auto-detect is good but a language= hint is better when known.
Quiet/clipped recordings — Whisper handles noise well but not clipping. Normalise levels before transcribing.
TTS swallowing punctuation — write naturally; “Hi—how are you?” reads better than “Hi how are you”.
Long base64 audio over JSON wastes 33% bandwidth vs. multipart. Use multipart unless you have a JSON-only client.

Get started

Playground

Workspaces

Billing

Models

Guides

Reference

Audio: TTS and STT

Text-to-speech

Pick a voice and model

Save to file

Sample output

Stream to a player

Format choice

Pacing

Speech-to-text

Quick transcription

Long-form audio (>25 MB)

Subtitle export

Word-level timestamps

Translation

Round-trip: voice agent

Costs at a glance

Pitfalls

Get started

Playground

Workspaces

Billing

Models

Guides

Reference

​Text-to-speech

​Pick a voice and model

​Save to file

​Sample output

​Stream to a player

​Format choice

​Pacing

​Speech-to-text

​Quick transcription

​Long-form audio (>25 MB)

​Subtitle export

​Word-level timestamps

​Translation

​Round-trip: voice agent

​Costs at a glance

​Pitfalls

Text-to-speech

Pick a voice and model

Save to file

Sample output

Stream to a player

Format choice

Pacing

Speech-to-text

Quick transcription

Long-form audio (>25 MB)

Subtitle export

Word-level timestamps

Translation

Round-trip: voice agent

Costs at a glance

Pitfalls