Streaming

Streaming gives the user the first token in 100–400 ms instead of waiting for the full response. Use it for any interactive UI.

Turning it on

Set "stream": true. The response is text/event-stream — a sequence of data: {...}\n\n chunks ending in data: [DONE]\n\n.

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about TCP"}],
    stream=True,
)

for event in stream:
    delta = event.choices[0].delta if event.choices else None
    if delta and delta.content:
        print(delta.content, end="", flush=True)

Chunk anatomy

A typical stream looks like this:

data: {"id":"...","choices":[{"delta":{"role":"assistant"}}]}
data: {"id":"...","choices":[{"delta":{"content":"Hello"}}]}
data: {"id":"...","choices":[{"delta":{"content":" world"}}]}
data: {"id":"...","choices":[{"delta":{},"finish_reason":"stop"}]}
data: {"choices":[],"usage":{"prompt_tokens":12,"completion_tokens":2}}
data: {"choices":[],"usage":{...},"credits_used":1}
data: [DONE]

The order is content chunks → finish chunk → provider usage → gateway usage → [DONE]. The OpenAI SDK ignores chunks with empty choices, so the two usage chunks pass through harmlessly; only your raw parser sees them.

Reading `credits_used`

The chunk that contains credits_used is always the second-to-last event. If you want per-call cost on the client without polling Usage:

node

let creditsUsed = 0;
for await (const event of stream) {
  if ((event as any).credits_used !== undefined) {
    creditsUsed = (event as any).credits_used;
    continue;
  }
  process.stdout.write(event.choices[0]?.delta?.content ?? '');
}
console.log(`\nCost: ${creditsUsed} credits`);

The OpenAI Python and Node SDKs don’t expose this field on their typed objects — read it via extra_body / cast.

Tool calls in a stream

Tool call deltas arrive in the same delta channel:

{"choices":[{"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\"loc"}}]}}]}
{"choices":[{"delta":{"tool_calls":[{"index":0,"function":{"arguments":"ation\":"}}]}}]}

Concatenate function.arguments strings by tool_calls[i].index until you see finish_reason: "tool_calls", then parse the assembled JSON. See Tool calling.

Cancellation

Drop the connection. The gateway notices the client is gone, cancels the upstream call within ~250 ms, and only meters tokens already produced. Useful for “stop” buttons in chat UIs.

Errors mid-stream

If the upstream provider errors after we’ve already streamed some tokens, you receive a data: {"error": {...}} chunk before the connection closes. Always handle this branch — it’s rare, but it does happen.

Buffering pitfalls

Don’t put a buffering proxy (Cloudflare cache, nginx with default buffering) between client and Infery. SSE needs to flush per chunk.
Browser fetch() with await response.text() swallows the stream — use response.body.getReader() or the OpenAI SDK.
curl needs -N to disable output buffering.

Get started

Playground

Workspaces

Billing

Models

Guides

Reference

Turning it on

Chunk anatomy

Reading `credits_used`

Tool calls in a stream

Cancellation

Errors mid-stream

Buffering pitfalls

Get started

Playground

Workspaces

Billing

Models

Guides

Reference

​Turning it on

​Chunk anatomy

​Reading credits_used

​Tool calls in a stream

​Cancellation

​Errors mid-stream

​Buffering pitfalls

Turning it on

Chunk anatomy

Reading `credits_used`

Tool calls in a stream

Cancellation

Errors mid-stream

Buffering pitfalls