Rate limits & retries

For the actual limits per plan, see Rate limits. This guide is the playbook for handling them in real applications.

The two failure modes

Code	Meaning	Retry?
`429 rate_limit_exceeded`	Your key burst above its RPM, or hit the daily token cap	Yes — honour `Retry-After`
`502 / 503 / 504`	Upstream provider transient	Yes — exponential backoff
`400 / 401 / 403 / 422`	Caller error	No — fix the request
`402 insufficient_credits`	Out of balance	No — top up

Don’t blanket-retry. Retrying caller errors burns RPM for nothing.

Always honour `Retry-After`

Every 429 from us includes:

HTTP 429 Too Many Requests
Retry-After: 12

The number is seconds until your window frees. Use it directly — don’t override with your own delay.

python

import time

def call_with_429(req):
    while True:
        resp = req()
        if resp.status_code != 429:
            return resp
        wait = float(resp.headers.get("Retry-After", "5"))
        time.sleep(wait)

Exponential backoff with jitter (for 5xx)

For upstream transients without Retry-After:

python

import random, time

def with_backoff(req, max_attempts=5):
    for attempt in range(max_attempts):
        resp = req()
        if resp.status_code < 500 and resp.status_code != 429:
            return resp
        if resp.status_code == 429:
            time.sleep(float(resp.headers.get("Retry-After", "5")))
            continue
        # 5xx
        delay = (2 ** attempt) + random.random() * 0.5
        time.sleep(delay)
    return resp  # last attempt's response, even if failed

Jitter (+ random()) is critical — without it, every client retries at the same instant and you get a stampede.

Don’t retry 4xx

400/401/403/422 are deterministic — retrying just wastes RPM and money. Fix the request, then resubmit. Common offenders:

Wrong model slug → check GET /v1/models
Missing required parameter → check the endpoint reference
Image too large → resize before resending
Malformed JSON → fix the producer

SDK-level retries

The OpenAI SDK retries 429s and 5xx automatically:

python

client = OpenAI(
    api_key=API_KEY,
    base_url="https://api.infery.ai/v1",
    max_retries=4,           # default 2
    timeout=60.0,
)

This honours Retry-After. For most apps, this is enough — you don’t need a custom loop. But:

It retries on a single call. Bursty workloads still need a queue.
It applies to streaming too — the first chunk is what matters.

Stay under the limit on purpose

Reactive retry is the floor. Proactive limiting is the ceiling. Token-bucket on your side, capped at ~80% of the key’s RPM:

python

import threading, time, collections

class TokenBucket:
    def __init__(self, rate_per_min: int):
        self.capacity = rate_per_min
        self.tokens = float(rate_per_min)
        self.refill_per_sec = rate_per_min / 60.0
        self.last = time.monotonic()
        self.lock = threading.Lock()

    def take(self, n: int = 1):
        while True:
            with self.lock:
                now = time.monotonic()
                self.tokens = min(self.capacity, self.tokens + (now - self.last) * self.refill_per_sec)
                self.last = now
                if self.tokens >= n:
                    self.tokens -= n
                    return
            time.sleep(0.05)

bucket = TokenBucket(rate_per_min=int(120 * 0.8))  # 80% of a 120 rpm key

def safe_call(req):
    bucket.take(1)
    return req()

Result: zero 429s under steady load. Bursts above 120 rpm spike → bucket pauses → resumes when refilled. Queue any requests that need to go through.

Fallback chains: the better answer for production

Per-call retries help, but fallback chains help more. Configure once:

Source: gpt-4o
└─ Fallback 1: gpt-4o-mini
└─ Fallback 2: gemini-2-5-flash

Now a 429 on gpt-4o is invisible to your code — the gateway routes to gpt-4o-mini, and you see:

x-model-used: gpt-4o-mini
x-fallback-from: gpt-4o
x-fallback-depth: 1

This shifts retry from your client to the gateway, with sub-100 ms hop instead of an exponential wait. Combine with exponential backoff for the case when all fallbacks are also rate-limited.

Per-environment keys

Don’t share one key across dev/staging/prod. Reasons:

Dev experiments shouldn’t drain prod’s RPM budget
Leaked dev keys have lower blast radius if scoped to a low-RPM preset
Per-env analytics are clearer

Settings → API Keys → Create. Pick a quota preset per environment.

Daily token caps

Some plans cap total tokens per day in addition to RPM. Hitting the cap returns 429 too — but Retry-After will be the seconds until midnight UTC, not seconds. Don’t blindly sleep — instead:

Reduce request volume
Switch to a cheaper model
Top up to a higher plan

You’ll see code: "rate_limit_exceeded" and message: "Daily token cap reached" to distinguish from RPM 429s.

When backoff fails

If you back off twice and still 429:

Look at Settings → Usage → By key — is one key dominating?
Check Settings → API Keys → preset — is the preset lower than you remember?
Did you ship a loop without rate limiting? Look at request volume in the last hour.
Open a ticket — sometimes it’s our problem and we want to know.

Get started

Playground

Workspaces

Billing

Models

Guides

Reference

Rate limits & retries

The two failure modes

Always honour `Retry-After`

Exponential backoff with jitter (for 5xx)

Don’t retry 4xx

SDK-level retries

Stay under the limit on purpose

Fallback chains: the better answer for production

Per-environment keys

Daily token caps

When backoff fails

Get started

Playground

Workspaces

Billing

Models

Guides

Reference

​The two failure modes

​Always honour Retry-After

​Exponential backoff with jitter (for 5xx)

​Don’t retry 4xx

​SDK-level retries

​Stay under the limit on purpose

​Fallback chains: the better answer for production

​Per-environment keys

​Daily token caps

​When backoff fails

The two failure modes

Always honour `Retry-After`

Exponential backoff with jitter (for 5xx)

Don’t retry 4xx

SDK-level retries

Stay under the limit on purpose

Fallback chains: the better answer for production

Per-environment keys

Daily token caps

When backoff fails