Skip to main content
For the actual limits per plan, see Rate limits. This guide is the playbook for handling them in real applications.

The two failure modes

CodeMeaningRetry?
429 rate_limit_exceededYour key burst above its RPM, or hit the daily token capYes — honour Retry-After
502 / 503 / 504Upstream provider transientYes — exponential backoff
400 / 401 / 403 / 422Caller errorNo — fix the request
402 insufficient_creditsOut of balanceNo — top up
Don’t blanket-retry. Retrying caller errors burns RPM for nothing.

Always honour Retry-After

Every 429 from us includes:
HTTP 429 Too Many Requests
Retry-After: 12
The number is seconds until your window frees. Use it directly — don’t override with your own delay.
python
import time

def call_with_429(req):
    while True:
        resp = req()
        if resp.status_code != 429:
            return resp
        wait = float(resp.headers.get("Retry-After", "5"))
        time.sleep(wait)

Exponential backoff with jitter (for 5xx)

For upstream transients without Retry-After:
python
import random, time

def with_backoff(req, max_attempts=5):
    for attempt in range(max_attempts):
        resp = req()
        if resp.status_code < 500 and resp.status_code != 429:
            return resp
        if resp.status_code == 429:
            time.sleep(float(resp.headers.get("Retry-After", "5")))
            continue
        # 5xx
        delay = (2 ** attempt) + random.random() * 0.5
        time.sleep(delay)
    return resp  # last attempt's response, even if failed
Jitter (+ random()) is critical — without it, every client retries at the same instant and you get a stampede.

Don’t retry 4xx

400/401/403/422 are deterministic — retrying just wastes RPM and money. Fix the request, then resubmit. Common offenders:
  • Wrong model slug → check GET /v1/models
  • Missing required parameter → check the endpoint reference
  • Image too large → resize before resending
  • Malformed JSON → fix the producer

SDK-level retries

The OpenAI SDK retries 429s and 5xx automatically:
python
client = OpenAI(
    api_key=API_KEY,
    base_url="https://api.infery.ai/v1",
    max_retries=4,           # default 2
    timeout=60.0,
)
This honours Retry-After. For most apps, this is enough — you don’t need a custom loop. But:
  • It retries on a single call. Bursty workloads still need a queue.
  • It applies to streaming too — the first chunk is what matters.

Stay under the limit on purpose

Reactive retry is the floor. Proactive limiting is the ceiling. Token-bucket on your side, capped at ~80% of the key’s RPM:
python
import threading, time, collections

class TokenBucket:
    def __init__(self, rate_per_min: int):
        self.capacity = rate_per_min
        self.tokens = float(rate_per_min)
        self.refill_per_sec = rate_per_min / 60.0
        self.last = time.monotonic()
        self.lock = threading.Lock()

    def take(self, n: int = 1):
        while True:
            with self.lock:
                now = time.monotonic()
                self.tokens = min(self.capacity, self.tokens + (now - self.last) * self.refill_per_sec)
                self.last = now
                if self.tokens >= n:
                    self.tokens -= n
                    return
            time.sleep(0.05)

bucket = TokenBucket(rate_per_min=int(120 * 0.8))  # 80% of a 120 rpm key

def safe_call(req):
    bucket.take(1)
    return req()
Result: zero 429s under steady load. Bursts above 120 rpm spike → bucket pauses → resumes when refilled. Queue any requests that need to go through.

Fallback chains: the better answer for production

Per-call retries help, but fallback chains help more. Configure once:
Source: gpt-4o
└─ Fallback 1: gpt-4o-mini
└─ Fallback 2: gemini-2-5-flash
Now a 429 on gpt-4o is invisible to your code — the gateway routes to gpt-4o-mini, and you see:
x-model-used: gpt-4o-mini
x-fallback-from: gpt-4o
x-fallback-depth: 1
This shifts retry from your client to the gateway, with sub-100 ms hop instead of an exponential wait. Combine with exponential backoff for the case when all fallbacks are also rate-limited.

Per-environment keys

Don’t share one key across dev/staging/prod. Reasons:
  • Dev experiments shouldn’t drain prod’s RPM budget
  • Leaked dev keys have lower blast radius if scoped to a low-RPM preset
  • Per-env analytics are clearer
Settings → API Keys → Create. Pick a quota preset per environment.

Daily token caps

Some plans cap total tokens per day in addition to RPM. Hitting the cap returns 429 too — but Retry-After will be the seconds until midnight UTC, not seconds. Don’t blindly sleep — instead:
  • Reduce request volume
  • Switch to a cheaper model
  • Top up to a higher plan
You’ll see code: "rate_limit_exceeded" and message: "Daily token cap reached" to distinguish from RPM 429s.

When backoff fails

If you back off twice and still 429:
  1. Look at Settings → Usage → By key — is one key dominating?
  2. Check Settings → API Keys → preset — is the preset lower than you remember?
  3. Did you ship a loop without rate limiting? Look at request volume in the last hour.
  4. Open a ticket — sometimes it’s our problem and we want to know.