Performance
What is the gateway actually faster than? Real numbers from a production benchmark, with the methodology so you can verify it yourself.
Last measured: 2026-05-17 · Gateway v0.1.7 · Istanbul vantage point · OpenAI gpt-4o-mini
Head-to-head latency
Same model, same prompt, same vantage point. 7-9 samples per scenario after dropping the TLS warmup. First request through the gateway warms the cache; subsequent identical requests HIT.
| Scenario | min | p50 | p95 | Upstream cost |
|---|---|---|---|---|
| Direct OpenAI (no gateway) | 720ms | 810ms | 874ms | full token $ |
| Gateway cache MISS | 833ms | 1077ms | 1387ms | full token $ |
| Gateway cache HIT | 196ms | 231ms | 274ms | $0 |
Gateway HIT vs Direct OpenAI: −579ms p50 (−71.5%) and 100% upstream cost saved per request.
Gateway MISS vs Direct OpenAI: +267ms p50 (+33%) — the cache-miss path pays a network detour cost (client → CF edge → OpenAI) that direct integration avoids. This is the price of admission for observability and caching.
When the gateway wins
The numbers above are for a single request in isolation. In a real application with repeat queries, the cache-hit rate determines the overall picture.
| Cache hit rate | Avg latency | Upstream $ saved |
|---|---|---|
| 0% (no repeat queries) | ~1077ms | 0% |
| 30% | ~823ms | 30% |
| 50% | ~654ms | 50% |
| 70% (typical RAG / FAQ) | ~485ms | 70% |
| 90% (system-prompt heavy) | ~316ms | 90% |
Break-even point: roughly 25% cache hit rate, at which avg gateway latency matches direct OpenAI. Above that, the gateway wins on both axes (latency + cost).
Methodology
We publish the methodology so you can reproduce the numbers — no hand-waving.
Test client
Python http.client.HTTPSConnection with HTTP/2 keep-alive. All samples in a single scenario share one TLS session, removing handshake noise. Wall-clock timing via time.perf_counter().
Vantage point
Istanbul, Turkey, residential ISP. Worker invocations served by the Istanbul CF colo (IST), verified via cf-ray header.
Request shape
gpt-4o-mini · temperature: 0 · max_tokens: 20. For HIT tests, an identical short prompt — first request stores the response, subsequent identical requests HIT. For MISS tests, a unique nonce per request forces a cache miss.
Sample size
7-9 timed samples per scenario after dropping the TLS warmup. Free tier rate limit caps a single burst at ~10/minute (11 effective with stale-tolerance).
Reproduce it yourself
The benchmark script is in the public repo: scripts/bench-gateway-vs-direct.sh. Every gateway response carries X-Cache, X-Pipeline-Overhead, X-Auth-Cache, and X-Usage-Cache headers — so a single curl -i shows whether your request was warm or cold. No marketing claim on this page is unverifiable from those headers.
Honest caveats
Geography matters. A US-based caller will see different direct-OpenAI baseline numbers and a smaller MISS overhead — the CF edge sits closer to OpenAI’s east-coast presence, shrinking the detour cost.
Cold caches pay more. The first request to a fresh API key (cold auth cache) and the first request for a new prompt (cold response cache) each pay a one-time setup cost. Subsequent requests within the 60-second TTL window are warm.
Workload-dependent. Cache savings track how often your application sends similar prompts. Stateless one-off questions don’t cache; FAQs, system-prompt patterns, and RAG retrieval cache heavily.
Non-streaming only. These numbers cover non-streaming chat/completions. Streaming has a different cost profile we’ll publish separately.
Try it on your workload
Swap your provider base URL for openai.tensor.cx (or any supported provider) and the gateway is in your hot path. No SDK changes.