Performance

What is the gateway actually faster than? Real numbers from a production benchmark, with the methodology so you can verify it yourself.

Last measured: 2026-05-17 · Gateway v0.1.7 · Istanbul vantage point · OpenAI gpt-4o-mini

3.5×

faster on cache HIT vs direct OpenAI

upstream token cost on cache HIT (vs full $$ direct)

~5ms

gateway overhead on warm requests (visible in X-Pipeline-Overhead header)

Head-to-head latency

Same model, same prompt, same vantage point. 7-9 samples per scenario after dropping the TLS warmup. First request through the gateway warms the cache; subsequent identical requests HIT.

Scenario	min	p50	p95	Upstream cost
Direct OpenAI (no gateway)	720ms	810ms	874ms	full token $
Gateway cache MISS	833ms	1077ms	1387ms	full token $
Gateway cache HIT	196ms	231ms	274ms	$0

Gateway HIT vs Direct OpenAI: −579ms p50 (−71.5%) and 100% upstream cost saved per request.

Gateway MISS vs Direct OpenAI: +267ms p50 (+33%) — the cache-miss path pays a network detour cost (client → CF edge → OpenAI) that direct integration avoids. This is the price of admission for observability and caching.

When the gateway wins

The numbers above are for a single request in isolation. In a real application with repeat queries, the cache-hit rate determines the overall picture.

Cache hit rate	Avg latency	Upstream $ saved
0% (no repeat queries)	~1077ms	0%
30%	~823ms	30%
50%	~654ms	50%
70% (typical RAG / FAQ)	~485ms	70%
90% (system-prompt heavy)	~316ms	90%

Break-even point: roughly 25% cache hit rate, at which avg gateway latency matches direct OpenAI. Above that, the gateway wins on both axes (latency + cost).

Methodology

We publish the methodology so you can reproduce the numbers — no hand-waving.

Test client

Python http.client.HTTPSConnection with HTTP/2 keep-alive. All samples in a single scenario share one TLS session, removing handshake noise. Wall-clock timing via time.perf_counter().

Vantage point

Istanbul, Turkey, residential ISP. Worker invocations served by the Istanbul CF colo (IST), verified via cf-ray header.

Request shape

gpt-4o-mini · temperature: 0 · max_tokens: 20. For HIT tests, an identical short prompt — first request stores the response, subsequent identical requests HIT. For MISS tests, a unique nonce per request forces a cache miss.

Sample size

7-9 timed samples per scenario after dropping the TLS warmup. Free tier rate limit caps a single burst at ~10/minute (11 effective with stale-tolerance).

Reproduce it yourself

The benchmark script is in the public repo: scripts/bench-gateway-vs-direct.sh. Every gateway response carries X-Cache, X-Pipeline-Overhead, X-Auth-Cache, and X-Usage-Cache headers — so a single curl -i shows whether your request was warm or cold. No marketing claim on this page is unverifiable from those headers.

Honest caveats

Geography matters. A US-based caller will see different direct-OpenAI baseline numbers and a smaller MISS overhead — the CF edge sits closer to OpenAI’s east-coast presence, shrinking the detour cost.

Cold caches pay more. The first request to a fresh API key (cold auth cache) and the first request for a new prompt (cold response cache) each pay a one-time setup cost. Subsequent requests within the 60-second TTL window are warm.

Workload-dependent. Cache savings track how often your application sends similar prompts. Stateless one-off questions don’t cache; FAQs, system-prompt patterns, and RAG retrieval cache heavily.

Non-streaming only. These numbers cover non-streaming chat/completions. Streaming has a different cost profile we’ll publish separately.

Try it on your workload

Swap your provider base URL for openai.tensor.cx (or any supported provider) and the gateway is in your hot path. No SDK changes.

Read the docs Compare to alternatives