Company surface

Reproducible AI infrastructure for agents and models.

Tensor Cortex builds deterministic evaluation, sandboxed execution, training infrastructure, and private coding-agent release gates for teams that need evidence before they trust an AI system.

Public infrastructure, EvalOps tooling, run reports
Private model weights, scaling strategy, data mixture

Private release gate

Agent evaluation run

deterministic
01 Repo tasks private suite
02 Sandbox network off
03 Verifier hidden tests
04 Report content hash
45 seed coding tasks
1,100+ CPU tests green
0 cluster runs claimed
I

Infrastructure

A single-code-path stack for model definition, training, evals, decontamination, conversion, lineage, and reproducible run artifacts.

E

Private EvalOps

Private coding-agent evaluation on a customer's own repo: hidden tests, sandboxed execution, repeatable scoring, and regression gates.

R

Run Reports

Provider-friendly evidence packages with config, git SHA, eval hashes, goodput, MFU, checkpoint behavior, and failure notes.

Infrastructure program

One measurement system, from laptop smoke to cluster run.

The public surface is deliberately narrow: reproducible infrastructure, private EvalOps tooling, and evidence reports. The private research set, model details, and data-mixture decisions stay private.

Deterministic evalscontent-addressed runs and repeatable task specs
Sandboxed executionbounded code runs for evals and coding agents
Train/serve parityconversion tests before model-code changes ship
Gate-driven computelarge runs only when they produce evidence
First product

Private SWE-bench for your repo.

Most teams choose coding agents using public benchmarks, demos, or vibes. Private EvalOps answers the sharper question: which agent setup actually works on your codebase?

Build private tasks

Bug fixes, multi-file changes, refactors, and feature tasks derived from real repo work.

Run agents fairly

The same tasks across Codex, Claude Code, Cursor-style agents, open-weight models, or internal loops.

Grade objectively

Hidden tests and verifiers decide pass/fail. The agent's self-report is never the score.

Catch regressions

Re-run the suite whenever prompts, tools, models, or guardrails change.

Pilot package

Small enough to start, concrete enough to matter.

10-30private repo tasks
$1k-$3kfirst pilot
$5k-$15kdeeper custom suite
$2k-$10k/momonthly regression gate
Compute partners

Bounded compute asks with public evidence in return.

Tensor Cortex is looking for practical compute partnerships: start with a GPU smoke, then a bounded bootstrap run, and publish the evidence package instead of making unsupported claims.

Ask 150 H100 GPUhstack smoke and environment report
Ask 2500 H100 GPUhG0 bootstrap evidence package
Ask 33k+ GPUhscaling-law pilot after proof
FAQ

Questions teams ask before a pilot.

What is private coding-agent EvalOps?

Private EvalOps is a private SWE-bench for your own repository. Tensor Cortex builds evaluation tasks from real work in your codebase, runs coding agents against them in a sandbox, and grades pass/fail with hidden tests and verifiers so you get evidence instead of vibes about which agent setup actually works on your code.

How is this different from public benchmarks like SWE-bench?

Public benchmarks measure performance on open repositories that models may have seen during training. Private EvalOps measures performance on your codebase, with your conventions, your test suite, and tasks that reflect the work your team actually does — the question that decides procurement.

Which coding agents and models can you evaluate?

The same private task suite can be run across Codex, Claude Code, Cursor-style agents, open-weight models, and internal agent loops. Because every agent runs the identical tasks under the same sandbox and verifiers, the comparison is fair and repeatable.

How do you keep evaluation objective?

Hidden tests and verifiers decide pass or fail — the agent's own self-report is never the score. Runs are sandboxed with network access off and content-addressed so results are reproducible rather than one-off demos.

How much does a pilot cost?

A first pilot is typically 10–30 private repo tasks for $1k–$3k. A deeper custom suite runs $5k–$15k, and an ongoing monthly regression gate is $2k–$10k per month.

Is my source code kept private?

Yes. Model weights, scaling strategy, data mixture, and customer repositories stay private. The public surface is limited to reproducible infrastructure, EvalOps tooling, and evidence reports.

What do I get at the end of a pilot?

A repeatable task suite plus an evidence package: pass/fail scores per agent, configuration, git SHA, eval hashes, and failure notes — enough to make a confident decision and to catch regressions whenever prompts, tools, models, or guardrails change.