Infrastructure
A single-code-path stack for model definition, training, evals, decontamination, conversion, lineage, and reproducible run artifacts.
Tensor Cortex builds deterministic evaluation, sandboxed execution, training infrastructure, and private coding-agent release gates for teams that need evidence before they trust an AI system.
Private release gate
A single-code-path stack for model definition, training, evals, decontamination, conversion, lineage, and reproducible run artifacts.
Private coding-agent evaluation on a customer's own repo: hidden tests, sandboxed execution, repeatable scoring, and regression gates.
Provider-friendly evidence packages with config, git SHA, eval hashes, goodput, MFU, checkpoint behavior, and failure notes.
The public surface is deliberately narrow: reproducible infrastructure, private EvalOps tooling, and evidence reports. The private research set, model details, and data-mixture decisions stay private.
Most teams choose coding agents using public benchmarks, demos, or vibes. Private EvalOps answers the sharper question: which agent setup actually works on your codebase?
Bug fixes, multi-file changes, refactors, and feature tasks derived from real repo work.
The same tasks across Codex, Claude Code, Cursor-style agents, open-weight models, or internal loops.
Hidden tests and verifiers decide pass/fail. The agent's self-report is never the score.
Re-run the suite whenever prompts, tools, models, or guardrails change.
Tensor Cortex is looking for practical compute partnerships: start with a GPU smoke, then a bounded bootstrap run, and publish the evidence package instead of making unsupported claims.
Private EvalOps is a private SWE-bench for your own repository. Tensor Cortex builds evaluation tasks from real work in your codebase, runs coding agents against them in a sandbox, and grades pass/fail with hidden tests and verifiers so you get evidence instead of vibes about which agent setup actually works on your code.
Public benchmarks measure performance on open repositories that models may have seen during training. Private EvalOps measures performance on your codebase, with your conventions, your test suite, and tasks that reflect the work your team actually does — the question that decides procurement.
The same private task suite can be run across Codex, Claude Code, Cursor-style agents, open-weight models, and internal agent loops. Because every agent runs the identical tasks under the same sandbox and verifiers, the comparison is fair and repeatable.
Hidden tests and verifiers decide pass or fail — the agent's own self-report is never the score. Runs are sandboxed with network access off and content-addressed so results are reproducible rather than one-off demos.
A first pilot is typically 10–30 private repo tasks for $1k–$3k. A deeper custom suite runs $5k–$15k, and an ongoing monthly regression gate is $2k–$10k per month.
Yes. Model weights, scaling strategy, data mixture, and customer repositories stay private. The public surface is limited to reproducible infrastructure, EvalOps tooling, and evidence reports.
A repeatable task suite plus an evidence package: pass/fail scores per agent, configuration, git SHA, eval hashes, and failure notes — enough to make a confident decision and to catch regressions whenever prompts, tools, models, or guardrails change.