Private repo eval

Private SWE-bench for your repo.

Tensor Cortex helps engineering teams evaluate coding agents on their own codebase — with hidden tests, sandboxed runs, repeatable scoring, and regression gates — so you know which agent to trust, and where not to.

Yours an evidence package: per-agent pass/fail, diffs, failure taxonomy
Private your source code and hidden tests stay on your machine

Private release gate

Agent evaluation run

deterministic
01 Repo tasks private suite
02 Agent sandboxed run
03 Verifier hidden tests
04 Report content hash
Hidden tests decide pass / fail
Fair same tasks across every agent
Repeatable re-run on every change
R

Your repo, real tasks

Evaluation tasks built from real work in your codebase — bug fixes, multi-file changes, refactors, and features — not generic public benchmarks.

E

Private EvalOps

Private coding-agent evaluation on a customer's own repo: hidden tests, sandboxed execution, repeatable scoring, and regression gates.

P

Evidence you can re-run

Every verdict is backed by the task, base commit, diff, verifier output, and a failure taxonomy — reproducible artifacts, not a one-off demo.

Why the verdict holds

Evidence, not vibes — and never the agent's word for it.

The agent's self-report is never the score. Hidden tests and verifiers decide pass or fail, runs are isolated and content-addressed, and your code never leaves your machine.

Hidden verifiersthe agent can't study for the tests that grade it
Deterministic scoringcontent-addressed runs and repeatable task specs
Sandboxed executionbounded, isolated code runs per task
Local-firstsource and hidden tests stay in your environment
First product

Private SWE-bench for your repo.

Most teams choose coding agents using public benchmarks, demos, or vibes. Private EvalOps answers the sharper question: which agent setup actually works on your codebase?

Build private tasks

Bug fixes, multi-file changes, refactors, and feature tasks derived from real repo work.

Run agents fairly

The same tasks across Codex, Claude Code, Cursor-style agents, open-weight models, or internal loops.

Grade objectively

Hidden tests and verifiers decide pass/fail. The agent's self-report is never the score.

Catch regressions

Re-run the suite whenever prompts, tools, models, or guardrails change.

Pilot package

Small enough to start, concrete enough to matter.

10-20private repo tasks
$3k-$5kfirst pilot
$7.5k-$15kdeeper custom suite
$2k-$10k/momonthly regression gate
FAQ

Questions teams ask before a pilot.

What is private coding-agent EvalOps?

Private EvalOps is a private SWE-bench for your own repository. Tensor Cortex builds evaluation tasks from real work in your codebase, runs coding agents against them in a sandbox, and grades pass/fail with hidden tests and verifiers so you get evidence instead of vibes about which agent setup actually works on your code.

How is this different from public benchmarks like SWE-bench?

Public benchmarks measure performance on open repositories that models may have seen during training. Private EvalOps measures performance on your codebase, with your conventions, your test suite, and tasks that reflect the work your team actually does — the question that decides procurement.

Which coding agents and models can you evaluate?

The same private task suite can be run across Codex, Claude Code, Cursor-style agents, open-weight models, and internal agent loops. Because every agent runs the identical tasks under the same sandbox and verifiers, the comparison is fair and repeatable.

How do you keep evaluation objective?

Hidden tests and verifiers decide pass or fail — the agent's own self-report is never the score. Runs are sandboxed and content-addressed, and verification can run network-off, so results are reproducible rather than one-off demos.

How much does a pilot cost?

A first pilot is a fixed-scope evaluation engagement, not a subscription — typically 10–20 private repo tasks across 2–3 agents for $3k–$5k. A deeper custom suite runs $7.5k–$15k, and an ongoing monthly regression gate is $2k–$10k per month.

Is my source code kept private?

Yes. Your source code and hidden tests stay on your machine — runs are local-first and nothing is uploaded by default. You receive the evidence package: per-agent pass/fail, diffs, verifier output, and a failure taxonomy.

What do I get at the end of a pilot?

A repeatable task suite plus an evidence package: pass/fail scores per agent, configuration, git SHA, eval hashes, and failure notes — enough to make a confident decision and to catch regressions whenever prompts, tools, models, or guardrails change.