Most coding agents are a single model doing everything at once — exploring the repo, deciding on an approach, writing the code, and checking its own work. That works until the task has a hidden correct procedure the model can't see, and it confidently ships something that looks right and fails the grader.
We ran an experiment with a different shape: split the agent into two roles. A strong orchestrator explores, reasons, and writes a precise plan. A cheaper but capable executor does the hands-on implementation. Here's what that division of labor actually buys.
A strong model plans and verifies; a cheaper, swappable model does the hands-on work. Control flows down on delegation and back up on verification.
Read the env, check versions, disassemble binaries, align data
read · grep · run
Lock the plan, reject the naive path
spec: algorithm + constraints
Hand off an implementable brief with hard constraints
Review the artifact against the grader
L2 compare · od -c · tests
Implement the brief, hand back status + test output
edit · write · run · test
Fix and re-test on a failed verification
Read the env, check versions, disassemble binaries, align data
read · grep · run
Lock the plan, reject the naive path
spec: algorithm + constraints
Hand off an implementable brief with hard constraints
Implement the brief, hand back status + test output
edit · write · run · test
Review the artifact against the grader
L2 compare · od -c · tests
Fix and re-test on a failed verification
The split: a planner and a worker
The architecture is an orchestrator agent that delegates to a stateful executor subagent. The orchestrator runs four jobs the solo model tends to rush:
Explore. The orchestrator (Opus 4.8) reads the environment, disassembles binaries, checks library versions, and aligns the data before committing to anything.
Lock the plan. It commits to a specific algorithm or procedure — often rejecting the naive approach. On one cryptanalysis task it derives a 2²⁰+2²⁰ meet-in-the-middle attack instead of a 2⁴⁰ brute force.
Delegate a precise brief. It hands the executor an implementable spec with hard constraints: do NOT redesign the algorithm; implement what is specified, then verify.
Review against the grader. On handback, the orchestrator verifies the artifact — an L2 image compare, od -c on the output file — and runs the tests before declaring the task done. If verification fails, the executor rectifies and the loop repeats.
The executor is a swappable, cheaper model that implements the brief and hands back a status plus test output. The interesting question is what each side is really contributing.
The results
Across the suite, the orchestrated stack with GLM-5.2 as executor cleared 10 more tasks than the same model running solo — at a higher per-task cost, but a measurably higher ceiling.
Same executor model, with and without an Opus orchestrator on top.
| Model | mean |
|---|---|
| Opus 4.8 → GLM-5.2 | 69.7 |
| Opus 4.8 → Kimi K2.7 | 58.4 |
| GLM-5.2 solo | 58.4 |
Means on our setup. The orchestrated GLM-5.2 stack adds +11.3 mean and +10 passes over the same model alone.
Two numbers are worth sitting with. The executor choice matters under the same orchestrator: GLM-5.2 beats Kimi K2.7 by +11.3 mean / +10 passes on identical tasks (51 both pass, 11 GLM-only, 1 Kimi-only). Kimi also timed out more — 17 agent timeouts versus 10 — and burned more on doomed long-horizon tasks, including $26.59 on make-doom-for-mips alone.
And GLM-5.2 solo ties the Kimi-executor stack at 45% of the cost. A cheaper model alone can match a more expensive orchestrated stack — which means the orchestrator only earns its keep if it lifts the ceiling, not just the average.
Where orchestration actually helps
The split pays off on tasks with a hidden correct procedure — where a solo model produces plausible-but-grader-wrong work, or never ships a deliverable at all. The orchestrator's job is to discover and compress the correct procedure into a brief, then gate "done" on a grader-shaped check rather than a local one.
- The task has a hidden correct procedure, not just a plausible-looking answer
- A solo model ships work that looks right and fails the grader
- Verification can be gated on a grader-shaped check — L2 compare, od -c, tests
- The task simply exceeds the time budget — orchestration just spends more failing
- No model can solve it at all, solo or orchestrated
- A cheap solo model already clears the bar at a fraction of the cost
The trajectory evidence is striking because it's the same executor model with opposite outcomes — the only difference is whether an orchestrator scoped the work first:
The unifying lesson: hidden verifiers check the reference procedure, not the result's surface plausibility. A solo model that "optimizes the query" or "computes a similarity" gets the shape right and the grader wrong. The orchestrator discovers the right procedure and verifies against the actual bar.
The executor is swappable — the orchestrator is not
This is the asymmetry that makes the architecture practical. The executor seat is a commodity: rank candidates by productivity-per-dollar and swap freely. Under a fixed Opus orchestrator, GLM-5.2 > Kimi K2.7 > MiniMax ≈ Nemotron.
The orchestrator seat is where the passes live. Swap Opus for a cheaper planner — Opus → GPT-5.5 — and the score roughly halves while winning nothing new. Planner reasoning is what discovers the hidden procedure; you can't buy it back with a stronger executor.
Why this matters for BLACKBOX
This is exactly the division of labor BLACKBOX's multi-agent execution is built around. You can seat a frontier model as the orchestrator, run a cheaper model as the executor, and swap that executor per task to tune productivity-per-dollar — all without rewiring your workflow.
The takeaways are clean. Split planning from execution and you buy roughly +11% mean and +10 passes on Terminal-Bench 2.0. Treat the executor as swappable and rank it by cost-efficiency. Keep a strong planner in the orchestrator seat — that's where the wins come from. And cap the budget, because a higher ceiling doesn't mean an unlimited one.