Orchestrator–Executor: A Two-Agent Split That Beats a Solo Model on SWE Tasks

Most coding agents are a single model doing everything at once — exploring the repo, deciding on an approach, writing the code, and checking its own work. That works until the task has a hidden correct procedure the model can't see, and it confidently ships something that looks right and fails the grader.

We ran an experiment with a different shape: split the agent into two roles. A strong orchestrator explores, reasons, and writes a precise plan. A cheaper but capable executor does the hands-on implementation. Here's what that division of labor actually buys.

The Orchestrator–Executor loop

A strong model plans and verifies; a cheaper, swappable model does the hands-on work. Control flows down on delegation and back up on verification.

OrchestratorOpus 4.8 — the planner

1Explore

Read the env, check versions, disassemble binaries, align data

read · grep · run

2Plan

Lock the plan, reject the naive path

spec: algorithm + constraints

3Delegate

Hand off an implementable brief with hard constraints

5Verify

Review the artifact against the grader

L2 compare · od -c · tests

pass ✓

DoneTask complete

delegate

re-verify

ExecutorGLM-5.2 — swappable worker

4Execute

Implement the brief, hand back status + test output

edit · write · run · test

fail ✗

6Rectify

Fix and re-test on a failed verification

OrchestratorOpus 4.8

1Explore

Read the env, check versions, disassemble binaries, align data

read · grep · run

2Plan

Lock the plan, reject the naive path

spec: algorithm + constraints

3Delegate

Hand off an implementable brief with hard constraints

delegate

ExecutorGLM-5.2 · swappable

4Execute

Implement the brief, hand back status + test output

edit · write · run · test

handback

OrchestratorOpus 4.8

5Verify

Review the artifact against the grader

L2 compare · od -c · tests

fail ✗ → rectify

6Rectify

Fix and re-test on a failed verification

pass ✓ — task complete

The split: a planner and a worker

The architecture is an orchestrator agent that delegates to a stateful executor subagent. The orchestrator runs four jobs the solo model tends to rush:

Explore. The orchestrator (Opus 4.8) reads the environment, disassembles binaries, checks library versions, and aligns the data before committing to anything.

Lock the plan. It commits to a specific algorithm or procedure — often rejecting the naive approach. On one cryptanalysis task it derives a 2²⁰+2²⁰ meet-in-the-middle attack instead of a 2⁴⁰ brute force.

Delegate a precise brief. It hands the executor an implementable spec with hard constraints: do NOT redesign the algorithm; implement what is specified, then verify.

Review against the grader. On handback, the orchestrator verifies the artifact — an L2 image compare, od -c on the output file — and runs the tests before declaring the task done. If verification fails, the executor rectifies and the loop repeats.

The executor is a swappable, cheaper model that implements the brief and hands back a status plus test output. The interesting question is what each side is really contributing.

The results

Across the suite, the orchestrated stack with GLM-5.2 as executor cleared 10 more tasks than the same model running solo — at a higher per-task cost, but a measurably higher ceiling.

mean on Terminal-Bench 2.0

+11.3

lift over GLM-5.2 solo

+10

extra tasks passed

$2.40

per-task cost (orchestrated)

Mean score — Terminal-Bench 2.0 (89 tasks)higher is better

Same executor model, with and without an Opus orchestrator on top.

Mean score — Terminal-Bench 2.0 (89 tasks)
Model	mean
Opus 4.8 → GLM-5.2	69.7
Opus 4.8 → Kimi K2.7	58.4
GLM-5.2 solo	58.4

Means on our setup. The orchestrated GLM-5.2 stack adds +11.3 mean and +10 passes over the same model alone.

Config	Orchestrator	Executor	Mean	Passes	$/task
Opus + GLM-5.2 ★	Opus 4.8	GLM-5.2	69.7	62/86	$2.40
Opus + Kimi K2.7	Opus 4.8	Kimi K2.7	58.4	52/89	$1.85
GLM-5.2 solo	GLM-5.2	—	58.4	52/89	$0.84

Two numbers are worth sitting with. The executor choice matters under the same orchestrator: GLM-5.2 beats Kimi K2.7 by +11.3 mean / +10 passes on identical tasks (51 both pass, 11 GLM-only, 1 Kimi-only). Kimi also timed out more — 17 agent timeouts versus 10 — and burned more on doomed long-horizon tasks, including $26.59 on make-doom-for-mips alone.

And GLM-5.2 solo ties the Kimi-executor stack at 45% of the cost. A cheaper model alone can match a more expensive orchestrated stack — which means the orchestrator only earns its keep if it lifts the ceiling, not just the average.

Where orchestration actually helps

The split pays off on tasks with a hidden correct procedure — where a solo model produces plausible-but-grader-wrong work, or never ships a deliverable at all. The orchestrator's job is to discover and compress the correct procedure into a brief, then gate "done" on a grader-shaped check rather than a local one.

The split pays off when

The task has a hidden correct procedure, not just a plausible-looking answer
A solo model ships work that looks right and fails the grader
Verification can be gated on a grader-shaped check — L2 compare, od -c, tests

It doesn't help when

The task simply exceeds the time budget — orchestration just spends more failing
No model can solve it at all, solo or orchestrated
A cheap solo model already clears the bar at a fraction of the cost

The trajectory evidence is striking because it's the same executor model with opposite outcomes — the only difference is whether an orchestrator scoped the work first:

Task	Orchestrated ✓	Solo ✗
mteb-retrieve	Correct BGE API, right revision	DIY cosine sim, wrong doc
feal-cryptanalysis	2²⁰+2²⁰ MITM, verified spec	Timed out, no output shipped
dna-assembly	Exact 4-bp Golden Gate overhangs	Wrong overhangs, 48-nt primer
path-tracing	L2-compares during dev → 1.0000	1900 lines, never shipped image.c
mcmc-sampling-stan	BDA3 prior → β ≈ 16.40 (pass)	Reparam → β = 14.25 (fail)
query-optimize	CTE rewrite at 622 ms, beats golden	Beat the original, not the bar

The unifying lesson: hidden verifiers check the reference procedure, not the result's surface plausibility. A solo model that "optimizes the query" or "computes a similarity" gets the shape right and the grader wrong. The orchestrator discovers the right procedure and verifies against the actual bar.

The executor is swappable — the orchestrator is not

This is the asymmetry that makes the architecture practical. The executor seat is a commodity: rank candidates by productivity-per-dollar and swap freely. Under a fixed Opus orchestrator, GLM-5.2 > Kimi K2.7 > MiniMax ≈ Nemotron.

The orchestrator seat is where the passes live. Swap Opus for a cheaper planner — Opus → GPT-5.5 — and the score roughly halves while winning nothing new. Planner reasoning is what discovers the hidden procedure; you can't buy it back with a stronger executor.

Why this matters for BLACKBOX

This is exactly the division of labor BLACKBOX's multi-agent execution is built around. You can seat a frontier model as the orchestrator, run a cheaper model as the executor, and swap that executor per task to tune productivity-per-dollar — all without rewiring your workflow.

The takeaways are clean. Split planning from execution and you buy roughly +11% mean and +10 passes on Terminal-Bench 2.0. Treat the executor as swappable and rank it by cost-efficiency. Keep a strong planner in the orchestrator seat — that's where the wins come from. And cap the budget, because a higher ceiling doesn't mean an unlimited one.

Run your own orchestrator–executor stack

Put a frontier planner on top of a cheaper executor — and swap models per task — with BLACKBOX multi-agent execution.

Explore Agents