Back to Blog

Orchestrator–Executor: A Two-Agent Split That Beats a Solo Model on SWE Tasks

3 min read

Most coding agents are a single model doing everything at once — exploring the repo, deciding on an approach, writing the code, and checking its own work. That works until the task has a hidden correct procedure the model can't see, and it confidently ships something that looks right and fails the grader.

We ran an experiment with a different shape: split the agent into two roles. A strong orchestrator explores, reasons, and writes a precise plan. A cheaper but capable executor does the hands-on implementation. Here's what that division of labor actually buys.

The Orchestrator–Executor loop

A strong model plans and verifies; a cheaper, swappable model does the hands-on work. Control flows down on delegation and back up on verification.

OrchestratorOpus 4.8
1Explore

Read the env, check versions, disassemble binaries, align data

read · grep · run

2Plan

Lock the plan, reject the naive path

spec: algorithm + constraints

3Delegate

Hand off an implementable brief with hard constraints

delegate
ExecutorGLM-5.2 · swappable
4Execute

Implement the brief, hand back status + test output

edit · write · run · test

handback
OrchestratorOpus 4.8
5Verify

Review the artifact against the grader

L2 compare · od -c · tests

fail ✗ → rectify
6Rectify

Fix and re-test on a failed verification

pass ✓ — task complete

The split: a planner and a worker

The architecture is an orchestrator agent that delegates to a stateful executor subagent. The orchestrator runs four jobs the solo model tends to rush:

Explore. The orchestrator (Opus 4.8) reads the environment, disassembles binaries, checks library versions, and aligns the data before committing to anything.

Lock the plan. It commits to a specific algorithm or procedure — often rejecting the naive approach. On one cryptanalysis task it derives a 2²⁰+2²⁰ meet-in-the-middle attack instead of a 2⁴⁰ brute force.

Delegate a precise brief. It hands the executor an implementable spec with hard constraints: do NOT redesign the algorithm; implement what is specified, then verify.

Review against the grader. On handback, the orchestrator verifies the artifact — an L2 image compare, od -c on the output file — and runs the tests before declaring the task done. If verification fails, the executor rectifies and the loop repeats.

The executor is a swappable, cheaper model that implements the brief and hands back a status plus test output. The interesting question is what each side is really contributing.

The results

Across the suite, the orchestrated stack with GLM-5.2 as executor cleared 10 more tasks than the same model running solo — at a higher per-task cost, but a measurably higher ceiling.

0
mean on Terminal-Bench 2.0
+11.3
lift over GLM-5.2 solo
+10
extra tasks passed
$2.40
per-task cost (orchestrated)
Mean score — Terminal-Bench 2.0 (89 tasks)higher is better

Same executor model, with and without an Opus orchestrator on top.

Mean score — Terminal-Bench 2.0 (89 tasks)
Modelmean
Opus 4.8 → GLM-5.269.7
Opus 4.8 → Kimi K2.758.4
GLM-5.2 solo58.4

Means on our setup. The orchestrated GLM-5.2 stack adds +11.3 mean and +10 passes over the same model alone.

ConfigOrchestratorExecutorMeanPasses$/task
Opus + GLM-5.2 ★Opus 4.8GLM-5.269.762/86$2.40
Opus + Kimi K2.7Opus 4.8Kimi K2.758.452/89$1.85
GLM-5.2 soloGLM-5.258.452/89$0.84

Two numbers are worth sitting with. The executor choice matters under the same orchestrator: GLM-5.2 beats Kimi K2.7 by +11.3 mean / +10 passes on identical tasks (51 both pass, 11 GLM-only, 1 Kimi-only). Kimi also timed out more — 17 agent timeouts versus 10 — and burned more on doomed long-horizon tasks, including $26.59 on make-doom-for-mips alone.

And GLM-5.2 solo ties the Kimi-executor stack at 45% of the cost. A cheaper model alone can match a more expensive orchestrated stack — which means the orchestrator only earns its keep if it lifts the ceiling, not just the average.

Where orchestration actually helps

The split pays off on tasks with a hidden correct procedure — where a solo model produces plausible-but-grader-wrong work, or never ships a deliverable at all. The orchestrator's job is to discover and compress the correct procedure into a brief, then gate "done" on a grader-shaped check rather than a local one.

The split pays off when
  • The task has a hidden correct procedure, not just a plausible-looking answer
  • A solo model ships work that looks right and fails the grader
  • Verification can be gated on a grader-shaped check — L2 compare, od -c, tests
It doesn't help when
  • The task simply exceeds the time budget — orchestration just spends more failing
  • No model can solve it at all, solo or orchestrated
  • A cheap solo model already clears the bar at a fraction of the cost

The trajectory evidence is striking because it's the same executor model with opposite outcomes — the only difference is whether an orchestrator scoped the work first:

TaskOrchestrated ✓Solo ✗
mteb-retrieveCorrect BGE API, right revisionDIY cosine sim, wrong doc
feal-cryptanalysis2²⁰+2²⁰ MITM, verified specTimed out, no output shipped
dna-assemblyExact 4-bp Golden Gate overhangsWrong overhangs, 48-nt primer
path-tracingL2-compares during dev → 1.00001900 lines, never shipped image.c
mcmc-sampling-stanBDA3 prior → β ≈ 16.40 (pass)Reparam → β = 14.25 (fail)
query-optimizeCTE rewrite at 622 ms, beats goldenBeat the original, not the bar

The unifying lesson: hidden verifiers check the reference procedure, not the result's surface plausibility. A solo model that "optimizes the query" or "computes a similarity" gets the shape right and the grader wrong. The orchestrator discovers the right procedure and verifies against the actual bar.

The executor is swappable — the orchestrator is not

This is the asymmetry that makes the architecture practical. The executor seat is a commodity: rank candidates by productivity-per-dollar and swap freely. Under a fixed Opus orchestrator, GLM-5.2 > Kimi K2.7 > MiniMax ≈ Nemotron.

The orchestrator seat is where the passes live. Swap Opus for a cheaper planner — Opus → GPT-5.5 — and the score roughly halves while winning nothing new. Planner reasoning is what discovers the hidden procedure; you can't buy it back with a stronger executor.

Why this matters for BLACKBOX

This is exactly the division of labor BLACKBOX's multi-agent execution is built around. You can seat a frontier model as the orchestrator, run a cheaper model as the executor, and swap that executor per task to tune productivity-per-dollar — all without rewiring your workflow.

The takeaways are clean. Split planning from execution and you buy roughly +11% mean and +10 passes on Terminal-Bench 2.0. Treat the executor as swappable and rank it by cost-efficiency. Keep a strong planner in the orchestrator seat — that's where the wins come from. And cap the budget, because a higher ceiling doesn't mean an unlimited one.

Run your own orchestrator–executor stack
Put a frontier planner on top of a cheaper executor — and swap models per task — with BLACKBOX multi-agent execution.
Explore Agents