Codex Multi-Agent Orchestrating

As AI becomes a bigger part of how we build software, the amount of code we need to review grows with it. The real question is not whether to use these tools. It is how we utilize our time in the most efficient way as we learn to integrate them, and in a way that maximizes the value we get back.

When AI coding tools first arrived, everyone thought that better prompts would be the fix for better code. But the actual fix is the same one that works for orchestrating human teams: distinct roles, parallel independent work, and a gate before anything ships.

Learning to orchestrate

As AI agents progress and become better at writing code, we will write less and less of it ourselves. At OpenAI, Codex now reviews (opens in new tab) the vast majority of pull requests, catching hundreds of issues every day. At Anthropic, engineers report using Claude Code increasingly (opens in new tab) for implementing new features, with that category rising from 14% to 37% in their internal study. I believe that within a year or so, we will spend far less time on smaller details like code style and formatting, and shift our focus toward higher-level design, architecture, and the decisions that actually shape the product.

Think of it like an orchestra. You would never see a conductor trying to play every instrument, read every sheet, and manage the tempo all at once. Instead, there are sections: strings, brass, woodwinds, percussion, each owning their part. The conductor sets the direction, keeps everyone in sync, and decides when to bring each section in.

This is exactly how you should think about your Codex session. We have an orchestrator that owns the plan, the conductor setting the direction. We have explorer agents that scout the repo like a first chair reading ahead in the score. Multi-agent workflows in Codex CLI are evolving quickly, and recent releases have improved model selection, plugin context handling, resume behavior, and fast execution workflows. That is exactly what this post covers.

Below, we will walk through the full setup: the scouts, the implementers, the quality gates, and how they all fit together. The loop is: plan → implement in parallel → run checks (tests, typecheck, lint) → review/security gate → repeat.

Since first writing this, Codex has moved forward quickly. The orchestration pattern in this post still holds, but I would update the model recommendations, CLI workflow details, and the way I frame security review in light of GPT-5.4, newer Codex CLI features, and Codex Security.

One thing I learned early on: parallelise read-heavy work first. Multiple agents editing code at the same time can cause merge conflicts, so keep writes scoped and sequential where possible.

Why this matters

First, each role owns one job, so the prompts can be short and tailored to that agent specifically, instead of reusing the same bloated prompt over and over had we only had a single agent. Second, different tasks require different abilities. When we explore the repository, there is not much deep reasoning involved, which means we can reduce the reasoning effort and in turn increase the speed and efficiency of execution. Below, we will review how to enable multi-agent mode and a quick explanation of how I use it.

That matters even more now that Codex CLI can surface enabled plugin context automatically and lets you pull in targeted plugin context directly with @plugin mentions.

To get started, open your Codex config:

# Open the global codex config (create it if it doesn't exist)
open ~/.codex/config.toml

1) Enable multi-agent in config.toml

Everything starts in ~/.codex/config.toml (or .codex/config.toml for repo-specific overrides). Set multi_agent = true under [features], then register each role in [agents] with a description and a path to its TOML file.

The agents.max_threads setting controls how many agent threads can be open concurrently. The default is 6, but I start at 12 for parallel batches. If you hit rate limits or stability issues, bring it down to 10, and if you are still seeing errors, try 8. You can always scale back up once the loop is stable.

The key point here is to experiment and see what works for you. These settings are a starting point, not a prescription.

1/10

# Enable multi-agent and set thread budget
[features]
multi_agent = true

[agents]
max_threads = 12

# Register every role — codex resolves the TOML at runtime
[agents.orchestrator]
description = "Owns plan.md, delegates tasks, merges results."
config_file = "agents/orchestrator.toml"

[agents.explorer]
description = "Finds files, CI configs, and local equivalents."
config_file = "agents/explorer.toml"

[agents.implementer]
description = "Implements one scoped task and iterates until green."
config_file = "agents/implementer.toml"

[agents.ci_runner]
description = "Runs CI commands locally and isolates failures."
config_file = "agents/ci_runner.toml"

[agents.reviewer]
description = "PR-grade review for correctness and regressions."
config_file = "agents/reviewer.toml"

[agents.qa_test_author]
description = "Adds/strengthens tests and repro steps."
config_file = "agents/qa_test_author.toml"

[agents.release_manager]
description = "Release notes, rollout, rollback checklist."
config_file = "agents/release_manager.toml"

# Optional local security reviewer.
# If you want dedicated security scanning, consider Codex Security instead.
#
# [agents.security_reviewer]
# description = "Optional local security review for secrets, injection, auth mistakes."
# config_file = "agents/security_reviewer.toml"

2) The orchestrator

Let's start with the orchestrator. This is the agent that tells everyone else what to do. Its main job is to maintain the plan.md file at the repo root, breaking the work into atomic tasks, delegating them to implementers in parallel, and running all three quality gates after each batch. Atomic tasks work better here because they keep context tight and limit how far an agent can drift off the rails. The smaller and more focused each task is, the easier it is to course-correct when something goes wrong.

Personally, I set this one to the highest reasoning level because the better the plan, the better the outcome. A planning mistake cascades through every batch, so this is where you want the model thinking the hardest.

# Create the orchestrator agent file
mkdir -p ~/.codex/agents && touch ~/.codex/agents/orchestrator.toml

2/10

model = "gpt-5.4"
model_reasoning_effort = "xhigh"

approval_policy = "never"
sandbox_mode = "danger-full-access"

developer_instructions = """
You are the Orchestrator.

Maintain plan.md at repo root.
Loop:
1) Update plan.md with atomic tasks and acceptance criteria.
2) Delegate independent tasks to implementer in parallel.
3) After each batch call: ci_runner, reviewer, qa_test_author.
4) If the change touches auth, secrets, external fetches, file access, permissions, or user data, trigger a security review step or use Codex Security.
5) Convert findings into new tasks. Repeat until green.

Rules:
- No scope creep. Put extras in Later.
- Small diffs. End with Done / Next / Risks.
"""

3) The explorer

The explorer maps file paths, finds CI configs, and suggests what to look at next, but it never implements anything. Keeping exploration separate prevents the classic failure mode where an agent starts "fixing" things it was only supposed to understand.

I run this at low reasoning. It is just reading files and listing paths. No complex analysis needed, which makes it fast and cheap.

3/10

model = "gpt-5.3-codex-spark"  # Spark for speed; use gpt-5.4 on Plus
model_reasoning_effort = "low"      # just reading — no complex reasoning

approval_policy = "never"
sandbox_mode = "danger-full-access"

developer_instructions = """
Explore only.

Return:
- file paths and key symbols
- where CI workflows live
- local commands that mirror CI
- next probes

Do not implement changes unless explicitly asked.
"""

4) The implementer

Implementers do the opposite of the explorer: one scoped task from plan.md, minimal change set, run lint/typecheck/tests, iterate until green, then stop. The structured report (what changed, commands run, result) gives the orchestrator the feedback it needs to move on.

I give this medium reasoning. The tasks are scoped enough that deep thinking is waste, but it still needs to reason about code changes.

4/10

model = "gpt-5.4"
model_reasoning_effort = "medium"   # scoped tasks, moderate reasoning

approval_policy = "never"
sandbox_mode = "danger-full-access"

developer_instructions = """
Implement one scoped task from plan.md.

Do:
- minimal change set
- run relevant commands (lint/typecheck/tests)
- iterate until green

Report:
- what changed
- commands run
- result summary
"""

5) The CI runner

The CI runner mirrors CI locally and outputs pass/fail with root cause analysis. It is running deterministic commands and parsing output, so low reasoning is all it needs. Think of it as the metronome in our orchestra. It just keeps time.

5/10

model = "gpt-5.3-codex-spark"  # Spark for speed; use gpt-5.4 on Plus
model_reasoning_effort = "low"       # deterministic commands, simple parsing

approval_policy = "never"
sandbox_mode = "danger-full-access"

developer_instructions = """
Mirror CI locally with the smallest deterministic command set.

Output:
1) Commands run
2) Pass/fail summary
3) Failure analysis (most likely root cause)
4) Smallest next fix
"""

6) The reviewer

The reviewer performs PR-grade review: must-fix issues with file/line refs and a ship-or-block verdict. This is where you want the model to think deeply, not cut corners. Subtle bugs need deep analysis to surface, so I give it xhigh reasoning.

6/10

model = "gpt-5.4"
model_reasoning_effort = "xhigh"     # deep analysis catches subtle bugs

approval_policy = "never"
sandbox_mode = "danger-full-access"

developer_instructions = """
PR-grade review.

Output:
1) Must-fix issues (file/line refs if possible)
2) Regression risks and edge cases
3) Missing/weak tests
4) Risk assessment (ship or block)

Keep it short and actionable.
"""

7) Security review: local role or Codex Security

Security review can now be handled in two ways: a local security-review role inside your multi-agent loop, or Codex Security for a dedicated vulnerability-finding and patching workflow. I still like a local security reviewer when I want everything to stay inside one repo loop, but I would no longer treat it as a default role in every setup. If security is a major bottleneck, Codex Security is now the more direct path.

7/10

# Optional — uncomment in config.toml to enable.
# For dedicated security scanning, consider Codex Security instead.

model = "gpt-5.4"
model_reasoning_effort = "high"

approval_policy = "never"
sandbox_mode = "danger-full-access"

developer_instructions = """
Security review only.

Look for:
- secrets/credentials leakage
- injection risks
- authn/authz mistakes
- SSRF/unsafe fetch
- sensitive logging

Output:
- Findings (severity + evidence)
- Fix recommendations
- Verification steps
"""

8) QA test author

The QA test author fills gaps: adds tests for changed behavior, writes minimal repro steps for bugs, verifies fixes with commands plus expected output. Edge-case thinking improves coverage, so I give it high reasoning.

8/10

model = "gpt-5.4"
model_reasoning_effort = "high"      # edge-case reasoning for good coverage

approval_policy = "never"
sandbox_mode = "danger-full-access"

developer_instructions = """
Tests and validation.

Do:
- add/strengthen tests for changed behavior
- write minimal repro steps for bugs
- verify fix with commands + expected output

Output:
- tests changed
- commands run
- confidence + remaining risks
"""

9) Release manager

The release manager handles release notes, rollout steps, rollback plan, and a pre-deploy checklist. Structured output, not analytical. medium reasoning does the job.

You do not need all these roles on day one. Start with orchestrator, implementer, ci_runner, and reviewer. Add explorer, QA, release, or a local security reviewer only when they solve a real bottleneck. If you need deeper security coverage, Codex Security is now a strong alternative to maintaining that role yourself.

9/10

model = "gpt-5.4"
model_reasoning_effort = "medium"    # structured output, not analytical

approval_policy = "never"
sandbox_mode = "danger-full-access"

developer_instructions = """
Release manager.

Output:
1) Release notes (user-facing + internal)
2) Rollout steps
3) Rollback plan
4) Pre-deploy checklist (migrations, env vars, flags, monitoring)
"""

10) Verify and choose your model tier

A quick ls and TOML parse confirm codex can read everything. The first-run prompt should tell the orchestrator to execute one bounded batch, run all three gates, update plan.md, and avoid expanding scope.

10/10

# Quick verification — make sure codex can parse everything
ls -la ~/.codex/agents
python -c "import tomllib, pathlib; \
  tomllib.loads(pathlib.Path.home() \
  .joinpath('.codex','config.toml').read_text())"

# ─── Pro vs Plus model note ───────────────
# GPT-5.4 is now the default for most roles.
#
# Pro subscribers can use Spark for fast roles:
#   model = "gpt-5.3-codex-spark"
#   → use for explorer, ci_runner
#
# Plus subscribers:
#   model = "gpt-5.4"
#   → use for all roles
#
# The main decision is GPT-5.4 by default,
# reserving Spark for fast, read-heavy, or
# deterministic roles.

Quick CLI updates worth knowing

Fast mode is now enabled by default in newer Codex CLI builds.
The TUI now reflects newer model selection changes more clearly.
@plugin mentions let you pull plugin context into the turn directly.
codex resume is more reliable because git context and enabled apps are preserved.

Benefit one: reasoning effort, spend tokens where they matter

So to recap, here is what I have chosen for my agents and the reason behind each choice. This is not a perfect split, so experiment with it yourself. The principle is simple: roles that decide (orchestrator, reviewer) get the highest reasoning. Roles that execute (explorer, ci_runner) get the lowest. Roles that create (implementer, qa, release) sit somewhere in between.

Model choice has shifted since I first wrote this. GPT-5.4 is now the default recommendation for most Codex work, especially for planning, review, and long-horizon tasks. Spark still makes sense when low latency matters more than maximum depth, especially for lighter execution roles.

Membership

model + reasoning effort by role: use GPT-5.4 as the default, then lower latency and cost only where the role is mostly deterministic

Higher reasoning effort still costs more time and tokens, so the principle has not changed: spend depth on planning, review, and security-sensitive work, and keep execution roles cheap. What has changed is the default model choice. Today, GPT-5.4 is the better general recommendation for most Codex tasks.

Benefit two: context management with MCP and skills

Every MCP server you enable adds tool context to your messages, so keep your global set lean. That context adds up fast, especially if you have a dozen MCPs attached.

With multi-agent mode, role-specific config files still matter, but newer Codex CLI releases also do a better job of surfacing enabled plugin context automatically, and @plugin mentions make targeted context routing easier. The explorer might need file search and docs MCPs. The CI runner just needs terminal access. The reviewer needs nothing beyond the code itself. Each agent only carries the tools relevant to its job.

Instead of every agent paying the context cost of every tool, you scope the overhead to where it matters. Less noise, faster routing, cheaper runs. That means the goal is no longer to stuff every tool into every role, but to keep the default environment lean and pull in extra context only where it helps.

The same principle applies to skills (opens in new tab). Skills work similarly to MCPs but inject prompt-level instructions rather than tool definitions. They use progressive disclosure (opens in new tab), loading metadata first and full instructions only when chosen. I did not include skills in my config above, but I would highly recommend adding relevant ones to your agent roles. For example, the reviewer could benefit from Vercel React Best Practices (opens in new tab) or Next.js Best Practices (opens in new tab). The implementer could use Test-Driven Development (opens in new tab). Browse what is available at skills.sh (opens in new tab) and pick the ones that match your stack.

Grab the starter pack

Rather than copying each TOML file individually, here is a setup script that creates the full config in one go. Pick your membership tier and run it.

pro tierGPT-5.4 + Spark for fast roles

plus tierGPT-5.4 for all roles

The takeaway

Parallel where it helps. Strict gates where it matters. A single plan to prevent drift. max_threads = 12 as your starting ceiling.

Codex is already being built and shipped with an agent harness and parallel orchestration patterns. If you want the deep dive, read OpenAI's posts on the Codex agent loop (opens in new tab) and the Codex harness (opens in new tab). The way we write software is shifting, and understanding multi-agent orchestration now puts you ahead of the curve.

Multi-agent orchestration is also no longer just a terminal pattern. The Codex app is now available on Windows as well, which makes the "agent command center" idea broader than a CLI-only workflow.

Take this implementation as an experiment. Tweak the reasoning levels, swap models, add roles, remove roles. But whatever you do, just experiment.

References: Codex changelog (opens in new tab) · Codex Security (opens in new tab) · Introducing the Codex app (opens in new tab)