The Codex Council playbook.

How it works #

The council borrows the LLM Council pattern and runs it inside Codex. One request becomes four stages.

01

First opinions

Up to six reviewers answer independently, in parallel, before seeing each other's work. No anchoring on whoever spoke first.

02

Anonymous review

Outputs lose their authorship and become Candidate A–F. Reviewers rank and score them on a rubric — not on who sounds senior.

03

Aggregation

Scores are combined deterministically (locally, when reviewer JSON exists). Dissent and blockers are kept, not averaged away.

04

Chairman synthesis

The main agent writes the final call from saved outputs — winner, dissent, blockers, verification. It doesn't just announce the top score.

The one thing to remember The diversity here comes from isolated role prompts and anonymous review, not from different model vendors. That's enough to break single-pass anchoring and surface dissent — it is not the same as a panel of independent labs. The wiki says this plainly because the plugin does.

Modes & budget #

Pick the smallest mode that can catch the risk. Escalate per blocker, not by default.

You're doing	Reach for	Why
A small, reversible call	`fast`	A single Chairman pass. No full council, no full cost.
An implementation or architecture choice	`standard`	Six roles + a scoring pass. The default for real decisions.
Security, migration, data loss, a close tie	`deep`	Adds rubric, bias, and implementation reviewers.
Frontend / UI / browser behavior	`--frontend-review`	Adds Leonardo (UX) and Bob (browser evidence).
Plugin or skill usability	`--type skill --skill-review`	A cheap three-lens panel, no reviewers by default.
A low-risk review you explicitly want scoped	`--router auto --panel auto`	Advisory routing only; privacy, security, data loss, and browser risk force full coverage.

Token budgets

The --token-budget flag defaults to compact. Escalate the budget for risk, not for comfort.

compact — tight outputs, smallest reference set. The default for normal decisions.
balanced — more detail, but only on the risky parts. For real tradeoffs and ambiguity.
expanded — full evidence and audit trail. Blocked until you explicitly confirm, because it burns usage.

Before any Standard/Deep run The council shows a token estimate and waits for you to accept it. "Use Codex Council" is a request, not permission to spend — the preflight gate is mandatory, and expanded needs a separate yes.

The six roles #

Each reviewer guards one concern. Knowing their lenses helps you aim a prompt — or read why one of them blocked you.

Ada Lovelace

Principal Architect. Boundaries, integration, maintainability, migration risk. Lean on Ada for "which design ages well."

Grace Hopper

Reliability Engineer. Failure modes, tests, rollback, observability. Lean on Grace for "what happens when it breaks."

Hypatia

Security & Governance. Secrets, permissions, privacy, provenance, policy. Lean on Hypatia for "who can abuse this."

Florence Nightingale

Product & Operator. Workflow fit, docs, adoption, operational friction. Lean on Florence for "will anyone actually use it."

Alan Turing

Contrarian Red Team. Hidden assumptions, simpler alternatives, overengineering. Lean on Turing for "what if we don't build this."

Seymour Cray

Performance Engineer. Latency, throughput, memory, cost, scaling. He treats a perf claim as unverified without a workload, baseline, and measurement plan.

You don't pick roles by hand in a normal run — all six answer. But naming a concern in your prompt ("focus on rollback and migration risk") tells the Chairman which lenses to weight.

The frontend gate: Leonardo & Bob #

Frontend work gets a stricter path, because UI bugs hide in behavior. Turn it on with --frontend-review or --type frontend.

Leonardo — the critic

A brutally honest UX/UI reviewer. He doesn't add a scoring dimension; his findings move clarity, relevance, completeness, and accuracy. A Leonardo blocker lowers final confidence even when the technical scores are high.

He hunts hidden affordances, weak hierarchy, mobile failures, accessibility gaps, and "clever" flows that slow real users down.

Bob — the evidence runner

Bob drives a real browser against a runnable app or prototype. He is not a council member and never votes.

He checks the things that only show up live: focus moving into a modal, background staying un-clickable, Escape/backdrop behavior, scroll lock, mobile close buttons. A forced click is never proof — every click needs an observed state change.

The hard rule Nothing is called "verified" until Bob, or equivalent browser evidence, actually ran the path. Untested UI claims are reported as "not verified," and a failed Bob case becomes a blocker when it invalidates the recommendation.

Competency packs #

Internal skill packs for when you don't want to wire in external skills. Name them in a prompt to bias the run.

Implementation Strategist

Architecture-to-plan: sequencing, ownership, migration path, smallest useful scope.

Test & Regression Sentinel

Missing tests, edge cases, rollback, smoke checks. Use when correctness matters.

Performance Impact Analyst

Workload assumptions, baseline vs change, measurement gaps, degradation under contention.

Token & Context Optimizer

Mode choice, loaded references, context pruning, output caps — when a run risks getting expensive.

Governance & Audit Officer

Sensitive data, license status, audit trail, redaction. For privacy and distribution calls.

Operator UX Reviewer

Trigger clarity, default prompts, concise output — when adoption and daily workflow matter.

Frontend UX Critic

Counterintuitive flows, hidden affordances, weak hierarchy, mobile failure, decorative bloat.

Contrarian Simplifier

What not to build, cheaper alternatives, hidden assumptions, invalidation conditions.

How to prompt the council #

A council prompt isn't a question — it's a decision to pressure-test. Give it something to argue about.

The anatomy of a good prompt

Template

[Standard|Deep|Frontend] Council: review <the specific decision>.
Context: <the diff, files, or links it should look at>.
Constraints: <hard limits — compatibility, deadline, budget>.
Return: blockers, dissent, confidence, the safest v1 scope,
and the exact verification before approval.

Do

Name the mode. "Standard Council", "Deep Council", "Frontend Council" — it sets cost and scrutiny up front.
Give it a decision, not a vibe. "Should we split this service in two?" beats "thoughts on the backend?"
Point at evidence. Paste the diff, name the files, link the spec. The council reasons over what you give it.
State the hard constraints. Backwards compatibility, a deadline, a token budget, a platform limit — these change the answer.
Ask for what you need to act: blockers, dissent, confidence, and exact verification.

Don't

Don't ask it to "just confirm." The council preserves dissent on purpose. If you want a rubber stamp, use plain Codex.
Don't skip the constraints and then be surprised it solved a different problem.
Don't claim a UI works without a frontend run — say "have Bob verify it in the browser."
Don't reach for expanded out of habit. Escalate one blocker, not the whole session.

Meta vs. run

Explaining the council is not running it. "How does the council work?" gets a one-line answer, no dispatch. "Run a council on this" dispatches agents. If your intent is ambiguous, the council asks one line before spending anything.

Use-case cookbook #

Sixteen situations, each with the mode to use, the reviewers to lean on, and a prompt you can paste as-is.

1. Architecture decision

When picking a design you'll live with
Mode standard
Lean on Ada, Turing, Seymour

Prompt

Standard Council: review this architecture decision —
<option A vs option B>. Preserve dissent.
Return module boundaries, migration risk, the rollback plan,
and the implementation verification before we commit.

2. Pull request / risky diff

When a change is big or load-bearing
Mode standard
Lean on Grace, Turing, Seymour

Prompt

Council review this diff. Focus on regressions,
missing tests, compatibility breaks, and performance impact.
Return blockers vs refinements and the exact tests to run.

3. Bug & regression root cause

When a fix needs to be the right fix
Mode fast, or standard if root cause is unclear
Lean on Grace, Turing

Prompt

Council review this bugfix plan for root-cause evidence
and regression risk. Is this the cause or a symptom?
Return the regression tests we're missing.

4. Performance-sensitive change

When latency, memory, or cost can move
Mode standard (deep if it can cause an outage)
Lean on Seymour

Prompt

Council review this change for latency, throughput, memory,
and cost. Treat any perf claim as unverified without a
workload, baseline, and measurement plan. Return the
benchmark we need and a safer alternative.

5. Database migration / schema change

When data is on the line
Mode deep
Lean on Hypatia, Grace, Ada

Prompt

Deep Council: review this migration for data-loss risk,
backwards compatibility, and rollback. Assume it runs against
production data. Return blockers, the dry-run plan, and the
exact rollback signal.

6. Security & secrets review

When auth, permissions, or secrets are touched
Mode deep
Lean on Hypatia, Turing

Prompt

Deep Council: security review this change. Look for leaked
secrets, broken authz, privacy exposure, and unsafe defaults.
Red-team it. Return blockers and the verification before merge.

7. API design / breaking change

When shipping a contract others depend on
Mode standard
Lean on Ada, Florence, Turing

Prompt

Standard Council: review this API design. Is it a breaking
change? Check naming, versioning, error shapes, and the
migration story for existing callers. Return the deprecation path.

8. Refactor vs. rewrite

When tempted to start over
Mode standard
Lean on Turing, Ada

Prompt

Council jury: refactor or rewrite this module? Give a
go/no-go with confidence, the conditions that would flip the
decision, and the smallest first step either way.

9. Dependency or framework upgrade

When a major version is in play
Mode standard
Lean on Grace, Hypatia, Seymour

Prompt

Council review this dependency upgrade. Check breaking
changes, security advisories, transitive risk, and bundle/perf
impact. Return blockers and the smoke tests to run after.

10. Frontend modal / overlay

When dialogs, drawers, popovers, focus
Mode standard --frontend-review
Lean on Leonardo, Bob

Prompt

Frontend Council: review this modal flow. Activate Leonardo
for brutal UX critique and have Bob verify in the browser:
focus moves in, background isn't clickable, Escape and backdrop
close, mobile shows the close button. Don't claim verified
without Bob's evidence.

11. Accessibility audit

When keyboard, screen-reader, contrast
Mode standard --frontend-review
Lean on Leonardo, Bob

Prompt

Frontend Council: accessibility review this page. Check
keyboard path, focus order, labels, and contrast. Have Bob
walk the keyboard-only flow and report what a screen-reader
user would miss.

12. Release go/no-go

When deciding to ship
Mode deep for irreversible work
Lean on Grace, Hypatia

Prompt

Deep Council: release readiness review. Give a go/no-go,
required blockers, preserved dissent, confidence, the
verification evidence, and the rollback signal.

13. Plugin / skill usability

When building tools others run
Mode --type skill --skill-review
Lean on skill engineer, UX-for-tools, non-expert

Prompt

Council review this skill for discoverability, trigger
clarity, and non-expert adoption. Will someone find it, use it,
and not trip over the existing rules? Return the smallest
wording and UX fixes.

14. Incident postmortem (RCA)

When learning from an outage
Mode standard
Lean on Grace, Turing, Hypatia

Prompt

Council review this incident timeline. Separate root cause
from contributing factors, challenge the convenient story, and
return the prevention items ranked by leverage, not blame.

15. Cost / token optimization

When a run or feature is getting expensive
Mode fast or standard (compact)
Lean on Seymour, Turing

Prompt

Council review this for cost and token use, compact output.
Where are we paying for detail we don't need? Preserve blockers,
dissent, and verification. Return the safe cuts.

16. Data privacy / sharing feature

When shipping anything that exports or shares data
Mode deep
Lean on Hypatia, Turing, Florence

Prompt

Deep Council: review this sharing/export feature. Classify the
data, find disclosure paths, and demand fail-closed redaction.
Return the safest v1 and what needs a threat model before it ships.

Reading the output #

A council answer leads with the decision and keeps the disagreement visible. Know what each part means.

Recommendation — the direct call, or a blocked state. Read this first.
Confidence — high (clear winner, no blocker, concrete verification), medium (dissent or missing confirmation), low (close tie, weak evidence), or blocked (an unresolved safety, data-loss, security, or correctness issue).
Blockers vs. refinements — blockers must be fixed before approval; refinements are nice-to-haves. Don't conflate them.
Preserved dissent — the minority view that didn't win. It's kept on purpose; it's often where the real risk is.
Verification — the exact tests, commands, or browser cases to run. Council consensus is not proof — this is how you make it real.
Frontend evidence — when active, Leonardo's UX judgment is reported separately from Bob's browser-observed pass/fail.

Scores don't decide The Chairman synthesizes from the winner, dissent, blockers, and evidence — it does not simply crown the highest number. A high score with a live blocker is still blocked.

Tuning roles (alters) #

An alter is a local, bounded tweak to how one reviewer behaves. It's advisory — it can sharpen focus, never switch off a guardrail. Full guide →

Tunable: Ada, Grace, Hypatia, Florence, Turing, Seymour, Leonardo. Off-limits: Bob (he's an evidence runner, not a reviewer). Always preview before you save.

Preview, then save

python3 scripts/codex_council.py alters preview --role leonardo \
  --tone "more brutally honest about confusing interaction design" \
  --domain-focus "mobile UI, modal accessibility, click-through regressions"

Good: "Be more direct about API compatibility and migration cost."
Won't work: "Always approve my implementation" — tuning can't remove blockers, dissent, verification, or anonymization.

Token discipline #

Cheap by default, expensive only where risk earns it.

Estimate first. Every Standard/Deep run shows a heuristic token range and waits for you to accept it.
Escalate per blocker. A compact run can expand one blocker without going full expanded.
Cut output before input. Shorter answers usually save cost without weakening evidence — as long as blockers stay.
Stable text first, project context last. Static prefixes hit the prompt cache; dynamic context comes after.
Estimates aren't a bill. They're local heuristics, not your real Codex usage or remaining quota — check Codex Settings > Usage for that.

Decision Runtime #

A 1.0 shadow lab for inspecting structured decisions without changing the decision path.

Legacy-authoritative. The saved Council artifacts and Chairman verdict remain the source of truth.
Experimental and opt-in. A shadow is created only after an explicit cells command.
Paired by design. Decision Cells are compared with a simpler frontier.jsonl representation from the same input.
Fail-closed. Unknown, corrupt, or incomplete state is ignored or quarantined; it never silently influences a verdict.
No saving claims yet. Token, latency, and signal-density targets stay hypotheses until paired replay measures them.

Commands cover projection, patch validation, impact planning, health, recovery, rollback, purge, replay, and fault testing. The Runtime overview presents the product value and operating model.

Governance & privacy #

Run this checklist before Standard/Deep work that touches sensitive data, subagents, or anything you'll publish.

Classify data sensitivity: public, internal, confidential, secret.
Check and redact secrets, tokens, and customer data before sharing files with agents.
List the files the agents can see; note any license or provenance concerns.
Know the verification commands before synthesis, and the final state you're aiming for: approved, approved-with-risk, or blocked.
Use Deep mode for security, privacy, data loss, migrations, irreversible edits, or publication.

Runtime artifacts (sessions, prompts, stats, history, and opt-in Cell shadows) live in plugin-local .codex-council/ and stay gitignored — they don't pollute your repo. Invocation logs are compact and never store prompt text, secrets, topics, or absolute paths.

CLI quick reference #

The helper script is stdlib-only. These are the commands you'll actually use.

# Set the local profile used for estimates (stored with consent).
python3 scripts/codex_council.py profile --plan Pro --model gpt-5.5 --reasoning xhigh

# Estimate before you start, then accept the range in chat.
python3 scripts/codex_council.py estimate --topic "Architecture Review" --mode standard --token-budget compact

# Is this a real run, or just talking about the council?
python3 scripts/codex_council.py classify-invocation --text "explain how council works"

# Scaffold a traceable session after accepting the estimate.
python3 scripts/codex_council.py init --topic "Architecture Review" --root . --mode standard --token-budget compact --confirm-estimate

# Preview opt-in routing; hard-risk flags still force full coverage.
python3 scripts/codex_council.py estimate --topic "Docs cleanup" --router auto --panel auto --json

# Frontend session (Leonardo + Bob); skill-review session.
python3 scripts/codex_council.py init --topic "Modal Review" --root . --mode standard --frontend-review --confirm-estimate
python3 scripts/codex_council.py init --topic "Skill Review" --root . --type skill --skill-review --confirm-estimate

# expanded must be confirmed explicitly.
python3 scripts/codex_council.py init --topic "Migration" --root . --mode deep --token-budget expanded --confirm-expanded

# Aggregate reviewer scores, validate, and close with stats.
python3 scripts/codex_council.py score --input reviews.json
python3 scripts/codex_council.py validate-session --session <session-dir>
python3 scripts/codex_council.py stats --session <session-dir> --write

# Compact repeated context; inspect one session and the local history.
python3 scripts/codex_council.py compile-context --topic "Review handoff" --constraint "No public links" --json
python3 scripts/codex_council.py doctor --session <session-dir>
python3 scripts/codex_council.py dashboard

# Experimental, opt-in Decision Runtime commands.
python3 scripts/codex_council.py cells project --help
python3 scripts/codex_council.py cells apply --help
python3 scripts/codex_council.py cells plan --help
python3 scripts/codex_council.py cells doctor --help
python3 scripts/codex_council.py cells recover --help
python3 scripts/codex_council.py cells rollback --help
python3 scripts/codex_council.py cells purge --help
python3 scripts/codex_council.py cells replay --help
python3 scripts/codex_council.py cells fault-test --help

# Check for a newer release.
python3 scripts/codex_council.py check-update

Common mistakes #

Using the council for everything. Small, reversible, checkable work is faster with plain Codex.
Treating consensus as proof. It's advisory. Run the verification.
Reading it as multi-vendor diversity. It's one model in different roles — useful, but not a panel of labs.
Declaring a UI "done" with no browser evidence. Without Bob, it's "not verified."
Defaulting to expanded. Expand one blocker instead.
Asking for a rubber stamp. The council is built to disagree; that's the point.

FAQ #

How is this different from prompting Codex twice?

Two passes from one prompt tend to agree with themselves. The council forces independent first opinions, hides authorship, ranks on a rubric, and keeps dissent — so the second look actually disagrees with the first instead of polishing it.

Does it call other model providers?

No. It runs entirely inside Codex (plus optional Codex subagents). The "diversity" is role isolation and anonymous review, and the site says so plainly.

Will it spend a lot of tokens?

Not without asking. Standard/Deep runs show an estimate and wait for acceptance, the default budget is compact, and expanded is blocked until you confirm.

Do I have to use the CLI?

No. Ask in chat ("Standard Council: review…") for everyday use. The CLI is for traceable sessions, estimates, scoring, and stats you want to keep.

Can I make a reviewer harsher?

Yes — tune its alter (preview first). Tuning is advisory and can never remove a council guardrail. Bob can't be tuned.

Does Decision Runtime change the verdict?

No. In 1.0 it is an experimental shadow projection. The legacy session and Chairman verdict remain authoritative, and corrupt or unknown shadow state is ignored or quarantined.

Glossary #

Chairman — the main Codex agent that writes the final synthesis from saved outputs.
Candidate A–F — anonymized member outputs during review, so ranking judges the work, not the author.
Blocker — a must-fix issue. Approval is blocked until it's resolved.
Dissent — a preserved minority view that didn't win the ranking.
Preflight — the mandatory pre-run estimate and acceptance gate.
Alter — a bounded, advisory, local tweak to one reviewer's behavior.
Evidence runner — Bob: drives the browser, reports pass/fail, never votes.
coverage: partial — a stats flag meaning some prompts or outputs were missing, so the post-run estimate is incomplete.
Decision Cell — an experimental structured shadow record for a claim, option, evidence item, risk, dissent, verification, or decision.
frontier — the simpler append-only baseline generated from the same legacy input as the Cells.
generation / HEAD — an immutable shadow snapshot and the pointer to the last selected valid snapshot.
quarantine — an explicit state for unknown, corrupt, incomplete, or policy-invalid shadow data that must not be used.