Wiki

The Codex Council playbook.

Everything in one place: how the council actually works, who the reviewers are, how to write a prompt that gets a useful answer, and a cookbook of paste-ready prompts for the situations you'll actually hit. Every prompt has a copy button.

How it works #

The council borrows the LLM Council pattern and runs it inside Codex. One request becomes four stages.

01

First opinions

Up to six reviewers answer independently, in parallel, before seeing each other's work. No anchoring on whoever spoke first.

02

Anonymous review

Outputs lose their authorship and become Candidate A–F. Reviewers rank and score them on a rubric — not on who sounds senior.

03

Aggregation

Scores are combined deterministically (locally, when reviewer JSON exists). Dissent and blockers are kept, not averaged away.

04

Chairman synthesis

The main agent writes the final call from saved outputs — winner, dissent, blockers, verification. It doesn't just announce the top score.

The one thing to remember The diversity here comes from isolated role prompts and anonymous review, not from different model vendors. That's enough to break single-pass anchoring and surface dissent — it is not the same as a panel of independent labs. The wiki says this plainly because the plugin does.

Modes & budget #

Pick the smallest mode that can catch the risk. Escalate per blocker, not by default.

You're doingReach forWhy
A small, reversible callfastA single Chairman pass. No full council, no full cost.
An implementation or architecture choicestandardSix roles + a scoring pass. The default for real decisions.
Security, migration, data loss, a close tiedeepAdds rubric, bias, and implementation reviewers.
Frontend / UI / browser behavior--frontend-reviewAdds Leonardo (UX) and Bob (browser evidence).
Plugin or skill usability--type skill --skill-reviewA cheap three-lens panel, no reviewers by default.

Token budgets

The --token-budget flag defaults to compact. Escalate the budget for risk, not for comfort.

  • compact — tight outputs, smallest reference set. The default for normal decisions.
  • balanced — more detail, but only on the risky parts. For real tradeoffs and ambiguity.
  • expanded — full evidence and audit trail. Blocked until you explicitly confirm, because it burns usage.
Before any Standard/Deep run The council shows a token estimate and waits for you to accept it. "Use Codex Council" is a request, not permission to spend — the preflight gate is mandatory, and expanded needs a separate yes.

The six roles #

Each reviewer guards one concern. Knowing their lenses helps you aim a prompt — or read why one of them blocked you.

Ada Lovelace

Principal Architect. Boundaries, integration, maintainability, migration risk. Lean on Ada for "which design ages well."

Grace Hopper

Reliability Engineer. Failure modes, tests, rollback, observability. Lean on Grace for "what happens when it breaks."

Hypatia

Security & Governance. Secrets, permissions, privacy, provenance, policy. Lean on Hypatia for "who can abuse this."

Florence Nightingale

Product & Operator. Workflow fit, docs, adoption, operational friction. Lean on Florence for "will anyone actually use it."

Alan Turing

Contrarian Red Team. Hidden assumptions, simpler alternatives, overengineering. Lean on Turing for "what if we don't build this."

Seymour Cray

Performance Engineer. Latency, throughput, memory, cost, scaling. He treats a perf claim as unverified without a workload, baseline, and measurement plan.

You don't pick roles by hand in a normal run — all six answer. But naming a concern in your prompt ("focus on rollback and migration risk") tells the Chairman which lenses to weight.

The frontend gate: Leonardo & Bob #

Frontend work gets a stricter path, because UI bugs hide in behavior. Turn it on with --frontend-review or --type frontend.

Leonardo — the critic

A brutally honest UX/UI reviewer. He doesn't add a scoring dimension; his findings move clarity, relevance, completeness, and accuracy. A Leonardo blocker lowers final confidence even when the technical scores are high.

He hunts hidden affordances, weak hierarchy, mobile failures, accessibility gaps, and "clever" flows that slow real users down.

Bob — the evidence runner

Bob drives a real browser against a runnable app or prototype. He is not a council member and never votes.

He checks the things that only show up live: focus moving into a modal, background staying un-clickable, Escape/backdrop behavior, scroll lock, mobile close buttons. A forced click is never proof — every click needs an observed state change.

The hard rule Nothing is called "verified" until Bob, or equivalent browser evidence, actually ran the path. Untested UI claims are reported as "not verified," and a failed Bob case becomes a blocker when it invalidates the recommendation.

Competency packs #

Internal skill packs for when you don't want to wire in external skills. Name them in a prompt to bias the run.

Implementation Strategist

Architecture-to-plan: sequencing, ownership, migration path, smallest useful scope.

Test & Regression Sentinel

Missing tests, edge cases, rollback, smoke checks. Use when correctness matters.

Performance Impact Analyst

Workload assumptions, baseline vs change, measurement gaps, degradation under contention.

Token & Context Optimizer

Mode choice, loaded references, context pruning, output caps — when a run risks getting expensive.

Governance & Audit Officer

Sensitive data, license status, audit trail, redaction. For privacy and distribution calls.

Operator UX Reviewer

Trigger clarity, default prompts, concise output — when adoption and daily workflow matter.

Frontend UX Critic

Counterintuitive flows, hidden affordances, weak hierarchy, mobile failure, decorative bloat.

Contrarian Simplifier

What not to build, cheaper alternatives, hidden assumptions, invalidation conditions.

How to prompt the council #

A council prompt isn't a question — it's a decision to pressure-test. Give it something to argue about.

The anatomy of a good prompt

Template [Standard|Deep|Frontend] Council: review <the specific decision>. Context: <the diff, files, or links it should look at>. Constraints: <hard limits — compatibility, deadline, budget>. Return: blockers, dissent, confidence, the safest v1 scope, and the exact verification before approval.

Do

  • Name the mode. "Standard Council", "Deep Council", "Frontend Council" — it sets cost and scrutiny up front.
  • Give it a decision, not a vibe. "Should we split this service in two?" beats "thoughts on the backend?"
  • Point at evidence. Paste the diff, name the files, link the spec. The council reasons over what you give it.
  • State the hard constraints. Backwards compatibility, a deadline, a token budget, a platform limit — these change the answer.
  • Ask for what you need to act: blockers, dissent, confidence, and exact verification.

Don't

  • Don't ask it to "just confirm." The council preserves dissent on purpose. If you want a rubber stamp, use plain Codex.
  • Don't skip the constraints and then be surprised it solved a different problem.
  • Don't claim a UI works without a frontend run — say "have Bob verify it in the browser."
  • Don't reach for expanded out of habit. Escalate one blocker, not the whole session.

Meta vs. run

Explaining the council is not running it. "How does the council work?" gets a one-line answer, no dispatch. "Run a council on this" dispatches agents. If your intent is ambiguous, the council asks one line before spending anything.

Use-case cookbook #

Sixteen situations, each with the mode to use, the reviewers to lean on, and a prompt you can paste as-is.

1. Architecture decision

  • When picking a design you'll live with
  • Mode standard
  • Lean on Ada, Turing, Seymour
PromptStandard Council: review this architecture decision — <option A vs option B>. Preserve dissent. Return module boundaries, migration risk, the rollback plan, and the implementation verification before we commit.

2. Pull request / risky diff

  • When a change is big or load-bearing
  • Mode standard
  • Lean on Grace, Turing, Seymour
PromptCouncil review this diff. Focus on regressions, missing tests, compatibility breaks, and performance impact. Return blockers vs refinements and the exact tests to run.

3. Bug & regression root cause

  • When a fix needs to be the right fix
  • Mode fast, or standard if root cause is unclear
  • Lean on Grace, Turing
PromptCouncil review this bugfix plan for root-cause evidence and regression risk. Is this the cause or a symptom? Return the regression tests we're missing.

4. Performance-sensitive change

  • When latency, memory, or cost can move
  • Mode standard (deep if it can cause an outage)
  • Lean on Seymour
PromptCouncil review this change for latency, throughput, memory, and cost. Treat any perf claim as unverified without a workload, baseline, and measurement plan. Return the benchmark we need and a safer alternative.

5. Database migration / schema change

  • When data is on the line
  • Mode deep
  • Lean on Hypatia, Grace, Ada
PromptDeep Council: review this migration for data-loss risk, backwards compatibility, and rollback. Assume it runs against production data. Return blockers, the dry-run plan, and the exact rollback signal.

6. Security & secrets review

  • When auth, permissions, or secrets are touched
  • Mode deep
  • Lean on Hypatia, Turing
PromptDeep Council: security review this change. Look for leaked secrets, broken authz, privacy exposure, and unsafe defaults. Red-team it. Return blockers and the verification before merge.

7. API design / breaking change

  • When shipping a contract others depend on
  • Mode standard
  • Lean on Ada, Florence, Turing
PromptStandard Council: review this API design. Is it a breaking change? Check naming, versioning, error shapes, and the migration story for existing callers. Return the deprecation path.

8. Refactor vs. rewrite

  • When tempted to start over
  • Mode standard
  • Lean on Turing, Ada
PromptCouncil jury: refactor or rewrite this module? Give a go/no-go with confidence, the conditions that would flip the decision, and the smallest first step either way.

9. Dependency or framework upgrade

  • When a major version is in play
  • Mode standard
  • Lean on Grace, Hypatia, Seymour
PromptCouncil review this dependency upgrade. Check breaking changes, security advisories, transitive risk, and bundle/perf impact. Return blockers and the smoke tests to run after.

10. Frontend modal / overlay

  • When dialogs, drawers, popovers, focus
  • Mode standard --frontend-review
  • Lean on Leonardo, Bob
PromptFrontend Council: review this modal flow. Activate Leonardo for brutal UX critique and have Bob verify in the browser: focus moves in, background isn't clickable, Escape and backdrop close, mobile shows the close button. Don't claim verified without Bob's evidence.

11. Accessibility audit

  • When keyboard, screen-reader, contrast
  • Mode standard --frontend-review
  • Lean on Leonardo, Bob
PromptFrontend Council: accessibility review this page. Check keyboard path, focus order, labels, and contrast. Have Bob walk the keyboard-only flow and report what a screen-reader user would miss.

12. Release go/no-go

  • When deciding to ship
  • Mode deep for irreversible work
  • Lean on Grace, Hypatia
PromptDeep Council: release readiness review. Give a go/no-go, required blockers, preserved dissent, confidence, the verification evidence, and the rollback signal.

13. Plugin / skill usability

  • When building tools others run
  • Mode --type skill --skill-review
  • Lean on skill engineer, UX-for-tools, non-expert
PromptCouncil review this skill for discoverability, trigger clarity, and non-expert adoption. Will someone find it, use it, and not trip over the existing rules? Return the smallest wording and UX fixes.

14. Incident postmortem (RCA)

  • When learning from an outage
  • Mode standard
  • Lean on Grace, Turing, Hypatia
PromptCouncil review this incident timeline. Separate root cause from contributing factors, challenge the convenient story, and return the prevention items ranked by leverage, not blame.

15. Cost / token optimization

  • When a run or feature is getting expensive
  • Mode fast or standard (compact)
  • Lean on Seymour, Turing
PromptCouncil review this for cost and token use, compact output. Where are we paying for detail we don't need? Preserve blockers, dissent, and verification. Return the safe cuts.

16. Data privacy / sharing feature

  • When shipping anything that exports or shares data
  • Mode deep
  • Lean on Hypatia, Turing, Florence
PromptDeep Council: review this sharing/export feature. Classify the data, find disclosure paths, and demand fail-closed redaction. Return the safest v1 and what needs a threat model before it ships.

Reading the output #

A council answer leads with the decision and keeps the disagreement visible. Know what each part means.

  • Recommendation — the direct call, or a blocked state. Read this first.
  • Confidencehigh (clear winner, no blocker, concrete verification), medium (dissent or missing confirmation), low (close tie, weak evidence), or blocked (an unresolved safety, data-loss, security, or correctness issue).
  • Blockers vs. refinements — blockers must be fixed before approval; refinements are nice-to-haves. Don't conflate them.
  • Preserved dissent — the minority view that didn't win. It's kept on purpose; it's often where the real risk is.
  • Verification — the exact tests, commands, or browser cases to run. Council consensus is not proof — this is how you make it real.
  • Frontend evidence — when active, Leonardo's UX judgment is reported separately from Bob's browser-observed pass/fail.
Scores don't decide The Chairman synthesizes from the winner, dissent, blockers, and evidence — it does not simply crown the highest number. A high score with a live blocker is still blocked.

Tuning roles (alters) #

An alter is a local, bounded tweak to how one reviewer behaves. It's advisory — it can sharpen focus, never switch off a guardrail. Full guide →

Tunable: Ada, Grace, Hypatia, Florence, Turing, Seymour, Leonardo. Off-limits: Bob (he's an evidence runner, not a reviewer). Always preview before you save.

Preview, then savepython3 scripts/codex_council.py alters preview --role leonardo \ --tone "more brutally honest about confusing interaction design" \ --domain-focus "mobile UI, modal accessibility, click-through regressions"
  • Good: "Be more direct about API compatibility and migration cost."
  • Won't work: "Always approve my implementation" — tuning can't remove blockers, dissent, verification, or anonymization.

Token discipline #

Cheap by default, expensive only where risk earns it.

  • Estimate first. Every Standard/Deep run shows a heuristic token range and waits for you to accept it.
  • Escalate per blocker. A compact run can expand one blocker without going full expanded.
  • Cut output before input. Shorter answers usually save cost without weakening evidence — as long as blockers stay.
  • Stable text first, project context last. Static prefixes hit the prompt cache; dynamic context comes after.
  • Estimates aren't a bill. They're local heuristics, not your real Codex usage or remaining quota — check Codex Settings > Usage for that.

Governance & privacy #

Run this checklist before Standard/Deep work that touches sensitive data, subagents, or anything you'll publish.

  • Classify data sensitivity: public, internal, confidential, secret.
  • Check and redact secrets, tokens, and customer data before sharing files with agents.
  • List the files the agents can see; note any license or provenance concerns.
  • Know the verification commands before synthesis, and the final state you're aiming for: approved, approved-with-risk, or blocked.
  • Use Deep mode for security, privacy, data loss, migrations, irreversible edits, or publication.

Runtime artifacts (sessions, prompts, stats, history) live in plugin-local .codex-council/ and stay gitignored — they don't pollute your repo. Invocation logs are compact and never store prompt text, secrets, topics, or absolute paths.

CLI quick reference #

The helper script is stdlib-only. These are the commands you'll actually use.

# Set the local profile used for estimates (stored with consent).
python3 scripts/codex_council.py profile --plan Pro --model gpt-5.5 --reasoning xhigh

# Estimate before you start, then accept the range in chat.
python3 scripts/codex_council.py estimate --topic "Architecture Review" --mode standard --token-budget compact

# Is this a real run, or just talking about the council?
python3 scripts/codex_council.py classify-invocation --text "explain how council works"

# Scaffold a traceable session after accepting the estimate.
python3 scripts/codex_council.py init --topic "Architecture Review" --root . --mode standard --token-budget compact --confirm-estimate

# Frontend session (Leonardo + Bob); skill-review session.
python3 scripts/codex_council.py init --topic "Modal Review" --root . --mode standard --frontend-review --confirm-estimate
python3 scripts/codex_council.py init --topic "Skill Review" --root . --type skill --skill-review --confirm-estimate

# expanded must be confirmed explicitly.
python3 scripts/codex_council.py init --topic "Migration" --root . --mode deep --token-budget expanded --confirm-expanded

# Aggregate reviewer scores, validate, and close with stats.
python3 scripts/codex_council.py score --input reviews.json
python3 scripts/codex_council.py validate-session --session <session-dir>
python3 scripts/codex_council.py stats --session <session-dir> --write

# Check for a newer release.
python3 scripts/codex_council.py check-update

Common mistakes #

  • Using the council for everything. Small, reversible, checkable work is faster with plain Codex.
  • Treating consensus as proof. It's advisory. Run the verification.
  • Reading it as multi-vendor diversity. It's one model in different roles — useful, but not a panel of labs.
  • Declaring a UI "done" with no browser evidence. Without Bob, it's "not verified."
  • Defaulting to expanded. Expand one blocker instead.
  • Asking for a rubber stamp. The council is built to disagree; that's the point.

FAQ #

How is this different from prompting Codex twice?

Two passes from one prompt tend to agree with themselves. The council forces independent first opinions, hides authorship, ranks on a rubric, and keeps dissent — so the second look actually disagrees with the first instead of polishing it.

Does it call other model providers?

No. It runs entirely inside Codex (plus optional Codex subagents). The "diversity" is role isolation and anonymous review, and the site says so plainly.

Will it spend a lot of tokens?

Not without asking. Standard/Deep runs show an estimate and wait for acceptance, the default budget is compact, and expanded is blocked until you confirm.

Do I have to use the CLI?

No. Ask in chat ("Standard Council: review…") for everyday use. The CLI is for traceable sessions, estimates, scoring, and stats you want to keep.

Can I make a reviewer harsher?

Yes — tune its alter (preview first). Tuning is advisory and can never remove a council guardrail. Bob can't be tuned.

Glossary #

  • Chairman — the main Codex agent that writes the final synthesis from saved outputs.
  • Candidate A–F — anonymized member outputs during review, so ranking judges the work, not the author.
  • Blocker — a must-fix issue. Approval is blocked until it's resolved.
  • Dissent — a preserved minority view that didn't win the ranking.
  • Preflight — the mandatory pre-run estimate and acceptance gate.
  • Alter — a bounded, advisory, local tweak to one reviewer's behavior.
  • Evidence runner — Bob: drives the browser, reports pass/fail, never votes.
  • coverage: partial — a stats flag meaning some prompts or outputs were missing, so the post-run estimate is incomplete.