Local consultation pressure for Codex

panda is not a panda.

it is a Multi-agent system: independent model advisors inspect one Codex task, while Codex remains the editor and integrator. for leveraging Codex weak spots where it matters.

Panda runs independent model consultations around a Codex task, extracts structured contracts, preserves useful disagreement, and lets Codex integrate only what survives local evidence and tests.

set it up for Codex

Codex config-aware skill trigger
You

Review this implementation plan with Panda.

Codex

Using Panda to analyze the research and plan.

default agent With no saved profile, Panda starts one Codex reviewer: gpt-5.5 at medium reasoning.
profile file user-scoped: ~/.config/panda/preferences.json
update via skill "set Panda to use Kimi + GLM" -> --save-preferences
Operating thesis
Panda turns reported model weaknesses into focused review pressure.

Independent advisors inspect the brittle parts. Structured artifacts preserve what they find. Codex integrates. Tests decide. Metrics teach the next prompt.

Where it helps

Following harness engineering's leverage rule, Panda focuses on research and plan quality before implementation, then checks the result for drift, missing tests, and inconsistencies.

Panda focuses review pressure where coding agents have the most leverage A leverage map showing Panda focusing on research and plan quality before Codex implements, then checking the finished diff against the research, plan, and tests. highest leverage is upstream Research and plan quality shape every later token. Panda pressure zone research facts • constraints plan approach • tests implementation Codex writes the diff after coding: consistency check does the diff still match the research, plan, and tests? wrong research → wrong problem weak plan → wrong solution
highest leverage Research and plan

Panda spends extra model attention before Codex writes code, where a wrong read or weak plan can cascade into much larger implementation waste.

after coding Check the implementation

Once Codex implements, Panda can review the diff against the research and plan to catch drift, inconsistencies, and verification gaps.

Install and run

Codex install request
You

Please install Panda for Codex. Use the latest release: https://github.com/igcodinap/panda/releases/latest After installing, make Panda available as a Codex skill and run a dry-run smoke check.

Codex

I’ll fetch the Panda checkout, install the local helpers, sync the skill, and verify the default Codex reviewer path.

Requirements
  • Python 3.9+
  • Codex CLI installed and authenticated, or CODEX_BIN set to an authenticated CLI
  • Claude Code optional, for Claude-backed advisors
  • OpenCode optional, for Kimi, GLM, Qwen, or other OpenCode-backed advisors

The minimum working setup is Python plus Codex. Claude Code and OpenCode are optional sources of extra review pressure; save them into Panda's behavior profile when you want future runs to use them.

ROI recommendation: after Codex, the first optional paid add-on is OpenCode Go, currently listed at $5 for the first month and then $10/month. It gives Panda a separate budget for Kimi, GLM, Qwen, and other OpenCode-backed review agents.

Academic sources

Panda uses the literature as constraint, not decoration: more agents are not automatically better, budgets matter, coordination fails, and benchmarks need humility.

Gao et al., Single-agent or Multi-agent Systems? Why Not Both?

Core claim
Single-agent and multi-agent systems have complementary tradeoffs; coordination has real cost.
Panda implication
Panda runs consultation only where independent pressure is likely to pay for itself.
Caveat
It does not prove Panda's exact advisor shape improves coding outcomes.
Source

CooperBench, Why Coding Agents Cannot be Your Teammates Yet

Core claim
Peer coding agents can fail through coordination, commitment, and communication breakdowns.
Panda implication
Panda advisors inspect independently. Codex remains the single editor and integrator.
Caveat
The benchmark studies collaborative coding, not Panda's fan-out advisory pattern.
Source

Rethinking the Value of Multi-Agent Workflow

Core claim
Strong single-agent baselines can match homogeneous multi-agent workflows in many settings.
Panda implication
Panda must earn its use through heterogeneous evidence and useful pressure, not agent count.
Caveat
Codex alone remains the baseline Panda must beat or complement.
Source

Tran and Kiela, Equal Thinking-Token Budgets

Core claim
When reasoning budgets are equalized, single-agent systems can be more information-efficient.
Panda implication
Panda treats cost, latency, and retained context as part of the result.
Caveat
The result is about multi-hop reasoning, not local software consultation.
Source

Kim et al., Towards a Science of Scaling Agent Systems

Core claim
Agent-system scaling depends on task shape and architecture fit.
Panda implication
Panda stays narrow: independent contract pressure, then single-editor integration.
Caveat
Poor task selection can still waste time or amplify noise.
Source

MAST, Why Do Multi-Agent LLM Systems Fail?

Core claim
Multi-agent failures cluster around design, misalignment, and verification categories.
Panda implication
V2 sidecars, parse warnings, and falsifier prompts keep failure modes visible.
Caveat
Taxonomy helps expose failure; it does not eliminate it.
Source

Silo-Bench

Core claim
Agents can exchange information yet still fail to synthesize distributed state.
Panda implication
Panda avoids advisor-to-advisor state merging and leaves synthesis to Codex plus tests.
Caveat
The synthesis step still depends on evidence quality and verification discipline.
Source

SWE-bench Pro

Core claim
Long-horizon software tasks better stress realistic debugging and integration behavior.
Panda implication
Panda evaluation should focus on tasks where consultation can actually matter.
Caveat
Benchmark use must stay contamination-aware and conservative.
Source

Terminal-Bench

Core claim
Command-line tasks expose setup recovery, autonomy, and verification quality.
Panda implication
Panda should be judged on evidence usefulness and recovery planning, not only final patches.
Caveat
Current public Panda results are primarily SWE-bench-style summaries.
Source

OpenAI SWE-bench Verified Methodology Note

Core claim
Saturated and contamination-prone coding benchmarks can stop measuring frontier capability.
Panda implication
Panda reports evaluation status cautiously and separates observed experience from proven lift.
Caveat
This is a methodology note, not an academic paper.
Source

Evaluation status

Panda has useful early evidence as a review and confidence amplifier. Solve-rate lift remains an active evaluation question.

Observed so far

  • Independent advisors expose risks Codex can miss on first pass.
  • V2 contracts turn fluent advice into auditable claims and tests.
  • Curated summaries are public; raw benchmark artifacts are reviewed before publication.

What Panda does not claim yet

Panda does not yet claim a statistically proven solve-rate lift over Codex alone. The public position is conservative: it is a contract-pressure system with promising experience and an evaluation path.