Local consultation pressure for Codex

panda is not a panda.

it is a Multi-agent system: independent model advisors inspect one Codex task, while Codex remains the editor and integrator. for leveraging Codex weak spots where it matters.

Panda runs independent model consultations around a Codex task, extracts structured contracts, preserves useful disagreement, and lets Codex integrate only what survives local evidence and tests.

set it up for Codex

Codex config-aware skill trigger

You

Review this implementation plan with Panda.

Codex

Using Panda to analyze the research and plan.

default agent With no saved profile, Panda starts one Codex reviewer: gpt-5.5 at medium reasoning.

profile file user-scoped: ~/.config/panda/preferences.json

update via skill "set Panda to use Kimi + GLM" -> --save-preferences

Operating thesis

Panda turns reported model weaknesses into focused review pressure.

Independent advisors inspect the brittle parts. Structured artifacts preserve what they find. Codex integrates. Tests decide. Metrics teach the next prompt.

Where it helps

Following harness engineering's leverage rule, Panda focuses on research and plan quality before implementation, then checks the result for drift, missing tests, and inconsistencies.

highest leverage Research and plan

Panda spends extra model attention before Codex writes code, where a wrong read or weak plan can cascade into much larger implementation waste.

after coding Check the implementation

Once Codex implements, Panda can review the diff against the research and plan to catch drift, inconsistencies, and verification gaps.

Install and run

Codex install request

You

Please install Panda for Codex. Use the latest release: https://github.com/igcodinap/panda/releases/latest After installing, make Panda available as a Codex skill and run a dry-run smoke check.

Codex

I’ll fetch the Panda checkout, install the local helpers, sync the skill, and verify the default Codex reviewer path.

Requirements

Python 3.9+
Codex CLI installed and authenticated, or CODEX_BIN set to an authenticated CLI
Claude Code optional, for Claude-backed advisors
OpenCode optional, for Kimi, GLM, Qwen, or other OpenCode-backed advisors

The minimum working setup is Python plus Codex. Claude Code and OpenCode are optional sources of extra review pressure; save them into Panda's behavior profile when you want future runs to use them.

ROI recommendation: after Codex, the first optional paid add-on is OpenCode Go, currently listed at $5 for the first month and then $10/month. It gives Panda a separate budget for Kimi, GLM, Qwen, and other OpenCode-backed review agents.

Academic sources

Panda uses the literature as constraint, not decoration: more agents are not automatically better, budgets matter, coordination fails, and benchmarks need humility.

Gao et al., Single-agent or Multi-agent Systems? Why Not Both?

Core claim: Single-agent and multi-agent systems have complementary tradeoffs; coordination has real cost.
Panda implication: Panda runs consultation only where independent pressure is likely to pay for itself.
Caveat: It does not prove Panda's exact advisor shape improves coding outcomes.

Source

CooperBench, Why Coding Agents Cannot be Your Teammates Yet

Core claim: Peer coding agents can fail through coordination, commitment, and communication breakdowns.
Panda implication: Panda advisors inspect independently. Codex remains the single editor and integrator.
Caveat: The benchmark studies collaborative coding, not Panda's fan-out advisory pattern.

Source

Rethinking the Value of Multi-Agent Workflow

Core claim: Strong single-agent baselines can match homogeneous multi-agent workflows in many settings.
Panda implication: Panda must earn its use through heterogeneous evidence and useful pressure, not agent count.
Caveat: Codex alone remains the baseline Panda must beat or complement.

Source

Tran and Kiela, Equal Thinking-Token Budgets

Core claim: When reasoning budgets are equalized, single-agent systems can be more information-efficient.
Panda implication: Panda treats cost, latency, and retained context as part of the result.
Caveat: The result is about multi-hop reasoning, not local software consultation.

Source

Kim et al., Towards a Science of Scaling Agent Systems

Core claim: Agent-system scaling depends on task shape and architecture fit.
Panda implication: Panda stays narrow: independent contract pressure, then single-editor integration.
Caveat: Poor task selection can still waste time or amplify noise.

Source

MAST, Why Do Multi-Agent LLM Systems Fail?

Core claim: Multi-agent failures cluster around design, misalignment, and verification categories.
Panda implication: V2 sidecars, parse warnings, and falsifier prompts keep failure modes visible.
Caveat: Taxonomy helps expose failure; it does not eliminate it.

Source

Silo-Bench

Core claim: Agents can exchange information yet still fail to synthesize distributed state.
Panda implication: Panda avoids advisor-to-advisor state merging and leaves synthesis to Codex plus tests.
Caveat: The synthesis step still depends on evidence quality and verification discipline.

Source

SWE-bench Pro

Core claim: Long-horizon software tasks better stress realistic debugging and integration behavior.
Panda implication: Panda evaluation should focus on tasks where consultation can actually matter.
Caveat: Benchmark use must stay contamination-aware and conservative.

Source

Terminal-Bench

Core claim: Command-line tasks expose setup recovery, autonomy, and verification quality.
Panda implication: Panda should be judged on evidence usefulness and recovery planning, not only final patches.
Caveat: Current public Panda results are primarily SWE-bench-style summaries.

Source

OpenAI SWE-bench Verified Methodology Note

Core claim: Saturated and contamination-prone coding benchmarks can stop measuring frontier capability.
Panda implication: Panda reports evaluation status cautiously and separates observed experience from proven lift.
Caveat: This is a methodology note, not an academic paper.

Source

Evaluation status

Panda has useful early evidence as a review and confidence amplifier. Solve-rate lift remains an active evaluation question.

Observed so far

Independent advisors expose risks Codex can miss on first pass.
V2 contracts turn fluent advice into auditable claims and tests.
Curated summaries are public; raw benchmark artifacts are reviewed before publication.

What Panda does not claim yet

Panda does not yet claim a statistically proven solve-rate lift over Codex alone. The public position is conservative: it is a contract-pressure system with promising experience and an evaluation path.