Engineering an Iterative Adversarial Review Loop: Mitigating Self-Grading Bias in Claude Code via Codex Integration
In the current landscape of AI-assisted software engineering, "Plan Mode" has become a standard paradigm. Tools like GSD, Superpowers, and Matt Pocock’s "Grill Me" skill have significantly advanced the ability to bridge the gap between vague human intent and executable technical requirements. However, a fundamental architectural flaw remains: the single-model dependency.
When relying on a single LLM (such as Claude Code) to both architect a solution and validate its implementation, you encounter the "Self-Modeling Bias" or "Self-Grading Problem." Because models like Claude are RLHF-tuned (Reinforcement Learning from Human Feedback) to be helpful and agreeable, they act as unreliable narrators. If you ask an LLM to grade its own plan, it will almost invariably provide a high-confidence, positive evaluation, regardless of the actual technical viability or security implications of the proposed logic.
To solve this, we must move beyond simple orchestration and implement an adversarial multi-agent architecture. By introducing a neutral third party—in this case, OpenAI’s Codex—we can create a competitive loop that identifies edge cases, security vulnerabilities, and logical fallacies that the primary builder might overlook.
The Architecture: Two-Phase Implementation
The proposed system expands upon the "Grill Me" pattern by splitting the development lifecycle into two distinct, high-fidelity phases: Deep Discovery and Adversarial Validation.
Phase 1: Deep Discovery (The Augmented Planning Layer)
The first phase utilizes an expanded version of the "Gritation" or "Grill Me" methodology. The goal here is to eliminate ambiguity before a single line of code is written. Instead of a simple prompt-and-execute flow, this layer forces Claude Code into a high-density questioning mode.
During this stage, the model iterates through a series of technical checkpoints:
- Requirement Clarification: Defining whether features are cosmetic (e.g., CSS overlays) or functional (e.g., backend enforcement).
- Infrastructure Mapping: Determining asset storage formats and database integration points (e.g., Supabase/PostgreSQL).
- Constraint Identification: Establishing rate limits, authentication requirements, and data persistence logic.
By the end of Phase 1, we have a high-fidelity blueprint that aligns user intent with technical feasibility. However, even with a perfect plan, the risk of "hallucinated correctness" remains.
Phase 2: The Adversarial Review Loop
This is where the architecture shifts from planning to adversarial testing. We introduce Codex as an independent auditor. This process generates two critical artifacts:
plan.md: The primary source of truth, containing the finalized technical implementation strategy.plan_review_log.md: A persistent ledger documenting the iterative debate between Claude Code (the Builder) and Codex (the Auditor).
The Iterative Mechanism
The system is configured for a multi-turn execution loop—typically capped at five iterations to manage token costs and latency. In each round:
- Audit: Codex ingests the current
plan.mdand identifies potential regressions, security holes, or architectural weaknesses. - Rebuttal/Correction: Claude Code analyzes the findings in the
plan_review_log.md. It must decide whether to accept the critique or defend its original logic. If a critique is valid, Claude Code updatesplan.mdwith a corrected implementation strategy. - Convergence: The loop continues until either consensus is reached (a "thumbs up" from Codex) or the maximum turn limit is exhausted.
Case Study: Identifying Critical Vulnerabilities via Adversarial Auditing
The efficacy of this adversarial approach was demonstrated during the implementation of an email-gated content feature. While a single-model approach might have successfully deployed a functional UI, the Codex-driven audit identified several high-severity technical oversights that would have compromised the production environment.
Round 1: Identifying Structural Flaws
During the first iteration, Codex flagged eleven distinct issues. Key findings included:
- Unbounded Client-Side Slugs: The risk of unbounded input leading to potential injection or routing errors.
- Case-Sensitive Dedupe Bypass: A logic flaw where email uniqueness could be bypassed via casing variations (e.g.,
user@example.comvsUser@example.com). - Security Vectors: Identification of a "raw list bombing" vector and the lack of a table scanning rate limit, which could lead to Denial of Service (DoS) on the database layer.
Round 2: Detecting False Fixes
The true value of the adversarial loop is most evident in its ability to catch "shallow fixes." In the second round, Codex identified that Claude Code’s initial attempts to remediate the Round 1 issues were non-functional:
- Unwired Logic: The implementation claimed a double opt-in mechanism was present, but the underlying logic had not been integrated into the workflow.
- Index Targeting Failures: An attempt to use an expression index that the Supabase JavaScript client could not effectively target for deduplication.
- Asynchronous Blocking: A failure to implement
awaitbefore a critical response, potentially leading to race conditions in the email capture flow.
Scalability and Model Agnosticism
While this implementation utilizes OpenAI’s Codex as the auditor due to its specialized coding proficiency, the architecture is model-agnostic. The "bones" of this system—the iterative markdown logging and the two-file state management—can be adapted for any LLM ecosystem.
For developers prioritizing privacy or cost-efficiency, the Codex layer can be swapped with local models such as DeepSeek or specialized fine-tuned Llama variants running via Ollama. The critical component is not the specific model used, but the enforcement of a multi-agent, adversarial workflow that breaks the cycle of single-model self-validation.
Conclusion
As AI agents move from simple code completion to autonomous software engineering (Agentic Workflows), the primary bottleneck will be trust. We cannot trust an agent to validate its own work. By implementing an iterative, adversarial review loop using a secondary model as a neutral auditor, we can significantly increase the reliability of AI-generated plans and ensure that "Plan Mode" results in production-ready, secure, and architecturally sound code.