Optimizing Agentic Workflows: Implementing Progressive Disclosure and Automated Evaluations via AI Skills
As the paradigm of software development shifts from Developer Experience (DX) to Developer Agent Experience (DAX), the challenge for platform engineers is no longer just providing clean APIs, but providing "agentic-friendly" environments. In a recent technical deep dive, Pedro Rodrigues, an AI Tooling Engineer at Supabase, outlined a framework for enhancing agent performance using a concept known as Skills.
The core objective is to move beyond simple tool-calling and toward a structured method of providing context and workflows that allow LLMs to navigate complex systems like PostgreSQL without overwhelming their context window.
The Architecture of a Skill: Progressive Disclosure
The fundamental limitation of current agentic architectures is context exhaustion. When an agent is provided with every available tool and every piece of documentation upfront, the signal-to-noise ratio degrades. To combat this, Rodrigues proposes the implementation of Skills—a structured approach to Progressive Disclosure.
Unlike the Model Context Protocol (MCP), which often requires loading tool definitions into the context immediately, a Skill acts as an "envelope." A Skill is essentially a directory containing a skill.md file and a reference/ folder.
The skill.md Structure
The skill.md file utilizes YAML front matter to facilitate discovery. The two mandatory fields are:
name: A unique identifier for the skill.description: A high-level summary that tells the agent when to load the skill.
The technical breakthrough here is the "envelope" mechanism. The agent only parses the front matter initially. It does not load the full content of the file or the associated reference files until the description matches the current task. This allows for a much larger, more complex knowledge base to exist "dormant" until the agent explicitly decides to expand the context.
Reference Files and Graph-based Context
Beyond the main metadata, Skills can reference external files—Markdown documentation, Python scripts, or Bash scripts—located in a reference/ directory. Because these files can reference each other, a Skill can effectively represent a directed graph of information, where the agent traverses nodes of context only as required by the task at hand.
Skills vs. MCP: Defining the Boundary
A common misconception is that Skills are a replacement for the Model Context Protocol (MCP). In reality, they are complementary. The distinction lies in the execution environment and the nature of the interaction:
| Feature | MCP (Model Context Protocol) | Skills | | :--- overlap | | | | Primary Purpose | Service Integration & Tooling | Contextual Instruction & Workflows | | Execution Environment | Remote/Server-side (Standardized) | Local/Machine-specific (Environment-dependent) | | Context Loading | Often loaded into context upfront | Progressive Disclosure (On-demand) | | Use Case | Connecting to a database or API | Providing domain-specific logic or scripts |
If an agent lacks access to a local Bash environment, MCP is the correct choice for integration. However, if the agent needs to execute a specific local workflow or understand a complex, multi-step deployment process, a Skill is the superior mechanism.
Case Study: Solving PostgreSQL RLS Bypassing
To demonstrate the utility of Skills, Rodrigues presented a real-world failure mode in PostgreSQL: the bypassing of Row-Level Security (RLS) when creating views.
In a standard PostgreSQL setup, when a user creates a VIEW, the view operates with the permissions of the creator (the owner), not the invoker. This can inadvertently expose sensitive data if the view bypasses the RLS policies applied to the underlying tables.
The Implementation
Using a Skill named SuperBaseSecurity, the agent was provided with specific instructions to mitigate this. The Skill instructed the agent to always include the security invoker flag when using the CREATE VIEW command (a feature available since PostgreSQL 15).
The Workflow:
- The Failure: An agent, without the Skill, creates a
department_statsview. Because it lacks thesecurity invokerflag, the view bypasses RLS, allowing a manager to see salaries for departments they do not oversee. - The Correction: The agent loads the
SuperBaseSecuritySkill. The Skill's description (using the trigger verb "use") prompts the agent to check for RLS compliance. - The Success: The agent rewrites the migration to:
CREATE OR REPLACE VIEW department_stats WITH (security invoker) AS ...
This ensures that the view respects the existing RHD (Row-Level Security) policies of the underlying profiles and performance_reviews tables.
Automated Evaluations: The "LLM as a Judge" Pattern
Testing non-deterministic agentic behavior requires a shift from traditional unit testing to Evaluations (Evals). Since an LLM's output can vary even with the same input, Rodrigues advocates for an Eval-Driven Development cycle:
- Define Metrics: Determine what "success" looks like (e.g., "Does the SQL contain the
security invokerflag?"). - Create the Skill: Implement the
skill.md. - Run Evaluations: Execute a set of scenarios (inputs and expected outputs).
- Grading: Use an LLM-as-a-Judge to grade the agent's performance.
In the demonstration, a Python-based evaluation harness was used to run two conditions: With Skill and Without Skill. By using an LLM to inspect the resulting SQL migrations and compare them against a success criterion, the developer can programmatically verify that the Skill is actually influencing the agent's reasoning and tool-calling behavior.
Productionizing Skills
As we move toward production-grade agentic systems, Skills should be treated as first-class citizens in the CI/CD pipeline.
- Treat Skills as Artifacts: Skills should be versioned and managed similarly to documentation or configuration files.
- Integration with CI: Use tools like the
skillsNPM package to install and symlink skills into the agent's environment (e.g.,.claude/skills). - Continuous Validation: Implement automated Evals in your pipeline to ensure that updates to your database schema or API do not render your existing Skills obsolete or dangerous.
By leveraging progressive disclosure and structured evaluation, we can build agents that are not only capable but also safe and scalable within complex enterprise ecosystems.