ai agentic_workflows llm claude_code pkm knowledge_management token_optimization architecture machine_learning automation

Architecting Model-Agnostic Agentic Workflows: Decoupling LLM Inference from Local Knowledge Scaffolding

5 min read

Architecting Model-Agnostic Agentic Workflows: Decoupling LLM Inference from Local Knowledge Scaffolding

In the rapidly evolving landscape of Large Language Models (LLMs), a critical architectural tension has emerged: the conflict between ecosystem lock-in and functional autonomy. As users move from simple chat interfaces to complex, agentic workflows, the reliance on proprietary "all-in-one" ecosystems (such as Perplexity or standard Claude Desktop) presents a significant risk to data sovereignty and system flexibility.

To build a truly resilient Personal Knowledge Management (PKM) system, we must move away from monolithic, provider-dependent structures and toward a decoupled, three-layer architecture: The Data (Knowledge Base), The Instructions (Agentic Scaffolding), and The Brain (Inference Engine).

The Three-Layer Architecture

The fundamental flaw in most current AI implementations is the conflation of data and logic. When the knowledge base is inextricably tied to the model's specific implementation (e.g., a proprietary "memory" feature), the user loses the ability to upgrade or replace the underlying model without rebuilding the entire knowledge structure.

A robust, professional-grade system requires a strict separation of concerns:

  1. The Data Layer (The Local Folder): This is the "Single Source of Truth." It consists of raw data, Markdown files, and structured documentation stored locally. This layer is persistent and independent of any LLM.
  2. The Instruction Layer (The Agentic Scaffolding): This layer consists of .md files (e.g., agent.md, skills.md) that define the personas, capabilities, and Standard Operating Procedures (SOPs) for specific tasks. These instructions are model-agnostic; they are simply text files that any capable LLM can ingest.
  3. The Inference Layer (The Brain): This is the LLM itself—Claude (Sonnet or Opus), Gemini, or Codex. The "Brain" is simply pointed toward the Local Folder and the Instruction Layer.

By maintaining this separation, you achieve "Model Agnosticism." If a new model (e.g., a future Grok iteration or a specialized Gemini update) demonstrates superior reasoning or lower latency, you can point the new "Brain" at your existing "Instructions" and "Data" without any reconfiguration of the underlying knowledge.

Agentic Orchestration: Sub-Agents vs. "Head Switching"

A sophisticated implementation of this architecture utilizes an Orchestrator Pattern. In our implementation, the orchestrator (referred to as "Larry") acts as the primary interface. The orchestrator's role is not to execute every task, but to manage delegation.

However, a critical technical distinction must be made between two types of agentic behavior: Contextual Head Switching and True Parallel Sub-Agents.

Contextual Head Switching

In standard web-based chat interfaces (like Claude.ai or Co-work), the model performs "head switching." When you ask the model to act as a "Designer," it simply loads the designer's instructions into the current active context window. The model is essentially "pretending" to be another person by simulating a persona change within a single continuous stream of tokens. This is computationally inefficient and prone to context drift, as the previous "persona's" context remains present in the window.

True Parallel Sub-Agents

Advanced CLI-based tools, such as Claude Code, allow for a more advanced paradigm: the instantiation of sub-agents. In this model, the orchestrator can launch distinct, parallel processes. Each sub-agent operates within its own specialized context, focused solely on its specific task (e.g., a "Researcher" agent or a "Database Architect" agent). This prevents the "noise" of unrelated tasks from polluting the active context window of the primary task.

Optimizing Token Efficiency and Context Management

A common critique of modular, multi-file agentic structures is the perceived increase in token consumption. The argument suggests that by having an orchestrator reference multiple .md files, you are "importing" more overhead into the context window.

This is a technical misconception.

In a well-architected system, the orchestrator does not load the entire directory tree into the context window at once. Instead, the orchestrator uses the instruction layer to understand the map of the folder. When a specific task is requested, the orchestrator identifies the relevant agent.md or skill.md file and only then pulls that specific context into the active session.

This "Just-in-Time" (JIT) context loading is significantly more token-efficient than the alternative: a single, massive claude.md file containing all instructions. A monolithic instruction file grows linearly with every new skill added, eventually leading to:

  1. Context Window Exhaustion: The model loses the ability to process the actual task because the "instruction overhead" has consumed the available tokens.
  2. Increased Latency: The model must process a massive amount of irrelevant "instruction noise" before reaching the task-specific logic.

By using a modular, file-based approach, we keep the active context window lean, focusing the model's attention only on the parameters necessary for the current execution.

The Role of the "Adapter Prompt"

To achieve true model agnosticism, we utilize an Adapter Prompt. This is a specialized instruction set that resides within the local folder. When a new model (e.g., Gemini or Codex) is introduced to the folder, the adapter prompt provides the necessary "bootstrapping" instructions. It tells the model how to interpret the folder structure, where to find the team members, and how to initialize its own specific configuration (such as creating a .gemini or .codex folder).

This allows for a seamless transition between models. You can use Claude Sonnet for high-speed, low-latency orchestration and delegation, and then switch to Claude Opus or Gemini for heavy-duty reasoning or complex research tasks, all while using the exact same agent definitions and data structures.

Conclusion: The ICOR Framework for Scalable Intelligence

The ultimate goal of this architecture is to implement the ICOR Framework: separating the Individual (Personal Knowledge Management) from the Team (Business/Shared Knowledge).

By treating your AI agents as a "pre-hired" team of specialists—each with their own .md based instructions—you create a scalable business infrastructure. Whether you are a solopreneur or managing a large organization, the ability to delegate tasks to specialized, modular agents ensures that your productivity system remains robust, your context windows remain clean, and your intellectual property remains entirely under your control.