Escaping the Context Window Loop: Implementing Smart-Truncation and Subagent Architectures in AI Agents
In the early stages of the generative AI boom, the industry's primary focus was almost exclusively on prompt engineering. The goal was to refine instructions to elicit better reasoning from Large Language Models (LLMs). However, as we move from simple chat interfaces to autonomous, long-running agents, the engineering frontier has shifted. As Sally-Ann Delucia, Head of Product at Arise, argues: "Agents don't fail because of prompts. They fail because of context."
This post explores the technical challenges of context management encountered during the development of Alex, an AI agent designed for observability, and the architectural shifts required to move from naive truncation to a hierarchical, subagent-driven memory system.
The Vicious Loop of Context Expansion
The primary challenge in building agents for observability—specifically for analyzing traces and spans—is the recursive nature of the data. When an agent like Alex is tasked with analyzing an observability platform (Arise), it must ingest traces, spans, and metadata.
The team encountered what can be described as a vicious loop of context expansion:
- The agent (Alex) is tasked with analyzing a specific trace.
- The trace contains spans, which contain further metadata and logs.
- As the agent performs deeper analysis, it generates more reasoning steps and tool calls.
- These tool calls and intermediate results are appended to the conversation history.
- The context window expands, eventually hitting the provider's token limit.
- The agent fails, triggering a retry or a broader search, which adds even more data to the context, leading to a certain subsequent failure.
This loop makes it impossible for an agent to succeed unless it can strategically decide what to remember and, more importantly, what to forget.
The Failure of Naive Strategies
Before arriving at a robust solution, the development team experimented with two common, yet ultimately flawed, context management strategies.
1. Naive Truncation
The simplest approach was to truncate the context window by simply keeping the first $N$ characters of the conversation. While this prevented token limit errors, it destroyed the agent's ability to maintain state. The agent would lose the "tail" of the conversation, meaning follow-up questions (e.g., "Can you tell me more about input B?") would fail because the reference to "input B" had been purged. Over-truncation effectively broke the agent's reasoning capabilities and turned multi-turn conversations into disconnected, single-turn interactions.
2. LLM-Based Summarization
The next logical step was to use an LLM to summarize the preceding conversation history into a condensed token representation. While this preserves the "gist" of the conversation, it introduces two critical issues:
- Inconsistency: The summarization process is stochastic. The LLM decides what is important, often omitting granular technical details (like specific span IDs or error codes) that are vital for downstream tool calls.
- Lack of Control: There is no deterministic way to ensure that the summary contains the specific metadata required for the agent's next planned action.
The Solution: Smart-Truncation Memory
To escape the loop, the team implemented a Smart-Truncation Memory strategy. This approach treats context as a structured entity rather than a simple string of text. The architecture relies on three pillars:
- Head and Tail Preservation: The system retains the beginning of the conversation (the system prompt and initial user intent) and the most recent messages (the "tail").
- Middle Truncation: The middle section of the conversation—the bulk of the intermediate reasoning and older tool outputs—is stripped out to save tokens.
- The Memory Store: Instead of discarding the middle section entirely, the data is moved to a separate memory store. If the agent determines that a specific past tool call or message is relevant to its current task, it can explicitly call a tool to retrieve that specific context.
By keeping the system prompt and the latest tool results intact while offloading the "middle" to a searchable store, the agent maintains its operational instructions and its immediate state without bloating the active context window.
Architectural Evolution: Subagent Delegation
Even with smart truncation, some tasks are inherently too data-intensive for a single agent. For example, a search task involving hundreds of spans within a trace stack can overwhelm a single context window regardless of truncation strategies.
The solution was a transition to a Hierarchical Agent Architecture. Instead of a monolithic agent, the system now utilizes a Main Agent and specialized Subagents:
- The Main Agent: This agent maintains the primary conversation thread. Its context remains "light," containing only the chat history, high-level instructions, and the results of recent interactions. It acts as the orchestrator.
- The Subagent: When a heavy-duty task is identified (e.g., a complex search or data augmentation), the Main Agent delegates the task to a Subagent. The Subagent operates within its own context window, which can be loaded with the heavy, data-intensive spans and traces.
Once the Subagent completes its analysis, it returns only the distilled, relevant result to the Main Agent. This separation of concerns ensures that the primary user experience remains responsive and that the main conversation context remains within manageable token limits.
Evaluating Context Decay with Long-Session Evals
A significant difficulty in managing context is that failures often appear "late." An agent might function perfectly for ten turns, only to fail on the eleventh turn because a critical piece of context was truncated.
To combat this, the team implemented Long-Session Evaluations. Rather than testing single-turn prompts, the evaluation pipeline loads a conversation of $N$ turns (e.g., 10 turns) and specifically tests the agent's ability to perform a task on the $(N+1)$ turn. This allows engineers to detect context decay—the point at which the truncation or summarization strategy begins to degrade the agent's reasoning—before it reaches the end-user.
The Future: Long-Term Memory and Context Budgeting
While the current architecture is stable, the frontier of agent development lies in two areas:
- Long-Term Memory: Currently, Alex's memory is session-based. The next step is implementing cross-session memory, allowing the agent to reference issues and patterns discussed in entirely different chat threads.
- Principled Context Budgeting: The team is moving away from heuristic-based truncation (like the 100-character head/tail) toward a more sophisticated, metric-driven approach to determine exactly which tokens are worth the "cost" of inclusion in the active window.
As we move deeper into the era of autonomous agents, the ability to engineer context will become the primary differentiator between a simple chatbot and a truly capable AI engineer.