Architecting Stateful Compute: Moving Beyond the Replay Model with MicroVM Snapshots and Context Logs
The evolution of backend infrastructure has historically been defined by the pursuit of statelessness. From the inception of CGI in 1993 to the ubiquity of the LAMP stack and the modern era of serverless functions, the "shared nothing" architecture has reigned supreme. In this paradigm, the compute layer remains stateless, offloading all meaningful state to a persistent database. However, the rise of autonomous AI agents is forcing a fundamental architectural shift: we are moving from stateless request-response cycles to long-running, stateful, and durable execution loops.
The Legacy of Statelessness and the Replay Model
For three decades, the industry has relied on the principle that a request plus a database equals a response. This approach allowed for massive scalability and simplified the management of distributed systems. As applications grew in complexity, the need for managing asynchronous side effects—such as processing payments or resizing images—led to the adoption of workflow and durable execution engines.
The "Replay Model" emerged as the standard solution for these multi-step side effects. In this model, every side effect is wrapped in a "step" that is cached upon execution. If a process fails mid-workflow, the engine can re-execute the function, skipping the already-completed steps by replaying the execution history (the replay journal). This provides an inherent audit trail and allows for error recovery and even human-in-the-loop interruptions.
However, the Replay Model possesses inherent architectural rigidities:
- Determinism Constraints: Developers must ensure that everything outside of defined "steps" is deterministic, or the replay will diverge from the original execution.
- Versioning Complexity: Managing replay journal versioning during code deployments is notoriously difficult.
- Log Bloat: As execution history grows, the replay journal expands, eventually hitting fundamental limits in terms of entry count or payload size.
The Agentic Disruption: From Orchestration to Autonomy
The introduction of Large Language Models (LLMs) and advanced tool-calling capabilities has fundamentally altered the orchestration layer. In traditional workflows, code orchestrates the LLM (the LLM is a step in a predefined sequence). In the agentic paradigm, the LLM orchestrates the code.
As agents move from simple text classification to complex, multi-turn interactions involving tool use, file system manipulation, and sub-process management, the Replay Model begins to fail. Agents are not transient transactions; they are long-running sessions. We are seeing the "meaningful work" duration of agents doubling every four to seven months, moving from minutes to potentially days of continuous execution. A replay journal cannot indefinitely scale to capture the unbounded, non-deterministic nature of an agentic loop.
The Two-Roads Solution: Context and Execution Durability
To achieve true durability for agents, we must decouple the agent's state into two distinct, manageable layers: Context Durability and Execution Durability.
1. Context Durability: The Append-Only Log
The first half of an agent's state is its context: the system messages, user prompts, tool calls, tool results, and assistant responses. This is essentially an append-only log of the interaction history.
Unlike the Replay Model, which requires re-executing logic to reconstruct state, Context Durability treats the context as a first-class citizen stored in a durable medium (e.g., a distributed file system, object storage, or a specialized database). Because append-only logs scale exceptionally well, this allows for durability across code versions and machine failures without the overhead of replaying complex logic.
2. Execution Durability: Snapshot and Restore
The second half of the state is the execution layer—the "machine" itself. As agents become more capable, they require a computational environment capable of running local processes, managing file systems, and maintaining memory.
We cannot simply keep a physical or virtual machine running indefinitely to wait for the next user turn; the cost would be prohibitive. Instead, we must move toward a Snapshot and Restore model. This allows us to capture the entire state of the execution environment, shut down the compute resource, and restore it precisely where it left off when a new interaction occurs.
Technical Implementation: From CRIU to Firecracker MicroVMs
The concept of checkpointing is not new—IBM mainframes utilized similar techniques in 1966 to protect expensive, long-running jobs. In 2011, CRIU (Checkpoint/Restore In Userspace) introduced a way to suspend and restore processes by injecting a "parasite" into the process to dump its memory state to disk.
While CRIU is powerful, it has significant limitations for agentic workloads:
- Process Isolation: It primarily captures a single process, making it difficult to handle external dependencies like an FFmpeg instance or a Chrome browser.
- Resource Leakage: It struggles with capturing all open file descriptors or complex network states.
- Container Complexity: Integrating CRIU with modern container registries and layers introduces significant latency.
To solve this, we transitioned to Firecracker MicroVMs. Unlike CRIU, which operates at the process level, Firecracker allows us to snapshot the entire virtual machine. This captures the kernel state, the entire file system, and all running sub-processes, providing a truly transparent and robust execution environment.
Optimizing the Snapshot: Seekable Compression
The primary challenge with MicroVM snapshots is the footprint. A standard 512MB RAM allocation results in a 512MB snapshot, leading to massive storage costs and network latency during restoration.
We implemented seekable compression to mitigate this. Instead of decompressing the entire snapshot upon restoration, the system only decompresses the specific memory pages required for the current execution. This optimization allows us to shrink a 512MB snapshot down to approximately 14MB of compressed data.
The performance gains are transformative:
- Snapshot Latency: Slightly under one second.
- Restore Latency: A few hundred milliseconds.
- Throughput: Capable of sustaining 15,000 VM starts per minute.
Conclusion: The Era of Stateful Compute
The infrastructure required for the next generation of AI agents will not be built on the stateless principles of the last 30 years. By combining append-only context logs with high-performance, compressed MicroVM snapshots, we can provide the durability, scalability, and error recovery necessary for autonomous agents. We are entering the era of Stateful Compute, where the boundary between the application logic and the execution environment becomes a single, durable, and recoverable entity.
Through tools like fcrun (a Docker-like CLI for Firecracker), the industry is moving toward a future where the "machine" is as ephemeral as a serverless function, yet as persistent as a database.