ai claude-code memsearch hermes vector-database milvus semantic-search agentic-workflows rag llm-architecture

Engineering a Multi-Tiered Semantic Memory Architecture for Claude Code: Integrating MemSearch and Hermes

5 min read

Engineering a Multi-Tiered Semantic Memory Architecture for Claude Code: Integrating MemSearch and Hermes

The current state of memory management within Claude Code is significantly trailing the advancements made by the open-source community. While Claude Code provides a foundational mechanism for persistence, it lacks the sophisticated retrieval-augmented generation (RAG) and long-term context management required for complex, multi-project agentic workflows. To move beyond simple file-based persistence, we must look toward a hybrid architecture that synthesively combines the exhaustive capture capabilities of MemSearch with the curated, efficient injection logic of Hermes.

The Triad of Memory Systems: Storage, Injection, and Recall

Any robust agentic memory system must solve three fundamental engineering challenges:

  1. Storage (The Write Path): How and when is information persisted? This involves determining the trigger for a "write" operation and the structure of the underlying data store.
  2. Injection (The Read Path/Context Loading): How is relevant context pushed into the LLM's active context window? This requires a strategy for maintaining a "lean" context window, avoiding token bloat while ensuring high-signal information is present.
  3. Recall (The Retrieval Path): How does the agent query its own history? This necessitates a multi-tiered search strategy, ranging from local context lookups to deep semantic queries in a vector database.

Analyzing the Baseline: Claude Code's Native Implementation

Claude Code utilizes a per-project and global storage model. It monitors conversations and silently writes significant interactions to .md files within .claudprojects/projects/memory. A memory.md index tracks these files. If a specific preference or piece of information is repeated $\ge 3$ times, the system promotes that data to a global .claude/memory directory.

While functional for simple persistence, the injection layer is limited to the claude.md file, which is injected into the system prompt at the start of every session. The recall mechanism is notably weak; it lacks a methodology for traversing past sessions without manual intervention (e.g., using the resume flag), often leading to high token consumption when attempting to reconstruct context through brute-force history reading.

The Open-Source Frontier: MemSearch and Hermes

To build a superior system, we can extract architectural patterns from two prominent open-source implementations.

1. MemSearch: The Exhaustive Vectorized Approach

MemSearch focuses on high-fidelity, long-term recall through a "stop hook" mechanism. After every turn in a conversation, a hook is triggered, calling Claude Haiku to summarize the turn into bullet points. This data is appended to a memory/[date] file, utilizing session anchors for traceability.

The technical core of MemSearch is its indexing pipeline:

  • Chunking & Hashing: Data is partitioned into discrete chunks.
  • Vectorization: These chunks are transformed into dense vectors via embeddings.
  • Storage: The vectors are stored in a Milvus vector database, running locally on the CPU to ensure zero API overhead.
  • Hybrid Retrieval: MemSearch employs a three-tier retrieval system:
    • Tier 1 (Semantic/Keyword): Uses dense vectors for semantic meaning (e.g., matching "revenue" to "monetization") and BM25 for exact keyword matching.
    • Tier 2 (Expansion): If Tier 1 fails, the system retrieves related chunks to provide broader context.
    • Tier 3 (Raw Transcript): As a last resort, the system accesses the raw, unsummarized session dialogue.

2. Hermes: The Curated Agentic Approach

Hermes prioritizes "lean" context through agent-led curation. Instead of passive capture, the agent utilizes specific tools (add, replace, remove) to manage memory.md and user.md files.

Key features include:

  • Character Caps: Enforced limits on memory.md ensure that only the most high-signal information is retained, preventing context window saturation.
  • The Curator: A periodic background process (e.g., every seven days) that prunes and consolidates information, removing raw transcripts and leaving only distilled facts.
  • Frozen Snapshot Injection: At the start of a session, Hermes injects a "frozen snapshot" of memory.md, user.md, and soul.md. This typically totals ~1,300 tokens. Because this is loaded at the session start, it can be cached, significantly reducing per-message token costs.

The Proposed Hybrid Architecture: A 4-Tiered Strategy

The optimal architecture for Claude Code is a synthesis: using MemSearch for completeness of storage and Hermes for efficiency of injection.

The Lifecycle of a Conversation

1. Storage & Maintenance

We leverage Claude Code's native auto-memory for immediate project-level persistence but augment it with a MemSearch-style stop hook. This ensures every turn is captured, summarized by Haiku, and appended to a local Milvus-backed database. A nightly job runs the memsearch index process to re-index and consolidate these logs, ensuring the vector database remains a "source of truth" that can be rebuilt from Markdown if necessary.

2. Optimized Injection

We adopt the Hermes "frozen snapshot" model. At the start of a session, we inject a curated set of files: memory.md, user.md, soul.md, and a daily_log. This expands the initial context to approximately 3,000 tokens. By using a cached snapshot, we ensure that the agent starts with high-density, recent information without incurring the latency or cost of re-processing the entire history every time a message is sent.

3. The 4-Tiered Recall Pipeline

To solve the recall gap, we implement a progressive disclosure retrieval strategy:

  • Tier 0 (Contextual Check): The agent first queries the already-injected context (memory.md and daily_log). This is a zero-cost, near-instantaneous lookup.
  • Tier 1 (Hybrid Search): If the information is not in the active window, the system executes a hybrid search in the Milvus database, utilizing both dense vector embeddings (semantic) and BM25 (keyword).
  • Tier 2 (Contextual Expansion): If Tier 1 yields low-confidence matches, the system retrieves expanded chunks and summaries surrounding the identified vectors.
  • Tier 3 (Deep Transcript Retrieval): As a final fallback, the system retrieves the raw, unsummarized session dialogue from the historical logs.

By implementing this architecture, we transform Claude Code from a stateless-adjacent tool into a true agentic OS, capable of maintaining deep, searchable, and highly efficient long-term memory across infinite project lifecycles.