xiaoO Memory & Context Compression
Overview
xiaoO treats context as a managed runtime asset rather than a raw transcript. The system keeps live task state, compresses low-value history before it becomes a context-window failure, and provides a path from short-session working memory to durable and searchable long-term memory.
In practice, this gives xiaoO three core properties:
- Layered memory: working memory, session memory, durable memory, semantic recall.
- Adaptive compression: snip, collapse, and auto-compact based on token pressure.
- Memory self-evolution: refresh, deduplicate, merge summaries, absorb corrections, rebuild semantic indexes when needed.
What Is Active Today
Enabled in the default runtime
- working-memory snapshots synchronized from loop state;
- adaptive context compression in
daemonandcli; - stale tool-noise microcompression;
- multi-agent lane memory isolation;
- traceable compression and prompt-build lifecycle.
Implemented and available for integration
- structured session memory summaries;
- durable long-term memory;
- SQLite + FTS5 + embedding hybrid retrieval;
- prompt-level memory snippet injection;
- session-memory-based prefix replacement compaction.
This distinction is important: compression is turnkey today, while deeper memory layers are currently integrated through Rust APIs.
Runtime Flow
flowchart LR
U["User / Channel"] --> G["Gateway"]
G --> L["Core Agent Loop"]
L --> A["Context Analysis"]
A --> C["Compression Pipeline"]
C --> P["Prompt Builder"]
P --> M["LLM"]
M --> T["Tool Execution"]
T --> L
L --> S["Loop State"]
S --> MS["Memory Snapshot"]
MS -. optional .-> SM["Session Memory"]
MS -. optional .-> DM["Durable Memory"]
DM -. optional .-> SR["Semantic Retrieval"]
Runtime sequence:
- A turn enters the agent loop.
- xiaoO estimates history cost against the token budget.
- If needed, it compresses history before prompt construction.
- The latest loop state is synchronized back into
MemorySnapshot. - Older context can later be promoted into session memory or durable memory.
Key Building Blocks
Memory model
| Type | Role |
|---|---|
MemorySnapshot |
unified live session memory: messages, facts, task state, prompt history, optional session summary |
MemoryManager |
updates and recalls live working memory |
SessionMemoryManager |
builds and persists structured session summaries |
DurableMemoryManager |
manages long-term memory entries |
RecallQuery / RecallPacket |
structured recall contract |
Compression model
| Type | Role |
|---|---|
CompressionPipeline |
runtime compression contract |
ContextManager |
production implementation |
ContextAnalysis |
token-pressure analysis result |
CompactionPolicy |
effective history budget policy |
Storage and retrieval contracts
| Trait | Role |
|---|---|
MemoryStore |
persists snapshots |
SessionMemoryStore |
persists session summaries |
DurableMemoryStore |
persists long-term memories |
SemanticMemoryStore |
hybrid vector + keyword search |
EmbeddingProvider |
embedding abstraction |
Compression Behavior
xiaoO does not use one generic truncation pass. It applies staged compression:
- Microcompact: removes stale
tool_use/tool_resultpairs. - History Snip: removes old low-value history while keeping protected messages.
- Context Collapse: replaces older ranges with a dense summary message.
- Auto Compact: final safety layer when history is still too large.
The compressor explicitly protects:
- the first user task message;
- tool-use/tool-result dependencies;
- recent tail messages;
- execution continuity needed for the next turn.
If the provider still returns a context-length error, xiaoO forces a compression retry and rebuilds the prompt automatically.
Memory Self-Evolution
In xiaoO, "memory self-evolution" refers to concrete mechanisms already present in the architecture:
- every loop refreshes live memory from the newest execution state;
- compact summaries can merge old and new context instead of restarting from scratch;
- duplicate recall lines are removed and normalized;
- failures, corrections, file references, and next-step information are intentionally retained;
- semantic memory can invalidate and rebuild stale embeddings after embedding-model changes.
For long-term memory specifically, xiaoO already exposes asynchronous update and restructuring paths through the memory crate: durable memory can be written incrementally, replaced in batches, searched semantically, and reindexed without forcing the main runtime loop to depend on one monolithic synchronous rebuild step.
Configuration
The default runtime currently exposes compression settings through daemon/CLI config. Session memory and semantic retrieval are currently configured programmatically.
Compression-related options are also described in daemon_config.md.
| Config | Meaning |
|---|---|
[llm].context_window |
optional explicit total context budget override for runtime token budgeting and compression |
[compact].warning_ratio |
warning threshold |
[compact].auto_compact_ratio |
context-collapse threshold |
[compact].blocking_ratio |
final pre-overflow threshold |
[compact].snip_stale_after_ms |
age threshold for snipping |
[compact].snip_preserve_tail |
tail preserved during snip |
[compact].collapse_preserve_tail |
tail preserved during collapse |
[compact].summary_max_tokens |
summary token budget |
[compact].summary_preserve_tail |
tail preserved during summary |
[compact].summary_llm_max_tokens |
max tokens for summary generation |
Example:
[llm]
provider = "openrouter"
model = "z-ai/glm-5"
api_key_env = "OPENROUTER_API_KEY"
context_window = 128000
max_tokens = 8192
[compact]
warning_ratio = 0.6
auto_compact_ratio = 0.75
blocking_ratio = 0.9
snip_stale_after_ms = 3600000
snip_preserve_tail = 6
collapse_preserve_tail = 4
summary_max_tokens = 1024
summary_preserve_tail = 4
summary_llm_max_tokens = 4096
Important boundary:
- there is no
[memory]TOML section yet for durable memory or semantic retrieval; - those features are already implemented in the
memorycrate and can be integrated directly in Rust.
Usage Guide
1. Enable adaptive compression
If you only need the built-in context-management path, configure [compact] and optionally [llm].context_window. This is already wired into:
apps/xiaoo-app/src/daemon_runtime.rsapps/xiaoo-app/src/cli/mod.rscrates/core/src/agent_loop.rs
Once configured, compression runs automatically before each turn.
2. Build session memory
Use SessionMemoryManager when you want a structured session summary that can later augment or replace older context.
use std::sync::Arc;
use agent_types::CompletionConfig;
use memory::{
FilesystemSessionMemoryStore, SessionMemoryManager, SessionMemoryPolicy,
};
let store = Arc::new(FilesystemSessionMemoryStore::new("./data/memory"));
let estimator = /* Arc<dyn TokenEstimator> */;
let llm_provider = /* Arc<LlmProviderWrapper> */;
let manager = SessionMemoryManager::new(
store,
estimator,
SessionMemoryPolicy {
summary_message_limit: 64,
summary_instruction_limit: 32,
summary_fact_limit: 64,
summary_prompt_history_limit: 16,
max_section_tokens: 256,
max_total_tokens: 1024,
},
llm_provider,
CompletionConfig { max_tokens: 2048, temperature: 0.2 },
)?;
let summary = manager.build_summary(snapshot, now_ms).await?;
manager.persist_summary(&summary).await?;
3. Use durable memory without semantic search
For stable long-term memory with simple filesystem persistence:
use std::sync::Arc;
use memory::{
DurableMemory, DurableMemoryKind, DurableMemoryManager, DurableMemoryPolicy,
FilesystemDurableMemoryStore,
};
let store = Arc::new(FilesystemDurableMemoryStore::new("./data/memory"));
let durable = DurableMemoryManager::new(
store,
DurableMemoryPolicy { max_memories: 10000 },
)?;
durable.save_memory(&DurableMemory {
memory_id: "team-style".into(),
kind: DurableMemoryKind::Preference,
content: "Prefer concise release notes with migration callouts.".into(),
source: "user-preference".into(),
updated_at: now_ms,
}).await?;
4. Use the built-in SQLite semantic store
The current vector-retrieval path in xiaoO is an embedded SQLite semantic store, not a standalone vector DB service. It combines:
- SQLite persistence;
- FTS5 full-text search;
- embedding storage;
- hybrid ranking.
Enable the sqlite feature on the memory crate:
[dependencies]
memory = { path = "../crates/memory", features = ["sqlite"] } # adjust to your workspace layout
Example:
use std::sync::Arc;
use memory::{
DurableMemory, DurableMemoryKind, DurableMemoryManager, DurableMemoryPolicy,
EmbeddingProvider, OpenAiEmbedding, SemanticMemoryStore, SemanticSearchQuery,
SqliteDurableMemoryStore,
};
let api_key = std::env::var("OPENAI_API_KEY")?;
let embedder: Arc<dyn EmbeddingProvider> = Arc::new(
OpenAiEmbedding::new(
"https://api.openai.com/v1",
&api_key,
"text-embedding-3-large",
3072,
)
);
let store = Arc::new(SqliteDurableMemoryStore::new(
"./data/memory/durable.db",
embedder,
0.7, // vector weight
0.3, // keyword weight
2048, // embedding cache size
)?);
let durable = DurableMemoryManager::new(
store.clone(),
DurableMemoryPolicy { max_memories: 10000 },
)?;
durable.save_memory(&DurableMemory {
memory_id: "deploy-runbook-001".into(),
kind: DurableMemoryKind::Procedure,
content: "Blue-green deploy requires health check verification before cutover.".into(),
source: "ops-runbook".into(),
updated_at: now_ms,
}).await?;
let results = store.search(&SemanticSearchQuery {
query_text: "How do we cut over a blue-green deploy safely?".into(),
limit: 10,
session_id: None,
kind_filter: Some(DurableMemoryKind::Procedure),
}).await?;
5. Build semantic recall packets
If you want semantic memory to flow into later prompt logic, use build_recall_with_semantic():
use memory::{MemoryManager, RecallQuery};
let packet = memory_manager
.build_recall_with_semantic(
&RecallQuery {
max_instruction_count: 16,
max_fact_count: 32,
max_prompt_history_count: 8,
include_session_memory: true,
include_durable_memory: true,
semantic_query: Some("release workflow".into()),
semantic_limit: 10,
},
&durable_memories,
semantic_store.as_ref(),
)
.await?;
6. Chunk large documents before ingestion
For large Markdown assets such as runbooks or specs, chunk first and store chunk-by-chunk:
use memory::{chunk_markdown, DurableMemory, DurableMemoryKind};
let chunks = chunk_markdown(markdown_text, 512);
for chunk in chunks {
durable.save_memory(&DurableMemory {
memory_id: format!("runbook-{:04}", chunk.index),
kind: DurableMemoryKind::Procedure,
content: chunk.content,
source: chunk.heading.as_deref().unwrap_or("runbook").to_string(),
updated_at: now_ms,
}).await?;
}
Connecting an External Vector Database
If you want pgvector, Milvus, Weaviate, Qdrant, Elasticsearch vector search, or another external backend, the intended integration path is to implement SemanticMemoryStore.
Your backend should satisfy:
DurableMemoryStorefor CRUD;SemanticMemoryStoreforsearch()andreindex().
Skeleton:
#[async_trait]
impl DurableMemoryStore for MySemanticStore {
async fn save_memory(&self, memory: &DurableMemory) -> std::io::Result<()> { /* ... */ }
async fn load_memory(&self, memory_id: &str) -> std::io::Result<DurableMemory> { /* ... */ }
async fn list_memories(&self) -> std::io::Result<Vec<DurableMemoryManifestEntry>> { /* ... */ }
async fn delete_memory(&self, memory_id: &str) -> std::io::Result<()> { /* ... */ }
async fn replace_all(&self, memories: &[DurableMemory]) -> std::io::Result<Vec<DurableMemoryManifestEntry>> { /* ... */ }
}
#[async_trait]
impl SemanticMemoryStore for MySemanticStore {
async fn search(&self, query: &SemanticSearchQuery) -> MemoryResult<Vec<ScoredMemory>> { /* ... */ }
async fn reindex(&self) -> MemoryResult<usize> { /* ... */ }
}
Recommended approach:
- Keep
DurableMemoryas the canonical application-level model. - Generate embeddings via
EmbeddingProvider, or delegate that to your backend. - Use your external backend for vector lookup and optional lexical lookup.
- Return
Vec<ScoredMemory>so the rest of the xiaoO memory pipeline stays unchanged.
Use the built-in SQLite store if you want minimal operations and embedded deployment. Build a custom semantic store if you need larger scale, distributed storage, or a specific enterprise search stack.
Observability
xiaoO traces context-management behavior so you can answer:
- when compression triggered;
- how much history was removed;
- whether the runtime snipped or summarized;
- whether a provider context-limit error triggered recovery.
This is especially useful when tuning compaction thresholds in production.