ContextFit retrieves the right prior conversation without embedding APIs, vector databases, or GPU hardware — with structure-aware ingestion for MD, TXT, TMD ledger, JSON, CSV, email, calendar, and code files.
Every major agent memory system in 2026 converts conversations to dense vectors before retrieval. This works for factual lookups — and breaks for the queries agents most frequently need to answer.
An entire session compressed into one vector loses episode structure. "What should I cook tonight?" shares almost no vector proximity with "I just harvested zucchini from my garden" — even though that's exactly the session the agent needs.
Every memory access requires an embedding call. At real agent throughput — thousands of turns per day — this compounds into a meaningful infrastructure cost line, plus a vector database on top.
Sending memory retrieval queries to an embedding API means your user's personal context crosses a network boundary on every turn. For privacy-sensitive use cases this is a non-starter.
Embedding API + vector database + embedding model versioning + index management = three separate systems to provision, monitor, and keep synchronized.
When retrieval fails, the answer is "cosine distance was 0.62." There's no auditable explanation for why a session was or wasn't surfaced.
Running embedding models locally requires PyTorch and benefits significantly from GPU acceleration — 500MB–2GB of dependencies before a single session is indexed.
The most valuable signals in conversational memory are structural, not semantic: what kind of memory did the user express? and does this episode's memory type match what this query needs? These questions are answerable with token-level pattern matching — no embedding model required.
Deterministic, domain-agnostic fact extraction from user-authored turns. Eight typed primitives — preference, goal, constraint, decision, temporal update, open loop, interest, entity — extracted with zero API calls.
Numeric session ranking by structural memory-signal type, not vector proximity. Answers "does this session have what this query type needs?" — the right question for vague advice queries.
Deterministic zero-cost dispatch to the right retrieval mode per query. Vague advice → episode scorer. Specific facts → BM25. Temporal state → fusion. No LLM routing, no latency.
Ten token-native features re-score BM25 candidates: lexical overlap, behavior-marker alignment, named entity overlap, question-type slot matching. Keeps exact-match paths interpretable and local.
Router-gated taste retrieval for personalized recommendations. It promotes user-authored preference evidence with marker detection, preference-window overlap, and lightweight token normalization — beating embedding baselines on preference R@1.
For multi-session synthesis, preserves the strongest anchor and promotes companion sessions with complementary evidence — lifting LongMemEval multi-session All@5 from 55.4% to 65.3% without embeddings.
RAG cost is not just search cost. It is every oversized prompt, every rerank, and every retry needed before the model finally sees the right evidence. ContextFit is designed to retrieve less, better — reducing the context loops between question and correct answer.
@r1 handles.LongMemEval-S session retrieval asks one clean question: did the system put a ground-truth answer session in the top 5? No answer generation, no LLM judge, no hidden scoring model. Answer-quality benchmarks stay separate because they depend on the downstream model and judge.
| System | R@1 | R@3 | R@5 | R@10 | Embeddings | Vector store |
|---|---|---|---|---|---|---|
| gbrain-hybrid published | — | — | 97.60% | — | yes | local |
| MemPalace raw published | — | — | 96.60% | — | yes | local |
| ContextFit token-native | 81.80% | 90.40% | 96.20% | 97.80% | no | no |
| ContextFit + OpenAI fusion | 84.60% | 95.20% | 99.00% | 99.60% | yes | no |
answer_session_ids. ContextFit scores are from local full-run artifacts;
gbrain and MemPalace are published reference rows. This is retrieval recall,
not answer accuracy. MemPalace raw is the 96.6% R@5 zero-API baseline cited by
gbrain-evals and MemPalace benchmark docs. Full run notes:
ContextFit comparison artifact.
| Agent-memory eval | R@1 | R@3 | MRR | API Cost | GPU |
|---|---|---|---|---|---|
| Mem0 79-case | 54.4% | 91.1% | 0.716 | LLM + embed | — |
| Cohere embed-english-v3 | 58.7% | 91.4% | 0.751 | embed API | — |
| ContextFit + routed rerankers | 62.7% | 94.0% | 0.784 | $0 | ✓ CPU |
| OpenAI text-embedding-3-small | 63.1% | 96.6% | 0.792 | embed API | — |
Strengths where AI agents most need them. The hardest memory failures are rarely simple keyword lookups — they are open loops, preferences, temporal changes, and cross-session synthesis. These are the moments where better retrieval reduces retries, preserves user trust, and turns memory into useful action.
The token-native architecture isn't just a performance choice — it's a deployment choice. ContextFit runs anywhere a Python process runs.
The index is a directory of flat files — zstd-compressed token arrays, BM25 postings, LSH signatures. No Qdrant, no Pinecone, no pgvector. No service to start, no schema to migrate.
Zero PyTorch, zero CUDA, zero model weights. Every operation runs on CPU: BM25 scoring, roaring bitmap intersection, MinHash LSH, episode feature computation, structural reranking.
Total dependency footprint is ~41MB (tiktoken + numpy + pyroaring + datasketch + zstandard). PyTorch alone is 500MB–2GB. Fits in a Lambda function, a slim container, or a mobile app bundle.
Standard POSIX permissions. The index lives alongside your data — inside an encrypted vault, a git repo, a synced drive. Back it up with rsync. No DB dump procedures, no export formats.
| Property | ContextFit | Embedding + vector DB |
|---|---|---|
| Database | None | Qdrant / Chroma / pgvector |
| GPU | Not required | Recommended for local models |
| Dependency size | ~41MB | 500MB–2GB+ (PyTorch alone) |
| Storage format | Plain files | DB-managed blobs |
| Permissions | POSIX filesystem | DB users / ACLs |
| Offline capable | Yes (default path) | No (API) / Partial (local model) |
| Backup | Any file backup tool | DB dump + vector store export |
| API cost (default) | $0 | Per-embedding call |
| Query latency | 0.5–9ms in-process | 50–500ms+ (embed + vector search) |
ContextFit still stores and searches token IDs, but file ingestion now chooses smarter boundaries first: Markdown headings, prose paragraphs, TMD ledger rows, JSON events, CSV/TSV records, email messages, calendar events, and source code symbols. The result is more auditable evidence with richer per-chunk metadata.
Heading path, section level, and ordinal metadata travel with each chunk. Paragraphs, lists, tables, blockquotes, and code fences stay together where possible.
Text files are grouped by natural paragraphs and separators, with overlap by whole paragraph rather than arbitrary token tails.
TMD ledger files chunk by rows while preserving schema and front-matter context, making source rows easier to cite and verify.
API exports, event logs, and chat streams chunk by object or line while retaining stable path, line, and index metadata.
Tabular exports chunk by row with header fields preserved in the source text, keeping ledgers, inventories, and reports easy to cite.
Email files keep sender, recipient, subject, and date context attached to message body chunks for source-verifiable inbox memory.
Calendar files chunk by event with summary, time, location, recurrence, organizer, and attendee metadata preserved.
Common source files chunk around imports, classes, functions, and selectors with language, symbol, and line-range metadata preserved.
TMD ledger is a new ContextFit-proposed Tabular Markdown file format: a human-readable text file for records that should stay row-addressable, schema-aware, and easy to cite. It bridges Markdown notes and structured data without requiring a database.
Ledgers stay in normal files, work in git, and remain understandable to humans reviewing purchases, assets, tasks, contacts, or memory records.
Each row can be retrieved, cited, and audited as a source record instead of disappearing into an arbitrary token window.
Front-matter and column context stay attached to chunks, so agents know what a row means without needing a separate database schema.
Drop in as a memory layer. Query with a natural-language string. Get back ranked session IDs with source-linked evidence.
# Ingest sessions — no API calls, no GPU, ~4ms per session from contextfit import RetrievalEngine engine = RetrievalEngine() engine.ingest_sessions(sessions) engine.save("./memory_index") # Query — auto-routes to the right retrieval mode result = engine.query_auto( "what should I cook for dinner tonight?", top_k=5 ) # Returns ranked session IDs + route metadata print(result["route"]) # → episode_score print(result["session_ids"]) # → ["s_garden_harvest", ...] # Or use individual modes directly engine.rank_sessions_by_episode_score(query, top_k=10) engine.query(query, method="hybrid", top_k=50) engine.rerank_sessions_by_structure(query, bm25_order, session_texts)
MIT licensed. No cloud dependency. No vendor lock-in. Fork it, embed it, ship it.
pip install contextfit
Python 3.10+ · CPU only · ~41MB deps · No DB
MIT — use freely in commercial and open-source projects
Issues, PRs, and benchmark contributions welcome
The full technical whitepaper: architecture, seven primitives, 499-case benchmark methodology, feature ablation, per-behavior analysis, and deployment architecture.