96.6% Any@5.
98.7% Any@10.
ContextFit with optional OpenAI fusion reaches 96.6% Any@5 and 98.7% Any@10 evidence retrieval on LongMemEval-S, with no vector database required.
text-embedding-3-small embeddings as an optional cached fusion signal. No vector database is required.
Result
| Metric | Score |
|---|---|
| Scored examples | 470 |
| Abstention examples skipped | 30 |
| Any evidence @1 | 84.68% |
| Any evidence @3 | 94.26% |
| Any evidence @5 | 96.60% |
| Any evidence @10 | 98.72% |
| All evidence @5 | 83.62% |
| All evidence @10 | 91.28% |
| MRR | 0.8999 |
Configuration
LongMemEval-S cleaned, 500 total rows, 470 scored after abstention exclusion. ContextFit used hybrid retrieval, parent/child conversation chunks, session ranking, coverage reranking, structured temporal filters, and optional OpenAI fusion. Answer markers were disabled.
Reproduction Command
.venv/bin/python benchmarks/longmemeval_contextfit.py \
benchmarks/data/longmemeval_s_cleaned.json \
--limit 0 \
--method hybrid \
--top-k-chunks 10 \
--retrieval-k 100 \
--chunk-size 2048 \
--overlap 128 \
--rank-by-session \
--conversation-chunks \
--conversation-parent \
--coverage-rerank \
--structured-temporal-filters \
--openai-fusion \
--out benchmarks/longmemeval_fusion_claim_966_987_20260519.json
Artifact
Raw JSON artifact: benchmarks/longmemeval_fusion_claim_966_987_20260519.json
SHA-256: 059c778ca389e2a5939505800acffd6349f0be7ada579238023d342784214932
GitHub report: benchmarks/longmemeval_fusion_claim_966_987_20260519.md
Recommended Wording
ContextFit with optional OpenAI fusion reaches 96.6% Any@5 and 98.7% Any@10 evidence retrieval on LongMemEval-S, with no vector database required.