96.6% Any@5.
98.7% Any@10.

ContextFit with optional OpenAI fusion reaches 96.6% Any@5 and 98.7% Any@10 evidence retrieval on LongMemEval-S, with no vector database required.

96.60%Any evidence @5
98.72%Any evidence @10
91.28%All evidence @10
0.8999MRR
This is a retrieval/evidence-ranking result, not an official end-to-end LongMemEval QA score. The run uses OpenAI text-embedding-3-small embeddings as an optional cached fusion signal. No vector database is required.
Companion end-to-end QA progress: 87.2% overall accuracy and 87.6% task-averaged accuracy with ContextFit selective-fusion retrieval, a GPT-5-mini answerer/extractor, and GPT-4o judging.

Result

MetricScore
Scored examples470
Abstention examples skipped30
Any evidence @184.68%
Any evidence @394.26%
Any evidence @596.60%
Any evidence @1098.72%
All evidence @583.62%
All evidence @1091.28%
MRR0.8999

Configuration

LongMemEval-S cleaned, 500 total rows, 470 scored after abstention exclusion. ContextFit used hybrid retrieval, parent/child conversation chunks, session ranking, coverage reranking, structured temporal filters, and optional OpenAI fusion. Answer markers were disabled.

Reproduction Command

.venv/bin/python benchmarks/longmemeval_contextfit.py \
  benchmarks/data/longmemeval_s_cleaned.json \
  --limit 0 \
  --method hybrid \
  --top-k-chunks 10 \
  --retrieval-k 100 \
  --chunk-size 2048 \
  --overlap 128 \
  --rank-by-session \
  --conversation-chunks \
  --conversation-parent \
  --coverage-rerank \
  --structured-temporal-filters \
  --openai-fusion \
  --out benchmarks/longmemeval_fusion_claim_966_987_20260519.json

Artifact

Raw JSON artifact: benchmarks/longmemeval_fusion_claim_966_987_20260519.json

SHA-256: 059c778ca389e2a5939505800acffd6349f0be7ada579238023d342784214932

GitHub report: benchmarks/longmemeval_fusion_claim_966_987_20260519.md

Recommended Wording

ContextFit with optional OpenAI fusion reaches 96.6% Any@5 and 98.7% Any@10 evidence retrieval on LongMemEval-S, with no vector database required.