96.20% token-native R@5.
99.00% fusion R@5.

ContextFit scored against the same public LongMemEval-S retrieval surface used by gbrain-evals: 500 questions, top-5 session retrieval, and a hit when any ground-truth answer_session_ids entry appears in the retrieved sessions.

96.20%Token-native R@5

99.00%Optional fusion R@5

97.60%gbrain-hybrid published R@5

96.60%MemPalace raw published R@5

500LongMemEval-S questions

This is retrieval recall, not answer accuracy. There is no answer-generation model and no LLM judge in this metric.

Result

System	R@1	R@3	R@5	R@10	Hits@5	Embeddings	Vector store
gbrain-hybrid	-	-	97.60%	-	488/500	yes	local
MemPalace raw	-	-	96.60%	-	483/500	yes	local
ContextFit token-native	81.80%	90.40%	96.20%	97.80%	481/500	no	no
ContextFit + OpenAI fusion	84.60%	95.20%	99.00%	99.60%	495/500	yes	no

Per-Type R@5

Question type	Token-native	Fusion
knowledge-update	98.7%	100.0%
multi-session	95.5%	100.0%
single-session-assistant	100.0%	100.0%
single-session-preference	90.0%	86.7%
single-session-user	95.7%	100.0%
temporal-reasoning	95.5%	99.2%

Artifacts

Token-native artifact: benchmarks/longmemeval_token_native_certificate_promotion_v5_typed_rescue_tight_20260524.json, SHA-256 c0e7ebc5d925549e1e3058b6100ab0786654c4d8c1bd4a99fe920c57f3ff2ea6.

Fusion artifact: benchmarks/longmemeval_fusion_selective_chunk_promotion_v5_typed_rescue_20260524.json, SHA-256 ababf7387cb18c9310e82c35a57594aeef3cd40a2b60d9a1f336f69b65d3dcd2.

Scoring Command

python3 - <<'PY'
import json
from pathlib import Path

files = [
    ("ContextFit token-native", "benchmarks/longmemeval_token_native_certificate_promotion_v5_typed_rescue_tight_20260524.json"),
    ("ContextFit fusion", "benchmarks/longmemeval_fusion_selective_chunk_promotion_v5_typed_rescue_20260524.json"),
]

for name, path in files:
    rows = json.loads(Path(path).read_text())["rows"]
    print(name)
    for k in (1, 3, 5, 10):
        hits = sum(r["best_rank"] is not None and r["best_rank"] <= k for r in rows)
        print(f"R@{k}: {hits}/{len(rows)} = {hits / len(rows) * 100:.2f}%")
PY

96.20% token-native R@5.99.00% fusion R@5.

Result

Per-Type R@5

Artifacts

Scoring Command

96.20% token-native R@5.
99.00% fusion R@5.