87.2% QA accuracy.
87.6% task-averaged.
ContextFit reports LongMemEval-S QA progress using selective-fusion retrieval, source-aware answer generation, and LongMemEval-style GPT-4o judging.
Current QA Results
| Run | Overall | Task-avg | Model split |
|---|---|---|---|
| Selective fusion, GPT-5-mini answerer | 87.2% | 87.58% | GPT-5-mini generation/extraction/answerability; GPT-4o judge |
| Selective fusion, GPT-4o-only | 85.2% | 85.67% | GPT-4o generation/extraction/answerability/judge |
| First-class answerer router | 86.8% | 86.65% | GPT-5-mini on temporal/preference/multi-session rows; GPT-4o otherwise; GPT-4o judge |
GPT-5-mini Answerer Breakdown
| Type | N | Accuracy |
|---|---|---|
| knowledge-update | 78 | 91.03% |
| multi-session | 133 | 78.20% |
| single-session-assistant | 56 | 100.00% |
| single-session-preference | 30 | 73.33% |
| single-session-user | 70 | 95.71% |
| temporal-reasoning | 133 | 87.22% |
Interpretation
The clean May 23 headline is 87.2% overall with a GPT-5-mini answerer/extractor and GPT-4o judge. The strongest GPT-4o-only result is 85.2% overall. Answerer-routing experiments show a useful temporal-reasoning lift but weaker multi-session behavior, so the router remains experimental rather than the main claim.
Configuration
LongMemEval-S cleaned, 500 total rows. ContextFit used selective-fusion retrieval, source-aware answer generation, GPT-4o judging, and full-run artifact checks with zero blank hypotheses.
Reproduction Command
.venv/bin/python benchmarks/longmemeval_contextfit_qa.py \
--data benchmarks/data/longmemeval_s_cleaned.json \
--retrieval-artifact benchmarks/longmemeval_selective_fusion_userpref_token_base_20260523.json \
--hypotheses-out benchmarks/longmemeval_contextfit_qa_hypotheses_selective_fusion_userpref_agent_blend_full_gpt5mini_judge_gpt4o_20260523.jsonl \
--judged-out benchmarks/longmemeval_contextfit_qa_judged_selective_fusion_userpref_agent_blend_full_gpt5mini_judge_gpt4o_20260523.jsonl \
--summary-out benchmarks/longmemeval_contextfit_qa_summary_selective_fusion_userpref_agent_blend_full_gpt5mini_judge_gpt4o_20260523.json \
--source-aware \
--evidence-packet general \
--fusion-evidence-map temporal \
--count-list-ledger agent_blend \
--multi-session-top-k-context 10 \
--temporal-top-k-context 10 \
--generation-model gpt-5-mini \
--extraction-model gpt-5-mini \
--answerability-model gpt-5-mini \
--judge-model gpt-4o-2024-08-06
Artifacts
GitHub report: benchmarks/longmemeval_contextfit_report.md
Run summaries are committed as benchmark artifacts for the GPT-4o-only, GPT-5-mini answerer, and first-class router runs.
Recommended Wording
ContextFit reports LongMemEval-S QA progress: 85.2% with a GPT-4o-only selective-fusion run, and 87.2% with a GPT-5-mini answerer/extractor plus GPT-4o judging. These are local LongMemEval-style evaluations, not official leaderboard submissions.