87.2% QA accuracy.
87.6% task-averaged.

ContextFit reports LongMemEval-S QA progress using selective-fusion retrieval, source-aware answer generation, and LongMemEval-style GPT-4o judging.

87.2%GPT-5-mini answerer QA
85.2%GPT-4o-only QA
93.3%Abstention accuracy
500Questions evaluated
This is the end-to-end QA companion to the retrieval artifact. ContextFit retrieves source evidence, the answer model responds from retrieved context, and GPT-4o judges answer correctness using the LongMemEval-style evaluation path.
Keep this claim separate from retrieval recall. The retrieval-only artifact reports 96.6% Any@5 and 98.7% Any@10 evidence recall. This page reports answer correctness after generation and judging.

Current QA Results

RunOverallTask-avgModel split
Selective fusion, GPT-5-mini answerer87.2%87.58%GPT-5-mini generation/extraction/answerability; GPT-4o judge
Selective fusion, GPT-4o-only85.2%85.67%GPT-4o generation/extraction/answerability/judge
First-class answerer router86.8%86.65%GPT-5-mini on temporal/preference/multi-session rows; GPT-4o otherwise; GPT-4o judge

GPT-5-mini Answerer Breakdown

TypeNAccuracy
knowledge-update7891.03%
multi-session13378.20%
single-session-assistant56100.00%
single-session-preference3073.33%
single-session-user7095.71%
temporal-reasoning13387.22%

Interpretation

The clean May 23 headline is 87.2% overall with a GPT-5-mini answerer/extractor and GPT-4o judge. The strongest GPT-4o-only result is 85.2% overall. Answerer-routing experiments show a useful temporal-reasoning lift but weaker multi-session behavior, so the router remains experimental rather than the main claim.

Configuration

LongMemEval-S cleaned, 500 total rows. ContextFit used selective-fusion retrieval, source-aware answer generation, GPT-4o judging, and full-run artifact checks with zero blank hypotheses.

Reproduction Command

.venv/bin/python benchmarks/longmemeval_contextfit_qa.py \
  --data benchmarks/data/longmemeval_s_cleaned.json \
  --retrieval-artifact benchmarks/longmemeval_selective_fusion_userpref_token_base_20260523.json \
  --hypotheses-out benchmarks/longmemeval_contextfit_qa_hypotheses_selective_fusion_userpref_agent_blend_full_gpt5mini_judge_gpt4o_20260523.jsonl \
  --judged-out benchmarks/longmemeval_contextfit_qa_judged_selective_fusion_userpref_agent_blend_full_gpt5mini_judge_gpt4o_20260523.jsonl \
  --summary-out benchmarks/longmemeval_contextfit_qa_summary_selective_fusion_userpref_agent_blend_full_gpt5mini_judge_gpt4o_20260523.json \
  --source-aware \
  --evidence-packet general \
  --fusion-evidence-map temporal \
  --count-list-ledger agent_blend \
  --multi-session-top-k-context 10 \
  --temporal-top-k-context 10 \
  --generation-model gpt-5-mini \
  --extraction-model gpt-5-mini \
  --answerability-model gpt-5-mini \
  --judge-model gpt-4o-2024-08-06

Artifacts

GitHub report: benchmarks/longmemeval_contextfit_report.md

Run summaries are committed as benchmark artifacts for the GPT-4o-only, GPT-5-mini answerer, and first-class router runs.

Recommended Wording

ContextFit reports LongMemEval-S QA progress: 85.2% with a GPT-4o-only selective-fusion run, and 87.2% with a GPT-5-mini answerer/extractor plus GPT-4o judging. These are local LongMemEval-style evaluations, not official leaderboard submissions.