1. Context rot
Every additional token spreads attention thinner. On two-hop reasoning, questions that require chaining two facts, accuracy collapses well before the documented context window runs out.
The longer the context got, the worse models tended to do on questions like this that required two reasoning hops.— Timothy B. Lee, Context rot
On a 32K-token retrieval task, GPT-4o slips from 99% to 70%; Claude 3.5 Sonnet from 88% to 30%. Two-hop accuracy drops faster still. See also Liu et al., Lost in the Middle.