Latest
AI Agents Are Starting to Speedrun the Scoreboard
Codegraph, GraphBit, BenchJack, and frontier-model CTF drama all point at the same ugly truth: agent progress is real, but our evals and workflows are way too easy to game.
All posts tagged "benchmarks" on Clord.
2 posts