tech 2d ago

ChatGPT Solves Open Math Problems

CambridgeChatGPT 5.5 Pro solved a Nathanson number-theory problem.

Fields Medalist Tim Gowers found it produced PhD-level research in under an hour with no mathematical guidance from him.

A problem must now be too hard for a large language model to qualify as junior research, Gowers wrote.

Sources: Hacker News

Background

WHY MATH IS THE HARDEST TEST

Mathematics is the cleanest benchmark for machine reasoning because answers are verifiable and proofs leave no room for plausible-sounding nonsense. A model that bluffs its way through prose cannot bluff its way through a proof — every step either follows or it doesn't.

THE GOWERS STANDARD

Tim Gowers won the Fields Medal in 1998 for work on functional analysis and combinatorics. His judgment that an output reaches PhD-level research is not casual — he has spent decades evaluating exactly this threshold in graduate students and postdocs.

THE SCALING HYPOTHESIS

The dominant theory at frontier labs holds that capability emerges from scale alone — more parameters, more compute, more data. Critics argued mathematics would be the wall: symbolic reasoning, they said, requires architectural innovation, not just bigger models. The Gowers result is evidence the scaling camp may be right.

THE BENCHMARK TREADMILL

AI capability benchmarks have a short half-life. GSM8K (grade-school math) was state-of-the-art in 2021 and saturated by 2023. MATH (competition problems) followed. FrontierMath, designed in 2024 to be unsolvable, is already being chipped away. Each benchmark falls faster than the one before it.

WHAT JUNIOR RESEARCH ACTUALLY MEANS

A PhD thesis in pure mathematics typically takes four to six years and produces one to three publishable results. If a model can produce comparable output in under an hour, the economics of mathematical research — postdoc positions, grant cycles, journal review — face the same disruption that hit translation and copywriting earlier.

THE VERIFICATION PROBLEM

LLM math output still requires expert verification because models occasionally produce confident proofs that contain subtle errors. The bottleneck is shifting from generating proofs to checking them — which is why formal proof assistants like Lean are seeing renewed investment as the auditing layer.