HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

RSCT Certification: κ=0.778 (certified) | RSN: 0.70/0.75/0.20 | Topics: LLM Agents and Reasoning, Graph-Based LLM Reasoning, Neuro-Symbolic AI and System 3 Reasoning

Overview

One-Sentence Summary

This paper introduces HorizonMath, a benchmark for evaluating large language models' (LLMs) ability to solve challenging, predominantly unsolved mathematical problems, and demonstrates that current state-of-the-art LLMs can make progress on some of these problems.

Key Innovation

The key innovation in this paper is the HorizonMath benchmark itself. Rather than relying on formal proof verification or manual review, which can be expensive to scale, HorizonMath uses automatically verifiable problems that are predominantly unsolved. This allows for scalable and unbiased evaluation of LLMs' mathematical reasoning capabilities.

Should You Read This?

If you work on large language models and AI reasoning: Yes, this paper is highly relevant. It provides a novel benchmark for evaluating LLMs' mathematical reasoning abilities and demonstrates that these models can make progress on challenging, unsolved problems. If you work on mathematical reasoning or automated theorem proving: Maybe. The HorizonMath benchmark could be a useful tool for your work, but the paper is primarily focused on evaluating LLMs rather than mathematical reasoning more broadly.

The Good

The HorizonMath benchmark is a well-designed and thoughtful approach to evaluating LLMs' mathematical reasoning abilities. The use of automatically verifiable, predominantly unsolved problems is a clever solution to the challenges of existing benchmarks.
The evaluation of GPT-5.4 Pro's performance on the benchmark is thorough and insightful. The discovery of two problems for which GPT-5.4 Pro proposes solutions that improve on the best-known published results is a promising finding.
The paper provides a strong theoretical and practical foundation for using HorizonMath as a community resource for evaluating and advancing the state of the art in LLM-based mathematical reasoning.

The Gaps

The paper does not provide detailed information on the specific problems included in the HorizonMath benchmark. It would be helpful to have more insight into the nature and difficulty of these problems.
The evaluation of GPT-5.4 Pro's performance is promising, but the paper does not provide a comprehensive comparison to other state-of-the-art LLMs or mathematical reasoning systems. Additional comparisons would help contextualize the results.
The paper acknowledges that the solutions proposed by GPT-5.4 Pro require expert review to validate their novelty. This is an important step that should be taken before considering these as true novel contributions.

How to Read This Paper

If you're from the field of large language models and AI reasoning: You can focus on the sections describing the HorizonMath benchmark and the evaluation of GPT-5.4 Pro's performance. The background material on mathematical reasoning and automated theorem proving may be less relevant for you. If you're from the field of mathematical reasoning or automated theorem proving: The background sections on the challenges of existing benchmarks and the introduction of the HorizonMath approach will be particularly valuable for you. You may also want to closely examine the specific problems included in the benchmark. Must read (everyone): The sections describing the HorizonMath benchmark, the evaluation of GPT-5.4 Pro, and the discussion of the potential for LLMs to make progress on unsolved mathematical problems. Verify: The claims regarding the novelty of the solutions proposed by GPT-5.4 Pro should be verified through expert review before building on these results.

Bottom Line

This paper presents a promising new benchmark, HorizonMath, for evaluating the mathematical reasoning capabilities of large language models. The demonstration that a state-of-the-art LLM can make progress on challenging, unsolved mathematical problems is an exciting result that warrants further investigation. Researchers working on LLMs and mathematical reasoning should closely examine this work and consider using HorizonMath as a tool for advancing the field.

Quality Assessment

Trust Level: MODERATE - Verify key results first

What the scores mean:

70% signal - This much of the paper directly supports its claims
75% context - Background material for readers from other fields (this is a bridge paper - high context is a feature!)
20% noise - Content that may mislead if taken at face value

Reliability score: 78% (certified)

Practical interpretation: Good foundation but some gaps. Read critically and verify key claims before building on this work.

Paper Details

Authors: Erik Y. Wang, Sumeet Motwani, James V. Roggeveen, Eliot Hodges, Dulhan Jayalath et al.
Published: 2026-03-16
Source: arxiv
PDF: Download
Primary Topic: LLM Agents and Reasoning
Difficulty: Intermediate

Abstract

Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 predominantly unsolved problems spanning 8 domains in computational and applied mathematics, paired with an open-source evaluation framework for automated verification. Our benchmark targets a class of problems where discovery is hard, requiring meaningful mathematical insight, but verification is computationally efficient and simple. Because these solutions are unknown, HorizonMath is immune to data contamination, and most state-of-the-art models score near 0%. Existing research-level benchmarks instead rely on formal proof verification or manual review, both of which are expensive to scale. Using this platform, we find two problems for which GPT 5.4 Pro proposes solutions that improve on the best-known published results, representing potential novel contributions (pending expert review). We release HorizonMath as an open challenge and a growing community resource, where correct solutions to problems in the unsolved problem classes could constitute novel results in the mathematical literature.

This analysis was automatically generated and certified by the Swarm-It RSCT pipeline. κ-gate score: 0.778 | Quality tier: certified

HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

RSCT Score Breakdown

TL;DR

HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

Overview

One-Sentence Summary

Key Innovation

Should You Read This?

The Good

The Gaps

How to Read This Paper

Bottom Line

Quality Assessment

Paper Details

Abstract