frontiermath news - Search News

AI’s math problem: FrontierMath benchmark shows how far technology still has to go

FrontierMath, a new benchmark from Epoch AI, challenges advanced AI systems with complex math problems, revealing how far AI still has to go before achieving true human-level reasoning.

New secret math benchmark stumps AI models and PhDs alike

FrontierMath's performance results, revealed in a preprint research paper, paint a stark picture of current AI model ...

5don MSN

A new math benchmark just dropped and leading AI models can solve 'less than 2%' of its problems... oh dear

Sometimes I forget there's a whole other world out there where AI models aren't just used for basic tasks such as simple ...

Epoch AI Launches FrontierMath AI Benchmark to Test Capabilities of AI Models

Epoch AI highlighted that to measure AI's aptitude, benchmarks should be created on creative problem-solving where the AI has ...

Hosted on MSN6d

Testing AI systems on hard math problems shows they still perform very poorly

They decided a new benchmark was needed, and so they created one they named FrontierMath. To begin, the research team delved ...

腾讯网7d

AI数学神话破灭！FrontierMath让LLM集体几乎“交白卷”：正确率不超过2%

然而，Epoch AI看不下去了，联手60多位顶尖数学家，憋了个大招——FrontierMath，一个专治LLM各种不服的全新数学推理测试！结果惨不忍睹，LLM集体“翻车”，正确率竟然不到2%！🤡 看看Epoch AI是 ...

3don MSN

全新AI数学基准测试集FrontierMath出炉：现有模型难以应对复杂数学挑战

【ITBEAR】研究机构 Epoch AI 近日发布了一款全新的 AI 模型数学基准测试集，名为 FrontierMath。该测试集旨在全面评估 AI 模型的数学推理能力，尤其是面对复杂数学问题时的表现。与现有的数学测试题集如 GSM-8K 和 ...

陶哲轩联手60多位数学家出题，世界顶尖模型通过率仅2%，专家级数学基准，让AI再苦战数年

近日，Epoch AI联合六十余位全世界的数学家，其中包括教授、IMO命题人、菲尔兹奖获得者，共同推出了全新的数学基准FrontierMath。其包括数百个原创的、格外具有挑战性的数学问题，旨在评估AI系统中的高级推理能力。

腾讯网3d

LLM 数学基准测试集 FrontierMath 公布：号称业界模型均败北

IT之家 11 月 15 日消息，研究机构 Epoch AI 现公布了一款名为 FrontierMath 的全新 AI 模型数学基准测试集，旨在评估系列模型的数学推理能力。

o1/Claude集体翻车，陶哲轩等60+顶尖数学家合力提出新数学基准，大模型正确率通通不足2%

一出手，曾在国际数学奥赛中拿下83%解题率的 o1模型就败下阵来，并且Claude 3.5 Sonnet、GPT-4o、Gemini 1.5 Pro等全都未攻破2%这一防线。一打听，这个新数学基准名为 FrontierMath ，由 Epoch ...

陶哲轩携手数十数学家推出FrontierMath，AI数学挑战成功率仅2%

据EpochAI的研究报告显示，这六个前沿模型在FrontierMath的表现尤其令人震惊，它们的成功率竟低于2%。OpenAI的研究科学家Noam Brown对此表示赞赏，认为这种低通过率显示了当前AI在数学处理方面的局限性。这一结果呼应了广泛存在的质疑：虽然许多大型语言模型（LLM）看似在处理数学问题上表现出色，但它们的能力常常被夸大。

3don MSN

LLM 数学基准测试集 FrontierMath 公布：号称多数题型 AI 没学过、业界模型均败北

研究机构表示，他们利用 FrontierMath 对当前市场上的 AI 模型进行初步测试，发现这些模型普遍表现不佳，包括此前在 GSM-8K、MATH 上取得近乎满分成绩的 Claude 3.5 和 GPT-4 等模型在 FrontierMath 中的解题成功率也均败北（成功率低于 2%）。

Some results have been hidden because they may be inaccessible to you

Show inaccessible results