Methodology

Challenging LLMs at the frontier of human knowledge

Humanity's Last Exam

Humanity's Last Exam (Preview)

Humanity's Last Exam (Text Only)

Models evaluated on text-only HLE questions

Humanity's Last Exam Text Only (Preview)

Evaluate model honesty when pressured to lie

MASK

Evaluating model performance on complex, multi-step reasoning tasks

EnigmaEval

Assessing models across diverse, interdisciplinary challenges

MultiChallenge

Vision-Language Understanding benchmark for multimodal models

VISTA

Evaluating AI agents' ability to use enterprise tools effectively

Agentic Tool Use (Enterprise)

Assessing chatbots' proficiency in leveraging external tools

Agentic Tool Use (Chat)

Assessing models' ability to understand and generate programming code

Coding

Assessing performance on Arabic language understanding and generation

Arabic

Measuring capabilities in Korean language processing and comprehension

Korean

Testing models' proficiency in Japanese language tasks and cultural nuances

Japanese

Evaluating Spanish language skills across various linguistic challenges

Spanish

Evaluating language models' proficiency in Chinese language tasks

Chinese

Previously used for evaluating mathematical problem-solving capabilities

Math

Former benchmark for assessing models' ability to follow complex instructions

Instruction Following

Retired test for measuring models' resilience against adversarial inputs

Adversarial Robustness

Rank (UB):&nbsp;1 + the number of models whose lower CI bound exceeds this model&rsquo;s upper CI bound.
&nbsp;

Alibaba (Qwen)

Amazon (Nova)

Anthropic

Cohere

Databricks

DeepSeek

Google DeepMind

Meta (Llama)

Microsoft (Phi)

Mistral

OpenAI

Scale’s SEAL Leaderboard evaluates top models’ visual-language understanding, testing perception, logic, calculation, and common sense.

VISTA

Performance Comparison