Scale AI logo
SEAL Logo

Showdown Leaderboard - LLMs

SEAL Showdown Logo

Real people. Real conversations. Real rankings.

Showdown ranks AI models based on how they perform in real-world use— not synthetic tests or lab settings. Votes are blind, optional, and organic, so rankings reflect authentic preferences.Methodology & Technical Report
0 promptsReal conversation prompts compared across models through pairwise votes.
0 usersFrom 80+ countries and 70+ languages, spanning all backgrounds and professions.

SEAL Leaderboard - LLMs

RANK
MODEL ↑↓
VOTES ↑↓
SCORE ↑↓
1

gpt-5-chat

gpt-5-chat
13532
1096.41
-5.89 +4.29
1

claude-sonnet-4-5-20250929

claude-sonnet-4-5-20250929
7625
1091.76
-4.45 +4.29
2

qwen3-235b-a22b-2507-v1

qwen3-235b-a22b-2507-v1
4025
1084.08
-6.72 +6.08
3

claude-opus-4-1-20250805

claude-opus-4-1-20250805
16479
1083.07
-3.98 +3.11
5

claude-sonnet-4-20250514

claude-sonnet-4-20250514
18268
1070.55
-2.48 +2.99
5

claude-sonnet-4-5-20250929 (Thinking)

claude-sonnet-4-5-20250929 (Thinking)
7513
1068.57
-4.81 +5
5

claude-opus-4-20250514

claude-opus-4-20250514
16147
1064.92
-4.2 +4.52
5

claude-haiku-4-5-20251001

claude-haiku-4-5-20251001
5063
1062.16
-6.62 +7.63
7

gpt-4.1-2025-04-14

gpt-4.1-2025-04-14
17661
1058.75
-3.68 +2.87
8

claude-opus-4-1-20250805 (Thinking)

claude-opus-4-1-20250805 (Thinking)
15196
1053.63
-4.22 +4.55
10

gemini-2.5-pro-preview-06-05

gemini-2.5-pro-preview-06-05
15357
1045.74
-4.31 +4.15
11

claude-opus-4-20250514 (Thinking)

claude-opus-4-20250514 (Thinking)
15639
1038.50
-4.01 +3.56
12

claude-sonnet-4-20250514 (Thinking)

claude-sonnet-4-20250514 (Thinking)
18017
1036.65
-4.01 +3.23
12

claude-haiku-4-5-20251001 (Thinking)

claude-haiku-4-5-20251001 (Thinking)
4861
1029.13
-6.13 +8.37
14

gemini-2.5-flash-preview-05-20

gemini-2.5-flash-preview-05-20
18363
1019.72
-2.73 +4.05
15

o3-2025-04-16-medium

o3-2025-04-16-medium*
20435
1019.94
-3.58 +2.9
17

llama4-maverick-instruct-basic

llama4-maverick-instruct-basic
18919
1000.00
-3.46 +4.67
18

o4-mini-2025-04-16-medium

o4-mini-2025-04-16-medium*
19835
988.83
-3.28 +3.42
19

deepseek-r1-0528

deepseek-r1-0528
4275
967.13
-8.72 +7.4
20

gpt-5-2025-08-07-medium

gpt-5-2025-08-07-medium*
14014
951.83
-4.66 +5.18
* This model’s API does not consistently return Markdown-formatted responses. Since raw outputs are used in head-to-head comparisons, this may affect its ranking.

Performance Comparison Across Language Models

Win Rate vs. Each Model

Win Rate vs Each Model

Battle Count vs. Each Model

Battle Count vs. Each Model

Confidence

Confidence Intervals

Average Win Rate

Average Win Rate

Prompt Distribution

Prompt Distribution