Scale AI logo
SEAL Logo
ShowdownFrontier Leaderboards
159 Hug × 24 Hug

Showdown Leaderboard - LLMs

SEAL Showdown Logo

Real people. Real conversations. Real rankings.

Showdown ranks AI models based on how they perform in real-world use— not synthetic tests or lab settings. Votes are blind, optional, and organic, so rankings reflect authentic preferences.Methodology & Technical Report→
0 promptsReal conversation prompts compared across models through pairwise votes.
0 usersFrom 80+ countries and 70+ languages, spanning all backgrounds and professions.

SEAL Leaderboard - LLMs

RANK ↑
MODEL ↑↓
VOTES ↑↓
SCORE ↑↓
1

gpt-5-chat

gpt-5-chat
15216
1094.40
-3.46 +3.23
1

claude-sonnet-4-5-20250929

claude-sonnet-4-5-20250929
9470
1091.64
-6.05 +3.84
2

claude-opus-4-1-20250805

claude-opus-4-1-20250805
16756
1082.60
-4.62 +3.82
2

qwen3-235b-a22b-2507-v1

qwen3-235b-a22b-2507-v1
6237
1082.13
-4.92 +5.58
5

claude-sonnet-4-20250514

claude-sonnet-4-20250514
20251
1069.75
-3.89 +3.01
5

claude-sonnet-4-5-20250929 (Thinking)

claude-sonnet-4-5-20250929 (Thinking)
9161
1065.67
-4.67 +5
5

claude-opus-4-20250514

claude-opus-4-20250514
16442
1064.28
-4.31 +2.92
5

claude-haiku-4-5-20251001

claude-haiku-4-5-20251001
5350
1060.63
-5.94 +6.83
6

gpt-4.1-2025-04-14

gpt-4.1-2025-04-14
17752
1058.77
-3.63 +3.53
6

gemini-3-pro-preview

gemini-3-pro-preview
3307
1051.92
-7.17 +9.35
8

claude-opus-4-1-20250805 (Thinking)

claude-opus-4-1-20250805 (Thinking)
15505
1053.02
-4.5 +3.91
8

gemini-2.5-pro

gemini-2.5-pro
3840
1048.19
-6.67 +7.97
11

gemini-2.5-pro-preview-06-05

gemini-2.5-pro-preview-06-05
15357
1044.19
-4.6 +4.29
12

claude-opus-4-20250514 (Thinking)

claude-opus-4-20250514 (Thinking)
16806
1040.20
-3.68 +3.3
13

claude-sonnet-4-20250514 (Thinking)

claude-sonnet-4-20250514 (Thinking)
19782
1036.90
-3.56 +3.34
15

claude-haiku-4-5-20251001 (Thinking)

claude-haiku-4-5-20251001 (Thinking)
5121
1028.89
-5.73 +6.06
16

o3-2025-04-16-medium

o3-2025-04-16-medium*
22266
1020.59
-2.76 +3.68
16

gemini-2.5-flash-preview-05-20

gemini-2.5-flash-preview-05-20
18363
1019.35
-3.47 +3.87
16

gemini-2.5-flash

gemini-2.5-flash
3823
1018.48
-8.6 +10.11
20

llama4-maverick-instruct-basic

llama4-maverick-instruct-basic
19169
1000.00
-3.89 +4.19
21

o4-mini-2025-04-16-medium

o4-mini-2025-04-16-medium*
21753
988.00
-3.32 +3.39
22

deepseek-r1-0528

deepseek-r1-0528
6059
968.17
-4.88 +6.48
23

gpt-5-2025-08-07-medium

gpt-5-2025-08-07-medium*
15702
950.68
-3.73 +3.86
* This model’s API does not consistently return Markdown-formatted responses. Since raw outputs are used in head-to-head comparisons, this may affect its ranking.

Performance Comparison Across Language Models

Win Rate vs. Each Model iconWin Rate vs. Each Model
Battle Count vs. Each Model iconBattle Count vs. Each Model
Confidence Intervals iconConfidence Intervals
Average Win Rate iconAverage Win Rate
Prompt Distribution iconPrompt Distribution

Win Rate vs. Each Model

Win Rate vs Each Model

Battle Count vs. Each Model

Battle Count vs. Each Model

Confidence

Confidence Intervals

Average Win Rate

Average Win Rate

Prompt Distribution

Prompt Distribution

Copyright 2025 Scale Inc. All rights reserved.

&Terms of Use&Privacy Policy