Trading Card Game Bench
Bespoke TCG Designed for LLM Evaluation from the Ground Up
Leaderboard
Rollout 1 Results
Performance against Rollout 1 opponent.
Random baseline: 36.0% win rate
Models tested: 10 English, 1 Arabic
Rank | Model | Win Rate | vs Random | Lang | Games | Avg Time |
---|---|---|---|---|---|---|
đĨ |
Grok-3 Mini
X.AI
|
53.4% | +17.4% | EN | 500 | 2.0s |
đĨ |
Gemini-2.5-Flash:Thinking
Google
|
52.4% | +16.4% | EN | 500 | 2.0s |
đĨ |
Gemini-2.5-Flash
Google
|
46.0% | +10.0% | EN | 500 | 2.0s |
#4 |
Qwen3-32B
Alibaba
|
45.0% | +9.0% | EN | 1,000 | 2.0s |
#5 |
LLaMA-3.3-70B
Meta
|
45.0% | +9.0% | EN | 600 | 2.0s |
#6 |
DeepSeek-R1-Distil-70B
DeepSeek
|
38.3% | +2.3% | EN | 600 | 2.0s |
#7 |
Qwen3-32B
Alibaba
|
34.9% | -1.1% | AR | 1,000 | 2.0s |
#8 |
Qwen3-235B
Alibaba
|
30.2% | -5.8% | EN | 500 | 2.0s |
#9 |
LLaMA-3.2-11B
Meta
|
30.0% | -6.0% | EN | 500 | 2.0s |
#10 |
meta-llama-llama-3.1-8b-instruct
Unknown
|
28.8% | -7.2% | EN | 500 | 2.0s |
#11 |
DeepSeek-R1-Distil-8B
DeepSeek
|
28.2% | -7.8% | EN | 500 | 2.0s |
Submit Your Results
Contribute your model's performance to our benchmark
About TCG-Bench
TCG-Bench is a contamination-proof benchmark designed to evaluate large language models in strategic decision-making tasks. By using a custom-designed trading card game that doesn't exist in training data, we ensure truly unbiased evaluation.
Our benchmark tests models across multiple difficulty levels and languages, providing insights into strategic reasoning capabilities beyond traditional benchmarks.
Contamination-Proof
Custom game ensures no prior exposure
Multilingual
Supports English and Arabic evaluation
Scalable Difficulty
Multiple opponent strengths for comprehensive testing