TCG-BENCH

LLM Benchmark
No Contamination, No Saturation

Trading Card Game Bench

Bespoke TCG Designed for LLM Evaluation from the Ground Up

10+ Models Tested
40K+ Games Played
2 Languages

Leaderboard

Rollout 1 Results

Performance against Rollout 1 opponent.
Random baseline: 36.0% win rate
Models tested: 10 English, 1 Arabic

Rank Model Win Rate vs Random Lang Games Avg Time
đŸĨ‡ 53.4%
+17.4% EN 500 2.0s
đŸĨˆ 52.4%
+16.4% EN 500 2.0s
đŸĨ‰ 46.0%
+10.0% EN 500 2.0s
#4
Qwen3-32B Alibaba
45.0%
+9.0% EN 1,000 2.0s
#5 45.0%
+9.0% EN 600 2.0s
#6 38.3%
+2.3% EN 600 2.0s
#7
Qwen3-32B Alibaba
34.9%
-1.1% AR 1,000 2.0s
#8
Qwen3-235B Alibaba
30.2%
-5.8% EN 500 2.0s
#9 30.0%
-6.0% EN 500 2.0s
#10
meta-llama-llama-3.1-8b-instruct Unknown
28.8%
-7.2% EN 500 2.0s
#11 28.2%
-7.8% EN 500 2.0s

Submit Your Results

Contribute your model's performance to our benchmark

â„šī¸ Fill in the form below and submit via email.

About TCG-Bench

TCG-Bench is a contamination-proof benchmark designed to evaluate large language models in strategic decision-making tasks. By using a custom-designed trading card game that doesn't exist in training data, we ensure truly unbiased evaluation.

Our benchmark tests models across multiple difficulty levels and languages, providing insights into strategic reasoning capabilities beyond traditional benchmarks.

đŸŽ¯

Contamination-Proof

Custom game ensures no prior exposure

🌍

Multilingual

Supports English and Arabic evaluation

📊

Scalable Difficulty

Multiple opponent strengths for comprehensive testing