🐙 DeepSynth Leaderboard

Results ranked by F1 score (LLM Judge used as tiebreaker). F1 / Precision / Recall measure prediction quality against gold answers; LLM Judge reports average precision under semantic matching. 🔒 = closed model, 🔓 = open-weights.

Self-reported numbers on the public dev set (40 tasks, Pass@1). Useful for prototyping and comparing methods during development. Anyone can score themselves locally on this split.