๐Ÿ™ DeepSynth Leaderboard

Results ranked by F1 score (LLM Judge used as tiebreaker). F1 / Precision / Recall measure prediction quality against gold answers; LLM Judge reports average precision under semantic matching. ๐Ÿ”’ = closed model, ๐Ÿ”“ = open-weights.

Self-reported numbers on the public dev set (40 tasks, Pass@1). Useful for prototyping and comparing methods during development. Anyone can score themselves locally on this split.