๐ DeepSynth Leaderboard
Results ranked by F1 score (LLM Judge used as tiebreaker). F1 / Precision / Recall measure prediction quality against gold answers; LLM Judge reports average precision under semantic matching. ๐ = closed model, ๐ = open-weights.
Self-reported numbers on the public dev set (40 tasks, Pass@1). Useful for prototyping and comparing methods during development. Anyone can score themselves locally on this split.