GitGoodBench: A Novel Benchmark for Evaluating Agentic Performance on Git

Published in REALM Workshop at ACL 2025 (Spotlight), 2025

GitGoodBench is a benchmark for evaluating the ability of AI agents to perform version control tasks. It covers three core Git scenarios sourced from real open-source repositories, with datasets of 900, 120, and 17,469 samples respectively. The benchmark provides a rigorous evaluation framework for agentic performance on practical software engineering workflows involving Git.

Recommended citation: Lindenbauer, T., Bogomolov, E., & Zharov, Y. (2025). "GitGoodBench: A Novel Benchmark for Evaluating Agentic Performance on Git." Proceedings of the 1st Workshop for Research on Agent Language Models (REALM) at ACL 2025.
Download Paper

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Tobias Lindenbauer

Share on