Today I am working on a benchmarking dashboard to benchmark Coducky code review performance against other major competitors. I am using Martian's code review benchmark: github.com/withmartian/code-review-benchmark
Unlike the competitors, in Coducky you can choose your own models. For a quick check of the benchmark against 5 PRs I used Qwen 3.7-plus in single-panel mode (as opposed to multi-panel, where you can choose many models to review). F1 is the important metric here as the harmonic mean of precision and recall.
With this cheap model (about 2c per PR review), Coducky's harness is already scoring higher than most of the major players.
I am running a full 50-PR benchmark now (ETA 2.5 hours) and will see how we fare across the entire corpus.