Ran it on a subset of 10 of the 50 PRs in this benchmark https://codereview.withmartian.com
- very good recall (~74%, e.g. found a lot of the golden issues)
- not so good precision (~12%, e.g. lots of false positives)
- the precision causes the F1 to tank (~20%, if this stays the same on the full 50 sample it would puts it almost last, even less than Kilo+Grok)
False positives from the deterministic audits a very difficult problem to address. Comparing and deduplicating across different methods or LLM audits seems to the only way.
I would say that recall is the most important metric here though. I'd want it to catch all the issues.
False positives are easy to ignore.
[flagged]
Which LLM did you use? I assume that will make a pretty big difference.