logoalt Hacker News

eranationtoday at 5:16 AM4 repliesview on HN

Ran it on a subset of 10 of the 50 PRs in this benchmark https://codereview.withmartian.com

- very good recall (~74%, e.g. found a lot of the golden issues)

- not so good precision (~12%, e.g. lots of false positives)

- the precision causes the F1 to tank (~20%, if this stays the same on the full 50 sample it would puts it almost last, even less than Kilo+Grok)


Replies

tirpentoday at 11:21 AM

Which LLM did you use? I assume that will make a pretty big difference.

show 1 reply
bobkbtoday at 11:02 AM

False positives from the deterministic audits a very difficult problem to address. Comparing and deduplicating across different methods or LLM audits seems to the only way.

akietoday at 5:47 AM

I would say that recall is the most important metric here though. I'd want it to catch all the issues.

False positives are easy to ignore.

show 1 reply
isabellehuetoday at 8:07 AM

[flagged]