Ran it on a subset of 10 of the 50 PRs in this benchmark

eranation • today at 5:16 AM • 4 replies • view on HN

Ran it on a subset of 10 of the 50 PRs in this benchmark https://codereview.withmartian.com

- very good recall (~74%, e.g. found a lot of the golden issues)

- not so good precision (~12%, e.g. lots of false positives)

- the precision causes the F1 to tank (~20%, if this stays the same on the full 50 sample it would puts it almost last, even less than Kilo+Grok)

Replies

tirpen • today at 11:21 AM

Which LLM did you use? I assume that will make a pretty big difference.

➕ show 1 reply

bobkb • today at 11:02 AM

False positives from the deterministic audits a very difficult problem to address. Comparing and deduplicating across different methods or LLM audits seems to the only way.

akie • today at 5:47 AM

I would say that recall is the most important metric here though. I'd want it to catch all the issues.

False positives are easy to ignore.

➕ show 1 reply

isabellehue • today at 8:07 AM

[flagged]

alt Hacker News

Replies