My experience with using AI tools for code review is that they do find critical bugs (from my retrospective analysis, maybe 80% of the time), but the signal to noise ratio is poor. It's really hard to get it not to tell you 20 highly speculative reasons why the code is problematic along with the one critical error. And in almost all cases, sufficient human attention would also have identified the critical bug - so human attention is the primary bottleneck here. Thus poor signal to noise ratio isn't a side issue, it's one of the core issues.
As a result, I'm mostly using this selectively so far, and I wouldn't want it turned on by default for every PR.
I suspect the noise is largely an artifact of cost optimization. Most tools restrict the context to just the diff to save on input tokens, rather than traversing the full dependency graph. Without seeing the actual definitions or call sites, the model is forced to speculate on side effects.
That's not even mentioning a not insignificant part of the point of code reviews is to propagate understanding of the evolution of the code base among other team members. The reviewer benefits from the act of reviewing as well.
How is that different from today's SA, like CodeQL and SonarQube? Most of the feedback is just sh*t and drives programmers towards making senseless perfections that just double the amount of work had to be done later to toggle or tune behaviour, because the configurable variables are gone due to bad static code analysis. Clearly present intent and convience like: Making a method virtual, adding a public method, not making a method static when it is likely to use instance fields in the future --- these good practices are shunned in all SA just because the rules are opportunistic, not real.
I've only managed to use it as a linter-but-on-steroids because, where I'd normally page through the Ruby docs about enumerators to find the exact method that does what someone has implemented in a PR (because there's almost always something in there that can help out), I can instead prompt to look up a more idiomatic version of the implementation for the ruby version being used. It's easy to cross-check and it saves me some time.
It's not very good with the rest, because there's an intuition that needs to be developed over time that takes all the weirdness into account. The dead code, the tech debt, the stuff that looks fundamentally broken but is depended on because of unintended side effects, etc. The code itself is not enough to explain that, it's not a holistic documentation of the system.
The AI is no different to a person here: something doesn't 'feel' right, you go and fix it, it breaks, so you have to put it back again because it's actually harder than you think to change it.
It very much depends on the product. In my experience, Copilot has terrible signal noise. But Bugbot is incredible. Very little noise and it consistently finds things the very experienced humans on my team didn’t.
> signal to noise ratio is poor
I think this is the problem with just about every tool that examines code.
I've had the same problem with runtime checkers, with static analysis tools, and now ai code reviews.
Might be the nature of the beast.
probably happens with human code reviews too. Lots of style false positives :)
The signal-to-noise ratio problem is unexpectedly difficult.
We wrote about our approach to it some time ago here - https://www.greptile.com/blog/make-llms-shut-up
Much has changed on our approach since then, so we'll probably write a a new blog post.
The tl;dr of what makes it hard is - different people have different ideas of what a nitpick is - it's not a spectrum, the differences are qualitative - LLMs are reluctant to risk downplaying the severity of an issue and therefore are unable to usefully filter out nits. - theory: they are paid by the token and so they say more stuff
My experience is similar. AI's context is limited to the codebase. It has limited or no understanding of the broader architecture or business constraints, which adds to the noise and makes it harder to surface the issues that actually matter.
I've been using it a bit lately and at first I was enjoying it, but then it quickly devolved into finding more different minor issues with each minor iteration, including a lovely loop of check against null rather than undefined, check against undefined rather than null etc.
I agree but find it's fairly easy noise to ignore.
I wouldn't replace human review with LLM-review but it is a good complement that can be run less frequently than human review.
Maybe that's why I find it easy to ignore the noise, I have it to a huge review task after a lot of changes have happened. It'll find 10 or so things, and the top 3 or 4 are likely good ones to look deeper into.
For the signal to noise reason, I start with Claude Code reviewing a PR. Then I selectively choose what I want to bubble up to the actual review. Often times, there's additional context not available to the model or it's just nit picky.
You should try Codex. There's a pretty wide gap between the quality of code review tools out there.
Agreed.
I have to constantly push back against it proposing C++ library code, like std::variant, when C-style basics are working great.
I absolutely hate the verbosity of AI. I know that you can give it context; I have done it, and it helps a little. It will still give me 10 "ideas", many of which are closely related to each other.
One thing I've found to be successful is to
1) give it a number of things to list in order of severity
and
2) tell it to grade how serious of a problem it may be
The human reviewer can then look at the top ten list and what the LLM thinks about its own list for a very low overhead of thinking (i.e. if the LLM thinks its own ideas are dumb a human probably doesn't need to look into them too hard)
It also helps to explicitly call out types of issue (naming, security, performance, correctness, etc)
The human doesn't owe the LLM any amount of time considering, it's just an idea generating tool. Looking through a top ten list formatted as a table can be scanned in 10 seconds in a first pass.
> but the signal to noise ratio is poor
Nail on the head. Every time I've seen it applied, its awful at this. However this is the one thing I loathe in human reviews as well, where people are leaving twenty comments about naming and then the actual FUNCTIONAL issue is just inside all of that mess. A good code reviewer knows how to just drop all the things that irk them and hyperfocus on what matters, if there's a functional issue with the code.
I wonder if AI is ever gonna be able to conquer that one as its quite nuanced. If they do though, then I feel the industry as it is today, is kinda toast for a lot of developers, because outside of agency, this is the one thing we were sorta holding out on being not very automatable.