None of these tools perform particularly well and all lack context to actually provide a meaningful review beyond what a linter would find, IMO. The SOTA isn't capable of using a code diff as a jumping off point.
Also the system prompts for some of them are kinda funny in a hopelessly naive aspirational way. We should all aspire to live and breathe the code review system prompt on a daily basis.
In some code that I was working on, I had
// stuff
obj.setSomeData(something);
// fifteen lines of other code
obj.setSomeData(something);
// more stuff
The 'something' was a little bit more complex, but it was the same something with slightly different formatting.My linter didn't catch the repeat call. When asking the AI chat for a review of the code changes it did correctly flag that there was a repeat call.
It also caught a repeat call in
List<Objs> objs = someList.stream().filter(o -> o.field.isPresent()).toList();
// ...
var something = someFunc(objs);
Thingy someFunc(List<Objs> param) {
return param.stream().filter(o -> o.field.isPresent()). ...
Where one of the filter calls is unnecessary... and it caught that across a call boundary.So, I'd say that AI code reviews are better than a linter. There's still things that it fusses about because it doesn't know the full context of the application and the tables that make certain guarantees about the data, or code conventions for the team (in particular the use of internal terms within naming conventions).
Opus 4.5 catches all sorts of things a linter would not, and with little manual prompting at that. Missing DB indexes, forgotten migration scenarios, inconsistencies with similar services, an overlooked edge case.
Now I'm getting a robot to review the branch at regular intervals and poking holes in my thinking. The trick is not to use an LLM as a confirmation machine.
It doesn't replace a human reviewer.
I don't see the point of paying for yet another CI integration doing LLM code review.
AI code review to me is similar to AI code itself. It's good (and constantly getting better) at dealing with mundane things, like - is the list reversed correctly? Are you dealing with pointers correctly? Do you have off by 1 issues?
Where they suck is high level problems like - is the code actually solving the business problem? Is it using right dependencies? Does it fit into broader design?
Which is expected for me and great help. I'm more happy as a human to spend less time checking if you're managing lifecycle of the pointer correctly and focus on ensuring that code is there to do what it needs to do.
I installed CodeRabbit for our reviews in GitLab and am pretty happy with the results, especially considering the low price ($15/user/mo I think).
It regularly finds problems, including subtle but important problems that human reviewers struggle to find. And it can make pretty good suggestions for fixes.
It also regularly complains about things that are possible in theory but impossible in practice, so we've gotten used to just resolving those comments without any action. Maybe if we used types more effectively it would do that less.
We pay a lot more attention to what CodeRabbit says than what DeepSource said when use used it.
GH Copilot is definitely far better than just a linter. I don't have examples to hand but one thing that's stood out to me is its use of context outside the changes in the diff. It'll pull in context that typically isn't visible in the PR itself, the sort of things that only someone experienced in the code base with good recall would connect the dots on (e.g. this doesn't conform to typical patterns, or a version of this is already encapsulated in reusable code, or there's an existing constant that could be used here instead of the hardcoded value you have).
I don't know that I fully agree with that. I use Copilot for AI code review - just because it's built in to GitHub and it's easy - and I'd say results are variable, but overall decent.
Like anything else AI you need to understand what you're doing, so you need to understand your code and the structure of your application or service or whatever because there are times it will say something that's just completely wide of the mark, or even the polar opposite of what's actually the case. And so you just ignore the crap and close the conversation in those situations.
At the same time, it does catch a lot of bugs and problems that fall into classes where more traditional linters really miss the mark. It can help fill holes in automated testing, spot security issues, etc., and it'll raise PRs for fixes that are generally decent. Sometimes not but, again, in these cases you just close them and move on.
I'd certainly say that an AI code review is better than no code review at all, so it's good for a startup where you might be the only developer or where there are only one or two of you and you don't cross over that much.
But the point I actually wanted to get to is this: I use Copilot because it's available as part of my GitHub subscription. Is it the best? I don't know. Does it add value with zero integration cost to me? Yes. And that, I suspect, is going to make it the default AI code review option for many GitHub subscribers.
That does leave me wondering how much of a future there is for AI code review as a product or service outside of the hosting platforms like GitHub and Gitlab, and I have to imagine that an absolutely savage consolidation is coming.
I suspect this is primarily a unit economics problem. To get context beyond the diff you really need the full repository or a robust AST, but the token costs to load that state for every PR make the margins impossible right now.
> The SOTA isn't capable of using a code diff as a jumping off point.
Not a jumping off point, but I'm having pretty great results on a complicated fork on a big project with a `git diff main..fork > main.diff`, then load in the specs I keep, and tell it to review the diff in chunks while updating a ./review.md
It's solving a problem I created myself by not reviewing some commits well enough, but it's surprisingly effective at picking up interactions spread out over multiple commits that might have slipped through regardless.
They 100% catch bugs in code I work on. Is it replacing human review fully? No, not yet. But it is a useful tool. Just like most of us wouldn’t do a code review without having tests, linters etc run first.
Anecdotally, Claude Bug Bot has actually been super impressive in understanding non trivial changes. Like, today, it noted a race condition in a ~1000 line go change that go test -race didnt pick up. There are definitely issues though. For one, it's non deterministic, so you end up with half a dozen commits, with each run noting different issues. For a second, it tends to be quite in favour of premature optimisation. But over all, well worth it in my experience
>The SOTA isn't capable of using a code diff as a jumping off point.
The low quality of HN comments has been blowing my mind.
I have quite literally been doing what you describe every working day for the last 6+ months.
I agree that none perform _super_ well.
I would argue they go far beyond linters now, which was perhaps not true even nine months ago.
To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.