I think they're saying that frontier LLMs may be usable to spot citations that are correct by shape (a real citation) but incorrect by usage (unrelated to the text)
I kind of hate the idea, but you probably could do a lazy LLM check of every paper and every citation and have it flag possible wrong (second sense) citations for human review
But you'd need a LOT of tokens and a LOT of human-hours
> have it flag possible wrong (second sense) citations for human review
And then what, we're done? How have we avoided the need for the same exhaustive human review? It only saves human review time if you trust the LLM not to miss things.
Right, that's what I'm saying. The LLM can identify and prioritize possible cases of academic fraud (or serious incompetence) for human review. As the cost of tokens drops it will become practical to go back and do AI reviews of every scholarly journal article ever written.