I find this study quite suspect. I'd have to dive deeper but there's definitely significant alarm bells that should be going off for anyone reading.
Figure 2 (page 6) screams problems. There's only 16 professors (3k comparisons each?!?!) and the professors are all over the place. That's very high variance, suggesting the study has no meaningful statistical power. Poor instructor 16 can't catch a break lol
There's also really clear bias given that the main results only feature Google models. Other models show up elsewhere, why not there?
I'm no lawyer, but I'm a pretty competent statistician and can confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over
Sure, but in two years AI has gone from “impressive tool, but not a replacement for knowledge workers” to “the study where it beats our highest caliber of knowledge workers may have some methodological deficits.” In another two years it’s going to be curtains.
Independent of whether it has any meaning (because the entire paper might be a bit iffy), I find it curious that Instructors 3 and 8 have the lowest harmfulness rates, quite a bit lower than even the LLMs, but not the highest preference rates. Harmfulness anticorrelates with preference, but not perfectly. Some amount of charisma appears to be a factor even in selections by professionals?
I think your 3k figure comes from here - It is explained:
> As judges, the professors then completed 2,918 blinded, forced-choice comparisons (median per judge: 200), each time indicating which of the two anonymized responses, from the instructor or the LLM, they would rather give to a student
more and more i see papers. interview 8 ppl, draw conclusions based on their expert opinions. AI and Cybersecurity are full of this.
Even saw some where they just slapped interviews + protocol into chatgpt as 'methodology' to extract the results -_-. Peer reviewed and published.
> There's also really clear bias given that the main results only feature Google models.
The main results also don’t seem to know what a “model” is, as the two “models” it refers to are “stock Gemini 2.5 Pro” and “a retrieval-augmented version of NotebookLM”.
One of which is a model, and the other of which is an interface backed by different models depending on exactly when the analysis was performed.
I never get the same answer from any two lawyers. I hate law as a result. With developers you might get disagreements based on experience, but there's usually a strong consensus on specific things, with lawyers and courts its all over the flipping place. I wouldn't be surprised if LLMs can "pass" on paper (ie college exams) but in practice, they might 'struggle' in different courts.
...On the other hand, if an LLM has access to every transcript of every case a Judge has overseen, they might have an unfair advantage in any case... Hmmm...
This all assuming the AI lawyer doesn't hallucinate and start referencing cases that don't exist.
Reversly viewed ones should ask with what intend the study should be like this. And for obvious reasons it sounds like monetary-nature.
I find it entirely likely that the preference for the AI generated answers is entirely due to the confidence of its assertions. Given the numbers of evaluations each prof had to do, there’s no way they researched the answers thoroughly. But if there’s one thing we all know LLMs can do well, it’s to generate text that sounds extremely confident. And that signal is appealing in choosing which of two statements you’d give to students.
The paper says the professors have a median of 200 comparisons each. It also says they only used 2 models because using more models would require more comparisons and they selected Google models because Google was branded/advertised as being education focused. When you see other models show up elsewhere, that's because they extended the main idea to other models but using LLMs to judge instead of human professors.
Agreed. The study might show something useful, but the headline is doing a lot of work.
But does it really matter? It seems fairly obvious that AI is going to outperform professors. While the studies run, there are three more model releases that change the calculus entirely. I wonder how much we are learning with these studies about what is going on.
The study was conducted by Stanford’s HAI institute, which receives heavy funding from Google (how much I couldn’t find because they don‘t publish their donations in a place I could find it; but I suspect it is alot). And the authors did not declare a non-conflict of interest at the end of the paper.
> That's very high variance
Do you doubt that educational value of a law professor can vary from 0 to somewhat reasonable? You are not studying screws here.
This is the bit I'm suspicious of:
> They calibrated AI responses to match the length and structure of human answers
which I would guess removes AI's hallucinations and errors somewhat.
> confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over
You can confidently say that you are unsure?
More than that, the entire structure of the study is pointless. They set up as a question/response and then had humans rate the response. That's literally what LLM's are trained to do, which ultimately is convincing a human to click the "I like this one better" button on it's response.