I find this study quite suspect. I'd have to dive deeper but there's definitely significan...

godelski • today at 2:23 AM • 16 replies • view on HN

I find this study quite suspect. I'd have to dive deeper but there's definitely significant alarm bells that should be going off for anyone reading.

Figure 2 (page 6) screams problems. There's only 16 professors (3k comparisons each?!?!) and the professors are all over the place. That's very high variance, suggesting the study has no meaningful statistical power. Poor instructor 16 can't catch a break lol

There's also really clear bias given that the main results only feature Google models. Other models show up elsewhere, why not there?

I'm no lawyer, but I'm a pretty competent statistician and can confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over

Replies

volkercraig • today at 2:37 PM

More than that, the entire structure of the study is pointless. They set up as a question/response and then had humans rate the response. That's literally what LLM's are trained to do, which ultimately is convincing a human to click the "I like this one better" button on it's response.

➕ show 3 replies

gguncth • today at 8:50 AM

Sure, but in two years AI has gone from “impressive tool, but not a replacement for knowledge workers” to “the study where it beats our highest caliber of knowledge workers may have some methodological deficits.” In another two years it’s going to be curtains.

➕ show 10 replies

Paracompact • today at 5:54 AM

Independent of whether it has any meaning (because the entire paper might be a bit iffy), I find it curious that Instructors 3 and 8 have the lowest harmfulness rates, quite a bit lower than even the LLMs, but not the highest preference rates. Harmfulness anticorrelates with preference, but not perfectly. Some amount of charisma appears to be a factor even in selections by professionals?

esquivalience • today at 6:44 AM

I think your 3k figure comes from here - It is explained:

> As judges, the professors then completed 2,918 blinded, forced-choice comparisons (median per judge: 200), each time indicating which of the two anonymized responses, from the instructor or the LLM, they would rather give to a student

➕ show 1 reply

saidnooneever • today at 3:38 PM

more and more i see papers. interview 8 ppl, draw conclusions based on their expert opinions. AI and Cybersecurity are full of this.

Even saw some where they just slapped interviews + protocol into chatgpt as 'methodology' to extract the results -_-. Peer reviewed and published.

➕ show 1 reply

dragonwriter • today at 2:51 PM

> There's also really clear bias given that the main results only feature Google models.

The main results also don’t seem to know what a “model” is, as the two “models” it refers to are “stock Gemini 2.5 Pro” and “a retrieval-augmented version of NotebookLM”.

One of which is a model, and the other of which is an interface backed by different models depending on exactly when the analysis was performed.

giancarlostoro • today at 3:03 PM

I never get the same answer from any two lawyers. I hate law as a result. With developers you might get disagreements based on experience, but there's usually a strong consensus on specific things, with lawyers and courts its all over the flipping place. I wouldn't be surprised if LLMs can "pass" on paper (ie college exams) but in practice, they might 'struggle' in different courts.

...On the other hand, if an LLM has access to every transcript of every case a Judge has overseen, they might have an unfair advantage in any case... Hmmm...

This all assuming the AI lawyer doesn't hallucinate and start referencing cases that don't exist.

➕ show 1 reply

vlan121 • today at 2:28 PM

Reversly viewed ones should ask with what intend the study should be like this. And for obvious reasons it sounds like monetary-nature.

skywhopper • today at 10:53 AM

I find it entirely likely that the preference for the AI generated answers is entirely due to the confidence of its assertions. Given the numbers of evaluations each prof had to do, there’s no way they researched the answers thoroughly. But if there’s one thing we all know LLMs can do well, it’s to generate text that sounds extremely confident. And that signal is appealing in choosing which of two statements you’d give to students.

ALittleLight • today at 3:47 AM

The paper says the professors have a median of 200 comparisons each. It also says they only used 2 models because using more models would require more comparisons and they selected Google models because Google was branded/advertised as being education focused. When you see other models show up elsewhere, that's because they extended the main idea to other models but using LLMs to judge instead of human professors.

➕ show 1 reply

RataNova • today at 9:50 AM

Agreed. The study might show something useful, but the headline is doing a lot of work.

jstummbillig • today at 8:04 AM

But does it really matter? It seems fairly obvious that AI is going to outperform professors. While the studies run, there are three more model releases that change the calculus entirely. I wonder how much we are learning with these studies about what is going on.

➕ show 4 replies

runarberg • today at 3:36 AM

The study was conducted by Stanford’s HAI institute, which receives heavy funding from Google (how much I couldn’t find because they don‘t publish their donations in a place I could find it; but I suspect it is alot). And the authors did not declare a non-conflict of interest at the end of the paper.

➕ show 1 reply

scotty79 • today at 12:50 PM

> That's very high variance

Do you doubt that educational value of a law professor can vary from 0 to somewhat reasonable? You are not studying screws here.

philipwhiuk • today at 11:18 AM

This is the bit I'm suspicious of:

> They calibrated AI responses to match the length and structure of human answers

which I would guess removes AI's hallucinations and errors somewhat.

NuclearPM • today at 5:03 PM

> confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over

You can confidently say that you are unsure?

alt Hacker News

Replies