The core of the problem the article is about isn't AI or LLMs, it's about scam software that claims to catch cheating. It's crap for the same reasons that crime predictions software is crap. It's selling a panacea, and that kind of product inherently attracts scammers.
If your school uses software to detect AI writing, that's a problem with the quality of your school. The people choosing that software are too stupid to be running a school. The software isn't going to get any better.
20 years ago (so very outdated), I TAed an introductory CS class for engineers who weren't going to be majoring in CS. We used MOSS [1]. Maybe it was the threshold we picked, but the results were pretty blatant. People renamed variables, renamed functions, changed comments, and the clever ones inlined or extracted a function. A lot forgot something, copying a bug or quirk from the original.
Does crapping on the average school's deep well of expertise for evaluating how effectively AI software solutions address their problems somehow fix the underlying problem (that the cost of catching cheaters is significantly higher than the cost of cheating)?
(This is roughly the same problem as evaluating software that only does an approximation of what it claims to do.)
(Aside: AI-based variations on this theme are in the early stages of proliferating across our society. They're being developed by many people using this forum and being sold to our schools, businesses, governments, and other organizations with little regard to whether they actually do what they claim.)
I'm always startled about how HN approaches these topics. When we have a press release from a university about how researchers can detect thoughts via fMRI, we have no issue with the claim. But if a vendor makes a pretty believable claim that there are repetitive statistical patterns in LLM output, it's all of sudden treated the same as palm reading.
The problem isn't that AI detection doesn't work. State of the art in this field is pretty solid. The only issue is that it's probabilistic, so it sometimes fails, and when it does, we have nothing else in situations where you actually want to know if someone put in the work.
So what are you proposing, exactly? That we run a large-scale experiment of "let's see what happens if children don't actually need to learn to do thinking and writing on their own"? The reality is that without some form of compulsion, most kids would rather play video games / scroll through TikTok all day. Or that we move to a vastly more resource-intensive model where every kid is given personalized instruction and watched 1:1?