So I guess the key takeaway is basically that the better Claude gets at producing polished output, t...

dmk • today at 5:57 PM • 3 replies • view on HN

So I guess the key takeaway is basically that the better Claude gets at producing polished output, the less users bother questioning it. They found that artifact conversations have lower rates of fact-checking and reasoning challenges across the board. That's kind of an uncomfortable loop for a company selling increasingly capable models.

Replies

Terr_ • today at 7:30 PM

> the less users bother questioning it

This makes me think of checklists. We have decades of experience in uncountable areas showing that checklists reminding users to question the universe improve outcomes: Is the chemical mixture at the temperature indicated by the chart? Did you get confirmation from Air Traffic Control? Are you about to amputate the correct limb? Is this really the file you want to permanently erase?

Yet our human brains are usually primed to skip steps, take shortcuts, and see what we expect rather than what's really there. It's surprisingly hard to keep doing the work both consistently and to notice deviations.

> lower rates of fact-checking and reasoning challenges

Now here we are with LLMs, geared to produce a flood of superficially-plausible output which strikes at our weak-point, the ability to do intentional review in a deep and sustained way. We've automated the stuff that wasn't as-hard and putting an even greater amount of pressure on the remaining bottleneck.

Rather than the old definition involving customer interaction and ads, I fear the new "attention economy" is going to be managing the scarce resource of human inspection and validation.

➕ show 1 reply

boplicity • today at 7:35 PM

> So I guess the key takeaway is basically that the better Claude gets at producing polished output, the less users bother questioning it.

This is exactly what I worry about when I use AI tools to generate code. Even if I check it, and it seems to work, it's easy to think, "oh, I'm done." However, I'll (often) later find obvious logical errors that make all of the code suspect. I don't bother, most of the time though.

I'm starting to group code in my head by code I've thoroughly thought about, and "suspect" code that, while it seems to work, is inherently not trustworthy.

Florin_Andrei • today at 6:09 PM

I think we're still at the stage where model performance largely depends on:

- how many data sources it has access to

- the quality of your prompts

So, if prompting quality decreases, so does model performance.

➕ show 3 replies

alt Hacker News

Replies