I have generally moved from bearish to bullish on the future of current AI technology, but the conti...

rainsford • today at 12:18 AM • 5 replies • view on HN

I have generally moved from bearish to bullish on the future of current AI technology, but the continued inaccuracy with basic facts all while the models significantly improve continues to give me significant pause.

As an example, creating recipes with Claude Opus based on flavor profiles and preferences feels magical, right up until the point at which it can't accurately convert between tablespoons and teaspoons. It's like the point in the movie where a character is acting nearly right but something is a bit off and then it turns out they're a zombie and going to try to eat your brain. This note taking example feels similar. It nearly works in some pretty impressive ways and then fails at the important details in a way that something able to do the things AI can allegedly do really shouldn't.

It's these failures that make me more and more convinced that while current generation AI can do some pretty cool things if you manage it right, we're not actually on the right track to achieve real intelligence. The persistence of these incredibly basic failure modes even as models advance makes it fairly obvious that continued advancement isn't going to actually address those problems.

Replies

cootsnuck • today at 3:39 AM

Yup, spot on. There's a capability-reliability gap that the industry does not like to talk about too much.

It often feels like the AI industry is continually glossing over the fact that capability and reliability are fundamentally different qualities. We tend to use "accurate" and "reliable" interchangeably, but they describe different things. A model can ace a benchmark (capability/accuracy) and still be a liability in production (reliability).

Just look at recent reactions to yet another release from METR showing improved capabilities. But the less talked about part is how their measure is for a 50% success rate (and the even lesser talked about secondary measure they have at 80% success rate has a drastically lower time-horizon for tasks). https://metr.org/

I implement AI systems for enterprises and I don't know any that would ever be okay with 80% reliability (let alone 50%).

➕ show 1 reply

smusamashah • today at 8:32 AM

Your analogy reminds of messed up fingers and hands in image generation models just a year ago. Now that is pretty much solved. These days they are generating videos you can't tell apart from reality. This makes me believe these nuances will keep reducing and eventually become very hard to notice and find in may be every task.

igleria • today at 8:25 AM

Yesterday I was using opus 4.6 through copilot (don't ask...) to rubber-duck-brainstorm a big feature that needs a lot of care.

I got some inspiration from it but it misinterpreted very basic stuff. might be a skill issue on my side, I do not know.

Brian_K_White • today at 12:51 AM

I hate to help provide possible soultions to an entire process I don't approve of, but maybe the fuzzy tools need old style deterministic tools the same way and for the same reasons we do.

So instead of an LLM trying to answer a math or reason question by finding a statistical match with other similar groups of words it found on 4chan and the all in podcast and a terrible recipe for soup written by a terrible cook, it can use a calculator when it needs a calculator answer.

➕ show 5 replies

themafia • today at 12:39 AM

> we're not actually on the right track to achieve real intelligence.

Real intelligence means you have to say "I don't know" when you don't know, or ask for help, or even just saying you refuse to help with the subtext being you don't want to appear stupid.

The models could ostensibly do this when it has low confidence in it's own results but they don't. What I don't know if it's because it would be very computationally difficult or it would harm the reputation of the companies charging a good sum to use them.

➕ show 4 replies

alt Hacker News

Replies