Have you tried running a reasonably sized model locally? You need minimum 24GB VRAM to load up a mod...

bitpush • yesterday at 7:04 PM • 11 replies • view on HN

Have you tried running a reasonably sized model locally? You need minimum 24GB VRAM to load up a model. 32GB to be safe, and this isnt even frontier, but bare minimum.

A good analogy would be streaming. To get good quality, sure, you can store the video file but it is going to take up space. For videos, these are 2-4GB (lets say) and streaming will always be easier and better.

For models, we're looking at 100s of GB worth of model params. There's no way we can make it into, say, 1GB without loss in quality.

So nope, beyond minimal classification and such, on-device isnt happening.

EDIT:

> Nobody wants to be sending EVERY request to someone else's cloud server.

We do this already with streaming. You watch YouTube that is hosting videos on the "cloud". For latest MKBHD video, I dont care about having that locally (for the most part). I just wanna watch the video and be done with it.

Same with LLMs. If LLMs are here to stay, most people would wanna use the latest / greatest models.

---

EDIT-EDIT:

If you response is Apple will figure it out somehow. Nope, Apple is sitting out the AI race. So it has no technology. It has nothing. It has access to whatever open source is available or something they can license from rest. So nope, Apple isnt pushing the limits. They are watching the world move beyond them.

Replies

542458 • yesterday at 7:09 PM

I think this is very pessimistic. Yes, big models are "smarter" and have more inherent knowledge but I'd bet you a coffee that what 99% of people want to do with Siri isn't "Write me an essay on the history of textiles" or "Vibe code me a SPA", rather it's "Send Mom the pictures I took of the kids yesterday" and "Hey, play that Deadmau5 album that came out a couple years back" which is more about tool calls than having wikipedia-level knowledge built in to the model.

➕ show 1 reply

satvikpendem • yesterday at 7:28 PM

Have you ran models locally, especially on the phone? I have, and there are even apps like Google AI Edge Gallery that runs Gemma for you. It works perfectly fine for use cases like summarizing emails and such, you don't really need the latest and greatest (ie biggest) models for tasks like these, in much the same way more people do not need the latest and greatest phone or laptop for their use cases.

And anyway, you already see models like Qwen 3.5 9B and 4B beating 30B and 80B parameter models, which can already run on phones today, especially with quantization.

Benchmarks: https://huggingface.co/Qwen/Qwen3.5-4B

➕ show 1 reply

burningChrome • yesterday at 7:10 PM

>> So nope, beyond minimal classification and such, on-device isnt happening.

This is a paradox right? Handset makers want less handset storage so they can get users to buy more of their proprietary cloud storage while at the same time wanting them to use their AI more frequently on their handsets.

It will be interesting which direction they decide to go. Finding a phone in the last few years with more than 256gb storage is not only expensive AF, its become more of a rarity than commonplace. Backtracking on this model in order to simply get AI models on board would be a huge paradigm shift.

➕ show 2 replies

ben_w • yesterday at 7:39 PM

> You need minimum 24GB VRAM to load up a model. 32GB to be safe, and this isnt even frontier, but bare minimum.

Indeed.

But they said 5 years. That's certainly plausible for high-end mobile devices in Jan 2031.

I have high uncertainty on if distillation will get Opus 4.6-level performance into that RAM envelope, but something interesting on device even if not that specifically, is certainly within the realm of plausibility.

Not convinced Apple gets any bonus points in this scenario, though.

chtitux • yesterday at 7:08 PM

5 years ago , LLM was "beyond minimal conversation, intelligence isn't happening".

I'm pretty sure in five years, local LLM will be a thing.

➕ show 1 reply

zozbot234 • yesterday at 7:20 PM

If you have good performance storage, you don't need to keep all your params in VRAM. The big datacenter-scale providers do it for peak performance/throughput, but locally you're better off (at least for the largest models) letting them sit on storage and accessing them on demand.

dpoloncsak • yesterday at 7:59 PM

>Apple is sitting out the AI race

Then why does my M4 run models at TOK/s that similar priced GPUs cannot?

➕ show 2 replies

baggachipz • yesterday at 7:08 PM

The insistence/assumption that llm models will consistently get better, smaller, and cheaper is so annoying. These things fundamentally require lots of data and lots of processing power. Moore's Law is dead; devices aren't getting exponentially faster anymore. RAM and SSDs are getting more expensive (thanks to this insane bubble).

➕ show 1 reply

4fterd4rk • yesterday at 7:07 PM

For vibe coding? Sure. For "Hey Siri, send Grandma an e-mail summarizing my schedule this afternoon."? No.

dvfjsdhgfv • yesterday at 8:15 PM

> Nope, Apple is sitting out the AI race.

That's why I use an iPhone. I don't need and I don't want any "AI" in my phone. The claims that people want it comes from CEOs, marketers and influencers of GenAI companies, not from users.

➕ show 1 reply

wat10000 • yesterday at 7:17 PM

Streaming video is almost exclusively pull. The only data you're sending up to the server is what you're watching, when you seek, pause, etc.

Useful LLM usage involves pushing a lot of private data into them. There's a pretty big difference sending up some metadata about your viewing of an MKBHD video, and asking an LLM to read a text message talking about your STD test results to decide whether it merits a priority notification. A lot of people will not be comfortable with sending the latter off to The Cloud.

alt Hacker News

Replies