Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference

250 points • by takumi123 • today at 5:19 AM • 161 comments • view on HN

Comments

I noticed the inference is routed through the gpu rather than the Apple neural engine. Google’s engineers likely gave up on trying to compile custom attention kernels for Apple’s proprietary tensor blocks iirc. While Metal is predictable and easy to port to, it drains the battery way faster than a dedicated NPU. Until they rewrite the backend for the ANE, this is just a flashy tech demo rather than a production-ready tool

➕ show 9 replies

temp7000 • today at 12:11 PM

Is it me, or does the article sound like LLM output?

The pattern "It's not mere X — it's Y", occurs like 4 times in the text :v

➕ show 9 replies

blixt • today at 1:54 PM

I made this offline pocket vibe coder using Gemma 4 (works offline once model is downloaded) on an iPhone. It can technically run the 4B model but it will default to 2B because of memory constraints.

https://github.com/blixt/pucky

It writes a single TypeScript file (I tried multiple files but embedded Gemma 4 is just not smart enough) and compiles the code with oxc.

You need to build it yourself in Xcode because this probably wouldn't survive the App Store review process. Once you run it, there are two starting points included (React Native and Three.js), the UX is a bit obscure but edge-swipe left/right to switch between views.

➕ show 1 reply

codybontecou • today at 10:40 AM

Unfortunately Apple appears to be blocking the use of these llms within apps on their app store. I've been trying to ship an app that contains local llms and have hit a brick wall with issue 2.5.2

➕ show 5 replies

karimf • today at 10:41 AM

Related: Gemma 4 on iPhone (254 comments) - https://news.ycombinator.com/item?id=47652561

➕ show 1 reply

mfro • today at 1:16 PM

Strangely, it is super fast on my 16 Plus, but with longer messages it can slow down a LOT, and not because of thermal throttling. I wish I could see some diagnostic data.

➕ show 1 reply

abc_lisper • today at 5:40 PM

Careful with using these small models. The other day, I asked it "Can dogs eat avocado" and answer was emphatic Yes.

This is not meant as a criticism, but people should be aware of their limitations.

➕ show 1 reply

conception • today at 12:28 PM

I’m pretty excited about the edge gallery ios app with gemma 4 on it but it seems like they hobbled it, not giving access to intents and you have to write custom plugins for web search, etc. Does anyone have a favorite way to run these usefully? ChatMCP works pretty well but only supports models via api.

Chrisszz • today at 12:24 PM

I just installed Google Ai Edge Gallery on my iPhone 16 pro, here are the results of the first benchmark with GPU, Prefill Tokens=256, Decode Tokens=256, Number of runs: 3. Prefill Speed=231t/s, Decode Speed=16t/s, Time to First Token=1.16s, First init time=20s

rich_sasha • today at 3:04 PM

Offline or not, I'm sure Google uploads every keystroke, phone orientation, photo, WiFi endpoints and your shoe size when you interact with it. To enhance your experience.

➕ show 3 replies

juancn • today at 4:09 PM

Gemma4 is still power hungry since it tends to activate pretty much every weight.

qwen3-coder-next uses a lot less since it seems to only activate ~3B parameters at a time.

My guess is that this is still close to tech demo, and a lot of performance is left on the table.

jimbokun • today at 1:51 PM

I feel like UX and API design are very under explored.

What are the possibilities of an Android or iOS device where the OS is centered around a locally running LLM with an API for accessing it from apps, along with tools the LLM can call to access data from locally running apps? What’s the equivalent of the original Mac OS?

Do apps disappear and there’s just a running dialog with the LLM generating graphical displays as needed on demand?

mistic92 • today at 9:52 AM

It runs on Android too, with AI Core or even with llama.cpp

➕ show 1 reply

usmanshaikh06 • today at 11:02 AM

ESET is blocking this site saying:

Threat found This web page may contain dangerous content that can provide remote access to an infected device, leak sensitive data from the device or harm the targeted device. Threat: JS/Agent.RDW trojan

➕ show 1 reply

politelemon • today at 5:37 PM

This is HN clickbait. No details or evidence, this is just generated for votes.

I think this should be flagged.

declan_roberts • today at 3:26 PM

I really hope this is a preview of the replacement for Siri that Google is creating bc these models are fantastic for their size!

➕ show 1 reply

deckar01 • today at 2:01 PM

They still don’t render the markdown (or LaTeX) it outputs.

pabs3 • today at 10:40 AM

> edge AI deployment

Isn't the "edge" meant to be computing near the user, but not on their devices?

➕ show 4 replies

bearjaws • today at 12:13 PM

Would love to see a show down of performance on iPhone vs Googles Tensor G5, which in my experience the G5 is 2 full generations behind performance wise.

DoctorOetker • today at 12:59 PM

does anyone know of a decent but low memory or low parameter count multilingual model (as many languages as possible), that can faithfully produce the detailed IPA transcription given a word in a sentence in some language?

I want to test a hypothesis for "uploading" neural network knowledge to a user's brain, by a reaction-speed game.

➕ show 1 reply

logicallee • today at 10:56 AM

For those who would like an example of its output, I'm currently working through creating a small, free (cc0, public domain) encyclopedia (just a couple of thousand entries) of core concepts in Biology and Health Sciences, Physical Sciences, and Technology. Each entry is being entirely written by Gemma 4:e4b (the 10 GB model.) I believe that this may be slightly larger than the size of the model that runs locally on phones, so perhaps this model is slightly better, but the output is similar. Here is an example entry:

https://pastebin.com/ZfSKmfWp

Seems pretty good to me!

➕ show 1 reply

the_inspector • today at 12:31 PM

You are referring to the edge models, right? E2B and E4B, not the bigger ones (26B, 31B)...

ValleZ • today at 11:05 AM

There are many apps to run local LLMs on both iOS & Android

➕ show 1 reply

grimmai143 • today at 12:41 PM

Do you know of a way of running these models on Android? Also, what does the thermal throttling look like?

➕ show 1 reply

andsoitis • today at 5:59 AM

is there a comparison of it running on iPhone vs. Android phones?

➕ show 2 replies

bossyTeacher • today at 10:09 AM

Is the output coherent though? I am yet to see a local model working on consumer grade hardware being actually useful.

➕ show 7 replies

grimm7000 • today at 12:34 PM

[dead]

camillomiller • today at 10:29 AM

Can we please ban content that is CLEARLY written by AI?

➕ show 3 replies

abstracthinking • today at 4:28 PM

I don't see the value in this post, are hacker news post being upvoted by bots?

➕ show 1 reply

alt Hacker News

Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference

Comments