I tried my "Generate an SVG of a pelican riding a bicycle" prompt against Gemma 3n 7.5GB from Ollama and 15GB for mlx-vlm and got a pleasingly different result for the two quantization sizes: https://simonwillison.net/2025/Jun/26/gemma-3n/
I still don't understand the difference between Gemma and Gemini for on-device, since both don't need network access. From https://developer.android.com/ai/gemini-nano :
"Gemini Nano allows you to deliver rich generative AI experiences without needing a network connection or sending data to the cloud." -- replace Gemini with Gemma and the sentence still valid.
I'm not a fan of this anarchic naming convention that OpenAI has apparently made standard across the industry.
Made some GGUFs if anyone wants to run them!
./llama.cpp/llama-cli -hf unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL -ngl 99 --jinja --temp 0.0
./llama.cpp/llama-cli -hf unsloth/gemma-3n-E2B-it-GGUF:UD-Q4_K_XL -ngl 99 --jinja --temp 0.0
I'm also working on an inference + finetuning Colab demo! I'm very impressed since Gemma 3N has audio, text and vision! https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-...
I'd genuinely like to know how these small models are useful for anyone. I've done a lot of experimenting, and anything smaller than 27B is basically unusable, except as a toy. All I can say for smaller models is that they sometimes produce good answers, which is not enough for anything except monkeying around.
I solved my spam problem with gemma3:27b-it-qat, and my benchmarks show that this is the size at which the current models start becoming useful.
Kevin Kwok did a great job taking it apart: https://github.com/antimatter15/reverse-engineering-gemma-3n
LM Studio has MLX variants of the model out: http://huggingface.co/lmstudio-community/gemma-3n-E4B-it-MLX...
However it's still 8B parameters and there are no quantized models just yet.
anyone know how much it costs to use the deployed version of gemma 3n? The docs indicate you can use the gemini api for deployed gemma 3n but the pricing page just shows "unavailable"
I read the general parts and skimmed the inner workings but I can't figure out what the high-level news is. What does this concretely do that Gemma didn't already do, or what benchmark/tasks did it improve upon?
Until it goes into the inner details (MatFormer, per-layer embeddings, caching...), the only sentence I've found that concretely mentions a new thing is "the first model under 10 billion parameters to reach [an LMArena score over 1300]". So it's supposed to be better than other models until those that use 10GB+ RAM, if I understand that right?
We need tabular data somewhere on Google that lists the titles of the products and their descriptions or functions or what they do.
Whats are some use cases for these local small models, for individuals? Seems like for programming related work, the proprietary models are significantly better and that's all I really use LLMs for personally.
Though I can imagine a few commercial applications where something like this would be useful. Maybe in some sort of document processing pipeline.
Suppose I'd like to use models like this one to perform web searches. Is there anything available in the open-source world that would let me do that without much tinkering needed?
I think it’s something that even Google should consider: publishing open-source models with the possibility of grounding their replies in Google Search.
If I wanted to run this locally at somewhat decent speeds, is an RK3588S board (like OrangePi 5) the cheapest option?
I've been playing around with E4B in AI Studio and it has been giving me really great results, much better than what you'd expect from an 8B model. In fact I'm thinking of trying to install it on a VPS so I can have an alternative to pricy APIs.
Updated Ollama to use this, now neither old or new work - much productivity
It seems way worse than other small models, including responding with complete non sequiturs. I think my favorite small model is still DeepSeek distilled with Llama 8B.
Anyone have any idea on the viability of running this on a Pi5 16GB? I have a few fun ideas if this can handle working with images (or even video?) well.
I just tried gemma3 out and it seems to be prone to getting stuck in loops where it outputs an infinite stream of the same word.
Is there a chance that we see an uncensored version of this ?
Somethings really screwy with on-device models from Google, I can't put my finger on what, and I think being ex-Google is screwing with my ability to evaluate.
Cherry-picking something that's quick to evaluate:
"High throughput: Processes up to 60 frames per second on a Google Pixel, enabling real-time, on-device video analysis and interactive experiences."
You can download an APK from the official Google project for this, linked from the blogpost: https://github.com/google-ai-edge/gallery?tab=readme-ov-file...
If I download it, run it on Pixel Fold, actual 2B model which is half the size of the ones the 60 fps claim is made for, it takes 6.2-7.5 seconds to begin responding (3 samples, 3 diff photos). Generation speed is shown at 4-5 tokens per second, slightly slower than what llama.cpp does on my phone. (I maintain an AI app that inter alia, wraps llama.cpp on all platforms)
So, *0.16* frames a second, not 60 fps.
The blog post is so jammed up with so many claims re: this is special for on-device and performance that just...seemingly aren't true. At all.
- Are they missing a demo APK?
- Was there some massive TPU leap since the Pixel Fold release?
- Is there a lot of BS in there that they're pretty sure won't be called out in a systematic way, given the amount of effort it takes to get this inferencing?
- I used to work on Pixel, and I remember thinking that it seemed like there weren't actually public APIs for the TPU. Is that what's going on?
In any case, either:
A) I'm missing something, big or
B) they are lying, repeatedly, big time, in a way that would be shown near-immediately when you actually tried building on it because it "enables real-time, on-device video analysis and interactive experiences."
Everything I've seen the last year or two indicates they are lying, big time, regularly.
But if that's the case:
- How are they getting away with it, over this length of time?
- How come I never see anyone else mention these gaps?
This looks amazing given the parameter sizes and capabilities (audio, visual, text). I like the idea of keeping simple tasks local. I’ll be curious to see if this can be run on an M1 machine…
Can popular sci-fi go 30 seconds without some lame wad naming themselves or a product after it?
I made a simple website[0] to check online model MMLU quickly (runs a subset), and Gemma 3n consistently loses to LLaMA 3.3 (~61% vs ~66%), and definitely loses to LLaMA 4 Scout (~86%). I suspect that means its rating on LMArena Leaderboard is just some form of gaming the metric.
What's interesting, that it beats smarter models in my Turing Test Battle Royale[1]. I wonder if it means it is a better talker.
> for everything from safeguarding
Maybe you could install it on YouTube, where my 78-year-old mother received a spammy advert this morning from a scam app pretending to be an iOS notification.
Kinda sick of companies spending untold billions on this while their core product remains a pile of user-hostile shite. :-)
imagine the entire internet is just an on the fly ui, would be pretty cool
My post politely describing this blog post does not match Google's own app, running inference on Pixel, is downvoted to -1, below dead posts with one-off short jokes.
I am posting again because I've been here 16 years now, it is very suspicious that happened, and given the replies to it, we now know this blog post is false.
There is no open model that you can download today and run at even 1% of the claims in the blog post.
You can read a reply from someone indicating they have inside knowledge on this, who notes this won't work as advertised unless you're Google (i.e. internally, they have it binding to a privileged system process that can access the Tensor core, and this isn't available to third parties. Anyone else is getting 1/100th of the speeds in the post)
This post promises $150K in prizes for on-device multimodal apps and tells you it's running at up to 60 fps, they know it runs at 0.1 fps, Engineering says it is because they haven't prioritized 3rd parties yet, and somehow, Google is getting away with this.
This model is fully compatible with anything previously done with gemma3. Just passed it to one of my vlm fine-tuning scripts and it started without issues (hf transformer code). On a single GPU with Lora the E4B model takes 18Gb of VRAM in batch size 1 where gemma-4B was 21Gb. Nice one from deepmind, the gemma3 family tops the open weights VLLMs.