logoalt Hacker News

easygenestoday at 2:11 AM6 repliesview on HN

For those who would like to know the total and active parameter count of this model: even though Google doesn't disclose the model technicals, we can infer them within relatively tight margins based on what we do know.

We know they serve the model on TPU 8i, which we have plenty of hard specs for (so we know the key constraints: total memory and bandwidth and compute flops). We can also set a ceiling on the compute complexity and memory demand of the model based on knowing they will be at least as efficient as what is disclosed in the Deepseek V4 Technical Report.

We can also assume that the model was explicitly built to run efficiently in a RadixAttention style batched serving scenario on a single TPU 8i (so no tensor parallelism, etc. to avoid unnecessary overheads... Google explicitly designed the 8th-generation inference architecture to eliminate the need for tensor sharding on mid-sized models).

We know Google intends to serve this model at a floor speed of around 280 tok/s too.

Putting all these pieces together, we can confidently say this model is ~250-300B total, and 10-16B active parameters. Likely mostly FP4 with FP8 where it matters most.

Visual:

  ┌────────────────────────────────────────────────────────┐
  │                   TPU 8i VRAM (288 GB)                 │
  ├───────────────────────────┬────────────────────────────┤
  │   Static Model Weights    │  Dynamic Allocations &     │
  │   (250B - 300B @ Mixed    │  Compressed KV Caches      │
  │   FP4/FP8)                │  (RadixAttention / SRAM)   │
  │   ~110 GB - 150 GB        │  ~138 GB - 178 GB          │
  └───────────────────────────┴────────────────────────────┘
I do model serving optimization work. This is napkin math.

Edit: There's one factor I under-rated in my initial estimate... TurboQuant. This is a compute to KV memory use tradeoff. It's plausible with TurboQuant at a quality-neutral setting they've gotten the model up to 400B with similar economics. This is a variable effecting concurrency and the the way they decided total model size was likely based on what they see for the average user's average KV cache depth in real-world usage.


Replies

gertlabstoday at 3:24 AM

We've been really impressed with the performance of ~30B parameter class models and how close they are to the frontier from ~6-12 months ago, which begs the question, are the frontier labs really serving 10T parameter models? Seems unlikely.

If these Gemini 3.5 numbers are accurate, then I'd wager GPT 5.5 and Opus 4.7 are a lot smaller than people have speculated, too. It's not that frontier labs can't create a 5T+ parameter model, but they don't have the data to optimize a model of that size.

Gemini 3.5 Flash is really smart in one-shot coding reasoning, btw. Near the frontier. But it doesn't do so well in long horizon agentic tasks with arbitrary tool availability. This is a common theme with Google models, and the opposite of what we see with Chinese models (start dumb, iterate consistently toward a smart solution).

Data at https://gertlabs.com/rankings

show 2 replies
daemonologisttoday at 3:06 AM

If this is accurate it raises the question: why is this model so expensive? DeepSeek v4 Flash is 284B total/13B active, FP4/FP8 mixed, and only costs $0.14/$0.28 - even less from OpenRouter. Of course Gemini 3.5 Flash is most likely a better product, and therefore it can command a higher price from an economics perspective, but does this imply Google is taking roughly a 90% profit margin on inference? If so they're either very compute-limited or confident in the model and wanting to recoup training/fixed costs (or both).

show 3 replies
zacksiritoday at 2:34 AM

Do you have similar math for the flash-lite variant of the models? I'd be curious. Based on my testing / benchmark i think it's around the 100-120B mark.

With the Pro variant being around 600B - 800B

My testing is comparing it's performance / output to other models in the same size range, so not as scientific as yours.

Maven911today at 2:49 AM

Tell me more about what your day looks like. What do you think of the LLMOps books from Abi, in case you have read it ? Any other resources you can recommed?

anthonypasq96today at 2:57 AM

given this, is it safe to assume that inference pricing is barely related to cost to serve at this point and there is considerable margin?

nilstenuratoday at 4:27 AM

[flagged]