There's a lot of Not Even Wrong, in the Pauli sense, going on presumably because back-of-napkin-with-LLM is like rocket fuel, I love it. :) But, the LLM got ahead of understanding the basics. I could write probably 900 words. Lets pull one thread out as an example:
> I guess maybe with very heavy quantizing to like 4 bit that could beat sufficiently-artifact-free video coding for then streaming the tokenized vision to a (potentially cloud) system that can keep up with the 15360 token/s at (streaming) prefill stage?
the 6-7s I am seeing is what it costs to run an image model, even running on GPU on M4 Max with 64GB of GPU RAM. This repros with my llama.cpp wrapper, and the llama.cpp demo of it.
It is simply getting tokens that is taking that long.
Given that reality, we can ignore it, of course. We could assume the image model does run on Pixel at 60 fps, and there's just no demo APK available, or just say it's all not noteworthy because as the Google employee points out, they can do it inside Google, and external hasn't been prioritized.
The problem is that the blog post is announcing this runs on device at up to 60 fps today, and announces $150K in prizes if you work based on this premise. We have 0 evidence of this externally, the most plausible demo of it released externally by Google is running at 1/500th of this speed, and 1 likely Google employee is saying "yup, it doesn't, we haven't prioritized external users!" The best steelman we can come up with is "well, if eventually the image model runs at 60 fps, we could stream it to an LLM in the cloud with about 4 seconds initiate + prefill latency!"
That's bad.