logoalt Hacker News

xigoitoday at 6:30 AM6 repliesview on HN

The standard objection: if the LLM is supposedly intelligent, why can’t it figure out on its own that this two-step process would achieve a better result?


Replies

petercoopertoday at 11:21 AM

Because image models at the basic level are just text tokens in, image tokens out. You'd need an agentic process on top to come up with a strategy, review output, try again, and so on.

I believe Nano Banana and gpt-image-2 have a little of this going on, but it's like asking a model to one-shot some code vs having an agentic harness with tools do it. Even the most basic agent can produce better code than ChatGPT can.

Sharlintoday at 10:02 AM

Because the LLM is more or less hardcoded to just pass "create image" style prompts to a separate model, possibly with some embellishment.

nine_ktoday at 6:32 AM

Nobody asked it to!

show 1 reply
pyrolisticaltoday at 7:32 AM

You don’t know what you don’t know

cubefoxtoday at 6:41 AM

Part of the problem is that it isn't the LLM making the image directly itself, it's the LLM repeatedly prompting edits for a separate edit diffusion model. The Gemini reasoning summary shows part of this. The style of some of the images makes it also clear that it uses an Imagen 4 derived diffusion model underneath.

jstanleytoday at 6:43 AM

[flagged]

show 1 reply