The standard objection: if the LLM is supposedly intelligent, why can’t it figure out on its own tha...

xigoi • today at 6:30 AM • 6 replies • view on HN

The standard objection: if the LLM is supposedly intelligent, why can’t it figure out on its own that this two-step process would achieve a better result?

Replies

petercooper • today at 11:21 AM

Because image models at the basic level are just text tokens in, image tokens out. You'd need an agentic process on top to come up with a strategy, review output, try again, and so on.

I believe Nano Banana and gpt-image-2 have a little of this going on, but it's like asking a model to one-shot some code vs having an agentic harness with tools do it. Even the most basic agent can produce better code than ChatGPT can.

Sharlin • today at 10:02 AM

Because the LLM is more or less hardcoded to just pass "create image" style prompts to a separate model, possibly with some embellishment.

nine_k • today at 6:32 AM

Nobody asked it to!

➕ show 1 reply

pyrolistical • today at 7:32 AM

You don’t know what you don’t know

cubefox • today at 6:41 AM

Part of the problem is that it isn't the LLM making the image directly itself, it's the LLM repeatedly prompting edits for a separate edit diffusion model. The Gemini reasoning summary shows part of this. The style of some of the images makes it also clear that it uses an Imagen 4 derived diffusion model underneath.

jstanley • today at 6:43 AM

[flagged]

➕ show 1 reply

alt Hacker News

Replies