logoalt Hacker News

0xbadcafebeetoday at 1:44 PM2 repliesview on HN

Configure a subagent in your coding harness to spin up a new sub-session with any vision model for those tasks and feed the result back to the main model. No need for "one model that does everything"


Replies

ricardobeattoday at 8:00 PM

That doesn’t work well in a lot of scenarios. The text LLM doesn’t know what to look for in an image before it sees a description, you might need multiple rounds of back and forth.

show 1 reply
WASDxtoday at 4:19 PM

Are you suggesting it should summarize the image in text or generate it in HTML or something else?