logoalt Hacker News

simonwyesterday at 1:37 PM0 repliesview on HN

Some of the LLMs that can draw (bad) pelicans on bicycles are text-input-only LLMs.

The ones that have image input do tend to do better though, which I assume is because they have better "spatial awareness" as part of having been trained on images in addition to text.

I use the term vLLMs or vision LLMs to define LLMs that are multimodal for image and text input. I still don't have a great name for the ones that can also accept audio.

The pelican test requires SVG output because asking a multimodal output model like Gemini Flash Image (aka Nano Banana) to create an image is a different test entirely.