logoalt Hacker News

D-Machinetoday at 1:51 AM0 repliesview on HN

I was obviously talking about conscious and unconscious processes in humans, you are attempting to transport these concepts to LLMs, which is not philosophically sound or coherent, generally.

Everything you said about how data flows in these multimodal models is not true in general (see https://huggingface.co/blog/vlms-2025), and unless you happen to work for OpenAI or other frontier AI companies, you don't know for sure how they are corralling data either.

Companies will of course engage in marketing and claim e.g. ChatGPT is a single "model", but, architecturally and in practice, this at least is known not to be accurate. The modalities and backbones in general remain quite separate, both architecturally and in terms of pre-training approaches. You are talking at a high level of abstraction that suggests education from blog posts by non-experts: actually read papers on how the architectures of these multimodal models are actually trained, developed, and connected, and you'll see the multi-modality is still very limited.

Also, and most importantly, the integration of modalities is primarily of the form:

    use (single) image annotations to improve image description, processing, and generation, i.e. "linking words to single images"
and not of the form

    use the implied spatial logic and relations from series of images and/or video to inform and improve linguistic outputs
I.e. most multimodal work is using linguistic models to represent or describe images linguistically, in the hope that the linguistic parts do the majority of the thinking and processing, but there is not much work using the image or video representations to do thinking, i.e. you "convert away" from most modalities into language, do work with token representations, and then maybe go back to images.

But there isn't much work on working with visuospatial world models or representations for the actual work (though there is some very cutting edge work here, e.g. Sam-3D https://ai.meta.com/blog/sam-3d/, and V-JEPA-2 https://ai.meta.com/research/vjepa/). But precisely because this stuff is cutting edge, even from frontier AI companies, it is likely most of the LLM stuff you see is largely driven by stuff learned from language, and not from images or other modalities. So LLMs are indeed still mostly constrained by their linguistic core.