Comparing Deep Learning with neuroscience may turn out to be erroneous. They may be orthogonal.
The brain likely has more in common with Reservoir Computing (sans the actual learning algorithm) than Deep Learning.
Deep Learning relies on end to end loss optimization, something which is much more powerful than anything the brain can be doing. But the end-to-end limitation is restricting, credit assignment is a big problem.
Consider how crazy the generative diffusion models are, we generate the output in its entirety with a fixed number of steps - the complexity of the output is irrelevant. If only we could train a model to just use Photoshop directly, but we can't.
Interestingly, there are some attempts at a middle ground where a variable number of continuous variables describe an image: <https://visual-gen.github.io/semanticist/>
Modern systems like Nano Banana 2 and ChatGPT Images 2.0 are very close to "just use Photoshop directly" in concept, if not in execution.
They seem to use an agentic LLM with image inputs and outputs to produce, verify, refine and compose visual artifacts. Those operations appear to be learned functions, however, not an external tool like Photoshop.
This allows for "variable depth" in practice. Composition uses previous images, which may have been generated from scratch, or from previous images.
If you think a 2 year old is doing deep learning, you're probably wrong. But if you think natural selection was providing end to end loss optimization, you might be closer to right. An _awful lot_ of our brain structure and connectivity is born, vs learned, and that goes for Mice and Men.
> If only we could train a model to just use Photoshop directly, but we can't.
It is probably coming, I get the impression - just from following the trend of the progress - that internal world models are the hardest part. I was playing with Gemma 4 and it seemed to have a remarkable amount of trouble with the idea of going from its house to another house, collecting something and returning; starting part-way through where it was already at house #2. It figured it out but it seemed to be working very hard with the concept to a degree that was really a bit comical.
It looks like that issue is solving itself as text & image models start to unify and they get more video-based data that makes the object-oriented nature of physical reality obvious. Understanding spatial layouts seems like it might be a prerequisite to being able to consistently set up a scene in Photoshop. It is a bit weird that it seems pulling an image fully formed from the aether is statistically easier than putting it together piece by piece.
> If only we could train a model to just use Photoshop directly, but we can't.
What kind of sadist would wish this on an intelligent entity?
> If only we could train a model to just use Photoshop directly, but we can't.
They're obviously more general purpose but LLMs can also be used to drive external graphics programs. A relatively popular one is Blender MCP [1], which lets an LLM control Blender to build and scaffold out 3D models.
[1] - https://github.com/ahujasid/blender-mcp