The idea of periodically stopping to write blocks of recent context into a fast-weight state is interesting, but I think it liked it better when E2E-TTT[1] did it. It's a more flexible and elegant continuous learning approach.
Essentially it goes "You know how your model can remember its training data? Well, what if you treated its recent context like more training data and updated (some of) the weights using (mostly) the same process used to train it?"
The end result is very good at remembering things but also really good at adapting to new unseen distributions.
This topic recently came up at the FLANN workshop [1], and seems to periodically be rediscovered [2,3,4] in different contexts. While some have speculated about the biological role it plays (e.g., Pearlmutter & Houghton [5]), we still lack a conclusive theory of sleep, but the convergent evolution of this specific phenomenon across the animal kingdom and the fact that deprivation is inevitably fatal seems like an important clue.
[1]: https://flann.cs.yale.edu
[2]: https://www.cs.toronto.edu/~hinton/csc2535/readings/ws.pdf
[3]: https://arxiv.org/abs/1711.02282
[4]: https://arxiv.org/abs/2006.08381
[5]: https://mural.maynoothuniversity.ie/id/eprint/1653/1/Hamilto...
The "sleep" thing gives me the creeps so in my head I'm just going to think of it as the difference between "response time retrieval" and "background consolidation".
I do think it points at something bigger than just attention architecture: "memory" isn't just storage, and merely longer context isn't the same thing as having a better understanding of the source data.
I'm looking at this through the "personal AI" lens, where I think the missing "memory" layer seems to be consolidation & prioritization. It's not enough to just pattern match and grab the right emails, notes, etc, stuff them into the context window & hope, but instead it's useful to consider offline processing and turn events into durable state: clusters of observed data becomes episodes, assumptions, contradictions and power confidence for suggestions.
That also pushes up the need for provenance & inspectability. It's going to be interesting to see what kind of memory consolidation strategies are required for each domain use case.
related preprint from the letta team https://arxiv.org/abs/2504.13171
Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.
Would be a big deal if you don't have to care about quadratic attention cost. Some workflows become a lot cheaper.
This could be a solution in search of a problem, I would be careful with overfitting.
That's an idea I had a few months ago: after going through a compaction once the KV cache is nearing capacity, accumulate this knowledge into a dataset to fine-tune a LoRA during offline hours.
This would create a three-layer memory system:
- Stable long-term memory (initial base weights)
- Mid-term memory built from the compactions and replay buffers
- Short-term memory (KV cache)
Sleeping would just be a fancy term for consolidating and transferring information from one memory layer to another during offline hours. Maybe that's also what the brain does while sleeping.
Kind of related
Context -> Lora would be soooo cool.
To reach a more brain-like behavior LLMs need to integrate your inputs into their model dynamically, essentially retraining real-time based on the most salient input. Human brains do this selectively all the time and it's part of our plasticity.
Biologically humans do similar compression, so introducing a similar concept to an LLM also feels reasonable. Hardware isn't fast/cheap enough to do this on an ongoing basis, similar to how it's too expensive for our brains to do this while we're moving through the world.
All we have now most of the time in LLMs is "working memory" we're missing a lot of the functionality that allows for episodic memory and selective plasticity.
The more you read about how human brains work, the more you realize that we may have figured out a piece with LLMs, but it's certainly nothing approaching AGI. People insisting so are blowing smoke for investor hype or don't understand a big piece of the concepts involved.
[flagged]
[dead]
[flagged]
[dead]
I can't pretend to understand how LLMs work, but I can be sure that anthropomorphizing their functions is not helpful to an objective debate over their abilities.
Does a motor vehicle get "sleep" when it is serviced? When I reboot a computer, is that equivalent to a nap?