logoalt Hacker News

usernametaken29yesterday at 11:16 AM6 repliesview on HN

> δ-mem compresses past information into a fixed-size state matrix updated by delta-rule learning

This doesn’t solve the capacity problem of memory. You can cram more into one context window, but then again you need to associate them with input queries. That’s very hard because slight variations in input create hugely different activations. So really, it doesn’t improve caching. This paper might do a thing or two approximating the compression limit for context windows, but there’s a fundamental limit on how much information can go into it. What you really need is contextual search, as in, different events and objects with the same abstractions and semantic lead to same response, so you can cache effectively… on this front the paper does little to improve “memory” in a meaningful way


Replies

jsemrauyesterday at 1:08 PM

I am currently working on deep context query which uses dynamically generated regex to pull only the relevant context blocks. By using lightweight RegEx pattern matching to detect semantic intent and filter structured context sections accordingly, you avoid the attention degradation that comes from stuffing semantically redundant information into the window

https://jdsemrau.substack.com/p/tokenmaxxing-and-optimizing-...

show 1 reply
vdelpuertoyesterday at 6:08 PM

I wrote something about it trying to look other way around the context or memory data in models. The gravitational pull of information stills very hard to manage. Ive been using "functional scars" about 30 days now and getting good results in repetitive mistakes across sesions. https://github.com/VDP89/fscars

in-silicoyesterday at 8:38 PM

While there is a limit to the amount of information you can fit in a fixed-size state, the theoretical ceiling is pretty high.

A Hebbian associative matrix (one of the simplest and weakest memory constructions) can store about 0.7 bits of information per parameter. If you have a state with 300M parameters (the size of a Llama 3 8B KV cache at 10K context length), and a context with 2.1 bits of entropy per token (a reasonable estimate), then the state can encode 100M tokens worth of information.

Real models obviously aren't powerful enough to operate at the limit, but you can see why this is a promising research direction.

show 2 replies
jandreseyesterday at 1:48 PM

So instead of a FIFO approach to memory management it instead continually degrades the existing data the more you put in? Details start getting lost or mangled more and more over time?

show 1 reply
kordlessagainyesterday at 1:12 PM

Like Ferricula: https://deepbluedynamics.com/ferricula (site/docs still in progress).