logoalt Hacker News

NitpickLawyertoday at 5:39 AM1 replyview on HN

Really cool paper and easy to follow. Lots of thoughts (in parallel, hah!) after a first read. I can see many benefits of the parallel streams w/ dynamic systems. Start thinking, fire up a tool call, adjust thinking on the fly. Or add a "clock tick" on one stream, and hope that the model learns how to output something under time constrain. Maybe some "time passing" concept can be had "for free?". Lots and lots of directions this could go.

It also gives a lot of new levers to play with. I'd assume you could tweak (sweep?) the amount of attention given to the same stream vs. cross stream, have different streams prompted / seeded with an objective, score each independently vs. together, etc. A bit reminiscent of the direction oAI took w/ their harmony template, where they define channels and the model learns to output to each channel (but that's sequential).

Would have loved to see even a small attempt at RL on top of this. Could probably get gnarly with so many avenues to explore, but even a few hundred steps could have informed if there's something to it.

One concern I have is w/ how the data was prepared. They used a 80b model to transform from sequential instruct format to this multi-stream format. There are a lot of ways where stuff can "leak" from the process, and contaminate the results. That's why I'd have loved to see some further RL on this, but anyway. Cool paper, worth a revisit sometime.


Replies

ultra2dtoday at 7:49 AM

The potential of tweaking cross-stream attention is a very interesting avenue, like they note in their discussion: "one-way interactions for security, or partial stream isolation for fine-grained privilege control". Splitting system streams from user streams already decreases likelihood of successful attacks (e.g., prompt injection) in their research, and that is - as they say - using the dense attention patterns between streams.