Maybe you are observing artifacts of Qwen's training procedure. Perhaps they initialized furthe...

doctorpangloss • yesterday at 4:59 PM • 1 reply • view on HN

Maybe you are observing artifacts of Qwen's training procedure. Perhaps they initialized further layers with the weights of previous ones as part of the training curriculum. But it's fun to imagine something more exotic.

Replies

dnhkng • yesterday at 7:29 PM

There are similar patterns in the models from all the big labs. I think the transform layer stack starts out 'undifferentiated', analogous to stem cells. Pre-training pushes the model to develop structure and this technique helps discover the hidden structure.

alt Hacker News

Replies