logoalt Hacker News

doctorpanglossyesterday at 4:59 PM1 replyview on HN

Maybe you are observing artifacts of Qwen's training procedure. Perhaps they initialized further layers with the weights of previous ones as part of the training curriculum. But it's fun to imagine something more exotic.


Replies

dnhkngyesterday at 7:29 PM

There are similar patterns in the models from all the big labs. I think the transform layer stack starts out 'undifferentiated', analogous to stem cells. Pre-training pushes the model to develop structure and this technique helps discover the hidden structure.