> Every weight tensor in Rio is, to thousands of standard deviations, the same 0.6/0.4 blend...

hintymad • today at 5:52 PM • 8 replies • view on HN

> Every weight tensor in Rio is, to thousands of standard deviations, the same 0.6/0.4 blend of Nex and Qwen — across all 60 layers and every component of the network. Other finetunes cannot be explained as interpolations.

I find it amazing how robust the current deep learning models are. A simple linear combination of every weight did not degrade the performance of the model, but enhanced it.

Replies

Aurornis • today at 6:50 PM

> A simple linear combination of every weight did not degrade the performance of the model, but enhanced it.

Enhanced it on a couple benchmarks, supposedly.

The game is to turn knobs until you get a benchmark run that shows an improvement, then ship it. There are a lot of fine tunes and chimera models on HuggingFace that are supposedly better at some specific test, but when you use them for anything else they're usually worse.

This happens with a lot of the models that are modified to remove censorship. They succeed in getting the model to emit previously censored outputs, but the overall output quality decreases.

➕ show 1 reply

woadwarrior01 • today at 6:13 PM

It's is a well known idea[1], although it's still surprising that something as simple, even works.

[1]: https://arxiv.org/abs/2203.05482

➕ show 1 reply

x312 • today at 6:49 PM

This works because Nex itself is a finetune of Qwen3.5 (https://huggingface.co/nex-agi/Nex-N2-Pro). It's merging Qwen3.5 with a Qwen3.5 finetune.

I don't believe this would work on two LLMs that have different pretraining. Even if it did you would need two LLMs that have exact same internal activation shapes, dimensions, expert counts, token vocabulary, realistically it would never happen outside of finetunes or academic experiments.

kristjansson • today at 7:21 PM

https://thickets.mit.edu

themafia • today at 7:29 PM

> A simple linear combination of every weight did not degrade the performance of the model, but enhanced it.

Which could be a signal that your "performance" was so abysmal in the first place that even randomly applied training methods can't make it _worse_.

randall • today at 6:43 PM

[dead]

meindnoch • today at 6:59 PM

It shows that LLMs are an extremely wasteful approach to intelligence.

➕ show 2 replies

alt Hacker News

Replies