Just to confirm, did you read Cosmo's article (cosmo.tardis.uk black background), or the girl.surgery (white background) article?
ML isn't my strong suit so I wouldn't be able to explain how, but Cosmo's article is almost entirely a refutation of the points made by the root article. No doubt he is very friendly, as someone would be to anyone interested in their field.
What I can speak about is the general construction of sentences, they read (in the most charitable of interpretations) like text messages:
"Good model vs bad model is ~200 elo, but search is ~1200 elo, so even a bad model + search is essentially an oracle to a good model without, and you can distill from bad model + search → good model."
I take it that by "is ~X elo" they mean that implementing that strategy results in a gain of 200 ELO? Which would still be undefined, as 1000 to 1200 is not the same as 2800 to 3000, and improvements are of course not cumulative. I get that this reads more like internal notes, but it was published, so there was some expectation that it would be understood by someone else.
For a lot more reasons, the writing reminds me of notes written by me or by loved ones under influence of drugs. My estimation is that the article was written by a mind that used to be brilliant but is now just echoing that brilliance while, trying to keep their higher order cognitive functions while struggling to maintain the baseline of basic language use. I hope it is reversible and if per is reading this and my estimation is correct, that they perturb the weights in favour of quitting drugs and see if they win more or not.
> 1000 to 1200 is not the same as 2800 to 3000
Elo is defined such that the expected win-rate of a player should only depend on the difference in Elo rating to their opponent. https://en.wikipedia.org/wiki/Elo_rating_system#Mathematical...
I personally don't believe the argument that search with a bad model helps so much. In e.g. an open position with lots of possibilities you would need an insane amount of calculations to beat a positional/strategic player with a bad engine.
I had only read girl.surgery. I have now read Cosmo's article.
> ML isn't my strong suit so I wouldn't be able to explain how, but Cosmo's article is almost entirely a refutation of the points made by the root article. No doubt he is very friendly, as someone would be to anyone interested in their field.
ML is familiar to me but far from my specialty. It was very difficult for me to understand the points from Cosmo's article, even if it seems more technically correct and less notes-y. Actually, it was likely because it was aiming for high technical correctness that some sentences are impossible for me to digest. (AlphaZero is a strange inversion of RL, where all of the “learning how to map situations to actions so as to maximize a numerical reward signal” is done online, by a GOFAI algorithm, and absolutely no reinforcement learning makes it into the actual gradient used to train the network!)
I think you may have misunderstood the Now we get to the scathing criticism line as being literal rather than ironic (or literal disguised as irony), because most of Cosmo's points are clarifications and distinctions only understandable or valuable to chess engine/ML experts. Many of Cosmo's points are agreement or unrelated; many others are self-professed nitpicks; and among the rest, I think Cosmo is being overly harsh. For example, the discussion on "no gradient" is an agreement in disguise, because what girl.surgery means to say (and what I understood the first read around) is simply that SPSA is like gradient descent, but without access to analytical derivations of derivatives. As another example, the discussion on "self-play was only necessary one time" leads to Cosmo only disagreeing with the language, not the description of the process; "bad model + search → good model" per girl.surgery is mirrored by Cosmo saying "To surpass that ceiling, you must search-amplify the new network, generating better data than the old oracle could, and distill again — and this is precisely the self-play loop," and if I had to guess girl.surgery means by "self play" bootstrapping from absolutely nothing rather than from another highly capable model.
> I take it that by "is ~X elo" they mean that implementing that strategy results in a gain of 200 ELO? Which would still be undefined, as 1000 to 1200 is not the same as 2800 to 3000, and improvements are of course not cumulative.
I understood +X elo over the next-best model, when the context is that of top-shelf models rather than near amateur human play. This usage of "elo gains" in generalized context is even used by Tilps and Crem in Cosmo's quote. It's just a ballpark of the magnitude of strength difference we're talking about, one which is actually not as contextually sensitive as you might think, because of what yorwba notes about the very definition of elo.
> For a lot more reasons, the writing reminds me of notes written by me or by loved ones under influence of drugs. My estimation is that the article was written by a mind that used to be brilliant but is now just echoing that brilliance while, trying to keep their higher order cognitive functions while struggling to maintain the baseline of basic language use. I hope it is reversible and if per is reading this and my estimation is correct, that they perturb the weights in favour of quitting drugs and see if they win more or not.
Very possibly. But I might offer an alternative, more charitable explanation: profound neurodivergence and/or mental illness. I personally know at least one troubled genius who writes like this, if not worse, but who is more than capable of very serious intellectual projects and research. The nature of autism tends to make it harder to write for a general audience without coming off as bizarre, and in my experience they are better at interactive, 1-on-1 discussions where you can ask questions to course-correct them away from burrowing too deep into their own head.
I think Cosmo's refutations were mostly not very useful and based on misunderstandings of what I was trying to say. This is fine and we discussed it prior to their article being published.
The point I was trying to make with "RL is only necessary once" is that you can embark on a single self-play loop getting better and better, and this will get you to something close to the frontier. Once you're at the frontier, the frontier doesn't move very much, so you have quite a while (decade?) where it's totally fine to distill from the RL games.
On correction histories -- imo I correctly described what they do. Cosmo was annoyed by the word "adapt" but what I described was the adaptation.
On SPSA -- you don't have a gradient! you don't do backprop! this is what i was trying to get at.