> With LLMs, the probability is much higher (since in truth they are very much not a "clean ...

pmarreck • yesterday at 12:23 PM • 3 replies • view on HN

> With LLMs, the probability is much higher (since in truth they are very much not a "clean room" at all).

I beg to differ. Please examine any of my recent codebases on github (same username); I have cleanroom-reimplemented par2 (par2z), bzip2 (bzip2z), rar (rarz), 7zip (z7z), so maybe I am a good test case for this (I haven't announced this anywhere until now, right here, so here we go...)

https://github.com/pmarreck?tab=repositories&type=source

I was most particular about the 7zip reimplementation since it is the most likely to be contentious. Here is my repo with the full spec that was created by the "dirty team" and then worked off of by the LLM with zero access to the original source: https://github.com/pmarreck/7z-cleanroom-spec

Not only are they rewritten in a completely different language, but to my knowledge they are also completely different semantically except where they cannot be to comply with the specification. I invite you and anyone else to compare them to the original source and find overt similarities.

With all of these, I included two-way interoperation tests with the original tooling to ensure compatibility with the spec.

Replies

ostacke • yesterday at 12:35 PM

Bu that's not really what danlitt said, right? They did not claim that it's impossible for an LLM to generate something different, merely that it's not a clean room implementation since the LLM, one must assume, is trained on the code it's re-implementing.

➕ show 1 reply

airza • yesterday at 12:31 PM

By what means did you make sure your LLM was not trained with data from the original source code?

➕ show 1 reply

danlitt • yesterday at 4:57 PM

I only said the probability is higher, not that the probability is 1!

alt Hacker News

Replies