logoalt Hacker News

geraneumyesterday at 7:39 PM8 repliesview on HN

> This was a clean-room implementation

This is really pushing it, considering it’s trained on… internet, with all available c compilers. The work is already impressive enough, no need for such misleading statements.


Replies

cryptonectortoday at 2:19 AM

Hmm... If Claude iterated a lot then chances are very good that the end result bears little resemblance to open source C compilers. One could check how much resemblance the result actually bears to open source compilers, and I rather suspect that if anyone does check they'll find it doesn't resemble any open source C compiler.

raincoleyesterday at 9:38 PM

It's not a clean-room implementation, but not because it's trained on the internet.

It's not a clean-room implementation because of this:

> The fix was to use GCC as an online known-good compiler oracle to compare against

show 3 replies
GorbachevyChasetoday at 1:23 AM

https://arxiv.org/abs/2505.03335

Check out the paper above on Absolute Zero. Language models don’t just repeat code they’ve seen. They can learn to code give the right training environment.

TacticalCoderyesterday at 11:16 PM

I'm using AI to help me code and I love Anthropic but I chocked when I read that in TFA too.

It's all but a clean-room design. A clean-room design is a very well defined term: "Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design."

https://en.wikipedia.org/wiki/Clean-room_design

The "without infringing any of the copyrights" contains "any".

We know for a fact that models are extremely good at storing information with the highest compression rate ever achieved. It's not because it's typically decompressing that information in a lossy way that it didn't use that information in the first place.

Note that I'm not saying all AIs do is simply compress/decompress information. I'm saying that, as commenters noted in this thread, when a model was caught spotting out Harry Potter verbatim, there is information being stored.

It's not a clean-room design, plain and simple.

show 1 reply
inchargeoncallyesterday at 9:20 PM

[flagged]

show 1 reply
antirezyesterday at 8:05 PM

The LLM does not contain a verbatim copy of whatever it saw during the pre-training stage, it may remember certain over-represented parts, otherwise it has a knowledge about a lot of things but such knowledge, while about a huge amount of topics, is similar to the way you could remember things you know very well. And, indeed, if you give it access to internet or the source code of GCC and other compilers, it will implement such a project N times faster.

show 3 replies