I don't know how far it would get, but I imagine that a FAANG will be able to get the farthest here by virtue of having mountains of corporate data that they have complete ownership over.
They’d probably get the farthest, but they won’t pursue that because they don’t want to end up leaking the original data from training.
It is possible in regular language/text subsets of models to reconstruct massive consecutive parts of the training data [1], so it ought to be possible for their internal code, too.
They’d probably get the farthest, but they won’t pursue that because they don’t want to end up leaking the original data from training. It is possible in regular language/text subsets of models to reconstruct massive consecutive parts of the training data [1], so it ought to be possible for their internal code, too.
[1] https://arxiv.org/abs/2601.02671