Most LLMs are trained on a lot of the source code for many open-source projects. This 'project' has the whole song-and-dance about never seeing the source code and separating the system to skirt around legal trouble. Why didn't anyone do that yet?
not a lot of code is public domain and thus not a lot of training data is available
Because that's impossible. Any "robot" that can generate code must be trained on massive amounts of code, most of which is open source.