Architectures have evolved significantly since then. DeepSeek v4 =/= GPT-3. Even then, a great deal of complexity lies in everything surrounding the architectures e.g. how do you implement them performantly on modern accelerators, how do you distribute the model across a set of accelerators, how do you post-train, etc. And pre-training itself is a dark art. If you legitimately think that frontier labs are doing something equivalent to whatever you wrote on your whiteboard, you’re clueless.
Those are all just optimizations.
We still don’t really know why they work, we just know how to build them.