In my experience these models work fine using another language, if it’s a widely spoken one. For example, sometimes I prompt in Spanish, just to practice. It doesn’t seem to affect the quality of code generation.
They literally just have to subtract the vector for the source language and add the vector for the target.
It’s the original use case for LLMs.
It’s just a subjective observation.
It just can’t be a case simply because how ML works. In short, the more diverse and high quality texts with reasoning reach examples were in the training set, the better model performs on a given language.
So unless Spanish subset had much more quality-dense examples, to make up for volume, there is no way the quality of reasoning in Spanish is on par with English.
I apologise for the rambling explanation, I sure someone with ML expertise here can it explain it better.