It's still predicting the next word. Somewhere in the gigantic dataset that the LLM was trained on, there is a phrase that says "gradient border" being in the vicinity of a CSS code that render the stuff. Therefore when you run it on an inference loop there's a good chance it output that CSS code when you tell it to render a "gradient border"
Multi-modal models that can understand visual input do exists, but no such visual reasoning process happened in the example you mentioned. Not unless you have a visual feedback loop in the coding harness.
I'm not dismissing the capability of "predicting the next word" however. The vast amount of training data enable extremely complex and useful behavior you just described.
What about things it wasn’t trained on?
For instance I’ve written a few custom languages to learn how to write a VM and the lexer/parser/compiler/etc. that it had never seen before and then just gave it the syntax which is different than what it had ever seen before. Simply due to the fact I made it and it had never been trained on it.
After giving it my documentation, it was able to write the language just like a language that it had been trained on. I’ve also seen this behavior at work where there are weird quirks to do things and definitely not standard and it can handle it.