SOTA typically refers to achieving the best performance, not using the trendiest thing regardless of performance. There is some subtlety here. At some point an LLM might give the best performance in this task, but that day is not today, so an LLM is not SOTA, just trendy. It's kinda like rewriting something in Rust and calling it SOTA because that's the trend right now. Hope that makes sense.
I don’t think trendy is really the right word and maybe it’s not state of the art but a lot of us in the industry are seeing emerging capabilities that might make it SOTA. Hope that makes sense.
>Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior.
>SOTA typically refers to achieving the best performance
Multimodal Transformers are the best way to turn plain text instructions to embodied world behavior. Nothing to do with being 'trendy'. A Vision Language Action model would probably have done much better but really the only difference between that and the models trialed above is training data. Same technology.