> it's why the "how many Rs in strawberry" question doesn't trip them up anymore, because they can write a few lines of Python to answer it, run that, and return the answer.
That still requires the LLM to ‘decide’ that consulting Python to answer that question is a good idea, and for it to generate the correct code to answer it.
Questions similar to ”how many Rs in strawberry" nowadays likely are in their training set, so they are unlikely to make mistakes there, but it may be still be problematic for other questions.