LLMs using code to answer questions is nothing new, it's why the "how many Rs in strawberr...

danpalmer • yesterday at 5:37 AM • 3 replies • view on HN

LLMs using code to answer questions is nothing new, it's why the "how many Rs in strawberry" question doesn't trip them up anymore, because they can write a few lines of Python to answer it, run that, and return the answer.

Mathematica / Wolfram Language as the basis for this isn't bad (it's arguably late), because it's a highly integrated system with, in theory, a lot of consistency. It should work well.

That said, has it been designed for sandboxing? A core requirement of this "CAG" is sandboxing requirements. Python isn't great for that, but it's possible due to the significant effort put in by many over years. Does Wolfram Language have that same level? As it's proprietary, it's at a disadvantage, as any sandboxing technology would have to be developed by Wolfram Research, not the community.

Replies

Someone • yesterday at 1:17 PM

> it's why the "how many Rs in strawberry" question doesn't trip them up anymore, because they can write a few lines of Python to answer it, run that, and return the answer.

That still requires the LLM to ‘decide’ that consulting Python to answer that question is a good idea, and for it to generate the correct code to answer it.

Questions similar to ”how many Rs in strawberry" nowadays likely are in their training set, so they are unlikely to make mistakes there, but it may be still be problematic for other questions.

adius • yesterday at 6:49 AM

I also think that sandboxing is crucial. That’s why I’m working on a Wolfram Language interpreter that can be run fully sandboxed via WebAssembly: https://github.com/ad-si/Woxi

simianwords • yesterday at 8:26 AM

>LLMs using code to answer questions is nothing new, it's why the "how many Rs in strawberry" question doesn't trip them up anymore, because they can write a few lines of Python to answer it, run that, and return the answer.

False. It has nothing to do with tool use but just reasoning.

➕ show 2 replies

alt Hacker News

Replies