Curious on how (if?) changes to the inference engine can fix the issue with infinitely long reasonin...

ipython • yesterday at 5:17 PM • 2 replies • view on HN

Curious on how (if?) changes to the inference engine can fix the issue with infinitely long reasoning loops.

It’s my layman understanding that would have to be fixed in the model weights itself?

Replies

There's an AMA happening on reddit and they said it will be fixed in the next release: https://www.reddit.com/r/LocalLLaMA/comments/1r8snay/ama_wit...

sosodev • yesterday at 6:30 PM

I think there are multiple ways these infinite loops can occur. It can be an inference engine bug because the engine doesn't recognize the specific format of tags/tokens the model generates to delineate the different types of tokens (thinking, tool calling, regular text). So the model might generate a "I'm done thinking" indicator but the engine ignores it and just keeps generating more "thinking" tokens.

It can also be a bug in the model weights because the model is just failing to generate the appropriate "I'm done thinking" indicator.

You can see this described in this PR https://github.com/ggml-org/llama.cpp/pull/19635

Apparently Step 3.5 Flash uses an odd format for its tags so llama.cpp just doesn't handle it correctly.

➕ show 1 reply

alt Hacker News

Replies