logoalt Hacker News

ipythonyesterday at 5:17 PM2 repliesview on HN

Curious on how (if?) changes to the inference engine can fix the issue with infinitely long reasoning loops.

It’s my layman understanding that would have to be fixed in the model weights itself?


Replies

tarrudayesterday at 6:57 PM

There's an AMA happening on reddit and they said it will be fixed in the next release: https://www.reddit.com/r/LocalLLaMA/comments/1r8snay/ama_wit...

sosodevyesterday at 6:30 PM

I think there are multiple ways these infinite loops can occur. It can be an inference engine bug because the engine doesn't recognize the specific format of tags/tokens the model generates to delineate the different types of tokens (thinking, tool calling, regular text). So the model might generate a "I'm done thinking" indicator but the engine ignores it and just keeps generating more "thinking" tokens.

It can also be a bug in the model weights because the model is just failing to generate the appropriate "I'm done thinking" indicator.

You can see this described in this PR https://github.com/ggml-org/llama.cpp/pull/19635

Apparently Step 3.5 Flash uses an odd format for its tags so llama.cpp just doesn't handle it correctly.

show 1 reply