logoalt Hacker News

meatmanekyesterday at 5:24 PM1 replyview on HN

This is super cool. Do you know if any of the inference backends (llama.cpp, vllm, etc) support this technique?


Replies

iawyesterday at 10:12 PM

vLLM supports "banning" certain tokens but I don't know if it can dynamically reduce them.

To my knowledge you can also "ban" with llama.cpp but it is passed in the API call rather than to the server at initialization.