I wonder if stuffing tool call formatting into an engram layer (see Deepseek's engram paper) that could be swapped at runtime would be a useful solution here.
The idea would be to encode tool calling semantics once on a single layer, and inject as-needed. Harness providers could then give users their bespoke tool calling layer that is injected at model load-time.
Dunno, seems like it might work. I think most open source models can have an engram layer injected (some testing would be required to see where the layer best fits).
The engram idea is actually technically clever but imo sees the solution from a bottom-up approach while Louf's real argument is a top-down view. His solution (declarative specs) solves that by centralizing the spec, making it versioned and composable, independent of any actual model.
Engram layers just move the coordination problem earlier and lock it in. Coordination problems between models & providers would still exist, requiring a layer injection in each open source model and another variant produced for each. Users would still need to chose between "Qwen-8b" and "Qwen-8b-engram" x model families and sizes. Is that cleaner?