> Plus, there are no RLHF signals in OpenRouter data. Even if OpenRouter wanted to build a general model-neutral framework for collecting RLHF-type data, it can't force subscriber apps to do the UI-level stuff necessary to collect it (i.e. the things ChatGPT/Claude do, with "thumbs-down" buttons, A/B tested responses, etc.)
The majority of RLHF data doesn't need this. The majority is software development and/or tool calling where the agent gets a signal back as to if it succeeded (eg compilation errors, test errors). It's true that end-of-trajectory signals (eg, did this task do what you wanted) are even more useful but even partial signals are great for RL training.