logoalt Hacker News

hleszekyesterday at 9:22 PM2 repliesview on HN

Why not allow the user to provide the seed used for the generation. That way at least we can detect if the model has changed if the same prompt with the same seed suddenly gives a new answer (assuming they don't cache answers), you could compare different providers which supposedly use the same model, and if the model is open-weight you could even compare yourself on your own hardware or on rented gpus.


Replies

bthornburyyesterday at 9:27 PM

AFAIK seed determinism can't really be relied upon between two machines, maybe not even between two different gpus.

show 2 replies
bthornburyyesterday at 9:30 PM

Something like a perplexity/log-likelihood measurement across a large enough number of prompts/tokens might get you the same in a statistical sense though. I expect those comparison percentages at the top are something like that.