logoalt Hacker News

josephgyesterday at 7:17 AM1 replyview on HN

For this problem, I’d consider a different approach. You have a fuzzer, and based on some seed it’s generating lots of records. You then need to query a specific record (or set of records) based on the leaf.

I’d just store a table of records with the leaf, associated with the seed. A good fuzzer is entirely deterministic. So you should be able to regenerate the entire run from simply knowing the seed. Just store a table of {leaf, seed}. Then gather all the seeds which generated the leaf you’re interested in and rerun the fuzzer for those seeds at query time to figure out what choices were made.


Replies

_dain_yesterday at 11:54 AM

(I work at Antithesis)

Yes, this is (more or less) how we regenerate the system state, when necessary. But keep in mind that the fuzzing target is a network of containers, plus a whole Linux userland, plus the kernel. And these workloads often run for many minutes in each timeline. Regenerating the entire state from t=0 would be far too computationally intensive on the "read path", when all you want are the logs leading up to some event. We only do it on the "write path", when there's a need to interact with the system by creating new branching timelines. And even then, we have some smart snapshotting so that you're not always paying the full time cost from t=0; we trade off more memory usage for lower latency.

Oh one other thing: the "fuzzer" component itself is not fully deterministic. It can't be, because it also has to forward arbitrary user input into the simulation component (which is deterministic). If you decide to rewind to some moment and run a shell command, that's an input which can't be recovered from a fixed random seed. So in practice we explicitly store all the inputs that were fed in.