The artifacts themselves have more structure, but diffing is hard because of size: what exactly do you show in the different? Row-level? Summary statistics? How do you keep it from getting slow on bigger datasets?
Then there are plots saved as images which have basically no structure at all exposed.
Row level and summary stats are both diffs over values that can tell you that something changed but not whether the * meaning * has changed. What I'm working on is providing more information on how the meaning changes.
What questions I'd like to answer with the diffing is more like: will the grain go from one-row-per-user to one-row-per-user-per-day, will a key stop being unique, will a join start fanning out and quietly double a measure, will something additive become non-additive.
This diff is over structure but this structure is latent in the transformation that produces it and to make things harder, if we are talking about some declarative language being used (e.g. SQL) the code doesn't even describe how things are getting done, but what the output would be.
What I've ended up doing is recovering the structure from the code by analyzing it and then using * cheap * profiling than a full row compare.
As an example, my equivalent impact sub-command output would be something like this: "this change makes account_id non-unique three models downstream"