logoalt Hacker News

gwerbintoday at 4:09 AM1 replyview on HN

There is still no good "data diff" tool that I can run on, say, a big pile of CSV or Parquet. Something with DVC integration would be especially welcome.


Replies

appplicationtoday at 4:40 AM

I would imagine because at scales where most folks use parquet files, you’re generally no longer really thinking in terms of individual diffs to your data (and also does imply some level of batch processing, vs e.g. a DB).

We have some custom data diff tools at my ultracorp that provide a browsable interface, but the customer tends to be more operations folk than engineers or DS etc who would be more familiar with actual version control concepts. But these work against the data store and not on something like csv or parquet.