logoalt Hacker News

torginustoday at 5:12 PM1 replyview on HN

What does it mean to clean the data?

Do you remove those weird implausible outliers? They're probably garbage, but are they? Where do you draw the line?

If you've established the assumption that the data collection can go wrong, how do you know the points which look reasonable are actually accurate?

Working with data like this has unknown error bars, and I've had weird shit happen where I fixed the tracing pipeline, and the metrics people complained that they corrected for the errors downstream, and now due to those corrections, the whole thing looked out of shape.


Replies

chapstoday at 5:19 PM

"What does it mean to clean the data?"

This isn't possible to answer generally, but I'm sure you know that.

Look -- I've been in nonstop litigation for data through FOIA for the past ten years. During litigation I can definitely push back on messy data and I have, but if I were to do that on every little "obviously wrong" point, then my litigation will get thrown out for me being a twat of a litigant.

Again, I'd rather have the data and publish it with known gotchas.

Here's an example: https://mchap.io/using-foia-data-and-unix-to-halve-major-sou...

Should I have told the Department of Finance to fuck off with their messy data? No -- even if I want to. Instead, we learn to work with its awfulness and advocate for cleaner data. Which is exactly what happened here -- once me and others started publishing stuff about tickets data and more journalists got involved, the data became cleaner over time.

show 1 reply