logoalt Hacker News

chapstoday at 4:37 PM2 repliesview on HN

I have mixed feelings about this. On one hand, yeah stop publishing garbage data, but as a FOIA nerd... I'll take the data in any state it is. I'm not personally going to be able to clean the data before I receive it. Does that mean I shouldn't release the unsanitized (public) data knowing that it has garbage data within? Hell no. Instead, we should learn and cultivate techniques to work with shit data. Should I attempt to clean it? Sure. But it becomes a liability problem very, very quickly.


Replies

torginustoday at 5:12 PM

What does it mean to clean the data?

Do you remove those weird implausible outliers? They're probably garbage, but are they? Where do you draw the line?

If you've established the assumption that the data collection can go wrong, how do you know the points which look reasonable are actually accurate?

Working with data like this has unknown error bars, and I've had weird shit happen where I fixed the tracing pipeline, and the metrics people complained that they corrected for the errors downstream, and now due to those corrections, the whole thing looked out of shape.

show 1 reply
hermitcrabtoday at 4:43 PM

So you expect the 1000s of people trying to use the fuel price data to each individually clean and validate it, rather than the supplier doing it?

show 2 replies