Clean data is expensive--as in, it takes real human labor to obtain clean data. One problem is tha...

GMoromisato • today at 4:30 PM • 2 replies • view on HN

Clean data is expensive--as in, it takes real human labor to obtain clean data.

One problem is that you can't just focus on outliers. Whatever pattern-matching you use to spot outliers will end up introducing a bias in the data. You need to check all the data, not just the data that "looks wrong". And that's expensive.

In clinical drug trials, we have the concept of SDV--Source Data Verification. Someone checks every data point against the official source record, usually a medical chart. We track the % of data points that have been verified. For important data (e.g., Adverse Events), the goal is to get SDV to 100%.

As you can imagine, this is expensive.

Will LLMs help to make this cheaper? I don't know, but if we can give this tedious, detail-oriented work to a machine, I would love it.

Replies

hermitcrab • today at 4:36 PM

>Clean data is expensive--as in, it takes real human labor to obtain clean data.

Yes, data can contain subtle errors that are expensive and difficult to find. But the 2nd error in the article was so obvious that a bright 10 year would probably have spotted it.

➕ show 1 reply

gdulli • today at 4:39 PM

Why would you give this sort of work to a machine that can't be responsibly used without checking its output anyway?

➕ show 1 reply

alt Hacker News

Replies