> So how were we able to recover the database and the data inside it? Most of the data was probab...

Retr0id • yesterday at 7:50 PM • 1 reply • view on HN

> So how were we able to recover the database and the data inside it? Most of the data was probably still intact, only a few sectors were unreadable. Once those were either restored (rewritten with a strong signal) or remapped by the drive’s firmware, the filesystem and the database engine could read the file end-to-end again. SQL Server pages also have checksums, so if any page came back wrong rather than unreadable, we’d have known. We got lucky: the corruption was at the magnetic-signal level, not at the “platter is scratched” level.

This doesn't quite seem to follow. As described, neither of the "recovery" methods actually restore lost data. So why weren't any of the SQL pages left in a bad state?

Replies

benlivengood • yesterday at 7:56 PM

As best as I can tell it was intermittent read failures on some sectors, not permanent failures.

So if you keep rereading that section of the disk you eventually get all the data, save it somewhere, write a bunch of new patterns over it, then write the original data and verify it reads back correctly many times.

I believe the article's analysis about RAID is wrong though; most controllers will start resilvering or just fail a drive once it experiences too many IO errors.

alt Hacker News

Replies