I will take your word for it that some systems make this "optimization" but any ...

ansible · on Jan 15, 2014

I will take your word for it that some systems make this "optimization" but any such systems is fundamentally broken at the design level.

I'm fairly sure some current systems (MD on Linux for example) have this sort of optimization. And that is because the designers assumed that the disk would be fine or just fail, and not be in some state in-between.

This assumption used to be more true that it is now. The MTBF for a read error has not been increasing as fast as hard drive capacity has been. In the old days, it was highly unlikely to get a silent read error. However, these days it is more common due to the massive increase in capacity.

ChuckMcM · on Jan 15, 2014

"And that is because the designers assumed that the disk would be fine or just fail, and not be in some state in-between."

Interesting, if you make that assumption you will get burned. I have personally experienced disks that through firmware errors returned what was essentially a freed memory block from their cache as the sector data, returned success status on a write that never actually happened, and flipped bits in the data they actually returned (without error). The whole point of RAID for data reliability is to catch these things. (RAID for performance is difference and people tolerate errors in exchange for faster performance).

We used to start new hire training at NetApp with the question, "How many people here think a disk drive is a storage system?" and then proceed to demolish that idea with real world data that showed just how crappy disks actually were. You don't see these things when you look at one drive, but when you have a few thousand to a couple of million out there spinning and reporting on their situation you see these things happen every day.

The issue of course is that margins in drives are razor thin and they are always trying to find ways to squeeze another penny out here and there. Generally the expresses itself as dips in drive reliability on a manufacturing cohort basis by a few basis points. You can make a more reliable drive, you just have a hard time justifying it to someone who is essentially going to get only one in a million 'incorrect' operations.

This is the same reason ECC RAM is so hard to find on 'regular' PCs, the chance of you being screwed is low enough that you'll assume it was something else or not be willing to pay a premium for the extra chip you need on your DIMMs for ECC.

zerd · on Jan 16, 2014

Firstly, I must say that I'm talking about consumer available RAID systems here, since that's what the article discusses.

>> "And that is because the designers assumed that the disk would be fine or just fail, and not be in some state in-between."

> Interesting, if you make that assumption you will get burned.

Probably why the designers of RAID have moved on to other storage systems. Yet, RAID is currently the only viable option for consumers to try to get "reliable" storage. Which is why the article was written.

> The whole point of RAID for data reliability is to catch these things. (RAID for performance is difference and people tolerate errors in exchange for faster performance).

No, RAID is for withstanding disk failures, where the disk fails in a predicted manner. This is a common misunderstanding. Nothing in the RAID specifications say that they should detect silent bit flipping. If you find a system that does this, it goes beyond the RAID specification (which is good). But you seem to be under the impression that most systems do that, but that is not true. Most RAID systems that consumers can get their hands on won't detect a single bit flip. The author demonstrated it on software raid. I am fairly certain that the same thing will happen if you try it on a hardware RAID controller from any of LSI, 3ware, Areca, Promise etc.

For instance Google has multiple layers of checksumming to guard themselves from bitrot, at filesystem and application level. If RAID did protect against it, why don't they trust it?

ChuckMcM · on Jan 16, 2014

In this particular case (bad data from disk, without an error indication from the disk) is pretty straight forward programming. If the device doesn't allow for additional check bits in the sectors, then reserving one sector out of 15 or one out of 9 (say if you're doing 4K blocks) for check data on the other sectors will let you catch misbehaving drives, and, if necessary, reconstruct bad data with the RAID parity.

If you use a 15/16ths scheme you only "lose" a bit more than 7% of your drive to check data, and you gain the ability to avoid really nasty silent corruption. I'm not sure why even a "consumer" RAID solution wouldn't do that (even the RAID-1 folks (mirroring) can use this for a bit more protection, although the mean-bit-error spec still bites them in trying to do a re-silver)

As for Google, If you'd like to understand the choice they made (and may un-make) you have to look at the cost of adding an available disk. The 'magic' thing they figured out was they were adding lots and lots of machines, and most of those machines had a small kernel, a few apps, and some memory, and an unused IDE (later SATA) port or ports. So adding another disk to the pizza box was "free" (and if you look at the server in the Computer History Museum you will see they just wrapped a piece of velcro around the disk and stuck it down next to the motherboard on the 'pizza pan'.) In that model an R3 system which takes no computation (its just copying, no ECC computation) with data "chunks" (in the GFS sense) which had built in check bits) and voila "free" storage. And that really is very cost effective if you have stuff for the CPUs to do, it breaks down when the amount of storage you need exceeds what you can acquire by either adding a drive to a machine with a spare port, or replacing all the drives with their denser next generation model. Steve Kleiman the former CTO of NetApp used to model storage with what he called a 'slot tax' which was the marginal cost of adding a drive to the network (so fraction of a power supply, chassis, cabling, I/O card, and carrier (if there was one)). Installing in pre-existing machines at Google the slot tax was as close to zero as you can reasonably make it.

That said, as Googles storage requirements exceeded their CPU requirements it became clear that the cost was going to be an issue. I left right about that time, but since that time there has been some interesting work at Amazon and Facebook with so called "cold" storage, which are ways to have the drive live in a datacenter but powered off most of the time.

I can't agree with this statement, "No, RAID is for withstanding disk failures, where the disk fails in a predicted manner." mostly because disks have never failed in a "predicted manner", that is what Garth Gibson invented RAID in the first place, he noted all the money DEC and IBM were spending on trying to make an inherently unreliable system reliable, and observed if you were willing to give up some of the disk capacity (through redundancy, check bits) you could take inexpensive, unreliable, drives, and turn them into something that was as reliable as the expensive drives from these manufacturers. It is a very powerful concept, and changed how storage was delivered. The paper is a great read, even today.

wmf · on Jan 15, 2014

So let's say you read a full RAID-5 stripe including parity and the parity does not match because of a silent error. How do you know which block contains the error? Classic RAID does not have any checksums. AFAICT classic RAIDs are screwed in the face of silent errors, so they might as well implement as many optimizations as they can within the bounds of their (unrealistic) failure model.

ChuckMcM · on Jan 15, 2014

Fiber Channel drives used the DIF part of the sector, reading 526 bytes per sector rather than 512. SATA based RAID systems will often use a separate sector in the group to hold individual sector checksums. So 16 sectors, where 15 are 'data' and the 16th is crc data.

zerd · on Jan 16, 2014

Could you tell us what RAID controllers actually does this? Because I've never seen one do checksumming at all.

Anyone that is actually available for "normal" people (which is relevant for the discussion of the article), i.e. not enterprise SAN?

ChuckMcM · on Jan 16, 2014

All fiber channel HBAs can generate errors when the DIF don't correlate, its part of the FC spec. I don't have access to the LSI controller firmware source so I couldn't say one way or another if they do this with SATA drives. It should be possible to test though.

dspillett · on Jan 16, 2014

> So let's say you read a full RAID-5 stripe including parity and the parity does not match because of a silent error.

If you have an "iffy" sector that is on th eway out you might get a decent read after a few attempts. You can then make sure that the blocks on the other drives are OK and (assuming all the failures are in the same device) drop the problem device when done.

In reality most drives these days do this for you: a certain number of sectors are reserved for reallocation in the case of small surface failures so the drive itself will retry a few times to get a reliable read (each sector on a disk has checksums that allow error detection) then the controller will remap that sector. This happening once or twice is considered normal wear and tear, this heppening beyond a certain threshold or a certain rate is a sign of imminent failure and is measureed by SMART indicators - so running the scan will not let md raid do much directly but will give the drive chance to remap data that is in danger (and if you have mdadm setup right, email you a warning if you shoudl consider dropping the drive immediately and replace it).

ars · on Jan 16, 2014

> but any such systems is fundamentally broken

Um, isn't this the EXACT point of the article you are so happily criticizing? That filesystems don't do this, and they should?

andrewcooke · on Jan 16, 2014

fwiw, with linux md, you can (and should) scrub the array periodically.

    for raid in /sys/block/md*/md/sync_action; do
        echo "check" >> ${raid}
    done

which will check and fix errors (or fail). i run this weekly as a cron job.

of course, it doesn't help with any data that are corrupted and read before it runs.