At SpiderOak, every storage cluster writes new customer data to three RAID6 volumes on 3 different machines, which wait for fdatasync, before we consider it written.
... and before that begins, the data is added to a giant ring buffer that records the last 30 TB of new writes, on machines with UPS. If a rack power failure happens, the ring buffer is kept until storage audits complete.
I think every company that deals with large data they can't lose develops appropriate paranoia.
The behavior of hard drives is like decaying atoms. You can't make accurate predictions about what any one of them will do. Only in aggregate can you say something like "the half life of this pile of hardware is 12 years" or "if we write this data N times we can reasonably expect to read it it again."
Interesting, do you do disk scrubbing to make sure the data is still out there after a while? That would be the key method to ensure you can still recover the data upon need.
During my tenure at NetApp I got to see all sorts of really really interesting disk problems and the lengths software had to go to reliably store data on them. Two scars burned into me from that were 1) disks suck, and 2) disks are not 'storage'.
The first part is pretty easy to understand, storage manufacturers have competed for years in a commodity market where consumers often choose price per gigabyte over URER (the unrecoverable read error rate), further at the scale of the market small savings of cents adds up to better margins. And while the 'enterprise' fibre channel and SCSI drives could (and did) command a hefty premium, the shift to SATA drives really put a crimp in the over all margin picture. So the disk manufacturers are stuck between the reliability of the drive and the cost of the drive. They surf the curve where it is just reliable enough to keep people buying drives.
This trend bites back, making the likely hood of an error while reading more and more probable. Not picking on Seagate here, they are all similar, but lets look at their Barracuda drive's spec sheet [1]. You will notice a parameter 'Nonrecoverable Read Errors per Bits read', and you'll see that its 1x10e14 across the sheet, from 3TB down to 250GB. It is a statistical thing, the whole magnetic field domain to digital bit pipeline is one giant analog loop of a error extraction. 1x10E14 bits. So what does that mean? Lets say each byte is encoded in 10 bits on the disk. Three trillion bytes is 30 trillion bits in that case, or 3x10E13 bits. Best case, if you read that disk from one end to the other (like you were re-silvering a mirror from a RAID 10 setup) you have a 1 in 3 chance that one of those sectors won't be readable for a perfectly working disk. Amazing isn't it? Trying to reconstruct a RAID5 group with 4 disks remaining out of 5, look out.
So physics is not on your side, but we've got RAID 6 Chuck! Of course you do, and that is a good thing. But what about when you write to the disk, the disk replies "Yeah sure boss! Got that on the platters now" but it didn't actually right it? Now you've got a parity failure waiting to bite you sitting on the drive, or worse, it does write the sector but writes it somewhere else (saw this a number of times at NetApp as well). There are three million lines of code in the disk firmware. One of the manufacturers (my memory says Maxtor) showed a demo where they booted Linux on the controller of the drive.
Bottom line is that the system works mostly, which is a huge win, and a lot of people blame the OS when the drive is at fault, another bonus for manufacturers, but unless your data is in a RAID 6 group or on at least 3 drives, its not really 'safe' in the "I am absolutely positive I could read this back" sense of the word.
Phantom writes are one reason ZFS stores checksums separately from the blocks being written. If you store checksums with the block itself, then failure modes like phantom writes can result in reading data that you think is valid (because the checksum matches) but it's the wrong data.
The worst part of this is that often when disks fail, they just become extremely slow (100s of milliseconds) rather than explicitly failing. That can be significantly worse than just failing explicitly and having the OS read from another disk.
Nit: Remember the old fallacy: if there's a 50% chance of rain on each of Saturday and Sunday, that doesn't mean there's 100% chance of rain this weekend. The quoted number of nonrecoverable errors per bits read is presumably an expected value on a uniform distribution (which is ludicrous, since errors are not even remotely independent, but I don't know what else they could be expecting the reader to assume). In that case, the expected probability of an error on each bit read is 1 / 1E14. The probability of zero errors after reading 3E13 bits is ((1E14-1)/1E14)^3E13, which is about 75%.
> The worst part of this is that often when disks fail, they just become extremely slow (100s of milliseconds) rather than explicitly failing.
This is the main difference between consumer-grade and enterprise-grade drives firmware. Take the HGST Deskstar and Ultrastar, or Seagate Constellation ES and Barracuda 2 TB; they are physically exactly the same, but behave very differently when encountering an error. Basically the enterprise drive will fail shortly BUT recover (because it's probably RAID-backed), while the desktop drive will try to read the data when asked at all prices (because it's probably NOT RAID-backed), therefore suffering from extremely long time-outs.
> . The probability of zero errors after reading 3E13 bits is ((1E14-1)/1E14)^3E13, which is about 75%.
Still not the sort of statistics you want betting your precious data against...
>This is the main difference between consumer-grade and enterprise-grade drives firmware. Take the HGST Deskstar and Ultrastar, or Seagate Constellation ES and Barracuda 2 TB; they are physically exactly the same, but behave very differently when encountering an error. Basically the enterprise drive will fail shortly BUT recover (because it's probably RAID-backed), while the desktop drive will try to read the data when asked at all prices (because it's probably NOT RAID-backed), therefore suffering from extremely long time-outs.
That is, in fact, why I pay the huge premium for 'enterprise' sata.
And yes, Usually 'enterprise' drives fail rather than just getting really slow.
But not always. 'enterprise' drives sometimes get shitty rather than just failing outright, too. I mean, it's usually like 1/3rd to 1/5th expected performance rather than 1/100th, but this is disk. It's already on the edge of unacceptably slow. just cutting performance in half means I'm getting complaints.
Overall? reducing this problem from twice a month to twice a year is worth the premium, but all spinning rust is shit. The 'enterprise' stuff is only slightly less shit.
It bothers me that RAID controllers don't handle this more intelligently.
I had a partner at a VC ask me that too :-) The question you have to ask is "What is Flash?" which sounds really dumb when you ask it but let me explain.
So in the storage business everything used to be done on the highest performance, most reliable drives. Typically those were 15K RPM Fibre Channel drives which cost 8x the price of an equivalent SATA drive [1]. In 2003 NetApp created the 'NearStore' line which was filers that used SATA drives. The idea was that tape was way to slow, you could put stuff on SATA drives and for low duty cycles get to it a lot faster, SATA drives were 'nearline' storage which was defined to be "about 99.8% available." This was hugely better than tape but not as good as a "real" disk array. Of course what people discovered was that you could have some pretty huge data sets be reasonably accessible that way, the first folks were the oil and gas folks [2] but lots of people jumped onto the bandwagon, it was cheaper than a storage array, faster than tape.
Now lets look at flash. Most people think of flash as a disk replacement because it connects to the computer through a disk interface (even though it doesn't have disk mechanicals). So it seems like a really fast disk. My assertion though is that it isn't a fast disk, its slow memory. It is 'near line' memory. Which is to say it takes a lot longer to access something in flash than it does in memory, but its craploads faster than getting off of spinning rust.
Four years ago, I told a guy who claimed to be Intel's key flash architect that if Intel would put flash on the same side of the MMU as the processor we would completely change the way servers are built. Why? Can you imagine that you've got 200GB of address space that is already ready to go when you turn on the processor? Once you start loading page tables you read in the ones in non-volatile flash and blam all your data structures are ready to go, right now. You want to make some sort of logical computation based on the state of a 50GB data structure, and all you do is follow the pointers? The 'driver' for flash attached as memory is
var = *((var type *)(0xsome64bitaddress);
But we cannot do that because flash doesn't live on the front side bus, it lives, at best, in the PCI Express address space on the other side of the Southbridge. [3].
Bottom line Solid State drives are doomed, you can't make them much denser the physics doesn't work, you don't want to waste your time going through the disk driver and I/O subsystem to get to what is essentially more memory. But as flash moves closer to the CPU its impact will become pretty impressive.
[1] This lead people like IBM to propose 'bricks' which were 8 SATA drives in a RAID or Mirrored config pretending to be a single reliable drive.
[2] One of my favorite Dave Hitz quotes: "Oil and Gas companies have an awesome compression algorithm, they can take a 600TB data set and compress it to one bit, 'oil' or 'no oil.'"
> Now lets look at flash. Most people think of flash as a disk replacement because it connects to the computer through a disk interface (even though it doesn't have disk mechanicals). So it seems like a really fast disk. My assertion though is that it isn't a fast disk, its slow memory. It is 'near line' memory.
Not trying to be a party pooper, but flash does not behave like random access memory at all.
With RAM, you can erase/overwrite memory at any address at will.
With both NAND and NOR flash, you can only update 1s to 0s or 'reset' bits to 1. But this is only done in chunks. Flash is actually called 'flash' because in the early prototypes, there was a visible flash of light when a block was reset to all 1s.
This means that to erase part of a block, any data that is to preserved within this block first has to be moved away. So the controller uses a block mapper to keep the address presented on the interface the same, while actually relocating data to someplace else (a bit like the remapping that common hard drives do when a sector goes bad).
To further complicate matters, there's only so many times a block can be erased — erasing causes wear. So the block mapper also incorporates a mechanism for wear leveling. This is also why you want to align your partitions to match the erasure blocks. Otherwise you'll get write amplification; you'd write to 1 block of your filesystem, but if that block straddles erasure blocks, it would cause 2 relocations.
There are filesystems in the Linux kernel that are built to do wear leveling (i think it was yaffs/ubifs), they operate on MTD devices (/dev/mtd?) that do not do their own wear leveling. If you use CF or SD cards, or USB flash drives, they'll have a controller which does the wear leveling (but they don't always do this in a sane way).
So, sure, you could get Flash to behave like RAM, but it would be through emulation, by hiding the relocation and wear leveling logic. You could propose doing this in some super-heavy special MMU and I am with you in sofar as that it could be of some benefit to eliminate some abstraction layers like ATA and avoid hitting a bus (pci, or usb on pci).
Well, flash doesn't behave like non-volatile memory, but it is like random access memory. You don't have to wait for a spinning disk to come around in order to read a sector. And, SSDs can read any block at the same speed. HDs read sequential blocks faster. The mechanics of writing/erasing doesn't change the fact that it is much closer to slow memory than fast disk. You can leave the implementation details (wear leveling, etc) to the controllers.
It actually reminds me a lot of old core memory (http://en.wikipedia.org/wiki/Magnetic-core_memory). In order to read core memory, you had to actually try to write to it. If the voltage changed, you guessed wrong, so you then knew the correct value.
> The 'driver' for flash attached as memory is var = ((var type )(0xsome64bitaddress);
It's a nice idea, but the driver for flash needs to do much more than this. For example, blocks need to be erased before they can be re-written, and since they have a limited lifecycle blocks must also be copied around and remapped.
True, if you look at Intel's flash chips they do this internally, you've got the flash 'controller' which sits astride a bunch of actual flash being written in parallel and its doing the wear leveling. I've played a bit with Intel's PCIe attached flash and can vouch for it being quite a bit faster than going through the SATA port, now if we can get them to move it to the FSB so that I can talk to it via the L1/L2/L3 cache then I'll be really happy.
That's an interesting concept but there would still be a need for the much faster and non-volatile DRAM. At which point you need to discern which is which and decide for each data structure where you want to locate it.
The latency of NAND flash will be so much more than DRAM that you'd have to use DRAM as the cache layer before the flash in which case you could just use one of those block caching schemes coupled with a PCIe SSD to get everything that is good with very little trouble at the architecture level.
Absolutely, there will always be a place for DRAM (in bulk 'fast' memory) and for Static RAM (in caches), this is just a new (for this generation) tier beyond DRAM.
I doubt that you could use NAND Flash as RAM substitute as it is right now. Basically it's a block device with page size about 8KB and erase block ~ 2MB. Data could be read and written only in pages, in-chip page load times are on the order of 100us, page write 1ms, block erase -- 5ms, and it takes about 100us to transfer page data in either direction. So, random read/write latencies are 10^2 / 10^1 times lower than HDD's, but still about 10^3 / 10^4 times larger than DRAM's. I guess internal architecture could be redisigned to reduce block sizes and allow parallel operations inside a sigle chip, but it's unclear how much it will affect density and price.
But it would be useful to put NAND controller closer to CPU, remove block device emulation layer and let OS do wear-leveling and block allocation. We could get high-performance swap or put there specialized MTD filesystems like UBIFS.
Not familiar with 'drum' memory I take it? Back in the way back time a bunch of read/write heads all sitting long a drum which was spinning underneath them. It had some similarities to flash, no seek time, you just picked the head to read or write, if you wrote you wrote the entire track because there wasn't an index (usually). Cool stuff.
Your proposed solution will slightly improve latency, not bit rate.
>You want to make some sort of logical computation based on the state of a 50GB data structure, and all you do is follow the pointers? The 'driver' for flash attached as memory is
> var = ((var type )(0xsome64bitaddress);
You don't need any changes in hardware for this, you can use memory mapped files. Main reason that today software serialise everything is interoperability. For example, old versions of Microsoft Office applications did this, modern don't.
> You don't need any changes in hardware for this, you can use memory mapped files.
That still triggers the OS drivers on the other end which has to go through the same complicated drivers to read blocks into RAM before mapping them into the process, so it's nowhere comparable to what he's proposing.
Flash on the main bus would mean the only thing a read would go through would be the MMU page translation, then straight to either the flash chips or the flash controller, without being bottlenecked by passing through other parts of the system.
And that bottleneck is already real - I have SSD setups at works that are pretty much maxing out the PCIe bus on those servers, and they're not even that expensive (as long as you want performance, not large amounts of storage).
Now, there are still challenges, not least that either you're still talking to a flash controller rather than directly to the flash chips, or you have to handle wear levelling etc. yourself if you want to do writes. Both have disadvantages.
Some embedded systems do talk straight to the flash chips themselves and handle the wear levelling directly - I once put Linux on a BIOS-less x86 system with flash directly on the main bus where the "bootloader" did the equivalent of a memcpy to copy the kernel into a suitable location, set up some very basic stuff and jumped to it, and where the "disk" used an FTL (flash translation layer) driver that handled wear levelling in the kernel code.
You get some extra benefits too: You get rid of the split between memory and filesystem address space. In fact, you can kill off the traditional filesystem altogether. I'm excited by that, filesystems are a dirty kludge.
But this would require completely changing how modern operating systems work. No doubt applications developers would love it though.
do you mean filesystems as the hierarchical way to organize stuff or the filesystem implementation that ensures that data remains persisted also in the event of unflushed writes in case of crash/poweroff etc ?
disks and filesystems as a weird, separate appendage hanging off the side of a computer are conceptually a weak idea.
Once you've unified your address space - stable storage is in the same space as unstable storage, then your entire world changes. The trick then becomes, how do you represent objects (Hint: Others have already tread this ground, see AS/400).
Although many of the problems (i.e. seek time) of rotating disks are overcome by solid state persistent memory, you still need special datastructures to ensure that:
a) data is buffered during write and that a power failure won't leave the storage in an inconsistent state
b) blocks to be overwritten are reallocated and erased in background (you cannot simply overwrite a flash block before erasing it)
c) the used datastructures (trees etc) are optimized for the size of the flash block
d) gracefully handle block corruption
Since it should be possible to remap blocks (either because of (b) or corruption (d)) it's common to have a logical block remapping layer somewhere in the (disk) controller. At this point classical filesystem algorithms and datastructure can easily solve the remaining problems (consistency with journaling and block optimized access like on rotating disks).
Of course, there is another approach for solving (b): see http://en.wikipedia.org/wiki/Log-structured_file_system
(actually, log structured filesystem technique can be used to implement the block remapping within the ssd compatibility layer which lies inside the disk controller itself, but that's a detail)
or copy on write filesystems like zfs/btrfs etc (although AFAIK they rely on a disk interface which is compatible to a rotating disk).
That's to say that solid state storage is not really just persistent ram where you can apply the same datastructures and procedures you would use in core.
Future technologies might be different and trim down these differences so that you could really have huge amount of super fast word addressable random memory access, where nothing can be lost even in case of power failure. But, even assuming that this might be possible, it's possible that later there will be yet another technology with even better storage density/ cost but which will introduce back some of the complications. So, it's understandable why you want to keep some abstractions, so that you can easily replace the underlying technologies and get most of the benefits.
The ones that had been shipping for a few years. Like many things they all start life the least stable, go through several firmware revisions, maybe a couple of hardware revisions, and then everyone hits about the same level of reliability if they don't change them further.
Yup, last time I've been beaten was with "brand new fresh from the factory new model" 3 TB drives. The crappy firmware failed badly under load and I nearly lost 60 TB. Recent firmwares pose no problem, though.
OTOH 1 and 2 TB from HGST were absolutely rock-solid from the start. Only lost a handful of them among several thousand installed. By contrast in the past years I've had a steady 50% failure rate on Barracuda ES2, the shittiest drive on earth since the legendary 9 GB Micropolis.
Not original poster, but I favor WD over Seagate because lately all Seagate drives we have got (about 20) failed in a couple of months. I have had 2 WD failures this years (about 100 drives in various RAID configs).
Whatever brand ships with HP netbooks is basically a failure waiting to happen (we bought 100 HP netbooks with 3G from Verizon - they are very sick of us calling).
Everyone has their own horror stories about drive manufacturers. I personally had a bunch of WD drives fail a while ago, so I've been hesitant to buy WD. Now, this was 10-ish years ago, so I know reliability has improved, but that doesn't change my hesitation.
It used to be that Maxtor (when they were around and independent) had the most reliable drives. Now, it's who knows...
Maxtor had horrible reputation (by far the worst) among most of my friends and local forums.
Things is, all drive manufacturers have batches and you might end up with 8 dead out of 10 drives for any manufacturer at any given time.
The only thing you can do is try to get drives from different batches and thus trying to minimize having all eggs in one basket (what good is raid6 if all drives die?).
Also when a lot of drives die you should look for other culprits as well, is the temperature OK? Is the power stable? (Vibrations?)
There isn't much statistics available (that I know of) but a french store published their return-rate for drives (which of course doesn't count drives returned directly to the manufacturer)
But all in all, if you have many drives failing you've hit a bad batch or are treating your drives badly, nothing you can do about it (the former at least).
> Also when a lot of drives die you should look for other culprits as well, is the temperature OK? Is the power stable? (Vibrations?)
Also remember the demo from a Sun guy that showed how screaming to your RAID array had a significant impact on IO in real time (it was actually a dtrace demo): https://www.youtube.com/watch?v=tDacjrSCeq4
Someone ( in the storage industry) also told me of some crap RAID enclosure that had a normal performance until you added a drive in the center slot; then some bad vibration resonance kicked in and the performance dropped terribly :)
Right... that's my point. I had a bad experience with WD once upon a time (multiple drives, all different batches). And when I had to build a ~1TB RAID in 2001, all of my research pointed to Maxtor. Side note: That actually turned out to be a good choice. The drives lasted 5+ years and didn't die until a fan gave out in the drive cage. Poor little guys got too hot and seized up. After all of that, I was still able to pull 99.5% of the data off of them.
What I was trying to get at is that looking at any specific case and making any blanket statements about quality of drives. Ask 5 people what are the best and worst drives and you'll get 5 different answers. We all have horror stories. The only way to know for sure is to have actual population-level data on reliability. Unfortunately, that data is somewhat hard to come by, so we all just rely on anecdotal stories.
It also depends on if we are talking consumer drives, enterprise drives, etc... I doubt that the return rate for drives would actually cover serious data errors which occurred after the return timeframe.
Honestly, the only people who could offer any insight are the large companies with huge datacenters: Amazon, Google, Yahoo, Rackspace, etc... and I doubt you'll hear them talking. If you could, I suspect the answer would be that it really doesn't matter which manufacturer you choose. All of them fail. All of them have bad batches. The best that you can do is try to minimize the MTBF and try to gracefully replace failed drives as soon as possible.
Honestly, the only people who could offer any insight are the large companies with huge datacenters
Yup, this gives you good insights into the 'current' crop. We've got about 15,000 drives in our Santa Clara data center at Blekko (mostly enterprise SATA, 2TB (WD), but some 1TB (Seagate) too) In large populations like that I prefer to keep one family, which I know folks decry a monoculture but that keeps the failure rate more consistent across all drives which helps manage replacing them.
I have an interest in this and am trying to build up a disk survey system to try and learn about this in the open, it's still not up yet but you can subscribe for an announcement and read some thoughts at http://disksurvey.org/
I wish I could post the video from the talk, but here are some slides at least. Good supporting talk on the pitfalls of making assumptions about how disks work, from OSCON '11:
I had fun once diagnosing a bad bit in a drive's cache memory. Writes would go out, and sometimes come back with a bit set. All the disk-resident CRCs in the world won't help if your data is mangled before it makes it to the media.
File systems with end-to-end checking are good. (These can turn into accidental memory tests for your host, too, and with a large enough population you'll see interesting failures).
At the storage level there is T10 DIF that is supposed to help with such things (among many other things), though it is used in a very limited fashion and is only supported on the higher end disks.
The article glosses over the fact that fsync() itself has major issues. For example, on ext3, if you call fsync() on a single file, all file system cached data is written to disk, leading to an extreme slowdown. This led to a Firefox bug, when the sqlite db used for the awesome bar and for bookmarks called fsync() and slowed everything down.
fsync in Mac OS X: Since in Mac OS X the fsync command does not make the guarantee that bytes are written, SQLite sends a F_FULLFSYNC request to the kernel to ensures that the bytes are actually written through to the drive platter. This causes the kernel to flush all buffers to the drives and causes the drives to flush their track caches. Without this, there is a significantly large window of time within which data will reside in volatile memory—and in the event of system failure you risk data corruption
So in summary, I believe that the comments in the MySQL news posting
are slightly confused. On MacOS X fsync() behaves the same as it does
on all Unices. That's not good enough if you really care about data
integrity and so we also provide the F_FULLFSYNC fcntl. As far as I
know, MacOS X is the only OS to provide this feature for apps that
need to truly guarantee their data is on disk.
... and before that begins, the data is added to a giant ring buffer that records the last 30 TB of new writes, on machines with UPS. If a rack power failure happens, the ring buffer is kept until storage audits complete.
I think every company that deals with large data they can't lose develops appropriate paranoia.
The behavior of hard drives is like decaying atoms. You can't make accurate predictions about what any one of them will do. Only in aggregate can you say something like "the half life of this pile of hardware is 12 years" or "if we write this data N times we can reasonably expect to read it it again."