Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Big leap for hard drive capacities: 32 TB HAMR drives due soon, 40tb on horizon (anandtech.com)
220 points by walterbell on June 10, 2023 | hide | past | favorite | 144 comments


This is cool, but I really need to hear more about how I can effectively use these kinds of drives in storage pools and backup applications without issue.

I didn’t have time to learn back when the first disaster hit with SMR drives got snuck in without warning and we had a small outrage event.

I’m happy for new drive tech but I’ve not seen much about the pros and cons of HAMR in real world use yet. Which could be because there’s no difference from traditional Perpendicular Magnetic Recording (PMR) hard drives, but then again it could have its own subtle set of trade offs entirely different to the ones that Shingled Magnetic Recording (SMR) drives do.


From what I’ve heard, these are primarily sold to the hyperscalers like the public clouds, Drop Box, etc…

At their scale it’s easy to utilise these efficiently despite the limitations.

They’re not very good in a small NAS or disk array. You’d need a big write cache in front of them and an OS that natively understands the sector groupings.


This is why I’m looking for more info. This is talking about how they are already sending these out to Hyperscale customers, and they will eventually be in individual customers hands… so it would be good to know what file systems and cache setups are needed in order to benefit from this.

I used to just rely on FreeNAS but it’s not as straightforward anymore. I’m having to consider Linux and looking at bCache FS and ZFS vs BTRFS and how all this compares on an all PCIe (m.2) flash drive setup… where a drive (or two) at this sort of size (~50TB) would make a great second backup copy of the flash array that can be started and stopped to periodically make the backups to save power… but then you have to think about copy efficiency since I don’t want to have them wasting read bandwidth from the flash drives… and the best is usually like to like file system copy ZFS -> ZFS and BTRFS -> BTRFS … so it adds another complication into a mix that is already far from simple.

So it’s become something I’m keeping an eye out for… hopefully before it becomes something I may purchase, someone will have already done a good writeup.


> … so it would be good to know what file systems and cache setups are needed in order to benefit from this.

I’ve worked on these systems. You might as well be asking F1 drivers for tips to help you commute to work.

The hyperscale stuff is not built on top of ordinary filesystems. It’s all clusters of machines, and error correction is handled at the level of clusters. If you’re evaluating systems like ZFS and BTRFS, then you’re already working with a radically different tech stack.

At scale, your file metadata is stored in a distributed database of some kind, and the file contents are stored with forward error correction across multiple machines in a cluster. Or something similar, but stored across multiple clusters. The pricing models for cloud storage are designed around the usage patterns—like, if you know that some data is going to stick around for 90 days, then SMR is a win. If a file could get deleted at any time, then SMR is a loss.


> You might as well be asking F1 drivers for tips to help you commute to work.

Boston drivers: "hold my beer."


The simple answer is that the software stack necessary for consumers to effectively use these storage devices does not yet exist. Hyperscalers are pretty much defined as the ones who are big enough to do their own software stack. For the rest of us, we have some good components to work with but also some major gaps that won't be filled anytime soon.

ZFS is a great improvement over traditional hardware RAID systems, but is still in many ways clearly a descendant of them. BTRFS has a slightly different mix of features from ZFS that make it a bit more flexible and a better choice for consumers who don't buy drives by the dozen. Neither has a great solution for caching/tiering with SSDs and hard drives. Ceph has a lot of features that would otherwise be almost exclusive to the hyperscalers, but is too complicated for something like a turnkey NAS. bcachefs aspires to eventually have most of the features you would want for a non-clustered storage system.

Zoned storage is something many of the above already have some degree of support for, but as a paradigm it has not even started showing up in the consumer computing ecosystem so many of the issues with adopting zoned storage for consumer systems aren't even being worked on.


Y'all keep saying "these" but SMR and HAMR are unrelated.


In theory, yes, but the bottom of the article explains that Seagate currently plans for their upcoming 24TB drive to be the last new non-SMR drive and all larger drives (28+ TB) are planned to be SMR. So in practice SMR and HAMR will be going hand in hand, at least from this vendor.


That's not my interpretation but we'll see.


> "So, we have a 24TB coming out soon, next few months, you will see it," said Romano. "That is the last PMR product. So I would say [higher] capacity point above 24TB PMR, that is probably 28TB SMR."

Sounds like the only uncertainty is in what the next capacity past 24TB will be, but it'll definitely be SMR rather than PMR.


I agree this part is a bit ambigous, but it says somewhere before this that the next drives to be released will be a 22TB and 28TB SMR drive. So I read this comment as "after this last 24TB PMR drive there will be the 28TB SMR drive before we will see the 32TB+ HAMR drives going forward" - so to me he was more talking purely about upcoming sizes rather than technology connections and SMR seems to be on its way out too!?


ZFS has the concept of a separate intent log device though, I wonder if that would be enough to make these types of disks work with it? I have no problem putting a couple of terabytes of flash in front of a ZFS array if it means I can have 40TB of redundant storage in 3 disks.


The ZFS intent log is an example of a supremely disappointing, underwhelming way to integrate SSD storage into your system. Fundamentally, it's just a workaround for the fact that most of the write caches preexisting in the storage stack are volatile caches and thus not safe. Putting the ZIL on an SSD allows you to get the safety without sacrificing the performance benefits of the write caching (performance that everyone has come to rely on). But the ZIL doesn't help with read performance, and the data you write to it is never even read unless you have a crash or power failure. And then for the read caching, ZFS has L2ARC as an entirely separate feature. So ZFS does technically have ways to take advantage of SSDs to improve a hard drive storage array—but I don't think anyone would consider it to be an ideal solution, more of a pragmatic minimum viable feature.


Big disks present a problem in the cloud provider environment: users typically want to partition them into smaller logical disks... but this means that larger disks must also have lower latency / higher bandwidth to service multiple users at the same time.

Most likely, cloud vendors have a large choice of sizes to try to match requested sizes of logical disks to physical disks, but if most users don't want / need disks that are this big (or bigger), then the provider will have a problem.


Big disks have their place; both as selling "large bulk storage" to cloud users (most cloud systems use SSD as the "main/OS" drive now anyway) but also as near line backup storage.


I am not entirely sure that HAMR will be too different from CMR/PMR drives. The premise of HAMR is that they heat the area under the head, expanding it, for reads and writes, and otherwise they are essentially CMR devices. This isn't like SMR, which had huge implications for drivers (due to the overlap in the bit lines). There must be a significant durability cost to HAMR, but I don't think the difference in drivers will be nearly as big for HAMR as it was for SMR.


The nanoscale heating manipulates the magnetic coercivity of the platter coating to allow the magnetic polarization to be manipulated. It seems somewhat analogous to modern day Magneto Optical but much finer scale and the readout is via a magnetic signal vs MO using an optical readout.


I'm guessing then that they will probably use a little bit more power then, but otherwise similar to CMR/PMR in performance. That heated head element though seems like a massive point of failure, though I have no idea how durable the technology it.


The heating is done by a solid state laser diode. It should be very reliable other than adding one more thing to an already complex drive head. (Which undoubtedly means the first generation or two will have some kinks to work out, but that's not a fundamental decrease, just engineering).

And there shouldn't be much of an impact on durability. The heating is very brief - a nanosecond or so. The idea is not to expand the material physically, it's to increase its magnetic permeability temporarily so that you can write the bit to it and then it becomes stable again.


I hadn't read all the details behind HAMR before. Fancy tech.

https://en.wikipedia.org/wiki/Heat-assisted_magnetic_recordi...


I remember reading about HAMR a few years ago

Why does memory and other hardware tech always seem to magically come to fruition? At least compared to software

Is it because they're more conservative in their announcements? More certainty in a path forward? More $$$?


Might be a selective bias effect? A lot of hardware stuff has crashed and burned. Most famously fusion has been 20 years away for 60+ years, and there's always new rechargable battery tech that's going to replace lithium ion in a few years. For computer hardware, mram, feram, and optane/crosspoint memory has all been pretty disappointing.


My favorite is "bubble memory". Everybody was convinced we were going to store all our precious data on these itty-bitty little bubbles. It was all over the trade publications, very hype, such bling, much buzz:

https://en.wikipedia.org/wiki/Bubble_memory

And then it just basically turned out to be a flop.


Looks like it actually was kind of successful for a while though. It just got out competed by other methods.


Most software innovations seem to be made to make the developer's life easier instead of creating more performant software. I'm not complaining but user focused software innovations like LLMs do seem to come very few and far between compared to hardware.


I wouldn't consider LLMs a software innovation.


I feel like if you're gonna drop a take like that you should have at least one more sentence of justification to follow it up with.


I’m not sure I agree, but I’ll do my best to argue the position.

The neural net software algorithms have been around for decades. What made LLM’s feasible are the hardware advances to achieve unprecedented scale, just barely providing the ability (at great cost) to train today’s LLM models. Transformer architecture might be called a software innovation, but RWKV Raven gets similar performance to transformers and is built on decades-old RNN technology. So it is the hardware that was far more instrumental than the software in achieving LLM’s.

Counter to that argument: had google not done neural net research for google translate and proved the transformer approach scaled and performed well in their “attention is all you need” paper, people wouldn’t have spent the money to train foundation LLM models and we would not be having this discussion, so the software really mattered more than the hardware.

In reality I think it’s a little bit of both.


That's not necessarily true. The algorithm ( transformers) was done by Google in 2017.

Even while they missed the opportunity, Gpt-1 was released in 2018 by OpenAI.

Then they incrementally added parameters in the next versions.


Great take. It is really a mix of several factors, each one leveraging the other, and your arguments are great.


I don’t think I totally agree either but your argument for the position definitely has some merit.


I kind of admire the gumption, personally. Like writing them off as fancy markov bots.


In storage there are standards and abstractions that commodify the product. There are lots of vendors and lots of projects at said vendors that try to deliver improved parts and they can count on customer demand & competitive advantage if they succeed.

Most software (and hardware) is not like this.


I worked in this area of magnetics, and it’s been in research for at least 15 years. Companies had made devices by about 10 years ago but the storage capacities were low so they weren’t that useful for consumer applications at that point. The hard problem with HAMR is dissipating the heat.


I'm not sure what you mean by magically. For every tech that makes it to production there are one or two others that don't and people just forget about them. For example, storage roadmaps used to show bit-patterned media. Phase-change memory would have been nice to have too.


I recall reading about holographic storage[1] back in the 90s, was supposed to become a big thing "real soon".

[1]: https://en.wikipedia.org/wiki/Holographic_data_storage


It's not dead yet[1] but, yes, it's never quite caught on the way so many of us thought it would back in the 1980s and 90s. The 1987 movie "Innerspace" even referred to "photon echo memory chips." Friends of mine worked for years at one of the companies mentioned in the Wiki page but came up just a little short.

FWIW, gallium arsenide was also supposed to take over from silicon for CPUs once speeds went above 25-MHz (or so). A senior colleague of mine said, "There are a lot of smart people whose kids's college tuition depend on making silicon a little better every year." And so, it came to pass. GaAs is definitely important but we've got multi-GHz silicon CPUs now.

[1] https://www.microsoft.com/en-us/research/project/hsd/ "Project HSD: Holographic Storage Device for the Cloud"


There's whitepapers and articles every few years on Holographic storage, it's one of those perpetually just around the corner things.

See also memristors, that HP in particular have been saying is coming in the next couple of years, fundamentally changing the entire computing industry, for decades.


If we extend that to medical, the graveyard is bananas big.


It seems like it was doubtful this would come to fruition, at least in 2013. There are many promising technologies that that never become economically viable. But necessity is the mother of invention.


I wouldn't be surprised if they have all this tech developed for 256TB hard drives already but they're deliberately announcing it slowly over the years to buffer their time to do more research, lest they announce 256TB too fast and shareholders are like "why don't we have 1PB already damnit" a quarter later and short the stock to oblivion


This happens all the time in industries close to the "asymptote of perfection" for whatever product or service they're producing. Think of products like ICEs, industrial processes or mature/standardised commodities

At what point does an influencial government say "it's unlikely there is substantial innovation to be uncovered for ___ product" and begin the process of deprivatisation for good of the public?


Lots of things don't work until they do, and then iteration can be rapid. In addition, solutions for the same problem tend to have drastically different limitations and bottlenecks. Any new breakthrough can move the boundaries forward in leaps.


It’s the new Moores law. Instead of processing power doubling, it’s storage.


Kryder's law is the storage equivalent of Moore's law. It's also coming to an end.


As my NAS expands, I get more and more cold data. Meanwhile large SSD's are getting cheap-ish (even TLC variants).

Are there any good options right now for multi-tiered storage for the home lab?

LVM has writeback cache as an option, but I couldn't quite figure out how reliable that is, found some old posts with disturbing issues but not a lot of recent talk. Would also need to run ZFS on top of it for the ZFS features I rely on like snapshots and such, so feels like a Jenga tower solution.

I know Ceph has this option, but Ceph performance is abysmal for small installations from what I could benchmark.

Bcachefs looks like it'll be a winner, but it's still WIP so won't trust it with my precious data quite yet I think.


The idea you want tiering... at home.. just seems so unlikely to save you any serious money, or gain you significant performance.

"tiering" in the traditional sense isn't even popular in the DC with things like Vast, Pure, etc all doing well.

Is there any reason you don't know the data set that is generally cold? The vast majority of home users basically have data that is trivially super cold that needs reasonable read performance - this is just HDDs. Then have a second mount point with things that you need greater performance, and have that be just SSDs.


It's not about saving money, more about convenience and noise.

With large SSDs it seems alluring that one could have the majority of hot data on SSDs, essentially making the cold data even colder.

As you mention I could split my data in two, but that's a chore. If I need new data on the hot pool, I might have to first decide what got cold enough to move to the cold pool if there's not enough room left on the hot pool. That means I have to wait for that before I can save the new stuff on the hot pool. Searching for stuff means searching both places etc.

I forgot I stumbled over AutoTier[1] which could work, but it seems abandoned so again, not ideal for precious data. And it's FUSE based, so performance is likely not the best.

[1]: https://github.com/45Drives/autotier


Having frequently used file metadata on SSDs while the cold parts are on HDDs would be noticeable even at home; the moment you start browsing the directory tree.


ZFS on Linux has exactly this, you can specify specific devices to hold metadata. Caveats apply, you need to mirror those devices for redundancy.

https://openzfs.github.io/openzfs-docs/man/7/zpoolconcepts.7...


I love my LTO-5 tape drive. 1.5TB on a tape, encryption, they have a write protect notch, and I add parity files to each tape to combat bitrot. The price per TB is pretty low. I keep a backup of important stuff off-site. As datacenters upgrade, I'll gladly take their hand-me-downs, when the price is right.


Why? Hard drives are as cheap as tape now: $10/TB. Tapes are sequential access media, wear out, and malfunction, whereas hard drives are random access media. I know all about StorageTek silos and 4U tape robots but still don't bother with them.


Tape is $5 per TB at retail and you can safely store data on it for at least 10 years, after which you may need to make a new copy not due to the risk of aging, but due to the risk of no longer finding new tape drives compatible with the tape format.

Your price of $10/TB might apply when HDDs are purchased in bulk, because at retail I see prices between $15/TB and $20/TB.

No HDD may be trusted to store data for more than 5 years and this duration is valid only for the more expensive models.

Assuming your price of $10/TB, one must buy at least a double HDD capacity than tape capacity, due to the short lifetime, so tape is at least 4 times cheaper. At the prices that I see at retail the difference is even greater.

The sequential transfer speed of tape is greater than that of HDDs, so archiving or retrieving many GB of data takes less time with tapes.

HDDs wear out and malfunction more frequently than tapes.

The only real disadvantage of tapes is the high cost of the tape drives, which makes tapes preferable only when more than 100 or 200 TB of data have to be stored.


> The only real disadvantage of tapes is the high cost of the tape drives

The real disadvantage of tapes is what the drive may die at any time and if you don't have another drive then you can't recover. And a replacement drive wouldn't be cheap (new) or reliable (used).


The drives are relatively cheap. I picked up an LTO-5 drive for about $150. I don't care that it's used. At that price I can buy 3 of them so that I have spares. And the next gen tape drive will always read the previous gen tapes. So it's an easy upgrade path. I'll go to LTO-6 when those drives come down a little in price on the used market. Then LTO-7 if I'm even alive lone enough to need that much storage. Realistically, I have about 30-40 years left in me and after that I don't really care what happens to my backups. They're encrypted, with parity files, so it's not unreasonable to think that these backups will last me the rest of my life.


While this is true, HDDs are much worse, because the probability that a HDD will die at any time and without warnings is many times higher than for any tape drive.

When a HDDs dies, that is far worse than when a tape drive dies, because you not only lose some money, but you also lose data, which may be priceless, unless you have been careful to have backup copies.

While there are some companies that offer data recovery services from defective HDDs, for recent HDD models such services can be very expensive, comparable with the cost of a tape drive and much more expensive than the service of copying a good tape on a HDD, which can be done when it is not possible to buy a replacement drive immediately and the data is needed urgently.


> HDD will die at any time and without warnings is many times higher than for any tape drive

It's same most of the time and LTO drive dies if you have at least any amount of dust while HDD doesn't give a fuck

> but you also lose data, which may be priceless

No difference if you sat on the tape with yours only one copy of your favourite porn.

And if you only have one copy of data then it doesn't really matter on which media it resides.

And oh, you CAN write the same data to two HDDs simultaneously with mirroring or just having a two copy jobs to a two separate hard drives, which would not only give you a physical separation but logical as well. For you to do the same with the tape - shell another $3000.


Where do you get the idea that HDDs only retain data for 5 years? The physics I learned in college suggests they will retain magnetic domains for hundreds or even thousands of years.

Yes they can fail mechanically (is this what you mean?) but you don’t necessarily lose your data.


They will fail mostly mechanically, but also the very small magnetized bits of modern HDDs will flip spontaneously at normal temperature after a much shorter time than "hundreds or even thousands of years" (because the energy needed to flip a small bit is not large enough in comparison to the thermal fluctuations to make the flipping probability negligible).

Many of these bit flips, but not all, will be corrected when the sectors are read, due to the error-correcting codes that are used in HDDs.

This is not theory, I have stored data for several years on more than 60 HDDs of various capacities from both WD and Seagate, most of them being the more expensive models with extended warranty durations, but even so, only few of the HDDs did not have any non-correctable error after several years. (Fortunately I was careful to use redundancy, so there was no data loss.)

Moreover, some of the biggest HDDs that are available now are no longer suitable for long term data storage, because in order to improve the performance they store metadata in a flash memory, which has a more limited data retention time.

After more than 5 years the complete loss of a HDD should be expected at any time, but even after 2 or 3 years a few non-correctable errors are probable.

When a HDD fails mechanically, one might pay a data recovery service, but that might have a price similar to a new HDD, so if you plan to not replace your HDDs often enough with the hope of using data recovery, it is pretty much certain that the cost will be much higher than replacing any HDD preemptively when its warranty expires.


Have you truly observed this type of degradation?

I do archive work and have 20+ discs from the 2010 era. Mostly the first generation of PMR drives. I have never had any data degradation problems.

You can also find lots of YouTube videos of people spinning up drives from the 80s and 90s which still hold their data without problem.

More scientifically, the phenomenon you talk about is modeled by the Arrhenius equation (1), where the activation energy to flip a grain is given by KuV/KbT, where Ku is the anisotropy of the magnetic media, V is the volume of a grain, Kb is the Boltzmann constant, and T is temp in Kelvin.

HDD manufacturers engineer this ratio to be >60 (usually targeting 70-90 to be safe). Media manufacturing is imperfect, so there is a log normal distribution of grains on real-world media, but if we assume that 60 is the energy barrier for all grains, a KuV/KbT of 60 would mean it takes 362 million years for half the grains to flip, assuming an attempt frequency of 10^10.

Where is my math wrong?

(1) https://en.m.wikipedia.org/wiki/Arrhenius_equation


Your math is probably right, but a modern HDD has more than 2^19 data bits.

Assuming that your computed time is right, that means that there is a 50% probability that one bit of a HDD will flip after less than a week.

Most such bit errors will be corrected when a sector is read and the controller will rewrite a bad sector with a valid value, so the bit errors will not be cumulative in normal usage.

However when the data is stored for years without powering up the HDD, the bit flips will accumulate and they may pass the threshold needed to cause an non-correctable error.

While I do not remember to have ever seen non-correctable errors on the HDDs that I have been using daily, on identical HDDs that have been stored for years without being powered up I have frequently seen both cases when the drive reported non-correctable errors and cases when the drive reported no error but the file hashes used for error detection identified corrupted files.

The older HDDs with low data capacities had much longer lifetimes, but also the perception of those claiming that data has been stored OK on them may be wrong if they have not used any means to detect the corrupted files, because even if the HDD reports no errors, that is not good enough.


One point of clarification: one bit on a classic PMR drive contains hundreds of magnetic grains. It is the grains that flip, not the bit. It would take many grain flips to affect the bit. Errors of this sort do not manifest as flipped bits per se—they manifest as a degraded signal, which the drive may or may not be able to translate to the correct bit sequence correctly. Also, the nature of ECC is (usually) that you get the correct sequence or an error. It would be unusual to get an incorrect sequence unless that is happening somewhere off-drive.

If you have a stored drive that is reporting errors, my starting assumption would be that something else is causing problems besides the platter—maybe the heads have gotten a bit of corrosion from humidity.

Still disagree?


Because the HDD manufacturers avoid to provide the information that would be necessary to estimate with any degree of certainty the data retention time for HDDs, we cannot know for sure the causes of HDD errors during long term storage, so we can only speculate about them.

Nevertheless, the experimental facts, both from my experience during many years with many HDDs and from the reports that I have read are:

1. Immediately after the warranty of a HDD expires, the probability of mechanical failure increases a lot. I have seen several cases of HDD failures a few months after the warranty expiration, while I have never seen a failure before that (on drives that had passed the initial acceptance tests after purchase; some drives have failed the initial tests and have been replaced by the vendor).

Therefore one should never plan to store data on HDDs beyond their warranty expiration.

2. When data is stored on HDDs that are powered down for several years, one should expect a few errors (I have seen e.g. about one error per 2 to 8 TB of data), which cause either non-correctable errors or wrong corrections that corrupt the data.

The effect of such errors can be easily mitigated by storing 2 copies of each data file on 2 different HDDs.

An alternative is to introduce a controlled data redundancy, e.g. of 5% or 10%, with a program like "par2create".

That works fine against wrongly corrected sectors, but when a non-correctable error is reported, many file copy programs fail to copy any good sector following a bad sector, so one may need to write a custom script that will seek through the corrupt file and copy the good sectors, in order to get enough data from which the original file can be reconstructed.

Storing everything on 2 HDDs, preferably of different models, is the safest method, as it also guards against the case when one HDD is completely lost due to a mechanical defect.


Where do you buy $5/TB tape at retail?


Amazon, either USA or Germany, 12 TB (real capacity) LTO-8 cartridges for $60 a piece.


Right but from what I can see a LTO-8 drive ain't cheap, even used.


Many years ago, when LTO-7 was new and state-of-the-art, I have bought a new tabletop tape drive (Quantum) for $3000, which is less than an Apple HMD, but much more useful for me.

After writing several hundred TB, I have achieved a decent money saving in comparison with using HDDs.

On the other hand, when someone needs to store only 100 TB or less, there is no chance to recover the cost of the tape drive, so tapes are inappropriate for such a case.


LTO-5 worked great for my use case and I find new/old-stock LTO5 tapes for about $5/TB on ebay. I verify every tape after writing, and I include about 25% to 30% parity data on each tape to combat bit rot. I don't really need to write more than 1TB at a time of data, and the rest of the room on the tape is parity data (PAR files).

The drive was super cheap, $150 used. I don't expect it to last forever, I plan to buy another tape drive as a backup and eventually upgrade to an LTO-6 drive.

After making 2 copies of all my important data, about 30TB on LTO-5 tape, I don't have that much to back up, maybe 2TB a month, but it's easy to justify buying a few more tapes every now and then. Buying 2 hard drives for redundancy is just not anywhere near as cheap as buying 2 LTO tapes for the same amount of data, even for people with less than 100TB of data.


    After writing several hundred TB
Sheesh, what are you writing!?


With scanned books and Blu-ray movies, the TBs add very quickly when you do not want to degrade their quality with additional compression, which is my case, because after digitizing or ripping them I do not keep the originals, due to not having enough space for them, so I want the retained copies to have a quality as high as possible, which leads to multi-GB file sizes.


It’s like saying “hard drive platters are really dense storage and super cheap” and ignoring the cost of the actual drive itself


I have more storage than I know what to do with. mdadm + LVM + XFS is reliable and works.


> As my NAS expands

How are you expanding capacity? With traditional software raid I'd need to fail every drive and replace them one by one with identical higher capacity drives which requires several rebuilds from parity which means a long time and massive I/O loads and high risk of unrecoverable read errors just destroying the whole array in the process. It's easier to make a new array and move data over...


Yes this is also a thing.

I used to do what you're doing. Last 6-7 years I've been running mirrored setup so I could expand by just adding a new vdev. Rebuilt it twice when I upgraded to significantly larger disks.


What are you doing now? Surely there are better ways...


Researching my options...

As I mentioned ZFS on top of LVM seems interesting in terms of features and flexibility. But reliability-wise... I'd like to run a more proper test setup to gain some experience before I'd trust it.


I this it's just ZFS that has that "limitation", LVM and Btrfs just let you mix drives of arbitrary sizes.


Everything raid has that limitation, including LVM, mdadm.

https://raid.wiki.kernel.org/index.php/Growing#Expanding_exi...

Btrfs has a more flexible block allocator but its parity implementation is still unreliable.


ZFS will also let you mix drives of different sizes. And the pool would grow as soon as you remove the smallest one.



Why would you use LVM with ZFS? ZFS should handle all of the features LVM has (as far as I can think of, at least).


The idea would be to use LVM to get writeback caching on the block-device level, then use those cached LVM devices as backing devices for a ZFS pool.

I've tested this on a small scale prototype level (one HDD and one NVME SSD) and there it worked quite well. But like I said it feels a bit fragile, lots of moving parts.


Is it going to be hamstrung by SATA III? It would take roughly 13 hours to fill the drive at 6 Gbps…


Only if it's a multi-actuator drive. Single-actuator drives are still barely hitting SATA II speeds, so this density boost on its own probably won't be enough to increase performance to the point where SATA III's 6Gbps is a real problem.


You are hitting on the largest problem with high density HDDs - the IO dosn't scale with the density, and thus the data needs to be colder and colder that you put on the drive.

The cold data thesis hasn't worked out as well in the market as many storage vendors would have liked.


There is a physical reason why I/O speed doesn't scale with density.

When the density of data on disks gets denser, it does so on both axes - x and y, or radial and tangential. But read/write heads only access one track at a time, so the speed-up is only along one axis.

Think about CDs, DVDs, and BDs. The discs can only be spun at a finite rate (somewhere around 10k RPM) before they shatter. The R/W head can read the data within any single track at full speed, but only one track at a time. From CDs to BDs, the amount of data per track increased (this doesn't affect the amount of time needed to read all the data from a disc), but also the number of tracks increased (this does affect the total time).


You get some data throughput increases by increasing areal density and number of platters -- the days of counting on 100-120MB/s from a spinner are now replaced by checking to see whether you get 220-280MB/s -- but you should not expect the rotating disk technology to push the SATA III standard just yet.

It will be more than 13 hours.


Adding an 18TB to a small Synology array takes days as it is. Adding a drive this size would take at least a week I think.


The biggest problem that I have with adding big drives to my NAS is the amount of prep work. badblocks takes fucking forever against bigger drives. It was like, 5 days to run three passes of badblocks against a 12 TB drive or something?


Is it even worth running something like badblocks before adding drives anymore?

I guess if you want to be absolutely sure it's "clean" but most drives do intelligent remapping internally anyway, so "add and scrub afterwards" may be the way to go.


I feel like it's less of a bother to diagnose if the drive needs to be returned immediately and not after you've already slapped it into your RAID.


Ah yes, "mo TB, mo problems", as the Notorious B.I.G would lament.


SSD prices at last seem to have an effect on hdd prices and capacities. We have almost had a decade of stagnant hdd prices and capacities. But it started to change as soon as ssd price per tb got close to hdd prices. I think we are back to a point where every 2-3 years the price of hhd/tb drops regularly as larger hhd drives are added.


Do hard drives have any advantage over SSDs anymore beyond cost per terabyte?


You can write on them 24/7. DWPD isn't great, but TBW is almost limitless.

Also sequential workloads doesn't suffer from the write amplification and remain pretty high even if you are writing for hours.


Essentially no, but the difference is enough that you would use HDDs in any situations where you can afford to deal with the shortcomings of HDDs.


they don't lose information after a couple of years sitting on a shelf


they also fail somewhat predictably in some situations (they start clicking before they die) but more and more I've noticed they just stop working.


This is really exciting news, there hasn't really been any big leaps in HDD capacities in a long time, +12TB perhaps even +20TB in one generation is huge. I hope they are going to be reasonably priced, but given the competition has nothing, 1k+? I really hope not.


They will probably be competitively priced per terabyte, which will make each drive an expensive component. There will also be a small margin for the convenience of having a single drive rather than multiple drives, and an “early adopter tax” because it’s new tech.

So yes, a 32TB unit is very likely to debut at >$1k.


a 20TB seagate is ~$350 right now, so >$1k would not be "competitively priced per terabyte". But I still agree that its likely going to be >1k because storage density per disk is also a factor (and greed obviously).


Somewhat related: 1TB optical discs for $3 from Folio Photonics (possible vaporware alert): https://www.techradar.com/news/exclusive-blu-ray-successor-w...


This isn't exactly new, goes back to as far as [1] 2007 on Ars, and they reported something similar from the same startup [2]. I do wish them success though. Even a Single TB disc would be 10 to 40 times the current Blu-Ray.

It would be nice if they could do a mini cartridge version ( 6 Folio Mini Disc inside a Cartridge ) like Zip drive with 2TB for consumer. Unless you are a Data Hoarder 99.9% of the consumer could have their own backup down with a few of these cartridge. And in theory they should last a lot longer than HDD or even SSD.

Really wish this isn't a pipe dream.

[1] https://arstechnica.com/gadgets/2007/08/new-dvd-sized-disc-t...

[2] https://www.storagereview.com/news/folio-photonics-working-o...


> in theory they should last a lot longer than HDD or even SSD.

The etched plastic yes, but I think the reflective layer of CDs and DVDs is highly sensitive to humidity, shortening the lifespan to 10-20 years. You could probably recover it professionally, but not cheaply.


>the reflective layer of CDs and DVDs is highly sensitive to humidity, shortening the lifespan to 10-20 years.

this is actually solved for HTL BD-R Blurays as they use inorganic material for storing data. Only the outer plastic is organic there. See https://en.wikipedia.org/wiki/Blu-ray_Disc_recordable


A $3000 drive sounds surprisingly tempting if it gives me access to 1 TB for only $3 afterwards, assuming that the drives have a decent amount of life to them.


HAMR has some serious faults with it's io heads (could rant about this forever), especially given the "reman" logic they use to qualify the durability of the drives.

I'd wait until these actually ship, and we get some data on them. These end up basically being flaky new-age tape systems rather than HDD given SMR is required for this density.


Ok, I'll rant now.

Tape drives have multiple heads for IO, which allows some to fail along the way and you can still read/write your data. This sounds good - but it actually just means these tape drive heads are flaky. What do you do when too many fail? You call IBM or whoever and get them to replace your tape drive, or "reman" it by replacing the failed heads. This is the only way they actually achieve their warranties around lifetime read/writes.. they assume you'll fix the hardware along the way.

These HAMR drives have the same problem. The "heat assisted" just means they're using a laser to heat up a piece of gold, and sometimes this means the gold kinda drips around, and the head can be ruined. So their read/write lifetime numbers are pretty loose compared to PMR, and there is an assumption you'll "reman" these drives if they start to have failed heads for IO. However, the gold can even drip onto the platter, giving you permanent data loss anyways.

Lastly, they use SMR to get this density. SMR is not like PMR. PMR is what you think of with an HDD with many small blocks either 512B or 4KiB which you can read/write to. SMR has 256MiB (or 128) "zones" that you can only append, or reset. This means instead of being able to write randomly across the drive's capacity, you need to plan out your writes across appendable regions. This complicates your GC, compaction, and reduces your total system IO. Your random read performance is still better than Tape, but this basically turns your HDD solutions into something that looks a lot more like Tape. This is incredibly unpopular technology for this reason.

The market for these things is much smaller than most HDD vendors would like. You have a few hyperscalars that have figured out how to write out backups efficiently, but they want a lot of read IO, and reducing the spindle to byte ratio means you have less total system bandwidth to your data.

This means that the price per byte would actually need to be lower than their PMR drives, let alone wildly better warranty agreements, for these to be better TCO than current generation drives.


These don't sound like insurmountable issues.

> The "heat assisted" just means they're using a laser to heat up a piece of gold, and sometimes this means the gold kinda drips around, and the head can be ruined.

It sounds like this tech is related to MO tech, which has been used for decades without issues (see: minidisc).

The laser is said to heat the spot on the platter to a bit over 400C, while the melting point of gold is over 1000C, so that doesn't jive. Did you meant something else?

> Lastly, they use SMR to get this density.

Why do they have to? SMR is just a way of fitting more data on a disc by overlapping tracks. I don't know why it is necessary to use this method on this technology.


> Why do they have to?

Because they can. I saw a tech talk from a data recovery guy and he said some vendor (sadly doesn't remember which one) just have no new CMR drives at all.


If we go by surface area / storage, how many atoms per bit this is ?


Seeing as how 3D NAND is already using quantum tunneling for charge trap flash, it's gotta be hard to beat that, right?[^1]

[1]: https://www.youtube.com/watch?v=5f2xOxRGKqk


It's (probably?) perpendicular recording, so it's depth as well.



Finally!?

I have a 5 TB external drive I bought many years ago, and despite its big physical footprint, the new drives still don't feel like a justifiable upgrade, given their elevated price for a small increment in storage.


0.04 x 0.59 x 0.43 inches - 1TB MicroSD (with 160/90 MB/s read/write speeds) for $110... now THAT was a shock when I was buying a new phone after several years and I was looking at expanding its storage. It's so surreal to hold such a tiny thing that can store so much data.


It's rather unreliable storage though. Don't use those for anything important and always back stuff up from your cameras or similar frequently. SD cards are easily corrupted and lack the mitigation mechanisms things like SSDs have.


Now the NSA will have enough storage to keep all that surveillance video of me watching The Munsters.


Enough for 18 surrounding cameras in 3D for 360 angles recorded in 120fps for 8k VR videos? Times 7 billion people times 24/7.


I admit I wasn't particularly careful calculating this, but something like a quadrillion petabytes per year. I guess the datacenter would also need to be quite large, depending how vertical it could go. Say, a continent per year?

Your move, NSA.


Reminder: Have backups. Imagine losing 32TB of data in one go.


Restoring backups on a 32TB drive would also take a very long time! Other comments have mentioned that it would probably take more than 13 hours to fill the drive.

On a related note, remirroring or rebuilding a 32TB RAID drive would probably take days.


Backup to cloud provider, and then have them ship you a replacement drive with the data preloaded when your local drive goes bad (Backblaze, for example, does this).

Local drives are more fast cache for data that should be stored reliably elsewhere (maybe hot, maybe cold, depending on provider and cost concerns).


It also takes forever to restore them. The issue with ever increasing HDD speeds is they don't get any faster. It takes literal days to get all the data off them, and if you try to resilver your array when a drive fails, it's more likely another will fail too. Some of these big drives have fairly low endurance ratings, so they might fail earlier than you think [1]

1. https://www.servethehome.com/discussing-low-wd-red-pro-nas-h...


Reminder: RAID is not a backup. You're going to have more drives die while you're resilvering 32TB of data.


But "RAID is not a backup" is not related to this problem though. "RAID is not a backup" s a reflection of the fact that your RAID array is in one location, and instantly mirrors all data copied to it. A disaster to your storage array outside of a disk failure, will annihilate all your data (and this used to include "RAID controller dies").

The practical issues of RAID rebuilds also don't quite work that way - the "drives die while rebuilding problem" has two components:

(1) is that beyond a certain size, the risk of bit-error while resilvering becomes a certainty due to all the data you're reading - ZFS fixes this problem with it's checksums basically.

(2) If all your disks were installed at the same time, then the chance of another disk failure is high because they're likely all in the same part of the mean-time-to-failure distribution.

(3) the risk of an error is higher because you're doing a lot more activity then normal, and so the additional stress is increasing the chance of triggering a failure.


Do you not regularly scrub your arrays to surface these issues promptly?


With 32TB, nothing would be done “promptly”. Can you imagine how long resilvering this drive would take? These aren’t feasible for RAID setups…


If you go from double to triple parity it should far more than make up for the extra resilvering time.

Though if you want specific numbers, what would you estimate as the per-drive odds of 8TB and 40TB drives dying during a resilver? Let's say they take 2 and 7 days respectively.


Maybe, but then these have to be a lot cheaper for a triple redundant setup to be cost effective.


Not really. Let's say you were doing 8+2 before (times five) with 8TB drives, and switch to 8+3 with 40TB drives. You go from 400TB of raw space to 440TB of raw space, but you also go from 50 drive slots down to 11. Even if both drive models are the same cost per terabyte, the latter setup should have a lower total cost.

But I would also expect these drives to be cheaper per TB, at least by the time HAMR is 2-3 generations old.


I think if you're in the market for 32TB drives, you're already using 20TB ones instead of 8. With 8+2 at 20TB you pay for 200TB and get 160TB usable. With 7+3 at 32TB you pay for 320TB and have 224TB available. That's 160% of TB paid for and only 140% of TB usable. HMR will need to be a sizable 12.5% discount per terabyte just to break even there.

If you only do 5+3 to keep the same 160T available, it's 256TB or 128% paid for and the same 100% usable, so an even steeper 22% discount to break even.


If you're going from 20TB to 32TB then it barely affects your RAID reliability at all. You don't need to increase parity in that case.

And nobody that's price conscious would make a 5+3 RAID so I'm not very worried about such an extreme scenario.


Hmm I could use a reminder of what those acronyms say. SMR is the one unusable for desktop stuff because of the sloooow writes? What is HAMR?


The one you want at home right now is CMR, or Conventional Magnetic Recording.

S is Shingled, H is Heat Assisted, while the article mentions P for Perpendicular which became conventional after nobody has room for L or Longitudinal.

Explainer: https://www.linuxadictos.com/en/diferencias-entre-smr-cmr-pm...

WD NAS apology tour: https://blog.westerndigital.com/wd-red-nas-drives/

Seagate's CMR/SMR list: https://www.seagate.com/products/cmr-smr-list/


Thinking "quantity has a quality all its own," I've started to wonder more about the black swans on the horizon, should you look far enough: cosmic rays bitflipping your memory, unusual collisions, ridiculous rebuild times, the LTO-machine syndicate ...

Maybe file fixity is in the future, where the concept is more foreground than before.


Just here to complain about the stingy 256gb ssds in MacBooks


The 256s are also slower than the 512s, IIRC.


Do HAMR drives have the same access pattern behaviors as normal spinning disks (eg sequential seek preferred to random)?


I suspect so, they're still an arm that moves across the disk. It just also has to heat the surface for reads, evidently. It will be neat to know if there's a warmup time for writes.


Finally, something dense enough to fit my entire porn collection in a single 2U server!


Since it’s heat assisted I wonder how much the drives will heat up server racks?


Incredibly tiny areas are being heated for tiny lengths of time, so it has negligible effect.

https://blog.seagate.com/craftsman-ship/hamr-next-leap-forwa... > Power, heat, and the reliability of related systems is equally nominal. HAMR heads integrated in customer systems consume under 200mW power while writing — a tiny percentage of the total 8W power a drive uses during random write, and easily maintaining a total power consumption equivalent to standard drives.


The heating is to a tiny area. It’s 0.2 watts or so. So, it won’t heat up server racks as normal HDDs use 8-15 watts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: