Hacker News new | past | comments | ask | show | jobs | submit login
ZFS for Dummies (ikrima.dev)
414 points by giis on Sept 5, 2023 | hide | past | favorite | 164 comments



I'm getting started with ZFS just now. The learning curve is steeper than I expected. I would love to have a dumbed down wrapper that made the common case dead-simple. For example:

- Use sane defaults for pool creation. ashift=12, lz4 compression, xattr=sa, acltype=posixacl, and atime=off. Don't even ask me.

- Make encryption just on or off instead of offering five or six options

- Generate the encryption key for me, set up the systemd service to decrypt the pool at start up, and prompt me to back up the key somewhere

- `zfs list` should show if a dataset is mounted or not, if it is encrypted or not, and if the encryption key is loaded or not

- No recursive datasets and use {pool}:{dataset} instead of {pool}/{dataset} to maintain a clear distinction between pools and datasets.

- Don't make me name pools or snapshots. Assign pools the name {hostname}-[A-Z]. Name snapshots {pool name}_{datetime created} and give them numerical shortcuts so I never have to type that all out

- Don't make me type disk IDs when creating pools. Store metadata on the disk so ZFS doesn't get confused if I set up a pool with `/dev/sda` and `/dev/sdb` references and then shuffle around the drives

- Always use `pv` to show progress

- Automatically set up weekly scrubs

- Automatically set up hourly/daily/weekly/monthly snapshots and snapshot pruning

- If I send to a disk without a pool, ask for confirmation and then create a new single disk pool for me with the same settings as on the sending pool

- collapse `zpool` and `zfs` into a single command

- Automatically use `--raw` when sending encrypted datasets, default to `--replicate` when sending, and use `-I` whenever possible when sending

- Provide an obvious way to mount and navigate a snapshot dataset instead of hiding the snapshot filesystem in a hidden directory


Uh you are mixing sensible suggestions (ashift=12 as a default, {pool}:{dataset} syntax, though it will be hard to change so late) with very, very opinionated ones, which break use cases you may not be aware of.

Naming pools after hostnames: I have pools on a SAN which can be imported by more than one host.

Weekly scrubs, periodic snapshots, periodic pruning: This is really the job of the OS' scheduler (an equally opinionated view, I admit)

collapsing zpool and zfs commands - sure but why? so you can have zfs -pool XXXX and zfs -volume XXXX?

No recursive datasets? I have use cases where it's very useful.

`zfs list` should show if a dataset is mounted or not, if it is encrypted or not, and if the encryption key is loaded or not: Fully agree!

Don't make me type disk IDs when creating pools: You can address them in 3-4 different ways (by id, by WWN, by label, by sdX etc), and you have to specify in _some_ way which disks you want to go there, so not sure what's the point here.

Store metadata on the disk so ZFS doesn't get confused if I set up a pool with `/dev/sda` and `/dev/sdb` references and then shuffle around the drives: Already happening. Swap a few drives around and import the pool, it will find them.

Some of your suggestions are genuinely OK, at least as defaults, but some indicate you aren't really considering much outside your own usage pattern and needs. ZFS caters to a lot more people than you.


I'm not suggesting zfs itself change. I'd like a porcelain for people with very simple needs where zfs is almost overkill, for people who just want better data integrity and quicker backups on their main machine, for example.

I think zpool is unnecessary as an additional command. For example, `zfs scrub`, `zfs destroy [pool | dataset]`, `zfs add`, `zfs remove` would all have clear meanings. There may be a couple commands that would need explicit disambiguation with a flag like `zfs create`.


I missed completely that you were discussing just a wrapper and not touching the original ZFS. Completely my bad! Obviously some of my objections become quite moot in that light.


> ZFS caters to a lot more people than you.

And under the OP's proposal, those people would continue to ZFS entirely unaffected. The OP wasn't proposing changing the behaviour of ZFS, but rather "wrapping" this defined set into a well-defined recipe which could be used by people who aren't so opinionated.

This "dumbed down wrapper" wouldn't even need to be called ZFS, to avoid confusion. Personally I'd like to propose the name ZzzFS: which is ZFS made so simple you can do it in your sleep...


I like that. I was thinking Zoofus, easy enough for a doofus.


Would also accept EZFS.


> so ZFS doesn't get confused if I set up a pool with `/dev/sda` and `/dev/sdb`

To be fair, that's not ZFS's problem, that is your problem for not keeping up with the times. PBCAK.

For quite some time now, Linux has had fully-qualified references, e.g. : `/dev/disk/by-id/ata-$manufactuer-$serial-$whatever`

That is what you should be using when building your pools.


My problem is that that's a pain to type out. I've read that it's necessary (and other's have said it's not), but it'd be more convenient to just do `mirror /dev/sda /dev/sdb` than `mirror /dev/disk/by-id/ata-WDC_WDBNCE5000PNC_21365M802768 /dev/disk/by-id/ata-Samsung_SSD_850_EVO_500GB_S2RANX0HA52854X`


Working with drive serial numbers is great. I would ban using /dev/sda etc. Nothing like being able to read the sticker on the disk to make sure you pulled the right one. Or that "sda" is actually what you think it is when you're booted from a recovery disk.


Tab complete would not add that much over head in typing... I have never understood the complaint that verbosity is bad. We have systems that enable you to be both verbose and quick, win win...

I believe defining a storage array based on an immutable ID to be far better than the dynamic OS Assigned ID like /dev/sda which may have a total of what 5 key strokes more when using tab complete?


Create the pool using `/dev/sda` etc then:

    sudo zpool export tank
    sudo zpool import -d /dev/disk/by-id -aN


> I've read that it's necessary (and other's have said it's not)

It is not, it's outdated advice that stubbornly persists. Try it yourself if you are not convinced, create a pool using /dev/sdx and change the device order, zfs will still find it, because why wouldn't they just dereference sdx and store UUIDs?


Does that still work if you remove one device while renaming?


Renaming pools is not related.

Specifying devices to `zpool create` as path vs uuid vs partlabel vs partuuid makes absolutely no difference to the end result of what is written to the devices, and never did. ZFS does not rely on device paths to identify vdevs (that would be nuts if you consider ever trying to move pools between machines).

However it used to make a difference to the /etc/zfs/zpool.cache [0] which records the current imported pools for the OS. AFAIU the problem used to be that this cache would inherit device specifications from the zpool import command (or implicit import from a zpool create command), and this is what got people got into trouble when device path enumeration is not deterministic across boots, thus the advice to avoid device paths during zpool create arose, but was only really applicable to zpool import ... This was never a permanent issue for the pool, you could always simply re-import it with UUIDs etc to rebuild the cache in a reliable way.

This was patched way back in 2016, to make sure the cache never uses /dev/sdx alone... Although it's kind of buried in a commit at the end of a semi-related PR [1]:

     Fix 'zpool import' blkid device names

    When importing a pool using the blkid cache only the device
    node path was added to the list of known paths for a device.
    This results in 'zpool import' always using the sdX names
    in preference to the 'path' name stored in the label.

    To fix the issue the blkid import path has been updated to
    add both the 'path', 'devid', and 'devname' names from the
    label to known paths.  Sanity check is done to ensure these
    paths do refer to the same device identified by blkid.
You can cat your zpool.cache and see those three identifiers in the ASCII parts for yourself.

TL;DR you are free to specify devices using /dev/sdx

This is the same problem I mentioned in my other comment, the current quantity of ZFS myths and outdated advice on the web is substantial and takes a lot of navigating or investigating - The best you can do for now is never accept any advice without proof, and test, always test.

[0] https://openzfs.github.io/openzfs-docs/Project%20and%20Commu...

[1] https://github.com/openzfs/zfs/pull/4523/commits/73abf3a8634...


One and done. One you type it would you can use zfs list|status from there. There is nothing wrong using sda and sdb but if you need to replace a drive In the future it’s just easier with the by-id label.


True, but in ZFS, you don't usually use these names unless you are shoving disks around. In which case it's nice to have an identity on the disk.

Most of your work is a logical naming scheme on top.


the tab key is my friend here.


OTOH you have BTRFS where you can use whatever you want, and just find all disks using the filesystem ID to join the array. Works like a charm and you never have to think about it.


ZFS does that too, the top-level comment was just uninformed. You can shuffle your disks around, move them to other machines, and even move the contents to different physical drives and ZFS will still automatically find them all using metadata written to the disks called the "zpool label".


Thanks for the clarification! This makes more sense for a fs as praised as zfs to be that flexible


A lot of these suggestions are heavily opinionated. Which is not necessarily bad, but they seem to mess with existing conventions just for the sake of it (why {pool}:{dataset}?).

> Don't make me name [...] snapshots.

You might like this little tool I wrote: https://github.com/rollcat/zfs-autosnap

You put "zfs-autosnap snap" in cron hourly (or however often you want a snapshot), and "zfs-autosnap gc" in cron daily, and it takes care of maintaining a rolling history of snapshots, per the retention policy.

It's not hard writing simple ZFS command wrappers, feel free to take my code and make your own tools.


Fair point on the {pool}:{dataset} thing. I just don't like that the same {pool} name refers to both a pool of vdevs and the top-level dataset on that pool. It makes it that little bit harder to grok the distinction. Perhaps there's a better way to emphasize that difference.


The distinction is easy. If you're using the 'zpool' binary, you're operating on the pool. If you use the 'zfs' binary, you're operating on the dataset.


Nice, you reminded me of my own incomplete Rust rewrite of the Ruby ZFS snapshot script I wrote about a decade ago, and this bit of yak shaving that ended up derailing me: https://github.com/Freaky/command-limits

I ended up finishing neither, and should pick them back up!

(I snapshot in big chunks with xargs to try to minimise temporal smear - snapshots created in the same `zfs snapshot` command are atomic)


I've read that one of the last task of a blacksmith apprentice is to make all the tools a workaday blacksmith would need to use to blacksmith. IE your last lesson is to make your own anvil, your own hammers, tongs, etc.

At $DAYJOB I wrote a bunch of scripts to mechanize building ZFS arrays for whatever expected deployment I'd imagined on that day. Among the tasks was to make luks encrypted volumes on which to put the zvols, standardize the naming schemes, sane defaults like ashift=12, lz4 compression, etc. (this was well before encryption was part of ZFS; I haven't updated the scripts since to support encryption in zfs since it's not really been a problem this way)

I don't remember many of these flags now, but have a script as reference for documentation, and others on the team don't need to know much about ZFS besides run make-zfs-big-mirror or make-big-zfs-undundant-raid0 and magic happens.

Eventually maybe even that stuff will be automated away by our provisioning, if we ever are in a position to provision systems more than 20 times per year.


Honestly this is one of the reasons I love ansible. I create scripts while doing it the first time. Then the rest is hitting go and forgetting. The scripts are the documentation, like you said. The hell if I'm remembering all those magic incantations. You only ever will remember what you frequently use, the rest is for off-brain storage.


Last time I tested performance (~2 years ago), zfs on luks performed better than zfs encrypted datasets, on sequential reads, almost twice as good. This was on particularly slow hard drives.

Not sure why, and I should probably make the test reproducable.


As others have noted, these are really opinionated suggestions. And while it's perfectly fine to have an opinion, many of these vary between "this isn't the way I'm used to Linux doing it" to the actually objectionable.

The ones I find most personally objectionable:

> - Don't make me name pools or snapshots. Assign pools the name {hostname}-[A-Z]. Name snapshots {pool name}_{datetime created} and give them numerical shortcuts so I never have to type that all out

Not naming pools is just bonkers. You don't create pools often enough to not simply name them.

Re: not naming snapshots, you could use `httm` and `zfs allow` for that[0]:

    $ httm -S .
    httm took a snapshot named: rpool/ROOT/ubuntu_tiebek@snap_2022-12-14-12:31:41_httmSnapFileMount
> - collapse `zpool` and `zfs` into a single command

`zfs` and `zpool` are just immaculate Unix commands, each of which has half a dozen sub commands. One the smartest decisions the ZFS designers made was not giving you a more complicated single administrative command.

> - Provide an obvious way to mount and navigate a snapshot dataset instead of hiding the snapshot filesystem in a hidden directory

Again -- you can do this very easily via `zfs mount`, but you'll have to trust me that a stable virtual interface also makes it very easy to search for all file versions, something which is much more difficult to achieve with btrfs, et. al. See again `httm` [1].

[0]: https://kimono-koans.github.io/opinionated-guide/#dynamic-sn... [1]: https://github.com/kimono-koans/httm


> I would love to have a dumbed down wrapper that made the common case dead-simple.

TrueNAS


This. ZFS for Dummies == TrueNAS.


It's interesting; I'm the kind of person who feels uncomfortable using something without understanding the shape of the stack underneath it. A magic black box is anathema; I want to know how to use the thing with mechanical sympathy, aligning my work with the grain of the implementation details. I want to understand error messages when they happen. I want to know how to diagnose something when it breaks, especially when it's something as important as my data.

I like how ZFS is put together. I've been running it for about 13 years. I started with Nexenta, a Solaris fork with Debian userland. I've ported my pool twice, had a bunch of HDD failures, and haven't lost a single byte.

I agree with you on most of the encryption stuff. That is very recent and not fully integrated and the user experience isn't fully baked. I don't agree on unifying zpool and zfs; for a good long time, I served zvols from my zpool, and dividing up storage management and its redundancy configuration from file system management makes sense to me. Similarly, recursive datasets make sense; you want inheritance or something very like it when managing more than a handful of filesystems. I don't agree on pool names (why anyone would want ordinal pool naming and just replicate the problem you just stated re sda, sdb etc. is a bit mysterious), and I don't agree on snapshots (to me this is like preferring commit IDs in git to branch and tag names - manually created snapshots outside periodic pruning should be named).

ZoL on Ubuntu does periodic scrubs by default now. Sometimes I have to stop them because they noticeably impact I/O too much. Periodic snapshots is one of the first cronjobs I created on Nexenta, and while there's plenty of tooling, it also needs configuration - if you are not aware of it, it's an easy way to retain references to huge volumes of data, depending on use case. Not all of my ZFS filesystems are periodically snapshotted the same way.


I see the utility in recursive datasets and I wouldn't want them to go away, but if I were creating a zfs-for-dummies I wouldn't include the functionality. You'd have to drop down to the raw zpool/zfs commands to get that.

Likewise, I appreciate being able to name snapshots, but it's annoying to have to manually name the snapshot I create in order to zfs send. The solution there is probably to not make me take a manual snapshot in the first place. `zfs send` should automatically make the snapshot for me. But in general, I don't see why zfs can't default to a generic name and let me override it with a `--name` flag.

Giving it more thought, I think I would keep pool naming. What I don't like is the possibility of having pool name collisions which isn't something you have to think about with, say, ext4 filesystems. But the upshot, as you point out, is with zfs you aren't stuck using sda, sdb, etc.


What's the lifecycle of an automatically-created snapshot? i.e. is the snapshot garbage? How does it get collected? Things like autosnapshot implement policy to take care of itself but... zfs send one-offs?

zfs send is a strange beast. It's more like differential tar than rsync (i.e. a stream intended for linear backup). zfs is cool because it unifies backup/restore. Have you tried restoring differential tar from unlabeled tapes?


I’m on the other end of the spectrum. I like knowing the flags and settings I use to create the pools.

For snapshots and replication take a look at sanoid (https://github.com/jimsalterjrs/sanoid).


— when destroying a dataset, please ask for confirmation before permanently deleting it

— please provide support for multiple key slots as in LUKS

— please build in the functionality of sanoid and syncoid, so that snapshots and replication don’t need a third party tool

— please build a usable deduplication, so that we don’t have to use external tools such as Restic or Borg


I’m not sure what restic or borg are or do for you, but there is an upcoming feature called “block cloning” that is a sort of on-demand version deduplication that offers new advantages for certain setups.

https://www.bsdcan.org/events/bsdcan_2023/sessions/session/1...


Borg/Restic are deduplicating backup tools. They actually use the exact same approach as ZFS deduplication, but because e.g. Borg chunks are around 2M instead of 128K (ZFS recordsize default) it uses less memory for the same data compared to ZFS. All of these systems end up using somewhere around 300 bytes of memory per chunk stored. [1] This is how you arrive at the "1 GB of memory per 1 TB of storage" rule of thumb when using deduplication.

Block cloning is different because it's offline. It does not require deduplication tables, therefore does not require those 300 bytes per chunk I mentioned above, neither on-disk nor in-core. Note that the BRT only has to track blocks which are actually referenced at least twice, while the DDT has to track all blocks. This, plus the BRT not having to store the block checksum, is why it has so much less overhead. The BRT should come out to less than 20 bytes per tracked block on average. (Also, the BRT is only used for freeing blocks and when handling ficlonerange, while the DDT has to be consulted for every prospective block write).

The downside is of course that it's offline. So the finding of duplicate blocks, which is what all the DDT overhead supports, has to be done by an external tool, unless you're explicitly creating duplicates (cp --reflink=always), which is of course a very useful thing to have -- especially because block cloning works across datasets: with block cloning you can actually copy data out of snapshots back into a writable dataset without having to physically copy anything. Also, again, as it's offline, you can deduplicate existing data in-place, while traditional ZFS dedup would require turning deduplication on and then re-writing all data you want deduplicated.

So OpenZFS 2.2 is definitely going to be very exciting.

[1] Some use more memory in-core than on-disk, e.g. Restic tends to use about 2-3x the memory relative to Borg, Rustic is somewhere in-between. Probably GC tax.


Do you use Restic or Borg with ZFS? How do you have that set up? Do you use them in lieu of zfs send/recv?


In liu of ZFS replication (send and receive). I still use ZFS as a file system.

Also, I have a 1.5TB directory. There is a lot of redundant backup data in it. I archived it with Restic to 400 GB. As a ZFS dataset, it would have taken ~4X the size.

Honestly, backups up to 100 TB are better be done with tools such as Restic than file system stream backups: less hardware requirements, lots of support for repository management, integration with clouds, portable repositories, better and more trusted encryption, etc. Tools based on Go are static binaries with no dependencies. You can recover your data in the future on any X86 platform.


Fully agree on using tools such as Restic! The `--exclude-caches` is really helpful in keeping backups small, it makes Restic skip directories with a CACHEDIR.TAG file and that includes Rust compile target directories. Combined with a small exclude list for browser caches and other temporary storage this makes backup deltas way smaller than in the average ZFS setup. (And no, creating a new ZFS filesystem for every cache directory and excluding each of them from snapshots is not really a solution)


I searched my linux system for CACHEDIR.TAG and found only a handful of them. Also looking at the standard’s website, it seems it’s not adopted.


I make remote snapshot backups with Borg using this: https://github.com/Freaky/zfsnapr

zfsnapr mounts recursive snapshots on a target directory so you can just point whatever backup tool you like at a normal directory tree.

I still use send/recv for local backups - I think it's good to have a mix of strategies.


> - Make encryption just on or off instead of offering five or six options

You can set "encryption=on", and it will select the default strongest option, currently AES-256-GCM

> - Generate the encryption key for me, set up the systemd service to decrypt the pool at start up, and prompt me to back up the key somewhere

Technically it does generate encryption keys internally, which is why the ones you provide can be rotated out. If you use a keyfile then automount with key load is easy (zfs mount -al), there is already an auto mount systemd service created automatically for Debian, however they did not add the -l flag for auto loading keys because they got stuck in a debate about supporting passphrase prompts at boot. For now you can simply edit it to add the -l flag and it work fine for datasets with keyfiles.

> - Don't make me type disk IDs when creating pools. Store metadata on the disk so ZFS doesn't get confused if I set up a pool with `/dev/sda` and `/dev/sdb` references and then shuffle around the drives

This is no longer the case for ZoL. I know because there is an issue with Linode for storage device identifier assignments where they consistently get jumbled up with Debian on every boot. ZFS finds the devices all the same, even if it has a different identifier every boot. I believe this is because it stores it's own UUID info on the devices. So you can create pools by referring to devices however you like, because they are only a temporary reference, i.e use /dev/sda etc (and I have, and it's fine). I think there is a lot of outdated advice about this floating around still.

> - Automatically set up weekly scrubs

This might be ZoL specific, but, the Debian package does exactly this, sets up systemd weekly scrub.

> - No recursive datasets

Why? this is too useful... inherited encryption roots, datasets with different properties so you can have databases and other filesystems all under the same root dataset, which can then be recursive replicated in one command. If you have no need for recursive datasets just don't use them, but they have many valid purposes.


this is honestly hard because many of the decisions that matter are not things you type into zfs at all (except incidentally).

how many disks per vdev? how much memory? etc

a lot of the things you've outlined are not universal at all, just situational


Yes, there is a lot of essential complexity that is unavoidable, but there are a lot of people like me who just want a better desktop file system, and we don't need to know about SLOGs and L2ARCs and the half dozen compression algorithms, etc. It's situational, but it's a common enough situation that a targeted solution would be valuable.


I'm one of those who wants a better desktop file system but have stayed away from ZFS (at least for the time being) due to stories about complexity.

Would you say that the defaults are sane enough for that kind of person (no real configuration needed)?


Outside of it being a good idea to manually set `ashift=12` on each vdev creation (because disks lie and zfs’s whitelist isn’t omniscient. I’ve seen too many people burned by this) and `atime=off` for datasets (because burning IO updating the metadata because you’ve accessed something is just fucking stupid), the defaults are sane and you can basically refuse to care about them after that.

Every system I’ve used has `compression=on` set as default, which currently means lz4. People set it manually out of paranoia from earlier days I think.

For linux systems you can set `xattr=sa` and `acltype=posixacl` if you like, which offers a minor optimization that you’ll never notice.

I suppose if you don’t like how much memory zfs uses for ARC, you can reduce it. For desktop use 2-4gb is plenty. For actual active heavier storage use like working with big files or via a slow HDD filled NAS, 8GB+ is a better amount.

Dataset recordsize can be set as well, but that’s really something for nerds like me that have huge static archives (set to 1-4MiB) or virtual machines (match with qcow2’s 64KiB cluster size). The default recordsize works well enough for everything that unless you have a particular concern, you don’t need to care.

I should note, beware of rolling release linux distros and ZFS. The linux kernel breaks compatibility nonstop, and sometimes it can take a while for zfs to catch up. This means your distro can update to a new kernel version, and suddenly you can’t load your filesystem. ZFSbootmenu is probably the best way to navigate that, it makes rolling back easy.

You also want to setup automatic snapshots, snapshot pruning, and sending of snapshots to a backup machine. Highly recommend https://github.com/jimsalterjrs/sanoid

If you really find yourself wanting to gracefully deal with rollbacks and differences in a project, HotTubTimeMachine (HTTM) is nice to be aware of: https://github.com/kimono-koans/httm


No, not really. There's a lot of unavoidable complexity like needing to understand what ZFS pools, vdevs, and datasets are and how they relate to each other. And then if you want to encrypt your data that adds another bit of functionality you need to configure. And then there are just a lot of things that aren't that important for most people, but you won't know if it's important for you until you understand what they are.

I would say the learning curve is similar but perhaps a bit steeper than configuring vim for web development for the first time or configuring nginx and let's encrypt on a webserver for the first time.

It didn't help that I started by trying to setup root on encrypted zfs. I would recommend first using zfs unencrypted, then adding encryption once you're comfortable with that, and finally trying /home on zfs or even root on zfs.


I switched away from ZFS on root after using it for about a year. There were no actual problems, but I could never get over the fact that it's so complicated underneath I have to relearn everything every time I do any sort of advanced maintenance.

For example, I had to move between SSDs which is an absolutely trivial operation with any other filesystem if you know your Linux well. It took me maybe three hours of work on ZFS because it brings its own environment and does everything its own way, so I had to remember and re-discover a lot of stuff. The system booted on maybe 15th try.

If you are fine with this, or you're not going to use it on root, or you are a FreeBSD user, it's an excellent choice for pretty much any task.


I have zfs root on ubuntu; moving to another ssd was trivial: add the second drive, recreate the partition scheme (esp, bpool, rpool), add them as mirrors to existing pools, and while waiting while it mirrors up, reinstall grub on the new esp. After mirroring is done, check fstab for UUIDs, reboot, check, whether everything is ok, remove the old drive from the mirror. Done.

Though I consider zfs to be nice, I'm willing to use it only with distributions, where it comes with the distribution, built for the distro kernel (i.e. ubuntu or proxmox). In the past, I've wasted too much time with dkms, failing to build the kernel modules, or kABI modules failing to load, or other problems that left me without root filesystem at reboot. That's not an experience I'd like to repeat. Also, btrfs is nice too ;).


Is the integration of root zfs on linux complicated? I saw some info-graphic and it looked like there were a lot of pieces to stitch it together. Is it fragile?

Regardless, I'm surprised your one-year with ZFS didn't convinced you that it's worthwhile to invest in its mastery. The filesystem will pay dividends for your entire compute life-time and not only save your data but hours of maintenance over that time.

Major downgrade moving away from it in my view.

On a side note, FreeBSD is great and worth investing in as well. I still have more to learn but the journey has been wonderful and I will not be using Linux in the next iteration of my home / lab server and hobby projects.


The defaults are absolutely sane. I've been using FreeBSD with root on ZFS on all sorts of workstations and laptops for many years now. It's always run just fine, even on cheap laptops with 2 GB of RAM.


I think the key thing to note there is that you are on FreeBSD. The Linux integration isn't nearly as mature.


If you install on ZFS from the installer with Ubuntu, FreeBSD, Illumos, etc. there's no reason to care about the complexities under the hood if you don't care about them with any other file system.

It's like saying you should know how virtual memory and pre-emptive multitasking work if you want to upgrade from Classic Mac OS to OS X.


`compression=on` exists.

You don't need to know about SLOGs and L2ARCs. They're not super useful generally but people geek out a out them.

I know those were just examples and I don't disagree with the general point.


Why would you want to know about SLOGs or L2ARCs if you're using it on a desktop? Even if you are using it for a NAS, you almost certainly don't need to know about those.


I'm also firmly on the "convention over configuration" mindset, sensible defaults should be available and anybody else can still tinker if needed. I also admit that setting up zfs wasn't easy for me (and I'm sure I haven't done it 100% properly)


I get that it's meant but be wrapper for common use cases, but I use ZFS on root and had to craft a custom initfs. The other nice-ities sound great (some BTRFS inspiration?), but some of these are at distro-devel level.


I looked into doing root on ZFS, but I ended up settling for /home on ZFS. Maybe I'll revisit it next time I build a new system when I'm more comfortable with ZFS.


> The learning curve is steeper than I expected

In fairness, learning about zfs is like learning about mdadm, lvm and a filesystem all at once… so it’s kinda justifiable in my opinion


I'd write a wrapper to do much of that to automate whatever my particular use-case is.

One pattern I've found useful when writing wrapper shell scripts: Output the actual command(s) that actually get run, to stderr in yellow, before running them. This also serves as a sanity check.


> - Generate the encryption key for me, set up the systemd service to decrypt the pool at start up, and prompt me to back up the key somewhere

This should have an option to integrate use of a TPM for super-encrypting the ZFS encryption key(s).


> - collapse `zpool` and `zfs` into a single command

Nooo, that should not be done. They are very different tools, for very different things.


I expect that that’s what Apple would have done if they had adopted ZFS, sensible defaults for the common user.


This is a very funny list.

"Sane defaults", are you going to be arbiter of sanity? With the #3 recommendation, "set up systemd for me", I'd rather not have you at that position.

Most of the bullets you wrote induce a "...why?" thought in somebody that has ZFS experience. Why would you unify zpool and zfs? Why would you want automatic weekly scrubs on as default? Do you realize what ZFS scrubbing is and when is the time to perform it?

I'm a bit agitated by your writing I must confess. You want ZFS to exactly reflect your basic use case so you don't have to move your little finger (automatic naming, automagical configuration). It's not meant to be a hands-off filesystem, you are expected to understand encryption in ZFS in order to use it.

But the most annoying thing is that you did see a steeper learning curve, and want to avoid it. Why don't you write your own ZFS provisioning tool? Why are you still using /dev/sda and not disk-by-UUID or something more 2023? Etc.


Like I said, I just want a wrapper to make this stuff easier for me. It's just like not everyone wants all that comes with running Arch. Some people like having Ubuntu handle the nitty gritty of a Linux OS. What's wrong with that?


> It's not meant to be a hands-off filesystem,

??


Other useful things about ZFS:

- get to know the difference between zpool-attach(8) and zpool-replace(8).

- this one will tell you where your space is used:

    # zfs list -t all -o space
    NAME                      AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
    (...)
- ZFS Boot Environments is the best feature to protect your OS before major changes/upgrades

--- this may be useful for a start: https://is.gd/BECTL

- this command will tell you all history about ZFS pool config and its changes:

    # zpool history poolname
    History for 'poolname':
    2023-06-20.14:03:08 zpool create poolname ada0p1
    2023-06-20.14:03:08 zpool set autotrim=on poolname
    2023-06-20.14:03:08 zfs set atime=off poolname
    2023-06-20.14:03:08 zfs set compression=zstd poolname
    2023-06-20.14:03:08 zfs set recordsize=1m poolname
    (...)
- the guide misses one important info:

  --- you can create 3-way mirror - requires 3 disks and 2 may fail - still no data lost

  --- you can create 4-way mirror - requires 4 disks and 3 may fail - still no data lost

  --- you can create N-way mirror - requires N disks and N-1 may fail - still no data lost

  (useful when data is most important and you do not have that many slots/disks)


N-way mirrors also have the property that ZFS can shard reads across them, which mattered a lot on spinning rust, since iops can be limited.


We have been running a large multi-TB PostgreSQL database on ZFS for years now. ZFS makes it super easy to do backups, create test environments from past snapshots, and saves a lot of disk space thanks to built-in compression. In case anyone is interested, you can read our experience at https://lackofimagination.org/2022/04/our-experience-with-po...


Nice - thanks for the info! I had no idea about the Toy Story 2 fiasco as well so this was a great read :)


Thanks, glad you liked it.


FreeBSD's Handbook on ZFS [0] and Aaron Toponce's articles [1] were what helped me the most when getting started with ZFS

[0] https://docs.freebsd.org/en/books/handbook/zfs/

[1] https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux...


I love FreeBSD's docs.

I had an old HP Microserver with 1GB of ECC RAM lying around so I installed FreeBSD on it. I had 5 old 500GB hard drives lying around too so I set them up in a 5x mirror with help from the FreeBSD Handbook. First time using FreeBSD and it was a breeze.


One of the diagrams under the bit about snapshotting has a typo reading "snapthot" and I immediately thought it was talking about instagram.

(I realize now after writing it that maybe snapchat should have occurred to me first, but I have never used it)


I recently rebuilt a load of infrastructure (mainly LAMP servers) and decided to back them all with ZFS on Linux for the benefit of efficient backup replication and encryption.

I've been using ZFS in combination with rsync for backups for a long time, so I was fairly comfortable with it... and it all worked out, but it was a way bigger time sink than I expected - because I wanted to do it right - and there is a lot of misleading advice on the web, particularly when it comes to running databases and replication.

For databases (you really should at minimum do basic tuning like block size alignment), by far the best resource I found for mariadb/innoDB is from the lets encrypt people [0]. They give reasons for everything and cite multiple sources, which is gold. If you search around the web elsewhere you will find endless contradicting advice, anecdotes and myths that are accompanied with incomplete and baseless theories. Ultimately you should also test this stuff and understand everything you tune (it's ok to decide to not tune something).

For replication, I can only recommend the man pages... yeah, really! ZFS gives you solid replication tools, but they are too agnostic, they are like git pluming, they don't assume you're going to be doing it over SSH (even though that's almost always how it's being used)... so you have to plug it together yourself, and this feels scary at first, especially because you probably want it to be automated, which means considering edge cases... which is why everyone runs to something like syncoid, but there's something horrible I discovered with replication scripts like syncoid, which is that they don't use ZFS's send --replication mode! They try to reimplement it in perl, for "greater flexibility", but incompletely. This is maddening when you are trying to test this stuff for the first time and find that all of the encryption roots break when you do a fresh restore, and not all dataset properties are automatically synced. ZFS takes care of all of this if you simply use the build in recursive "replicate" option. It's not that hard to script manually once you commit to it, just keep it simple, don't add a bunch of unnecessary crap into the pipeline like syncoid does, (they actually slow it down if you test), just use pv to monitor progress and it will fly.

I might publish my replication scripts at some point because I feel like there are no good functional reference scripts for this stuff that deal with the basics without going nuts and reinventing replication badly like so many others.

[0] https://github.com/letsencrypt/openzfs-nvme-databases


They mention tuning io_capacity and io_capacity_max, which unfortunately the MySQL docs indicate is useful until you click through to see what the parameters do [0]. They control background IO actions like change buffer merges, and in fact will take IO from the main process that needs it for work.

IME with a decently busy (120K QPS) MySQL DB is that you do not need to touch either of these. If you think you do, monitor the time to fill the redo log, and the dirty page percent in the buffer pool. There are probably other parameters you should tune instead.

[0] https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.ht...


To be fair they aren't going nuts with them, I've seen worse examples. But I agree with you in principle, it's not necessary, and potentially harmful to overall performance. It also doesn't really belong in a ZFS tuning guide.


> For databases (you really should at minimum do basic tuning like block size alignment),

One unexpected thing to check (and do check, because your mileage will vary) - the suggestion is usually to align record sizes, which in practice tends to mean reducing the record size on the ZFS filesystem holding the data. I don't doubt that this is at some level more efficient, but I can empirically tell you that it kills compression ratios. Now the funny knock-on effect is that it can - and again, I say can because it will vary by your workload - but it can actually result in worse throughput if you're bottlenecked on disk bandwidth, because compression lets you read/write data faster than the disk is physically capable of, so killing that compression can do bad things to your read/write bandwidth.


I know what you're getting at, I wondered the same thing, but my results were the opposite of what I expected for compression.

I enabled lz4 compression and set recordsize for database datasets to 16k to match innoDB... turns out even at 16k my databases are extremely compressible 3-4x AFAIR (I didn't write the DB schema for the really big DBs, they are not great, and I suspect that there is a lot of redundant data even within 16k of contiguous data)... maybe I could get even more throughput with larger record sizes, but seems unlikely.

As you say, mileage will vary, it's subjective, but then I wasn't using compression before ZFS, so I don't have a comparison. I have only done basic performance testing, overall it's an improvement over ext4, but I've not been trying to fine tune it, I'm just happy to not have made it worse so far while gaining ZFS.


Oh, nice. In my case I sort of forgot to set the record sizes, got great compression, and then realized I'd missed the record sizes so tried lowering/matching them and watched compression ratio drop to... I think all the way to 1.0:( So that was a natural experiment, as it were. But of course it totally depends on the exact size and most of all your data.


I suppose ultimately the thing to test is application performance, (assuming you don't care too much about disk space). Because it depends on the access pattern. If reading and writing contiguously, then large record sizes (even if unaligned) with high compression ratio might be fastest. but if it's more weighted to highly randomised, then the larger ZFS record sizes might actually cause read and write amplification and a lower throughput even if the overall on-disk size is smaller (assuming it doesn't all fit in memory in which case this is all moot).

Another new thing that's possible with innoDB these days is that you can actually change the default 16k page size, i.e you could align innoDB to your preferred ZFS recordsize if that makes sense for your application.


I started to use ZFS (on Linux) a few years ago and it went smoothly.

My only surprise was volblocksize default which is pretty bad for most RAIDZ configuration: you need to increase it to avoid loosing 50% of raw disk space...

Articles touching this topic :

https://jro.io/nas/#overhead

https://openzfs.github.io/openzfs-docs/Basic%20Concepts/RAID...

https://www.delphix.com/blog/zfs-raidz-stripe-width-or-how-i...

And you end up on one of the ZFS "spreadsheet" out there:

ZFS overhead calc.xlsx https://docs.google.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT6...

RAID-Z parity cost https://docs.google.com/spreadsheets/d/1pdu_X2tR4ztF6_HLtJ-D...


In my opinion, the 50% efficiency of mirror vdevs is a fair price to pay for the simplicity and greatly improved performance. You can grow RAIDZ pools now, but it's still a lot more complicated and doesn't perform as well.


Might not remember the details correctly but when I was younger and stupider I read a lot about how great one of the open source NAS OSs (FreeNAS?) and ZFS were from fervent fans. I bought a very low spec second hand HP micro server on eBay and jumped straight in without really knowing what I was doing. I asked a few questions on the community forum but the vast majority of answers were "Have you read the documentation?!" "Do you have enough RAM?!".

The documentation in question was a PowerPoint presentation with difficult to read styling, somewhat evangelical language, lots of assumptions about knowledge and it was not regularly updated. It was vague on how much RAM was required, mainly just focused on having as much as possible. Needless to say I ignored all the red flags about the technology, the hype and my own knowledge and lost a load of data. Lots of lessons learnt.


Can you roughly remember how long ago that was? ZFS has been around since the earlier 2000's, with FreeNAS starting in 2005 iirc.

The filesystem has gotten a lot more stable, and imo the documentation clearer.

That said, it's "more powerful and more advanced" than traditional journaling filesystems like ext3, and thus comes with more ways to shoot yourself in the foot.


Some additional points for posterity, in case it isn't driven home here:

- All redundancy in ZFS is built in the vdev layer. Zpools are created with one or more vdevs, and no matter what, if you lose any single vdev in a zpool, the zpool is permanently destroyed.

- Historically RAIDZs (parity RAIDs) cannot be expanded by adding disks. The only way to grow a RAIDZ is to replace each disk in the array one at a time with a larger disk (and hope no disks fail during the rebuild). So in my very amateur opinion, I would only consider doing a RAIDZ if it is something like a RAIDZ2 or 3 with a large number of disks. For n<=6 and if the budget can stand it, I would do several mirrored vdevs. (Again as an amateur I am less familiar with RW performance metrics of various RAIDs so do more research for prod).


Pool of mirrors is usually the safer way, yes.

If and only if you a. Have full, on-site backups b. Are fairly sure of your abilities and monitoring then I can suggest RAIDZ1. I have a pool of 3x3 drives, which ships its snapshots a few U down in my rack to the backup target that wakes up daily, and has a pool of 3x4 drives, also in RAIDZ1.

In the event that I suffer a drive failure in my NAS, my plan of action would be to immediately start up the backup, ingest snapshots, and then replace the drive. That should minimize the chance of a 2nd drive failure during resilvering destroying my data.

Truly important data, of course, has off-site as well.


I've run into a ZFS problem I don't understand. I have a zpool where zpool status prints out a list of detected errors, never in files or `<metadata>` but in snapshots (and hex numbers that I assume are deleted snapshots). If I delete the listed errored snapshots and run zpool scrub twice the errors disappear and the scrub finds no errors. Zpool status never listed any errors for any of the devices.

So there aren't any errors in files. There aren't any errors in devices. There aren't any errors detected in scrub(?). And yet at runtime I get a dozen new "errors" showing up in zpool status per day. How?


Damn good question. I don’t have time to search for duplicates myself right now, but you can look through/ask the mailing list: https://zfsonlinux.topicbox.com/groups/zfs-discuss (looks weird, but this is a legit web front end for the mailing list) and the github issues: https://github.com/openzfs/zfs/issues


I've been running into the same issue, where occasionally files seem to get corrupted on the snapshot but also in the live version of the file. I cannot move it or modify it. I can only delete it. There's no indication as to why these files are getting corrupted. Thankfully there they are all large Linux ISOs, so it hasn't been critical to my life.


Nice. My gotchas form using zfs on my personal laptop with Ubuntu.

- if you want to copy files for example and connect your drive to another system and mount your zpool there, it sets some pool membership value on the file system and when you put it back in your system it won’t boot unless you set it back. Which involved chroot

- the default settings I had made snapshot every time I apt installed something, because that snap shot included my home drive when I deleted big files thereafter I didn’t get any free space back until i figued out what was going on and arbitrarily deleted some old snapshots

- you can’t just make a swap file and use it,


> - if you want to copy files for example and connect your drive to another system and mount your zpool there, it sets some pool membership value on the file system and when you put it back in your system it won’t boot unless you set it back. Which involved chroot

Isn't this what `zpool export` is for?


Opensuse Tumbleweed comes with snapper which works with btrfs in a similar fashion and /home is not included in the snapshot by default. For your use case too you should exclude /home from your apt triggered snapshots and set a separate one for it. I had scheduled snapshots for my /home at one point but since it's a very actively used directory (downloading isos, games then deleting them) I had similar problems to yours. I guess we could both also have a dedicated separate directory for those short lived huge files, which don't need a snapshot anyway.


I found

  cat /etc/apt/apt.conf.d/90_zsys_system_autosnapshot
  // Takes a snapshot of the system before package changes.
  DPkg::Pre-Invoke {"[ -x /usr/libexec/zsys-system-autosnapshot ] && /usr/libexec/zsys-system-autosnapshot snapshot || true";};

  // Update our bootloader to list the new snapshot after the update is done to not block the critical path
  DPkg::Post-Invoke {"[ -x /usr/libexec/zsys-system-autosnapshot ] && /usr/libexec/zsys-system-autosnapshot update-menu || true";};

but how would I get this to not snapshot , say /home/Downloads .. make that its own zpool?


Dataset should be enough I think. Not zpool.

Hard to be sure without knowing what that script does.


> I had scheduled snapshots for my /home at one point but since it's a very actively used directory (downloading isos, games then deleting them) I had similar problems to yours.

What kind of schedule was it? I feel like the low-impact alternative to no snapshots at all is daily snapshots for half a week to a week, and maybe some n-hourly snapshots that last a day or two. Which I would not expect to use up very much space.


It was very frequent (around every 5 minutes) because what I was trying to get was a failsafe in case I delete or otherwise mess up a file I was currently working on. It happened exactly once since I disabled snapshots of /home and I was able to recover it from the terminal's scrollback buffer of all places. :)

As far as I know btrfs does get slower as the number of snapshots increase. Not sure about zfs in that regard. Your plan does sound sensible. I have a feeling you could make the snapshots even more frequent without any ill effects.


I need 3 stores to feel I'm keeping safe years of digital family photos. 1) I have a live (local) FreeBSD ZFS server running for backups and snapshots; 2 pairs of mirrored physical drives 2) I have a USB device that takes 2 mirrored drives to recv ZFS snapshots from #1; I store that vdev backup in a safe place 3) I backup entire datasets to cloud storage from off-prem using rclone.

It's #3 where I need to do some more research/work. I need to spend some time sending snapshots/diffs to cloud blob storage and make sure I can restore. Yes, I know there is rsync.net.

Any experiences to share?


I bought a cheap HP Microserver with four 4 TB spinning disks that I placed at a relatives house ~1000 km from where I live. I do nightly replication to the off-site location, with an account on the receiving end that only has enough permissions to create snapshots and receive data, so even if that ssh key somehow got out in the wild things could not be deleted from the remote store. I hope :)

Clarification: Remote end also uses ZFS, so I can use cheap replication with encryption


My setup is similar. +1 that Restic is great. My cloud backup was sending blobs to Google workspace, but they have clamped on storage. I will be replacing that with another box at my parents that will be tucked out of the way. I will just have that wireguard tunnel to my home network and send snapshots to it. At some point I'll turn down the workspace solution and probably also unsubscribe.


I'm using Borg to back up to rysnc.net.

Borg spilts your files up into chunks, encrypts them and dedupes them client-side and then syncs them with the server. Because of the deduping, versioning is cheap and you can configure how many daily, weekly, monthly, &c. copies to keep. For example you could keep 7 day's worth of copies, 6 monthly copies and 10 yearly copies.

Rysnc.net have special pricing for customers using Borg/Restic:

https://www.rsync.net/products/borg.html

https://www.rsync.net/products/restic.html


I use Restic and backup to rsync.net for remote backups. Works great.

I'm not working with much data though so even if I wanted to I couldn't get a ZFS send/receive account with rsync.net. I like the way rsync.net give you separate credentials for managing the snapshots. This way even if my NAS gets compromised i will still have all the periodic snapshots.

For me privacy is my main concern and Restic's security model is good for me. The backup testing features are good too and rsync.net doesn't charge for traffic so these two work good together. I don't use the snapshots though because rsync.net already supports this via ZFS.


I have a similar setup, though with Linux. From my experience I can recommend taking a look at restic (https://restic.readthedocs.io/). It does encrypted and deduplicated snapshots to local and remote repositories. There's a good selection of remote target options available, but you can also use it with rclone to use any weird remote. Just remember to keep a backup of your encryption key somewhere besides the machine you back up ;-)


Are you running regular scrubs with ZFS and checking the results?


In your experience, what is a good schedule for scrubs?

I do one about every month or so. I should probably add a crontab for that.


Weekly is good as a generic answer. The specifics of what you store and the value you place on them could warrant tweaks up or down. The time it takes to perform a full scrub could also be a factor.



"Also read up on the zpool add command."

Haha, The only part of maintenance that I need to look up every time I do it is replacing a faulty hard drive.

Even this guide skips that.


is there an equivalent "btrfs for dummies" ?


As nice as the technology is as long as there's the potential of a Damoclean license issue I'll always feel hesitant around ZFS.

(Hey looks like it's a sore spot!)


This is only going to get a shedload of "you do you" responses. BSD licencing is quite content to use ZFS, its pretty much ubiquitous now in widescale deployment and the source would be impossible to re-lock. Worst case is fork.

I very much regret the fragmentation of FS design, it has many mothers. "there can only be one" was never going to work, but we seem to have perhaps 4-5 more than we really need. ZFS manages to wrap up a number of behaviours cohesively with good version-dependent signalling so it should always be possible to know you're risking a non-reversible change to your flags. And, it keeps improving.

But, counter "it keeps improving" so do all the other current, maintained, developed FS and if somebody tells me they prefer to use Hammer, or one of the Linux FS models with a discrete volume-management and encryption layer, I don't think thats necessarily wrong.

Mainly I regret Apple walking away. That was about Oracle behaviour. It wasn't helpful. A lot of Apple's FS design ideas persist. I never got resource/data forks, it only ever appeared on my radar as .files in the UNIX AUFS backend model of them. Obviously inside Apples code, it was dealt with somehow. It felt like the wrong lessons about meta data had been learned. Maybe an Ex-VMS person went to Apple? Also Apple has a rather "yea maybe or no, dunno" view about case-independent or case-dependent naming. Time machine is good. Feels like it should fit ZFS well. Oh well.


I wish zfs had a 'tag' type of feature like Apple style fs. I have thought about making an object store to get some of the metadata on the fs level, rather than having to make some cobbles together solution of an SQLite DB of metadata with the files or something, but many of the object store solutions seem to be for fairly large files, rather than a few thousand pictures of family and such. Unsure what a good solution using zfs would be.


I am not sure what kind of "tag" you mean, but in any modern file system which supports extended file attributes you can attach arbitrary metadata to any file.

You just have to define whatever kind of tags you want and choose names for the xattrs in which to store them and then you should use them in your scripts or applications.

I am using all the time xattrs for things like file checksums and content classification, on XFS and on FreeBSD UFS, but almost all modern file systems support xattrs, except Linux tmpfs (which has only a partial support that is useless for those who use custom xattrs, so a file with xattrs cannot be copied to /tmp if tmpfs is used).


> except Linux tmpfs (which has only a partial support that is useless for those who use custom xattrs, so a file with xattrs cannot be copied to /tmp if tmpfs is used).

tmpfs is getting user xattrs in kernel 6.6, together with quotas.

> You just have to define whatever kind of tags you want and choose names for the xattrs in which to store them and then you should use them in your scripts or applications.

It is still just a oob store, you have to traverse the directory hierarchy old-style; the old BeOS did (and Haiku today does) something more: it indexed the selected xattrs and allowed live queries against those indices. This allowed for very interesting uses.


> tmpfs is getting user xattrs in kernel 6.6

I am aware of this, but the so-called user xattrs support most likely remains too bad to be of any use.

The kernel list message:

"Add support for user xattrs. While tmpfs already supports security xattrs (security.*) and POSIX ACLs for a long time it lacked support for user xattrs (user.*). With this pull request tmpfs will be able to support a limited number of user xattrs."

I am waiting to see which is the meaning of "a limited number", but based on the history of tmpfs, I suspect that this does not mean that the number of extended file attributes is limited to some value, but it means that there is a small list of names of user xattrs that will preserved (which are those used by some application that the tmpfs devs love, while they are hating the other applications), while the other xattrs will be lost, like now.

If "a limited number" is really a number, that would be much better, but this might still mean that the number is excessively low, e.g. ten attributes or less, instead of being high enough to not be a serious limitation, e.g. one hundred or one thousand.

> It is still just a oob store, you have to traverse the directory hierarchy old-style; the old BeOS did (and Haiku today does) something more: it indexed the selected xattrs and allowed live queries against those indices.

You are right and it would be nice if the file system itself would allow faster queries by xattrs than what can be achieved by testing the attributes of files selected by their names.

However, it is not difficult to create a database file indexing the files by attributes, but it may need frequent updates, to remain in sync with the indexed file system. I use such databases, but mainly for file system parts where I keep files for long-term storage, like books, research papers, handbooks, product documentation or older software projects.


Looking over the landed patch (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...), I cannot find a hardcoded list of user.* xattrs that will be accepted.

It looks like it will store as many xattrs as there is space for, based on their contents.


That would be nice, but then the "a limited number" expression does not make much sense.

All the file systems that support extended file attributes have some limit for the size of the attributes attached to a file, for instance of 64 kB.

After searching through the patch, it seems that the maximum number of xattrs is 128. There is also a size limit of 128 kB, which might be the limit for the sum of the sizes of all attributes.

If that is the only reason for saying "a limited number", then this is very good, the limits are high enough to not cause any problems in most applications, so tmpfs will finally no longer strip the attributes of the files copied through it.


You might want to consider using FreeBSD then, where there is no licensing issue. I've found it has all the stuff I liked about Linux, and less of the stuff I dislike.


There is also Illumos/OpenIndiana/OmniOS, basically continuations of OpenSolaris. I have used OpenIndiana for many years and it's been dead reliable and stable.

There's quite a few "quality of life" differences like boot environments (boot into a pre upgraded OS state, even years old), built in SMB server with NFS v4 style ACLs, dtrace, built in snapshot scheduling and management, and Napp-It is an available web UI for management a la FreeNAS/TrueNAS.

It has a few differences, service management is quite different from other things, but overall very underrated as an OS I think.


I haven't looked at those distributions in years. I thought they were niche, container host operating systems and not general purpose.. I guess it need to do my research.


That's because they are niche container host operating systems. Without support for modern hardware like GPUs, it makes no sense to run them on bare metal


Sure it does, they are great storage OSs. A NAS doesnt need GPUs.


I was concerned about this aspect initially, but ZoL has been going for many years now. At this point even Debian (the most purist in terms of GPL considerations) includes ZoL in it's official repository. It doesn't distribute binaries, apt just builds the kernel module automatically.

Frankly, if even Debian can use it, it's a non-issue.


The license issue in the very worst case (but even that is quite questionable and I would say that Linus is a bit paranoid here) could only be a problem for a distro that ships it by default, you as the end user can freely use both linux and ZFS, as well as their combination.


It's not really an issue with ZFS itself. It's a great filesystem. GPL isn't the only way to do open source. It's the way Linux has chosen and that has pros and cons. Not integrating so well with other open source licenses is one of the cons.


> Not integrating so well with other open source licenses is one of the cons.

That depends on what you want. If you want a license that will play well with closed source software, then yeah, it's a downside. But the GPL family comes from the perspective of a developer who wants to retain their rights while respecting others' desire for the same. If you care about your rights, then this is an upside.


CDDL (ZFS’s license) is also like that, the problem is exactly the two “strong” copyleft licenses.


https://www.gnu.org/licenses/license-list.en.html#CDDL

>"It has a weak per-file copyleft (like version 1 of the Mozilla Public License)"

The CDDL isn't a strong copyleft licence. It doesn't give a shit what larger project you include the code into, and it doesn't give a shit how you licence the resulting binary. It's considerably more permissive than the GPL.

The conflict is entirely on the GPL side.


Even in such an event, I don’t see how it would affect end users in an extreme way. OpenZFS doesn’t call home to make sure it’s still legal to use, so worst that could happen is that they are forced to stop distribution and you need to stay on old software until you can move to another storage system.


Not to mention Ubuntu included ZFS by default since 2016. I still remember all the doomsday predictions people made back then. People on HN and proggit kinda suck at legal


Playing devil's advocate here, but what if Oracle decides to sue you, a company with only a few thousands in revenue, for their multi-million dollar licensing fee?

I agree this is unlikely, but so is someone being born with as much litigiousness as Larry Ellison.


This is built on a misunderstanding of what the ZFS licensing issue is.

ZFS is licensed under the CDDL [0], which is a copyleft license that grants you full rights to use the ZFS source code for basically anything, providing that any derivative works incorporating its source code are also are under the CDDL. There is no "licensing fee" that you can pay Oracle related to this.

Linux is licensed under the GPL, which just like the CDDL, is a copyleft license which requires derivative works to be distributed under the GPL.

The conflict here is that both require the entire derived work to be distributed under CDDL/GPL, but obviously only one license can be picked.... even though realistically, GPL and CDDL are quite similar licenses.

The additional fact that is useful to know is that zfs.ko, the zfs kernel module, is distributed under the CDDL by ubuntu, and any other distro distributing it that I know of.

So, with that knowledge, who can sue who and why?

1. The people that can be sued are people distributing compiled zfs binaries (the 'zfs.ko' kernel binary). In practice, this means canonical, and anyone else who distributes disk images using ZFS (such as snapshotted EC2 AMIs)

2. The people who can sue are the holders of the Linux GPL license, and they can do so by claiming zfs.ko violates their copyright by being a derived work of the linux kernel, but being distributed under the CDDL rather than the GPL.

So, first of all, the vast majority of companies using ZFS aren't also distributing binary distributions containing zfs.ko and the linux kernel together, therefore this whole thing is moot. Realistically, most companies using ZFS only have the risk of _canonical_ being sued, and that breaking their filesystem, not them running into legal issues themselves.

And second, it would require a holder of Linux copyright to file suit, and realistically, no court could find significant damages here.

Canonical also has lawyers that claim 2 wouldn't legally follow, that 'zfs.ko' is not in fact a derived work of the linux kernel. There's differing opinions there, but pretty much everyone agrees that the people who could sue if this interpretation is wrong is the holders of Linux copyright, not oracle.

[0]: https://en.wikipedia.org/wiki/Common_Development_and_Distrib...


You have one crucial fact wrong: The CDDL is actually a weak per-file copyleft.

The CDDL is totally ok with the rest of the linux kernel staying GPLv2. The GPL is allergic to any non-GPL (eg. ZFS) code being distributed with the kernel source.


Sure, but I don't think that's crucial here.

Does it change any of the other information? I admit I shouldn't have written "both require the entire derived work to be distributed under CDDL/GPL", and instead should have written "both may require, depending on a court's interpretation, the zfs kernel module to be distributed under CDDL or GPL respectively".

I think the rest of the comment stands though, right?


> "both may require, depending on a court's interpretation, the zfs kernel module to be distributed under CDDL or GPL respectively".

That's wrong, and misses fundamental information about the CDDL.

First, the CDDL gives absolutely no fucks how you licence a resulting binary. It only cares about the specific covered source files, and doesn't care what you bundle them with. So nobody would ever argue that the module should be licenced as CDDL.

On the flip side, you'd have a really hard time arguing that the module should be licensed as GPL.* Sure, there's maybe a GPL wrapper, but lots of modules are non-GPL even when they might feature such a wrapper. ie. nvidia. You also can't argue ZFS is a derived work of the kernel, which is really the crux of the issue. ZFS was first written on solaris, developed for several years on freeBSD, and remains neutral across 3-5 different supported OSs. There's nothing to indicate the GPL should have more claim over ZFS than the wrapper.

The one and only place you'll ever run into trouble is if you start trying to distribute ZFS *source code* in the kernel source tree, as a fully integrated part of the kernel. (EDIT: or compiled together in a single kernel binary, from said combined source) The CDDL doesn't care about the mixing but the GPL does. Literally everyone agrees this is a bad idea and nobody does it.

*Let's ignore for a moment for the moment that you *couldn't* since the GPL requires complete corresponding source code and you can't relicence the CDDL source code files.


I agree that statement is factually incorrect, but I still don't think it changes the overall information in my comment in any meaningful way. The actual risk is low, the risk if of GPL infringement, not of oracle suing you, and there are people who contest it on either side. Those are the key points, and those are true I believe.

> On the flip side, you'd have a really hard time arguing that the module should be licensed as GPL

That is the very thing that some people do argue, for example the sfconservancy people. As I note, canonical's lawyers take your view (and I personally agree), but it's not clearcut. The sfconservancy outline their reasoning clearly for why they think zfs.ko is a derived work of the linux kernel.

https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/


The SFC can get stuffed. Their argument says nothing in particular about ZFS and is essentially "the kernel is GPL and therefore ALL modules MUST BE GPL.

Their stance isn't just rejected by nvidia and company, but also by the kernel devs most notably torvalds himself. Their argument boils down to "if ONLY it were statically linked everyone would say it's derived" combined with "there's no difference between static and dynamic linking."

They completely ignore the inconvenient fact that being statically linked would imply being pasted into the kernel tree, and the real reason people would say it's derived at that point is because you'd expect the ZFS code to be deeply connected with linux's guts if that were the case. Which it isn't, and it isn't. Over in reality courts look at how code came to be and where it came from, a more reasonable definition of "derived." For one thing it means I can't come up with a GPL module that statically links against a pre-existing bit of other software and then try to claim the software should be GPL. (remember, OpenZFS' fork long predates ZFS' compatibility with linux)

Being a derived work shouldn't imply time travel, i.e. at the very minimum shouldn't violate causality.

Torvalds himself has personally refuted the "(dynamic) linking inescapably means you're derived" point on multiple occasions, and yet here the SFC is, continuing to wave it around because that's how they wish it worked.

Their article is very plainly a "extend the GPL's reach" political move attempting to move the goalposts the kernel community set up wrt proprietary modules, while using the muddy waters around ZFS as cover. Their arguments could equally be applied to the nvidia kernel module and yet would instantly be rejected as extreeme if they attempted to use nvidia's module as their subject.


Note that unlike the GPL, the CDDL is file-based. It doesn't require your entire body of software to be CDDL, it just requires that you make modifications to CDDL-licenced files available under the CDDL.


I see, thanks for educating me on this with such an in-depth reply! I simply saw "licensing issue" and "Oracle" together and assumed the standard Oracle licensing worry applied.


On what grounds? ZFS has an open-source license. The only semi-open question related to the topic is packaging it together with another, different open-source license that is viral and mandates more things (GPL). That part is not being “violated” by an enduser, but the distro maintainer.


Oracle's hands are tied in so far as they can't revoke the CDDL licence that Sun provided.

They could sue you as a kernel contributor, but only if you do something to violate the GPL. The only thing that would clearly violate the GPL, in the era of proprietary nvidia blobs, is if you tried to ship the ZFS source code in-tree.

Everybody agrees that's a bad idea, nobody does it. ZFS instead is shipped as a separate kernel module.


ZFS on Linux, and even Ubuntu distributing the binary form in their base OS, has been going on for years. I think you would be a huge event and very unlikely to occur if Oracle tried to pull some licensing bullshit.

As someone ignorant to file system development, I would almost expect something more likely to be BTRFS getting sued for copying a feature of ZFS or something like that.


There's really no case to be made from the CDDL side. It's a weak per-file copyleft.

If anyone (Oracle or Linus Torvalds) launches a ZFS-related lawsuit, it'll be as an author of GPLv2 kernel code. For the time being the solution has been to ship ZFS separate from the kernel, as any module with a non-GPL licence typically does.

The biggest hurdle to ZFS isn't legal, but technical. The various teams working on ZFS has so far been able to keep up with kernel churn and symbols (eg FPU ones) being made GPL-only.

That said, the kernel devs have made it clear that they don't care if you're open source or proprietary, they will make changes and mark new symbols GPL-only to fuck with you regardless.


Do you refuse to use NVidia cards for the same reason?


Honestly the nvidia thing was always annoying because it seems like the linux devs are more aggressive about licensing with an actual full-on Open Source project than the very much closed-source nvidia drivers. They even both use the same approach of having a GPL wrapper to talk to the non-GPL code AFAIK:\


That might be a bad example. I'd trust ZFS and the CDDL before NVIDIA any day. At least the CDDL is designed with the intention of being open attemps to work with other licenses. NVIDIA seems hellbent on working around the GPL.


The sad part is the kernel devs can't tell the difference. Doesn't matter if something is open source or proprietary, if they can't take it and relicence it under their GPLv2 licence then it deserves no comfort.


The original intention of the license doesn’t really matter when its current copyright holder is Oracle.


Oracle is *a* copyright holder, but they have no control over the direction of the OpenZFS fork.


I’m not concerned about the direction of development, I’m concerned about Oracle holding the (or at least a significant part of) copyright to some software which is used as a Linux module, but which is licensed under a license which is (arguably) incompatible with the GPL.

If Oracle wanted to, they could easily release their copyright under a dual license or something else to clear up the licensing issue. They have not, and have chosen to retain the ability to sue every Linux user using ZFS. This concerns me.


Modules don't have to be under licences compatible with the GPL. There is no universe where you could possibly argue ZFS is a derived work of the kernel. None.

Oracle can't actually sue people using ZFS modules. The CDDL licence that sun provided already gave all the relevant rights away. They can only sue from the GPL side as a kernel contributor, and that's ONLY if someone is dumb enough to ship ZFS code in the linux source tree.

You *seriously* underestimate the number of people using ZoL in production that Oracle would love to sue *if they could.*



It doesn’t really matter on whose “side” the conflict lies; the license is arguably GPL-incompatible, and Linux is not going to change its license. But Oracle could, if they wanted to. And they evidently don’t want to, which is concerning.


It's plainly obvious why they don't. Oracle still sells *their* fork of ZFS.

Back in the pre-oracle days Sun's fishworks division sold ZFS-based storage appliances. Oracle still sells those, and thanks to ZFS' reputation they're still profitable. Oracle simply took that much older fork of ZFS (with no OpenZFS code in it) and made it closed-source.

Before anyone asks, no, that fork is too old and too diverged to be compatible with OpenZFS.


All the more reason for Oracle to someday do something to put all those Linux ZFS servers into a questionable legal status. They could then stand with a carrot in their other hand, waving their own unquestionably legal version of “ZFS”.


I don't think you understand how this works. Sun already gave all their power away via the CDDL.

We're currently in a legally-stable state, and have been for over a decade now (nearly two). If they could have done something, they would have.

The fact is that so long as nobody tries to merge ZFS code into the mainline kernel, there's nothing they can do. Everybody knows this, everybody agrees that would be a dumb idea, and everybody has agreed not to do it.


No choice on that front.


New drivers are MIT/GPL2 dual-licensed: https://github.com/NVIDIA/open-gpu-kernel-modules


Doesn't that still require the closed source user space driver to function? I'd love to be wrong on that one.


In the short-term yes, in the long term the user space driver would also get open (or replaced with nouveau, since it will be on equal footing with the closed one, when communicating with kernel; just like the open vs closed amd drivers).

However, what will be left is a blob, that runs on the dedicated core on the GPU. The stuff that Nvidia doesn't want to open is moved there, and that's why the open kernel driver works only on Turing and newer.


> However, what will be left is a blob, that runs on the dedicated core on the GPU

But that basically firmware, right? So not all that different from the firmware loaded by a NIC?


Yes, exactly the same thing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: