Things Nobody Told You About ZFS
Latest update 9/12/2013 – Hot Spare, 4K Sector and ARC/L2ARC sections edited, note on ZFS Destroy section, minor edit to Compression section.
There are a couple of things about ZFS itself that are often skipped over or missed by users/administrators. Many deploy home or business production systems without even being aware of these gotchya’s and architectural issues. Don’t be one of those people!
I do not want you to read this and think “ugh, forget ZFS”. Every other filesystem I’m aware of has many and more issues than ZFS – going another route than ZFS because of perceived or actual issues with ZFS is like jumping into the hungry shark tank with a bleeding leg wound, instead of the goldfish tank, because the goldfish tank smelled a little fishy! Not a smart move.
ZFS is one of the most powerful, flexible, and robust filesystems (and I use that word loosely, as ZFS is much more than just a filesystem, incorporating many elements of what is traditionally called a volume manager as well) available today. On top of that it’s open source and free (as in beer) in some cases, so there’s a lot there to love.
However, like every other man-made creation ever dreamed up, it has its own share of caveats, gotchya’s, hidden “features” and so on. The sorts of things that an administrator should be aware of before they lead to a 3 AM phone call! Due to its relative newness in the world (as compared to venerable filesystems like NTFS, ext2/3/4, and so on), and its very different architecture, yet very similar nomenclature, certain things can be ignored or assumed by potential adopters of ZFS that can lead to costly issues and lots of stress later.
I make various statements in here that might be difficult to understand or that you disagree with – and often without wholly explaining why I’ve directed the way I have. I will endeavor to produce articles explaining them and update this blog with links to them, as time allows. In the interim, please understand that I’ve been on literally 1000’s of large ZFS deployments in the last 2+ years, often called in when they were broken, and much of what I say is backed up by quite a bit of experience. This article is also often used, cited, reviewed, and so on by many of my fellow ZFS support personnel, so it gets around and mistakes in it get back to me eventually. I can be wrong – but especially if you’re new to ZFS, you’re going to be better served not assuming I am. 🙂
1. Virtual Devices Determine IOPS
2. Deduplication Is Not Free
Every block of data in a dedup’ed filesystem can end up having an entry in a database known as the DDT (DeDupe Table). DDT entries need RAM. It is not uncommon for DDT’s to grow to sizes larger than available RAM on zpools that aren’t even that large (couple of TB’s). If the hits against the DDT aren’t being serviced primarily from RAM or fast SSD, performance quickly drops to abysmal levels. Because enabling/disabling deduplication within ZFS doesn’t actually do anything to data already on disk, do not enable deduplication without a full understanding of its requirements and architecture first. You will be hard-pressed to get rid of it later.
3. Snapshots Are Not Backups
This is critically important to understand. ZFS has redundancy levels from mirrors and raidz. It has checksums and scrubs to help catch bit rot. It has snapshots to take lightweight point-in-time captures of data to let you roll back or grab older versions of files. It has all of these things to help protect your data. And one ‘zfs destroy’ by a disgruntled employee, one fire in your datacenter, one random chance of bad luck that causes a whole backplane, JBOD, or a number of disks to die at once, one faulty HBA, one hacker, one virus, etc, etc, etc — and poof, your pool is gone. I’ve seen it. Lots of times. MAKE BACKUPS.
4. ZFS Destroy Can Be Painful
Something often waxed over or not discussed about ZFS is how it presently handles destroy tasks. This is specific to the “zfs destroy” command, be it used on a zvol, filesystem, clone or snapshot. This does not apply to deleting files within a ZFS filesystem (unless that file is very large – for instance, if a single file is all that a whole filesystem contains) or on the filesystem formatted onto a zvol, etc. It also does not apply to “zpool destroy”. ZFS destroy tasks are potential downtime causers, when not properly understood and treated with the respect they deserve. Many a SAN has suffered impacted performance or full service outages due to a “zfs destroy” in the middle of the day on just a couple of terabytes (no big deal, right?) of data. The truth is a “zfs destroy” is going to go touch many of the metadata blocks related to the object(s) being destroyed. Depending on the block size of the destroy target(s), the number of metadata blocks that have to be touched can quickly reach into the millions, even the hundreds of millions.
If a destroy needs to touch 100 million blocks, and the zpool’s IOPS potential is 10,000, how long will that zfs destroy take? Somewhere around 2 1/2 hours! That’s a good scenario – ask any long-time ZFS support person or administrator and they’ll tell you horror stories about day long, even week long “zfs destroy” commands. There’s eventual work that can be done to make this less painful (a major one is in the works right now) and there’s a few things that can be done to mitigate it, but at the end of the day, always check the actual used disk size of something you’re about to destroy and potentially hold off on that destroy if it’s significant. How big is too big? That is a factor of block size, pool IOPS potential, extenuating circumstances (current I/O workload of the pool, deduplication on or off, a few other things).
5. RAID Cards vs HBA’s
6. SATA vs SAS
7. Compression Is Good (Even When It Isn’t)
8. RAIDZ – Even/Odd Disk Counts
9. Pool Design Rules
- Do not use raidz1 for disks 1TB or greater in size.
- For raidz1, do not use less than 3 disks, nor more than 7 disks in each vdev (and again, they should be under 1 TB in size, preferably under 750 GB in size) (5 is a typical average).
- For raidz2, do not use less than 6 disks, nor more than 10 disks in each vdev (8 is a typical average).
- For raidz3, do not use less than 7 disks, nor more than 15 disks in each vdev (13 & 15 are typical average).
- Mirrors trump raidz almost every time. Far higher IOPS potential from a mirror pool than any raidz pool, given equal number of drives. Only downside is redundancy – raidz2/3 are safer, but much slower. Only way that doesn’t trade off performance for safety is 3-way mirrors, but it sacrifices a ton of space (but I have seen customers do this – if your environment demands it, the cost may be worth it).
- For >= 3TB size disks, 3-way mirrors begin to become more and more compelling.
- Never mix disk sizes (within a few %, of course) or speeds (RPM) within a single vdev.
- Never mix disk sizes (within a few %, of course) or speeds (RPM) within a zpool, except for l2arc & zil devices.
- Never mix redundancy types for data vdevs in a zpool (no raidz1 vdev and 2 raidz2 vdevs, for example)
- Never mix disk counts on data vdevs within a zpool (if the first data vdev is 6 disks, all data vdevs should be 6 disks).
- If you have multiple JBOD’s, try to spread each vdev out so that the minimum number of disks are in each JBOD. If you do this with enough JBOD’s for your chosen redundancy level, you can even end up with no SPOF (Single Point of Failure) in the form of JBOD, and if the JBOD’s themselves are spread out amongst sufficient HBA’s, you can even remove HBA’s as a SPOF.
10. 4KB Sector Disks
There are a number of in-the-wild devices that are 4KB sector size instead of the old 512-byte sector size. ZFS handles this just fine if it knows the disk is 4K sector size. The problem is a number of these devices are lying to the OS about their sector size, claiming it is 512-byte (in order to be compatible with ancient Operating Systems like Windows 95); this will cause significant performance issues if not dealt with at zpool creation time.
11. ZFS Has No “Restripe”
12. Hot Spares
For a bit of clarification, the main reasoning behind this has to do with the present method hot spares are handled by ZFS & Solaris FMA and so on – the whole environment involved in identifying a failed drive and choosing to replace it is far too simplistic to be useful in many situations. For instance, if you create a pool that is designed to have no SPOF in terms of JBOD’s and HBA’s, and even go so far as to put hot spares in each JBOD, the code presently in illumos (9/12/2013) has nothing in it to understand you did this, and it’s going to be sheer chance if a disk dies and it picks the hot spare in the same JBOD to resilver to. It is more likely it just picks the first hot spare in the spares list, which is probably in a different JBOD, and now your pool has a SPOF.
Further, it isn’t intelligent enough to understand things like catastrophic loss — say you again have a pool setup where the HBA’s and JBOD’s are set up for no SPOF, and you lose an HBA and the JBOD connected to it – you had 40 drives in mirrors, and now you are only seeing half of each mirror — but you also have a few hot spares in that JBOD, say 2. Now, obviously, picking 2 random mirrors and starting to resilver them from the hot spares still visible is silly – you lost a whole JBOD, all your mirrors have gone to single drive, and the only logical solution is getting the other JBOD back on (or if it somehow went nuts, a whole new JBOD full of drives and attach them to the existing mirrors). Resilvering 2 of your 20 mirror vdevs to hot spares in the still-visible JBOD is just a waste of time at best, and dangerous at worst, and it’s GOING to do it.
What I tend to tell customers when the hot spare discussion comes up is actually to start with a question. The multi-part question is this: how many hours could possibly pass before your team is able to remotely login to the SAN after receiving an alert that there’s been a disk loss event, and how many hours could possibly pass before your team is able to physically arrive to replace a disk after receiving an alert that there’s been a disk loss event?
The idea, of course, is to determine if hot spares are seemingly required, or if warm spares would do, or if cold spares are acceptable. Here’s the ruleset in my head that I use after they tell me the answers to that question (and obviously, this is just my opinion on the numbers to use):
- Under 24 hours for remote access, but physical access or lack of disks could mean physical replacement takes longer
- Warm spares
- Under 24 hours for remote access, and physical access with replacement disks is available by that point as well
- Pool is 2-way mirror or raidz1 vdevs
- Warm spares
- Pool is >2-way mirror or raidz2-3 vdevs
- Cold spares
- Pool is 2-way mirror or raidz1 vdevs
- Over 24 hours for remote or physical access
- Hot spares start to become a potential risk worth taking, but serious discussion about best practices and risks has to be had – often is it’s 48-72 hours as the timeline, warm or cold spares may still make sense depending on pool layout; > 72 hours to replace is generally where hot spares become something of a requirement to cover those situations where they help, but at that point a discussion needs to be had on customer environment that there’s a > 72 hour window where a replacement disk isn’t available
I’d have to make one huge bullet list to try to cover every possible contingency here – each customer is unique, but this is some general guidelines. Remember, it takes a significant amount of time to resilver a disk, and so adding in X amount of additional hours is not adding a lot of risk, especially for 3-way or higher mirrors and raidz2-3 vdevs which can already handle multiple failures.
13. ZFS Is Not A Clustered Filesystem
14. To ZIL, Or Not To ZIL
So with that explained, the real question is, do you need to direct those writes to a separate device from the pool data disks or not? In general, you do if one or more of the intended use-cases of the storage server are very write latency sensitive, or if the total combined IOPS requirement of the clients is approaching say 30% of the raw pool IOPS potential of the zpool. In such scenarios, the addition of a log vdev can have an immediate and noticeable positive performance impact. If neither of those is true, it is likely you can just skip a log device and be perfectly happy. Most home systems, for example, have no need of a log device and won’t miss not having it. Many small office environments using ZFS as a simple file store will also not require it. Larger enterprises or latency-sensitive storage will generally require fast log devices.
15. ARC and L2ARC
One of ZFS’ strongest performance features is its intelligent caching mechanisms. The primary cache, stored in RAM, is the ARC (Adaptive Replacement Cache). The secondary cache, typically stored on fast media like SSD’s, is the L2ARC (second level ARC). Basic rule of thumb in almost all scenarios is don’t worry about L2ARC, and instead just put as much RAM into the system as you can, within financial realities. ZFS loves RAM, and it will use it – there is a point of diminishing returns depending on how big the total working set size really is for your dataset(s), but in almost all cases, more RAM is good. If your use-case does lend itself to a situation where RAM will be insufficient and L2ARC is going to end up being necessary, there are rules about how much addressable L2ARC one can have based on how much ARC (RAM) one has.
16. Just Because You Can, Doesn’t Mean You Should
It is very rare for a company to need 1 PB of space in one filesystem, even if it does need 1 PB in total space. Find a logical separation and build to meet it, not go crazy and try to build a single 1 PB zpool. ZFS may let you, but various hardware constraints will inevitably doom this attempt or create an environment that works, but could have worked far better at the same or even lower cost.
Learn from Google, Facebook, Amazon, Yahoo and every other company with a huge server deployment — they learned to scale out, with lots of smaller systems, because scaling up with giant systems not only becomes astronomically expensive, it quickly ends up being a negative ROI versus scaling out.