Thursday, March 21, 2013

ZFS: Read Me 1st

Things Nobody Told You About ZFS

Yes, it's back. You may also notice it is now hosted on my Blogger page - just don't have time to deal with self-hosting at the moment, but I've made sure the old URL redirects here.

So, without further adieu..

Foreword

I will be updating this article over time, so check back now and then.

Latest update 9/12/2013 - Hot Spare, 4K Sector and ARC/L2ARC sections edited, note on ZFS Destroy section, minor edit to Compression section.

There are a couple of things about ZFS itself that are often skipped over or missed by users/administrators. Many deploy home or business production systems without even being aware of these gotchya's and architectural issues. Don't be one of those people!

I do not want you to read this and think "ugh, forget ZFS". Every other filesystem I'm aware of has many and more issues than ZFS - going another route than ZFS because of perceived or actual issues with ZFS is like jumping into the hungry shark tank with a bleeding leg wound, instead of the goldfish tank, because the goldfish tank smelled a little fishy! Not a smart move.

ZFS is one of the most powerful, flexible, and robust filesystems (and I use that word loosely, as ZFS is much more than just a filesystem, incorporating many elements of what is traditionally called a volume manager as well) available today. On top of that it's open source and free (as in beer) in some cases, so there's a lot there to love.

However, like every other man-made creation ever dreamed up, it has its own share of caveats, gotchya's, hidden "features" and so on. The sorts of things that an administrator should be aware of before they lead to a 3 AM phone call! Due to its relative newness in the world (as compared to venerable filesystems like NTFS, ext2/3/4, and so on), and its very different architecture, yet very similar nomenclature, certain things can be ignored or assumed by potential adopters of ZFS that can lead to costly issues and lots of stress later.

I make various statements in here that might be difficult to understand or that you disagree with - and often without wholly explaining why I've directed the way I have. I will endeavor to produce articles explaining them and update this blog with links to them, as time allows. In the interim, please understand that I've been on literally 1000's of large ZFS deployments in the last 2+ years, often called in when they were broken, and much of what I say is backed up by quite a bit of experience. This article is also often used, cited, reviewed, and so on by many of my fellow ZFS support personnel, so it gets around and mistakes in it get back to me eventually. I can be wrong - but especially if you're new to ZFS, you're going to be better served not assuming I am. :)

1. Virtual Devices Determine IOPS

IOPS (I/O per second) are mostly a factor of the number of virtual devices (vdevs) in a zpool. They are not a factor of the raw number of disks in the zpool. This is probably the single most important thing to realize and understand, and is commonly not. 

ZFS stripes writes across vdevs (not individual disks). A vdev is typically IOPS bound to the speed of the slowest disk within it. So if you have one vdev of 100 disks, your zpool's raw IOPS potential is effectively only a single disk, not 100. There's a couple of caveats on here (such as the difference between write and read IOPS, etc), but if you just put as a rule of thumb in your head that a zpool's raw IOPS potential is equivalent to the single slowest disk in each vdev in the zpool, you won't end up surprised or disappointed.

2. Deduplication Is Not Free

Another common misunderstanding is that ZFS deduplication, since its inclusion, is a nice, free feature you can enable to hopefully gain space savings on your ZFS filesystems/zvols/zpools. Nothing could be farther from the truth. Unlike a number of other deduplication implementations, ZFS deduplication is on-the-fly as data is read and written. This creates a number of architectural challenges that the ZFS team had to conquer, and the methods by which this was achieved lead to a significant and sometimes unexpectedly high RAM requirement.

Every block of data in a dedup'ed filesystem can end up having an entry in a database known as the DDT (DeDupe Table). DDT entries need RAM. It is not uncommon for DDT's to grow to sizes larger than available RAM on zpools that aren't even that large (couple of TB's). If the hits against the DDT aren't being serviced primarily from RAM or fast SSD, performance quickly drops to abysmal levels. Because enabling/disabling deduplication within ZFS doesn't actually do anything to data already on disk, do not enable deduplication without a full understanding of its requirements and architecture first. You will be hard-pressed to get rid of it later.

3. Snapshots Are Not Backups

This is critically important to understand. ZFS has redundancy levels from mirrors and raidz. It has checksums and scrubs to help catch bit rot. It has snapshots to take lightweight point-in-time captures of data to let you roll back or grab older versions of files. It has all of these things to help protect your data. And one 'zfs destroy' by a disgruntled employee, one fire in your datacenter, one random chance of bad luck that causes a whole backplane, JBOD, or a number of disks to die at once, one faulty HBA, one hacker, one virus, etc, etc, etc -- and poof, your pool is gone. I've seen it. Lots of times. MAKE BACKUPS.

4. ZFS Destroy Can Be Painful

(9/12/2013) A few illumos-based OS are now shipping ZFS with "async destroy" feature. That has a significant mitigating impact on the below text, and ZFS destroys, while they still have to do the work, do so in the background in a less performance and stability damaging manner. However, not all shipping OS have this code in them yet (for instance, NexentaStor 3.x does not). If your ZFS has feature flag support, it might have async destroy, if it still is using the old 'zpool version' method, it probably doesn't.

Something often waxed over or not discussed about ZFS is how it presently handles destroy tasks. This is specific to the "zfs destroy" command, be it used on a zvol, filesystem, clone or snapshot. This does not apply to deleting files within a ZFS filesystem (unless that file is very large - for instance, if a single file is all that a whole filesystem contains) or on the filesystem formatted onto a zvol, etc. It also does not apply to "zpool destroy". ZFS destroy tasks are potential downtime causers, when not properly understood and treated with the respect they deserve. Many a SAN has suffered impacted performance or full service outages due to a "zfs destroy" in the middle of the day on just a couple of terabytes (no big deal, right?) of data. The truth is a "zfs destroy" is going to go touch many of the metadata blocks related to the object(s) being destroyed. Depending on the block size of the destroy target(s), the number of metadata blocks that have to be touched can quickly reach into the millions, even the hundreds of millions.

If a destroy needs to touch 100 million blocks, and the zpool's IOPS potential is 10,000, how long will that zfs destroy take? Somewhere around 2 1/2 hours! That's a good scenario - ask any long-time ZFS support person or administrator and they'll tell you horror stories about day long, even week long "zfs destroy" commands. There's eventual work that can be done to make this less painful (a major one is in the works right now) and there's a few things that can be done to mitigate it, but at the end of the day, always check the actual used disk size of something you're about to destroy and potentially hold off on that destroy if it's significant. How big is too big? That is a factor of block size, pool IOPS potential, extenuating circumstances (current I/O workload of the pool, deduplication on or off, a few other things).

5. RAID Cards vs HBA's

ZFS provides RAID, and does so with a number of improvements over most traditional hardware RAID card solutions. ZFS uses block-level logic for things like rebuilds, it has far better handling of disk loss & return due to the ability to rebuild only what was missed instead of rebuilding the entire disk, it has access to more powerful processors than the RAID card and far more RAM as well, it does checksumming and auto-correction based on it, etc. Many of these features are gone or useless if the disks provided to ZFS are, in fact, RAID LUN's from a RAID card, or even RAID0 single-disk entities offered up. 

If your RAID card doesn't support a true "JBOD" (sometimes referred to as "passthrough") mode, don't use it if you can avoid it. Creating single-disk RAID0's (sometimes called "virtual drives") and then letting ZFS create a pool out of those is better than creating RAID sets on the RAID card itself and offering those to ZFS, but only about 50% better, and still 50% worse than JBOD mode or a real HBA. Use a real HBA - don't use RAID cards.

6. SATA vs SAS

This has been a long-standing argument in the ZFS world. Simple fact is, the majority of ZFS storage appliances, most of the consultants and experts you'll talk to, and the majority of enterprise installations of ZFS are using SAS disks. To be clear, "nearline" SAS (7200 RPM SAS) is fine, but what will often get you in trouble is the use of SATA (including enterprise-grade) disks behind bad interposers (which is most of them) and SAS expanders (which almost every JBOD is going to be utilizing).

Plan to purchase SAS disks if you're deploying a 'production' ZFS box. In any decent-sized deployment, they're not going to have much of a price delta over equivalent SATA disks. The only exception to this rule is home and very small business use-cases -- and for more on that, I'll try to wax on about it in a post later.

7. Compression Is Good (Even When It Isn't)

It is the very rare dataset or use-case that I run into these days where compress=on (lzjb) doesn't make sense. It is on by default on most ZFS appliances, and that is my recommendation. Turn it on, and don't worry about it. Even if you discover that your compression ratio is nearly 0% - it still isn't hurting you enough to turn it off, generally speaking. Other compression algorithms such as gzip are another matter entirely, and in almost all cases should be strongly avoided. I do see environments using gzip for datasets they truly do not care about performance on (long-term archival, etc). In my experience if that is the case, go with gzip-9, as the performance difference between gzip-1 and gzip-9 is minimal (when then compared to lzjb or off). You're going to get the pain, so you may as well go for the best compression ratio.

8. RAIDZ - Even/Odd Disk Counts

Try (and not very hard) to keep the number of data disks in a raidz vdev to an even number. This means if its raidz1, the total number of disks in the vdev would be an odd number. If it is raidz2, an even number, and if it is raidz3, an odd number again. Breaking this rule has very little repercussion, however, so you should do so if your pool layout would be nicer by doing so (like to match things up on JBOD's, etc).

9. Pool Design Rules

I've got a variety of simple rules I tell people to follow when building zpools:
  • Do not use raidz1 for disks 1TB or greater in size.
  • For raidz1, do not use less than 3 disks, nor more than 7 disks in each vdev (and again, they should be under 1 TB in size, preferably under 750 GB in size) (5 is a typical average).
  • For raidz2, do not use less than 6 disks, nor more than 10 disks in each vdev (8 is a typical average).
  • For raidz3, do not use less than 7 disks, nor more than 15 disks in each vdev (13 & 15 are typical average).
  • Mirrors trump raidz almost every time. Far higher IOPS potential from a mirror pool than any raidz pool, given equal number of drives. Only downside is redundancy - raidz2/3 are safer, but much slower. Only way that doesn't trade off performance for safety is 3-way mirrors, but it sacrifices a ton of space (but I have seen customers do this - if your environment demands it, the cost may be worth it).
  • For >= 3TB size disks, 3-way mirrors begin to become more and more compelling.
  • Never mix disk sizes (within a few %, of course) or speeds (RPM) within a single vdev.
  • Never mix disk sizes (within a few %, of course) or speeds (RPM) within a zpool, except for l2arc & zil devices.
  • Never mix redundancy types for data vdevs in a zpool (no raidz1 vdev and 2 raidz2 vdevs, for example)
  • Never mix disk counts on data vdevs within a zpool (if the first data vdev is 6 disks, all data vdevs should be 6 disks).
  • If you have multiple JBOD's, try to spread each vdev out so that the minimum number of disks are in each JBOD. If you do this with enough JBOD's for your chosen redundancy level, you can even end up with no SPOF (Single Point of Failure) in the form of JBOD, and if the JBOD's themselves are spread out amongst sufficient HBA's, you can even remove HBA's as a SPOF.
If you keep these in mind when building your pool, you shouldn't end up with something tragic.

10. 4KB Sector Disks

(9/12/2013) The likelihood of this being an issue for you is presently very up in the air, very dependent on OS choice at the moment. There are more 4K disks out there, including some SSD's, and still some that are lying and claiming 512. However, there is also work being done to hard-code in recognition of these disks in illumos and so on. My blog post on here talking about my home BSD-based ZFS SAN has instructions on how to manually force recognition of 4K sector disks if they're not reporting on BSD, but it is not as easy on illumos derivatives as they do not have 'geom'. All I can suggest at the moment is Googling about zfs and "ashift" and your chosen OS and OS version -- not only does that vary the answer, but I myself am not spending any real time keeping track, so all I can suggest is do your own homework right now. I also do not recommend mixing -- if your pool started off with one sector size, keep it that way if you grow it or replace any drives. Do not mix/match.

There are a number of in-the-wild devices that are 4KB sector size instead of the old 512-byte sector size. ZFS handles this just fine if it knows the disk is 4K sector size. The problem is a number of these devices are lying to the OS about their sector size, claiming it is 512-byte (in order to be compatible with ancient Operating Systems like Windows 95); this will cause significant performance issues if not dealt with at zpool creation time.

11. ZFS Has No "Restripe"

If you're familiar with traditional RAID arrays, then the term "restripe" is probably in your vocabulary. Many people in this boat are surprised to hear that ZFS has no equivalent function at all. The method by which ZFS delivers data to the pool has a long-term equivalent to this functionality, but not an up-front way nor a command that can be run to kick off such a thing. 

The most obvious task where this shows up is when you add a vdev to an existing zpool. You could be forgiven to expect that the existing data in the pool would slide over and all your vdevs would end up of roughly equal used size (rebalancing is another term for this), since that's what a traditional RAID array would do. ZFS? It won't. That data balancing will only come as an indirect result of rewrites. If you only ever read from your pool, it'll never happen. Bear this in mind when designing your environment and making initial purchases. It is almost never a good idea, performance wise, to start off with a handful of disks if within a year or two you expect to grow that pool to a significant larger size, adding in small numbers of disks every X weeks/months.

12. Hot Spares

Don't use them. Pretty much ever. Warm spares make sense in some environments. Hot spares almost never make sense. Very often it makes more sense to include the disks in the pool and increase redundancy level because of it, than it does to leave them out and have a lower redundancy level.

For a bit of clarification, the main reasoning behind this has to do with the present method hot spares are handled by ZFS & Solaris FMA and so on - the whole environment involved in identifying a failed drive and choosing to replace it is far too simplistic to be useful in many situations. For instance, if you create a pool that is designed to have no SPOF in terms of JBOD's and HBA's, and even go so far as to put hot spares in each JBOD, the code presently in illumos (9/12/2013) has nothing in it to understand you did this, and it's going to be sheer chance if a disk dies and it picks the hot spare in the same JBOD to resilver to. It is more likely it just picks the first hot spare in the spares list, which is probably in a different JBOD, and now your pool has a SPOF.

Further, it isn't intelligent enough to understand things like catastrophic loss -- say you again have a pool setup where the HBA's and JBOD's are set up for no SPOF, and you lose an HBA and the JBOD connected to it - you had 40 drives in mirrors, and now you are only seeing half of each mirror -- but you also have a few hot spares in that JBOD, say 2. Now, obviously, picking 2 random mirrors and starting to resilver them from the hot spares still visible is silly - you lost a whole JBOD, all your mirrors have gone to single drive, and the only logical solution is getting the other JBOD back on (or if it somehow went nuts, a whole new JBOD full of drives and attach them to the existing mirrors). Resilvering 2 of your 20 mirror vdevs to hot spares in the still-visible JBOD is just a waste of time at best, and dangerous at worst, and it's GOING to do it.

What I tend to tell customers when the hot spare discussion comes up is actually to start with a question. The multi-part question is this: how many hours could possibly pass before your team is able to remotely login to the SAN after receiving an alert that there's been a disk loss event, and how many hours could possibly pass before your team is able to physically arrive to replace a disk after receiving an alert that there's been a disk loss event?

The idea, of course, is to determine if hot spares are seemingly required, or if warm spares would do, or if cold spares are acceptable. Here's the ruleset in my head that I use after they tell me the answers to that question (and obviously, this is just my opinion on the numbers to use):

  • Under 24 hours for remote access, but physical access or lack of disks could mean physical replacement takes longer
    • Warm spares
  • Under 24 hours for remote access, and physical access with replacement disks is available by that point as well
    • Pool is 2-way mirror or raidz1 vdevs
      • Warm spares
    • Pool is >2-way mirror or raidz2-3 vdevs
      • Cold spares
  • Over 24 hours for remote or physical access
    • Hot spares start to become a potential risk worth taking, but serious discussion about best practices and risks has to be had - often is it's 48-72 hours as the timeline, warm or cold spares may still make sense depending on pool layout; > 72 hours to replace is generally where hot spares become something of a requirement to cover those situations where they help, but at that point a discussion needs to be had on customer environment that there's a > 72 hour window where a replacement disk isn't available
I'd have to make one huge bullet list to try to cover every possible contingency here - each customer is unique, but this is some general guidelines. Remember, it takes a significant amount of time to resilver a disk, and so adding in X amount of additional hours is not adding a lot of risk, especially for 3-way or higher mirrors and raidz2-3 vdevs which can already handle multiple failures.

13. ZFS Is Not A Clustered Filesystem

I don't know where this got started, but at some point, something must have been said that has led some people to believe ZFS is or has clustered filesystem features. It does not. ZFS lives on a single set of disks in a single system at a time, period. Various HA technologies have been developed to "seamlessly" move the pool from one machine to another in case of hardware issues, but they move the pool - they don't offer up the storage from multiple heads at once. There is no present (9/12/2013) method of "clustered ZFS" where the same pool is offering up datasets from multiple physical machines. I'm aware of no work to change this.

14. To ZIL, Or Not To ZIL

This is a common question - do I need a ZIL (ZFS Intent Log)? So, first of all, this is the wrong question. In almost every storage system you'll ever build utilizing ZFS, you will need and will have a ZIL. The first thing to explain is that there is a difference between the ZIL and a ZIL (referred to as a log or slog) device. It is very common for people to call a log device a "ZIL" device, but this is wrong - there is a reason ZFS' own documentation always refers to the ZIL as the ZIL, and a log device as a log device. Not having a log device does not mean you do not have a ZIL!

So with that explained, the real question is, do you need to direct those writes to a separate device from the pool data disks or not? In general, you do if one or more of the intended use-cases of the storage server are very write latency sensitive, or if the total combined IOPS requirement of the clients is approaching say 30% of the raw pool IOPS potential of the zpool. In such scenarios, the addition of a log vdev can have an immediate and noticeable positive performance impact. If neither of those is true, it is likely you can just skip a log device and be perfectly happy. Most home systems, for example, have no need of a log device and won't miss not having it. Many small office environments using ZFS as a simple file store will also not require it. Larger enterprises or latency-sensitive storage will generally require fast log devices.

15. ARC and L2ARC

(9/12/2013) There are presently issues related to memory handling and the ARC that have me strongly suggesting you physically limit RAM in any ZFS-based SAN to 128 GB. Go to > 128 GB at your own peril (it might work fine for you, or might cause you some serious headaches). Once resolved, I will remove this note.

One of ZFS' strongest performance features is its intelligent caching mechanisms. The primary cache, stored in RAM, is the ARC (Adaptive Replacement Cache). The secondary cache, typically stored on fast media like SSD's, is the L2ARC (second level ARC). Basic rule of thumb in almost all scenarios is don't worry about L2ARC, and instead just put as much RAM into the system as you can, within financial realities. ZFS loves RAM, and it will use it - there is a point of diminishing returns depending on how big the total working set size really is for your dataset(s), but in almost all cases, more RAM is good. If your use-case does lend itself to a situation where RAM will be insufficient and L2ARC is going to end up being necessary, there are rules about how much addressable L2ARC one can have based on how much ARC (RAM) one has.

16. Just Because You Can, Doesn't Mean You Should

ZFS has very few limits - and what limits it has are typically measured in megazillions, and are thus unreachable with modern hardware. Does that mean you should create a single pool made up of 5,000 hard disks? In almost every scenario, the answer is no. The fact that ZFS is so flexible and has so few limits means, if anything, that proper design is more important than in legacy storage systems. It is a truism that in most environments that need lots of storage space, it is likely more efficient and architecturally sound to find a smaller-than-total break point and design systems to meet that size, then build more than one of them to meet your total space requirements. There is almost never a time when this is not true.

It is very rare for a company to need 1 PB of space in one filesystem, even if it does need 1 PB in total space. Find a logical separation and build to meet it, not go crazy and try to build a single 1 PB zpool. ZFS may let you, but various hardware constraints will inevitably doom this attempt or create an environment that works, but could have worked far better at the same or even lower cost.

Learn from Google, Facebook, Amazon, Yahoo and every other company with a huge server deployment -- they learned to scale out, with lots of smaller systems, because scaling up with giant systems not only becomes astronomically expensive, it quickly ends up being a negative ROI versus scaling out.

17. Crap In, Crap Out

ZFS is only as good as the hardware it is put on. Even ZFS can corrupt your data or lose it, if placed on inferior components. Examples of things you don't want to do if you want to keep your data intact include using non-ECC RAM, using non-enterprise disks, using SATA disks behind SAS expanders, using non-enterprise class motherboards, using a RAID card (especially one without a battery), putting the server in a poor environment for a server to be in, etc.

38 comments:

  1. This is great! When you have a slog, how do you decide pool spindle count to maximize the use of the slog. I have always used mirrors, but my math says to take advantage of a high performance slog, that I would want lots of spindles.

    My slogs do 900MB/sec, therefore don't I want a pool that does 900MB/sec, which is 20+ vdevs.

    ReplyDelete
  2. That answer is really pretty specific on the workload of the pool itself. Much of the time, the slog devices are there to speed up the pool by offloading the ZIL traffic - and as an added benefit, reducing write latency from a client perspective.

    I almost always am looking at slog devices from an IOPS perspective first and foremost, and a throughput potential as a distant or even non-existent second (depends on the environment). Often a pool that can do 2.4 GB/s in a large-block sequential workload can't do anywhere near that at 4K random read/write request sizes (indeed, that's some 620,000 IOPS) -- and the client is doing exactly those, so suddenly all the interest is in IOPS and little time is spent worrying about throughput.

    In a pure throughput workload, things can and should be a bit different. And in ZFS, they are. For instance, ZFS has built-in mechanics for negating normal ZIL workflow if the incoming data is a large-block streaming workload. It can opt to send the data straight to the disk, bypassing any slog device (well, bypassing the ZIL entirely, really, and thus the slog device). There could be a whole post at some point on the varying conditions and how ZFS deals with each, I think. You've got 'logbias' on datasets (good writeup here: https://blogs.oracle.com/roch/entry/synchronous_write_bias_property ). And even on latency, there's some code to deal with limits, I believe. Take a look at: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/zil.c#893 , or the ZFS On Linux guys (dechamps, specifically) has a pretty good write-up on this at https://github.com/zfsonlinux/zfs/issues/1012 .

    ReplyDelete
  3. I liked the oracle article the best, thanks for the feedback. My scenario is different then theirs however. My specific workload is a VDI implemnation with 80/20 r/w bias. I cannot seem to get a diskpool to get to performance levels to match the hardware I think. I have 22 spindle 10K mirrored pools with a ram based slog. The slog is rated at 90K iops and 900MB/sec.

    Wouldn't zpool iostat show under ideal conditions 22 * 50MB/s = 1100 MB/sec or near there? Best I can get is 300 MB/sec. I am just trying to explain the gap. Zpool iostat shows peaks of 42K iops which is great, but never very high MB/sec. When the system is not busy, I would think that a file copy would reach the speed of the slog at least or at least double what the readings are at 300MB/sec. Nobody seems to use zpool iostat for performance data. iostat seems to be the tool of choice, but I don't have that data compiled over time like I do for zpool iostat.


    So would taking my 22 spindle 10k mirrored pool to a 44 spindle mirrored pool, which a little bigger than what oracle pushed at spec.org here: http://www.spec.org/sfs2008/results/res2012q2/sfs2008-20120402-00208.html I should see my numbers go up closer to the limits of my slog right?


    ReplyDelete
  4. So, 'rated at' and 'capable of' are always two different things. However, more importantly, 'capable of when used as a ZFS log device' is a whole new ballgame.

    Manufacturers tend to provide numbers that show them in the most favorable light -- and even third-party analysis websites focus on typical use-cases: database, file transfer, those sorts of things.

    ZIL log device traffic is something every device fears - synchronous single-thread I/O. Your device may be capable of 90,000 IOPS @ 4K block size with 8, 16, 32, or more threads.. and anywhere from 4 to 64 threads is likely what both they and third-party websites run tests at -- but what can it do at 1 thread, at the average block size of your pool datasets? Because that's what a log device will be asked to do. :)

    As for SPECfs - I tend to, well, ignore that benchmark entirely. What it is testing isn't particularly real-world applicable, especially since vendors tend to game the system. For instance, you mention 44 spindle mirrors - no, in that test, the Oracle system had *280* drives, which they split up into 4 pools, each containing 4 filesystems, which were then tested in aggregate I believe. I also believe the data amount tested was significantly less than the pool size, and various other tunings were likely done as well. This picture gives some idea as to how big that system was: http://www.spec.org/sfs2008/results/res2012q2/sfs2008-20120402-00208.7420cluster.jpg

    Even pretending you had the specific tunings, and ignoring for a moment its not particularly fair to just 'divide down' to get an idea for what a smaller system could do, doing so puts your 22 spindle 10K mirrored pool at about 14K iops, on the same benchmark.

    I generally want to see both iostat and zpool iostat; they're very different in what they're reporting, as they're reporting on different layers. Sometimes the combination of both gives hints that one or the other would not alone provide.

    I suspect with a 'VDI' implementation you're probably running 4-32K block size, and at that, I'd be happy with a peak of 42K iops out of 22 10K disks.. indeed, that's way past what you should realistically expect out of the drives, most of that 42K is coming out of ARC and an 80%+ read workload. Were I just gut feeling, I'd suspect you to get much less at times.

    This sort of performance work is time-consuming and involves a ton of variables. However, it is important to note that the log device is not some sort of write cache -- that's your RAM. The log device's job is to take the ZIL workload off the data pool. The performance benefit of that is purely in that the pool devices now have all those I/O they were spending on ZIL back. If there's any further benefit, its just that luck of the draw that the incoming writes were 'redundant' (they were writing some % of the same blocks multiple times within a txg, allowing ZFS to effectively ignore all but the last write that comes in when it lays it out on the spinning media). The pain that spinning disks feel from ZIL traffic cannot be understated. However, the streaming small-block performance of the spinning media minus the serious pain of interjecting the random read that gets past the ARC is, at the end of the day, the actual write performance the pool is capable of -- not what the log device can do at all.

    In super streaming workloads, sometimes, the log devices end up being the bottleneck. However, in almost all VM/VDI deployments I've seen, the log device is not your bottleneck - your drives are. :)

    ReplyDelete
  5. Therefore, is going from a 22 disk mirror to a 44 disk mirror bad? How may vdevs are too many vdevs? The spec test, which I get was tuned, 280 disks, 2 controllers, leaves 140 disks per controller. 4 slogs, mean 4 mirrored pools, therefore they used 35 spindles. But you say that lots of spindles is bad.

    The slog I have is a STEC ZeusRAM. I discovered them in the Nexenta setup from VMworld 2011 (I have a diagram of it also), which is what I have been trying to replicate ever since. Since I have 100 of these 10K drives and JBODS to go with them, I am trying to figure out how to get the best out of them for a VDI deployment. So far I have only tried 22 spindles and I was thinking 44 would be better.

    Lots of $$ in equipment and consultants plus gobs and gobs of wasted time still has me scratching my head.

    ReplyDelete
  6. No no, more spindles is usually better up to a point. I don't start to worry about spindle counts until it is up into the 100's. However, remember the Oracle box got the 200K+ IOPS only from 280 spindles - at 44, you're at a small fraction of that.

    Your box will perform twice as well as your 22-disk mirror system does, assuming no part in the system hits a bottleneck (which is going to happen if you've insufficient RAM, CPU, network, etc), and it is properly tuned(!). I would not expect, on a properly tuned system, in an 8-32K average block size VDI-workload, for 44 drives in mirror pool to be able to outperform a single STEC ZeusRAM (eg: I wouldn't expect it to be your bottleneck, from an IOPS perspective).

    I would expect the ZeusRAM to bottleneck you on a throughput test - or if your average blocksize is 32K or greater (getting ever more likely up to 128K). Its IOPS potential is not 90,000 at 4K, nor at 8K, 32K, or 128K (each of which is worse than the previous), because ZIL traffic is single-threaded, unlike most benchmarks you'd cite when saying how fast a device is.

    I love ZeusRAM, and I recommend them on every VM/VDI deployment I'm involved with and commend you on their use; but while they are in fact the very best device you could possibly use, it is not like they can't limit you, they are not of unlimited power. Still, again, if you're at 16K or under average block size, I'd suspect your pool (22 or 44 drives) to run out of IOPS, first. What block size are you using? What protocol (iSCSI, NFS)?

    Is this a NexentaStor-licensed system, or a home grown (and if so, what O/S & version)? That will matter in terms of where you can go for performance tuning assistance - because it needs some, unless you've already started down that path? I'm unaware of a single ZFS-capable O/S whose default tuneables for ZFS will well suit a high-IOPS VDI workload. The spinning disks are very likely underutilized.

    ReplyDelete
  7. I am running Solaris 11 because the Nexenta resellers I reached out to were too busy to get back with me I guess because they never did. So I just started buying what made sense to me. If any of you out there are reading this.. look at what you missed. Sorry I wasn't interesting enough! I have 100 SAS 10K spindles, 2 Stec's, 2 DDRDrives, 2 256GB w/10Gbe Servers. Tried a 60 disk pool but someone told me it was too big, so now I have 22. Your nugget of vdev's are for I/O was worth the price of admission. I learned this, you can never have enough RAM, ever. All in all its been such a letdown because of the $$ spent and the results achieved.

    ReplyDelete
  8. Sorry they never got back to you. Doubly so since that precludes the option of contacting Nexenta to do a performance tuning engagement. :(

    Also sorry the performance has seemed underwhelming - this is one of the current problems with ZFS go-it-on-your-own, is that there's just such a dearth of good information out there on sizing, tuning, performance gotchya's, etc - and the out of box ZFS experience at scale is quite bad. What information does exist is often in mailing lists, hidden amongst a lot of other, bad advice. I'm hoping to try to fix that as best I can with blog entries on here, but time I have to spend on this is erratic, and some of these topics are nearly impossible to address fully in a few paragraphs on a blog post, I'm afraid.

    60 disk is most assuredly not 'too big'. Average Nexenta deployment these days I'd say is probably around 96 disks per pool, or somewhere thereabouts. If you don't mind people poking around on the box via SSH (and it is in a place where that's possible), email me (nexseven@gmail.com) to work out login details, and I can try to find some off time to take a peek at it.

    ReplyDelete
  9. I dropped you a note last weekend, but maybe your on spring break like I have been. I was thinking of just adding another jbod of 24 disks to the exiting pool, creating new zfs datasets, then copy the exiting data to them to spread it around the new disks. Go from 22 to 44 spindles. The whole operation should only take a few hours. Currently when I do zpool iostat I see maybe 1-2k ops/s with a high of 4k. What I don't like is the time to clone VM's, the max MB/s I get is around 500-550 and doubling the spindles would double that .. correct?

    Also.. how many minutes/seconds should a RSF1 with 96 disks take to fail over? I am curious what I would expect.

    ReplyDelete
  10. Oops, email lost in the clutter. I've responded.

    It would very likely double your IOPS count, but not potentially double your throughput count, since there's more bottleneck concerns to consider there. I assume you're using NFS -- you might (and it IS beta, so bear that in mind) be interested in this: http://nexentastor.org/boards/13/topics/9315 - we're in beta on the NFS VAAI plugin. I say that because you mentioned tasks like VM cloning and such, and NFS VAAI support could have a serious impact on certain VM image manipulation tasks in VMware when backed by NexentaStor. Possibly worth looking at (though again -- beta, probably not good for production, yet).

    The goal of RSF-1 is to fail over in the shortest safe time possible. I've seen failovers take under 20 seconds. That said, I've also seen them take over 4 minutes (which isn't bad when you put it in context -- at my last job, my Sun 7410 took *15 minutes* to fail over). There's a number of factors involved. Number of disks is one, number of datasets (zvols & filesystems) is another. In general I recommend people expect 60-120 seconds, which is why I have the blog post up on VM Timeouts and suggest at least 180 second timeout values everywhere (personally I use higher than even that, as I see no reason to go read-only when I know the SAN will come back *some day*).

    ReplyDelete
  11. What about Zpool fragmentation? That seems to be another issue with ZFS that you don't see much discussion about. As your pools get older, they tend to get slower and slower because of the fragmentation, and in the case of a root filesystem on a zpool, that can even mean that you can't create a new swap device or dump device because there is no contiguous space left. Zpools really need a defrag utility. Today the only solution is to create a new pool and migrate all your data to it.

    A related issue is that there are no tools to even easily check the pool fragmentation. Locally, we estimate the fragmentation based on the output of "zdb -mm", but even that falls down when you have zpools that are using an "alternate root" (for example in a zone in a cluster). "zpool list" sees those pools fine, but zdb does not.

    Are you aware of any work being done on solutions to those issues?

    ReplyDelete
    Replies
    1. BK:

      Fragmentation does remain a long-term problem of ZFS pools. The only real answer at the moment is to move the data around -- eg: zfs send|zfs recv it to another pool, then wipe out the original pool and recreate, then send back.

      The 'proper' fix for ZFS fragmentation is known -- it is generally referred to as 'block pointer rewrite', or BPR for short. I am not presently aware of anyone actively working on this functionality, I'm afraid.

      For most pools, especially ones kept under 50-60% utilization that are mostly-read, it could be years before fragmentation becomes a significant issue. Hopefully by then, a new version of ZFS will have come along with BPR in it.

      Delete
  12. Ok I have a quick question and I'll include specific info below the actual question just in case you need it: I have a home "all-in-one" ESX/OpenIndiana ZFS machine. Right now I have a 8 disk RAIDz2 array with 8 2TB drives. 2 Samsung, 2 WD, 2 Seagate and 2 Hitachi drives (just worked out that way). The two Hitachi drives are 7200RPM the rest are 5400-5900RPM drives. 6 of them are 4k and I think the two hitachi's are "regular" 512byte drives. I want to know if I'm making a terrible mistake mixing these drives? I don't mind "loosing" the performance of those 7200RPM drives over the 5400 ones I just don't want data risk due to that. I could probably find someone who would happily trade those 7200RPMs for 5400s but if I can leave it as is I would prefer that.

    Second, I have pulled a spare 2TB 5400rpm WD green from a external case and was going to put it in as a hotspare. Would I be better off just rebuilding the array as a z3 instead (or tossing it into the z2 array for a "free" 2TB?) Or leaving it in a box and keeping it for when something dies? (BTW) This question *might* have a different answer after you read my specs below.

    Specs: Supermicro dual CPU MB (2x L5520 Xeons) with 48GB of ECC registered samsung ram. 4 PCI-E 8x slots, 3 PCI-E 8x (4x electrical) slots. 1 LSI 3081E-R 3gb/s HBA, 1 M1015 6gb/s HBA, 2 (incoming, not installed yet) 3801E 3gb/s HBAs (for the drives soon to go into my DIY DAS), 1x Mellanox DDR Infiniband card. Drives: the 8 2TB drives previously mentioned in a z2 pool and 8 300GB raptor 10kRPM in a 8 disk RAID10 array for putting VMs on (overkill honestly, but fun). Right now its one pool on one card, one pool on the other. SOFTWARE: ESXi 4.1U3, Open Indiana VM with 24GB of ram handed over from ESXi.

    My future plans were to use 5 1TB drives I had laying around to create another pool. My pool ideas were raidz1 with all 5, raidz2 with all 5, or RAID10 with 6 (and locate/buy a 6th 1TB drive). Given that I was going to have this second pool, possibly setup with RAIDz1 I was seriously considering using that 9th 2TB drive as a global hotspare so it could be pulled into either pool (even the pool with the 1TB drives right?). And even more bizzare an idea could I also flag that 2TB drive to be used as a spare for the RAID10 raptors? Obviously it would impact performance if it got "called-to-duty" but it would protect the array until I could source a new 10k drive right? If thats a really stupid idea then I'll just order another 10k drive this weekend and toss it in as a spare for the mirror set, or if you prefer, sit on it in a box and swap it in when something actually dies (which saves power so I'm ok with it).

    Right now the data that is considered "vital" and business important (my wifes small business) is sitting on the 8 disk Z2 array, and two different physical locations elsewhere in the house, *and* backed up on tape. Regardless how we setup the ZFS pools all important data will be residing on the ZFS box *AND* *TWO SEPARATE* other locations/machines/drives (IE, it will always be on 3 separate machines on at least 3 separate drives. The array with the 1TB drives will be housing TV shows and Movies that can be lost without many tears so while I'd prefer not to lose it, its not *vital* like the array with the 2TB disks. I would be open to changing the *vital* array/pool to a RAID10 or RAIDz3 if you believe its worthwhile considering my requirements.

    Thanks for any help you feel like giving! :) -Ben

    ReplyDelete
    Replies
    1. Let me see if I can get through all this!

      1) Mixing RPM's -- generally speaking, don't do it. You'll only go as fast as the slowest disk in the pool. If you don't mind that as well as the occasional performance hiccup as well, maybe that's fine. I suspect it would perform worse than you might even expect - it isn't a situation I've spent any time investigating, and it isn't something ZFS has specifically set out to work well at.

      2) Mixing 4K & non-4K disks -- I've not spent any real time thinking this one out or seen it, but in general I'd also suggest not doing it if you can avoid it. It would also be pretty important that all the disks were appearing the same, even if some technically aren't the same. However, of note -- it sounds like what you actually have are 4K disks that REPORT as 512. There are a few of these running around, and they lead to really terrible performance. You can Google around (or maybe I'll see about writing up an entry for this at some point) about zpool 'ashift'.

      3) I'd throw the extra disk into the existing z2, making it a z3, if its uptime is vital. You'll find z2/z3 slightly more failure-resistant than the same disks in a raid10, so I don't recommend that, at least. You could in theory use it as a global spare (though I don't like hotspares, just warm or cold spares), though putting it into service paired to a 10K disk would indeed lead to very odd performance issues and wouldn't generally be something I'd recommend (I tend to be conservative and cautious when it comes to storage, though).

      Delete
  13. Andrew,

    I'm curious about your following rules. Could you be so kind as to point me to some further information on these? Also shouldn't the number of data disks always be 2**n, meaning that raidz2 should start at 6 not 5?

    Do not use raidz1 for disks 1TB or greater in size.
    For raidz1, do not use less than 3 disks, nor more than 7 disks in each vdev (and again, they should be under 1 TB in size, preferably under 750 GB in size).
    For raidz2, do not use less than 5 disks, nor more than 10 disks in each vdev.
    For raidz3, do not use less than 7 disks, nor more than 15 disks in each vdev.

    Ian

    ReplyDelete
    Replies
    1. I need a prize if you find a typo. You are correct, I've updated it to 6.

      The primary source for that information is internal experience (1000's of live zpools), that is not knowledge I picked up from websites or blog posts. The 'do not use less' rule is fairly obvious - it's silly; why would you use less than 7 disks in a raidz3 (at just 5, you're left with more parity than data, and should probably have gone with raidz2 at 6 disks).

      The 'more' rule logic is around the nightmare scenario of 'lose another disk while resilvering from an initial loss'. You do not want this to happen to you, and keeping the number of disks per vdev low is how you mitigate it. I will actually bend these rules based on disk type, JBOD layout, known workload, etc. I'll be more conservative in environments where the number of JBOD's is low, the workload is high (thus making resilvers take longer as they compete for I/O), or the chosen disk type is very large (3+ TB) since they take forever to resilver, or a vendor or model of disk I'm less confident in, since I'll then expect more failures). I'll be less conservative and even go out of my own ranges if it's a very strong environment with no SPOF on the JBOD's, good disks that are not more than 2 TB in size, and the workload is light or confined to only certain periods of the day, etc.

      When making this decision, it is also important to be cognizant of IOPS requirements - your environment may be one that would otherwise be OK to lean towards the high end of these ranges, but you have an IOPS requirement that precludes it, and requires you go with smaller vdevs to hit the IOPS needs.

      Let me know if that didn't cover anything you were curious about.

      Delete
  14. 'vdevs IOP potential is equal to that of the slowest disk in the set - a vdev of 100 disks will have the IO potential of a single disk.'

    is this how it works in traditional RAID6 (if we are talking raidz2) hardware arrays too?

    I'm a bit concerned as my current volumes are comprised of 12 vdevs of 6 disks each (raidz2), which if I understand correctly means i'm really only seeing about 12 disks worth of write IOPs. Which would explain why it doesn't seem that fantastic till we put a pair of Averes in front of it.

    ReplyDelete
    Replies
    1. No, traditional RAID5/6 arrays tend to have IOPS potential roughly equivalent to some % of the number of data drives - parity drives. This is one of the largest performance differences between a traditional hardware RAID card and ZFS-based RAID -- when doing parity RAID on lots of disks, the traditional hardware RAID card has significantly higher raw IOPS potential.

      ZFS mirror versus hardware RAID10 is a reasonable comparison, performance wise, but ZFS will win no wars versus traditional RAID5/6/50/60. Then again, it also won't lose your data, and isn't subject to the raid write hole problem. :)

      I often have to remind people that ZFS wasn't designed for performance. It's fairly clear from the documentation, the initial communication from Sun team, and the source, that ZFS was designed with data integrity as the primary goal, followed I'd say by ease of administration and simplification of traditionally annoying storage stuff (like adding drives, etc) -- /performance/ was a distant second or third or even fourth priority. Things like ARC and the fact that ZFS is just newer than many really ancient filesystems gives people this mistaken impression that it's a speed demon -- it isn't. It never will be.

      If your use-case is mostly-read and reasonably cacheable, ARC/L2ARC utilization can make a ZFS filesystem outperform alternatives, but it's doing so by way of the caching layer, not because the underlying IOPS potential is higher (that's rarely the case). If your use-case isn't that, then the only reason you'd go ZFS is for the data integrity first and foremost, and also possibly for the features (snapshots, cloning, compression, gigantic potential namespace, etc); not because you couldn't find a better performing alternative.

      Delete
  15. Hi Andrew,

    I am a "normal" home user with a "normal" home media server and after reading (and some testing with virtualbox) been considering moving to zfs (freeBSD or solaris) from windows 7. Most likely going to try Esxi (no experience on this either but I have no problems learning) to run one VM for file server (zfs) and another for Plex media server.

    Specs for my "server": Currently running windows 7, Intel motherboard (DP67BG if I remember correctly), i7 2600k and 16 GB of DDR3 (non ECC) ram, one 500 GB HD for the OS and 8 3TB (sata, non enterprise class) HDs for data (bought 4 then the other 4 later).

    The 8 data disks are on a raid 6 hardware array (adaptec 6805) with around 9 TB of used space. 90% of that space are movies in mkv format (I rip all my blurays with makemkv so I have 20-30 gb files) and 10% of random files (backup of family photos and stuff from main box, that I back up to another 2 different HDs).

    Main purpose of my "media server" is Plex, serving 2 HTPCs and some devices (iPads). I want to move from one windows 7 box to Esxi with VMs to have storage and Plex on different VMs and optionally a third VM for misc stuff (like encoding video/testing). Everytime I install/update something I have to reboot the windows box and if anyone is watching a movie has to wait for it to get back online.

    Apart from a learning experience, would zfs (solaris or freeBSD) be better or am I just fine and should just try to use Esxi with windows VMs?? Would a zraid2 be better than the hw raid6 array I currently have (for my use)?

    My plan is one VM for zfs (still don't know what to install here, solaris, freeBSD, nexenta, etc.), one for Plex media server (windows or linux) and one windows VM for misc stuff.

    Thanks a lot for any feedback.

    ReplyDelete
    Replies
    1. I can't wait until we're "normal" home users, Simon. Pretty sure we're not, at the moment. :)

      I could easily run over the 4,096 character limit trying to advise you here. The tl;dr version would be: No, not unless you've got a backup of the data or a place to put it while migrating, and preferably only if you're willing to change boxes to something with ECC RAM in the process (that in and of itself is not a deal breaker) and definitely to an HBA instead of a RAID card. So you're definitely migrating data. It's potentially a lot of work for some data integrity gains. If you're not planning to use any ZFS features (which your present list of requirements doesn't seem to indicate you would -- you mention nothing that sounds like zfs snapshots, clones, rollback and so on would /really/ improve your life, those data integrity gains may not be worth the move (definitely not if not also going to ECC RAM & an HBA).

      Moving off Windows to a *nix derivative for the storage portion is very sane. Separating the storage to its own box or VM is reasonably sane. The level of effort to get you there safely on ZFS would almost necessitate buying a whole new server.

      As for choice, if you do decide to go through with a migration to a new box and ZFS, in order of what I feel to be the best options at the moment (9/21/2013):

      If you prefer command line administration:
      1. OmniOS
      2. FreeBSD 9.1 (or, really, wait for 10!)

      If you prefer UI administration:
      1. OmniOS with free version of napp-it if over 18 TB of space required
      2. NexentaStor Community Edition if under 18 TB of space required
      2. FreeNAS

      I can't currently recommend you use anything sporting ZFS On Linux. Lack of fault management, lack of dtrace/mdb, few other niggling things keep it off my list for now.

      Delete
  16. > I can't wait until we're "normal" home users, Simon.

    I just stumbled upon this looking for advice how to install a global hot spare on Solaris 11. There is conflicting information about this capability and after reading your post, I am thinking that one or two warm spares might be more appropriate.

    My setup is a Solaris 11 home server, doing multiple duties for the family as a Sun Ray server, virtual machine host (so all five family members can have as many instances of Windows, Linux or whatever) and media server. Thin clients are scattered around the house and in many rooms,. Rooms each have a 24 port switch with gigabit fiber backhauled to the rack.

    The server is HP DL585 G2 with 4x dual core Opterons and 60GB of RAM, two fibre HBAs each with two ports, connected to two Brocade switches in a way that any HBA, cable or switch can fail without losing a path to the disks. Disks are 500GB in four arrays of 11 disks, each with dual paths.

    Your notes on backup are spot on. Right now the main pool consists of 3 vdevs, each containing 8 disks in RAIDZ2 (6+2), allowing any two disk failures before the array becomes critical. The remaining 20 disks are a backup pool as a single RAIDZ3 vdev (17+3). Snapshots are synchronized to the backup pool every 8 hours using the zrep script.

    The disks are not terribly reliable with one failing every week or three. I have about 50 cold spares, so the loss of a spindle is not an issue, but I often cannot get time to make the replacement too quickly. I was thinking that it made sense to reduce the size of the backup pool and allocate two global hot spares, so that any failure would rebuild automatically and give me time to respond.

    Your post brought back scary memories of a single array going offline, causing ZFS to scramble to build hot spares and declaring the whole pool invalid. I think I will take your advice and simply allocate one or two warm spares.

    ReplyDelete
    Replies
    1. Yeah - I stress hard not to do hot spares, and never feel quite as good about builds that the client ends up demanding it in, claiming it is required or because they actually meet my criteria for using them (I can never feel very good about a SAN that nobody will even be able to remotely login to upon notification of a failure for over 72 hours).

      Glad I could remind you of a scary memory, I suppose. :)

      PS: Nice home setup!

      Delete
  17. Andrew, here is another reason to be careful with ZFS. I have a less reliable spare array and many unused fast 73GB disks. I also wanted to investigate how L2ARC would impact performance, or even if ZFS would populate L2ARC storage. No problem, power up the spare array, put in some disks and add them as cache.

    bash-4.1$ sudo zpool add tank c0t20000011C692521Bd0
    vdev verification failed: use -f to override the following errors:
    /dev/dsk/c0t20000011C692521Bd0s0 is part of exported or potentially active ZFS pool slow. Please see zpool(1M).
    Unable to build pool from specified devices: device already in use

    Oh yeah, those were part of an old pool. No problem, override.

    bash-4.1$ sudo zpool add -f tank c0t20000011C692521Bd0

    Did you catch the error? My array now looks like this:

    capacity operations bandwidth
    pool alloc free read write read write
    ------------------------- ----- ----- ----- ----- ----- -----
    tank 4.77T 6.18T 0 0 63.9K 0
    raidz2 1.59T 2.04T 0 0 0 0
    c0t20000011C61A75FFd0 - - 0 0 0 0
    c0t20000011C619D560d0 - - 0 0 0 0
    c0t20000011C619A481d0 - - 0 0 0 0
    c0t20000011C619DBDCd0 - - 0 0 0 0
    c0t20000014C3D47348d0 - - 0 0 0 0
    c0t20000011C619D695d0 - - 0 0 0 0
    c0t20000011C619D742d0 - - 0 0 0 0
    c0t20000011C619A4ADd0 - - 0 0 0 0
    raidz2 1.59T 2.04T 0 0 63.9K 0
    c0t20000011C619D657d0 - - 0 0 10.7K 0
    c0t20000011C61A75A6d0 - - 0 0 10.7K 0
    c0t20000011C619D4ECd0 - - 0 0 10.7K 0
    c0t20000011C619A043d0 - - 0 0 10.5K 0
    c0t20000011C619D669d0 - - 0 0 10.5K 0
    c0t20000011C61A7F9Cd0 - - 0 0 0 0
    c0t20000011C619D6C5d0 - - 0 0 0 0
    c0t20000011C619D220d0 - - 0 0 10.7K 0
    raidz2 1.59T 2.04T 0 0 0 0
    c0t20000011C619DCD3d0 - - 0 0 0 0
    c0t20000011C619D7FCd0 - - 0 0 0 0
    c0t20000011C619D646d0 - - 0 0 0 0
    c0t20000011C619A41Fd0 - - 0 0 0 0
    c0t20000011C6199E5Ed0 - - 0 0 0 0
    c0t20000011C619D43Fd0 - - 0 0 0 0
    c0t20000011C61A7F82d0 - - 0 0 0 0
    c0t20000011C619D636d0 - - 0 0 0 0
    c0t20000011C692521Bd0 33.8M 68.0G 0 0 0 0
    cache - - - - - -
    c0t20000011C615FDBAd0 0 68.4G 0 0 0 0
    c0t20000011C6924F09d0 0 68.4G 0 0 0 0
    c0t20000011C6C2163Cd0 0 68.4G 0 0 0 0
    c0t20000011C6C2C468d0 0 68.4G 0 0 0 0
    c0t20000011C6C2C4B8d0 1.15M 68.4G 0 0 0 0
    ------------------------- ----- ----- ----- ----- ----- -----

    So now, my entire pool is critical due to a a single vdev located on an unreliable array. My only hope is to mirror it (which I have done) and pray that the spare array stays alive until I can rebuild the ENTIRE POOL from a backup.

    That's rather lame.

    ReplyDelete
    Replies
    1. Ouch.

      Yes. This is much like forgetting what your current path is and rm -rf'ing. :(

      This is the sort of thing that generally prompts the creation of 'safe' tools to use in lieu of the underlying 'unsafe' tools. I say this somewhat tongue in cheek, since my employer makes one of those 'safe' tools and yet I'm fairly sure it would have still let you do this in the interface (though I'll make a point of bringing it up to our development staff to add some logic to keep you from doing so without a warning).

      Delete
    2. Yes, not very fun. The underlying issue (besides me not realizing my mistake sooner) was that the -f override silenced the warning I had seen, and most critically, the warning I had not seen yet (and would never see).

      Delete
  18. Andrew - could you expand on what tragedy might happen if you mixed disk sizes, speeds, redundancy types?

    I'm thinking of expanding a pool that so far has only one vdev.
    RAID_Z2[ 10 x 600G 10k ] + SLOG + Hot-Spare (before I found your blog)

    Proposed 2nd vdev = RAID_Z3[ 11 x 1TB 7.2k ] + SLOG

    Single 24-slot 2.5in JBOD chassis. SLOG devices are STEC s840z. NexentaStor with the RSF-1 H-A plugin.

    Thanks for any response

    ReplyDelete
    Replies
    1. Performance, mostly, including a few performance corner cases you'd be hard-pressed to actually hit in a homogeneous pool. Before I answer generically, let me state that as a NexentaStor user, if you have a license key with an active support contract, be aware that Nexenta Support does not support the use of heterogeneous pools. Contact them for more information.

      If you were to add an 11x 1-TB disk raidz3 vdev to an existing pool comprised of a single 10x 600-GB raidz2 vdev, you'd be effectively adding a larger, slower vdev that is also at start less utilized. First, this will make ZFS 'prefer', it as it's emptier (which shouldn't be read as completely ignoring the other vdev for purposes of writes, but it is going to push more % of the writes to the new vdev). Second, it's larger, so it will prefer it even longer. Third, it's slower, and this 'preference' is not synonymous with 'all on new vdev'. So at the end of the day, you've added another vdev which should have almost doubled your write performance, but instead it won't double it, it in fact will probably only increase it by 20-50%, because not only is every write only as fast as the slowest vdev involved in the write (and now you've got a 7200 RPM vdev in there), but it's going to write a larger majority of the new data onto that slower vdev for awhile, as well.

      Even if you rewrite data often enough that you eventually 'normalize', it will still end up only improving your pool's write IOPS by less than double the original speed, as the new vdev isn't as fast as the old one.

      I feel compelled to point out, though, that the part about normalizing and preferring the new vdev is going to happen regardless of similarity in the vdevs - that's one of the reasons I like to explain this early if I get the chance, so people know what to expect when it comes to 'expanding' a pool (it expands the space, but you can't expect it to expand the performance nearly as linearly, especially if you don't rewrite existing data that often).

      If all you're concerned about is more space, and you have no performance problems, you might be OK, but if you presently have a system that is nearing its maximum performance whatsoever, adding this vdev is likely to end up tanking you in the end, if adding capacity means you also add client demand at the same rate. The new space (and the old space) won't respond as quickly on a 'speed per GB' basis as it did pre-addition, so if you had 20 clients before and you add 30 more (as you're adding more than double the original space) for a total of 50 clients, there's every expectation the pool will fall over, performance wise. Hopefully that makes sense.

      Delete
  19. Can you tell me more about this please?

    15. ARC and L2ARC
    (9/12/2013) There are presently issues related to memory handling and the ARC that have me strongly suggesting you physically limit RAM in any ZFS-based SAN to 128 GB. Go to > 128 GB at your own peril (it might work fine for you, or might cause you some serious headaches). Once resolved, I will remove this note.

    We have 384GB of RAM and on one system I notice that the disk pool goes to 100% over time (3 days) but then if I export and an re-import it we are good for another couple of days. We are running Solaris x86 11.1 and specifically SRU 0.5.11-0.175.1.7.0.5.0. Later SRUs exhibit the same problem.

    Any ideas much appreciated!

    Effrem

    ReplyDelete
  20. Hello,

    I was wondering you could clarify this point for me:
    For raidz2, do not use less than 6 disks, nor more than 10 disks in each vdev (8 is a typical average).

    I am doing a home NAS for my media and was considering doing a raidz2 pool with 4x2.0TB WD Reds. Why would I want to use a minimum of six as opposed to four? It is a mini-ATX case so space is tight and I really wanted to add a second RAID0 group for a hot backup location. I would have to sacrifice this to get six drives in my raidz2 pool. Can you elaborate? Thank you for this guide as well.

    ReplyDelete
    Replies
    1. For home use, 4 disks is fine. For enterprise use, follow the recommendations in this guide,

      Delete
  21. I'm considering building a 24 disk storage box for home use (insane hobby). Although not recommended for production use, would there be a significant downside to just go with 2x12 disk RAIDZ2 vdevs in one pool?

    ReplyDelete
  22. If you go with 3x8, at the cost of two disks for parity, you'll increase your write IOPS by 50%. The recommendations Andrew gives are a balance between speed, space, and redundancy.

    ReplyDelete
  23. I'm building out a 12-bay solution that will mostly be used to house VMWare virtual machines. My plan right now is to have 64G of ECC RAM, a 128G ZIL SSD drive, 2x240G SSD in a mirror pool for applications that need extra performance and 8x4TB WD Red setup as a mirrored pool for the main storage. Anything in particular I need to watch out for? The majority of the servers will connect to the storage via a 4G fiber channel switch, but there will also be connections via regular 1G ethernet. If I understand the math correctly, my theoretical max throughput for the main storage would be 4 x the throughput of a single WD Red disk, so appx. 600MB/s, right?

    ReplyDelete
  24. HI

    The article and following comments has given more insight to my knowledge with regard to ZFS.

    I am trying to build a storage of about 100TB usable with the following configuration. Please let me know if any precaution to be taken in terms of performance. The requirement is for an NFS storage for mailstore for around 100000 Mail users. Mailing solution will be based on Postfix + Dovecot.

    3TB NLSAS or Enterprise SATA HDDs x 55Nos, configuring Raidz3. I will be using a server class hardware system (SuperMicro or Intel Server System) with Dual 6 Core Xeon CPU and can have around 128GB RAM. Do you recommend more RAM or do I need to invest in SSDs for ZIL or L2ARC.

    Kindly help with any precautions that I may to take take before procuring this infrastructure

    Thanks in advance.

    ReplyDelete
  25. I have 2 Linux ROCKS clusters currently with hardware RAID 24 bay SATA drive systems. I intend to replace the 3TB SATA drives with 4TB SAS, and install JBOD HBA's to move from XFS to ZFS. I suspect that storage capacity and redundancy will be prized over outright performance. Can anyone suggest a starting point for the ZFS setup? I will likely start with 128GB RAM, and I need to investigate the particulars of the backplane in these Supermicro boxes. Will we likely take a big performance hit using what I presume is a 1x3 expander? Should I be looking at replacing the backplane and using three 8 channel HBA's?

    ReplyDelete
  26. Hi,

    first of all, thanks to Andrew for this great article and to the people who commented on it. We have used Nexenta for a year now and this has given a great deal of valuable information.

    I've got one question about recordize. For our first pools we used 128k recordsize because we were told it was the default value and most suitable for many cases.

    We trusted the techies from Nexenta who gave that advice, until we started to have experience some bumps in the road with our production pools.

    One of the things I tested was different recordsizes.

    Our use case is Xen Community edition accessing Nexenta 3.x through NFS. In ZFS we store .img files, one for each vm.

    So, I did a lot of testing with iozone and my conclusion was that if you align NFS mountpoint and ZFS recordsize to either 4k or 8k, you get the best possible performance of all the possible combinations which go from 4k to 128k on both sides.

    I also used Dtrace to get as much information as possible from the pool, and I saw that more than 90% of the requests to the ARC are either 4k or 8k blocks, no matter what blocksize on Linux or recordsize on ZFS you use, you always get the same kind of requests from Xen.

    I'm telling you this because I've seen many articles and posts in forums about this which say the contrary, that you should use recordsizes of 32k or bigger, or even stick to the default 128k.

    I would like to know if anyone has ever done this kind of tests and what they got. Why are my results so different than the recommended values?

    I have no graphs or anything "nice-looking" to show you, just a text file with all the results, but if anyone is interested in my findings I am more than willing to publish it somewhere in a human-readable way.

    Thanks.

    ReplyDelete
    Replies
    1. Jordi:
      I recently testing a nuymber of block sizes using the free NexentaStor 4,0 release. The winner was 32k but this was with a Linux box. Windows still uses a native 4k buffer, so depending on your mix of wht you run (Linux and Windows) it may take a compromise (8k?). The duffer flushing mechanism used by Nexenta is set to be 32k so an engineer I spoke with there recommended 32k for everything.
      --Tobias

      Delete
  27. minor correction: "without further adieu" should be "without further ado"

    ReplyDelete