Tuesday, 29 April 2008

More ZFS on FreeBSD

Update: (2009/06/01) I should point out that a few pieces of info here are now out of date. Specifically referring to where I have mentioned bugs in the ZFS implementation on FreeBSD. The developers have done, and continue to do an amazing job, and in case anyone's not aware I'm a big fanboy of pjd for the massive effort put in. That said, I think the info herein is still useful, but where I mention 'bugs' they may well have been resolved (;

My last post was a basic intro to ZFS on FreeBSD. This one is meant to provide some information I didn't really get out of the documentation on my first (or second... third...) read. It is basically a collection of potential 'gotchas' that didn't seem obvious to me as a new user. These things become really obvious once you really understand ZFS, but to the newbie some of the features of ZFS can blur together to make them seem like something they're not.

I have intentionally left out example code, as I'm not trying to provide another 'howto' document, or a replacement for the man page; I just want to share some of the features/limitations/bugs/... of ZFS.

Copies=N Property
The first point of confusion for me was about the ZFS property 'copies'. The option allows you to tell a zfs file system to keep multiple copies of a single file. My initial thought/assumption with this was that if you create a zpool out of two disk vdevs (i.e. no redundancy) then you could add in redundancy for some of the file systems within the zpool this way. This is partially true.

If a bit is flipped somehow, or the disk space is overwritten where one of the copies exists, then the other can be used to recover this. However (and this is obvious once you learn more about ZFS) copies does not and cannot provide protection against complete disk failure. If you don't have one of the vdevs (i.e. the disk in my example) in a zpool, then you cannot access the file systems within it, so your zfs on which you have 2 copies of each file is still unavailable.

Put simply, the 'copies' property protects against small-scale errors (e.g. bad sectors, bit flips, accidental overwriting of a portion of the block device) but does not provide redundancy against disk failure. That said, if you have a single disk and can't expand this to multiple disks, then copies is a good way of reducing the likelihood of losing data due to bad sectors, etc, since zfs can only tell you there's an error with 1 copy, but it can (most likely) fix it with 2+ copies.

ZPool vdev Failure
Re-iterating from the previous section, a zpool is created from one or more vdevs, and is effectively a RAID0 across those vdevs, and has the same property as a RAID0 with respect to loss of a device. If one fails or is lost, the zpool cannot be accessed. Running the zpool status command will show that the zpool is in the FAULTED state due to the missing vdev.

A failed block device within a raidz or mirror vdev of course is fine, the block device status will become FAULTED (or similar) and the raidz/mirror vdev status will become DEGRADED until the block device is fixed/replaced.

For me, on my hardware (amd64 system, 4gb ram, nForce4 chipset, FreeBSD 7.0 release fresh install), ZVOLs do not perform at all well. My advice would be to seriously test them out before using them. I found I was limited to write speeds of around 5 MiB/s and using a ZVOL as a block device in a zpool causes instability of the whole system (kernel would rapidly increase in size and eventually lock up) when writing large files or a lot of small files.

Now the first issue is one that should be fixed, clearly a bug. The latter may not necessarily be an intelligent thing to do, and perhaps using ZVOLs as a block device in a vdev should be denied (or at least warned against).

Size of Multi-block-device vdevs
The total available space of a mirror is the size of the smallest block device within it.

The total available space of a raidz vdev is approximately as follows.
N = the number of block devices in the raidz vdev
P = the number of parity blocks per stripe (i.e. 1 for raidz1, 2 for raidz2)
S = the size of the smallest block device
Total Space = S x (N - P)
(actually it will be a little below this value)

Adding and Removing Block Devices
You can add and remove block devices to/from vdevs only under very specific conditions at present (more add/remove options will probably be added in the future).
  • mirror - you can freely add block devices to a mirror (provided they're at least as big as the smallest block device) and devices can be removed provided there's at least two devices left in the vdev
  • raidz - you cannot add or remove block devices to/from a raidz vdev
  • disk - N/A - a disk vdev is a single block device
You can always add new vdevs to a zpool (provided they're of the same type) so if you have a zpool of disks, you can always add another disk. You cannot remove vdevs from a zpool.

Replacing Block Devices in a vdev
You can replace any block device in a vdev with the following conditions
  • mirror - the new block device must be at least as big as the smallest device in the mirror, the old device does not need to still exist (due to redundancy in the mirror)
  • raidz - the new block device must be at least as big as the smallest device in the raidz, the old device may not need to exist (assuming that without that device, the raidz is not FAULTED)
  • disk - the new block device must be at least as big as the old one, and the old disk must still be in place to copy the data from (as there's no redundancy in a disk vdev)
Expanding Multi-Disk vdevs
A nice feature of the raidz vdev (and, I assume, the mirror vdev, though I have not personally tested it) is that if you increase the size of the smallest block device (by replacing it) the whole vdev size will increase to fill that space. In order to see this expansion, however, you must export and re-import the zpool (actually it might work by just taking it offline also...).

This allows you to increase the size of your raidz vdev (and thus your zpool) very easily (i.e. you don't have to try to fit both the new and old raidz vdevs into the system at once and copy the data to the new zpool). You can simply replace one (or all, if you have the space) drive at a time, and furthermore, if you really lack the space, you can completely remove a drive and replace it with the new one, which will be resilvered from the contents of the other drives in the raidz vdev (thanks to the redundancy of your raidz vdev).


sallevan said...

About "copies": according to one of Sun Microsystems's ZFS architects, if you're set, for example, "copies=3", ZFS architecture will allow you to recover your files in a case of hard disk surface failure, if such a failure destroyed no more than 1/8 part of surface.

admin said...

raidz.net - a perspective