Tuesday, 29 April 2008

ZFS on FreeBSD

First of all, ZFS (or here) is awesome. I've been using it for a good 3 days now and will be singing its praises for much longer (at least until my raidz volume dies due to a bug and I lose everything...). I'm running ZFS on FreeBSD 7.0 and after spending a week playing with various configurations I have a few things to share that I didn't find obvious from the documentation.

In this post I intend to give an overview of ZFS (the overview I'd like to have had up-front when I was planning my system). In the next post I've got a few more advanced things to discuss. These posts aren't meant to be full how-to guides (though I link to some of those...) just an introduction to the terminology, structure and some ideas of how to adjust to the awesomeness of zfs (lets face it, anyone using FreeBSD, myself included, has been held back in the dark ages by UFS for quite some time now, if for no other reason than the lack of journalling).

First up, if you want to use ZFS, you'll want to read (and follow!) the ZFS Tuning Guide on the FreeBSD wiki. It's still marked as 'experimental' in FreeBSD for a reason (i.e. it won't necessarily play nice straight out of the box, you might have to manually tune some things). Also there's a quick-start guide with some useful commands for those who want to know how to use the 'zfs' and 'zpool' commands. But for now, an introduction...

For those new to ZFS there are a few things to learn up-front that will make it easier to understand. Firstly, despite its name (Zettabyte File System) ZFS is not just a file-system, its a volume management system as well. It can take care of whole disks, splitting them into smaller file systems and/or combining multiple disks into larger file systems. It manages compression, quotas and a bunch of other 'tuneables' at the file system level. It manages the mount points for drives, including boot-time mounting (you cannot, at the time of writing, boot from zfs, but you can put your root on zfs, and everything apart from /boot - guides here and here). Of course its not limited to using whole disks, you can use slices or partitions if you want to do something else with part of the disk, and you can even use files as the base block devices (though they're really only for testing out zfs) - as the documentation says, any GEOM provider.

Understanding the structure of zfs and zpools helps a fair bit in realising what zfs is useful for, so here goes...

Virtual Devices (vdevs)
At the lowest level you have one or more block devices available to put data on; these may be disks/slices/files as mentioned above. One or more of these can form a vdev (Virtual Device), of which there are about 7 types, so first a bit about the vdevs:
  • disk is a vdev that is just a disk/partition/slice-style block device
  • file is a vdev that is created from a file on an existing file system (really just for testing though)
  • mirror is effectively RAID1, and is created out of 2 or more block devices
  • raidz (raidz1 or raidz2) are effectively RAID5/6 (with 1 or 2 volumes worth of parity) which are created from 2 (for raidz1) or 3 (raidz2) or more block devices
  • spare is a hot spare block device (which I believe can be part of a number of vdevs)
  • log and cache are for more specialised purposes and you probably know much more than me about zfs if you know you need to use them (;
So now we know what a vdev might be, we can start to look at what a zpool is and how it's constructed. A zpool is created from one or more such vdevs of the same type (ignoring spares, logs and caches). This is a fairly important point, and is perhaps not an obvious limitation.

Z-Pools
Starting out simple, a zpool (called mypool) might be created from a single disk (/dev/ad0) vdev using the following command:
# zpool create mypool ad0
Actually this command has done a few things. It's created the disk vdev (okay it's just a disk, not very exciting), it's created a zpool using that disk, and a file system (zfs obviously) in that zpool, ready to use, and finally it's mounted the file system at /mypool (which was created if it didn't already exist). That command is analogous to (something along the lines of) the following in UFS land:
# fdisk -I /dev/ad0
# bsdlabel -w /dev/ad0s1
# newfs -U /dev/ad0s1a
# mkdir /mypool
# echo "/dev/ad0s1a /mypool ufs rw 1 2" >> /etc/fstab
# mount /mypool
A more complex example (and there's many many examples around if you hit up google) would be to create a raidz using a bunch of disks (ad4, 6, 8 and 10):
# zpool create mypool raidz ad4 ad6 ad8 ad10
Not much more complex to type, but this changes the simple step from the previous example in which a (boring) disk vdev was created, to creating a raidz vdev out of the four disks listed. Again, all the other steps are also performed. The important point I'm trying to make here is that vdevs, although important to understand, don't actually get explicitly created. Their creation is done when you use zpool create or zpool add. Much the same applies to mirrors; if I want to create a three-disk mirror I'd type something like
# zpool create mypool mirror ad4 ad6 ad8
Now in each of these examples, the zpool has been created from a single vdev (disk, raidz or mirror). And the zpool will act (pretty much) like a simple disk, RAID5 (for raidz) or RAID1 (for mirror). Of course on top of that zpool is the file system with its checksumming, compression, quotas, journal, etc.

However, a zpool can consist of multiple vdevs as mentioned above; in this case it acts as a RAID0 over all vdevs within it. So its very easy to create a RAID0, RAID1+0, RAID5+0 by creating a zpool consisting of multiple disks, mirrors or raidz vdevs respectively. Examples for these three as follows
# zpool create mypool ad4 ad6
# zpool create mypool raidz ad4 ad6 ad8 raidz ad10 ad12 ad14 ad16
# zpool create mypool mirror ad4 ad6 mirror ad8 ad10
In each example, a zpool is created with two vdevs. The first uses two simple disks, the second uses two raidz vdevs separately created out of three and four disks, and the third uses two mirrors each consisting of two disks.

I'm not completely familiar with the way zfs stripes data between vdevs, but I'm fairly sure it does intelligent things with stripe sizes. In any case, there's no requirement for the vdevs to be the same size in a zpool.

Zettabyte File Systems (ZFS's - the actual file systems you put files on, not the whole architecture!)
We've already seen that creating a zpool creates a zfs file system (the root zfs of that zpool) automatically. However it doesn't stop there; you can create file systems within this file system, and each child shares the free space of the sibling/parent file systems. At the very least this is useful for grouping files on your system, setting compression, quotas, etc. But these file systems also inherit properties, which can be useful, for example, having each user's home directory as a separate file system with its own quota, and some inherited properties (e.g. compression or a quota for everyone's home directory) from an encompassing file system for all users' home directories.

Each ZFS file system can have an arbitrary mount point (its an independent property of each file system, with the initial/default value coming from its position in the parent file system), so your directory structure need not follow the file system structure, despite the inheritance of properties. And finally, the file systems (including the root file system of the zpool) don't have to be mounted if you're just using them as a container object.

So as a simple example, setting up a system that has an existing root, /, file system, I might create a few file systems within a zpool 'mypool' as follows
# zfs set mountpoint=none mypool
# zfs create mypool/usr
# zfs create mypool/var
# zfs set mountpoint=/usr mypool/usr
# zfs set mountpoint=/var mypool/var
# zfs create mypool/usr/ports
# zfs set compression=gzip mypool/usr/ports
# zfs create mypool/usr/ports/distfiles
# zfs set compression=none mypool/usr/ports/distfiles
The above examples tell zfs not to mount the root file system within the mypool zpool. It then creates 'usr' and 'var' file systems and sets appropriate mountpoints for these (since we don't want them at /mypool/usr and var). Then two more file systems are created, one for /usr/ports (with compression turned on) and one for distfiles in ports (with compression off, since the distfiles are already compressed). I won't go into any more detail on this stuff, just providing some ideas of why you might create a few different file systems within the zpool.

Of course there's no limitation on the file systems being mounted within each other, since mount points can be set arbitrarily; and there's nothing stopping you mounting UFS and ZFS volumes in directories within each other. Personally I have a UFS boot partition (
because there's no support for booting off zfs yet) mounted (read-only, except when rebuilding FreeBSD) within my root ZFS, and the rest of the system is running on ZFS using a raidz based pool with a couple of other drive based pools as well.

Z-Volumes (ZVOL's)
This section comes with a warning, that I'm not convinced ZVOLs are working properly yet (at least in the FreeBSD implementation of ZFS at the time of writing).

Within a zpool you can create a new block device known as a ZVOL. This can then be used as a block device just like a disk, slice or partition. But its in a zpool on a ZFS, so as you can imagine this has the potential for some cool tricks. You create a ZVOL much like you create a file system, except that you need to specify its size (since its a block of space) in advance.
# zfs create -V 2g mypool/swapspace
This command creates a 2 GiB ZVOL in the 'mypool' zpool called 'swapspace', and this device can thereafter be found at /dev/zvol/mypool/swapspace so you could use it as swap space (as suggested by the name I gave it) or use newfs to create a UFS volume on it.

Now my personal experience with ZFS and FreeBSD 7.0 at the time of writing is that ZVOLs have a few problems, at least on my hardware (I've seen others complain but I'm not sure if this is a global issue or hardware specific). Basically, the ZVOLs I created (both on a single disk zpool and on a raidz of 3-4 drives) had write speeds of around 5 MiB/s. The raidz I'd created at the time had a write speed (outside the ZVOL) of around 100 MiB/s. This was on a fresh install of FreeBSD/amd64 7.0 with 2 GiB RAM and no other data on the zpool. In theory though (and hopefully in the future, in practice as well) this should be a "good" way of creating a volume for swap space, or for a block device taking up part of a drive, allowing you to avoid editing BSD labels and messing about with slices.

I'll make another post about how, in theory, you could set up a psuedo-RAID0+1 or 0+5 setup (i.e. RAID1 or 5 on top of RAID0) using zfs. That's more of an academic exercise though, and is probably a Bad Idea™. Certainly in the current implementation of ZFS in FreeBSD on my hardware this fails pretty badly (creating zpools using ZVOLs as vdevs - I hope you're following all this...) seems to cause the kernel to hang (not panic, just stop responding to anything file system related).

So again, to re-emphasise, be warned that ZVOLs may not work at all well right now. It might just be my hardware, but if you really want to use them, do some testing, including some very large writes (using both dd from /dev/zero and copying files on the file system) before you commit to it. I found they'd be stable for small writes (~100 MiB) but once you started any serious use of them (dumping a ~1GiB file or even just copying a large number of small files) they'd hang the system.

ZFS Notation Summary
In summary, the structure of zfs, zpool and vdevs might look something like this...

A hierarchy of ZFS file systems exists at the top level, each may contain other file systems and ZVOLs
- These file systems sit in a zpool
--- This zpool sits on top of one or more vdevs (of the same type)
----- These vdevs may each be a disk (a single block device)
----- Alternatively they may each be a raidz/mirror (i.e. a collection of block devices)

For the picky, I have specifically ignored spares, logs and caches in this, but if you're looking at those then this introduction to ZFS is aimed below your level (;

Useful Commands
A few useful commands for seeing what you've done (this would be at the top if this were a 'howto' guide)
# zpool status -v tank
This shows you useful information (the structure, any errors or current operations such as a replace/scrub/resilver) about a specific tank and the vdevs and block devices within it.
# zpool list
This shows you a list of all zpools on the system; note that the sizes shown here are the amount of physical space used/available to the zpool. This does not represent their data capacity if the zpool has raidz or mirror vdevs, as these vdevs store less data than their total physical size due to redundancy.
# zfs list
This shows you a list of all zfs file systems on the available zpools. You can see here how free space is shared between different zfs file systems, and also how much data capacity they actually have (as opposed to their physical size as in zpool list).
# zfs get all tank/fs
This shows you all the properties of the 'fs' file system within the zpool 'tank'. This includes things like compression, quotas and how many copies of each file to store (for a different type of redundancy to multi-disk vdevs).

Hopefully this makes ZFS a little clearer for you, it took me a few days to really understand the structure and a few more to find out some of the limitations...

Update: (2009/06/01) Just a short update due to some comments made on this post (worth reading). Most importantly, if you plan on using ZFS on FreeBSD then you need to read the ZFS Tuning Guide. And again, for emphasis... Don't set up ZFS on FreeBSD without reading the ZFS Tuning Guide. It's not that difficult to follow, and there's not a lot to it, and at present some tuning is still a requirement for most systems.

2 comments:

Anonymous said...

You might want to mention the sysctl paramaters and the kernel limitiation that are necessary to ensure that zfs doesn't crash the kernel if you run out of kernel memory. Its in the tuning guide, but they really don't make enough of a stink that tuning the zfs requires some tinkering in order to be stable.

Anonymous said...

zvols work well in the MFC on -STABLE. It appears before there was an fscync instead of ZIL commit and we now get ufs zvol speeds close to zfs :

www# time dd if=/dev/zero of=/mnt/512MB count=512 bs=1M
512+0 records in
512+0 records out
536870912 bytes transferred in 20.236759 secs (26529491 bytes/sec)
0.000u 1.799s 0:20.25 8.8% 26+1516k 0+4115io 0pf+0w

www# mount | grep /mnt
/dev/zvol/tank/ufs on /mnt (ufs, local)

www# time dd if=/dev/zero of=/tank/bigfiles/512MB count=512 bs=1M
512+0 records in
512+0 records out
536870912 bytes transferred in 23.743324 secs (22611447 bytes/sec)
0.007u 0.845s 0:23.75 3.5% 26+1528k 0+0io 0pf+0w
www# zfs list tank/bigfiles
NAME USED AVAIL REFER MOUNTPOINT
tank/bigfiles 995M 806G 995M /tank/bigfiles

www# uname -a
FreeBSD zfs.freebsd.test 7.2-STABLE FreeBSD 7.2-STABLE #0: Mon May 25 17:24:17 UTC 2009 root@zfs.freebsd.test:/usr/obj/usr/src/sys/GENERIC amd64