Monday, September 21, 2015

Understanding How ZFS Calculates Used Space

This document explains how ZFS calculates the amount of used space within a zpool and provides examples of how various ZFS features affect that calculation.

DETAILS

A ZFS storage pool is a logical collection of devices that provide space for datasets such as filesystems, snapshots and volumes. A zfs pool can be used as a filesystem, i.e. such as mounting/unmounting; to take snapshots that provides read-only (clones are writable copies of snapshots) copies of the filesystem taken in the past; to create volumes that can be accessed as a raw and a block device and for setting properties on datasets.

Interaction of all these datasets and associated properties such as quotas, reservations, compression and deduplication (Solaris 11 only) play a role in how ZFS calculates space usage. Neither the du nor df command have been updated to account for ZFS file system space usage. When calculating zfs space usage, always use the following command.
# zfs list -r -o space <pool_name>
Alternatively, if the system is running Solaris 10 Update 8 or later, one can use:
zfs list -t all -o name,avail,used,usedsnap,usedds,usedrefreserv,usedchild  -r pool_name

ZFS supports datasets of types: filesystem, volume and snapshot.

Snapshot: Snapshots are read-only copies of the filesystem taken at some time in the past. A clone is a type of snapshot but it is writable. When snapshots are created, their space are initially shared between the snapshot and filesystem and possibly with the previous snapshots. As the filesystem changes, space that was previously shared becomes unique to the snapshot and thus counted in the snapshot's space usage. Deleting snapshots can increase the amount of used space unique to the snapshots.

Volume: A logical volume exported as a raw or block device.

Properties like quota, reservation and deduplication (dedup) also influence the space reported by ZFS:

Quota: Limits the amount of space a dataset and its descendents can consume. It is the limit at which, if there is no space available in the zpool, processes will fail to create files or allocate space.

Reservation: The minimum amount of space guaranteed to a dataset and its descendents. When the amount of space used is below this value, the dataset is treated as if it were taking up the amount of space specified by its reservation. Reservations are accounted for in the parent datasets' space usage and count against the parent datasets' quotas and reservations.

Deduplication:
Deduplication is the process of eliminating duplicate copies of data. ZFS provides inline block-level deduplication. Enabling the ZFS dedup property saves space and also increases performance. Performance improvement is due to the elimination of disk writes when storing duplicate data, plus the reduced memory footprint due to many applications sharing the same pages of memory. Deduplicated space accounting is reported at the pool level. You must use the zpool list command rather than the zfs list command to identify disk space consumption when dedup is enabled. If you use the zfs list command to review deduplicated space, you might see that the filesystem size appears to be increasing because we are able to store more data on the same physical device. Using the zpool list will show you how much physical space is being consumed and it will also show you the dedup ratio.
Note: The df and du commands are not dedup or compression aware and will not provide accurate space accounting.  Always use the zfs and zpool commands for accurate space accounting information.

The following examples show common scenarios which explain why zfs and zpool may show different values to df and du.

Example #1: Creating snapshots and calculating space used.

Let's create a non-redundant pool of size 1G:
# mkfile 1G /dev/dsk/mydisk
# zpool create mypool mydisk

# zpool list mypool
NAME     SIZE  USED  AVAIL  CAP  HEALTH  ALTROOT
mypool  1016M   88K  1016M   0%  ONLINE  -

# zfs list mypool
NAME    USED  AVAIL  REFER  MOUNTPOINT
mypool   85K   984M  24.5K  /mypool
Now let's create some files and snapshots:
# mkfile 100M /mypool/file1
# zfs snapshot mypool@snap1
# mkfile 100M /mypool/file2
# zfs snapshot mypool@snap2
List the amount of data referenced by the snapshots and clones. Initially, snapshots and clones reference the same amount of space as the filesystem since their contents are identical.
# zfs list mypool
NAME    USED  AVAIL  REFER  MOUNTPOINT
mypool  200M   784M   200M  /mypoolome
As shown, 200MB of disk space has been used, none of which is used by the snapshots. snap1 refers to 100MB of data (file1) and snap2 refers to 200MB of data (file1 and file2).

# zfs list -t snapshot -r mypool
NAME           USED  AVAIL  REFER  MOUNTPOINT
mypool@snap1  23.5K     -    100M  -
mypool@snap2      0     -    200M  -
Now let's remove file1 to see how it affects the space usage. As you can see nothing changed except the referenced value, considering it is still referenced by snapshots.
# rm /mypool/file1
# zfs list mypool
NAME    USED  AVAIL  REFER  MOUNTPOINT
mypool  200M   784M   100M  /mypool

# zfs list -t snapshot -r mypool
NAME           USED  AVAIL  REFER  MOUNTPOINT
mypool@snap1  23.5K      -   100M  -
mypool@snap2  23.5K      -   200M  -
So why don't the snapshots reflect this in their USED column? You may think we should show 100MB used by snap1, however, this would be misleading as deleting snap1 has no effect on the data used by the mypool filesystem. Deleting snap1 would only free up 17KB of disk space.  You can, however, use the following zfs option to find space used by snapshots.
# zfs list -t all -o space -r mypool
NAME         AVAIL  USED  USEDSNAP  USEDDS  USEDREFRESERV USEDCHILD
mypool        784M  200M      100M    100M              0      154K
mypool@snap1     -   17K         -       -              -         -
mypool@snap2     -   17K         -       -              -         -
As you can see that  20MB is used by the file system, 100MB is used by snapshots (file1) and 100MB is used by the dataset itself (file2).

Now let's delete snapshot snap1 considering it is not using much space.
# zfs destroy mypool@snap1
# zfs list mypool
NAME    USED  AVAIL  REFER  MOUNTPOINT
mypool  200M   784M   100M  /mypool
# zfs list -t snapshot -r mypool
NAME          USED  AVAIL  REFER  MOUNTPOINT
mypool@snap2  100M      -   200M  -

We can see that snap2 now shows 100MB used. If I were to delete snap2, I would be deleting 100MB of data (or reclaiming 100MB of space).
# zfs destroy mypool@snap2
# zfs list mypool
NAME   USED  AVAIL  REFER  MOUNTPOINT
mypool 100M   884M   100M  /mypool
# zfs list -t snapshot -r mypool
no datasets available

Example #2: Using Quotas

Quotas set the limit on the space used by a dataset. It does not reserve any ZFS filesystem space.
# zfs create -o quota=100m mypool/amer
# zfs get quota mypool/amer
NAME         PROPERTY  VALUE  SOURCE
mypool/amer  quota     100M   local
As you can see, there no change in the amount of space used:
# zfs list mypool
NAME    USED  AVAIL  REFER  MOUNTPOINT
mypool  100M   884M   100M  /mypool
Let take a snapshot of the amer dataset.
# zfs snapshot mypool/amer@snap1
# zfs list mypool/amer
NAME         USED  AVAIL  REFER  MOUNTPOINT
mypool/amer 24.5K   100M  24.5K  /mypool/amer
# zfs list -t snapshot -r mypool
NAME              USED  AVAIL REFER  MOUNTPOINT
mypool/amer@snap1    0      - 24.5K  -
Upgrade the snapshot to a clone and set a 100MB quota. Remember, clones can only be created from a snapshot. When a snapshot is cloned, an implicit dependency is created between the clone and snapshot. Even though the clone is created somewhere else in the dataset hierarchy, the original snapshot cannot be destroyed as long as the clone exists. Because a clone initially shares all its disk space with the original snapshot, its used property is initially zero. As changes are made to the clone, it uses more space. The used property of the original snapshot does not consider the disk space consumed by the clone.
# zfs clone mypool/amer@snap1 mypool/clone1
# zfs set quota=100m mypool/clone1
# zfs get quota mypool/clone1
NAME           PROPERTY  VALUE  SOURCE
mypool/clone1  quota     100M   local

# zfs list mypool/amer
NAME         USED  AVAIL  REFER  MOUNTPOINT
mypool/amer 24.5K   100M  24.5K  /mypool/amer

# zfs list mypool/clone1
NAME          USED  AVAIL  REFER  MOUNTPOINT
mypool/clone1    0   100M  24.5K  /mypool/clone1
To get a breakdown of space used by snapshot and clone use:
# zfs list -r -o space mypool

Example #3: Setting Reservations

Reservations deduct space from the zpool and reserve it for the dataset.
# zfs list mypool
NAME    USED  AVAIL  REFER  MOUNTPOINT
mypool  100M   884M   100M  /mypool

# zfs create -o reservation=100m mypool/ather
# zfs get reservation mypool/ather
NAME          PROPERTY     VALUE  SOURCE
mypool/ather  reservation  100M   local

# zfs list mypool
NAME    USED  AVAIL  REFER  MOUNTPOINT
mypool  200M   784M   100M  /mypool

Example #4: ZFS Volumes (ZVOL)

Similar to the reservation property, creating a zvol results in space being deducted from the zpool.
# zfs list mypool
NAME    USED  AVAIL  REFER  MOUNTPOINT
mypool  100M   884M   100M  /mypool
Let's create a 100MB zvol.
# zfs create -V 100m mypool/vol1
# zfs list -t volume -r mypool
NAME         USED  AVAIL  REFER  MOUNTPOINT
mypool/vol1 22.5K   884M  22.5K  -

# zfs list mypool
NAME   USED  AVAIL REFER  MOUNTPOINT
mypool 200M  784M   100M  /mypool

Example #5: Overlaid Mounts

An overlaid mountpoint occurs when a filesystem is mounted over a directory that contains data.  Normally a filesystem mountpoint should be empty to prevent space accounting anolomies.  In situations like this, zfs and zpool will show more space used than df and du because df/du cannot see underneath the mountpoint whereas zpool and zfs can.
In this example a new pool called 'tank' is created.  Within the root filesystem (/tank) a 2GB file is created.  A 500MB file is created within a new directory 'tank/challenger2'
# zpool create tank c0d1
# zpool list
NAME    SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
rpool  19.9G  5.45G  14.4G    27%  ONLINE  -
tank   4.97G  78.5K  4.97G     0%  ONLINE  -
# mkfile 2g /tank/2GBFile
# mkdir tank/challenger2
# mkfile 500m /tank/challenger2/500MBFILE
# zpool list
NAME    SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
rpool  19.9G  5.45G  14.4G    27%  ONLINE  -
tank   4.97G  2.49G  2.48G    50%  ONLINE  -
# zfs list -r
NAME                         USED  AVAIL  REFER  MOUNTPOINT
rpool                       5.86G  13.7G   106K  /rpool
rpool/ROOT                  4.34G  13.7G    31K  legacy
rpool/ROOT/s10s_u10wos_17b  4.34G  13.7G  4.34G  /
rpool/dump                  1.00G  13.7G  1.00G  -
rpool/export                  63K  13.7G    32K  /export
rpool/export/home             31K  13.7G    31K  /export/home
rpool/swap                   528M  14.1G   114M  -
tank                        2.49G  2.40G  2.49G  /tank
# df -h
Filesystem             size   used  avail capacity  Mounted on
rpool/ROOT/s10s_u10wos_17b
                        20G   4.3G    14G    25%    /
/devices                 0K     0K     0K     0%    /devices
ctfs                     0K     0K     0K     0%    /system/contract
proc                     0K     0K     0K     0%    /proc
mnttab                   0K     0K     0K     0%    /etc/mnttab
swap                   400M   416K   399M     1%    /etc/svc/volatile
objfs                    0K     0K     0K     0%    /system/object
sharefs                  0K     0K     0K     0%    /etc/dfs/sharetab
/platform/SUNW,T5240/lib/libc_psr/libc_psr_hwcap2.so.1
                        18G   4.3G    14G    25%    /platform/sun4v/lib/libc_psr.so.1
/platform/SUNW,T5240/lib/sparcv9/libc_psr/libc_psr_hwcap2.so.1
                        18G   4.3G    14G    25%    /platform/sun4v/lib/sparcv9/libc_psr.so.1
fd                       0K     0K     0K     0%    /dev/fd
swap                   399M    32K   399M     1%    /tmp
swap                   399M    48K   399M     1%    /var/run
rpool/export            20G    32K    14G     1%    /export
rpool/export/home       20G    31K    14G     1%    /export/home
rpool                   20G   106K    14G     1%    /rpool
tank                   4.9G   2.5G   2.4G    51%    /tank
Now mount an NFS Filesystem on /tank/challenger2.  The filesystem type is not important.
# mount -F nfs nfssrv:/export/iso /tank/challenger2
# zpool list
NAME    SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
rpool  19.9G  5.45G  14.4G    27%  ONLINE  -
tank   4.97G  2.49G  2.48G    50%  ONLINE  -
# zfs list -r
NAME                         USED  AVAIL  REFER  MOUNTPOINT
rpool                       5.86G  13.7G   106K  /rpool
rpool/ROOT                  4.34G  13.7G    31K  legacy
rpool/ROOT/s10s_u10wos_17b  4.34G  13.7G  4.34G  /
rpool/dump                  1.00G  13.7G  1.00G  -
rpool/export                  63K  13.7G    32K  /export
rpool/export/home             31K  13.7G    31K  /export/home
rpool/swap                   528M  14.1G   114M  -
tank                        2.49G  2.40G  2.49G  /tank
# df -h
Filesystem             size   used  avail capacity  Mounted on
rpool/ROOT/s10s_u10wos_17b
                        20G   4.3G    14G    25%    /
/devices                 0K     0K     0K     0%    /devices
ctfs                     0K     0K     0K     0%    /system/contract
proc                     0K     0K     0K     0%    /proc
mnttab                   0K     0K     0K     0%    /etc/mnttab
swap                   399M   416K   399M     1%    /etc/svc/volatile
objfs                    0K     0K     0K     0%    /system/object
sharefs                  0K     0K     0K     0%    /etc/dfs/sharetab
/platform/SUNW,T5240/lib/libc_psr/libc_psr_hwcap2.so.1
                        18G   4.3G    14G    25%    /platform/sun4v/lib/libc_psr.so.1
/platform/SUNW,T5240/lib/sparcv9/libc_psr/libc_psr_hwcap2.so.1
                        18G   4.3G    14G    25%    /platform/sun4v/lib/sparcv9/libc_psr.so.1
fd                       0K     0K     0K     0%    /dev/fd
swap                   399M    32K   399M     1%    /tmp
swap                   399M    48K   399M     1%    /var/run
rpool/export            20G    32K    14G     1%    /export
rpool/export/home       20G    31K    14G     1%    /export/home
rpool                   20G   106K    14G     1%    /rpool
tank                   4.9G   2.5G   2.4G    51%    /tank
nfssrv:/export/iso
                        12T   1.8T    11T    15%    /tank/challenger2

# du -sh /tank   <-- Reports space used by spanning filesystem boundaries
 1.8T   /tank

# du -dsh /tank  <-- Reports space used without spanning filesystem boundaries (local)
 2.0G   /tank

Example #6: Using zdb(1M) to look for large files and directories

A method to look for files or objects consuming space that cannot be seen from df, du, zfs, or zpool would be to use zdb(1M) with verbose options to dump out all the objects on a given pool or dataset. The size for each object can then be summed to find out where the space is being consumed.
Usage: zdb -ddddd pool|dataset > outputfile
Using the same 'tank' pool as used in example #6 there is only one dataset, so dump out all the objects.  Note:  For a pool with lots of files, this can take a long time and will result in a large output file.
# zdb -ddddd tank > zdb-dddd_tank.out
The output file can be processed and the size of each object counted to see the total.  This number should match the 'zfs list' ALLOC number
# awk '{if($1 == "Object") {getline;print $5}}' zdb-dddd_tank.out |sed -e 's/T/ 1099511627776/g' -e 's/G/ 1073741824/g' -e 's/M/ 1048576/g' -e 's/K/ 1024/g'|nawk '{if(NF == 1) { s+=$1 } else {s+=($1*$2)} } END {print s/1073741824"Gb"}'
2.48853Gb
Scan the same output file but look for 'ZFS plainfiles' only.  Large differences would indicate ZFS is using space for internal objects.  This is less likely unless the pool has millions of files.
# grep "ZFS plain file" zdb-dddd_tank.out |awk '{print $5}'|sed -e 's/T/ 1099511627776/g' -e 's/G/ 1073741824/g' -e 's/M/ 1048576/g' -e 's/K/ 1024/g'|nawk '{if(NF == 1) { s+=$1 } else {s+=($1*$2)} } END {print s/1073741824"Gb"}'
2.48828Gb

An example of a file and directory object from the zdb-dddd_tank.out file is shown below.  From this information the filename can be extracted and reviewed manually or scripts written to parse the data further.
=== zdb-dddd_tank.out ===
[...snip...]
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         7    1    16K    512     1K    512  100.00  ZFS directory
                                        168   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        path    /
        uid     0
        gid     0
        atime   Thu Jan 10 18:58:19 2013
        mtime   Thu Jan 10 18:58:19 2013
        ctime   Thu Jan 10 18:58:19 2013
        crtime  Thu Jan 10 18:58:19 2013
        gen     4
        mode    40555
        size    2
        parent  7
        links   2
        pflags  144
        microzap: 512 bytes, 0 entries
[...snip...]

[...snip...]
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         8    3    16K   128K  2.00G     2G  100.00  ZFS plain file
                                        168   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 16383
        path    /2GBFile
        uid     0
        gid     0
        atime   Thu Jan 10 18:58:49 2013
        mtime   Thu Jan 10 19:00:09 2013
        ctime   Thu Jan 10 19:00:09 2013
        crtime  Thu Jan 10 18:58:49 2013
        gen     5
        mode    101600
        size    2147483648
        parent  4
        links   1
        pflags  4
[...snip...]
=== END ===

A nawk script to process the 'zdb -dddd' output can be downloaded here.
$ ./zdb_parser.nawk ROOT_s10x_u9wos_14a > ROOT_s10x_u9wos_14a.parsed

The script creates a file with two tab separated columns; "FILE/OBJECT    SIZE (Bytes)", eg:
$ head -5 ROOT_s10x_u9wos_14a.parsed
FILE    SIZE
/var/sadm/install/.lockfile     128
/var/sadm/install/.pkg.lock.client      0
/var/sadm/install/.pkg.lock     0
/var/sadm/install/.door 0
/var/sadm/pkg/SUNWkvm/pkginfo   3446

Sort the parsed file to get the files in size order, largest files at the bottom:
$ sort -nk 2,2 ROOT_s10x_u9wos_14a.parsed > ROOT_s10x_u9wos_14a.parsed.sorted

In this example a 35GB file/object is shown.  The reason we see the object number instead of a filename is because the file has been deleted from the filesystem but is still know to ZFS
$ tail ROOT_s10x_u9wos_14a.parsed.sorted
/var/adm/sa/sa24        111347584
/var/adm/sa/sa16        192024576
/var/adm/sa/sa17        192024576
/var/adm/sa/sa18        192024576
/var/adm/sa/sa19        192024576
/var/adm/sa/sa20        192024576
/var/adm/sa/sa21        192024576
/var/adm/sa/sa22        192024576
/var/adm/sa/sa23        192024576
???<object#420528>      34969691083  <-- ~35GB

The object can be found in the original 'ROOT_s10x_u9wos_14a' file
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
    420528    4    16K   128K  32.6G  32.6G  100.00  ZFS plain file
                                        264   bonus  ZFS znode
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 266797
        path    ???<object#420528>
        uid     0
        gid     0
        atime   Wed Mar 18 21:21:17 2015
        mtime   Fri Apr 24 13:50:14 2015
        ctime   Fri Apr 24 13:50:14 2015
        crtime  Wed Mar 18 21:21:17 2015
        gen     4325862
        mode    100600
        size    34969691083
        parent  1966    <-------- [2]
        links   0
        xattr   0
        rdev    0x0000000000000000
Indirect blocks:
               0 L3    0:c0030bc00:600 4000L/600P F=266798 B=4432145/4432145
               0  L2   0:a63c72e00:1e00 4000L/1e00P F=16384 B=4334434/4334434
               0   L1  0:e2c7d0800:1a00 4000L/1a00P F=128 B=4326086/4326086
               0    L0 0:e2b2d1200:20000 20000L/20000P F=1 B=4325864/4325864
           20000    L0 0:e2bcf3c00:20000 20000L/20000P F=1 B=4325865/4325865
           40000    L0 0:e2c6a4c00:20000 20000L/20000P F=1 B=4325867/4325867
           60000    L0 0:e2d8ae000:20000 20000L/20000P F=1 B=4325870/4325870
           [...snip...]
This object is indeed found in the 'ZFS delete queue':
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         2    1    16K  26.0K  3.00K  26.0K  100.00  ZFS delete queue
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        microzap: 26624 bytes, 13 entries

                5ed80 = 388480
                5edc0 = 388544
                5edbc = 388540
                5ed83 = 388483
                5ed84 = 388484
                61878 = 399480
                5ed7f = 388479
                5ed81 = 388481
                5ed85 = 388485
                69c2c = 433196
                5ed7e = 388478
                5ed82 = 388482
                66ab0 = 420528 <-- This one

The delete queue will be processed automatically in the background, though this can take an undetermined amount of time.
To force the delete queue to be processed, the filesystem containing the obejcts to be deleted need to be unmounted and mounted.
As the full path to this file is unknown, we turn to the output of the object. [2] shows the parent object for the 35GB file which is this object:
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
      1966    1    16K    512     1K    512  100.00  ZFS directory
                                        264   bonus  ZFS znode
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        path    /var/adm/exacct
        [...snip...]

This is a directory on / so cannot be unmounted.  A system reboot is the only way to force the processing of this dataset's delete queue.  Alternatively wait for ZFS to reap the data in its own time.

Frequently Asked Questions - FAQ

Why don't du and df report correct values for ZFS space usage?


On UFS, the du command reports the size of the data blocks within the file. On ZFS, du reports the actual size of the file as stored on disk. This size includes metadata as well as compression. This reporting really helps answer the question of "how much more space will I get if I remove this file?" So, even when compression is off, you will still see different results between ZFS and UFS.
The GNU version of du has an option to omit this overhead from the calculation, --apparent-size.  This can be useful if the files will be copied to a non-ZFS filesystem that does not have the same overhead, e.g. copying to HSFS on a CD, so that the total size of the files within a directory structure can be determined.  Below is an example for a directory containing a number of files on ZFS - standard du with the overhead included and GNU du without.  GNU du is included with Solaris 11.
# du -sh /var/tmp
20M   /var/tmp

# /usr/gnu/bin/du -sh --apparent-size /var/tmp
19M     /var/tmp

When you compare the space consumption that is reported by the df command with the zfs list command, consider that df is reporting the pool size and not just filesystem sizes. In addition, df doesn't understand descendent datasets or whether snapshots exist. If any ZFS properties, such as compression and quotas are set on filesystems, reconciling the space consumption that is reported by df might be difficult.

Consider the following scenarios that might also impact reported space consumption:

For files that are larger than recordsize, the last block of the file is generally about 1/2 full. With the default recordsize set to 128KB, approximately 64KB is wasted per file, which might be a large impact. You can work around this by enabling compression. Even if your data is already compressed, the unused portion of the last block will be zero-filled and thus compresses very well.

On a RAIDZ-2 pool, every block consumes at least 2 sectors (512-byte chunks) of parity information. The space consumed by the parity information is not reported and because it can vary and be a much larger percentage for small blocks, an impact to space reporting might be seen. The impact is more extreme for a recordsize set to 512 bytes, where each 512-byte logical block consumes 1.5KB (3 times the space). Regardless of the data being stored, if space efficiency is your primary concern, you should leave the recordsize at the default (128 KB), and enable compression.

Why doesn't ZFS space used as reported by zpool list and zfs list match?

The SIZE value that is reported by the zpool list command is generally the amount of physical disk space in the pool, but varies depending on the pool's redundancy level. The zfs list command lists the usable space that is available to filesystems, which is the disk space minus ZFS pool redundancy metadata overhead, if any. A non-redundant storage pool created with one 136GB disk reports SIZE and initial FREE values as 136GB. The initial AVAIL space reported by the zfs list command is 134GB, due to a small amount pool metadata overhead.
# zpool create tank c0t6d0
# zpool list tank
NAME  SIZE  ALLOC  FREE  CAP  DEDUP  HEALTH  ALTROOT
tank  136G  95.5K  136G   0%  1.00x  ONLINE  -

# zfs list tank
NAME  USED  AVAIL  REFER  MOUNTPOINT
tank   72K   134G    21K  /tank

A mirrored storage pool created with two 136GB disks reports SIZE as 136GB and initial FREE values as 136GB. This reporting is referred to as the deflated space value. The initial AVAIL space reported by the zfs list command is 134GB, due to a small amount of pool metadata overhead.
# zpool create tank mirror c0t6d0 c0t7d0
# zpool list tank
NAME  SIZE  ALLOC  FREE  CAP  DEDUP  HEALTH  ALTROOT
tank  136G  95.5K  136G   0%  1.00x  ONLINE  -
# zfs list tank
NAME  USED  AVAIL REFER  MOUNTPOINT
tank   72K   134G   21K  /tank

A RAIDZ-2 storage pool created with three 136GB disks reports SIZE as 408GB and initial FREE values as 408GB. This reporting is referred to as the inflated disk space value, which includes redundancy overhead, such as parity information. The initial AVAIL space reported by the zfs list command is 133GB, due to the pool redundancy overhead.
# zpool create tank raidz2 c0t6d0 c0t7d0 c0t8d0
# zpool list tank
NAME  SIZE  ALLOC  FREE  CAP DEDUP  HEALTH  ALTROOT
tank  408G   286K  408G   0% 1.00x  ONLINE  -
# zfs list tank
NAME  USED  AVAIL  REFER  MOUNTPOINT
tank 73.2K   133G  20.9K  /tank

Another major reason of different used space reported by zpool list and zfs list is refreservation by a zvol. Once a zvol is created, zfs list immediately reports the zvol size (and metadata) as USED. But zpool list does not report the zvol size as USED until the zvol is actually used.
# zpool create tank c0t0d0
# zpool list tank
NAME  SIZE  ALLOC  FREE  CAP  DEDUP  HEALTH  ALTROOT
tank  278G  98.5K  278G   0%  1.00x  ONLINE  -
# zfs list tank
NAME  USED  AVAIL  REFER  MOUNTPOINT
tank   85K   274G    31K  /tank

# zfs create -V 128G tank/vol
# zpool list tank
NAME  SIZE  ALLOC  FREE  CAP  DEDUP  HEALTH  ALTROOT
tank  278G   174K  278G   0%  1.00x  ONLINE  -
# zfs list tank
NAME  USED  AVAIL  REFER  MOUNTPOINT
tank  132G   142G    31K  /tank

NOTE: In some rare cases, space reported may not add up. In such situations, space may be hidden in the ZFS delete queue or there may be an unlinked file still consuming space. Normally, rebooting the server should clear the problem. You can find out if there is some object listed in ZFS DELETE queue by running:
# zdb -dddd <zpool>/<dataset> 1 | grep DELETE

Example:
# zdb -dddd rpool/ROOT/zfsroot/var 1 | grep DELETE

Why creating a snapshot immediately reduces the available pool size

If a snapshot is taken for a dataset (filesystem or zvol) that refreservation is set, the pool usage immediately increases depending on the size of usedbydataset even though the snapshot itself does not allocate any space.

For example, here is a dataset created with refreservation=10G and 3G is actually used.
# zfs list -o space -t all -r tank
NAME         AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
tank          264G  10.0G         0     32K              0      10.0G
tank/testfs   271G    10G         0   3.00G          7.00G          0

Since refreservation is set to 10G on this dataset, this dataset has 10G-3G=7G usedbyrefreservation.

If a snapshot is taken, the used space increases by 3G as blow.
# zfs snapshot tank/testfs@snap1
# zfs list -o space -t all -r tank
NAME               AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
tank                261G  13.0G         0     32K              0      13.0G
tank/testfs         271G  13.0G         0   3.00G            10G          0
tank/testfs@snap1      -      0         -       -              -          -

The dataset uses 3G and this 3G contents are now reserved by snapshot.  So the 3G data cannot be freed while the snapshot is available.
At the same time, since the dataset has refreserved 10G, it's guaranteed that the dataset can use 10G anytime.
Hence the zpool must add another 3G refreservation to the dataset so that the dataset can completely chanage the 3G contents in addition to 7G reserved space.
Therefore, the USED by tank/testfs immediately increases from 10G to 13G when a snapshot is taken although the snapshot itself does not allocate any space.

In addition, swap or dump device has a special feature called preallocation.
When the swap or dump device is configured with swap(1M) or dumpadm(1M) command, all blocks are allocated in advance.
Since the full size of zvol is used by dataset due to this feature, taking a snapshot of a swap or dump device will increase the used space by the size of zvol immediately.


No comments:

Post a Comment