ZFS Storage Pools Recommendations
This section describes general recommendations for setting up ZFS storage pools.
Systems
* Run ZFS on a system that runs a 64-bit kernel
Memory and Swap Space
* One Gbyte or more of memory is recommended.
* Approximately 64 Kbytes of memory is consumed per mounted ZFS file system. On systems with 1,000s of ZFS file systems, we suggest that you provision 1 Gbyte of extra memory for every 10,000 mounted file systems including snapshots. Be prepared for longer boot times on these systems as well.
* Because ZFS caches data in kernel addressable memory, the kernel sizes will likely be larger than with other file systems. You may wish to configure additional disk-based swap to account for this difference for systems with limited RAM. You can use the size of physical memory as an upper bound to the extra amount of swap space that might be required. In any case, you should monitor the swap space usage to determine if swapping is occurring.
* If possible, do not use slices on the same disk for both swap space and ZFS file systems. Keep the swap areas separate from the ZFS file systems. The best policy is to have enough RAM so that your system does not normally use the swap devices.
For additional memory considerations, see #Memory_and_Dynamic_Reconfiguration_Recommendations.
Storage Pools
* Set up one storage pool using whole disks per system, if possible.
* For production systems, consider using whole disks for storage pools rather than slices for the following reasons:
* Allows ZFS to enable the disk's write cache for those disks that have write caches. If you are using a RAID array
with a non-volatile write cache, then this is less of an issue and slices as vdevs should still gain the benefit of
the array's write cache.
* The recovery process of replacing a failed disk is more complex when disks contain both ZFS and UFS file systems on
slices.
* ZFS pools (and underlying disks) that also contain UFS file systems on slices cannot be easily migrated to other
systems by using zpool import and export features.
* In general, maintaining slices increases administration time and cost. Lower your administration costs by
simplifying your storage pool configuration model.
* For all production environments, configure ZFS so that it can repair data inconsistencies. Use ZFS redundancy such as raidz, raidz2, mirror, or copies > 1, regardless of the RAID level implemented on the underlying storage device. With such redundancy, faults in the underlying storage device or its connections to the host can be discovered and repaired by ZFS.
* Do not create a raidz, raidz2, or a mirrored configuration with one logical device of 48 devices. See the sections below for examples of redundant configurations.
* In a replicated pool configuration, leverage multiple controllers to reduce hardware failures and improve performance. For example:
# zpool create tank mirror c1t0d0 c2t0d0
* Set up hot spares to speed up healing in the face of hardware failures. Spares are critical for high mean time to data loss (MTTDL) environments. 1 or 2 spares for a 40-disk pool is commonly used configuration. For example:
# zpool create tank mirror c1t0d0 c2t0d0 [mirror cxtydz ...] spare c1t1d0 c2t1d0
* Run zpool scrub on a regular basis, to identify data integrity problems. If you have consumer-quality drives, consider a weekly scrubbing schedule. If you have datacenter-quality drives, consider a monthly scrubbing schedule.
* ZFS works well with solid-state storage devices that emulate disk drives (SSDs). You might wish to enable compression on storage pools that contain such devices because of their relatively high cost per byte.
* ZFS works well with iSCSI devices. For more information, see the ZFS Administration Guide and the following blog:
x4500_solaris_zfs_iscsi_perfect
* ZFS works well with storage based protected LUNs (RAID-5 or mirrored LUNs from intelligent storage arrays). However, ZFS cannot heal corrupted blocks that are detected by ZFS checksums.
Additional Cautions for Storage Pools
Review the following cautions before building your ZFS storage pool:
* A pool created with a single slice has no redundancy and is at risk for data loss.
* A pool created with multiple slices but no redundancy is also at risk for data loss. A pool created with multiple slices across disks is harder to manage than a pool created with whole disks.
* A pool created with whole disks but no redundancy is at risk for data loss. In addition, a pool that is not created with ZFS redundancy (RAIDZ or mirror) will only be able to report data inconsistencies. It will not be able to repair data inconsistencies. Finally, a pool created without ZFS redundancy is harder to manage because you cannot replace or detach disks in a non-redundant ZFS configuration.
* A pool cannot be shared across systems. ZFS is not a cluster file system.
Simple or Striped Storage Pool Limitations
Simple or striped storage pools have limitations that should be considered.
* Expansion of space is possible by two methods:
* Adding another disk to expand the stripe. This method should also increase the performance of the storage pool because more
devices can be utilized concurrently. Be aware that for current ZFS implementations, once vdevs are added, they cannot be
removed.
# zpool add tank c2t2d0
* Replacing an existing vdev with a larger vdev. For example:
# zpool replace tank c0t2d0 c2t2d0
* ZFS can tolerate many types of device failures.
* For simple storage pools, metadata is dual redundant, but data is not redundant.
* You can set the redundancy level for data using the ZFS copies property.
* If a block cannot be properly read and there is no redundancy, ZFS will tell you which files are affected.
* Replacing a failing disk for a simple storage pool requires access to both the old and new device in order to put the old data onto the new device.
# zpool replace tank c0t2d0 ### wrong: cannot recreate data because there is no redundancy
# zpool replace tank c0t2d0 c2t2d0 ### ok
Multiple Storage Pools on the Same System
* The pooling of resources into one ZFS storage pool allows different file systems to get the benefit from all resources at different times. This strategy can greatly increase the performance seen by any one file system.
* If some workloads require more predictable performance characteristics, then you might consider separating workloads into different pools.
* For instance, Oracle log writer performance is critically dependent on I/O latency and we expect best performance to be achieved by keeping that load on a separate small pool that has the lowest possible latency.
ZFS Root Pool Considerations
* A root pool must be created with disk slices rather than whole disks. Allocate the entire disk capacity for the root pool to slice 0, for example, rather than partition the disk that is used for booting for many different uses.
* A root pool must be labeled with a VTOC (SMI) label rather than an EFI label.
* A disk that contains the root pool cannot be repartitioned while the root pool is active. If the entire disk's capacity is allocated to the root pool, then it is less likely to need more disk space. If a pool is tight on disk space, you can add another pair of mirrored disks to the root pool.
* Consider keeping the root pool (that is, the pool with the dataset that is allocated for the root file system) separate from pool(s) that are used for data. Several reasons exist for this strategy:
* Some limitations on root pools exist that you would not want to place on data pools. Only mirrored pools and pools with one disk are supported. However, no RAID-Z or unreplicated pools with more than one disk are supported.
* Data pools can be architecture-neutral. It might make sense to move a data pool between SPARC and Intel. Root pools are pretty much tied to a particular architecture.
* In general, we think it's a good idea to separate the "personality" of a system from its data. Then, you can change one without having to change the other.
* Setting up different pools for what we currently allocate as separate file systems (root, /usr and /var, for example) makes no sense at all. It probably won't even be a supported configuration. Separate datasets for those directories are possible, but only in the same pool.
Storage Pool Performance Considerations
General Storage Pool Performance Considerations
* For better performance, use individual disks or at least LUNs made up of just a few disks. By providing ZFS with more visibility into the LUNs setup, ZFS is able to make better I/O scheduling decisions.
* Depending on workloads, the current ZFS implementation can, at times, cause much more I/O to be requested than other page-based file systems. If the throughput flowing toward the storage, as observed by iostat, nears the capacity of the channel linking the storage and the host, tuning down the zfs recordsize should improve performance. This tuning is dynamic, but only impacts new file creations. Existing files keep their old recordsize.
* Tuning recordsize does not help sequential type loads. Tuning recordsize is aimed at improving workloads that intensively manage large files using small random reads and writes.
* Currently, pool performance can degrade when a pool is very full and file systems are updated frequently, such as on a busy mail server. Under these circumstances, keep pool space under 80% utilization to maintain pool performance.
* For better performance, do not build UFS components on top of ZFS components. For ZFS performance testing, make sure you are not running UFS on top of ZFS components.
Separate Log Devices
The ZFS intent log (ZIL) satisfies POSIX requirements for synchronous transactions. For example, databases often require their transactions to be on stable storage devices when returning from a system call. By default, the intent log is allocated from blocks within the main pool. However, it might be possible to get better performance by using separate intent log devices, such as NVRAM or a dedicated disk. Determine whether separate log devices are appropriate for your environment:
* Any performance improvement seen by implementing a separate log device depends on the device type, the hardware configuration of the pool, and the application workload.
* Log devices can be unreplicated or mirrored, but RAIDZ is not supported for log devices.
* It is highly recommended to mirror the log device. At present, until CR 6707530 is integrated, failure of the log device may cause the storage pool to be inaccessible. Protecting the log device by mirroring will allow you to access the storage pool even if a log device has failed.
* The minimum size of a log device is the same as the minimum size of device in pool, which is 64 Mbytes. The amount of in-play data that might be stored on a log device is relatively small. Log blocks are freed when the log transaction (system call) is committed.
* The maximum size of a log device should be approximately 1/2 the size of physical memory because that is the maximum amount of potential in-play data that can be stored. For example, if a system has 16 Gbytes of physical memory, consider a maximum log device size of 8 Gbytes.
* For a target throughput of X MB/sec and given that ZFS pushes transaction groups every 5 seconds (and have 2 outstanding), we also expect the ZIL to not grow beyond X MB/sec * 10 sec. So to service 100MB/sec of synchronous writes, 1 GBytes of log device should be sufficient.
Memory and Dynamic Reconfiguration Recommendations
The ZFS adaptive replacement cache (ARC) tries to use most of a system's available memory to cache file system data. The default is to use all of physical memory except 1 Gbyte. As memory pressure increases, the ARC relinquishes memory.
Consider limiting the maximum ARC memory footprint in the following situations:
* When a known amount of memory is always required by an application. Databases often fall into this category.
* On platforms that support dynamic reconfiguration of memory boards, to prevent ZFS from growing the kernel cage onto all boards.
* A system that requires large memory pages might also benefit from limiting the ZFS cache, which tends to breakdown large pages into base pages.
* Finally, if the system is running another non-ZFS file system, in addition to ZFS, it is advisable to leave some free memory to host that other file system's caches.
The trade off is to consider that limiting this memory footprint means that the ARC is unable to cache as much file system data, and this limit could impact performance.
In general, limiting the ARC is wasteful if the memory that now goes unused by ZFS is also unused by other system components. Note that non-ZFS file systems typically manage to cache data in what is nevertheless reported as free memory by the system.
RAID-Z Configuration Requirements and Recommendations
A RAID-Z configuration with N disks of size X with P parity disks can hold approximately (N-P)*X bytes and can withstand P device(s) failing before data integrity is compromised.
* Start a single-parity RAID-Z (raidz) configuration at 3 disks (2+1)
* Start a double-parity RAID-Z (raidz2) configuration at 5 disks (3+2)
* (N+P) with P = 1 (raidz) or 2 (raidz2) and N equals 2, 4, or 8
* The recommended number of disks per group is between 3 and 9. If you have more disks, use multiple groups.
Mirrored Configuration Recommendations
* No currently reachable limits exist on the number of devices
* On a Sun Fire X4500 server, do not create a single mirror with 48 devices. Consider creating 24 2-device mirrors. This configuration reduces the disk capacity by 1/2, but up to 24 disks or 1 disk in each mirror could be lost without a failure.
* If you need better data protection, a 3-way mirror has a significantly greater MTTDL than a 2-way mirror. Going to a 4-way (or greater) mirror may offer only marginal improvements in data protection. Concentrate on other methods of data protection if a 3-way mirror is insufficient.
Should I Configure a RAID-Z, RAID-Z2, or a Mirrored Storage Pool?
A general consideration is whether your goal is to maximum disk space or maximum performance.
* A RAID-Z configuration maximizes disk space and generally performs well when data is written and read in large chunks (128K or more).
* A RAID-Z2 configuration offers excellent data availability, and performs similarly to RAID-Z. RAID-Z2 has significantly better mean time to data loss (MTTDL) than either RAID-Z or 2-way mirrors.
* A mirrored configuration consumes more disk space but generally performs better with small random reads.
* If your I/Os are large, sequential, or write-mostly, then ZFS's I/O scheduler aggregates them in such a way that you'll get very efficient use of the disks regardless of the data replication model.
For better performance, a mirrored configuration is strongly favored over a RAID-Z configuration particularly for large, uncacheable, random read loads.
RAID-Z Configuration Examples
For RAID-Z configuration on a Thumper, mirror c3t0 and c3t4 (disks 0 and 1) as your root pool, with the remaining 46 disks available for user data. The following raidz2 configurations illustrate how to set up the remaining 46 disks.
* 5x(7+2), 1 hot spare, 17.5 TB
* 4x(9+2), 2 hot spares, 18.0 TB
* 6x(5+2), 4 hot spares, 15.0 TB
ZFS Migration Strategies
Migrating to a ZFS Root File System
In the SXCE (Nevada) build 90 release or in the Solaris 10 10/08 release, you can migrate your UFS root file system to a ZFS root file system by upgrading to build 90 or the Solaris 10 10/08 release and then using the Solaris Live Upgrade feature to migrate to a ZFS root file system. You can create a mirrored ZFS root pool either during an initial installation or a JumpStart installation. Or, by using the zpool attach command to create a mirrored ZFS root pool after installation.
Migrating a UFS Root File System With Zones to a ZFS Root File System
Keep the following points in mind when using ZFS datasets on a Solaris system with zones installed:
* Starting in the Solaris 10 10/08 release, you can create a zone root path on a ZFS file system. However, supported configurations are limited when migrating a system with zones to a ZFS root file system by using the Solaris Live Upgrade feature. Review the following supported configurations before you begin migrating a system with zones.
* You can use ZFS as a zone root path in the Solaris Express releases, but keep in mind that patching or upgrading these zones is not supported.
* You cannot associate ZFS snapshots with zones at this time.
Manually Migrating Non-System Data to a ZFS File System
Consider the following practices when migrating non-system-related data from UFS file systems to ZFS file systems:
* Unshare the existing UFS file systems
* Unmount the existing UFS file systems from the previous mount points
* Mount the UFS file systems to temporary unshared mount points
* Migrate the UFS data with parallel instances of rsync running to the new ZFS file systems
* Set the mount points and the sharenfs properties on the new ZFS file systems
ZFS Interactions With Other Volume Management Products
ZFS works best without any additional volume management software.
If you must use ZFS with SVM because you need an extra level of volume management, ZFS expects that 1 to 4 Mbytes of consecutive logical block map to consecutive physical blocks. Keeping to this rule allows ZFS to drive the volume with efficiency.
VxVM/FS
General ZFS Administration Information
* ZFS administration is performed while the data is online.
* ZFS file systems are mounted automatically when created.
* ZFS file systems do not have to be mounted by modifying the /etc/vfstab file.
* Currently, ZFS doesn't provide a comprehensive backup/restore utility like ufsdump and ufsrestore commands. However, you can use the zfs send and zfs receive commands to capture ZFS data streams. You can also use the ufsrestore command to restore UFS data into a ZFS file system.
* For most ZFS administration tasks, see the zfs.1m and zpool.1m man pages. For more detailed documentation, see the ZFS Administration Guide. To ask ZFS questions, join the opensolaris/zfs discussion group.
* You can use "iostat -En" to see drives.
Using ZFS for Application Servers Considerations
ZFS NFS Server Practices
Consider the following lessons learned from a UFS to ZFS migration experience:
* Existing user home directories were renamed but they were not unmounted. NFS continued to serve the older home directories when the new home directories were also shared.
* Do not mix UFS directories and ZFS file systems in the same file system hierarchy because this model is confusing to administer and maintain.
* Do not mix NFS legacy shared ZFS file systems and ZFS NFS shared file systems because this model is difficult to maintain. Go with ZFS NFS shared file systems.
* ZFS file systems are shared with the sharenfs file system property and zfs share command. For example:
# zfs set sharenfs=on export/home
* This syntax shares the file system automatically. If ZFS file systems need to be shared, use the zfs share command. For example:
# zfs share export/home
ZFS NFS Server Benefits
* NFSv4-style ACLs are available with ZFS file systems and ACL information is automatically available over NFS.
* ZFS snapshots are available over NFSv4 so NFS mounted-home directories can access their .snapshot directories.
ZFS file service for SMB (CIFS) or SAMBA
Many of the best practices for NFS also apply to CIFS or SAMBA servers.
* ZFS file systems can be shared using the SMB service, for those OS releases which support it.
# zfs set sharesmb=on export/home
* If native SMB support is not available, then SAMBA offers a reasonable solution.
ZFS Home Directory Server Practices
Consider the following practices when planning your ZFS home directories:
* Set up one file system per user
* Use quotas and reservations to manage user disk space
* Use snapshots to back up users' home directories
* Beware that mounting 1000s of file systems, will impact your boot time
Consider the following practices when migrating data from UFS file systems to ZFS file systems:
* Unshare the existing UFS file systems
* Unmount the existing UFS file systems from the previous mount points
* Mount the UFS file systems to temporary unshared mount points
* Migrate the UFS data with parallel instances of rsync running to the new ZFS file systems
* Set the mount points and the sharenfs properties on the new ZFS file systems
ZFS Home Directory Server Benefits
* ZFS can handle many small files and many users because of its high capacity architecture.
* Additional space for user home directories is easily expanded by adding more devices to the storage pool.
* ZFS quotas are an easy way to manage home directory space.
* Use ZFS property inheritance to apply properties to many file systems.
ZFS Mail/News Server
ZFS Software Development Server
ZFS Backup / Restore Recommendations
Using ZFS Snapshots
* Use ZFS snapshots as a quick and easy way to backup user home directories. For example, the following syntax creates recursive snapshots of all home directories in the tank/home file system.
# zfs snapshot -r tank/home@monday
* You can create rolling snapshots to help manage snapshot copies.
* You can use the zfs send and zfs receive commands to archive snapshots to more permanent storage.
* You can create an incremental snapshot stream (see "zfs send -i" syntax).
* The zfs send and receive commands are not enterprise-backup solutions. The receive is an all-or-nothing event: you can get all of a snapshot, or none of it. If you store the output of zfs send some place, like a file or on tape, and that file becomes corrupted, then it will not be possible to receive correctly and none of the data will be recoverable. See RFE CR6736794. Enterprise backup solutions, as well as other copying methods (e.g. cp, tar, rsync, pax, cpio, et.al.) are more appropriate for backup/restore than zfs send/receive.
Using ZFS With AVS
The Sun StorageTek Availability Suite (AVS), Remote Mirror Copy and Point-in-Time Copy services, previously known as SNDR (Sun Network Data Replicator) and II (Instant Image), are similiar to the Veritas VVR (volume replicator) and Flashsnap (point-in-time copy) products, is currently available in the Solaris Express release.
SNDR differs from the ZFS send and recv features, which are time-fixed replication features. For example, you can take a point-in-time snapshot, replicate it, or replicate it based on a differential of a prior snapshot. The combination of AVS II and SNDR features, also allows you to perform time-fixed replication. The other modes of the AVS SNDR replication feature allows you to obtain CDP (continuous data replication). ZFS doesn't currently have this feature.
Using ZFS With Enterprise Backup Solutions
* The zfs send and receive commands do not provide an enterprise-level backup solution. At a minimum, these commands provide the ability to send and receive ZFS file system snapshots.
* The robustness of ZFS does not protect you from all failures. You should maintain copies of your ZFS data either by taking regular ZFS snapshots, and saving them to tape or other offline storage. Or, consider using an enterprise backup solution. The best solution is to use an enterprise backup solution to save and restore ZFS data.
* The Sun StorEdge Enterprise Backup Software (Legato Networker 7.3.2) product can fully back up and restore ZFS files including ACLs.
* The Symantec Netbackup 6.5 product can fully back up and restore ZFS files including ACLs. Release 6.5.2A offers some fixes which make backup of ZFS file systems easier.
* IBM's TSM product can be used to back up ZFS files. However supportability is not absolutely clear. Based on TSM documentation on IBM website, ZFS with ACLs is supported with TSM client 5.3. It has been verified (internally to Sun) to correctly work with 5.4.1.2
tsm> q file /opt/SUNWexplo/output
# Last Incr Date Type File Space Name
--- -------------- ---- ---------------
1 08/13/07 09:18:03 ZFS /opt/SUNWexplo/output
^
|__correct filesystem type
* ZFS snapshots are accessible in the .zfs directories of the file system that was snapshot. Configure your backup product to skip these directories.
In order to take advantage of ZFS features, such as snapshots, to improve the backup and restore process, ndmp service will be released in Solaris (first in opensolaris). With this addition, an enterprise backup solution could take advantage of snapshots to improve data protection for large storage repositories.
Looking Forward: ZFS and ADM
The Automatic Data Migration project will provide hierarchical storage management capabilities to ZFS, leveraging the features of SAM-QFS.
ZFS and Database Recommendations
* General remarks
* If the database uses a fixed disk block or record size for I/O, set the ZFS recordsize property to match. You can do this on a per-file system basis, even though multiple file systems can share a single pool.
* With it's copy-on-write design, tuning down the ZFS recordsize is a way to improve OLTP performance at the expense of batch reporting queries.
* ZFS checksums every block stored on disk. This alleviates the need for the database layer to checksum data an additional time. If checksums are computed by ZFS instead of at the database layer, any discrepancy can be caught and fixed before the data is returned to the application.
* ZFS performance on databases is a very fast moving target. Keeping up-to-date with Solaris releases is very important.
* As of July 2007, the following features might impact database performance:
* ZFS issues up to 35 concurrent I/Os to each top-level device and this can lead to inflated service time.
* ZFS does some low-level prefetches of up to 64K for each input block, which can cause saturation of storage channels.
* Using vdev-level prefetches of 8K and between 5 and 10 concurrent I/O was shown to help some database loads.
* Oracle Considerations
* For better OLTP performance, match the ZFS recordsize to the Oracle db_block_size.
* Keep an eye on batch reporting during mixed batch and OLTP; a small recordsize can lead to an IOPS inflation during batch.
* Use a separate file system with the default 128K record size for Oracle logs.
* Starting in the SXCE build 68 release, you can create a ZFS storage pool with separate devices for the ZFS intent log (ZIL). F
* It is often desirable to minimize Oracle log latency as it is sometimes the only I/O in the path of a transaction. With SAN storage arrays, Oracle log latency should be close to the latency of writing to the SAN caches and there is no need to partition spindle resources between data space and log space: a single pool works. But for Oracle over JBOD storage, it has been observed that using a segregated set of spindles (not subject to competing reads or writes) can help the log latency. This in turn can help some workloads such as those with a high write to read ratio at the storage level. For the Oracle log pool, using some fast separate devices (NVRAM, battery back DRAM, 15K RPM drivces) for the intent log, appears like a very good idea.
* PostgreSQL
* Limit the ZFS ARC as described in
* See the Oracle Considerations section above about using a separate pool for logs
* Set ZFS recordsize=8K (Note: Do this before any datafile creation!)
* init database from log pool, then for each database create a new tablespace pointing to data spool
* MySQL
*
* InnoDB:
* Limit the ZFS ARC as described in #Memory_and_Dynamic_Reconfiguration_Recommendations
* See Oracle Considerations section above about using a separate pool for logs
* Set ZFS recordsize=8K (Note: Do this before any datafile creation!)
* Then, use a different path for data and log (set in mysql.conf)
* MyISAM:
* Limit the ZFS ARC as described in #Memory_and_Dynamic_Reconfiguration_Recommendations
* Create a separate intent log for the log (WAL). If you do not have this feature (that is, you're running a Solaris 10 release), then create at least 2 pools, one for data and one for the log (WAL)
* Set ZFS recordsize=8K (Note: Do this before any datafile creation!)
ZFS and Complex Storage Considerations
* Certain storage subsystems stage data through persistent memory devices, such as NVRAM on a storage array, allowing them to respond to writes with very low latency. These memory devices are commonly considered as stable storage, in the sense that they are likely to survive a power outage and other types of breakdown. At critical times, ZFS is unaware of the persistent nature of storage memory, and asks for that memory to be flushed to disk. If indeed the memory devices are considered stable, the storage system should be configured to ignore those requests from ZFS.
ZFS Management Tools / Observability
ZFS and Virtual Tape Libraries (VTLs)
VTL solutions are hardware and software combinations that are used to emulate tapes, tape drives, and tape libraries. VTLs are used in backup/archive systems, with a focus on reducing hardware and software costs.
VTLs are big disk space eaters and we believe ZFS will allow them to more efficiently and securely manage the massive, online disk space.
* Falconstor VTL - has been tested on Thumper running ZFS.
* NetVault from BakBone - This backup solution includes a VTL feature that has been tested on Thumper running ZFS.
ZFS Performance Considerations
ZFS and Application Considerations
ZFS and NFS Server Performance
ZFS is deployed over NFS in many different places with no reports of obvious deficiency. Many have reported disappointing performance, but those scenarios more typically relate to comparing ZFS-over-NFS performance with local file system performance. It is well known that serving NFS leads to significant slowdown compared to local or directly-attached file systems, especially for workloads that have low thread parallelism. A dangerous way to create better ZFS-over-NFS performance at the expense of data integrity is to set the kernel variable, zil_disable. Setting this parameter is not recommended.
Later versions of ZFS have implemented a separate ZIL log device option which improves NFS synchronous write operations. This is a better option than disabling the ZIL. Anecdotal evidence suggests good NFS performance improvement even if the log device does not have nonvolatile RAM. To see if your zpool can support separate log devices use zpool upgrade -v and look for version 7.
Misconceptions about file systems and their behaviour (Roch?)
ZFS Overhead Considerations
* Compression
* Checksum - Checksumming has been measured to consume very roughly 1 Ghz of CPU to checksum 500MB/sec of data.
* RAID-Z
* As of the Solaris 10 11/06 release, checksum, compression, and RAID-Z parity computation all occur in the context of the thread synching pool data. Under load, that thread may be a performance choke point. It is expected that in future releases all these computations will be done in concurrent threads. Compression is no longer single-threaded via CR 6460622, and is part of the Solaris 10 8/07 release.
Scalability
Data Reconstruction
Traditional RAID systems where the context of the data is not known and data is reconstructed (also known as resilvering) by blindly reconstructing the data blocks in block order. ZFS only needs to reconstruct data, so it can be more efficient than traditional RAID systems when the storage pool is not full. ZFS reconstruction occurs top-down in a priority-based manner.
Since ZFS reconstruction occurs on the host, there is some concern over the performance impact and availability trade-offs. There are two competing RFEs which address this subject:
* CR 6592835, resilver needs to go faster
This section describes general recommendations for setting up ZFS storage pools.
Systems
* Run ZFS on a system that runs a 64-bit kernel
Memory and Swap Space
* One Gbyte or more of memory is recommended.
* Approximately 64 Kbytes of memory is consumed per mounted ZFS file system. On systems with 1,000s of ZFS file systems, we suggest that you provision 1 Gbyte of extra memory for every 10,000 mounted file systems including snapshots. Be prepared for longer boot times on these systems as well.
* Because ZFS caches data in kernel addressable memory, the kernel sizes will likely be larger than with other file systems. You may wish to configure additional disk-based swap to account for this difference for systems with limited RAM. You can use the size of physical memory as an upper bound to the extra amount of swap space that might be required. In any case, you should monitor the swap space usage to determine if swapping is occurring.
* If possible, do not use slices on the same disk for both swap space and ZFS file systems. Keep the swap areas separate from the ZFS file systems. The best policy is to have enough RAM so that your system does not normally use the swap devices.
For additional memory considerations, see #Memory_and_Dynamic_Reconfiguration_Recommendations.
Storage Pools
* Set up one storage pool using whole disks per system, if possible.
* For production systems, consider using whole disks for storage pools rather than slices for the following reasons:
* Allows ZFS to enable the disk's write cache for those disks that have write caches. If you are using a RAID array
with a non-volatile write cache, then this is less of an issue and slices as vdevs should still gain the benefit of
the array's write cache.
* The recovery process of replacing a failed disk is more complex when disks contain both ZFS and UFS file systems on
slices.
* ZFS pools (and underlying disks) that also contain UFS file systems on slices cannot be easily migrated to other
systems by using zpool import and export features.
* In general, maintaining slices increases administration time and cost. Lower your administration costs by
simplifying your storage pool configuration model.
* For all production environments, configure ZFS so that it can repair data inconsistencies. Use ZFS redundancy such as raidz, raidz2, mirror, or copies > 1, regardless of the RAID level implemented on the underlying storage device. With such redundancy, faults in the underlying storage device or its connections to the host can be discovered and repaired by ZFS.
* Do not create a raidz, raidz2, or a mirrored configuration with one logical device of 48 devices. See the sections below for examples of redundant configurations.
* In a replicated pool configuration, leverage multiple controllers to reduce hardware failures and improve performance. For example:
# zpool create tank mirror c1t0d0 c2t0d0
* Set up hot spares to speed up healing in the face of hardware failures. Spares are critical for high mean time to data loss (MTTDL) environments. 1 or 2 spares for a 40-disk pool is commonly used configuration. For example:
# zpool create tank mirror c1t0d0 c2t0d0 [mirror cxtydz ...] spare c1t1d0 c2t1d0
* Run zpool scrub on a regular basis, to identify data integrity problems. If you have consumer-quality drives, consider a weekly scrubbing schedule. If you have datacenter-quality drives, consider a monthly scrubbing schedule.
* ZFS works well with solid-state storage devices that emulate disk drives (SSDs). You might wish to enable compression on storage pools that contain such devices because of their relatively high cost per byte.
* ZFS works well with iSCSI devices. For more information, see the ZFS Administration Guide and the following blog:
x4500_solaris_zfs_iscsi_perfect
* ZFS works well with storage based protected LUNs (RAID-5 or mirrored LUNs from intelligent storage arrays). However, ZFS cannot heal corrupted blocks that are detected by ZFS checksums.
Additional Cautions for Storage Pools
Review the following cautions before building your ZFS storage pool:
* A pool created with a single slice has no redundancy and is at risk for data loss.
* A pool created with multiple slices but no redundancy is also at risk for data loss. A pool created with multiple slices across disks is harder to manage than a pool created with whole disks.
* A pool created with whole disks but no redundancy is at risk for data loss. In addition, a pool that is not created with ZFS redundancy (RAIDZ or mirror) will only be able to report data inconsistencies. It will not be able to repair data inconsistencies. Finally, a pool created without ZFS redundancy is harder to manage because you cannot replace or detach disks in a non-redundant ZFS configuration.
* A pool cannot be shared across systems. ZFS is not a cluster file system.
Simple or Striped Storage Pool Limitations
Simple or striped storage pools have limitations that should be considered.
* Expansion of space is possible by two methods:
* Adding another disk to expand the stripe. This method should also increase the performance of the storage pool because more
devices can be utilized concurrently. Be aware that for current ZFS implementations, once vdevs are added, they cannot be
removed.
# zpool add tank c2t2d0
* Replacing an existing vdev with a larger vdev. For example:
# zpool replace tank c0t2d0 c2t2d0
* ZFS can tolerate many types of device failures.
* For simple storage pools, metadata is dual redundant, but data is not redundant.
* You can set the redundancy level for data using the ZFS copies property.
* If a block cannot be properly read and there is no redundancy, ZFS will tell you which files are affected.
* Replacing a failing disk for a simple storage pool requires access to both the old and new device in order to put the old data onto the new device.
# zpool replace tank c0t2d0 ### wrong: cannot recreate data because there is no redundancy
# zpool replace tank c0t2d0 c2t2d0 ### ok
Multiple Storage Pools on the Same System
* The pooling of resources into one ZFS storage pool allows different file systems to get the benefit from all resources at different times. This strategy can greatly increase the performance seen by any one file system.
* If some workloads require more predictable performance characteristics, then you might consider separating workloads into different pools.
* For instance, Oracle log writer performance is critically dependent on I/O latency and we expect best performance to be achieved by keeping that load on a separate small pool that has the lowest possible latency.
ZFS Root Pool Considerations
* A root pool must be created with disk slices rather than whole disks. Allocate the entire disk capacity for the root pool to slice 0, for example, rather than partition the disk that is used for booting for many different uses.
* A root pool must be labeled with a VTOC (SMI) label rather than an EFI label.
* A disk that contains the root pool cannot be repartitioned while the root pool is active. If the entire disk's capacity is allocated to the root pool, then it is less likely to need more disk space. If a pool is tight on disk space, you can add another pair of mirrored disks to the root pool.
* Consider keeping the root pool (that is, the pool with the dataset that is allocated for the root file system) separate from pool(s) that are used for data. Several reasons exist for this strategy:
* Some limitations on root pools exist that you would not want to place on data pools. Only mirrored pools and pools with one disk are supported. However, no RAID-Z or unreplicated pools with more than one disk are supported.
* Data pools can be architecture-neutral. It might make sense to move a data pool between SPARC and Intel. Root pools are pretty much tied to a particular architecture.
* In general, we think it's a good idea to separate the "personality" of a system from its data. Then, you can change one without having to change the other.
* Setting up different pools for what we currently allocate as separate file systems (root, /usr and /var, for example) makes no sense at all. It probably won't even be a supported configuration. Separate datasets for those directories are possible, but only in the same pool.
Storage Pool Performance Considerations
General Storage Pool Performance Considerations
* For better performance, use individual disks or at least LUNs made up of just a few disks. By providing ZFS with more visibility into the LUNs setup, ZFS is able to make better I/O scheduling decisions.
* Depending on workloads, the current ZFS implementation can, at times, cause much more I/O to be requested than other page-based file systems. If the throughput flowing toward the storage, as observed by iostat, nears the capacity of the channel linking the storage and the host, tuning down the zfs recordsize should improve performance. This tuning is dynamic, but only impacts new file creations. Existing files keep their old recordsize.
* Tuning recordsize does not help sequential type loads. Tuning recordsize is aimed at improving workloads that intensively manage large files using small random reads and writes.
* Currently, pool performance can degrade when a pool is very full and file systems are updated frequently, such as on a busy mail server. Under these circumstances, keep pool space under 80% utilization to maintain pool performance.
* For better performance, do not build UFS components on top of ZFS components. For ZFS performance testing, make sure you are not running UFS on top of ZFS components.
Separate Log Devices
The ZFS intent log (ZIL) satisfies POSIX requirements for synchronous transactions. For example, databases often require their transactions to be on stable storage devices when returning from a system call. By default, the intent log is allocated from blocks within the main pool. However, it might be possible to get better performance by using separate intent log devices, such as NVRAM or a dedicated disk. Determine whether separate log devices are appropriate for your environment:
* Any performance improvement seen by implementing a separate log device depends on the device type, the hardware configuration of the pool, and the application workload.
* Log devices can be unreplicated or mirrored, but RAIDZ is not supported for log devices.
* It is highly recommended to mirror the log device. At present, until CR 6707530 is integrated, failure of the log device may cause the storage pool to be inaccessible. Protecting the log device by mirroring will allow you to access the storage pool even if a log device has failed.
* The minimum size of a log device is the same as the minimum size of device in pool, which is 64 Mbytes. The amount of in-play data that might be stored on a log device is relatively small. Log blocks are freed when the log transaction (system call) is committed.
* The maximum size of a log device should be approximately 1/2 the size of physical memory because that is the maximum amount of potential in-play data that can be stored. For example, if a system has 16 Gbytes of physical memory, consider a maximum log device size of 8 Gbytes.
* For a target throughput of X MB/sec and given that ZFS pushes transaction groups every 5 seconds (and have 2 outstanding), we also expect the ZIL to not grow beyond X MB/sec * 10 sec. So to service 100MB/sec of synchronous writes, 1 GBytes of log device should be sufficient.
Memory and Dynamic Reconfiguration Recommendations
The ZFS adaptive replacement cache (ARC) tries to use most of a system's available memory to cache file system data. The default is to use all of physical memory except 1 Gbyte. As memory pressure increases, the ARC relinquishes memory.
Consider limiting the maximum ARC memory footprint in the following situations:
* When a known amount of memory is always required by an application. Databases often fall into this category.
* On platforms that support dynamic reconfiguration of memory boards, to prevent ZFS from growing the kernel cage onto all boards.
* A system that requires large memory pages might also benefit from limiting the ZFS cache, which tends to breakdown large pages into base pages.
* Finally, if the system is running another non-ZFS file system, in addition to ZFS, it is advisable to leave some free memory to host that other file system's caches.
The trade off is to consider that limiting this memory footprint means that the ARC is unable to cache as much file system data, and this limit could impact performance.
In general, limiting the ARC is wasteful if the memory that now goes unused by ZFS is also unused by other system components. Note that non-ZFS file systems typically manage to cache data in what is nevertheless reported as free memory by the system.
RAID-Z Configuration Requirements and Recommendations
A RAID-Z configuration with N disks of size X with P parity disks can hold approximately (N-P)*X bytes and can withstand P device(s) failing before data integrity is compromised.
* Start a single-parity RAID-Z (raidz) configuration at 3 disks (2+1)
* Start a double-parity RAID-Z (raidz2) configuration at 5 disks (3+2)
* (N+P) with P = 1 (raidz) or 2 (raidz2) and N equals 2, 4, or 8
* The recommended number of disks per group is between 3 and 9. If you have more disks, use multiple groups.
Mirrored Configuration Recommendations
* No currently reachable limits exist on the number of devices
* On a Sun Fire X4500 server, do not create a single mirror with 48 devices. Consider creating 24 2-device mirrors. This configuration reduces the disk capacity by 1/2, but up to 24 disks or 1 disk in each mirror could be lost without a failure.
* If you need better data protection, a 3-way mirror has a significantly greater MTTDL than a 2-way mirror. Going to a 4-way (or greater) mirror may offer only marginal improvements in data protection. Concentrate on other methods of data protection if a 3-way mirror is insufficient.
Should I Configure a RAID-Z, RAID-Z2, or a Mirrored Storage Pool?
A general consideration is whether your goal is to maximum disk space or maximum performance.
* A RAID-Z configuration maximizes disk space and generally performs well when data is written and read in large chunks (128K or more).
* A RAID-Z2 configuration offers excellent data availability, and performs similarly to RAID-Z. RAID-Z2 has significantly better mean time to data loss (MTTDL) than either RAID-Z or 2-way mirrors.
* A mirrored configuration consumes more disk space but generally performs better with small random reads.
* If your I/Os are large, sequential, or write-mostly, then ZFS's I/O scheduler aggregates them in such a way that you'll get very efficient use of the disks regardless of the data replication model.
For better performance, a mirrored configuration is strongly favored over a RAID-Z configuration particularly for large, uncacheable, random read loads.
RAID-Z Configuration Examples
For RAID-Z configuration on a Thumper, mirror c3t0 and c3t4 (disks 0 and 1) as your root pool, with the remaining 46 disks available for user data. The following raidz2 configurations illustrate how to set up the remaining 46 disks.
* 5x(7+2), 1 hot spare, 17.5 TB
* 4x(9+2), 2 hot spares, 18.0 TB
* 6x(5+2), 4 hot spares, 15.0 TB
ZFS Migration Strategies
Migrating to a ZFS Root File System
In the SXCE (Nevada) build 90 release or in the Solaris 10 10/08 release, you can migrate your UFS root file system to a ZFS root file system by upgrading to build 90 or the Solaris 10 10/08 release and then using the Solaris Live Upgrade feature to migrate to a ZFS root file system. You can create a mirrored ZFS root pool either during an initial installation or a JumpStart installation. Or, by using the zpool attach command to create a mirrored ZFS root pool after installation.
Migrating a UFS Root File System With Zones to a ZFS Root File System
Keep the following points in mind when using ZFS datasets on a Solaris system with zones installed:
* Starting in the Solaris 10 10/08 release, you can create a zone root path on a ZFS file system. However, supported configurations are limited when migrating a system with zones to a ZFS root file system by using the Solaris Live Upgrade feature. Review the following supported configurations before you begin migrating a system with zones.
* You can use ZFS as a zone root path in the Solaris Express releases, but keep in mind that patching or upgrading these zones is not supported.
* You cannot associate ZFS snapshots with zones at this time.
Manually Migrating Non-System Data to a ZFS File System
Consider the following practices when migrating non-system-related data from UFS file systems to ZFS file systems:
* Unshare the existing UFS file systems
* Unmount the existing UFS file systems from the previous mount points
* Mount the UFS file systems to temporary unshared mount points
* Migrate the UFS data with parallel instances of rsync running to the new ZFS file systems
* Set the mount points and the sharenfs properties on the new ZFS file systems
ZFS Interactions With Other Volume Management Products
ZFS works best without any additional volume management software.
If you must use ZFS with SVM because you need an extra level of volume management, ZFS expects that 1 to 4 Mbytes of consecutive logical block map to consecutive physical blocks. Keeping to this rule allows ZFS to drive the volume with efficiency.
VxVM/FS
General ZFS Administration Information
* ZFS administration is performed while the data is online.
* ZFS file systems are mounted automatically when created.
* ZFS file systems do not have to be mounted by modifying the /etc/vfstab file.
* Currently, ZFS doesn't provide a comprehensive backup/restore utility like ufsdump and ufsrestore commands. However, you can use the zfs send and zfs receive commands to capture ZFS data streams. You can also use the ufsrestore command to restore UFS data into a ZFS file system.
* For most ZFS administration tasks, see the zfs.1m and zpool.1m man pages. For more detailed documentation, see the ZFS Administration Guide. To ask ZFS questions, join the opensolaris/zfs discussion group.
* You can use "iostat -En" to see drives.
Using ZFS for Application Servers Considerations
ZFS NFS Server Practices
Consider the following lessons learned from a UFS to ZFS migration experience:
* Existing user home directories were renamed but they were not unmounted. NFS continued to serve the older home directories when the new home directories were also shared.
* Do not mix UFS directories and ZFS file systems in the same file system hierarchy because this model is confusing to administer and maintain.
* Do not mix NFS legacy shared ZFS file systems and ZFS NFS shared file systems because this model is difficult to maintain. Go with ZFS NFS shared file systems.
* ZFS file systems are shared with the sharenfs file system property and zfs share command. For example:
# zfs set sharenfs=on export/home
* This syntax shares the file system automatically. If ZFS file systems need to be shared, use the zfs share command. For example:
# zfs share export/home
ZFS NFS Server Benefits
* NFSv4-style ACLs are available with ZFS file systems and ACL information is automatically available over NFS.
* ZFS snapshots are available over NFSv4 so NFS mounted-home directories can access their .snapshot directories.
ZFS file service for SMB (CIFS) or SAMBA
Many of the best practices for NFS also apply to CIFS or SAMBA servers.
* ZFS file systems can be shared using the SMB service, for those OS releases which support it.
# zfs set sharesmb=on export/home
* If native SMB support is not available, then SAMBA offers a reasonable solution.
ZFS Home Directory Server Practices
Consider the following practices when planning your ZFS home directories:
* Set up one file system per user
* Use quotas and reservations to manage user disk space
* Use snapshots to back up users' home directories
* Beware that mounting 1000s of file systems, will impact your boot time
Consider the following practices when migrating data from UFS file systems to ZFS file systems:
* Unshare the existing UFS file systems
* Unmount the existing UFS file systems from the previous mount points
* Mount the UFS file systems to temporary unshared mount points
* Migrate the UFS data with parallel instances of rsync running to the new ZFS file systems
* Set the mount points and the sharenfs properties on the new ZFS file systems
ZFS Home Directory Server Benefits
* ZFS can handle many small files and many users because of its high capacity architecture.
* Additional space for user home directories is easily expanded by adding more devices to the storage pool.
* ZFS quotas are an easy way to manage home directory space.
* Use ZFS property inheritance to apply properties to many file systems.
ZFS Mail/News Server
ZFS Software Development Server
ZFS Backup / Restore Recommendations
Using ZFS Snapshots
* Use ZFS snapshots as a quick and easy way to backup user home directories. For example, the following syntax creates recursive snapshots of all home directories in the tank/home file system.
# zfs snapshot -r tank/home@monday
* You can create rolling snapshots to help manage snapshot copies.
* You can use the zfs send and zfs receive commands to archive snapshots to more permanent storage.
* You can create an incremental snapshot stream (see "zfs send -i" syntax).
* The zfs send and receive commands are not enterprise-backup solutions. The receive is an all-or-nothing event: you can get all of a snapshot, or none of it. If you store the output of zfs send some place, like a file or on tape, and that file becomes corrupted, then it will not be possible to receive correctly and none of the data will be recoverable. See RFE CR6736794. Enterprise backup solutions, as well as other copying methods (e.g. cp, tar, rsync, pax, cpio, et.al.) are more appropriate for backup/restore than zfs send/receive.
Using ZFS With AVS
The Sun StorageTek Availability Suite (AVS), Remote Mirror Copy and Point-in-Time Copy services, previously known as SNDR (Sun Network Data Replicator) and II (Instant Image), are similiar to the Veritas VVR (volume replicator) and Flashsnap (point-in-time copy) products, is currently available in the Solaris Express release.
SNDR differs from the ZFS send and recv features, which are time-fixed replication features. For example, you can take a point-in-time snapshot, replicate it, or replicate it based on a differential of a prior snapshot. The combination of AVS II and SNDR features, also allows you to perform time-fixed replication. The other modes of the AVS SNDR replication feature allows you to obtain CDP (continuous data replication). ZFS doesn't currently have this feature.
Using ZFS With Enterprise Backup Solutions
* The zfs send and receive commands do not provide an enterprise-level backup solution. At a minimum, these commands provide the ability to send and receive ZFS file system snapshots.
* The robustness of ZFS does not protect you from all failures. You should maintain copies of your ZFS data either by taking regular ZFS snapshots, and saving them to tape or other offline storage. Or, consider using an enterprise backup solution. The best solution is to use an enterprise backup solution to save and restore ZFS data.
* The Sun StorEdge Enterprise Backup Software (Legato Networker 7.3.2) product can fully back up and restore ZFS files including ACLs.
* The Symantec Netbackup 6.5 product can fully back up and restore ZFS files including ACLs. Release 6.5.2A offers some fixes which make backup of ZFS file systems easier.
* IBM's TSM product can be used to back up ZFS files. However supportability is not absolutely clear. Based on TSM documentation on IBM website, ZFS with ACLs is supported with TSM client 5.3. It has been verified (internally to Sun) to correctly work with 5.4.1.2
tsm> q file /opt/SUNWexplo/output
# Last Incr Date Type File Space Name
--- -------------- ---- ---------------
1 08/13/07 09:18:03 ZFS /opt/SUNWexplo/output
^
|__correct filesystem type
* ZFS snapshots are accessible in the .zfs directories of the file system that was snapshot. Configure your backup product to skip these directories.
In order to take advantage of ZFS features, such as snapshots, to improve the backup and restore process, ndmp service will be released in Solaris (first in opensolaris). With this addition, an enterprise backup solution could take advantage of snapshots to improve data protection for large storage repositories.
Looking Forward: ZFS and ADM
The Automatic Data Migration project will provide hierarchical storage management capabilities to ZFS, leveraging the features of SAM-QFS.
ZFS and Database Recommendations
* General remarks
* If the database uses a fixed disk block or record size for I/O, set the ZFS recordsize property to match. You can do this on a per-file system basis, even though multiple file systems can share a single pool.
* With it's copy-on-write design, tuning down the ZFS recordsize is a way to improve OLTP performance at the expense of batch reporting queries.
* ZFS checksums every block stored on disk. This alleviates the need for the database layer to checksum data an additional time. If checksums are computed by ZFS instead of at the database layer, any discrepancy can be caught and fixed before the data is returned to the application.
* ZFS performance on databases is a very fast moving target. Keeping up-to-date with Solaris releases is very important.
* As of July 2007, the following features might impact database performance:
* ZFS issues up to 35 concurrent I/Os to each top-level device and this can lead to inflated service time.
* ZFS does some low-level prefetches of up to 64K for each input block, which can cause saturation of storage channels.
* Using vdev-level prefetches of 8K and between 5 and 10 concurrent I/O was shown to help some database loads.
* Oracle Considerations
* For better OLTP performance, match the ZFS recordsize to the Oracle db_block_size.
* Keep an eye on batch reporting during mixed batch and OLTP; a small recordsize can lead to an IOPS inflation during batch.
* Use a separate file system with the default 128K record size for Oracle logs.
* Starting in the SXCE build 68 release, you can create a ZFS storage pool with separate devices for the ZFS intent log (ZIL). F
* It is often desirable to minimize Oracle log latency as it is sometimes the only I/O in the path of a transaction. With SAN storage arrays, Oracle log latency should be close to the latency of writing to the SAN caches and there is no need to partition spindle resources between data space and log space: a single pool works. But for Oracle over JBOD storage, it has been observed that using a segregated set of spindles (not subject to competing reads or writes) can help the log latency. This in turn can help some workloads such as those with a high write to read ratio at the storage level. For the Oracle log pool, using some fast separate devices (NVRAM, battery back DRAM, 15K RPM drivces) for the intent log, appears like a very good idea.
* PostgreSQL
* Limit the ZFS ARC as described in
* See the Oracle Considerations section above about using a separate pool for logs
* Set ZFS recordsize=8K (Note: Do this before any datafile creation!)
* init database from log pool, then for each database create a new tablespace pointing to data spool
* MySQL
*
* InnoDB:
* Limit the ZFS ARC as described in #Memory_and_Dynamic_Reconfiguration_Recommendations
* See Oracle Considerations section above about using a separate pool for logs
* Set ZFS recordsize=8K (Note: Do this before any datafile creation!)
* Then, use a different path for data and log (set in mysql.conf)
* MyISAM:
* Limit the ZFS ARC as described in #Memory_and_Dynamic_Reconfiguration_Recommendations
* Create a separate intent log for the log (WAL). If you do not have this feature (that is, you're running a Solaris 10 release), then create at least 2 pools, one for data and one for the log (WAL)
* Set ZFS recordsize=8K (Note: Do this before any datafile creation!)
ZFS and Complex Storage Considerations
* Certain storage subsystems stage data through persistent memory devices, such as NVRAM on a storage array, allowing them to respond to writes with very low latency. These memory devices are commonly considered as stable storage, in the sense that they are likely to survive a power outage and other types of breakdown. At critical times, ZFS is unaware of the persistent nature of storage memory, and asks for that memory to be flushed to disk. If indeed the memory devices are considered stable, the storage system should be configured to ignore those requests from ZFS.
ZFS Management Tools / Observability
ZFS and Virtual Tape Libraries (VTLs)
VTL solutions are hardware and software combinations that are used to emulate tapes, tape drives, and tape libraries. VTLs are used in backup/archive systems, with a focus on reducing hardware and software costs.
VTLs are big disk space eaters and we believe ZFS will allow them to more efficiently and securely manage the massive, online disk space.
* Falconstor VTL - has been tested on Thumper running ZFS.
* NetVault from BakBone - This backup solution includes a VTL feature that has been tested on Thumper running ZFS.
ZFS Performance Considerations
ZFS and Application Considerations
ZFS and NFS Server Performance
ZFS is deployed over NFS in many different places with no reports of obvious deficiency. Many have reported disappointing performance, but those scenarios more typically relate to comparing ZFS-over-NFS performance with local file system performance. It is well known that serving NFS leads to significant slowdown compared to local or directly-attached file systems, especially for workloads that have low thread parallelism. A dangerous way to create better ZFS-over-NFS performance at the expense of data integrity is to set the kernel variable, zil_disable. Setting this parameter is not recommended.
Later versions of ZFS have implemented a separate ZIL log device option which improves NFS synchronous write operations. This is a better option than disabling the ZIL. Anecdotal evidence suggests good NFS performance improvement even if the log device does not have nonvolatile RAM. To see if your zpool can support separate log devices use zpool upgrade -v and look for version 7.
Misconceptions about file systems and their behaviour (Roch?)
ZFS Overhead Considerations
* Compression
* Checksum - Checksumming has been measured to consume very roughly 1 Ghz of CPU to checksum 500MB/sec of data.
* RAID-Z
* As of the Solaris 10 11/06 release, checksum, compression, and RAID-Z parity computation all occur in the context of the thread synching pool data. Under load, that thread may be a performance choke point. It is expected that in future releases all these computations will be done in concurrent threads. Compression is no longer single-threaded via CR 6460622, and is part of the Solaris 10 8/07 release.
Scalability
Data Reconstruction
Traditional RAID systems where the context of the data is not known and data is reconstructed (also known as resilvering) by blindly reconstructing the data blocks in block order. ZFS only needs to reconstruct data, so it can be more efficient than traditional RAID systems when the storage pool is not full. ZFS reconstruction occurs top-down in a priority-based manner.
Since ZFS reconstruction occurs on the host, there is some concern over the performance impact and availability trade-offs. There are two competing RFEs which address this subject:
* CR 6592835, resilver needs to go faster
No comments:
Post a Comment