Monday, September 21, 2015

How Solaris ZFS Cache Management Differs From UFS and VxFS File Systems

ZFS manages its cache differently to other filesystems such as: UFS and VxFS.  ZFS' use of kernel memory as a cache results in higher kernel memory allocation as compared to UFS and VxFS filesystems.  Monitoring a system with tools such as vmstat would report less free memory with ZFS and may lead to unnecessary support calls.

SOLUTION

This is due to ZFS's cache management being different from UFS and VxFS filesystems.  ZFS does not use the page cache, unlike other filesystems such as UFS and VxFS.  ZFS's caching is drastically different from these old filesystems where cached pages can be moved to the cache list after being written to the backing store and thus counted as free memory.
ZFS affects the VM subsystem in terms of memory management.  Monitoring of systems with vmstat(1M) and prstat(1M) would report less free memory when ZFS is used heavily ie, copying large file(s) into a ZFS filesystem.  The same load running on a UFS filesystem would use less memory since pages that are written to the backing store will be moved into a cache list and counted as free memory.
ZFS uses a significantly different caching model than page-based filesystems like UFS and VxFS.  This is for both performance and architectural reasons.  This may impact existing application software (like Oracle that itself consumes large amount of memory).
The primary ZFS cache is an Adjustable Replacement Cache (ARC) that is built on top of a number of kmem_cache's: zio_buf_512 thru zio_buf_131072 (+ hdr_cache and buf_cache). These kmem caches are used for holding data blocks (ZFS uses variable block sizes: 512 bytes to 1MB).  The ARC will be at least 64MB in size and can use a maximum of physical memory less 1GB.  With ZFS, reported freemem will be lower than with other filesystems.
ZFS returns memory from the ARC only when there is a memory pressure.  It is a different behaviour than pre Solaris 8 (pre priority paging) where reading or writing one large (GBs) file can lead to memory shortage and that leads to paging/swaping of application pages.  That can result in slow application performance.  This old problem was due to the failure to distinguish between a useful application page and a filesystem cached page.  See knowledge article 1003383.1
ZFS frees up its cache in a way that does not cause a memory shortage.  The system can operate with lower freemem without suffering a performance penalty.  ZFS, unlike UFS and VxFS filesystems, does not throttle writers.  The UFS filesystem throttles writes when the number of dirty pages/vnode reaches 16MB.  The objective being to preserve free memory.  The downside of this is slow application write performance that may be unnecessary when plenty of free memory is available.  ZFS does not throttle individual applications unlike UFS and VxFS.  ZFS only throttles the application when the data load overflows the IO subsystem capacity for 5 to 10 seconds. See doc
However, there are occasions when ZFS fails to evict memory from the ARC quickly which can lead to application startup failure due to a memory shortage.  Also, reaping memory from the ARC can trigger high system utilization at the expense of performance.  This issue is addressed in bug:
<Defect 6505658> - target MRU size (arc.p) needs to be adjusted more aggressively
The work-around is to limit the ZFS ARC by setting:
set zfs:zfs_arc_max
in /etc/system file.  This tunable determines the maximum size of the ZFS Adaptive Replacement Cache (ARC).  The default: 3/4th of memory on systems with less than 4GB or physmem minus 1GB on Systems with greater than 4GB of memory.  For databases, we know in advance how much memory they will consume.  Limit ZFS's ARC to the remaining free memory (and possibly reduce it even more) by setting zfs_arc_max tunable to desired value.
This tunable is available in Solaris 10 8/07 (Update 4)  KU 120011-14 installed along with fix for <Defect 6505658>.
One can estimate the amount of kernel memory used for caching ZFS data blocks by running:
# echo ::memstat | mdb -k
Look for a line of output similar to this one:
ZFS File Data              514965              4023   25%
Where "ZFS File Data" reports the amount of memory currently allocated in all memory caches associated with ARC file data. This includes both memory actively in use as well as additional memory currently being held unused in the kernel memory caches.  When buffers are evicted from the ARC cache, they are returned to the respective caches (e.g. the zio_data_buf_131072 cache is used for allocating 128k blocks).  Buffers in the kernel caches will stay unused until the VM system can reap excess capacity in these caches when the system comes under memory pressure.  You can determine the kernel memory usage of various caches by running:
# echo ::kmastat | mdb -k

One can also monitor ZFS ARC memory usage using:
# kstat -n arcstats
 Look for a line of output similar to this one:
        size                            5473591096
Where: "size" reports amount of active data in the ARC.  This value stays within the "target" size set using "zfs_arc_max" tunable. 
Also, if possible, consider using the ZFS "primarycache" property to better control what is cached in ZFS ARC.  It allows caching on a per-dataset (filesystem) basis and thus provide better ARC usage and control.  If this property is set to "all", then both user data and metadata are cached.  If this property is set to "metadata", then only metadata is cached.  If this property is set to "none", then neither user data nor metadata is cached. The default is "all".
There is also a good discussion on monitoring ZFS ARC using DTrace scripts and arcstat.pl and arcstat-extended.pl tools available.
On Solaris 11.1 SRU 20.5 and later, 'user_reserve_hint_pct' is introduced and 'zfs_arc_max' is no longer needed to be used.   This is documented in Doc ID 1663862.1

Relief/Workaround
For Solaris 10 releases prior to 8/07 (Update 4) or without KU 120011-14 installed a script is provided in the <Defect 6505658>.

ZFS Best Practices Guide:

The ZFS adaptive replacement cache (ARC) tries to use most of a system's available memory to cache filesystem data.  The default is to use all of physical memory except 1 Gbyte.  As memory pressure increases, the ARC relinquishes memory.
Consider limiting the maximum ARC memory footprint in the following situations:
  • When a known amount of memory is always required by an application. Databases often fall into this category.
  • On platforms that support dynamic reconfiguration of memory boards, to prevent ZFS from growing the kernel cage onto all boards.
  • A system that requires large memory pages might also benefit from limiting the ZFS cache, which tends to breakdown large pages into base (4KB for x86 or 8KB for SPARC) pages.
  • Finally, if the system is running another non-ZFS filesystem, in addition to ZFS, it is advisable to leave some free memory to host that other filesystem's caches.
The trade off is that limiting this memory footprint means that the ARC is unable to cache as much file system data and this limit could impact performance.  In general, limiting the ARC is wasteful if the memory that would go unused by ZFS is also unused by other system components.  Note that non-ZFS filesystems typically cache data in what is nevertheless reported as free memory by the system.

1 comment: