Friday, December 13, 2013

Solaris 10, 11 and 11.1 Systems With a ZFS Storage Pool With Cache Devices May Enter a Panic Loop

Description
A problem with L2ARC and ARC synchronization may cause a rewrite of metadata that was cached in L2ARC to be lost.

This inconsistency may cause an immediate panic, in which case there is a chance that the corruption does not get written to disk and the system will recover automatically after the panic.

However, if the corrupted data is written to disk and is later accessed by a critical function (syncing context), the system may enter a panic loop.

Alternatively, the issue may manifest itself as checksum errors on metadata reported by scrub which may lead to panics later when that piece of metadata is needed to perform work in a critical context.

Occurrence
This issue can occur in the following releases:

SPARC Platform

•Solaris 10 with patch 147147-26 and without patch 150125-01
•Solaris 11
•Solaris 11.1 without SRU 3.4
x86 Platform

•Solaris 10 with patch 147148-26 and without patch 149637-02
•Solaris 11
•Solaris 11.1 without SRU 3.4
Note: Solaris 8 and 9 are not impacted by this issue.

Systems may be impacted by this issue if they have a cache device. To determine if a system has this configuration, execute the following command:

     # zpool status
       pool: rpool
      state: ONLINE
       scan: none requested
    config:

      NAME        STATE     READ   WRITE   CKSUM
        rpool       ONLINE      0           0      0
     c4t0d0s0    ONLINE      0           0      0

      errors: No known data errors

       pool: tank
      state: ONLINE
       scan: none requested
    config:

      NAME        STATE     READ   WRITE   CKSUM
       tank        ONLINE      0           0       0
      mirror-0    ONLINE      0           0       0
      c9t0d1      ONLINE      0           0       0
      c9t0d2      ONLINE      0           0       0
      cache
      c9t0d0      ONLINE      0           0       0

The system from the example above has two ZFS storage pools, 'rpool' and 'tank'; only 'tank' has the cache device. It means that this system may be impacted by this issue that can manifest itself in a variety of ways.

Symptoms
Should the described issue occur, the system will panic. The following are examples of known panic strings related to this issue:

    zfs: freeing free segment (offset=7078106103808 size=512)

    assertion failed: sm->sm_space == space (0x2dd924400 ==
    0x2dd924600), file: ../../common/fs/zfs/space_map.c, line: 355

    assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE,
    FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 948

One known manifestation can be detected by following these steps:

    1. export the pool:
    # zpool export <pool>

    2. run the following command:
    # /sbin/zdb -emm <pool>

    3. import the pool again:
    # zpool import <pool>

If the zdb(1M) command fails or crashes, the pool is likely affected by one of the manifestations of the issue.

Workaround
This issue is addressed in the following releases:

SPARC Platform

•Solaris 10 with patch 150125-01 or later
•Solaris 11.1 with SRU 3.4 or later
x86 Platform

•Solaris 10 with patch 149637-02 or later
•Solaris 11.1 with SRU 3.4 or later
Solaris 11 systems need to be upgraded to Solaris 11.1 and install SRU 3.4 or later to get the fix.

No comments:

Post a Comment