Tuesday, December 3, 2013

Repairing Damaged Data in ZFS

Repairing Damaged Data

The following sections describe how to identify the type of data corruption and how to repair the data, if possible.
    ZFS uses checksums, redundancy, and self-healing data to minimize the risk of data corruption. Nonetheless, data corruption can occur if a pool isn't redundant, if corruption occurred while a pool was degraded, or an unlikely series of events conspired to corrupt multiple copies of a piece of data. Regardless of the source, the result is the same: The data is corrupted and therefore no longer accessible. The action taken depends on the type of data being corrupted and its relative value. Two basic types of data can be corrupted:
    • Pool metadata – ZFS requires a certain amount of data to be parsed to open a pool and access datasets. If this data is corrupted, the entire pool or portions of the dataset hierarchy will become unavailable.
    • Object data – In this case, the corruption is within a specific file or directory. This problem might result in a portion of the file or directory being inaccessible, or this problem might cause the object to be broken altogether.
    Data is verified during normal operations as well as through a scrubbing.

    Identifying the Type of Data Corruption

    By default, the zpool status command shows only that corruption has occurred, but not where this corruption occurred. For example:
    # zpool status monkey
      pool: monkey
     state: ONLINE
    status: One or more devices has experienced an error resulting in data
            corruption.  Applications may be affected.
    action: Restore the file in question if possible.  Otherwise restore the
            entire pool from backup.
       see: http://www.sun.com/msg/ZFS-8000-8A
     scrub: scrub completed after 0h0m with 8 errors on Tue Jul 13 13:17:32 2010
    config:
    
            NAME        STATE     READ WRITE CKSUM
            monkey      ONLINE       8     0     0
              c1t1d0    ONLINE       2     0     0
              c2t5d0    ONLINE       6     0     0
    
    errors: 8 data errors, use '-v' for a list
    Each error indicates only that an error occurred at a given point in time. Each error is not necessarily still present on the system. Under normal circumstances, this is the case. Certain temporary outages might result in data corruption that is automatically repaired after the outage ends. A complete scrub of the pool is guaranteed to examine every active block in the pool, so the error log is reset whenever a scrub finishes. If you determine that the errors are no longer present, and you don't want to wait for a scrub to complete, reset all errors in the pool by using the zpool online command.
    If the data corruption is in pool-wide metadata, the output is slightly different. For example:
    # zpool status -v morpheus
      pool: morpheus
        id: 1422736890544688191
     state: FAULTED
    status: The pool metadata is corrupted.
    action: The pool cannot be imported due to damaged devices or data.
       see: http://www.sun.com/msg/ZFS-8000-72
    config:
    
            morpheus    FAULTED   corrupted data
              c1t10d0   ONLINE
    In the case of pool-wide corruption, the pool is placed into the FAULTED state because the pool cannot provide the required redundancy level.

    Repairing a Corrupted File or Directory

    If a file or directory is corrupted, the system might still function, depending on the type of corruption. Any damage is effectively unrecoverable if no good copies of the data exist on the system. If the data is valuable, you must restore the affected data from backup. Even so, you might be able to recover from this corruption without restoring the entire pool.
    If the damage is within a file data block, then the file can be safely removed, thereby clearing the error from the system. Use the zpool status -v command to display a list of file names with persistent errors. For example:
    # zpool status -v
      pool: monkey
     state: ONLINE
    status: One or more devices has experienced an error resulting in data
            corruption.  Applications may be affected.
    action: Restore the file in question if possible.  Otherwise restore the
            entire pool from backup.
       see: http://www.sun.com/msg/ZFS-8000-8A
     scrub: scrub completed after 0h0m with 8 errors on Tue Jul 13 13:17:32 2010
    config:
    
            NAME        STATE     READ WRITE CKSUM
            monkey      ONLINE       8     0     0
              c1t1d0    ONLINE       2     0     0
              c2t5d0    ONLINE       6     0     0
    
    errors: Permanent errors have been detected in the following files: 
    
    /monkey/a.txt
    /monkey/bananas/b.txt
    /monkey/sub/dir/d.txt
    monkey/ghost/e.txt
    /monkey/ghost/boo/f.txt
    The list of file names with persistent errors might be described as follows:
    • If the full path to the file is found and the dataset is mounted, the full path to the file is displayed. For example:
      /monkey/a.txt
    • If the full path to the file is found, but the dataset is not mounted, then the dataset name with no preceding slash (/), followed by the path within the dataset to the file, is displayed. For example:
      monkey/ghost/e.txt
    • If the object number to a file path cannot be successfully translated, either due to an error or because the object doesn't have a real file path associated with it, as is the case for a dnode_t, then the dataset name followed by the object's number is displayed. For example:
      monkey/dnode:<0x0>
    • If an object in the metaobject set (MOS) is corrupted, then a special tag of <metadata>, followed by the object number, is displayed.
    If the corruption is within a directory or a file's metadata, the only choice is to move the file elsewhere. You can safely move any file or directory to a less convenient location, allowing the original object to be restored in its place.

    Repairing ZFS Storage Pool-Wide Damage

    If the damage is in pool metadata and that damage prevents the pool from being opened or imported, then the following options are available to you:
    • You can attempt to recover the pool by using the zpool clear -F command or the zpool import -F command. These commands attempt to roll back the last few pool transactions to an operational state. You can use the zpool status command to review a damaged pool and the recommended recovery steps. For example:
      # zpool status
        pool: tpool
       state: FAULTED
      status: The pool metadata is corrupted and the pool cannot be opened.
      action: Recovery is possible, but will result in some data loss.
              Returning the pool to its state as of Wed Jul 14 11:44:10 2010
              should correct the problem.  Approximately 5 seconds of data
              must be discarded, irreversibly.  Recovery can be attempted
              by executing 'zpool clear -F tpool'.  A scrub of the pool
              is strongly recommended after recovery.
         see: http://www.sun.com/msg/ZFS-8000-72
       scrub: none requested
      config:
      
              NAME        STATE     READ WRITE CKSUM
               tpool      FAULTED      0     0     1  corrupted data
                c1t1d0    ONLINE       0     0     2
                c1t3d0    ONLINE       0     0     4 
      The recovery process as described in the preceding output is to use the following command:
      # zpool clear -F tpool
      If you attempt to import a damaged storage pool, you will see messages similar to the following:
      # zpool import tpool
      cannot import 'tpool': I/O error
              Recovery is possible, but will result in some data loss.
              Returning the pool to its state as of Wed Jul 14 11:44:10 2010
              should correct the problem.  Approximately 5 seconds of data
              must be discarded, irreversibly.  Recovery can be attempted
              by executing 'zpool import -F tpool'.  A scrub of the pool
              is strongly recommended after recovery.
      The recovery process as described in the preceding output is to use the following command:
      # zpool import -F tpool
      Pool tpool returned to its state as of Wed Jul 14 11:44:10 2010.
      Discarded approximately 5 seconds of transactions
      If the damaged pool is in the zpool.cache file, the problem is discovered when the system is booted, and the damaged pool is reported in the zpool status command. If the pool isn't in the zpool.cache file, it won't successfully import or open and you will see the damaged pool messages when you attempt to import the pool.
    • You can import a damaged pool in read-only mode. This method enables you to import the pool so that you can access the data. For example:
      # zpool import -o readonly=on tpool
    • You can import a pool with a missing log device by using the zpool import -m command. 
    • If the pool cannot be recovered by either pool recovery method, you must restore the pool and all its data from a backup copy. The mechanism you use varies widely depending on the pool configuration and backup strategy. First, save the configuration as displayed by the zpool status command so that you can re-create it after the pool is destroyed. Then, use the zpool destroy -f command to destroy the pool.
      Also, keep a file describing the layout of the datasets and the various locally set properties somewhere safe, as this information will become inaccessible if the pool is ever rendered inaccessible. With the pool configuration and dataset layout, you can reconstruct your complete configuration after destroying the pool. The data can then be populated by using whatever backup or restoration strategy you use.

    No comments:

    Post a Comment