Tuesday, May 5, 2020

Improper fsck(1M) Usage can Lead to Bogus "unexpected free/allocated inode" Messages From UFS Filesystem

Beginning with Solaris 8 kernel patch 108528-18 for SPARC(R) and 108529-18 for x86, a sanity check has been added in ufs module to check for inode inconsistency. Due to improper fsck usage these messages get emitted into system console and /var/adm/messages file:

Patch 108528-18
Patch 108529-18

Without logging:

"NOTICE: : unexpected free inode , run fsck(1M)"

With logging enabled:

"NOTICE: : unexpected free inode , run fsck(1M) -o f"

There are additional sanity checks which result in seeing the following
messages:

Without logging:

"NOTICE: : unexpected allocated inode , run
fsck(1M)"

With logging enabled:

"NOTICE: : unexpected allocated inode , run fsck(1M) -o f"

Troubleshooting Steps
If the systems display the unexpected inode messages:

* Check /var/adm/messages for multiple entries for the same inode number

* Check to see if there are any inode entries under /lost+found directory. If the inode entry matches with theinode-number displayed in the unexpected inode message, then one can suspect that an improper fsck procedure had been followed in the past.

There are two scenario that also needs to be looked at:

Scenario A - Without following the proper procedure while running fsck.

* "fsck " on a read-write live mounted filesystem

for example : filesystem that cannot be unmounted - /, /var, /usr :-

* some filesystem activities running in parallel

* filesystem activity completed, changes in-memory did not still make it to the disk, due to delayed writes or async writes

 In such cases as listed above, the fsck messages of "unexpected free/allocated inode" messages are misleading and fsck could corrupt the filesystem".



 Scenario B - with following the proper procedure while running fsck.

* The filesystem could have been corrupted already - maybe H/W problem, volume-manager bug, ufs bug, system crashed, power outage etc.. "fsck" detects it and wants to correct it.

These are genuine problems. fsck should repair it and make the filesystem healthy again.



Proper procedure for running fsck when volume managers are involved:

a. Assuming filesystems that cannot be unmounted and are mirrored

* detach the sub-mirror

* boot off the net/cdrom, alternate boot-disk or "boot -b" (from ALOM)

* run fsck on the raw device. Re-run fsck until it comes out clean

* Reboot

 * attach the sub-mirror back



b. Assuming filesystems that can be unmounted and are mirrored. In "multi-user mode" do the following:

* unmount the filesystem

* detach the sub-mirror

* run fsck on the raw device. Re-run fsck until it comes out clean

* attach the sub-mirror back

* mount the filesystem back

NOTE:

* Make sure you don't fsck the sub-mirror disk that has been detached

* Avoid running fsck as - "fsck /", "fsck /usr", "fsck /var"

 * Always run fsck as "fsck /dev/rdsk/c#t#d#s#"

* If /usr is a separate mount point then /usr needs to be mounted before using /usr/sbin/fsck

"boot -b" :

 -----------

 This is needed when you cannot mount critical filesystems and need repair (the system refuses to boot due to problems in the filesystem, such as /usr mount fails).

This option would mount "/" as readonly, give a login prompt and happens much before any of the rc scripts (startup scripts) run. This option should be used when running fsck on the raw device.



No comments:

Post a Comment