Tuesday, June 10, 2014

Solaris Fault Manager

    Fault Manager is part of self-healing functionality that provides fault isolation and component restart, in this case hardware component
     (SMF will take care of software components).

    Make sure that you run the service and have required packages.

# pkginfo | grep fmd
system      SUNWfmd                      Fault Management Daemon and Utilities
system      SUNWfmdr                     Fault Management Daemon and Utilities (Root)
#

# fmadm config
MODULE                   VERSION STATUS  DESCRIPTION
cpumem-diagnosis         1.7     active  CPU/Memory Diagnosis
cpumem-retire            1.1     active  CPU/Memory Retire Agent
disk-transport           1.0     active  Disk Transport Agent
eft                      1.16    active  eft diagnosis engine
etm                      1.2     active  FMA Event Transport Module
ext-event-transport      0.1     active  External FM event transport
fabric-xlate             1.0     active  Fabric Ereport Translater
fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
generic-mem              1.0     active  SPARC-Generic-Memory Diagnosis
io-retire                1.0     active  I/O Retire Agent
pri-monitor              1.0     active  PRI Update Monitor
snmp-trapgen             1.0     active  SNMP Trap Generation Agent
sp-monitor               1.0     active  Service Processor Monitor
sysevent-transport       1.0     active  SysEvent Transport Agent
syslog-msgs              1.0     active  Syslog Messaging Agent
zfs-diagnosis            1.0     active  ZFS Diagnosis Engine
zfs-retire               1.0     active  ZFS Retire Agent
#


    To check the fmd error or any hardware failure on system

# fmdump -v
# fmadm faulty
# fmadm faulty -r
# fmadm faulty -a

    For example :

# fmdump
TIME                 UUID                                 SUNW-MSG-ID
May 02 09:40:33.3713 f01fb317-2383-caad-9883-83ea390a5bba PCIEX-8000-KP
May 02 10:15:27.9764 f01fb317-2383-caad-9883-83ea390a5bba FMD-8000-58 Updated
May 02 10:15:47.2144 4123dedb-9bbb-c58d-e88a-f12d0b28fa44 PCIEX-8000-KP
#

# fmadm faulty -r
dev:////pci@0,0/pci8086,340a@3/pci111d,806c@0                         degraded
dev:////pci@0,0/pci8086,340a@3                                        degraded
#

# fmadm faulty -r
mem:///unum=MB/CMP0/CH3:R0/D1/J2201                                   degraded
#

    To rotate the fmd errors use below script ( run at least two times) .

#GZ:
fmdump

#Clean up old entries:
for i in `/usr/sbin/fmdump|awk '{print $4}'`
do
 /usr/sbin/fmadm repair $i
done
sleep 2
/usr/sbin/fmadm rotate fltlog
/usr/sbin/fmadm rotate errlog

    If the errors are still exist in # fmadm faulty –a then we need to clear the cache of fmd using the below steps.

Clear ereports and resource cache:

# cd /var/fm/fmd
# rm e* f* c*/eft/* r*/*

Clearing out FMA files with no reboot needed:

svcadm disable -s svc:/system/fmd:default
cd /var/fm/fmd
find /var/fm/fmd -type f -exec ls {} \;
find /var/fm/fmd -type f -exec rm {} \;
svcadm enable svc:/system/fmd:default

And monitor the system for few hrs (or one day) if the errors are came again then we need raise the oracle SR to get it fix.

To resolved Correctable DIMM errors.

    If we received memory DIMM alert and all memory is show as enabled on the system controller and the FMA event was logged on the motherboard.

# prtconf -v | grep -i Mem
Memory size: 16256 Megabytes
    memory (driver not attached)
    virtual-memory (driver not attached)
#

    I think it would be best if the FMA events are cleared and repaired on operating system first then we can go ahead for PSRs.

    From Solaris level :Display the fault in the OS
 
# fmadm faulty

    Repair all FMA events from the output from above

# fmadm repair <UUID>

  Example:

# fmadm repair a5e24c9f-65c5-ee14-8148-a1ff8b35dc5a
# fmadm repair 07b01c83-c720-481c-cb9a-85774abb30f5

Or use below script to clear the fmd errors:

#GZ:
fmdump

#Clean up old entries:
for i in `/usr/sbin/fmdump|awk '{print $4}'`
do
 /usr/sbin/fmadm repair $i
done
sleep 2
/usr/sbin/fmadm rotate fltlog
/usr/sbin/fmadm rotate errlog


    Check to see that faults are cleared

# fmdump -v
# fmadm faulty
# fmadm faulty -r
# fmadm faulty -a

    Clear error reports and resource cache with following commands:

cd /var/fm/fmd
rm e* f* c*/eft/* r*/*

    Reset the fmd serd modules:

fmadm reset cpumem-diagnosis
fmadm reset cpumem-retire
fmadm reset eft
fmadm reset io-retire

    Check with the below command to confirm that the DIMM is not faulty and wait for few days and again check for confirmation.

sc> showfaults -v