Tuesday, October 21, 2014

Solaris 11 missing kernel module after fast reboot initiated by panic

Introduction

On VCS 6.0.3 running on Solaris 11.1 x64, I hit a rather strange problem that is not specific to the use of Veritas Cluster Server.  While testing a scenario involving Coordination Point Servers where I was expecting both of the nodes in my two-node cluster to have a kernel panic initiated by Symantec's GAB kernel module, both nodes booted up after the panic but were missing the LLT kernel module.  Well, by "missing", I mean it was not loaded.
There is a dependency chain between the Symantec kernel modules: LLT -> GAB -> VxFEN, and finally VCS itself.  So LLT is the most fundamental building block, and the cluster will not function without it.
What would cause a Solaris 11 server that had a loaded and perfectly functioning kernel driver prior to a reboot, fail to load that driver after a reboot?  But not after any old reboot, normal reboots work fine.  Only after a reboot initiated by a kernel panic?  And then, only sometimes, this wasn't 100% reproducible, although I could usually reproduce it after a few attempts.
The answer to this question took many weeks to unravel ...

The symptoms

The mystery began the day both nodes failed to start LLT after a kernel panic.  The initial symptom was clear in SMF, where the llt service was reported as in maintenance state:

# svcs llt
STATE          STIME    FMRI
maintenance     5:56:42 svc:/system/llt:default
Checking the SMF log file /var/svc/log/system-llt:default.log did not make the problem immediately obvious:

[ Jun  9 05:56:42 Executing start method ("/lib/svc/method/llt
start"). ]
Starting LLT...
LLT lltconfig ERROR V-14-2-15000 open /dev/llt failed:
No such file or directory
LLT lltconfig ERROR V-14-2-15000 open /dev/llt failed:
No such file or directory
Could not start LLT successfully.
[ Jun  9 05:56:42 Method "start" exited with status 1. ]
The svcadm clear llt command did not help, it re-executed the start method with the same result.
Why would /dev/llt be missing?  It was there prior to the reboot.  Let's see ...

# ls -l /dev/llt
lrwxrwxrwx   1 root     root          29 May 30 04:50 /dev/llt ->
../devices/pseudo/clone@0:llt
# ls -lL /dev/llt
/dev/llt: No such file or directory
The underlying device file is missing.  Is there a corrupt file?

# pkg verify VRTSllt
The package looks ok, all files that are supposed to be installed are installed, according to IPS.  Is the module in /etc/name_to_major?

# grep llt /etc/name_to_major 
llt 230
Yes.  Next up, is the LLT kernel module loaded?

# modinfo | grep llt
Strange.  Why has the module not loaded?  Can I load it manually?

# modload /kernel/drv/amd64/llt
can't load module: No such device or address
Hmm, that should have worked.  This module was loaded before the panic.  What does No such device or address mean anyway?  Well really this is one of those overloaded C error codes that can mean a plethora of things depending on which API it came from, it is actually defined in /usr/include/sys/errno.h as ENXIO (error code 6).
What if I attempt to remove and then re-install the kernel module using rem_drv and then add_drv?

# /usr/sbin/rem_drv llt
Invalid argument
Cannot remove major number binding for 230
Wait, that shouldn't happen.  We are definitely in the mists of undocumented behaviour here.  Let's see that again ...

# /usr/sbin/rem_drv llt
Driver (llt) not installed.
Interesting, so despite the error from the first rem_drv, it has apparently done something.  Now let's try to re-install the module:

# /usr/sbin/add_drv -v -f -m '* 0600 root sys' llt
exit status = 0
devfsadm[7862]: verbose: symlink /dev/llt ->
../devices/pseudo/clone@0:llt
Driver (llt) installed.
# modinfo | grep -i llt
264 fffffffff8258000  2dff0 230   1  llt (LLT 6.0.1)
Success.  Now can I manually unload and reload the module?

# modunload -i 264
# modload /kernel/drv/amd64/llt
# modinfo | grep llt
264 fffffffff8258000  2dff0 230   1  llt (LLT 6.0.1)
Yes, no problems now.  So this node is fixed.  It also turns out that just rebooting the host also fixes the problem.
So how did these systems get into this state?  How do we get a host from a healthy running state, via a kernel panic, to this state where the driver won't load?

Why can't I manually load the module?

Turning on module debugging with echo "moddebug/W 0xe0000000" | mdb -wk and repeating the modload test reveals we are in _init() when the ENXIO error is received:

loadmodule:/kernel/drv/amd64/llt(llt): _init() error 6
However, speaking with Symantec, the LLT _init() function contains three lines, it's job is to call the kernel's mod_install() function.
So next I used dtrace to analyse the call tree, to determine from where the ENXIO error was originating:

  13           <- find_mbind                0 8228444470450197
  13         <- mod_name_to_major  4294967295 8228444470452093
  13       <- ddi_name_to_major    4294967295 8228444470453493
  13     <- mod_installdrv                  6 8228444470455315
  13   <- mod_install                       6 8228444470457131
  13 <- modinstall                          6 8228444470458763
  13 -> mod_unload            -69269231358152 8228444470460659
The 4th column is arg1, because I was interested in the return value of each function, and in particular if it returned 6 (ENXIO).  The last column is a timestamp which I used to sort the function calls into chronological order. After the LLT _init() function has called mod_install(), we can clearly see from the dtrace output that mod_install() returned 6 because mod_installdrv() returned 6 because ddi_name_to_major() returned (ulong_t)-1 (DDI_MAJOR_T_NONE) because mod_name_to_major() returned (ulong_t)-1 (DDI_MAJOR_T_NONE) because find_mbind() returned NULL.
The llt module does not have a major number in the kernel structures?  It has an entry in /etc/name_to_major and it was running before the reboot, so why would it not appear in the kernel structures?

Examining the kernel mb_hash structure

The kernel uses a structure called mb_hash which is populated by read_binding_file() called by mod_setup() during system boot.  This is a hash table structure, so in order to inspect the contents of this hash table in the running kernel with mdb, we will need to know the bucket number.  I knocked up a quick C program to calculate it for us:

/* Copy of Solaris nm_hash() routine from modsubr.c
   to calculate bucket number for mb_hashtab lookup */

#include <stdio.h>

#define MOD_BIND_HASHMASK 63

int main(int argc, char **argv)
{
    char c, *name;
    int hash = 0;

    if (argc!=2) {
        fputs("Syntax: nm_hash <modulename>\n", stderr);
        return 1;
    }
    name = argv[1];
    for (c = *name++; c; c = *name++) hash ^= c;
    printf("Hash value for module %s is %d (0x%x)\n",
                argv[1], (hash & MOD_BIND_HASHMASK),
                         (hash & MOD_BIND_HASHMASK));
    return 0;
}
I may as well check all of the Symantec kernel modules, as I had also encountered this exact same problem once with VxFEN during extensive testing, so I suspected there may be a problem with LLT, GAB and VxFEN.  Compiling and running the above C program for these three modules reveals the bucket number for each:

$ ./nm_hash llt
Hash value for module llt is 52 (0x34)
$ ./nm_hash vxfen
Hash value for module vxfen is 35 (0x23)
$ ./nm_hash gab
Hash value for module gab is 36 (0x24)
Now I should be able to check the mb_hash for these modules in the running kernel, and also from the crash dumps created by the kernel panics (which are snapshots of the kernel at the time of the panic).  First, were the modules present at the time of the panic?

# mdb 3
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc
pcplusmp scsi_vhci zfs mpt sd ufs ip hook neti arp usba uhci kssl
qlc fctl random stmf stmf_sbd sockfs lofs idm fcp cpc crypto fcip
smbsrv nfs logindmux nsmb ptm ipc ]
> *(mb_hash+0x8+0t52*8)::walk b_next|::print struct bind b_name
b_name = 0xffffc1003b500918 "llt"
b_name = 0xffffc100005c3138 "i86pc"
b_name = 0xffffc100005b67a8 "usbif1608,241.config1.0"
<snip>

> *(mb_hash+0x8+0t35*8)::walk b_next|::print struct bind b_name
b_name = 0xffffc100ce3fb518 "vxfen"
b_name = 0xffffc100b693b728 "vxfen"
b_name = 0xffffc100005b4910 "usb403,e889"
b_name = 0xffffc100005b4f70 "pccard138,2"
<snip>

> *(mb_hash+0x8+0t36*8)::walk b_next|::print struct bind b_name
b_name = 0xffffc1002ac79038 "gab"
b_name = 0xffffc100005b4a60 "usb769,11f2"
b_name = 0xffffc100005b4b90 "pci1039,900"
<snip>
Yes, as expected, as VCS was up and running at the time of the panic these kernel modules must have been present.  Let's compare that with the running kernel now on a broken host:

# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc
pcplusmp scsi_vhci zfs mpt sd ufs ip hook neti arp usba uhci kssl
qlc fctl random sockfs stmf stmf_sbd lofs idm fcp cpc crypto fcip
smbsrv nfs logindmux nsmb ptm ipc ]
> *(mb_hash+0x8+0t52*8)::walk b_next|::print struct bind b_name
!grep llt
> *(mb_hash+0x8+0t35*8)::walk b_next|::print struct bind b_name
!grep vxfen
> *(mb_hash+0x8+0t36*8)::walk b_next|::print struct bind b_name
!grep gab
All three modules are missing from mb_hash.

Finding the smoking gun ...

Something blew away the LLT, GAB and VxFEN modules from the kernel's mb_hash structure.  But examining the code on Illumos, it doesn't look like anything is ever removed from this hash structure.  If a module is deleted, the entry is left in the hash but marked as deleted.  So if the entries are missing entirely, they were never added in the first place.
The contents of /etc/name_to_major should drive the contents of the mb_hash structure.  In order to determine the list of kernel modules early on in the boot process before the filesystem drivers are ready, Solaris uses a boot archive which is a copy of the critical files needed to support a system boot.  The boot archive can be updated at any time using the bootadm update-archive command.
The Symantec LLT, GAB and VxFEN SMF methods, which are called to start and stop these services, ordinarily run add_drv and bootadm update-archive when starting up to install the kernel driver, and rem_drv/bootadm update-archive when shutting down to uninstall the driver.  The reason why Symantec want to continuously install and uninstall their drivers, rather than installing them once and leaving them alone, is probably lost in the mists of time, as their code hasn't changed much in years and has been ported from one release of Solaris to the next with minimal change.  The correct way to manage driver installs/uninstalls in Solaris 11 is via IPS driver actions.
This said, the add_drv and rem_drv commands appear to be working correctly and as documented, so could there be a problem with these commands that I have yet to discover?
What about the /etc/name_to_major file in the boot archive?  Is there a problem there?  Let's check:

# lofiadm -a /platform/i86pc/boot_archive
/dev/lofi/1
# mount -rF hsfs /dev/lofi/1 /mnt
# gunzip -c /mnt/etc/name_to_major | tail -4
cpqci 288
llt 230
gab 231
vxfen 287
No problem there.  This should have been the archive that was loaded during the boot sequence.  This was puzzling me for ages, until after further testing I performed to rule out the way that GAB panicked the system, it emerged that, while I could reproduce the problem with:

# dtrace -w -n syscall:::'{panic()}' 
... I could not reproduce the problem with:

# uadmin 5 1
... which was also supposed to panic the system.  This turned out to be the key that unlocked this puzzle, because uadmin 5 1 was triggering a full system reboot, whereas the other kernel panics were triggering a fast reboot.

Not so fast!

On speaking with an Oracle kernel developer about all this, it emerged that a fast reboot following a kernel panic was a special case.  My ears pricked up.  "Oh yes, a special case you say, how so?"
The kernel keeps a copy of the boot archive in memory, which it obtains during the last system boot cycle, and which it uses instead of the ordinary boot archive only in the event of a fast reboot following a panic.  Fast reboots after panics may be disabled, by setting the fastreboot_onpanic property of the system/boot-config:default SMF service to false as described in Changing the Default Behavior of the Fast Reboot Feature, but this feature is switched on by default.
Has the penny dropped?  Not yet?  Let's do some analysis given this new insight into the Solaris kernel ...

Whoosh!  Getting a boot archive out of live kernel memory

Here's a little magic trick.  I can extract some pages of kernel memory, save them to disk, and then mount them as an ISO to inspect the contents of the live in-memory copy of the boot archive.  Yes I can, it's true, just watch me ...
First I need to use mdb to determine the starting address and size of the in-memory copy, from the crash dump saved at the time of the panic:

# echo 'newkernel::print -t fastboot_info_t fi_files[1] |
           ::print fastboot_file_t fb_va fb_size' |
               mdb -k unix.29 vmcore.29
fb_va = 0xffffc1002a8aa000
fb_size = 0x3274000
Then I need to run some C code, which is effectively some glue commands around kvm_open() and kvm_kread().  Here's the source:

#include <sys/types.h>
#include <sys/stat.h>
#include <stdio.h>
#include <stdlib.h>
#include <kvm.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>

int main(int argc, char **argv)
{
     kvm_t     *kd;
     int       ret;
     int       archive_fd;
     char      *namelist = NULL, *corefile = NULL, *outputfile;
     char      *archive_buf;
     uintptr_t target_addr, target_size;

     if (argc != 6) {
         fprintf(stderr, "Syntax: %s <namelist> <corefile> "
                              "<target_addr> <target_size> "
                              "<output_file>\n", argv[0]);
         exit(1);
     }

     if (*argv[1] != '-') namelist = argv[1];
     if (*argv[2] != '-') corefile = argv[2];
     target_addr = (uintptr_t)strtoul(argv[3], NULL, 0);
     target_size = (uintptr_t)strtoul(argv[4], NULL, 0);
     outputfile = argv[5];

     if (!(archive_buf = malloc(target_size))) {
         perror("malloc");
         exit(1);
     }

     kd = kvm_open(namelist, corefile, NULL, O_RDONLY, argv[0]);
     if (kd == NULL) {
         perror("kvm_open");
         exit(1);
     }

     ret = kvm_kread(kd, target_addr, archive_buf, target_size);
     if (ret == -1) {
         perror("kvm_read");
         exit(1);
     }

     printf("Read %d bytes of archive from %p\n",
                                        ret, target_addr);

     archive_fd = open(outputfile, O_CREAT|O_RDWR);
     if (archive_fd == -1) {
         perror("open");
         exit(1);
     }

     ret = write(archive_fd, archive_buf, target_size);

     if (ret == -1) {
         perror("write");
         exit(1);
     }

     printf("Wrote %d bytes of archive to %s\n", ret, outputfile);

     exit(0);
}
I compiled this with gcc -o kvm_archive -m64 -lkvm kvm_archive.c, and ran it using the address and size I had just obtained from mdb:

# kvm_archive unix.29 vmcore.29 0xffffc1002a8aa000 0x3274000 \
                                              archive_dump.29
Read 52903936 bytes of archive from ffffc1002a8aa000
Wrote 52903936 bytes of archive to archive_dump.29
Let's mount it up and see what's inside:

# lofiadm -a archive_dump.29
/dev/lofi/1
# mount -rF hsfs /dev/lofi/1 /mnt
# pushd /mnt/etc
# gunzip -c name_to_major | egrep 'llt|gab|vxfen'
Where are they?  Well that's the problem then, these modules are missing from the copy of the boot archive that the kernel took during the last boot.
I subsequently wrote a more extensive C program that would compare the contents of the in-memory boot archive with /etc/name_to_major on-disk and the boot archive under /platform and report differences.  This further demonstrated the same discrepancy.

Conclusion

The reason the in-memory copy of the boot archive has this discrepancy is because VCS 6.0.3 calls rem_drv on a normal shutdown, and add_drv on a normal startup.  But at the time, early in the boot process, that the kernel takes a copy of the boot archive, the modules are not there yet because their start scripts have not been run.  So the kernel has an incomplete picture.
This problem is worked around in VCS 6.0.5, as Symantec no longer run rem_drv in their normal SMF stop method, so the kernel module remains properly installed during the whole boot cycle.  The in-memory copy of the boot archive is then correct.
As previously discussed, it is possible to disable fastreboot_onpanic, and it is questionable whether this should be enabled by default anyway given strange and difficult to diagnose incosistencies such as the one described in this article.
Finally, Oracle have raised bug #19157755, so hopefully this issue will be fixed in the future, as it seems to me that the in-memory copy of the boot archive should be updated when new drivers are installed, otherwise we will always be prone to peculiarities like this if sometime after installing new drivers the kernel panics (for whatever reason), but did not perform a clean reboot prior to that.  As we can see in the case of cluster software, a kernel panic cannot be assumed to mean there is a software or driver failure, not even necessarily a hardware failure, but is a normal reaction to prevent a cluster split-brain scenario from causing data corruption.

No comments:

Post a Comment