Introduction
On VCS 6.0.3 running on Solaris 11.1 x64, I hit a rather strange problem that is not specific to the use of Veritas Cluster Server. While testing a scenario involving Coordination Point Servers where I was expecting both of the nodes in my two-node cluster to have a kernel panic initiated by Symantec's GAB kernel module, both nodes booted up after the panic but were missing the LLT kernel module. Well, by "missing", I mean it was not loaded.There is a dependency chain between the Symantec kernel modules: LLT -> GAB -> VxFEN, and finally VCS itself. So LLT is the most fundamental building block, and the cluster will not function without it.
What would cause a Solaris 11 server that had a loaded and perfectly functioning kernel driver prior to a reboot, fail to load that driver after a reboot? But not after any old reboot, normal reboots work fine. Only after a reboot initiated by a kernel panic? And then, only sometimes, this wasn't 100% reproducible, although I could usually reproduce it after a few attempts.
The answer to this question took many weeks to unravel ...
The symptoms
The mystery began the day both nodes failed to start LLT after a kernel panic. The initial symptom was clear in SMF, where thellt
service was reported as in maintenance
state:# svcs llt STATE STIME FMRI maintenance 5:56:42 svc:/system/llt:defaultChecking the SMF log file
/var/svc/log/system-llt:default.log
did not make the problem immediately obvious:[ Jun 9 05:56:42 Executing start method ("/lib/svc/method/llt start"). ] Starting LLT... LLT lltconfig ERROR V-14-2-15000 open /dev/llt failed: No such file or directory LLT lltconfig ERROR V-14-2-15000 open /dev/llt failed: No such file or directory Could not start LLT successfully. [ Jun 9 05:56:42 Method "start" exited with status 1. ]The
svcadm clear llt
command did not help, it re-executed the start method with the same result.
Why would
/dev/llt
be missing? It was there prior to the reboot. Let's see ...# ls -l /dev/llt lrwxrwxrwx 1 root root 29 May 30 04:50 /dev/llt -> ../devices/pseudo/clone@0:llt # ls -lL /dev/llt /dev/llt: No such file or directoryThe underlying device file is missing. Is there a corrupt file?
# pkg verify VRTSlltThe package looks ok, all files that are supposed to be installed are installed, according to IPS. Is the module in
/etc/name_to_major
?# grep llt /etc/name_to_major llt 230Yes. Next up, is the LLT kernel module loaded?
# modinfo | grep lltStrange. Why has the module not loaded? Can I load it manually?
# modload /kernel/drv/amd64/llt can't load module: No such device or addressHmm, that should have worked. This module was loaded before the panic. What does No such device or address mean anyway? Well really this is one of those overloaded C error codes that can mean a plethora of things depending on which API it came from, it is actually defined in
/usr/include/sys/errno.h
as ENXIO (error code 6).
What if I attempt to remove and then re-install the kernel module using
rem_drv
and then add_drv
?# /usr/sbin/rem_drv llt Invalid argument Cannot remove major number binding for 230Wait, that shouldn't happen. We are definitely in the mists of undocumented behaviour here. Let's see that again ...
# /usr/sbin/rem_drv llt Driver (llt) not installed.Interesting, so despite the error from the first
rem_drv
, it has apparently done something. Now let's try to re-install the module:# /usr/sbin/add_drv -v -f -m '* 0600 root sys' llt exit status = 0 devfsadm[7862]: verbose: symlink /dev/llt -> ../devices/pseudo/clone@0:llt Driver (llt) installed. # modinfo | grep -i llt 264 fffffffff8258000 2dff0 230 1 llt (LLT 6.0.1)Success. Now can I manually unload and reload the module?
# modunload -i 264 # modload /kernel/drv/amd64/llt # modinfo | grep llt 264 fffffffff8258000 2dff0 230 1 llt (LLT 6.0.1)Yes, no problems now. So this node is fixed. It also turns out that just rebooting the host also fixes the problem.
So how did these systems get into this state? How do we get a host from a healthy running state, via a kernel panic, to this state where the driver won't load?
Why can't I manually load the module?
Turning on module debugging withecho "moddebug/W 0xe0000000" | mdb -wk
and repeating the modload
test reveals we are in _init()
when the ENXIO
error is received:loadmodule:/kernel/drv/amd64/llt(llt): _init() error 6However, speaking with Symantec, the LLT
_init()
function contains three lines, it's job is to call the kernel's mod_install()
function.
So next I used
dtrace
to analyse the call tree, to determine from where the ENXIO
error was originating:13 <- find_mbind 0 8228444470450197 13 <- mod_name_to_major 4294967295 8228444470452093 13 <- ddi_name_to_major 4294967295 8228444470453493 13 <- mod_installdrv 6 8228444470455315 13 <- mod_install 6 8228444470457131 13 <- modinstall 6 8228444470458763 13 -> mod_unload -69269231358152 8228444470460659The 4th column is
arg1
, because I was interested in the return value of each function, and in particular if it returned 6 (ENXIO
). The last column is a timestamp which I used to sort the function calls into chronological order.
After the LLT _init()
function has called mod_install()
, we can clearly see from the dtrace
output that mod_install()
returned 6 because mod_installdrv()
returned 6
because ddi_name_to_major()
returned (ulong_t)-1
(DDI_MAJOR_T_NONE
) because mod_name_to_major()
returned (ulong_t)-1
(DDI_MAJOR_T_NONE
) because find_mbind()
returned NULL
.
The
llt
module does not have a major number in the kernel structures? It has an entry in /etc/name_to_major
and it was running before the reboot, so why would it not appear in the kernel structures?
Examining the kernel mb_hash
structure
The kernel uses a structure called mb_hash
which is populated by read_binding_file()
called by mod_setup()
during system boot. This is a hash table structure, so in order to inspect the contents of this hash table in the running kernel with mdb, we will need to know the bucket number. I knocked up a quick C program to calculate it for us:/* Copy of Solaris nm_hash() routine from modsubr.c to calculate bucket number for mb_hashtab lookup */ #include <stdio.h> #define MOD_BIND_HASHMASK 63 int main(int argc, char **argv) { char c, *name; int hash = 0; if (argc!=2) { fputs("Syntax: nm_hash <modulename>\n", stderr); return 1; } name = argv[1]; for (c = *name++; c; c = *name++) hash ^= c; printf("Hash value for module %s is %d (0x%x)\n", argv[1], (hash & MOD_BIND_HASHMASK), (hash & MOD_BIND_HASHMASK)); return 0; }I may as well check all of the Symantec kernel modules, as I had also encountered this exact same problem once with VxFEN during extensive testing, so I suspected there may be a problem with LLT, GAB and VxFEN. Compiling and running the above C program for these three modules reveals the bucket number for each:
$ ./nm_hash llt Hash value for module llt is 52 (0x34) $ ./nm_hash vxfen Hash value for module vxfen is 35 (0x23) $ ./nm_hash gab Hash value for module gab is 36 (0x24)Now I should be able to check the
mb_hash
for these modules
in the running kernel, and also from the crash dumps created by the
kernel panics (which are snapshots of the kernel at the time of the
panic). First, were the modules present at the time of the panic?# mdb 3 Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci zfs mpt sd ufs ip hook neti arp usba uhci kssl qlc fctl random stmf stmf_sbd sockfs lofs idm fcp cpc crypto fcip smbsrv nfs logindmux nsmb ptm ipc ] > *(mb_hash+0x8+0t52*8)::walk b_next|::print struct bind b_name b_name = 0xffffc1003b500918 "llt" b_name = 0xffffc100005c3138 "i86pc" b_name = 0xffffc100005b67a8 "usbif1608,241.config1.0" <snip> > *(mb_hash+0x8+0t35*8)::walk b_next|::print struct bind b_name b_name = 0xffffc100ce3fb518 "vxfen" b_name = 0xffffc100b693b728 "vxfen" b_name = 0xffffc100005b4910 "usb403,e889" b_name = 0xffffc100005b4f70 "pccard138,2" <snip> > *(mb_hash+0x8+0t36*8)::walk b_next|::print struct bind b_name b_name = 0xffffc1002ac79038 "gab" b_name = 0xffffc100005b4a60 "usb769,11f2" b_name = 0xffffc100005b4b90 "pci1039,900" <snip>Yes, as expected, as VCS was up and running at the time of the panic these kernel modules must have been present. Let's compare that with the running kernel now on a broken host:
# mdb -k Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci zfs mpt sd ufs ip hook neti arp usba uhci kssl qlc fctl random sockfs stmf stmf_sbd lofs idm fcp cpc crypto fcip smbsrv nfs logindmux nsmb ptm ipc ] > *(mb_hash+0x8+0t52*8)::walk b_next|::print struct bind b_name !grep llt > *(mb_hash+0x8+0t35*8)::walk b_next|::print struct bind b_name !grep vxfen > *(mb_hash+0x8+0t36*8)::walk b_next|::print struct bind b_name !grep gabAll three modules are missing from
mb_hash
.
Finding the smoking gun ...
Something blew away the LLT, GAB and VxFEN modules from the kernel'smb_hash
structure. But examining the code on Illumos, it doesn't look like anything is ever removed
from this hash structure. If a module is deleted, the entry is left in
the hash but marked as deleted. So if the entries are missing
entirely, they were never added in the first place.
The contents of
/etc/name_to_major
should drive the contents of the mb_hash
structure. In order to determine the list of kernel modules early on
in the boot process before the filesystem drivers are ready, Solaris
uses a boot archive which is a copy of the critical files needed to
support a system boot. The boot archive can be updated at any time
using the bootadm update-archive
command.
The Symantec LLT, GAB and VxFEN SMF methods, which are called to start and stop these services, ordinarily run
add_drv
and bootadm update-archive
when starting up to install the kernel driver, and rem_drv
/bootadm update-archive
when shutting down to uninstall the driver. The reason why Symantec
want to continuously install and uninstall their drivers, rather than
installing them once and leaving them alone, is probably lost in the
mists of time, as their code hasn't changed much in years and has been
ported from one release of Solaris to the next with minimal change. The
correct way to manage driver installs/uninstalls in Solaris 11 is via IPS driver actions.
This said, the
add_drv
and rem_drv
commands
appear to be working correctly and as documented, so could there be a
problem with these commands that I have yet to discover?
What about the
/etc/name_to_major
file in the boot archive? Is there a problem there? Let's check:# lofiadm -a /platform/i86pc/boot_archive /dev/lofi/1 # mount -rF hsfs /dev/lofi/1 /mnt # gunzip -c /mnt/etc/name_to_major | tail -4 cpqci 288 llt 230 gab 231 vxfen 287No problem there. This should have been the archive that was loaded during the boot sequence. This was puzzling me for ages, until after further testing I performed to rule out the way that GAB panicked the system, it emerged that, while I could reproduce the problem with:
# dtrace -w -n syscall:::'{panic()}'... I could not reproduce the problem with:
# uadmin 5 1... which was also supposed to panic the system. This turned out to be the key that unlocked this puzzle, because
uadmin 5 1
was triggering a full system reboot, whereas the other kernel panics were triggering a fast reboot.
Not so fast!
On speaking with an Oracle kernel developer about all this, it emerged that a fast reboot following a kernel panic was a special case. My ears pricked up. "Oh yes, a special case you say, how so?"The kernel keeps a copy of the boot archive in memory, which it obtains during the last system boot cycle, and which it uses instead of the ordinary boot archive only in the event of a fast reboot following a panic. Fast reboots after panics may be disabled, by setting the
fastreboot_onpanic
property of the system/boot-config:default
SMF service to false
as described in Changing the Default Behavior of the Fast Reboot Feature, but this feature is switched on by default.
Has the penny dropped? Not yet? Let's do some analysis given this new insight into the Solaris kernel ...
Whoosh! Getting a boot archive out of live kernel memory
Here's a little magic trick. I can extract some pages of kernel memory, save them to disk, and then mount them as an ISO to inspect the contents of the live in-memory copy of the boot archive. Yes I can, it's true, just watch me ...First I need to use
mdb
to determine the starting address and size of the in-memory copy, from the crash dump saved at the time of the panic:# echo 'newkernel::print -t fastboot_info_t fi_files[1] | ::print fastboot_file_t fb_va fb_size' | mdb -k unix.29 vmcore.29 fb_va = 0xffffc1002a8aa000 fb_size = 0x3274000Then I need to run some C code, which is effectively some glue commands around
kvm_open()
and kvm_kread()
. Here's the source:#include <sys/types.h> #include <sys/stat.h> #include <stdio.h> #include <stdlib.h> #include <kvm.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> int main(int argc, char **argv) { kvm_t *kd; int ret; int archive_fd; char *namelist = NULL, *corefile = NULL, *outputfile; char *archive_buf; uintptr_t target_addr, target_size; if (argc != 6) { fprintf(stderr, "Syntax: %s <namelist> <corefile> " "<target_addr> <target_size> " "<output_file>\n", argv[0]); exit(1); } if (*argv[1] != '-') namelist = argv[1]; if (*argv[2] != '-') corefile = argv[2]; target_addr = (uintptr_t)strtoul(argv[3], NULL, 0); target_size = (uintptr_t)strtoul(argv[4], NULL, 0); outputfile = argv[5]; if (!(archive_buf = malloc(target_size))) { perror("malloc"); exit(1); } kd = kvm_open(namelist, corefile, NULL, O_RDONLY, argv[0]); if (kd == NULL) { perror("kvm_open"); exit(1); } ret = kvm_kread(kd, target_addr, archive_buf, target_size); if (ret == -1) { perror("kvm_read"); exit(1); } printf("Read %d bytes of archive from %p\n", ret, target_addr); archive_fd = open(outputfile, O_CREAT|O_RDWR); if (archive_fd == -1) { perror("open"); exit(1); } ret = write(archive_fd, archive_buf, target_size); if (ret == -1) { perror("write"); exit(1); } printf("Wrote %d bytes of archive to %s\n", ret, outputfile); exit(0); }I compiled this with
gcc -o kvm_archive -m64 -lkvm kvm_archive.c
, and ran it using the address and size I had just obtained from mdb
:# kvm_archive unix.29 vmcore.29 0xffffc1002a8aa000 0x3274000 \ archive_dump.29 Read 52903936 bytes of archive from ffffc1002a8aa000 Wrote 52903936 bytes of archive to archive_dump.29Let's mount it up and see what's inside:
# lofiadm -a archive_dump.29 /dev/lofi/1 # mount -rF hsfs /dev/lofi/1 /mnt # pushd /mnt/etc # gunzip -c name_to_major | egrep 'llt|gab|vxfen'Where are they? Well that's the problem then, these modules are missing from the copy of the boot archive that the kernel took during the last boot.
I subsequently wrote a more extensive C program that would compare the contents of the in-memory boot archive with
/etc/name_to_major
on-disk and the boot archive under /platform
and report differences. This further demonstrated the same discrepancy.
Conclusion
The reason the in-memory copy of the boot archive has this discrepancy is because VCS 6.0.3 callsrem_drv
on a normal shutdown, and add_drv
on a normal startup. But at the time, early in the boot process, that
the kernel takes a copy of the boot archive, the modules are not there
yet because their start scripts have not been run. So the kernel has an
incomplete picture.
This problem is worked around in VCS 6.0.5, as Symantec no longer run
rem_drv
in their normal SMF stop method, so the kernel module remains properly
installed during the whole boot cycle. The in-memory copy of the boot
archive is then correct.
As previously discussed, it is possible to disable
fastreboot_onpanic
,
and it is questionable whether this should be enabled by default anyway
given strange and difficult to diagnose incosistencies such as the one
described in this article.
Finally, Oracle have raised bug #19157755, so hopefully this issue will be fixed in the future, as it seems to me that the in-memory copy of the boot archive should be updated when new drivers are installed, otherwise we will always be prone to peculiarities like this if sometime after installing new drivers the kernel panics (for whatever reason), but did not perform a clean reboot prior to that. As we can see in the case of cluster software, a kernel panic cannot be assumed to mean there is a software or driver failure, not even necessarily a hardware failure, but is a normal reaction to prevent a cluster split-brain scenario from causing data corruption.
No comments:
Post a Comment