[ClusterLabs] More pacemaker oddities while stopping DC

Wed May 25 02:10:53 EDT 2022

Hi!

We are still suffering from kernel RAM corruption on the Xen hypervisor when a VM or the hypervisor is doing I/O (three months since the bug report at SUSE, but no fix or workaround meaning the whole Xen cluster project was canceled after 20 years, but that's a different topic). All VMs will be migrated to VMware, dumping the whole SLES15 Xen cluster very soon.

My script that detected RAM corruption tried to shutdown pacemaker, hoping for the best (i.e. VMs to be live-migrated away). However there are very strange decisions made (pacemaker-2.0.5+20201202.ba59be712-150300.4.21.1.x86_64):

May 24 17:05:07 h16 VirtualDomain(prm_xen_test-jeos7)[24460]: INFO: test-jeos7: live migration to h19 succeeded.
May 24 17:05:07 h16 VirtualDomain(prm_xen_test-jeos9)[24463]: INFO: test-jeos9: live migration to h19 succeeded.
May 24 17:05:07 h16 pacemaker-execd[7504]:  notice: prm_xen_test-jeos7 migrate_to (call 321, PID 24281) exited with status 0 (execution time 5500ms, queue time 0ms)
May 24 17:05:07 h16 pacemaker-controld[7509]:  notice: Result of migrate_to operation for prm_xen_test-jeos7 on h16: ok
May 24 17:05:07 h16 pacemaker-execd[7504]:  notice: prm_xen_test-jeos9 migrate_to (call 323, PID 24283) exited with status 0 (execution time 5514ms, queue time 0ms)
May 24 17:05:07 h16 pacemaker-controld[7509]:  notice: Result of migrate_to operation for prm_xen_test-jeos9 on h16: ok

Would you agree that the migration was successful? I'd say YES!

However this is what happened:

May 24 17:05:19 h16 pacemaker-controld[7509]:  notice: Transition 2460 (Complete=16, Pending=0, Fired=0, Skipped=7, Incomplete=57, Source=/var/lib/pacemaker/pengine/pe-input-89.bz2): Stopped
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Unexpected result (error) was recorded for stop of prm_ping_gw1:1 on h16 at May 24 17:05:02 2022
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Unexpected result (error) was recorded for stop of prm_ping_gw1:1 on h16 at May 24 17:05:02 2022
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Cluster node h16 will be fenced: prm_ping_gw1:1 failed there
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Unexpected result (error) was recorded for stop of prm_iotw-md10:1 on h16 at May 24 17:05:02 2022
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Unexpected result (error) was recorded for stop of prm_iotw-md10:1 on h16 at May 24 17:05:02 2022
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_ping_gw1 away from h16 after 1000000 failures (max=1000000)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_ping_gw1 away from h16 after 1000000 failures (max=1000000)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_ping_gw1 away from h16 after 1000000 failures (max=1000000)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_iotw-md10 away from h16 after 1000000 failures (max=1000000)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_iotw-md10 away from h16 after 1000000 failures (max=1000000)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_iotw-md10 away from h16 after 1000000 failures (max=1000000)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  notice: Resource prm_xen_test-jeos7 can no longer migrate from h16 to h19 (will stop on both nodes)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  notice: Resource prm_xen_test-jeos9 can no longer migrate from h16 to h19 (will stop on both nodes)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Scheduling Node h16 for STONITH

So the DC considers the migration to have failed, even though it was reported as success!
(The ping had dumped core due to RAM corruption before)

May 24 17:03:12 h16 kernel: ping[23973]: segfault at 213e6 ip 00000000000213e6 sp 00007ffc249fab78 error 14 in bash[5655262bc000+f1000]

So it stopped the VMs that were migrated successfully before:
May 24 17:05:19 h16 pacemaker-controld[7509]:  notice: Initiating stop operation prm_xen_test-jeos7_stop_0 on h19
May 24 17:05:19 h16 pacemaker-controld[7509]:  notice: Initiating stop operation prm_xen_test-jeos9_stop_0 on h19
May 24 17:05:19 h16 pacemaker-controld[7509]:  notice: Requesting fencing (reboot) of node h16

Those test VMs were not important, but the important part was that due to the failure to stop the ping resource, it did not even try to migrate the other VMs (non-test) away, so those were hard-fenced.

For completeness I should add that the RAM corruption also affected pacemaker itself:

May 24 17:05:02 h16 kernel: traps: pacemaker-execd[24272] general protection fault ip:7fc572327bcf sp:7ffca7cd22d0 error:0 in libc-2.31.so[7fc572246000+1e6000]
May 24 17:05:02 h16 kernel: pacemaker-execd[24277]: segfault at 0 ip 0000000000000000 sp 00007ffca7cd22f0 error 14 in pacemaker-execd[56347df4e000+b000]
May 24 17:05:02 h16 kernel: Code: Bad RIP value.

That affected the stop of some (non-essential) ping and  MD-RAID-based resources:
May 24 17:05:02 h16 pacemaker-execd[7504]:  warning: prm_ping_gw1_stop_0[24272] terminated with signal: Segmentation fault
May 24 17:05:02 h16 pacemaker-execd[7504]:  warning: prm_iotw-md10_stop_0[24277] terminated with signal: Segmentation fault

May 24 17:05:03 h16 sbd[7254]:  warning: inquisitor_child: pcmk health check: UNHEALTHY
May 24 17:05:03 h16 sbd[7254]:  warning: inquisitor_child: Servant pcmk is outdated (age: 1844062)

Note: If the "outdated" number is seconds, that's definitely odd!

Regards,
Ulrich