[ClusterLabs] More pacemaker oddities while stopping DC

Fri May 27 02:56:23 EDT 2022

On 25.05.2022 09:47, Gao,Yan via Users wrote:
> On 2022/5/25 8:10, Ulrich Windl wrote:
>> Hi!
>>
>> We are still suffering from kernel RAM corruption on the Xen hypervisor when a VM or the hypervisor is doing I/O (three months since the bug report at SUSE, but no fix or workaround meaning the whole Xen cluster project was canceled after 20 years, but that's a different topic). All VMs will be migrated to VMware, dumping the whole SLES15 Xen cluster very soon.
>>
>> My script that detected RAM corruption tried to shutdown pacemaker, hoping for the best (i.e. VMs to be live-migrated away). However there are very strange decisions made (pacemaker-2.0.5+20201202.ba59be712-150300.4.21.1.x86_64):
>>
>> May 24 17:05:07 h16 VirtualDomain(prm_xen_test-jeos7)[24460]: INFO: test-jeos7: live migration to h19 succeeded.
>> May 24 17:05:07 h16 VirtualDomain(prm_xen_test-jeos9)[24463]: INFO: test-jeos9: live migration to h19 succeeded.
>> May 24 17:05:07 h16 pacemaker-execd[7504]:  notice: prm_xen_test-jeos7 migrate_to (call 321, PID 24281) exited with status 0 (execution time 5500ms, queue time 0ms)
>> May 24 17:05:07 h16 pacemaker-controld[7509]:  notice: Result of migrate_to operation for prm_xen_test-jeos7 on h16: ok
>> May 24 17:05:07 h16 pacemaker-execd[7504]:  notice: prm_xen_test-jeos9 migrate_to (call 323, PID 24283) exited with status 0 (execution time 5514ms, queue time 0ms)
>> May 24 17:05:07 h16 pacemaker-controld[7509]:  notice: Result of migrate_to operation for prm_xen_test-jeos9 on h16: ok
>>
>> Would you agree that the migration was successful? I'd say YES!
> 
> Maybe practically yes with what migrate_to has achieved with 
> VirtualDomain RA, but technically no from pacemaker's point of view.
> 
> Following the migrate_to on the source node, a migrate_from operation on 
> the target node and a stop operation on the source node will be needed 
> to eventually make a successful live-migration.
> 

I do not know of there is formal state machine for performing live
migration in pacemaker, but speaking about VirtualDomain RA

a) successful migrate_to means that VM is running on target node
b) migrate_from does not need anything from source node and could be run
even if the source node becomes unavailable
c) successful fencing of the source node means that resource is stopped
on the source node

So technically this could work, but this would need pacemaker to
recognize "partially migrated" resource state.

>>
>> However this is what happened:
>>
>> May 24 17:05:19 h16 pacemaker-controld[7509]:  notice: Transition 2460 (Complete=16, Pending=0, Fired=0, Skipped=7, Incomplete=57, Source=/var/lib/pacemaker/pengine/pe-input-89.bz2): Stopped
>> May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Unexpected result (error) was recorded for stop of prm_ping_gw1:1 on h16 at May 24 17:05:02 2022
>> May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Unexpected result (error) was recorded for stop of prm_ping_gw1:1 on h16 at May 24 17:05:02 2022
>> May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Cluster node h16 will be fenced: prm_ping_gw1:1 failed there
>> May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Unexpected result (error) was recorded for stop of prm_iotw-md10:1 on h16 at May 24 17:05:02 2022
>> May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Unexpected result (error) was recorded for stop of prm_iotw-md10:1 on h16 at May 24 17:05:02 2022
>> May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_ping_gw1 away from h16 after 1000000 failures (max=1000000)
>> May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_ping_gw1 away from h16 after 1000000 failures (max=1000000)
>> May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_ping_gw1 away from h16 after 1000000 failures (max=1000000)
>> May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_iotw-md10 away from h16 after 1000000 failures (max=1000000)
>> May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_iotw-md10 away from h16 after 1000000 failures (max=1000000)
>> May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_iotw-md10 away from h16 after 1000000 failures (max=1000000)
>> May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  notice: Resource prm_xen_test-jeos7 can no longer migrate from h16 to h19 (will stop on both nodes)
>> May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  notice: Resource prm_xen_test-jeos9 can no longer migrate from h16 to h19 (will stop on both nodes)
>> May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Scheduling Node h16 for STONITH
>>
>> So the DC considers the migration to have failed, even though it was reported as success!
> 
> A so-called partial live-migration could no longer continue here.
> 
> Regards,
>    Yan
> 
>> (The ping had dumped core due to RAM corruption before)
>>
>> May 24 17:03:12 h16 kernel: ping[23973]: segfault at 213e6 ip 00000000000213e6 sp 00007ffc249fab78 error 14 in bash[5655262bc000+f1000]
>>
>> So it stopped the VMs that were migrated successfully before:
>> May 24 17:05:19 h16 pacemaker-controld[7509]:  notice: Initiating stop operation prm_xen_test-jeos7_stop_0 on h19
>> May 24 17:05:19 h16 pacemaker-controld[7509]:  notice: Initiating stop operation prm_xen_test-jeos9_stop_0 on h19
>> May 24 17:05:19 h16 pacemaker-controld[7509]:  notice: Requesting fencing (reboot) of node h16
>>
>> Those test VMs were not important, but the important part was that due to the failure to stop the ping resource, it did not even try to migrate the other VMs (non-test) away, so those were hard-fenced.
>>
>> For completeness I should add that the RAM corruption also affected pacemaker itself:
>>
>> May 24 17:05:02 h16 kernel: traps: pacemaker-execd[24272] general protection fault ip:7fc572327bcf sp:7ffca7cd22d0 error:0 in libc-2.31.so[7fc572246000+1e6000]
>> May 24 17:05:02 h16 kernel: pacemaker-execd[24277]: segfault at 0 ip 0000000000000000 sp 00007ffca7cd22f0 error 14 in pacemaker-execd[56347df4e000+b000]
>> May 24 17:05:02 h16 kernel: Code: Bad RIP value.
>>
>> That affected the stop of some (non-essential) ping and  MD-RAID-based resources:
>> May 24 17:05:02 h16 pacemaker-execd[7504]:  warning: prm_ping_gw1_stop_0[24272] terminated with signal: Segmentation fault
>> May 24 17:05:02 h16 pacemaker-execd[7504]:  warning: prm_iotw-md10_stop_0[24277] terminated with signal: Segmentation fault
>>
>> May 24 17:05:03 h16 sbd[7254]:  warning: inquisitor_child: pcmk health check: UNHEALTHY
>> May 24 17:05:03 h16 sbd[7254]:  warning: inquisitor_child: Servant pcmk is outdated (age: 1844062)
>>
>> Note: If the "outdated" number is seconds, that's definitely odd!
>>
>> Regards,
>> Ulrich
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/