[ClusterLabs] Node is silently unfenced if transition is very long

Vladislav Bogdanov bubble at hoster-ok.com
Tue Apr 19 11:47:45 EDT 2016


Hi,

Just found an issue with node is silently unfenced.

That is quite large setup (2 cluster nodes and 8 remote ones) with
a plenty of slowly starting resources (lustre filesystem).

Fencing was initiated due to resource stop failure.
lustre often starts very slowly due to internal recovery, and some such
resources were starting in that transition where another resource failed to stop.
And, as transition did not finish in time specified by the
"failure-timeout" (set to 9 min), and was not aborted, that stop failure was successfully cleaned.
There were transition aborts due to attribute changes, after that stop failure happened, but fencing
was not initiated for some reason.
Node where stop failed was a DC.
pacemaker is 1.1.14-5a6cdd1 (from fedora, built on EL7)

Here is log excerpt illustrating the above:
Apr 19 14:57:56 mds1 pengine[3452]:   notice: Move    mdt0-es03a-vg        (Started mds1 -> mds0)
Apr 19 14:58:06 mds1 pengine[3452]:   notice: Move    mdt0-es03a-vg        (Started mds1 -> mds0)
Apr 19 14:58:10 mds1 crmd[3453]:   notice: Initiating action 81: monitor mdt0-es03a-vg_monitor_0 on mds0
Apr 19 14:58:11 mds1 crmd[3453]:   notice: Initiating action 2993: stop mdt0-es03a-vg_stop_0 on mds1 (local)
Apr 19 14:58:11 mds1 LVM(mdt0-es03a-vg)[6228]: INFO: Deactivating volume group vg_mdt0_es03a
Apr 19 14:58:12 mds1 LVM(mdt0-es03a-vg)[6541]: ERROR: Logical volume vg_mdt0_es03a/mdt0 contains a filesystem in use. Can't deactivate volume group "vg_mdt0_es03a" with 1 open logical volume(s)
[...]
Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9939]: ERROR: LVM: vg_mdt0_es03a did not stop correctly
Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9943]: WARNING: vg_mdt0_es03a still Active
Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9947]: INFO: Retry deactivating volume group vg_mdt0_es03a
Apr 19 14:58:31 mds1 lrmd[3450]:   notice: mdt0-es03a-vg_stop_0:5865:stderr [ ocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly ]
[...]
Apr 19 14:58:31 mds1 lrmd[3450]:   notice: mdt0-es03a-vg_stop_0:5865:stderr [ ocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly ]
Apr 19 14:58:31 mds1 crmd[3453]:   notice: Operation mdt0-es03a-vg_stop_0: unknown error (node=mds1, call=324, rc=1, cib-update=1695, confirmed=true)
Apr 19 14:58:31 mds1 crmd[3453]:   notice: mds1-mdt0-es03a-vg_stop_0:324 [ ocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctl
Apr 19 14:58:31 mds1 crmd[3453]:  warning: Action 2993 (mdt0-es03a-vg_stop_0) on mds1 failed (target: 0 vs. rc: 1): Error
Apr 19 14:58:31 mds1 crmd[3453]:  warning: Action 2993 (mdt0-es03a-vg_stop_0) on mds1 failed (target: 0 vs. rc: 1): Error
Apr 19 15:02:03 mds1 pengine[3452]:  warning: Processing failed op stop for mdt0-es03a-vg on mds1: unknown error (1)
Apr 19 15:02:03 mds1 pengine[3452]:  warning: Processing failed op stop for mdt0-es03a-vg on mds1: unknown error (1)
Apr 19 15:02:03 mds1 pengine[3452]:  warning: Node mds1 will be fenced because of resource failure(s)
Apr 19 15:02:03 mds1 pengine[3452]:  warning: Forcing mdt0-es03a-vg away from mds1 after 1000000 failures (max=1000000)
Apr 19 15:02:03 mds1 pengine[3452]:  warning: Scheduling Node mds1 for STONITH
Apr 19 15:02:03 mds1 pengine[3452]:   notice: Stop of failed resource mdt0-es03a-vg is implicit after mds1 is fenced
Apr 19 15:02:03 mds1 pengine[3452]:   notice: Recover mdt0-es03a-vg        (Started mds1 -> mds0)
[... many of these ]
Apr 19 15:07:22 mds1 pengine[3452]:  warning: Processing failed op stop for mdt0-es03a-vg on mds1: unknown error (1)
Apr 19 15:07:22 mds1 pengine[3452]:  warning: Processing failed op stop for mdt0-es03a-vg on mds1: unknown error (1)
Apr 19 15:07:22 mds1 pengine[3452]:  warning: Node mds1 will be fenced because of resource failure(s)
Apr 19 15:07:22 mds1 pengine[3452]:  warning: Forcing mdt0-es03a-vg away from mds1 after 1000000 failures (max=1000000)
Apr 19 15:07:23 mds1 pengine[3452]:  warning: Scheduling Node mds1 for STONITH
Apr 19 15:07:23 mds1 pengine[3452]:   notice: Stop of failed resource mdt0-es03a-vg is implicit after mds1 is fenced
Apr 19 15:07:23 mds1 pengine[3452]:   notice: Recover mdt0-es03a-vg        (Started mds1 -> mds0)
Apr 19 15:07:24 mds1 pengine[3452]:  warning: Processing failed op stop for mdt0-es03a-vg on mds1: unknown error (1)
Apr 19 15:07:24 mds1 pengine[3452]:  warning: Processing failed op stop for mdt0-es03a-vg on mds1: unknown error (1)
Apr 19 15:07:24 mds1 pengine[3452]:  warning: Node mds1 will be fenced because of resource failure(s)
Apr 19 15:07:24 mds1 pengine[3452]:  warning: Forcing mdt0-es03a-vg away from mds1 after 1000000 failures (max=1000000)
Apr 19 15:07:24 mds1 pengine[3452]:  warning: Scheduling Node mds1 for STONITH
Apr 19 15:07:24 mds1 pengine[3452]:   notice: Stop of failed resource mdt0-es03a-vg is implicit after mds1 is fenced
Apr 19 15:07:24 mds1 pengine[3452]:   notice: Recover mdt0-es03a-vg        (Started mds1 -> mds0)
Apr 19 15:07:32 mds1 pengine[3452]:   notice: Clearing expired failcount for mdt0-es03a-vg on mds1
Apr 19 15:07:32 mds1 pengine[3452]:   notice: Clearing expired failcount for mdt0-es03a-vg on mds1
Apr 19 15:07:32 mds1 pengine[3452]:   notice: Ignoring expired calculated failure mdt0-es03a-vg_stop_0 (rc=1, magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1
Apr 19 15:07:32 mds1 pengine[3452]:   notice: Clearing expired failcount for mdt0-es03a-vg on mds1
Apr 19 15:07:32 mds1 pengine[3452]:   notice: Ignoring expired calculated failure mdt0-es03a-vg_stop_0 (rc=1, magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1
Apr 19 15:07:33 mds1 crmd[3453]:   notice: Initiating action 2016: monitor mdt0-es03a-vg_monitor_60000 on mds1 (local)
Apr 19 15:07:33 mds1 crmd[3453]:   notice: Transition aborted by deletion of nvpair[@id='status-2-fail-count-mdt0-es03a-vg']: Transient attribute change (cib=0.228.2601, source=abort_unless_down:343, path=/cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']/nvpair[@id='status-2-fail-count-mdt0-es03a-vg'], 0)
Apr 19 15:10:09 mds1 pengine[3452]:   notice: Ignoring expired calculated failure mdt0-es03a-vg_stop_0 (rc=1, magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1
Apr 19 15:12:40 mds1 pengine[3452]:   notice: Ignoring expired calculated failure mdt0-es03a-vg_stop_0 (rc=1, magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1

Best,
Vladislav




More information about the Users mailing list