[ClusterLabs] Antw: [EXT] Stopping a server failed and fenced, despite disabling stop timeout
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Mon Jan 18 03:28:07 EST 2021
>>> Digimer <lists at alteeve.ca> schrieb am 18.01.2021 um 03:11 in Nachricht
<816a4d1e-a92d-2a4c-b1a0-cf4353e3fa41 at alteeve.ca>:
> Hi all,
>
> Mind the slew of questions, well into testing now and finding lots of
> issues. This one is two questions... :)
>
> I set a server to be unamaged in pacemaker while the server was
> running. Then I tried to remove the resource, and it refused saying it
> couldn't stop it, and to use '--force'. So I did, and the node got
> fenced. Now, the resource was setup with;
My guess is you shouldn't do it that way: Why not stop the resource,
unconfigure it in the cluster, then start it manually?
>
> pcs resource create srv07-el6 ocf:alteeve:server name="srv07-el6" \
> meta allow-migrate="true" target-role="started" \
> op monitor interval="60" start timeout="INFINITY" \
> on-fail="block" stop timeout="INFINITY" on-fail="block" \
> migrate_to timeout="INFINITY"
>
> I would have expected the 'stop timeout="INFINITY" on-fail="block"' to
> prevent fencing if the server failed to stop (question 1) and that if a
> resource was unmanaged, that the resource wouldn't even try to stop
> (question 2).
>
> Can someone help me understand what happened here?
Fencing reason was " srv01-test_stop_0 process (PID 113779) timed out".
Did have a failutre before your actions? The logs indicate such it seems:
"Clearing failure of srv01-test on el8-a01n02 because resource parameters
have changed"
Haveing the cluster in a clean state before configuring it highly desirable
IMHO. I use this command frequently to check: "crm_mon -1Arfj"
The logs should help to explain!
Regards,
Ulrich
>
> digimer
>
> More below;
>
> ====
> [root at el8-a01n01 ~]# pcs resource remove srv01-test
> Attempting to stop: srv01-test... Warning: 'srv01-test' is unmanaged
> Error: Unable to stop: srv01-test before deleting (re-run with --force
> to force deletion)
> [root at el8-a01n01 ~]# pcs resource remove srv01-test --force
> Deleting Resource - srv01-test
> [root at el8-a01n01 ~]# client_loop: send disconnect: Broken pipe
> ====
>
> As you can see, the node was fenced. The logs on that node were;
>
> ====
> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-execd[1872]: warning:
> srv01-test_stop_0 process (PID 113779) timed out
> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-execd[1872]: warning:
> srv01-test_stop_0[113779] timed out after 20000ms
> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-controld[1875]: error:
> Result of stop operation for srv01-test on el8-a01n01: Timed Out
> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-controld[1875]: notice:
> el8-a01n01-srv01-test_stop_0:37 [ The server: [srv01-test] is indeed
> running. It will be shut down now.\n ]
> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]: notice:
> Setting fail-count-srv01-test#stop_0[el8-a01n01]: (unset) -> INFINITY
> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]: notice:
> Setting last-failure-srv01-test#stop_0[el8-a01n01]: (unset) -> 1610935435
> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]: notice:
> Setting fail-count-srv01-test#stop_0[el8-a01n01]: INFINITY -> (unset)
> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]: notice:
> Setting last-failure-srv01-test#stop_0[el8-a01n01]: 1610935435 -> (unset)
> client_loop: send disconnect: Broken pipe
> ====
>
> On the peer node, the logs showed;
>
> ====
> Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: State transition S_IDLE -> S_POLICY_ENGINE
> Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: Calculated transition 58, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-100.bz2
> Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Transition 58 (Complete=0, Pending=0, Fired=0, Skipped=0,
> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-100.bz2): Complete
> Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: State transition S_TRANSITION_ENGINE -> S_IDLE
> Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: State transition S_IDLE -> S_POLICY_ENGINE
> Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: Calculated transition 59, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-101.bz2
> Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Transition 59 (Complete=0, Pending=0, Fired=0, Skipped=0,
> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-101.bz2): Complete
> Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: State transition S_TRANSITION_ENGINE -> S_IDLE
> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: State transition S_IDLE -> S_POLICY_ENGINE
> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> warning: Detected active orphan srv01-test running on el8-a01n01
> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: Clearing failure of srv01-test on el8-a01n02 because resource
> parameters have changed
> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: Removing srv01-test from el8-a01n01
> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: Removing srv01-test from el8-a01n02
> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: * Stop srv01-test ( el8-a01n01
> ) due to node availability
> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: Calculated transition 60, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-102.bz2
> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Initiating stop operation srv01-test_stop_0 on el8-a01n01
> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Transition 60 aborted by deletion of
> lrm_rsc_op[@id='srv01-test_last_failure_0']: Resource operation removal
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Transition 60 action 11 (srv01-test_stop_0 on el8-a01n01):
> expected 'ok' but got 'error'
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Transition 60 (Complete=2, Pending=0, Fired=0, Skipped=0,
> Incomplete=2, Source=/var/lib/pacemaker/pengine/pe-input-102.bz2): Complete
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]: notice:
> Setting fail-count-srv01-test#stop_0[el8-a01n01]: (unset) -> INFINITY
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]: notice:
> Setting last-failure-srv01-test#stop_0[el8-a01n01]: (unset) -> 1610935435
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> warning: Unexpected result (error) was recorded for stop of srv01-test
> on el8-a01n01 at Jan 18 02:03:35 2021
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> warning: Unexpected result (error) was recorded for stop of srv01-test
> on el8-a01n01 at Jan 18 02:03:35 2021
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> warning: Cluster node el8-a01n01 will be fenced: srv01-test failed there
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> warning: Detected active orphan srv01-test running on el8-a01n01
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> warning: Scheduling Node el8-a01n01 for STONITH
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: Stop of failed resource srv01-test is implicit after el8-a01n01
> is fenced
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: * Fence (reboot) el8-a01n01 'srv01-test failed there'
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: * Move virsh_node2_pulsar ( el8-a01n01 -> el8-a01n02 )
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: * Stop srv01-test ( el8-a01n01
> ) due to node availability
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> warning: Calculated transition 61 (with warnings), saving inputs in
> /var/lib/pacemaker/pengine/pe-warn-1.bz2
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> warning: Unexpected result (error) was recorded for stop of srv01-test
> on el8-a01n01 at Jan 18 02:03:35 2021
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> warning: Unexpected result (error) was recorded for stop of srv01-test
> on el8-a01n01 at Jan 18 02:03:35 2021
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> warning: Cluster node el8-a01n01 will be fenced: srv01-test failed there
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> warning: Detected active orphan srv01-test running on el8-a01n01
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> warning: Forcing srv01-test away from el8-a01n01 after 1000000 failures
> (max=1000000)
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: Clearing failure of srv01-test on el8-a01n01 because it is orphaned
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> warning: Scheduling Node el8-a01n01 for STONITH
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: Stop of failed resource srv01-test is implicit after el8-a01n01
> is fenced
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: * Fence (reboot) el8-a01n01 'srv01-test failed there'
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: * Move virsh_node2_pulsar ( el8-a01n01 -> el8-a01n02 )
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: * Stop srv01-test ( el8-a01n01
> ) due to node availability
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> warning: Calculated transition 62 (with warnings), saving inputs in
> /var/lib/pacemaker/pengine/pe-warn-2.bz2
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Requesting fencing (reboot) of node el8-a01n01
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Initiating start operation virsh_node2_pulsar_start_0 locally on
> el8-a01n02
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]: notice:
> Client pacemaker-controld.490050.72911c98 wants to fence (reboot)
> 'el8-a01n01' with device '(any)'
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]: notice:
> Requesting peer fencing (reboot) targeting el8-a01n01
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]: notice:
> Setting fail-count-srv01-test#stop_0[el8-a01n01]: INFINITY -> (unset)
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]: notice:
> Setting last-failure-srv01-test#stop_0[el8-a01n01]: 1610935435 -> (unset)
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]: notice:
> virsh_node2_pulsar is not eligible to fence (reboot) el8-a01n01:
static-list
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]: notice:
> virsh_node1_pulsar is eligible to fence (reboot) el8-a01n01: static-list
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Transition 62 aborted by deletion of
> lrm_rsc_op[@id='srv01-test_last_failure_0']: Resource operation removal
> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]: notice:
> Requesting that el8-a01n02 perform 'reboot' action targeting el8-a01n01
> using 'virsh_node1_pulsar'
> Jan 18 02:03:56 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Result of start operation for virsh_node2_pulsar on el8-a01n02: ok
> Jan 18 02:03:57 el8-a01n02.alteeve.ca pacemaker-fenced[490046]: notice:
> Operation 'reboot' [646769] (call 4 from pacemaker-controld.490050) for
> host 'el8-a01n01' with device 'virsh_node1_pulsar' returned: 0 (OK)
> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-attrd[490048]: notice:
> Node el8-a01n01 state is now lost
> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-attrd[490048]: notice:
> Removing all el8-a01n01 attributes for peer loss
> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Node el8-a01n01 state is now lost
> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-based[490045]: notice:
> Node el8-a01n01 state is now lost
> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-based[490045]: notice:
> Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership cache
> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]: notice:
> Node el8-a01n01 state is now lost
> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]: notice:
> Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership cache
> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-attrd[490048]: notice:
> Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership cache
> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]: notice:
> Action 'reboot' targeting el8-a01n01 using virsh_node1_pulsar on behalf
> of pacemaker-controld.490050 at el8-a01n02: OK
> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]: notice:
> Operation 'reboot' targeting el8-a01n01 on el8-a01n02 for
> pacemaker-controld.490050 at el8-a01n02.8ff64dd6: OK
> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Stonith operation 4/2:62:0:e827eea0-dedc-4200-a207-c4095621b3c6:
> OK (0)
> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Peer el8-a01n01 was terminated (reboot) by el8-a01n02 on behalf
> of pacemaker-controld.490050: OK
> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Transition 62 (Complete=5, Pending=0, Fired=0, Skipped=1,
> Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: Removing srv01-test from el8-a01n02
> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: Calculated transition 63, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-103.bz2
> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Initiating monitor operation virsh_node2_pulsar_monitor_60000
> locally on el8-a01n02
> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Initiating delete operation srv01-test_delete_0 locally on
> el8-a01n02
> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Transition 63 aborted by deletion of
> lrm_resource[@id='srv01-test']: Resource state removal
> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Result of monitor operation for virsh_node2_pulsar on el8-a01n02:
ok
> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Transition 63 (Complete=2, Pending=0, Fired=0, Skipped=0,
> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-103.bz2): Complete
> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
> notice: Calculated transition 64, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-104.bz2
> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: Transition 64 (Complete=0, Pending=0, Fired=0, Skipped=0,
> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-104.bz2): Complete
> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
> notice: State transition S_TRANSITION_ENGINE -> S_IDLE
> ====
>
> --
> Digimer
> Papers and Projects: https://alteeve.com/w/
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list