[ClusterLabs] Stopping a server failed and fenced, despite disabling stop timeout

Digimer lists at alteeve.ca
Sun Jan 17 21:11:16 EST 2021


Hi all,

  Mind the slew of questions, well into testing now and finding lots of
issues. This one is two questions... :)

  I set a server to be unamaged in pacemaker while the server was
running. Then I tried to remove the resource, and it refused saying it
couldn't stop it, and to use '--force'. So I did, and the node got
fenced. Now, the resource was setup with;

pcs resource create srv07-el6 ocf:alteeve:server name="srv07-el6" \
 meta allow-migrate="true" target-role="started" \
 op monitor interval="60" start timeout="INFINITY" \
 on-fail="block" stop timeout="INFINITY" on-fail="block" \
 migrate_to timeout="INFINITY"

  I would have expected the 'stop timeout="INFINITY" on-fail="block"' to
prevent fencing if the server failed to stop (question 1) and that if a
resource was unmanaged, that the resource wouldn't even try to stop
(question 2).

  Can someone help me understand what happened here?

digimer

More below;

====
[root at el8-a01n01 ~]# pcs resource remove srv01-test
Attempting to stop: srv01-test... Warning: 'srv01-test' is unmanaged
Error: Unable to stop: srv01-test before deleting (re-run with --force
to force deletion)
[root at el8-a01n01 ~]# pcs resource remove srv01-test --force
Deleting Resource - srv01-test
[root at el8-a01n01 ~]# client_loop: send disconnect: Broken pipe
====

  As you can see, the node was fenced. The logs on that node were;

====
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-execd[1872]:  warning:
srv01-test_stop_0 process (PID 113779) timed out
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-execd[1872]:  warning:
srv01-test_stop_0[113779] timed out after 20000ms
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-controld[1875]:  error:
Result of stop operation for srv01-test on el8-a01n01: Timed Out
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-controld[1875]:  notice:
el8-a01n01-srv01-test_stop_0:37 [ The server: [srv01-test] is indeed
running. It will be shut down now.\n ]
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
Setting fail-count-srv01-test#stop_0[el8-a01n01]: (unset) -> INFINITY
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
Setting last-failure-srv01-test#stop_0[el8-a01n01]: (unset) -> 1610935435
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
Setting fail-count-srv01-test#stop_0[el8-a01n01]: INFINITY -> (unset)
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
Setting last-failure-srv01-test#stop_0[el8-a01n01]: 1610935435 -> (unset)
client_loop: send disconnect: Broken pipe
====

On the peer node, the logs showed;

====
Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: State transition S_IDLE -> S_POLICY_ENGINE
Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Calculated transition 58, saving inputs in
/var/lib/pacemaker/pengine/pe-input-100.bz2
Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 58 (Complete=0, Pending=0, Fired=0, Skipped=0,
Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-100.bz2): Complete
Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: State transition S_IDLE -> S_POLICY_ENGINE
Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Calculated transition 59, saving inputs in
/var/lib/pacemaker/pengine/pe-input-101.bz2
Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 59 (Complete=0, Pending=0, Fired=0, Skipped=0,
Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-101.bz2): Complete
Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: State transition S_IDLE -> S_POLICY_ENGINE
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Detected active orphan srv01-test running on el8-a01n01
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Clearing failure of srv01-test on el8-a01n02 because resource
parameters have changed
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Removing srv01-test from el8-a01n01
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Removing srv01-test from el8-a01n02
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice:  * Stop       srv01-test             (               el8-a01n01
)   due to node availability
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Calculated transition 60, saving inputs in
/var/lib/pacemaker/pengine/pe-input-102.bz2
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Initiating stop operation srv01-test_stop_0 on el8-a01n01
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 60 aborted by deletion of
lrm_rsc_op[@id='srv01-test_last_failure_0']: Resource operation removal
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 60 action 11 (srv01-test_stop_0 on el8-a01n01):
expected 'ok' but got 'error'
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 60 (Complete=2, Pending=0, Fired=0, Skipped=0,
Incomplete=2, Source=/var/lib/pacemaker/pengine/pe-input-102.bz2): Complete
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
Setting fail-count-srv01-test#stop_0[el8-a01n01]: (unset) -> INFINITY
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
Setting last-failure-srv01-test#stop_0[el8-a01n01]: (unset) -> 1610935435
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Unexpected result (error) was recorded for stop of srv01-test
on el8-a01n01 at Jan 18 02:03:35 2021
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Unexpected result (error) was recorded for stop of srv01-test
on el8-a01n01 at Jan 18 02:03:35 2021
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Cluster node el8-a01n01 will be fenced: srv01-test failed there
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Detected active orphan srv01-test running on el8-a01n01
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Scheduling Node el8-a01n01 for STONITH
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Stop of failed resource srv01-test is implicit after el8-a01n01
is fenced
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice:  * Fence (reboot) el8-a01n01 'srv01-test failed there'
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice:  * Move       virsh_node2_pulsar     ( el8-a01n01 -> el8-a01n02 )
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice:  * Stop       srv01-test             (               el8-a01n01
)   due to node availability
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Calculated transition 61 (with warnings), saving inputs in
/var/lib/pacemaker/pengine/pe-warn-1.bz2
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Unexpected result (error) was recorded for stop of srv01-test
on el8-a01n01 at Jan 18 02:03:35 2021
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Unexpected result (error) was recorded for stop of srv01-test
on el8-a01n01 at Jan 18 02:03:35 2021
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Cluster node el8-a01n01 will be fenced: srv01-test failed there
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Detected active orphan srv01-test running on el8-a01n01
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Forcing srv01-test away from el8-a01n01 after 1000000 failures
(max=1000000)
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Clearing failure of srv01-test on el8-a01n01 because it is orphaned
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Scheduling Node el8-a01n01 for STONITH
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Stop of failed resource srv01-test is implicit after el8-a01n01
is fenced
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice:  * Fence (reboot) el8-a01n01 'srv01-test failed there'
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice:  * Move       virsh_node2_pulsar     ( el8-a01n01 -> el8-a01n02 )
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice:  * Stop       srv01-test             (               el8-a01n01
)   due to node availability
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Calculated transition 62 (with warnings), saving inputs in
/var/lib/pacemaker/pengine/pe-warn-2.bz2
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Requesting fencing (reboot) of node el8-a01n01
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Initiating start operation virsh_node2_pulsar_start_0 locally on
el8-a01n02
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Client pacemaker-controld.490050.72911c98 wants to fence (reboot)
'el8-a01n01' with device '(any)'
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Requesting peer fencing (reboot) targeting el8-a01n01
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
Setting fail-count-srv01-test#stop_0[el8-a01n01]: INFINITY -> (unset)
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
Setting last-failure-srv01-test#stop_0[el8-a01n01]: 1610935435 -> (unset)
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
virsh_node2_pulsar is not eligible to fence (reboot) el8-a01n01: static-list
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
virsh_node1_pulsar is eligible to fence (reboot) el8-a01n01: static-list
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 62 aborted by deletion of
lrm_rsc_op[@id='srv01-test_last_failure_0']: Resource operation removal
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Requesting that el8-a01n02 perform 'reboot' action targeting el8-a01n01
using 'virsh_node1_pulsar'
Jan 18 02:03:56 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Result of start operation for virsh_node2_pulsar on el8-a01n02: ok
Jan 18 02:03:57 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Operation 'reboot' [646769] (call 4 from pacemaker-controld.490050) for
host 'el8-a01n01' with device 'virsh_node1_pulsar' returned: 0 (OK)
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
Node el8-a01n01 state is now lost
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
Removing all el8-a01n01 attributes for peer loss
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Node el8-a01n01 state is now lost
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-based[490045]:  notice:
Node el8-a01n01 state is now lost
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-based[490045]:  notice:
Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership cache
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Node el8-a01n01 state is now lost
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership cache
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership cache
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Action 'reboot' targeting el8-a01n01 using virsh_node1_pulsar on behalf
of pacemaker-controld.490050 at el8-a01n02: OK
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Operation 'reboot' targeting el8-a01n01 on el8-a01n02 for
pacemaker-controld.490050 at el8-a01n02.8ff64dd6: OK
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Stonith operation 4/2:62:0:e827eea0-dedc-4200-a207-c4095621b3c6:
OK (0)
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Peer el8-a01n01 was terminated (reboot) by el8-a01n02 on behalf
of pacemaker-controld.490050: OK
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 62 (Complete=5, Pending=0, Fired=0, Skipped=1,
Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Removing srv01-test from el8-a01n02
Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Calculated transition 63, saving inputs in
/var/lib/pacemaker/pengine/pe-input-103.bz2
Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Initiating monitor operation virsh_node2_pulsar_monitor_60000
locally on el8-a01n02
Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Initiating delete operation srv01-test_delete_0 locally on
el8-a01n02
Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 63 aborted by deletion of
lrm_resource[@id='srv01-test']: Resource state removal
Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Result of monitor operation for virsh_node2_pulsar on el8-a01n02: ok
Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 63 (Complete=2, Pending=0, Fired=0, Skipped=0,
Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-103.bz2): Complete
Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Calculated transition 64, saving inputs in
/var/lib/pacemaker/pengine/pe-input-104.bz2
Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 64 (Complete=0, Pending=0, Fired=0, Skipped=0,
Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-104.bz2): Complete
Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: State transition S_TRANSITION_ENGINE -> S_IDLE
====

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould


More information about the Users mailing list