[ClusterLabs] Stopping a server failed and fenced, despite disabling stop timeout

Digimer lists at alteeve.ca
Mon Jan 18 14:10:55 EST 2021


On 2021-01-18 1:52 p.m., Ken Gaillot wrote:
> On Sun, 2021-01-17 at 21:11 -0500, Digimer wrote:
>> Hi all,
>>
>>   Mind the slew of questions, well into testing now and finding lots
>> of
>> issues. This one is two questions... :)
>>
>>   I set a server to be unamaged in pacemaker while the server was
>> running. Then I tried to remove the resource, and it refused saying
>> it
>> couldn't stop it, and to use '--force'. So I did, and the node got
>> fenced. Now, the resource was setup with;
>>
>> pcs resource create srv07-el6 ocf:alteeve:server name="srv07-el6" \
>>  meta allow-migrate="true" target-role="started" \
>>  op monitor interval="60" start timeout="INFINITY" \
>>  on-fail="block" stop timeout="INFINITY" on-fail="block" \
>>  migrate_to timeout="INFINITY"
>>
>>   I would have expected the 'stop timeout="INFINITY" on-fail="block"' 
>> to
>> prevent fencing if the server failed to stop (question 1) and that if
>> a
>> resource was unmanaged, that the resource wouldn't even try to stop
>> (question 2).
> 
> It would have ... if you hadn't removed it :)

Ahahaha! OK, I see now why it happened. I should have realized this. :)

> Removing a resource means removing its configuration entry. Pacemaker
> memorizes the resource agent parameters used to start a resource, so it
> can still execute a stop when the configuration is removed. However
> currently it does not memorize any resource meta-attributes or
> operations that were configured, so those are lost when the resource is
> removed.
> 
> We maybe should remember more information about the resource than just
> its agent parameters, but it should probably be selective rather than a
> broad net.
> 
> FYI, the stop-orphan-resources property controls whether a resource
> whose configuration is removed should be stopped (defaulting to true).

Oh! That's excellent. So if I know I want to remove a resource from
pacemaker, but keep it running, I can update the resource's config to
set that, then delete the resource and the service will still keep running?

Do you know off hand what the pcs command would be to set that? No
worries if not, I can sort it out easily enough I am sure.

digimer

>>   Can someone help me understand what happened here?
>>
>> digimer
>>
>> More below;
>>
>> ====
>> [root at el8-a01n01 ~]# pcs resource remove srv01-test
>> Attempting to stop: srv01-test... Warning: 'srv01-test' is unmanaged
>> Error: Unable to stop: srv01-test before deleting (re-run with --
>> force
>> to force deletion)
>> [root at el8-a01n01 ~]# pcs resource remove srv01-test --force
>> Deleting Resource - srv01-test
> 
> Note that the above is all within pcs, which tries to explicitly stop
> the resource before removing it. That gives pcs more control over the
> process and would ensure that the meta-attributes and operations were
> still present for the stop. But as it warns, it can't do that if the
> resource is unmanaged, so then it's left to Pacemaker to stop the
> resource after the configuration is gone.
> 
> BTW stop-orphan-resources takes precedence over whether the resource is
> managed, so Pacemaker does try to stop it even though the node is still
> unmanaged.
> 
>> [root at el8-a01n01 ~]# client_loop: send disconnect: Broken pipe
>> ====
>>
>>   As you can see, the node was fenced. The logs on that node were;
>>
>> ====
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-
>> execd[1872]:  warning:
>> srv01-test_stop_0 process (PID 113779) timed out
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-
>> execd[1872]:  warning:
>> srv01-test_stop_0[113779] timed out after 20000ms
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-
>> controld[1875]:  error:
>> Result of stop operation for srv01-test on el8-a01n01: Timed Out
> 
> A stop timeout causes fencing. I'm guessing what happened in this case
> is that the stop takes longer than 20s, which is the default operation
> timeout, which has to be used because the original operation
> configuration was removed.
> 
> The main problem here is that removing a resource from the
> configuration means that its stop timeout is lost, so if it's active
> when it's removed, the stop might time out. pcs's approach of stopping
> it first (by setting target-role to Stopped) makes sense, but having
> the node unmanaged blocks that. It's a catch-22 for the cluster that
> you'd have to resolve by changing one of those conditions.
> 
> For example you could unmanage the other resources on the node and re-
> manage the node, before stopping the desired resource then removing it.
> Or, you could temporarily set a long default timeout for all operations
> (i.e. in operation defaults) so that gets used instead of 20s. Or, you
> could temporarily set stop-orphan-resources to false, remove the
> resource, then manually stop the instance.
> 
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-
>> controld[1875]:  notice:
>> el8-a01n01-srv01-test_stop_0:37 [ The server: [srv01-test] is indeed
>> running. It will be shut down now.\n ]
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
>> Setting fail-count-srv01-test#stop_0[el8-a01n01]: (unset) -> INFINITY
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
>> Setting last-failure-srv01-test#stop_0[el8-a01n01]: (unset) ->
>> 1610935435
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
>> Setting fail-count-srv01-test#stop_0[el8-a01n01]: INFINITY -> (unset)
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
>> Setting last-failure-srv01-test#stop_0[el8-a01n01]: 1610935435 ->
>> (unset)
>> client_loop: send disconnect: Broken pipe
>> ====
>>
>> On the peer node, the logs showed;
>>
>> ====
>> Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: State transition S_IDLE -> S_POLICY_ENGINE
>> Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Calculated transition 58, saving inputs in
>> /var/lib/pacemaker/pengine/pe-input-100.bz2
>> Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 58 (Complete=0, Pending=0, Fired=0, Skipped=0,
>> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-100.bz2):
>> Complete
>> Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: State transition S_TRANSITION_ENGINE -> S_IDLE
>> Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: State transition S_IDLE -> S_POLICY_ENGINE
>> Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Calculated transition 59, saving inputs in
>> /var/lib/pacemaker/pengine/pe-input-101.bz2
>> Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 59 (Complete=0, Pending=0, Fired=0, Skipped=0,
>> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-101.bz2):
>> Complete
>> Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: State transition S_TRANSITION_ENGINE -> S_IDLE
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: State transition S_IDLE -> S_POLICY_ENGINE
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Detected active orphan srv01-test running on el8-a01n01
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Clearing failure of srv01-test on el8-a01n02 because resource
>> parameters have changed
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Removing srv01-test from el8-a01n01
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Removing srv01-test from el8-a01n02
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice:  * Stop       srv01-test             (               el8-
>> a01n01
>> )   due to node availability
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Calculated transition 60, saving inputs in
>> /var/lib/pacemaker/pengine/pe-input-102.bz2
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Initiating stop operation srv01-test_stop_0 on el8-a01n01
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 60 aborted by deletion of
>> lrm_rsc_op[@id='srv01-test_last_failure_0']: Resource operation
>> removal
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 60 action 11 (srv01-test_stop_0 on el8-a01n01):
>> expected 'ok' but got 'error'
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 60 (Complete=2, Pending=0, Fired=0, Skipped=0,
>> Incomplete=2, Source=/var/lib/pacemaker/pengine/pe-input-102.bz2):
>> Complete
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-
>> attrd[490048]:  notice:
>> Setting fail-count-srv01-test#stop_0[el8-a01n01]: (unset) -> INFINITY
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-
>> attrd[490048]:  notice:
>> Setting last-failure-srv01-test#stop_0[el8-a01n01]: (unset) ->
>> 1610935435
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Unexpected result (error) was recorded for stop of srv01-
>> test
>> on el8-a01n01 at Jan 18 02:03:35 2021
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Unexpected result (error) was recorded for stop of srv01-
>> test
>> on el8-a01n01 at Jan 18 02:03:35 2021
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Cluster node el8-a01n01 will be fenced: srv01-test failed
>> there
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Detected active orphan srv01-test running on el8-a01n01
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Scheduling Node el8-a01n01 for STONITH
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Stop of failed resource srv01-test is implicit after el8-
>> a01n01
>> is fenced
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice:  * Fence (reboot) el8-a01n01 'srv01-test failed there'
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice:  * Move       virsh_node2_pulsar     ( el8-a01n01 -> el8-
>> a01n02 )
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice:  * Stop       srv01-test             (               el8-
>> a01n01
>> )   due to node availability
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Calculated transition 61 (with warnings), saving inputs in
>> /var/lib/pacemaker/pengine/pe-warn-1.bz2
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Unexpected result (error) was recorded for stop of srv01-
>> test
>> on el8-a01n01 at Jan 18 02:03:35 2021
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Unexpected result (error) was recorded for stop of srv01-
>> test
>> on el8-a01n01 at Jan 18 02:03:35 2021
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Cluster node el8-a01n01 will be fenced: srv01-test failed
>> there
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Detected active orphan srv01-test running on el8-a01n01
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Forcing srv01-test away from el8-a01n01 after 1000000
>> failures
>> (max=1000000)
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Clearing failure of srv01-test on el8-a01n01 because it is
>> orphaned
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Scheduling Node el8-a01n01 for STONITH
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Stop of failed resource srv01-test is implicit after el8-
>> a01n01
>> is fenced
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice:  * Fence (reboot) el8-a01n01 'srv01-test failed there'
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice:  * Move       virsh_node2_pulsar     ( el8-a01n01 -> el8-
>> a01n02 )
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice:  * Stop       srv01-test             (               el8-
>> a01n01
>> )   due to node availability
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Calculated transition 62 (with warnings), saving inputs in
>> /var/lib/pacemaker/pengine/pe-warn-2.bz2
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Requesting fencing (reboot) of node el8-a01n01
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Initiating start operation virsh_node2_pulsar_start_0 locally
>> on
>> el8-a01n02
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-
>> fenced[490046]:  notice:
>> Client pacemaker-controld.490050.72911c98 wants to fence (reboot)
>> 'el8-a01n01' with device '(any)'
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-
>> fenced[490046]:  notice:
>> Requesting peer fencing (reboot) targeting el8-a01n01
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-
>> attrd[490048]:  notice:
>> Setting fail-count-srv01-test#stop_0[el8-a01n01]: INFINITY -> (unset)
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-
>> attrd[490048]:  notice:
>> Setting last-failure-srv01-test#stop_0[el8-a01n01]: 1610935435 ->
>> (unset)
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-
>> fenced[490046]:  notice:
>> virsh_node2_pulsar is not eligible to fence (reboot) el8-a01n01:
>> static-list
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-
>> fenced[490046]:  notice:
>> virsh_node1_pulsar is eligible to fence (reboot) el8-a01n01: static-
>> list
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 62 aborted by deletion of
>> lrm_rsc_op[@id='srv01-test_last_failure_0']: Resource operation
>> removal
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-
>> fenced[490046]:  notice:
>> Requesting that el8-a01n02 perform 'reboot' action targeting el8-
>> a01n01
>> using 'virsh_node1_pulsar'
>> Jan 18 02:03:56 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Result of start operation for virsh_node2_pulsar on el8-
>> a01n02: ok
>> Jan 18 02:03:57 el8-a01n02.alteeve.ca pacemaker-
>> fenced[490046]:  notice:
>> Operation 'reboot' [646769] (call 4 from pacemaker-controld.490050)
>> for
>> host 'el8-a01n01' with device 'virsh_node1_pulsar' returned: 0 (OK)
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-
>> attrd[490048]:  notice:
>> Node el8-a01n01 state is now lost
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-
>> attrd[490048]:  notice:
>> Removing all el8-a01n01 attributes for peer loss
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Node el8-a01n01 state is now lost
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-
>> based[490045]:  notice:
>> Node el8-a01n01 state is now lost
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-
>> based[490045]:  notice:
>> Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership
>> cache
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-
>> fenced[490046]:  notice:
>> Node el8-a01n01 state is now lost
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-
>> fenced[490046]:  notice:
>> Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership
>> cache
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-
>> attrd[490048]:  notice:
>> Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership
>> cache
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-
>> fenced[490046]:  notice:
>> Action 'reboot' targeting el8-a01n01 using virsh_node1_pulsar on
>> behalf
>> of pacemaker-controld.490050 at el8-a01n02: OK
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-
>> fenced[490046]:  notice:
>> Operation 'reboot' targeting el8-a01n01 on el8-a01n02 for
>> pacemaker-controld.490050 at el8-a01n02.8ff64dd6: OK
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Stonith operation 4/2:62:0:e827eea0-dedc-4200-a207-
>> c4095621b3c6:
>> OK (0)
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Peer el8-a01n01 was terminated (reboot) by el8-a01n02 on
>> behalf
>> of pacemaker-controld.490050: OK
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 62 (Complete=5, Pending=0, Fired=0, Skipped=1,
>> Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2):
>> Stopped
>> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Removing srv01-test from el8-a01n02
>> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Calculated transition 63, saving inputs in
>> /var/lib/pacemaker/pengine/pe-input-103.bz2
>> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Initiating monitor operation virsh_node2_pulsar_monitor_60000
>> locally on el8-a01n02
>> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Initiating delete operation srv01-test_delete_0 locally on
>> el8-a01n02
>> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 63 aborted by deletion of
>> lrm_resource[@id='srv01-test']: Resource state removal
>> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Result of monitor operation for virsh_node2_pulsar on el8-
>> a01n02: ok
>> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 63 (Complete=2, Pending=0, Fired=0, Skipped=0,
>> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-103.bz2):
>> Complete
>> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Calculated transition 64, saving inputs in
>> /var/lib/pacemaker/pengine/pe-input-104.bz2
>> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 64 (Complete=0, Pending=0, Fired=0, Skipped=0,
>> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-104.bz2):
>> Complete
>> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: State transition S_TRANSITION_ENGINE -> S_IDLE
>> ====
>>


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould


More information about the Users mailing list