[ClusterLabs] Stopping a server failed and fenced, despite disabling stop timeout

Digimer lists at alteeve.ca
Mon Jan 18 14:08:30 EST 2021


On 2021-01-18 4:49 a.m., Tomas Jelinek wrote:
> Hi Digimer,
> 
> Regarding pcs behavior:
> 
> When deleting a resource, pcs first sets its target-role to Stopped,
> pushes the change into pacemaker and waits for the resource to stop.
> Once the resource stops, pcs removes the resource from CIB. If pcs
> simply removed the resource from CIB without stopping it first, the
> resource would be running as orphaned (until pacemaker stops it if
> configured to do so). We want to avoid that.
> 
> If the resource cannot be stopped for whatever reason, pcs reports this
> and advises running the delete command with --force. Running 'pcs
> resource delete --force' skips the part where pcs sets target role and
> waits for the resource to stop, making pcs simply remove the resource
> from CIB.
> 
> I agree that pcs should handle deleting unmanaged resources in a better
> way. We plan to address that, but it's not on top of the priority list.
> Our plan is actually to prevent deleting unmanaged resources (or require
> --force to be specified to do so) based on the following scenario:
> 
> If a resource is deleted while in unmanaged state, it ends up in
> ORPHANED state - it is removed from CIB but still present in running
> configuration. This can cause various issues, i.e. when unmanaged
> resource is stopped manually outside of the cluster there might be
> problems with stopping the resource upon deletion (while unmanaged)
> which may end up with stonith being initiated - this is not desired.
> 
> 
> Regards,
> Tomas

This logic makes sense. If I may propose a reason for an alternative method;

In my case, the idea I was experimenting with was to remove a running
server from cluster management, without actually shutting down the
server. This is somewhat contrived, I freely admin, but the idea of
taking a server out of the config entirely without shutting it down
could be useful in some cases.

In my case, I didn't worry about the orphaned state and the risk of it
trying to start elsewhere as there are additional safeguards in place to
prevent this (both in our software and in that DRBD is not set to
dual-primary, so the VM simply can't start elsewhere while it's running
somewhere).

Totally understand it's not a priority, but when this is addressed, some
special mechanism to say "I know this will leave it orphaned and that's
OK" would be nice to have.

digimer

> Dne 18. 01. 21 v 3:11 Digimer napsal(a):
>> Hi all,
>>
>>    Mind the slew of questions, well into testing now and finding lots of
>> issues. This one is two questions... :)
>>
>>    I set a server to be unamaged in pacemaker while the server was
>> running. Then I tried to remove the resource, and it refused saying it
>> couldn't stop it, and to use '--force'. So I did, and the node got
>> fenced. Now, the resource was setup with;
>>
>> pcs resource create srv07-el6 ocf:alteeve:server name="srv07-el6" \
>>   meta allow-migrate="true" target-role="started" \
>>   op monitor interval="60" start timeout="INFINITY" \
>>   on-fail="block" stop timeout="INFINITY" on-fail="block" \
>>   migrate_to timeout="INFINITY"
>>
>>    I would have expected the 'stop timeout="INFINITY" on-fail="block"' to
>> prevent fencing if the server failed to stop (question 1) and that if a
>> resource was unmanaged, that the resource wouldn't even try to stop
>> (question 2).
>>
>>    Can someone help me understand what happened here?
>>
>> digimer
>>
>> More below;
>>
>> ====
>> [root at el8-a01n01 ~]# pcs resource remove srv01-test
>> Attempting to stop: srv01-test... Warning: 'srv01-test' is unmanaged
>> Error: Unable to stop: srv01-test before deleting (re-run with --force
>> to force deletion)
>> [root at el8-a01n01 ~]# pcs resource remove srv01-test --force
>> Deleting Resource - srv01-test
>> [root at el8-a01n01 ~]# client_loop: send disconnect: Broken pipe
>> ====
>>
>>    As you can see, the node was fenced. The logs on that node were;
>>
>> ====
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-execd[1872]:  warning:
>> srv01-test_stop_0 process (PID 113779) timed out
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-execd[1872]:  warning:
>> srv01-test_stop_0[113779] timed out after 20000ms
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-controld[1875]:  error:
>> Result of stop operation for srv01-test on el8-a01n01: Timed Out
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-controld[1875]:  notice:
>> el8-a01n01-srv01-test_stop_0:37 [ The server: [srv01-test] is indeed
>> running. It will be shut down now.\n ]
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
>> Setting fail-count-srv01-test#stop_0[el8-a01n01]: (unset) -> INFINITY
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
>> Setting last-failure-srv01-test#stop_0[el8-a01n01]: (unset) -> 1610935435
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
>> Setting fail-count-srv01-test#stop_0[el8-a01n01]: INFINITY -> (unset)
>> Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
>> Setting last-failure-srv01-test#stop_0[el8-a01n01]: 1610935435 -> (unset)
>> client_loop: send disconnect: Broken pipe
>> ====
>>
>> On the peer node, the logs showed;
>>
>> ====
>> Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: State transition S_IDLE -> S_POLICY_ENGINE
>> Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Calculated transition 58, saving inputs in
>> /var/lib/pacemaker/pengine/pe-input-100.bz2
>> Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 58 (Complete=0, Pending=0, Fired=0, Skipped=0,
>> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-100.bz2):
>> Complete
>> Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: State transition S_TRANSITION_ENGINE -> S_IDLE
>> Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: State transition S_IDLE -> S_POLICY_ENGINE
>> Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Calculated transition 59, saving inputs in
>> /var/lib/pacemaker/pengine/pe-input-101.bz2
>> Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 59 (Complete=0, Pending=0, Fired=0, Skipped=0,
>> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-101.bz2):
>> Complete
>> Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: State transition S_TRANSITION_ENGINE -> S_IDLE
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: State transition S_IDLE -> S_POLICY_ENGINE
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Detected active orphan srv01-test running on el8-a01n01
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Clearing failure of srv01-test on el8-a01n02 because resource
>> parameters have changed
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Removing srv01-test from el8-a01n01
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Removing srv01-test from el8-a01n02
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice:  * Stop       srv01-test             (               el8-a01n01
>> )   due to node availability
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Calculated transition 60, saving inputs in
>> /var/lib/pacemaker/pengine/pe-input-102.bz2
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Initiating stop operation srv01-test_stop_0 on el8-a01n01
>> Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 60 aborted by deletion of
>> lrm_rsc_op[@id='srv01-test_last_failure_0']: Resource operation removal
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 60 action 11 (srv01-test_stop_0 on el8-a01n01):
>> expected 'ok' but got 'error'
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 60 (Complete=2, Pending=0, Fired=0, Skipped=0,
>> Incomplete=2, Source=/var/lib/pacemaker/pengine/pe-input-102.bz2):
>> Complete
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
>> Setting fail-count-srv01-test#stop_0[el8-a01n01]: (unset) -> INFINITY
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
>> Setting last-failure-srv01-test#stop_0[el8-a01n01]: (unset) -> 1610935435
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Unexpected result (error) was recorded for stop of srv01-test
>> on el8-a01n01 at Jan 18 02:03:35 2021
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Unexpected result (error) was recorded for stop of srv01-test
>> on el8-a01n01 at Jan 18 02:03:35 2021
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Cluster node el8-a01n01 will be fenced: srv01-test failed there
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Detected active orphan srv01-test running on el8-a01n01
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Scheduling Node el8-a01n01 for STONITH
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Stop of failed resource srv01-test is implicit after el8-a01n01
>> is fenced
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice:  * Fence (reboot) el8-a01n01 'srv01-test failed there'
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice:  * Move       virsh_node2_pulsar     ( el8-a01n01 -> el8-a01n02 )
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice:  * Stop       srv01-test             (               el8-a01n01
>> )   due to node availability
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Calculated transition 61 (with warnings), saving inputs in
>> /var/lib/pacemaker/pengine/pe-warn-1.bz2
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Unexpected result (error) was recorded for stop of srv01-test
>> on el8-a01n01 at Jan 18 02:03:35 2021
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Unexpected result (error) was recorded for stop of srv01-test
>> on el8-a01n01 at Jan 18 02:03:35 2021
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Cluster node el8-a01n01 will be fenced: srv01-test failed there
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Detected active orphan srv01-test running on el8-a01n01
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Forcing srv01-test away from el8-a01n01 after 1000000 failures
>> (max=1000000)
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Clearing failure of srv01-test on el8-a01n01 because it is
>> orphaned
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Scheduling Node el8-a01n01 for STONITH
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Stop of failed resource srv01-test is implicit after el8-a01n01
>> is fenced
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice:  * Fence (reboot) el8-a01n01 'srv01-test failed there'
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice:  * Move       virsh_node2_pulsar     ( el8-a01n01 -> el8-a01n02 )
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice:  * Stop       srv01-test             (               el8-a01n01
>> )   due to node availability
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> warning: Calculated transition 62 (with warnings), saving inputs in
>> /var/lib/pacemaker/pengine/pe-warn-2.bz2
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Requesting fencing (reboot) of node el8-a01n01
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Initiating start operation virsh_node2_pulsar_start_0 locally on
>> el8-a01n02
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
>> Client pacemaker-controld.490050.72911c98 wants to fence (reboot)
>> 'el8-a01n01' with device '(any)'
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
>> Requesting peer fencing (reboot) targeting el8-a01n01
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
>> Setting fail-count-srv01-test#stop_0[el8-a01n01]: INFINITY -> (unset)
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
>> Setting last-failure-srv01-test#stop_0[el8-a01n01]: 1610935435 -> (unset)
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
>> virsh_node2_pulsar is not eligible to fence (reboot) el8-a01n01:
>> static-list
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
>> virsh_node1_pulsar is eligible to fence (reboot) el8-a01n01: static-list
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 62 aborted by deletion of
>> lrm_rsc_op[@id='srv01-test_last_failure_0']: Resource operation removal
>> Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
>> Requesting that el8-a01n02 perform 'reboot' action targeting el8-a01n01
>> using 'virsh_node1_pulsar'
>> Jan 18 02:03:56 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Result of start operation for virsh_node2_pulsar on
>> el8-a01n02: ok
>> Jan 18 02:03:57 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
>> Operation 'reboot' [646769] (call 4 from pacemaker-controld.490050) for
>> host 'el8-a01n01' with device 'virsh_node1_pulsar' returned: 0 (OK)
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
>> Node el8-a01n01 state is now lost
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
>> Removing all el8-a01n01 attributes for peer loss
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Node el8-a01n01 state is now lost
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-based[490045]:  notice:
>> Node el8-a01n01 state is now lost
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-based[490045]:  notice:
>> Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership cache
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
>> Node el8-a01n01 state is now lost
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
>> Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership cache
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
>> Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership cache
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
>> Action 'reboot' targeting el8-a01n01 using virsh_node1_pulsar on behalf
>> of pacemaker-controld.490050 at el8-a01n02: OK
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
>> Operation 'reboot' targeting el8-a01n01 on el8-a01n02 for
>> pacemaker-controld.490050 at el8-a01n02.8ff64dd6: OK
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Stonith operation 4/2:62:0:e827eea0-dedc-4200-a207-c4095621b3c6:
>> OK (0)
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Peer el8-a01n01 was terminated (reboot) by el8-a01n02 on behalf
>> of pacemaker-controld.490050: OK
>> Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 62 (Complete=5, Pending=0, Fired=0, Skipped=1,
>> Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
>> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Removing srv01-test from el8-a01n02
>> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Calculated transition 63, saving inputs in
>> /var/lib/pacemaker/pengine/pe-input-103.bz2
>> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Initiating monitor operation virsh_node2_pulsar_monitor_60000
>> locally on el8-a01n02
>> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Initiating delete operation srv01-test_delete_0 locally on
>> el8-a01n02
>> Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 63 aborted by deletion of
>> lrm_resource[@id='srv01-test']: Resource state removal
>> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Result of monitor operation for virsh_node2_pulsar on
>> el8-a01n02: ok
>> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 63 (Complete=2, Pending=0, Fired=0, Skipped=0,
>> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-103.bz2):
>> Complete
>> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
>> notice: Calculated transition 64, saving inputs in
>> /var/lib/pacemaker/pengine/pe-input-104.bz2
>> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: Transition 64 (Complete=0, Pending=0, Fired=0, Skipped=0,
>> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-104.bz2):
>> Complete
>> Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
>> notice: State transition S_TRANSITION_ENGINE -> S_IDLE
>> ====
>>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould


More information about the Users mailing list