[ClusterLabs] Antw: [EXT] Re: Stopping a server failed and fenced, despite disabling stop timeout

Tue Jan 19 02:36:20 EST 2021

On 2021-01-19 2:27 a.m., Ulrich Windl wrote:
>>>> Digimer <lists at alteeve.ca> schrieb am 18.01.2021 um 20:08 in Nachricht
> <64c1aa75-a15a-95c3-6853-e21fc0dc8455 at alteeve.ca>:
>> On 2021-01-18 4:49 a.m., Tomas Jelinek wrote:
>>> Hi Digimer,
>>>
>>> Regarding pcs behavior:
>>>
>>> When deleting a resource, pcs first sets its target-role to Stopped,
>>> pushes the change into pacemaker and waits for the resource to stop.
>>> Once the resource stops, pcs removes the resource from CIB. If pcs
>>> simply removed the resource from CIB without stopping it first, the
>>> resource would be running as orphaned (until pacemaker stops it if
>>> configured to do so). We want to avoid that.
>>>
>>> If the resource cannot be stopped for whatever reason, pcs reports this
>>> and advises running the delete command with --force. Running 'pcs
>>> resource delete --force' skips the part where pcs sets target role and
>>> waits for the resource to stop, making pcs simply remove the resource
>>> from CIB.
>>>
>>> I agree that pcs should handle deleting unmanaged resources in a better
>>> way. We plan to address that, but it's not on top of the priority list.
>>> Our plan is actually to prevent deleting unmanaged resources (or require
>>> --force to be specified to do so) based on the following scenario:
>>>
>>> If a resource is deleted while in unmanaged state, it ends up in
>>> ORPHANED state - it is removed from CIB but still present in running
>>> configuration. This can cause various issues, i.e. when unmanaged
>>> resource is stopped manually outside of the cluster there might be
>>> problems with stopping the resource upon deletion (while unmanaged)
>>> which may end up with stonith being initiated - this is not desired.
>>>
>>>
>>> Regards,
>>> Tomas
>>
>> This logic makes sense. If I may propose a reason for an alternative
> method;
>>
>> In my case, the idea I was experimenting with was to remove a running
>> server from cluster management, without actually shutting down the
>> server. This is somewhat contrived, I freely admin, but the idea of
>> taking a server out of the config entirely without shutting it down
>> could be useful in some cases.
> 
> Assuming that the server runs resources, I'd consider that to be highly
> dangerous for data consistency.
> If you want to remove the node from the cluster, why not shut down the cluster
> node first? That would stop, move or migrate any resources running there.
> Then you would not remove any resources that still run on the other node(s).
> Basically I wonder what types of resources you would remove anyway.
> Finally I would remove the node configuration from the cluster.
> 
> Regards,
> Ulrich

To be clear, this won't be a "supported" condition, but I have seen it
done and needed to do it before for odd reasons.

Separately; In our case, we have external mechanisms that prevent a
resource from running in two places. In our software, which acts as a
logic layer over top of pacemaker, tracks a server's position directly
(doesn't rely on pacemaker's idea of location, queries libvirtd directly.

Secondly, in our new system, we run DRBD on a per-server basis and keep
allow-two-primaries off, save for during a live migation. We've tested
this and it properly refuses to start a VM on another node while it's
running somewhere.

So while your concern is absolutely valid, in our specific use case, we
have additional safety systems that allow a VM to operate outside
pacemaker without risking a split-brain.

cheers

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould