[ClusterLabs] Antw: Re: Antw: [EXT] Avoiding self-fence on RA failure
Ulrich.Windl at rz.uni-regensburg.de
Thu Oct 8 02:14:00 EDT 2020
>>> Digimer <lists at alteeve.ca> schrieb am 07.10.2020 um 23:27 in Nachricht
<d8e826da-f7cc-a8c6-9793-ea73e8280aff at alteeve.ca>:
> On 2020-10-07 2:35 a.m., Ulrich Windl wrote:
>>>>> Digimer <lists at alteeve.ca> schrieb am 07.10.2020 um 05:42 in Nachricht
>> <b1b2c412-1cc4-e77a-230e-a5d4423701a7 at alteeve.ca>:
>>> Hi all,
>>> While developing our program (and not being a production cluster), I
>>> find that when I push broken code to a node, causing the RA to fail to
>>> perform an operation, the node gets fenced. (example below).
>> (I see others have replied, too, but anyway)
>> Specifically it's the "stop" operation that may not fail.
>>> This brings up a question;
>>> If a single resource fails for any reason and can't be recovered, but
>>> other resources on the node are still operational, how can I suppress a
>>> self-fence? I'd rather one failed resource than having all resources get
>>> killed (they're VMs, so restarting on the peer is ... disruptive).
>> I think you can (on-fail=block (AFAIR).
>> Note: This is not a political statement for any near elections ;-)
> Indeed, and this works. I misunderstood the pcs syntax and applied the
> 'on-fail="stop"' to the monitor operation... Woops.
>>> If this is a bad approach (sufficiently bad to justify hard-rebooting
>>> other VMs that had been running on the same node), why is that? Are
>>> there any less-bad options for this scenario?
>>> Obviously, I would never push untested code to a production system,
>>> but knowing now that this is possible (losing a node with it's other VMs
>>> on an RA / code fault), I'm worried about some unintended "oops" causing
>>> the loss of a node.
>>> For example, would it be possible to have the node try to live migrate
>>> services to the other peer, before self-fencing in a scenario like this?
>> As there is guarantee that migration will succeed without fencing the node
s/there is/there is no/ # sorry
>> could only be done with a timeout; otherwise the node will be hanging while
>> waiting for migration to succeed.
> I figured as much.
More information about the Users