[ClusterLabs] Avoiding self-fence on RA failure
Digimer
lists at alteeve.ca
Wed Oct 7 02:20:22 EDT 2020
On 2020-10-07 1:49 a.m., Andrei Borzenkov wrote:
> 07.10.2020 06:42, Digimer пишет:
>> Hi all,
>>
>> While developing our program (and not being a production cluster), I
>> find that when I push broken code to a node, causing the RA to fail to
>> perform an operation, the node gets fenced. (example below).
>>
>> This brings up a question;
>>
>> If a single resource fails for any reason and can't be recovered, but
>> other resources on the node are still operational, how can I suppress a
>> self-fence? I'd rather one failed resource than having all resources get
>> killed (they're VMs, so restarting on the peer is ... disruptive).
>>
>> If this is a bad approach (sufficiently bad to justify hard-rebooting
>> other VMs that had been running on the same node), why is that? Are
>> there any less-bad options for this scenario?
>>
>> Obviously, I would never push untested code to a production system,
>> but knowing now that this is possible (losing a node with it's other VMs
>> on an RA / code fault), I'm worried about some unintended "oops" causing
>> the loss of a node.
>>
>> For example, would it be possible to have the node try to live migrate
>> services to the other peer, before self-fencing in a scenario like this?
>> Are there other options / considerations I might be missing here?
>>
>> example VM config:
>>
>> ====
>> <primitive class="ocf" id="srv07-el6" provider="alteeve"
>> type="server">
>> <instance_attributes id="srv07-el6-instance_attributes">
>> <nvpair id="srv07-el6-instance_attributes-name" name="name"
>> value="srv07-el6"/>
>> </instance_attributes>
>> <meta_attributes id="srv07-el6-meta_attributes">
>> <nvpair id="srv07-el6-meta_attributes-allow-migrate"
>> name="allow-migrate" value="true"/>
>> <nvpair id="srv07-el6-meta_attributes-migrate_to"
>> name="migrate_to" value="INFINITY"/>
>> <nvpair id="srv07-el6-meta_attributes-stop" name="stop"
>> value="INFINITY"/>
>> <nvpair id="srv07-el6-meta_attributes-target-role"
>> name="target-role" value="Stopped"/>
>> </meta_attributes>
>> <operations>
>> <op id="srv07-el6-migrate_from-interval-0s" interval="0s"
>> name="migrate_from" timeout="600"/>
>> <op id="srv07-el6-migrate_to-interval-0s" interval="0s"
>> name="migrate_to" timeout="INFINITY"/>
>> <op id="srv07-el6-monitor-interval-60" interval="60"
>> name="monitor" on-fail="block"/>
>> <op id="srv07-el6-notify-interval-0s" interval="0s"
>> name="notify" timeout="20"/>
>> <op id="srv07-el6-start-interval-0s" interval="0s"
>> name="start" timeout="30"/>
>> <op id="srv07-el6-stop-interval-0s" interval="0s" name="stop"
>> timeout="INFINITY"/>
>> </operations>
>> </primitive>
>> ====
>>
>> Logs from a code oops in the RA triggering a node self-fence;
>>
>> ====
>> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]: notice:
>> srv07-el6_stop_0:36779:stderr [ DBD::Pg::db do failed: ERROR: syntax
>> error at or near "3" ]
>
> Only stop operation failure results in stonith by default, you can
> change it with on-fail operation attribute. The only other sensible
> value would be "block".
Ah, it looks like I misunderstood how on-fail="block" works. I see in
the CIB it was only applied to the monitor action (which I probably
don't want, as I want it to recover if a monitor fails).
I've changed the CIB to below, I'll see how this handles future code
oopses.
Thanks!
digimer
====
<primitive class="ocf" id="srv07-el6" provider="alteeve"
type="server">
<instance_attributes id="srv07-el6-instance_attributes">
<nvpair id="srv07-el6-instance_attributes-name" name="name"
value="srv07-el6"/>
</instance_attributes>
<meta_attributes id="srv07-el6-meta_attributes">
<nvpair id="srv07-el6-meta_attributes-allow-migrate"
name="allow-migrate" value="true"/>
<nvpair id="srv07-el6-meta_attributes-migrate_to"
name="migrate_to" value="INFINITY"/>
<nvpair id="srv07-el6-meta_attributes-stop" name="stop"
value="INFINITY"/>
<nvpair id="srv07-el6-meta_attributes-target-role"
name="target-role" value="stopped"/>
</meta_attributes>
<operations>
<op id="srv07-el6-migrate_from-interval-0s" interval="0s"
name="migrate_from" timeout="600"/>
<op id="srv07-el6-migrate_to-interval-0s" interval="0s"
name="migrate_to" timeout="INFINITY"/>
<op id="srv07-el6-monitor-interval-60" interval="60"
name="monitor"/>
<op id="srv07-el6-notify-interval-0s" interval="0s"
name="notify" timeout="20"/>
<op id="srv07-el6-start-interval-0s" interval="0s"
name="start" on-fail="block" timeout="INFINITY"/>
<op id="srv07-el6-stop-interval-0s" interval="0s" name="stop"
on-fail="block" timeout="INFINITY"/>
</operations>
</primitive>
====
--
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
More information about the Users
mailing list