[ClusterLabs] Antw: [EXT] Avoiding self-fence on RA failure

Wed Oct 7 17:27:22 EDT 2020

On 2020-10-07 2:35 a.m., Ulrich Windl wrote:
>>>> Digimer <lists at alteeve.ca> schrieb am 07.10.2020 um 05:42 in Nachricht
> <b1b2c412-1cc4-e77a-230e-a5d4423701a7 at alteeve.ca>:
>> Hi all,
>>
>>   While developing our program (and not being a production cluster), I
>> find that when I push broken code to a node, causing the RA to fail to
>> perform an operation, the node gets fenced. (example below).
> 
> (I see others have replied, too, but anyway)
> Specifically it's the "stop" operation that may not fail.
> 
>>
>>   This brings up a question;
>>
>>   If a single resource fails for any reason and can't be recovered, but
>> other resources on the node are still operational, how can I suppress a
>> self-fence? I'd rather one failed resource than having all resources get
>> killed (they're VMs, so restarting on the peer is ... disruptive).
> 
> I think you can (on-fail=block (AFAIR).
> Note: This is not a political statement for any near elections ;-)

Indeed, and this works. I misunderstood the pcs syntax and applied the
'on-fail="stop"' to the monitor operation... Woops.

>>   If this is a bad approach (sufficiently bad to justify hard-rebooting
>> other VMs that had been running on the same node), why is that? Are
>> there any less-bad options for this scenario?
>>
>>   Obviously, I would never push untested code to a production system,
>> but knowing now that this is possible (losing a node with it's other VMs
>> on an RA / code fault), I'm worried about some unintended "oops" causing
>> the loss of a node.
>>
>>   For example, would it be possible to have the node try to live migrate
>> services to the other peer, before self-fencing in a scenario like this?
> 
> As there is guarantee that migration will succeed without fencing the node it
> could only be done with a timeout; otherwise the node will be hanging while
> waiting for migration to succeed.

I figured as much.

>> Are there other options / considerations I might be missing here?
>>
>> example VM config:
>>
>> ====
>>       <primitive class="ocf" id="srv07-el6" provider="alteeve"
>> type="server">
>>         <instance_attributes id="srv07-el6-instance_attributes">
>>           <nvpair id="srv07-el6-instance_attributes-name" name="name"
>> value="srv07-el6"/>
>>         </instance_attributes>
>>         <meta_attributes id="srv07-el6-meta_attributes">
>>           <nvpair id="srv07-el6-meta_attributes-allow-migrate"
>> name="allow-migrate" value="true"/>
>>           <nvpair id="srv07-el6-meta_attributes-migrate_to"
>> name="migrate_to" value="INFINITY"/>
>>           <nvpair id="srv07-el6-meta_attributes-stop" name="stop"
>> value="INFINITY"/>
>>           <nvpair id="srv07-el6-meta_attributes-target-role"
>> name="target-role" value="Stopped"/>
>>         </meta_attributes>
>>         <operations>
>>           <op id="srv07-el6-migrate_from-interval-0s" interval="0s"
>> name="migrate_from" timeout="600"/>
>>           <op id="srv07-el6-migrate_to-interval-0s" interval="0s"
>> name="migrate_to" timeout="INFINITY"/>
>>           <op id="srv07-el6-monitor-interval-60" interval="60"
>> name="monitor" on-fail="block"/>
>>           <op id="srv07-el6-notify-interval-0s" interval="0s"
>> name="notify" timeout="20"/>
>>           <op id="srv07-el6-start-interval-0s" interval="0s"
>> name="start" timeout="30"/>
>>           <op id="srv07-el6-stop-interval-0s" interval="0s" name="stop"
>> timeout="INFINITY"/>
>>         </operations>
>>       </primitive>
>> ====
>>
>> Logs from a code oops in the RA triggering a node self-fence;
>>
>> ====
>> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
>> srv07-el6_stop_0:36779:stderr [ DBD::Pg::db do failed: ERROR:  syntax
>> error at or near "3" ]
>> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
>> srv07-el6_stop_0:36779:stderr [ LINE 1: ...ut off, server_boot_time = 0
>> WHERE server_uuid = '3d73db4c-d... ]
>> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
>> srv07-el6_stop_0:36779:stderr [
>>                      ^ at /usr/share/perl5/Anvil/Tools/Database.pm line
>> 13791. ]
> 
> As I'm writing a lot of Perl code, too: Do you know "perl -c" to check the
> syntax, BTW?
> 
> And don't forget ocf-tester. ;-)

I did not know about ocf-tester, thanks for the hint.

As for 'perl -c', the issue above was caused by a bad SQL statement,
don't think perl can catch that. :)

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould