[ClusterLabs] Avoiding self-fence on RA failure

Digimer lists at alteeve.ca
Tue Oct 6 23:42:26 EDT 2020


Hi all,

  While developing our program (and not being a production cluster), I
find that when I push broken code to a node, causing the RA to fail to
perform an operation, the node gets fenced. (example below).

  This brings up a question;

  If a single resource fails for any reason and can't be recovered, but
other resources on the node are still operational, how can I suppress a
self-fence? I'd rather one failed resource than having all resources get
killed (they're VMs, so restarting on the peer is ... disruptive).

  If this is a bad approach (sufficiently bad to justify hard-rebooting
other VMs that had been running on the same node), why is that? Are
there any less-bad options for this scenario?

  Obviously, I would never push untested code to a production system,
but knowing now that this is possible (losing a node with it's other VMs
on an RA / code fault), I'm worried about some unintended "oops" causing
the loss of a node.

  For example, would it be possible to have the node try to live migrate
services to the other peer, before self-fencing in a scenario like this?
Are there other options / considerations I might be missing here?

example VM config:

====
      <primitive class="ocf" id="srv07-el6" provider="alteeve"
type="server">
        <instance_attributes id="srv07-el6-instance_attributes">
          <nvpair id="srv07-el6-instance_attributes-name" name="name"
value="srv07-el6"/>
        </instance_attributes>
        <meta_attributes id="srv07-el6-meta_attributes">
          <nvpair id="srv07-el6-meta_attributes-allow-migrate"
name="allow-migrate" value="true"/>
          <nvpair id="srv07-el6-meta_attributes-migrate_to"
name="migrate_to" value="INFINITY"/>
          <nvpair id="srv07-el6-meta_attributes-stop" name="stop"
value="INFINITY"/>
          <nvpair id="srv07-el6-meta_attributes-target-role"
name="target-role" value="Stopped"/>
        </meta_attributes>
        <operations>
          <op id="srv07-el6-migrate_from-interval-0s" interval="0s"
name="migrate_from" timeout="600"/>
          <op id="srv07-el6-migrate_to-interval-0s" interval="0s"
name="migrate_to" timeout="INFINITY"/>
          <op id="srv07-el6-monitor-interval-60" interval="60"
name="monitor" on-fail="block"/>
          <op id="srv07-el6-notify-interval-0s" interval="0s"
name="notify" timeout="20"/>
          <op id="srv07-el6-start-interval-0s" interval="0s"
name="start" timeout="30"/>
          <op id="srv07-el6-stop-interval-0s" interval="0s" name="stop"
timeout="INFINITY"/>
        </operations>
      </primitive>
====

Logs from a code oops in the RA triggering a node self-fence;

====
Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
srv07-el6_stop_0:36779:stderr [ DBD::Pg::db do failed: ERROR:  syntax
error at or near "3" ]
Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
srv07-el6_stop_0:36779:stderr [ LINE 1: ...ut off, server_boot_time = 0
WHERE server_uuid = '3d73db4c-d... ]
Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
srv07-el6_stop_0:36779:stderr [
                     ^ at /usr/share/perl5/Anvil/Tools/Database.pm line
13791. ]
Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
srv07-el6_stop_0:36779:stderr [ DBD::Pg::db do failed: ERROR:  syntax
error at or near "3" ]
Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
srv07-el6_stop_0:36779:stderr [ LINE 1: ...ut off, server_boot_time = 0
WHERE server_uuid = '3d73db4c-d... ]
Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
srv07-el6_stop_0:36779:stderr [
                     ^ at /usr/share/perl5/Anvil/Tools/Database.pm line
13791. ]
Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-controld[33819]:  notice:
Result of stop operation for srv07-el6 on mk-a02n01: 1 (error)
Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-controld[33819]:  notice:
mk-a02n01-srv07-el6_stop_0:51 [ DBD::Pg::db do failed: ERROR:  syntax
error at or near "3"\nLINE 1: ...ut off, server_boot_time = 0 WHERE
server_uuid = '3d73db4c-d...\n
                   ^ at /usr/share/perl5/Anvil/Tools/Database.pm line
13791.\nDBD::Pg::db do failed: ERROR:  syntax error at or near "3"\nLINE
1: ...ut off, server_boot_time = 0 WHERE server_uuid = '3d73db4c-d...\n
                                                            ^ at
/usr/share/p
Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-attrd[33817]:  notice:
Setting fail-count-srv07-el6#stop_0[mk-a02n01]: (unset) -> INFINITY
Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-attrd[33817]:  notice:
Setting last-failure-srv07-el6#stop_0[mk-a02n01]: (unset) -> 1602041634
Connection to mk-a02n01.ifn closed by remote host.
Connection to mk-a02n01.ifn closed.
====

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould


More information about the Users mailing list