[ClusterLabs] fence_mpath and failed IP

Mon Mar 30 22:56:26 EDT 2020

On Sat, 2020-02-22 at 03:50 +0200, Strahil Nikolov wrote:
> Hello community,
> 
> Recently I have started playing with fence_mpath and I have noticed
> that when the node is fenced,  the node is kicked out of the
> cluster  (corosync & pacemaker are shut down).
> 
> Fencing works correctly , but the IP address cannot be brought up on
> the designated 'replacement' host, because it was left on the old
> node.
> 
> I believe that this is a timing issue -  fenced node doesn't have the
> time to shutdown all it's resources before pacemaker dies locally.
> 
> Can someone confirm this behaviour on anothger distro,  as I'm
> currently testing it on RHEL7? If it is only for RedHat,  I can open
> a bug in the bugzilla.
> 
> Note: There is a workaround in order to reboot the node (using
> a  symbolic link to /etc/watchdog.d )  with the help of the
> fence_scsi or the fence_mpath scripts  in /usr/share/cluster .
> 
> 
> Best Regards,
> Strahil Nikolov

I'm not expert with fabric fencing, but from what I understand, this is
an inherent limitation. Cutting off the disk obviously has no effect on
resources (like an IP) that don't require that disk.

Pacemaker 2.0.3 added a new cluster property, "fence-reaction", that
controls what a node does when notified of its own fencing. That's
intended for cases like this (though it only is useful if the node is
still functioning well enough to process the notification). The default
of "stop" is pacemaker's traditional response -- immediately stop
pacemaker itself, which can leave resources running. Using "panic" will
make pacemaker halt the node instead.

In theory, the ideal solution would be to use a fencing topology to
combine disk fencing with network access fencing via a smart switch.
However there is a bug with that setup.

I'm not sure what people have traditionally done about the problem.
-- 
Ken Gaillot <kgaillot at redhat.com>