[ClusterLabs] fence_mpath and failed IP

Tue Mar 31 01:56:35 EDT 2020

31.03.2020 05:56, Ken Gaillot пишет:
> On Sat, 2020-02-22 at 03:50 +0200, Strahil Nikolov wrote:
>> Hello community,
>>
>> Recently I have started playing with fence_mpath and I have noticed
>> that when the node is fenced,  the node is kicked out of the
>> cluster  (corosync & pacemaker are shut down).
>>
>> Fencing works correctly , but the IP address cannot be brought up on
>> the designated 'replacement' host, because it was left on the old
>> node.
>>
>> I believe that this is a timing issue -  fenced node doesn't have the
>> time to shutdown all it's resources before pacemaker dies locally.
>>
>> Can someone confirm this behaviour on anothger distro,  as I'm
>> currently testing it on RHEL7? If it is only for RedHat,  I can open
>> a bug in the bugzilla.
>>
>> Note: There is a workaround in order to reboot the node (using
>> a  symbolic link to /etc/watchdog.d )  with the help of the
>> fence_scsi or the fence_mpath scripts  in /usr/share/cluster .
>>
>>
>> Best Regards,
>> Strahil Nikolov
> 
> I'm not expert with fabric fencing, but from what I understand, this is
> an inherent limitation. Cutting off the disk obviously has no effect on
> resources (like an IP) that don't require that disk.
> 
> Pacemaker 2.0.3 added a new cluster property, "fence-reaction", that
> controls what a node does when notified of its own fencing. That's
> intended for cases like this (though it only is useful if the node is
> still functioning well enough to process the notification). The default
> of "stop" is pacemaker's traditional response -- immediately stop
> pacemaker itself, which can leave resources running. Using "panic" will
> make pacemaker halt the node instead.
> 
> In theory, the ideal solution would be to use a fencing topology to
> combine disk fencing with network access fencing via a smart switch.
> However there is a bug with that setup.
> 

Could you elaborate or point to bug report?

> I'm not sure what people have traditionally done about the problem.
> 
In cases I am aware of either there are no additional resources (like
SAP HANA scale out multi-node database where there are no IP failover -
clients are aware of topology and connect to each individual node) or
node is completely cut off (consider clients with LAN access only - if
you cut off network it does not matter whether node is still alive).

But yes, it is very unfortunate that "stonith" and "fencing" are mixed
in pacemaker documentation because thy are really very different things
and cannot in general be used interchangeably.