[ClusterLabs] How can I prevent multiple start of IPaddr 2 in an environment using fence_mpath?

Wed Apr 18 01:17:32 EDT 2018

Hi, Ken

Thanks for your comment.
Network fencing that's a valid means, I also think.
However, I think that the reliance on equipment is strong.
Since we do not have an SNMP-capable network switch in our environment, we can not immediately try it.

Thanks, Yusuke
> -----Original Message-----
> From: Users [mailto:users-bounces at clusterlabs.org] On Behalf Of Ken Gaillot
> Sent: Friday, April 06, 2018 11:12 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] How can I prevent multiple start of IPaddr 2 in an
> environment using fence_mpath?
> 
> On Fri, 2018-04-06 at 04:30 +0000, 飯田 雄介 wrote:
> > Hi, all
> > I am testing the environment using fence_mpath with the following
> > settings.
> >
> > =======
> >   Stack: corosync
> >   Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition with
> > quorum
> >   Last updated: Fri Apr  6 13:16:20 2018
> >   Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e
> >
> >   2 nodes configured
> >   13 resources configured
> >
> >   Online: [ x3650e x3650f ]
> >
> >   Full list of resources:
> >
> >    fenceMpath-x3650e    (stonith:fence_mpath):  Started x3650e
> >    fenceMpath-x3650f    (stonith:fence_mpath):  Started x3650f
> >    Resource Group: grpPostgreSQLDB
> >        prmFsPostgreSQLDB1       (ocf::heartbeat:Filesystem):    Start
> > ed x3650e
> >        prmFsPostgreSQLDB2       (ocf::heartbeat:Filesystem):    Start
> > ed x3650e
> >        prmFsPostgreSQLDB3       (ocf::heartbeat:Filesystem):    Start
> > ed x3650e
> >        prmApPostgreSQLDB        (ocf::heartbeat:pgsql): Started x3650e
> >    Resource Group: grpPostgreSQLIP
> >        prmIpPostgreSQLDB        (ocf::heartbeat:IPaddr2):       Start
> > ed x3650e
> >    Clone Set: clnDiskd1 [prmDiskd1]
> >        Started: [ x3650e x3650f ]
> >    Clone Set: clnDiskd2 [prmDiskd2]
> >        Started: [ x3650e x3650f ]
> >    Clone Set: clnPing [prmPing]
> >        Started: [ x3650e x3650f ]
> > =======
> >
> > When split-brain occurs in this environment, x3650f executes fence and
> > the resource is started with x3650f.
> >
> > === view of x3650e ====
> >   Stack: corosync
> >   Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition
> > WITHOUT quorum
> >   Last updated: Fri Apr  6 13:16:36 2018
> >   Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e
> >
> >   2 nodes configured
> >   13 resources configured
> >
> >   Node x3650f: UNCLEAN (offline)
> >   Online: [ x3650e ]
> >
> >   Full list of resources:
> >
> >    fenceMpath-x3650e    (stonith:fence_mpath):  Started x3650e
> >    fenceMpath-x3650f    (stonith:fence_mpath):  Started[ x3650e x3650f
> > ]
> >    Resource Group: grpPostgreSQLDB
> >        prmFsPostgreSQLDB1       (ocf::heartbeat:Filesystem):    Start
> > ed x3650e
> >        prmFsPostgreSQLDB2       (ocf::heartbeat:Filesystem):    Start
> > ed x3650e
> >        prmFsPostgreSQLDB3       (ocf::heartbeat:Filesystem):    Start
> > ed x3650e
> >        prmApPostgreSQLDB        (ocf::heartbeat:pgsql): Started x3650e
> >    Resource Group: grpPostgreSQLIP
> >        prmIpPostgreSQLDB        (ocf::heartbeat:IPaddr2):       Start
> > ed x3650e
> >    Clone Set: clnDiskd1 [prmDiskd1]
> >        prmDiskd1        (ocf::pacemaker:diskd): Started x3650f
> > (UNCLEAN)
> >        Started: [ x3650e ]
> >    Clone Set: clnDiskd2 [prmDiskd2]
> >        prmDiskd2        (ocf::pacemaker:diskd): Started x3650f
> > (UNCLEAN)
> >        Started: [ x3650e ]
> >    Clone Set: clnPing [prmPing]
> >        prmPing  (ocf::pacemaker:ping):  Started x3650f (UNCLEAN)
> >        Started: [ x3650e ]
> >
> > === view of x3650f ====
> >   Stack: corosync
> >   Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition
> > WITHOUT quorum
> >   Last updated: Fri Apr  6 13:16:36 2018
> >   Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e
> >
> >   2 nodes configured
> >   13 resources configured
> >
> >   Online: [ x3650f ]
> >   OFFLINE: [ x3650e ]
> >
> >   Full list of resources:
> >
> >    fenceMpath-x3650e    (stonith:fence_mpath):  Started x3650f
> >    fenceMpath-x3650f    (stonith:fence_mpath):  Started x3650f
> >    Resource Group: grpPostgreSQLDB
> >        prmFsPostgreSQLDB1       (ocf::heartbeat:Filesystem):    Start
> > ed x3650f
> >        prmFsPostgreSQLDB2       (ocf::heartbeat:Filesystem):    Start
> > ed x3650f
> >        prmFsPostgreSQLDB3       (ocf::heartbeat:Filesystem):    Start
> > ed x3650f
> >        prmApPostgreSQLDB        (ocf::heartbeat:pgsql): Started x3650f
> >    Resource Group: grpPostgreSQLIP
> >        prmIpPostgreSQLDB        (ocf::heartbeat:IPaddr2):       Start
> > ed x3650f
> >    Clone Set: clnDiskd1 [prmDiskd1]
> >        Started: [ x3650f ]
> >        Stopped: [ x3650e ]
> >    Clone Set: clnDiskd2 [prmDiskd2]
> >        Started: [ x3650f ]
> >        Stopped: [ x3650e ]
> >    Clone Set: clnPing [prmPing]
> >        Started: [ x3650f ]
> >        Stopped: [ x3650e ]
> > =======
> >
> > However, IPaddr2 of x3650e will not stop until pgsql monitor error
> > occurs.
> > At this time, IPaddr2 is temporarily started on two nodes.
> >
> > === view of after pgsql monitor error ===
> >   Stack: corosync
> >   Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition
> > WITHOUT quorum
> >   Last updated: Fri Apr  6 13:16:56 2018
> >   Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e
> >
> >   2 nodes configured
> >   13 resources configured
> >
> >   Node x3650f: UNCLEAN (offline)
> >   Online: [ x3650e ]
> >
> >   Full list of resources:
> >
> >    fenceMpath-x3650e    (stonith:fence_mpath):  Started x3650e
> >    fenceMpath-x3650f    (stonith:fence_mpath):  Started[ x3650e x3650f
> > ]
> >    Resource Group: grpPostgreSQLDB
> >        prmFsPostgreSQLDB1       (ocf::heartbeat:Filesystem):    Start
> > ed x3650e
> >        prmFsPostgreSQLDB2       (ocf::heartbeat:Filesystem):    Start
> > ed x3650e
> >        prmFsPostgreSQLDB3       (ocf::heartbeat:Filesystem):    Start
> > ed x3650e
> >        prmApPostgreSQLDB        (ocf::heartbeat:pgsql): Stopped
> >    Resource Group: grpPostgreSQLIP
> >        prmIpPostgreSQLDB        (ocf::heartbeat:IPaddr2):       Stopp
> > ed
> >    Clone Set: clnDiskd1 [prmDiskd1]
> >        prmDiskd1        (ocf::pacemaker:diskd): Started x3650f
> > (UNCLEAN)
> >        Started: [ x3650e ]
> >    Clone Set: clnDiskd2 [prmDiskd2]
> >        prmDiskd2        (ocf::pacemaker:diskd): Started x3650f
> > (UNCLEAN)
> >        Started: [ x3650e ]
> >    Clone Set: clnPing [prmPing]
> >        prmPing  (ocf::pacemaker:ping):  Started x3650f (UNCLEAN)
> >        Started: [ x3650e ]
> >
> >   Node Attributes:
> >   * Node x3650e:
> >       + default_ping_set                        : 100
> >       + diskcheck_status                        : normal
> >       + diskcheck_status_internal               : normal
> >
> >   Migration Summary:
> >   * Node x3650e:
> >      prmApPostgreSQLDB: migration-threshold=1 fail-count=1 last-
> > failure='Fri Apr  6 13:16:39 2018'
> >
> >   Failed Actions:
> >   * prmApPostgreSQLDB_monitor_10000 on x3650e 'not running' (7):
> > call=60, status=complete, exitreason='Configuration file
> > /dbfp/pgdata/data/postgresql.conf doesn't exist',
> >       last-rc-change='Fri Apr  6 13:16:39 2018', queued=0ms, exec=0ms
> > ======
> >
> > We regard this behavior as a problem.
> > Is there a way to avoid this behavior?
> >
> > Regards, Yusuke
> 
> Hi Yusuke,
> 
> One possibility would be to implement network fabric fencing as well, e.g.
> fence_snmp with an SNMP-capable network switch. You can make a fencing topology
> level with both the storage and network devices.
> 
> The main drawback is that unfencing isn't automatic. After a fenced node is
> ready to rejoin, you have to clear the block at the switch yourself.
> --
> Ken Gaillot <kgaillot at redhat.com>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org