[ClusterLabs] How can I prevent multiple start of IPaddr 2 in an environment using fence_mpath?
Ken Gaillot
kgaillot at redhat.com
Fri Apr 6 10:12:03 EDT 2018
On Fri, 2018-04-06 at 04:30 +0000, 飯田 雄介 wrote:
> Hi, all
> I am testing the environment using fence_mpath with the following
> settings.
>
> =======
> Stack: corosync
> Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition with
> quorum
> Last updated: Fri Apr 6 13:16:20 2018
> Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on
> x3650e
>
> 2 nodes configured
> 13 resources configured
>
> Online: [ x3650e x3650f ]
>
> Full list of resources:
>
> fenceMpath-x3650e (stonith:fence_mpath): Started x3650e
> fenceMpath-x3650f (stonith:fence_mpath): Started x3650f
> Resource Group: grpPostgreSQLDB
> prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem): Start
> ed x3650e
> prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem): Start
> ed x3650e
> prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem): Start
> ed x3650e
> prmApPostgreSQLDB (ocf::heartbeat:pgsql): Started
> x3650e
> Resource Group: grpPostgreSQLIP
> prmIpPostgreSQLDB (ocf::heartbeat:IPaddr2): Start
> ed x3650e
> Clone Set: clnDiskd1 [prmDiskd1]
> Started: [ x3650e x3650f ]
> Clone Set: clnDiskd2 [prmDiskd2]
> Started: [ x3650e x3650f ]
> Clone Set: clnPing [prmPing]
> Started: [ x3650e x3650f ]
> =======
>
> When split-brain occurs in this environment, x3650f executes fence
> and the resource is started with x3650f.
>
> === view of x3650e ====
> Stack: corosync
> Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition
> WITHOUT quorum
> Last updated: Fri Apr 6 13:16:36 2018
> Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on
> x3650e
>
> 2 nodes configured
> 13 resources configured
>
> Node x3650f: UNCLEAN (offline)
> Online: [ x3650e ]
>
> Full list of resources:
>
> fenceMpath-x3650e (stonith:fence_mpath): Started x3650e
> fenceMpath-x3650f (stonith:fence_mpath): Started[ x3650e
> x3650f ]
> Resource Group: grpPostgreSQLDB
> prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem): Start
> ed x3650e
> prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem): Start
> ed x3650e
> prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem): Start
> ed x3650e
> prmApPostgreSQLDB (ocf::heartbeat:pgsql): Started
> x3650e
> Resource Group: grpPostgreSQLIP
> prmIpPostgreSQLDB (ocf::heartbeat:IPaddr2): Start
> ed x3650e
> Clone Set: clnDiskd1 [prmDiskd1]
> prmDiskd1 (ocf::pacemaker:diskd): Started x3650f
> (UNCLEAN)
> Started: [ x3650e ]
> Clone Set: clnDiskd2 [prmDiskd2]
> prmDiskd2 (ocf::pacemaker:diskd): Started x3650f
> (UNCLEAN)
> Started: [ x3650e ]
> Clone Set: clnPing [prmPing]
> prmPing (ocf::pacemaker:ping): Started x3650f (UNCLEAN)
> Started: [ x3650e ]
>
> === view of x3650f ====
> Stack: corosync
> Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition
> WITHOUT quorum
> Last updated: Fri Apr 6 13:16:36 2018
> Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on
> x3650e
>
> 2 nodes configured
> 13 resources configured
>
> Online: [ x3650f ]
> OFFLINE: [ x3650e ]
>
> Full list of resources:
>
> fenceMpath-x3650e (stonith:fence_mpath): Started x3650f
> fenceMpath-x3650f (stonith:fence_mpath): Started x3650f
> Resource Group: grpPostgreSQLDB
> prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem): Start
> ed x3650f
> prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem): Start
> ed x3650f
> prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem): Start
> ed x3650f
> prmApPostgreSQLDB (ocf::heartbeat:pgsql): Started
> x3650f
> Resource Group: grpPostgreSQLIP
> prmIpPostgreSQLDB (ocf::heartbeat:IPaddr2): Start
> ed x3650f
> Clone Set: clnDiskd1 [prmDiskd1]
> Started: [ x3650f ]
> Stopped: [ x3650e ]
> Clone Set: clnDiskd2 [prmDiskd2]
> Started: [ x3650f ]
> Stopped: [ x3650e ]
> Clone Set: clnPing [prmPing]
> Started: [ x3650f ]
> Stopped: [ x3650e ]
> =======
>
> However, IPaddr2 of x3650e will not stop until pgsql monitor error
> occurs.
> At this time, IPaddr2 is temporarily started on two nodes.
>
> === view of after pgsql monitor error ===
> Stack: corosync
> Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition
> WITHOUT quorum
> Last updated: Fri Apr 6 13:16:56 2018
> Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on
> x3650e
>
> 2 nodes configured
> 13 resources configured
>
> Node x3650f: UNCLEAN (offline)
> Online: [ x3650e ]
>
> Full list of resources:
>
> fenceMpath-x3650e (stonith:fence_mpath): Started x3650e
> fenceMpath-x3650f (stonith:fence_mpath): Started[ x3650e
> x3650f ]
> Resource Group: grpPostgreSQLDB
> prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem): Start
> ed x3650e
> prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem): Start
> ed x3650e
> prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem): Start
> ed x3650e
> prmApPostgreSQLDB (ocf::heartbeat:pgsql): Stopped
> Resource Group: grpPostgreSQLIP
> prmIpPostgreSQLDB (ocf::heartbeat:IPaddr2): Stopp
> ed
> Clone Set: clnDiskd1 [prmDiskd1]
> prmDiskd1 (ocf::pacemaker:diskd): Started x3650f
> (UNCLEAN)
> Started: [ x3650e ]
> Clone Set: clnDiskd2 [prmDiskd2]
> prmDiskd2 (ocf::pacemaker:diskd): Started x3650f
> (UNCLEAN)
> Started: [ x3650e ]
> Clone Set: clnPing [prmPing]
> prmPing (ocf::pacemaker:ping): Started x3650f (UNCLEAN)
> Started: [ x3650e ]
>
> Node Attributes:
> * Node x3650e:
> + default_ping_set : 100
> + diskcheck_status : normal
> + diskcheck_status_internal : normal
>
> Migration Summary:
> * Node x3650e:
> prmApPostgreSQLDB: migration-threshold=1 fail-count=1 last-
> failure='Fri Apr 6 13:16:39 2018'
>
> Failed Actions:
> * prmApPostgreSQLDB_monitor_10000 on x3650e 'not running' (7):
> call=60, status=complete, exitreason='Configuration file
> /dbfp/pgdata/data/postgresql.conf doesn't exist',
> last-rc-change='Fri Apr 6 13:16:39 2018', queued=0ms, exec=0ms
> ======
>
> We regard this behavior as a problem.
> Is there a way to avoid this behavior?
>
> Regards, Yusuke
Hi Yusuke,
One possibility would be to implement network fabric fencing as well,
e.g. fence_snmp with an SNMP-capable network switch. You can make a
fencing topology level with both the storage and network devices.
The main drawback is that unfencing isn't automatic. After a fenced
node is ready to rejoin, you have to clear the block at the switch
yourself.
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list