[ClusterLabs] Interface confusion

Tue Mar 19 16:49:01 EDT 2019

Not sure I got “or (due to two_node / wait_for_all in corosync.conf) waits
until it can see the other node before doing anything” right. I mean
according tohttps://access.redhat.com/solutions/1295713 “The 'two_node'
parameter sets the quorum to '1' and allows one node to remain quorate and
continue cluster operations after the second one is fenced.”

Exactly the same parameters are set for my cluster:

[root at srv1 ~]#  corosync-quorumtool -s

Quorum information

------------------

Date:             Tue Mar 19 16:15:50 2019

Quorum provider:  corosync_votequorum

Nodes:            2

Node ID:          1

Ring ID:          1/464

Quorate:          Yes

Votequorum information

----------------------

Expected votes:   2

Highest expected: 2

Total votes:      2

*Quorum:           1*

*Flags:            2Node Quorate WaitForAll*

Membership information

----------------------

    Nodeid      Votes Name

         1          1 srv1cr1 (local)

         2          1 srv2cr1

I was testing fencing (I’m using fence_vmware_soap) and followed
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/s1-stonithtest-haarand
could in each case see that when:

a)    Fencing the passive node the active remained active *

b)    Fencing the active node caused the passive node to take over the
active role*

*in all cases pacemaker and corosync was not configured to start on boot

Yes you are absolutely right regarding the fence condition when both shoot
at the same time and I even tried option 3 from your list that is corosync
qdevice. Here I followed
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/s1-quorumdev-haar#s2-managequorum-HAAR

So here’s quorum runtime status before adding the qdevice:

[root at srv1 ~]# pcs quorum status

Quorum information

------------------

Date:             Tue Mar 19 16:34:12 2019

Quorum provider:  corosync_votequorum

Nodes:            2

Node ID:          1

Ring ID:          1/464

Quorate:          Yes

Votequorum information

----------------------

Expected votes:   2

Highest expected: 2

Total votes:      2

Quorum:           1

Flags:            2Node Quorate WaitForAll

Membership information

----------------------

    Nodeid      Votes    Qdevice Name

         1          1         NR srv1cr1 (local)

         2          1         NR srv2cr1

And here’s the one after adding it:

[root at srv1 ~]# pcs quorum status

Quorum information

------------------

Date:             Tue Mar 19 16:35:06 2019

Quorum provider:  corosync_votequorum

Nodes:            2

Node ID:          1

Ring ID:          1/464

Quorate:          Yes

Votequorum information

----------------------

Expected votes:   3

Highest expected: 3

Total votes:      3

Quorum:           2

Flags:            Quorate WaitForAll Qdevice

Membership information

----------------------

    Nodeid      Votes    Qdevice Name

         1          1    A,V,NMW srv1cr1 (local)

         2          1    A,V,NMW srv2cr1

         0          1            Qdevice

I got some while adding the quorum device to the cluster because as
mentioned I have on both nodes pacemaker and corosync set to not start at
boot:

[root at srv1 ~]# pcs quorum device add model net host=otrs_db
algorithm=ffsplit

Setting up qdevice certificates on nodes...

srv2cr1: Succeeded

srv1cr1: Succeeded

Enabling corosync-qdevice...

srv2cr1: not enabling corosync-qdevice: corosync is not enabled

srv1cr1: not enabling corosync-qdevice: corosync is not enabled

Sending updated corosync.conf to nodes...

srv2cr1: Succeeded

srv1cr1: Succeeded

Corosync configuration reloaded

Starting corosync-qdevice...

srv2cr1: corosync-qdevice started

srv1cr1: corosync-qdevice started

Is this a problem ? what can I do now, how can I test it ?

Thank you !

wt., 19.03.2019, 19:01 użytkownik Ken Gaillot <kgaillot at redhat.com> napisał:

> On Tue, 2019-03-19 at 15:55 +0100, Adam Budziński wrote:
> > Hello Ken,
> >
> > Thank you.
> >
> > But if I have a two node cluster and a working fencing mechanism
> > wouldn't it be enough to disable the corosync and pacemaker service
> > on both nodes so when it fence they won't come up?
>
> There's actually no problem when a fenced node comes back. Either it
> joins the remaining cluster normally, or (due to two_node /
> wait_for_all in corosync.conf) waits until it can see the other node
> before doing anything.
>
> Disabling or enabling the services is a personal preference based on
> whether you'd rather investigate why a node was shot before letting it
> back in the cluster, or get the cluster back to full strength as
> quickly as possible.
>
> The fence delay is for when both nodes are running and communicating
> correctly, then suddenly they lose communication with each other. From
> each node's point of view, the other node is lost. So, each will
> attempt to fence the other. A delay on one node in this situation makes
> it less likely that they both pull the trigger at the same time, ending
> up with both nodes dead.
>
> > Thank you
> >
> > pon., 18.03.2019, 16:19 użytkownik Ken Gaillot <kgaillot at redhat.com>
> > napisał:
> > > On Sat, 2019-03-16 at 11:10 +0100, Adam Budziński wrote:
> > > > Hello Andrei,
> > > >
> > > > Ok I see your point. So per my understanding if the resource is
> > > > started successfully in that case fence vmware it will be
> > > monitored
> > > > indefinitely but as you sad it will monitor the current active
> > > node.
> > > > So how does the fence agent gets aware of problems with the
> > > slave? I
> > >
> > > The fence agent doesn't monitor the active node, or any node -- it
> > > monitors the fence device.
> > >
> > > The cluster layer (i.e. corosync) monitors all nodes, and reports
> > > any
> > > issues to pacemaker, which will initiate fencing if necessary.
> > >
> > > Pacemaker also monitors each resource and fence device, via any
> > > recurring monitors that have been configured.
> > >
> > > > mean if in a two node cluster the cluster splits in to two
> > > partitions
> > > > each of them will fence the other or does that happen because
> > > both
> > > > will assume they are the only survivors and thus need to fence
> > > the
> > > > other node which is in a unknow state so to say?
> > >
> > > If both nodes are functional but can't see each other, they will
> > > each
> > > want to initiate fencing. If one of them is quicker than the other
> > > to
> > > determine this, the other one will get shot before it has a chance
> > > to
> > > do anything itself.
> > >
> > > However there is the possibility that both nodes will shoot at
> > > about
> > > the same time, resulting in both nodes getting shot (a "stonith
> > > death
> > > match"). This is only a problem in 2-node clusters. There are a few
> > > ways around this:
> > >
> > > 1. Configure two separate fence devices, each targeting one of the
> > > nodes, and put a delay on one of them (or a random delay on both).
> > > This
> > > makes it highly unlikely that they will shoot at the same time.
> > >
> > > 2. Configure a fencing topology with a fence heuristics device plus
> > > your real device. A fence heuristics device runs some test, and
> > > refuses
> > > to shoot the other node if the test fails. For example,
> > > fence_heuristics_ping tries to ping an IP address you give it; the
> > > idea
> > > is that if a node can't ping that IP, you don't want it to survive.
> > > This ensures that only a node that passes the test can shoot (which
> > > means there still might be some cases where the nodes can both
> > > shoot
> > > each other, and cases where the cluster will freeze because neither
> > > node can see the IP).
> > >
> > > 3. Configure corosync with qdevice to provide true quorum via a
> > > third
> > > host (which doesn't participate in the cluster otherwise).
> > >
> > > 4. Use sbd with a hardware watchdog and a shared storage device as
> > > the
> > > fencing device. This is not a reliable option with VMWare, but I'm
> > > listing it for the general case.
> > >
> > >
> > > >
> > > > Thank you and Best Regards,
> > > > Adam
> > > >
> > > > sob., 16.03.2019, 07:17 użytkownik Andrei Borzenkov <
> > > > arvidjaar at gmail.com> napisał:
> > > > > 16.03.2019 9:01, Adam Budziński пишет:
> > > > > > Thank you Andrei. The problem is that I can see with 'pcs
> > > status'
> > > > > that
> > > > > > resources are runnin on srv2cr1 but its at the same time its
> > > > > telling that
> > > > > > the fence_vmware_soap is running on srv1cr1. That's somewhat
> > > > > confusing.
> > > > > > Could you possibly explain this?
> > > > > >
> > > > >
> > > > > Two points.
> > > > >
> > > > > It is actually logical to have stonith agent running on
> > > different
> > > > > node
> > > > > than node with active resources - because it is the *other*
> > > node
> > > > > that
> > > > > will initiate fencing when node with active resources fails.
> > > > >
> > > > > But even considering the above, active (running) state of fence
> > > (or
> > > > > stonith) agent just determines on which node recurring monitor
> > > > > operation
> > > > > will be started. The actual result of this monitor operation
> > > has no
> > > > > impact on subsequent stonith attempt and serves just as warning
> > > to
> > > > > administrator. When stonith request comes, agent may be used by
> > > any
> > > > > node
> > > > > where stonith agent is not prohibited to run by (co-)location
> > > > > rules. My
> > > > > understanding is that this node is selected by DC in partition.
> > > > >
> > > > > > Thank you!
> > > > > >
> > > > > > sob., 16.03.2019, 05:37 użytkownik Andrei Borzenkov <
> > > > > arvidjaar at gmail.com>
> > > > > > napisał:
> > > > > >
> > > > > >> 16.03.2019 1:16, Adam Budziński пишет:
> > > > > >>> Hi Tomas,
> > > > > >>>
> > > > > >>> Ok but how then pacemaker or the fence agent knows which
> > > route
> > > > > to take to
> > > > > >>> reach the vCenter?
> > > > > >>
> > > > > >> They do not know or care at all. It is up to your underlying
> > > > > operating
> > > > > >> system and its routing tables.
> > > > > >>
> > > > > >>> Btw. Do I have to add the stonith resource on each of the
> > > nodes
> > > > > or is it
> > > > > >>> just enough to add it on one as for other resources?
> > > > > >>
> > > > > >> If your fencing agent can (should) be able to run on any
> > > node,
> > > > > it should
> > > > > >> be enough to define it just once as long as it can properly
> > > > > determine
> > > > > >> "port" to use on fencing "device" for a given node. There
> > > are
> > > > > cases when
> > > > > >> you may want to restrict fencing agent to only subset on
> > > nodes
> > > > > or when
> > > > > >> you are forced to set unique parameter for each node
> > > (consider
> > > > > IPMI IP
> > > > > >> address), in this case you would need separate instance of
> > > agent
> > > > > in each
> > > > > >> case.
> > > > > >>
> > > > > >>> Thank you!
> > > > > >>>
> > > > > >>> pt., 15.03.2019, 15:48 użytkownik Tomas Jelinek <
> > > > > tojeline at redhat.com>
> > > > > >>> napisał:
> > > > > >>>
> > > > > >>>> Dne 15. 03. 19 v 15:09 Adam Budziński napsal(a):
> > > > > >>>>> Hello Tomas,
> > > > > >>>>>
> > > > > >>>>> Thank you! So far I  need to say how great this community
> > > is,
> > > > > would
> > > > > >>>>> never expect so much positive vibes! A big thank you your
> > > > > doing a great
> > > > > >>>>> job!
> > > > > >>>>>
> > > > > >>>>> Now let's talk business :)
> > > > > >>>>>
> > > > > >>>>> So if pcsd is using ring0 and it fails will ring1 not be
> > > used
> > > > > at all?
> > > > > >>>>
> > > > > >>>> Pcs and pcsd never use ring1, but they are just tools for
> > > > > managing
> > > > > >>>> clusters. You can have a perfectly functioning cluster
> > > without
> > > > > pcs and
> > > > > >>>> pcsd running or even installed, it would be just more
> > > > > complicated to set
> > > > > >>>> it up and manage it.
> > > > > >>>>
> > > > > >>>> Even if ring0 fails, you will be able to use pcs (in
> > > somehow
> > > > > limited
> > > > > >>>> manner) as most of its commands don't go through network
> > > > > anyway.
> > > > > >>>>
> > > > > >>>> Corosync, which is the actual cluster messaging layer,
> > > will of
> > > > > course
> > > > > >>>> use ring1 in case of ring0 failure.
> > > > > >>>>
> > > > > >>>>>
> > > > > >>>>> So in regards to VMware that would mean that the
> > > interface
> > > > > should be
> > > > > >>>>> configured with a network that can access the  vCenter to
> > > > > fence right?
> > > > > >>>>> But wouldn't it then use only ring0 so if that fails it
> > > > > wouldn't switch
> > > > > >>>>> to ring1?
> > > > > >>>>
> > > > > >>>> If you are talking about pcmk_host_map, that does not
> > > really
> > > > > have
> > > > > >>>> anything to do with network interfaces of cluster nodes.
> > > It
> > > > > maps node
> > > > > >>>> names (parts before :) to "ports" of a fence device (parts
> > > > > after :).
> > > > > >>>> Pcs-0.9.x does not support defining custom node names,
> > > > > therefore node
> > > > > >>>> names are the same as ring0 addresses.
> > > > > >>>>
> > > > > >>>> I am not an expert on fence agents / devices, but I'm sure
> > > > > someone else
> > > > > >>>> on this list will be able to help you with configuring
> > > fencing
> > > > > for your
> > > > > >>>> cluster.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Tomas
> > > > > >>>>
> > > > > >>>>>
> > > > > >>>>> Thank you!
> > > > > >>>>>
> > > > > >>>>> pt., 15.03.2019, 13:14 użytkownik Tomas Jelinek <
> > > > > tojeline at redhat.com
> > > > > >>>>> <mailto:tojeline at redhat.com>> napisał:
> > > > > >>>>>
> > > > > >>>>>     Dne 15. 03. 19 v 12:32 Adam Budziński napsal(a):
> > > > > >>>>>      > Hello Folks,____
> > > > > >>>>>      >
> > > > > >>>>>      > __ __
> > > > > >>>>>      >
> > > > > >>>>>      > Tow node active/passive VMware VM cluster.____
> > > > > >>>>>      >
> > > > > >>>>>      > __ __
> > > > > >>>>>      >
> > > > > >>>>>      > /etc/hosts____
> > > > > >>>>>      >
> > > > > >>>>>      > __ __
> > > > > >>>>>      >
> > > > > >>>>>      > 10.116.63.83    srv1____
> > > > > >>>>>      >
> > > > > >>>>>      > 10.116.63.84    srv2____
> > > > > >>>>>      >
> > > > > >>>>>      > 172.16.21.12    srv2cr1____
> > > > > >>>>>      >
> > > > > >>>>>      > 172.16.22.12    srv2cr2____
> > > > > >>>>>      >
> > > > > >>>>>      > 172.16.21.11    srv1cr1____
> > > > > >>>>>      >
> > > > > >>>>>      > 172.16.22.11    srv1cr2____
> > > > > >>>>>      >
> > > > > >>>>>      > __ __
> > > > > >>>>>      >
> > > > > >>>>>      > __ __
> > > > > >>>>>      >
> > > > > >>>>>      > I have 3 NIC’s on each VM:____
> > > > > >>>>>      >
> > > > > >>>>>      > __ __
> > > > > >>>>>      >
> > > > > >>>>>      > 10.116.63.83    srv1  and 10.116.63.84    srv2 are
> > > > > networks used
> > > > > >>>> to
> > > > > >>>>>      > access the VM’s via SSH or any resource directly
> > > if
> > > > > not via a
> > > > > >>>>>     VIP.____
> > > > > >>>>>      >
> > > > > >>>>>      > __ __
> > > > > >>>>>      >
> > > > > >>>>>      > Everything with cr in its name is used for
> > > corosync
> > > > > >>>>>     communication, so
> > > > > >>>>>      > basically I have two rings (this are two no
> > > routable
> > > > > networks
> > > > > >>>>>     just for
> > > > > >>>>>      > that).____
> > > > > >>>>>      >
> > > > > >>>>>      > __ __
> > > > > >>>>>      >
> > > > > >>>>>      > My questions are:____
> > > > > >>>>>      >
> > > > > >>>>>      > __ __
> > > > > >>>>>      >
> > > > > >>>>>      > __1.__With ‘pcs cluster auth’ which interface /
> > > > > interfaces
> > > > > >> should
> > > > > >>>>>     I use
> > > > > >>>>>      > ?____
> > > > > >>>>>
> > > > > >>>>>     Hi Adam,
> > > > > >>>>>
> > > > > >>>>>     I can see you are using pcs-0.9.x. In that case you
> > > > > should do:
> > > > > >>>>>     pcs cluster auth srv1cr1 srv2cr1
> > > > > >>>>>
> > > > > >>>>>     In other words, use the first address of each node.
> > > > > >>>>>     Authenticating all the other addresses should not
> > > cause
> > > > > any issues.
> > > > > >>>> It
> > > > > >>>>>     is pointless, though, as pcs only communicates via
> > > ring0
> > > > > addresses.
> > > > > >>>>>
> > > > > >>>>>      >
> > > > > >>>>>      > __2.__With ‘pcs cluster setup –name’ I would use
> > > the
> > > > > corosync
> > > > > >>>>>     interfaces
> > > > > >>>>>      > e.g. ‘pcs cluster setup –name MyCluster
> > > > > srv1cr1,srv1cr2
> > > > > >>>>>     srv2cr1,srv2cr2’
> > > > > >>>>>      > right ?____
> > > > > >>>>>
> > > > > >>>>>     Yes, that is correct.
> > > > > >>>>>
> > > > > >>>>>      >
> > > > > >>>>>      > __3.__With fence_vmware_soap
> > > > > >> inpcmk_host_map="X:VM_C;X:VM:OTRS_D"
> > > > > >>>>>     which
> > > > > >>>>>      > interface should replace X ?____
> > > > > >>>>>
> > > > > >>>>>     X should be replaced by node names as seen by
> > > pacemaker.
> > > > > Once you
> > > > > >>>>>     set up
> > > > > >>>>>     and start your cluster, run 'pcs status' to get
> > > (amongs
> > > > > other info)
> > > > > >>>> the
> > > > > >>>>>     node names. In your configuration, they should be
> > > srv1cr1
> > > > > and
> > > > > >>>> srv2cr1.
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>     Regards,
> > > > > >>>>>     Tomas
> > > > > >>>>>
> > > > > >>>>>      > __ __
> > > > > >>>>>      >
> > > > > >>>>>      > Thank you!
> > > > > >>>>>      >
> > > > > >>>>>      >
> > > > > >>>>>      > _______________________________________________
> > > > > >>>>>      > Users mailing list: Users at clusterlabs.org
> > > > > >>>>>     <mailto:Users at clusterlabs.org>
> > > > > >>>>>      >
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > > > >>>>>      >
> > > > > >>>>>      > Project Home: http://www.clusterlabs.org
> > > > > >>>>>      > Getting started:
> > > > > >>>>>
> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > > >>>>>      > Bugs: http://bugs.clusterlabs.org
> > > > > >>>>>      >
> > > > > >>>>>     _______________________________________________
> > > > > >>>>>     Users mailing list: Users at clusterlabs.org <mailto:
> > > > > >>>> Users at clusterlabs.org>
> > > > > >>>>>     https://lists.clusterlabs.org/mailman/listinfo/users
> > > > > >>>>>
> > > > > >>>>>     Project Home: http://www.clusterlabs.org
> > > > > >>>>>     Getting started:
> > > > > >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > > >>>>>     Bugs: http://bugs.clusterlabs.org
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> _______________________________________________
> > > > > >>>>> Users mailing list: Users at clusterlabs.org
> > > > > >>>>> https://lists.clusterlabs.org/mailman/listinfo/users
> > > > > >>>>>
> > > > > >>>>> Project Home: http://www.clusterlabs.org
> > > > > >>>>> Getting started:
> > > > > >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > > >>>>> Bugs: http://bugs.clusterlabs.org
> > > > > >>>>>
> > > > > >>>> _______________________________________________
> > > > > >>>> Users mailing list: Users at clusterlabs.org
> > > > > >>>> https://lists.clusterlabs.org/mailman/listinfo/users
> > > > > >>>>
> > > > > >>>> Project Home: http://www.clusterlabs.org
> > > > > >>>> Getting started:
> > > > > >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > > >>>> Bugs: http://bugs.clusterlabs.org
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>> _______________________________________________
> > > > > >>> Users mailing list: Users at clusterlabs.org
> > > > > >>> https://lists.clusterlabs.org/mailman/listinfo/users
> > > > > >>>
> > > > > >>> Project Home: http://www.clusterlabs.org
> > > > > >>> Getting started:
> > > > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > > >>> Bugs: http://bugs.clusterlabs.org
> > > > > >>>
> > > > > >>
> > > > > >> _______________________________________________
> > > > > >> Users mailing list: Users at clusterlabs.org
> > > > > >> https://lists.clusterlabs.org/mailman/listinfo/users
> > > > > >>
> > > > > >> Project Home: http://www.clusterlabs.org
> > > > > >> Getting started:
> > > > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > > >> Bugs: http://bugs.clusterlabs.org
> > > > > >>
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > Users mailing list: Users at clusterlabs.org
> > > > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > > > >
> > > > > > Project Home: http://www.clusterlabs.org
> > > > > > Getting started:
> > > > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > > > Bugs: http://bugs.clusterlabs.org
> > > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Users mailing list: Users at clusterlabs.org
> > > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > > >
> > > > > Project Home: http://www.clusterlabs.org
> > > > > Getting started:
> > > > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > > Bugs: http://bugs.clusterlabs.org
> > > >
> > > > _______________________________________________
> > > > Users mailing list: Users at clusterlabs.org
> > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > >
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started:
> > > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190319/086c7f78/attachment-0001.html>