[ClusterLabs] Pacemaker failover failure

Tue Jul 14 23:31:45 UTC 2015

Unfortunately I have nothing yet ...

There's something I don't quite understand though. What's the role of
stonith if the other machine crashes unexpectedly and totally unclean? Is
it to reboot the machine and recreate the cluster, thus making the drbd
volume available again? or is it other?

The way I see it, even if stonith is functional the other node's drbd
filesystem will not be acessible until the crashed node is back up, is this
correct?

Alex

On Thu, Jul 2, 2015 at 2:56 PM, Ken Gaillot <kgaillot at redhat.com> wrote:

> ----- Original Message -----
> > Thank you!
> >
> > However, what is proper fencing in this situation?
>
> For virtual machines, there is a fence agent called fence_virt/fence_xvm,
> but
> it requires a daemon to be installed and configured on the underlying
> physical machine(s). If that's not a possibility, you need some other means
> of shutting the VM down. Whoever's providing your VM might also provide an
> API to start and stop it, or if your VMs have access to some shared
> external
> storage, it might be possible to control it via that.
>
> > On Wed, Jul 1, 2015 at 11:30 PM, Ken Gaillot <kgaillot at redhat.com>
> wrote:
> >
> > > On 07/01/2015 09:39 AM, alex austin wrote:
> > > > This is what crm_mon shows
> > > >
> > > >
> > > > Last updated: Wed Jul  1 10:35:40 2015
> > > >
> > > > Last change: Wed Jul  1 09:52:46 2015
> > > >
> > > > Stack: classic openais (with plugin)
> > > >
> > > > Current DC: host2 - partition with quorum
> > > >
> > > > Version: 1.1.11-97629de
> > > >
> > > > 2 Nodes configured, 2 expected votes
> > > >
> > > > 4 Resources configured
> > > >
> > > >
> > > >
> > > > Online: [ host1 host2 ]
> > > >
> > > >
> > > > ClusterIP (ocf::heartbeat:IPaddr2): Started host2
> > > >
> > > >  Master/Slave Set: redis_clone [redis]
> > > >
> > > >      Masters: [ host2 ]
> > > >
> > > >      Slaves: [ host1 ]
> > > >
> > > > pcmk-fencing    (stonith:fence_pcmk):   Started host2
> > > >
> > > > On Wed, Jul 1, 2015 at 3:37 PM, alex austin <alexixalex at gmail.com>
> > > wrote:
> > > >
> > > >> I am running version 1.4.7 of corosync
> > >
> > > If you can't upgrade to corosync 2 (which has many improvements),
> you'll
> > > need to set the no-quorum-policy=ignore cluster option.
> > >
> > > Proper fencing is necessary to avoid a split-brain situation, which can
> > > corrupt your data.
> > >
> > > >> On Wed, Jul 1, 2015 at 3:25 PM, Ken Gaillot <kgaillot at redhat.com>
> > > wrote:
> > > >>
> > > >>> On 07/01/2015 08:57 AM, alex austin wrote:
> > > >>>> I have now configured stonith-enabled=true. What device should I
> use
> > > for
> > > >>>> fencing given the fact that it's a virtual machine but I don't
> have
> > > >>> access
> > > >>>> to its configuration. would fence_pcmk do? if so, what parameters
> > > >>> should I
> > > >>>> configure for it to work properly?
> > > >>>
> > > >>> No, fence_pcmk is not for using in pacemaker, but for using in
> RHEL6's
> > > >>> CMAN to redirect its fencing requests to pacemaker.
> > > >>>
> > > >>> For a virtual machine, ideally you'd use fence_virtd running on the
> > > >>> physical host, but I'm guessing from your comment that you can't do
> > > >>> that. Does whoever provides your VM also provide an API for
> controlling
> > > >>> it (starting/stopping/rebooting)?
> > > >>>
> > > >>> Regarding your original problem, it sounds like the surviving node
> > > >>> doesn't have quorum. What version of corosync are you using? If
> you're
> > > >>> using corosync 2, you need "two_node: 1" in corosync.conf, in
> addition
> > > >>> to configuring fencing in pacemaker.
> > > >>>
> > > >>>> This is my new config:
> > > >>>>
> > > >>>>
> > > >>>> node dcwbpvmuas004.edc.nam.gm.com \
> > > >>>>
> > > >>>>         attributes standby=off
> > > >>>>
> > > >>>> node dcwbpvmuas005.edc.nam.gm.com \
> > > >>>>
> > > >>>>         attributes standby=off
> > > >>>>
> > > >>>> primitive ClusterIP IPaddr2 \
> > > >>>>
> > > >>>>         params ip=198.208.86.242 cidr_netmask=23 \
> > > >>>>
> > > >>>>         op monitor interval=1s timeout=20s \
> > > >>>>
> > > >>>>         op start interval=0 timeout=20s \
> > > >>>>
> > > >>>>         op stop interval=0 timeout=20s \
> > > >>>>
> > > >>>>         meta is-managed=true target-role=Started
> > > resource-stickiness=500
> > > >>>>
> > > >>>> primitive pcmk-fencing stonith:fence_pcmk \
> > > >>>>
> > > >>>>         params pcmk_host_list="dcwbpvmuas004.edc.nam.gm.com
> > > >>>> dcwbpvmuas005.edc.nam.gm.com" \
> > > >>>>
> > > >>>>         op monitor interval=10s \
> > > >>>>
> > > >>>>         meta target-role=Started
> > > >>>>
> > > >>>> primitive redis redis \
> > > >>>>
> > > >>>>         meta target-role=Master is-managed=true \
> > > >>>>
> > > >>>>         op monitor interval=1s role=Master timeout=5s
> on-fail=restart
> > > >>>>
> > > >>>> ms redis_clone redis \
> > > >>>>
> > > >>>>         meta notify=true is-managed=true ordered=false
> > > interleave=false
> > > >>>> globally-unique=false target-role=Master migration-threshold=1
> > > >>>>
> > > >>>> colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master
> > > >>>>
> > > >>>> colocation ip-on-redis inf: ClusterIP redis_clone:Master
> > > >>>>
> > > >>>> colocation pcmk-fencing-on-redis inf: pcmk-fencing
> redis_clone:Master
> > > >>>>
> > > >>>> property cib-bootstrap-options: \
> > > >>>>
> > > >>>>         dc-version=1.1.11-97629de \
> > > >>>>
> > > >>>>         cluster-infrastructure="classic openais (with plugin)" \
> > > >>>>
> > > >>>>         expected-quorum-votes=2 \
> > > >>>>
> > > >>>>         stonith-enabled=true
> > > >>>>
> > > >>>> property redis_replication: \
> > > >>>>
> > > >>>>         redis_REPL_INFO=dcwbpvmuas005.edc.nam.gm.com
> > > >>>>
> > > >>>> On Wed, Jul 1, 2015 at 2:53 PM, Nekrasov, Alexander <
> > > >>>> alexander.nekrasov at emc.com> wrote:
> > > >>>>
> > > >>>>> stonith-enabled=false
> > > >>>>>
> > > >>>>> this might be the issue. The way peer node death is resolved, the
> > > >>>>> surviving node must call STONITH on the peer. If it’s disabled it
> > > >>> might not
> > > >>>>> be able to resolve the event
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> Alex
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> *From:* alex austin [mailto:alexixalex at gmail.com]
> > > >>>>> *Sent:* Wednesday, July 01, 2015 9:51 AM
> > > >>>>> *To:* Users at clusterlabs.org
> > > >>>>> *Subject:* Re: [ClusterLabs] Pacemaker failover failure
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> So I noticed that if I kill redis on one node, it starts on the
> > > other,
> > > >>> no
> > > >>>>> problem, but if I actually kill pacemaker itself on one node, the
> > > other
> > > >>>>> doesn't "sense" it so it doesn't fail over.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Wed, Jul 1, 2015 at 12:42 PM, alex austin <
> alexixalex at gmail.com>
> > > >>> wrote:
> > > >>>>>
> > > >>>>> Hi all,
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> I have configured a virtual ip and redis in master-slave with
> > > corosync
> > > >>>>> pacemaker. If redis fails, then the failover is successful, and
> redis
> > > >>> gets
> > > >>>>> promoted on the other node. However if pacemaker itself fails on
> the
> > > >>> active
> > > >>>>> node, the failover is not performed. Is there anything I missed
> in
> > > the
> > > >>>>> configuration?
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> Here's my configuration (i have hashed the ip address out):
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> node host1.com
> > > >>>>>
> > > >>>>> node host2.com
> > > >>>>>
> > > >>>>> primitive ClusterIP IPaddr2 \
> > > >>>>>
> > > >>>>> params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \
> > > >>>>>
> > > >>>>> op monitor interval=1s timeout=20s \
> > > >>>>>
> > > >>>>> op start interval=0 timeout=20s \
> > > >>>>>
> > > >>>>> op stop interval=0 timeout=20s \
> > > >>>>>
> > > >>>>> meta is-managed=true target-role=Started resource-stickiness=500
> > > >>>>>
> > > >>>>> primitive redis redis \
> > > >>>>>
> > > >>>>> meta target-role=Master is-managed=true \
> > > >>>>>
> > > >>>>> op monitor interval=1s role=Master timeout=5s on-fail=restart
> > > >>>>>
> > > >>>>> ms redis_clone redis \
> > > >>>>>
> > > >>>>> meta notify=true is-managed=true ordered=false interleave=false
> > > >>>>> globally-unique=false target-role=Master migration-threshold=1
> > > >>>>>
> > > >>>>> colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master
> > > >>>>>
> > > >>>>> colocation ip-on-redis inf: ClusterIP redis_clone:Master
> > > >>>>>
> > > >>>>> property cib-bootstrap-options: \
> > > >>>>>
> > > >>>>> dc-version=1.1.11-97629de \
> > > >>>>>
> > > >>>>> cluster-infrastructure="classic openais (with plugin)" \
> > > >>>>>
> > > >>>>> expected-quorum-votes=2 \
> > > >>>>>
> > > >>>>> stonith-enabled=false
> > > >>>>>
> > > >>>>> property redis_replication: \
> > > >>>>>
> > > >>>>> redis_REPL_INFO=host.com
> > >
> > >
> >
>
> --
> -- Ken Gaillot <kgaillot at redhat.com>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20150715/554ac53d/attachment-0002.html>