<div dir="ltr">Unfortunately I have nothing yet ... <div><br></div><div>There's something I don't quite understand though. What's the role of stonith if the other machine crashes unexpectedly and totally unclean? Is it to reboot the machine and recreate the cluster, thus making the drbd volume available again? or is it other? </div><div><br></div><div>The way I see it, even if stonith is functional the other node's drbd filesystem will not be acessible until the crashed node is back up, is this correct?</div><div><br></div><div>Alex</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jul 2, 2015 at 2:56 PM, Ken Gaillot <span dir="ltr"><<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">----- Original Message -----<br>

> Thank you!<br>

><br>

> However, what is proper fencing in this situation?<br>

<br>

</span>For virtual machines, there is a fence agent called fence_virt/fence_xvm, but<br>

it requires a daemon to be installed and configured on the underlying<br>

physical machine(s). If that's not a possibility, you need some other means<br>

of shutting the VM down. Whoever's providing your VM might also provide an<br>

API to start and stop it, or if your VMs have access to some shared external<br>

storage, it might be possible to control it via that.<br>

<div class="HOEnZb"><div class="h5"><br>

> On Wed, Jul 1, 2015 at 11:30 PM, Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>> wrote:<br>

><br>

> > On 07/01/2015 09:39 AM, alex austin wrote:<br>

> > > This is what crm_mon shows<br>

> > ><br>

> > ><br>

> > > Last updated: Wed Jul  1 10:35:40 2015<br>

> > ><br>

> > > Last change: Wed Jul  1 09:52:46 2015<br>

> > ><br>

> > > Stack: classic openais (with plugin)<br>

> > ><br>

> > > Current DC: host2 - partition with quorum<br>

> > ><br>

> > > Version: 1.1.11-97629de<br>

> > ><br>

> > > 2 Nodes configured, 2 expected votes<br>

> > ><br>

> > > 4 Resources configured<br>

> > ><br>

> > ><br>

> > ><br>

> > > Online: [ host1 host2 ]<br>

> > ><br>

> > ><br>

> > > ClusterIP (ocf::heartbeat:IPaddr2): Started host2<br>

> > ><br>

> > >  Master/Slave Set: redis_clone [redis]<br>

> > ><br>

> > >      Masters: [ host2 ]<br>

> > ><br>

> > >      Slaves: [ host1 ]<br>

> > ><br>

> > > pcmk-fencing    (stonith:fence_pcmk):   Started host2<br>

> > ><br>

> > > On Wed, Jul 1, 2015 at 3:37 PM, alex austin <<a href="mailto:alexixalex@gmail.com">alexixalex@gmail.com</a>><br>

> > wrote:<br>

> > ><br>

> > >> I am running version 1.4.7 of corosync<br>

> ><br>

> > If you can't upgrade to corosync 2 (which has many improvements), you'll<br>

> > need to set the no-quorum-policy=ignore cluster option.<br>

> ><br>

> > Proper fencing is necessary to avoid a split-brain situation, which can<br>

> > corrupt your data.<br>

> ><br>

> > >> On Wed, Jul 1, 2015 at 3:25 PM, Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>><br>

> > wrote:<br>

> > >><br>

> > >>> On 07/01/2015 08:57 AM, alex austin wrote:<br>

> > >>>> I have now configured stonith-enabled=true. What device should I use<br>

> > for<br>

> > >>>> fencing given the fact that it's a virtual machine but I don't have<br>

> > >>> access<br>

> > >>>> to its configuration. would fence_pcmk do? if so, what parameters<br>

> > >>> should I<br>

> > >>>> configure for it to work properly?<br>

> > >>><br>

> > >>> No, fence_pcmk is not for using in pacemaker, but for using in RHEL6's<br>

> > >>> CMAN to redirect its fencing requests to pacemaker.<br>

> > >>><br>

> > >>> For a virtual machine, ideally you'd use fence_virtd running on the<br>

> > >>> physical host, but I'm guessing from your comment that you can't do<br>

> > >>> that. Does whoever provides your VM also provide an API for controlling<br>

> > >>> it (starting/stopping/rebooting)?<br>

> > >>><br>

> > >>> Regarding your original problem, it sounds like the surviving node<br>

> > >>> doesn't have quorum. What version of corosync are you using? If you're<br>

> > >>> using corosync 2, you need "two_node: 1" in corosync.conf, in addition<br>

> > >>> to configuring fencing in pacemaker.<br>

> > >>><br>

> > >>>> This is my new config:<br>

> > >>>><br>

> > >>>><br>

> > >>>> node <a href="http://dcwbpvmuas004.edc.nam.gm.com" rel="noreferrer" target="_blank">dcwbpvmuas004.edc.nam.gm.com</a> \<br>

> > >>>><br>

> > >>>>         attributes standby=off<br>

> > >>>><br>

> > >>>> node <a href="http://dcwbpvmuas005.edc.nam.gm.com" rel="noreferrer" target="_blank">dcwbpvmuas005.edc.nam.gm.com</a> \<br>

> > >>>><br>

> > >>>>         attributes standby=off<br>

> > >>>><br>

> > >>>> primitive ClusterIP IPaddr2 \<br>

> > >>>><br>

> > >>>>         params ip=198.208.86.242 cidr_netmask=23 \<br>

> > >>>><br>

> > >>>>         op monitor interval=1s timeout=20s \<br>

> > >>>><br>

> > >>>>         op start interval=0 timeout=20s \<br>

> > >>>><br>

> > >>>>         op stop interval=0 timeout=20s \<br>

> > >>>><br>

> > >>>>         meta is-managed=true target-role=Started<br>

> > resource-stickiness=500<br>

> > >>>><br>

> > >>>> primitive pcmk-fencing stonith:fence_pcmk \<br>

> > >>>><br>

> > >>>>         params pcmk_host_list="<a href="http://dcwbpvmuas004.edc.nam.gm.com" rel="noreferrer" target="_blank">dcwbpvmuas004.edc.nam.gm.com</a><br>

> > >>>> <a href="http://dcwbpvmuas005.edc.nam.gm.com" rel="noreferrer" target="_blank">dcwbpvmuas005.edc.nam.gm.com</a>" \<br>

> > >>>><br>

> > >>>>         op monitor interval=10s \<br>

> > >>>><br>

> > >>>>         meta target-role=Started<br>

> > >>>><br>

> > >>>> primitive redis redis \<br>

> > >>>><br>

> > >>>>         meta target-role=Master is-managed=true \<br>

> > >>>><br>

> > >>>>         op monitor interval=1s role=Master timeout=5s on-fail=restart<br>

> > >>>><br>

> > >>>> ms redis_clone redis \<br>

> > >>>><br>

> > >>>>         meta notify=true is-managed=true ordered=false<br>

> > interleave=false<br>

> > >>>> globally-unique=false target-role=Master migration-threshold=1<br>

> > >>>><br>

> > >>>> colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master<br>

> > >>>><br>

> > >>>> colocation ip-on-redis inf: ClusterIP redis_clone:Master<br>

> > >>>><br>

> > >>>> colocation pcmk-fencing-on-redis inf: pcmk-fencing redis_clone:Master<br>

> > >>>><br>

> > >>>> property cib-bootstrap-options: \<br>

> > >>>><br>

> > >>>>         dc-version=1.1.11-97629de \<br>

> > >>>><br>

> > >>>>         cluster-infrastructure="classic openais (with plugin)" \<br>

> > >>>><br>

> > >>>>         expected-quorum-votes=2 \<br>

> > >>>><br>

> > >>>>         stonith-enabled=true<br>

> > >>>><br>

> > >>>> property redis_replication: \<br>

> > >>>><br>

> > >>>>         redis_REPL_INFO=<a href="http://dcwbpvmuas005.edc.nam.gm.com" rel="noreferrer" target="_blank">dcwbpvmuas005.edc.nam.gm.com</a><br>

> > >>>><br>

> > >>>> On Wed, Jul 1, 2015 at 2:53 PM, Nekrasov, Alexander <<br>

> > >>>> <a href="mailto:alexander.nekrasov@emc.com">alexander.nekrasov@emc.com</a>> wrote:<br>

> > >>>><br>

> > >>>>> stonith-enabled=false<br>

> > >>>>><br>

> > >>>>> this might be the issue. The way peer node death is resolved, the<br>

> > >>>>> surviving node must call STONITH on the peer. If it’s disabled it<br>

> > >>> might not<br>

> > >>>>> be able to resolve the event<br>

> > >>>>><br>

> > >>>>><br>

> > >>>>><br>

> > >>>>> Alex<br>

> > >>>>><br>

> > >>>>><br>

> > >>>>><br>

> > >>>>> *From:* alex austin [mailto:<a href="mailto:alexixalex@gmail.com">alexixalex@gmail.com</a>]<br>

> > >>>>> *Sent:* Wednesday, July 01, 2015 9:51 AM<br>

> > >>>>> *To:* <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>

> > >>>>> *Subject:* Re: [ClusterLabs] Pacemaker failover failure<br>

> > >>>>><br>

> > >>>>><br>

> > >>>>><br>

> > >>>>> So I noticed that if I kill redis on one node, it starts on the<br>

> > other,<br>

> > >>> no<br>

> > >>>>> problem, but if I actually kill pacemaker itself on one node, the<br>

> > other<br>

> > >>>>> doesn't "sense" it so it doesn't fail over.<br>

> > >>>>><br>

> > >>>>><br>

> > >>>>><br>

> > >>>>><br>

> > >>>>><br>

> > >>>>><br>

> > >>>>><br>

> > >>>>> On Wed, Jul 1, 2015 at 12:42 PM, alex austin <<a href="mailto:alexixalex@gmail.com">alexixalex@gmail.com</a>><br>

> > >>> wrote:<br>

> > >>>>><br>

> > >>>>> Hi all,<br>

> > >>>>><br>

> > >>>>><br>

> > >>>>><br>

> > >>>>> I have configured a virtual ip and redis in master-slave with<br>

> > corosync<br>

> > >>>>> pacemaker. If redis fails, then the failover is successful, and redis<br>

> > >>> gets<br>

> > >>>>> promoted on the other node. However if pacemaker itself fails on the<br>

> > >>> active<br>

> > >>>>> node, the failover is not performed. Is there anything I missed in<br>

> > the<br>

> > >>>>> configuration?<br>

> > >>>>><br>

> > >>>>><br>

> > >>>>><br>

> > >>>>> Here's my configuration (i have hashed the ip address out):<br>

> > >>>>><br>

> > >>>>><br>

> > >>>>><br>

> > >>>>> node <a href="http://host1.com" rel="noreferrer" target="_blank">host1.com</a><br>

> > >>>>><br>

> > >>>>> node <a href="http://host2.com" rel="noreferrer" target="_blank">host2.com</a><br>

> > >>>>><br>

> > >>>>> primitive ClusterIP IPaddr2 \<br>

> > >>>>><br>

> > >>>>> params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \<br>

> > >>>>><br>

> > >>>>> op monitor interval=1s timeout=20s \<br>

> > >>>>><br>

> > >>>>> op start interval=0 timeout=20s \<br>

> > >>>>><br>

> > >>>>> op stop interval=0 timeout=20s \<br>

> > >>>>><br>

> > >>>>> meta is-managed=true target-role=Started resource-stickiness=500<br>

> > >>>>><br>

> > >>>>> primitive redis redis \<br>

> > >>>>><br>

> > >>>>> meta target-role=Master is-managed=true \<br>

> > >>>>><br>

> > >>>>> op monitor interval=1s role=Master timeout=5s on-fail=restart<br>

> > >>>>><br>

> > >>>>> ms redis_clone redis \<br>

> > >>>>><br>

> > >>>>> meta notify=true is-managed=true ordered=false interleave=false<br>

> > >>>>> globally-unique=false target-role=Master migration-threshold=1<br>

> > >>>>><br>

> > >>>>> colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master<br>

> > >>>>><br>

> > >>>>> colocation ip-on-redis inf: ClusterIP redis_clone:Master<br>

> > >>>>><br>

> > >>>>> property cib-bootstrap-options: \<br>

> > >>>>><br>

> > >>>>> dc-version=1.1.11-97629de \<br>

> > >>>>><br>

> > >>>>> cluster-infrastructure="classic openais (with plugin)" \<br>

> > >>>>><br>

> > >>>>> expected-quorum-votes=2 \<br>

> > >>>>><br>

> > >>>>> stonith-enabled=false<br>

> > >>>>><br>

> > >>>>> property redis_replication: \<br>

> > >>>>><br>

> > >>>>> redis_REPL_INFO=<a href="http://host.com" rel="noreferrer" target="_blank">host.com</a><br>

> ><br>

> ><br>

><br>

<br>

</div></div><span class="HOEnZb"><font color="#888888">--<br>

-- Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>><br>

</font></span></blockquote></div><br></div>