[ClusterLabs] Pacemaker failover failure

Wed Jul 1 18:30:14 EDT 2015

On 07/01/2015 09:39 AM, alex austin wrote:
> This is what crm_mon shows
> 
> 
> Last updated: Wed Jul  1 10:35:40 2015
> 
> Last change: Wed Jul  1 09:52:46 2015
> 
> Stack: classic openais (with plugin)
> 
> Current DC: host2 - partition with quorum
> 
> Version: 1.1.11-97629de
> 
> 2 Nodes configured, 2 expected votes
> 
> 4 Resources configured
> 
> 
> 
> Online: [ host1 host2 ]
> 
> 
> ClusterIP (ocf::heartbeat:IPaddr2): Started host2
> 
>  Master/Slave Set: redis_clone [redis]
> 
>      Masters: [ host2 ]
> 
>      Slaves: [ host1 ]
> 
> pcmk-fencing    (stonith:fence_pcmk):   Started host2
> 
> On Wed, Jul 1, 2015 at 3:37 PM, alex austin <alexixalex at gmail.com> wrote:
> 
>> I am running version 1.4.7 of corosync

If you can't upgrade to corosync 2 (which has many improvements), you'll
need to set the no-quorum-policy=ignore cluster option.

Proper fencing is necessary to avoid a split-brain situation, which can
corrupt your data.

>> On Wed, Jul 1, 2015 at 3:25 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
>>
>>> On 07/01/2015 08:57 AM, alex austin wrote:
>>>> I have now configured stonith-enabled=true. What device should I use for
>>>> fencing given the fact that it's a virtual machine but I don't have
>>> access
>>>> to its configuration. would fence_pcmk do? if so, what parameters
>>> should I
>>>> configure for it to work properly?
>>>
>>> No, fence_pcmk is not for using in pacemaker, but for using in RHEL6's
>>> CMAN to redirect its fencing requests to pacemaker.
>>>
>>> For a virtual machine, ideally you'd use fence_virtd running on the
>>> physical host, but I'm guessing from your comment that you can't do
>>> that. Does whoever provides your VM also provide an API for controlling
>>> it (starting/stopping/rebooting)?
>>>
>>> Regarding your original problem, it sounds like the surviving node
>>> doesn't have quorum. What version of corosync are you using? If you're
>>> using corosync 2, you need "two_node: 1" in corosync.conf, in addition
>>> to configuring fencing in pacemaker.
>>>
>>>> This is my new config:
>>>>
>>>>
>>>> node dcwbpvmuas004.edc.nam.gm.com \
>>>>
>>>>         attributes standby=off
>>>>
>>>> node dcwbpvmuas005.edc.nam.gm.com \
>>>>
>>>>         attributes standby=off
>>>>
>>>> primitive ClusterIP IPaddr2 \
>>>>
>>>>         params ip=198.208.86.242 cidr_netmask=23 \
>>>>
>>>>         op monitor interval=1s timeout=20s \
>>>>
>>>>         op start interval=0 timeout=20s \
>>>>
>>>>         op stop interval=0 timeout=20s \
>>>>
>>>>         meta is-managed=true target-role=Started resource-stickiness=500
>>>>
>>>> primitive pcmk-fencing stonith:fence_pcmk \
>>>>
>>>>         params pcmk_host_list="dcwbpvmuas004.edc.nam.gm.com
>>>> dcwbpvmuas005.edc.nam.gm.com" \
>>>>
>>>>         op monitor interval=10s \
>>>>
>>>>         meta target-role=Started
>>>>
>>>> primitive redis redis \
>>>>
>>>>         meta target-role=Master is-managed=true \
>>>>
>>>>         op monitor interval=1s role=Master timeout=5s on-fail=restart
>>>>
>>>> ms redis_clone redis \
>>>>
>>>>         meta notify=true is-managed=true ordered=false interleave=false
>>>> globally-unique=false target-role=Master migration-threshold=1
>>>>
>>>> colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master
>>>>
>>>> colocation ip-on-redis inf: ClusterIP redis_clone:Master
>>>>
>>>> colocation pcmk-fencing-on-redis inf: pcmk-fencing redis_clone:Master
>>>>
>>>> property cib-bootstrap-options: \
>>>>
>>>>         dc-version=1.1.11-97629de \
>>>>
>>>>         cluster-infrastructure="classic openais (with plugin)" \
>>>>
>>>>         expected-quorum-votes=2 \
>>>>
>>>>         stonith-enabled=true
>>>>
>>>> property redis_replication: \
>>>>
>>>>         redis_REPL_INFO=dcwbpvmuas005.edc.nam.gm.com
>>>>
>>>> On Wed, Jul 1, 2015 at 2:53 PM, Nekrasov, Alexander <
>>>> alexander.nekrasov at emc.com> wrote:
>>>>
>>>>> stonith-enabled=false
>>>>>
>>>>> this might be the issue. The way peer node death is resolved, the
>>>>> surviving node must call STONITH on the peer. If it’s disabled it
>>> might not
>>>>> be able to resolve the event
>>>>>
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>>
>>>>> *From:* alex austin [mailto:alexixalex at gmail.com]
>>>>> *Sent:* Wednesday, July 01, 2015 9:51 AM
>>>>> *To:* Users at clusterlabs.org
>>>>> *Subject:* Re: [ClusterLabs] Pacemaker failover failure
>>>>>
>>>>>
>>>>>
>>>>> So I noticed that if I kill redis on one node, it starts on the other,
>>> no
>>>>> problem, but if I actually kill pacemaker itself on one node, the other
>>>>> doesn't "sense" it so it doesn't fail over.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 1, 2015 at 12:42 PM, alex austin <alexixalex at gmail.com>
>>> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>>
>>>>>
>>>>> I have configured a virtual ip and redis in master-slave with corosync
>>>>> pacemaker. If redis fails, then the failover is successful, and redis
>>> gets
>>>>> promoted on the other node. However if pacemaker itself fails on the
>>> active
>>>>> node, the failover is not performed. Is there anything I missed in the
>>>>> configuration?
>>>>>
>>>>>
>>>>>
>>>>> Here's my configuration (i have hashed the ip address out):
>>>>>
>>>>>
>>>>>
>>>>> node host1.com
>>>>>
>>>>> node host2.com
>>>>>
>>>>> primitive ClusterIP IPaddr2 \
>>>>>
>>>>> params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \
>>>>>
>>>>> op monitor interval=1s timeout=20s \
>>>>>
>>>>> op start interval=0 timeout=20s \
>>>>>
>>>>> op stop interval=0 timeout=20s \
>>>>>
>>>>> meta is-managed=true target-role=Started resource-stickiness=500
>>>>>
>>>>> primitive redis redis \
>>>>>
>>>>> meta target-role=Master is-managed=true \
>>>>>
>>>>> op monitor interval=1s role=Master timeout=5s on-fail=restart
>>>>>
>>>>> ms redis_clone redis \
>>>>>
>>>>> meta notify=true is-managed=true ordered=false interleave=false
>>>>> globally-unique=false target-role=Master migration-threshold=1
>>>>>
>>>>> colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master
>>>>>
>>>>> colocation ip-on-redis inf: ClusterIP redis_clone:Master
>>>>>
>>>>> property cib-bootstrap-options: \
>>>>>
>>>>> dc-version=1.1.11-97629de \
>>>>>
>>>>> cluster-infrastructure="classic openais (with plugin)" \
>>>>>
>>>>> expected-quorum-votes=2 \
>>>>>
>>>>> stonith-enabled=false
>>>>>
>>>>> property redis_replication: \
>>>>>
>>>>> redis_REPL_INFO=host.com