[Pacemaker] Frequent Loss of Quorum on VM Cluster

Tue Feb 12 16:32:50 EST 2013

Just to tie this off.

It now seems stable since reinstalling vmware tools on both nodes. So it
seems nothing to do with corosync or pacemaker.

Regards,
Darren

On 7 February 2013 11:03, Darren Mansell <darren.mansell at gmail.com> wrote:

> Hi all.
>
> I've installed a Corosync/Pacemaker cluster of 2 nodes into a VMware ESX
> environment. The install uses Debian squeeze (6.0) with packages from
> squeeze-backports.
>
> These are package versions in use:
>
> corosync                            1.4.2-1~bpo60+1
> pacemaker                           1.1.7-1~bpo60+1
> ( + required packages and libs )
> ( I had to use backports to get the failure-timeout ability )
>
> I use these 2 nodes to run ldirectord and a VIP to load-balance a MS
> Exchange cluster and it works very well in the main. But about twice a day
> there are losses of quorum where the cluster will go split-brain then
> recover after about 30 seconds.
>
> I've already had to disable STONITH for this issue as it was causing long
> shoot-outs and taking a while to recover. Now with failure-timeouts and no
> STONITH it comes back fairly quickly.
>
> I've attached a hb_report from both nodes and put the cluster config
> below. Any ideas or thoughts would be most welcome.
>
> Many thanks.
> Darren
>
> crm configure show:
> node exlb01
> node exlb02
> primitive VIP1 ocf:heartbeat:IPaddr2 \
>         params lvs_support="true" ip="10.8.35.55" cidr_netmask="24"
> broadcast="10.8.35.255" \
>         op monitor interval="60" timeout="60" \
>         meta migration-threshold="2" failure-timeout="120"
> primitive ldirectord ocf:heartbeat:ldirectord \
>         params configfile="/etc/ha.d/ldirectord.cf" \
>         op monitor interval="60" timeout="60" \
>         meta migration-threshold="2" target-role="Started"
> failure-timeout="120"
> group lb VIP1 ldirectord \
>         meta target-role="Started"
> location l-lb-100 lb 100: exlb01
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         no-quorum-policy="ignore" \
>         stonith-enabled="false" \
>         last-lrm-refresh="1355878292" \
>         cluster-recheck-interval="60s"
>
> crm status:
> ============
> Last updated: Thu Feb  7 11:01:06 2013
> Last change: Wed Dec 19 01:32:40 2012
> Stack: openais
> Current DC: exlb02 - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 2 Resources configured.
> ============
>
> Online: [ exlb02 exlb01 ]
>
>  Resource Group: lb
>      VIP1       (ocf::heartbeat:IPaddr2):       Started exlb01
>      ldirectord (ocf::heartbeat:ldirectord):    Started exlb01
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130212/c86852c3/attachment-0002.html>