[ClusterLabs] group resources not grouped ?!?

Wed Oct 7 13:09:00 EDT 2015

On 10/07/2015 09:12 AM, zulucloud wrote:
> Hi,
> i got a problem i don't understand, maybe someone can give me a hint.
> 
> My 2-node cluster (named ali and baba) is configured to run mysql, an IP
> for mysql and the filesystem resource (on drbd master) together as a
> GROUP. After doing some crash-tests i ended up having filesystem and
> mysql running happily on one host (ali), and the related IP on the other
> (baba) .... although, the IP's not really up and running, crm_mon just
> SHOWS it as started there. In fact it's nowhere up, neither on ali nor
> on baba.
> 
> crm_mon shows that pacemaker tried to start it on baba, but gave up
> after fail-count=1000000.
> 
> Q1: why doesn't pacemaker put the IP on ali, where all the rest of it's
> group lives?
> Q2: why doesn't pacemaker try to start the IP on ali, after max
> failcount had been reached on baba?
> Q3: why is crm_mon showing the IP as "started", when it's down after
> 100000 tries?
> 
> Thanks :)
> 
> 
> config (some parts removed):
> -------------------------------
> node ali
> node baba
> 
> primitive res_drbd ocf:linbit:drbd \
>     params drbd_resource="r0" \
>     op stop interval="0" timeout="100" \
>     op start interval="0" timeout="240" \
>     op promote interval="0" timeout="90" \
>     op demote interval="0" timeout="90" \
>     op notify interval="0" timeout="90" \
>     op monitor interval="40" role="Slave" timeout="20" \
>     op monitor interval="20" role="Master" timeout="20"
> primitive res_fs ocf:heartbeat:Filesystem \
>     params device="/dev/drbd0" directory="/drbd_mnt" fstype="ext4" \
>     op monitor interval="30s"
> primitive res_hamysql_ip ocf:heartbeat:IPaddr2 \
>     params ip="XXX.XXX.XXX.224" nic="eth0" cidr_netmask="23" \
>     op monitor interval="10s" timeout="20s" depth="0"
> primitive res_mysql lsb:mysql \
>     op start interval="0" timeout="15" \
>     op stop interval="0" timeout="15" \
>     op monitor start-delay="30" interval="15" time-out="15"
> 
> group gr_mysqlgroup res_fs res_mysql res_hamysql_ip \
>     meta target-role="Started"
> ms ms_drbd res_drbd \
>     meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> 
> colocation col_fs_on_drbd_master inf: res_fs:Started ms_drbd:Master
> 
> order ord_drbd_master_then_fs inf: ms_drbd:promote res_fs:start
> 
> property $id="cib-bootstrap-options" \
>     dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
>     cluster-infrastructure="openais" \
>     stonith-enabled="false" \

Not having stonith is part of the problem (see below).

Without stonith, if the two nodes go into split brain (both up but can't
communicate with each other), Pacemaker will try to promote DRBD to
master on both nodes, mount the filesystem on both nodes, and start
MySQL on both nodes.

>     no-quorum-policy="ignore" \
>     expected-quorum-votes="2" \
>     last-lrm-refresh="1438857246"
> 
> 
> crm_mon -rnf (some parts removed):
> ---------------------------------
> Node ali: online
>         res_fs  (ocf::heartbeat:Filesystem) Started
>         res_mysql       (lsb:mysql) Started
>         res_drbd:0      (ocf::linbit:drbd) Master
> Node baba: online
>         res_hamysql_ip  (ocf::heartbeat:IPaddr2) Started
>         res_drbd:1      (ocf::linbit:drbd) Slave
> 
> Inactive resources:
> 
> Migration summary:
> 
> * Node baba:
>    res_hamysql_ip: migration-threshold=1000000 fail-count=1000000
> 
> Failed actions:
>     res_hamysql_ip_stop_0 (node=a891vl107s, call=35, rc=1,
> status=complete): unknown error

The "_stop_" above means that a *stop* action on the IP failed.
Pacemaker tried to migrate the IP by first stopping it on baba, but it
couldn't. (Since the IP is the last member of the group, its failure
didn't prevent the other members from moving.)

Normally, when a stop fails, Pacemaker fences the node so it can safely
bring up the resource on the other node. But you disabled stonith, so it
got into this state.

So, to proceed:

1) Stonith would help :)

2) Figure out why it couldn't stop the IP. There might be a clue in the
logs on baba (though they are indeed hard to follow; search for
"res_hamysql_stop_0" around this time, and look around there). You could
also try adding and removing the IP manually, first with the usual OS
commands, and if that works, by calling the IP resource agent directly.
That often turns up the problem.

> 
> corosync.log:
> --------------
> pengine: [1223]: WARN: should_dump_input: Ignoring requirement that
> res_hamysql_ip_stop_0 comeplete before gr_mysqlgroup_stopped_0:
> unmanaged failed resources cannot prevent shutdown
> 
> pengine: [1223]: WARN: should_dump_input: Ignoring requirement that
> res_hamysql_ip_stop_0 comeplete before gr_mysqlgroup_stopped_0:
> unmanaged failed resources cannot prevent shutdown
> 
> Software:
> ----------
> corosync 1.2.1-4
> pacemaker 1.0.9.1+hg15626-1
> drbd8-utils 2:8.3.7-2.1
> (for some reason it's not possible to update at this time)

It should be possible to get working for a simple setup like this, but
there have been legions of bugfixes and a much better model for the
corosync layer since then.