[ClusterLabs] Help needed getting DRBD cluster working

Tue Oct 6 11:13:00 EDT 2015

On 10/06/2015 09:38 AM, Gordon Ross wrote:
> On 5 Oct 2015, at 15:05, Ken Gaillot <kgaillot at redhat.com> wrote:
>>
>> The "rc=6" in the failed actions means the resource's Pacemaker
>> configuration is invalid. (For OCF return codes, see
>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-ocf-return-codes
>> )
>>
>> The "_monitor_0" means that this was the initial probe that Pacemaker
>> does before trying to start the resource, to make sure it's not already
>> running. As an aside, you probably want to add recurring monitors as
>> well, otherwise Pacemaker won't notice if the resource fails. For
>> example: op monitor interval="29s" role="Master" op monitor
>> interval="31s" role="Slave"
>>
>> As to why the probe is failing, it's hard to tell. Double-check your
>> configuration to make sure disc0 is the exact DRBD name, Pacemaker can
>> read the DRBD configuration file, etc. You can also try running the DRBD
>> resource agent's "status" command manually to see if it prints a more
>> detailed error message.
> 
> I cleated the CIB and re-created most of it with your suggested parameters. It now looks like:
> 
> node $id="739377522" ct1
> node $id="739377523" ct2
> node $id="739377524" ct3 \
> 	attributes standby="on"
> primitive drbd_disc0 ocf:linbit:drbd \
> 	params drbd_resource="disc0" \
> 	meta target-role="Started" \
> 	op monitor interval="19s" on-fail="restart" role="Master" start-delay="10s" timeout="20s" \
> 	op monitor interval="20s" on-fail="restart" role="Slave" start-delay="10s" timeout="20s"
> ms ms_drbd0 drbd_disc0 \
> 	meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started"

You want to omit target-role, or set it to "Master". Otherwise both
nodes will start as slaves.

> location cli-prefer-drbd_disc0 ms_drbd0 inf: ct2
> location cli-prefer-ms_drbd0 ms_drbd0 inf: ct2

You've given the above constraints different names, but they are
identical: they both say ms_drbd0 can run on ct2 only.

When you're using clone/ms resources, you generally only ever need to
refer to the clone's name, not the resource being cloned. So you don't
need any constraints for drbd_disc0.

You've set symmetric-cluster=false in the cluster options, which means
that Pacemaker will not start resources on any node unless a location
constaint enables it. Here, you're only enabling ct2. Duplicate the
constraint for ct1 (or set symmetric-cluster=true and use a -INF
location constraint for the third node instead).

> property $id="cib-bootstrap-options" \
> 	dc-version="1.1.10-42f2063" \
> 	cluster-infrastructure="corosync" \
> 	stonith-enabled="false" \

I'm sure you've heard this before, but stonith is the only way to avoid
data corruption in a split-brain situation. It's usually best to make
fencing the first priority rather than save it for last, because some
problems can become more difficult to troubleshoot without fencing. DRBD
in particular needs special configuration to coordinate fencing with
Pacemaker: https://drbd.linbit.com/users-guide/s-pacemaker-fencing.html

> 	no-quorum-policy="stop" \
> 	symmetric-cluster="false"
> 
> 
> I think I’m missing something basic between the DRBD/Pacemaker hook-up.
> 
> As soon as Pacemaker/Corosync start, DRBD on both nodes stop. a “cat /proc/drbd” then just returns:
> 
> version: 8.4.3 (api:1/proto:86-101)
> srcversion: 6551AD2C98F533733BE558C 
> 
> and no details on the replicated disc and the drbd block device disappears.
> 
> GTG
>