[ClusterLabs] pacemaker will not promote a drbd resource

Jean-Francois Malouin Jean-Francois.Malouin at bic.mni.mcgill.ca
Mon Dec 2 17:14:33 EST 2019


Hi,

I know this kind of problem has been posted numerous times but my google-fu
doesn't seem to gather anything interesting and I'm really stuck.

I have this 2 node cluster with multiple drbd resources in active/passive mode
(only one primary allowed) and every time one primary drbd resource gets in
trouble (I have still to figure out why, I suspect a home-grown LVM backup
cronjob that takes images of Xen VMs living on top of drbd and brings down the
I/O on the primary node) the whole node get fenced and killed by stonith and
the drbd resources on the secondary node don't get promoted to primary, they
just sits there in secondary role. 

After failure, the cluster inserts location constraints that forbid the
surviving node to promote the drbd resources:

location drbd-fence-by-handler-r0-ms_drbd_r0 ms_drbd_r0 \
        rule $role=Master -inf: #uname ne node1

The drbd resources are all set with:

resource <resource> {
  disk {
    fencing resource-only;
    ...
  }
  handlers {
    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
    ...
  }
  ...

so I fail to see why the node with the problematic resource get killed.

I must do something extremely stupid but I can't see it!

The cluster as a run-of-the-mill config with (the Xen VM 'fly' lives on a
volume group on top of the drbd resource )

primitive p_xen_fly Xen \
    params xmfile="/etc/xen/fly.cfg" name=fly \
    op monitor interval=20s timeout=60s \
    op start interval=0 timeout=90s \
    op stop interval=0 timeout=90s \
    meta migration-threshold=3 failure-timeout=60s target-role=Started

primitive resDRBDr0 ocf:linbit:drbd \
    params drbd_resource=r0 unfence_if_all_uptodate=true \
    op start interval=0 timeout=300s \
    op stop interval=0 timeout=100s \
    op monitor interval=29s role=Master timeout=300s \
    op monitor interval=31s role=Slave timeout=300s \
    meta migration-threshold=3 failure-timeout=120s

primitive p_lvm_vg0 LVM \
    params volgrpname=vg0 \
    op start timeout=30s interval=0 \
    op stop timeout=30s interval=0 \
    op monitor timeout=30s interval=10s

ms ms_drbd_r0 resDRBDr0 \
    meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Master is-managed=true

colocation c_lvm_vg0_on_drbd_r0 inf: p_lvm_vg0 ms_drbd_r0:Master
colocation c_xen_fly_on_lvm_vg0 inf: p_xen_fly p_lvm_vg0
order o_drbd_r0_before_lvm_vg0 Mandatory: ms_drbd_r0:promote p_lvm_vg0:start
order o_lvm_vg0_before_xen_fly Mandatory: p_lvm_vg0 p_xen_fly

Can anyone see something obvious?
Thanks,
jf


More information about the Users mailing list