[ClusterLabs] Simple master-slave DRBD with fs mount

Fri Apr 3 05:03:34 EDT 2015

Hi all,

I spent a day debugging this with no success.

I'm trying to achieve a simple master-slave DRBD failover with a 
filesystem mount.
(Ubuntu 14.04)

* Here is my cluster config (drbd only) :
# crm configure show
node $id="1084752174" lxc2
node $id="1084752175" lxc1
primitive DRBD ocf:linbit:drbd \
     params drbd_resource="r0" \
     op monitor interval="29s" role="Master" \
     op monitor interval="31s" role="Slave"
ms DRBDClone DRBD \
     meta master-max="1" master-node-max="1" clone-max="2" 
clone-node-max="1" notify="true" target-role="Master"
property $id="cib-bootstrap-options" \
     dc-version="1.1.10-42f2063" \
     cluster-infrastructure="corosync" \
     stonith-enabled="false" \
     no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
     resource-stickiness="100"

# crm_mon -1
Last updated: Fri Apr  3 08:45:11 2015
Last change: Fri Apr  3 08:42:56 2015 via cibadmin on lxc1
Stack: corosync
Current DC: lxc1 (1084752175) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
2 Resources configured

Online: [ lxc1 lxc2 ]

  Master/Slave Set: DRBDClone [DRBD]
      Masters: [ lxc2 ]
      Slaves: [ lxc1 ]

# cat /proc/drbd
version: 8.4.3 (api:1/proto:86-101)
srcversion: 6551AD2C98F533733BE558C

  1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

* The DRBD failover is working perfectly without the FS mount. I can 
migrate, or force a failover with a pacemaker stop. Everything is fine.

* Then I add the Filesystem resource, with the proper constraints, 
filesystem goes up and is mounted with no problem  :

primitive DRBDfs ocf:heartbeat:Filesystem params device="/dev/drbd1" 
directory="/mnt/drbd" fstype="ext4"
colocation fs_on_drbd inf: DRBDfs DRBDClone:Master
order fs_after_drbd inf: DRBDClone:promote DRBDfs
commit

# crm_mon -1
Last updated: Fri Apr  3 08:55:36 2015
Last change: Fri Apr  3 08:55:22 2015 via cibadmin on lxc1
Stack: corosync
Current DC: lxc1 (1084752175) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
3 Resources configured

Online: [ lxc1 lxc2 ]

  Master/Slave Set: DRBDClone [DRBD]
      Masters: [ lxc2 ]
      Slaves: [ lxc1 ]
  DRBDfs    (ocf::heartbeat:Filesystem):    Started lxc2

* However, nothing is working when trying to migrate manually with crm 
resource migrate, or forcing a failover with a pacemaker stop on the master.

* The problem is that the filesystem resource is started BEFORE the 
master being promoted.

* After pacemaker stop on the master :
# crm_mon -1
Last updated: Fri Apr  3 08:59:10 2015
Last change: Fri Apr  3 08:55:22 2015 via cibadmin on lxc1
Stack: corosync
Current DC: lxc1 (1084752175) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
3 Resources configured

Online: [ lxc1 ]
OFFLINE: [ lxc2 ]

Failed actions:
     DRBDfs_start_0 (node=lxc1, call=124, rc=1, status=complete, 
last-rc-change=Fri Apr  3 08:59:02 2015
, queued=61ms, exec=0ms
): unknown error

* I tried to "hack" the shell script of the Filesystem resource agent : 
when adding a ugly sleep before mounting, everything is working fine.
* I think something is wrong with my constraints, but what ?

* Thanks in advance !
* See bellow my syslog during the failing failover.

-----
Apr  3 08:05:43 lxc1 crmd[6954]:   notice: process_lrm_event: LRM 
operation DRBD_notify_0 (call=72, rc=0, cib-update=0, confirmed=true) ok
Apr  3 08:05:43 lxc1 crmd[6954]:   notice: process_lrm_event: LRM 
operation DRBD_notify_0 (call=77, rc=0, cib-update=0, confirmed=true) ok
Apr  3 08:05:43 lxc1 Filesystem(DRBDfs)[7947]: INFO: Running start for 
/dev/drbd1 on /mnt/drbd
Apr  3 08:05:43 lxc1 lrmd[6951]:   notice: operation_finished: 
DRBDfs_start_0:7947:stderr [ blockdev: cannot open /dev/drbd1: Wrong 
medium type ]
Apr  3 08:05:43 lxc1 lrmd[6951]:   notice: operation_finished: 
DRBDfs_start_0:7947:stderr [ mount: block device /dev/drbd1 is 
write-protected, mounting read-only ]
Apr  3 08:05:43 lxc1 lrmd[6951]:   notice: operation_finished: 
DRBDfs_start_0:7947:stderr [ mount: Wrong medium type ]
Apr  3 08:05:43 lxc1 lrmd[6951]:   notice: operation_finished: 
DRBDfs_start_0:7947:stderr [ 
/usr/lib/ocf/resource.d/heartbeat/Filesystem: 451: 
/usr/lib/ocf/resource.d/heartbeat/Filesystem: ocf_exit_reason: not found ]
Apr  3 08:05:43 lxc1 crmd[6954]:   notice: process_lrm_event: LRM 
operation DRBDfs_start_0 (call=75, rc=1, cib-update=25, confirmed=true) 
unknown error
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_cs_dispatch: Update 
relayed from lxc2
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: fail-count-DRBDfs (INFINITY)
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_perform_update: Sent 
update 57: fail-count-DRBDfs=INFINITY
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_cs_dispatch: Update 
relayed from lxc2
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: last-failure-DRBDfs (1428048343)
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_perform_update: Sent 
update 60: last-failure-DRBDfs=1428048343
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_cs_dispatch: Update 
relayed from lxc2
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: fail-count-DRBDfs (INFINITY)
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_perform_update: Sent 
update 63: fail-count-DRBDfs=INFINITY
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_cs_dispatch: Update 
relayed from lxc2
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: last-failure-DRBDfs (1428048343)
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_perform_update: Sent 
update 66: last-failure-DRBDfs=1428048343
Apr  3 08:05:43 lxc1 kernel: [57338.686691] block drbd1: peer( Primary 
-> Secondary )
Apr  3 08:05:43 lxc1 crmd[6954]:   notice: process_lrm_event: LRM 
operation DRBD_notify_0 (call=81, rc=0, cib-update=0, confirmed=true) ok
Apr  3 08:05:43 lxc1 crmd[6954]:   notice: process_lrm_event: LRM 
operation DRBD_notify_0 (call=86, rc=0, cib-update=0, confirmed=true) ok
Apr  3 08:05:43 lxc1 Filesystem(DRBDfs)[8049]: INFO: Running stop for 
/dev/drbd1 on /mnt/drbd
Apr  3 08:05:43 lxc1 lrmd[6951]:   notice: operation_finished: 
DRBDfs_stop_0:8049:stderr [ blockdev: cannot open /dev/drbd1: Wrong 
medium type ]
Apr  3 08:05:43 lxc1 crmd[6954]:   notice: process_lrm_event: LRM 
operation DRBDfs_stop_0 (call=84, rc=0, cib-update=26, confirmed=true) ok
Apr  3 08:05:43 lxc1 kernel: [57338.796271] d-con r0: peer( Secondary -> 
Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
Apr  3 08:05:43 lxc1 kernel: [57338.796291] d-con r0: asender terminated
Apr  3 08:05:43 lxc1 kernel: [57338.796292] d-con r0: Terminating drbd_a_r0
Apr  3 08:05:43 lxc1 kernel: [57338.802943] d-con r0: conn( TearDown -> 
Disconnecting )
Apr  3 08:05:43 lxc1 kernel: [57338.807694] d-con r0: Connection closed
Apr  3 08:05:43 lxc1 kernel: [57338.807701] d-con r0: conn( 
Disconnecting -> StandAlone )
Apr  3 08:05:43 lxc1 kernel: [57338.807702] d-con r0: receiver terminated
Apr  3 08:05:43 lxc1 kernel: [57338.807704] d-con r0: Terminating drbd_r_r0
Apr  3 08:05:43 lxc1 kernel: [57338.807726] block drbd1: disk( UpToDate 
-> Failed )
Apr  3 08:05:43 lxc1 kernel: [57338.807732] block drbd1: bitmap WRITE of 
0 pages took 0 jiffies
Apr  3 08:05:43 lxc1 kernel: [57338.818172] block drbd1: 0 KB (0 bits) 
marked out-of-sync by on disk bit-map.
Apr  3 08:05:43 lxc1 kernel: [57338.818187] block drbd1: disk( Failed -> 
Diskless )
Apr  3 08:05:43 lxc1 kernel: [57338.818235] block drbd1: drbd_bm_resize 
called with capacity == 0
Apr  3 08:05:43 lxc1 kernel: [57338.818239] d-con r0: Terminating drbd_w_r0
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-DRBD (<null>)
Apr  3 08:05:43 lxc1 crmd[6954]:   notice: process_lrm_event: LRM 
operation DRBD_stop_0 (call=89, rc=0, cib-update=27, confirmed=true) ok
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_perform_update: Sent 
delete 68: node=1084752175, attr=master-DRBD, id=<n/a>, set=(null), 
section=status
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_perform_update: Sent 
delete 70: node=1084752175, attr=master-DRBD, id=<n/a>, set=(null), 
section=status
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_perform_update: Sent 
delete 72: node=1084752175, attr=master-DRBD, id=<n/a>, set=(null), 
section=status
Apr  3 08:05:43 lxc1 crmd[6954]:   notice: peer_update_callback: Our 
peer on the DC is dead
Apr  3 08:05:43 lxc1 crmd[6954]:   notice: do_state_transition: State 
transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION 
cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Apr  3 08:05:43 lxc1 crmd[6954]:   notice: do_state_transition: State 
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=do_election_check ]
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_local_callback: 
Sending full refresh (origin=crmd)
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: fail-count-DRBDfs (INFINITY)
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: last-failure-DRBDfs (1428048343)
Apr  3 08:05:43 lxc1 attrd[6952]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: probe_complete (true)
Apr  3 08:05:44 lxc1 pengine[6953]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Apr  3 08:05:44 lxc1 pengine[6953]:  warning: unpack_rsc_op: Processing 
failed op start for DRBDfs on lxc1: unknown error (1)
Apr  3 08:05:44 lxc1 pengine[6953]:  warning: common_apply_stickiness: 
Forcing DRBDClone away from lxc1 after 1000000 failures (max=1000000)
Apr  3 08:05:44 lxc1 pengine[6953]:  warning: common_apply_stickiness: 
Forcing DRBDClone away from lxc1 after 1000000 failures (max=1000000)
Apr  3 08:05:44 lxc1 pengine[6953]:  warning: common_apply_stickiness: 
Forcing DRBDfs away from lxc1 after 1000000 failures (max=1000000)
Apr  3 08:05:44 lxc1 crmd[6954]:   notice: run_graph: Transition 0 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-225.bz2): Complete
Apr  3 08:05:44 lxc1 crmd[6954]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr  3 08:05:44 lxc1 pengine[6953]:   notice: process_pe_message: 
Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-225.bz2