[ClusterLabs] Simple master-slave DRBD with fs mount
Marc MAURICE
marc.maurice at objectif-libre.com
Fri Apr 3 09:03:34 UTC 2015
Hi all,
I spent a day debugging this with no success.
I'm trying to achieve a simple master-slave DRBD failover with a
filesystem mount.
(Ubuntu 14.04)
* Here is my cluster config (drbd only) :
# crm configure show
node $id="1084752174" lxc2
node $id="1084752175" lxc1
primitive DRBD ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="29s" role="Master" \
op monitor interval="31s" role="Slave"
ms DRBDClone DRBD \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Master"
property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
stonith-enabled="false" \
no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
# crm_mon -1
Last updated: Fri Apr 3 08:45:11 2015
Last change: Fri Apr 3 08:42:56 2015 via cibadmin on lxc1
Stack: corosync
Current DC: lxc1 (1084752175) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
2 Resources configured
Online: [ lxc1 lxc2 ]
Master/Slave Set: DRBDClone [DRBD]
Masters: [ lxc2 ]
Slaves: [ lxc1 ]
# cat /proc/drbd
version: 8.4.3 (api:1/proto:86-101)
srcversion: 6551AD2C98F533733BE558C
1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
* The DRBD failover is working perfectly without the FS mount. I can
migrate, or force a failover with a pacemaker stop. Everything is fine.
* Then I add the Filesystem resource, with the proper constraints,
filesystem goes up and is mounted with no problem :
primitive DRBDfs ocf:heartbeat:Filesystem params device="/dev/drbd1"
directory="/mnt/drbd" fstype="ext4"
colocation fs_on_drbd inf: DRBDfs DRBDClone:Master
order fs_after_drbd inf: DRBDClone:promote DRBDfs
commit
# crm_mon -1
Last updated: Fri Apr 3 08:55:36 2015
Last change: Fri Apr 3 08:55:22 2015 via cibadmin on lxc1
Stack: corosync
Current DC: lxc1 (1084752175) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
3 Resources configured
Online: [ lxc1 lxc2 ]
Master/Slave Set: DRBDClone [DRBD]
Masters: [ lxc2 ]
Slaves: [ lxc1 ]
DRBDfs (ocf::heartbeat:Filesystem): Started lxc2
* However, nothing is working when trying to migrate manually with crm
resource migrate, or forcing a failover with a pacemaker stop on the master.
* The problem is that the filesystem resource is started BEFORE the
master being promoted.
* After pacemaker stop on the master :
# crm_mon -1
Last updated: Fri Apr 3 08:59:10 2015
Last change: Fri Apr 3 08:55:22 2015 via cibadmin on lxc1
Stack: corosync
Current DC: lxc1 (1084752175) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
3 Resources configured
Online: [ lxc1 ]
OFFLINE: [ lxc2 ]
Failed actions:
DRBDfs_start_0 (node=lxc1, call=124, rc=1, status=complete,
last-rc-change=Fri Apr 3 08:59:02 2015
, queued=61ms, exec=0ms
): unknown error
* I tried to "hack" the shell script of the Filesystem resource agent :
when adding a ugly sleep before mounting, everything is working fine.
* I think something is wrong with my constraints, but what ?
* Thanks in advance !
* See bellow my syslog during the failing failover.
-----
Apr 3 08:05:43 lxc1 crmd[6954]: notice: process_lrm_event: LRM
operation DRBD_notify_0 (call=72, rc=0, cib-update=0, confirmed=true) ok
Apr 3 08:05:43 lxc1 crmd[6954]: notice: process_lrm_event: LRM
operation DRBD_notify_0 (call=77, rc=0, cib-update=0, confirmed=true) ok
Apr 3 08:05:43 lxc1 Filesystem(DRBDfs)[7947]: INFO: Running start for
/dev/drbd1 on /mnt/drbd
Apr 3 08:05:43 lxc1 lrmd[6951]: notice: operation_finished:
DRBDfs_start_0:7947:stderr [ blockdev: cannot open /dev/drbd1: Wrong
medium type ]
Apr 3 08:05:43 lxc1 lrmd[6951]: notice: operation_finished:
DRBDfs_start_0:7947:stderr [ mount: block device /dev/drbd1 is
write-protected, mounting read-only ]
Apr 3 08:05:43 lxc1 lrmd[6951]: notice: operation_finished:
DRBDfs_start_0:7947:stderr [ mount: Wrong medium type ]
Apr 3 08:05:43 lxc1 lrmd[6951]: notice: operation_finished:
DRBDfs_start_0:7947:stderr [
/usr/lib/ocf/resource.d/heartbeat/Filesystem: 451:
/usr/lib/ocf/resource.d/heartbeat/Filesystem: ocf_exit_reason: not found ]
Apr 3 08:05:43 lxc1 crmd[6954]: notice: process_lrm_event: LRM
operation DRBDfs_start_0 (call=75, rc=1, cib-update=25, confirmed=true)
unknown error
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_cs_dispatch: Update
relayed from lxc2
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-DRBDfs (INFINITY)
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_perform_update: Sent
update 57: fail-count-DRBDfs=INFINITY
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_cs_dispatch: Update
relayed from lxc2
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-DRBDfs (1428048343)
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_perform_update: Sent
update 60: last-failure-DRBDfs=1428048343
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_cs_dispatch: Update
relayed from lxc2
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-DRBDfs (INFINITY)
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_perform_update: Sent
update 63: fail-count-DRBDfs=INFINITY
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_cs_dispatch: Update
relayed from lxc2
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-DRBDfs (1428048343)
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_perform_update: Sent
update 66: last-failure-DRBDfs=1428048343
Apr 3 08:05:43 lxc1 kernel: [57338.686691] block drbd1: peer( Primary
-> Secondary )
Apr 3 08:05:43 lxc1 crmd[6954]: notice: process_lrm_event: LRM
operation DRBD_notify_0 (call=81, rc=0, cib-update=0, confirmed=true) ok
Apr 3 08:05:43 lxc1 crmd[6954]: notice: process_lrm_event: LRM
operation DRBD_notify_0 (call=86, rc=0, cib-update=0, confirmed=true) ok
Apr 3 08:05:43 lxc1 Filesystem(DRBDfs)[8049]: INFO: Running stop for
/dev/drbd1 on /mnt/drbd
Apr 3 08:05:43 lxc1 lrmd[6951]: notice: operation_finished:
DRBDfs_stop_0:8049:stderr [ blockdev: cannot open /dev/drbd1: Wrong
medium type ]
Apr 3 08:05:43 lxc1 crmd[6954]: notice: process_lrm_event: LRM
operation DRBDfs_stop_0 (call=84, rc=0, cib-update=26, confirmed=true) ok
Apr 3 08:05:43 lxc1 kernel: [57338.796271] d-con r0: peer( Secondary ->
Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
Apr 3 08:05:43 lxc1 kernel: [57338.796291] d-con r0: asender terminated
Apr 3 08:05:43 lxc1 kernel: [57338.796292] d-con r0: Terminating drbd_a_r0
Apr 3 08:05:43 lxc1 kernel: [57338.802943] d-con r0: conn( TearDown ->
Disconnecting )
Apr 3 08:05:43 lxc1 kernel: [57338.807694] d-con r0: Connection closed
Apr 3 08:05:43 lxc1 kernel: [57338.807701] d-con r0: conn(
Disconnecting -> StandAlone )
Apr 3 08:05:43 lxc1 kernel: [57338.807702] d-con r0: receiver terminated
Apr 3 08:05:43 lxc1 kernel: [57338.807704] d-con r0: Terminating drbd_r_r0
Apr 3 08:05:43 lxc1 kernel: [57338.807726] block drbd1: disk( UpToDate
-> Failed )
Apr 3 08:05:43 lxc1 kernel: [57338.807732] block drbd1: bitmap WRITE of
0 pages took 0 jiffies
Apr 3 08:05:43 lxc1 kernel: [57338.818172] block drbd1: 0 KB (0 bits)
marked out-of-sync by on disk bit-map.
Apr 3 08:05:43 lxc1 kernel: [57338.818187] block drbd1: disk( Failed ->
Diskless )
Apr 3 08:05:43 lxc1 kernel: [57338.818235] block drbd1: drbd_bm_resize
called with capacity == 0
Apr 3 08:05:43 lxc1 kernel: [57338.818239] d-con r0: Terminating drbd_w_r0
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
Sending flush op to all hosts for: master-DRBD (<null>)
Apr 3 08:05:43 lxc1 crmd[6954]: notice: process_lrm_event: LRM
operation DRBD_stop_0 (call=89, rc=0, cib-update=27, confirmed=true) ok
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_perform_update: Sent
delete 68: node=1084752175, attr=master-DRBD, id=<n/a>, set=(null),
section=status
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_perform_update: Sent
delete 70: node=1084752175, attr=master-DRBD, id=<n/a>, set=(null),
section=status
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_perform_update: Sent
delete 72: node=1084752175, attr=master-DRBD, id=<n/a>, set=(null),
section=status
Apr 3 08:05:43 lxc1 crmd[6954]: notice: peer_update_callback: Our
peer on the DC is dead
Apr 3 08:05:43 lxc1 crmd[6954]: notice: do_state_transition: State
transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Apr 3 08:05:43 lxc1 crmd[6954]: notice: do_state_transition: State
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_election_check ]
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_local_callback:
Sending full refresh (origin=crmd)
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-DRBDfs (INFINITY)
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-DRBDfs (1428048343)
Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
Sending flush op to all hosts for: probe_complete (true)
Apr 3 08:05:44 lxc1 pengine[6953]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Apr 3 08:05:44 lxc1 pengine[6953]: warning: unpack_rsc_op: Processing
failed op start for DRBDfs on lxc1: unknown error (1)
Apr 3 08:05:44 lxc1 pengine[6953]: warning: common_apply_stickiness:
Forcing DRBDClone away from lxc1 after 1000000 failures (max=1000000)
Apr 3 08:05:44 lxc1 pengine[6953]: warning: common_apply_stickiness:
Forcing DRBDClone away from lxc1 after 1000000 failures (max=1000000)
Apr 3 08:05:44 lxc1 pengine[6953]: warning: common_apply_stickiness:
Forcing DRBDfs away from lxc1 after 1000000 failures (max=1000000)
Apr 3 08:05:44 lxc1 crmd[6954]: notice: run_graph: Transition 0
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-225.bz2): Complete
Apr 3 08:05:44 lxc1 crmd[6954]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr 3 08:05:44 lxc1 pengine[6953]: notice: process_pe_message:
Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-225.bz2
More information about the Users
mailing list