[ClusterLabs] Simple master-slave DRBD with fs mount
Vladislav Bogdanov
bubble at hoster-ok.com
Fri Apr 3 09:16:38 UTC 2015
03.04.2015 12:03, Marc MAURICE wrote:
> Hi all,
>
> I spent a day debugging this with no success.
>
> I'm trying to achieve a simple master-slave DRBD failover with a
> filesystem mount.
> (Ubuntu 14.04)
...
> * The DRBD failover is working perfectly without the FS mount. I can
> migrate, or force a failover with a pacemaker stop. Everything is fine.
>
> * Then I add the Filesystem resource, with the proper constraints,
> filesystem goes up and is mounted with no problem :
>
> primitive DRBDfs ocf:heartbeat:Filesystem params device="/dev/drbd1"
> directory="/mnt/drbd" fstype="ext4"
> colocation fs_on_drbd inf: DRBDfs DRBDClone:Master
> order fs_after_drbd inf: DRBDClone:promote DRBDfs
try to replace this one with
order fs_after_drbd inf: DRBDClone:promote DRBDfs:start
Best,
Vladislav
> commit
>
> # crm_mon -1
> Last updated: Fri Apr 3 08:55:36 2015
> Last change: Fri Apr 3 08:55:22 2015 via cibadmin on lxc1
> Stack: corosync
> Current DC: lxc1 (1084752175) - partition with quorum
> Version: 1.1.10-42f2063
> 2 Nodes configured
> 3 Resources configured
>
>
> Online: [ lxc1 lxc2 ]
>
> Master/Slave Set: DRBDClone [DRBD]
> Masters: [ lxc2 ]
> Slaves: [ lxc1 ]
> DRBDfs (ocf::heartbeat:Filesystem): Started lxc2
>
> * However, nothing is working when trying to migrate manually with crm
> resource migrate, or forcing a failover with a pacemaker stop on the
> master.
>
> * The problem is that the filesystem resource is started BEFORE the
> master being promoted.
>
> * After pacemaker stop on the master :
> # crm_mon -1
> Last updated: Fri Apr 3 08:59:10 2015
> Last change: Fri Apr 3 08:55:22 2015 via cibadmin on lxc1
> Stack: corosync
> Current DC: lxc1 (1084752175) - partition with quorum
> Version: 1.1.10-42f2063
> 2 Nodes configured
> 3 Resources configured
>
>
> Online: [ lxc1 ]
> OFFLINE: [ lxc2 ]
>
>
> Failed actions:
> DRBDfs_start_0 (node=lxc1, call=124, rc=1, status=complete,
> last-rc-change=Fri Apr 3 08:59:02 2015
> , queued=61ms, exec=0ms
> ): unknown error
>
> * I tried to "hack" the shell script of the Filesystem resource agent :
> when adding a ugly sleep before mounting, everything is working fine.
> * I think something is wrong with my constraints, but what ?
>
>
> * Thanks in advance !
> * See bellow my syslog during the failing failover.
>
>
> -----
> Apr 3 08:05:43 lxc1 crmd[6954]: notice: process_lrm_event: LRM
> operation DRBD_notify_0 (call=72, rc=0, cib-update=0, confirmed=true) ok
> Apr 3 08:05:43 lxc1 crmd[6954]: notice: process_lrm_event: LRM
> operation DRBD_notify_0 (call=77, rc=0, cib-update=0, confirmed=true) ok
> Apr 3 08:05:43 lxc1 Filesystem(DRBDfs)[7947]: INFO: Running start for
> /dev/drbd1 on /mnt/drbd
> Apr 3 08:05:43 lxc1 lrmd[6951]: notice: operation_finished:
> DRBDfs_start_0:7947:stderr [ blockdev: cannot open /dev/drbd1: Wrong
> medium type ]
> Apr 3 08:05:43 lxc1 lrmd[6951]: notice: operation_finished:
> DRBDfs_start_0:7947:stderr [ mount: block device /dev/drbd1 is
> write-protected, mounting read-only ]
> Apr 3 08:05:43 lxc1 lrmd[6951]: notice: operation_finished:
> DRBDfs_start_0:7947:stderr [ mount: Wrong medium type ]
> Apr 3 08:05:43 lxc1 lrmd[6951]: notice: operation_finished:
> DRBDfs_start_0:7947:stderr [
> /usr/lib/ocf/resource.d/heartbeat/Filesystem: 451:
> /usr/lib/ocf/resource.d/heartbeat/Filesystem: ocf_exit_reason: not found ]
> Apr 3 08:05:43 lxc1 crmd[6954]: notice: process_lrm_event: LRM
> operation DRBDfs_start_0 (call=75, rc=1, cib-update=25, confirmed=true)
> unknown error
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_cs_dispatch: Update
> relayed from lxc2
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-DRBDfs (INFINITY)
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_perform_update: Sent
> update 57: fail-count-DRBDfs=INFINITY
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_cs_dispatch: Update
> relayed from lxc2
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-DRBDfs (1428048343)
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_perform_update: Sent
> update 60: last-failure-DRBDfs=1428048343
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_cs_dispatch: Update
> relayed from lxc2
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-DRBDfs (INFINITY)
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_perform_update: Sent
> update 63: fail-count-DRBDfs=INFINITY
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_cs_dispatch: Update
> relayed from lxc2
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-DRBDfs (1428048343)
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_perform_update: Sent
> update 66: last-failure-DRBDfs=1428048343
> Apr 3 08:05:43 lxc1 kernel: [57338.686691] block drbd1: peer( Primary
> -> Secondary )
> Apr 3 08:05:43 lxc1 crmd[6954]: notice: process_lrm_event: LRM
> operation DRBD_notify_0 (call=81, rc=0, cib-update=0, confirmed=true) ok
> Apr 3 08:05:43 lxc1 crmd[6954]: notice: process_lrm_event: LRM
> operation DRBD_notify_0 (call=86, rc=0, cib-update=0, confirmed=true) ok
> Apr 3 08:05:43 lxc1 Filesystem(DRBDfs)[8049]: INFO: Running stop for
> /dev/drbd1 on /mnt/drbd
> Apr 3 08:05:43 lxc1 lrmd[6951]: notice: operation_finished:
> DRBDfs_stop_0:8049:stderr [ blockdev: cannot open /dev/drbd1: Wrong
> medium type ]
> Apr 3 08:05:43 lxc1 crmd[6954]: notice: process_lrm_event: LRM
> operation DRBDfs_stop_0 (call=84, rc=0, cib-update=26, confirmed=true) ok
> Apr 3 08:05:43 lxc1 kernel: [57338.796271] d-con r0: peer( Secondary ->
> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
> Apr 3 08:05:43 lxc1 kernel: [57338.796291] d-con r0: asender terminated
> Apr 3 08:05:43 lxc1 kernel: [57338.796292] d-con r0: Terminating drbd_a_r0
> Apr 3 08:05:43 lxc1 kernel: [57338.802943] d-con r0: conn( TearDown ->
> Disconnecting )
> Apr 3 08:05:43 lxc1 kernel: [57338.807694] d-con r0: Connection closed
> Apr 3 08:05:43 lxc1 kernel: [57338.807701] d-con r0: conn(
> Disconnecting -> StandAlone )
> Apr 3 08:05:43 lxc1 kernel: [57338.807702] d-con r0: receiver terminated
> Apr 3 08:05:43 lxc1 kernel: [57338.807704] d-con r0: Terminating drbd_r_r0
> Apr 3 08:05:43 lxc1 kernel: [57338.807726] block drbd1: disk( UpToDate
> -> Failed )
> Apr 3 08:05:43 lxc1 kernel: [57338.807732] block drbd1: bitmap WRITE of
> 0 pages took 0 jiffies
> Apr 3 08:05:43 lxc1 kernel: [57338.818172] block drbd1: 0 KB (0 bits)
> marked out-of-sync by on disk bit-map.
> Apr 3 08:05:43 lxc1 kernel: [57338.818187] block drbd1: disk( Failed ->
> Diskless )
> Apr 3 08:05:43 lxc1 kernel: [57338.818235] block drbd1: drbd_bm_resize
> called with capacity == 0
> Apr 3 08:05:43 lxc1 kernel: [57338.818239] d-con r0: Terminating drbd_w_r0
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: master-DRBD (<null>)
> Apr 3 08:05:43 lxc1 crmd[6954]: notice: process_lrm_event: LRM
> operation DRBD_stop_0 (call=89, rc=0, cib-update=27, confirmed=true) ok
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_perform_update: Sent
> delete 68: node=1084752175, attr=master-DRBD, id=<n/a>, set=(null),
> section=status
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_perform_update: Sent
> delete 70: node=1084752175, attr=master-DRBD, id=<n/a>, set=(null),
> section=status
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_perform_update: Sent
> delete 72: node=1084752175, attr=master-DRBD, id=<n/a>, set=(null),
> section=status
> Apr 3 08:05:43 lxc1 crmd[6954]: notice: peer_update_callback: Our
> peer on the DC is dead
> Apr 3 08:05:43 lxc1 crmd[6954]: notice: do_state_transition: State
> transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
> cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> Apr 3 08:05:43 lxc1 crmd[6954]: notice: do_state_transition: State
> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_FSA_INTERNAL origin=do_election_check ]
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_local_callback:
> Sending full refresh (origin=crmd)
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-DRBDfs (INFINITY)
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-DRBDfs (1428048343)
> Apr 3 08:05:43 lxc1 attrd[6952]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: probe_complete (true)
> Apr 3 08:05:44 lxc1 pengine[6953]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Apr 3 08:05:44 lxc1 pengine[6953]: warning: unpack_rsc_op: Processing
> failed op start for DRBDfs on lxc1: unknown error (1)
> Apr 3 08:05:44 lxc1 pengine[6953]: warning: common_apply_stickiness:
> Forcing DRBDClone away from lxc1 after 1000000 failures (max=1000000)
> Apr 3 08:05:44 lxc1 pengine[6953]: warning: common_apply_stickiness:
> Forcing DRBDClone away from lxc1 after 1000000 failures (max=1000000)
> Apr 3 08:05:44 lxc1 pengine[6953]: warning: common_apply_stickiness:
> Forcing DRBDfs away from lxc1 after 1000000 failures (max=1000000)
> Apr 3 08:05:44 lxc1 crmd[6954]: notice: run_graph: Transition 0
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-225.bz2): Complete
> Apr 3 08:05:44 lxc1 crmd[6954]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Apr 3 08:05:44 lxc1 pengine[6953]: notice: process_pe_message:
> Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-225.bz2
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list