[Pacemaker] A secondary DRBD is not brought online after "crm resource online" in Ubuntu 14.04

Sat Jan 3 12:35:17 EST 2015

Hello.

Ubuntu 14.04, corosync 2.3.3, pacemaker 1.1.10. The cluster consists of 2
nodes (node1 and node2), when I run "crm node standby node2" and then, in a
minute, "crm node online node2", DRBD secondary on node2 does not start.
Logs say that "drbdadm -c /etc/drbd.conf check-resize vlv" fails with an
error message: "No valid meta data found" on the onlining node. And,
surprisingly, after I run "service drbd start" on node2 manually,
everything becomes fine.

Maybe something is broken in /usr/lib/ocf/resource.d/linbit/drbd, why
cannot it start DRBD? Or I am misconfigured somehow? Could you please give
an advice what to do?

I have the following configuration (drbd + mount + postgresql, but
postgresql is innocent here, so just ignore it):

*root at node2:/var/log#* crm configure show
node $id="1017525950" node2 attributes standby="off"
node $id="1760315215" node1
primitive drbd ocf:linbit:drbd \
params drbd_resource="vlv" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="120"
primitive fs ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/var/lib/vlv.drbd/root"
options="noatime,nodiratime" fstype="xfs" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="300"
primitive postgresql lsb:postgresql \
op monitor interval="4" timeout="60" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60"
group pgserver fs postgresql
ms ms_drbd drbd \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
notify="true"
location cli-prefer-pgserver pgserver inf: node1
colocation col_pgserver inf: pgserver ms_drbd:Master
order ord_pgserver inf: ms_drbd:promote pgserver:start
property $id="cib-bootstrap-options" dc-version="1.1.10-42f2063"
cluster-infrastructure="corosync"
stonith-enabled="false" no-quorum-policy="ignore"
last-lrm-refresh="1420304078"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"

The cluster and DRBD statuses on node2 look healthy:

*root at node2:/var/log#* crm status
...
Online: [ node1 node2 ]
 Master/Slave Set: ms_drbd [drbd]
     Masters: [ node1 ]
     Slaves: [ node2 ]
 Resource Group: pgserver
     fs (ocf::heartbeat:Filesystem): Started node1
     postgresql (lsb:postgresql): Started node1

*root at node2:/var/log#* cat /proc/drbd
version: 8.4.3 (api:1/proto:86-101)
srcversion: F97798065516C94BE0F27DC
 0: cs:Connected ro:Secondary/Primary ds:Diskless/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

Now I switch node2 to standby and verify that DRBD on it has really shot
down:

*root at node1:/etc/rc2.d#* crm node standby node2
*root at node2:/var/log#* cat /proc/drbd
version: 8.4.3 (api:1/proto:86-101)
srcversion: F97798065516C94BE0F27DC
*root at node2:/var/log#*

Then I switch node2 back online and see that DRBD has not been initialized
and reattached again!

*root at node2:/var/log#* >syslog
*root at node1:/etc#* crm node online node2
*root at node2:/var/log#* crm status
...
Online: [ node1 node2 ]
 Master/Slave Set: ms_drbd [drbd]
     Masters: [ node1 ]
     Stopped: [ node2 ]
 Resource Group: pgserver
     fs (ocf::heartbeat:Filesystem): Started node1
     postgresql (lsb:postgresql): Started node1
Failed actions:
    drbd_start_0 (node=node2, call=81, rc=1, status=complete,
last-rc-change=Sat Jan  3 12:05:32 2015
, queued=1118ms, exec=0ms
): unknown error

*root at node2:/var/log#* cat syslog | head -n 30
Jan  3 12:05:31 node2 crmd[918]:   notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Jan  3 12:05:31 node2 cib[913]:   notice: cib:diff: Diff: --- 0.29.3
Jan  3 12:05:31 node2 cib[913]:   notice: cib:diff: Diff: +++ 0.30.1
027344551b46745123e4a52562e55974
Jan  3 12:05:31 node2 pengine[917]:   notice: unpack_config: On loss of CCM
Quorum: Ignore
Jan  3 12:05:31 node2 pengine[917]:   notice: LogActions: Start
drbd:1#011(node2)
Jan  3 12:05:31 node2 crmd[918]:   notice: te_rsc_command: Initiating
action 46: notify drbd_pre_notify_start_0 on node1
Jan  3 12:05:31 node2 pengine[917]:   notice: process_pe_message:
Calculated Transition 11: /var/lib/pacemaker/pengine/pe-input-11.bz2
Jan  3 12:05:32 node2 crmd[918]:   notice: te_rsc_command: Initiating
action 10: start drbd:1_start_0 on node2 (local)
Jan  3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
/etc/drbd.conf check-resize vlv
Jan  3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
Jan  3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
Jan  3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
/etc/drbd.conf --peer node1 attach vlv
Jan  3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
Jan  3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
Jan  3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
/etc/drbd.conf --peer node1 attach vlv
Jan  3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
Jan  3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
Jan  3 12:05:33 node2 lrmd[915]:   notice: operation_finished:
drbd_start_0:1931:stderr [ no suitable meta data found :( ]
Jan  3 12:05:33 node2 lrmd[915]:   notice: operation_finished:
drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal
check-resize' terminated with exit code 255 ]
Jan  3 12:05:33 node2 lrmd[915]:   notice: operation_finished:
drbd_start_0:1931:stderr [ No valid meta data found ]
Jan  3 12:05:33 node2 lrmd[915]:   notice: operation_finished:
drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal
apply-al' terminated with exit code 255 ]
Jan  3 12:05:33 node2 lrmd[915]:   notice: operation_finished:
drbd_start_0:1931:stderr [ No valid meta data found ]
Jan  3 12:05:33 node2 lrmd[915]:   notice: operation_finished:
drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal
apply-al' terminated with exit code 255 ]
Jan  3 12:05:33 node2 crmd[918]:   notice: process_lrm_event: LRM operation
drbd_start_0 (call=81, rc=1, cib-update=79, confirmed=true) unknown error
Jan  3 12:05:33 node2 crmd[918]:   notice: process_lrm_event:
node2-drbd_start_0:81 [ \n\n\n\n\n\n\n ]
Jan  3 12:05:33 node2 crmd[918]:  warning: status_from_rc: Action 10
(drbd:1_start_0) on node2 failed (target: 0 vs. rc: 1): Error
Jan  3 12:05:33 node2 crmd[918]:  warning: update_failcount: Updating
failcount for drbd on node2 after failed start: rc=1 (update=INFINITY,
time=1420304733)
Jan  3 12:05:33 node2 attrd[916]:   notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-drbd (INFINITY)
Jan  3 12:05:33 node2 attrd[916]:   notice: attrd_perform_update: Sent
update 60: fail-count-drbd=INFINITY
Jan  3 12:05:33 node2 attrd[916]:   notice: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-drbd (1420304733)

root at node2:/var/log# cat syslog | grep ERROR | head -n 30
Jan  3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
/etc/drbd.conf check-resize vlv
Jan  3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
Jan  3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
Jan  3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
/etc/drbd.conf --peer node1 attach vlv
Jan  3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
Jan  3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
Jan  3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
/etc/drbd.conf --peer node1 attach vlv
Jan  3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
Jan  3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
Jan  3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Called drbdadm -c
/etc/drbd.conf --peer node1 -v adjust vlv
Jan  3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Exit code 1
Jan  3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Command output:
drbdmeta 0 v08 /dev/loop0 internal apply-al

Here's the DRBD config file:

*root at node2:/var/log#* cat /etc/drbd.d/drbd.res
resource vlv {
  device /dev/drbd0;
  disk /dev/loop0;
  meta-disk internal;
  syncer { rate 150M; verify-alg md5; }
  on node1 { address x.x.x.x:7788; }
  on node2 { address y.y.y.y:7788; }
}

*root at node2:/var/log#* losetup -a
/dev/loop0: [fd01]:1314858 (/var/lib/vlv.drbd/vlv.img)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150103/27ab57ef/attachment-0002.html>