[Pacemaker] Nodes will not promote DRBD resources to master on failover

Tue Apr 10 10:29:27 EDT 2012

Hi Andreas,

----- Original Message ----- 

> From: "Andreas Kurz" <andreas at hastexo.com>
> To: pacemaker at oss.clusterlabs.org
> Sent: Tuesday, April 10, 2012 5:28:15 AM
> Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to
> master on failover

> On 04/10/2012 06:17 AM, Andrew Martin wrote:
> > Hi Andreas,
> >
> > Yes, I attempted to generalize hostnames and usernames/passwords in
> > the
> > archive. Sorry for making it more confusing :(
> >
> > I completely purged pacemaker from all 3 nodes and reinstalled
> > everything. I then completely rebuild the CIB by manually adding in
> > each
> > primitive/constraint one at a time and testing along the way. After
> > doing this DRBD appears to be working at least somewhat better -
> > the
> > ocf:linbit:drbd devices are started and managed by pacemaker.
> > However,
> > if for example a node is STONITHed when it comes back up it will
> > not
> > restart the ocf:linbit:drbd resources until I manually load the
> > DRBD
> > kernel module, bring the DRBD devices up (drbdadm up all), and
> > cleanup
> > the resources (e.g. crm resource cleanup ms_drbd_vmstore). Is it
> > possible that the DRBD kernel module needs to be loaded at boot
> > time,
> > independent of pacemaker?

> No, this is done by the drbd OCF script on start.

> >
> > Here's the new CIB (mostly the same as before):
> > http://pastebin.com/MxrqBXMp
> >
> > Typically quorumnode stays in the OFFLINE (standby) state, though
> > occasionally it changes to pending. I have just tried
> > cleaning /var/lib/heartbeat/crm on quorumnode again so we will see
> > if
> > that helps keep it in the OFFLINE (standby) state. I have it
> > explicitly
> > set to standby in the CIB configuration and also created a rule to
> > prevent some of the resources from running on it?
> > node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
> > attributes standby="on"
> > ...

> The node should be in "ONLINE (standby)" state if you start heartbeat
> and pacemaker is enabled with "crm yes" or "crm respawn"in ha.cf

I have never seen it listed as ONLINE (standby). Here's the ha.cf on quorumnode:
autojoin none
mcast eth0 239.0.0.43 694 1 0
warntime 5
deadtime 15
initdead 60
keepalive 2
node node1
node node2
node quorumnode
crm respawn

And here's the ha.cf on node[12]:
autojoin none
mcast br0 239.0.0.43 694 1 0
bcast br1
warntime 5
deadtime 15
initdead 60
keepalive 2
node node1
node node2
node quorumnode
crm respawn
respawn hacluster /usr/lib/heartbeat/dopd
apiauth dopd gid=haclient uid=hacluster

The only difference between these boxes is that quorumnode is a CentOS 5.5 box so it is stuck at heartbeat 3.0.3, whereas node[12] are both on Ubuntu 10.04 using the Ubuntu HA PPA, so they are running heartbeat 3.0.5. Would this make a difference?

> > location loc_not_on_quorumnode g_vm -inf: quorumnode
> >
> > Would it be wise to create additional constraints to prevent all
> > resources (including each ms_drbd resource) from running on it,
> > even
> > though this should be implied by standby?

> There is no need for that. A node in standby will never run resources
> and if there is no DRBD and installed on that node your resources
> won't
> start anyways.

I've removed this constraint

> >
> > Below is a portion of the log from when I started a node yet DRBD
> > failed
> > to start. As you can see it thinks the DRBD device is operating
> > correctly as it proceeds to starting subsequent resources, e.g.
> > Apr 9 20:22:55 node1 Filesystem[2939]: [2956]: WARNING: Couldn't
> > find
> > device [/dev/drbd0]. Expected /dev/??? to exist
> > http://pastebin.com/zTCHPtWy

> The only thing i can read from that log fragments is, that probes are
> running ... not enough information. Really interesting would be logs
> from the DC.

Here is the log from the DC for that same time period:
http://pastebin.com/d4PGGLPi

> >
> > After seeing these messages in the log I run
> > # service drbd start
> > # drbdadm up all
> > # crm resource cleanup ms_drbd_vmstore
> > # crm resource cleanup ms_drbd_mount1
> > # crm resource clenaup ms_drbd_mount2

> That should all not be needed ... what is the output of "crm_mon
> -1frA"
> before you do all that cleanups?

I will get this output the next time I can put the cluster in this state.

> > After this sequence of commands the DRBD resources appear to be
> > functioning normally and the subsequent resources start. Any ideas
> > on
> > why DRBD is not being started as expected, or why the cluster is
> > continuing with starting resources that according to the
> > o_drbd-fs-vm
> > constraint should not start until DRBD is master?

> No idea, maybe creating a crm_report archive and sending it to the
> list
> can shed some light on that problem.

> Regards,
> Andreas

> --
> Need help with Pacemaker?
> http://www.hastexo.com/now

Thanks,

Andrew

> >
> > Thanks,
> >
> > Andrew
> > ------------------------------------------------------------------------
> > *From: *"Andreas Kurz" <andreas at hastexo.com>
> > *To: *pacemaker at oss.clusterlabs.org
> > *Sent: *Monday, April 2, 2012 6:33:44 PM
> > *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> > master on failover
> >
> > On 04/02/2012 05:47 PM, Andrew Martin wrote:
> >> Hi Andreas,
> >>
> >> Here is the crm_report:
> >> http://dl.dropbox.com/u/2177298/pcmk-Mon-02-Apr-2012.bz2
> >
> > You tried to do some obfuscation on parts of that archive? ...
> > doesn't
> > really make it easier to debug ....
> >
> > Does the third node ever change its state?
> >
> > Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): pending
> >
> > Looking at the logs and the transition graph says it aborts due to
> > un-runable operations on that node which seems to be related to
> > it's
> > pending state.
> >
> > Try to get that node up (or down) completely ... maybe a fresh
> > start-over with a clean /var/lib/heartbeat/crm directory is
> > sufficient.
> >
> > Regards,
> > Andreas
> >
> >>
> >> Hi Emmanuel,
> >>
> >> Here is the configuration:
> >> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
> >> attributes standby="off"
> >> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
> >> attributes standby="off"
> >> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
> >> attributes standby="on"
> >> primitive p_drbd_mount2 ocf:linbit:drbd \
> >> params drbd_resource="mount2" \
> >> op start interval="0" timeout="240" \
> >> op stop interval="0" timeout="100" \
> >> op monitor interval="10" role="Master" timeout="20"
> >> start-delay="1m" \
> >> op monitor interval="20" role="Slave" timeout="20"
> >> start-delay="1m"
> >> primitive p_drbd_mount1 ocf:linbit:drbd \
> >> params drbd_resource="mount1" \
> >> op start interval="0" timeout="240" \
> >> op stop interval="0" timeout="100" \
> >> op monitor interval="10" role="Master" timeout="20"
> >> start-delay="1m" \
> >> op monitor interval="20" role="Slave" timeout="20"
> >> start-delay="1m"
> >> primitive p_drbd_vmstore ocf:linbit:drbd \
> >> params drbd_resource="vmstore" \
> >> op start interval="0" timeout="240" \
> >> op stop interval="0" timeout="100" \
> >> op monitor interval="10" role="Master" timeout="20"
> >> start-delay="1m" \
> >> op monitor interval="20" role="Slave" timeout="20"
> >> start-delay="1m"
> >> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
> >> params device="/dev/drbd0" directory="/mnt/storage/vmstore"
> > fstype="ext4" \
> >> op start interval="0" timeout="60s" \
> >> op stop interval="0" timeout="60s" \
> >> op monitor interval="20s" timeout="40s"
> >> primitive p_libvirt-bin upstart:libvirt-bin \
> >> op monitor interval="30"
> >> primitive p_ping ocf:pacemaker:ping \
> >> params name="p_ping" host_list="192.168.3.1 192.168.3.2"
> > multiplier="1000" \
> >> op monitor interval="20s"
> >> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
> >> params email="me at example.com" \
> >> params subject="Pacemaker Change" \
> >> op start interval="0" timeout="10" \
> >> op stop interval="0" timeout="10" \
> >> op monitor interval="10" timeout="10"
> >> primitive p_vm ocf:heartbeat:VirtualDomain \
> >> params config="/mnt/storage/vmstore/config/vm.xml" \
> >> meta allow-migrate="false" \
> >> op start interval="0" timeout="180" \
> >> op stop interval="0" timeout="180" \
> >> op monitor interval="10" timeout="30"
> >> primitive stonith-node1 stonith:external/tripplitepdu \
> >> params pdu_ipaddr="192.168.3.100" pdu_port="1" pdu_username="xxx"
> >> pdu_password="xxx" hostname_to_stonith="node1"
> >> primitive stonith-node2 stonith:external/tripplitepdu \
> >> params pdu_ipaddr="192.168.3.100" pdu_port="2" pdu_username="xxx"
> >> pdu_password="xxx" hostname_to_stonith="node2"
> >> group g_daemons p_libvirt-bin
> >> group g_vm p_fs_vmstore p_vm
> >> ms ms_drbd_mount2 p_drbd_mount2 \
> >> meta master-max="1" master-node-max="1" clone-max="2"
> >> clone-node-max="1"
> >> notify="true"
> >> ms ms_drbd_mount1 p_drbd_mount1 \
> >> meta master-max="1" master-node-max="1" clone-max="2"
> >> clone-node-max="1"
> >> notify="true"
> >> ms ms_drbd_vmstore p_drbd_vmstore \
> >> meta master-max="1" master-node-max="1" clone-max="2"
> >> clone-node-max="1"
> >> notify="true"
> >> clone cl_daemons g_daemons
> >> clone cl_ping p_ping \
> >> meta interleave="true"
> >> clone cl_sysadmin_notify p_sysadmin_notify \
> >> meta target-role="Started"
> >> location l-st-node1 stonith-node1 -inf: node1
> >> location l-st-node2 stonith-node2 -inf: node2
> >> location l_run_on_most_connected p_vm \
> >> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
> >> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
> >> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
> >> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote
> >> ms_drbd_mount1:promote
> >> ms_drbd_mount2:promote cl_daemons:start g_vm:start
> >> property $id="cib-bootstrap-options" \
> >> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
> >> cluster-infrastructure="Heartbeat" \
> >> stonith-enabled="true" \
> >> no-quorum-policy="freeze" \
> >> last-lrm-refresh="1333041002" \
> >> cluster-recheck-interval="5m" \
> >> crmd-integration-timeout="3m" \
> >> shutdown-escalation="5m"
> >>
> >> Thanks,
> >>
> >> Andrew
> >>
> >>
> >> ------------------------------------------------------------------------
> >> *From: *"emmanuel segura" <emi2fast at gmail.com>
> >> *To: *"The Pacemaker cluster resource manager"
> >> <pacemaker at oss.clusterlabs.org>
> >> *Sent: *Monday, April 2, 2012 9:43:20 AM
> >> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources
> >> to
> >> master on failover
> >>
> >> Sorry Andrew
> >>
> >> Can you post me your crm configure show again?
> >>
> >> Thanks
> >>
> >> Il giorno 30 marzo 2012 18:53, Andrew Martin <amartin at xes-inc.com
> >> <mailto:amartin at xes-inc.com>> ha scritto:
> >>
> >> Hi Emmanuel,
> >>
> >> Thanks, that is a good idea. I updated the colocation contraint as
> >> you described. After, the cluster remains in this state (with the
> >> filesystem not mounted and the VM not started):
> >> Online: [ node2 node1 ]
> >>
> >> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> >> Masters: [ node1 ]
> >> Slaves: [ node2 ]
> >> Master/Slave Set: ms_drbd_tools [p_drbd_mount1]
> >> Masters: [ node1 ]
> >> Slaves: [ node2 ]
> >> Master/Slave Set: ms_drbd_crm [p_drbd_mount2]
> >> Masters: [ node1 ]
> >> Slaves: [ node2 ]
> >> Clone Set: cl_daemons [g_daemons]
> >> Started: [ node2 node1 ]
> >> Stopped: [ g_daemons:2 ]
> >> stonith-node1 (stonith:external/tripplitepdu): Started node2
> >> stonith-node2 (stonith:external/tripplitepdu): Started node1
> >>
> >> I noticed that Pacemaker had not issued "drbdadm connect" for any
> >> of
> >> the DRBD resources on node2
> >> # service drbd status
> >> drbd driver loaded OK; device status:
> >> version: 8.3.7 (api:88/proto:86-91)
> >> GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
> >> root at node2, 2012-02-02 12:29:26
> >> m:res cs ro ds p
> >> mounted fstype
> >> 0:vmstore StandAlone Secondary/Unknown Outdated/DUnknown r----
> >> 1:mount1 StandAlone Secondary/Unknown Outdated/DUnknown r----
> >> 2:mount2 StandAlone Secondary/Unknown Outdated/DUnknown r----
> >> # drbdadm cstate all
> >> StandAlone
> >> StandAlone
> >> StandAlone
> >>
> >> After manually issuing "drbdadm connect all" on node2 the rest of
> >> the resources eventually started (several minutes later) on node1:
> >> Online: [ node2 node1 ]
> >>
> >> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> >> Masters: [ node1 ]
> >> Slaves: [ node2 ]
> >> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> >> Masters: [ node1 ]
> >> Slaves: [ node2 ]
> >> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> >> Masters: [ node1 ]
> >> Slaves: [ node2 ]
> >> Resource Group: g_vm
> >> p_fs_vmstore (ocf::heartbeat:Filesystem): Started node1
> >> p_vm (ocf::heartbeat:VirtualDomain): Started node1
> >> Clone Set: cl_daemons [g_daemons]
> >> Started: [ node2 node1 ]
> >> Stopped: [ g_daemons:2 ]
> >> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
> >> Started: [ node2 node1 ]
> >> Stopped: [ p_sysadmin_notify:2 ]
> >> stonith-node1 (stonith:external/tripplitepdu): Started node2
> >> stonith-node2 (stonith:external/tripplitepdu): Started node1
> >> Clone Set: cl_ping [p_ping]
> >> Started: [ node2 node1 ]
> >> Stopped: [ p_ping:2 ]
> >>
> >> The DRBD devices on node1 were all UpToDate, so it doesn't seem
> >> right that it would need to wait for node2 to be connected before
> >> it
> >> could continue promoting additional resources. I then restarted
> >> heartbeat on node2 to see if it would automatically connect the
> >> DRBD
> >> devices this time. After restarting it, the DRBD devices are not
> >> even configured:
> >> # service drbd status
> >> drbd driver loaded OK; device status:
> >> version: 8.3.7 (api:88/proto:86-91)
> >> GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
> >> root at webapps2host, 2012-02-02 12:29:26
> >> m:res cs ro ds p mounted fstype
> >> 0:vmstore Unconfigured
> >> 1:mount1 Unconfigured
> >> 2:mount2 Unconfigured
> >>
> >> Looking at the log I found this part about the drbd primitives:
> >> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[2] on
> >> p_drbd_vmstore:1 for client 10705: pid 11065 exited with return
> >> code 7
> >> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM
> >> operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, cib-update=11,
> >> confirmed=true) not running
> >> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[4] on
> >> p_drbd_mount2:1 for client 10705: pid 11069 exited with return
> >> code 7
> >> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM
> >> operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, cib-update=12,
> >> confirmed=true) not running
> >> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[3] on
> >> p_drbd_mount1:1 for client 10705: pid 11066 exited with return
> >> code 7
> >> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM
> >> operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, cib-update=13,
> >> confirmed=true) not running
> >>
> >> I am not sure what exit code 7 is - is it possible to manually run
> >> the monitor code or somehow obtain more debug about this? Here is
> >> the complete log after restarting heartbeat on node2:
> >> http://pastebin.com/KsHKi3GW
> >>
> >> Thanks,
> >>
> >> Andrew
> >>
> >>
> > ------------------------------------------------------------------------
> >> *From: *"emmanuel segura" <emi2fast at gmail.com
> >> <mailto:emi2fast at gmail.com>>
> >> *To: *"The Pacemaker cluster resource manager"
> >> <pacemaker at oss.clusterlabs.org
> >> <mailto:pacemaker at oss.clusterlabs.org>>
> >> *Sent: *Friday, March 30, 2012 10:26:48 AM
> >> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources
> >> to
> >> master on failover
> >>
> >> I think this constrain it's wrong
> >> ==================================================
> >> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
> >> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
> >> ===================================================
> >>
> >> change to
> >> ======================================================
> >> colocation c_drbd_libvirt_vm inf: g_vm ms_drbd_vmstore:Master
> >> ms_drbd_mount1:Master ms_drbd_mount2:Master
> >> =======================================================
> >>
> >> Il giorno 30 marzo 2012 17:16, Andrew Martin <amartin at xes-inc.com
> >> <mailto:amartin at xes-inc.com>> ha scritto:
> >>
> >> Hi Emmanuel,
> >>
> >> Here is the output of crm configure show:
> >> http://pastebin.com/NA1fZ8dL
> >>
> >> Thanks,
> >>
> >> Andrew
> >>
> >>
> > ------------------------------------------------------------------------
> >> *From: *"emmanuel segura" <emi2fast at gmail.com
> >> <mailto:emi2fast at gmail.com>>
> >> *To: *"The Pacemaker cluster resource manager"
> >> <pacemaker at oss.clusterlabs.org
> >> <mailto:pacemaker at oss.clusterlabs.org>>
> >> *Sent: *Friday, March 30, 2012 9:43:45 AM
> >> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources
> >> to master on failover
> >>
> >> can you show me?
> >>
> >> crm configure show
> >>
> >> Il giorno 30 marzo 2012 16:10, Andrew Martin
> >> <amartin at xes-inc.com <mailto:amartin at xes-inc.com>> ha scritto:
> >>
> >> Hi Andreas,
> >>
> >> Here is a copy of my complete CIB:
> >> http://pastebin.com/v5wHVFuy
> >>
> >> I'll work on generating a report using crm_report as well.
> >>
> >> Thanks,
> >>
> >> Andrew
> >>
> >>
> > ------------------------------------------------------------------------
> >> *From: *"Andreas Kurz" <andreas at hastexo.com
> >> <mailto:andreas at hastexo.com>>
> >> *To: *pacemaker at oss.clusterlabs.org
> >> <mailto:pacemaker at oss.clusterlabs.org>
> >> *Sent: *Friday, March 30, 2012 4:41:16 AM
> >> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD
> >> resources to master on failover
> >>
> >> On 03/28/2012 04:56 PM, Andrew Martin wrote:
> >> > Hi Andreas,
> >> >
> >> > I disabled the DRBD init script and then restarted the
> >> slave node
> >> > (node2). After it came back up, DRBD did not start:
> >> > Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4):
> >> pending
> >> > Online: [ node2 node1 ]
> >> >
> >> > Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> >> > Masters: [ node1 ]
> >> > Stopped: [ p_drbd_vmstore:1 ]
> >> > Master/Slave Set: ms_drbd_mount1 [p_drbd_tools]
> >> > Masters: [ node1 ]
> >> > Stopped: [ p_drbd_mount1:1 ]
> >> > Master/Slave Set: ms_drbd_mount2 [p_drbdmount2]
> >> > Masters: [ node1 ]
> >> > Stopped: [ p_drbd_mount2:1 ]
> >> > ...
> >> >
> >> > root at node2:~# service drbd status
> >> > drbd not loaded
> >>
> >> Yes, expected unless Pacemaker starts DRBD
> >>
> >> >
> >> > Is there something else I need to change in the CIB to
> >> ensure that DRBD
> >> > is started? All of my DRBD devices are configured like this:
> >> > primitive p_drbd_mount2 ocf:linbit:drbd \
> >> > params drbd_resource="mount2" \
> >> > op monitor interval="15" role="Master" \
> >> > op monitor interval="30" role="Slave"
> >> > ms ms_drbd_mount2 p_drbd_mount2 \
> >> > meta master-max="1" master-node-max="1"
> > clone-max="2"
> >> > clone-node-max="1" notify="true"
> >>
> >> That should be enough ... unable to say more without seeing
> >> the complete
> >> configuration ... too much fragments of information ;-)
> >>
> >> Please provide (e.g. pastebin) your complete cib (cibadmin
> >> -Q) when
> >> cluster is in that state ... or even better create a
> >> crm_report archive
> >>
> >> >
> >> > Here is the output from the syslog (grep -i drbd
> >> /var/log/syslog):
> >> > Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op:
> >> Performing
> >> > key=12:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
> >> > op=p_drbd_vmstore:1_monitor_0 )
> >> > Mar 28 09:24:47 node2 lrmd: [3210]: info:
> >> rsc:p_drbd_vmstore:1 probe[2]
> >> > (pid 3455)
> >> > Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op:
> >> Performing
> >> > key=13:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
> >> > op=p_drbd_mount1:1_monitor_0 )
> >> > Mar 28 09:24:48 node2 lrmd: [3210]: info:
> >> rsc:p_drbd_mount1:1 probe[3]
> >> > (pid 3456)
> >> > Mar 28 09:24:48 node2 crmd: [3213]: info: do_lrm_rsc_op:
> >> Performing
> >> > key=14:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
> >> > op=p_drbd_mount2:1_monitor_0 )
> >> > Mar 28 09:24:48 node2 lrmd: [3210]: info:
> >> rsc:p_drbd_mount2:1 probe[4]
> >> > (pid 3457)
> >> > Mar 28 09:24:48 node2 Filesystem[3458]: [3517]: WARNING:
> >> Couldn't find
> >> > device [/dev/drbd0]. Expected /dev/??? to exist
> >> > Mar 28 09:24:48 node2 crm_attribute: [3563]: info: Invoked:
> >> > crm_attribute -N node2 -n master-p_drbd_mount2:1 -l
> > reboot -D
> >> > Mar 28 09:24:48 node2 crm_attribute: [3557]: info: Invoked:
> >> > crm_attribute -N node2 -n master-p_drbd_vmstore:1 -l
> > reboot -D
> >> > Mar 28 09:24:48 node2 crm_attribute: [3562]: info: Invoked:
> >> > crm_attribute -N node2 -n master-p_drbd_mount1:1 -l
> > reboot -D
> >> > Mar 28 09:24:48 node2 lrmd: [3210]: info: operation
> >> monitor[4] on
> >> > p_drbd_mount2:1 for client 3213: pid 3457 exited with
> >> return code 7
> >> > Mar 28 09:24:48 node2 lrmd: [3210]: info: operation
> >> monitor[2] on
> >> > p_drbd_vmstore:1 for client 3213: pid 3455 exited with
> >> return code 7
> >> > Mar 28 09:24:48 node2 crmd: [3213]: info:
> >> process_lrm_event: LRM
> >> > operation p_drbd_mount2:1_monitor_0 (call=4, rc=7,
> >> cib-update=10,
> >> > confirmed=true) not running
> >> > Mar 28 09:24:48 node2 lrmd: [3210]: info: operation
> >> monitor[3] on
> >> > p_drbd_mount1:1 for client 3213: pid 3456 exited with
> >> return code 7
> >> > Mar 28 09:24:48 node2 crmd: [3213]: info:
> >> process_lrm_event: LRM
> >> > operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7,
> >> cib-update=11,
> >> > confirmed=true) not running
> >> > Mar 28 09:24:48 node2 crmd: [3213]: info:
> >> process_lrm_event: LRM
> >> > operation p_drbd_mount1:1_monitor_0 (call=3, rc=7,
> >> cib-update=12,
> >> > confirmed=true) not running
> >>
> >> No errors, just probing ... so for any reason Pacemaker does
> >> not like to
> >> start it ... use crm_simulate to find out why ... or provide
> >> information
> >> as requested above.
> >>
> >> Regards,
> >> Andreas
> >>
> >> --
> >> Need help with Pacemaker?
> >> http://www.hastexo.com/now
> >>
> >> >
> >> > Thanks,
> >> >
> >> > Andrew
> >> >
> >> >
> >>
> > ------------------------------------------------------------------------
> >> > *From: *"Andreas Kurz" <andreas at hastexo.com
> >> <mailto:andreas at hastexo.com>>
> >> > *To: *pacemaker at oss.clusterlabs.org
> >> <mailto:pacemaker at oss.clusterlabs.org>
> >> > *Sent: *Wednesday, March 28, 2012 9:03:06 AM
> >> > *Subject: *Re: [Pacemaker] Nodes will not promote DRBD
> >> resources to
> >> > master on failover
> >> >
> >> > On 03/28/2012 03:47 PM, Andrew Martin wrote:
> >> >> Hi Andreas,
> >> >>
> >> >>> hmm ... what is that fence-peer script doing? If you
> >> want to use
> >> >>> resource-level fencing with the help of dopd, activate the
> >> >>> drbd-peer-outdater script in the line above ... and
> >> double check if the
> >> >>> path is correct
> >> >> fence-peer is just a wrapper for drbd-peer-outdater that
> >> does some
> >> >> additional logging. In my testing dopd has been working
> > well.
> >> >
> >> > I see
> >> >
> >> >>
> >> >>>> I am thinking of making the following changes to the
> >> CIB (as per the
> >> >>>> official DRBD
> >> >>>> guide
> >> >>
> >> >
> >>
> > http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html)
> >> in
> >> >>>> order to add the DRBD lsb service and require that it
> >> start before the
> >> >>>> ocf:linbit:drbd resources. Does this look correct?
> >> >>>
> >> >>> Where did you read that? No, deactivate the startup of
> >> DRBD on system
> >> >>> boot and let Pacemaker manage it completely.
> >> >>>
> >> >>>> primitive p_drbd-init lsb:drbd op monitor interval="30"
> >> >>>> colocation c_drbd_together inf:
> >> >>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
> >> >>>> ms_drbd_mount2:Master
> >> >>>> order drbd_init_first inf: ms_drbd_vmstore:promote
> >> >>>> ms_drbd_mount1:promote ms_drbd_mount2:promote
> >> p_drbd-init:start
> >> >>>>
> >> >>>> This doesn't seem to require that drbd be also running
> >> on the node where
> >> >>>> the ocf:linbit:drbd resources are slave (which it would
> >> need to do to be
> >> >>>> a DRBD SyncTarget) - how can I ensure that drbd is
> >> running everywhere?
> >> >>>> (clone cl_drbd p_drbd-init ?)
> >> >>>
> >> >>> This is really not needed.
> >> >> I was following the official DRBD Users Guide:
> >> >>
> >>
> > http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html
> >> >>
> >> >> If I am understanding your previous message correctly, I
> >> do not need to
> >> >> add a lsb primitive for the drbd daemon? It will be
> >> >> started/stopped/managed automatically by my
> >> ocf:linbit:drbd resources
> >> >> (and I can remove the /etc/rc* symlinks)?
> >> >
> >> > Yes, you don't need that LSB script when using Pacemaker
> >> and should not
> >> > let init start it.
> >> >
> >> > Regards,
> >> > Andreas
> >> >
> >> > --
> >> > Need help with Pacemaker?
> >> > http://www.hastexo.com/now
> >> >
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Andrew
> >> >>
> >> >>
> >>
> > ------------------------------------------------------------------------
> >> >> *From: *"Andreas Kurz" <andreas at hastexo.com
> >> <mailto:andreas at hastexo.com> <mailto:andreas at hastexo.com
> >> <mailto:andreas at hastexo.com>>>
> >> >> *To: *pacemaker at oss.clusterlabs.org
> >> <mailto:pacemaker at oss.clusterlabs.org>
> >> <mailto:pacemaker at oss.clusterlabs.org
> >> <mailto:pacemaker at oss.clusterlabs.org>>
> >> >> *Sent: *Wednesday, March 28, 2012 7:27:34 AM
> >> >> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD
> >> resources to
> >> >> master on failover
> >> >>
> >> >> On 03/28/2012 12:13 AM, Andrew Martin wrote:
> >> >>> Hi Andreas,
> >> >>>
> >> >>> Thanks, I've updated the colocation rule to be in the
> >> correct order. I
> >> >>> also enabled the STONITH resource (this was temporarily
> >> disabled before
> >> >>> for some additional testing). DRBD has its own network
> >> connection over
> >> >>> the br1 interface (192.168.5.0/24
> >> <http://192.168.5.0/24> network), a direct crossover cable
> >> >>> between node1 and node2:
> >> >>> global { usage-count no; }
> >> >>> common {
> >> >>> syncer { rate 110M; }
> >> >>> }
> >> >>> resource vmstore {
> >> >>> protocol C;
> >> >>> startup {
> >> >>> wfc-timeout 15;
> >> >>> degr-wfc-timeout 60;
> >> >>> }
> >> >>> handlers {
> >> >>> #fence-peer
> >> "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
> >> >>> fence-peer "/usr/local/bin/fence-peer";
> >> >>
> >> >> hmm ... what is that fence-peer script doing? If you want
> >> to use
> >> >> resource-level fencing with the help of dopd, activate the
> >> >> drbd-peer-outdater script in the line above ... and
> >> double check if the
> >> >> path is correct
> >> >>
> >> >>> split-brain
> >> "/usr/lib/drbd/notify-split-brain.sh
> >> >>> me at example.com <mailto:me at example.com>
> >> <mailto:me at example.com <mailto:me at example.com>>";
> >> >>> }
> >> >>> net {
> >> >>> after-sb-0pri discard-zero-changes;
> >> >>> after-sb-1pri discard-secondary;
> >> >>> after-sb-2pri disconnect;
> >> >>> cram-hmac-alg md5;
> >> >>> shared-secret "xxxxx";
> >> >>> }
> >> >>> disk {
> >> >>> fencing resource-only;
> >> >>> }
> >> >>> on node1 {
> >> >>> device /dev/drbd0;
> >> >>> disk /dev/sdb1;
> >> >>> address 192.168.5.10:7787
> >> <http://192.168.5.10:7787>;
> >> >>> meta-disk internal;
> >> >>> }
> >> >>> on node2 {
> >> >>> device /dev/drbd0;
> >> >>> disk /dev/sdf1;
> >> >>> address 192.168.5.11:7787
> >> <http://192.168.5.11:7787>;
> >> >>> meta-disk internal;
> >> >>> }
> >> >>> }
> >> >>> # and similar for mount1 and mount2
> >> >>>
> >> >>> Also, here is my ha.cf <http://ha.cf>. It uses both the
> >> direct link between the nodes
> >> >>> (br1) and the shared LAN network on br0 for communicating:
> >> >>> autojoin none
> >> >>> mcast br0 239.0.0.43 694 1 0
> >> >>> bcast br1
> >> >>> warntime 5
> >> >>> deadtime 15
> >> >>> initdead 60
> >> >>> keepalive 2
> >> >>> node node1
> >> >>> node node2
> >> >>> node quorumnode
> >> >>> crm respawn
> >> >>> respawn hacluster /usr/lib/heartbeat/dopd
> >> >>> apiauth dopd gid=haclient uid=hacluster
> >> >>>
> >> >>> I am thinking of making the following changes to the CIB
> >> (as per the
> >> >>> official DRBD
> >> >>> guide
> >> >>
> >> >
> >>
> > http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html)
> >> in
> >> >>> order to add the DRBD lsb service and require that it
> >> start before the
> >> >>> ocf:linbit:drbd resources. Does this look correct?
> >> >>
> >> >> Where did you read that? No, deactivate the startup of
> >> DRBD on system
> >> >> boot and let Pacemaker manage it completely.
> >> >>
> >> >>> primitive p_drbd-init lsb:drbd op monitor interval="30"
> >> >>> colocation c_drbd_together inf:
> >> >>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
> >> >>> ms_drbd_mount2:Master
> >> >>> order drbd_init_first inf: ms_drbd_vmstore:promote
> >> >>> ms_drbd_mount1:promote ms_drbd_mount2:promote
> >> p_drbd-init:start
> >> >>>
> >> >>> This doesn't seem to require that drbd be also running
> >> on the node where
> >> >>> the ocf:linbit:drbd resources are slave (which it would
> >> need to do to be
> >> >>> a DRBD SyncTarget) - how can I ensure that drbd is
> >> running everywhere?
> >> >>> (clone cl_drbd p_drbd-init ?)
> >> >>
> >> >> This is really not needed.
> >> >>
> >> >> Regards,
> >> >> Andreas
> >> >>
> >> >> --
> >> >> Need help with Pacemaker?
> >> >> http://www.hastexo.com/now
> >> >>
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>> Andrew
> >> >>>
> >>
> > ------------------------------------------------------------------------
> >> >>> *From: *"Andreas Kurz" <andreas at hastexo.com
> >> <mailto:andreas at hastexo.com> <mailto:andreas at hastexo.com
> >> <mailto:andreas at hastexo.com>>>
> >> >>> *To: *pacemaker at oss.clusterlabs.org
> >> <mailto:pacemaker at oss.clusterlabs.org>
> >> > <mailto:*pacemaker at oss.clusterlabs.org
> >> <mailto:pacemaker at oss.clusterlabs.org>>
> >> >>> *Sent: *Monday, March 26, 2012 5:56:22 PM
> >> >>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD
> >> resources to
> >> >>> master on failover
> >> >>>
> >> >>> On 03/24/2012 08:15 PM, Andrew Martin wrote:
> >> >>>> Hi Andreas,
> >> >>>>
> >> >>>> My complete cluster configuration is as follows:
> >> >>>> ============
> >> >>>> Last updated: Sat Mar 24 13:51:55 2012
> >> >>>> Last change: Sat Mar 24 13:41:55 2012
> >> >>>> Stack: Heartbeat
> >> >>>> Current DC: node2
> >> (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition
> >> >>>> with quorum
> >> >>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> >> >>>> 3 Nodes configured, unknown expected votes
> >> >>>> 19 Resources configured.
> >> >>>> ============
> >> >>>>
> >> >>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4):
> >> OFFLINE
> >> > (standby)
> >> >>>> Online: [ node2 node1 ]
> >> >>>>
> >> >>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> >> >>>> Masters: [ node2 ]
> >> >>>> Slaves: [ node1 ]
> >> >>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> >> >>>> Masters: [ node2 ]
> >> >>>> Slaves: [ node1 ]
> >> >>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> >> >>>> Masters: [ node2 ]
> >> >>>> Slaves: [ node1 ]
> >> >>>> Resource Group: g_vm
> >> >>>> p_fs_vmstore(ocf::heartbeat:Filesystem):Started
> > node2
> >> >>>> p_vm(ocf::heartbeat:VirtualDomain):Started node2
> >> >>>> Clone Set: cl_daemons [g_daemons]
> >> >>>> Started: [ node2 node1 ]
> >> >>>> Stopped: [ g_daemons:2 ]
> >> >>>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
> >> >>>> Started: [ node2 node1 ]
> >> >>>> Stopped: [ p_sysadmin_notify:2 ]
> >> >>>> stonith-node1(stonith:external/tripplitepdu):Started
> > node2
> >> >>>> stonith-node2(stonith:external/tripplitepdu):Started
> > node1
> >> >>>> Clone Set: cl_ping [p_ping]
> >> >>>> Started: [ node2 node1 ]
> >> >>>> Stopped: [ p_ping:2 ]
> >> >>>>
> >> >>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
> >> >>>> attributes standby="off"
> >> >>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
> >> >>>> attributes standby="off"
> >> >>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4"
> >> quorumnode \
> >> >>>> attributes standby="on"
> >> >>>> primitive p_drbd_mount2 ocf:linbit:drbd \
> >> >>>> params drbd_resource="mount2" \
> >> >>>> op monitor interval="15" role="Master" \
> >> >>>> op monitor interval="30" role="Slave"
> >> >>>> primitive p_drbd_mount1 ocf:linbit:drbd \
> >> >>>> params drbd_resource="mount1" \
> >> >>>> op monitor interval="15" role="Master" \
> >> >>>> op monitor interval="30" role="Slave"
> >> >>>> primitive p_drbd_vmstore ocf:linbit:drbd \
> >> >>>> params drbd_resource="vmstore" \
> >> >>>> op monitor interval="15" role="Master" \
> >> >>>> op monitor interval="30" role="Slave"
> >> >>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
> >> >>>> params device="/dev/drbd0" directory="/vmstore"
> >> fstype="ext4" \
> >> >>>> op start interval="0" timeout="60s" \
> >> >>>> op stop interval="0" timeout="60s" \
> >> >>>> op monitor interval="20s" timeout="40s"
> >> >>>> primitive p_libvirt-bin upstart:libvirt-bin \
> >> >>>> op monitor interval="30"
> >> >>>> primitive p_ping ocf:pacemaker:ping \
> >> >>>> params name="p_ping" host_list="192.168.1.10
> >> 192.168.1.11"
> >> >>>> multiplier="1000" \
> >> >>>> op monitor interval="20s"
> >> >>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
> >> >>>> params email="me at example.com
> >> <mailto:me at example.com> <mailto:me at example.com
> >> <mailto:me at example.com>>" \
> >> >>>> params subject="Pacemaker Change" \
> >> >>>> op start interval="0" timeout="10" \
> >> >>>> op stop interval="0" timeout="10" \
> >> >>>> op monitor interval="10" timeout="10"
> >> >>>> primitive p_vm ocf:heartbeat:VirtualDomain \
> >> >>>> params config="/vmstore/config/vm.xml" \
> >> >>>> meta allow-migrate="false" \
> >> >>>> op start interval="0" timeout="120s" \
> >> >>>> op stop interval="0" timeout="120s" \
> >> >>>> op monitor interval="10" timeout="30"
> >> >>>> primitive stonith-node1 stonith:external/tripplitepdu \
> >> >>>> params pdu_ipaddr="192.168.1.12" pdu_port="1"
> >> pdu_username="xxx"
> >> >>>> pdu_password="xxx" hostname_to_stonith="node1"
> >> >>>> primitive stonith-node2 stonith:external/tripplitepdu \
> >> >>>> params pdu_ipaddr="192.168.1.12" pdu_port="2"
> >> pdu_username="xxx"
> >> >>>> pdu_password="xxx" hostname_to_stonith="node2"
> >> >>>> group g_daemons p_libvirt-bin
> >> >>>> group g_vm p_fs_vmstore p_vm
> >> >>>> ms ms_drbd_mount2 p_drbd_mount2 \
> >> >>>> meta master-max="1" master-node-max="1"
> >> clone-max="2"
> >> >>>> clone-node-max="1" notify="true"
> >> >>>> ms ms_drbd_mount1 p_drbd_mount1 \
> >> >>>> meta master-max="1" master-node-max="1"
> >> clone-max="2"
> >> >>>> clone-node-max="1" notify="true"
> >> >>>> ms ms_drbd_vmstore p_drbd_vmstore \
> >> >>>> meta master-max="1" master-node-max="1"
> >> clone-max="2"
> >> >>>> clone-node-max="1" notify="true"
> >> >>>> clone cl_daemons g_daemons
> >> >>>> clone cl_ping p_ping \
> >> >>>> meta interleave="true"
> >> >>>> clone cl_sysadmin_notify p_sysadmin_notify
> >> >>>> location l-st-node1 stonith-node1 -inf: node1
> >> >>>> location l-st-node2 stonith-node2 -inf: node2
> >> >>>> location l_run_on_most_connected p_vm \
> >> >>>> rule $id="l_run_on_most_connected-rule" p_ping:
> >> defined p_ping
> >> >>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
> >> >>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
> >> >>>
> >> >>> As Emmanuel already said, g_vm has to be in the first
> >> place in this
> >> >>> collocation constraint .... g_vm must be colocated with
> >> the drbd masters.
> >> >>>
> >> >>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote
> >> ms_drbd_mount1:promote
> >> >>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start
> >> >>>> property $id="cib-bootstrap-options" \
> >> >>>>
> >> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
> >> >>>> cluster-infrastructure="Heartbeat" \
> >> >>>> stonith-enabled="false" \
> >> >>>> no-quorum-policy="stop" \
> >> >>>> last-lrm-refresh="1332539900" \
> >> >>>> cluster-recheck-interval="5m" \
> >> >>>> crmd-integration-timeout="3m" \
> >> >>>> shutdown-escalation="5m"
> >> >>>>
> >> >>>> The STONITH plugin is a custom plugin I wrote for the
> >> Tripp-Lite
> >> >>>> PDUMH20ATNET that I'm using as the STONITH device:
> >> >>>>
> >>
> > http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf
> >> >>>
> >> >>> And why don't using it? .... stonith-enabled="false"
> >> >>>
> >> >>>>
> >> >>>> As you can see, I left the DRBD service to be started
> >> by the operating
> >> >>>> system (as an lsb script at boot time) however
> >> Pacemaker controls
> >> >>>> actually bringing up/taking down the individual DRBD
> >> devices.
> >> >>>
> >> >>> Don't start drbd on system boot, give Pacemaker the full
> >> control.
> >> >>>
> >> >>> The
> >> >>>> behavior I observe is as follows: I issue "crm resource
> >> migrate p_vm" on
> >> >>>> node1 and failover successfully to node2. During this
> >> time, node2 fences
> >> >>>> node1's DRBD devices (using dopd) and marks them as
> >> Outdated. Meanwhile
> >> >>>> node2's DRBD devices are UpToDate. I then shutdown both
> >> nodes and then
> >> >>>> bring them back up. They reconnect to the cluster (with
> >> quorum), and
> >> >>>> node1's DRBD devices are still Outdated as expected and
> >> node2's DRBD
> >> >>>> devices are still UpToDate, as expected. At this point,
> >> DRBD starts on
> >> >>>> both nodes, however node2 will not set DRBD as master:
> >> >>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4):
> >> OFFLINE
> >> > (standby)
> >> >>>> Online: [ node2 node1 ]
> >> >>>>
> >> >>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
> >> >>>> Slaves: [ node1 node2 ]
> >> >>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> >> >>>> Slaves: [ node1 node 2 ]
> >> >>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> >> >>>> Slaves: [ node1 node2 ]
> >> >>>
> >> >>> There should really be no interruption of the drbd
> >> replication on vm
> >> >>> migration that activates the dopd ... drbd has its own
> >> direct network
> >> >>> connection?
> >> >>>
> >> >>> Please share your ha.cf <http://ha.cf> file and your
> >> drbd configuration. Watch out for
> >> >>> drbd messages in your kernel log file, that should give
> >> you additional
> >> >>> information when/why the drbd connection was lost.
> >> >>>
> >> >>> Regards,
> >> >>> Andreas
> >> >>>
> >> >>> --
> >> >>> Need help with Pacemaker?
> >> >>> http://www.hastexo.com/now
> >> >>>
> >> >>>>
> >> >>>> I am having trouble sorting through the logging
> >> information because
> >> >>>> there is so much of it in /var/log/daemon.log, but I
> >> can't find an
> >> >>>> error message printed about why it will not promote
> >> node2. At this point
> >> >>>> the DRBD devices are as follows:
> >> >>>> node2: cstate = WFConnection dstate=UpToDate
> >> >>>> node1: cstate = StandAlone dstate=Outdated
> >> >>>>
> >> >>>> I don't see any reason why node2 can't become DRBD
> >> master, or am I
> >> >>>> missing something? If I do "drbdadm connect all" on
> >> node1, then the
> >> >>>> cstate on both nodes changes to "Connected" and node2
> >> immediately
> >> >>>> promotes the DRBD resources to master. Any ideas on why
> >> I'm observing
> >> >>>> this incorrect behavior?
> >> >>>>
> >> >>>> Any tips on how I can better filter through the
> >> pacemaker/heartbeat logs
> >> >>>> or how to get additional useful debug information?
> >> >>>>
> >> >>>> Thanks,
> >> >>>>
> >> >>>> Andrew
> >> >>>>
> >> >>>>
> >>
> > ------------------------------------------------------------------------
> >> >>>> *From: *"Andreas Kurz" <andreas at hastexo.com
> >> <mailto:andreas at hastexo.com>
> >> > <mailto:andreas at hastexo.com <mailto:andreas at hastexo.com>>>
> >> >>>> *To: *pacemaker at oss.clusterlabs.org
> >> <mailto:pacemaker at oss.clusterlabs.org>
> >> >> <mailto:*pacemaker at oss.clusterlabs.org
> >> <mailto:pacemaker at oss.clusterlabs.org>>
> >> >>>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
> >> >>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD
> >> resources to
> >> >>>> master on failover
> >> >>>>
> >> >>>> On 01/25/2012 08:58 PM, Andrew Martin wrote:
> >> >>>>> Hello,
> >> >>>>>
> >> >>>>> Recently I finished configuring a two-node cluster
> >> with pacemaker 1.1.6
> >> >>>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04.
> >> This cluster
> >> > includes
> >> >>>>> the following resources:
> >> >>>>> - primitives for DRBD storage devices
> >> >>>>> - primitives for mounting the filesystem on the DRBD
> >> storage
> >> >>>>> - primitives for some mount binds
> >> >>>>> - primitive for starting apache
> >> >>>>> - primitives for starting samba and nfs servers
> >> (following instructions
> >> >>>>> here
> >> <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
> >> >>>>> - primitives for exporting nfs shares
> >> (ocf:heartbeat:exportfs)
> >> >>>>
> >> >>>> not enough information ... please share at least your
> >> complete cluster
> >> >>>> configuration
> >> >>>>
> >> >>>> Regards,
> >> >>>> Andreas
> >> >>>>
> >> >>>> --
> >> >>>> Need help with Pacemaker?
> >> >>>> http://www.hastexo.com/now
> >> >>>>
> >> >>>>>
> >> >>>>> Perhaps this is best described through the output of
> >> crm_mon:
> >> >>>>> Online: [ node1 node2 ]
> >> >>>>>
> >> >>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
> >> (unmanaged)
> >> >>>>> p_drbd_mount1:0 (ocf::linbit:drbd):
> >> Started node2
> >> >>> (unmanaged)
> >> >>>>> p_drbd_mount1:1 (ocf::linbit:drbd):
> >> Started node1
> >> >>>>> (unmanaged) FAILED
> >> >>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> >> >>>>> p_drbd_mount2:0 (ocf::linbit:drbd):
> >> Master node1
> >> >>>>> (unmanaged) FAILED
> >> >>>>> Slaves: [ node2 ]
> >> >>>>> Resource Group: g_core
> >> >>>>> p_fs_mount1 (ocf::heartbeat:Filesystem):
> >> Started node1
> >> >>>>> p_fs_mount2 (ocf::heartbeat:Filesystem):
> >> Started node1
> >> >>>>> p_ip_nfs (ocf::heartbeat:IPaddr2):
> >> Started node1
> >> >>>>> Resource Group: g_apache
> >> >>>>> p_fs_mountbind1 (ocf::heartbeat:Filesystem):
> >> Started node1
> >> >>>>> p_fs_mountbind2 (ocf::heartbeat:Filesystem):
> >> Started node1
> >> >>>>> p_fs_mountbind3 (ocf::heartbeat:Filesystem):
> >> Started node1
> >> >>>>> p_fs_varwww (ocf::heartbeat:Filesystem):
> >> Started node1
> >> >>>>> p_apache (ocf::heartbeat:apache):
> >> Started node1
> >> >>>>> Resource Group: g_fileservers
> >> >>>>> p_lsb_smb (lsb:smbd): Started node1
> >> >>>>> p_lsb_nmb (lsb:nmbd): Started node1
> >> >>>>> p_lsb_nfsserver (lsb:nfs-kernel-server):
> >> Started node1
> >> >>>>> p_exportfs_mount1 (ocf::heartbeat:exportfs):
> >> Started node1
> >> >>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs):
> >> Started
> >> > node1
> >> >>>>>
> >> >>>>> I have read through the Pacemaker Explained
> >> >>>>>
> >> >>>>
> >> >>>
> >> >
> >>
> > <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained>
> >> >>>>> documentation, however could not find a way to further
> >> debug these
> >> >>>>> problems. First, I put node1 into standby mode to
> >> attempt failover to
> >> >>>>> the other node (node2). Node2 appeared to start the
> >> transition to
> >> >>>>> master, however it failed to promote the DRBD
> >> resources to master (the
> >> >>>>> first step). I have attached a copy of this session in
> >> commands.log and
> >> >>>>> additional excerpts from /var/log/syslog during
> >> important steps. I have
> >> >>>>> attempted everything I can think of to try and start
> >> the DRBD resource
> >> >>>>> (e.g. start/stop/promote/manage/cleanup under crm
> >> resource, restarting
> >> >>>>> heartbeat) but cannot bring it out of the slave state.
> >> However, if
> >> > I set
> >> >>>>> it to unmanaged and then run drbdadm primary all in
> >> the terminal,
> >> >>>>> pacemaker is satisfied and continues starting the rest
> >> of the
> >> > resources.
> >> >>>>> It then failed when attempting to mount the filesystem
> >> for mount2, the
> >> >>>>> p_fs_mount2 resource. I attempted to mount the
> >> filesystem myself
> >> > and was
> >> >>>>> successful. I then unmounted it and ran cleanup on
> >> p_fs_mount2 and then
> >> >>>>> it mounted. The rest of the resources started as
> >> expected until the
> >> >>>>> p_exportfs_mount2 resource, which failed as follows:
> >> >>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs):
> >> started node2
> >> >>>>> (unmanaged) FAILED
> >> >>>>>
> >> >>>>> I ran cleanup on this and it started, however when
> >> running this test
> >> >>>>> earlier today no command could successfully start this
> >> exportfs
> >> >> resource.
> >> >>>>>
> >> >>>>> How can I configure pacemaker to better resolve these
> >> problems and be
> >> >>>>> able to bring the node up successfully on its own?
> >> What can I check to
> >> >>>>> determine why these failures are occuring?
> >> /var/log/syslog did not seem
> >> >>>>> to contain very much useful information regarding why
> >> the failures
> >> >>>> occurred.
> >> >>>>>
> >> >>>>> Thanks,
> >> >>>>>
> >> >>>>> Andrew
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> This body part will be downloaded on demand.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> _______________________________________________
> >> >>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>
> >> >> <mailto:Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>>
> >> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> >>>>
> >> >>>> Project Home: http://www.clusterlabs.org
> >> >>>> Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> >>>> Bugs: http://bugs.clusterlabs.org
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> _______________________________________________
> >> >>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>
> >> >> <mailto:Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>>
> >> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> >>>>
> >> >>>> Project Home: http://www.clusterlabs.org
> >> >>>> Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> >>>> Bugs: http://bugs.clusterlabs.org
> >> >>>
> >> >>>
> >> >>>
> >> >>> _______________________________________________
> >> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>
> >> >> <mailto:Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>>
> >> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> >>>
> >> >>> Project Home: http://www.clusterlabs.org
> >> >>> Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> >>> Bugs: http://bugs.clusterlabs.org
> >> >>>
> >> >>>
> >> >>>
> >> >>> _______________________________________________
> >> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>
> >> >> <mailto:Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>>
> >> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> >>>
> >> >>> Project Home: http://www.clusterlabs.org
> >> >>> Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> >>> Bugs: http://bugs.clusterlabs.org
> >> >>
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>
> >> >> <mailto:Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>>
> >> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> >>
> >> >> Project Home: http://www.clusterlabs.org
> >> >> Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> >> Bugs: http://bugs.clusterlabs.org
> >> >>
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>
> >> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> >>
> >> >> Project Home: http://www.clusterlabs.org
> >> >> Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> >> Bugs: http://bugs.clusterlabs.org
> >> >
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>
> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> >
> >> > Project Home: http://www.clusterlabs.org
> >> > Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> > Bugs: http://bugs.clusterlabs.org
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>
> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> >
> >> > Project Home: http://www.clusterlabs.org
> >> > Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> > Bugs: http://bugs.clusterlabs.org
> >>
> >>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >>
> >>
> >>
> >>
> >> --
> >> esta es mi vida e me la vivo hasta que dios quiera
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >>
> >>
> >>
> >>
> >> --
> >> esta es mi vida e me la vivo hasta que dios quiera
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> <mailto:Pacemaker at oss.clusterlabs.org>
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >>
> >>
> >>
> >>
> >> --
> >> esta es mi vida e me la vivo hasta que dios quiera
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >>
> >>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >
> > --
> > Need help with Pacemaker?
> > http://www.hastexo.com/now
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker

> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org