<html><head><style type='text/css'>p { margin: 0; }</style></head><body><div style='font-family: Times New Roman; font-size: 12pt; color: #000000'><font size="3">Hi Andreas,</font><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><br></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; ">Yes, I attempted to generalize hostnames and usernames/passwords in the archive. Sorry for making it more confusing :( </div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><br></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; ">I completely purged pacemaker from all 3 nodes and reinstalled everything. I then completely rebuild the CIB by manually adding in each primitive/constraint one at a time and testing along the way. After doing this DRBD appears to be working at least somewhat better - the ocf:linbit:drbd devices are started and managed by pacemaker. However, if for example a node is STONITHed when it comes back up it will not restart the ocf:linbit:drbd resources until I manually load the DRBD kernel module, bring the DRBD devices up (drbdadm up all), and cleanup the resources (e.g. crm resource cleanup ms_drbd_vmstore). Is it possible that the DRBD kernel module needs to be loaded at boot time, independent of pacemaker?</div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><br></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; ">Here's the new CIB (mostly the same as before):</div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><a href="http://pastebin.com/MxrqBXMp">http://pastebin.com/MxrqBXMp</a></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><br></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><div style="font-size: medium; "><span style="font-size: 12pt; ">Typically quorumnode stays in the OFFLINE (standby) state, though occasionally it changes to pending. I have just tried cleaning </span>/var/lib/heartbeat/crm on quorumnode again so we will see if that helps keep it in the OFFLINE (standby) state. I have it explicitly set to standby in the CIB configuration and also created a rule to prevent some of the resources from running on it?</div><div style="font-size: medium; "><div><font face="'courier new', courier, monaco, monospace, sans-serif" size="2">node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \</font></div><div><font face="'courier new', courier, monaco, monospace, sans-serif" size="2"> attributes standby="on"</font></div></div><div style="font-size: medium; "><font face="'courier new', courier, monaco, monospace, sans-serif" size="2">...</font></div><div style="font-size: medium; "><font face="'courier new', courier, monaco, monospace, sans-serif" size="2">location loc_not_on_quorumnode g_vm -inf: quorumnode</font></div><div style="font-size: medium; "><br></div><div style="font-size: medium; ">Would it be wise to create additional constraints to prevent all resources (including each ms_drbd resource) from running on it, even though this should be implied by standby?</div></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><br></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; ">Below is a portion of the log from when I started a node yet DRBD failed to start. As you can see it thinks the DRBD device is operating correctly as it proceeds to starting subsequent resources, e.g.</div><div>Apr 9 20:22:55 node1 Filesystem[2939]: [2956]: WARNING: Couldn't find device [/dev/drbd0]. Expected /dev/??? to exist</div><div><a href="http://pastebin.com/zTCHPtWy">http://pastebin.com/zTCHPtWy</a></div><div><br></div><div>After seeing these messages in the log I run</div><div><font face="'courier new', courier, monaco, monospace, sans-serif" size="2"># service drbd start</font></div><div><font face="'courier new', courier, monaco, monospace, sans-serif" size="2"># drbdadm up all</font></div><div><font face="'courier new', courier, monaco, monospace, sans-serif" size="2"># crm resource cleanup ms_drbd_vmstore</font></div><div><font face="'courier new', courier, monaco, monospace, sans-serif" size="2"># crm resource cleanup ms_drbd_mount1</font></div><div><font face="'courier new', courier, monaco, monospace, sans-serif" size="2"># crm resource clenaup ms_drbd_mount2</font><br>After this sequence of commands the DRBD resources appear to be functioning normally and the subsequent resources start. Any ideas on why DRBD is not being started as expected, or why the cluster is continuing with starting resources that according to the o_drbd-fs-vm constraint should not start until DRBD is master?</div><div><br></div><div>Thanks,</div><div><br></div><div>Andrew<br><hr id="zwchr" style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><div style="color: rgb(0, 0, 0); font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-size: 12pt; "><b>From: </b>"Andreas Kurz" <andreas@hastexo.com><br><b>To: </b>pacemaker@oss.clusterlabs.org<br><b>Sent: </b>Monday, April 2, 2012 6:33:44 PM<br><b>Subject: </b>Re: [Pacemaker] Nodes will not promote DRBD resources to master on failover<br><br>On 04/02/2012 05:47 PM, Andrew Martin wrote:<br>> Hi Andreas,<br>> <br>> Here is the crm_report:<br>> http://dl.dropbox.com/u/2177298/pcmk-Mon-02-Apr-2012.bz2<br><br>You tried to do some obfuscation on parts of that archive? ... doesn't<br>really make it easier to debug ....<br><br>Does the third node ever change its state?<br><br>Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): pending<br><br>Looking at the logs and the transition graph says it aborts due to<br>un-runable operations on that node which seems to be related to it's<br>pending state.<br><br>Try to get that node up (or down) completely ... maybe a fresh<br>start-over with a clean /var/lib/heartbeat/crm directory is sufficient.<br><br>Regards,<br>Andreas<br><br>> <br>> Hi Emmanuel,<br>> <br>> Here is the configuration:<br>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \<br>> attributes standby="off"<br>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \<br>> attributes standby="off"<br>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \<br>> attributes standby="on"<br>> primitive p_drbd_mount2 ocf:linbit:drbd \<br>> params drbd_resource="mount2" \<br>> op start interval="0" timeout="240" \<br>> op stop interval="0" timeout="100" \<br>> op monitor interval="10" role="Master" timeout="20" start-delay="1m" \<br>> op monitor interval="20" role="Slave" timeout="20" start-delay="1m"<br>> primitive p_drbd_mount1 ocf:linbit:drbd \<br>> params drbd_resource="mount1" \<br>> op start interval="0" timeout="240" \<br>> op stop interval="0" timeout="100" \<br>> op monitor interval="10" role="Master" timeout="20" start-delay="1m" \<br>> op monitor interval="20" role="Slave" timeout="20" start-delay="1m"<br>> primitive p_drbd_vmstore ocf:linbit:drbd \<br>> params drbd_resource="vmstore" \<br>> op start interval="0" timeout="240" \<br>> op stop interval="0" timeout="100" \<br>> op monitor interval="10" role="Master" timeout="20" start-delay="1m" \<br>> op monitor interval="20" role="Slave" timeout="20" start-delay="1m"<br>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \<br>> params device="/dev/drbd0" directory="/mnt/storage/vmstore" fstype="ext4" \<br>> op start interval="0" timeout="60s" \<br>> op stop interval="0" timeout="60s" \<br>> op monitor interval="20s" timeout="40s"<br>> primitive p_libvirt-bin upstart:libvirt-bin \<br>> op monitor interval="30"<br>> primitive p_ping ocf:pacemaker:ping \<br>> params name="p_ping" host_list="192.168.3.1 192.168.3.2" multiplier="1000" \<br>> op monitor interval="20s"<br>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \<br>> params email="me@example.com" \<br>> params subject="Pacemaker Change" \<br>> op start interval="0" timeout="10" \<br>> op stop interval="0" timeout="10" \<br>> op monitor interval="10" timeout="10"<br>> primitive p_vm ocf:heartbeat:VirtualDomain \<br>> params config="/mnt/storage/vmstore/config/vm.xml" \<br>> meta allow-migrate="false" \<br>> op start interval="0" timeout="180" \<br>> op stop interval="0" timeout="180" \<br>> op monitor interval="10" timeout="30"<br>> primitive stonith-node1 stonith:external/tripplitepdu \<br>> params pdu_ipaddr="192.168.3.100" pdu_port="1" pdu_username="xxx"<br>> pdu_password="xxx" hostname_to_stonith="node1"<br>> primitive stonith-node2 stonith:external/tripplitepdu \<br>> params pdu_ipaddr="192.168.3.100" pdu_port="2" pdu_username="xxx"<br>> pdu_password="xxx" hostname_to_stonith="node2"<br>> group g_daemons p_libvirt-bin<br>> group g_vm p_fs_vmstore p_vm<br>> ms ms_drbd_mount2 p_drbd_mount2 \<br>> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"<br>> notify="true"<br>> ms ms_drbd_mount1 p_drbd_mount1 \<br>> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"<br>> notify="true"<br>> ms ms_drbd_vmstore p_drbd_vmstore \<br>> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"<br>> notify="true"<br>> clone cl_daemons g_daemons<br>> clone cl_ping p_ping \<br>> meta interleave="true"<br>> clone cl_sysadmin_notify p_sysadmin_notify \<br>> meta target-role="Started"<br>> location l-st-node1 stonith-node1 -inf: node1<br>> location l-st-node2 stonith-node2 -inf: node2<br>> location l_run_on_most_connected p_vm \<br>> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping<br>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master<br>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm<br>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote<br>> ms_drbd_mount2:promote cl_daemons:start g_vm:start<br>> property $id="cib-bootstrap-options" \<br>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \<br>> cluster-infrastructure="Heartbeat" \<br>> stonith-enabled="true" \<br>> no-quorum-policy="freeze" \<br>> last-lrm-refresh="1333041002" \<br>> cluster-recheck-interval="5m" \<br>> crmd-integration-timeout="3m" \<br>> shutdown-escalation="5m"<br>> <br>> Thanks,<br>> <br>> Andrew<br>> <br>> <br>> ------------------------------------------------------------------------<br>> *From: *"emmanuel segura" <emi2fast@gmail.com><br>> *To: *"The Pacemaker cluster resource manager"<br>> <pacemaker@oss.clusterlabs.org><br>> *Sent: *Monday, April 2, 2012 9:43:20 AM<br>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to<br>> master on failover<br>> <br>> Sorry Andrew<br>> <br>> Can you post me your crm configure show again?<br>> <br>> Thanks<br>> <br>> Il giorno 30 marzo 2012 18:53, Andrew Martin <amartin@xes-inc.com<br>> <mailto:amartin@xes-inc.com>> ha scritto:<br>> <br>> Hi Emmanuel,<br>> <br>> Thanks, that is a good idea. I updated the colocation contraint as<br>> you described. After, the cluster remains in this state (with the<br>> filesystem not mounted and the VM not started):<br>> Online: [ node2 node1 ]<br>> <br>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]<br>> Masters: [ node1 ]<br>> Slaves: [ node2 ]<br>> Master/Slave Set: ms_drbd_tools [p_drbd_mount1]<br>> Masters: [ node1 ]<br>> Slaves: [ node2 ]<br>> Master/Slave Set: ms_drbd_crm [p_drbd_mount2]<br>> Masters: [ node1 ]<br>> Slaves: [ node2 ]<br>> Clone Set: cl_daemons [g_daemons]<br>> Started: [ node2 node1 ]<br>> Stopped: [ g_daemons:2 ]<br>> stonith-node1 (stonith:external/tripplitepdu): Started node2<br>> stonith-node2 (stonith:external/tripplitepdu): Started node1<br>> <br>> I noticed that Pacemaker had not issued "drbdadm connect" for any of<br>> the DRBD resources on node2<br>> # service drbd status<br>> drbd driver loaded OK; device status:<br>> version: 8.3.7 (api:88/proto:86-91)<br>> GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by<br>> root@node2, 2012-02-02 12:29:26<br>> m:res cs ro ds p <br>> mounted fstype<br>> 0:vmstore StandAlone Secondary/Unknown Outdated/DUnknown r----<br>> 1:mount1 StandAlone Secondary/Unknown Outdated/DUnknown r----<br>> 2:mount2 StandAlone Secondary/Unknown Outdated/DUnknown r----<br>> # drbdadm cstate all<br>> StandAlone<br>> StandAlone<br>> StandAlone<br>> <br>> After manually issuing "drbdadm connect all" on node2 the rest of<br>> the resources eventually started (several minutes later) on node1:<br>> Online: [ node2 node1 ]<br>> <br>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]<br>> Masters: [ node1 ]<br>> Slaves: [ node2 ]<br>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]<br>> Masters: [ node1 ]<br>> Slaves: [ node2 ]<br>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]<br>> Masters: [ node1 ]<br>> Slaves: [ node2 ]<br>> Resource Group: g_vm<br>> p_fs_vmstore (ocf::heartbeat:Filesystem): Started node1<br>> p_vm (ocf::heartbeat:VirtualDomain): Started node1<br>> Clone Set: cl_daemons [g_daemons]<br>> Started: [ node2 node1 ]<br>> Stopped: [ g_daemons:2 ]<br>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]<br>> Started: [ node2 node1 ]<br>> Stopped: [ p_sysadmin_notify:2 ]<br>> stonith-node1 (stonith:external/tripplitepdu): Started node2<br>> stonith-node2 (stonith:external/tripplitepdu): Started node1<br>> Clone Set: cl_ping [p_ping]<br>> Started: [ node2 node1 ]<br>> Stopped: [ p_ping:2 ]<br>> <br>> The DRBD devices on node1 were all UpToDate, so it doesn't seem<br>> right that it would need to wait for node2 to be connected before it<br>> could continue promoting additional resources. I then restarted<br>> heartbeat on node2 to see if it would automatically connect the DRBD<br>> devices this time. After restarting it, the DRBD devices are not<br>> even configured:<br>> # service drbd status<br>> drbd driver loaded OK; device status:<br>> version: 8.3.7 (api:88/proto:86-91)<br>> GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by<br>> root@webapps2host, 2012-02-02 12:29:26<br>> m:res cs ro ds p mounted fstype<br>> 0:vmstore Unconfigured<br>> 1:mount1 Unconfigured<br>> 2:mount2 Unconfigured<br>> <br>> Looking at the log I found this part about the drbd primitives:<br>> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[2] on<br>> p_drbd_vmstore:1 for client 10705: pid 11065 exited with return code 7<br>> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM<br>> operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, cib-update=11,<br>> confirmed=true) not running<br>> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[4] on<br>> p_drbd_mount2:1 for client 10705: pid 11069 exited with return code 7<br>> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM<br>> operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, cib-update=12,<br>> confirmed=true) not running<br>> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[3] on<br>> p_drbd_mount1:1 for client 10705: pid 11066 exited with return code 7<br>> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM<br>> operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, cib-update=13,<br>> confirmed=true) not running<br>> <br>> I am not sure what exit code 7 is - is it possible to manually run<br>> the monitor code or somehow obtain more debug about this? Here is<br>> the complete log after restarting heartbeat on node2:<br>> http://pastebin.com/KsHKi3GW<br>> <br>> Thanks,<br>> <br>> Andrew<br>> <br>> ------------------------------------------------------------------------<br>> *From: *"emmanuel segura" <emi2fast@gmail.com<br>> <mailto:emi2fast@gmail.com>><br>> *To: *"The Pacemaker cluster resource manager"<br>> <pacemaker@oss.clusterlabs.org <mailto:pacemaker@oss.clusterlabs.org>><br>> *Sent: *Friday, March 30, 2012 10:26:48 AM<br>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to<br>> master on failover<br>> <br>> I think this constrain it's wrong<br>> ==================================================<br>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master<br>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm<br>> ===================================================<br>> <br>> change to<br>> ======================================================<br>> colocation c_drbd_libvirt_vm inf: g_vm ms_drbd_vmstore:Master<br>> ms_drbd_mount1:Master ms_drbd_mount2:Master<br>> =======================================================<br>> <br>> Il giorno 30 marzo 2012 17:16, Andrew Martin <amartin@xes-inc.com<br>> <mailto:amartin@xes-inc.com>> ha scritto:<br>> <br>> Hi Emmanuel,<br>> <br>> Here is the output of crm configure show:<br>> http://pastebin.com/NA1fZ8dL<br>> <br>> Thanks,<br>> <br>> Andrew<br>> <br>> ------------------------------------------------------------------------<br>> *From: *"emmanuel segura" <emi2fast@gmail.com<br>> <mailto:emi2fast@gmail.com>><br>> *To: *"The Pacemaker cluster resource manager"<br>> <pacemaker@oss.clusterlabs.org<br>> <mailto:pacemaker@oss.clusterlabs.org>><br>> *Sent: *Friday, March 30, 2012 9:43:45 AM<br>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources<br>> to master on failover<br>> <br>> can you show me?<br>> <br>> crm configure show<br>> <br>> Il giorno 30 marzo 2012 16:10, Andrew Martin<br>> <amartin@xes-inc.com <mailto:amartin@xes-inc.com>> ha scritto:<br>> <br>> Hi Andreas,<br>> <br>> Here is a copy of my complete CIB:<br>> http://pastebin.com/v5wHVFuy<br>> <br>> I'll work on generating a report using crm_report as well.<br>> <br>> Thanks,<br>> <br>> Andrew<br>> <br>> ------------------------------------------------------------------------<br>> *From: *"Andreas Kurz" <andreas@hastexo.com<br>> <mailto:andreas@hastexo.com>><br>> *To: *pacemaker@oss.clusterlabs.org<br>> <mailto:pacemaker@oss.clusterlabs.org><br>> *Sent: *Friday, March 30, 2012 4:41:16 AM<br>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD<br>> resources to master on failover<br>> <br>> On 03/28/2012 04:56 PM, Andrew Martin wrote:<br>> > Hi Andreas,<br>> ><br>> > I disabled the DRBD init script and then restarted the<br>> slave node<br>> > (node2). After it came back up, DRBD did not start:<br>> > Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4):<br>> pending<br>> > Online: [ node2 node1 ]<br>> ><br>> > Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]<br>> > Masters: [ node1 ]<br>> > Stopped: [ p_drbd_vmstore:1 ]<br>> > Master/Slave Set: ms_drbd_mount1 [p_drbd_tools]<br>> > Masters: [ node1 ]<br>> > Stopped: [ p_drbd_mount1:1 ]<br>> > Master/Slave Set: ms_drbd_mount2 [p_drbdmount2]<br>> > Masters: [ node1 ]<br>> > Stopped: [ p_drbd_mount2:1 ]<br>> > ...<br>> ><br>> > root@node2:~# service drbd status<br>> > drbd not loaded<br>> <br>> Yes, expected unless Pacemaker starts DRBD<br>> <br>> ><br>> > Is there something else I need to change in the CIB to<br>> ensure that DRBD<br>> > is started? All of my DRBD devices are configured like this:<br>> > primitive p_drbd_mount2 ocf:linbit:drbd \<br>> > params drbd_resource="mount2" \<br>> > op monitor interval="15" role="Master" \<br>> > op monitor interval="30" role="Slave"<br>> > ms ms_drbd_mount2 p_drbd_mount2 \<br>> > meta master-max="1" master-node-max="1" clone-max="2"<br>> > clone-node-max="1" notify="true"<br>> <br>> That should be enough ... unable to say more without seeing<br>> the complete<br>> configuration ... too much fragments of information ;-)<br>> <br>> Please provide (e.g. pastebin) your complete cib (cibadmin<br>> -Q) when<br>> cluster is in that state ... or even better create a<br>> crm_report archive<br>> <br>> ><br>> > Here is the output from the syslog (grep -i drbd<br>> /var/log/syslog):<br>> > Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op:<br>> Performing<br>> > key=12:315:7:24416169-73ba-469b-a2e3-56a22b437cbc<br>> > op=p_drbd_vmstore:1_monitor_0 )<br>> > Mar 28 09:24:47 node2 lrmd: [3210]: info:<br>> rsc:p_drbd_vmstore:1 probe[2]<br>> > (pid 3455)<br>> > Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op:<br>> Performing<br>> > key=13:315:7:24416169-73ba-469b-a2e3-56a22b437cbc<br>> > op=p_drbd_mount1:1_monitor_0 )<br>> > Mar 28 09:24:48 node2 lrmd: [3210]: info:<br>> rsc:p_drbd_mount1:1 probe[3]<br>> > (pid 3456)<br>> > Mar 28 09:24:48 node2 crmd: [3213]: info: do_lrm_rsc_op:<br>> Performing<br>> > key=14:315:7:24416169-73ba-469b-a2e3-56a22b437cbc<br>> > op=p_drbd_mount2:1_monitor_0 )<br>> > Mar 28 09:24:48 node2 lrmd: [3210]: info:<br>> rsc:p_drbd_mount2:1 probe[4]<br>> > (pid 3457)<br>> > Mar 28 09:24:48 node2 Filesystem[3458]: [3517]: WARNING:<br>> Couldn't find<br>> > device [/dev/drbd0]. Expected /dev/??? to exist<br>> > Mar 28 09:24:48 node2 crm_attribute: [3563]: info: Invoked:<br>> > crm_attribute -N node2 -n master-p_drbd_mount2:1 -l reboot -D<br>> > Mar 28 09:24:48 node2 crm_attribute: [3557]: info: Invoked:<br>> > crm_attribute -N node2 -n master-p_drbd_vmstore:1 -l reboot -D<br>> > Mar 28 09:24:48 node2 crm_attribute: [3562]: info: Invoked:<br>> > crm_attribute -N node2 -n master-p_drbd_mount1:1 -l reboot -D<br>> > Mar 28 09:24:48 node2 lrmd: [3210]: info: operation<br>> monitor[4] on<br>> > p_drbd_mount2:1 for client 3213: pid 3457 exited with<br>> return code 7<br>> > Mar 28 09:24:48 node2 lrmd: [3210]: info: operation<br>> monitor[2] on<br>> > p_drbd_vmstore:1 for client 3213: pid 3455 exited with<br>> return code 7<br>> > Mar 28 09:24:48 node2 crmd: [3213]: info:<br>> process_lrm_event: LRM<br>> > operation p_drbd_mount2:1_monitor_0 (call=4, rc=7,<br>> cib-update=10,<br>> > confirmed=true) not running<br>> > Mar 28 09:24:48 node2 lrmd: [3210]: info: operation<br>> monitor[3] on<br>> > p_drbd_mount1:1 for client 3213: pid 3456 exited with<br>> return code 7<br>> > Mar 28 09:24:48 node2 crmd: [3213]: info:<br>> process_lrm_event: LRM<br>> > operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7,<br>> cib-update=11,<br>> > confirmed=true) not running<br>> > Mar 28 09:24:48 node2 crmd: [3213]: info:<br>> process_lrm_event: LRM<br>> > operation p_drbd_mount1:1_monitor_0 (call=3, rc=7,<br>> cib-update=12,<br>> > confirmed=true) not running<br>> <br>> No errors, just probing ... so for any reason Pacemaker does<br>> not like to<br>> start it ... use crm_simulate to find out why ... or provide<br>> information<br>> as requested above.<br>> <br>> Regards,<br>> Andreas<br>> <br>> -- <br>> Need help with Pacemaker?<br>> http://www.hastexo.com/now<br>> <br>> ><br>> > Thanks,<br>> ><br>> > Andrew<br>> ><br>> ><br>> ------------------------------------------------------------------------<br>> > *From: *"Andreas Kurz" <andreas@hastexo.com<br>> <mailto:andreas@hastexo.com>><br>> > *To: *pacemaker@oss.clusterlabs.org<br>> <mailto:pacemaker@oss.clusterlabs.org><br>> > *Sent: *Wednesday, March 28, 2012 9:03:06 AM<br>> > *Subject: *Re: [Pacemaker] Nodes will not promote DRBD<br>> resources to<br>> > master on failover<br>> ><br>> > On 03/28/2012 03:47 PM, Andrew Martin wrote:<br>> >> Hi Andreas,<br>> >><br>> >>> hmm ... what is that fence-peer script doing? If you<br>> want to use<br>> >>> resource-level fencing with the help of dopd, activate the<br>> >>> drbd-peer-outdater script in the line above ... and<br>> double check if the<br>> >>> path is correct<br>> >> fence-peer is just a wrapper for drbd-peer-outdater that<br>> does some<br>> >> additional logging. In my testing dopd has been working well.<br>> ><br>> > I see<br>> ><br>> >><br>> >>>> I am thinking of making the following changes to the<br>> CIB (as per the<br>> >>>> official DRBD<br>> >>>> guide<br>> >><br>> ><br>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html)<br>> in<br>> >>>> order to add the DRBD lsb service and require that it<br>> start before the<br>> >>>> ocf:linbit:drbd resources. Does this look correct?<br>> >>><br>> >>> Where did you read that? No, deactivate the startup of<br>> DRBD on system<br>> >>> boot and let Pacemaker manage it completely.<br>> >>><br>> >>>> primitive p_drbd-init lsb:drbd op monitor interval="30"<br>> >>>> colocation c_drbd_together inf:<br>> >>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master<br>> >>>> ms_drbd_mount2:Master<br>> >>>> order drbd_init_first inf: ms_drbd_vmstore:promote<br>> >>>> ms_drbd_mount1:promote ms_drbd_mount2:promote<br>> p_drbd-init:start<br>> >>>><br>> >>>> This doesn't seem to require that drbd be also running<br>> on the node where<br>> >>>> the ocf:linbit:drbd resources are slave (which it would<br>> need to do to be<br>> >>>> a DRBD SyncTarget) - how can I ensure that drbd is<br>> running everywhere?<br>> >>>> (clone cl_drbd p_drbd-init ?)<br>> >>><br>> >>> This is really not needed.<br>> >> I was following the official DRBD Users Guide:<br>> >><br>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html<br>> >><br>> >> If I am understanding your previous message correctly, I<br>> do not need to<br>> >> add a lsb primitive for the drbd daemon? It will be<br>> >> started/stopped/managed automatically by my<br>> ocf:linbit:drbd resources<br>> >> (and I can remove the /etc/rc* symlinks)?<br>> ><br>> > Yes, you don't need that LSB script when using Pacemaker<br>> and should not<br>> > let init start it.<br>> ><br>> > Regards,<br>> > Andreas<br>> ><br>> > --<br>> > Need help with Pacemaker?<br>> > http://www.hastexo.com/now<br>> ><br>> >><br>> >> Thanks,<br>> >><br>> >> Andrew<br>> >><br>> >><br>> ------------------------------------------------------------------------<br>> >> *From: *"Andreas Kurz" <andreas@hastexo.com<br>> <mailto:andreas@hastexo.com> <mailto:andreas@hastexo.com<br>> <mailto:andreas@hastexo.com>>><br>> >> *To: *pacemaker@oss.clusterlabs.org<br>> <mailto:pacemaker@oss.clusterlabs.org><br>> <mailto:pacemaker@oss.clusterlabs.org<br>> <mailto:pacemaker@oss.clusterlabs.org>><br>> >> *Sent: *Wednesday, March 28, 2012 7:27:34 AM<br>> >> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD<br>> resources to<br>> >> master on failover<br>> >><br>> >> On 03/28/2012 12:13 AM, Andrew Martin wrote:<br>> >>> Hi Andreas,<br>> >>><br>> >>> Thanks, I've updated the colocation rule to be in the<br>> correct order. I<br>> >>> also enabled the STONITH resource (this was temporarily<br>> disabled before<br>> >>> for some additional testing). DRBD has its own network<br>> connection over<br>> >>> the br1 interface (192.168.5.0/24<br>> <http://192.168.5.0/24> network), a direct crossover cable<br>> >>> between node1 and node2:<br>> >>> global { usage-count no; }<br>> >>> common {<br>> >>> syncer { rate 110M; }<br>> >>> }<br>> >>> resource vmstore {<br>> >>> protocol C;<br>> >>> startup {<br>> >>> wfc-timeout 15;<br>> >>> degr-wfc-timeout 60;<br>> >>> }<br>> >>> handlers {<br>> >>> #fence-peer<br>> "/usr/lib/heartbeat/drbd-peer-outdater -t 5";<br>> >>> fence-peer "/usr/local/bin/fence-peer";<br>> >><br>> >> hmm ... what is that fence-peer script doing? If you want<br>> to use<br>> >> resource-level fencing with the help of dopd, activate the<br>> >> drbd-peer-outdater script in the line above ... and<br>> double check if the<br>> >> path is correct<br>> >><br>> >>> split-brain<br>> "/usr/lib/drbd/notify-split-brain.sh<br>> >>> me@example.com <mailto:me@example.com><br>> <mailto:me@example.com <mailto:me@example.com>>";<br>> >>> }<br>> >>> net {<br>> >>> after-sb-0pri discard-zero-changes;<br>> >>> after-sb-1pri discard-secondary;<br>> >>> after-sb-2pri disconnect;<br>> >>> cram-hmac-alg md5;<br>> >>> shared-secret "xxxxx";<br>> >>> }<br>> >>> disk {<br>> >>> fencing resource-only;<br>> >>> }<br>> >>> on node1 {<br>> >>> device /dev/drbd0;<br>> >>> disk /dev/sdb1;<br>> >>> address 192.168.5.10:7787<br>> <http://192.168.5.10:7787>;<br>> >>> meta-disk internal;<br>> >>> }<br>> >>> on node2 {<br>> >>> device /dev/drbd0;<br>> >>> disk /dev/sdf1;<br>> >>> address 192.168.5.11:7787<br>> <http://192.168.5.11:7787>;<br>> >>> meta-disk internal;<br>> >>> }<br>> >>> }<br>> >>> # and similar for mount1 and mount2<br>> >>><br>> >>> Also, here is my ha.cf <http://ha.cf>. It uses both the<br>> direct link between the nodes<br>> >>> (br1) and the shared LAN network on br0 for communicating:<br>> >>> autojoin none<br>> >>> mcast br0 239.0.0.43 694 1 0<br>> >>> bcast br1<br>> >>> warntime 5<br>> >>> deadtime 15<br>> >>> initdead 60<br>> >>> keepalive 2<br>> >>> node node1<br>> >>> node node2<br>> >>> node quorumnode<br>> >>> crm respawn<br>> >>> respawn hacluster /usr/lib/heartbeat/dopd<br>> >>> apiauth dopd gid=haclient uid=hacluster<br>> >>><br>> >>> I am thinking of making the following changes to the CIB<br>> (as per the<br>> >>> official DRBD<br>> >>> guide<br>> >><br>> ><br>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html)<br>> in<br>> >>> order to add the DRBD lsb service and require that it<br>> start before the<br>> >>> ocf:linbit:drbd resources. Does this look correct?<br>> >><br>> >> Where did you read that? No, deactivate the startup of<br>> DRBD on system<br>> >> boot and let Pacemaker manage it completely.<br>> >><br>> >>> primitive p_drbd-init lsb:drbd op monitor interval="30"<br>> >>> colocation c_drbd_together inf:<br>> >>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master<br>> >>> ms_drbd_mount2:Master<br>> >>> order drbd_init_first inf: ms_drbd_vmstore:promote<br>> >>> ms_drbd_mount1:promote ms_drbd_mount2:promote<br>> p_drbd-init:start<br>> >>><br>> >>> This doesn't seem to require that drbd be also running<br>> on the node where<br>> >>> the ocf:linbit:drbd resources are slave (which it would<br>> need to do to be<br>> >>> a DRBD SyncTarget) - how can I ensure that drbd is<br>> running everywhere?<br>> >>> (clone cl_drbd p_drbd-init ?)<br>> >><br>> >> This is really not needed.<br>> >><br>> >> Regards,<br>> >> Andreas<br>> >><br>> >> --<br>> >> Need help with Pacemaker?<br>> >> http://www.hastexo.com/now<br>> >><br>> >>><br>> >>> Thanks,<br>> >>><br>> >>> Andrew<br>> >>><br>> ------------------------------------------------------------------------<br>> >>> *From: *"Andreas Kurz" <andreas@hastexo.com<br>> <mailto:andreas@hastexo.com> <mailto:andreas@hastexo.com<br>> <mailto:andreas@hastexo.com>>><br>> >>> *To: *pacemaker@oss.clusterlabs.org<br>> <mailto:pacemaker@oss.clusterlabs.org><br>> > <mailto:*pacemaker@oss.clusterlabs.org<br>> <mailto:pacemaker@oss.clusterlabs.org>><br>> >>> *Sent: *Monday, March 26, 2012 5:56:22 PM<br>> >>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD<br>> resources to<br>> >>> master on failover<br>> >>><br>> >>> On 03/24/2012 08:15 PM, Andrew Martin wrote:<br>> >>>> Hi Andreas,<br>> >>>><br>> >>>> My complete cluster configuration is as follows:<br>> >>>> ============<br>> >>>> Last updated: Sat Mar 24 13:51:55 2012<br>> >>>> Last change: Sat Mar 24 13:41:55 2012<br>> >>>> Stack: Heartbeat<br>> >>>> Current DC: node2<br>> (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition<br>> >>>> with quorum<br>> >>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c<br>> >>>> 3 Nodes configured, unknown expected votes<br>> >>>> 19 Resources configured.<br>> >>>> ============<br>> >>>><br>> >>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4):<br>> OFFLINE<br>> > (standby)<br>> >>>> Online: [ node2 node1 ]<br>> >>>><br>> >>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]<br>> >>>> Masters: [ node2 ]<br>> >>>> Slaves: [ node1 ]<br>> >>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]<br>> >>>> Masters: [ node2 ]<br>> >>>> Slaves: [ node1 ]<br>> >>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]<br>> >>>> Masters: [ node2 ]<br>> >>>> Slaves: [ node1 ]<br>> >>>> Resource Group: g_vm<br>> >>>> p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2<br>> >>>> p_vm(ocf::heartbeat:VirtualDomain):Started node2<br>> >>>> Clone Set: cl_daemons [g_daemons]<br>> >>>> Started: [ node2 node1 ]<br>> >>>> Stopped: [ g_daemons:2 ]<br>> >>>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]<br>> >>>> Started: [ node2 node1 ]<br>> >>>> Stopped: [ p_sysadmin_notify:2 ]<br>> >>>> stonith-node1(stonith:external/tripplitepdu):Started node2<br>> >>>> stonith-node2(stonith:external/tripplitepdu):Started node1<br>> >>>> Clone Set: cl_ping [p_ping]<br>> >>>> Started: [ node2 node1 ]<br>> >>>> Stopped: [ p_ping:2 ]<br>> >>>><br>> >>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \<br>> >>>> attributes standby="off"<br>> >>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \<br>> >>>> attributes standby="off"<br>> >>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4"<br>> quorumnode \<br>> >>>> attributes standby="on"<br>> >>>> primitive p_drbd_mount2 ocf:linbit:drbd \<br>> >>>> params drbd_resource="mount2" \<br>> >>>> op monitor interval="15" role="Master" \<br>> >>>> op monitor interval="30" role="Slave"<br>> >>>> primitive p_drbd_mount1 ocf:linbit:drbd \<br>> >>>> params drbd_resource="mount1" \<br>> >>>> op monitor interval="15" role="Master" \<br>> >>>> op monitor interval="30" role="Slave"<br>> >>>> primitive p_drbd_vmstore ocf:linbit:drbd \<br>> >>>> params drbd_resource="vmstore" \<br>> >>>> op monitor interval="15" role="Master" \<br>> >>>> op monitor interval="30" role="Slave"<br>> >>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \<br>> >>>> params device="/dev/drbd0" directory="/vmstore"<br>> fstype="ext4" \<br>> >>>> op start interval="0" timeout="60s" \<br>> >>>> op stop interval="0" timeout="60s" \<br>> >>>> op monitor interval="20s" timeout="40s"<br>> >>>> primitive p_libvirt-bin upstart:libvirt-bin \<br>> >>>> op monitor interval="30"<br>> >>>> primitive p_ping ocf:pacemaker:ping \<br>> >>>> params name="p_ping" host_list="192.168.1.10<br>> 192.168.1.11"<br>> >>>> multiplier="1000" \<br>> >>>> op monitor interval="20s"<br>> >>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \<br>> >>>> params email="me@example.com<br>> <mailto:me@example.com> <mailto:me@example.com<br>> <mailto:me@example.com>>" \<br>> >>>> params subject="Pacemaker Change" \<br>> >>>> op start interval="0" timeout="10" \<br>> >>>> op stop interval="0" timeout="10" \<br>> >>>> op monitor interval="10" timeout="10"<br>> >>>> primitive p_vm ocf:heartbeat:VirtualDomain \<br>> >>>> params config="/vmstore/config/vm.xml" \<br>> >>>> meta allow-migrate="false" \<br>> >>>> op start interval="0" timeout="120s" \<br>> >>>> op stop interval="0" timeout="120s" \<br>> >>>> op monitor interval="10" timeout="30"<br>> >>>> primitive stonith-node1 stonith:external/tripplitepdu \<br>> >>>> params pdu_ipaddr="192.168.1.12" pdu_port="1"<br>> pdu_username="xxx"<br>> >>>> pdu_password="xxx" hostname_to_stonith="node1"<br>> >>>> primitive stonith-node2 stonith:external/tripplitepdu \<br>> >>>> params pdu_ipaddr="192.168.1.12" pdu_port="2"<br>> pdu_username="xxx"<br>> >>>> pdu_password="xxx" hostname_to_stonith="node2"<br>> >>>> group g_daemons p_libvirt-bin<br>> >>>> group g_vm p_fs_vmstore p_vm<br>> >>>> ms ms_drbd_mount2 p_drbd_mount2 \<br>> >>>> meta master-max="1" master-node-max="1"<br>> clone-max="2"<br>> >>>> clone-node-max="1" notify="true"<br>> >>>> ms ms_drbd_mount1 p_drbd_mount1 \<br>> >>>> meta master-max="1" master-node-max="1"<br>> clone-max="2"<br>> >>>> clone-node-max="1" notify="true"<br>> >>>> ms ms_drbd_vmstore p_drbd_vmstore \<br>> >>>> meta master-max="1" master-node-max="1"<br>> clone-max="2"<br>> >>>> clone-node-max="1" notify="true"<br>> >>>> clone cl_daemons g_daemons<br>> >>>> clone cl_ping p_ping \<br>> >>>> meta interleave="true"<br>> >>>> clone cl_sysadmin_notify p_sysadmin_notify<br>> >>>> location l-st-node1 stonith-node1 -inf: node1<br>> >>>> location l-st-node2 stonith-node2 -inf: node2<br>> >>>> location l_run_on_most_connected p_vm \<br>> >>>> rule $id="l_run_on_most_connected-rule" p_ping:<br>> defined p_ping<br>> >>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master<br>> >>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm<br>> >>><br>> >>> As Emmanuel already said, g_vm has to be in the first<br>> place in this<br>> >>> collocation constraint .... g_vm must be colocated with<br>> the drbd masters.<br>> >>><br>> >>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote<br>> ms_drbd_mount1:promote<br>> >>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start<br>> >>>> property $id="cib-bootstrap-options" \<br>> >>>> <br>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \<br>> >>>> cluster-infrastructure="Heartbeat" \<br>> >>>> stonith-enabled="false" \<br>> >>>> no-quorum-policy="stop" \<br>> >>>> last-lrm-refresh="1332539900" \<br>> >>>> cluster-recheck-interval="5m" \<br>> >>>> crmd-integration-timeout="3m" \<br>> >>>> shutdown-escalation="5m"<br>> >>>><br>> >>>> The STONITH plugin is a custom plugin I wrote for the<br>> Tripp-Lite<br>> >>>> PDUMH20ATNET that I'm using as the STONITH device:<br>> >>>><br>> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf<br>> >>><br>> >>> And why don't using it? .... stonith-enabled="false"<br>> >>><br>> >>>><br>> >>>> As you can see, I left the DRBD service to be started<br>> by the operating<br>> >>>> system (as an lsb script at boot time) however<br>> Pacemaker controls<br>> >>>> actually bringing up/taking down the individual DRBD<br>> devices.<br>> >>><br>> >>> Don't start drbd on system boot, give Pacemaker the full<br>> control.<br>> >>><br>> >>> The<br>> >>>> behavior I observe is as follows: I issue "crm resource<br>> migrate p_vm" on<br>> >>>> node1 and failover successfully to node2. During this<br>> time, node2 fences<br>> >>>> node1's DRBD devices (using dopd) and marks them as<br>> Outdated. Meanwhile<br>> >>>> node2's DRBD devices are UpToDate. I then shutdown both<br>> nodes and then<br>> >>>> bring them back up. They reconnect to the cluster (with<br>> quorum), and<br>> >>>> node1's DRBD devices are still Outdated as expected and<br>> node2's DRBD<br>> >>>> devices are still UpToDate, as expected. At this point,<br>> DRBD starts on<br>> >>>> both nodes, however node2 will not set DRBD as master:<br>> >>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4):<br>> OFFLINE<br>> > (standby)<br>> >>>> Online: [ node2 node1 ]<br>> >>>><br>> >>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]<br>> >>>> Slaves: [ node1 node2 ]<br>> >>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]<br>> >>>> Slaves: [ node1 node 2 ]<br>> >>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]<br>> >>>> Slaves: [ node1 node2 ]<br>> >>><br>> >>> There should really be no interruption of the drbd<br>> replication on vm<br>> >>> migration that activates the dopd ... drbd has its own<br>> direct network<br>> >>> connection?<br>> >>><br>> >>> Please share your ha.cf <http://ha.cf> file and your<br>> drbd configuration. Watch out for<br>> >>> drbd messages in your kernel log file, that should give<br>> you additional<br>> >>> information when/why the drbd connection was lost.<br>> >>><br>> >>> Regards,<br>> >>> Andreas<br>> >>><br>> >>> --<br>> >>> Need help with Pacemaker?<br>> >>> http://www.hastexo.com/now<br>> >>><br>> >>>><br>> >>>> I am having trouble sorting through the logging<br>> information because<br>> >>>> there is so much of it in /var/log/daemon.log, but I<br>> can't find an<br>> >>>> error message printed about why it will not promote<br>> node2. At this point<br>> >>>> the DRBD devices are as follows:<br>> >>>> node2: cstate = WFConnection dstate=UpToDate<br>> >>>> node1: cstate = StandAlone dstate=Outdated<br>> >>>><br>> >>>> I don't see any reason why node2 can't become DRBD<br>> master, or am I<br>> >>>> missing something? If I do "drbdadm connect all" on<br>> node1, then the<br>> >>>> cstate on both nodes changes to "Connected" and node2<br>> immediately<br>> >>>> promotes the DRBD resources to master. Any ideas on why<br>> I'm observing<br>> >>>> this incorrect behavior?<br>> >>>><br>> >>>> Any tips on how I can better filter through the<br>> pacemaker/heartbeat logs<br>> >>>> or how to get additional useful debug information?<br>> >>>><br>> >>>> Thanks,<br>> >>>><br>> >>>> Andrew<br>> >>>><br>> >>>><br>> ------------------------------------------------------------------------<br>> >>>> *From: *"Andreas Kurz" <andreas@hastexo.com<br>> <mailto:andreas@hastexo.com><br>> > <mailto:andreas@hastexo.com <mailto:andreas@hastexo.com>>><br>> >>>> *To: *pacemaker@oss.clusterlabs.org<br>> <mailto:pacemaker@oss.clusterlabs.org><br>> >> <mailto:*pacemaker@oss.clusterlabs.org<br>> <mailto:pacemaker@oss.clusterlabs.org>><br>> >>>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM<br>> >>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD<br>> resources to<br>> >>>> master on failover<br>> >>>><br>> >>>> On 01/25/2012 08:58 PM, Andrew Martin wrote:<br>> >>>>> Hello,<br>> >>>>><br>> >>>>> Recently I finished configuring a two-node cluster<br>> with pacemaker 1.1.6<br>> >>>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04.<br>> This cluster<br>> > includes<br>> >>>>> the following resources:<br>> >>>>> - primitives for DRBD storage devices<br>> >>>>> - primitives for mounting the filesystem on the DRBD<br>> storage<br>> >>>>> - primitives for some mount binds<br>> >>>>> - primitive for starting apache<br>> >>>>> - primitives for starting samba and nfs servers<br>> (following instructions<br>> >>>>> here<br>> <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)<br>> >>>>> - primitives for exporting nfs shares<br>> (ocf:heartbeat:exportfs)<br>> >>>><br>> >>>> not enough information ... please share at least your<br>> complete cluster<br>> >>>> configuration<br>> >>>><br>> >>>> Regards,<br>> >>>> Andreas<br>> >>>><br>> >>>> --<br>> >>>> Need help with Pacemaker?<br>> >>>> http://www.hastexo.com/now<br>> >>>><br>> >>>>><br>> >>>>> Perhaps this is best described through the output of<br>> crm_mon:<br>> >>>>> Online: [ node1 node2 ]<br>> >>>>><br>> >>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]<br>> (unmanaged)<br>> >>>>> p_drbd_mount1:0 (ocf::linbit:drbd): <br>> Started node2<br>> >>> (unmanaged)<br>> >>>>> p_drbd_mount1:1 (ocf::linbit:drbd): <br>> Started node1<br>> >>>>> (unmanaged) FAILED<br>> >>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]<br>> >>>>> p_drbd_mount2:0 (ocf::linbit:drbd): <br>> Master node1<br>> >>>>> (unmanaged) FAILED<br>> >>>>> Slaves: [ node2 ]<br>> >>>>> Resource Group: g_core<br>> >>>>> p_fs_mount1 (ocf::heartbeat:Filesystem): <br>> Started node1<br>> >>>>> p_fs_mount2 (ocf::heartbeat:Filesystem): <br>> Started node1<br>> >>>>> p_ip_nfs (ocf::heartbeat:IPaddr2): <br>> Started node1<br>> >>>>> Resource Group: g_apache<br>> >>>>> p_fs_mountbind1 (ocf::heartbeat:Filesystem): <br>> Started node1<br>> >>>>> p_fs_mountbind2 (ocf::heartbeat:Filesystem): <br>> Started node1<br>> >>>>> p_fs_mountbind3 (ocf::heartbeat:Filesystem): <br>> Started node1<br>> >>>>> p_fs_varwww (ocf::heartbeat:Filesystem): <br>> Started node1<br>> >>>>> p_apache (ocf::heartbeat:apache): <br>> Started node1<br>> >>>>> Resource Group: g_fileservers<br>> >>>>> p_lsb_smb (lsb:smbd): Started node1<br>> >>>>> p_lsb_nmb (lsb:nmbd): Started node1<br>> >>>>> p_lsb_nfsserver (lsb:nfs-kernel-server): <br>> Started node1<br>> >>>>> p_exportfs_mount1 (ocf::heartbeat:exportfs): <br>> Started node1<br>> >>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs):<br>> Started<br>> > node1<br>> >>>>><br>> >>>>> I have read through the Pacemaker Explained<br>> >>>>><br>> >>>><br>> >>><br>> ><br>> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained><br>> >>>>> documentation, however could not find a way to further<br>> debug these<br>> >>>>> problems. First, I put node1 into standby mode to<br>> attempt failover to<br>> >>>>> the other node (node2). Node2 appeared to start the<br>> transition to<br>> >>>>> master, however it failed to promote the DRBD<br>> resources to master (the<br>> >>>>> first step). I have attached a copy of this session in<br>> commands.log and<br>> >>>>> additional excerpts from /var/log/syslog during<br>> important steps. I have<br>> >>>>> attempted everything I can think of to try and start<br>> the DRBD resource<br>> >>>>> (e.g. start/stop/promote/manage/cleanup under crm<br>> resource, restarting<br>> >>>>> heartbeat) but cannot bring it out of the slave state.<br>> However, if<br>> > I set<br>> >>>>> it to unmanaged and then run drbdadm primary all in<br>> the terminal,<br>> >>>>> pacemaker is satisfied and continues starting the rest<br>> of the<br>> > resources.<br>> >>>>> It then failed when attempting to mount the filesystem<br>> for mount2, the<br>> >>>>> p_fs_mount2 resource. I attempted to mount the<br>> filesystem myself<br>> > and was<br>> >>>>> successful. I then unmounted it and ran cleanup on<br>> p_fs_mount2 and then<br>> >>>>> it mounted. The rest of the resources started as<br>> expected until the<br>> >>>>> p_exportfs_mount2 resource, which failed as follows:<br>> >>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): <br>> started node2<br>> >>>>> (unmanaged) FAILED<br>> >>>>><br>> >>>>> I ran cleanup on this and it started, however when<br>> running this test<br>> >>>>> earlier today no command could successfully start this<br>> exportfs<br>> >> resource.<br>> >>>>><br>> >>>>> How can I configure pacemaker to better resolve these<br>> problems and be<br>> >>>>> able to bring the node up successfully on its own?<br>> What can I check to<br>> >>>>> determine why these failures are occuring?<br>> /var/log/syslog did not seem<br>> >>>>> to contain very much useful information regarding why<br>> the failures<br>> >>>> occurred.<br>> >>>>><br>> >>>>> Thanks,<br>> >>>>><br>> >>>>> Andrew<br>> >>>>><br>> >>>>><br>> >>>>><br>> >>>>><br>> >>>>> This body part will be downloaded on demand.<br>> >>>><br>> >>>><br>> >>>><br>> >>>><br>> >>>><br>> >>>> _______________________________________________<br>> >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org><br>> >> <mailto:Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org>><br>> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> >>>><br>> >>>> Project Home: http://www.clusterlabs.org<br>> >>>> Getting started:<br>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> >>>> Bugs: http://bugs.clusterlabs.org<br>> >>>><br>> >>>><br>> >>>><br>> >>>> _______________________________________________<br>> >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org><br>> >> <mailto:Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org>><br>> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> >>>><br>> >>>> Project Home: http://www.clusterlabs.org<br>> >>>> Getting started:<br>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> >>>> Bugs: http://bugs.clusterlabs.org<br>> >>><br>> >>><br>> >>><br>> >>> _______________________________________________<br>> >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org><br>> >> <mailto:Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org>><br>> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> >>><br>> >>> Project Home: http://www.clusterlabs.org<br>> >>> Getting started:<br>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> >>> Bugs: http://bugs.clusterlabs.org<br>> >>><br>> >>><br>> >>><br>> >>> _______________________________________________<br>> >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org><br>> >> <mailto:Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org>><br>> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> >>><br>> >>> Project Home: http://www.clusterlabs.org<br>> >>> Getting started:<br>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> >>> Bugs: http://bugs.clusterlabs.org<br>> >><br>> >><br>> >><br>> >> _______________________________________________<br>> >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org><br>> >> <mailto:Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org>><br>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> >><br>> >> Project Home: http://www.clusterlabs.org<br>> >> Getting started:<br>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> >> Bugs: http://bugs.clusterlabs.org<br>> >><br>> >><br>> >><br>> >> _______________________________________________<br>> >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org><br>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> >><br>> >> Project Home: http://www.clusterlabs.org<br>> >> Getting started:<br>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> >> Bugs: http://bugs.clusterlabs.org<br>> ><br>> ><br>> ><br>> ><br>> > _______________________________________________<br>> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org><br>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> ><br>> > Project Home: http://www.clusterlabs.org<br>> > Getting started:<br>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> > Bugs: http://bugs.clusterlabs.org<br>> ><br>> ><br>> ><br>> > _______________________________________________<br>> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org><br>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> ><br>> > Project Home: http://www.clusterlabs.org<br>> > Getting started:<br>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> > Bugs: http://bugs.clusterlabs.org<br>> <br>> <br>> <br>> _______________________________________________<br>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org><br>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> <br>> Project Home: http://www.clusterlabs.org<br>> Getting started:<br>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> Bugs: http://bugs.clusterlabs.org<br>> <br>> <br>> _______________________________________________<br>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org><br>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> <br>> Project Home: http://www.clusterlabs.org<br>> Getting started:<br>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> Bugs: http://bugs.clusterlabs.org<br>> <br>> <br>> <br>> <br>> -- <br>> esta es mi vida e me la vivo hasta que dios quiera<br>> <br>> _______________________________________________<br>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org><br>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> <br>> Project Home: http://www.clusterlabs.org<br>> Getting started:<br>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> Bugs: http://bugs.clusterlabs.org<br>> <br>> <br>> _______________________________________________<br>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org><br>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> <br>> Project Home: http://www.clusterlabs.org<br>> Getting started:<br>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> Bugs: http://bugs.clusterlabs.org<br>> <br>> <br>> <br>> <br>> -- <br>> esta es mi vida e me la vivo hasta que dios quiera<br>> <br>> _______________________________________________<br>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org><br>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> <br>> Project Home: http://www.clusterlabs.org<br>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> Bugs: http://bugs.clusterlabs.org<br>> <br>> <br>> _______________________________________________<br>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> <mailto:Pacemaker@oss.clusterlabs.org><br>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> <br>> Project Home: http://www.clusterlabs.org<br>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> Bugs: http://bugs.clusterlabs.org<br>> <br>> <br>> <br>> <br>> -- <br>> esta es mi vida e me la vivo hasta que dios quiera<br>> <br>> _______________________________________________<br>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> <br>> Project Home: http://www.clusterlabs.org<br>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> Bugs: http://bugs.clusterlabs.org<br>> <br>> <br>> <br>> _______________________________________________<br>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> <br>> Project Home: http://www.clusterlabs.org<br>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> Bugs: http://bugs.clusterlabs.org<br><br>-- <br>Need help with Pacemaker?<br>http://www.hastexo.com/now<br><br><br><br>_______________________________________________<br>Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br><br>Project Home: http://www.clusterlabs.org<br>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>Bugs: http://bugs.clusterlabs.org<br></div><br></div></div></body></html>