<html><head><style type='text/css'>p { margin: 0; }</style></head><body><div style='font-family: Times New Roman; font-size: 12pt; color: #000000'>Hi Andreas,<div><br></div><div>Here is a copy of my complete CIB:</div><div><a href="http://pastebin.com/v5wHVFuy">http://pastebin.com/v5wHVFuy</a></div><div><br></div><div>I'll work on generating a report using crm_report as well.</div><div><br></div><div>Thanks,</div><div><br></div><div>Andrew</div><div><br><hr id="zwchr"><div style="color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><b>From: </b>"Andreas Kurz" <andreas@hastexo.com><br><b>To: </b>pacemaker@oss.clusterlabs.org<br><b>Sent: </b>Friday, March 30, 2012 4:41:16 AM<br><b>Subject: </b>Re: [Pacemaker] Nodes will not promote DRBD resources to master on failover<br><br>On 03/28/2012 04:56 PM, Andrew Martin wrote:<br>> Hi Andreas,<br>> <br>> I disabled the DRBD init script and then restarted the slave node<br>> (node2). After it came back up, DRBD did not start:<br>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): pending<br>> Online: [ node2 node1 ]<br>> <br>>  Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]<br>>      Masters: [ node1 ]<br>>      Stopped: [ p_drbd_vmstore:1 ]<br>>  Master/Slave Set: ms_drbd_mount1 [p_drbd_tools]<br>>      Masters: [ node1 ]<br>>      Stopped: [ p_drbd_mount1:1 ]<br>>  Master/Slave Set: ms_drbd_mount2 [p_drbdmount2]<br>>      Masters: [ node1 ]<br>>      Stopped: [ p_drbd_mount2:1 ]<br>> ...<br>> <br>> root@node2:~# service drbd status<br>> drbd not loaded<br><br>Yes, expected unless Pacemaker starts DRBD<br><br>> <br>> Is there something else I need to change in the CIB to ensure that DRBD<br>> is started? All of my DRBD devices are configured like this:<br>> primitive p_drbd_mount2 ocf:linbit:drbd \<br>>         params drbd_resource="mount2" \<br>>         op monitor interval="15" role="Master" \<br>>         op monitor interval="30" role="Slave"<br>> ms ms_drbd_mount2 p_drbd_mount2 \<br>>         meta master-max="1" master-node-max="1" clone-max="2"<br>> clone-node-max="1" notify="true"<br><br>That should be enough ... unable to say more without seeing the complete<br>configuration ... too much fragments of information ;-)<br><br>Please provide (e.g. pastebin) your complete cib (cibadmin -Q) when<br>cluster is in that state ... or even better create a crm_report archive<br><br>> <br>> Here is the output from the syslog (grep -i drbd /var/log/syslog):<br>> Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing<br>> key=12:315:7:24416169-73ba-469b-a2e3-56a22b437cbc<br>> op=p_drbd_vmstore:1_monitor_0 )<br>> Mar 28 09:24:47 node2 lrmd: [3210]: info: rsc:p_drbd_vmstore:1 probe[2]<br>> (pid 3455)<br>> Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing<br>> key=13:315:7:24416169-73ba-469b-a2e3-56a22b437cbc<br>> op=p_drbd_mount1:1_monitor_0 )<br>> Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount1:1 probe[3]<br>> (pid 3456)<br>> Mar 28 09:24:48 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing<br>> key=14:315:7:24416169-73ba-469b-a2e3-56a22b437cbc<br>> op=p_drbd_mount2:1_monitor_0 )<br>> Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount2:1 probe[4]<br>> (pid 3457)<br>> Mar 28 09:24:48 node2 Filesystem[3458]: [3517]: WARNING: Couldn't find<br>> device [/dev/drbd0]. Expected /dev/??? to exist<br>> Mar 28 09:24:48 node2 crm_attribute: [3563]: info: Invoked:<br>> crm_attribute -N node2 -n master-p_drbd_mount2:1 -l reboot -D<br>> Mar 28 09:24:48 node2 crm_attribute: [3557]: info: Invoked:<br>> crm_attribute -N node2 -n master-p_drbd_vmstore:1 -l reboot -D<br>> Mar 28 09:24:48 node2 crm_attribute: [3562]: info: Invoked:<br>> crm_attribute -N node2 -n master-p_drbd_mount1:1 -l reboot -D<br>> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[4] on<br>> p_drbd_mount2:1 for client 3213: pid 3457 exited with return code 7<br>> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[2] on<br>> p_drbd_vmstore:1 for client 3213: pid 3455 exited with return code 7<br>> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM<br>> operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, cib-update=10,<br>> confirmed=true) not running<br>> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[3] on<br>> p_drbd_mount1:1 for client 3213: pid 3456 exited with return code 7<br>> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM<br>> operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, cib-update=11,<br>> confirmed=true) not running<br>> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM<br>> operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, cib-update=12,<br>> confirmed=true) not running<br><br>No errors, just probing ... so for any reason Pacemaker does not like to<br>start it ... use crm_simulate to find out why ... or provide information<br>as requested above.<br><br>Regards,<br>Andreas<br><br>-- <br>Need help with Pacemaker?<br>http://www.hastexo.com/now<br><br>> <br>> Thanks,<br>> <br>> Andrew<br>> <br>> ------------------------------------------------------------------------<br>> *From: *"Andreas Kurz" <andreas@hastexo.com><br>> *To: *pacemaker@oss.clusterlabs.org<br>> *Sent: *Wednesday, March 28, 2012 9:03:06 AM<br>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to<br>> master on failover<br>> <br>> On 03/28/2012 03:47 PM, Andrew Martin wrote:<br>>> Hi Andreas,<br>>><br>>>> hmm ... what is that fence-peer script doing? If you want to use<br>>>> resource-level fencing with the help of dopd, activate the<br>>>> drbd-peer-outdater script in the line above ... and double check if the<br>>>> path is correct<br>>> fence-peer is just a wrapper for drbd-peer-outdater that does some<br>>> additional logging. In my testing dopd has been working well.<br>> <br>> I see<br>> <br>>><br>>>>> I am thinking of making the following changes to the CIB (as per the<br>>>>> official DRBD<br>>>>> guide<br>>><br>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in<br>>>>> order to add the DRBD lsb service and require that it start before the<br>>>>> ocf:linbit:drbd resources. Does this look correct?<br>>>><br>>>> Where did you read that? No, deactivate the startup of DRBD on system<br>>>> boot and let Pacemaker manage it completely.<br>>>><br>>>>> primitive p_drbd-init lsb:drbd op monitor interval="30"<br>>>>> colocation c_drbd_together inf:<br>>>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master<br>>>>> ms_drbd_mount2:Master<br>>>>> order drbd_init_first inf: ms_drbd_vmstore:promote<br>>>>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start<br>>>>><br>>>>> This doesn't seem to require that drbd be also running on the node where<br>>>>> the ocf:linbit:drbd resources are slave (which it would need to do to be<br>>>>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?<br>>>>> (clone cl_drbd p_drbd-init ?)<br>>>><br>>>> This is really not needed.<br>>> I was following the official DRBD Users Guide:<br>>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html<br>>><br>>> If I am understanding your previous message correctly, I do not need to<br>>> add a lsb primitive for the drbd daemon? It will be<br>>> started/stopped/managed automatically by my ocf:linbit:drbd resources<br>>> (and I can remove the /etc/rc* symlinks)?<br>> <br>> Yes, you don't need that LSB script when using Pacemaker and should not<br>> let init start it.<br>> <br>> Regards,<br>> Andreas<br>> <br>> -- <br>> Need help with Pacemaker?<br>> http://www.hastexo.com/now<br>> <br>>><br>>> Thanks,<br>>><br>>> Andrew<br>>><br>>> ------------------------------------------------------------------------<br>>> *From: *"Andreas Kurz" <andreas@hastexo.com <mailto:andreas@hastexo.com>><br>>> *To: *pacemaker@oss.clusterlabs.org <mailto:pacemaker@oss.clusterlabs.org><br>>> *Sent: *Wednesday, March 28, 2012 7:27:34 AM<br>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to<br>>> master on failover<br>>><br>>> On 03/28/2012 12:13 AM, Andrew Martin wrote:<br>>>> Hi Andreas,<br>>>><br>>>> Thanks, I've updated the colocation rule to be in the correct order. I<br>>>> also enabled the STONITH resource (this was temporarily disabled before<br>>>> for some additional testing). DRBD has its own network connection over<br>>>> the br1 interface (192.168.5.0/24 network), a direct crossover cable<br>>>> between node1 and node2:<br>>>> global { usage-count no; }<br>>>> common {<br>>>>         syncer { rate 110M; }<br>>>> }<br>>>> resource vmstore {<br>>>>         protocol C;<br>>>>         startup {<br>>>>                 wfc-timeout  15;<br>>>>                 degr-wfc-timeout 60;<br>>>>         }<br>>>>         handlers {<br>>>>                 #fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";<br>>>>                 fence-peer "/usr/local/bin/fence-peer";<br>>><br>>> hmm ... what is that fence-peer script doing? If you want to use<br>>> resource-level fencing with the help of dopd, activate the<br>>> drbd-peer-outdater script in the line above ... and double check if the<br>>> path is correct<br>>><br>>>>                 split-brain "/usr/lib/drbd/notify-split-brain.sh<br>>>> me@example.com <mailto:me@example.com>";<br>>>>         }<br>>>>         net {<br>>>>                 after-sb-0pri discard-zero-changes;<br>>>>                 after-sb-1pri discard-secondary;<br>>>>                 after-sb-2pri disconnect;<br>>>>                 cram-hmac-alg md5;<br>>>>                 shared-secret "xxxxx";<br>>>>         }<br>>>>         disk {<br>>>>                 fencing resource-only;<br>>>>         }<br>>>>         on node1 {<br>>>>                 device /dev/drbd0;<br>>>>                 disk /dev/sdb1;<br>>>>                 address 192.168.5.10:7787;<br>>>>                 meta-disk internal;<br>>>>         }<br>>>>         on node2 {<br>>>>                 device /dev/drbd0;<br>>>>                 disk /dev/sdf1;<br>>>>                 address 192.168.5.11:7787;<br>>>>                 meta-disk internal;<br>>>>         }<br>>>> }<br>>>> # and similar for mount1 and mount2<br>>>><br>>>> Also, here is my ha.cf. It uses both the direct link between the nodes<br>>>> (br1) and the shared LAN network on br0 for communicating:<br>>>> autojoin none<br>>>> mcast br0 239.0.0.43 694 1 0<br>>>> bcast br1<br>>>> warntime 5<br>>>> deadtime 15<br>>>> initdead 60<br>>>> keepalive 2<br>>>> node node1<br>>>> node node2<br>>>> node quorumnode<br>>>> crm respawn<br>>>> respawn hacluster /usr/lib/heartbeat/dopd<br>>>> apiauth dopd gid=haclient uid=hacluster<br>>>><br>>>> I am thinking of making the following changes to the CIB (as per the<br>>>> official DRBD<br>>>> guide<br>>><br>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in<br>>>> order to add the DRBD lsb service and require that it start before the<br>>>> ocf:linbit:drbd resources. Does this look correct?<br>>><br>>> Where did you read that? No, deactivate the startup of DRBD on system<br>>> boot and let Pacemaker manage it completely.<br>>><br>>>> primitive p_drbd-init lsb:drbd op monitor interval="30"<br>>>> colocation c_drbd_together inf:<br>>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master<br>>>> ms_drbd_mount2:Master<br>>>> order drbd_init_first inf: ms_drbd_vmstore:promote<br>>>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start<br>>>><br>>>> This doesn't seem to require that drbd be also running on the node where<br>>>> the ocf:linbit:drbd resources are slave (which it would need to do to be<br>>>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere?<br>>>> (clone cl_drbd p_drbd-init ?)<br>>><br>>> This is really not needed.<br>>><br>>> Regards,<br>>> Andreas<br>>><br>>> --<br>>> Need help with Pacemaker?<br>>> http://www.hastexo.com/now<br>>><br>>>><br>>>> Thanks,<br>>>><br>>>> Andrew<br>>>> ------------------------------------------------------------------------<br>>>> *From: *"Andreas Kurz" <andreas@hastexo.com <mailto:andreas@hastexo.com>><br>>>> *To: *pacemaker@oss.clusterlabs.org<br>> <mailto:*pacemaker@oss.clusterlabs.org><br>>>> *Sent: *Monday, March 26, 2012 5:56:22 PM<br>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to<br>>>> master on failover<br>>>><br>>>> On 03/24/2012 08:15 PM, Andrew Martin wrote:<br>>>>> Hi Andreas,<br>>>>><br>>>>> My complete cluster configuration is as follows:<br>>>>> ============<br>>>>> Last updated: Sat Mar 24 13:51:55 2012<br>>>>> Last change: Sat Mar 24 13:41:55 2012<br>>>>> Stack: Heartbeat<br>>>>> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition<br>>>>> with quorum<br>>>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c<br>>>>> 3 Nodes configured, unknown expected votes<br>>>>> 19 Resources configured.<br>>>>> ============<br>>>>><br>>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE<br>> (standby)<br>>>>> Online: [ node2 node1 ]<br>>>>><br>>>>>  Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]<br>>>>>      Masters: [ node2 ]<br>>>>>      Slaves: [ node1 ]<br>>>>>  Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]<br>>>>>      Masters: [ node2 ]<br>>>>>      Slaves: [ node1 ]<br>>>>>  Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]<br>>>>>      Masters: [ node2 ]<br>>>>>      Slaves: [ node1 ]<br>>>>>  Resource Group: g_vm<br>>>>>      p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2<br>>>>>      p_vm(ocf::heartbeat:VirtualDomain):Started node2<br>>>>>  Clone Set: cl_daemons [g_daemons]<br>>>>>      Started: [ node2 node1 ]<br>>>>>      Stopped: [ g_daemons:2 ]<br>>>>>  Clone Set: cl_sysadmin_notify [p_sysadmin_notify]<br>>>>>      Started: [ node2 node1 ]<br>>>>>      Stopped: [ p_sysadmin_notify:2 ]<br>>>>>  stonith-node1(stonith:external/tripplitepdu):Started node2<br>>>>>  stonith-node2(stonith:external/tripplitepdu):Started node1<br>>>>>  Clone Set: cl_ping [p_ping]<br>>>>>      Started: [ node2 node1 ]<br>>>>>      Stopped: [ p_ping:2 ]<br>>>>><br>>>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \<br>>>>>         attributes standby="off"<br>>>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \<br>>>>>         attributes standby="off"<br>>>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \<br>>>>>         attributes standby="on"<br>>>>> primitive p_drbd_mount2 ocf:linbit:drbd \<br>>>>>         params drbd_resource="mount2" \<br>>>>>         op monitor interval="15" role="Master" \<br>>>>>         op monitor interval="30" role="Slave"<br>>>>> primitive p_drbd_mount1 ocf:linbit:drbd \<br>>>>>         params drbd_resource="mount1" \<br>>>>>         op monitor interval="15" role="Master" \<br>>>>>         op monitor interval="30" role="Slave"<br>>>>> primitive p_drbd_vmstore ocf:linbit:drbd \<br>>>>>         params drbd_resource="vmstore" \<br>>>>>         op monitor interval="15" role="Master" \<br>>>>>         op monitor interval="30" role="Slave"<br>>>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \<br>>>>>         params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \<br>>>>>         op start interval="0" timeout="60s" \<br>>>>>         op stop interval="0" timeout="60s" \<br>>>>>         op monitor interval="20s" timeout="40s"<br>>>>> primitive p_libvirt-bin upstart:libvirt-bin \<br>>>>>         op monitor interval="30"<br>>>>> primitive p_ping ocf:pacemaker:ping \<br>>>>>         params name="p_ping" host_list="192.168.1.10 192.168.1.11"<br>>>>> multiplier="1000" \<br>>>>>         op monitor interval="20s"<br>>>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \<br>>>>>         params email="me@example.com <mailto:me@example.com>" \<br>>>>>         params subject="Pacemaker Change" \<br>>>>>         op start interval="0" timeout="10" \<br>>>>>         op stop interval="0" timeout="10" \<br>>>>>         op monitor interval="10" timeout="10"<br>>>>> primitive p_vm ocf:heartbeat:VirtualDomain \<br>>>>>         params config="/vmstore/config/vm.xml" \<br>>>>>         meta allow-migrate="false" \<br>>>>>         op start interval="0" timeout="120s" \<br>>>>>         op stop interval="0" timeout="120s" \<br>>>>>         op monitor interval="10" timeout="30"<br>>>>> primitive stonith-node1 stonith:external/tripplitepdu \<br>>>>>         params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx"<br>>>>> pdu_password="xxx" hostname_to_stonith="node1"<br>>>>> primitive stonith-node2 stonith:external/tripplitepdu \<br>>>>>         params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx"<br>>>>> pdu_password="xxx" hostname_to_stonith="node2"<br>>>>> group g_daemons p_libvirt-bin<br>>>>> group g_vm p_fs_vmstore p_vm<br>>>>> ms ms_drbd_mount2 p_drbd_mount2 \<br>>>>>         meta master-max="1" master-node-max="1" clone-max="2"<br>>>>> clone-node-max="1" notify="true"<br>>>>> ms ms_drbd_mount1 p_drbd_mount1 \<br>>>>>         meta master-max="1" master-node-max="1" clone-max="2"<br>>>>> clone-node-max="1" notify="true"<br>>>>> ms ms_drbd_vmstore p_drbd_vmstore \<br>>>>>         meta master-max="1" master-node-max="1" clone-max="2"<br>>>>> clone-node-max="1" notify="true"<br>>>>> clone cl_daemons g_daemons<br>>>>> clone cl_ping p_ping \<br>>>>>         meta interleave="true"<br>>>>> clone cl_sysadmin_notify p_sysadmin_notify<br>>>>> location l-st-node1 stonith-node1 -inf: node1<br>>>>> location l-st-node2 stonith-node2 -inf: node2<br>>>>> location l_run_on_most_connected p_vm \<br>>>>>         rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping<br>>>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master<br>>>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm<br>>>><br>>>> As Emmanuel already said, g_vm has to be in the first place in this<br>>>> collocation constraint .... g_vm must be colocated with the drbd masters.<br>>>><br>>>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote<br>>>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start<br>>>>> property $id="cib-bootstrap-options" \<br>>>>>         dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \<br>>>>>         cluster-infrastructure="Heartbeat" \<br>>>>>         stonith-enabled="false" \<br>>>>>         no-quorum-policy="stop" \<br>>>>>         last-lrm-refresh="1332539900" \<br>>>>>         cluster-recheck-interval="5m" \<br>>>>>         crmd-integration-timeout="3m" \<br>>>>>         shutdown-escalation="5m"<br>>>>><br>>>>> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite<br>>>>> PDUMH20ATNET that I'm using as the STONITH device:<br>>>>> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf<br>>>><br>>>> And why don't using it? .... stonith-enabled="false"<br>>>><br>>>>><br>>>>> As you can see, I left the DRBD service to be started by the operating<br>>>>> system (as an lsb script at boot time) however Pacemaker controls<br>>>>> actually bringing up/taking down the individual DRBD devices.<br>>>><br>>>> Don't start drbd on system boot, give Pacemaker the full control.<br>>>><br>>>> The<br>>>>> behavior I observe is as follows: I issue "crm resource migrate p_vm" on<br>>>>> node1 and failover successfully to node2. During this time, node2 fences<br>>>>> node1's DRBD devices (using dopd) and marks them as Outdated. Meanwhile<br>>>>> node2's DRBD devices are UpToDate. I then shutdown both nodes and then<br>>>>> bring them back up. They reconnect to the cluster (with quorum), and<br>>>>> node1's DRBD devices are still Outdated as expected and node2's DRBD<br>>>>> devices are still UpToDate, as expected. At this point, DRBD starts on<br>>>>> both nodes, however node2 will not set DRBD as master:<br>>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE<br>> (standby)<br>>>>> Online: [ node2 node1 ]<br>>>>><br>>>>>  Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]<br>>>>>      Slaves: [ node1 node2 ]<br>>>>>  Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]<br>>>>>      Slaves: [ node1 node 2 ]<br>>>>>  Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]<br>>>>>      Slaves: [ node1 node2 ]<br>>>><br>>>> There should really be no interruption of the drbd replication on vm<br>>>> migration that activates the dopd ... drbd has its own direct network<br>>>> connection?<br>>>><br>>>> Please share your ha.cf file and your drbd configuration. Watch out for<br>>>> drbd messages in your kernel log file, that should give you additional<br>>>> information when/why the drbd connection was lost.<br>>>><br>>>> Regards,<br>>>> Andreas<br>>>><br>>>> --<br>>>> Need help with Pacemaker?<br>>>> http://www.hastexo.com/now<br>>>><br>>>>><br>>>>> I am having trouble sorting through the logging information because<br>>>>> there is so much of it in /var/log/daemon.log, but I can't  find an<br>>>>> error message printed about why it will not promote node2. At this point<br>>>>> the DRBD devices are as follows:<br>>>>> node2: cstate = WFConnection dstate=UpToDate<br>>>>> node1: cstate = StandAlone dstate=Outdated<br>>>>><br>>>>> I don't see any reason why node2 can't become DRBD master, or am I<br>>>>> missing something? If I do "drbdadm connect all" on node1, then the<br>>>>> cstate on both nodes changes to "Connected" and node2 immediately<br>>>>> promotes the DRBD resources to master. Any ideas on why I'm observing<br>>>>> this incorrect behavior?<br>>>>><br>>>>> Any tips on how I can better filter through the pacemaker/heartbeat logs<br>>>>> or how to get additional useful debug information?<br>>>>><br>>>>> Thanks,<br>>>>><br>>>>> Andrew<br>>>>><br>>>>> ------------------------------------------------------------------------<br>>>>> *From: *"Andreas Kurz" <andreas@hastexo.com<br>> <mailto:andreas@hastexo.com>><br>>>>> *To: *pacemaker@oss.clusterlabs.org<br>>> <mailto:*pacemaker@oss.clusterlabs.org><br>>>>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM<br>>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to<br>>>>> master on failover<br>>>>><br>>>>> On 01/25/2012 08:58 PM, Andrew Martin wrote:<br>>>>>> Hello,<br>>>>>><br>>>>>> Recently I finished configuring a two-node cluster with pacemaker 1.1.6<br>>>>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster<br>> includes<br>>>>>> the following resources:<br>>>>>> - primitives for DRBD storage devices<br>>>>>> - primitives for mounting the filesystem on the DRBD storage<br>>>>>> - primitives for some mount binds<br>>>>>> - primitive for starting apache<br>>>>>> - primitives for starting samba and nfs servers (following instructions<br>>>>>> here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)<br>>>>>> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)<br>>>>><br>>>>> not enough information ... please share at least your complete cluster<br>>>>> configuration<br>>>>><br>>>>> Regards,<br>>>>> Andreas<br>>>>><br>>>>> --<br>>>>> Need help with Pacemaker?<br>>>>> http://www.hastexo.com/now<br>>>>><br>>>>>><br>>>>>> Perhaps this is best described through the output of crm_mon:<br>>>>>> Online: [ node1 node2 ]<br>>>>>><br>>>>>>  Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)<br>>>>>>      p_drbd_mount1:0     (ocf::linbit:drbd):     Started node2<br>>>> (unmanaged)<br>>>>>>      p_drbd_mount1:1     (ocf::linbit:drbd):     Started node1<br>>>>>> (unmanaged) FAILED<br>>>>>>  Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]<br>>>>>>      p_drbd_mount2:0       (ocf::linbit:drbd):     Master node1<br>>>>>> (unmanaged) FAILED<br>>>>>>      Slaves: [ node2 ]<br>>>>>>  Resource Group: g_core<br>>>>>>      p_fs_mount1 (ocf::heartbeat:Filesystem):    Started node1<br>>>>>>      p_fs_mount2   (ocf::heartbeat:Filesystem):    Started node1<br>>>>>>      p_ip_nfs   (ocf::heartbeat:IPaddr2):       Started node1<br>>>>>>  Resource Group: g_apache<br>>>>>>      p_fs_mountbind1    (ocf::heartbeat:Filesystem):    Started node1<br>>>>>>      p_fs_mountbind2    (ocf::heartbeat:Filesystem):    Started node1<br>>>>>>      p_fs_mountbind3    (ocf::heartbeat:Filesystem):    Started node1<br>>>>>>      p_fs_varwww        (ocf::heartbeat:Filesystem):    Started node1<br>>>>>>      p_apache   (ocf::heartbeat:apache):        Started node1<br>>>>>>  Resource Group: g_fileservers<br>>>>>>      p_lsb_smb  (lsb:smbd):     Started node1<br>>>>>>      p_lsb_nmb  (lsb:nmbd):     Started node1<br>>>>>>      p_lsb_nfsserver    (lsb:nfs-kernel-server):        Started node1<br>>>>>>      p_exportfs_mount1   (ocf::heartbeat:exportfs):      Started node1<br>>>>>>      p_exportfs_mount2     (ocf::heartbeat:exportfs):      Started<br>> node1<br>>>>>><br>>>>>> I have read through the Pacemaker Explained<br>>>>>><br>>>>><br>>>><br>> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained><br>>>>>> documentation, however could not find a way to further debug these<br>>>>>> problems. First, I put node1 into standby mode to attempt failover to<br>>>>>> the other node (node2). Node2 appeared to start the transition to<br>>>>>> master, however it failed to promote the DRBD resources to master (the<br>>>>>> first step). I have attached a copy of this session in commands.log and<br>>>>>> additional excerpts from /var/log/syslog during important steps. I have<br>>>>>> attempted everything I can think of to try and start the DRBD resource<br>>>>>> (e.g. start/stop/promote/manage/cleanup under crm resource, restarting<br>>>>>> heartbeat) but cannot bring it out of the slave state. However, if<br>> I set<br>>>>>> it to unmanaged and then run drbdadm primary all in the terminal,<br>>>>>> pacemaker is satisfied and continues starting the rest of the<br>> resources.<br>>>>>> It then failed when attempting to mount the filesystem for mount2, the<br>>>>>> p_fs_mount2 resource. I attempted to mount the filesystem myself<br>> and was<br>>>>>> successful. I then unmounted it and ran cleanup on p_fs_mount2 and then<br>>>>>> it mounted. The rest of the resources started as expected until the<br>>>>>> p_exportfs_mount2 resource, which failed as follows:<br>>>>>> p_exportfs_mount2     (ocf::heartbeat:exportfs):      started node2<br>>>>>> (unmanaged) FAILED<br>>>>>><br>>>>>> I ran cleanup on this and it started, however when running this test<br>>>>>> earlier today no command could successfully start this exportfs<br>>> resource.<br>>>>>><br>>>>>> How can I configure pacemaker to better resolve these problems and be<br>>>>>> able to bring the node up successfully on its own? What can I check to<br>>>>>> determine why these failures are occuring? /var/log/syslog did not seem<br>>>>>> to contain very much useful information regarding why the failures<br>>>>> occurred.<br>>>>>><br>>>>>> Thanks,<br>>>>>><br>>>>>> Andrew<br>>>>>><br>>>>>><br>>>>>><br>>>>>><br>>>>>> This body part will be downloaded on demand.<br>>>>><br>>>>><br>>>>><br>>>>><br>>>>><br>>>>> _______________________________________________<br>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>>> <mailto:Pacemaker@oss.clusterlabs.org><br>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>>>>><br>>>>> Project Home: http://www.clusterlabs.org<br>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>>>>> Bugs: http://bugs.clusterlabs.org<br>>>>><br>>>>><br>>>>><br>>>>> _______________________________________________<br>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>>> <mailto:Pacemaker@oss.clusterlabs.org><br>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>>>>><br>>>>> Project Home: http://www.clusterlabs.org<br>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>>>>> Bugs: http://bugs.clusterlabs.org<br>>>><br>>>><br>>>><br>>>> _______________________________________________<br>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>>> <mailto:Pacemaker@oss.clusterlabs.org><br>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>>>><br>>>> Project Home: http://www.clusterlabs.org<br>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>>>> Bugs: http://bugs.clusterlabs.org<br>>>><br>>>><br>>>><br>>>> _______________________________________________<br>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>>> <mailto:Pacemaker@oss.clusterlabs.org><br>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>>>><br>>>> Project Home: http://www.clusterlabs.org<br>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>>>> Bugs: http://bugs.clusterlabs.org<br>>><br>>><br>>><br>>> _______________________________________________<br>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>>> <mailto:Pacemaker@oss.clusterlabs.org><br>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>>><br>>> Project Home: http://www.clusterlabs.org<br>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>>> Bugs: http://bugs.clusterlabs.org<br>>><br>>><br>>><br>>> _______________________________________________<br>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>>><br>>> Project Home: http://www.clusterlabs.org<br>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>>> Bugs: http://bugs.clusterlabs.org<br>> <br>> <br>> <br>> <br>> _______________________________________________<br>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> <br>> Project Home: http://www.clusterlabs.org<br>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> Bugs: http://bugs.clusterlabs.org<br>> <br>> <br>> <br>> _______________________________________________<br>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br>> <br>> Project Home: http://www.clusterlabs.org<br>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>> Bugs: http://bugs.clusterlabs.org<br><br><br><br>_______________________________________________<br>Pacemaker mailing list: Pacemaker@oss.clusterlabs.org<br>http://oss.clusterlabs.org/mailman/listinfo/pacemaker<br><br>Project Home: http://www.clusterlabs.org<br>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>Bugs: http://bugs.clusterlabs.org<br></div><br></div></div></body></html>