[Pacemaker] Nodes will not promote DRBD resources to master on failover

Fri Mar 30 10:10:43 EDT 2012

Hi Andreas, 

Here is a copy of my complete CIB: 
http://pastebin.com/v5wHVFuy 

I'll work on generating a report using crm_report as well. 

Thanks, 

Andrew 

----- Original Message -----

From: "Andreas Kurz" <andreas at hastexo.com> 
To: pacemaker at oss.clusterlabs.org 
Sent: Friday, March 30, 2012 4:41:16 AM 
Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to master on failover 

On 03/28/2012 04:56 PM, Andrew Martin wrote: 
> Hi Andreas, 
> 
> I disabled the DRBD init script and then restarted the slave node 
> (node2). After it came back up, DRBD did not start: 
> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): pending 
> Online: [ node2 node1 ] 
> 
> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] 
> Masters: [ node1 ] 
> Stopped: [ p_drbd_vmstore:1 ] 
> Master/Slave Set: ms_drbd_mount1 [p_drbd_tools] 
> Masters: [ node1 ] 
> Stopped: [ p_drbd_mount1:1 ] 
> Master/Slave Set: ms_drbd_mount2 [p_drbdmount2] 
> Masters: [ node1 ] 
> Stopped: [ p_drbd_mount2:1 ] 
> ... 
> 
> root at node2:~# service drbd status 
> drbd not loaded 

Yes, expected unless Pacemaker starts DRBD 

> 
> Is there something else I need to change in the CIB to ensure that DRBD 
> is started? All of my DRBD devices are configured like this: 
> primitive p_drbd_mount2 ocf:linbit:drbd \ 
> params drbd_resource="mount2" \ 
> op monitor interval="15" role="Master" \ 
> op monitor interval="30" role="Slave" 
> ms ms_drbd_mount2 p_drbd_mount2 \ 
> meta master-max="1" master-node-max="1" clone-max="2" 
> clone-node-max="1" notify="true" 

That should be enough ... unable to say more without seeing the complete 
configuration ... too much fragments of information ;-) 

Please provide (e.g. pastebin) your complete cib (cibadmin -Q) when 
cluster is in that state ... or even better create a crm_report archive 

> 
> Here is the output from the syslog (grep -i drbd /var/log/syslog): 
> Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing 
> key=12:315:7:24416169-73ba-469b-a2e3-56a22b437cbc 
> op=p_drbd_vmstore:1_monitor_0 ) 
> Mar 28 09:24:47 node2 lrmd: [3210]: info: rsc:p_drbd_vmstore:1 probe[2] 
> (pid 3455) 
> Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing 
> key=13:315:7:24416169-73ba-469b-a2e3-56a22b437cbc 
> op=p_drbd_mount1:1_monitor_0 ) 
> Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount1:1 probe[3] 
> (pid 3456) 
> Mar 28 09:24:48 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing 
> key=14:315:7:24416169-73ba-469b-a2e3-56a22b437cbc 
> op=p_drbd_mount2:1_monitor_0 ) 
> Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount2:1 probe[4] 
> (pid 3457) 
> Mar 28 09:24:48 node2 Filesystem[3458]: [3517]: WARNING: Couldn't find 
> device [/dev/drbd0]. Expected /dev/??? to exist 
> Mar 28 09:24:48 node2 crm_attribute: [3563]: info: Invoked: 
> crm_attribute -N node2 -n master-p_drbd_mount2:1 -l reboot -D 
> Mar 28 09:24:48 node2 crm_attribute: [3557]: info: Invoked: 
> crm_attribute -N node2 -n master-p_drbd_vmstore:1 -l reboot -D 
> Mar 28 09:24:48 node2 crm_attribute: [3562]: info: Invoked: 
> crm_attribute -N node2 -n master-p_drbd_mount1:1 -l reboot -D 
> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[4] on 
> p_drbd_mount2:1 for client 3213: pid 3457 exited with return code 7 
> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[2] on 
> p_drbd_vmstore:1 for client 3213: pid 3455 exited with return code 7 
> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM 
> operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, cib-update=10, 
> confirmed=true) not running 
> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[3] on 
> p_drbd_mount1:1 for client 3213: pid 3456 exited with return code 7 
> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM 
> operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, cib-update=11, 
> confirmed=true) not running 
> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM 
> operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, cib-update=12, 
> confirmed=true) not running 

No errors, just probing ... so for any reason Pacemaker does not like to 
start it ... use crm_simulate to find out why ... or provide information 
as requested above. 

Regards, 
Andreas 

-- 
Need help with Pacemaker? 
http://www.hastexo.com/now 

> 
> Thanks, 
> 
> Andrew 
> 
> ------------------------------------------------------------------------ 
> *From: *"Andreas Kurz" <andreas at hastexo.com> 
> *To: *pacemaker at oss.clusterlabs.org 
> *Sent: *Wednesday, March 28, 2012 9:03:06 AM 
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to 
> master on failover 
> 
> On 03/28/2012 03:47 PM, Andrew Martin wrote: 
>> Hi Andreas, 
>> 
>>> hmm ... what is that fence-peer script doing? If you want to use 
>>> resource-level fencing with the help of dopd, activate the 
>>> drbd-peer-outdater script in the line above ... and double check if the 
>>> path is correct 
>> fence-peer is just a wrapper for drbd-peer-outdater that does some 
>> additional logging. In my testing dopd has been working well. 
> 
> I see 
> 
>> 
>>>> I am thinking of making the following changes to the CIB (as per the 
>>>> official DRBD 
>>>> guide 
>> 
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in 
>>>> order to add the DRBD lsb service and require that it start before the 
>>>> ocf:linbit:drbd resources. Does this look correct? 
>>> 
>>> Where did you read that? No, deactivate the startup of DRBD on system 
>>> boot and let Pacemaker manage it completely. 
>>> 
>>>> primitive p_drbd-init lsb:drbd op monitor interval="30" 
>>>> colocation c_drbd_together inf: 
>>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master 
>>>> ms_drbd_mount2:Master 
>>>> order drbd_init_first inf: ms_drbd_vmstore:promote 
>>>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start 
>>>> 
>>>> This doesn't seem to require that drbd be also running on the node where 
>>>> the ocf:linbit:drbd resources are slave (which it would need to do to be 
>>>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere? 
>>>> (clone cl_drbd p_drbd-init ?) 
>>> 
>>> This is really not needed. 
>> I was following the official DRBD Users Guide: 
>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html 
>> 
>> If I am understanding your previous message correctly, I do not need to 
>> add a lsb primitive for the drbd daemon? It will be 
>> started/stopped/managed automatically by my ocf:linbit:drbd resources 
>> (and I can remove the /etc/rc* symlinks)? 
> 
> Yes, you don't need that LSB script when using Pacemaker and should not 
> let init start it. 
> 
> Regards, 
> Andreas 
> 
> -- 
> Need help with Pacemaker? 
> http://www.hastexo.com/now 
> 
>> 
>> Thanks, 
>> 
>> Andrew 
>> 
>> ------------------------------------------------------------------------ 
>> *From: *"Andreas Kurz" <andreas at hastexo.com <mailto:andreas at hastexo.com>> 
>> *To: *pacemaker at oss.clusterlabs.org <mailto:pacemaker at oss.clusterlabs.org> 
>> *Sent: *Wednesday, March 28, 2012 7:27:34 AM 
>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to 
>> master on failover 
>> 
>> On 03/28/2012 12:13 AM, Andrew Martin wrote: 
>>> Hi Andreas, 
>>> 
>>> Thanks, I've updated the colocation rule to be in the correct order. I 
>>> also enabled the STONITH resource (this was temporarily disabled before 
>>> for some additional testing). DRBD has its own network connection over 
>>> the br1 interface (192.168.5.0/24 network), a direct crossover cable 
>>> between node1 and node2: 
>>> global { usage-count no; } 
>>> common { 
>>> syncer { rate 110M; } 
>>> } 
>>> resource vmstore { 
>>> protocol C; 
>>> startup { 
>>> wfc-timeout 15; 
>>> degr-wfc-timeout 60; 
>>> } 
>>> handlers { 
>>> #fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; 
>>> fence-peer "/usr/local/bin/fence-peer"; 
>> 
>> hmm ... what is that fence-peer script doing? If you want to use 
>> resource-level fencing with the help of dopd, activate the 
>> drbd-peer-outdater script in the line above ... and double check if the 
>> path is correct 
>> 
>>> split-brain "/usr/lib/drbd/notify-split-brain.sh 
>>> me at example.com <mailto:me at example.com>"; 
>>> } 
>>> net { 
>>> after-sb-0pri discard-zero-changes; 
>>> after-sb-1pri discard-secondary; 
>>> after-sb-2pri disconnect; 
>>> cram-hmac-alg md5; 
>>> shared-secret "xxxxx"; 
>>> } 
>>> disk { 
>>> fencing resource-only; 
>>> } 
>>> on node1 { 
>>> device /dev/drbd0; 
>>> disk /dev/sdb1; 
>>> address 192.168.5.10:7787; 
>>> meta-disk internal; 
>>> } 
>>> on node2 { 
>>> device /dev/drbd0; 
>>> disk /dev/sdf1; 
>>> address 192.168.5.11:7787; 
>>> meta-disk internal; 
>>> } 
>>> } 
>>> # and similar for mount1 and mount2 
>>> 
>>> Also, here is my ha.cf. It uses both the direct link between the nodes 
>>> (br1) and the shared LAN network on br0 for communicating: 
>>> autojoin none 
>>> mcast br0 239.0.0.43 694 1 0 
>>> bcast br1 
>>> warntime 5 
>>> deadtime 15 
>>> initdead 60 
>>> keepalive 2 
>>> node node1 
>>> node node2 
>>> node quorumnode 
>>> crm respawn 
>>> respawn hacluster /usr/lib/heartbeat/dopd 
>>> apiauth dopd gid=haclient uid=hacluster 
>>> 
>>> I am thinking of making the following changes to the CIB (as per the 
>>> official DRBD 
>>> guide 
>> 
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html) in 
>>> order to add the DRBD lsb service and require that it start before the 
>>> ocf:linbit:drbd resources. Does this look correct? 
>> 
>> Where did you read that? No, deactivate the startup of DRBD on system 
>> boot and let Pacemaker manage it completely. 
>> 
>>> primitive p_drbd-init lsb:drbd op monitor interval="30" 
>>> colocation c_drbd_together inf: 
>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master 
>>> ms_drbd_mount2:Master 
>>> order drbd_init_first inf: ms_drbd_vmstore:promote 
>>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start 
>>> 
>>> This doesn't seem to require that drbd be also running on the node where 
>>> the ocf:linbit:drbd resources are slave (which it would need to do to be 
>>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere? 
>>> (clone cl_drbd p_drbd-init ?) 
>> 
>> This is really not needed. 
>> 
>> Regards, 
>> Andreas 
>> 
>> -- 
>> Need help with Pacemaker? 
>> http://www.hastexo.com/now 
>> 
>>> 
>>> Thanks, 
>>> 
>>> Andrew 
>>> ------------------------------------------------------------------------ 
>>> *From: *"Andreas Kurz" <andreas at hastexo.com <mailto:andreas at hastexo.com>> 
>>> *To: *pacemaker at oss.clusterlabs.org 
> <mailto:*pacemaker at oss.clusterlabs.org> 
>>> *Sent: *Monday, March 26, 2012 5:56:22 PM 
>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to 
>>> master on failover 
>>> 
>>> On 03/24/2012 08:15 PM, Andrew Martin wrote: 
>>>> Hi Andreas, 
>>>> 
>>>> My complete cluster configuration is as follows: 
>>>> ============ 
>>>> Last updated: Sat Mar 24 13:51:55 2012 
>>>> Last change: Sat Mar 24 13:41:55 2012 
>>>> Stack: Heartbeat 
>>>> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition 
>>>> with quorum 
>>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 
>>>> 3 Nodes configured, unknown expected votes 
>>>> 19 Resources configured. 
>>>> ============ 
>>>> 
>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE 
> (standby) 
>>>> Online: [ node2 node1 ] 
>>>> 
>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] 
>>>> Masters: [ node2 ] 
>>>> Slaves: [ node1 ] 
>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] 
>>>> Masters: [ node2 ] 
>>>> Slaves: [ node1 ] 
>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] 
>>>> Masters: [ node2 ] 
>>>> Slaves: [ node1 ] 
>>>> Resource Group: g_vm 
>>>> p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2 
>>>> p_vm(ocf::heartbeat:VirtualDomain):Started node2 
>>>> Clone Set: cl_daemons [g_daemons] 
>>>> Started: [ node2 node1 ] 
>>>> Stopped: [ g_daemons:2 ] 
>>>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify] 
>>>> Started: [ node2 node1 ] 
>>>> Stopped: [ p_sysadmin_notify:2 ] 
>>>> stonith-node1(stonith:external/tripplitepdu):Started node2 
>>>> stonith-node2(stonith:external/tripplitepdu):Started node1 
>>>> Clone Set: cl_ping [p_ping] 
>>>> Started: [ node2 node1 ] 
>>>> Stopped: [ p_ping:2 ] 
>>>> 
>>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \ 
>>>> attributes standby="off" 
>>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \ 
>>>> attributes standby="off" 
>>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \ 
>>>> attributes standby="on" 
>>>> primitive p_drbd_mount2 ocf:linbit:drbd \ 
>>>> params drbd_resource="mount2" \ 
>>>> op monitor interval="15" role="Master" \ 
>>>> op monitor interval="30" role="Slave" 
>>>> primitive p_drbd_mount1 ocf:linbit:drbd \ 
>>>> params drbd_resource="mount1" \ 
>>>> op monitor interval="15" role="Master" \ 
>>>> op monitor interval="30" role="Slave" 
>>>> primitive p_drbd_vmstore ocf:linbit:drbd \ 
>>>> params drbd_resource="vmstore" \ 
>>>> op monitor interval="15" role="Master" \ 
>>>> op monitor interval="30" role="Slave" 
>>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \ 
>>>> params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \ 
>>>> op start interval="0" timeout="60s" \ 
>>>> op stop interval="0" timeout="60s" \ 
>>>> op monitor interval="20s" timeout="40s" 
>>>> primitive p_libvirt-bin upstart:libvirt-bin \ 
>>>> op monitor interval="30" 
>>>> primitive p_ping ocf:pacemaker:ping \ 
>>>> params name="p_ping" host_list="192.168.1.10 192.168.1.11" 
>>>> multiplier="1000" \ 
>>>> op monitor interval="20s" 
>>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \ 
>>>> params email="me at example.com <mailto:me at example.com>" \ 
>>>> params subject="Pacemaker Change" \ 
>>>> op start interval="0" timeout="10" \ 
>>>> op stop interval="0" timeout="10" \ 
>>>> op monitor interval="10" timeout="10" 
>>>> primitive p_vm ocf:heartbeat:VirtualDomain \ 
>>>> params config="/vmstore/config/vm.xml" \ 
>>>> meta allow-migrate="false" \ 
>>>> op start interval="0" timeout="120s" \ 
>>>> op stop interval="0" timeout="120s" \ 
>>>> op monitor interval="10" timeout="30" 
>>>> primitive stonith-node1 stonith:external/tripplitepdu \ 
>>>> params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx" 
>>>> pdu_password="xxx" hostname_to_stonith="node1" 
>>>> primitive stonith-node2 stonith:external/tripplitepdu \ 
>>>> params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx" 
>>>> pdu_password="xxx" hostname_to_stonith="node2" 
>>>> group g_daemons p_libvirt-bin 
>>>> group g_vm p_fs_vmstore p_vm 
>>>> ms ms_drbd_mount2 p_drbd_mount2 \ 
>>>> meta master-max="1" master-node-max="1" clone-max="2" 
>>>> clone-node-max="1" notify="true" 
>>>> ms ms_drbd_mount1 p_drbd_mount1 \ 
>>>> meta master-max="1" master-node-max="1" clone-max="2" 
>>>> clone-node-max="1" notify="true" 
>>>> ms ms_drbd_vmstore p_drbd_vmstore \ 
>>>> meta master-max="1" master-node-max="1" clone-max="2" 
>>>> clone-node-max="1" notify="true" 
>>>> clone cl_daemons g_daemons 
>>>> clone cl_ping p_ping \ 
>>>> meta interleave="true" 
>>>> clone cl_sysadmin_notify p_sysadmin_notify 
>>>> location l-st-node1 stonith-node1 -inf: node1 
>>>> location l-st-node2 stonith-node2 -inf: node2 
>>>> location l_run_on_most_connected p_vm \ 
>>>> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping 
>>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master 
>>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm 
>>> 
>>> As Emmanuel already said, g_vm has to be in the first place in this 
>>> collocation constraint .... g_vm must be colocated with the drbd masters. 
>>> 
>>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote 
>>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start 
>>>> property $id="cib-bootstrap-options" \ 
>>>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ 
>>>> cluster-infrastructure="Heartbeat" \ 
>>>> stonith-enabled="false" \ 
>>>> no-quorum-policy="stop" \ 
>>>> last-lrm-refresh="1332539900" \ 
>>>> cluster-recheck-interval="5m" \ 
>>>> crmd-integration-timeout="3m" \ 
>>>> shutdown-escalation="5m" 
>>>> 
>>>> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite 
>>>> PDUMH20ATNET that I'm using as the STONITH device: 
>>>> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf 
>>> 
>>> And why don't using it? .... stonith-enabled="false" 
>>> 
>>>> 
>>>> As you can see, I left the DRBD service to be started by the operating 
>>>> system (as an lsb script at boot time) however Pacemaker controls 
>>>> actually bringing up/taking down the individual DRBD devices. 
>>> 
>>> Don't start drbd on system boot, give Pacemaker the full control. 
>>> 
>>> The 
>>>> behavior I observe is as follows: I issue "crm resource migrate p_vm" on 
>>>> node1 and failover successfully to node2. During this time, node2 fences 
>>>> node1's DRBD devices (using dopd) and marks them as Outdated. Meanwhile 
>>>> node2's DRBD devices are UpToDate. I then shutdown both nodes and then 
>>>> bring them back up. They reconnect to the cluster (with quorum), and 
>>>> node1's DRBD devices are still Outdated as expected and node2's DRBD 
>>>> devices are still UpToDate, as expected. At this point, DRBD starts on 
>>>> both nodes, however node2 will not set DRBD as master: 
>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE 
> (standby) 
>>>> Online: [ node2 node1 ] 
>>>> 
>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] 
>>>> Slaves: [ node1 node2 ] 
>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] 
>>>> Slaves: [ node1 node 2 ] 
>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] 
>>>> Slaves: [ node1 node2 ] 
>>> 
>>> There should really be no interruption of the drbd replication on vm 
>>> migration that activates the dopd ... drbd has its own direct network 
>>> connection? 
>>> 
>>> Please share your ha.cf file and your drbd configuration. Watch out for 
>>> drbd messages in your kernel log file, that should give you additional 
>>> information when/why the drbd connection was lost. 
>>> 
>>> Regards, 
>>> Andreas 
>>> 
>>> -- 
>>> Need help with Pacemaker? 
>>> http://www.hastexo.com/now 
>>> 
>>>> 
>>>> I am having trouble sorting through the logging information because 
>>>> there is so much of it in /var/log/daemon.log, but I can't find an 
>>>> error message printed about why it will not promote node2. At this point 
>>>> the DRBD devices are as follows: 
>>>> node2: cstate = WFConnection dstate=UpToDate 
>>>> node1: cstate = StandAlone dstate=Outdated 
>>>> 
>>>> I don't see any reason why node2 can't become DRBD master, or am I 
>>>> missing something? If I do "drbdadm connect all" on node1, then the 
>>>> cstate on both nodes changes to "Connected" and node2 immediately 
>>>> promotes the DRBD resources to master. Any ideas on why I'm observing 
>>>> this incorrect behavior? 
>>>> 
>>>> Any tips on how I can better filter through the pacemaker/heartbeat logs 
>>>> or how to get additional useful debug information? 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Andrew 
>>>> 
>>>> ------------------------------------------------------------------------ 
>>>> *From: *"Andreas Kurz" <andreas at hastexo.com 
> <mailto:andreas at hastexo.com>> 
>>>> *To: *pacemaker at oss.clusterlabs.org 
>> <mailto:*pacemaker at oss.clusterlabs.org> 
>>>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM 
>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to 
>>>> master on failover 
>>>> 
>>>> On 01/25/2012 08:58 PM, Andrew Martin wrote: 
>>>>> Hello, 
>>>>> 
>>>>> Recently I finished configuring a two-node cluster with pacemaker 1.1.6 
>>>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster 
> includes 
>>>>> the following resources: 
>>>>> - primitives for DRBD storage devices 
>>>>> - primitives for mounting the filesystem on the DRBD storage 
>>>>> - primitives for some mount binds 
>>>>> - primitive for starting apache 
>>>>> - primitives for starting samba and nfs servers (following instructions 
>>>>> here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>) 
>>>>> - primitives for exporting nfs shares (ocf:heartbeat:exportfs) 
>>>> 
>>>> not enough information ... please share at least your complete cluster 
>>>> configuration 
>>>> 
>>>> Regards, 
>>>> Andreas 
>>>> 
>>>> -- 
>>>> Need help with Pacemaker? 
>>>> http://www.hastexo.com/now 
>>>> 
>>>>> 
>>>>> Perhaps this is best described through the output of crm_mon: 
>>>>> Online: [ node1 node2 ] 
>>>>> 
>>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged) 
>>>>> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2 
>>> (unmanaged) 
>>>>> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1 
>>>>> (unmanaged) FAILED 
>>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] 
>>>>> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1 
>>>>> (unmanaged) FAILED 
>>>>> Slaves: [ node2 ] 
>>>>> Resource Group: g_core 
>>>>> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1 
>>>>> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1 
>>>>> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1 
>>>>> Resource Group: g_apache 
>>>>> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1 
>>>>> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1 
>>>>> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1 
>>>>> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1 
>>>>> p_apache (ocf::heartbeat:apache): Started node1 
>>>>> Resource Group: g_fileservers 
>>>>> p_lsb_smb (lsb:smbd): Started node1 
>>>>> p_lsb_nmb (lsb:nmbd): Started node1 
>>>>> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1 
>>>>> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1 
>>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started 
> node1 
>>>>> 
>>>>> I have read through the Pacemaker Explained 
>>>>> 
>>>> 
>>> 
> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained> 
>>>>> documentation, however could not find a way to further debug these 
>>>>> problems. First, I put node1 into standby mode to attempt failover to 
>>>>> the other node (node2). Node2 appeared to start the transition to 
>>>>> master, however it failed to promote the DRBD resources to master (the 
>>>>> first step). I have attached a copy of this session in commands.log and 
>>>>> additional excerpts from /var/log/syslog during important steps. I have 
>>>>> attempted everything I can think of to try and start the DRBD resource 
>>>>> (e.g. start/stop/promote/manage/cleanup under crm resource, restarting 
>>>>> heartbeat) but cannot bring it out of the slave state. However, if 
> I set 
>>>>> it to unmanaged and then run drbdadm primary all in the terminal, 
>>>>> pacemaker is satisfied and continues starting the rest of the 
> resources. 
>>>>> It then failed when attempting to mount the filesystem for mount2, the 
>>>>> p_fs_mount2 resource. I attempted to mount the filesystem myself 
> and was 
>>>>> successful. I then unmounted it and ran cleanup on p_fs_mount2 and then 
>>>>> it mounted. The rest of the resources started as expected until the 
>>>>> p_exportfs_mount2 resource, which failed as follows: 
>>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2 
>>>>> (unmanaged) FAILED 
>>>>> 
>>>>> I ran cleanup on this and it started, however when running this test 
>>>>> earlier today no command could successfully start this exportfs 
>> resource. 
>>>>> 
>>>>> How can I configure pacemaker to better resolve these problems and be 
>>>>> able to bring the node up successfully on its own? What can I check to 
>>>>> determine why these failures are occuring? /var/log/syslog did not seem 
>>>>> to contain very much useful information regarding why the failures 
>>>> occurred. 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Andrew 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> This body part will be downloaded on demand. 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________ 
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> <mailto:Pacemaker at oss.clusterlabs.org> 
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>>>> 
>>>> Project Home: http://www.clusterlabs.org 
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>> Bugs: http://bugs.clusterlabs.org 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________ 
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> <mailto:Pacemaker at oss.clusterlabs.org> 
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>>>> 
>>>> Project Home: http://www.clusterlabs.org 
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>> Bugs: http://bugs.clusterlabs.org 
>>> 
>>> 
>>> 
>>> _______________________________________________ 
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> <mailto:Pacemaker at oss.clusterlabs.org> 
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>>> 
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>>> 
>>> 
>>> 
>>> _______________________________________________ 
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> <mailto:Pacemaker at oss.clusterlabs.org> 
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>>> 
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>> 
>> 
>> 
>> _______________________________________________ 
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> <mailto:Pacemaker at oss.clusterlabs.org> 
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>> 
>> 
>> 
>> _______________________________________________ 
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> 
> _______________________________________________ 
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> _______________________________________________ 
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 

_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120330/d8aa1013/attachment-0003.html>