[Pacemaker] CRM checks resource on bad-nodes (in a non-symmetric cluster)

Arthur Holstvoogd a.holstvoogd at nedforce.nl
Tue Sep 9 12:11:38 EDT 2008

I'm setting up a cluster with crm over heartbeat and I keep running into 
trouble with resources that are being called on nodes that don't have 
them. The setup is pretty simple, we have 4 nodes, two physical servers 
and two virtual servers (xen) in a asymmetric cluster. The xen servers 
have to run drbd(primary/secondary), a iscs-target and a third-deamon. 
(The physical server don't run anything yet, but wil have to mount stuff 
and start more xens later on. That's why they are in the cluster.)
This is the cib xml, pretty self explanatory I guess:

       <cluster_property_set id="cib-bootstrap-options">
           <nvpair id="cib-bootstrap-options-symmetric-cluster" 
name="symmetric-cluster" value="false"/>
       <master_slave id="ms-san">
         <meta_attributes id="ma-ms-san">
             <nvpair id="ma-ms-san-1" name="clone_max" value="2"/>
             <nvpair id="ma-ms-san-2" name="clone_node_max" value="1"/>
             <nvpair id="ma-ms-san-3" name="master_max" value="1"/>
             <nvpair id="ma-ms-san-4" name="master_node_max" value="1"/>
             <nvpair id="ma-ms-san-5" name="notify" value="yes"/>
             <nvpair id="ma-ms-san-6" name="globally_unique" value="false"/>
         <primitive id="drbd-san" class="ocf" provider="heartbeat" 
           <instance_attributes id="9002a0e4-28d2-4ca7-83d8-74cd7ac066e8">
               <nvpair name="drbd_resource" value="san" 
             <op name="monitor" interval="29s" timeout="10s" 
role="Master" id="714ea049-f14d-4b09-b856-8b374252e1de"/>
             <op name="monitor" interval="30s" timeout="10s" 
role="Slave" id="6c7ce46c-7fe5-4d22-8a31-eae6b2927711"/>
       <group id="iscsi-cluster">
         <primitive class="ocf" provider="heartbeat" type="IPaddr2" 
           <instance_attributes id="ia-iscsi-target-ip">
               <nvpair id="ia-iscsi-target-ip-1" name="ip" 
               <nvpair id="ia-iscsi-target-ip-2" name="nic" value="eth0"/>
             <op id="iscsi-target-ip-monitor-0" name="monitor" 
interval="20s" timeout="10s"/>
         <primitive id="iscsi-target" class="lsb" type="iscsi-target"/>
       <group id="puppet-cluster">
         <primitive class="ocf" provider="heartbeat" type="IPaddr2" 
           <instance_attributes id="ia-puppet-master-ip">
               <nvpair id="puppet-master-ip-1" name="ip" value=""/>
               <nvpair id="puppet-master-ip-2" name="nic" value="eth0"/>
             <op id="puppet-master-ip-monitor-0" name="monitor" 
interval="60s" timeout="10s"/>
         <primitive class="lsb" id="puppet-master" type="puppetmaster"/>
       <rsc_location id="san-placement-1" rsc="ms-san">
         <rule id="san-rule-1" score="INFINITY" boolean_op="or">
           <expression id="exp-01" value="en1-r1-san1" 
attribute="#uname" operation="eq"/>
           <expression id="exp-02" value="en1-r1-san2" 
attribute="#uname" operation="eq"/>
       <rsc_location id="iscsi-placement-1" rsc="iscsi-cluster">
         <rule id="iscsi-rule-1" score="INFINITY" boolean_op="or">
           <expression id="exp-03" value="en1-r1-san1" 
attribute="#uname" operation="eq"/>
           <expression id="exp-04" value="en1-r1-san2" 
attribute="#uname" operation="eq"/>
       <rsc_location id="puppet-placement-1" rsc="puppet-cluster">
         <rule id="puppet-rule-1" score="INFINITY" boolean_op="or">
           <expression id="exp-05" value="en1-r1-san1" 
attribute="#uname" operation="eq"/>
           <expression id="exp-06" value="en1-r1-san2" 
attribute="#uname" operation="eq"/>
       <rsc_order id="iscsi_promotes_ms-san" from="iscsi-cluster" 
action="start" to="ms-san" to_action="promote" type="after"/>
       <rsc_colocation id="iscsi_on_san" to="ms-san" to_role="Master" 
from="iscsi-cluster" score="INFINITY"/>

Oh yeah, the nodes are en1-r1-san1, en1-r1-san2 (virtual servers) and 
en1-r1-srv1, en1-r1-srv2 (physical servers)

A couple of problems arise when we start the cluster:
- CRM tries to run /etc/init.d/puppetmaster status and 
/etc/init.d/iscsi-target status on srv1 and srv2, which fails because 
they don't have these deamons installed. Because it's unsure if the 
deamons are running it doesn't start it on san1 or san2
- CRM looks for the drbdadm tool (probably as defined in the ocf file 
for drbd) on srv1 and srv2 with which, this fails and they get started 
on san1 and san2. The logs show me this:
Sep  9 15:39:48 en1-r1-srv1 crmd: [8012]: info: do_lrm_rsc_op: 
Performing op=drbd-san:1_monitor_0 
Sep  9 15:39:48 en1-r1-srv1 lrmd: [8009]: info: rsc:drbd-san:1: monitor
Sep  9 15:39:48 en1-r1-srv1 lrmd: [8009]: info: RA output: 
(drbd-san:1:monitor:stderr) which: no drbdadm in (/usr/ ... )
Sep  9 15:39:48 en1-r1-srv1 drbd[8088]: [8099]: ERROR: Setup problem: 
Couldn't find utility drbdadm
Sep  9 15:39:48 en1-r1-srv1 crmd: [8012]: ERROR: process_lrm_event: LRM 
operation drbd-san:1_monitor_0 (call=7, rc=5) Error not installed

When I try to stop heartbeat on a node it drops in a deadlock because 
CRM tries to stop drbd-san:0 and drbd-san:1 on that node (with LRM I 
think). I get this in the logs on a stop, keeps repeating every minute:
Sep  9 10:16:15 en1-r1-srv1 crmd: [7570]: info: do_shutdown: All 
subsystems stopped, continuing
Sep  9 10:16:15 en1-r1-srv1 crmd: [7570]: ERROR: verify_stopped: 
Resource drbd-san:1 was active at shutdown.  You may ignore this error 
if it is unmanaged.
Sep  9 10:16:15 en1-r1-srv1 crmd: [7570]: ERROR: verify_stopped: 
Resource drbd-san:0 was active at shutdown.  You may ignore this error 
if it is unmanaged.

The very first problem i've solved with a dummy script and some 
symlinks, now the whole cluster does start properly except for some 
'can't find drbdadm' errors, but I can't stop it properly. I can call 
stop on it, wait for the 'can't stop drbd-san:*' errors and then clean 
those resources with crm_resouce -C and then heartbeat will go down.

ciblint is giving me some intresting errors too which seem related:

Anybody a clue what I'm doing wrong? I'm at a loss here.
I've considered moving to openAIS over heartbeat, but that can't really 
be the problem now can it?
I'm running it all on centos 5, pacemaker is packaged with heartbeat 
2.1.3 on it.

Any help, pointers or suggestions would be very much appreciated!
Arthur Holstvoogd

More information about the Pacemaker mailing list