[Pacemaker] standby attribute and same resources running at the same time

Mon Mar 4 17:20:41 UTC 2013

Dear list,

just to excuse the triviality - i started to deploy a ha environment 
in a test lab and therefore i do not have much experience. 

i started to setup a 2-node cluster 

  corosync-1.4.1-15.el6.x86_64
  pacemaker-1.1.8-7.el6.x86_64
  cman-3.0.12.1-49.el6.x86_64

with rhel6.3 and then switched to rhel6.4. 

This update brings some differences. The crm shell is gone and pcs is added.
Anyway i found some equivalent commands to setup/configure resources. 

So far all good. I am doing some stress test now and noticed that rebooting
one node (n2), that node (n2) will be marked as standby in the cib (shown on the 
other node (n1)).

After rebooting the node (n2) crm_mon on that node shows that the other node (n1) 
is offline and begins to start the ressources. While the other node (n1) that wasn't
rebooted still shows n2 as standby. At that point both nodes are runnnig the "same" 
resources. After a couple of minutes that situation is noticed and both nodes 
renegotiate the current state. Then one node take over the responsibility to provide
the resources. On both nodes the previously rebooted node is still listed as standby.

  cat /var/log/messages |grep error
  Mar  4 17:32:33 cn1 pengine[1378]:    error: native_create_actions: Resource resIP (ocf::IPaddr2) is active on 2 nodes attempting recovery
  Mar  4 17:32:33 cn1 pengine[1378]:    error: native_create_actions: Resource resApache (ocf::apache) is active on 2 nodes attempting recovery
  Mar  4 17:32:33 cn1 pengine[1378]:    error: process_pe_message: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-error-6.bz2
  Mar  4 17:32:48 cn1 crmd[1379]:   notice: run_graph: Transition 1 (Complete=9, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-error-6.bz2): Complete

  crm_mon -1
  Last updated: Mon Mar  4 17:49:08 2013
  Last change: Mon Mar  4 10:22:53 2013 via crm_resource on cn1.localdomain
  Stack: cman
  Current DC: cn1.localdomain - partition with quorum
  Version: 1.1.8-7.el6-394e906
  2 Nodes configured, 2 expected votes
  2 Resources configured.

  Node cn2.localdomain: standby
  Online: [ cn1.localdomain ]

  resIP	(ocf::heartbeat:IPaddr2):	Started cn1.localdomain
  resApache	(ocf::heartbeat:apache):	Started cn1.localdomain

i checked the init scripts and found that the standby "behavior" comes
from a function that is called on "service pacemaker stop" (added in rhel6.4).

cman_pre_stop() 
{
    cname=`crm_node --name`
    crm_attribute -N $cname -n standby -v true -l reboot
    echo -n "Waiting for shutdown of managed resources"
...

i could not delete the standby attribute with

crm_attribute -G --node=cn2.localdomain -n standby

okay - recap: 

1st. i have this delay where the two nodes dont see each 
other (after rebooting) and the result are resources running on both 
nodes while they should only run on one node - this will be corrected 
by the cluster it self but this situation should not happen.

2nd. the standby attribute (and there must be a reason why redhat 
added this) will prevent to migrate resources to that node. How 
do i delete this attribute?

i appreciate any comments

--
Leon

A. $ cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
 <cluster name="HA" config_version="5">
   <logging debug="off"/>
   <clusternodes>
     <clusternode name="cn1.localdomain" votes="1" nodeid="1">
       <fence>
         <method name="pcmk-redirect">
           <device name="pcmk" port="cn1.localdomain"/>
         </method>
       </fence>
     </clusternode>
     <clusternode name="cn2.localdomain" votes="1" nodeid="2">
       <fence>
         <method name="pcmk-redirect">
           <device name="pcmk" port="cn2.localdomain"/>
         </method>
       </fence>
     </clusternode>
   </clusternodes>
   <fencedevices>
     <fencedevice name="pcmk" agent="fence_pcmk"/>
   </fencedevices>
   <rm>
     <failoverdomains/>
     <resources/>
   </rm>
 </cluster>

B. $ pcs config
Corosync Nodes:

Pacemaker Nodes:
 cn1.localdomain cn2.localdomain 

Resources: 
 Resource: resIP (provider=heartbeat type=IPaddr2 class=ocf)
  Attributes: ip=192.168.201.220 nic=eth0 cidr_netmask=24 
  Operations: monitor interval=30s
 Resource: resApache (provider=heartbeat type=apache class=ocf)
  Attributes: httpd=/usr/sbin/httpd configfile=/etc/httpd/conf/httpd.conf 
  Operations: monitor interval=1min

Location Constraints:
Ordering Constraints:
  start resApache then start resIP
Colocation Constraints:
  resIP with resApache

Cluster Properties:
 dc-version: 1.1.8-7.el6-394e906
 cluster-infrastructure: cman
 expected-quorum-votes: 2
 stonith-enabled: false
 no-quorum-policy: ignore