[Pacemaker] resource is too active problem in a 2-node cluster

Mon Feb 10 21:13:30 EST 2014

I have a 2 node cluster with no-quorum-policy=ignore. I call these nodes as node-0 and node-1. In addition, I have two cluster resources in a group; an IP-address and an OCF script.

Normally these resources are active on node-0. However when I bounce pacemaker on node-1 (service pacemaker stop followed by service pacemaker start), the OCF resource gets bounced on node-0, which is unexpected and causing problems for my application. In the log messages I see that monitor has failed with "unknown error", leading to "resource is active on 2 nodes" error and the recovery procedure then bounces the OCF resource. But when I manually run monitor on my OCF script, return value is always either OCF_SUCCESS(0) or OCF_NOT_RUNNING(7)

I am using following versions of the software
   Pacemaker version: 1.1.10
   Corosync version: 1-4.1-15
   OS: CentOS 6.4

What am I doing wrong?

Below I am including the cib config and corresponding log messages

<cib epoch="10" num_updates="94" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Tue Jan  7 18:11:58 2014" update-origin="gol-5-7-0" update-client="cibadmin" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="gol-5-7-0">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.10-1.el6_4.4-368c726"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="cman"/>
        <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>
        <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/>
        <nvpair id="cib-bootstrap-options-migration-threshold" name="migration-threshold" value="3"/>
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="gol-5-7-6" uname="gol-5-7-6"/>
      <node id="gol-5-7-0" uname="gol-5-7-0"/>
    </nodes>
    <resources>
      <group id="Group">
        <primitive class="ocf" id="FAILOVER-INTER" provider="heartbeat" type="IPaddr2">
          <instance_attributes id="FAILOVER-INTER-instance_attributes">
            <nvpair id="FAILOVER-INTER-instance_attributes-ip" name="ip" value="10.20.7.190"/>
            <nvpair id="FAILOVER-INTER-instance_attributes-nic" name="nic" value="eth1"/>
            <nvpair id="FAILOVER-INTER-instance_attributes-cidr_netmask" name="cidr_netmask" value="14"/>
          </instance_attributes>
          <operations>
            <op id="FAILOVER-INTER-monitor-interval-5s" interval="5s" name="monitor"/>
          </operations>
        </primitive>
        <primitive class="ocf" id="GOL-HA" provider="redhat" type="script.sh">
          <instance_attributes id="GOL-HA-instance_attributes">
            <nvpair id="GOL-HA-instance_attributes-name" name="name" value="gol-ha"/>
            <nvpair id="GOL-HA-instance_attributes-file" name="file" value="/etc/init.d/gol-ha"/>
          </instance_attributes>
          <operations>
            <op id="GOL-HA-monitor-interval-60s" interval="60s" name="monitor"/>
          </operations>
        </primitive>
      </group>
    </resources>
    <constraints/>
    <rsc_defaults>
      <meta_attributes id="rsc_defaults-options">
        <nvpair id="rsc_defaults-options-resource-stickiness" name="resource-stickiness" value="100"/>
      </meta_attributes>
    </rsc_defaults>
  </configuration>

Corresponding Log messages

Feb 04 11:27:29 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 04 11:27:29 corosync [QUORUM] Members[2]: 1 2
Feb 04 11:27:29 corosync [QUORUM] Members[2]: 1 2
Feb 04 11:27:29 [45168] gol-5-7-0       crmd:   notice: crm_update_peer_state:     cman_event_callback: Node gol-5-7-6[2] - state is now member (was lost)
Feb 04 11:27:29 corosync [CPG   ] chosen downlist: sender r(0) ip(172.16.0.2) ; members(old:1 left:0)
Feb 04 11:27:29 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Feb 04 11:27:36 [45168] gol-5-7-0       crmd:   notice: do_state_transition:     State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]
Feb 04 11:27:38 [45166] gol-5-7-0      attrd:   notice: attrd_local_callback:     Sending full refresh (origin=crmd)
Feb 04 11:27:38 [45166] gol-5-7-0      attrd:   notice: attrd_trigger_update:     Sending flush op to all hosts for: fail-count-GOL-HA (5)
Feb 04 11:27:38 [45166] gol-5-7-0      attrd:   notice: attrd_trigger_update:     Sending flush op to all hosts for: last-failure-GOL-HA (1391444085)
Feb 04 11:27:38 [45166] gol-5-7-0      attrd:   notice: attrd_trigger_update:     Sending flush op to all hosts for: probe_complete (true)
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:   notice: unpack_config:     On loss of CCM Quorum: Ignore
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:  warning: unpack_rsc_op:     Processing failed op monitor for GOL-HA on gol-5-7-0: unknown error (1)
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:   notice: process_pe_message:     Calculated Transition 1825: /var/lib/pacemaker/pengine/pe-input-45.bz2
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     Initiating action 7: monitor FAILOVER-INTER_monitor_0 on gol-5-7-6
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     Initiating action 8: monitor GOL-HA_monitor_0 on gol-5-7-6
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:  warning: status_from_rc:     Action 8 (GOL-HA_monitor_0) on gol-5-7-6 failed (target: 7 vs. rc: 1): Error
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     Initiating action 6: probe_complete probe_complete on gol-5-7-6 - no waiting
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:   notice: run_graph:     Transition 1825 (Complete=3, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-45.bz2): Stopped
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:   notice: unpack_config:     On loss of CCM Quorum: Ignore
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:  warning: unpack_rsc_op:     Processing failed op monitor for GOL-HA on gol-5-7-0: unknown error (1)
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:  warning: unpack_rsc_op:     Processing failed op monitor for GOL-HA on gol-5-7-6: unknown error (1)
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:    error: native_create_actions:     Resource GOL-HA (ocf::script.sh) is active on 2 nodes attempting recovery
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:   notice: LogActions:     Recover GOL-HA    (Started gol-5-7-0)
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:    error: process_pe_message:     Calculated Transition 1826: /var/lib/pacemaker/pengine/pe-error-3.bz2
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     Initiating action 10: stop GOL-HA_stop_0 on gol-5-7-0 (local)
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     Initiating action 3: stop GOL-HA_stop_0 on gol-5-7-6
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     Initiating action 7: probe_complete probe_complete on gol-5-7-6 - no waiting
Feb 04 11:27:39 [45168] gol-5-7-0       crmd:   notice: process_lrm_event:     LRM operation GOL-HA_stop_0 (call=111, rc=0, cib-update=1953, confirmed=true) ok
Feb 04 11:27:39 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     Initiating action 11: start GOL-HA_start_0 on gol-5-7-0 (local)
Feb 04 11:27:40 [45168] gol-5-7-0       crmd:   notice: process_lrm_event:     LRM operation GOL-HA_start_0 (call=115, rc=0, cib-update=1954, confirmed=true) ok
Feb 04 11:27:40 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     Initiating action 1: monitor GOL-HA_monitor_60000 on gol-5-7-0 (local)
Feb 04 11:27:40 [45168] gol-5-7-0       crmd:   notice: process_lrm_event:     LRM operation GOL-HA_monitor_60000 (call=118, rc=0, cib-update=1955, confirmed=false) ok
Feb 04 11:27:40 [45168] gol-5-7-0       crmd:   notice: run_graph:     Transition 1826 (Complete=10, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-error-3.bz2): Complete
Feb 04 11:27:40 [45168] gol-5-7-0       crmd:   notice: do_state_transition:     State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]