[Pacemaker] resource is too active problem in a 2-node cluster
Aggarwal, Ajay
aaggarwal at verizon.com
Mon Feb 10 21:13:30 EST 2014
I have a 2 node cluster with no-quorum-policy=ignore. I call these nodes as node-0 and node-1. In addition, I have two cluster resources in a group; an IP-address and an OCF script.
Normally these resources are active on node-0. However when I bounce pacemaker on node-1 (service pacemaker stop followed by service pacemaker start), the OCF resource gets bounced on node-0, which is unexpected and causing problems for my application. In the log messages I see that monitor has failed with "unknown error", leading to "resource is active on 2 nodes" error and the recovery procedure then bounces the OCF resource. But when I manually run monitor on my OCF script, return value is always either OCF_SUCCESS(0) or OCF_NOT_RUNNING(7)
I am using following versions of the software
Pacemaker version: 1.1.10
Corosync version: 1-4.1-15
OS: CentOS 6.4
What am I doing wrong?
Below I am including the cib config and corresponding log messages
<cib epoch="10" num_updates="94" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Tue Jan 7 18:11:58 2014" update-origin="gol-5-7-0" update-client="cibadmin" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="gol-5-7-0">
<configuration>
<crm_config>
<cluster_property_set id="cib-bootstrap-options">
<nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.10-1.el6_4.4-368c726"/>
<nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="cman"/>
<nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>
<nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/>
<nvpair id="cib-bootstrap-options-migration-threshold" name="migration-threshold" value="3"/>
</cluster_property_set>
</crm_config>
<nodes>
<node id="gol-5-7-6" uname="gol-5-7-6"/>
<node id="gol-5-7-0" uname="gol-5-7-0"/>
</nodes>
<resources>
<group id="Group">
<primitive class="ocf" id="FAILOVER-INTER" provider="heartbeat" type="IPaddr2">
<instance_attributes id="FAILOVER-INTER-instance_attributes">
<nvpair id="FAILOVER-INTER-instance_attributes-ip" name="ip" value="10.20.7.190"/>
<nvpair id="FAILOVER-INTER-instance_attributes-nic" name="nic" value="eth1"/>
<nvpair id="FAILOVER-INTER-instance_attributes-cidr_netmask" name="cidr_netmask" value="14"/>
</instance_attributes>
<operations>
<op id="FAILOVER-INTER-monitor-interval-5s" interval="5s" name="monitor"/>
</operations>
</primitive>
<primitive class="ocf" id="GOL-HA" provider="redhat" type="script.sh">
<instance_attributes id="GOL-HA-instance_attributes">
<nvpair id="GOL-HA-instance_attributes-name" name="name" value="gol-ha"/>
<nvpair id="GOL-HA-instance_attributes-file" name="file" value="/etc/init.d/gol-ha"/>
</instance_attributes>
<operations>
<op id="GOL-HA-monitor-interval-60s" interval="60s" name="monitor"/>
</operations>
</primitive>
</group>
</resources>
<constraints/>
<rsc_defaults>
<meta_attributes id="rsc_defaults-options">
<nvpair id="rsc_defaults-options-resource-stickiness" name="resource-stickiness" value="100"/>
</meta_attributes>
</rsc_defaults>
</configuration>
Corresponding Log messages
Feb 04 11:27:29 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 04 11:27:29 corosync [QUORUM] Members[2]: 1 2
Feb 04 11:27:29 corosync [QUORUM] Members[2]: 1 2
Feb 04 11:27:29 [45168] gol-5-7-0 crmd: notice: crm_update_peer_state: cman_event_callback: Node gol-5-7-6[2] - state is now member (was lost)
Feb 04 11:27:29 corosync [CPG ] chosen downlist: sender r(0) ip(172.16.0.2) ; members(old:1 left:0)
Feb 04 11:27:29 corosync [MAIN ] Completed service synchronization, ready to provide service.
Feb 04 11:27:36 [45168] gol-5-7-0 crmd: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]
Feb 04 11:27:38 [45166] gol-5-7-0 attrd: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Feb 04 11:27:38 [45166] gol-5-7-0 attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-GOL-HA (5)
Feb 04 11:27:38 [45166] gol-5-7-0 attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-GOL-HA (1391444085)
Feb 04 11:27:38 [45166] gol-5-7-0 attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: warning: unpack_rsc_op: Processing failed op monitor for GOL-HA on gol-5-7-0: unknown error (1)
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: notice: process_pe_message: Calculated Transition 1825: /var/lib/pacemaker/pengine/pe-input-45.bz2
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 7: monitor FAILOVER-INTER_monitor_0 on gol-5-7-6
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 8: monitor GOL-HA_monitor_0 on gol-5-7-6
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: warning: status_from_rc: Action 8 (GOL-HA_monitor_0) on gol-5-7-6 failed (target: 7 vs. rc: 1): Error
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 6: probe_complete probe_complete on gol-5-7-6 - no waiting
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: run_graph: Transition 1825 (Complete=3, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-45.bz2): Stopped
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: warning: unpack_rsc_op: Processing failed op monitor for GOL-HA on gol-5-7-0: unknown error (1)
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: warning: unpack_rsc_op: Processing failed op monitor for GOL-HA on gol-5-7-6: unknown error (1)
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: error: native_create_actions: Resource GOL-HA (ocf::script.sh) is active on 2 nodes attempting recovery
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: notice: LogActions: Recover GOL-HA (Started gol-5-7-0)
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: error: process_pe_message: Calculated Transition 1826: /var/lib/pacemaker/pengine/pe-error-3.bz2
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 10: stop GOL-HA_stop_0 on gol-5-7-0 (local)
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 3: stop GOL-HA_stop_0 on gol-5-7-6
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 7: probe_complete probe_complete on gol-5-7-6 - no waiting
Feb 04 11:27:39 [45168] gol-5-7-0 crmd: notice: process_lrm_event: LRM operation GOL-HA_stop_0 (call=111, rc=0, cib-update=1953, confirmed=true) ok
Feb 04 11:27:39 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 11: start GOL-HA_start_0 on gol-5-7-0 (local)
Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: process_lrm_event: LRM operation GOL-HA_start_0 (call=115, rc=0, cib-update=1954, confirmed=true) ok
Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 1: monitor GOL-HA_monitor_60000 on gol-5-7-0 (local)
Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: process_lrm_event: LRM operation GOL-HA_monitor_60000 (call=118, rc=0, cib-update=1955, confirmed=false) ok
Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: run_graph: Transition 1826 (Complete=10, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-error-3.bz2): Complete
Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
More information about the Pacemaker
mailing list