[Pacemaker] Problem configuring Heartbeat with CRM : Abnormal Failover test results

Sun Jul 31 22:50:28 EDT 2011

On Fri, Jul 29, 2011 at 11:14 PM, Deneux Olivier <odeneux at oxya.com> wrote:
> Hello,
>
> First of all, please excuse my apporximative english/US..
> I'm facing a problem with configuring a simple 2 nodes cluster with 1
> resource (Virtual IP)
> I'v read and read a lot of threads and doc but... Don't find.
> Have to say the cluster's world is pretty new for me...
>
> I've installed on my 2 linux servers RHEL 4.1.2-48 following packages :
>
> cluster-glue-1.0.5-1.el5.x86_64.rpm
> cluster-glue-libs-1.0.5-1.el5.x86_64.rpm
> corosync-1.2.5-1.3.el5.x86_64.rpm
> corosynclib-1.2.5-1.3.el5.x86_64.rpm
> heartbeat-3.0.3-2.el5.x86_64.rpm
> heartbeat-libs-3.0.3-2.el5.x86_64.rpm
> libesmtp-1.0.4-5.el5.x86_64.rpm
> pacemaker-1.0.9.1-1.el5.x86_64.rpm
> pacemaker-libs-1.0.9.1-1.el5.x86_64.rpm
> resource-agents-1.0.3-2.el5.x86_64.rpm
>
> (corosync not running, It seems that I don't need it)
>
> Here under the ha.cf of node 1 :
> node 1
> autojoin none
> keepalive 2
> deadtime 10
> initdead 80
> udpport 694
> ucast bond0 <@IP node2>
> auto_failback off
> node    node1
> node    node2
> use_logd yes
> crm     yes
>
> Here under the ha.cf of node 2 :
> node 1
> autojoin none
> keepalive 2
> deadtime 10
> initdead 80
> udpport 694
> ucast bond0 <@IP node1>
> auto_failback off
> node    node1
> node    node2
> use_logd yes
> crm     yes
>
> I use crm to configure the cluster, here is the cib.xml file :
>
> <cib validate-with="pacemaker-1.0" crm_feature_set="3.0.1" have-quorum="1"
> admin_epoch="0" epoch="190" dc-uuid="85f5f8dc-6ccf-4478-8a89-a3d7c952c0e4"
> num_
> updates="0" cib-last-written="Fri Jul 29 14:18:28 2011">
> <configuration>
> <crm_config>
> <cluster_property_set id="cib-bootstrap-options">
> <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
> value="1.0.9-89bd754939df5150de7cd76835f98fe90851b677"/>
> <nvpair id="cib-bootstrap-options-cluster-infrastructure"
> name="cluster-infrastructure" value="Heartbeat"/>
> <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled"
> value="false"/>
> <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh"
> value="1311941556"/>
> </cluster_property_set>
> </crm_config>
> <nodes>
> <node type="normal" uname="node2" id="85f5f8dc-6ccf-4478-8a89-a3d7c952c0e4">
> <instance_attributes id="nodes-85f5f8dc-6ccf-4478-8a89-a3d7c952c0e4">
> <nvpair name="standby"
> id="nodes-85f5f8dc-6ccf-4478-8a89-a3d7c952c0e4-standby" value="off"/>
> </instance_attributes>
> </node>
> <node id="813121d2-360b-4532-8883-7f1330ed2c39" type="normal" uname="node1">
> <instance_attributes id="nodes-813121d2-360b-4532-8883-7f1330ed2c39">
> <nvpair id="nodes-813121d2-360b-4532-8883-7f1330ed2c39-standby"
> name="standby" value="off"/>
> </instance_attributes>
> </node>
> </nodes>
> <resources>
> <primitive class="ocf" id="ClusterIP" provider="heartbeat" type="IPaddr2">
> <instance_attributes id="ClusterIP-instance_attributes">
> <nvpair id="ClusterIP-instance_attributes-ip" name="ip" value="<@IP
> Virtual>"/>
> <nvpair id="ClusterIP-instance_attributes-cidr_netmask" name="cidr_netmask"
> value="32"/>
> </instance_attributes>
> <operations>
> <op id="ClusterIP-monitor-30s" interval="30s" name="monitor"/>
> </operations>
> <meta_attributes id="ClusterIP-meta_attributes">
> <nvpair id="ClusterIP-meta_attributes-target-role" name="target-role"
> value="Started"/>
> </meta_attributes>
> </primitive>
> </resources>
> <constraints/>
> <rsc_defaults/>
> <op_defaults/>
> </configuration>
> </cib>
>
> Heartbeat demon starts well on both sides, here is the result of crm_mon :
> ============
> Last updated: Fri Jul 29 14:49:47 2011
> Stack: Heartbeat
> Current DC: node1 (813121d2-360b-4532-8883-7f1330ed2c39) - partition with
>  quorum
> Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
>
> Online: [ node 2 node1]
>
> ClusterIP       (ocf::heartbeat:IPaddr2):       Started node2
>
>
> To test if everything works fine, il launch a script taht stop network on
> node2, waits 50s and then starts back network.
> When the network goes down on node2, the resource migrates as expected on
> node1.
> But when the network is back operational, resource does note move back to
> node2 (it should, as there's no stickiness option defined yet)
> I have the following error on crm_mon :
>
> ============
> Last updated: Fri Jul 29 14:52:15 2011
> Stack: Heartbeat
> Current DC: node2 (85f5f8dc-6ccf-4478-8a89-a3d7c952c0e4) - partition with
>  quorum
> Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
>
> Online: [ node1 node2]
>
> ClusterIP       (ocf::heartbeat:IPaddr2):       Started node1
>
> Failed actions:
>    ClusterIP_start_0 (node=node2, call=6, rc=2, status=complete): invalid
> parameter
>
> Same behaviour if I swap resource to node1 and start/stop network on node1.
>
> Why is there this "invalid parameter" ??

You would have to open up the IPaddr2 script.
Perhaps it is also logging a reason (the logs you quoted are from the
policy engine which has no additional information - better to look for
something from the resource itself).

> Here is an extract of the ha-log :
>
> Jul 29 14:48:30 node2 pengine: [30752]: ERROR: unpack_rsc_op: Hard error -
> ClusterIP_start_0 failed with rc=2: Preventing ClusterIP from re-starting on
> node2
> Jul 29 14:48:30 node2 pengine: [30752]: WARN: unpack_rsc_op: Processing
> failed op ClusterIP_start_0 on node2: invalid parameter (2)
> Jul 29 14:48:30 node2 pengine: [30752]: notice: native_print: ClusterIP
> (ocf::heartbeat:IPaddr2):       Started node1
> Jul 29 14:48:30 node2 pengine: [30752]: info: get_failcount: ClusterIP has
> failed INFINITY times on node2
> Jul 29 14:48:30 node2 pengine: [30752]: WARN: common_apply_stickiness:
> Forcing ClusterIP away from node2 after 1000000 failures (max=1000000)
> Jul 29 14:48:30 node2 pengine: [30752]: notice: LogActions: Leave resource
> ClusterIP        (Started node1)
>
> If you need more info, please ask me !
>
> Thanks In advance
>
> Olivier
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>