[ClusterLabs] [Enhancement] When STONITH is not completed, a resource moves.
renayama19661014 at ybb.ne.jp
renayama19661014 at ybb.ne.jp
Fri Oct 30 04:40:02 UTC 2015
Hi Ken,
Thank you for comments.
> The above is the reason for the behavior you're seeing.
>
> A fenced node can come back up and rejoin the cluster before the fence
> command reports completion. When Pacemaker sees the rejoin, it assumes
> the fence command completed.
>
> However in this case, the lost node rejoined on its own while fencing
> was still in progress, so that was an incorrect assumption.
>
> A proper fix will take some investigation. As a workaround in the
> meantime, you could try increasing the corosync token timeout, so the
> node is not declared lost for brief outages.
We think so, too.
We understand that we can evade a problem by lengthening token of corosync.
If log when a problem happened is necessary for a survey by you, please contact me.
Many Thanks!
Hideo Yamauchi.
----- Original Message -----
> From: Ken Gaillot <kgaillot at redhat.com>
> To: users at clusterlabs.org
> Cc:
> Date: 2015/10/29, Thu 23:09
> Subject: Re: [ClusterLabs] [Enhancement] When STONITH is not completed, a resource moves.
>
> On 10/28/2015 08:39 PM, renayama19661014 at ybb.ne.jp wrote:
>> Hi All,
>>
>> The following problem produced us in Pacemaker1.1.12.
>> While STONITH was not completed, a resource moved it.
>>
>> The next movement seemed to happen in a cluster.
>>
>> Step1) Start a cluster.
>>
>> Step2) Node 1 breaks down.
>>
>> Step3) Node 1 is reconnected before practice is completed from node 2
> STONITH.
>>
>> Step4) Repeated between Step2 and Step3.
>>
>> Step5) STONITH from node 2 is not completed, but a resource moves to node
> 2.
>>
>>
>>
>> There was not resource information of node 1 when I saw pe file when a
> resource moved in node 2.
>> (snip)
>> <status>
>> <node_state id="3232242311" uname="node1"
> in_ccm="false" crmd="offline"
> crm-debug-origin="do_state_transition" join="down"
> expected="down">
>> <transient_attributes id="3232242311">
>> <instance_attributes id="status-3232242311">
>> <nvpair id="status-3232242311-last-failure-prm_XXX1"
> name="last-failure-prm_XXX1" value="1441957021"/>
>> <nvpair id="status-3232242311-default_ping_set"
> name="default_ping_set" value="300"/>
>> <nvpair id="status-3232242311-last-failure-prm_XXX2"
> name="last-failure-prm_XXX2" value="1441956891"/>
>> <nvpair id="status-3232242311-shutdown"
> name="shutdown" value="0"/>
>> <nvpair id="status-3232242311-probe_complete"
> name="probe_complete" value="true"/>
>> </instance_attributes>
>> </transient_attributes>
>> </node_state>
>> <node_state id="3232242312" in_ccm="true"
> crmd="online" crm-debug-origin="do_state_transition"
> uname="node2" join="member" expected="member">
>> <transient_attributes id="3232242312">
>> <instance_attributes id="status-3232242312">
>> <nvpair id="status-3232242312-shutdown"
> name="shutdown" value="0"/>
>> <nvpair id="status-3232242312-probe_complete"
> name="probe_complete" value="true"/>
>> <nvpair id="status-3232242312-default_ping_set"
> name="default_ping_set" value="300"/>
>> </instance_attributes>
>> </transient_attributes>
>> <lrm id="3232242312">
>> <lrm_resources>
>> (snip)
>>
>> While STONITH is not completed, the information of the node of cib is
> deleted and seems to be caused by the fact that cib does not have the resource
> information of the node.
>>
>> The cause of the problem was that the communication of the cluster became
> unstable.
>> However, an action of this cluster is a problem.
>>
>> This problem is not taking place in Pacemaker1.1.13 for the moment.
>> However, I think that it is the same processing as far as I see a source
> code.
>>
>> Does the deletion of the node information not have to perform it after all
> new node information gathered?
>>
>> * crmd/callback.c
>> (snip)
>> void
>> peer_update_callback(enum crm_status_type type, crm_node_t * node, const
> void *data)
>> {
>> (snip)
>> if (down) {
>> const char *task = crm_element_value(down->xml,
> XML_LRM_ATTR_TASK);
>>
>> if (alive && safe_str_eq(task, CRM_OP_FENCE)) {
>> crm_info("Node return implies stonith of %s (action
> %d) completed", node->uname,
>> down->id);
>
> The above is the reason for the behavior you're seeing.
>
> A fenced node can come back up and rejoin the cluster before the fence
> command reports completion. When Pacemaker sees the rejoin, it assumes
> the fence command completed.
>
> However in this case, the lost node rejoined on its own while fencing
> was still in progress, so that was an incorrect assumption.
>
> A proper fix will take some investigation. As a workaround in the
> meantime, you could try increasing the corosync token timeout, so the
> node is not declared lost for brief outages.
>
>> st_fail_count_reset(node->uname);
>>
>> erase_status_tag(node->uname, XML_CIB_TAG_LRM,
> cib_scope_local);
>> erase_status_tag(node->uname,
> XML_TAG_TRANSIENT_NODEATTRS, cib_scope_local);
>> /* down->confirmed = TRUE; Only stonith-ng returning
> should imply completion */
>> down->sent_update = TRUE; /* Prevent
> tengine_stonith_callback() from calling send_stonith_update() */
>>
>> (snip)
>>
>>
>> * There is the log, but cannot attach it because the information of the
> user is included.
>> * Please contact me by an email if you need it.
>>
>>
>> These contents are registered with Bugzilla.
>> * http://bugs.clusterlabs.org/show_bug.cgi?id=5254
>>
>>
>> Best Regards,
>> Hideo Yamauchi.
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list