[ClusterLabs] Antw: Behavior after stop action failure with the failure-timeout set and STONITH disabled

Fri May 5 08:15:45 CEST 2017

>>> Jan Wrona <wrona at cesnet.cz> schrieb am 04.05.2017 um 16:41 in Nachricht
<cac9591a-7efc-d524-5d2c-248415c5c37e at cesnet.cz>:
> I hope I'll be able to explain the problem clearly and correctly.
> 
> My setup (simplified): I have two cloned resources, a filesystem mount 
> and a process which writes to that filesystem. The filesystem is Gluster 
> so its OK to clone it. I also have a mandatory ordering constraint 
> "start gluster-mount-clone then start writer-process-clone". I don't 
> have a STONITH device, so I've disable STONITH by settin||||g 
> stonith-enabled=false.
> 
> The problem: Sometimes the Gluster freezes for a while, which causes the 
> gluster-mount resource's monitor with the OCF_CHECK_LEVEL=20 to timeout 
> (it is unable to write the status file). When this happens, the cluster 

Actually I would do two things:

1) Find out why Gluster freezes, and what to do to avoid that

2) Implement stonith

Regards,
Ulrich

> tries to recover by restarting the writer-process resource. But the 
> writer-process is writing to the frozen filesystem which makes it 
> uninterruptable, not even SIGKILL works. Then the stop operation times 
> out and on-fail with disabled STONITH defaults to block (don’t perform 
> any further operations on the resource):
> warning: Forcing writer-process-clone away from node1.example.org after 
> 1000000 failures (max=1000000)
> After that, the cluster continues with the recovery process by 
> restarting the gluster-mount resource on that node and it usually 
> succeeds. As a consequence of that remount, the uninterruptable system 
> call in the writer process fails, signals are finally delivered and the 
> writer-process is terminated. But the cluster doesn't know about that!
> 
> I thought I can solve this by setting the failure-timeout meta attribute 
> to the writer-process resource, but it only made things worse. The 
> documentation states: "Stop failures are slightly different and crucial. 
> ... If a resource fails to stop and STONITH is not enabled, then the 
> cluster has no way to continue and will not try to start the resource 
> elsewhere, but will try to stop it again after the failure timeout.", 
> but I'm seeing something different. When the policy engine is launched 
> after the nearest cluster-recheck-interval, following lines are written 
> to the syslog:
> crmd[11852]: notice: State transition S_IDLE -> S_POLICY_ENGINE
> pengine[11851]:  notice: Clearing expired failcount for writer-process:1 
> on node1.example.org
> pengine[11851]:  notice: Clearing expired failcount for writer-process:1 
> on node1.example.org
> pengine[11851]:  notice: Ignoring expired calculated failure 
> writer-process_stop_0 (rc=1, 
> magic=2:1;64:557:0:2169780b-ca1f-483e-ad42-118b7c7c1a7d) on 
> node1.example.org
> pengine[11851]:  notice: Clearing expired failcount for writer-process:1 
> on node1.example.org
> pengine[11851]:  notice: Ignoring expired calculated failure 
> writer-process_stop_0 (rc=1, 
> magic=2:1;64:557:0:2169780b-ca1f-483e-ad42-118b7c7c1a7d) on 
> node1.example.org
> pengine[11851]: warning: Processing failed op monitor for 
> gluster-mount:1 on node1.example.org: unknown error (1)
> pengine[11851]:  notice: Calculated transition 564, saving inputs in 
> /var/lib/pacemaker/pengine/pe-input-362.bz2
> crmd[11852]:  notice: Transition 564 (Complete=2, Pending=0, Fired=0, 
> Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-362.bz2): Complete
> crmd[11852]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE
> crmd[11852]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
> crmd[11852]: warning: No reason to expect node 3 to be down
> crmd[11852]: warning: No reason to expect node 1 to be down
> crmd[11852]: warning: No reason to expect node 1 to be down
> crmd[11852]: warning: No reason to expect node 3 to be down
> pengine[11851]: warning: Processing failed op stop for writer-process:1 
> on node1.example.org: unknown error (1)
> pengine[11851]: warning: Processing failed op monitor for 
> gluster-mount:1 on node1.example.org: unknown error (1)
> pengine[11851]: warning: Forcing writer-process-clone away from 
> node1.example.org after 1000000 failures (max=1000000)
> pengine[11851]: warning: Forcing writer-process-clone away from 
> node1.example.org after 1000000 failures (max=1000000)
> pengine[11851]: warning: Forcing writer-process-clone away from 
> node1.example.org after 1000000 failures (max=1000000)
> pengine[11851]:  notice: Calculated transition 565, saving inputs in 
> /var/lib/pacemaker/pengine/pe-input-363.bz2
> pengine[11851]:  notice: Ignoring expired calculated failure 
> writer-process_stop_0 (rc=1, 
> magic=2:1;64:557:0:2169780b-ca1f-483e-ad42-118b7c7c1a7d) on 
> node1.example.org
> pengine[11851]: warning: Processing failed op monitor for 
> gluster-mount:1 on node1.example.org: unknown error (1)
> crmd[11852]:  notice: Transition 566 (Complete=0, Pending=0, Fired=0, 
> Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-364.bz2): Complete
> crmd[11852]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE
> pengine[11851]:  notice: Calculated transition 566, saving inputs in 
> /var/lib/pacemaker/pengine/pe-input-364.bz2
> 
> Then after each cluster-recheck-interval:
> crmd[11852]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
> pengine[11851]:  notice: Ignoring expired calculated failure 
> writer-process_stop_0 (rc=1, 
> magic=2:1;64:557:0:2169780b-ca1f-483e-ad42-118b7c7c1a7d) on 
> node1.example.org
> pengine[11851]: warning: Processing failed op monitor for 
> gluster-mount:1 on node1.example.org: unknown error (1)
> crmd[11852]:  notice: Transition 567 (Complete=0, Pending=0, Fired=0, 
> Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-364.bz2): Complete
> crmd[11852]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE
> 
> And the crm_mon is happily showing the writer-process as Started, 
> although it is not running. This is very confusing. Could anyone please 
> explain what is going on here?
> 
> ||||