[ClusterLabs] Ignore lost monitoring request

Wed Mar 14 10:23:21 EDT 2018

Hi all,

As Ken said

"Not currently, but that is planned for a future version",

just want to remind how useful would be to have "ignore X monitoring 
timeouts" as an option in the newest pacemaker.

Still having big problems with resources restarting because of a lost 
monitoring requests, which leads to service interruptions.

Best regards,

Klecho

On 1.09.2017 17:52, Klechomir wrote:
> On 1.09.2017 17:21, Jan Pokorný wrote:
>> On 01/09/17 09:48 +0300, Klechomir wrote:
>>> I have cases, when for an unknown reason a single monitoring request
>>> never returns result.
>>> So having bigger timeouts doesn't resolve this problem.
>> If I get you right, the pain point here is a command called by the
>> resource agents during monitor operation, while this command under
>> some circumstances _never_ terminates (for dead waiting, infinite
>> loop, or whatever other reason) or possibly terminates based on
>> external/asynchronous triggers (e.g. network connection gets
>> reestablished).
>>
>> Stating obvious, the solution should be:
>> - work towards fixing such particular command if blocking
>>    is an unexpected behaviour (clarify this with upstream
>>    if needed)
>> - find more reliable way for the agent to monitor the resource
>>
>> For the planned soft-recovery options Ken talked about, I am not
>> sure if it would be trivially possible to differentiate exceeded
>> monitor timeout from a plain monitor failure.
> In any case currently there is no differentiation between failed 
> monitoring request and timeouts, so a parameter for ignoring X fails 
> in a row would be very welcome for me.
>
> Here is one very fresh example, entirely unrelated to LV&I/O:
> Aug 30 10:44:19 [1686093] CLUSTER-1       crmd:    error: 
> process_lrm_event:    LRM operation p_PingD_monitor_0 (1148) Timed Out 
> (timeout=20000ms)
> Aug 30 10:44:56 [1686093] CLUSTER-1       crmd:   notice: 
> process_lrm_event:    LRM operation p_PingD_stop_0 (call=1234, rc=0, 
> cib-update=40, confirmed=true) ok
> Aug 30 10:45:26 [1686093] CLUSTER-1       crmd:   notice: 
> process_lrm_event:    LRM operation p_PingD_start_0 (call=1240, rc=0, 
> cib-update=41, confirmed=true) ok
> In this case PingD is fencing drbd and causes unneeded (as the next 
> monitoring request is ok) restart of all related resources.
>>
>>
>> _______________________________________________
>> Users mailing list:Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home:http://www.clusterlabs.org
>> Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:http://bugs.clusterlabs.org
>
>

-- 
Klecho

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180314/68d97a0a/attachment.html>