[ClusterLabs] Continuous master monitor failure of a resource in case some other resource is being promoted

Mon Feb 25 15:00:42 EST 2019

25.02.2019 22:36, Andrei Borzenkov пишет:
> 
>> Could you please help me understand:
>> 1. Why doesn't pacemaker process the failure of Stateful_Test_2 resource
>> immediately after first failure?
> 

I'm still not sure why.

> I vaguely remember something about sequential execution mentioned before
> but cannot find details.
> 
>> 2. Why does the monitor failure of Stateful_Test_2 continue even after the
>> promote of Stateful_Test_1 has been completed? Shouldn't it handle
>> Stateful_Test_2's failure and take necessary action on it? It feels as if
>> that particular failure 'event' has been 'dropped' and pengine is not even
>> aware of the Stateful_Test_2's failure.
>>
> 
> Yes. Although crm_mon shows resource as being master on this node, in
> reality resource is left in failed state forever and monitor result is
> simply ignored.
> 

Yes, pacemaker reacts only on result change (more precisely, it tells
lrmd to report only the first result and suppress all further
consecutive duplicates). As the first report gets lost due to low
failure-timeout, this explains what happens.
...

>>
>> Could you please help us in understanding this behavior and how to fix this?
>>
> 
> Your problem is triggered by too low failure-timeout. Failure of master
> is cleared before pacemaker picks it for processing (or so I interpret
> it). You should set failure-timeout to be longer than your actions may
> take. This will give you at least workaround.
> 
> Note that in your configuration resource cannot be recovered anyway.
> migration-threshold is 1 so pacemaker cannot (try to) restart master on
> the same node but you prohibit running it anywhere else.
>