[ClusterLabs] How does failure-timeout works, will the resource not be scheduled when setting too short?

Sat May 19 22:19:31 EDT 2018

I have two pacemaker resources. We call them A and B. Because of environmental reasons, their start methods and monitor methods always return failure

(OCF_ERR_GENERIC). The following are their configurations:(The cluster property of start-failure-is-fatal is false)

primitive A A \
        op monitor interval=20 timeout=120 \
        op stop interval=0 timeout=120 on-fail=restart \
        op start interval=0 timeout=240 on-fail=restart \
        meta failure-timeout=60s
primitive B B \
        op monitor interval=20 timeout=120 \
        op stop interval=0 timeout=120 on-fail=restart \
        op start interval=0 timeout=240 on-fail=restart \
        meta failure-timeout=60s
clone A_cl A
clone B_cl B

The time consuming of their methods is different:
A:
start = 60s       monitor < 1s        stop = 80s
B:
start < 1s        monitor < 1s        stop < 1s    

Resource of A is scheduled normally, always start and stop. But for resource B, there is only circular monitor fails, without start and stop.
. And there is no fail-count showing of B in "crm status -f".

Two operations can solve the problem of B not being scheduled:
1，Set failure-timeout of B from 60s to 600s
2，Modify ocf of A，make the stop method return as soon as possible

I tested it several times, and the results were the same. Why does the resource not be scheduled when failure-timeout setting too short? And what does

it have to do with the time consuming stop of another resource?  Is this a bug?

My pacemaker version is 1.1.16. Any suggestion is welcome. Thank you!

James
2018-05-20
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180520/e1480ebd/attachment.html>